CN114880311A - Data processing method, data processing device, storage medium and computer equipment - Google Patents

Data processing method, data processing device, storage medium and computer equipment Download PDF

Info

Publication number
CN114880311A
CN114880311A CN202210514745.4A CN202210514745A CN114880311A CN 114880311 A CN114880311 A CN 114880311A CN 202210514745 A CN202210514745 A CN 202210514745A CN 114880311 A CN114880311 A CN 114880311A
Authority
CN
China
Prior art keywords
data
attribute
deduplication
storage unit
group
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210514745.4A
Other languages
Chinese (zh)
Inventor
洪海滨
韩旭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Weride Technology Co Ltd
Original Assignee
Guangzhou Weride Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Weride Technology Co Ltd filed Critical Guangzhou Weride Technology Co Ltd
Priority to CN202210514745.4A priority Critical patent/CN114880311A/en
Publication of CN114880311A publication Critical patent/CN114880311A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/219Managing data history or versioning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a data processing method, a data processing device, a storage medium and computer equipment. The method comprises the following steps: reading repeated storage units in each hfile file of the HBase database and dividing the storage units into a plurality of repeated groups; respectively reading the data attribute contained in the storage unit in each repeating group and the attribute value corresponding to the data attribute; respectively acquiring deduplication rules corresponding to the data attributes according to the data attributes; the deduplication rule is a priority rule for data retention based on the attribute values of the data attributes; and performing deduplication processing on each storage unit in each repetitive group respectively based on the deduplication rule corresponding to each repetitive group. According to the data deduplication method and device, the priorities of the data can be reserved as required for deduplication according to different data attributes, the possibility that useful data are discarded is reduced, and then the data reliability of subsequent data analysis tasks is improved.

Description

Data processing method, data processing device, storage medium and computer equipment
Technical Field
The present application relates to the field of data processing technologies, and in particular, to a data processing method and apparatus, a storage medium, and a computer device.
Background
HBase (Hadoop Database) is a distributed, extensible, NoSQL Database that supports mass data storage. The bottom physical storage is stored in a Key-Value data format, all data files in HBase are stored in a Hadoop HDFS file system, parallel and distributed processing of complex tasks can be achieved, and the processing performance and reliability are high. However, a large amount of repeated data may be stored in the HBase database, and in order to save storage resources, data needs to be deduplicated.
The existing HBase database deduplication is based on a timestamp, so that some useful data may be discarded, and the reliability of data analysis is not facilitated.
Disclosure of Invention
In view of the above, it is desirable to provide a data processing method, an apparatus, a storage medium, and a computer device capable of avoiding loss of useful data.
In a first aspect, the present application provides a data processing method, including:
reading repeated storage units in each hfile file of the HBase database and dividing the storage units into a plurality of repeated groups;
respectively reading the data attribute contained in the storage unit in each repeating group and the attribute value corresponding to the data attribute;
respectively acquiring deduplication rules corresponding to the data attributes according to the data attributes; the deduplication rule is a priority rule for data retention based on the attribute value of the data attribute;
and performing deduplication processing on each storage unit in each repetitive group respectively based on the deduplication rule corresponding to each repetitive group.
In one embodiment, the reading of the repeated storage units in each hfile file of the HBase database is divided into a plurality of repeated groups, and the steps include:
reading a row key of each storage unit in each hfile file;
if the row keys of more than two storage units are the same, the storage units with the same row keys are determined as a group of repeated groups.
In one embodiment, when the storage units in the duplicate group include more than two types of data attributes, executing the deduplication rule corresponding to each duplicate group, and performing deduplication processing on each storage unit in each duplicate group respectively includes:
scanning attribute values corresponding to each data attribute of each storage unit of the repeated group;
and sequentially carrying out deduplication processing on each storage unit in the repeated group according to the attribute value corresponding to each data attribute until only one storage unit is left in the repeated group.
In one embodiment, when the data attributes include a first attribute and a second attribute, executing the deduplication processing on the storage units in the duplicate group sequentially according to the attribute value corresponding to each data attribute until only one remaining storage unit in the duplicate group includes:
performing one-time deduplication processing on each storage unit in the repeated group according to the attribute value corresponding to the first attribute;
and performing secondary deduplication processing on the residual storage units in the repeated group after the primary deduplication processing according to the attribute value corresponding to the second attribute.
In one embodiment, the first attribute is link discovery time, and the second attribute is link depth;
or the like, or, alternatively,
the first attribute is a link depth, and the second attribute is a link discovery time.
In one embodiment, when the data attribute includes link discovery time, executing the deduplication rule corresponding to each repeating group, and performing deduplication processing on each storage unit in each repeating group respectively includes:
scanning attribute values corresponding to link discovery time of each storage unit in the repeating group;
and reserving the storage unit with the minimum attribute value corresponding to the link discovery time.
In one embodiment, when the data attribute includes a link depth, executing the deduplication rule corresponding to each repeating group, and performing deduplication processing on each storage unit in each repeating group respectively includes:
scanning attribute values corresponding to the link depths of the storage units in the repeating group;
and reserving the storage unit with the minimum attribute value corresponding to the link depth.
In one embodiment, before reading the repeated storage units in each hfile file of the HBase database and dividing the hfile file into a plurality of repeated groups, the method further includes:
receiving data collected through a plurality of data mining channels;
and respectively storing the data acquired by each data mining channel in each hfile file corresponding to different data mining channels.
In one embodiment, the method further comprises:
and merging the residual storage units in each hfile file after the duplication removal into one hfile file for storage.
In a second aspect, the present application provides a data processing apparatus comprising:
the first reading module is used for reading repeated storage units in each hfile file of the HBase database and dividing the repeated storage units into a plurality of repeated groups;
the second reading module is used for respectively reading the data attribute contained in the storage unit in each repeating group and the attribute value corresponding to the data attribute;
a duplicate removal rule obtaining module, configured to obtain duplicate removal rules corresponding to the data attributes according to the data attributes respectively; the deduplication rule is a priority rule for data retention based on the attribute value of the data attribute;
and the deduplication module is used for performing deduplication processing on each storage unit in each repeating group respectively based on the deduplication rule corresponding to each repeating group.
In a third aspect, the present application provides a storage medium having stored therein computer-readable instructions, which, when executed by one or more processors, cause the one or more processors to perform the steps of the data processing method according to any one of the above embodiments.
In a fourth aspect, the present application provides a computer device comprising: one or more processors, and a memory;
the memory has stored therein computer-readable instructions which, when executed by the one or more processors, perform the steps of the data processing method of any of the above embodiments.
According to the technical scheme, the embodiment of the application has the following advantages:
the data processing method, the data processing device, the storage medium and the computer equipment provided by the application identify a plurality of repeated groups of storage units in the HBase database by reading each hfile file in the HBase database, wherein the repeated storage units serve as a repeated group, and reads the data attribute contained in the storage unit in each repeating group and the attribute value corresponding to the data attribute, and needs to perform deduplication on each repeating group, acquiring corresponding deduplication rules according to the data attributes, performing deduplication on the storage units of each repeating group respectively based on the deduplication rules corresponding to each repeating group, instead of simply performing deduplication by means of timestamps, deduplication rules are predefined, which can be based on different data attributes, and the data is deduplicated according to the priority of the data which needs to be reserved, so that the possibility of discarding useful data is reduced, and the data reliability of a subsequent data analysis task is further improved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.
FIG. 1 is a flow diagram illustrating a data processing method according to an embodiment;
FIG. 2 is a schematic flow chart illustrating the steps of reading the repeated storage units in each hfile file of the HBase database and dividing the storage units into a plurality of repeated groups in one embodiment;
FIG. 3 is a schematic flowchart illustrating the steps of performing deduplication processing on each storage unit in each duplication group based on the deduplication rule corresponding to each duplication group in one embodiment;
FIG. 4 is a flow chart illustrating a data processing method according to another embodiment;
FIG. 5 is a block diagram of a data processing apparatus in one embodiment;
FIG. 6 is a block diagram of a computer device in one embodiment.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The embodiment of the application is applied to data storage of an HBase database, and the data is stored in the HBase database in a table (table) form, wherein the table comprises a Row key (Row key), a Column cluster (Column Family) and a Timestamp (Timestamp). Wherein the row key is used to identify each row of data of the HBase table. A Column cluster is a collection of columns (columns). The timestamp is used to identify the version of the data. One memory location (cell) in the HBase table can be uniquely identified by a { Row key, Column, TimeStamp } triple.
An embodiment of the present application provides a data processing method, as shown in fig. 1, the method includes steps S101 to S104, where:
and step S101, reading repeated storage units in each hfile file of the HBase database and dividing the storage units into a plurality of repeated groups.
The hfile file is a storage format of KeyValue data in HBase and is a binary format file of hadoop. Each storage unit stores one piece of data, the repeated storage units refer to all storage units with repetition in the HBase database, and only one repeated storage unit for storing the same piece of data can be stored, namely only one piece of data is repeatedly stored; or a plurality of different data repeated storage units, that is, a plurality of different data are repeatedly stored. If only one piece of data is repeatedly stored, only one group of the repeated groups is present; if a plurality of pieces of data are repeatedly stored, the number of the groups of the repeated groups is the same as the number of the pieces of data which are repeatedly stored, for example, if two different pieces of data are repeatedly stored, two groups of the repeated groups are provided.
Step S102, respectively reading the data attribute and the attribute value corresponding to the data attribute contained in the storage unit in each repeating group.
The data attribute is represented by columns (columns), each Column is a data attribute, and a key value corresponding to the data attribute is an attribute value. Each storage unit may have one or more data attributes.
And step S103, acquiring the deduplication rules corresponding to the data attributes according to the data attributes respectively.
The deduplication rule is a priority rule for data retention based on the attribute value of the data attribute. The deduplication rule is preset, each data attribute has the deduplication rule corresponding to the data attribute, and stored data can be screened in a mode beneficial to subsequent data processing tasks during deduplication by configuring the deduplication rule in advance.
And step S104, performing deduplication processing on each storage unit in each repetitive group respectively based on the deduplication rule corresponding to each repetitive group.
If the same piece of data is stored repeatedly, it may be that one or more data attributes are different, or that the timestamps are different. In this embodiment, for each duplicate group, deduplication is performed according to deduplication rules corresponding to data attributes, and when there are multiple types of data attributes that are different, deduplication is performed according to deduplication rules corresponding to each data attribute.
Figure BDA0003641010820000061
TABLE 1 certain repeat group
Taking the repeating group in table 1 as an example, for data represented by S1, which is stored repeatedly, 3 storage units store the same piece of data, and assuming that the deduplication rule corresponding to the data attribute a is that the priority of the retained data is a1> a2> A3, the storage unit corresponding to { S1, a1, T2} is retained after deduplication.
In this embodiment, a plurality of sets of repeated storage units are identified by reading each hfile file in the HBase database, where a set of repeated storage units serves as a repeat group, and a data attribute and an attribute value corresponding to the data attribute included in the storage unit in each repeat group are read, each repeat group needs to be deduplicated, a corresponding deduplication rule is obtained according to the data attribute, the storage unit of each repeat group is deduplicated based on the deduplication rule corresponding to each repeat group, instead of a manner of deduplication simply by a timestamp, where the deduplication rule is predefined, deduplication can be performed according to different data attributes and priorities of data that need to be retained, so that the possibility of discarding useful data is reduced, and the data reliability of a subsequent data analysis task is improved. On the other hand, the attribute value and the timestamp of each data attribute are stored adjacently, the original deduplication logic of the HBase database (namely, deduplication is performed according to the timestamp), the timestamp needs to be read in the original logic, and therefore the calculation resources consumed for reading the attribute values cannot be greatly increased, a large amount of storage resources do not need to be additionally consumed, and deduplication which is more beneficial to data processing tasks can be achieved while the calculation resources and the storage resources are saved.
In some embodiments, if the attribute values corresponding to the same data attribute in the repeated storage units are the same, deduplication may be performed in cooperation with the timestamp.
In one embodiment, as shown in fig. 2, reading repeated storage units in each hfile file of the HBase database and dividing the storage units into a plurality of repeated groups includes steps S201 to S202, where:
step S201, reading a row key of each storage unit in each hfile file;
in step S202, if there are two or more memory cells having the same row key, the memory cells having the same row key are determined as a set of repetitive groups.
In this embodiment, when the same data is represented by the same row key, the division into the repetitive groups can be performed by identifying the same row key.
It will be appreciated that dividing the repeating groups may simply be determining the corresponding memory locations of the memory cells where the repetition occurs and does not represent a need to store the repeating groups separately. In some embodiments, the divided repeating groups may be stored separately.
In some embodiments, it may also be determined whether each storage unit has duplication or each row key corresponds to a piece of data in another data storage table through other elements, and by looking up the data storage table, the duplicated data is read, and according to the row key corresponding to the duplicated data, the storage unit corresponding to the duplicated stored data is located, and then deduplication processing is performed.
In one embodiment, when the storage units in the duplicate groups contain more than two types of data attributes, as shown in fig. 3, executing a deduplication rule corresponding to each duplicate group, and performing deduplication processing on each storage unit in each duplicate group respectively includes:
step S301, scanning attribute values corresponding to each data attribute of each storage unit of the repeated group;
and step S302, sequentially performing deduplication processing on each storage unit in the repeating group according to the attribute value corresponding to each data attribute until only one storage unit remains in the repeating group.
When more than two data attributes exist, if deduplication is performed according to the attribute value of only one data attribute, duplicate storage units may also exist, and at this time, deduplication is sequentially performed according to the attribute value corresponding to each data attribute until only one storage unit is left in the duplicate group.
In one embodiment, when the data attributes include a first attribute and a second attribute, performing deduplication processing on each storage unit in the repeating group sequentially according to the attribute value corresponding to each data attribute until only one storage unit remains in the repeating group includes:
performing once deduplication processing on each storage unit in the repetitive group according to the attribute value corresponding to the first attribute;
and performing secondary deduplication processing on the residual storage units in the repeated group after the primary deduplication processing according to the attribute value corresponding to the second attribute.
In one embodiment, the first attribute is link discovery time and the second attribute is link depth. In another embodiment, the first attribute is link depth and the second attribute is link discovery time.
In one embodiment, when the data attribute includes link discovery time, executing a deduplication rule corresponding to each duplication group, and performing deduplication processing on each storage unit in each duplication group respectively, including:
scanning attribute values corresponding to link discovery time of each storage unit in the repeated group;
and reserving the storage unit with the minimum attribute value corresponding to the link discovery time.
The link discovery time refers to discovery time of a link to which the mined data belongs, for example, a source link of the data S1 is P1, discovery time of P1 is link discovery time, for different data acquisition modes or acquisition times, it may be possible to discover the same link multiple times and acquire data of the link, and then repeatedly store the data, data acquisition is performed once for each discovery link, and discovery time of each source link of the data is link discovery time, so that the smaller the attribute value corresponding to the link discovery time is, the earlier the link discovery time is, by reserving the earlier link discovery time, the generation time of the link in the internet can be closer, and data analysis is more facilitated.
In one embodiment, when the data attribute includes a link depth, executing a deduplication rule corresponding to each duplication group, and performing deduplication processing on each storage unit in each duplication group respectively, including:
scanning attribute values corresponding to the link depths of all storage units in the repeated group;
and reserving the storage unit with the minimum attribute value corresponding to the link depth.
If sub-links are found on a certain page, the depth of the sub-links is equal to the current page depth +1, and the link depth of the top page is recorded as 0. A link may be found from different pages, which may result in different link depths, and it is necessary to keep a memory location with the minimum link depth, because the smaller the link depth, the higher the data analysis value. The smaller the attribute value corresponding to the link depth, the smaller the link depth.
When multiple data attributes exist, the duplication is removed according to a duplication removal rule corresponding to one of the data attributes, if the duplicated storage units still exist after duplication removal, the duplication is removed according to a duplication removal rule corresponding to the other data attribute until the duplication removal of the storage units of the duplicated group is completed until only one storage unit is left. Any data attribute may be selected to perform deduplication first without limitation.
Figure BDA0003641010820000081
TABLE 2 certain repeat group
Taking the repeating group in table 2 as an example, for the data represented by S1, the data is stored repeatedly, and 4 storage units store the same piece of data, and it is assumed that the deduplication rule corresponding to the data attribute a is that the priority of the retained data is a1> a2> A3, and the deduplication rule corresponding to the data attribute B is that the priority of the retained data is B2> B1. If the duplication group is firstly removed according to the data attribute A, then the { S1, A1, B1, T1} and { S1, A1, B2, T2} are reserved, then the duplication group is subjected to the second duplication removal according to the data attribute B, then the { S1, A1, B2, T2} are reserved, and at this time, only one storage unit is left in the duplication group, and then the duplication removal is finished.
In one embodiment, as shown in fig. 4, before reading the repeated storage units in each hfile file of the HBase database and dividing the storage units into a plurality of repeating groups, the method further includes:
step S401, receiving data collected through a plurality of data mining channels;
step S402, storing the data collected by each data mining channel in each hfile file corresponding to different data mining channels.
In order to improve efficiency in data mining, a plurality of data mining channels are configured to perform data mining in parallel, for example, data processing, analysis, and mining are performed by a plurality of pipeline modules, and data is generated based on a database schema. When the HBase database stores mass data, each data generated by each data mining channel is stored in an offline storage mode through bulkload, the data mined by one data mining channel is stored in one hfile file for facilitating subsequent data processing, therefore, a plurality of hfile files can be generated, when deduplication is performed, whether duplication exists in all the hfile files is analyzed together, and deduplication is performed according to deduplication rules corresponding to data attributes.
In one embodiment, the method further comprises:
and merging the residual storage units in each hfile file after the duplication removal into one hfile file for storage.
And the storage resources can be further saved by combining the rest storage units after the duplication removal into one hfile file for storage, so that the subsequent data processing is facilitated, and the computing resources are saved.
It should be understood that, although the steps in the flowcharts related to the embodiments described above are shown in sequence as indicated by the arrows, the steps are not necessarily performed in sequence as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a part of the steps in the flowcharts related to the embodiments described above may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the execution order of the steps or stages is not necessarily sequential, but may be rotated or alternated with other steps or at least a part of the steps or stages in other steps.
The following describes a text processing apparatus provided in an embodiment of the present application, and the text processing apparatus described below and the text processing method described above may be referred to correspondingly.
The present application provides a data processing apparatus 500, as shown in fig. 5, comprising:
a first reading module 501, configured to read repeated storage units in each hfile file of the HBase database and divide the storage units into a plurality of repeated groups;
a second reading module 502, configured to read the data attribute and the attribute value corresponding to the data attribute included in the storage unit in each repeating group respectively;
a duplicate removal rule obtaining module 503, configured to obtain duplicate removal rules corresponding to the data attributes according to the data attributes respectively; the deduplication rule is a priority rule for data retention based on the attribute values of the data attributes;
and a deduplication module 504, configured to perform deduplication processing on the storage units in each duplication group respectively based on a deduplication rule corresponding to each duplication group.
In one embodiment, the first reading module comprises:
the row key reading unit is used for reading the row key of each storage unit in each hfile file;
and a repeating group determining unit for determining each storage unit with the same row key as a group of repeating groups when the row keys of more than two storage units are the same.
In one embodiment, the deduplication module comprises:
the first scanning unit is used for scanning the attribute value corresponding to each data attribute of each storage unit of the repeated group;
and the first deduplication unit is used for performing deduplication processing on each storage unit in the duplication group in sequence according to the attribute value corresponding to each data attribute until only one storage unit remains in the duplication group.
In one embodiment, the first deduplication unit is further configured to perform deduplication processing on each storage unit in the duplicate group according to an attribute value corresponding to the first attribute; and performing secondary deduplication processing on the residual storage units in the repeated group after the primary deduplication processing according to the attribute value corresponding to the second attribute.
In one embodiment, the deduplication module comprises:
the second scanning unit is used for scanning the attribute values corresponding to the link discovery time of each storage unit in the repeated group;
and the second deduplication unit is used for reserving a storage unit with the minimum attribute value corresponding to the link discovery time.
In one embodiment, the deduplication module comprises:
the third scanning unit is used for scanning the attribute value corresponding to the link depth of each storage unit in the repeated group;
and the third deduplication unit is used for reserving a storage unit with the minimum attribute value corresponding to the link depth.
In one embodiment, the data processing apparatus further comprises:
the data receiving module is used for receiving data acquired through a plurality of data mining channels;
and the data storage module is used for respectively storing the data acquired by each data mining channel into each hfile file corresponding to different data mining channels.
In one embodiment, the data processing apparatus further comprises:
and the merging storage module is used for merging the residual storage units in each hfile file after the duplication removal into one hfile file for storage.
The various modules in the data processing apparatus described above may be implemented in whole or in part by software, hardware, and combinations thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, the present application further provides a storage medium having stored therein computer-readable instructions, which, when executed by one or more processors, cause the one or more processors to perform the steps of the data processing method as described in any one of the above embodiments.
In one embodiment, the present application further provides a computer device having computer-readable instructions stored therein, which, when executed by one or more processors, cause the one or more processors to perform the steps of the data processing method as described in any one of the above embodiments.
In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 6. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer readable instructions, and a database. The internal memory provides an environment for the operating system and execution of computer-readable instructions in the non-volatile storage medium. The database of the computer device is used for storing data. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer readable instructions, when executed by a processor, implement a data processing method.
Those skilled in the art will appreciate that the architecture shown in fig. 6 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
It will be understood by those of ordinary skill in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware associated with computer readable instructions, which can be stored in a non-volatile computer readable storage medium, and when executed, can include processes of the embodiments of the methods described above. Any reference to memory, database, or other medium used in the embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high-density embedded nonvolatile Memory, resistive Random Access Memory (ReRAM), Magnetic Random Access Memory (MRAM), Ferroelectric Random Access Memory (FRAM), Phase Change Memory (PCM), graphene Memory, and the like. Volatile Memory can include Random Access Memory (RAM), external cache Memory, and the like. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others. The databases referred to in various embodiments provided herein may include at least one of relational and non-relational databases. The non-relational database may include, but is not limited to, a block chain based distributed database, and the like. The processors referred to in the embodiments provided herein may be general purpose processors, central processing units, graphics processors, digital signal processors, programmable logic devices, quantum computing based data processing logic devices, etc., without limitation.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present application, "plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise. Also, as used in this specification, the term "and/or" includes any and all combinations of the associated listed items.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, the embodiments may be combined as needed, and the same and similar parts may be referred to each other.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (12)

1. A method of data processing, the method comprising:
reading repeated storage units in each hfile file of the HBase database and dividing the storage units into a plurality of repeated groups;
respectively reading the data attribute contained in the storage unit in each repeating group and the attribute value corresponding to the data attribute;
respectively acquiring deduplication rules corresponding to the data attributes according to the data attributes; the deduplication rule is a priority rule for data retention based on the attribute value of the data attribute;
and performing deduplication processing on each storage unit in each repetitive group respectively based on the deduplication rule corresponding to each repetitive group.
2. The data processing method according to claim 1, wherein the reading of the repeated storage units in each hfile file of the HBase database is divided into a plurality of repeating groups, and comprises:
reading a row key of each storage unit in each hfile file;
if the row keys of more than two storage units are the same, the storage units with the same row keys are determined as a group of repeated groups.
3. The data processing method according to claim 1 or 2, wherein when the storage units in the repeating group contain more than two data attributes, executing the deduplication rule based on the correspondence of each repeating group to perform deduplication processing on each storage unit in each repeating group respectively comprises:
scanning attribute values corresponding to each data attribute of each storage unit of the repeated group;
and sequentially carrying out deduplication processing on each storage unit in the repeated group according to the attribute value corresponding to each data attribute until only one storage unit is left in the repeated group.
4. The data processing method according to claim 3, wherein when the data attributes include a first attribute and a second attribute, performing the deduplication processing on the storage units in the repeating group sequentially according to the attribute value corresponding to each data attribute until only one storage unit remains in the repeating group comprises:
performing deduplication processing on each storage unit in the repeating group once according to the attribute value corresponding to the first attribute;
and performing secondary deduplication processing on the residual storage units in the repeated group after the primary deduplication processing according to the attribute value corresponding to the second attribute.
5. The data processing method of claim 4, wherein the first attribute is link discovery time and the second attribute is link depth;
or the like, or, alternatively,
the first attribute is a link depth, and the second attribute is a link discovery time.
6. The data processing method according to claim 1 or 2, wherein when the data attribute includes a link discovery time, executing the deduplication rule corresponding to each of the duplication groups, and performing deduplication processing on each of the storage units in each of the duplication groups, respectively, includes:
scanning attribute values corresponding to link discovery time of each storage unit in the repeating group;
and reserving the storage unit with the minimum attribute value corresponding to the link discovery time.
7. The data processing method according to claim 1 or 2, wherein when the data attribute includes a link depth, executing the deduplication rule corresponding to each of the duplication groups, and performing deduplication processing on each of the storage units in each of the duplication groups respectively includes:
scanning attribute values corresponding to the link depths of the storage units in the repeating group;
and reserving the storage unit with the minimum attribute value corresponding to the link depth.
8. The data processing method according to claim 1, before performing reading of the repeated storage units in each hfile file of the HBase database and dividing into a plurality of repeating groups, further comprising:
receiving data collected through a plurality of data mining channels;
and respectively storing the data acquired by each data mining channel in each hfile file corresponding to different data mining channels.
9. The data processing method of claim 1, wherein the method further comprises:
and merging the residual storage units in each hfile file after the duplication removal into one hfile file for storage.
10. A data processing apparatus, comprising:
the first reading module is used for reading repeated storage units in each hfile file of the HBase database and dividing the repeated storage units into a plurality of repeated groups;
the second reading module is used for respectively reading the data attribute contained in the storage unit in each repeating group and the attribute value corresponding to the data attribute;
a duplicate removal rule obtaining module, configured to obtain duplicate removal rules corresponding to the data attributes according to the data attributes respectively; the deduplication rule is a priority rule for data retention based on the attribute value of the data attribute;
and the deduplication module is used for performing deduplication processing on each storage unit in each repeating group respectively based on the deduplication rule corresponding to each repeating group.
11. A storage medium, characterized by: the storage medium has stored therein computer-readable instructions which, when executed by one or more processors, cause the one or more processors to perform the steps of the data processing method of any one of claims 1 to 9.
12. A computer device, comprising: one or more processors, and a memory;
the memory has stored therein computer-readable instructions which, when executed by the one or more processors, perform the steps of the data processing method of any one of claims 1 to 9.
CN202210514745.4A 2022-05-12 2022-05-12 Data processing method, data processing device, storage medium and computer equipment Pending CN114880311A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210514745.4A CN114880311A (en) 2022-05-12 2022-05-12 Data processing method, data processing device, storage medium and computer equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210514745.4A CN114880311A (en) 2022-05-12 2022-05-12 Data processing method, data processing device, storage medium and computer equipment

Publications (1)

Publication Number Publication Date
CN114880311A true CN114880311A (en) 2022-08-09

Family

ID=82675759

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210514745.4A Pending CN114880311A (en) 2022-05-12 2022-05-12 Data processing method, data processing device, storage medium and computer equipment

Country Status (1)

Country Link
CN (1) CN114880311A (en)

Similar Documents

Publication Publication Date Title
US9953102B2 (en) Creating NoSQL database index for semi-structured data
EP3238106B1 (en) Compaction policy
US10331641B2 (en) Hash database configuration method and apparatus
CN107391544B (en) Processing method, device and equipment of column type storage data and computer storage medium
CN110399096B (en) Method, device and equipment for deleting metadata cache of distributed file system again
CN109033365B (en) Data processing method and related equipment
CN103246549A (en) Method and system for data transfer
CN113177090A (en) Data processing method and device
CN107609011B (en) Database record maintenance method and device
CN112965939A (en) File merging method, device and equipment
CN111221814B (en) Method, device and equipment for constructing secondary index
CN114880311A (en) Data processing method, data processing device, storage medium and computer equipment
CN115858322A (en) Log data processing method and device and computer equipment
CN113590566B (en) Method, device, equipment and storage medium for optimizing sequence file storage based on heap structure
CN114489481A (en) Method and system for storing and accessing data in hard disk
CN107515867A (en) The generation method and device that data storage, querying method and the device and a kind of rowKey of a kind of NoSQL databases combine entirely
CN117539690B (en) Method, device, equipment, medium and product for merging and recovering multi-disk data
CN111258955A (en) File reading method and system, storage medium and computer equipment
CN114969200B (en) Data synchronization method, device, electronic equipment and storage medium
CN116821146B (en) Apache Iceberg-based data list updating method and system
CN111198877B (en) Data storage and query method and device
CN111459949B (en) Data processing method, device and equipment for database and index updating method
CN110019987B (en) Log matching method and device based on decision tree
CN116028448A (en) Identification code determining method, device, equipment and storage medium of electronic file
CN114443662A (en) Data synchronization method and device based on HBase and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination