CN115905254A - Method, equipment and storage medium for replacing data item values of mass data tables in batch - Google Patents

Method, equipment and storage medium for replacing data item values of mass data tables in batch Download PDF

Info

Publication number
CN115905254A
CN115905254A CN202211393108.2A CN202211393108A CN115905254A CN 115905254 A CN115905254 A CN 115905254A CN 202211393108 A CN202211393108 A CN 202211393108A CN 115905254 A CN115905254 A CN 115905254A
Authority
CN
China
Prior art keywords
data
original
fragment
values
value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211393108.2A
Other languages
Chinese (zh)
Inventor
董明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiong Anzhi Pingyun Digital Technology Co ltd
Original Assignee
Xiong Anzhi Pingyun Digital Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiong Anzhi Pingyun Digital Technology Co ltd filed Critical Xiong Anzhi Pingyun Digital Technology Co ltd
Priority to CN202211393108.2A priority Critical patent/CN115905254A/en
Publication of CN115905254A publication Critical patent/CN115905254A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The method comprises the steps of taking out original values of data items to be replaced in an original data table, and respectively taking the original values and new values replaced on the basis of a replacement rule as original primary key values and corresponding new primary key values to be stored in a cache middleware. And importing the data item values in the original data table into a distributed database, reading the original values to be replaced in the original data table from each fragment one by utilizing each Map task, inquiring corresponding new values from the caching middleware, and replacing the original values with the new values to obtain each updated fragment. And inserting the data in each updating fragment into the SQL script by using each Reduce task, and executing each SQL script to insert the data in each updating fragment into a new data table with the same structure as the original data table. Therefore, the processing time for replacing mass data can be reduced, and the replacement efficiency is improved.

Description

Method, equipment and storage medium for replacing data item values of mass data tables in batches
Technical Field
The application relates to the technical field of data processing, in particular to a method, equipment and a storage medium for replacing data item values of mass data tables in batches.
Background
During data processing, batch replacement is often performed on certain data items in the data table. However, the existing replacement method takes a long time in the process of replacing mass data in batches. For example, the existing replacement method is used to replace 3000 ten thousand pieces of data in batches, which takes about 60min, and this significantly reduces the efficiency of data processing.
Disclosure of Invention
The application provides a method, equipment and a storage medium for replacing data item values of mass data tables in batches, which can reduce the processing time for replacing mass data in batches so as to realize quick and efficient batch replacement of mass data.
According to a first scheme of the application, a method for replacing data item values of a mass data table in batches is provided, which comprises the steps of taking out original values of data items to be replaced in an original data table, and respectively taking the original values and new values replaced on the basis of a replacement rule as original primary key values and corresponding new primary key values to be stored in a cache middleware; importing the data item values in the original data table into a distributed database, wherein the distributed database has fragments with the number not lower than a first threshold value; reading original values to be replaced in an original data table from each fragment one by using each Map task corresponding to each fragment, inquiring corresponding new primary key values from the cache middleware by taking the original values as the primary key values to serve as new values, and replacing the original values with the new values to obtain each updated fragment; inserting the data in each updating fragment into the SQL script corresponding to each updating fragment by using each Reduce task corresponding to each updating fragment; and executing each SQL script to insert the data in each update fragment into a new data table with the same structure as the original data table.
According to a second aspect of the present application, there is provided an apparatus for bulk replacement of data item values by mass data tables, the apparatus comprising one or more processors; a storage device to store one or more programs that, when executed by the one or more processors, cause the one or more processors to implement a method for bulk replacement of data item values by a mass data table as described in various embodiments of the present application.
According to a third aspect of the present application, a computer-readable storage medium stores a computer program, which when executed by a processor, implements a method for replacing data item values in bulk for a mass data table according to various embodiments of the present application.
Compared with the prior art, the beneficial effects of the embodiment of the application lie in that:
according to the method and the device, the mass data item values in the data table are replaced by using the cache middleware, the distributed database and the MapReduce calculation model, compared with the traditional method that the data table is directly replaced by using a program according to a replacement rule, the method and the device make full use of the extremely fast searching speed of the cache and the parallel data searching and processing capacity of distributed storage, greatly improve the data processing efficiency and reduce the time cost of data processing.
The foregoing general description and the following detailed description are exemplary and explanatory only and are not intended to limit the invention as claimed.
Drawings
In the drawings, which are not necessarily drawn to scale, like reference numerals may describe similar parts throughout the different views. Like reference numerals having letter suffixes or different letter suffixes may represent different examples of similar components. The drawings illustrate generally, by way of example, but not by way of limitation, various embodiments and, together with the description and the claims, serve to explain the disclosed embodiments. Such embodiments are illustrative and exemplary and are not intended to be exhaustive or exclusive embodiments of the present method, apparatus, system, or non-transitory computer-readable medium having instructions for implementing the method.
FIG. 1 is a flowchart illustrating a method for replacing data item values in bulk in a mass data table according to an embodiment of the present application.
Fig. 2 is a schematic diagram illustrating that an original value and a new value replaced by the original value based on a replacement rule are stored in caching middleware as an original primary key value (key) and a corresponding new primary key value (value), respectively, according to an embodiment of the present application.
FIG. 3 illustrates an implementation method of replacing data item values in batches by a mass data table according to an embodiment of the present application.
Detailed Description
In order to make the technical solutions of the present application better understood, the present application is described in detail below with reference to the accompanying drawings and the detailed description. The embodiments of the present application will be described in further detail below with reference to the drawings and specific embodiments, but the present application is not limited thereto.
As used in this application, the terms "first," "second," and the like do not denote any order, quantity, or importance, but rather are used to distinguish one element from another. The word "comprising" or "comprises", and the like, means that the element preceding the word covers the element listed after the word, and does not exclude the possibility that other elements are also covered. The order of execution of the steps in the methods described in this application in connection with the figures is not intended to be limiting. As long as the logical relationship between the steps is not affected, the steps can be integrated into a single step, the single step can be divided into a plurality of steps, and the execution order of the steps can be changed according to the specific requirements.
FIG. 1 is a flowchart illustrating a method for replacing data item values in bulk in a mass data table according to an embodiment of the present application. The method starts in step S101, an original value of a data item to be replaced in an original data table is taken out, and the original value and a new value thereof replaced based on a replacement rule are stored in a cache middleware as an original primary key value (key) and a corresponding new primary key value (value), respectively. The cache middleware includes, but is not limited to, memcached, mongoDB, redis (Remote directory Server), and preferably, the cache middleware is Redis. The Redis is a key-value storage system, data of the Redis exist in an internal memory, the reading and writing speed is very high, and more than 10 ten thousand reading and writing operations can be processed per second. And storing the original value of the data item to be replaced in the original data table into the cache middleware, and fully utilizing the high-speed reading and writing advantages of the cache middleware to improve the data replacement efficiency.
Specifically, as shown in fig. 2, the contents of the "department depth" column in the employee table having 3000 ten thousand records in the original table 201 need to be replaced in batches according to the replacement rule. In the replacement rules table 202, ids denote addresses stored in the caching middleware, each id corresponding to a key-value pair. The original value column indicates the original value of the data item to be replaced in the original data table, and the corresponding new value column indicates the new value after replacement, for example, when id is 1001, the original primary key value key and the new primary key value form a key-value pair, which indicates that the original value dept01 is to be replaced by prod-safe-dept-01, when id is 1002, indicates that the original value dept02 is to be replaced by prod-safe-dept-02, and so on. According to the replacement rule table 202, the original value to be replaced and the new value after replacement in the original table 201 are stored in the cache middleware 203 as the original primary key value and the new primary key value, respectively. As shown in the cache middleware 203, redis Hash is a set of Key Value pairs, redis Hash is actually a HashMap with internally stored Value, and provides an interface for directly accessing the members of this Map, it can be understood that Key is user ID, value is a Map, key of this Map is the attribute name of the members, value is the attribute Value, so that data modification and access can be directly performed through the Key of its internal Map (Redis calls Key of internal Map to field), that is, the corresponding attribute data can be manipulated through Key (user ID) + field (attribute tag). Based on the caching middleware 203, it is possible to directly inquire which new value the original value to be replaced is to be replaced with quickly and efficiently.
Returning to fig. 1, in step S102, the data item values in the original data table are imported into a distributed database, where the distributed database has no less than a first threshold number of shards. Wherein the distributed database is composed of a plurality of independent entities and is continuously interconnected with each other through a network. The distributed database is composed of a plurality of data files located at different sites, allowing a plurality of users to access and manipulate data. And importing the data item values in the original data table into a distributed database, dispersing the data item values to multiple nodes, and processing data more flexibly and efficiently. Specifically, mass data table batch replacement of data item values is executed, and the number of fragments of the distributed database is not lower than a first threshold value, so that the data processing efficiency is improved. The distributed database can be a distributed listing storage database HBase which is a highly reliable, high-performance, column-oriented and telescopic distributed database, data stored in the distributed column storage database is stored in columns, each column is stored independently, only the column related to the query needs to be accessed in the query process, and the query concurrent processing performance is high based on the distributed listing storage database HBase.
In step S103, the original values to be replaced in the original data table are read one by one from each fragment by using each Map task corresponding to each fragment, the corresponding new primary key value is queried from the cache middleware with the original value as the primary key value as a new value, and the original value is replaced with the new value to obtain each updated fragment. Specifically, replacement of the data item values in the original data table is performed based on the distributed computing model MapReduce. MapReduce is a calculation model, and is a programming framework of a distributed operation program, which is used for performing large-data-volume calculation. The MapReduce divides the application into two tasks of Map and Reduce. The distributed database comprises fragments not lower than a first threshold value, and each Map task reads original values to be replaced in an original data table one by one from the corresponding fragments, such as depth 01, depth 02, depth 03, depth 04 \8230, and \8230fromthe original table 201. And then based on the corresponding relation between the original primary key value key and the new primary key value, the original value is used as the original primary key value to query the corresponding new primary key value from the cache middleware as a new value. In the Map task execution process, after the original values are acquired, which new value each original value is to be replaced with is inquired based on the key-value correspondence in the cache middleware 203, and the original values are replaced with the new values. Each Map task reads and analyzes data corresponding to one fragment, and the calculated result data is temporarily stored in a local disk of the node where the Map task is located. And replacing the original value in the original fragment with a new value to obtain the updated fragment.
In this embodiment, based on a replacement rule (for example, adding a random number after each value), a corresponding relationship between an original primary key value (key) and a new primary key value (value) is generated in the cache middleware 203 according to the replacement rule, and mass data item values in the original table 201 are imported into the distributed database, and a Map task is executed by using a MapReduce computing model to search and replace the data item values. Through the cooperation between the caching middleware and the distributed database, the efficiency of inquiring the corresponding new value based on each original value is improved, and the data processing time is greatly reduced.
In step S104, the data in each update fragment is inserted into the SQL script corresponding to each update fragment by using each Reduce task corresponding to each update fragment. And replacing the original value in the fragment with a new value through the Map task to obtain an updated fragment, executing the Reduce task by using a MapReduce computing model, and gathering the data in each updated fragment and inserting the data into the SQL script corresponding to each updated fragment. Each update fragment corresponds to one Reduce task and one SQL script, so that the parallel processing of data in different update fragments by multiple MapReduce calculation models is facilitated, and the data processing efficiency is improved. And executing each SQL script to insert the data in each update fragment into a new data table with the same structure as the original data table (as in step S105), so that the original data table can be renamed to a backup table, the new data table can be renamed to the table name of the original data table, and batch replacement of mass data item values in the original data table is realized.
The method for replacing mass data item values in a data table in batches as shown in fig. 3 is taken as a specific embodiment for description. The original table 301 has 3000 ten thousand records, and the data in the "department" column in the original table 301 is exchanged in batches according to the replacement rule. Replace dept01 with prod-sale-dept-01, replace dept02 with prod-sale-dept-02, and so on, form a replacement rule. And generating the corresponding relation between the original primary key value and the new primary key value according to the replacement, and storing the corresponding relation into the cache middleware. The data item values in the original table 301 are imported into the distributed database, the distributed database 302 is divided into 3000 segments, wherein the segment 1 stores the 1 st to 10000 th pieces of record data, the segment 2 stores the 1001 st to 20000 th pieces of record data, the segment 3 stores the 20001 th to 30000 th pieces of record data \8230 \ 8230, and the segment 3000 stores the 2999 th to 3000 th ten thousand data records. At this point, the data in the original table 301 is deposited in each shard of the distributed database 302. And executing the Map tasks by using the MapReduce calculation model for each fragment, namely each fragment corresponds to one Map task, so that a plurality of Map tasks can be simultaneously performed, and the data processing efficiency is improved. And reading original values to be replaced one by one from each corresponding fragment by using a MapReduce calculation model, inquiring new values corresponding to the original values from the cache middleware based on the key-value correspondence, and replacing the original values with the new values, so that the original values in the 'department' columns in each fragment are replaced with the new values to obtain each updated fragment. Although the original values in the "department" column in each update slice are replaced with new values, 3000 tens of thousands of records in the original table 301 are still distributed among 3000 update slices in the update distributed database 304. The Reduce task is executed on each update fragment by using a MapReduce calculation model, data in each update fragment is inserted into each SQL306 script corresponding to each update fragment, then each SQL306 script is executed, and data in each update fragment is inserted into a database 307 where an original table is located, wherein the structure of a new data table is the same as that of the original table 301, and a replaced new value is stored in a 'department' column in the new data table, so that batch replacement of data item values in the original table 301 is realized, and the consumed time is only about 10 minutes. However, replacing the data item values in the original table 301 in batches according to the existing method takes more than 60 minutes. Therefore, the method provided by the embodiment can effectively improve the efficiency of replacing the data item values in the original data table in batches, and can greatly reduce the time of data processing.
In some embodiments of the present application, retrieving the original value of the data item to be replaced in the original data table further comprises: a plurality of original values of data items having the same value are taken out only once. For example, if each original value in the 1 st column in the original data table is to be replaced according to the replacement rule of adding 1, namely replacing 1 with 2,2 with 3 \8230;, the original value is replaced by 1. If 10000 original values with the value of 1 in the 1 st column exist, only the original value with the value of 1 acquired for the first time needs to be taken out, and a key-value key value pair is constructed and stored in the cache middleware according to the replacement rule, and other 9999 original values with the value of 1 do not need to be taken out and stored in the cache middleware. Therefore, for a plurality of original values with the same numerical value, the original values are only taken out and stored into the cache middleware once, so that the workload of the processor for processing data can be reduced, the memory of the cache middleware occupied by the data can be reduced, and the speed of inquiring the data in the cache middleware can be increased.
In some embodiments of the present application, each Reduce task corresponding to each update fragment is utilized, the SQL script corresponding to each update fragment, into which data in each update fragment is inserted, specifically includes that, under the condition that the amount of data in the data group in which a new value is replaced in the update fragment is not less than the second threshold value, the corresponding Reduce task is utilized to insert data in the data group in which a new value is replaced into the SQL script corresponding to each update fragment. Specifically, the description will be given by taking an example in which each fragment and the update fragment include 10000 records. And the Map task replaces the original values in the fragments one by one, and when one original value is replaced, the replaced new value is stored in the updated fragments. Each update slice may be divided into 100 data groups of 100 records each. Taking the first data group as an example, when the replaced new values are stored into the first data group one by one, judging whether the quantity of the data stored into the first data group is more than 100, if so, inserting the data in the first data group with the new values replaced into each SQL script by using the Reduce task. This embodiment is only an exemplary illustration and does not exclude other possibilities. The inventor of the present application finds that the data processing method provided by this embodiment can reduce the data processing time compared with storing the replaced new values into the cache middleware one by one.
Further, in a case where the number of data in the data group in which the new value is replaced in the update slice is smaller than the second threshold number, the replacement processing of the original value is continued. That is, in the case where the number of data in the data group in which the new value is replaced is not less than the second threshold value, the task of inserting the data in the data group into the SQL script is performed using the Reduce task, and if the number of data in the data group is less than the second threshold value, the replacement processing is continued without inserting the data into the SQL script, so as to improve the efficiency of the data processing.
In some embodiments of the present application, importing the data item values in the original data table into the distributed database specifically includes reading data in the original data table one by using an import tool, calculating a data range of each fragment according to the segmentation fields, and importing the data item values of the corresponding data range in the original data table into each fragment in the distributed database. The distributed database can be Hive, the Hive is a data warehouse tool based on static batch processing Hadoop (distributed system infrastructure), structured data files can be mapped into a database table through the Hive, a simple sql query function is provided, and sql statements are converted into MapReduce tasks to be operated. The importing tool may be Sqoop, and Sqoop (SQL-to-Hadoop) may import data in a relational database (e.g., mySQL, oracle, postgres, etc.) into HDFS (distributed file system) of Hadoop, or export data of HDFS into the relational database. Specifically, the description will be given taking an example in which the import tool is Sqoop. Importing the table mysql _ employee data of the relational database into the distributed database table hive _ employee using Sqoop may be implemented based on the following codes: sqoop import-connect jdbc mysql:// database IP address: database port number/database name-user name-password database password-table mysql _ applied _ layer-field-terminated-by '\ t' -delete-target-dir-num-maps 1-hive-import default-hive _ embed _ copy. The data range of each fragment in the distributed database can be calculated according to the segmentation field, the specific data range of each fragment is not limited, and the data range can be set according to the needs of users. For example, the original table 301 in fig. 3 has 3000 ten thousand pieces of recorded data, and in the case that the number of shards is 3000, the data amount in each shard may be set to 10000, the data range in each shard is divided, and then the data item value of the corresponding data range is respectively imported into each shard by using an import tool. Therefore, the data volume in each fragment is kept consistent, and the method is favorable for keeping the synchronism of data processing of each fragment as much as possible when a plurality of fragments carry out data processing in parallel so as to shorten the processing time for replacing mass data item values in batches.
In some embodiments of the present application, the distributed database is a distributed columnar storage database, and the lookup and replacement of the data are performed on each of the segments in parallel by using each Map task. The total data of the original data table can be dispersed into different fragments by adopting a distributed column storage database, and each fragment is processed in parallel by utilizing a calculation model. Compared with the traditional method that the data table is directly replaced by using a program according to a replacement rule, the embodiment makes full use of the extremely fast search speed of the cache middleware and the parallel data search and processing capacity of the distributed column type storage database, greatly improves the data processing efficiency and reduces the time cost of data processing.
In some embodiments of the present application, the setting method of the replacement rule includes setting the replacement rule using a regular expression or a rule engine, and the setting of the replacement rule may be set according to a user's requirement. And a regular expression can be used for setting a slightly complicated replacement rule, and a certain rule engine middleware can be used for setting a more complicated data replacement rule. And generating the corresponding relation between the original primary key value and the new primary key value in the cache middleware according to the replacement rule so as to quickly inquire a new value corresponding to the original value in the cache middleware.
In some implementations of the present application, an apparatus for bulk replacement of data item values by a mass data table is provided that includes one or more processors, which may be processing devices including one or more general purpose processing devices, such as microprocessors, central Processing Units (CPUs), graphics Processing Units (GPUs), and the like. More specifically, the processor may be a Complex Instruction Set Computing (CISC) microprocessor, reduced Instruction Set Computing (RISC) microprocessor, very Long Instruction Word (VLIW) microprocessor, processor running other instruction sets, or processors running a combination of instruction sets. The processor may also be one or more special-purpose processing devices such as an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA), a Digital Signal Processor (DSP), a system on a chip (SoC), or the like.
Further comprising storage means for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement a method for bulk replacement of data item values by a mass data table as described in various embodiments of the present application.
The present application describes various operations or functions that may be implemented as or defined as software code or instructions. Such content may be source code or differential code ("delta" or "patch" code) ("object" or "executable" form) that may be executed directly. The software code or instructions may be stored in a computer-readable storage medium and, when executed, may cause a machine to perform the functions or operations described, and includes any mechanism for storing information in a form accessible by a machine (e.g., a computing device, an electronic system, etc.), such as recordable or non-recordable media (e.g., read Only Memory (ROM), random Access Memory (RAM), magnetic disk storage media, optical storage media, flash memory devices, etc.).
The example methods described herein may be implemented at least in part by a machine or computer. In some embodiments, a computer-readable storage medium is provided that stores a computer program that, when executed by a processor, implements the method for bulk replacement of data item values by a mass data table as described in various embodiments of the present application. An implementation of such a method may include software code, such as microcode, assembly language code, a high-level language code, and so forth. Various software programming techniques may be used to create the various programs or program modules. For example, the program parts or program modules may be designed in or by Java, python, C + +, assembly language, or any known programming language. One or more of such software portions or modules may be integrated into a computer system and/or computer-readable medium. Such software code may include computer readable instructions for performing various methods. The software code may form part of a computer program product or a computer program module. Further, in an example, the software code can be tangibly stored on one or more volatile, non-transitory, or non-volatile tangible computer-readable media, e.g., during execution or at other times. Examples of such tangible computer-readable media may include, but are not limited to, hard disks, removable magnetic disks, removable optical disks (e.g., compact disks and digital video disks), magnetic cassettes, memory cards or sticks, random Access Memories (RAMs), read Only Memories (ROMs), and the like.
Moreover, although exemplary embodiments have been described herein, the scope thereof includes any and all embodiments based on the present application with equivalent elements, modifications, omissions, combinations (e.g., of various embodiments across), adaptations or alterations. The elements of the claims are to be interpreted broadly based on the language employed in the claims and not limited to examples described in the present specification or during the prosecution of the application, which examples are to be construed as non-exclusive. It is intended, therefore, that the specification and examples be considered as exemplary only, with a true scope and spirit being indicated by the following claims and their full scope of equivalents. The above description is intended to be illustrative and not restrictive. For example, the above-described examples (or one or more versions thereof) may be used in combination with each other. For example, other embodiments may be used by those of ordinary skill in the art upon reading the above description. In addition, in the above detailed description, various features may be grouped together to streamline the application. This should not be interpreted as an intention that a disclosed feature not claimed is essential to any claim. Rather, subject matter of the present application can lie in less than all features of a particular disclosed embodiment. Thus, the claims are hereby incorporated into the detailed description as examples or embodiments, with each claim standing on its own as a separate embodiment, and it is contemplated that such embodiments can be combined with each other in various combinations or permutations. The scope of the application should be determined with reference to the appended claims, along with the full scope of equivalents to which such claims are entitled.
The above embodiments are only exemplary embodiments of the present application, and are not intended to limit the present application, and the protection scope of the present application is defined by the claims. Various modifications and equivalents may be made to the disclosure by those skilled in the art within the spirit and scope of the disclosure, and such modifications and equivalents should also be considered as falling within the scope of the disclosure.

Claims (10)

1. A method for replacing data item values in batches by mass data tables is characterized by comprising the following steps:
taking out an original value of a data item to be replaced in an original data table, and respectively taking the original value and a new value thereof replaced based on a replacement rule as an original primary key value and a corresponding new primary key value to be stored in a cache middleware;
importing the data item values in the original data table into a distributed database, wherein the distributed database has fragments with the number not lower than a first threshold value;
reading original values to be replaced in an original data table from each fragment one by utilizing each Map task corresponding to each fragment, inquiring a corresponding new primary key value from the cache middleware by taking the original values as the primary key values as new values, and replacing the original values with the new values to obtain each updated fragment;
inserting the data in each updating fragment into the SQL script corresponding to each updating fragment by using each Reduce task corresponding to each updating fragment;
and executing each SQL script to insert the data in each update fragment into a new data table with the same structure as the original data table.
2. The method of claim 1, wherein retrieving the original value of the data item to be replaced in the original data table further comprises: a plurality of original values of data items having the same value are taken out only once.
3. The method of claim 1, wherein inserting the data in each update fragment into the SQL script corresponding to each update fragment using each Reduce task corresponding to each update fragment specifically comprises:
and under the condition that the quantity of the data in the data group with the new value replaced in the updating fragment is not less than a second threshold value, inserting the data in the data group with the new value replaced into the SQL script corresponding to each updating fragment by utilizing the corresponding Reduce task.
4. The method according to claim 1 or 3, wherein inserting the data in each update fragment into the SQL script corresponding to each update fragment using each Reduce task corresponding to each update fragment specifically comprises:
and when the number of data in the data group which is replaced by the new value in the updating fragment is less than the second threshold number, continuing the replacement processing of the original value.
5. The method of claim 1, wherein importing the data item values in the original data table into a distributed database specifically comprises:
and reading the data in the original data table one by using an import tool, calculating the data range of each fragment according to the segmentation field, and importing the data item value of the corresponding data range in the original data table into each fragment in the distributed database.
6. The method of claim 1, wherein the distributed database is a distributed columnar storage database; and searching and replacing the data of each fragment in parallel by using each Map task.
7. The method of claim 1, wherein the caching middleware is Redis.
8. The method according to claim 1, wherein the setting method of the replacement rule comprises: setting the replacement rule using a regular expression or a rule engine;
and generating the corresponding relation between the original primary key value and the new primary key value in the cache middleware according to the replacement rule.
9. An apparatus for bulk replacement of data item values with a mass data table, the apparatus comprising:
one or more processors;
a storage device for storing one or more programs,
when executed by the one or more programs, cause the one or more processors to implement a method for bulk replacement of data item values by a mass data table as claimed in any one of claims 1 to 8.
10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a processor, implements the method of mass data table bulk replacement of data item values of any of claims 1-8.
CN202211393108.2A 2022-11-08 2022-11-08 Method, equipment and storage medium for replacing data item values of mass data tables in batch Pending CN115905254A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211393108.2A CN115905254A (en) 2022-11-08 2022-11-08 Method, equipment and storage medium for replacing data item values of mass data tables in batch

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211393108.2A CN115905254A (en) 2022-11-08 2022-11-08 Method, equipment and storage medium for replacing data item values of mass data tables in batch

Publications (1)

Publication Number Publication Date
CN115905254A true CN115905254A (en) 2023-04-04

Family

ID=86475798

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211393108.2A Pending CN115905254A (en) 2022-11-08 2022-11-08 Method, equipment and storage medium for replacing data item values of mass data tables in batch

Country Status (1)

Country Link
CN (1) CN115905254A (en)

Similar Documents

Publication Publication Date Title
US11475034B2 (en) Schemaless to relational representation conversion
US11263211B2 (en) Data partitioning and ordering
US20220156289A1 (en) Generating a multi-column index for relational databases by interleaving data bits for selectivity
US20230306135A1 (en) Masking sensitive information in records of filtered accesses to unstructured data
US10114908B2 (en) Hybrid table implementation by using buffer pool as permanent in-memory storage for memory-resident data
US9977802B2 (en) Large string access and storage
US11537578B2 (en) Paged column dictionary
JP7257068B2 (en) Systems, methods, and data structures for fast searching or filtering of large datasets
US10592556B2 (en) On-the-fly encoding method for efficient grouping and aggregation
US20240111744A1 (en) Method of performing transactional and analytical data processing using a data structure
US20110213775A1 (en) Database Table Look-up
US10650032B1 (en) Filtering pipeline optimizations for unstructured data
US10678784B2 (en) Dynamic column synopsis for analytical databases
US10664508B1 (en) Server-side filtering of unstructured data items at object storage services
EP3683696A1 (en) System and method of bloom filter for big data
US20160196310A1 (en) Cross column searching a relational database table
AU2018345147B2 (en) Database processing device, group map file production method, and recording medium
Kvet Database index balancing strategy
CN115905254A (en) Method, equipment and storage medium for replacing data item values of mass data tables in batch
Jing-hua et al. OLAP aggregation based on dimension-oriented storage
Narayanan Embedding Indices and Bloom Filters in Parquet Filesfor Fast Apache Arrow Retrievals
CN115757581A (en) Method and device for importing text data into database
CN116680275A (en) Data read-write method, system, equipment and storage medium based on skip list

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination