CN112181921A - Data processing method and device - Google Patents

Data processing method and device Download PDF

Info

Publication number
CN112181921A
CN112181921A CN202011137195.6A CN202011137195A CN112181921A CN 112181921 A CN112181921 A CN 112181921A CN 202011137195 A CN202011137195 A CN 202011137195A CN 112181921 A CN112181921 A CN 112181921A
Authority
CN
China
Prior art keywords
data
column
incremental
storage system
timestamp
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011137195.6A
Other languages
Chinese (zh)
Inventor
林兆祥
李晓松
马妍娇
李彪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202011137195.6A priority Critical patent/CN112181921A/en
Publication of CN112181921A publication Critical patent/CN112181921A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/174Redundancy elimination performed by the file system
    • G06F16/1748De-duplication implemented within the file system, e.g. based on file segments
    • G06F16/1756De-duplication implemented within the file system, e.g. based on file segments based on delta files

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a data processing method, a data processing device, electronic equipment and a computer readable storage medium; relates to a database and storage in cloud technology, and the method comprises the following steps: receiving incremental data and writing the incremental data into a data storage system; updating the time stamps of the column levels respectively corresponding to all columns contained in the incremental data according to the time stamps of the incremental data; receiving a data query request; inquiring the data storage system according to the key name carried by the data inquiry request to acquire data corresponding to the key name; and comparing the timestamp of the data with the latest timestamp of the column level of the column in which the data is positioned, and returning a corresponding query result according to the comparison result. By the method and the device, the efficiency of data filtering in the data storage system can be improved.

Description

Data processing method and device
Technical Field
The present application relates to the field of database technologies in cloud technologies, and in particular, to a data processing method and apparatus, an electronic device, and a computer-readable storage medium.
Background
In the era of the internet, particularly the mobile internet, data is generated more and more rapidly, and the performance requirement on the storage and processing (such as query) of the data is higher and higher.
When updating a data storage system in batches based on newly generated data, the related art provides a scheme for data filtering by marking a Delete (Delete) flag on the data to be deleted, thereby avoiding return at the time of query. Based on the mode, firstly, the data of the new and old batches are required to be merged (Join) according to a row of key values (RowKey) of the data table, namely, the two batches of data are associated together according to the RowKey, and a Delete mark is marked on the data which is existed in the old batch but not existed in the new batch.
However, since Join operation is time-consuming, when the server subsequently responds to the data query request, efficiency of responding to the data query request is affected due to the time-consuming Join operation.
Disclosure of Invention
Embodiments of the present application provide a data processing method, an apparatus, an electronic device, and a computer-readable storage medium, which can improve efficiency of data filtering, thereby ensuring efficiency of responding to a data query request.
The technical scheme of the embodiment of the application is realized as follows:
an embodiment of the present application provides a data processing method, including:
receiving incremental data and writing the incremental data into a data storage system;
updating the time stamps of the column levels respectively corresponding to all columns contained in the incremental data according to the time stamps of the incremental data;
receiving a data query request;
inquiring the data storage system according to the key name carried by the data inquiry request to acquire data corresponding to the key name;
and comparing the timestamp of the data with the latest timestamp of the column level of the column in which the data is positioned, and returning a corresponding query result according to the comparison result.
An embodiment of the present application provides a data processing apparatus, including:
a receiving module for receiving the incremental data;
a write module for writing the incremental data to a data storage system;
the updating module is used for updating the timestamps of the column levels respectively corresponding to all the columns contained in the incremental data according to the timestamp of the incremental data;
the receiving module is further used for receiving a data query request;
the query module is used for querying the data storage system according to the key name carried by the data query request so as to acquire data corresponding to the key name;
and the comparison module is used for comparing the timestamp of the data with the latest timestamp of the column level of the column in which the data is arranged and returning a corresponding query result according to the comparison result.
In the above scheme, the comparing module is further configured to return a query result that does not include the data when a latest timestamp of a column level at which the data is located is greater than a timestamp of the data; and when the latest timestamp of the column level of the column where the data is located is less than or equal to the timestamp of the data, returning a query result containing the data.
In the above scheme, the updating module is further configured to update the corresponding index file according to the storage address of the incremental data in the data storage system, and record all columns included in the incremental data in the index file.
In the foregoing solution, the updating module is further configured to, when a loading operation is performed on the index file, read all columns included in the incremental data from the index file, and perform an updating operation on the timestamps of the column levels respectively corresponding to all the columns.
In the above solution, the update module is further configured to perform the following operations during the lifetime of the same lock: executing a loading operation aiming at the index file; and reading all columns contained in the incremental data from the index file, and updating the timestamps of the column levels respectively corresponding to all the columns.
In the foregoing solution, the update module is further configured to, for any column of all columns included in the incremental data, perform the following processing: and updating the time stamp of the column level corresponding to any column to be the same as the time stamp of the incremental data.
In the foregoing solution, the writing module is further configured to write the incremental data in the corresponding address in the data storage system when the incremental data are respectively provided by a plurality of data sources and the data corresponding to different columns in the data storage system are updated based on the incremental data.
In the foregoing solution, the writing module is further configured to write the incremental data in the corresponding address in the data storage system when the incremental data are respectively provided by a plurality of data sources and data corresponding to the same column in the data storage system are updated based on the incremental data; the device also comprises an adding module, a processing module and a processing module, wherein the adding module is used for adding a unique corresponding label for each key name aiming at different key names in the data storage system; and the time stamp of the column level corresponding to the label is recorded in the index file according to the label.
In the above scheme, the query module is further configured to query a corresponding storage address in an index file according to the key name, and query the data storage system based on the storage address to obtain data corresponding to the key name.
An embodiment of the present application provides an electronic device, including:
a memory for storing executable instructions;
and the processor is used for realizing the data processing method provided by the embodiment of the application when the processor executes the executable instructions stored in the memory.
The embodiment of the application provides a computer-readable storage medium, which stores executable instructions for causing a processor to execute, so as to implement the data processing method provided by the embodiment of the application.
The embodiment of the application has the following beneficial effects:
according to the time stamp of the incremental data, the time stamps of the column levels corresponding to all columns contained in the incremental data are updated, so that the purpose of data filtering can be achieved by comparing the time stamp of the data with the latest time stamp of the column level of the column in which the data is located.
Drawings
FIG. 1 is a block diagram of an architecture of a data processing system according to an embodiment of the present application;
FIG. 2 is a schematic structural diagram of a server provided in an embodiment of the present application;
FIG. 3 is a schematic flow chart diagram of a data processing method provided in an embodiment of the present application;
FIG. 4 is a schematic flow chart diagram illustrating a data processing method according to an embodiment of the present application;
FIG. 5 is a schematic diagram of an application of a data processing method provided in an embodiment of the present application;
FIG. 6 is a diagram illustrating query results provided by the related art;
fig. 7 is a schematic diagram of a query result provided in an embodiment of the present application.
Detailed Description
In order to make the objectives, technical solutions and advantages of the present application clearer, the present application will be described in further detail with reference to the attached drawings, the described embodiments should not be considered as limiting the present application, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.
In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the application.
In the related art, when data filtering operation is performed, a Delete mark is usually marked on data to be deleted, and the data is not returned when being queried. Based on the mode, firstly, the data of the new and old batches are required to be joined according to RowKey, and a Delete mark is marked on the data which is compared with the data which is existed in the old batch but not existed in the new batch. However, Join operations are time consuming, and the efficiency of responding to data query requests is greatly affected by the time consumption of Join operations when the server subsequently responds to data query requests.
In view of this, embodiments of the present application provide a data processing method, an apparatus, an electronic device, and a computer-readable storage medium, which can improve efficiency of data filtering, thereby ensuring efficiency of responding to a data query request.
The following describes an exemplary application of the electronic device applying the data processing method provided in the embodiment of the present application, and the electronic device applying the data processing method provided in the embodiment of the present application may be implemented as various types of user terminals such as a notebook computer and a desktop computer, may also be implemented as a server, for example, an independent physical server, a server cluster or a distributed system configured by a plurality of physical servers, and may also be a cloud server providing a cloud computing service. Next, an exemplary application when an electronic device to which the data processing method is applied is implemented as a server will be described with reference to fig. 1.
Referring to fig. 1, fig. 1 is a schematic diagram of an architecture of a data processing system 100 according to an embodiment of the present application, in order to improve efficiency of data filtering to ensure efficiency of responding to a data query request, where the data processing system 100 includes: the server 200, the network 300, the terminal 400, and the data storage system 500 are explained below.
The server 200 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network service, cloud communication, middleware service, a domain name service, a security service, a CDN, a big data and artificial intelligence platform, and the like. The terminal 400 may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, and the like. The terminal 400 and the server 200 may be directly or indirectly connected through wired or wireless communication, and the embodiment of the present application is not limited herein.
The server 200 is configured to receive the real-time incremental data, store the received incremental data in the data storage system 500, and update the column-level timestamps corresponding to all columns included in the received incremental data according to the timestamp of the received incremental data. The server 200 is further configured to receive a data query request sent by the terminal 400 through the network 300, so as to query the data storage system 500 according to a key name carried in the data query request, so as to obtain data corresponding to the key name. Finally, the server 200 compares the timestamp of the acquired data with the latest timestamp of the column level at which the data is listed, and returns a corresponding query result to the terminal 400 according to the comparison result.
In some embodiments, the server 200 may be a background server for various internet applications, that is, the real-time incremental data received by the server 200 may be data generated during the running of various internet applications, such as user portrait data in a recommendation system, including information of age, sex, occupation, hobby, and the like of a user; or the operation record data of the user can also comprise information such as application lists installed by the user, advertisement sequences clicked by the user, keywords input by the user in a search engine and the like.
A network 300 for connecting the server 200 and the terminal 400, wherein the network 300 may be a wide area network or a local area network, or a combination thereof.
The terminal 400 is provided with a client 410, which is used for sending a data query request to the server 200 through the network 300 and for receiving a query result issued by the server 200 through the network 300.
The data storage system 500 is configured to store the real-time incremental data received by the server 200 and the latest timestamps of all columns included in the incremental data at the column level corresponding to the columns respectively.
The following describes the configuration of the server 200 in fig. 1. Referring to fig. 2, fig. 2 is a schematic structural diagram of a server 200 according to an embodiment of the present application, where the server 200 shown in fig. 2 includes: at least one processor 210, memory 240, at least one network interface 220. The various components in server 200 are coupled together by a bus system 230. It is understood that the bus system 230 is used to enable connected communication between these components. The bus system 230 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 230 in fig. 2.
The Processor 210 may be an integrated circuit chip having Signal processing capabilities, such as a general purpose Processor, a Digital Signal Processor (DSP), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like, wherein the general purpose Processor may be a microprocessor or any conventional Processor, or the like.
The memory 240 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard disk drives, optical disk drives, and the like. Memory 240 optionally includes one or more storage devices physically located remote from processor 210.
The memory 240 includes either volatile memory or nonvolatile memory, and may include both volatile and nonvolatile memory. The nonvolatile Memory may be a Read Only Memory (ROM), and the volatile Memory may be a Random Access Memory (RAM). The memory 240 described in embodiments herein is intended to comprise any suitable type of memory.
In some embodiments, memory 240 is capable of storing data, examples of which include programs, modules, and data structures, or subsets or supersets thereof, to support various operations, as exemplified below.
An operating system 241, including system programs for handling various basic system services and performing hardware-related tasks, such as a framework layer, a core library layer, a driver layer, etc., for implementing various basic services and handling hardware-based tasks;
a network communication module 242 for communicating to other computing devices via one or more (wired or wireless) network interfaces 220, exemplary network interfaces 220 including: bluetooth, wireless compatibility authentication (WiFi), and Universal Serial Bus (USB), etc.;
in some embodiments, the data processing apparatus provided in the embodiments of the present application may be implemented in software, and fig. 2 shows the data processing apparatus 243 stored in the memory 240, which may be software in the form of programs and plug-ins, and includes the following software modules: a receiving module 2431, a writing module 2432, an updating module 2433, a querying module 2434, a comparing module 2435, and an adding module 2436, which are logical and thus can be arbitrarily combined or further separated depending on the functionality implemented. The functions of the respective modules will be explained below.
In other embodiments, the data processing apparatus provided in this embodiment may be implemented in hardware, and for example, the data processing apparatus provided in this embodiment may be a processor in the form of a hardware decoding processor, which is programmed to execute the data processing method provided in this embodiment, for example, the processor in the form of the hardware decoding processor may employ one or more Application Specific Integrated Circuits (ASICs), DSPs, Programmable Logic Devices (PLDs), Complex Programmable Logic Devices (CPLDs), Field Programmable Gate Arrays (FPGAs), or other electronic components.
The data processing method provided by the embodiment of the present application will be described below with reference to an exemplary application when the electronic device provided by the embodiment of the present application is implemented as a server (for example, the server 200 shown in fig. 1).
In step S301, incremental data is received and written to the data storage system.
In some embodiments, when performing data update, not all the original data in the data storage system is updated, but only part of the original data is updated, that is, incremental update is performed with respect to all the original data stored in the data storage system, and therefore, after receiving incremental data (for example, real-time incremental data), the server writes the received incremental data into the data storage system to update part of the original data in the data storage system.
For example, referring to table 1, table 1 is a schematic table of data stored in the data storage system at time T1 provided in the embodiments of the present application. As shown in Table 1, at time T1, the data stored in the data storage system includes V1-V6. Taking the determination of the user characteristics as an example, when the real-time incremental data received by the server comes from the same data source, col1, col2, and col3 in table 1 may correspond to the age, gender, and hobbies of the user, respectively; keys 1 and 2 may correspond to the numbers of user A and user B, respectively.
TABLE 1 schematic of data stored in data storage System at time T1
Figure BDA0002737078110000081
Assuming that the server receives incremental data including V7-V8 and V11-V12 at time T2, the server writes the received incremental data into the data storage system to update the data in the data schematic table at time T1, so as to obtain the data stored in the data schematic table at time T2 shown in Table 2. As shown in table 2, the server updates V1 corresponding to key1 and col1 to V7, V2 corresponding to key1 and col2 to V8, V5 corresponding to key2 and col2 to V11, and V6 corresponding to key2 and col3 to V12; and V3 corresponding to key1 and col3 and V4 corresponding to key2 and col1 are not updated at this time.
TABLE 2 schematic of data stored in data storage System at time T2
Figure BDA0002737078110000082
In some embodiments, the server, after writing the received delta data to the data storage system, may further perform the following: and updating the corresponding index file according to the storage address of the incremental data in the data storage system, and recording all columns contained in the incremental data in the index file.
For example, in order to improve the efficiency of data retrieval, the original data stored in the data storage system may be read, and a corresponding index file may be generated according to the storage address of the original data, so that after the server writes the incremental data into the data storage system, the index file may also be updated according to the storage address of the incremental data in the data storage system. In addition, the server may additionally record all columns contained in the incremental data in the index file. Taking table 2 as an example, all columns included in the real-time incremental data V7-V8 and V11-V12 received by the server at time T2 include col1, col2 and col3, and therefore, the server additionally records col1, col2 and col3 in the index file.
In other embodiments, the real-time incremental data received by the server may also be provided separately from multiple data sources. For example, in the case of obtaining a user representation, the real-time incremental data received by the server may be provided separately by different social platforms. Taking table 1 as an example for illustration, key1 and key2 in table 1 may be accounts corresponding to the same user (e.g., user a) in two different social platforms, where key1 may be an account corresponding to user a being in a microblog, and key2 may be an account corresponding to user a being in a WeChat.
For example, when the incremental data received by the server is provided by a plurality of data sources respectively, and the data corresponding to different columns in the data storage system is updated based on the incremental data (i.e. the same batch of key names (key) in the data storage system is updated corresponding to the plurality of data sources, and there is no intersection between the columns updated by the data sources), the incremental data may be written in the corresponding addresses in the data storage system. In this case, since there is no intersection between columns of incremental data updates provided by multiple data sources, there is no situation where an early-ready data query cannot be returned because of loading late-ready data.
For example, when the incremental data received by the server is provided by a plurality of data sources respectively, and the data corresponding to at least one same column in the data storage system is updated based on the incremental data (i.e. the same column in the data storage system is updated corresponding to the plurality of data sources, but keys of different batches are updated), the incremental data is written in the corresponding address in the data storage system. In this case, since the incremental data provided by the multiple data sources is updated to the same column in the data storage system, loading late ready data may result in an early ready data query not being returned because the column-level timestamp has been updated to the same timestamp as the late ready data. To avoid the situation where the data ready times are different, but the same column is updated, the server may also perform the following operations: for different key names in the data storage system, a unique corresponding Tag (Tag) is added to each key name, and a column-level timestamp corresponding to the Tag is recorded in the index file according to the added Tag, that is, the server can add the unique corresponding Tag to each batch of keys respectively, and simultaneously, maintain a group of column-level timestamps for each type of Tag.
In step S302, the column-level time stamps corresponding to all columns included in the incremental data are updated based on the time stamp of the incremental data.
In some embodiments, the server updates the column-level timestamps corresponding to all columns included in the incremental data according to the timestamp of the incremental data, and may be implemented by: when the loading operation is executed for the index file, all columns contained in the currently received incremental data are read from the index file, and the column-level timestamps corresponding to all the columns respectively are updated according to the timestamp of the incremental data.
For example, still taking the above table 1 and table 2 as an example, at time T1, the timestamp of the column level corresponding to col1, col2, and col3 in table 1 is T1. At time T2, the server receives incremental data including V7-V8, V11-V12, the columns contained in the incremental data include col1, col2 and col3, and records col1, col2 and col3 in an index file. Subsequently, when the server calls a query process to load the index file, all columns contained in the current index are read from the index file, and the column-level timestamps of the columns are updated, namely, the server updates the timestamps T2 of V7-V8 and V11-V12 and updates the timestamps of column levels respectively corresponding to col1, col2 and col3 to T2 according to the incremental data.
In other embodiments, to ensure that the data can be correctly filtered, the server may also perform a loading operation of the index file and an updating operation of the column-level timestamp by: during the lifetime of the same lock, the following operations are performed: executing a loading operation aiming at the index file; and reading all columns contained in the incremental data from the index file, and updating the time stamps of the column levels respectively corresponding to all the columns.
For example, when the server loads the index file in the process of invoking the query, the loading operation of the index file and the updating operation of the column-level timestamp may be performed based on the same lock, that is, the updating operation of the column-level timestamp and the loading operation of the index file are in the same lock, so that it can be ensured that the updating time of the column-level timestamp and the loading time of the index file are consistent, and thus it is ensured that data can be correctly filtered. This is because when the update operation of the column-level time stamp is earlier than the load operation of the index file, it causes the data to be filtered erroneously; and when the update operation of the column-level timestamp is later than the load operation of the index file, the data filtering is disabled.
In some embodiments, the server updates the timestamp of the column level corresponding to each column included in the incremental data by the following steps: for any one of all columns contained in the incremental data, the following processing is performed: the column-level timestamp corresponding to any column is updated to be the same as the timestamp of the delta data.
For example, still taking table 2 as an example for explanation, assuming that the columns included in the incremental data received by the server at time T3 are col1 and col2, the server updates the column-level timestamps corresponding to col1 and col2 to T3 according to the timestamp T3 of the currently received incremental data, and the column-level timestamp corresponding to col3 is still T2.
In step S303, a data query request is received.
In some embodiments, when a user (e.g., an operator) needs to obtain data, a data query request may be sent to the server through the terminal, so that the server responds when receiving the data query request, and returns a corresponding query result to the terminal.
In step S304, the data storage system is queried according to the key name carried in the data query request to obtain data corresponding to the key name.
In some embodiments, the data query request sent by the user through the terminal may carry a key name, so that the server queries the data storage system according to the key name carried in the received data query request, thereby acquiring data corresponding to the key name.
For example, the server queries the data storage system according to a key name carried in the data query request to obtain data corresponding to the key name, and the method may be implemented as follows: and inquiring a corresponding storage address in the index file according to the key name, and inquiring the data storage system based on the inquired storage address so as to acquire data corresponding to the key name in the data storage system.
In step S305, the timestamp of the data is compared with the latest timestamp at the column level where the data is listed, and a corresponding query result is returned according to the comparison result.
In some embodiments, step S305 shown in fig. 3 may be implemented by step S3051 to step S3053 shown in fig. 4, which will be described in conjunction with the steps shown in fig. 4.
In step S3051, it is determined whether the latest timestamp at the column level of the column in which the data is located is greater than the timestamp of the data, and if so, step S3052 is performed; if not, step S3053 is performed.
In step S3052, a query result containing no data is returned.
In step S3053, a query result containing data is returned.
In some embodiments, after querying the data corresponding to the key name in the data storage system according to the key name carried in the data query request, the server may further perform the following operations: judging whether the latest timestamp of the column level of the queried data is greater than the timestamp of the data, and if so, indicating that the queried data is not the latest data, and returning a query result without the data to the terminal by the server; and when the latest timestamp of the column level in which the data is listed is less than or equal to the timestamp of the data, the server indicates that the inquired data is the latest data, and then the server returns an inquiry result containing the data to the terminal.
For example, still taking the above table 2 as an example for explanation, assuming that the key name carried in the data query request sent by the user through the terminal is key1, the server queries the data storage system according to the received key name key1 to obtain the data corresponding to key1 as V7, V8 and V3. Next, the server compares the timestamps of V7, V8, and V3 with the latest timestamp at the column level of the respective column, i.e., the server compares the timestamp of V7 with the latest timestamp at the column level of col1, compares the timestamp of V8 with the latest timestamp at the column level of col2, and compares the timestamp of V3 with the latest timestamp at the column level of col 3. Wherein the timestamp of V7 is T2, which is the same as the latest timestamp T2 at the column level of col 1; the timestamp of V8 is T2, which is the same as the latest timestamp T2 at the column level of col 2; the timestamp of V3 is T1, less than the latest timestamp T2 at the column level of col3, then the server returns only data V7 and V8 to the terminal, and not data V3.
The embodiment of the application aims at incremental data written in a data storage system in batches, according to the timestamp of the incremental data, the timestamp of the column level corresponding to all columns contained in the incremental data is updated, when a data query request is subsequently received, the timestamp of the data corresponding to the key name carried by the data query request can be directly compared with the latest timestamp of the column level where the data are located, and a corresponding query result is returned according to the comparison result.
In other embodiments, the data processing method provided in the embodiments of the present application may also be implemented in combination with a block chain technique.
A blockchain refers to a storage structure of encrypted, chained transactions formed from blocks. The system is a shared database, and data or information stored in the shared database has the characteristics of being unforgeable, traceable and maintained collectively.
For example, referring to fig. 5, fig. 5 is an application schematic diagram of the data processing method provided in the embodiment of the present application, and includes a blockchain network 600 (exemplarily illustrating a consensus node 610-1 to a consensus node 610-3), an authentication center 700, and a service principal 800/900, which are respectively described below.
The type of blockchain network 600 is flexible and may be, for example, any of a public chain, a private chain, or a federation chain. Taking the public chain as an example, the electronic devices (e.g., the server 200 and the terminal 400 in fig. 1) of any service entity can access the blockchain network 600 without authorization to become a client node; taking a federation chain as an example, after being authorized, a business entity can access the electronic device under its jurisdiction to the blockchain network 600 to become a client node.
As an example, when the blockchain network 600 is a federation chain, the service entity 800/900 registers with the certificate authority 700 to obtain a digital certificate of each service entity, where the digital certificate includes the public key of the service entity and a digital signature signed by the certificate authority 700 for the public key and identity information of the service entity 800/900, and is used to be appended to a transaction (e.g., incremental data for uplink, or a data query request, etc.) together with the digital signature of the service entity for the transaction, and is sent to the blockchain network 600, so that the blockchain network 600 can take the digital certificate and the digital signature out of the transaction, verify the authenticity of the transaction (i.e., whether it has not been tampered with) and the identity information of the service entity sending the message, and the blockchain network 600 can verify according to the identity, for example, whether it has the right to initiate the transaction.
In some embodiments, the client node may act as a mere watcher of the blockchain network 600, i.e., provide support for the business entity to initiate transaction functions, and may be implemented by default or selectively (e.g., depending on the specific business requirements of the business entity) for the functions of the consensus node 610 of the blockchain network 600, such as a ranking function, a consensus service, and an ledger function, etc. Therefore, the data and the service processing logic of the service subject can be migrated to the blockchain network 600 to the maximum extent, and the credibility and traceability of the data and service processing process are realized through the blockchain network 600.
Consensus nodes in blockchain network 600 receive transactions submitted by client nodes from different business entities (e.g., business entity 800/900 shown in fig. 4), perform transactions to update the ledger or query the ledger, and various intermediate or final results of performing transactions may be returned for display in the business entity's client nodes.
An exemplary application of a blockchain network is described below, taking as an example the server uploading received incremental data to the blockchain network for storage, and referring to fig. 5, a client node 810 in fig. 5 may correspond to the server 200 in fig. 1.
First, logic for uplink of incremental data is set at the client node 810, for example, when real-time incremental data is received, the client node 810 sends the received incremental data to the blockchain network 600 and generates a corresponding transaction, which includes: in order to uplink the incremental data and update the time stamps of the column levels respectively corresponding to all columns contained in the incremental data according to the time stamps of the incremental data, the intelligent contract which needs to be called and the parameters transferred to the intelligent contract are obtained; the transaction also includes the client node's 810 digital certificate, signed digital signature, and broadcasts the transaction to the consensus node 610 in the blockchain network 600.
Then, when the transaction is received in the consensus node 610 in the blockchain network 600, the digital certificate and the digital signature carried in the transaction are verified, and after the verification is successful, whether the service entity 800 has the transaction right is determined according to the identity of the service entity 800 carried in the transaction, and any verification error in the digital signature and the right verification will cause the transaction failure. After verification is successful, the consensus node 610 signs its own digital signature (e.g., by encrypting the digest of the transaction using the private key of node 610-1) and continues to broadcast in the blockchain network 600.
Finally, after the consensus node 610 in the blockchain network 600 receives the transaction that is successfully verified, the transaction is filled into a new block and broadcast. When a new block is broadcasted by the consensus node 610 in the block chain network 600, the new block is verified, for example, whether the digital signature of the transaction in the new block is valid is verified, if the verification is successful, the new block is appended to the tail of the block chain stored in the new block, and the state database is updated according to the transaction result to execute the transaction in the new block: for a submitted transaction storing incremental data, adding key-value pairs including the incremental data in a state database; and for the submitted transaction of the updating operation, calling the intelligent contract to update the timestamps of the column levels respectively corresponding to all columns contained in the incremental data according to the timestamp of the incremental data.
An exemplary application of the blockchain network is described by taking an example that the terminal sends a data query request to the blockchain network. Referring to fig. 5, the client node 910 in fig. 5 may correspond to the terminal 400 in fig. 1.
In some embodiments, the type of data that can be queried by the client node 910 in the blockchain network 600 may be implemented by the consensus node 610 by restricting the authority of a transaction that can be initiated by a client phase of the service body, and when the client node 910 has the authority to initiate query data, a transaction for querying the data may be generated by the client node 910 and submitted to the blockchain network 600, where a key name is carried in the data query request, so that the consensus node 610 executes the transaction to query data corresponding to the key name from the state database. Then, the blockchain network 600 invokes an intelligent contract to compare the timestamp of the queried data with the latest timestamp of the column level where the data is listed, and returns a corresponding query result to the client node 910 according to the comparison result.
Continuing with the exemplary structure of the data processing device 243 provided by the embodiments of the present application implemented as software modules, in some embodiments, as shown in fig. 2, the software modules stored in the data processing device 243 of the memory 240 may include: a receiving module 2431, a writing module 2432, an updating module 2433, a querying module 2434, a comparing module 2435, and an adding module 2436.
A receiving module 2431 for receiving the delta data; a write module 2432 for writing the incremental data to the data storage system; an updating module 2433, configured to update, according to the timestamp of the incremental data, timestamps of column levels corresponding to all columns included in the incremental data respectively; a receiving module 2431, further configured to receive a data query request; the query module 2434 is configured to query the data storage system according to the key name carried in the data query request to obtain data corresponding to the key name; and the comparing module 2435 is configured to compare the timestamp of the data with the latest timestamp of the column level in which the data is listed, and return a corresponding query result according to the comparison result.
In some embodiments, the comparing module 2435 is further configured to return a query result that does not include the data when the latest timestamp of the column level at which the data is listed is greater than the timestamp of the data; and returning a query result containing the data when the latest timestamp at the column level of the column in which the data is listed is less than or equal to the timestamp of the data.
In some embodiments, the updating module 2433 is further configured to update the corresponding index file according to the storage address of the incremental data in the data storage system, and record all columns included in the incremental data in the index file.
In some embodiments, the updating module 2433 is further configured to, when a load operation is performed on the index file, read all columns included in the incremental data from the index file, and perform an update operation on the column-level timestamps corresponding to all columns, respectively.
In some embodiments, update module 2433 is further configured to perform the following operations during the lifetime of the same lock: executing a loading operation aiming at the index file; and reading all columns contained in the incremental data from the index file, and updating the time stamps of the column levels respectively corresponding to all the columns.
In some embodiments, the updating module 2433 is further configured to, for any column of all columns included in the incremental data, perform the following: the column-level timestamp corresponding to any column is updated to be the same as the timestamp of the delta data.
In some embodiments, the writing module 2432 is further configured to write the incremental data at a corresponding address in the data storage system when the incremental data are respectively provided by a plurality of data sources and the data corresponding to different columns in the data storage system are respectively updated based on the incremental data.
In some embodiments, the writing module 2432 is further configured to, when the incremental data are respectively provided by a plurality of data sources and the corresponding data of the same column in the data storage system is updated based on the incremental data, write the incremental data at a corresponding address in the data storage system; the data processing apparatus 243 further comprises an adding module 2436 for adding a unique corresponding label for each key name for different key names in the data storage system; and recording a time stamp of a column level corresponding to the tag in the index file according to the tag.
In some embodiments, the query module 2434 is further configured to query the index file for a corresponding storage address according to the key name, and query the data storage system based on the storage address to obtain data corresponding to the key name.
It should be noted that the description of the apparatus in the embodiment of the present application is similar to the description of the method embodiment, and has similar beneficial effects to the method embodiment, and therefore, the description is not repeated. The technical details that are not used up in the data processing device provided by the embodiment of the present application can be understood from the description of any one of the drawings in fig. 3-4.
An exemplary application of the embodiment of the present application in an actual application scenario is described below by taking an information recommendation scenario as an example.
For example, the server 200 in fig. 1 may be a background server of the recommendation system, and is configured to periodically update the portrait data of the registered user, recall information that matches the latest portrait data of the registered user from the information to be recommended according to the latest portrait data of the registered user, and then perform sorting and pushing of the information.
For example, the user's portrait data may be distributed on a daily scale, with multiple data sources waiting up to 24 hours to be distributed (new data arriving just after the previous distribution started may need to wait 24 hours to be distributed). The data from all the data sources of each batch, whether updated or not, are merged together to form a full index. The index is released in a full-scale replacement mode for the whole table, the old batch index can be directly replaced by the new batch index, and the problem of data deletion does not need to be considered.
However, it would be desirable to index and publish the user's portrait data immediately after it is provided. This immediate mode of data distribution is called on-demand distribution. Under on-demand publishing, each source data source typically updates an individual column of all keys, i.e., incrementally with respect to the entire table. When updating in increments, the data of the old batch but not the new batch needs to be additionally considered when deleting.
For example, referring to FIG. 6, when a query result is returned, an incremental update may cause old batches of user portrait data to be returned (in this case, since the user portrait data received by the server includes a part of historical user portrait data, the information recalled by the server in the information to be recommended is not matched with the latest portrait data of the user, and the accuracy of subsequent information recommendation is poor). The user image data of the old lot refers to V3 and V4 (i.e., data at time T1) and V8 and V12 (i.e., data at time T2) in fig. 6. This is because the data that has been updated to T3 at the current time, i.e., time T3, corresponds to the latest version of user portrait data, but some of the cells are empty (i.e., not all of the user portrait data has been updated) at time T3, indicating that the data is not present in the latest version, i.e., the data needs to be deleted when a query result is returned. However, there is no such information in the incremental update, unless a Delete flag is marked on each empty cell, indicating that this is an empty cell and no data is returned. However, marking all empty cells is a time consuming operation that can significantly affect the efficiency of data filtering.
In view of this, the present embodiment provides a data processing method, which, for incremental data written in a data storage system in a batch manner, achieves an effect of filtering deletion by comparing a timestamp of the data with a global timestamp (i.e., a timestamp at a column level where the data is listed), so as to improve efficiency of a deletion operation. The following describes a data processing method provided in the embodiments of the present application in detail.
For example, the data is time stamped, and the time stamps of the data of the same batch are all the same, which is the ready time of the batch. All columns contained in the data of the batch need to be additionally recorded in meta of the index file in the on-demand publishing scheme. Where, meta refers to metadata information of an index file level, and the existing metadata information includes the size, name, and total number of data pieces of the index file. In the embodiment of the present application, all columns included in the data of the batch and the column-level time stamps corresponding to each column are additionally recorded in meta of the index file, and a query process is invoked for maintenance. When the server calls the query process to load the index file, the columns contained in the current index are read from the meta of the index file, and the time stamps of the column levels of the columns are updated. In this way, when a subsequent server receives a data query request, the timestamp of the data (i.e. the data corresponding to the data query request) is compared with the timestamp of the column level in which the data is listed, and when the timestamp of the column level in which the data is listed is determined to be greater than the timestamp of the data, it indicates that the data is not included in the latest update, and the data is deleted from the returned result of the query; when the time stamp of the column level of the column in which the data is located is determined to be less than or equal to the time stamp of the data, indicating that the data is contained in the latest update, the data is returned. Therefore, on one hand, the data received by the server only contain the latest user portrait data, so that the server can recall the information matched with the latest user portrait data from the information to be recommended, and sort and recommend the recalled information, and the accuracy of information recommendation is improved; on the other hand, the purpose of data filtering can be achieved by comparing the time stamp of the data with the time stamp of the column level of the column in which the data is arranged, and the efficiency of data filtering is also improved.
In other embodiments, the update of the column-level timestamp and the loading of the index file may be in the same lock when the index file is loaded, so that the update time of the column-level timestamp and the loading time of the index file can be kept consistent to ensure that the data can be correctly filtered. This is because if the column-level timestamps are updated earlier than the index file is loaded, then the data may be erroneously filtered; data filtering may be rendered ineffective if the column-level timestamps are updated later than the index file load.
In some embodiments, multiple types of keys may be included in the same data table, including, for example, QQ, WUID, IMEI, IDFA, and so on, and data corresponding to different keys may update the same column in the data table, but may be provided by different data sources, i.e., the data corresponding to different keys have different ready times. For example, both QQ and WUID may update both the age and gender columns, but the data for QQ and WUID are ready separately. In this case, loading late ready data may result in early ready data not being returned at query time because the column level timestamp has been updated to a later time. To avoid the situation where the data ready times are different, but the same column is updated, the server can distinguish by tagging each key with a unique corresponding Tag (Tag). If multiple data sources update the same key, the columns of the data source updates are required to have no intersection. If a plurality of data sources update the same columns in the data table, but update keys in different batches, the server adds a unique corresponding Tag for each batch of keys respectively, and meanwhile, the query process also maintains a group of column-level timestamps for each Tag.
For example, as shown in fig. 7, when the same column in the data table is updated according to multiple data sources, but keys of different batches are updated, a uniquely corresponding Tag is added to each batch of keys, that is, a uniquely corresponding Tag1 is added to key1, and a uniquely corresponding Tag2 is added to key2, that is, in actual use, keys of the same batch are divided by the type of key, so tags are recorded as the type of key. In this way, when a query result is returned, each Tag maintains a group of corresponding column-level timestamps, that is, the server compares the timestamp of the data with the timestamp of the column level corresponding to the Tag, rather than the timestamp of the entire column level, so that the situation that the early ready data cannot be returned during query when the late ready data is loaded is avoided.
The incremental data written into the data storage system in batches are compared by the timestamp of the data with the global timestamp, so that the data filtering can be realized by only utilizing the data of a new batch without the help of the data of an old batch and adding some extra information (namely the timestamp of the data and the timestamp of the column level where the data are arranged), and the efficiency of the data filtering is greatly improved.
Embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the data processing method described in the embodiment of the present application.
Embodiments of the present application provide a computer-readable storage medium storing executable instructions, which when executed by a processor, will cause the processor to execute a data processing method provided by embodiments of the present application, for example, a data processing method as shown in fig. 3 or fig. 4.
In some embodiments, the storage medium may be memory such as FRAM, ROM, PROM, EPROM, EEPROM, flash memory, magnetic surface memory, optical disk, or CD-ROM; or may be various devices including one or any combination of the above memories.
In some embodiments, executable instructions may be written in any form of programming language (including compiled or interpreted languages), in the form of programs, software modules, scripts or code, and may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.
By way of example, executable instructions may correspond, but do not necessarily have to correspond, to files in a file system, and may be stored in a portion of a file that holds other programs or data, such as in one or more scripts stored in a Hyper-log Markup Language (HTML) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).
By way of example, executable instructions may be deployed to be executed on one computing device or on multiple computing devices at one site or distributed across multiple sites and interconnected by a communication network.
In summary, the embodiments of the present application have the following beneficial effects:
aiming at incremental data written into a data storage system in batch, according to the timestamp of the incremental data, the timestamp of the column level corresponding to all columns contained in the incremental data is updated, when a data query request is subsequently received, the timestamp of the data corresponding to the key name carried by the data query request can be directly compared with the latest timestamp of the column level where the data is located, and a corresponding query result is returned according to the comparison result.
The above description is only an example of the present application, and is not intended to limit the scope of the present application. Any modification, equivalent replacement, and improvement made within the spirit and scope of the present application are included in the protection scope of the present application.

Claims (10)

1. A method of data processing, the method comprising:
receiving incremental data and writing the incremental data into a data storage system;
updating the time stamps of the column levels respectively corresponding to all columns contained in the incremental data according to the time stamps of the incremental data;
receiving a data query request;
inquiring the data storage system according to the key name carried by the data inquiry request to acquire data corresponding to the key name;
and comparing the timestamp of the data with the latest timestamp of the column level of the column in which the data is positioned, and returning a corresponding query result according to the comparison result.
2. The method of claim 1, wherein returning the corresponding query result according to the comparison result comprises:
when the latest timestamp of the column level of the column in which the data is located is larger than the timestamp of the data, returning a query result which does not contain the data;
and when the latest timestamp of the column level of the column where the data is located is less than or equal to the timestamp of the data, returning a query result containing the data.
3. The method of claim 1, wherein after writing the incremental data to a data storage system, the method further comprises:
and updating the corresponding index file according to the storage address of the incremental data in the data storage system, and recording all columns contained in the incremental data in the index file.
4. The method according to claim 3, wherein the updating the column-level time stamps corresponding to all columns included in the incremental data respectively comprises:
when a loading operation is executed for the index file, reading all columns contained in the incremental data from the index file, and performing an updating operation on the timestamps of the column levels respectively corresponding to all the columns.
5. The method according to claim 4, wherein when a load operation is performed on the index file, reading all columns included in the incremental data from the index file, and performing an update operation on the column-level timestamps corresponding to all the columns, respectively, includes:
during the lifetime of the same lock, the following operations are performed:
executing a loading operation aiming at the index file;
and reading all columns contained in the incremental data from the index file, and updating the timestamps of the column levels respectively corresponding to all the columns.
6. The method according to claim 1, wherein the updating the column-level time stamps corresponding to all columns included in the incremental data respectively comprises:
for any column of all columns contained in the incremental data, performing the following processing:
and updating the time stamp of the column level corresponding to any column to be the same as the time stamp of the incremental data.
7. The method of claim 1, wherein writing the incremental data to a data storage system comprises:
when the incremental data are respectively provided by a plurality of data sources and respectively corresponding data of different columns in the data storage system are updated based on the incremental data, the incremental data are written in corresponding addresses in the data storage system.
8. The method of claim 1, wherein writing the incremental data to a data storage system comprises:
when the incremental data are respectively provided by a plurality of data sources and the data corresponding to the same column in the data storage system are updated based on the incremental data, writing the incremental data in the corresponding address in the data storage system;
after writing the delta data, the method further comprises:
adding a unique corresponding label for each key name aiming at different key names in the data storage system;
and recording a column-level time stamp corresponding to the label in an index file according to the label.
9. The method according to any one of claims 1 to 8, wherein the obtaining data corresponding to the key name comprises:
and inquiring a corresponding storage address in an index file according to the key name, and inquiring the data storage system based on the storage address to acquire data corresponding to the key name.
10. A data processing apparatus, characterized in that the apparatus comprises:
a receiving module for receiving the incremental data;
a write module for writing the incremental data to a data storage system;
the updating module is used for updating the timestamps of the column levels respectively corresponding to all the columns contained in the incremental data according to the timestamp of the incremental data;
the receiving module is further used for receiving a data query request;
the query module is used for querying the data storage system according to the key name carried by the data query request so as to acquire data corresponding to the key name;
and the comparison module is used for comparing the timestamp of the data with the latest timestamp of the column level of the column in which the data is arranged and returning a corresponding query result according to the comparison result.
CN202011137195.6A 2020-10-22 2020-10-22 Data processing method and device Pending CN112181921A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011137195.6A CN112181921A (en) 2020-10-22 2020-10-22 Data processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011137195.6A CN112181921A (en) 2020-10-22 2020-10-22 Data processing method and device

Publications (1)

Publication Number Publication Date
CN112181921A true CN112181921A (en) 2021-01-05

Family

ID=73923252

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011137195.6A Pending CN112181921A (en) 2020-10-22 2020-10-22 Data processing method and device

Country Status (1)

Country Link
CN (1) CN112181921A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113138989A (en) * 2021-03-12 2021-07-20 莘上信息技术(上海)有限公司 Block chain data retrieval method and device
CN114638543A (en) * 2022-04-12 2022-06-17 中国工商银行股份有限公司 Document auditing method and device, computer equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113138989A (en) * 2021-03-12 2021-07-20 莘上信息技术(上海)有限公司 Block chain data retrieval method and device
CN114638543A (en) * 2022-04-12 2022-06-17 中国工商银行股份有限公司 Document auditing method and device, computer equipment and storage medium

Similar Documents

Publication Publication Date Title
US9305002B2 (en) Method and apparatus for eventually consistent delete in a distributed data store
US8555018B1 (en) Techniques for storing data
CN110377649B (en) Construction and query methods, devices, equipment and storage medium of tagged data
US20090234880A1 (en) Remote storage and management of binary object data
US20090112870A1 (en) Management of distributed storage
CN111506592B (en) Database upgrading method and device
CN109739828B (en) Data processing method and device and computer readable storage medium
US9081784B2 (en) Delta indexing method for hierarchy file storage
US9665732B2 (en) Secure Download from internet marketplace
CN112035471B (en) Transaction processing method and computer equipment
CN112181921A (en) Data processing method and device
WO2016169237A1 (en) Data processing method and device
CN114064073A (en) Software version upgrading method and device, computer equipment and storage medium
US10318330B2 (en) Data-persisting temporary virtual machine environments
CN117453810A (en) Heterogeneous data processing method, heterogeneous data processing device, computer equipment and storage medium
CN108256019A (en) Database key generation method, device, equipment and its storage medium
CN111104408A (en) Data exchange method and device based on map data and storage medium
CN113094753B (en) Big data platform hive data modification method and system based on block chain
US7536398B2 (en) On-line organization of data sets
CN114691653A (en) Account set migration method and device, computer equipment and storage medium
CN115185946A (en) Multi-tenant system, multi-tenant management method, computer device, and storage medium
CN114281873A (en) Verifiable search method for medical block chain data
CN113094754A (en) Big data platform data modification system and modification, response, cache and verification method
CN113672640A (en) Data query method and device, computer equipment and storage medium
CN116302206B (en) Presto data source hot loading method based on MQ

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination