CN113392127A

CN113392127A - Data management method and device

Info

Publication number: CN113392127A
Application number: CN202110075597.6A
Authority: CN
Inventors: 苏福钦
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-01-20
Filing date: 2021-01-20
Publication date: 2021-09-14

Abstract

The embodiment of the application relates to the technical field of computers, in particular to a data management method and device, which are used for solving the problems of high task storage cost and heavy execution burden after task modification. The method comprises the following steps: selecting at least one target object from a target object storage unit according to a data storage rule corresponding to the target object storage unit and a storage time point of the object stored in the target object storage unit, wherein the data storage rule at least comprises the maximum storage time length of the object; the data storage rule is used for comparing with the attribute information of the object when the object is stored so as to determine an object storage unit of the object; generating at least one task according to the data processing rule corresponding to the target object storage unit and the information to be executed corresponding to the at least one target object; and performing data processing on the at least one target object according to the at least one task.

Description

Data management method and device

Technical Field

The embodiment of the application relates to the technical field of computers, in particular to a data management method and device.

Background

At present, people demand more and more data, and most of the data are unstructured data. In the future, unstructured data will more exponentially grow. With the explosion of unstructured data, conventional SAN (storage area network) and NAS (network attached storage) storage architectures cannot cope with the problems caused by the explosion of data. Based on the structural characteristics of flattening and strong expansibility, object storage becomes the best solution for unstructured data storage. The object storage replaces the penetrated SAN and NAS storage, and the efficiency of unstructured data access and storage is greatly improved. The object storage has all the advantages of distributed storage, flexible expansibility and management of metadata; through the powerful function of version management, the object storage effectively avoids the logic errors of manual operation.

In the related art, generally, a task is formed and stored according to an execution rule and an execution time of an object while the object is uploaded. When an object expires, the system polls for expired tasks, executing one by one. When the object is uploaded and stored, the expiration time of the object is solidified into the index, so that the storage cost is high, and if the execution rule is changed, a large number of repeated tasks are waited to be executed, and the execution burden of the system is increased.

Disclosure of Invention

The embodiment of the application provides a data management method and device, and aims to solve the problems that task storage cost is high and execution burden is heavy after task modification.

The application provides a data management method, which comprises the following steps:

selecting at least one target object from a target object storage unit according to a data storage rule corresponding to the target object storage unit and a storage time point of the object stored in the target object storage unit, wherein the data storage rule at least comprises the maximum storage time length of the object; the data storage rule is used for comparing with the attribute information of the object when the object is stored so as to determine an object storage unit of the object;

generating at least one task according to the data processing rule corresponding to the target object storage unit and the information to be executed corresponding to the at least one target object; the target object is uploaded and then stored in the target object storage unit, and the at least one task is generated when the time length stored in the target object storage unit is longer than the maximum storage time length;

and performing data processing on the at least one target object according to the at least one task.

In another aspect, the present application provides a data management apparatus, including:

the screening unit is used for selecting at least one target object from the target object storage unit according to a data storage rule corresponding to the target object storage unit and a storage time point of the object stored in the target object storage unit, wherein the data storage rule at least comprises the maximum storage time length of the object; the data storage rule is used for comparing with the attribute information of the object when the object is stored so as to determine an object storage unit of the object;

the creating unit is used for generating at least one task according to the data processing rule corresponding to the target object storage unit and the information to be executed corresponding to the at least one target object; the target object is uploaded and then stored in the target object storage unit, and the at least one task is generated when the time length stored in the target object storage unit is longer than the maximum storage time length;

and the execution unit is used for executing data processing on the at least one target object according to the at least one task.

Optionally, the apparatus further includes a matching unit, configured to:

receiving an uploaded object and determining attribute information of the object;

matching the attribute information of the object with the data storage rule of the target object storage unit;

and if the attribute information of the object is successfully matched with the data storage rule of the target object storage unit, storing the object into the target object storage unit.

Optionally, the matching unit is specifically configured to:

respectively setting corresponding data storage rules for each object storage unit, and respectively converting each obtained data storage rule into corresponding unit feature vectors;

converting the attribute information of the object into a corresponding object feature vector;

matching the object feature vector with a unit feature vector corresponding to the target object storage unit;

and determining that the object feature vector is the same as the unit feature vector corresponding to the target object storage unit.

Optionally, the apparatus further includes a storage unit, configured to:

according to the attribute information of the at least one target object, determining a task storage partition corresponding to the corresponding at least one task in a preset arbitrary storage unit, wherein the task storage unit comprises a plurality of task storage partitions;

and storing the at least one task in a corresponding task storage partition.

Optionally, the task storage partition is provided with a status flag;

an execution unit to further:

obtaining the at least one task from at least one task storage partition marked as incomplete;

and after the tasks in the at least one task storage partition are determined to be completely executed, the at least one task storage partition is marked to be completed.

Optionally, the execution unit is further configured to:

based on the created at least one execution process, acquiring at least one task with a free state from the task storage partition marked as unfinished according to the generation time of each task; wherein, one execution process corresponds to one task;

changing the state of the corresponding at least one task from a free state to a locked state by the at least one executing process.

Optionally, the execution unit is further configured to:

adding lease time information in a state of the corresponding at least one task by the at least one executing process;

determining that any task is converted from a locking state to a free state after lease time information corresponding to any task expires;

and acquiring any task through an execution process except the at least one execution process, and changing the state of any task into a locking state.

Optionally, the execution unit is further configured to:

determining task concurrency of object storage units associated with each task in the at least one task storage partition;

and selecting the at least one task from the tasks associated with the target object storage unit with the task concurrency lower than the set threshold.

On the other hand, an embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and when the computer program is executed by a processor, the computer program implements the data management method described above.

On the other hand, the embodiment of the present application further provides an electronic device, which includes a memory and a processor, where the memory stores a computer program that is executable on the processor, and when the computer program is executed by the processor, the processor is enabled to implement the above data management method.

The object storage unit in the embodiment of the application is preset with a data storage rule, when a user uploads an object, attribute information of the object is compared with the data storage rule to determine the object storage unit of the object, and the object is stored in the object storage unit, wherein the data storage rule at least comprises the maximum storage duration of the object. The target object storage unit is any object storage unit, and at least one target object is selected from the target object storage unit when the storage of at least one target object is due according to the data storage rule corresponding to the target object storage unit and the storage time point of the object stored in the target object storage unit. And generating at least one task according to the data processing rule corresponding to the target object storage unit and the information to be executed corresponding to the at least one target object, and executing data processing on the at least one target object according to the at least one task. According to the method and the device, the task is not established when the object is uploaded, and when the target object is stored due and needs to be executed, the task is generated according to the information to be executed of the target object and the data processing rule of the target object storage unit, so that the storage space of the task can be saved, the task is generated by filtering the rule only when the target object is executed, the rule can be guaranteed to be immediately effective after being modified, and the problem of warehousing can be avoided.

Drawings

Fig. 1 is a process diagram of a data management method in the related art;

FIG. 2 is a process diagram of another data management method in the related art;

FIG. 3 is a process diagram of a data management method according to an embodiment of the present application;

FIG. 4 is a schematic diagram of an application architecture of a data management method according to an embodiment of the present application;

fig. 5 is a flowchart of a data management method according to an embodiment of the present application;

FIG. 6 is a diagram illustrating a task storage partition provided by an embodiment of the present application;

FIG. 7 is a diagram illustrating a plurality of task processes obtaining tasks from an incomplete task storage partition, according to an embodiment of the present disclosure;

fig. 8 is a schematic diagram illustrating that other task processes continue to execute the task after the task fails to be executed according to the embodiment of the present application;

FIG. 9 is a graph of the total amount of tasks on the day and the task concurrency threshold of the object storage unit according to an embodiment of the present disclosure;

fig. 10 is a schematic structural diagram of an overall architecture of a data management system according to an embodiment of the present application;

FIG. 11 is a diagram illustrating a specific task execution process according to an embodiment of the present application;

fig. 12 is a schematic structural diagram of a data management apparatus according to an embodiment of the present application;

fig. 13 is a block diagram of a physical architecture of a computer device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application clearer, the present application will be described in further detail with reference to the accompanying drawings, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

For the purpose of facilitating an understanding of the embodiments of the present application, a brief introduction of several concepts is provided below:

the life cycle system: and automatically managing the data according to the time, wherein the data comprises due deletion and due cooling of the data.

DB (DoggaByte, data storage unit): data storage unit in computer, 1DB 1024⁸GB. In the embodiment of the application, the DB is used for storing task data of the object.

CSV (Comma-Separated Values, character separation value): the file stores table data (numbers and text) in plain text form. Plain text means that the file is a sequence of characters, containing no data that must be interpreted like binary digits. CSV files are composed of any number of records, and the records are separated by a certain linefeed character; each record is made up of fields, and separators between fields are other characters or strings, most commonly commas or tabs. Typically, all records have identical field sequences. Typically a plain text file.

A Bucket is a name for a memory space in a MOS (Management Operating System) and is a container for storing objects. The object storage is a very flat storage mode, and objects stored in the bucket are all in the same logic level, unlike a file system which has a file structure with a plurality of levels. In MOS, the naming of buckets is globally unique. Each bucket generates a default bucket ACL (Access Control List) when it is created, and each entry of the bucket ACL List contains what rights are granted to an authorized user, such as READ Rights (READ), WRITE Rights (WRITE), FULL Control rights (FULL Control), and so on. The user can operate the bucket only if the user has corresponding authority to the bucket, such as creating, deleting, displaying, setting the bucket ACL and the like.

Cloud computing (cloud computing): the method is a computing mode, and distributes computing tasks on a resource pool formed by a large number of computers, so that various application systems can acquire computing power, storage space and information services according to needs. The network that provides the resources is referred to as the "cloud". Resources in the "cloud" appear to the user as being infinitely expandable and available at any time, available on demand, expandable at any time, and paid for on-demand.

As a basic capability provider of cloud computing, a cloud computing resource pool (called as an Infrastructure as a Service (IaaS) platform for short) is established, and multiple types of virtual resources are deployed in the resource pool and are selectively used by external clients.

According to the logic function division, a Platform as a Service (PaaS) layer can be deployed on the IaaS layer, a Software as a Service (SaaS) layer is deployed on the PaaS layer, and the SaaS layer can be directly deployed on the IaaS layer. PaaS is a platform on which software runs, such as a database, a web container, etc. SaaS is a variety of business software, such as web portal, sms, and mass texting. Generally speaking, SaaS and PaaS are upper layers relative to IaaS.

As society develops, human social activities generate a large amount of data that can only be analyzed and processed to fully exploit its value. At present, most data is unstructured data, and with the explosive growth of unstructured data, object storage becomes the best solution for unstructured data storage. Object stores, also called object-based stores, are a general term used to describe methods of resolving and processing discrete units, referred to as objects. Just like a file, an object contains data, but unlike a file, an object no longer has a hierarchy in one hierarchy. Each object is in the same level of a flat address space called a storage pool, and an object does not belong to the next level of another object.

In the related art, when an object is uploaded, an expiration Time is written into an object index, scanning is performed by using a Time To Live (TTL) function of a DB or a bypass system, and the object is selected To be executed, and the specific structure is as shown in fig. 1. Since the expiration time of the object is fixed in the index when the object is uploaded, if the user rule is changed, the TTL time of the stock data is invalid and is difficult to modify.

As shown in fig. 2, in the scheme, when an object is uploaded, a task is generated in a time-sequential manner by using the execution time of the object, the operation to be executed, and the like, and is written into an independent DB. And polling expired tasks by utilizing a Worker (execution process) of the life cycle system, and executing one by one. The task amount of the scheme is very huge, and the index storage cost consumed by the object becomes more for the bucket of the life cycle; when the rule changes, a large number of repeated tasks are waited to be executed, and the burden of the Worker is greatly increased.

In view of the above, in order to solve the above problem, in the embodiment of the present application, as shown in fig. 3, when a user uploads an object, the object is only stored without establishing a task, and specifically, the object is stored in buckets according to data storage rules, where each bucket is preset with a data storage rule and a data processing rule, and the object may select a corresponding bucket through rule matching for storage. And selecting the target object from the bucket when the storage of the target object is due according to the data storage rule corresponding to the bucket and the storage time of the object in the bucket if the data storage rule at least comprises the maximum storage duration of the object. And establishing a task for the target object according to the information to be executed corresponding to the target object and the data processing rule corresponding to the bucket, and executing data processing. According to the method and the device, the task is not established when the object is uploaded, so that the storage space of the task can be saved, the rule is filtered to generate the task only when the task is executed, the rule can be guaranteed to be immediately effective after being modified, and the problem of warehousing can be avoided.

Preferred embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

Fig. 4 is a schematic diagram of an application architecture of the data management method in the embodiment of the present application, including a server 100 and a terminal device 200.

The terminal device 200 may be a mobile or a fixed electronic device. For example, a mobile phone, a tablet computer, a notebook computer, a desktop computer, various wearable devices, a smart television, a vehicle-mounted device, or other electronic devices capable of implementing the above functions may be used. The terminal device 200 may upload an object to the server 100 and receive feedback from the server 100.

The terminal device 200 and the server 100 can be connected via the internet to communicate with each other. Optionally, the internet described above uses standard communication techniques and/or protocols. The internet is typically the internet, but can be any Network including, but not limited to, a Local Area Network (LAN), a Metropolitan Area Network (MAN), a Wide Area Network (WAN), any combination of mobile, wireline or wireless networks, private or virtual private networks. In some embodiments, data exchanged over a network is represented using techniques and/or formats including Hypertext Mark-up Language (HTML), Extensible Markup Language (XML), and the like. All or some of the links may also be encrypted using conventional encryption techniques such as Secure Socket Layer (SSL), Transport Layer Security (TLS), Virtual Private Network (VPN), Internet Protocol Security (IPsec). In other embodiments, custom and/or dedicated data communication techniques may also be used in place of, or in addition to, the data communication techniques described above.

The server 100 may provide various network services for the terminal device 200, and the server 100 may perform information processing using a cloud computing technology. The server 100 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network service, cloud communication, a middleware service, a domain name service, a security service, a CDN, a big data and artificial intelligence platform, and the like. The terminal may be, but is not limited to, a smart phone, a tablet computer, a laptop computer, a desktop computer, a smart speaker, a smart watch, and the like. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the application is not limited herein.

Specifically, the server 100 may include a processor 110 (CPU), a memory 120, an input device 130, an output device 140, and the like, the input device 130 may include a keyboard, a mouse, a touch screen, and the like, and the output device 140 may include a Display device, such as a Liquid Crystal Display (LCD), a Cathode Ray Tube (CRT), and the like.

Memory 120 may include Read Only Memory (ROM) and Random Access Memory (RAM), and provides processor 110 with program instructions and data stored in memory 120. In the embodiment of the present invention, the memory 120 may be used to store a program of the data management method in the embodiment of the present invention.

The processor 110 is configured to execute the steps of any of the data management methods according to the embodiments of the present invention according to the obtained program instructions by calling the program instructions stored in the memory 120.

In addition, the application architecture diagram in the embodiment of the present invention is for more clearly illustrating the technical solution in the embodiment of the present invention, and does not limit the technical solution provided in the embodiment of the present invention, and certainly, is not limited to the digestive tract disease diagnosis service application, and for other application architectures and service applications, the technical solution provided in the embodiment of the present invention is also applicable to similar problems.

The various embodiments of the present invention are schematically illustrated as applied to the application architecture diagram shown in fig. 4.

Fig. 5 shows a flowchart of a data management method according to an embodiment of the present application. As shown in fig. 5, the method includes the steps of:

step 501: and selecting at least one target object from the target object storage unit according to the data storage rule corresponding to the target object storage unit and the storage time point of the object stored in the target object storage unit.

Wherein the data storage rule at least comprises the maximum storage duration of the object; the data storage rule is used to compare with the attribute information of the object when the object is stored to determine an object storage unit of the object.

Specifically, a plurality of object storage units, namely buckets, are arranged in the server. Each object storage unit is preset with corresponding data storage rules and data processing rules, wherein the rules include user ID, object prefix, operation mode, storage days, object tag, etc., the data storage rules and the data processing rules of different object storage units are different, all the rules in the data storage rules and the data processing rules are different, or one part of the data storage rules and one part of the data processing rules are the same, and the other part of the data storage rules and the other part of the data processing rules are different.

And after receiving the object uploaded by the user, the server stores the object in a corresponding object storage unit according to the data storage rule and the data execution rule. The data storage rule at least comprises the maximum storage duration of the object and is used for identifying the storage time of the object, namely after the object is stored to the end, screening the object and establishing a task.

Generally, for convenience of operation, the target objects may be acquired from the target object storage means at a set frequency, and for example, when object screening is performed at a predetermined time point of each day, the number of the target objects acquired from the target object storage means is plural. Further, since the number of object storage units stored in the server is also plural, the range of the filtering is also all the object storage units.

For example, 10 buckets are provided in the server, and the maximum storage time periods of the objects are respectively set to be 5 days, 7 days, 10 days, and the like. And after the object is uploaded, storing the object in the corresponding bucket according to the maximum storage time length. And at the 0 point of each day, aiming at each bucket, screening all objects due on the same day from the buckets according to the maximum storage time length of the bucket and the storage time point of each object stored in the bucket, and taking the objects as target objects. In one possible scenario, 350 target objects are screened out from the first bucket, and 200 target objects are screened out from the tenth bucket of the 120 target objects … … from the second bucket.

Step 502: and generating at least one task according to the data processing rule corresponding to the target object storage unit and the information to be executed corresponding to at least one target object. And the target object is uploaded and then stored in the target object storage unit, and at least one task is generated when the time length stored in the target object storage unit is longer than the maximum storage time length.

Specifically, the object selects a corresponding target object storage unit according to the data storage rule and the data processing rule of the target object storage unit, that is, the object stored in the target object storage unit matches not only the data storage rule of the target object storage unit but also the data processing rule of the target object storage unit. Therefore, when the target object is stored due, the target object is screened out from the target object storage unit, and the task of the target object is generated according to the data processing rule corresponding to the target object storage unit and the information to be executed corresponding to the target object.

In a specific implementation process, generally, object screening is performed on a plurality of object storage units, and the number of target objects screened from the same object storage unit is also multiple, that is, multiple target objects can be screened in the same batch. And generating a corresponding task for each target object according to the information to be executed of the target object and the data processing rule of the corresponding target object storage unit.

Step 503: data processing is performed on at least one target object according to at least one task.

In a specific implementation process, one target object corresponds to one task, and tasks of different target objects may be the same or different, for example, deleting the target object, or performing a read-write operation on the target object. In addition, the processing of the tasks in the embodiment of the application can be serial processing, that is, the same process sequentially processes all the tasks; the parallel processing can also be realized, namely the parallel processing of the processes with the same quantity is established for a certain quantity of tasks; the serial processing and the parallel processing can also be combined, that is, a second number of processes are created for the first number of tasks, each process processes a plurality of tasks in sequence, the tasks of a single process are processed in series, and the tasks of different processes are processed in parallel.

The object storage unit in the embodiment of the application is preset with a data storage rule, and the object uploaded by a user is stored in the object storage unit according to the data storage rule, wherein the data storage rule is at least used for setting the maximum storage duration of the object. The target object storage unit is any object storage unit, and at least one target object is selected from the target object storage unit according to the data storage rule corresponding to the target object storage unit. And generating at least one task according to the data processing rule corresponding to the target object storage unit and the information to be executed corresponding to the at least one target object, and executing data processing on the at least one target object according to the at least one task. According to the method and the device, the task is not established when the object is uploaded, and when the target object is stored due and needs to be executed, the task is generated according to the information to be executed of the target object and the data processing rule of the target object storage unit, so that the storage space of the task can be saved, the task is generated by filtering the rule only when the target object is executed, the rule can be guaranteed to be immediately effective after being modified, and the problem of warehousing can be avoided.

The following specifically describes a processing scheme of an object in the embodiment of the present application, starting from the object uploading.

In the embodiment of the application, the task is not constructed immediately after the object is uploaded, and the object is directly stored in the corresponding target object storage unit. Before at least one target object is selected from the object storage unit according to the data storage rule corresponding to the target object storage unit, the method further comprises the following steps:

receiving the uploaded object and determining the attribute information of the object;

In a specific implementation process, a plurality of object storage units are arranged in the server, and each object storage unit is respectively provided with a data storage rule and a data processing rule. The data storage rule and the data processing rule of the same object storage unit may or may not coincide, that is, the data storage rule of the object storage unit may include the data processing rule, or the data storage rule does not include the data processing rule.

For example, the target object storage unit is provided with 50 rules, wherein 25 are data storage rules, and the other 25 are data processing rules. And when the target object is stored, matching the attribute information of the target object with the 25 data storage rules of the target object storage unit, and if all matching is successful, storing the target object into the target object storage unit. In addition, since the object stored in the target object storage unit needs to execute the data processing rule of the target object storage unit, when the target object in the target object storage unit reaches the storage period, the data processing is performed according to the 25 data processing rules corresponding to the target object and the to-be-executed information generation task of the target object.

In another embodiment, the target object storage unit is provided with 50 rules, wherein 30 rules are data storage rules, and 10 rules in the data storage rules are also data processing rules, so that the number of the data processing rules is 30. That is, when the target object is stored, it is necessary to match the attribute information of the target object with the 30 pieces of data storage rules. The 10 data processing rules need to be matched when storing the target object, and also need to establish the task of the target object according to the 10 data processing rules when establishing the task.

In another embodiment, the target object storage unit is provided with 50 rules, and all 50 rules are data storage rules, where 25 rules in the data storage rules are also data processing rules, that is, the number of the data processing rules is 25. Thus, when storing a target object, it is necessary to match the attribute information of the object with all 50 data storage rules. The 25 data processing rules need to be matched when storing the target object, and also need to establish the task of the target object according to the 25 data processing rules when establishing the task.

And after receiving the target object uploaded by the user, sequentially matching the target object with a plurality of object storage units arranged in the server. For example, the server is provided with 10 object storage units, the target object is matched with the data storage rule of the first object storage unit, and if the matching is successful, the target object is stored in the first object storage unit; if the matching is unsuccessful, matching the target object with the data storage rule of the second object storage unit, and if the matching is successful, storing the target object in the second object storage unit; and if the matching is unsuccessful, matching the target object with the data storage rule of the third object storage unit, and so on. Specifically, if the data storage rules of the target object for all the object storage units are not matched, the target object may be considered as an error object, and is not stored, and the feedback is performed to the user.

Further, because more rules are set in each object storage unit, and the processing pressure of the direct rule matching system is higher, the rules set in the object storage units are formatted, the occupation of the CPU is optimized by matching the vectors with the vector set, and the matching time is reduced.

Before receiving the uploaded object, the method further comprises:

matching the attribute information of the object with the data storage rule of the target object storage unit, comprising:

matching the object feature vector with a unit feature vector corresponding to a target object storage unit;

determining that the attribute information of the object is successfully matched with the data storage rule of the object storage unit, including:

In the specific implementation process, 1000 rules can be configured for one bucket, wherein the data storage rule comprises various conditions and operations such as Id, Prefix (Prefix), operation (Action), Days (Days), label (Tags) and the like. On the other hand, the attribute information of each object includes information such as an object name (Objname), a modification time (ModifyTime), a tag (Tags), and a Version (Version). Matching such complex data storage rules with attribute information of objects is very CPU consuming.

Therefore, in the embodiment of the application, the attribute information of the object and the data storage rule of the object storage unit are formatted, the data storage rule of the object storage unit is formatted into the unit feature vector, the attribute information of each object is formatted into the object feature vector, and the matching between the attribute information and the data storage rule is replaced by the matching between the vectors, so that the effects of optimizing the CPU occupation and reducing the matching time are achieved. Further, since the data storage rule of the object storage unit can set a plurality of rules for the same attribute, for example, the data storage rule of a certain object storage unit specifies a prefix and a tag of an object, including: the prefix is null, and meanwhile, the label is any one of a, b, c, d and e; the prefix is tmp, and the label is e or f; the prefix is tmp2, while the label is g. Then the data storage rules are converted into unit feature vectors respectively, vector sets can be formed, and matching of the object feature vector of the object with any unit feature vector in the vector sets indicates that the object feature vector is matched with the vector sets, namely that the object is matched with the object storage unit.

Specifically, the vectorization process of the data storage rule of the object storage unit is as follows:

the first step is as follows: the unit feature vector mp1 is initialized.

The second step is that: assuming that the contents of the data storage rule include a prefix (prefix) and a tag (tag), anchor points, such as 000000010010101, are added to mp1_ prefix and mp1_ tag.

The third step: anchor points are added for mp2, where mp2 is in the format: prefix anchor- > rule corresponding to anchor. A unit feature vector value of the data storage rule is obtained.

The fourth step: finally, the vector value of the data storage rule is obtained and is represented by roleinfo.

The following examples are given.

The data storage rule of the object storage unit is as follows:

rule 1- > prefix: null; tag: a, b, c, d, e;

rule 2- > prefix: tmp; tag: e, f;

rule 3- > prefix: tmp; tag: e, f;

rule 4- > prefix: tmp 2; tag: g.

generating according to the data storage rule:

mp1_prefix；Null:1,Tmp:2,Tmp2:3；

mp1_tag；A:4,B:5,C:6,D:7,E:8,F:9,G:10。

then, generating a vector:

rule 1: 1+4+5+6+7+ 8;

rule 2: 2+8+ 9;

rule 3: 2+8+ 9;

rule 4: 3+10.

And generating a vector prefix:

rule 1: 1;

rule 2: 2;

rule 3: 2;

rule 4: 3.

further generating a unit feature vector mp _ rule:

1 (prefix analogy) - > rule 1

2 (prefix analogy) > rule 2, rule 3

3 (prefix analogy) - > rule 4

And finally, setting the rule in a memory (cache) of the bucket.

It should be noted that, in the unit feature vector of the generated data storage rule, if the spare bits are all 0, the position of the first 1 is called a vector prefix, when the rule is matched, the vector prefixes are preferentially compared, and if the vector prefixes are matched, the corresponding vectors are fully matched to obtain the corresponding rule. According to the above embodiment, if the vector prefix of the object is 2, rule 2 and rule 3 are matched, and the workload of matching is reduced 2/3.

Further, in order to improve the execution effect of the task, the storage layout of the task is optimized, and the task DB is partitioned. After the at least one task is generated according to the data processing rule corresponding to the object storage unit and the information to be executed corresponding to the at least one target object, and before the data processing is performed on the at least one target object according to the at least one task, the method further includes:

according to the attribute information of at least one target object, determining a task storage partition corresponding to at least one corresponding task in a preset random storage unit, wherein the task storage unit comprises a plurality of task storage partitions;

at least one task is stored in a corresponding task storage partition.

In the specific implementation process, the tasks are stored in any storage unit. Generally, the storage space of any storage unit is a set value, and one task storage unit can store a plurality of tasks. When a task process is used for executing any storage unit, in order to improve the execution efficiency of the task, each arbitrary storage unit is partitioned to obtain a plurality of task storage partitions. In the task storage process, hash calculation may be performed on the attribute information, for example, according to the attribute information of the target object corresponding to the task, so as to determine the corresponding task storage partition, and further store the task in the corresponding task storage partition.

For example, FIG. 6 shows a schematic diagram of a task memory partition. As shown in fig. 6, three task storage units, task DB1, task DB2, and task DB3, are provided. Each task DB is divided into three task storage partitions, and each task storage partition may store a plurality of tasks therein. It should be noted that the number of the task memory partitions divided by the different task DB may be the same or different.

The embodiment of the application can create a task process (worker) to execute the task. Further, the task storage partition is provided with a status flag to indicate whether the task in the task storage partition has been executed and completed, and it is clear that the task process is only selected from the task storage partition whose status flag is incomplete to prevent the task from being executed repeatedly.

Specifically, before performing data processing on at least one target object according to at least one task, the method further includes:

obtaining at least one task from at least one task storage partition marked as incomplete;

after performing data processing on at least one target object according to at least one task, the method further includes:

and after the tasks in the at least one task storage partition are determined to be completely executed, marking the at least one task storage partition to be completed.

In a specific implementation process, a plurality of task processes randomly select a task storage partition from task storage partitions marked as unfinished in all states so as to reduce the scene problem that the plurality of task processes occupy a certain task. Meanwhile, the executed partitions are marked, and the task storage partitions randomly selected by the task process are only selected from the uncompleted task storage partitions to prevent the tasks from being executed repeatedly. FIG. 7 is a diagram illustrating a plurality of task processes obtaining tasks from an incomplete task storage partition. As shown in fig. 7, only the lowest task storage partition of the three task storage partitions of the task DB3 is incomplete, and all 4 task processes (workers) in the figure acquire tasks from the task storage partition.

Further, in order to prevent the task from being repeatedly executed, after the task process occupies the task, the task process in the embodiment of the present application also locks the task to prevent the repeated execution of other task processes. The above acquiring at least one task from at least one task storage partition marked as incomplete includes:

changing the state of the corresponding at least one task from the free state to the locked state by the at least one executing process.

In a specific implementation process, as shown in fig. 7, the tasks are stored in the corresponding task storage partitions according to the generation time, for example, in the first task storage partition of the task DB1, if the task a is the first generated task, the task B is the second generated task, and the task C is the third generated task, the task a, the task B, and the task C are sequentially stored. After the task process worker1 randomly selects the task storage partition, the task A is selected to be executed according to the storage positions of the task A, the task B and the task C or the generation time of the task A, the task B and the task C, at this time, the task process worker1 corresponds to the task A, and the task process worker1 changes the state of the task A from a free state to a locked state. At this time, the task process worker2 also randomly selects the task memory partition, although the task a is still the task that is generated first, since the state of the task a is the locked state, that is, it indicates that the task a has been executed by other task processes, the task process worker2 selects from the remaining tasks, that is, selects the task B to preempt, and similarly, the task process worker2 also changes the state of the task B from the free state to the locked state.

Further, when the task process is abnormal to cause that the task execution is not completed, other task processes can re-preempt the task which is not completed and continue to execute. Specifically, after acquiring at least one task in a free state from a task storage partition marked as incomplete according to the generation time of each task, before performing data processing on at least one target object according to the at least one task, the method further includes:

adding lease time information in a state of the corresponding at least one task through at least one execution process;

performing data processing on at least one target object according to at least one task, further comprising:

determining that any task is converted into a free state from a locking state after lease time information corresponding to any task is expired;

and acquiring any task through an execution process except for at least one execution process, and changing the state of any task into a locking state.

In the specific implementation process, the task process may execute the task successfully, and may also execute the task unsuccessfully. When the task process exception causes the task to fail to execute, other task processes need to execute the task continuously. In order to meet the above requirements, in the embodiment of the present application, lease time information is set for each task, and generally, the length of the lease time is longer than the processing time of a normal task process for the task, so that the processing of the task process for the task needs to be completed within the lease time, and if the lease time expires and the task is not yet executed, the executed task process is considered to be abnormal, the task is failed to be executed, and the locking state of the task is automatically changed into the free state. After the task is converted into the free state, the other task processes can preempt the task again and continue to execute the task.

Fig. 8 is a schematic diagram showing the task process continuing to execute the task after the task fails to execute. As shown by solid arrows in fig. 8, during the first round of task execution, the task process worker1 selects a task in the task storage partition 1, the task process worker2 selects a task in the task storage partition 2, and the task process worker3 selects a task in the task storage partition 4. The task process worker2 executes the task D, changes the state of the task D into a locking state, and sets lease time for the task D. The task process worker2 is abnormal in the process of executing the task D, which results in the task D failing to execute, so that after the lease time of the task D expires, the state of the task D is changed from the locked state to the free state. After the task process worker1 completes executing the task in task storage partition 1, task storage partition 2 is selected as indicated by the dashed arrow in fig. 8. Since the state of the task D is a free state at this time, the task process worker1 preempts the task D and changes the state of the task D from the free state to a locked state.

Thus, the task may continuously modify the state during execution, including lease time and task object offset. When the task is not completed due to the abnormal task process, other task processes can be preempted again after the lease time of the task is expired, and the task is continuously executed according to the offset, so that the task is smoothly executed.

Further, since the number of objects processed by the system is very large, in order to prevent the overload, the embodiment of the present application is managed by multi-level frequency control. Specifically, the acquiring at least one task from at least one task storage partition marked as incomplete includes:

determining task concurrency of object storage units associated with each task in at least one task storage partition;

and selecting at least one task from the tasks associated with the target object storage unit with the task concurrency lower than the set threshold.

In a specific implementation process, a task concurrency threshold is set for the object storage unit, and the threshold is calculated as follows:

autoThreshold 2 × (sqrt (x)) + log2(x)) … … formula 1

Wherein, the autoThreshold is a task concurrency threshold of the object storage unit, and x is a total amount of tasks in the current time period.

FIG. 9 is a graph illustrating the total amount of tasks on the current day versus the task concurrency threshold for the object store, as shown in FIG. 9, where the task concurrency threshold for each object store is dynamically determined by the amount of tasks generated during the current time period. Generally, the current time period is set to one day, i.e., the task concurrency threshold of the object storage unit is determined by the amount of tasks generated on the day.

Therefore, when the task process selects the task, the task number of the object storage unit corresponding to the task, namely the task concurrency of the object storage unit, is determined, and when the task number corresponding to the object storage unit is lower than a set threshold value, the task process selects the task from the tasks corresponding to the object storage unit to execute.

Thus, the object storage units with large task amount are prevented from occupying the execution load of all task processes by fair scheduling among the object storage units.

In addition, the embodiment of the application is also provided with an overload fusing mechanism, if an object storage unit returns an overload-related error code during task execution, fusing is directly performed, and the object storage unit does not schedule and execute the task within 5 minutes.

The following describes, by way of specific examples, implementation procedures of the data management method provided in the embodiments of the present application. The overall architecture of the data management system in a specific implementation is shown in fig. 10.

And generating a list subtask for the Bucket configured by the user every day, and executing the list subtask by using the list system. And after the list is executed, packaging the filtered object into a file and storing the file into an object storage system, and generating an execution statistical file named manifest. Notifying the lifecycle system; the lifecycle system completes execution of the filtered object tasks.

The specific task execution process is shown in fig. 11. Each task is a CSV file stored in the object storage system, each line of the file represents information of an object to be executed, as shown in fig. 11, the life cycle analyzes the information of the object to be executed in each line according to Schema (a set of database objects, generally, one user corresponds to one Schema) information, and queries the state of the current object and executes the current object in combination with rules. However, the multi-version object is complex, and the execution of the multi-version object needs to be partially performed in a reverse order because the time stamp of the multi-version object depends on the previous and subsequent historical versions.

When the life cycle executes the task, each Bucket needs a function of executing the report, so after receiving the list summarizing task, a Finish task is created for collecting information of completed execution.

Corresponding to the method embodiment, the embodiment of the application also provides a data management device. Fig. 12 is a schematic structural diagram of a data management apparatus according to an embodiment of the present application; as shown in fig. 12, the data management apparatus includes:

the screening unit 121 is configured to select at least one target object from the target object storage units according to a data storage rule corresponding to the target object storage unit and a storage time point at which the object is stored in the target object storage unit, where the data storage rule at least includes a maximum storage duration of the object; the data storage rule is used for comparing with the attribute information of the object when the object is stored so as to determine an object storage unit of the object;

a creating unit 122, configured to generate at least one task according to a data processing rule corresponding to the target object storage unit and information to be executed corresponding to the at least one target object; the target object is uploaded and then stored in the target object storage unit, and the at least one task is generated when the time length stored in the target object storage unit is longer than the maximum storage time length;

an execution unit 123, configured to perform data processing on the at least one target object according to the at least one task.

Optionally, the apparatus further includes a matching unit 124, configured to:

Optionally, the matching unit 124 is specifically configured to:

Optionally, the storage unit 125 is further included to:

and storing the at least one task in a corresponding task storage partition.

Optionally, the task storage partition is provided with a status flag;

the execution unit 123 is further configured to:

Optionally, the execution unit 123 is further configured to:

Based on the same inventive concept, referring to fig. 13, an embodiment of the present application further provides a computer device 1300, where the computer device 1300 may be an electronic device such as a smart phone, a tablet computer, a laptop computer, or a PC. As shown in fig. 13, the computer device 1300 includes a display unit 1340, a processor 1380, and a memory 1320, wherein the display unit 1340 includes a display panel 1341 for displaying information input by a user or information provided to the user, and various object selection pages and the like of the computer device 1300, and in the embodiment of the present application, is mainly used for displaying pages of applications installed in the computer device 1300, shortcut windows, and the like. Alternatively, the Display panel 1341 may be configured in the form of a Liquid Crystal Display (LCD), an Organic Light-Emitting Diode (OLED), or the like. The computer device 1300 can execute any one of the methods executed by the terminal or the processing server in the embodiments described above.

The processor 1380 is used to read the computer program and then execute a method defined by the computer program, for example, the processor 1380 reads the social application program, thereby running the application on the computer device 1300 and displaying the page of the application on the display unit 1340. The Processor 1380 may include one or more general purpose processors and may also include one or more Digital Signal Processors (DSPs) for performing the relevant operations to implement the techniques provided by the embodiments of the present application.

Memory 1320 typically includes both internal and external memory, which may be Random Access Memory (RAM), Read Only Memory (ROM), and CACHE memory (CACHE). The external memory can be a hard disk, an optical disk, a USB disk, a floppy disk or a tape drive. The memory 1320 is used for storing computer programs including application programs and the like corresponding to applications, and other data, which may include data generated after an operating system or application programs are executed, including system data (e.g., configuration parameters of the operating system) and user data. The program instructions stored in the memory 1320 of the embodiments of the present application and the processor 1380 executing the program instructions stored in the memory 1320 implement any of the methods described above as being performed by the terminal or the processing server.

In addition, the computer device 1300 may further include a display unit 1340 for receiving input numerical information, character information, or contact touch manipulation/non-contact gestures, and generating signal inputs related to user settings and function control of the computer device 1300, and the like. Specifically, in the embodiment of the present application, the display unit 1340 may include a display panel 1341. The display panel 1341, such as a touch screen, can collect touch operations of a user (e.g., operations of a player on the display panel 1341 or on the display panel 1341 using any suitable object or accessory such as a finger, a stylus, etc.) on or near the display panel 1341, and drive the corresponding connection device according to a preset program. Alternatively, the display panel 1341 may include two portions of a touch detection device and a touch controller. The touch detection device comprises a touch controller, a touch detection device and a touch control unit, wherein the touch detection device is used for detecting the touch direction of a user, detecting a signal brought by touch operation and transmitting the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts the touch information into touch point coordinates, and sends the touch point coordinates to the processor 1380, where the touch controller can receive and execute commands sent by the processor 1380.

The display panel 1341 can be implemented by various types, such as resistive, capacitive, infrared, and surface acoustic wave. In addition to the display unit 1340, the computer device 1300 may also include an input unit 1330, the input unit 1330 may include, but is not limited to, one or more of a physical keyboard, function keys (such as volume control keys, switch keys, etc.), a trackball, a mouse, a joystick, and the like. In fig. 13, the input unit 1330 includes an image input device 1331 and another input device 1332 as an example.

In addition to the above, the computer device 1300 may also include a power supply 1390, audio circuitry 1360, near field communication module 1370, and RF circuitry 1310 for powering other modules. The computer device 1300 may also include one or more sensors 1350, such as acceleration sensors, light sensors, pressure sensors, and the like. The audio circuit 1360 specifically includes a speaker 1361 and a microphone 1362, for example, the user can use voice control, the computer device 1300 can collect the user's voice through the microphone 1362, can control the user's voice, and when the user needs to be prompted, plays a corresponding prompt sound through the speaker 1361.

Based on the same inventive concept, the present application provides a computer-readable storage medium, and when instructions in the computer-readable storage medium are executed by a processor, the processor is enabled to execute any one of the methods performed by the terminal or the processing server in the above embodiments.

Alternatively, the computer readable medium may be a non-transitory computer readable storage medium, such as a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and so forth.

Based on the same inventive concept, the embodiments of the present application provide a computer program product or a computer program, which includes computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions, so that the computer device executes any one of the methods executed by the terminal and the processing server in the various embodiments described above.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable signal medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The program product of embodiments of the present application may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a computing device. However, the program product of the present application is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with a command execution system, apparatus, or device.

A readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with a command execution system, apparatus, or device.

Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

While the preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all alterations and modifications as fall within the scope of the application.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims

1. A method for managing data, comprising:

2. The method of claim 1, wherein before the selecting at least one target object from the object storage unit according to the data storage rule corresponding to the target object storage unit and the storage time point of the object stored in the target object storage unit, the method further comprises:

3. The method of claim 2, wherein prior to receiving the uploaded object, further comprising:

the matching the attribute information of the object with the data storage rule of the target object storage unit includes:

the determining that the attribute information of the object is successfully matched with the data storage rule of the object storage unit includes:

4. The method according to claim 2, wherein after generating at least one task according to the data processing rule corresponding to the object storage unit and the information to be executed corresponding to the at least one target object, and before performing data processing on the at least one target object according to the at least one task, the method further comprises:

and storing the at least one task in a corresponding task storage partition.

5. The method of claim 4, wherein the task storage partition is provided with a status flag;

before the performing data processing on the at least one target object according to the at least one task, the method further includes:

after the performing data processing on the at least one target object according to the at least one task, the method further includes:

6. The method of claim 5, wherein said fetching the at least one task from the at least one task storage partition marked as outstanding comprises:

7. The method according to claim 6, wherein after acquiring at least one task whose state is a free state from the task storage partition marked as unfinished according to the generation time of each task, before performing data processing on the at least one target object according to the at least one task, the method further comprises:

the performing data processing on the at least one target object according to the at least one task further includes:

8. The method of claim 5, wherein said fetching the at least one task from the at least one task storage partition marked as outstanding comprises:

9. A data management apparatus, comprising:

10. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the steps of the method of any of claims 1 to 8 are performed when the program is executed by the processor.

11. A computer-readable storage medium, having stored thereon a computer program executable by a computer device, for causing the computer device to perform the steps of the method of any one of claims 1 to 8, when the program is run on the computer device.