WO2017190469A1

WO2017190469A1 - Data optimisation method and apparatus in big data processing

Info

Publication number: WO2017190469A1
Application number: PCT/CN2016/101058
Authority: WO
Inventors: 刘宏斌; 国铁龙; 杨海乐
Original assignee: 乐视控股（北京）有限公司; 乐视网信息技术（北京）股份有限公司
Priority date: 2016-05-04
Filing date: 2016-09-30
Publication date: 2017-11-09
Also published as: CN105975577A

Abstract

A data optimisation method and apparatus in big data processing. The method comprises: analysing a data processing logic of a plurality of tasks (S10); determining, according to the data processing logic of the plurality of tasks, intermediate data generated between the plurality of tasks (S11); analysing the usage state of the intermediate data, so as to determine whether the intermediate data needs to be conserved continuously (S12); and deleting the intermediate data when the intermediate data does not need to be conserved (S13). Thereby, unnecessary intermediate data is removed, and the storage space of a data warehouse is saved.

Description

Data optimization method and device in big data processing

cross reference

The present application is hereby incorporated by reference in its entirety in its entirety in its entirety in the entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire all

Technical field

The invention belongs to the field of computers, and in particular relates to a data optimization method and device in big data processing.

Background technique

With the rapid development of the Internet, many Internet companies have accumulated terabytes of data. Data warehouses receive data from different ecosystems every day, such as user data records from mobile phones, smart TVs, and video sites, as part of big data resources.

Data entering the data warehouse from the entry point of the data warehouse and layering inside the data warehouse requires data processing. Each data processing process is a collection of multiple tasks, each task has inherent processing logic, such as tasks. 1 is to read the data of some fields in the A table and write it to the B table. Sometimes, when many data engineers need some data, different data engineers may use different existing data to get the data. The path of the method may be different. At this time, some intermediate data is left behind, and there will be a lot of duplicate data over time. And much of this data will not be used in the future.

The above problem is caused by the intrinsic processing logic of the task is not in place, resulting in a waste of many storage resources and reducing the effective storage space of the data warehouse.

Summary of the invention

In view of this, the embodiment of the present invention provides a data optimization method and apparatus for processing big data. The problem is to solve the technical problem of wasting storage resources in the prior art due to incomplete analysis of the intrinsic processing logic of the task.

Embodiments of the present invention also provide an electronic device, a non-transitory computer storage medium, and a computer program product.

In order to solve the above technical problem, the present invention discloses a data optimization method in big data processing, including: analyzing data processing logic of multiple tasks; determining data generated between multiple tasks according to data processing logic of the plurality of tasks Intermediate data; analyzing the usage status of the intermediate data to determine whether the intermediate data needs to be saved; when the intermediate data does not need to be saved, deleting the intermediate data.

In order to solve the above technical problem, the present invention also discloses a data optimization apparatus in a big data processing, comprising: a first analysis module, configured to analyze data processing logic of a plurality of tasks; and a first determining module, configured to The data processing logic of the plurality of tasks determines the intermediate data generated between the plurality of tasks; the second determining module is configured to analyze the usage status of the intermediate data to determine whether the intermediate data needs to be saved; the first deleting module, For deleting the intermediate data when the intermediate data does not need to be saved.

The embodiment of the present invention further provides a non-transitory computer storage medium storing computer executable instructions for performing a data optimization method in any of the above-mentioned big data processing of the present application.

An embodiment of the present invention provides an electronic device, including: at least one processor;

a memory communicatively coupled to the at least one processor; wherein

The memory stores instructions executable by the one processor, the instructions being executed by the at least one processor to enable the at least one processor to:

Analyze data processing logic for multiple tasks;

Determining intermediate data generated between the plurality of tasks according to data processing logic of the plurality of tasks;

Analyzing a usage status of the intermediate data to determine whether the intermediate data needs to be saved;

The intermediate data is deleted when the intermediate data does not need to be saved.

Embodiments of the present invention also provide a computer program product, the computer program product comprising a computing program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions, when the program instructions are executed by a computer The computer is caused to perform the data optimization method in any of the above-described big data processing of the present application.

Compared with the prior art, the data optimization method and apparatus in the big data processing provided by the embodiments of the present invention detect intermediate data generated between tasks to determine whether they are still used, and if it is determined that there is no When used, the intermediate data is deleted, unnecessary intermediate data is cleared, and the storage space of the data warehouse is saved.

DRAWINGS

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, a brief description of the drawings used in the embodiments or the prior art description will be briefly described below. Obviously, the drawings in the following description It is a certain embodiment of the present invention, and other drawings can be obtained from those skilled in the art without any creative work.

1 is a flowchart of a data optimization method in big data processing according to an embodiment of the present invention;

2 is a flowchart of a data optimization method in big data processing according to an embodiment of the present invention;

3 is a block diagram of a data optimization apparatus in big data processing according to an embodiment of the present invention;

4 is a block diagram of a data optimization apparatus in big data processing according to an embodiment of the present invention;

FIG. 5 is a block diagram of an electronic device according to an embodiment of the present invention.

detailed description

The technical solutions in the embodiments of the present invention will be clearly and completely described in conjunction with the drawings in the embodiments of the present invention. It is a partial embodiment of the invention, and not all of the embodiments. based on All other embodiments obtained by those skilled in the art without creative efforts are within the scope of the present invention.

In the embodiment of the present invention, the computing task in the data warehouse is analyzed, the data processing logic of each task is analyzed, and the logical relationship and the data dependency relationship between the tasks are found through the data processing logic, and are generated between the tasks. The intermediate data and the execution of the task are analyzed, the data that can be optimized that is no longer used is found, the intermediate data that is no longer used is deleted, the storage space of the data warehouse is saved, and the corresponding tasks are appropriately merged, thereby Save computing resources in the data warehouse and improve the efficiency of task execution.

FIG. 1 is a data optimization method in big data processing according to an embodiment of the present invention, which is applicable to a server, and the method includes the following steps.

S10, analyzing data processing logic of multiple tasks.

Data processing logic includes processing objects and calculation methods. The processing object includes source data, target data, and the like. For example, task T01 reads data of three fields from table A and writes it to table B. The calculation method refers to a method of generating target data using source data. If the data is directly read from the table A and written to the table B, there is no calculation method, and if the data read from the table A is calculated, the result is written. Table B, there is a calculation method between the task existence table A and the table B.

S11. Determine intermediate data generated between the multiple tasks according to data processing logic of the plurality of tasks.

From the data processing logic of multiple tasks, find the logical relationship between multiple tasks. For example, task T01 reads the data of three fields from table A and writes it to table B. Task T02 filters the data of the three fields in table B, filters out the data that meets the preset condition, and writes the data to table C. T03 reads the data of Table C and adds it to Table D. It can be seen that the tasks T01 to T03 are sequentially performed in accordance with the logical relationship between each other. After finding the logical relationship between multiple tasks, it can be determined which intermediate data is generated between each task. Table B and Table C in the above example can be determined as intermediate data.

Different data engineers will set different calculation methods for getting the target data, and sometimes get some intermediate data for other calculations according to the actual needs of the business they are responsible for. Therefore, it is necessary to further judge whether these intermediate data will be used, that is, whether it is necessary to save these intermediate data.

S12. Analyze the usage status of the intermediate data to determine whether the intermediate data needs to be saved.

The usage status includes whether the intermediate data will be used for other calculations and whether the intermediate data itself is the final result of other task chains. Therefore, the determination as to whether or not the intermediate data needs to be saved can be performed in various ways.

In an embodiment, the step S12 can be further implemented as the following steps.

S120: Analyze whether the intermediate data is used in the service according to the business requirement.

Business requirements include whether the data is used for the calculation of other business data and whether the intermediate data is also the desired end result in the business. For example, the intermediate data B records the sales of smart TVs in various stores in Shanghai from January to March 2016. If the business needs to further filter out the top five stores in the sales, the intermediate data B will be used. Or, the intermediate data B itself is the final result of a task chain that counts the sales of smart TVs in Shanghai from January to March 2016, and the intermediate data also needs to be used.

S121, when the intermediate data is not used in the service, it is determined that the intermediate data does not need to be saved.

It is determined whether the intermediate data of the task chain needs to be saved according to the actual demand for data in the preset business logic.

In another embodiment, the step S12 can be further implemented as the following steps.

S122: The accumulated accumulated duration of the intermediate data is counted. When the accumulated duration reaches the preset threshold, the intermediate data is marked as data that is not used.

For the intermediate data determined to be in the task chain, the accumulated duration of the intermediate data is not used, for example, as long as the read operation for the intermediate data B does not occur, it indicates that the intermediate data B is not used, when the intermediate data When B is read, the accumulated duration will be cleared and restarted. If there is no read operation for the intermediate data B for a preset duration (for example, 12 hours), the intermediate data B is marked as unused data. .

In order to reduce the probability of misjudgment, the number of times the intermediate data is marked as unused data is further counted. If the data is still not used for the next preset time, the intermediate data is once again marked as data that will not be used.

S123. When the number of times the intermediate data is marked as unused data is greater than or equal to a preset threshold, it is determined that the intermediate data does not need to be saved.

For example, if the intermediate data B has been marked as data that is not used 10 times in a row, it can be considered that the data does not need to be saved.

The emergence of such intermediate data that will not be used is often artificially configured because different data engineers obtain target data in different ways, and the randomness will be stronger and will not be utilized by other data engineers.

S13, when the intermediate data does not need to be saved, the intermediate data is deleted.

In the above example, if the table B is determined to be intermediate data that does not need to be saved, the table B is deleted; if the table C is determined to be intermediate data that does not need to be saved, the table C is deleted; if both the table B and the table C are If it is determined that intermediate data that does not need to be saved, all of Table B and Table C are deleted.

In the task chain composed of multiple tasks, the intermediate data generated between the tasks is detected to determine whether it will be used, if it is determined according to the business logic that it will not be used or determined by time. If they are not used, the intermediate data will be deleted, and unnecessary intermediate data will be cleared, thus saving the storage space of the data warehouse.

In one embodiment, the data optimization method in the big data processing further includes the following steps.

S14, combining multiple tasks into one task according to data processing logic.

After deleting the intermediate data that does not need to be saved, the task for generating the deleted intermediate data can also be adjusted accordingly, and the original multiple tasks are merged into one task, thereby avoiding the generation of intermediate data again, and also It can save the computing resources of the data warehouse and improve the processing efficiency of the data warehouse. In the above example, if the table B is determined to be intermediate data that does not need to be saved, the tasks T01 and T02 are merged into T12 according to the data processing logic, and the processed objects of the merged task T12 are the table A and the table C, and the calculation method is correspondingly The data is merged into three fields from Table A and filtered according to preset conditions, and the screening result is written into Table B. If the table C is determined to be intermediate data that does not need to be saved, the tasks T02 and T03 are merged into T23 according to the data processing logic, and the processed objects of the merged task T23 are the table B and the table D, and the calculation methods are also merged into a pair table. Three words in B The segment data is filtered and the screening results are added to Table D. If both Table B and Table C are determined to be intermediate data that need not be saved, the tasks T01, T02, and T03 are merged into T13 according to the data processing logic, and the processed objects of the merged task T13 are Table A and Table D, and the calculation method is It is also combined to read the data of the three fields from Table A and filter according to preset conditions, and add the screening result to Table D.

That is to say, if there is intermediate data between the two tasks that will not be used, then the two tasks can be combined into one task. If there are multiple intermediate data that will not be used continuously, multiple tasks can be merged into A task, which reduces the number of computing tasks that need to be performed in the data warehouse, saves computing resources, and helps improve the processing efficiency of the data warehouse.

In an embodiment, as shown in FIG. 2, the data optimization method in the above big data processing may further include the following steps.

S15. Determine, according to the data processing logic, whether multiple tasks exist at the same time to generate the same intermediate data.

S16, when multiple tasks simultaneously generate the same intermediate data, keep a copy in the same intermediate data and delete other identical intermediate data, and subsequent tasks read the required data from the retained copy.

The multiple tasks that can produce the same intermediate data come from the configuration of different data engineers. For example, everyone knows that there exists Table A. A needs to extract the data of the three fields in Table A and write it to Table B, perform predictive analysis on the data of Table B, and output the analysis result to Table C; and B needs to extract Table A. The data of the same three fields is written to Table D, the data of Table D is filtered and the result is output to Table E. It can be seen that there are two tasks for reading three field data from the table A at this time, and the read data is respectively written into the table B and the table D. Then at this time, you can reserve any one of Table B and Table D and delete the other one, for example, keep Table B and delete Table D at the same time, and configure the task of B to write the data read from Table A into Table D and the slave table. The task of reading data for filtering is redirected to Table B, so that the task configured by B will write the data read from Table A into Table B and the data will be read from Table B for screening. In this way, the duplicate intermediate data can be deleted, and only one copy is reserved to meet the data read and write requirements of other tasks, thereby further saving the storage resources of the data warehouse.

In addition, in another embodiment, multiple tasks that simultaneously generate the same intermediate data may be further combined into one task. In the above example, after the table D is deleted, the configuration of the configuration may be further Take the data of the three fields in Table A and write the task of Table B with the data of the three fields in the extraction table A of A configuration and redirect the tasks written to Table B into one task. After the merger, other subsequent tasks of the A and B configurations jointly utilize the output of the merged task.

Combining multiple tasks that produce the same intermediate data at the same time can further reduce the number of computing tasks and save computing resources.

The following is an embodiment of the apparatus of the present invention for performing the above described method embodiments of the present invention.

FIG. 3 is a data optimization apparatus for processing big data according to an embodiment of the present invention, including:

a first analysis module 30, configured to analyze data processing logic of multiple tasks;

a first determining module 31, configured to determine intermediate data generated between multiple tasks according to data processing logic of multiple tasks;

a second determining module 32, configured to analyze a usage status of the intermediate data to determine whether the intermediate data needs to be saved;

The first deleting module 33 is configured to delete the intermediate data when the intermediate data does not need to be saved.

In an embodiment, the second determining module 32 further includes:

a first analysis submodule, configured to analyze, according to a service requirement, whether the intermediate data is used in a service;

The first determining submodule is configured to determine that the intermediate data does not need to be saved when the intermediate data is not used in the service.

In an embodiment, the second determining module 32 further includes:

a marking sub-module, configured to collect an unused cumulative duration of the intermediate data, and when the accumulated duration reaches a preset threshold, marking the intermediate data as data that is not used;

The second determining submodule is configured to determine that the intermediate data does not need to be saved when the number of times the intermediate data is marked as unused data is greater than or equal to a preset threshold.

In one embodiment, the apparatus further comprises:

A merge module that combines multiple tasks into one task based on data processing logic.

In one embodiment, as shown in FIG. 4, the apparatus further includes:

The determining module 34 is configured to determine, according to the data processing logic, whether multiple tasks exist simultaneously to generate the same intermediate data;

The second deleting module 35 is configured to reserve a copy in the same intermediate data and delete other identical intermediate data when multiple tasks simultaneously generate the same intermediate data, and subsequent tasks are read from the retained copy. The data needed.

In addition, in the embodiment of the present invention, each of the foregoing functional modules may be implemented by a hardware processor.

FIG. 5 is a block diagram of an electronic device according to an embodiment of the present invention. The device may be part of a data optimization server in a big data processing or a data optimization server in a big data processing. The device may include:

One or more processors 501 and memory 502, one processor 501 is taken as an example in FIG.

The electronic device may also include an input device 503 and an output device 504.

The processor 501, the memory 502, the input device 503, and the output device 504 can be connected by a bus or other means.

The memory 502 is used as a non-transitory computer readable storage medium, and can be used for storing a non-transitory software program, a non-transitory computer executable program, and a module, as in the data optimization method in the big data processing in the embodiment of the present invention. Program instructions/modules. The processor 501 executes various functional applications and data processing of the server by running non-transitory software programs, instructions, and modules stored in the memory 502, that is, implementing the data optimization method in the big data processing of the above method embodiment.

The memory 502 can include a storage program area and a storage data area, wherein the storage program area can store an operating system, an application required for at least one function; the storage data area can be stored according to the use of the data optimization device in the big data processing. Data, etc. Moreover, memory 502 can include high speed random access memory, and can also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 502 can optionally include memory remotely located relative to the processor 501 that can be connected to the processing device of the list item operation over a network. Examples of such networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input device 503 can receive input numeric or character information and generate key signal inputs related to user settings and function control of the data optimization device in the big data processing. Output device 504 can include a display device such as a display screen.

The one or more modules are stored in the memory 502, and when executed by the one or more processors 501, perform a data optimization method in big data processing in any of the above method embodiments.

The above product can perform the method provided by the embodiment of the present invention, and has the corresponding functional modules and beneficial effects of the execution method. For technical details that are not described in detail in this embodiment, reference may be made to the method provided by the embodiments of the present invention.

The electronic device of the embodiment of the invention exists in various forms, including but not limited to:

(1) Mobile communication devices: These devices are characterized by mobile communication functions and are mainly aimed at providing voice and data communication. Such terminals include: smart phones (such as iPhone), multimedia phones, functional phones, and low-end phones.

(2) Ultra-mobile personal computer equipment: This type of equipment belongs to the category of personal computers, has computing and processing functions, and generally has mobile Internet access. Such terminals include: PDAs, MIDs, and UMPC devices, such as the iPad.

(3) Portable entertainment devices: These devices can display and play multimedia content. Such devices include: audio, video players (such as iPod), handheld game consoles, e-books, and smart toys and portable car navigation devices.

(4) Server: A device that provides computing services. The server consists of a processor, a hard disk, a memory, a system bus, etc. The server is similar to a general-purpose computer architecture, but because of the need to provide highly reliable services, processing power and stability High reliability in terms of reliability, security, scalability, and manageability.

(5) Other electronic devices with data interaction functions.

Correspondingly, an embodiment of the present invention further provides a computer readable storage medium, where the program for executing the method of the foregoing embodiment is stored.

The device embodiments described above are merely illustrative, wherein the units described as separate components may or may not be physically separate, and the components displayed as units may or may not be physical units, ie may be located A place, or it can be distributed to multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the embodiment. Those of ordinary skill in the art can understand and implement without deliberate labor.

Through the description of the above embodiments, those skilled in the art can clearly understand that the various embodiments can be implemented by means of software plus a necessary general hardware platform, and of course, by hardware. Based on such understanding, the above-described technical solutions may be embodied in the form of software products in essence or in the form of software products, which may be stored in a computer readable storage medium such as ROM/RAM, magnetic Discs, optical discs, etc., include instructions for causing a computer device (which may be a personal computer, server, or network device, etc.) to perform the methods described in various embodiments or portions of the embodiments.

It should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and are not limited thereto; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that The technical solutions described in the foregoing embodiments are modified, or the equivalents of the technical features are replaced. The modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

A data optimization method in big data processing, comprising:

Analyze data processing logic for multiple tasks;

Determining intermediate data generated between the plurality of tasks according to data processing logic of the plurality of tasks;

Analyzing a usage status of the intermediate data to determine whether the intermediate data needs to be saved;

The intermediate data is deleted when the intermediate data does not need to be saved.
The method according to claim 1, wherein said analyzing said use status of said intermediate data to determine whether said intermediate data needs to be saved further comprises:

Analyze whether the intermediate data is used in the business according to the business needs;

When the intermediate data is not used in the service, it is determined that the intermediate data does not need to be saved.
The method according to claim 1, wherein said analyzing said use status of said intermediate data to determine whether said intermediate data needs to be saved further comprises:

Counting the unused accumulated duration of the intermediate data, and when the accumulated duration reaches a preset threshold, marking the intermediate data as data that is not used;

When the number of times the intermediate data is marked as unused data is greater than or equal to a preset threshold, it is determined that the intermediate data does not need to be saved.
The method of claim 1 further comprising:

The plurality of tasks are combined into one task according to the data processing logic.
The method of claim 1 further comprising:

Determining whether multiple tasks exist at the same time according to data processing logic can generate the same intermediate data;

When multiple tasks simultaneously produce the same intermediate data, a copy is kept in the same intermediate data and other identical intermediate data is deleted, and subsequent tasks read the required data from the retained copy.
A data optimization device in big data processing, comprising:

a first analysis module, configured to analyze data processing logic of multiple tasks;

a first determining module, configured to determine intermediate data generated between the multiple tasks according to data processing logic of the multiple tasks;

a second determining module, configured to analyze a usage status of the intermediate data to determine whether the intermediate data needs to be saved;

The first deleting module is configured to delete the intermediate data when the intermediate data does not need to be saved.
The apparatus according to claim 6, wherein the second determining module comprises:

a first analysis submodule, configured to analyze, according to a service requirement, whether the intermediate data is used in a service;

The first determining submodule is configured to determine that the intermediate data does not need to be saved when the intermediate data is not used in the service.
The apparatus according to claim 6, wherein the second determining module comprises:

a marking sub-module, configured to collect an unused cumulative duration of the intermediate data, and when the accumulated duration reaches a preset threshold, marking the intermediate data as data that is not used;

The second determining submodule is configured to determine that the intermediate data does not need to be saved when the number of times the intermediate data is marked as unused data is greater than or equal to a preset threshold.
The device according to claim 6, wherein the device further comprises:

And a merging module, configured to merge the plurality of tasks into one task according to the data processing logic.
The device according to claim 6, wherein the device further comprises:

a determining module, configured to determine, according to data processing logic, whether multiple tasks exist simultaneously to generate the same intermediate data;

The second deletion module is configured to reserve a copy in the same intermediate data and delete other identical intermediate data when multiple tasks simultaneously generate the same intermediate data, and subsequent tasks are Read the required data from the retained copy.
An electronic device, comprising:

At least one processor; and,

a memory communicatively coupled to the at least one processor; wherein

The memory stores instructions executable by the one processor, the instructions being executed by the at least one processor to enable the at least one processor to:

Analyze data processing logic for multiple tasks;

Determining intermediate data generated between the plurality of tasks according to data processing logic of the plurality of tasks;

Analyzing a usage status of the intermediate data to determine whether the intermediate data needs to be saved;

The intermediate data is deleted when the intermediate data does not need to be saved.
[Correct according to Rule 26 01.11.2016]
A non-transitory computer readable storage medium, wherein the non-transitory computer readable storage medium stores computer instructions for causing the computer to perform the method of any of claims 1-5 .
[Correct according to Rule 26 01.11.2016]
A computer program product comprising a computing program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions that, when executed by a computer, cause the computer to execute The method of any of claims 1-5.