WO2017190469A1 - Data optimisation method and apparatus in big data processing - Google Patents

Data optimisation method and apparatus in big data processing Download PDF

Info

Publication number
WO2017190469A1
WO2017190469A1 PCT/CN2016/101058 CN2016101058W WO2017190469A1 WO 2017190469 A1 WO2017190469 A1 WO 2017190469A1 CN 2016101058 W CN2016101058 W CN 2016101058W WO 2017190469 A1 WO2017190469 A1 WO 2017190469A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
intermediate data
tasks
saved
data processing
Prior art date
Application number
PCT/CN2016/101058
Other languages
French (fr)
Chinese (zh)
Inventor
刘宏斌
国铁龙
杨海乐
Original Assignee
乐视控股(北京)有限公司
乐视网信息技术(北京)股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 乐视控股(北京)有限公司, 乐视网信息技术(北京)股份有限公司 filed Critical 乐视控股(北京)有限公司
Publication of WO2017190469A1 publication Critical patent/WO2017190469A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2291User-Defined Types; Storage management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/283Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP

Definitions

  • the invention belongs to the field of computers, and in particular relates to a data optimization method and device in big data processing.
  • Data warehouses receive data from different ecosystems every day, such as user data records from mobile phones, smart TVs, and video sites, as part of big data resources.
  • Data entering the data warehouse from the entry point of the data warehouse and layering inside the data warehouse requires data processing.
  • Each data processing process is a collection of multiple tasks, each task has inherent processing logic, such as tasks. 1 is to read the data of some fields in the A table and write it to the B table.
  • different data engineers may use different existing data to get the data. The path of the method may be different. At this time, some intermediate data is left behind, and there will be a lot of duplicate data over time. And much of this data will not be used in the future.
  • the above problem is caused by the intrinsic processing logic of the task is not in place, resulting in a waste of many storage resources and reducing the effective storage space of the data warehouse.
  • the embodiment of the present invention provides a data optimization method and apparatus for processing big data.
  • the problem is to solve the technical problem of wasting storage resources in the prior art due to incomplete analysis of the intrinsic processing logic of the task.
  • Embodiments of the present invention also provide an electronic device, a non-transitory computer storage medium, and a computer program product.
  • the present invention discloses a data optimization method in big data processing, including: analyzing data processing logic of multiple tasks; determining data generated between multiple tasks according to data processing logic of the plurality of tasks Intermediate data; analyzing the usage status of the intermediate data to determine whether the intermediate data needs to be saved; when the intermediate data does not need to be saved, deleting the intermediate data.
  • the present invention also discloses a data optimization apparatus in a big data processing, comprising: a first analysis module, configured to analyze data processing logic of a plurality of tasks; and a first determining module, configured to The data processing logic of the plurality of tasks determines the intermediate data generated between the plurality of tasks; the second determining module is configured to analyze the usage status of the intermediate data to determine whether the intermediate data needs to be saved; the first deleting module, For deleting the intermediate data when the intermediate data does not need to be saved.
  • the embodiment of the present invention further provides a non-transitory computer storage medium storing computer executable instructions for performing a data optimization method in any of the above-mentioned big data processing of the present application.
  • An embodiment of the present invention provides an electronic device, including: at least one processor;
  • the memory stores instructions executable by the one processor, the instructions being executed by the at least one processor to enable the at least one processor to:
  • the intermediate data is deleted when the intermediate data does not need to be saved.
  • Embodiments of the present invention also provide a computer program product, the computer program product comprising a computing program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions, when the program instructions are executed by a computer
  • the computer is caused to perform the data optimization method in any of the above-described big data processing of the present application.
  • the data optimization method and apparatus in the big data processing detect intermediate data generated between tasks to determine whether they are still used, and if it is determined that there is no When used, the intermediate data is deleted, unnecessary intermediate data is cleared, and the storage space of the data warehouse is saved.
  • FIG. 1 is a flowchart of a data optimization method in big data processing according to an embodiment of the present invention
  • FIG. 2 is a flowchart of a data optimization method in big data processing according to an embodiment of the present invention
  • FIG. 3 is a block diagram of a data optimization apparatus in big data processing according to an embodiment of the present invention.
  • FIG. 4 is a block diagram of a data optimization apparatus in big data processing according to an embodiment of the present invention.
  • FIG. 5 is a block diagram of an electronic device according to an embodiment of the present invention.
  • the computing task in the data warehouse is analyzed, the data processing logic of each task is analyzed, and the logical relationship and the data dependency relationship between the tasks are found through the data processing logic, and are generated between the tasks.
  • the intermediate data and the execution of the task are analyzed, the data that can be optimized that is no longer used is found, the intermediate data that is no longer used is deleted, the storage space of the data warehouse is saved, and the corresponding tasks are appropriately merged, thereby Save computing resources in the data warehouse and improve the efficiency of task execution.
  • FIG. 1 is a data optimization method in big data processing according to an embodiment of the present invention, which is applicable to a server, and the method includes the following steps.
  • Data processing logic includes processing objects and calculation methods.
  • the processing object includes source data, target data, and the like.
  • task T01 reads data of three fields from table A and writes it to table B.
  • the calculation method refers to a method of generating target data using source data. If the data is directly read from the table A and written to the table B, there is no calculation method, and if the data read from the table A is calculated, the result is written.
  • Table B there is a calculation method between the task existence table A and the table B.
  • task T01 reads the data of three fields from table A and writes it to table B.
  • Task T02 filters the data of the three fields in table B, filters out the data that meets the preset condition, and writes the data to table C.
  • T03 reads the data of Table C and adds it to Table D. It can be seen that the tasks T01 to T03 are sequentially performed in accordance with the logical relationship between each other. After finding the logical relationship between multiple tasks, it can be determined which intermediate data is generated between each task. Table B and Table C in the above example can be determined as intermediate data.
  • the usage status includes whether the intermediate data will be used for other calculations and whether the intermediate data itself is the final result of other task chains. Therefore, the determination as to whether or not the intermediate data needs to be saved can be performed in various ways.
  • step S12 can be further implemented as the following steps.
  • S120 Analyze whether the intermediate data is used in the service according to the business requirement.
  • Business requirements include whether the data is used for the calculation of other business data and whether the intermediate data is also the desired end result in the business.
  • the intermediate data B records the sales of smart TVs in various stores in Shanghai from January to March 2016. If the business needs to further filter out the top five stores in the sales, the intermediate data B will be used. Or, the intermediate data B itself is the final result of a task chain that counts the sales of smart TVs in Shanghai from January to March 2016, and the intermediate data also needs to be used.
  • step S12 can be further implemented as the following steps.
  • S122 The accumulated accumulated duration of the intermediate data is counted. When the accumulated duration reaches the preset threshold, the intermediate data is marked as data that is not used.
  • the accumulated duration of the intermediate data is not used, for example, as long as the read operation for the intermediate data B does not occur, it indicates that the intermediate data B is not used, when the intermediate data When B is read, the accumulated duration will be cleared and restarted. If there is no read operation for the intermediate data B for a preset duration (for example, 12 hours), the intermediate data B is marked as unused data. .
  • the number of times the intermediate data is marked as unused data is further counted. If the data is still not used for the next preset time, the intermediate data is once again marked as data that will not be used.
  • the intermediate data B has been marked as data that is not used 10 times in a row, it can be considered that the data does not need to be saved.
  • the table B is deleted; if the table C is determined to be intermediate data that does not need to be saved, the table C is deleted; if both the table B and the table C are If it is determined that intermediate data that does not need to be saved, all of Table B and Table C are deleted.
  • the intermediate data generated between the tasks is detected to determine whether it will be used, if it is determined according to the business logic that it will not be used or determined by time. If they are not used, the intermediate data will be deleted, and unnecessary intermediate data will be cleared, thus saving the storage space of the data warehouse.
  • the data optimization method in the big data processing further includes the following steps.
  • the task for generating the deleted intermediate data can also be adjusted accordingly, and the original multiple tasks are merged into one task, thereby avoiding the generation of intermediate data again, and also It can save the computing resources of the data warehouse and improve the processing efficiency of the data warehouse.
  • the tasks T01 and T02 are merged into T12 according to the data processing logic, and the processed objects of the merged task T12 are the table A and the table C, and the calculation method is correspondingly
  • the data is merged into three fields from Table A and filtered according to preset conditions, and the screening result is written into Table B.
  • the tasks T02 and T03 are merged into T23 according to the data processing logic, and the processed objects of the merged task T23 are the table B and the table D, and the calculation methods are also merged into a pair table.
  • Three words in B The segment data is filtered and the screening results are added to Table D.
  • the tasks T01, T02, and T03 are merged into T13 according to the data processing logic, and the processed objects of the merged task T13 are Table A and Table D, and the calculation method is It is also combined to read the data of the three fields from Table A and filter according to preset conditions, and add the screening result to Table D.
  • the data optimization method in the above big data processing may further include the following steps.
  • S15 Determine, according to the data processing logic, whether multiple tasks exist at the same time to generate the same intermediate data.
  • the multiple tasks that can produce the same intermediate data come from the configuration of different data engineers. For example, everyone knows that there exists Table A. A needs to extract the data of the three fields in Table A and write it to Table B, perform predictive analysis on the data of Table B, and output the analysis result to Table C; and B needs to extract Table A. The data of the same three fields is written to Table D, the data of Table D is filtered and the result is output to Table E. It can be seen that there are two tasks for reading three field data from the table A at this time, and the read data is respectively written into the table B and the table D.
  • multiple tasks that simultaneously generate the same intermediate data may be further combined into one task.
  • the configuration of the configuration may be further Take the data of the three fields in Table A and write the task of Table B with the data of the three fields in the extraction table A of A configuration and redirect the tasks written to Table B into one task.
  • other subsequent tasks of the A and B configurations jointly utilize the output of the merged task.
  • Combining multiple tasks that produce the same intermediate data at the same time can further reduce the number of computing tasks and save computing resources.
  • FIG. 3 is a data optimization apparatus for processing big data according to an embodiment of the present invention, including:
  • a first analysis module 30 configured to analyze data processing logic of multiple tasks
  • a first determining module 31 configured to determine intermediate data generated between multiple tasks according to data processing logic of multiple tasks
  • a second determining module 32 configured to analyze a usage status of the intermediate data to determine whether the intermediate data needs to be saved
  • the first deleting module 33 is configured to delete the intermediate data when the intermediate data does not need to be saved.
  • the second determining module 32 further includes:
  • a first analysis submodule configured to analyze, according to a service requirement, whether the intermediate data is used in a service
  • the first determining submodule is configured to determine that the intermediate data does not need to be saved when the intermediate data is not used in the service.
  • the second determining module 32 further includes:
  • a marking sub-module configured to collect an unused cumulative duration of the intermediate data, and when the accumulated duration reaches a preset threshold, marking the intermediate data as data that is not used;
  • the second determining submodule is configured to determine that the intermediate data does not need to be saved when the number of times the intermediate data is marked as unused data is greater than or equal to a preset threshold.
  • the apparatus further comprises:
  • a merge module that combines multiple tasks into one task based on data processing logic.
  • the apparatus further includes:
  • the determining module 34 is configured to determine, according to the data processing logic, whether multiple tasks exist simultaneously to generate the same intermediate data
  • the second deleting module 35 is configured to reserve a copy in the same intermediate data and delete other identical intermediate data when multiple tasks simultaneously generate the same intermediate data, and subsequent tasks are read from the retained copy. The data needed.
  • each of the foregoing functional modules may be implemented by a hardware processor.
  • FIG. 5 is a block diagram of an electronic device according to an embodiment of the present invention.
  • the device may be part of a data optimization server in a big data processing or a data optimization server in a big data processing.
  • the device may include:
  • One or more processors 501 and memory 502, one processor 501 is taken as an example in FIG.
  • the electronic device may also include an input device 503 and an output device 504.
  • the processor 501, the memory 502, the input device 503, and the output device 504 can be connected by a bus or other means.
  • the memory 502 is used as a non-transitory computer readable storage medium, and can be used for storing a non-transitory software program, a non-transitory computer executable program, and a module, as in the data optimization method in the big data processing in the embodiment of the present invention.
  • the processor 501 executes various functional applications and data processing of the server by running non-transitory software programs, instructions, and modules stored in the memory 502, that is, implementing the data optimization method in the big data processing of the above method embodiment.
  • the memory 502 can include a storage program area and a storage data area, wherein the storage program area can store an operating system, an application required for at least one function; the storage data area can be stored according to the use of the data optimization device in the big data processing. Data, etc.
  • memory 502 can include high speed random access memory, and can also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device.
  • the memory 502 can optionally include memory remotely located relative to the processor 501 that can be connected to the processing device of the list item operation over a network. Examples of such networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.
  • the input device 503 can receive input numeric or character information and generate key signal inputs related to user settings and function control of the data optimization device in the big data processing.
  • Output device 504 can include a display device such as a display screen.
  • the one or more modules are stored in the memory 502, and when executed by the one or more processors 501, perform a data optimization method in big data processing in any of the above method embodiments.
  • the above product can perform the method provided by the embodiment of the present invention, and has the corresponding functional modules and beneficial effects of the execution method.
  • the above product can perform the method provided by the embodiment of the present invention, and has the corresponding functional modules and beneficial effects of the execution method.
  • the electronic device of the embodiment of the invention exists in various forms, including but not limited to:
  • Mobile communication devices These devices are characterized by mobile communication functions and are mainly aimed at providing voice and data communication.
  • Such terminals include: smart phones (such as iPhone), multimedia phones, functional phones, and low-end phones.
  • Ultra-mobile personal computer equipment This type of equipment belongs to the category of personal computers, has computing and processing functions, and generally has mobile Internet access.
  • Such terminals include: PDAs, MIDs, and UMPC devices, such as the iPad.
  • Portable entertainment devices These devices can display and play multimedia content. Such devices include: audio, video players (such as iPod), handheld game consoles, e-books, and smart toys and portable car navigation devices.
  • the server consists of a processor, a hard disk, a memory, a system bus, etc.
  • the server is similar to a general-purpose computer architecture, but because of the need to provide highly reliable services, processing power and stability High reliability in terms of reliability, security, scalability, and manageability.
  • an embodiment of the present invention further provides a computer readable storage medium, where the program for executing the method of the foregoing embodiment is stored.
  • the device embodiments described above are merely illustrative, wherein the units described as separate components may or may not be physically separate, and the components displayed as units may or may not be physical units, ie may be located A place, or it can be distributed to multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the embodiment. Those of ordinary skill in the art can understand and implement without deliberate labor.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A data optimisation method and apparatus in big data processing. The method comprises: analysing a data processing logic of a plurality of tasks (S10); determining, according to the data processing logic of the plurality of tasks, intermediate data generated between the plurality of tasks (S11); analysing the usage state of the intermediate data, so as to determine whether the intermediate data needs to be conserved continuously (S12); and deleting the intermediate data when the intermediate data does not need to be conserved (S13). Thereby, unnecessary intermediate data is removed, and the storage space of a data warehouse is saved.

Description

大数据处理中的数据优化方法和装置Data optimization method and device in big data processing
交叉引用cross reference
本申请引用于2016年05月04日递交的名称为“大数据处理中的数据优化方法和装置”的第201610290381.0号中国专利申请,其通过引用被全部并入本申请。The present application is hereby incorporated by reference in its entirety in its entirety in its entirety in the entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire entire all
技术领域Technical field
本发明属于计算机领域,具体地说,涉及一种大数据处理中的数据优化方法和装置。The invention belongs to the field of computers, and in particular relates to a data optimization method and device in big data processing.
背景技术Background technique
随着互联网的快速发展,很多互联网公司都已积累了TB量级的数据。数据仓库每天都在接收来自不同生态的数据,例如来自手机、智能电视、视频网站的用户数据记录等,做为大数据资源的一部分。With the rapid development of the Internet, many Internet companies have accumulated terabytes of data. Data warehouses receive data from different ecosystems every day, such as user data records from mobile phones, smart TVs, and video sites, as part of big data resources.
数据从数据仓库的入口机进入数据仓库以及在数据仓库内部进行分层,都需要进行数据处理,每次数据处理过程都是多个任务的集合,每个任务都具有内在的处理逻辑,例如任务1是将A表中的部分字段的数据读取再写入到B表。有时,当很多数据工程师都需要某些数据时,不同的数据工程师利用现有数据获取到所需要数据方法路径可能就会不同,此时遗留下一些中间数据,久而久之还会出现很多重复的数据,而且其中的很多数据在今后都不会再被用到。Data entering the data warehouse from the entry point of the data warehouse and layering inside the data warehouse requires data processing. Each data processing process is a collection of multiple tasks, each task has inherent processing logic, such as tasks. 1 is to read the data of some fields in the A table and write it to the B table. Sometimes, when many data engineers need some data, different data engineers may use different existing data to get the data. The path of the method may be different. At this time, some intermediate data is left behind, and there will be a lot of duplicate data over time. And much of this data will not be used in the future.
上述问题是由于对任务的内在处理逻辑分析不到位而造成的,导致了很多存储资源的浪费,减少了数据仓库的有效存储空间。The above problem is caused by the intrinsic processing logic of the task is not in place, resulting in a waste of many storage resources and reducing the effective storage space of the data warehouse.
发明内容Summary of the invention
有鉴于此,本发明实施例提供了一种大数据处理中的数据优化方法和装 置,用以解决现有技术中由于对任务的内在处理逻辑分析不到位而导致浪费存储资源的技术问题。In view of this, the embodiment of the present invention provides a data optimization method and apparatus for processing big data. The problem is to solve the technical problem of wasting storage resources in the prior art due to incomplete analysis of the intrinsic processing logic of the task.
本发明实施例还提供一种电子设备、一种非暂态计算机存储介质以及一种计算机程序产品。Embodiments of the present invention also provide an electronic device, a non-transitory computer storage medium, and a computer program product.
为了解决上述技术问题,本发明公开了一种大数据处理中的数据优化方法,包括:分析多个任务的数据处理逻辑;根据所述多个任务的数据处理逻辑确定多个任务之间产生的中间数据;分析所述中间数据的使用状态以确定所述中间数据是否需要继续被保存;当所述中间数据不需要被保存时,删除所述中间数据。In order to solve the above technical problem, the present invention discloses a data optimization method in big data processing, including: analyzing data processing logic of multiple tasks; determining data generated between multiple tasks according to data processing logic of the plurality of tasks Intermediate data; analyzing the usage status of the intermediate data to determine whether the intermediate data needs to be saved; when the intermediate data does not need to be saved, deleting the intermediate data.
为了解决上述技术问题,本发明还公开了一种大数据处理中的数据优化装置,包括:第一分析模块,用于分析多个任务的数据处理逻辑;第一确定模块,用于根据所述多个任务的数据处理逻辑确定多个任务之间产生的中间数据;第二确定模块,用于分析所述中间数据的使用状态以确定所述中间数据是否需要继续被保存;第一删除模块,用于当所述中间数据不需要被保存时,删除所述中间数据。In order to solve the above technical problem, the present invention also discloses a data optimization apparatus in a big data processing, comprising: a first analysis module, configured to analyze data processing logic of a plurality of tasks; and a first determining module, configured to The data processing logic of the plurality of tasks determines the intermediate data generated between the plurality of tasks; the second determining module is configured to analyze the usage status of the intermediate data to determine whether the intermediate data needs to be saved; the first deleting module, For deleting the intermediate data when the intermediate data does not need to be saved.
本发明实施例还提供一种非暂态计算机存储介质,存储有计算机可执行指令,所述计算机可执行指令用于执行本申请上述任一项大数据处理中的数据优化方法。The embodiment of the present invention further provides a non-transitory computer storage medium storing computer executable instructions for performing a data optimization method in any of the above-mentioned big data processing of the present application.
本发明实施例提供一种电子设备,包括:至少一个处理器;以及,An embodiment of the present invention provides an electronic device, including: at least one processor;
与所述至少一个处理器通信连接的存储器;其中,a memory communicatively coupled to the at least one processor; wherein
所述存储器存储有可被所述一个处理器执行的指令,所述指令被被所述至少一个处理器执行,以使所述至少一个处理器能够:The memory stores instructions executable by the one processor, the instructions being executed by the at least one processor to enable the at least one processor to:
分析多个任务的数据处理逻辑;Analyze data processing logic for multiple tasks;
根据所述多个任务的数据处理逻辑确定多个任务之间产生的中间数据;Determining intermediate data generated between the plurality of tasks according to data processing logic of the plurality of tasks;
分析所述中间数据的使用状态以确定所述中间数据是否需要继续被保存;Analyzing a usage status of the intermediate data to determine whether the intermediate data needs to be saved;
当所述中间数据不需要被保存时,删除所述中间数据。 The intermediate data is deleted when the intermediate data does not need to be saved.
本发明实施例还提供了一种计算机程序产品,所述计算机程序产品包括存储在非暂态计算机可读存储介质上的计算程序,所述计算机程序包括程序指令,当所述程序指令被计算机执行时,使所述计算机执行本申请上述任一项大数据处理中的数据优化方法。Embodiments of the present invention also provide a computer program product, the computer program product comprising a computing program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions, when the program instructions are executed by a computer The computer is caused to perform the data optimization method in any of the above-described big data processing of the present application.
与现有技术相比,本发明实施例提供的大数据处理中的数据优化方法和装置,会对各个任务之间产生的中间数据进行检测,以判定其是否还会被利用,如果判定其没有被使用,则会将该中间数据删除,清除了不必要的中间数据,从而节省了数据仓库的存储空间。Compared with the prior art, the data optimization method and apparatus in the big data processing provided by the embodiments of the present invention detect intermediate data generated between tasks to determine whether they are still used, and if it is determined that there is no When used, the intermediate data is deleted, unnecessary intermediate data is cleared, and the storage space of the data warehouse is saved.
附图说明DRAWINGS
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作一简单地介绍,显而易见地,下面描述中的附图是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, a brief description of the drawings used in the embodiments or the prior art description will be briefly described below. Obviously, the drawings in the following description It is a certain embodiment of the present invention, and other drawings can be obtained from those skilled in the art without any creative work.
图1是本发明实施例提供的一种大数据处理中的数据优化方法的流程图;1 is a flowchart of a data optimization method in big data processing according to an embodiment of the present invention;
图2是本发明实施例提供的一种大数据处理中的数据优化方法的流程图;2 is a flowchart of a data optimization method in big data processing according to an embodiment of the present invention;
图3是本发明实施例提供的一种大数据处理中的数据优化装置的框图;3 is a block diagram of a data optimization apparatus in big data processing according to an embodiment of the present invention;
图4是本发明实施例提供的一种大数据处理中的数据优化装置的框图;4 is a block diagram of a data optimization apparatus in big data processing according to an embodiment of the present invention;
图5为本发明实施例提供的一种电子设备的框图。FIG. 5 is a block diagram of an electronic device according to an embodiment of the present invention.
具体实施方式detailed description
为使本发明实施例的目的、技术方案和优点更加清楚,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于 本发明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。The technical solutions in the embodiments of the present invention will be clearly and completely described in conjunction with the drawings in the embodiments of the present invention. It is a partial embodiment of the invention, and not all of the embodiments. based on All other embodiments obtained by those skilled in the art without creative efforts are within the scope of the present invention.
本发明实施例中,针对数据仓库中的计算任务进行分析,分析每个任务的数据处理逻辑,通过数据处理逻辑来找到各个任务之间的逻辑关系以及数据依赖关系,对各个任务之间产生的中间数据以及任务的执行情况进行分析,找到不再被使用的可以进行优化的数据,删除不再被使用的中间数据,节省了数据仓库的存储空间,例外对相应的任务进行适当的合并,从而节省数据仓库的计算资源并提高任务的执行效率。In the embodiment of the present invention, the computing task in the data warehouse is analyzed, the data processing logic of each task is analyzed, and the logical relationship and the data dependency relationship between the tasks are found through the data processing logic, and are generated between the tasks. The intermediate data and the execution of the task are analyzed, the data that can be optimized that is no longer used is found, the intermediate data that is no longer used is deleted, the storage space of the data warehouse is saved, and the corresponding tasks are appropriately merged, thereby Save computing resources in the data warehouse and improve the efficiency of task execution.
图1是本发明实施例提供的一种大数据处理中的数据优化方法,适用服务器,该方法包括以下步骤。FIG. 1 is a data optimization method in big data processing according to an embodiment of the present invention, which is applicable to a server, and the method includes the following steps.
S10,分析多个任务的数据处理逻辑。S10, analyzing data processing logic of multiple tasks.
数据处理逻辑包括处理对象和计算方法。处理对象包括源数据、目标数据等,例如任务T01是从表A读取三个字段的数据并写入表B。计算方法是指利用源数据生成目标数据的方法,如果是直接从表A读取数据并写入表B则不存在计算方法,而如果对从表A读取的数据进行计算之后将结果写入表B,在该任务存在表A与表B之间存在计算方法。Data processing logic includes processing objects and calculation methods. The processing object includes source data, target data, and the like. For example, task T01 reads data of three fields from table A and writes it to table B. The calculation method refers to a method of generating target data using source data. If the data is directly read from the table A and written to the table B, there is no calculation method, and if the data read from the table A is calculated, the result is written. Table B, there is a calculation method between the task existence table A and the table B.
S11,根据多个任务的数据处理逻辑确定多个任务之间产生的中间数据。S11. Determine intermediate data generated between the multiple tasks according to data processing logic of the plurality of tasks.
从多个任务的数据处理逻辑中,找出多个任务的之间的逻辑关系。例如,任务T01从表A读取三个字段的数据并写入表B,任务T02对表B中的三个字段的数据进行筛选,筛选出满足预设条件的数据并写入表C,任务T03读取表C的数据并添加到表D中。可以看出任务T01至T03是按照彼此之间的逻辑关系依次进行的。找到多个任务之间的逻辑关系后,就可以确定各个任务之间都产生了哪些中间数据,上例中的表B和表C即可以被确定为中间数据。From the data processing logic of multiple tasks, find the logical relationship between multiple tasks. For example, task T01 reads the data of three fields from table A and writes it to table B. Task T02 filters the data of the three fields in table B, filters out the data that meets the preset condition, and writes the data to table C. T03 reads the data of Table C and adds it to Table D. It can be seen that the tasks T01 to T03 are sequentially performed in accordance with the logical relationship between each other. After finding the logical relationship between multiple tasks, it can be determined which intermediate data is generated between each task. Table B and Table C in the above example can be determined as intermediate data.
不同的数据工程师对得到目标数据而设置的计算方式会有所不同,有时还会根据其所负责的业务的实际需求来得到一些中间数据供进行其他计算使用。因此,需要进一步判断这些中间数据会被使用,也就是判断这些中间数据是否有必要进行保存。 Different data engineers will set different calculation methods for getting the target data, and sometimes get some intermediate data for other calculations according to the actual needs of the business they are responsible for. Therefore, it is necessary to further judge whether these intermediate data will be used, that is, whether it is necessary to save these intermediate data.
S12,分析中间数据的使用状态以确定中间数据是否需要继续被保存。S12. Analyze the usage status of the intermediate data to determine whether the intermediate data needs to be saved.
使用状态包括该中间数据是否会被用于其它计算,以及该中间数据本身是否是其他任务链的最终结果。因此,对于中间数据是否需要保存的判定,可以通过多种方式进行。The usage status includes whether the intermediate data will be used for other calculations and whether the intermediate data itself is the final result of other task chains. Therefore, the determination as to whether or not the intermediate data needs to be saved can be performed in various ways.
在一个实施例中,该步骤S12可进一步被实施为以下步骤。In an embodiment, the step S12 can be further implemented as the following steps.
S120,根据业务需求分析中间数据是否在业务中被使用。S120: Analyze whether the intermediate data is used in the service according to the business requirement.
业务需求包括该数据是否用于其它业务数据的计算以及该中间数据在业务中是否也是需要的最终结果。例如,中间数据B记录了上海的各门店在2016年1月至3月的智能电视销量,如果业务中还需要进一步筛选出销量排名前五位的门店,则代表该中间数据B还会被使用;或者,该中间数据B本身就是一个统计上海市在2016年1月至3月的智能电视销量的任务链的最终结果,则代表该中间数据也需要被使用。Business requirements include whether the data is used for the calculation of other business data and whether the intermediate data is also the desired end result in the business. For example, the intermediate data B records the sales of smart TVs in various stores in Shanghai from January to March 2016. If the business needs to further filter out the top five stores in the sales, the intermediate data B will be used. Or, the intermediate data B itself is the final result of a task chain that counts the sales of smart TVs in Shanghai from January to March 2016, and the intermediate data also needs to be used.
S121,当中间数据在业务中不被使用时,确定中间数据不需要继续被保存。S121, when the intermediate data is not used in the service, it is determined that the intermediate data does not need to be saved.
实现了根据预设的业务逻辑中对数据的实际需求来判定任务链的中间数据是否需要被保存。It is determined whether the intermediate data of the task chain needs to be saved according to the actual demand for data in the preset business logic.
在另一实施例中,该步骤S12还可以进一步被实施为以下步骤。In another embodiment, the step S12 can be further implemented as the following steps.
S122,统计中间数据的未被使用的累积时长,当累积时长达到预设门限时,标记中间数据为不被使用的数据。S122: The accumulated accumulated duration of the intermediate data is counted. When the accumulated duration reaches the preset threshold, the intermediate data is marked as data that is not used.
对于被判定为任务链中的中间数据,会统计该中间数据未被使用的累积时长,例如,只要没有出现针对中间数据B的读取操作,就说明该中间数据B没有被使用,当中间数据B被读取时,累积时长将被清零并重新开始计时,如果在预设时长(例如12小时)都没有针对中间数据B的读取操作,则标记该中间数据B为不被使用的数据。For the intermediate data determined to be in the task chain, the accumulated duration of the intermediate data is not used, for example, as long as the read operation for the intermediate data B does not occur, it indicates that the intermediate data B is not used, when the intermediate data When B is read, the accumulated duration will be cleared and restarted. If there is no read operation for the intermediate data B for a preset duration (for example, 12 hours), the intermediate data B is marked as unused data. .
为了降低发生误判的概率,还会进一步对该中间数据被标记为不被使用的数据的次数进行统计。如果在接下来的预设时长该数据仍然没有被使用,则再一次标记该中间数据为不会被使用的数据。 In order to reduce the probability of misjudgment, the number of times the intermediate data is marked as unused data is further counted. If the data is still not used for the next preset time, the intermediate data is once again marked as data that will not be used.
S123,当中间数据被标记为不被使用的数据的次数大于或等于预设门限时,确定中间数据不需要继续被保存。S123. When the number of times the intermediate data is marked as unused data is greater than or equal to a preset threshold, it is determined that the intermediate data does not need to be saved.
例如,中间数据B已连续10次被标记为不被使用的数据,则可以认为该数据不需要被继续保存。For example, if the intermediate data B has been marked as data that is not used 10 times in a row, it can be considered that the data does not need to be saved.
这种不会被使用的中间数据的出现往往都是由于不同的数据工程师通过不同的方式获取目标数据时而人为配置的,随意性会比较强又不会被其他其他数据工程师所利用。The emergence of such intermediate data that will not be used is often artificially configured because different data engineers obtain target data in different ways, and the randomness will be stronger and will not be utilized by other data engineers.
S13,当中间数据不需要被保存时,删除中间数据。S13, when the intermediate data does not need to be saved, the intermediate data is deleted.
如上例中,如果表B被判定为不需要保存的中间数据,则删除该表B;如果表C被判定为不需要保存的中间数据,则删除该表C;如果表B和表C都被判定为不需要保存的中间数据,则将表B和表C全部删除。In the above example, if the table B is determined to be intermediate data that does not need to be saved, the table B is deleted; if the table C is determined to be intermediate data that does not need to be saved, the table C is deleted; if both the table B and the table C are If it is determined that intermediate data that does not need to be saved, all of Table B and Table C are deleted.
在多个任务组成的任务链中,会对各个任务之间产生的中间数据进行检测,以判定其是否还会被利用,如果根据业务逻辑判定其不会被使用或者通过计时判定其很长时间都没有被使用,则会将该中间数据删除,清除了不必要的中间数据,从而节省了数据仓库的存储空间。In the task chain composed of multiple tasks, the intermediate data generated between the tasks is detected to determine whether it will be used, if it is determined according to the business logic that it will not be used or determined by time. If they are not used, the intermediate data will be deleted, and unnecessary intermediate data will be cleared, thus saving the storage space of the data warehouse.
在一个实施例中,该大数据处理中的数据优化方法进一步包括以下步骤。In one embodiment, the data optimization method in the big data processing further includes the following steps.
S14,根据数据处理逻辑将多个任务合并为一个任务。S14, combining multiple tasks into one task according to data processing logic.
在删除不需要保存的中间数据后,对于生成这些被删除的中间数据的任务也可以随之进行相应调整,将原来的多个任务合并成一个任务,也就避免了再次产生中间数据,同时还能节约数据仓库的计算资源,提高数据仓库的处理效率。如上例中,如果表B被判定为不需要保存的中间数据,则根据数据处理逻辑将任务T01和T02合并为T12,合并后的任务T12的处理对象就是表A和表C,计算方法也相应合并为从表A中读取三个字段的数据并根据预设条件进行筛选,将筛选结果写入表B。如果表C被判定为不需要保存的中间数据,则根据数据处理逻辑将任务T02和T03合并为T23,合并后的任务T23的处理对象就是表B和表D,计算方法也相应合并为对表B中三个字 段数据进行筛选并将筛选结果添加至表D。如果表B和表C都被判定为不需要保存的中间数据,则根据数据处理逻辑将任务T01、T02和T03合并为T13,合并后的任务T13的处理对象就是表A和表D,计算方法也相应合并为从表A中读取三个字段的数据并按照预设条件进行筛选,将筛选结果添加到表D。After deleting the intermediate data that does not need to be saved, the task for generating the deleted intermediate data can also be adjusted accordingly, and the original multiple tasks are merged into one task, thereby avoiding the generation of intermediate data again, and also It can save the computing resources of the data warehouse and improve the processing efficiency of the data warehouse. In the above example, if the table B is determined to be intermediate data that does not need to be saved, the tasks T01 and T02 are merged into T12 according to the data processing logic, and the processed objects of the merged task T12 are the table A and the table C, and the calculation method is correspondingly The data is merged into three fields from Table A and filtered according to preset conditions, and the screening result is written into Table B. If the table C is determined to be intermediate data that does not need to be saved, the tasks T02 and T03 are merged into T23 according to the data processing logic, and the processed objects of the merged task T23 are the table B and the table D, and the calculation methods are also merged into a pair table. Three words in B The segment data is filtered and the screening results are added to Table D. If both Table B and Table C are determined to be intermediate data that need not be saved, the tasks T01, T02, and T03 are merged into T13 according to the data processing logic, and the processed objects of the merged task T13 are Table A and Table D, and the calculation method is It is also combined to read the data of the three fields from Table A and filter according to preset conditions, and add the screening result to Table D.
也就是说,如果两个任务之间存在不会被使用的中间数据,则可以这两个任务合并一个任务,如果连续出现多个不会被使用的中间数据,则可以将多个任务合并为一个任务,从而减少了数据仓库中需要执行的计算任务数量,节约了计算资源,有助于提高数据仓库的处理效率。That is to say, if there is intermediate data between the two tasks that will not be used, then the two tasks can be combined into one task. If there are multiple intermediate data that will not be used continuously, multiple tasks can be merged into A task, which reduces the number of computing tasks that need to be performed in the data warehouse, saves computing resources, and helps improve the processing efficiency of the data warehouse.
在一个实施例中,如图2所示,上述大数据处理中的数据优化方法可进一步包括以下步骤。In an embodiment, as shown in FIG. 2, the data optimization method in the above big data processing may further include the following steps.
S15,根据数据处理逻辑判断是否同时存在多个任务能够产生相同的中间数据。S15. Determine, according to the data processing logic, whether multiple tasks exist at the same time to generate the same intermediate data.
S16,当同时存在多个任务能够产生相同的中间数据时,在相同的中间数据中保留一份副本并删除其他相同的中间数据,后续任务都从保留的副本中读取需要的数据。S16, when multiple tasks simultaneously generate the same intermediate data, keep a copy in the same intermediate data and delete other identical intermediate data, and subsequent tasks read the required data from the retained copy.
该多个能够产生相同中间数据的任务来自于不同数据工程师的配置。例如,大家都已知存在表A,甲需要提取表A中三个字段的数据并写入表B,对表B的数据进行预测分析,输出分析结果至表C;而乙需要提取表A中相同的三个字段的数据并写入表D,对表D的数据进行筛选并将结果输出到表E。可见此时存在两个从表A读取三个字段数据的任务,并分别将读取到的数据写入表B和表D。那么此时可以在表B和表D中保留任意一个并删除另外一个,例如保留表B同时删除表D,并将乙配置的把从表A读取的数据写入表D的任务以及从表D读取数据进行筛选的任务都重定向至表B,从而使乙配置的任务会把从表A读取的数据写入表B同时会从表B读取数据进行筛选。这样即可将重复的中间数据进行删除,只保留一份副本来满足其他任务的数据读写需求,进一步节省了数据仓库的存储资源。The multiple tasks that can produce the same intermediate data come from the configuration of different data engineers. For example, everyone knows that there exists Table A. A needs to extract the data of the three fields in Table A and write it to Table B, perform predictive analysis on the data of Table B, and output the analysis result to Table C; and B needs to extract Table A. The data of the same three fields is written to Table D, the data of Table D is filtered and the result is output to Table E. It can be seen that there are two tasks for reading three field data from the table A at this time, and the read data is respectively written into the table B and the table D. Then at this time, you can reserve any one of Table B and Table D and delete the other one, for example, keep Table B and delete Table D at the same time, and configure the task of B to write the data read from Table A into Table D and the slave table. The task of reading data for filtering is redirected to Table B, so that the task configured by B will write the data read from Table A into Table B and the data will be read from Table B for screening. In this way, the duplicate intermediate data can be deleted, and only one copy is reserved to meet the data read and write requirements of other tasks, thereby further saving the storage resources of the data warehouse.
此外,在另一个实施例中,还可以进一步将同时产生相同中间数据的多个任务合并为一个任务,如上例中在删除表D之后可以进一步将甲配置的提 取表A中三个字段的数据并写入表B的任务与甲配置的提取表A中三个字段的数据并重定向写入表B的任务合并为一个任务。合并后,甲乙配置的其他后续任务共同利用该合并后的任务的输出结果。In addition, in another embodiment, multiple tasks that simultaneously generate the same intermediate data may be further combined into one task. In the above example, after the table D is deleted, the configuration of the configuration may be further Take the data of the three fields in Table A and write the task of Table B with the data of the three fields in the extraction table A of A configuration and redirect the tasks written to Table B into one task. After the merger, other subsequent tasks of the A and B configurations jointly utilize the output of the merged task.
对同时产生相同中间数据的多个任务进行合并,可以进一步减少计算任务的数量,节约计算资源。Combining multiple tasks that produce the same intermediate data at the same time can further reduce the number of computing tasks and save computing resources.
下面是本发明的装置实施例,用于执行本发明的上述方法实施例。The following is an embodiment of the apparatus of the present invention for performing the above described method embodiments of the present invention.
图3是本发明实施例提供的一种大数据处理中的数据优化装置,包括:FIG. 3 is a data optimization apparatus for processing big data according to an embodiment of the present invention, including:
第一分析模块30,用于分析多个任务的数据处理逻辑;a first analysis module 30, configured to analyze data processing logic of multiple tasks;
第一确定模块31,用于根据多个任务的数据处理逻辑确定多个任务之间产生的中间数据;a first determining module 31, configured to determine intermediate data generated between multiple tasks according to data processing logic of multiple tasks;
第二确定模块32,用于分析中间数据的使用状态以确定中间数据是否需要继续被保存;a second determining module 32, configured to analyze a usage status of the intermediate data to determine whether the intermediate data needs to be saved;
第一删除模块33,用于当中间数据不需要被保存时,删除中间数据。The first deleting module 33 is configured to delete the intermediate data when the intermediate data does not need to be saved.
在一个实施例中,该第二确定模块32进一步包括:In an embodiment, the second determining module 32 further includes:
第一分析子模块,用于根据业务需求分析所述中间数据是否在业务中被使用;a first analysis submodule, configured to analyze, according to a service requirement, whether the intermediate data is used in a service;
第一确定子模块,用于当所述中间数据在业务中不被使用时,确定所述中间数据不需要继续被保存。The first determining submodule is configured to determine that the intermediate data does not need to be saved when the intermediate data is not used in the service.
在一个实施例中,该第二确定模块32进一步包括:In an embodiment, the second determining module 32 further includes:
标记子模块,用于统计所述中间数据的未被使用的累积时长,当所述累积时长达到预设门限时,标记所述中间数据为不被使用的数据;a marking sub-module, configured to collect an unused cumulative duration of the intermediate data, and when the accumulated duration reaches a preset threshold, marking the intermediate data as data that is not used;
第二确定子模块,用于当所述中间数据被标记为不被使用的数据的次数大于或等于预设门限时,确定所述中间数据不需要继续被保存。The second determining submodule is configured to determine that the intermediate data does not need to be saved when the number of times the intermediate data is marked as unused data is greater than or equal to a preset threshold.
在一个实施例中,该装置进一步包括:In one embodiment, the apparatus further comprises:
合并模块,用于根据数据处理逻辑将多个任务合并为一个任务。A merge module that combines multiple tasks into one task based on data processing logic.
在一个实施例中,如图4所示,该装置进一步包括: In one embodiment, as shown in FIG. 4, the apparatus further includes:
判断模块34,用于根据数据处理逻辑判断是否同时存在多个任务能够产生相同的中间数据;The determining module 34 is configured to determine, according to the data processing logic, whether multiple tasks exist simultaneously to generate the same intermediate data;
第二删除模块35,用于当同时存在多个任务能够产生相同的中间数据时,在相同的中间数据中保留一份副本并删除其他相同的中间数据,后续任务都从保留的副本中读取需要的数据。The second deleting module 35 is configured to reserve a copy in the same intermediate data and delete other identical intermediate data when multiple tasks simultaneously generate the same intermediate data, and subsequent tasks are read from the retained copy. The data needed.
此外,本发明实施例中可以通过硬件处理器(hardware processor)来实现上述各个功能模块。In addition, in the embodiment of the present invention, each of the foregoing functional modules may be implemented by a hardware processor.
图5为本发明实施例提供的一种电子设备的框图,本实施例所述设备可以为大数据处理中的数据优化服务器或大数据处理中的数据优化服务器中的一部分,该设备可以包括:FIG. 5 is a block diagram of an electronic device according to an embodiment of the present invention. The device may be part of a data optimization server in a big data processing or a data optimization server in a big data processing. The device may include:
一个或多个处理器501以及存储器502,图5中以一个处理器501为例。One or more processors 501 and memory 502, one processor 501 is taken as an example in FIG.
电子设备还可以包括:输入装置503和输出装置504。The electronic device may also include an input device 503 and an output device 504.
处理器501、存储器502、输入装置503和输出装置504可以通过总线或者其他方式连接。The processor 501, the memory 502, the input device 503, and the output device 504 can be connected by a bus or other means.
存储器502作为一种非暂态计算机可读存储介质,可用于存储非暂态软件程序、非暂态计算机可执行程序以及模块,如本发明实施例中的大数据处理中的数据优化方法对应的程序指令/模块。处理器501通过运行存储在存储器502中的非暂态软件程序、指令以及模块,从而执行服务器的各种功能应用以及数据处理,即实现上述方法实施例大数据处理中的数据优化方法。The memory 502 is used as a non-transitory computer readable storage medium, and can be used for storing a non-transitory software program, a non-transitory computer executable program, and a module, as in the data optimization method in the big data processing in the embodiment of the present invention. Program instructions/modules. The processor 501 executes various functional applications and data processing of the server by running non-transitory software programs, instructions, and modules stored in the memory 502, that is, implementing the data optimization method in the big data processing of the above method embodiment.
存储器502可以包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需要的应用程序;存储数据区可存储根据大数据处理中的数据优化装置的使用所创建的数据等。此外,存储器502可以包括高速随机存取存储器,还可以包括非暂态存储器,例如至少一个磁盘存储器件、闪存器件、或其他非暂态固态存储器件。在一些实施例中,存储器502可选包括相对于处理器501远程设置的存储器,这些远程存储器可以通过网络连接至列表项操作的处理装置。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。 The memory 502 can include a storage program area and a storage data area, wherein the storage program area can store an operating system, an application required for at least one function; the storage data area can be stored according to the use of the data optimization device in the big data processing. Data, etc. Moreover, memory 502 can include high speed random access memory, and can also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 502 can optionally include memory remotely located relative to the processor 501 that can be connected to the processing device of the list item operation over a network. Examples of such networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.
输入装置503可接收输入的数字或字符信息,以及产生与大数据处理中的数据优化装置的用户设置以及功能控制有关的键信号输入。输出装置504可包括显示屏等显示设备。The input device 503 can receive input numeric or character information and generate key signal inputs related to user settings and function control of the data optimization device in the big data processing. Output device 504 can include a display device such as a display screen.
所述一个或者多个模块存储在所述存储器502中,当被所述一个或者多个处理器501执行时,执行上述任意方法实施例中的大数据处理中的数据优化方法。The one or more modules are stored in the memory 502, and when executed by the one or more processors 501, perform a data optimization method in big data processing in any of the above method embodiments.
上述产品可执行本发明实施例所提供的方法,具备执行方法相应的功能模块和有益效果。未在本实施例中详尽描述的技术细节,可参见本发明实施例所提供的方法。The above product can perform the method provided by the embodiment of the present invention, and has the corresponding functional modules and beneficial effects of the execution method. For technical details that are not described in detail in this embodiment, reference may be made to the method provided by the embodiments of the present invention.
本发明实施例的电子设备以多种形式存在,包括但不限于:The electronic device of the embodiment of the invention exists in various forms, including but not limited to:
(1)移动通信设备:这类设备的特点是具备移动通信功能,并且以提供话音、数据通信为主要目标。这类终端包括:智能手机(例如iPhone)、多媒体手机、功能性手机,以及低端手机等。(1) Mobile communication devices: These devices are characterized by mobile communication functions and are mainly aimed at providing voice and data communication. Such terminals include: smart phones (such as iPhone), multimedia phones, functional phones, and low-end phones.
(2)超移动个人计算机设备:这类设备属于个人计算机的范畴,有计算和处理功能,一般也具备移动上网特性。这类终端包括:PDA、MID和UMPC设备等,例如iPad。(2) Ultra-mobile personal computer equipment: This type of equipment belongs to the category of personal computers, has computing and processing functions, and generally has mobile Internet access. Such terminals include: PDAs, MIDs, and UMPC devices, such as the iPad.
(3)便携式娱乐设备:这类设备可以显示和播放多媒体内容。该类设备包括:音频、视频播放器(例如iPod),掌上游戏机,电子书,以及智能玩具和便携式车载导航设备。(3) Portable entertainment devices: These devices can display and play multimedia content. Such devices include: audio, video players (such as iPod), handheld game consoles, e-books, and smart toys and portable car navigation devices.
(4)服务器:提供计算服务的设备,服务器的构成包括处理器、硬盘、内存、系统总线等,服务器和通用的计算机架构类似,但是由于需要提供高可靠的服务,因此在处理能力、稳定性、可靠性、安全性、可扩展性、可管理性等方面要求较高。(4) Server: A device that provides computing services. The server consists of a processor, a hard disk, a memory, a system bus, etc. The server is similar to a general-purpose computer architecture, but because of the need to provide highly reliable services, processing power and stability High reliability in terms of reliability, security, scalability, and manageability.
(5)其他具有数据交互功能的电子装置。(5) Other electronic devices with data interaction functions.
相应地,本发明实施例还提供一种计算机可读存储介质,所述计算机可读存储介质中存储有用于执行上述实施例方法的程序。 Correspondingly, an embodiment of the present invention further provides a computer readable storage medium, where the program for executing the method of the foregoing embodiment is stored.
以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性的劳动的情况下,即可以理解并实施。The device embodiments described above are merely illustrative, wherein the units described as separate components may or may not be physically separate, and the components displayed as units may or may not be physical units, ie may be located A place, or it can be distributed to multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the embodiment. Those of ordinary skill in the art can understand and implement without deliberate labor.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到各实施方式可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件。基于这样的理解,上述技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品可以存储在计算机可读存储介质中,如ROM/RAM、磁碟、光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行各个实施例或者实施例的某些部分所述的方法。Through the description of the above embodiments, those skilled in the art can clearly understand that the various embodiments can be implemented by means of software plus a necessary general hardware platform, and of course, by hardware. Based on such understanding, the above-described technical solutions may be embodied in the form of software products in essence or in the form of software products, which may be stored in a computer readable storage medium such as ROM/RAM, magnetic Discs, optical discs, etc., include instructions for causing a computer device (which may be a personal computer, server, or network device, etc.) to perform the methods described in various embodiments or portions of the embodiments.
最后应说明的是:以上实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围。 It should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and are not limited thereto; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that The technical solutions described in the foregoing embodiments are modified, or the equivalents of the technical features are replaced. The modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (13)

  1. 一种大数据处理中的数据优化方法,其特征在于,包括:A data optimization method in big data processing, comprising:
    分析多个任务的数据处理逻辑;Analyze data processing logic for multiple tasks;
    根据所述多个任务的数据处理逻辑确定多个任务之间产生的中间数据;Determining intermediate data generated between the plurality of tasks according to data processing logic of the plurality of tasks;
    分析所述中间数据的使用状态以确定所述中间数据是否需要继续被保存;Analyzing a usage status of the intermediate data to determine whether the intermediate data needs to be saved;
    当所述中间数据不需要被保存时,删除所述中间数据。The intermediate data is deleted when the intermediate data does not need to be saved.
  2. 根据权利要求1所述的方法,其特征在于,所述分析所述中间数据的使用状态以确定所述中间数据是否需要继续被保存包括:The method according to claim 1, wherein said analyzing said use status of said intermediate data to determine whether said intermediate data needs to be saved further comprises:
    根据业务需求分析所述中间数据是否在业务中被使用;Analyze whether the intermediate data is used in the business according to the business needs;
    当所述中间数据在业务中不被使用时,确定所述中间数据不需要继续被保存。When the intermediate data is not used in the service, it is determined that the intermediate data does not need to be saved.
  3. 根据权利要求1所述的方法,其特征在于,所述分析所述中间数据的使用状态以确定所述中间数据是否需要继续被保存包括:The method according to claim 1, wherein said analyzing said use status of said intermediate data to determine whether said intermediate data needs to be saved further comprises:
    统计所述中间数据的未被使用的累积时长,当所述累积时长达到预设门限时,标记所述中间数据为不被使用的数据;Counting the unused accumulated duration of the intermediate data, and when the accumulated duration reaches a preset threshold, marking the intermediate data as data that is not used;
    当所述中间数据被标记为不被使用的数据的次数大于或等于预设门限时,确定所述中间数据不需要继续被保存。When the number of times the intermediate data is marked as unused data is greater than or equal to a preset threshold, it is determined that the intermediate data does not need to be saved.
  4. 根据权利要求1所述的方法,其特征在于,所述方法还包括:The method of claim 1 further comprising:
    根据所述数据处理逻辑将所述多个任务合并为一个任务。The plurality of tasks are combined into one task according to the data processing logic.
  5. 根据权利要求1所述的方法,其特征在于,所述方法还包括:The method of claim 1 further comprising:
    根据数据处理逻辑判断是否同时存在多个任务能够产生相同的中间数据;Determining whether multiple tasks exist at the same time according to data processing logic can generate the same intermediate data;
    当同时存在多个任务能够产生相同的中间数据时,在相同的中间数据中保留一份副本并删除其他相同的中间数据,后续任务都从保留的副本中读取需要的数据。 When multiple tasks simultaneously produce the same intermediate data, a copy is kept in the same intermediate data and other identical intermediate data is deleted, and subsequent tasks read the required data from the retained copy.
  6. 一种大数据处理中的数据优化装置,其特征在于,包括:A data optimization device in big data processing, comprising:
    第一分析模块,用于分析多个任务的数据处理逻辑;a first analysis module, configured to analyze data processing logic of multiple tasks;
    第一确定模块,用于根据所述多个任务的数据处理逻辑确定多个任务之间产生的中间数据;a first determining module, configured to determine intermediate data generated between the multiple tasks according to data processing logic of the multiple tasks;
    第二确定模块,用于分析所述中间数据的使用状态以确定所述中间数据是否需要继续被保存;a second determining module, configured to analyze a usage status of the intermediate data to determine whether the intermediate data needs to be saved;
    第一删除模块,用于当所述中间数据不需要被保存时,删除所述中间数据。The first deleting module is configured to delete the intermediate data when the intermediate data does not need to be saved.
  7. 根据权利要求6所述的装置,其特征在于,所述第二确定模块包括:The apparatus according to claim 6, wherein the second determining module comprises:
    第一分析子模块,用于根据业务需求分析所述中间数据是否在业务中被使用;a first analysis submodule, configured to analyze, according to a service requirement, whether the intermediate data is used in a service;
    第一确定子模块,用于当所述中间数据在业务中不被使用时,确定所述中间数据不需要继续被保存。The first determining submodule is configured to determine that the intermediate data does not need to be saved when the intermediate data is not used in the service.
  8. 根据权利要求6所述的装置,其特征在于,所述第二确定模块包括:The apparatus according to claim 6, wherein the second determining module comprises:
    标记子模块,用于统计所述中间数据的未被使用的累积时长,当所述累积时长达到预设门限时,标记所述中间数据为不被使用的数据;a marking sub-module, configured to collect an unused cumulative duration of the intermediate data, and when the accumulated duration reaches a preset threshold, marking the intermediate data as data that is not used;
    第二确定子模块,用于当所述中间数据被标记为不被使用的数据的次数大于或等于预设门限时,确定所述中间数据不需要继续被保存。The second determining submodule is configured to determine that the intermediate data does not need to be saved when the number of times the intermediate data is marked as unused data is greater than or equal to a preset threshold.
  9. 根据权利要求6所述的装置,其特征在于,所述装置还包括:The device according to claim 6, wherein the device further comprises:
    合并模块,用于根据所述数据处理逻辑将所述多个任务合并为一个任务。And a merging module, configured to merge the plurality of tasks into one task according to the data processing logic.
  10. 根据权利要求6所述的装置,其特征在于,所述装置还包括:The device according to claim 6, wherein the device further comprises:
    判断模块,用于根据数据处理逻辑判断是否同时存在多个任务能够产生相同的中间数据;a determining module, configured to determine, according to data processing logic, whether multiple tasks exist simultaneously to generate the same intermediate data;
    第二删除模块,用于当同时存在多个任务能够产生相同的中间数据时,在相同的中间数据中保留一份副本并删除其他相同的中间数据,后续任务都 从保留的副本中读取需要的数据。The second deletion module is configured to reserve a copy in the same intermediate data and delete other identical intermediate data when multiple tasks simultaneously generate the same intermediate data, and subsequent tasks are Read the required data from the retained copy.
  11. 一种电子设备,其特征在于,包括:An electronic device, comprising:
    至少一个处理器;以及,At least one processor; and,
    与所述至少一个处理器通信连接的存储器;其中,a memory communicatively coupled to the at least one processor; wherein
    所述存储器存储有可被所述一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够:The memory stores instructions executable by the one processor, the instructions being executed by the at least one processor to enable the at least one processor to:
    分析多个任务的数据处理逻辑;Analyze data processing logic for multiple tasks;
    根据所述多个任务的数据处理逻辑确定多个任务之间产生的中间数据;Determining intermediate data generated between the plurality of tasks according to data processing logic of the plurality of tasks;
    分析所述中间数据的使用状态以确定所述中间数据是否需要继续被保存;Analyzing a usage status of the intermediate data to determine whether the intermediate data needs to be saved;
    当所述中间数据不需要被保存时,删除所述中间数据。The intermediate data is deleted when the intermediate data does not need to be saved.
  12. [根据细则26改正01.11.2016] 
    一种非暂态计算机可读存储介质,其特征在于,所述非暂态计算机可读存储介质存储计算机指令,所述计算机指令用于使所述计算机执行权利要求1-5任一所述方法。
    [Correct according to Rule 26 01.11.2016]
    A non-transitory computer readable storage medium, wherein the non-transitory computer readable storage medium stores computer instructions for causing the computer to perform the method of any of claims 1-5 .
  13. [根据细则26改正01.11.2016] 
    一种计算机程序产品,所述计算机程序产品包括存储在非暂态计算机可读存储介质上的计算程序,所述计算机程序包括程序指令,当所述程序指令被计算机执行时,使所述计算机执行权利要求1-5任一所述方法。
    [Correct according to Rule 26 01.11.2016]
    A computer program product comprising a computing program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions that, when executed by a computer, cause the computer to execute The method of any of claims 1-5.
PCT/CN2016/101058 2016-05-04 2016-09-30 Data optimisation method and apparatus in big data processing WO2017190469A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610290381.0 2016-05-04
CN201610290381.0A CN105975577A (en) 2016-05-04 2016-05-04 Data optimization method and device in big data processing

Publications (1)

Publication Number Publication Date
WO2017190469A1 true WO2017190469A1 (en) 2017-11-09

Family

ID=56994506

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/101058 WO2017190469A1 (en) 2016-05-04 2016-09-30 Data optimisation method and apparatus in big data processing

Country Status (2)

Country Link
CN (1) CN105975577A (en)
WO (1) WO2017190469A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111552730A (en) * 2020-04-28 2020-08-18 杭州数梦工场科技有限公司 Data distribution method and device, electronic equipment and storage medium

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105975577A (en) * 2016-05-04 2016-09-28 乐视控股(北京)有限公司 Data optimization method and device in big data processing
CN115604331A (en) * 2021-06-28 2023-01-13 华为技术有限公司(Cn) Data processing system, method and device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102456031A (en) * 2010-10-26 2012-05-16 腾讯科技(深圳)有限公司 MapReduce system and method for processing data streams
CN102932416A (en) * 2012-09-26 2013-02-13 东软集团股份有限公司 Intermediate data storage method, processing method and device in information flow task
CN103793530A (en) * 2014-02-26 2014-05-14 北京京东尚科信息技术有限公司 Method, device and system for cleaning up business data regularly
CN104391748A (en) * 2014-11-21 2015-03-04 浪潮电子信息产业股份有限公司 Mapreduce computation process optimization method
CN105975577A (en) * 2016-05-04 2016-09-28 乐视控股(北京)有限公司 Data optimization method and device in big data processing

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102456031A (en) * 2010-10-26 2012-05-16 腾讯科技(深圳)有限公司 MapReduce system and method for processing data streams
CN102932416A (en) * 2012-09-26 2013-02-13 东软集团股份有限公司 Intermediate data storage method, processing method and device in information flow task
CN103793530A (en) * 2014-02-26 2014-05-14 北京京东尚科信息技术有限公司 Method, device and system for cleaning up business data regularly
CN104391748A (en) * 2014-11-21 2015-03-04 浪潮电子信息产业股份有限公司 Mapreduce computation process optimization method
CN105975577A (en) * 2016-05-04 2016-09-28 乐视控股(北京)有限公司 Data optimization method and device in big data processing

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111552730A (en) * 2020-04-28 2020-08-18 杭州数梦工场科技有限公司 Data distribution method and device, electronic equipment and storage medium
CN111552730B (en) * 2020-04-28 2024-01-26 杭州数梦工场科技有限公司 Data distribution method, device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN105975577A (en) 2016-09-28

Similar Documents

Publication Publication Date Title
US9542447B1 (en) Supplementing candidate answers
CN109034993B (en) Account checking method, account checking equipment, account checking system and computer readable storage medium
US20210209416A1 (en) Method and apparatus for generating event theme
US10657033B2 (en) How to track operator behavior via metadata
US20150310335A1 (en) Determining a performance prediction model for a target data analytics application
US9501377B2 (en) Generating and implementing data integration job execution design recommendations
US8977587B2 (en) Sampling transactions from multi-level log file records
US20190026333A1 (en) Executing Graph Path Queries
WO2018120720A1 (en) Method for locating test error of client program, electronic device, and storage medium
US9223552B2 (en) Compiling optimization of an application and compiler thereof
WO2017190469A1 (en) Data optimisation method and apparatus in big data processing
CN112949172B (en) Data processing method, device, machine-readable medium and equipment
WO2017185652A1 (en) Method for implementing file sharing and electronic device
CN104834599A (en) WEB security detection method and device
US8712100B2 (en) Profiling activity through video surveillance
Cheng et al. Optimal alignments between large event logs and process models over distributed systems: An approach based on Petri nets
CN112612832B (en) Node analysis method, device, equipment and storage medium
US10114951B2 (en) Virus signature matching method and apparatus
US10956458B2 (en) Consolidating text conversations from collaboration channels
CN111459937A (en) Data table association method, device, server and storage medium
US20160371171A1 (en) Stream-based breakpoint for too many tuple creations
CN111382017A (en) Fault query method, device, server and storage medium
CN115470489A (en) Detection model training method, detection method, device and computer readable medium
US20150106303A1 (en) Finite state machine forming
US10175961B2 (en) Joining operator graph elements via whole program optimization

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16900991

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 16900991

Country of ref document: EP

Kind code of ref document: A1