CN112579259B - GC self-adaptive adjustment method and device for big data processing framework - Google Patents

GC self-adaptive adjustment method and device for big data processing framework Download PDF

Info

Publication number
CN112579259B
CN112579259B CN202011472196.6A CN202011472196A CN112579259B CN 112579259 B CN112579259 B CN 112579259B CN 202011472196 A CN202011472196 A CN 202011472196A CN 112579259 B CN112579259 B CN 112579259B
Authority
CN
China
Prior art keywords
data
memory
jvm
information
actuator
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011472196.6A
Other languages
Chinese (zh)
Other versions
CN112579259A (en
Inventor
黄涛
许利杰
王伟
李慧
汪钇丞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Software of CAS
Original Assignee
Institute of Software of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Software of CAS filed Critical Institute of Software of CAS
Priority to CN202011472196.6A priority Critical patent/CN112579259B/en
Publication of CN112579259A publication Critical patent/CN112579259A/en
Application granted granted Critical
Publication of CN112579259B publication Critical patent/CN112579259B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • G06F2009/45583Memory management, e.g. access or allocation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • G06F2009/45591Monitoring or debugging support

Abstract

The invention relates to a GC (gas chromatography) self-adaptive adjustment method and a GC self-adaptive adjustment device for a big data processing frame, wherein the method comprises the steps of collecting current big data operation information and memory state information in a big data frame and an actuator JVM (java virtual machine), and predicting the memory use requirement of each processing stage of big data application; meanwhile, according to the prediction result, the GC parameters of the JVM are adaptively adjusted according to a certain logic rule, and an interface for dynamically modifying the GC parameters during running is realized. The invention can adapt to the memory use characteristics of big data application which are constantly changed, reduce the GC trigger frequency and the global pause time of the JVM of the actuator and improve the memory management efficiency of the JVM in a big data environment.

Description

GC self-adaptive adjusting method and device for big data processing framework
Technical Field
The invention relates to a GC self-adaptive adjusting tool, in particular to a GC parameter dynamic adjusting method and device for a JVM (Java virtual machine) in a big data processing framework, and belongs to the technical field of software.
Background
With the rapid development and popularization of internet technology, the amount of data generated by users thereof also shows an explosive growth trend, and urgent needs for processing and storing massive data are brought. In order to meet the demand, various extensible distributed big data processing frameworks are developed in the industry and the academic world in the last decade and are widely applied. The mainstream big data frameworks such as Spark and Flink are generally written by object-oriented languages such as Java and Scala, a Java Virtual Machine (JVM) is used as a runtime environment to execute specific computing tasks of big data application, and a Garbage Collection mechanism (GC) of the JVM is used to manage data objects created and placed in the memory in the JVM heap.
The GC algorithms of the conventional JVM are all based on a weak generation hypothesis, i.e., most objects are not used soon after creation. These GC algorithms thus use the concept of generational collection to divide the JVM heap memory into the younger Generation (Young Generation) and the older Generation (Old Generation). Younger generations are subdivided into one Eden park (Eden) and two Survivor districts (Survivor). Objects were originally applied for the young generation of the Eden park, Minor GC was triggered when the Eden park was exhausted, and reachability analysis determined that surviving objects were copied to survivor zones. When the number of Minor GC times an object experiences reaches a certain threshold, it is promoted to be stored in the old age. And when the usage of the old age reaches a certain proportion, triggering the Major GC to arrange the whole heap memory space. In order to ensure consistency and safety of GC results, all GC algorithms have partial processing stages, and all working threads need to be interrupted by global pause (Stop-The-World, STW).
The classical GC algorithms are not designed aiming at the calculation characteristics of a big data processing framework, the results of the big data processing framework in industrial practice and academic evaluation show that the JVM often has the problems of long GC global pause time, high GC frequency and the like when the JVM actually runs a big data application, the GC pause time under some big data scenes even exceeds 50% of the whole running time of the application, the throughput and delay of the big data application are seriously influenced, and the performance bottleneck of the big data processing framework is already formed. The cause of the low efficiency of the GC algorithm in the big data processing framework is considered to be the change of the memory usage pattern of the big data application: different from the characteristics of 'calculation intensive' and an object dying from birth of a traditional Java application, the big data application is generally 'data intensive' and 'memory intensive', a large number of objects of input data and intermediate calculation results survive in a JVM heap memory for a long time, and the objects are subjected to repeated GC reachability analysis and object moving processes without being recycled, so that a large number of CPU time slices are wasted.
The conventional GC algorithm cannot adapt to the change of the memory usage mode for the default configuration of the JVM heap memory size, the proportion of the aged and younger generations, the object promotion threshold value and the GC trigger threshold value. Since the memory usage patterns of the big data application at different processing stages are different, the self-adaptive adjustment of the GC algorithm according to the history execution information cannot be applied to the future processing stage of the big data application. The user can adjust the configuration parameters manually, but to obtain a good effect, the user needs to have a deep knowledge of the memory usage and the GC algorithm of the application, and the results are compared through a large number of tests. Because the JVM does not provide a method for performing dynamic parameter adjustment during running, manual parameter adjustment can only be implemented statically when the JVM is started, and cannot be effectively adapted to each processing stage of big data application. The existing related research dynamically adjusts the partition of JVM heap memory use on the large data processing framework layer, but the coarse granularity adjustment effect is limited. Meanwhile, in the GC algorithm level, the research attempts to process different types of objects according to the life cycle of the user annotation or historical information statistical data object, but a large amount of user work or a complicated algorithm flow is required.
The methods can not realize effective self-adaptive adjustment of the GC algorithm under a large data processing framework under the conditions of low user burden and low calculation complexity.
Disclosure of Invention
Aiming at the adaptability problem of the GC algorithm in a big data processing frame, the invention aims to provide a GC self-adaptive adjusting method and device facing the big data processing frame.
The invention utilizes the acquired information of the big data frame in the running process to determine the approximate quantity and the life cycle of the data objects, and transmits the data objects to the task executor JVM for dynamically adjusting GC parameters in the running process, thereby improving the running efficiency of the GC mechanism. As shown in fig. 1, the present invention predicts the life cycle and memory usage of the corresponding data object by obtaining the data volume of the big data applied in the big data processing frame, executing the flow and other information, provides the adaptive GC configuration parameters, and transfers the adaptive GC configuration parameters to the executor JVM for dynamic adjustment, so as to reduce the time and frequency of the GC global pause.
The technical scheme of the invention mainly focuses on 3 GC parameters which have the greatest influence on the memory management efficiency of big data application, including Heap memory Size (Heap Size), the proportion of old and young generations (New Ratio), and an object promotion Threshold (Tenuiring Threshold). The specific meanings are as follows:
1. heap memory size
The memory size allocated to the JVM heap by the physical node of the big data processing framework cluster where the executor JVM is located determines the memory that the JVM can use to store the data to be processed and the intermediate calculation result object instances and arrays, and is proportional to the execution memory available to the JVM thread of each big data framework calculation task.
2. Proportion of old age and young age
The JVM heap memory is divided into the old and the young age in memory size ratio. Heap memory consists of only two parts, the old and the young, so their sizes trade off. Larger old generations can store more long-time survival objects generated by large data processing frames, the triggering frequency of Major GC is reduced, and larger young generations can reduce the triggering frequency of Minor GC, so that the overall throughput rate of GC is improved.
3. Subject promotion threshold
Objects need to go through the number of GC cycles from the survivor zone of the young generation to the old. Higher promotion thresholds may reduce stress in the older age, avoiding short-lived objects from entering the older age. A lower promotion threshold may reduce the number of memory moves that a long-lived object may need to undergo in a younger generation to enter an older generation.
The technical scheme of the invention is as follows:
a GC self-adaptive adjustment method for a big data processing framework comprises the following steps:
acquiring dynamic information related to memory use, including acquiring information related to a current computing task from a big data processing framework and acquiring memory management information of a current actuator from an actuator JVM;
predicting the memory use condition of input data and intermediate calculation results in the current task stage in the JVM (java virtual machine) of the actuator according to the acquired information;
and generating adaptive GC configuration parameters according to the predicted memory use condition of input data and intermediate calculation results in the actuator JVM in the current task stage and the acquired current memory state of the actuator JVM, and dynamically adjusting the GC configuration parameters during the operation of the actuator JVM to improve the GC efficiency of the actuator JVM.
A GC self-adaptive adjusting device facing a big data processing framework mainly comprises the following 3 functional modules: data information collector (Profiler), data usage pattern Analyzer (Analyzer), GC parameter dynamic Modifier (Modifier).
1. Data information collector
In the method, a data information collector is divided into a frame data stream collector and a GC information collector, and dynamic information related to memory use is collected from a big data processing frame and JVMs of actuators respectively, wherein the frame data stream collector is responsible for collecting:
(1) the operation information includes the current code position and the data operation type, the caching condition of the data set (whether caching, caching level, caching dependency), and the types of the Shuffle Write and Shuffle Read.
(2) And the data information comprises the data structure type of the current data to be operated and the processing data quantity divided for each executor JVM.
(3) The configuration information includes the coarse-grained division condition of the memory of the current big data processing framework, namely the big data framework specifies the memory space proportion of an executor JVM for data set caching, Shuffle calculation and user code execution.
The GC information collector is responsible for collecting:
(1) memory information including heap memory size of the current actor JVM, age and proportion, object promotion threshold, GC trigger threshold, used size of each age, age distribution of the object.
(2) The log information includes GC information that has been performed so far, such as the trigger reason of each GC, the size of the space and the number of objects to be recycled, and the time occupied by each stage.
2. Data usage pattern analyzer
In the method of the present invention, the data usage pattern analyzer predicts the memory usage of the input data and the intermediate calculation result in the executor JVM at the current task stage according to the information obtained by the data information collector, as shown in fig. 2. The method specifically comprises the following steps:
(1) the occupied size of the memory is as follows:
linearly fitting the size of the memory occupied by the data object set after the current input data is loaded into the JVM (JVM executor) according to the structure type of the current data and the data volume to be processed and by combining the relevant information of the historical data and the memory occupation size model;
and calculating a data object of the intermediate calculation result and the size of the memory required to be occupied by the cache data object by combining the current specific data operation, the type of the Shuffle Write and the type of the Shuffle Read.
(2) The life cycle of the object:
the final cache position (in-heap memory, out-of-heap memory or local disk) of the cached data set is judged according to the code positions of the cache functions persist (), unpersist (), cache (), the data set, the calculation dependency relationship of the data set, the cache level, and the size and the use condition of the data cache space specified by the big data processing framework, and the survival time of the cached data in the in-heap memory responsible for the GC algorithm is deduced.
According to the type of data operation, the type of Shuffle operation, the size of the data volume to be processed, the processing speed of the JVM, and the size and usage of the JVM data execution space, it is determined whether the time period required for holding the input data and the intermediate calculation result in the heap memory is long, and an over write is generated.
Dynamic modifier of GC parameters
In the invention, the GC parameter dynamic modifier generates adaptive GC configuration parameters by combining the data set size and the life cycle characteristics predicted by the data use mode analyzer and the current memory use state of the actuator JVM obtained by the data information collector, and dynamically adjusts the GC configuration parameters during the operation of the actuator JVM.
The specific judgment flow is shown in fig. 3 and 4, and the logic rule of GC parameter modification is as follows.
Rule 1: for caching, if the cache position of the data set determined by the data usage pattern analyzer is an out-of-heap memory or a local disk, the object promotion threshold is increased, and the younger generation proportion is increased.
Explanation 1: after the data objects cached to the off-heap memory and the local disk are calculated and written into the corresponding positions, the original data can be cleared from the heap memory, so that the object promotion threshold is increased, the data objects existing in the heap memory for a short time are prevented from entering the old age, in addition, the proportion of the young age is increased, the trigger frequency of the Minor GC can be reduced, and the GC throughput is improved.
Rule 2: for caching, if the cache location of the data set determined by the data usage pattern analyzer is in-heap memory, the object promotion threshold is lowered.
Explanation 2: the cache data objects survive in the heap memory for a long time, the object promotion threshold value is reduced, and repeated movement of the data objects which need to be accumulated for a long time in the process of promoting to the old age can be avoided.
Rule 3: for the cache operation, if the cache level of the data set is the in-heap memory priority and the current in-heap memory has potential to be completely put down through GC parameter adjustment, the size of the in-heap memory is increased, and the proportion of the old age is enlarged.
Explanation 3: if the dependency of the data sets indicates that all existing cache data sets are used in the future, the size of a heap memory should be increased, the proportion of old age is enlarged, and triggering of a Major GC is avoided, so that the cached data sets are not swapped out as much as possible.
Rule 4: for caching operations, if the caching level of the dataset is heap memory first, but the potential of heap memory and physical node memory is not sufficient to put down the entire cached dataset, the object promotion threshold is lowered.
Explanation 4: when the memory cannot meet the cache requirement and the exchange and removal of part of the cached data set are inevitable, the Major GC can be triggered to clean the exchanged data set as soon as possible by reducing the object promotion threshold value, and the current data set is promoted to the old as soon as possible.
Rule 5: for data operation, if the data usage pattern analyzer determines that the object generated by the current operation is survived for a short time, the proportion of the younger generations is increased, and the object promotion threshold is increased.
Explanation 5: the current operation has the advantages of small possible data processing amount, low calculation complexity, high processing speed of physical nodes and no need of long-term maintenance of data objects. The proportion of the younger generation is increased, the promotion threshold of the data objects is increased, and the data objects can be prevented from occupying the space of the older generation.
Rule 6: for data operation, if the data usage pattern analyzer determines that the object generated by the current operation is alive for a long time, the aged generation ratio is increased, and the heap memory size is increased.
Explanation 6: current operations may require a Shuffle process and data objects may need to be maintained for long periods of time. The proportion of the old generation is increased, the size of the heap memory is increased, and the performance loss caused by Major GC and data overflow can be avoided as much as possible.
Rule 7: for data operation, if the data usage pattern analyzer judges that the over-writing cannot be avoided, the promotion threshold of the object is increased while heap memory and old age are expanded.
Explanation 7: the data overflow and write operation brings significant performance loss, and the increase of the promotion threshold of the object can reserve more Shuffle data in the heap memory, thereby reducing the frequency and proportion of overflow and write data.
The parameter adjustment range is based on the following steps:
(1) and the heap size determines the adjustment range according to the size of the idle memory of the physical node where the JVM is located and the calculated size of the extra memory required by all the calculation task threads in total.
(2) The proportion of the old age and the young age determines the adjustment range according to the calculated occupation size of the old age required by the calculation task and the number proportion of the long-time survival objects in all the objects.
(3) The subject progress threshold determines adjustments based on the size of the younger generation, the calculated data subject life cycle, and the speed of progress of the historical subject.
The tool modifies the source code of OpenJDK and provides an interface for dynamically adjusting GC parameters during running. The memory space of the whole physical node is declared when the JVM is started, so that the extension and contraction of the real use size of the heap memory are realized; the proportion of the old age and the young age is adjusted during operation by setting movable boundaries of the young age and the old age; by adding a special method, dynamic setting of the promotion threshold of the object is realized. Before each calculation task is executed specifically, the adaptive GC parameter values calculated by the three modules are modified and adjusted by calling corresponding interfaces.
The invention provides a GC (gas chromatography) self-adaptive adjusting method and device for a big data processing framework, and compared with the prior art, the method has the following advantages:
(1) the invention provides a method for dynamically adjusting GC parameters during operation, which is specially designed for a scene of a big data processing frame and can better adapt to the memory use characteristics of big data application at different stages compared with a static fixed parameter adjusting method.
(2) The memory usage mode of the calculation task is presumed according to the historical data information and the memory usage information and by combining the data operation and the data amount to be executed currently, and compared with the method that only the historical adjustment is carried out, the memory usage mode is more suitable for the memory requirement in a period of time in the future.
(3) The invention designs a detailed GC parameter adjusting rule, and provides a corresponding GC parameter adjusting strategy aiming at different memory use states and memory use requirements, so that the executor JVM realizes higher memory management efficiency under various scenes.
Drawings
FIG. 1 is a block diagram of the big data processing framework oriented GC adaptive tuning tool of the present invention;
FIG. 2 is a flow chart of the present invention for predicting the life cycle of a data object in the heap memory of an executor JVM;
FIG. 3 is a flow chart of the GC parameter adjustment rule for cache operations according to the present invention;
FIG. 4 is a flowchart illustrating the GC parameter adjustment rule for Shuffle operation according to the present invention;
fig. 5 is a flow chart of an implementation of the present invention.
Detailed Description
The present invention will be described in detail below with reference to specific embodiments and the accompanying drawings.
The GC self-adaptive adjusting method provided by the invention is applied to a Spark big data processing framework, a representative big data application PageRank is taken as an example, the specific implementation steps are as follows, and a flow chart is shown in FIG. 5.
Spark generates a logical processing flow and a physical execution plan of PageRank, according to the processing flow, the first data operation of Spark is to pass the edge of an input graph data set through map (), so as to obtain a record of < user, follower >, and the tool of the invention starts to work:
1. and the data information acquisition unit acquires the related information.
Acquiring a frame data stream by using a frame data stream acquisition unit:
(1) operation information: map, shuffle write (BypassMergeSortShufflWriter)
(2) Data information: 1000 side
(3) Configuration information:
spark is assigned to user code space: 40 percent of
Frame memory space: 60 percent
Frame execution space: 60% by 50% to 30%
Data cache space: 60%. 50%. 30%
The size of the memory outside the heap: 200MB (multi-media broadcast)
Acquiring by a GC information acquisition device:
(4) memory information:
heap memory size for executor JVM startup: 400MB
The size of the young generation: 400MB 1/4-100 MB
The Eden park: 100MB 4/5-80 MB
Survivor area: 100MB 1/10 × 2-10 MB × 2
The size of the aged generation: 400MB 3/4-300 MB
Subject promotion threshold: 5
(5) Log information:
the adopted garbage recoverer comprises: parallel Scavenge
2. And the data use mode analyzer calculates according to the information acquired by the data information acquisition unit.
(1) The occupied size of the memory is as follows: the input data object occupies 100MB of memory after entering the executor.
(2) The life cycle of the object: the formed data objects are short lived.
And 3, the GC parameter dynamic modifier makes a decision according to the memory characteristics calculated by the data use mode analyzer.
According to rule 5, short-lived objects are prevented from entering the old age:
the proportion of the young generations is increased to: 400MB 1/2-200 MB
The proportion of the survivor area is improved to: 200MB 1/4 × 2-50 MB 2
Promotion object promotion threshold: 10
Next, the < user, follower > record is processed by Reduce operation to obtain < user, list (followers) >, and is cached in the heap memory for a long time as the input of the subsequent iterative computation.
1. And the data information acquisition unit acquires the related information.
Acquiring a frame data stream by using a frame data stream acquisition unit:
(1) operation information: reduceBykey
shuffle read(BlockStoreShuffleReader)
persist(MEMORY_AND_DISK)
(2) Data information: 1000 edges information, generating 200 users and their visitors information
(3) Configuration information:
spark is assigned to user code space: 40 percent
Frame memory space: 60 percent of
Frame execution space: 60%. 50%. 30%
Data cache space: 60% by 50% to 30%
The size of the memory outside the heap: 200MB (multi-media broadcast)
Acquiring by a GC information acquisition device:
(4) memory information:
heap memory size started by the executor JVM: 400MB
The size of the young generation: 120MB was used for 400MB 1/2-200 MB
The Eden park: 75MB was used for 200MB 1/2-100 MB
Survivor area: 200MB 1/4 × 2-50 MB × 2 used 45MB
(5 age groups: 1:2:2:2:3)
The size of the old generation: 0MB was used for 400MB 1/2-200 MB
Object promotion threshold: 10
(5) Log information:
after 5 Minor GC runs through the Eden park, 10000 objects were recovered, totaling the pause time of 0.5 s
2. And the data use mode analyzer calculates according to the information acquired by the data information acquisition unit.
(1) The occupied size of the memory is as follows: the shuffle and cache data will take up to 300MB of memory.
(2) The life cycle of the object: cached data objects are live for long periods of time.
And 3, the GC parameter dynamic modifier makes a decision according to the memory characteristics calculated by the data use mode analyzer.
According to rule 3, rule 6:
increasing heap memory size to 500MB
The proportion of the old generation is improved: 500MB 4/5-400 MB
Decreasing the subject promotion threshold: 5
Subsequently, < user, list (followers) > is aggregated with rank information to get < user, list (rank) >, and the cartesian product of each list (followers, rank) is calculated.
1. And the data information acquisition unit acquires the related information.
Acquiring a frame data stream by using a frame data stream acquisition unit:
(1) operation information: join, flatMap
(2) Data information: the information of 200 users and visitors thereof and the ranking information of 200 users generate the ranking contribution to 250 users.
(3) Configuration information:
spark is assigned to user code space: 40 percent
Frame memory space: 60 percent of
Frame execution space: 60%. 20%. 12%
Data cache space: 60%. 80%. 48%
The size of the memory outside the heap: 100MB
Acquiring by a GC information acquisition device:
(4) memory information:
heap memory size for executor JVM startup: 500MB
The size of the young generation: 50MB was used for 500MB 1/5-100 MB
The Eden park: 15MB was used for 100MB 1/2-50 MB
Survivor area: 35MB was used for 100MB 1/4 MB 2-50 MB 2
(5 age groups: 3:2:1:1:1)
The size of the aged generation: 300MB was used for 500MB 4/5-400 MB
Subject promotion threshold: 5
(5) Log information:
after 12 times of Minor GC exhaustion of the Eden park, 20000 objects are recovered, and the total pause time is 1 second
4. And the data use mode analyzer calculates according to the information acquired by the data information acquisition unit.
(1) The occupied size of the memory is as follows: the shuffle and cache data will occupy 200MB of memory.
(2) The life cycle of the object: the shuffle objects are long lived.
And 5, the GC parameter dynamic modifier makes a decision according to the memory characteristics calculated by the data use mode analyzer.
According to rule 6, rule 7:
increasing heap memory size to 550MB
The proportion of the old generation is increased to: 550MB 4/5-440 MB
Increase of subject promotion threshold: 10
Repeated iteration operation is carried out on the subsequent processing flow of the PageRank, and GC parameters are adjusted by the tool according to a flow similar to the process before each operation, so that the trigger times and pause time of a GC can be reduced by an executor JVM under a big data processing frame Spark, and the execution efficiency of big data application is improved.
Based on the same inventive concept, another embodiment of the present invention provides an electronic device (computer, server, smartphone, etc.) comprising a memory storing a computer program configured to be executed by the processor, and a processor, the computer program comprising instructions for performing the steps of the inventive method.
Based on the same inventive concept, another embodiment of the present invention provides a computer-readable storage medium (e.g., ROM/RAM, magnetic disk, optical disk) storing a computer program, which when executed by a computer, performs the steps of the inventive method.
The particular embodiments of the present invention disclosed above are illustrative only and are not intended to be limiting, since various alternatives, modifications, and variations will be apparent to those skilled in the art without departing from the spirit and scope of the invention. The invention should not be limited to the disclosure of the embodiments in the present specification, but the scope of the invention is defined by the appended claims.

Claims (8)

1. A GC self-adaptive adjustment method for a big data processing framework is characterized by comprising the following steps:
acquiring dynamic information related to memory use, including acquiring information related to a current computing task from a big data processing framework and acquiring memory management information of a current actuator from an actuator JVM;
predicting the memory use condition of input data and intermediate calculation results in the current task stage in the JVM (java virtual machine) of the actuator according to the acquired information;
generating adaptive GC configuration parameters according to the predicted memory use condition of input data and intermediate calculation results in the actuator JVM in the current task stage and the acquired current memory state of the actuator JVM, and dynamically adjusting the GC configuration parameters during the operation of the actuator JVM to improve the GC efficiency of the actuator JVM;
the dynamic adjustment during the operation of the JVM is to dynamically adjust GC configuration parameters for the memory usage characteristics of the big data application by using the following rules:
rule 1: for cache operation, if the cache position of the data set determined by the data usage pattern analyzer is an out-of-heap memory or a local disk, improving the promotion threshold of the object and increasing the proportion of young generations;
rule 2: for cache operation, if the cache position of the data set determined by the data use mode analyzer is an in-heap memory, reducing a promotion threshold of the object;
rule 3: for the cache operation, if the cache level of the data set is the in-heap memory priority and the current in-heap memory has potential to be completely put down through GC parameter adjustment, the size of the in-heap memory is increased, and the proportion of the old age is enlarged;
rule 4: for caching operation, if the caching level of the data set is the in-heap memory priority, but the potentials of the in-heap memory and the physical node memory are not enough to put down the whole cached data set, the promotion threshold of the object is reduced;
rule 5: for data operation, if the data usage pattern analyzer judges that the object generated by the current operation is survived for a short time, increasing the proportion of the younger generations and increasing the object promotion threshold;
rule 6: for data operation, if the data usage pattern analyzer judges that an object generated by current operation survives for a long time, the proportion of old generation is increased, and the size of a heap memory is increased;
rule 7: for data operation, if the data usage pattern analyzer judges that the over-writing cannot be avoided, increasing the promotion threshold of the object while expanding the heap memory and the old age;
wherein, the GC configuration parameter adjustment range is based on the following steps:
determining the adjustment range according to the size of an idle memory of a physical node where the JVM is located and the calculated size of an extra memory required by all calculation task threads in total;
determining the adjustment range of the proportion of the old age and the young age according to the calculated size of the old age required by the calculation task and the number proportion of the long-time survival objects in all the objects;
the subject promotion threshold determines the magnitude of adjustment based on the size of the young generation, the derived lifecycle of the data subject, and the rate of promotion of the historical subject.
2. The method of claim 1, wherein the information related to the current computing task includes operational information, data information, and configuration information; the memory management information of the current executor includes memory state information and GC log information.
3. The method of claim 1, wherein predicting memory usage of input data and intermediate computation results in a current task stage among an executor JVM comprises:
according to the data quantity and the data structure type, combining the historical memory use condition, and linearly fitting the memory size to be occupied by the data;
and predicting the final storage position of the data object and the life cycle of the data object in the JVM heap memory according to the memory use state and the data caching requirement.
4. The method of claim 3, wherein linearly fitting the memory size to be occupied by the data according to the data amount and the data structure type and in combination with the historical memory usage comprises:
according to the structure type of the current data and the data volume to be processed, the memory size occupied by the data object set after the current input data is loaded into the JVM (JVM) of the actuator is linearly fitted by combining the relevant information of the historical data and the memory occupation size model;
and calculating a data object of the intermediate calculation result and the size of the memory occupied by the cache data object by combining the current specific data operation, the Shuffle Write type and the Shuffle Read type.
5. The method of claim 3, wherein predicting the final storage location and life cycle of the data object in the JVM heap memory based on memory usage and data caching requirements comprises:
judging the final cache position of the cache data set according to the code position of the cache function of the data set, the calculation dependency relationship and the cache level of the data set, and the data cache space size and the use condition specified by a big data processing frame, and deducing the survival time of the cache data in the in-heap memory in charge of a GC algorithm;
and judging whether the time for keeping the input data and the intermediate calculation result in the heap memory is long or not and whether the overflow write is generated or not according to the type of data operation, the type of Shuffle operation, the size of the data volume to be processed, the processing speed of the JVM and the size and the use condition of the JVM data execution space.
6. A big data processing framework-oriented GC adaptive adjustment device adopting the method of any one of claims 1-5, characterized by comprising the following components:
the data information collector is responsible for collecting dynamic information related to the use of the memory and is divided into a frame data stream collector and a GC information collector; the frame data flow collector is responsible for collecting information related to the current computing task from the big data processing frame; the GC information collector is responsible for collecting the memory management information of the current actuator from the JVM of the actuator;
the data use mode analyzer is responsible for predicting the memory use condition of input data and intermediate calculation results in the JVM of the actuator in the current task stage according to the information obtained by the data information collector;
and the GC parameter dynamic modifier is responsible for generating adaptive GC configuration parameters by combining the prediction result of the data use mode analyzer and the current memory state of the actuator JVM obtained by the data information collector, and dynamically adjusting the GC configuration parameters when the actuator JVM runs so as to improve the GC efficiency of the actuator JVM.
7. An electronic apparatus, comprising a memory and a processor, the memory storing a computer program configured to be executed by the processor, the computer program comprising instructions for performing the method of any of claims 1 to 5.
8. A computer-readable storage medium, characterized in that it stores a computer program which, when executed by a computer, implements the method of any one of claims 1 to 5.
CN202011472196.6A 2020-12-14 2020-12-14 GC self-adaptive adjustment method and device for big data processing framework Active CN112579259B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011472196.6A CN112579259B (en) 2020-12-14 2020-12-14 GC self-adaptive adjustment method and device for big data processing framework

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011472196.6A CN112579259B (en) 2020-12-14 2020-12-14 GC self-adaptive adjustment method and device for big data processing framework

Publications (2)

Publication Number Publication Date
CN112579259A CN112579259A (en) 2021-03-30
CN112579259B true CN112579259B (en) 2022-07-15

Family

ID=75136205

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011472196.6A Active CN112579259B (en) 2020-12-14 2020-12-14 GC self-adaptive adjustment method and device for big data processing framework

Country Status (1)

Country Link
CN (1) CN112579259B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114816760B (en) * 2022-05-13 2023-04-28 兰考堌阳医院有限公司 Interactive nursing billboard system and storage medium
US11972242B2 (en) 2022-07-26 2024-04-30 Red Hat, Inc. Runtime environment optimizer for JVM-style languages
CN116089319B (en) * 2022-08-30 2023-10-31 荣耀终端有限公司 Memory processing method and related device

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107766123A (en) * 2017-10-11 2018-03-06 郑州云海信息技术有限公司 A kind of JVM tunings method
CN110888712A (en) * 2019-10-10 2020-03-17 望海康信(北京)科技股份公司 Java virtual machine optimization method and system

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10223258B2 (en) * 2017-03-21 2019-03-05 Microsoft Technology Licensing, Llc Automated virtual machine performance tuning

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107766123A (en) * 2017-10-11 2018-03-06 郑州云海信息技术有限公司 A kind of JVM tunings method
CN110888712A (en) * 2019-10-10 2020-03-17 望海康信(北京)科技股份公司 Java virtual machine optimization method and system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Spark并行计算框架的内存优化;廖旺坚 等;《计算机工程与科学》;20180430;第40卷(第4期);第587-593页 *
基于程序分析的大数据应用内存预估方法;胡振宇 等;《中国科学》;20200731;第50卷(第8期);第1178-1196页 *

Also Published As

Publication number Publication date
CN112579259A (en) 2021-03-30

Similar Documents

Publication Publication Date Title
CN112579259B (en) GC self-adaptive adjustment method and device for big data processing framework
JP4079684B2 (en) Heap memory management method and computer system using the same
US7779054B1 (en) Heuristic-based resumption of fully-young garbage collection intervals
US10802718B2 (en) Method and device for determination of garbage collector thread number and activity management in log-structured file systems
US10235044B2 (en) System and methods for storage data deduplication
JP2006092532A (en) Increasing data locality of recently accessed resource
CN111143243B (en) Cache prefetching method and system based on NVM hybrid memory
CN103631730A (en) Caching optimizing method of internal storage calculation
CN112015765B (en) Spark cache elimination method and system based on cache value
EP3577567A1 (en) Multiple stage garbage collector
WO2014138234A1 (en) Demand determination for data blocks
Villalba et al. Constant-time sliding window framework with reduced memory footprint and efficient bulk evictions
US20050138329A1 (en) Methods and apparatus to dynamically insert prefetch instructions based on garbage collector analysis and layout of objects
Zhang et al. Program-level adaptive memory management
US20050066305A1 (en) Method and machine for efficient simulation of digital hardware within a software development environment
US20110078378A1 (en) Method for generating program and method for operating system
CN112597076B (en) Spark-oriented cache replacement method and system based on data perception
CN110908771A (en) Memory management method of intelligent contract based on JAVA
CN103970679A (en) Dynamic cache pollution prevention system and method
Nelson Virtual memory vs. the file system
JP5577518B2 (en) Memory management method, computer and memory management program
Huang et al. Duo: Improving Data Sharing of Stateful Serverless Applications by Efficiently Caching Multi-Read Data
KR102031490B1 (en) Apparatus and method for prefetching
CN116501660A (en) Spark-oriented automatic caching method and device
KR102168464B1 (en) Method for managing in-memory cache

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant