CN112579259A - GC self-adaptive adjusting method and device for big data processing framework - Google Patents

GC self-adaptive adjusting method and device for big data processing framework Download PDF

Info

Publication number
CN112579259A
CN112579259A CN202011472196.6A CN202011472196A CN112579259A CN 112579259 A CN112579259 A CN 112579259A CN 202011472196 A CN202011472196 A CN 202011472196A CN 112579259 A CN112579259 A CN 112579259A
Authority
CN
China
Prior art keywords
data
memory
jvm
information
current
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011472196.6A
Other languages
Chinese (zh)
Other versions
CN112579259B (en
Inventor
黄涛
许利杰
王伟
李慧
汪钇丞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Software of CAS
Original Assignee
Institute of Software of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Software of CAS filed Critical Institute of Software of CAS
Priority to CN202011472196.6A priority Critical patent/CN112579259B/en
Publication of CN112579259A publication Critical patent/CN112579259A/en
Application granted granted Critical
Publication of CN112579259B publication Critical patent/CN112579259B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • G06F2009/45583Memory management, e.g. access or allocation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • G06F2009/45591Monitoring or debugging support

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a GC self-adaptive adjusting method and a GC self-adaptive adjusting device facing a big data processing frame, which are characterized in that the current big data operation information and the memory state information are collected in the big data frame and an actuator JVM respectively, and the memory use requirement of each processing stage of big data application is predicted; meanwhile, according to the prediction result, the GC parameters of the JVM are adaptively adjusted according to a certain logic rule, and an interface for dynamically modifying the GC parameters during running is realized. The invention can adapt to the memory use characteristics of big data application which are continuously changed, reduce the GC trigger frequency and the global pause time of the JVM of the executor and improve the memory management efficiency of the JVM in a big data environment.

Description

GC self-adaptive adjusting method and device for big data processing framework
Technical Field
The invention relates to a GC self-adaptive adjusting tool, in particular to a GC parameter dynamic adjusting method and device for a JVM (Java virtual machine) in a big data processing framework, and belongs to the technical field of software.
Background
With the rapid development and popularization of internet technology, the data volume generated by users of the internet technology also shows an explosive growth trend, and urgent needs for processing and storing mass data are brought. In order to meet the demand, various scalable distributed big data processing frameworks have been developed in the industry and academia in the last decade and are widely applied. The mainstream big data frames such as Spark and Flink are generally written by Java, Scala and other object-oriented languages, a Java Virtual Machine (JVM) is used as a runtime environment to execute specific computing tasks of big data application, and a Garbage Collection (GC) management task relying on the JVM manages data objects created and placed in the memory in the JVM heap.
The GC algorithms of the conventional JVM are based on a weak generation hypothesis, i.e., most objects are no longer used soon after creation. These GC algorithms thus use the concept of generational collection to divide the JVM heap memory into the younger Generation (Young Generation) and the older Generation (Old Generation). The younger generations are subdivided into an Eden park (Eden) and two Survivor districts (Survivor). Objects were originally applied to the younger generation of the Eden park, the Minor GC was triggered when the Eden park was exhausted, and the survivors were judged by reachability analysis to be copied to the survivor zone. When the number of Minor GC times an object experiences reaches a certain threshold, it is promoted to be stored in the old. And when the usage of the old age reaches a certain proportion, triggering the Major GC to arrange the whole heap memory space. In order to ensure consistency and safety of GC results, all GC algorithms have partial processing stages and need to interrupt all working thread global pauses (Stop-The-World, STW).
The classical GC algorithms are not designed aiming at the calculation characteristics of a big data processing framework, the results of the big data processing framework in industrial practice and academic evaluation show that the JVM often has the problems of long GC global pause time, high GC frequency and the like when the JVM actually runs a big data application, the GC pause time under some big data scenes even exceeds 50% of the whole running time of the application, the throughput and delay of the big data application are seriously influenced, and the performance bottleneck of the big data processing framework is already formed. The cause of the low efficiency of the GC algorithm in the big data processing framework is considered to be the change of the memory usage pattern of the big data application: different from the characteristics of 'calculation intensive' and the death of objects in the traditional Java application, the big data application is generally 'data intensive' and 'memory intensive', a large number of objects with input data and intermediate calculation results survive in a JVM heap memory for a long time, and are not recycled after going through the accessibility analysis of multiple GC and the object moving process, so that a large number of CPU time slices are wasted.
The conventional GC algorithm cannot adapt to the change of the memory usage mode for the default configuration of the JVM heap memory size, the proportion of the aged and the young generations, the object promotion threshold value and the GC trigger threshold value. Since the memory usage patterns of the big data application at different processing stages are different, the self-adaptive adjustment of the GC algorithm according to the history execution information cannot be applied to the future processing stage of the big data application. The user can adjust the configuration parameters manually, but to obtain a good effect, the user needs to have a deep knowledge of the memory usage and the GC algorithm of the application, and the results are compared through a large number of tests. Because the JVM does not provide a method for performing dynamic parameter adjustment during running, manual parameter adjustment can only be implemented statically when the JVM is started, and cannot be effectively adapted to each processing stage of big data application. The existing related research dynamically adjusts the partition of JVM heap memory use on the large data processing framework layer, but the coarse granularity adjustment effect is limited. Meanwhile, in the research at the GC algorithm level, different types of objects are processed in a differentiated manner according to the life cycle of the user annotation or the historical information statistical data object, but a large amount of user work or a complicated algorithm flow is required.
The methods can not realize effective adaptive adjustment of the GC algorithm under a large data processing framework under the conditions of low user burden and low calculation complexity.
Disclosure of Invention
Aiming at the adaptability problem of the GC algorithm in a big data processing frame, the invention aims to provide a GC self-adaptive adjusting method and device for the big data processing frame.
The invention utilizes the acquired information of the big data frame in the running process to determine the approximate quantity and the life cycle of the data objects, and transmits the data objects to the task executor JVM for dynamically adjusting GC parameters in the running process, thereby improving the running efficiency of the GC mechanism. As shown in fig. 1, the present invention predicts the life cycle and memory usage of the corresponding data object by obtaining the data volume of the big data applied in the big data processing frame, executing the flow and other information, provides the adaptive GC configuration parameters, and transfers the adaptive GC configuration parameters to the executor JVM for dynamic adjustment, so as to reduce the time and frequency of the GC global pause.
The technical scheme of the invention mainly focuses on 3 GC parameters which have the greatest influence on the memory management efficiency of big data application, including Heap memory Size (Heap Size), the proportion of old and young generations (New Ratio), and an object promotion Threshold (Tenuiring Threshold). The specific meanings are as follows:
1. heap memory size
The memory size allocated to the JVM heap by the physical node of the big data processing framework cluster where the executor JVM is located determines the memory that the JVM can use to store the data to be processed and the intermediate calculation result object instances and arrays, and is proportional to the execution memory available to the JVM thread of each big data framework calculation task.
2. Proportion of old age and young age
The JVM heap memory is divided into the old and the young age memory size ratios. Heap memory consists of only two parts, the old and the young, so their sizes trade off. The older generation can store more long-time survival objects generated by a large data processing frame, the triggering frequency of the Major GC is reduced, the younger generation can reduce the triggering frequency of the Minor GC, and the overall throughput rate of the GC is improved.
3. Subject promotion threshold
Objects need to go through the number of GC cycles from the survivor zone of the young generation to the old. Higher promotion thresholds may reduce stress in the older age, avoiding short-lived objects from entering the older age. A lower promotion threshold may reduce the number of memory moves that a long-lived object may need to undergo in a younger generation to enter an older generation.
The technical scheme of the invention is as follows:
a GC self-adaptive adjusting method facing a big data processing framework comprises the following steps:
acquiring dynamic information associated with memory use, including acquiring information related to a current computing task from a big data processing framework, and acquiring memory management information of a current actuator from an actuator JVM;
predicting the memory use condition of input data and intermediate calculation results in the current task stage in the JVM (java virtual machine) of the actuator according to the acquired information;
and generating adaptive GC configuration parameters according to the predicted memory use condition of the input data and the intermediate calculation result in the actuator JVM in the current task stage and the acquired current memory state of the actuator JVM, and dynamically adjusting the GC configuration parameters during the running of the actuator JVM to improve the GC efficiency of the actuator JVM.
A GC self-adaptive adjusting device facing a big data processing framework mainly comprises the following 3 functional modules: data information collector (Profiler), data usage pattern Analyzer (Analyzer), GC parameter dynamic Modifier (Modifier).
1. Data information collector
In the method, a data information collector is divided into a frame data stream collector and a GC information collector, and dynamic information related to memory use is collected from a big data processing frame and each actuator JVM respectively, wherein the frame data stream collector is responsible for collecting:
(1) the operation information includes the current code position and the data operation type, the caching condition of the data set (whether caching, caching level, caching dependency), and the types of the Shuffle Write and Shuffle Read.
(2) And the data information comprises the data structure type of the current data to be operated and the processing data amount divided for each executor JVM.
(3) The configuration information includes the coarse-grained division condition of the memory of the current big data processing framework, namely the big data framework specifies the memory space proportion of the actuator JVM used for data set caching, Shuffle calculation and user code execution.
The GC information collector is responsible for collecting:
(1) memory information including heap memory size of the current executor JVM, old and young generations and proportions, object promotion threshold, GC trigger threshold, used size of each age generation, age distribution of the object.
(2) The log information includes GC information that has been performed so far, such as the trigger reason of each GC, the size of the space and the number of objects to be recycled, and the time occupied by each stage.
2. Data usage pattern analyzer
In the method of the present invention, the data usage pattern analyzer predicts the memory usage of the input data and the intermediate calculation result in the executor JVM at the current task stage according to the information obtained by the data information collector, as shown in fig. 2. The method specifically comprises the following steps:
(1) the occupied size of the memory is as follows:
linearly fitting the size of the memory occupied by the data object set after the current input data is loaded into the JVM (JVM executor) according to the structure type of the current data and the data volume to be processed and by combining the relevant information of the historical data and the memory occupation size model;
and calculating a data object of the intermediate calculation result and the size of the memory required to be occupied by the cache data object by combining the current specific data operation, the type of the Shuffle Write and the type of the Shuffle Read.
(2) The life cycle of the object:
the final cache position (in-heap memory, out-of-heap memory or local disk) of the cached data set is judged according to the code positions of the cache functions persist (), unpersist (), cache (), the data set, the calculation dependency relationship of the data set, the cache level, and the size and the use condition of the data cache space specified by the big data processing framework, and the survival time of the cached data in the in-heap memory responsible for the GC algorithm is deduced.
According to the type of data operation, the type of Shuffle operation, the size of the data volume to be processed, the processing speed of the JVM, and the size and usage of the JVM data execution space, it is determined whether the time period required for holding the input data and the intermediate calculation result in the heap memory is long, and an over write is generated.
Dynamic modifier of GC parameters
In the invention, the GC parameter dynamic modifier generates adaptive GC configuration parameters by combining the data set size and the life cycle characteristics predicted by the data use mode analyzer and the current memory use state of the actuator JVM obtained by the data information collector, and dynamically adjusts the GC configuration parameters during the operation of the actuator JVM.
The specific judgment flow is shown in fig. 3 and 4, and the logic rule of GC parameter modification is as follows.
Rule 1: for the cache operation, if the cache position of the data set determined by the data usage pattern analyzer is the heap memory or the local disk, the object promotion threshold is increased, and the proportion of the younger generations is increased.
Explanation 1: after the data objects cached to the off-heap memory and the local disk are calculated and written into corresponding positions, original data can be cleared from the heap memory, so that the object promotion threshold is increased, the data objects existing in the heap memory for a short time are prevented from entering old age, in addition, the proportion of young age is increased, the trigger frequency of the Minor GC can be reduced, and the GC throughput is improved.
Rule 2: for caching, if the cache location of the data set determined by the data usage pattern analyzer is in-heap memory, the object promotion threshold is lowered.
Explanation 2: the cache data objects survive in the heap memory for a long time, the object promotion threshold value is reduced, and repeated movement of the data objects which need to be accumulated for a long time in the process of promoting to the old age can be avoided.
Rule 3: for the cache operation, if the cache level of the data set is the in-heap memory priority and the current in-heap memory has potential to be completely put down through GC parameter adjustment, the size of the in-heap memory is increased, and the proportion of the old age is enlarged.
Explanation 3: if the dependency of the data sets indicates that all existing cache data sets are used in the future, the size of a heap memory should be increased, the proportion of old age is enlarged, and triggering of a Major GC is avoided, so that the cached data sets are not swapped out as much as possible.
Rule 4: for caching operations, if the caching level of the dataset is heap memory first, but the potential of heap memory and physical node memory is not sufficient to put down the entire cached dataset, the object promotion threshold is lowered.
Explanation 4: when the memory cannot meet the cache requirement and the exchange and removal of part of the cached data set are inevitable, the Major GC can be triggered to clean the exchanged data set as soon as possible by reducing the object promotion threshold value, and the current data set is promoted to the old as soon as possible.
Rule 5: for data operation, if the data usage pattern analyzer determines that the object generated by the current operation is survived for a short time, the proportion of the younger generations is increased, and the object promotion threshold is increased.
Explanation 5: the current operation has the advantages of small possible data processing amount, low calculation complexity, high processing speed of physical nodes and no need of long-term maintenance of data objects. The proportion of the younger generation is increased, the promotion threshold of the data objects is increased, and the data objects can be prevented from occupying the space of the older generation.
Rule 6: for data operation, if the data usage pattern analyzer determines that the object generated by the current operation is alive for a long time, the aged generation ratio is increased, and the heap memory size is increased.
Explanation 6: current operations may require a Shuffle process and data objects may need to be maintained for long periods of time. The proportion of the old generation is increased, the size of the heap memory is increased, and the performance loss caused by Major GC and data overflow can be avoided as much as possible.
Rule 7: for data operation, if the data usage pattern analyzer judges that the over-writing cannot be avoided, the promotion threshold of the object is increased while heap memory and old age are expanded.
Explanation 7: the data overflow and write operation brings significant performance loss, and the increase of the promotion threshold of the object can reserve more Shuffle data in the heap memory, thereby reducing the frequency and proportion of overflow and write data.
The parameter adjustment range is based on the following criteria:
(1) and the heap size determines the adjustment range according to the size of the idle memory of the physical node where the JVM is located and the calculated size of the extra memory required by all the calculation task threads in total.
(2) The proportion of the old age and the young age determines the adjustment range according to the calculated occupation size of the old age required by the calculation task and the number proportion of the long-time survival objects in all the objects.
(3) The subject promotion threshold determines the magnitude of adjustment based on the size of the young generation, the derived lifecycle of the data subject, and the rate of promotion of the historical subject.
The tool modifies the source code of OpenJDK and provides an interface for dynamically adjusting GC parameters during running. The memory space of the whole physical node is declared when the JVM is started, so that the extension and contraction of the real use size of the heap memory are realized; the proportion of the old generation and the young generation is adjusted during operation by setting movable boundaries of the young generation and the old generation; by adding a special method, dynamic setting of the promotion threshold of the object is realized. Before each calculation task is executed specifically, the adaptive GC parameter values calculated by the three modules are modified and adjusted by calling corresponding interfaces.
The invention provides a GC self-adaptive adjusting method and a GC self-adaptive adjusting device for a big data processing framework, and compared with the prior art, the method has the following advantages:
(1) the invention provides a method for dynamically adjusting GC parameters during operation, which is specially designed for a scene of a big data processing frame and can better adapt to the memory use characteristics of big data application at different stages compared with a static fixed parameter adjusting method.
(2) The memory usage mode of the calculation task is presumed according to the historical data information and the memory usage information and by combining the data operation and the data amount to be executed currently, and compared with the memory usage mode which is adjusted only according to the history, the memory usage mode can be more suitable for the memory requirement in a period of time in the future.
(3) The invention designs a detailed GC parameter adjusting rule, and provides a corresponding GC parameter adjusting strategy aiming at different memory use states and memory use requirements, so that the executor JVM realizes higher memory management efficiency under various scenes.
Drawings
FIG. 1 is a block diagram of the big data processing framework oriented GC adaptive tuning tool of the present invention;
FIG. 2 is a flow chart of the present invention for predicting the life cycle of a data object in the heap memory of an executor JVM;
FIG. 3 is a flow chart of the GC parameter adjustment rule for cache operations according to the present invention;
FIG. 4 is a flowchart of the GC parameter adjustment rule for Shuffle operation according to the present invention;
fig. 5 is a flow chart of an implementation of the present invention.
Detailed Description
The present invention will be described in detail below with reference to specific embodiments and the accompanying drawings.
The GC adaptive adjustment method provided by the invention is applied to a Spark big data processing framework, a representative big data application PageRank is taken as an example, the specific implementation steps are as follows, and a flow chart is shown in FIG. 5.
Spark generates a logical processing flow and a physical execution plan of PageRank, according to the processing flow, the first data operation of Spark is to pass the edge of an input graph data set through map (), so as to obtain a record of < user, follower >, and the tool of the invention starts to work:
1. and the data information acquisition unit acquires the related information.
Acquiring a frame data stream by using a frame data stream acquisition unit:
(1) operation information: map, shuffle write (BypassMergeSortShufflWriter)
(2) Data information: 1000 side
(3) Configuration information:
spark is assigned to user code space: 40 percent of
Frame memory space: 60 percent of
Frame execution space: 60%. 50%. 30%
Data cache space: 60%. 50%. 30%
The size of the memory outside the heap: 200MB
The GC information acquisition device acquires:
(4) memory information:
heap memory size for executor JVM startup: 400MB
The size of the young generation: 400MB 1/4-100 MB
The Eden park: 100MB 4/5-80 MB
Survivor area: 100MB 1/10 ═ 10MB 2 ═ 2
The size of the aged generation: 400MB 3/4-300 MB
Subject promotion threshold: 5
(5) Log information:
the adopted garbage recoverer comprises: parallel Scavenge
2. And the data use mode analyzer calculates according to the information acquired by the data information acquisition unit.
(1) The occupied size of the memory is as follows: the input data object occupies 100MB of memory after entering the executor.
(2) The life cycle of the object: the data objects formed are short lived.
And 3, the GC parameter dynamic modifier makes a decision according to the memory characteristics calculated by the data use mode analyzer.
According to rule 5, short lived objects are prevented from entering the old age:
the proportion of the young generations is increased to: 400MB 1/2-200 MB
The survival area proportion is improved to: 200MB 1/4 MB 2-50 MB 2
Promotion object promotion threshold: 10
Next, the < user, follower > record is processed by Reduce operation to obtain < user, list (follower) >, and is cached in the heap memory for a long time as the input of the subsequent iterative computation.
1. And the data information acquisition unit acquires the related information.
Acquiring a frame data stream by using a frame data stream acquisition unit:
(1) operation information: reduceBykey
shuffle read(BlockStoreShuffleReader)
persist(MEMORY_AND_DISK)
(2) Data information: 1000 edges information, 200 users and their visitors information
(3) Configuration information:
spark is assigned to user code space: 40 percent of
Frame memory space: 60 percent of
Frame execution space: 60%. 50%. 30%
Data cache space: 60%. 50%. 30%
The size of the memory outside the heap: 200MB
The GC information acquisition device acquires:
(4) memory information:
heap memory size for executor JVM startup: 400MB
The size of the young generation: 120MB is used for 400MB 1/2 MB to 200MB
The Eden park: 75MB was used for 200MB 1/2-100 MB
Survivor area: 200MB 1/4 MB 2-50 MB 2 used 45MB
(5 age groups 1:2:2:2:3)
The size of the aged generation: 0MB was used for 400MB 1/2-200 MB
Subject promotion threshold: 10
(5) Log information:
after 5 Minor GC runs through the Eden park, 10000 objects were recovered, totaling the pause time of 0.5 s
2. And the data use mode analyzer calculates according to the information acquired by the data information acquisition unit.
(1) The occupied size of the memory is as follows: shuffling and caching the data will take up 300MB of memory.
(2) The life cycle of the object: cached data objects are live for long periods of time.
And 3, the GC parameter dynamic modifier makes a decision according to the memory characteristics calculated by the data use mode analyzer.
According to rule 3, rule 6:
increasing heap memory size to 500MB
The proportion of the old generation is increased to: 500MB 4/5-400 MB
Decreasing the subject promotion threshold: 5
Subsequently, < user, list (followings) > and rank information are aggregated to get < user, list (rank) >, and the cartesian product of each list (followings, rank) is calculated.
1. And the data information acquisition unit acquires the related information.
Acquiring a frame data stream by using a frame data stream acquisition unit:
(1) operation information: join, flatMap
(2) Data information: 200 users and their visitors, 200 users' ranking information, generating a ranking contribution to 250 users.
(3) Configuration information:
spark is assigned to user code space: 40 percent of
Frame memory space: 60 percent of
Frame execution space: 60%. 20%. 12%
Data cache space: 60%. 80%. 48%
The size of the memory outside the heap: 100MB
The GC information acquisition device acquires:
(4) memory information:
heap memory size for executor JVM startup: 500MB
The size of the young generation: 50MB was used for 500MB 1/5 MB to 100MB
The Eden park: 15MB was used for 100MB 1/2-50 MB
Survivor area: 35MB was used for 100MB 1/4 MB 2-50 MB 2
(5 age groups: 3:2:1:1:1)
The size of the aged generation: 300MB was used for 500MB 4/5 MB-400 MB
Subject promotion threshold: 5
(5) Log information:
after 12 Minor GC runs out of the Eden park, 20000 objects were recovered for a total pause time of 1 second
4. And the data use mode analyzer calculates according to the information acquired by the data information acquisition unit.
(1) The occupied size of the memory is as follows: the shuffle and cache data will occupy 200MB of memory.
(2) The life cycle of the object: the shuffle objects are long lived.
And 5, the GC parameter dynamic modifier makes a decision according to the memory characteristics calculated by the data use mode analyzer.
According to rule 6, rule 7:
increasing heap memory size to 550MB
The proportion of the old generation is increased to: 550MB 4/5-440 MB
Increase of subject promotion threshold: 10
Repeated iteration operation is carried out on the subsequent processing flow of the PageRank, and GC parameters are adjusted by the tool according to a flow similar to the process before each operation, so that the trigger times and pause time of a GC can be reduced by an executor JVM under a big data processing frame Spark, and the execution efficiency of big data application is improved.
Based on the same inventive concept, another embodiment of the present invention provides an electronic device (computer, server, smartphone, etc.) comprising a memory storing a computer program configured to be executed by the processor, and a processor, the computer program comprising instructions for performing the steps of the inventive method.
Based on the same inventive concept, another embodiment of the present invention provides a computer-readable storage medium (e.g., ROM/RAM, magnetic disk, optical disk) storing a computer program, which when executed by a computer, performs the steps of the inventive method.
The particular embodiments of the present invention disclosed above are illustrative only and are not intended to be limiting, since various alternatives, modifications, and variations will be apparent to those skilled in the art without departing from the spirit and scope of the invention. The invention should not be limited to the disclosure of the embodiments in the present specification, but the scope of the invention is defined by the appended claims.

Claims (10)

1. A GC self-adaptive adjusting method facing a big data processing framework is characterized by comprising the following steps:
acquiring dynamic information associated with memory use, including acquiring information related to a current computing task from a big data processing framework, and acquiring memory management information of a current actuator from an actuator JVM;
predicting the memory use condition of input data and intermediate calculation results in the current task stage in the JVM (java virtual machine) of the actuator according to the acquired information;
and generating adaptive GC configuration parameters according to the predicted memory use condition of the input data and the intermediate calculation result in the actuator JVM in the current task stage and the acquired current memory state of the actuator JVM, and dynamically adjusting the GC configuration parameters during the running of the actuator JVM to improve the GC efficiency of the actuator JVM.
2. The method of claim 1, wherein the information related to the current computing task comprises operational information, data information, configuration information; the memory management information of the current actuator comprises memory state information and GC log information.
3. The method of claim 1, wherein predicting memory usage of input data and intermediate computation results in a current task stage among the executor JVMs comprises:
according to the data quantity and the data structure type, combining the historical memory use condition, and linearly fitting the memory size to be occupied by the data;
and predicting the final storage position of the data object and the life cycle of the data object in the JVM heap memory according to the memory use state and the data caching requirement.
4. The method of claim 3, wherein linearly fitting the memory size to be occupied by the data according to the data amount and the data structure type and in combination with the historical memory usage comprises:
according to the structure type of current data and the data volume to be processed, the memory size occupied by a data object set after current input data are loaded into an actuator JVM is linearly fitted by combining relevant information of historical data and a memory occupation size model;
and calculating a data object of the intermediate calculation result and the size of the memory occupied by the cache data object by combining the current specific data operation, the Shuffle Write type and the Shuffle Read type.
5. The method of claim 3, wherein predicting the final storage location and life cycle of the data object in the JVM heap memory based on memory usage and data caching requirements comprises:
judging the final cache position of the cache data set according to the code position of the cache function of the data set, the calculation dependency relationship and the cache level of the data set, and the data cache space size and the use condition specified by a big data processing frame, and deducing the survival time of the cache data in the in-heap memory in charge of a GC algorithm;
and judging whether the time length of the input data and the intermediate calculation result which need to be kept in the heap memory is long or not and whether the overflow write can be generated or not according to the type of the data operation, the type of the Shuffle operation, the size of the data volume to be processed, the processing speed of the JVM and the size and the use condition of the JVM data execution space.
6. The method as claimed in claim 1, wherein the dynamic adjustment at runtime of the executor JVM is a dynamic adjustment of GC configuration parameters for big data application memory usage characteristics by using the following rules:
rule 1: for cache operation, if the cache position of the data set determined by the data usage pattern analyzer is an out-of-heap memory or a local disk, the promotion threshold of the object is increased, and the proportion of the young generations is increased;
rule 2: for cache operation, if the cache position of the data set determined by the data use mode analyzer is an in-heap memory, reducing a promotion threshold of the object;
rule 3: for the cache operation, if the cache level of the data set is the in-heap memory priority and the current in-heap memory has potential to be completely put down through GC parameter adjustment, the size of the in-heap memory is increased, and the proportion of the old age is enlarged;
rule 4: for caching operation, if the caching level of the data set is the in-heap memory priority, but the potentials of the in-heap memory and the physical node memory are not enough to put down the whole cached data set, the promotion threshold of the object is reduced;
rule 5: for data operation, if the data usage pattern analyzer judges that the object generated by the current operation is survived for a short time, the proportion of the young generations is increased, and the object promotion threshold value is increased;
rule 6: for data operation, if the data usage pattern analyzer judges that an object generated by current operation survives for a long time, the proportion of old generation is increased, and the size of a heap memory is increased;
rule 7: for data operation, if the data usage pattern analyzer judges that the over-writing cannot be avoided, the promotion threshold of the object is increased while heap memory and old age are expanded.
7. The method of claim 6, wherein the GC configuration parameter adjustment is performed according to the following steps:
determining the adjustment range according to the size of an idle memory of a physical node where the JVM is located and the calculated size of an extra memory required by all calculation task threads in total;
determining the adjustment range of the proportion of the old age and the young age according to the calculated size of the old age required by the calculation task and the number proportion of the long-time survival objects in all the objects;
the subject promotion threshold determines the magnitude of adjustment based on the size of the young generation, the derived lifecycle of the data subject, and the rate of promotion of the historical subject.
8. A big data processing framework-oriented GC adaptive adjustment device adopting the method of any one of claims 1 to 7, characterized by comprising the following components:
the data information collector is responsible for collecting dynamic information related to the use of the memory and is divided into a frame data stream collector and a GC information collector; the frame data flow collector is responsible for collecting information related to the current computing task from the big data processing frame; the GC information collector is responsible for collecting the memory management information of the current actuator from the JVM of the actuator;
the data use mode analyzer is responsible for predicting the memory use condition of input data and intermediate calculation results in the JVM of the actuator in the current task stage according to the information obtained by the data information collector;
and the GC parameter dynamic modifier is responsible for generating adaptive GC configuration parameters by combining the prediction result of the data use mode analyzer and the current memory state of the actuator JVM obtained by the data information collector, and dynamically adjusting the GC configuration parameters during the running of the actuator JVM so as to improve the GC efficiency of the actuator JVM.
9. An electronic apparatus, comprising a memory and a processor, the memory storing a computer program configured to be executed by the processor, the computer program comprising instructions for performing the method of any of claims 1 to 7.
10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a computer, implements the method of any one of claims 1 to 7.
CN202011472196.6A 2020-12-14 2020-12-14 GC self-adaptive adjustment method and device for big data processing framework Active CN112579259B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011472196.6A CN112579259B (en) 2020-12-14 2020-12-14 GC self-adaptive adjustment method and device for big data processing framework

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011472196.6A CN112579259B (en) 2020-12-14 2020-12-14 GC self-adaptive adjustment method and device for big data processing framework

Publications (2)

Publication Number Publication Date
CN112579259A true CN112579259A (en) 2021-03-30
CN112579259B CN112579259B (en) 2022-07-15

Family

ID=75136205

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011472196.6A Active CN112579259B (en) 2020-12-14 2020-12-14 GC self-adaptive adjustment method and device for big data processing framework

Country Status (1)

Country Link
CN (1) CN112579259B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114816760A (en) * 2022-05-13 2022-07-29 兰考堌阳医院有限公司 Interactive nursing billboard system and storage medium
CN116089319A (en) * 2022-08-30 2023-05-09 荣耀终端有限公司 Memory processing method and related device
US11972242B2 (en) 2022-07-26 2024-04-30 Red Hat, Inc. Runtime environment optimizer for JVM-style languages

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107766123A (en) * 2017-10-11 2018-03-06 郑州云海信息技术有限公司 A kind of JVM tunings method
US20180276117A1 (en) * 2017-03-21 2018-09-27 Linkedin Corporation Automated virtual machine performance tuning
CN110888712A (en) * 2019-10-10 2020-03-17 望海康信(北京)科技股份公司 Java virtual machine optimization method and system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180276117A1 (en) * 2017-03-21 2018-09-27 Linkedin Corporation Automated virtual machine performance tuning
CN107766123A (en) * 2017-10-11 2018-03-06 郑州云海信息技术有限公司 A kind of JVM tunings method
CN110888712A (en) * 2019-10-10 2020-03-17 望海康信(北京)科技股份公司 Java virtual machine optimization method and system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
廖旺坚 等: "Spark并行计算框架的内存优化", 《计算机工程与科学》 *
胡振宇 等: "基于程序分析的大数据应用内存预估方法", 《中国科学》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114816760A (en) * 2022-05-13 2022-07-29 兰考堌阳医院有限公司 Interactive nursing billboard system and storage medium
CN114816760B (en) * 2022-05-13 2023-04-28 兰考堌阳医院有限公司 Interactive nursing billboard system and storage medium
US11972242B2 (en) 2022-07-26 2024-04-30 Red Hat, Inc. Runtime environment optimizer for JVM-style languages
CN116089319A (en) * 2022-08-30 2023-05-09 荣耀终端有限公司 Memory processing method and related device
CN116089319B (en) * 2022-08-30 2023-10-31 荣耀终端有限公司 Memory processing method and related device

Also Published As

Publication number Publication date
CN112579259B (en) 2022-07-15

Similar Documents

Publication Publication Date Title
CN112579259B (en) GC self-adaptive adjustment method and device for big data processing framework
JP4079684B2 (en) Heap memory management method and computer system using the same
US10802718B2 (en) Method and device for determination of garbage collector thread number and activity management in log-structured file systems
US7779054B1 (en) Heuristic-based resumption of fully-young garbage collection intervals
Zhou et al. Second-level buffer cache management
US10235044B2 (en) System and methods for storage data deduplication
JP2006092532A (en) Increasing data locality of recently accessed resource
CN103631730A (en) Caching optimizing method of internal storage calculation
CN112015765B (en) Spark cache elimination method and system based on cache value
Villalba et al. Constant-time sliding window framework with reduced memory footprint and efficient bulk evictions
Zhang et al. Program-level adaptive memory management
Itshak et al. AMSQM: adaptive multiple super-page queue management
Kim et al. $ ezswap $: Enhanced compressed swap scheme for mobile devices
US20050066305A1 (en) Method and machine for efficient simulation of digital hardware within a software development environment
CN112597076B (en) Spark-oriented cache replacement method and system based on data perception
Zhu et al. MCS: memory constraint strategy for unified memory manager in spark
CN110908771A (en) Memory management method of intelligent contract based on JAVA
CN103970679A (en) Dynamic cache pollution prevention system and method
KR102031490B1 (en) Apparatus and method for prefetching
JP5577518B2 (en) Memory management method, computer and memory management program
US11403232B2 (en) Sequence thrashing avoidance via fall through estimation
KR102168464B1 (en) Method for managing in-memory cache
Wu Ordering functions for improving memory reference locality in a shared memory multiprocessor system
CN116501660A (en) Spark-oriented automatic caching method and device
CN118093055A (en) Front-end lazy loading optimization method, device and medium based on dynamic analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant