CN116737440A

CN116737440A - Concurrent error detection method and system for Arm architecture branch record buffer

Info

Publication number: CN116737440A
Application number: CN202310690059.7A
Authority: CN
Inventors: 宁振宇; 刘佳禄; 杨祺浩; 胡玉鹏
Original assignee: Hunan University
Current assignee: Hunan University
Priority date: 2023-06-12
Filing date: 2023-06-12
Publication date: 2023-09-12

Abstract

The application discloses a concurrent error detection method and a concurrent error detection system for an Arm architecture branch record buffer area, wherein an event counter of a performance monitoring unit is utilized to calculate the recorded branch record number, and if the count value of the event counter reaches a threshold value, the performance monitoring unit triggers an interrupt processing program; and reading the latest branch record in the branch record buffer area, and successfully jumping to the last position of execution by the interrupt processing program to obtain the position of concurrent error. The application uses hardware mechanism to record program executing path, without modifying program source code, and the process of recovering executing flow is to obtain branch record of branch instruction by means of branch record buffer, the process is obtained by hardware component branch record buffer, and the cost of switching and on-line reading of branch record buffer is low, so the cost is almost zero.

Description

Concurrent error detection method and system for Arm architecture branch record buffer

Technical Field

The application relates to the field of concurrent defect detection, in particular to a concurrent error detection method and a concurrent error detection system for an Arm architecture branch record buffer.

Background

Concurrency is an important feature of modern computers that enables computers to more efficiently perform complex tasks and improve overall performance. Concurrency provides the ability to perform multiple tasks at the same time, which also improves the reliability and scalability of the system. Introducing concurrency may help developers to better handle complex tasks and problems, but everything has its two-sided nature. While concurrent programming under the age of multiple cores is of great importance for achieving improved performance, it is highly susceptible to serious concurrent errors due to uncertainty in multi-threaded program execution, and difficulty in concurrent error detection is also brought. A concurrent error refers to an error that occurs in a concurrent system in which multiple processes or threads access a shared resource at the same time. When multiple processes or threads access a shared resource at the same time, it may occur that the shared resource is modified without proper synchronization, resulting in unexpected behavior or crashing of the program. Concurrency errors may manifest themselves in various forms, such as contention conditions, deadlocks, livelock, and thread starvation. Once a concurrency error occurs, program or system operation may experience unpredictable behavior or erroneous results, which can seriously impair system usability, even with significant economic loss. More importantly, due to the uncertainty of the concurrent execution of multiple threads, concurrent errors tend to be difficult to ascertain the cause of their occurrence as compared to other types of errors. It is a very headache for the developer how to avoid and correctly resolve concurrency errors.

Since the occurrence of a concurrent error can have a serious negative impact on the usability and reliability of the system, we have to discover and resolve the error in a timely manner. There are some approaches to solve the concurrency problem, such as lock mechanism, semaphore, atomic operation, etc. But these methods each have advantages and disadvantages and require solutions to be selected according to the specific scenario. For example, the lock mechanism is a synchronous mechanism for preventing multiple threads from accessing the same resource at the same time by locking, the synchronous mechanism is set when the threads acquire the lock, and the synchronous mechanism is released when the threads release the lock, so that only one thread can access the resource at the same time, and the concurrent error is avoided. The lock mechanism may cause a deadlock in that multiple threads are blocked waiting to release the lock from each other. When many threads contend for the same lock, they may also cause performance problems. An atomic operation is an uninterruptible operation that is not interrupted by other threads during execution, thereby avoiding concurrency errors. Unfortunately, atomic operations can only resolve data race-specific concurrency errors, and other solutions can only be used for deadlock-like concurrency errors. In addition, atomic operations may also lead to performance problems. These above methods require careful code design to avoid problems of deadlock, race conditions, etc., which in turn increases programming complexity and maintenance costs. It is clear that the current solutions have certain disadvantages, and that the use of these methods is not effective and efficient in solving the difficult complications.

In addition to the techniques mentioned above, common techniques used in computer systems to resolve concurrency errors also include rollback-re-execution. This technique involves detecting the occurrence of a concurrent error, eliminating the effect of any instructions that execute after the erroneous instruction, and then re-executing the instruction from a point prior to the erroneous instruction. When an error occurs during program execution, the program may be rolled back and re-executed from the previous correct checkpoint. Failure may be avoided due to uncertainty re-execution of the concurrency error. When multiple transactions operate on the same data at the same time, a concurrency conflict may occur, such as two transactions modifying the same data at the same time, and the problem can be solved by rollback-re-execution. While rollback-re-execution can effectively reduce the likelihood of concurrency errors occurring, it has significant overhead in terms of performance and system resources and does not address all concurrency errors, e.g., deadlock and lost updates.

To solve the concurrency error, first, the concurrency error is detected, and the cause of the concurrency error is ascertained. One approach to detecting concurrent errors that has been used in the academia and industry today is to use dynamic analysis techniques, such as data race detection and model checking. Data race detection involves analyzing execution of a program to identify instances where multiple threads access the same shared resource without proper synchronization. Model checking involves systematically exploring all possible program executions to identify potential concurrency errors. Both techniques can effectively detect concurrency errors, but they may slow down the execution speed of the software system due to the overhead of monitoring and analysis. Another approach is a static analysis technique that involves analyzing the source code of the software system to identify potential concurrency issues such as race conditions or deadlocks. Static analysis tools can scan code for specific patterns or constructs that can lead to concurrency errors and provide feedback to developers to solve these problems. Static analysis techniques may be more efficient than dynamic analysis techniques, but may produce false positives in detecting subtle concurrency errors.

In addition to the two above analysis methods, concurrent errors can be detected by actually recording and tracking the behavior of the software system. Last Branch Record (LBR) on the X86 architecture is a hardware function for recording information of a branch instruction executed by a processor. Several researchers have now investigated the correlation of detection concurrency errors by using LBR registers. For example, LBR-RECORD uses LBR registers to detect race conditions in Linux kernel code. By analyzing the memory RECORDs in the LBR register, the LBR-RECORD can detect race conditions that occur when multiple threads access shared resources without proper synchronization.

Arm architecture has become increasingly popular in recent years, especially in the mobile device and internet of things (IoT) fields, by virtue of its power efficiency, scalability, low cost, and high performance. The reliability of the Arm architecture machines is also becoming particularly important today. Armv9 architecture was published by Arm corporation 2021. The Armv9 architecture adds some new features on the basis of the compatible Armv8 architecture, including branch record buffer extensions (Branch Record Buffer Extension, BRBE) (Martin Weidmann, director Product Management, ATG ARM, LVC20-214 https:// static. The Branch Record Buffer (BRB) is a hardware component of the Arm processor that captures the branch record buffer of the control path history in a low cost manner. Specifically, the address of the branch instruction is recorded in the BRBSRC register, the target address of the branch record is recorded in the BRBTGT register, and information such as the type of the branch instruction, whether the jump is valid, and the exception level of the target address is recorded in the BRBINF register, as shown in fig. 1. In the prior art, no research is available on the Arm architecture to detect the concurrent error problem by utilizing hardware functions.

Disclosure of Invention

Aiming at the defects of the prior art, the application provides a concurrent error detection method and a concurrent error detection system for an Arm architecture branch record buffer area, which reduce the cost.

In order to solve the technical problems, the application adopts the following technical scheme: a concurrent error detection method for Arm architecture branch record buffers, the method comprising:

s1, counting the recorded branch record number by using an event counter of a performance monitoring unit, and triggering an interrupt processing program by the performance monitoring unit if the count value of the event counter reaches a threshold value;

s2, reading the latest branch record in the branch record buffer area, and interrupting the last position of the processing program successfully jumping to execute, namely, the position where the concurrency error occurs.

The Performance Monitoring Unit (PMU) in the ARM architecture is a hardware component that is used to collect and monitor performance data of a processor. Unlike branch record buffers, the performance monitoring unit is interrupt-capable. The PMU provides a set of special registers for measuring and analyzing various performance indicators of the processor, and typically a programmer will only use the performance monitoring unit to calculate CPU-related events (number of execution instructions, capture exception, number of clock cycles, etc.), cache-related events (number of cache accesses, number of miss, etc.), TLB-related events, etc. In the prior art, event counters are not controlled to count some other events by operating the corresponding control registers of these counters. The application uses PMU to record the quantity of branch records stored in the branch record buffer zone, the cost of the hardware performance counter is extremely low and almost zero, thus greatly improving the execution speed of the software system.

In step S1 of the present application, the method further includes: and placing instruction information data stored in the branch record buffer area into the physical memory block which is allocated in advance. When the branch record buffer is full, the branch record buffer cannot cause an interrupt, but the previous branch record data is covered in a cyclic coverage manner, so that the branch record instruction tracking is lost. According to the application, the instruction information data stored in the branch record buffer area is put into the physical memory block which is allocated in advance, so that the old branch record data can be effectively prevented from being covered due to cyclic buffering.

After step S2, the method further includes:

s3, reading branch records in the branch record buffer area, determining the jump relation between the nodes, judging the execution sequence of the instruction according to the instruction execution time recorded by the time stamp, and connecting unordered instruction execution nodes into an ordered instruction execution flow by combining the execution sequence and the jump relation between the nodes.

The application obtains the branch record of the branch instruction by the branch record buffer, the process is obtained by the branch record buffer of the hardware component, and the context switching and online reading of the branch record buffer are low in cost, so the cost is almost zero. The method uses a hardware mechanism to record the program execution path, and does not need to modify the source code of the program, thereby further improving the execution speed of the software system.

After step S3 of the present application, the method further includes:

s4, detecting the reason of concurrent errors by using the ordered instruction execution flow.

After step S4 of the present application, the method further includes:

s5, selecting corrective measures to process concurrent errors according to the reasons determined in the step S4.

In the present application, the corrective action includes repair code or optimization code.

Before step S1, the method further includes: judging whether the branch record buffer area is in an available state, if so, entering a step S1; otherwise, stopping generating the branch record and ending.

As an inventive concept, the present application also provides a concurrent error detection system of Arm architecture branch record buffer, which includes:

one or more processors;

and a memory having one or more programs stored thereon, which when executed by the one or more processors cause the one or more processors to implement the steps of the above-described method of the present application.

As an inventive concept, the present application also provides a computer-readable storage medium storing a computer program which, when executed by a processor, implements the steps of the above-described method of the present application.

Compared with the prior art, the application has the following beneficial effects:

1. the application provides a new method for solving the concurrency error, which can effectively improve the stability and reliability of the multi-thread program. The application utilizes the branch record buffer zone newly introduced by the Arm architecture to recover the instruction execution flow of the processor, and can detect the root cause of the concurrent error by combining with other debugging or analysis tools, thereby solving the problem of the concurrent error which is difficult to process.

2. The application uses hardware mechanism to record program executing path, without modifying program source code, and the process of recovering executing flow is to obtain branch record of branch instruction by means of branch record buffer, the process is obtained by hardware component branch record buffer, and the cost of switching and on-line reading of branch record buffer is low, so the cost is almost zero.

3. Since the branch record buffer is a general mechanism on the Arm architecture, the application is applicable to most Arm architecture computers and has certain universality.

4. The application can solve the concurrent error safely and reliably and hardly affects the performance of the software system.

Drawings

FIG. 1 is a schematic diagram of BRB registers;

FIG. 2 is a flow chart of a method for resolving concurrency errors by resuming execution flow according to an embodiment of the present application;

FIG. 3 is a flowchart of a method for preventing a buffer record from being overwritten according to an embodiment of the present application;

FIG. 4 is a diagram illustrating an example of a resume instruction execution flow according to an embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

Example 1

The method for checking the concurrent errors of BRBE based on ARM architecture in the embodiment of the application is mainly divided into the following parts, and a specific flow chart is shown in FIG. 2:

step 1: first ensure that the branch record buffer is in a usable state:

the generation of branch records in the buffer is controlled by the BRBFCR register. Taking the brbfcr_el1 register as an example, when the 7 th bit of the brbfcr_el1 register, i.e., brbfcr_el1.PAUSED, is 1, the processor will stop generating branch records, and no new branch record will be written in the branch record buffer. It is necessary to determine that the value of brbfcr_el1. Used is 0, ensuring normal recording of conditional branches.

Step 2: in order to solve the problem that the branch record storage capacity of the branch record buffer is far smaller than the number of branch records, the application uses a timely interrupt caused by a performance monitoring unit (Performance Monitoring Unit, PMU) to extract the saved branch records from the buffer and save them to disk before the buffer is full.

(1) Counting the number of recorded branch records by using an event counter of the performance monitoring unit;

(2) The maximum score record number of the branch record buffer area is defined by BRBIDR0_EL1.NUMREC, and the branch record buffer area stores 32 branch records at most in general, so the event counter threshold of the performance monitoring unit is set to be 32;

(3) When the event counter value of the performance monitoring unit reaches 32, which indicates that the branch record buffer is full, the performance monitoring unit initiates an interrupt (PMI) to store instruction information data stored in the branch record buffer into a physical memory block allocated in advance by modifying the interrupt processing program in the interrupt processing program. This can prevent the old branch record data from being overwritten by the circular buffer. The specific flow is shown in fig. 3.

Step 3: restoring the instruction execution stream by means of the branch record in the branch record buffer:

the execution path that caused the failure can be successfully recovered by using the branch record, so the branch record is one of the most useful information in the fault diagnosis.

(1) Identifying the location where the concurrency error occurred: to resume the instruction execution flow of a program using a branch record buffer, it is first necessary to determine the specific location where the program has occurred a concurrency error. The last position of successful jump execution of the program is the position where the concurrency error occurs, which can be obtained by reading the latest branch record in the branch record buffer.

(2) Reconstructing instruction execution flow: the specific location of the concurrent error of the program is determined, and the execution flow before the location is restored only to help detect the cause of the concurrent error. The instruction execution flow before the program crash can be easily restored by the address of the branch instruction and the destination address of the instruction jump. The position of each branch jump is regarded as a node, and the jumps between the nodes are regarded as edges, so that an instruction execution path which causes program breakdown is constructed. As shown in FIG. 4, the jump relationship between nodes, such as node 1 jumping to node 2, may be determined by reading the branch records in the branch record buffer. Meanwhile, the execution sequence of the instruction can be judged through the instruction execution time recorded by the time stamp, and the unordered instruction execution nodes can be connected into an ordered instruction execution flow by combining the execution sequence and a specific jump relation.

Step 4: detecting a root cause of a concurrency error by analyzing an instruction execution stream:

after successful restoration of the instruction execution stream, detection of concurrency errors is much simpler. The method combines existing methods or tools to perform the detection.

(1) For data race detection, reference may be made to the ThreadSanit izer (TSan) (see: konstantin Serebryany; timur Iskhodzhanov. ThreadSanitizer: data race detection in practice [ A ]. WBIA'09:Proceedings of the Workshop on Binary Instrumentation and Applications[C ], 2009) method. TSan monitors memory accesses during program execution and records the memory address, type of access (read or write), and order in which access operations occur for each thread access. When two or more threads access the same memory address at the same time and at least one is a write operation, there may be a data race.

(2) For deadlock detection, a method may be used in which static analysis and dynamic analysis are combined (see: M.Pistonia; S.Chandra; S.J.Fink.A Survey of Static Analysis Methods for Identifying Security Vulnerabilities in Software Systems [ J ]. IBM Systems Journal,2007, vol.46 (2): 265-288. Or J.Schutte, R.Fedler and D.Titze, "Condroid: targeted Dynamic Analysis of Android Applications,"2015IEEE 29th International Conference on Advanced Information Networking and Applications,Gwangju,Korea (South), 2015, pp.571-578, doi: 10.1109/AINA.2015.238.). In the static analysis phase, the code, the structure of the program, the code logic, etc. are analyzed by static analysis, thereby detecting potential faults. Or building control and data flow graphs of the program by data flow analysis techniques and detecting whether a potential deadlock exists by traversing the graphs. In the dynamic analysis stage, when one thread is detected to be blocked, the current execution path is recorded, and other threads on the path are analyzed to find whether deadlock exists.

(3) Diagnosis of a type of concurrency error that violates univariate atomicity: static-dynamic hybrid program analysis was achieved by SNORLAX technique (Kasikci B, cui W, ge X, et al [ ACM Press the 26th Symposium-Shangai, china (2017.10.28-2017.10.28) ] Proceedings of the 26th Symposium on Operating Systems Principles-SOSP\ "17-Lazy Diagnosis of In-Production Concurrency Bugs [ C ]// Symposium on Operating Systems principles. ACM, 2017:582-598.) which enabled accurate and efficient diagnosis of concurrency errors using coarse interleaving assumptions.

Step 5: measures are taken to solve the concurrency errors:

once the cause of the error is determined, appropriate corrective action can be targeted to solve the problem and prevent it from happening again in the future.

(1) Repair code (see: boehm, H. -J.Email Author, adve, S.V.Email Author View Correspondence (jump link), foundations of the C ++ concurrency memory model (arc) [ J ]. ACM SIGPLAN notes, 2008, vol.43 (6): 68-78.): after the root cause of the concurrency error is found, the code can be directly modified to fix the problem. For example, there are cases where multiple threads or processes wait for each other to release resources, which causes the concurrent code to be unable to continue execution, which is known as a deadlock problem. By modifying the code, the occurrence of deadlock is avoided or the deadlock which has occurred is solved by using a proper resource allocation strategy, mechanisms such as loop avoidance and the like and introducing a deadlock detection and release algorithm. In the face of data contention, a necessary synchronization mechanism (such as a mutex lock, an atomic operation, or a thread-safe data structure) may be added to ensure consistent access of shared data, thereby avoiding data contention. In other cases, the problem may be solved by modifying the data structure or the like.

(2) Optimization code (see: yongu Kim; michael Papamichael; onlu Mutlu; mor Harchol-Balter. Thread Cluster Memory Scheduling: exploiting Differences in Memory Access Behavior [ A ].2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture [ C ], 2010): the performance and stability of the program can be improved by optimizing the code, thereby reducing the occurrence of concurrency errors. While deadlock and starvation problems typically need to be addressed by modifying the code, in some cases, deadlock and starvation problems may also be ameliorated by optimizing the code. For example, by introducing more rational resource allocation policies, adjusting the priority of threads or processes, or improving scheduling algorithms, etc., the probability of deadlock and starvation may be reduced. In addition, asynchronous programming, thread pool and other technologies can be adopted to improve the concurrent processing capacity of the program, and resources of the system can be effectively managed through optimizing codes, so that waste or contention of the resources is avoided.

Example 2

Embodiment 2 of the present application provides a concurrent error checking system corresponding to embodiment 1, including a memory, a processor, and a computer program stored on the memory; the processor executes the computer program on the memory to implement the steps of the method of embodiment 1 described above.

In some implementations, the memory may be high-speed random access memory (RAM: random Access Memory), and may also include non-volatile memory (non-volatile memory), such as at least one disk memory.

In other implementations, the processor may be a Central Processing Unit (CPU), a Digital Signal Processor (DSP), or other general-purpose processor, which is not limited herein.

Example 3

Embodiment 3 of the present application provides a computer-readable storage medium corresponding to embodiment 1 described above, on which a computer program/instructions is stored. The steps of the method of embodiment 1 described above are implemented when the computer program/instructions are executed by a processor.

The computer readable storage medium may be a tangible device that retains and stores instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any combination of the preceding.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein. The scheme in the embodiment of the application can be realized by adopting various computer languages, such as object-oriented programming language Java, an transliteration script language JavaScript and the like.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the application.

It will be apparent to those skilled in the art that various modifications and variations can be made to the present application without departing from the spirit or scope of the application. Thus, it is intended that the present application also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

1. The concurrent error detection method for Arm architecture branch record buffer is characterized by comprising the following steps:

2. The method for concurrent error detection in Arm architecture branch record buffers according to claim 1, further comprising in step S1: and placing instruction information data stored in the branch record buffer area into the physical memory block which is allocated in advance.

3. The method for concurrent error detection of Arm fabric branch record buffers according to claim 1, further comprising, after step S2:

4. The method for concurrent error detection of Arm fabric branch record buffers according to claim 3, further comprising, after step S3:

5. The method for concurrent error detection of Arm fabric branch record buffers according to claim 4, further comprising, after step S4:

6. The method of claim 5, wherein the corrective action comprises a repair code or an optimization code.

7. The method for concurrent error detection of Arm fabric branch record buffers according to any one of claims 1 to 6, further comprising, prior to step S1:

judging whether the branch record buffer area is in an available state, if so, entering a step S1; otherwise, stopping generating the branch record and ending.

8. A concurrent error detection system for Arm architecture branch record buffers, comprising:

one or more processors;

a memory having one or more programs stored thereon, which when executed by the one or more processors, cause the one or more processors to implement the steps of the method of any of claims 1-7.

9. A computer-readable storage medium, characterized in that it stores a computer program which, when executed by a processor, implements the steps of the method according to any one of claims 1-7.