WO2020207310A1 - 缺陷检测/处理方法和装置 - Google Patents

缺陷检测/处理方法和装置 Download PDF

Info

Publication number
WO2020207310A1
WO2020207310A1 PCT/CN2020/082707 CN2020082707W WO2020207310A1 WO 2020207310 A1 WO2020207310 A1 WO 2020207310A1 CN 2020082707 W CN2020082707 W CN 2020082707W WO 2020207310 A1 WO2020207310 A1 WO 2020207310A1
Authority
WO
WIPO (PCT)
Prior art keywords
shared data
defect
distributed system
node
atomic
Prior art date
Application number
PCT/CN2020/082707
Other languages
English (en)
French (fr)
Inventor
高钰
周利
黄瑞瑞
吴永明
龙舟
Original Assignee
阿里巴巴集团控股有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴集团控股有限公司 filed Critical 阿里巴巴集团控股有限公司
Publication of WO2020207310A1 publication Critical patent/WO2020207310A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • G06F16/2365Ensuring data consistency and integrity
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor

Definitions

  • This application relates to the field of distributed system technology, in particular to defect detection methods and devices, and defect processing methods and devices.
  • large-scale distributed systems are gradually being widely used. Compared with traditional stand-alone deployment systems, large-scale distributed systems have better scalability and fault tolerance, and the cost of obtaining the same computing power is lower.
  • large-scale distributed systems must manage a large number of distributed software components, hardware and their configurations, making this type of system extremely complex. Therefore, large-scale distributed systems will inevitably fail, and affect a large number of end users, reducing their reliability and availability. Therefore, it is very important to ensure the high reliability of large-scale distributed systems.
  • a typical way to ensure the high reliability of the distributed system is that the distributed system needs to manage its internal computing nodes and recover from the above-mentioned node failure behavior to ensure the normal operation of the system.
  • developers have introduced various complex failure recovery mechanisms into the distributed system.
  • large-scale distributed systems still face great challenges in dealing with node failures.
  • any node may fail at any time, triggering various failure scenarios. It is very difficult for developers to predict all possible failure scenarios, design the correct failure recovery mechanism, and implement the failure recovery mechanism correctly.
  • common node failure and recovery defect detection methods include at least the following: 1) Test a variety of failure scenarios as much as possible by avoiding testing the same recovery behavior. Users use Datalog to describe the failure test methods and distributed systems. Recovery specifications, which can systematically test the failure recovery logic in the distributed system; 2) SAMC intercepts the uncertain events in the distributed system and replaces their order. SAMC adopts gray-box testing technology, and adds semantic information of distributed systems on the basis of traditional black-box model checking, thereby reducing the state space and avoiding state explosion problems in model checking as much as possible; 3) Through systematic generation and exploration The file information that may be generated during the execution of the distributed system is used to detect the distributed system vulnerabilities related to the simultaneous failure of all copies.
  • the inventor discovered a new defect of restoration after node failure, that is: certain associated operations in the distributed system are expected to be executed atomically and cannot be interrupted by node failure. If node failure occurs, it will As a result, the distributed system is in a data inconsistency, and the distributed system cannot be restored to a normal state normally.
  • This type of node failure is called failure atomicity in the present invention.
  • the inventor found that the above-mentioned existing technical solutions cannot recover from the failure of this type of node.
  • the first method above treats the distributed system as a black box, and does not care about the failure of injecting a node when the system is in the state of failure.
  • the third method above only cares about operations related to file persistence, and does not care about invalidation atomicity.
  • the prior art has the problem of inconsistency in multiple shared data of the same data source caused by node failure.
  • This application provides a defect detection method to solve the problem of inconsistency of multiple shared data of the same data source caused by node failure in the prior art.
  • This application additionally provides a defect detection device, as well as a defect processing method and device.
  • This application provides a defect detection method, including:
  • the failure atomic defect is determined.
  • the multiple shared data processing operations include shared data write operations on two or more files in one node;
  • the failure atomic defect includes the node failure atomic defect between any two shared data write operations.
  • the multiple shared data processing operations include a shared data write operation to a file in the source node, and a shared data sending operation from the source node to the target node;
  • the failure atomic defect includes the failure atomic defect of the source node between the shared data writing operation and the shared data sending operation.
  • the multiple shared data processing operations include a shared data sending operation of the source node to the first target node, and a shared data sending operation of the source node to the second target node;
  • the failure atomic defect includes the failure atomic defect of the source node between the first shared data sending operation and the second shared data sending operation.
  • the execution track is determined using the following steps:
  • the operation of the distributed system includes:
  • Optional also includes:
  • the multiple shared data processing operations include at least one of the following operations: a data writing operation and a data sending operation.
  • This application also provides a defect processing method, which further includes:
  • the error handling program code includes code for keeping multiple shared data related to the invalid atomic defect consistent ;
  • the error handling program code is executed to keep the multiple shared data related to the failed atomic defect consistent.
  • the determining the failure atomic defect of the distributed system includes:
  • This application also provides a defect detection device, including:
  • the execution trajectory determining unit is used to determine the execution trajectory of multiple shared data processing operations with atomic relationships in the distributed system
  • a candidate defect determining unit configured to determine candidate failed atomic defects according to the execution track
  • a node failure injection unit configured to run the distributed system and inject the node failures corresponding to the candidate failed atomic defects into the distributed system
  • a shared data acquiring unit configured to acquire multiple shared data related to the failed atomic defect after the distributed system is running
  • the defect determining unit is configured to determine the failed atomic defect if the multiple shared data are inconsistent.
  • This application also provides a defect processing device, including:
  • the defect determination unit is used to determine failed atomic defects included in the distributed system
  • An error code adding unit for adding an error handling program code for the failure atomic defect to the source program code of the distributed system
  • the error processing unit is configured to execute the error handling program code if the node corresponding to the failed atomic defect fails during execution of the distributed system, so that multiple related to the failed atomic defect Keep shared data consistent.
  • the present application also provides a computer-readable storage medium having instructions stored in the computer-readable storage medium, which when run on a computer, cause the computer to execute the various methods described above.
  • the present application also provides a computer program product including instructions, which when run on a computer, causes the computer to execute the above-mentioned various methods.
  • the defect detection method determines the execution trajectory of multiple shared data processing operations with atomic relationships in a distributed system; determines candidate failed atomic defects according to the execution trajectory; and runs the distributed system , And inject the node failures corresponding to the candidate failed atomic defects into the distributed system; obtain multiple shared data related to the failed atomic defects after running the distributed system; if said If multiple shared data are inconsistent, the failure atomic defect is determined; this way of processing makes it possible to predict possible atomic violation errors by observing a correct execution of the distributed system without injecting node failures, and passing the final re Release the workload, inject node failures, confirm and replay defects deterministically; therefore, it is possible to effectively detect failed atomic defects, thereby improving the reliability of the distributed system.
  • the defect processing method determines the failed atomic defect included in the distributed system; adds error handling program code for the failed atomic defect to the source program code of the distributed system; the error
  • the processing program code includes code for keeping the multiple shared data consistent; if the node corresponding to the failed atomic defect fails during the execution of the distributed system, the error processing program code is executed to make the The multiple shared data related to the failed atomic defect are kept consistent; this processing method enables the distributed system to catch and process the error in time when the node corresponding to the failed atomic defect fails. Shared data performs consistent processing; therefore, it can effectively improve the consistency of shared data, thereby enhancing the reliability of distributed systems.
  • Fig. 1 is a flowchart of an embodiment of a defect detection method provided by the present application
  • FIG. 2 is a schematic diagram of shared data and key operations of an embodiment of a defect detection method provided by the present application
  • FIG. 3a is a schematic diagram of a failure atomic violation mode of an embodiment of a defect detection method provided by the present application.
  • 3b is a schematic diagram of another failure atomic violation mode of an embodiment of a defect detection method provided by the present application.
  • FIG. 3c is a schematic diagram of another failure atomic violation mode of an embodiment of a defect detection method provided by the present application.
  • FIG. 4 is a schematic diagram of an embodiment of a defect detection device provided by the present application.
  • FIG. 5 is a flowchart of an embodiment of a defect processing method provided by the present application.
  • Fig. 6 is a schematic diagram of an embodiment of a defect processing apparatus provided by the present application.
  • the core technical idea of the technical method provided by the embodiments of this application is: by observing a correct execution of the distributed system, possible atomicity violation errors can be predicted without injecting node failures, and by finally replaying the workload, injecting If the node fails, confirm and replay the defect deterministically. Since atomicity violation errors can be detected, the reliability of distributed systems can be effectively improved.
  • FIG. 1 is a flowchart of an embodiment of a defect detection method provided by this application.
  • the execution body of the method includes a defect detection device.
  • a defect detection method provided by this application includes:
  • Step S101 Determine the execution trajectory of multiple shared data processing operations with atomic relationships in the distributed system.
  • the distributed system includes but is not limited to: distributed storage system, distributed computing framework, synchronization service, cluster management service, etc., such as HDFS, HBase, Hadoop, Spark, ZooKeeper, Mesos, YARN.
  • the distributed system is a distributed application coordination service Zookeeper.
  • Some operations in the services provided by the distributed system are atomic and indivisible in multiple shared data consistency dimensions. Each step of the operation is either executed or not executed.
  • the embodiments of this application refer to these operations as having an atomic relationship.
  • the multiple shared data processing operations are referred to as atomic association operations for short. After these atomic association operations are over, the related shared data processing results should be consistent.
  • the failure atomic defect is related to multiple associated operations of shared data of the same data source, and may include one or more node failures, and the node failure may occur at a time between two atomic associated operations.
  • the consistency of shared data is not limited to the fact that each shared data is exactly the same.
  • the inventory quantity of goods in file 1 and file 2 is 1000; it can also be that part of the data between each shared data is the same, such as file 1.
  • the product number in file 2 is "123", the product number in file 2 is "A-123" and so on.
  • Step S101 can be implemented in the following manner: run a specified workload, such as a test case that comes with the system itself, drive the operation of a distributed system, record the usage process of sharing data under the load, and generate a system execution track.
  • the key operations with atomic relationship corresponding to shared data v include: 1) write shared data v to file file1 in node 1, and write shared data v to file file2 in node 1; 2) share The data v is sent to node 2, and the shared data v is written into the resource resource in node 2; 3) the shared data v is sent to node 3 and stored in the memory variable of node 3; the system execution trajectory of shared data v is node Files file1 and file2 in 1, resource resource in node 2, and memory variables in node 3.
  • the method may further include the following steps: identifying key operations in the distributed system; determining shared data based on the identified key operations.
  • the key operations include operations involving file writing and network data sending.
  • the key operations in Zookeeper include Application Programming Interface (API) related to file reading and writing, network reading and writing, and so on.
  • API Application Programming Interface
  • key operations in a distributed system can be used: first, through manual analysis, identify basic file operation instructions, such as file-related IO operations in Java; then, identify The system involves file read and write and network read and write APIs, which are regarded as key operations.
  • key operations can be automatically extracted through static analysis to identify key operation pairs with atomic relationships.
  • the plurality of shared data includes data from the same data source, and these data have a sharing relationship.
  • the shared data can be data located in different persistent files on the same node.
  • the data written into three files by node N is shared data; it can also be written into a persistent file by a node and
  • the data sent to another node, such as the two data written into file x by node N1 and sent to node N2, is shared data; it can also be the data sent by a node to multiple nodes, such as the data sent by node N1 to The two data of nodes N2 and N3 are shared data.
  • Step S103 Determine candidate failed atomic defects according to the execution track.
  • the purpose of determining the execution trajectory is to track the use process (trajectory) of shared data and the causal chain, so as to determine on which nodes data sharing has occurred.
  • the embodiment of the application analyzes the execution trajectory of the distributed system, discovers the use process of shared data, matches it with the preset failure atomic defect mode, and finds node failure injection points that may cause failure atomic violations.
  • the first mode is shown in Figure 3a.
  • the variable v (such as the name "Zhang San", v0) is a piece of shared data, which may be written to file 1 (such as file1) and file 2 (such as file2) by a node. Therefore, the values of v (v1 and v2) stored in the two files should be consistent, and the write operations w1 and w2 to these two files can be regarded as a pair of associated operations.
  • node failure occurs in the middle of these two file write operations (such as w1 operation at t1 and w2 at t2, the node fails at some time between t1 and t2), it may cause the two files to be inconsistent (v1 ⁇ >v2), and then an exception occurs in the process of node failure recovery (v1 ⁇ >v2 after node N recovers).
  • the node failure injection point can be: after w1 operation and At some point before the w2 operation is performed, node N is made invalid.
  • Variable v is a piece of shared data, which may be written into file1 (v1) by node 1 and sent by node 1 to node 2.
  • v1 file1
  • v2 resource
  • v2 resource
  • the v stored in file1 and resource The value should be consistent.
  • a node failure occurs between the file writing operation and the message sending operation of node 1, two inconsistent data may be caused, and an exception may be generated during the node failure recovery process.
  • the node failure injection points include but are not limited to: When performing w1 operation Then make node 1 fail, and make node 2 fail before performing w2 operation.
  • Variable v is a piece of shared data, which may be sent by node 1 to node 2 and node 3 respectively. After receiving the message, node 2 and node 3 save the value of v in Resource 1 (v1) and resource 2 (v2), so the two pieces of data v1 and v2 stored on node 2 and node 3 should be consistent.
  • the data (v1 and v2) on node 2 and node 3 may be inconsistent, and an abnormality may occur in the process of node failure recovery.
  • Node failure injection points include but are not limited to: After the shared data v is sent to node 2, node 1 is invalidated, node 1 is invalidated after node 2 performs w1 operation, and node 3 is invalidated before node 3 performs w2 operation.
  • Step S105 Run the distributed system, and inject node failures corresponding to the candidate failed atomic defects into the distributed system.
  • step S101 re-run the distributed system and insert the node failure at the corresponding failure injection point.
  • the workload in step S101 is replayed to run the distributed system.
  • Step S107 Obtain multiple shared data related to the failed atomic defect after running the distributed system.
  • multiple shared data associated with these operations can be obtained according to the locations and data names of the multiple shared data. For example, in the first mode, the shared data v1 and v2 associated with the data source v0 are read from the files file1 and file2 of the node N; in the second mode, the files file1 in the node 1 are read respectively. The shared data v1 and v2 are read from the resource or memory in node 2; in the third mode, the shared data v1 and v2 are read from the resource in node 2 and the resource in node 3, respectively.
  • Step S109 If the multiple shared data are inconsistent, determine the failure atomic defect.
  • the user constraint includes the constraint that the multiple shared data are consistent with each other.
  • the processing process of the defect detection method includes the following steps:
  • a defect detection method is provided, and correspondingly, the present application also provides a defect detection device.
  • This device corresponds to the embodiment of the above method.
  • FIG. 4 is a schematic diagram of an embodiment of the defect detection device of the present application. Since the device embodiment is basically similar to the method embodiment, the description is relatively simple, and for related parts, please refer to the part of the description of the method embodiment.
  • the device embodiments described below are merely illustrative.
  • This application additionally provides a defect detection device, including:
  • the execution trajectory determination unit 401 is configured to determine the execution trajectory of multiple shared data processing operations with atomic relationships in the distributed system
  • the candidate defect determining unit 403 is configured to determine candidate failed atomic defects according to the execution track
  • a node failure injection unit 405, configured to run the distributed system and inject the node failures corresponding to the candidate failed atomic defects into the distributed system;
  • the shared data acquisition unit 407 is configured to acquire multiple shared data related to the failed atomic defect after running the distributed system;
  • the defect determining unit 409 is configured to determine the failed atomic defect if the multiple shared data are inconsistent.
  • the multiple shared data processing operations include shared data write operations on two or more files in one node; the failure atomic defect includes node failure atomic defects between any two shared data write operations.
  • the multiple shared data processing operations include a shared data write operation to a file in the source node, and a shared data transmission operation from the source node to the target node;
  • the invalid atomic defect includes a shared data write operation and sharing Atomic defect of source node failure between data sending operations.
  • the multiple shared data processing operations include a shared data sending operation from a source node to a first target node, and a shared data sending operation from a source node to a second target node; the failure atomic defect includes the first shared data The source node failure atomic defect between the data sending operation and the second shared data sending operation.
  • the execution trajectory determining unit 401 is specifically configured to perform a system test on the distributed system, and record the execution trajectory of the distributed system under test data.
  • the node failure injection unit 405 includes:
  • the system operation subunit is specifically configured to re-execute the system test on the distributed system according to the test data.
  • Optional also includes:
  • the shared data determining unit is configured to determine, according to the source program code of the distributed system, a plurality of shared data respectively corresponding to each data source and the plurality of shared data processing operations.
  • the multiple shared data processing operations include at least one of the following operations: a data writing operation and a data sending operation.
  • a defect detection method is provided, and correspondingly, the present application also provides a defect processing method. This method corresponds to the embodiment of the above method.
  • FIG. 5 is a flowchart of an embodiment of the defect processing method of this application. Since the method embodiment corresponds to the method embodiment of the first embodiment, the description is relatively simple, and for related parts, please refer to the part of the description of the method embodiment one.
  • This application additionally provides a defect processing method, including:
  • Step S501 Determine failure atomic defects included in the distributed system.
  • This step can be implemented using the method of the above-mentioned first embodiment, or other implementation methods, such as manually identifying failed atomic defects, etc.
  • Step S503 Add an error handling program code for the failure atomic defect to the source program code of the distributed system.
  • the error handling program code includes the code that makes the multiple shared data related to the invalid atomic defect consistent, including but not limited to: the error capture code of the invalid atomic defect, and the code after the error is captured Shared data consistency processing code.
  • the distributed system can be run to enable it to receive service requests from users.
  • the user may be a human, and the human may send a service request to the distributed system through a client; the user may also be other software systems and so on.
  • Step S505 If the node corresponding to the failed atomic defect fails during execution of the distributed system, execute the error handling program code to keep multiple shared data related to the failed atomic defect Consistent.
  • the failed atomic defect occurs, that is, a node failure occurs among multiple atomically related operations related to the defect, the failed atom will be captured by the error handling program code. Sexual defect error, and handle the error through the error handling program code.
  • the shared data consistency processing code includes program code for rolling back the performed shared data association operation.
  • the shared data consistency processing code includes executing the associated operation after the executed shared data association operation on the shared data processed according to the executed shared data association operation.
  • a large-scale distributed system consists of a large number of computing nodes and runs different complex protocols.
  • the computing nodes in these systems may face the following abnormal situations: 1) A single computing node is usually an ordinary personal computer (PC), which has various reliability problems, such as hardware problems such as disk damage, memory errors, and operating system crashes. And so on, this will cause the computing node to crash, or have wrong data and behavior; 2) A single computing node or the entire data center has power outages and other behaviors, causing multiple computing nodes to crash, restart, etc.; 3 ) Due to data center management requirements, such as sudden user requests, certain nodes need to be added to the data center, or when the load is light, some computing nodes are removed. The above behaviors will cause node failures in computing nodes, data centers, etc.
  • the defect processing method determines the failure atomic defects included in the distributed system; adding error handling for the failure atomic defects to the source code of the distributed system Program code; the error handling program code includes a code that keeps a plurality of shared data related to the failed atomic defect consistent; if the distributed system is executing an error corresponding to the failed atomic defect, The error handling program code is executed to keep the multiple shared data related to the failed atomic defect consistent; this processing method enables when the node corresponding to the failed atomic defect fails in the distributed system , It can catch and process the error in time, and perform consistent processing on multiple shared data; therefore, it can effectively improve the consistency of shared data, thereby improving the reliability of the distributed system.
  • a defect processing method is provided, and correspondingly, the present application also provides a defect processing device.
  • This device corresponds to the embodiment of the above method.
  • FIG. 6 is a schematic diagram of an embodiment of the defect processing apparatus of the present application. Since the device embodiment is basically similar to the method embodiment, the description is relatively simple, and for related parts, please refer to the part of the description of the method embodiment.
  • the device embodiments described below are merely illustrative.
  • This application additionally provides a defect processing device, including:
  • the defect determining unit 601 is used to determine the failed atomic defects included in the distributed system
  • An error code adding unit 603 is configured to add an error handling program code for the failure atomic defect to the source program code of the distributed system
  • the error processing unit 605 is configured to execute the error handling program code if the node corresponding to the failed atomic defect fails during the execution of the distributed system, so as to make multiple errors related to the failed atomic defect.
  • the shared data is consistent.
  • the computing device includes one or more processors (CPU), input/output interfaces, network interfaces, and memory.
  • processors CPU
  • input/output interfaces network interfaces
  • memory volatile and non-volatile memory
  • the memory may include non-permanent memory in computer readable media, random access memory (RAM) and/or non-volatile memory, such as read-only memory (ROM) or flash memory (flash RAM). Memory is an example of computer readable media.
  • RAM random access memory
  • ROM read-only memory
  • flash RAM flash memory
  • Computer-readable media includes permanent and non-permanent, removable and non-removable media, and information storage can be realized by any method or technology.
  • the information can be computer-readable instructions, data structures, program modules, or other data.
  • Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology, CD-ROM, digital versatile disc (DVD) or other optical storage, Magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices or any other non-transmission media can be used to store information that can be accessed by computing devices.
  • computer-readable media does not include non-transitory computer-readable media (transitory media), such as modulated data signals and carrier waves.
  • this application can be provided as methods, systems or computer program products. Therefore, this application may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, this application may adopt the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program codes.
  • a computer-usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Computer Security & Cryptography (AREA)
  • Hardware Redundancy (AREA)

Abstract

缺陷检测方法和装置,缺陷处理方法和装置。其中,缺陷检测方法包括:确定分布式系统中具有原子性关系的多个共享数据处理操作的执行轨迹(S101);根据所述执行轨迹确定候选的失效原子性缺陷(S103);运行所述分布式系统,并将与所述候选的失效原子性缺陷对应的节点失效注入至所述分布式系统(S105);获取运行所述分布式系统后的与所述失效原子性缺陷相关的多个共享数据(S107);若所述多个共享数据不一致,则确定所述失效原子性缺陷(S109)。采用这种处理方式,使得通过观察分布式系统的一次正确执行,无需注入节点失效,即可预测可能的原子性违反错误,并通过最后重放工作负载,注入节点失效,确定性地确认缺陷、重放缺陷;因此,可以有效检测出失效原子性缺陷,从而提升分布式系统的可靠性。

Description

缺陷检测/处理方法和装置
本申请要求2019年04月12日递交的申请号为201910296074.7、发明名称为“缺陷检测/处理方法和装置”中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及分布式系统技术领域,具体涉及缺陷检测方法和装置,缺陷处理方法和装置。
背景技术
随着越来越多的数据与计算从本地向云端迁移,大规模分布式系统逐步得到广泛使用。与传统的单机部署系统相比,大规模分布式系统具有较好的可扩展性和容错能力,获得同样计算能力的成本较低。然而,大规模分布式系统必须管理大量分布式软件组件、硬件及其配置,使得该类系统异常复杂。因而,大规模分布式系统不可避免地会发生故障,并影响到大量终端用户,降低其可靠性和可用性。因此保证大规模分布式系统的高可靠性十分重要。
一种典型的保证分布式系统具有高可靠性的方式是,分布式系统需要管理其内部计算节点,并从上述节点失效行为中恢复,以保障系统的正常运行。为了应对节点失效,开发人员在分布式系统中引入了各种复杂的失效恢复机制,然而,大规模分布式系统在应对节点失效时,依然面临极大挑战。在大规模分布式系统中,任意节点在任何时间都可能发生失效,从而触发各种各样的失效场景。对开发人员来说,预测所有可能的失效场景、设计正确的失效恢复机制以及正确地实现失效恢复机制十分困难。同时,通过在所有可能场景下注入节点失效来达到彻底测试系统的目标也难以实现。因此,不正确的节点失效恢复机制及其实现会引入错综复杂的节点失效恢复相关缺陷,这些缺陷往往导致严重的后果,如节点宕机、数据不一致等。
目前,常见的节点失效与恢复缺陷检测方式至少包括如下几种:1)通过避免测试相同的恢复行为来尽可能地测试多种多样的失败场景,用户通过Datalog来描述故障测试方法以及分布式系统恢复规范,从而可以系统化测试分布式系统中故障恢复逻辑;2)SAMC拦截分布式系统中的不确定性事件并置换他们的顺序。SAMC采用灰盒测试技术,在传统黑盒模型检查的基础上,加入分布式系统的语义信息,从而消减了状态空间,尽 可能避免模型检查中的状态爆炸问题;3)通过系统地生成和探索分布式系统执行中可能产生的文件信息,来检测与所有副本同时失效相关的分布式系统漏洞。
在实现本发明过程中,发明人发现一种新的节点失效后恢复缺陷,即:分布式系统中某些关联操作期待被原子地执行,不能被节点失效所打断,如果发生节点失效,将导致分布式系统处于数据不一致,分布式系统无法正常恢复到正常状态,本发明将该类节点失效称为失效原子性。然而,发明人发现上述现有技术方案无法对该类节点失效进行恢复,如上述方法一将分布式系统看做一个黑盒,并不关心在系统处于何种状态时注入一个节点失效会产生失效原子性违背,上述方法三也只关心与文件持久化相关的操作,并不关心失效原子性。综上所述,现有技术存在由节点失效导致的同一数据源的多个共享数据不一致的问题。
发明内容
本申请提供缺陷检测方法,以解决现有技术存在的由节点失效导致的同一数据源的多个共享数据不一致的问题。本申请另外提供缺陷检测装置,以及缺陷处理方法和装置。
本申请提供一种缺陷检测方法,包括:
确定分布式系统中具有原子性关系的多个共享数据处理操作的执行轨迹;
根据所述执行轨迹确定候选的失效原子性缺陷;
运行所述分布式系统,并将与所述候选的失效原子性缺陷对应的节点失效注入至所述分布式系统;
获取运行所述分布式系统后的与所述失效原子性缺陷相关的多个共享数据;
若所述多个共享数据不一致,则确定失效原子性缺陷。
可选的,所述多个共享数据处理操作包括在一个节点内对两个以上文件的共享数据写操作;
所述失效原子性缺陷包括任意两次共享数据写操作间的节点失效原子性缺陷。
可选的,所述多个共享数据处理操作包括在源节点内对文件的共享数据写操作、和源节点对目标节点的共享数据发送操作;
所述失效原子性缺陷包括共享数据写操作与共享数据发送操作间的源节点失效原子性缺陷。
可选的,所述多个共享数据处理操作包括源节点对第一目标节点的共享数据发送操作、和源节点对第二目标节点的共享数据发送操作;
所述失效原子性缺陷包括第一共享数据发送操作与第二共享数据发送操作间的源节点失效原子性缺陷。
可选的,所述执行轨迹采用如下步骤确定:
对所述分布式系统执行系统测试,并记录所述分布式系统在测试数据下的所述执行轨迹。
可选的,所述运行所述分布式系统,包括:
根据所述测试数据重新对所述分布式系统执行系统测试。
可选的,还包括:
确定与各个数据源分别对应的多个共享数据。
可选的,所述多个共享数据处理操作包括以下操作的至少一个:数据写操作,数据发送操作。
本申请还提供一种缺陷处理方法,还包括:
确定分布式系统包括的失效原子性缺陷;
在所述分布式系统的源程序代码中增加针对所述失效原子性缺陷的错误处理程序代码;所述错误处理程序代码包括使与所述失效原子性缺陷相关的多个共享数据保持一致的代码;
若所述分布式系统在执行时发生与所述失效原子性缺陷对应的节点失效,则执行所述错误处理程序代码,以使与所述失效原子性缺陷相关的多个共享数据保持一致。
可选的,所述确定分布式系统的失效原子性缺陷,包括:
确定分布式系统中具有原子性关系的多个共享数据处理操作的执行轨迹;
根据所述执行轨迹确定候选的失效原子性缺陷;
运行所述分布式系统,并将与所述候选的失效原子性缺陷对应的节点失效注入至所述分布式系统;
获取运行所述分布式系统后的与所述失效原子性缺陷相关的多个共享数据;若所述多个共享数据不一致,则确定所述失效原子性缺陷。
本申请还提供一种缺陷检测装置,包括:
执行轨迹确定单元,用于确定分布式系统中具有原子性关系的多个共享数据处理操作的执行轨迹;
候选缺陷确定单元,用于根据所述执行轨迹确定候选的失效原子性缺陷;
节点失效注入单元,用于运行所述分布式系统,并将与所述候选的失效原子性缺陷 对应的节点失效注入至所述分布式系统;
共享数据获取单元,用于获取运行所述分布式系统后的与所述失效原子性缺陷相关的多个共享数据;
缺陷确定单元,用于若所述多个共享数据不一致,则确定所述失效原子性缺陷。
本申请还提供一种缺陷处理装置,包括:
缺陷确定单元,用于确定分布式系统包括的失效原子性缺陷;
错误代码增加单元,用于在所述分布式系统的源程序代码中增加针对所述失效原子性缺陷的错误处理程序代码;
错误处理单元,用于若所述分布式系统在执行时发生与所述失效原子性缺陷对应的节点失效,则执行所述错误处理程序代码,以使与所述失效原子性缺陷相关的多个共享数据保持一致。
本申请还提供一种计算机可读存储介质,所述计算机可读存储介质中存储有指令,当其在计算机上运行时,使得计算机执行上述各种方法。
本申请还提供一种包括指令的计算机程序产品,当其在计算机上运行时,使得计算机执行上述各种方法。
与现有技术相比,本申请具有以下优点:
本申请实施例提供的缺陷检测方法,通过确定分布式系统中具有原子性关系的多个共享数据处理操作的执行轨迹;根据所述执行轨迹确定候选的失效原子性缺陷;运行所述分布式系统,并将与所述候选的失效原子性缺陷对应的节点失效注入至所述分布式系统;获取运行所述分布式系统后的与所述失效原子性缺陷相关的多个共享数据;若所述多个共享数据不一致,则确定所述失效原子性缺陷;这种处理方式,使得通过观察分布式系统的一次正确执行,无需注入节点失效,即可预测可能的原子性违反错误,并通过最后重放工作负载,注入节点失效,确定性地确认缺陷、重放缺陷;因此,可以有效检测出失效原子性缺陷,从而提升分布式系统的可靠性。
本申请实施例提供的缺陷处理方法,通过确定分布式系统包括的失效原子性缺陷;在所述分布式系统的源程序代码中增加针对所述失效原子性缺陷的错误处理程序代码;所述错误处理程序代码包括使所述多个共享数据保持一致的代码;若所述分布式系统在执行时发生与所述失效原子性缺陷对应的节点失效,则执行所述错误处理程序代码,以使与所述失效原子性缺陷相关的多个共享数据保持一致;这种处理方式,使得当分布式系统发生与所述失效原子性缺陷对应的节点失效时,可及时捕获及处理该错误,对多个 共享数据执行一致性处理;因此,可以有效提升共享数据一致性,从而提升分布式系统的可靠性。
附图说明
图1是本申请提供的一种缺陷检测方法的实施例的流程图;
图2是本申请提供的一种缺陷检测方法的实施例的共享数据及关键操作的示意图;
图3a是本申请提供的一种缺陷检测方法的实施例的一种失效原子性违背模式的示意图;
图3b是本申请提供的一种缺陷检测方法的实施例的另一种失效原子性违背模式的示意图;
图3c是本申请提供的一种缺陷检测方法的实施例的又一种失效原子性违背模式的示意图;
图4是本申请提供的一种缺陷检测装置的实施例的示意图;
图5是本申请提供的一种缺陷处理方法的实施例的流程图;
图6是本申请提供的一种缺陷处理装置的实施例的示意图。
具体实施方式
在下面的描述中阐述了很多具体细节以便于充分理解本申请。但是本申请能够以很多不同于在此描述的其它方式来实施,本领域技术人员可以在不违背本申请内涵的情况下做类似推广,因此本申请不受下面公开的具体实施的限制。
在本申请中,提供了缺陷检测方法和装置,以及缺陷处理方法和装置。在下面的实施例中逐一对各种方案进行详细说明。
本申请实施例提供的技术方法,其核心的技术思想是:通过观察分布式系统的一次正确执行,无需注入节点失效,即可预测可能的原子性违反错误,并通过最后重放工作负载,注入节点失效,确定性地确认缺陷、重放缺陷。由于可以检测出原子性违反错误,因此可以有效提升分布式系统的可靠性。
第一实施例
请参考图1,其为本申请提供的一种缺陷检测方法实施例的流程图,该方法的执行主体包括缺陷检测装置。本申请提供的一种缺陷检测方法包括:
步骤S101:确定分布式系统中具有原子性关系的多个共享数据处理操作的执行轨迹。
所述分布式系统,包括但不限于:分布式存储系统、分布式计算框架、同步服务、集群管理服务等等,如HDFS、HBase、Hadoop、Spark、ZooKeeper、Mesos、YARN。在本实施例中,分布式系统为分布式应用程序协调服务Zookeeper。
分布式系统提供的服务中的一些操作在多个共享数据一致性维度上是原子不可分的,各步操作要么都执行了,要么都不执行,本申请实施例将这些操作称为具有原子性关系的多个共享数据处理操作,简称为原子性关联操作。在这些原子性关联操作结束之后,相关的各个共享数据处理结果都应该保持一致。所述失效原子性缺陷与针对同一数据源的多个共享数据的关联操作有关,可包括一个或多个节点失效,节点失效可发生在两个原子性关联操作之间的时刻。
需要说明的是,共享数据一致性并不局限于每个共享数据完全相同,如文件1和文件2中的商品库存数量均为1000;也可以是每个共享数据间部分数据相同,如文件1中的商品编号为“123”,文件2中的商品编号为“A-123”等等。
步骤S101可采用如下方式实现:运行指定的工作负载,如系统本身自带的测试用例,驱动分布式系统运行,记录在该负载下,共享数据的使用过程,生成系统执行轨迹。例如,与共享数据v对应的具有原子性关系的关键操作包括:1)将共享数据v写入节点1中的文件file1,以及将共享数据v写入节点1中的文件file2;2)将共享数据v发送至节点2,以及将共享数据v写入节点2中的资源resource;3)将共享数据v发送至节点3,存储在节点3的内存变量中;共享数据v的系统执行轨迹为节点1中的文件file1和file2、节点2中的资源resource、节点3中的内存变量。
在本实施例中,除了记录与共享数据对应的具有原子性关系的关键操作(如文件写、网络数据发送等等操作),也记录其他关键操作,如线程创建/合并,线程通知/等待,消息收发操作等。通过这些数据及数据处理过程,构建事件之间的因果关系,可以更为准确地追踪共享数据的使用过程,从而提升系统执行轨迹的准确度。
在本实施例中,所述方法还可包括如下步骤:识别分布式系统中的关键操作;根据识别出的关键操作确定共享数据。
所述关键操作,包括涉及文件写、网络数据发送的操作。例如,Zookeeper中的关键操作包括涉及文件读写、网络读写的应用程序编程接口(Application Programming Interface,API)等等。
要识别分布式系统(比如Zookeeper)中的关键操作,可采用如下方式:首先,通 过人工分析,识别基本的文件操作指令,如Java中文件相关的IO操作等;接着,通过分析调用关系,识别系统中涉及文件读写、网络读写的API,将这些API视为关键操作。具体实施时,可通过静态分析方式自动提取关键操作,以识别具有原子性关系的关键操作对。
所述多个共享数据包括源于同一数据源的数据,这些数据具有共享关系。所述共享数据,可以是位于同一个节点的不同持久化文件中的数据,如被节点N分别写入三个文件中的数据为共享数据;也可以是被某个节点写入持久化文件并发送给另外一个节点的数据,如被节点N1写入文件x并发送给节点N2的两个数据为共享数据;还可以是被某个节点发送给多个节点的数据,如被节点N1发送给节点N2和N3的两个数据为共享数据。
要识别分布式系统中的共享数据,可采用如图2所示的方式实现:从所述分布式系统的源程序代码中所有的文件操作和消息发送操作出发,通过数据依赖分析,分析写入文件的数据以及消息包中的数据的来源,直到找到它们的一份共享数据来源,则把这样的数据标记为共享数据。例如,如果发现三个文件中的数据都是由节点N写入的,则这三个文件中的数据为共享数据。步骤S103:根据所述执行轨迹确定候选的失效原子性缺陷。
确定执行轨迹的目的是为了追踪共享数据的使用过程(轨迹)和因果链条,从而确定在哪些节点上发生了数据共享。本申请实施例通过分析分布式系统的执行轨迹,发现共享数据的使用过程,将其与预设的失效原子性缺陷模式进行匹配,发现可能会引发失效原子性违背的节点失效注入点。
如图3a-图3c所示,本发明的发明人发现如下3类失效原子性违背模式:
第一种模式如图3a所示,变量v(如姓名“张三”,v0)是一份共享数据,它可能被某节点分别写入文件1(如file1)和文件2(如file2),因此保存在这两份文件中的v(v1和v2)的值应该具有一致性,对这两个文件的写操作w1和w2可以看做是一对关联操作。当一个节点失效发生在这两次文件写操作中间(如t1时刻执行w1操作,t2时刻执行w2操作,则节点在t1和t2之间某时刻失效)时,可能会使这两份文件产生不一致(v1<>v2),进而在节点失效恢复过程中产生异常(节点N恢复后v1<>v2)。
如果根据系统执行轨迹确定共享数据的使用过程符合上述第一种模式,则可确定与该共享数据关联的关键操作构成候选的失效原子性缺陷,节点失效注入点可以是:在执行w1操作后且在执行w2操作前的某个时刻使得节点N失效。
第二种模式如图3b所示,变量v是一份共享数据,它可能被节点1写入文件file1 (v1),同时被节点1发送给节点2。当节点2收到消息后,有可能会把v的值保存到某份资源resource(v2)中,如写入某个文件或保存在某个内存变量中,因此保存在file1和resource中的v值应当具有一致性。当节点1在写文件操作和发送消息操作之间发生节点失效时,可能会造成两份不一致的数据,进而在节点失效恢复过程中产生异常。
如果根据系统执行轨迹确定共享数据的使用过程符合上述第二种模式,则可确定与该共享数据关联的关键操作构成候选的失效原子性缺陷,节点失效注入点包括但不限于:在执行w1操作后使得节点1失效,在执行w2操作前使得节点2失效。
第三种模式如图3c所示,变量v是一份共享数据,它可能被节点1分别发送给节点2和节点3,在收到消息之后,节点2和节点3分别将v的值保存在资源1(v1)和资源2中(v2),因此保存在节点2和节点3上的这两份数据v1和v2应当具有一致性。当节点1在两次消息发送操作中间发生节点失效时,可能会造成节点2和节点3上的数据(v1和v2)不一致,进而在节点失效恢复过程中产生异常。
如果根据系统执行轨迹确定共享数据的使用过程符合上述第三种模式,则可确定与该共享数据关联的关键操作构成候选的失效原子性缺陷,节点失效注入点包括但不限于:在节点1将共享数据v发送给节点2后使得节点1失效,在节点2执行w1操作后使得节点1失效,在节点3执行w2操作前使得节点3失效。
步骤S105:运行所述分布式系统,并将与所述候选的失效原子性缺陷对应的节点失效注入至所述分布式系统。
根据上一步发现的节点失效注入点,重新运行所述分布式系统,并在相应的失效注入点插入节点失效。在本实施例中,重放步骤S101的工作负载运行所述分布式系统。
例如,在上述第一种模式中,通过在执行w1操作后注入节点N失效,使得在节点N将变量v0写入文件file1后,不会继续执行w2操作,即不会使得节点N将变量v0写入文件file2中。
再例如,在上述第二种模式中,通过在执行w1操作后注入节点1失效,使得在节点1将变量v0写入文件file1后,不会继续执行将变量v0发送给节点2,即不会使得节点2将变量v0写入节点2的资源resource或内存中。
又例如,在上述第三种模式中,通过在节点1将变量v0发送给节点2后注入节点1失效,使得在节点1将变量v0发送给节点2后,不会继续执行将变量v0发送给节点3,即不会使得节点3将变量v0写入节点3的资源resource或内存中。
步骤S107:获取运行所述分布式系统后的与所述失效原子性缺陷相关的多个共享数 据。
要观察注入节点失效后系统是否满足用户约束,需要获取系统注入节点失效后的与所述失效原子性缺陷相关的多个共享数据。在本实施例中,在执行完多个原子性关联操作后,可以根据多个共享数据的位置及数据名获取与这些操作关联的多个共享数据。例如,在上述第一种模式中,分别从节点N的文件file1、file2中读取与数据源v0关联的共享数据v1和v2;在上述第二种模式中,分别从节点1中的文件file1、节点2中的资源resource或内存中读取共享数据v1和v2;在上述第三种模式中,分别从节点2中的resource、节点3中的resource中读取共享数据v1和v2。
在获取多个共享数据后,就可以进入下一步观察系统是否满足用户约束,从而确定候选的失效原子性缺陷是否为真正的失效原子性缺陷。
步骤S109:若所述多个共享数据不一致,则确定失效原子性缺陷。
通过判断所述多个共享数据是否一致,也就是说观察分布式系统是否满足用户约束,确定候选的失效原子性缺陷是否为真正的失效原子性缺陷。其中,用户约束包括所述多个共享数据相互一致的约束。
以上述第一种模式为例,在节点失效恢复过程中,如果这两份文件的共享数据一致(v1=v2),即未产生异常,则表示步骤S103确定的候选的失效原子性缺陷不是真正的失效原子性缺陷;如果这两份文件的共享数据产生不一致(v1<>v2),即产生异常(节点N恢复后v1<>v2),则表示步骤S103确定的候选的失效原子性缺陷是真正的失效原子性缺陷。
在本实施例中,所述缺陷检测方法的处理过程包括如下步骤:
1)通过静态分析方式,根据分布式系统的源程序代码确定关键操作;
2)通过分析数据依赖关系,根据关键操作确定共享数据;
3)运行分布式系统确定共享数据的执行轨迹;
4)根据执行轨迹确定与共享数据相关的待验证的节点失效;
5)向分布式系统注入待验证的节点失效;
6)如果共享数据在注入节点失效后并不一致,则确定失效原子性缺陷。
从上述实施例可见,通过确定分布式系统中具有原子性关系的多个共享数据处理操作的执行轨迹;根据所述执行轨迹确定候选的失效原子性缺陷;运行所述分布式系统,并将与所述候选的失效原子性缺陷对应的节点失效注入至所述分布式系统;获取运行所述分布式系统后的与所述失效原子性缺陷相关的多个共享数据;若所述多个共享数据不 一致,则确定所述失效原子性缺陷;这种处理方式,使得通过观察分布式系统的一次正确执行,无需注入节点失效,即可预测可能的原子性违反错误,并通过最后重放工作负载,注入节点失效,确定性地确认缺陷、重放缺陷;因此,可以有效检测出失效原子性缺陷,从而提升分布式系统的可靠性。
在上述的实施例中,提供了一种缺陷检测方法,与之相对应的,本申请还提供一种缺陷检测装置。该装置是与上述方法的实施例相对应。
第二实施例
请参看图4,其为本申请的缺陷检测装置的实施例的示意图。由于装置实施例基本相似于方法实施例,所以描述得比较简单,相关之处参见方法实施例的部分说明即可。下述描述的装置实施例仅仅是示意性的。
本申请另外提供一种缺陷检测装置,包括:
执行轨迹确定单元401,用于确定分布式系统中具有原子性关系的多个共享数据处理操作的执行轨迹;
候选缺陷确定单元403,用于根据所述执行轨迹确定候选的失效原子性缺陷;
节点失效注入单元405,用于运行所述分布式系统,并将与所述候选的失效原子性缺陷对应的节点失效注入至所述分布式系统;
共享数据获取单元407,用于获取运行所述分布式系统后的与所述失效原子性缺陷相关的多个共享数据;
缺陷确定单元409,用于若所述多个共享数据不一致,则确定所述失效原子性缺陷。
可选的,所述多个共享数据处理操作包括在一个节点内对两个以上文件的共享数据写操作;所述失效原子性缺陷包括任意两次共享数据写操作间的节点失效原子性缺陷。
可选的,所述多个共享数据处理操作包括在源节点内对文件的共享数据写操作、和源节点对目标节点的共享数据发送操作;所述失效原子性缺陷包括共享数据写操作与共享数据发送操作间的源节点失效原子性缺陷。
可选的,所述多个共享数据处理操作包括源节点对第一目标节点的共享数据发送操作、和源节点对第二目标节点的共享数据发送操作;所述失效原子性缺陷包括第一共享数据发送操作与第二共享数据发送操作间的源节点失效原子性缺陷。
可选的,所述执行轨迹确定单元401,具体用于对所述分布式系统执行系统测试,并记录所述分布式系统在测试数据下的所述执行轨迹。
可选的,所述节点失效注入单元405包括:
系统运行子单元,具体用于根据所述测试数据重新对所述分布式系统执行系统测试。
可选的,还包括:
共享数据确定单元,用于根据所述分布式系统的源程序代码,确定与各个数据源分别对应的多个共享数据、和所述多个共享数据处理操作。
可选的,所述多个共享数据处理操作包括以下操作的至少一个:数据写操作,数据发送操作。
在上述的实施例中,提供了一种缺陷检测方法,与之相对应的,本申请还提供一种缺陷处理方法。该方法是与上述方法的实施例相对应。
第三实施例
请参考图5,其为本申请的缺陷处理方法的实施例的流程图。由于该方法实施例与实施例一的方法实施例相对应,所以描述得比较简单,相关之处参见方法实施例一的部分说明即可。
本申请另外提供一种缺陷处理方法,包括:
步骤S501:确定分布式系统包括的失效原子性缺陷。
本步骤可采用上述实施例一的方式实现,也可以采用其它实施方式,如通过人工方式识别失效原子性缺陷等等。
步骤S503:在所述分布式系统的源程序代码中增加针对所述失效原子性缺陷的错误处理程序代码。
所述错误处理程序代码,包括使与所述失效原子性缺陷相关的多个共享数据保持一致的代码,包括但不限于:所述失效原子性缺陷的错误捕获代码,以及,捕获到错误后的共享数据一致性处理代码。
在源程序代码中增加错误处理程序代码后,就可以运行所述分布式系统,使其接收来自用户的服务请求。所述用户可以是人,人可以通过客户端向所述分布式系统发送服务请求;所述用户也可以是其它软件系统等等。
步骤S505:若所述分布式系统在执行时发生与所述失效原子性缺陷对应的节点失效,则执行所述错误处理程序代码,以使与所述失效原子性缺陷相关的多个共享数据保持一致。
在所述分布式系统运行过程中,如果发生所述失效原子性缺陷,即在该缺陷相关的多个原子性关联操作间出现节点失效,则将通过所述错误处理程序代码捕获所述失效原子性缺陷的错误,并通过所述错误处理程序代码处理该错误。
在一个示例中,所述共享数据一致性处理代码包括回滚已执行的共享数据关联操作的程序代码。
在一个示例中,所述共享数据一致性处理代码包括根据已执行的共享数据关联操作处理的共享数据,执行在所述已执行的共享数据关联操作之后的关联操作。
需要说明的是,大规模分布式系统由数量众多的计算节点组成,运行不同的复杂协议。这些系统中的计算节点可能面临以下异常情况:1)单个计算节点通常是普通的个人电脑(PC),其存在多种可靠性问题,比如,磁盘损坏、内存错误等硬件问题,以及操作系统崩溃等等,这将导致计算节点失效(Node crash),或者具有错误的数据与行为;2)单个计算节点或者整个数据中心出现断电等行为,导致多个计算节点宕机、重启等等;3)由于数据中心管理的需求,比如突发用户请求,需要将某些节点加入数据中心,或者负载较轻时,移除部分计算节点。上述这些行为将导致计算结点、数据中心等出现节点失效。
从上述实施例可见,本申请实施例提供的缺陷处理方法,通过确定分布式系统包括的失效原子性缺陷;在所述分布式系统的源程序代码中增加针对所述失效原子性缺陷的错误处理程序代码;所述错误处理程序代码包括使与所述失效原子性缺陷相关的多个共享数据保持一致的代码;若所述分布式系统在执行时发生与所述失效原子性缺陷对应的错误,则执行所述错误处理程序代码,以使与所述失效原子性缺陷相关的多个共享数据保持一致;这种处理方式,使得当分布式系统发生与所述失效原子性缺陷对应的节点失效时,可及时捕获及处理该错误,对多个共享数据执行一致性处理;因此,可以有效提升共享数据一致性,从而提升分布式系统的可靠性。
在上述的实施例中,提供了一种缺陷处理方法,与之相对应的,本申请还提供一种缺陷处理装置。该装置是与上述方法的实施例相对应。
第四实施例
请参看图6,其为本申请的缺陷处理装置的实施例的示意图。由于装置实施例基本相似于方法实施例,所以描述得比较简单,相关之处参见方法实施例的部分说明即可。下述描述的装置实施例仅仅是示意性的。
本申请另外提供一种缺陷处理装置,包括:
缺陷确定单元601,用于确定分布式系统包括的失效原子性缺陷;
错误代码增加单元603,用于在所述分布式系统的源程序代码中增加针对所述失效原子性缺陷的错误处理程序代码;
错误处理单元605,用于若所述分布式系统在执行时发生与所述失效原子性缺陷对应的节点失效,则执行所述错误处理程序代码,以使与所述失效原子性缺陷相关的多个共享数据保持一致。
本申请虽然以较佳实施例公开如上,但其并不是用来限定本申请,任何本领域技术人员在不脱离本申请的精神和范围内,都可以做出可能的变动和修改,因此本申请的保护范围应当以本申请权利要求所界定的范围为准。
在一个典型的配置中,计算设备包括一个或多个处理器(CPU)、输入/输出接口、网络接口和内存。
内存可能包括计算机可读介质中的非永久性存储器,随机存取存储器(RAM)和/或非易失性内存等形式,如只读存储器(ROM)或闪存(flash RAM)。内存是计算机可读介质的示例。
1、计算机可读介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机的存储介质的例子包括,但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带,磁带磁磁盘存储或其他磁性存储设备或任何其他非传输介质,可用于存储可以被计算设备访问的信息。按照本文中的界定,计算机可读介质不包括非暂存电脑可读媒体(transitory media),如调制的数据信号和载波。
2、本领域技术人员应明白,本申请的实施例可提供为方法、系统或计算机程序产品。因此,本申请可采用完全硬件实施例、完全软件实施例或结合软件和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。

Claims (12)

  1. 一种缺陷检测方法,其特征在于,包括:
    确定分布式系统中具有原子性关系的多个共享数据处理操作的执行轨迹;
    根据所述执行轨迹确定候选的失效原子性缺陷;
    运行所述分布式系统,并将与所述候选的失效原子性缺陷对应的节点失效注入至所述分布式系统;
    获取运行所述分布式系统后的与所述失效原子性缺陷相关的多个共享数据;
    若所述多个共享数据不一致,则确定失效原子性缺陷。
  2. 根据权利要求1所述的方法,其特征在于,
    所述多个共享数据处理操作包括在一个节点内对两个以上文件的共享数据写操作;
    所述失效原子性缺陷包括任意两次共享数据写操作间的节点失效原子性缺陷。
  3. 根据权利要求1所述的方法,其特征在于,
    所述多个共享数据处理操作包括在源节点内对文件的共享数据写操作、和源节点对目标节点的共享数据发送操作;
    所述失效原子性缺陷包括共享数据写操作与共享数据发送操作间的源节点失效原子性缺陷。
  4. 根据权利要求1所述的方法,其特征在于,
    所述多个共享数据处理操作包括源节点对第一目标节点的共享数据发送操作、和源节点对第二目标节点的共享数据发送操作;
    所述失效原子性缺陷包括第一共享数据发送操作与第二共享数据发送操作间的源节点失效原子性缺陷。
  5. 根据权利要求1所述的方法,其特征在于,所述执行轨迹采用如下步骤确定:
    对所述分布式系统执行系统测试,并记录所述分布式系统在测试数据下的所述执行轨迹。
  6. 根据权利要求5所述的方法,其特征在于,所述运行所述分布式系统,包括:
    根据所述测试数据重新对所述分布式系统执行系统测试。
  7. 根据权利要求1所述的方法,其特征在于,还包括:
    确定与各个数据源分别对应的多个共享数据。
  8. 根据权利要求1所述的方法,其特征在于,所述多个共享数据处理操作包括以下操作的至少一个:数据写操作,数据发送操作。
  9. 一种缺陷处理方法,其特征在于,还包括:
    确定分布式系统包括的失效原子性缺陷;
    在所述分布式系统的源程序代码中增加针对所述失效原子性缺陷的错误处理程序代码;所述错误处理程序代码包括使与所述失效原子性缺陷相关的多个共享数据保持一致的代码;
    若所述分布式系统在执行时发生与所述失效原子性缺陷对应的节点失效,则执行所述错误处理程序代码,以使与所述失效原子性缺陷相关的多个共享数据保持一致。
  10. 根据权利要求9所述的方法,其特征在于,所述确定分布式系统的失效原子性缺陷,包括:
    确定分布式系统中具有原子性关系的多个共享数据处理操作的执行轨迹;
    根据所述执行轨迹确定候选的失效原子性缺陷;
    运行所述分布式系统,并将与所述候选的失效原子性缺陷对应的节点失效注入至所述分布式系统;
    获取运行所述分布式系统后的与所述失效原子性缺陷相关的多个共享数据;若所述多个共享数据不一致,则确定所述失效原子性缺陷。
  11. 一种缺陷检测装置,其特征在于,包括:
    执行轨迹确定单元,用于确定分布式系统中具有原子性关系的多个共享数据处理操作的执行轨迹;
    候选缺陷确定单元,用于根据所述执行轨迹确定候选的失效原子性缺陷;
    节点失效注入单元,用于运行所述分布式系统,并将与所述候选的失效原子性缺陷对应的节点失效注入至所述分布式系统;
    共享数据获取单元,用于获取运行所述分布式系统后的与所述失效原子性缺陷相关的多个共享数据;
    缺陷确定单元,用于若所述多个共享数据不一致,则确定所述失效原子性缺陷。
  12. 一种缺陷处理装置,其特征在于,包括:
    缺陷确定单元,用于确定分布式系统包括的失效原子性缺陷;
    错误代码增加单元,用于在所述分布式系统的源程序代码中增加针对所述失效原子 性缺陷的错误处理程序代码;
    错误处理单元,用于若所述分布式系统在执行时发生与所述失效原子性缺陷对应的节点失效,则执行所述错误处理程序代码,以使与所述失效原子性缺陷相关的多个共享数据保持一致。
PCT/CN2020/082707 2019-04-12 2020-04-01 缺陷检测/处理方法和装置 WO2020207310A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910296074.7A CN111813786A (zh) 2019-04-12 2019-04-12 缺陷检测/处理方法和装置
CN201910296074.7 2019-04-12

Publications (1)

Publication Number Publication Date
WO2020207310A1 true WO2020207310A1 (zh) 2020-10-15

Family

ID=72750888

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/082707 WO2020207310A1 (zh) 2019-04-12 2020-04-01 缺陷检测/处理方法和装置

Country Status (2)

Country Link
CN (1) CN111813786A (zh)
WO (1) WO2020207310A1 (zh)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020032883A1 (en) * 2000-05-02 2002-03-14 Sun Microsystems, Inc. Method and system for providing cluster replicated checkpoint services
CN101183377A (zh) * 2007-12-10 2008-05-21 华中科技大学 一种基于消息中间件的高可用性数据库集群
CN109002462A (zh) * 2018-06-04 2018-12-14 北京明朝万达科技股份有限公司 一种实现分布式事物的方法及系统

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7827438B2 (en) * 2008-06-10 2010-11-02 Microsoft Corporation Distributed testing system and techniques
US8990640B2 (en) * 2012-11-16 2015-03-24 International Business Machines Corporation Selective posted data error detection based on request type
CN102999604B (zh) * 2012-11-20 2017-07-28 北京奇虎科技有限公司 一种数据库性能的检测方法和装置
CN103473031B (zh) * 2013-01-18 2015-11-18 龙建 协同并发式消息总线、主动构件组装模型及构件拆分方法
CN105117369B (zh) * 2015-08-04 2017-11-10 复旦大学 一种基于异构平台的多种并行错误检测系统
CN105095092A (zh) * 2015-09-25 2015-11-25 南京大学 基于静态分析和动态运行的Web应用JavaScript代码原子性违反检测
US10025788B2 (en) * 2015-09-29 2018-07-17 International Business Machines Corporation Detection of file corruption in a distributed file system
CN106874074B (zh) * 2016-12-26 2020-05-05 哈尔滨工业大学 一种基于软件事务内存的并发缺陷规避系统及方法
US10346166B2 (en) * 2017-04-28 2019-07-09 Intel Corporation Intelligent thread dispatch and vectorization of atomic operations
CN109522097B (zh) * 2018-10-11 2023-03-07 天津大学 一种基于自适应随机测试的并发缺陷检测方法

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020032883A1 (en) * 2000-05-02 2002-03-14 Sun Microsystems, Inc. Method and system for providing cluster replicated checkpoint services
CN101183377A (zh) * 2007-12-10 2008-05-21 华中科技大学 一种基于消息中间件的高可用性数据库集群
CN109002462A (zh) * 2018-06-04 2018-12-14 北京明朝万达科技股份有限公司 一种实现分布式事物的方法及系统

Also Published As

Publication number Publication date
CN111813786A (zh) 2020-10-23

Similar Documents

Publication Publication Date Title
Gao et al. An empirical study on crash recovery bugs in large-scale distributed systems
Yuan et al. Simple testing can prevent most critical failures: An analysis of production failures in distributed {Data-Intensive} systems
Tan et al. SALSA: Analyzing Logs as StAte Machines.
CN109643255B (zh) 在云系统中自动检测分布式并发错误
US8726225B2 (en) Testing of a software system using instrumentation at a logging module
Viennot et al. Transparent mutable replay for multicore debugging and patch validation
US7747742B2 (en) Online predicate checking for distributed systems
US20130246358A1 (en) Online verification of a standby database in log shipping physical replication environments
Zhao Building dependable distributed systems
US11748215B2 (en) Log management method, server, and database system
Li et al. Dfix: automatically fixing timing bugs in distributed systems
US11500854B2 (en) Selective data synchronization to troubleshoot production environment failures
Song et al. Why software hangs and what can be done with it
Mudduluru et al. Lasso detection using partial-state caching
Nascimento et al. Shuttle: Intrusion recovery for paas
WO2020207310A1 (zh) 缺陷检测/处理方法和装置
Sun et al. Reasoning about modern datacenter infrastructures using partial histories
Kim et al. Modulo: Finding Convergence Failure Bugs in Distributed Systems with Divergence Resync Models.
Pham et al. Verifying mpi applications with simgridmc
WO2019109257A1 (zh) 一种日志管理方法、服务器和数据库系统
CN115421990A (zh) 分布式存储系统数据一致性测试方法、系统、终端及介质
US9471409B2 (en) Processing of PDSE extended sharing violations among sysplexes with a shared DASD
Cachin et al. On limitations of using cloud storage for data replication
CN109791541B (zh) 日志序列号生成方法、装置及可读存储介质
US20230229582A1 (en) Information processing apparatus, processing method for information processing apparatus, and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20788640

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20788640

Country of ref document: EP

Kind code of ref document: A1