CN114780442A - Testing method and device for distributed system - Google Patents

Testing method and device for distributed system Download PDF

Info

Publication number
CN114780442A
CN114780442A CN202210710364.3A CN202210710364A CN114780442A CN 114780442 A CN114780442 A CN 114780442A CN 202210710364 A CN202210710364 A CN 202210710364A CN 114780442 A CN114780442 A CN 114780442A
Authority
CN
China
Prior art keywords
distributed system
node
main node
determining
main
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210710364.3A
Other languages
Chinese (zh)
Inventor
吴文林
叶小萌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Yueshu Technology Co ltd
Original Assignee
Hangzhou Yueshu Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Yueshu Technology Co ltd filed Critical Hangzhou Yueshu Technology Co ltd
Priority to CN202210710364.3A priority Critical patent/CN114780442A/en
Publication of CN114780442A publication Critical patent/CN114780442A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/3668Software testing
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/08Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Environmental & Geological Engineering (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The application relates to a test method and a test device for a distributed system, wherein the distributed system comprises a plurality of districts with main nodes, the test method for the distributed system is applied to any district, and the test method for the distributed system comprises the following steps: in the event that a write or read request continues to be submitted to the distributed system: determining a main node of the fragment area to obtain a first main node, and constructing a network partition in the fragment area to isolate the first main node; whether the distributed system fails or not is determined, and through the method and the device, the problem that in the related art, when performance test is performed on the distributed system, the test efficiency is low is solved, and the test efficiency is improved.

Description

Testing method and device for distributed system
Technical Field
The present application relates to the field of computer software technology application technologies, and in particular, to a method and an apparatus for testing a distributed system.
Background
With the development of information technology, information is explosively increased, and higher demands are continuously made on computing power, storage and the like of an information system. Standalone computing has many times failed to meet current demands, and more systems have adopted a distributed architecture to address this challenge. A distributed system is a collection of independent computers, and a set of algorithms effectively organize and cooperate the computer nodes in the collection to allow a group of computers to participate in a computing task together, so that the system acts like a computer to a user of the system. The distributed system has the greatest benefit of being capable of transversely expanding the system, improving the performance of the whole system by adding more machines and meeting the requirements of continuously increasing computing power, storage and the like in an information system.
But distributed systems are also known for their high theory and implementation complexity. If the current famous distributed algorithm Raft is exposed after the birth of the distributed algorithm Raft for one year, the theory of the distributed algorithm Raft has important logic defects, and the problems such as inconsistent reading and writing and the like are found for many times by the famous distributed system implementation in the industry such as Etcd, Zookeeper and the like, so that the performance test of the distributed system is performed, the problems existing in the system are found in time, and the system is improved. However, due to the high and difficult theory and implementation complexity of the distributed system, it is difficult to quickly find and diagnose problems when the performance tests of robustness, data consistency and the like of the system are performed on the distributed system in the related art.
Aiming at the problem of low testing efficiency when a distributed system is subjected to performance testing in the related art, an effective solution is not provided yet.
Disclosure of Invention
The embodiment of the application provides a method and a device for testing a distributed system, which are used for at least solving the problem of low testing efficiency when a performance test is performed on the distributed system in the related art.
In a first aspect, an embodiment of the present application provides a method for testing a distributed system, where the distributed system includes multiple segments with master nodes, and the method is applied to any of the segments, and the method includes: in the event that a write or read request continues to be submitted to the distributed system:
determining a main node of the fragment area to obtain a first main node, and constructing a network partition in the fragment area to isolate the first main node;
determining whether the distributed system is malfunctioning.
In some of these embodiments, where the first host node is isolated, the method includes:
after the new main node is selected by the chip voting, the first main node is recovered before the new main node does not complete the main switching operation;
determining whether the distributed system is malfunctioning.
In some of these embodiments, where the first master node is isolated, the method includes:
determining a main node of the district to obtain a second main node, and constructing a network partition in the district to isolate the second main node;
determining whether the distributed system is malfunctioning.
In some embodiments, where multiple nodes are isolated within the tile, the method comprises:
randomly restoring one of the isolated nodes;
determining a main node of the fragment area to obtain a third main node, and constructing a network partition in the fragment area to isolate the third main node;
determining whether the distributed system is malfunctioning.
In some embodiments, the determining the master node for the segment comprises: and acquiring node information of each node in the parcel, and determining a main node of the parcel according to the node information.
In some embodiments, in a distributed system using a Raft algorithm, the node information comprises: an identification of whether the node is a master node, and a Term value for the node.
In some of these embodiments, the process of determining whether the distributed system is malfunctioning comprises:
and detecting the consistency of the finally written data, the log data consistency of the state machine and the linear consistency, and determining whether the distributed system fails according to the detection result.
In a second aspect, an embodiment of the present application provides a testing apparatus for a distributed system, where the distributed system includes a plurality of bays having a master node, and the apparatus is applied to any of the bays, and the apparatus includes:
the main node positioning module is used for determining the main node of the fragment area to obtain a first main node;
a network partition module for constructing a network partition within the tile to isolate the first master node;
the load generation module is used for continuously submitting write or read requests to the distributed system;
and the fault determining module is used for determining whether the distributed system is in fault.
In a third aspect, an embodiment of the present application provides an electronic apparatus, which includes a memory and a processor, where the memory stores a computer program, and the processor is configured to run the computer program to perform a test method for the distributed system.
In a fourth aspect, an embodiment of the present application provides a storage medium, where a computer program is stored in the storage medium, where the computer program is configured to execute a test method of the distributed system when running.
Compared with the related art, the testing method of the distributed system provided by the embodiment of the application obtains the first main node by determining the main node of the partition, constructs the network partition in the partition to isolate the first main node, and continuously submits the write-in or read request to the distributed system in the process to determine whether the distributed system fails, so that the problem of low testing efficiency when the performance test is performed on the distributed system in the related art is solved, and the testing efficiency is improved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:
fig. 1 is a schematic application environment diagram of a testing method of a distributed system according to an embodiment of the present application;
FIG. 2 is a schematic diagram of a distributed system according to an embodiment of the present application;
FIG. 3 is a flow chart of a method of testing a distributed system according to a first embodiment of the present application;
FIG. 4 is a flow chart of a method of testing a distributed system according to a second embodiment of the present application;
FIG. 5 is a flow chart of a method of testing a distributed system according to a third embodiment of the present application;
fig. 6 is a block diagram showing a test apparatus of a distributed system according to a fourth embodiment of the present application;
fig. 7 is a schematic diagram of a master node location module according to a fifth embodiment of the present application;
fig. 8 is an internal structural diagram of an electronic device according to an embodiment of the application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be described and illustrated below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments provided in the present application without any inventive step are within the scope of protection of the present application.
It is obvious that the drawings in the following description are only examples or embodiments of the application, and that it is also possible for a person skilled in the art to apply the application to other similar contexts on the basis of these drawings without inventive effort. Moreover, it should be appreciated that in the development of any such actual implementation, as in any engineering or design project, numerous implementation-specific decisions must be made to achieve the developers' specific goals, such as compliance with system-related and business-related constraints, which may vary from one implementation to another.
Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the specification. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is to be expressly and implicitly understood by one of ordinary skill in the art that the embodiments described herein may be combined with other embodiments without conflict.
Unless otherwise defined, technical or scientific terms referred to herein should have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The use of the terms "a" and "an" and "the" and similar referents in the context of describing the invention (including a single reference) are to be construed in a non-limiting sense as indicating either the singular or the plural. The use of the terms "including," "comprising," "having," and any variations thereof herein, is meant to cover a non-exclusive inclusion; for example, a process, method, system, article, or apparatus that comprises a list of steps or modules (elements) is not limited to only those steps or elements but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus. Reference to "connected," "coupled," and the like in this application is not intended to be limited to physical or mechanical connections, but rather can include electrical connections, whether direct or indirect. The term "plurality" as referred to herein means two or more. "and/or" describes the association relationship of the associated object, indicating that there may be three relationships, for example, "a and/or B" may indicate: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. Reference herein to the terms "first," "second," "third," and the like, are merely to distinguish similar objects and do not denote a particular ordering for the objects.
The testing method of the distributed system provided by the present application may be applied to an application environment shown in fig. 1, fig. 1 is a schematic diagram of an application environment of the testing method of the distributed system according to the embodiment of the present application, as shown in fig. 1, a pressure measurement terminal 102 communicates with a distributed system 104 through a network, fig. 2 is a schematic diagram of the distributed system according to the embodiment of the present application, as shown in fig. 2, the distributed system 104 includes a plurality of segments (segment 1 to segment n), each segment has a plurality of nodes, each segment has a master node, and the testing method may be applied to any segment; after determining a parcel to be tested, the pressure measurement terminal 102 continuously submits a write or read request to the distributed system 104, the distributed system 104 determines a master node of the parcel to obtain a first master node, and constructs a network partition in the parcel to isolate the first master node and determine whether the system has a fault.
The present invention is directed to analyzing the problem of low testing efficiency when performing performance testing on a distributed system 104 in the related art, and since the performance testing on the distributed system 104 in the related art is random, if a testing framework (for example, Jepsen, Chaos Mesh, etc.) of an existing distributed system 104 is examined, it is found that although most products already include a function of network partition fault injection, current master node information and network partition states are not considered when network partition fault injection occurs, and therefore, these products need to perform a large number of tests to find a system fault, but in practice, the present invention finds that, because the master node is a very important role in the distributed system 104 including the master node, all activities in the entire system are generally coordinated by the master node, when the master node fails (such as downtime or network isolation), the system needs to switch the main node, and the switching process of the main node is very easy to cause problems.
For example, to increase the processing power of the system, the distributed system 104 generally uses a thread pool to process the request, and in practice, the present invention finds the following deadlock model that frequently occurs:
task queue of thread one:
+ task1:
1. condVar = true
2. **trylock(lock)**
3. condVar = false
task queue of thread two:
+ task2
1. hold(lock)
2. wait for condVar = false
3. release(lock)
where variable condVar is used to indicate that the current Raft exists in the process, for example, when Raft is processing a heartbeat request, it sets condVar = true in task1, and sets condVar to false after processing, and Raft will switch to master first waiting for all requests in progress to be processed. A task1 exists in a task queue of the thread one, task1 sets condVar = true, then waits for lock acquisition, and finally sets condVar = false; thread two has task2 in its task queue, task2 has already acquired the lock, and it waits until condVar = false and releases the lock. When the above steps are performed, both threads are completely deadlocked.
The invention analyzes the reasons of the deadlock model, in the process of switching master by Raft, a certain node (marked as a node e) is in a folower state initially, because of network isolation, timeout or other reasons, the node e does not receive the heartbeat request of the master node, therefore, the node e initiates an election request and obtains enough votes to switch master, and the node e initiates a task2 and is dispatched to a thread two to execute; however, during node e's master-off process, the original master node resumes communication with it, sending it a heartbeat request that enters thread one's task queue as task 1. It can be seen that in distributed system 104, the switching of master nodes is a very problematic process.
Based on this, the present embodiment provides a testing method for a distributed system 104, and fig. 3 is a flowchart of the testing method for a distributed system according to the first embodiment of the present application, as shown in fig. 3, the flowchart includes the following steps:
step S301, in case of continuing to submit a write or read request to the distributed system 104: determining a main node of the fragment area to obtain a first main node; optionally, node information of each node in the parcel may be acquired, and the master node of the parcel may be determined according to the node information, it should be noted that, if it is not enough that each node returns role information of the current node (for example, whether it is a master node or not), because in some cases (for example, when a master node is isolated and a cluster has elected a new master node), the role information of the acquired multiple nodes is the master node, and at this time, the true master node needs to be screened by relying on other information, the judgment of the master node relies on a specific consensus algorithm, for example, in the Raft algorithm, it is also necessary to obtain the Term value of each node, where the Term value is a logic clock in Raft and is an integer value, and a higher Term indicates a newer value, in Raft, if both nodes claim to be the master node, the actual master node takes the result returned from the instance with high Term value as the standard;
step S302, in case of continuing to submit a write or read request to the distributed system 104: constructing a network partition within the partition to isolate the first master node; it should be noted that, when isolating the first host node, only a single node of the first host node may be isolated, or the first host node may be isolated together with individual common nodes randomly extracted in the partition; optionally, the iptable may conveniently construct the required network partition in the system, for example, executing the following instructions on the machine a will disconnect all links of the machine a and the machine B: iptables-D INPUT-W1000-W4-s B-j DROP; optionally, the Linux iptables system provides a set of compact declarative network partition injection interfaces, and only declarative statements such as { { a, b }, { c, d, e } } need to be provided for the system, so that two network partitions of { a, b }, { c, d, e } can be constructed, and nodes in the two partitions cannot communicate;
step S303, in case of continuing to submit a write or read request to the distributed system 104: determining whether the distributed system 104 is malfunctioning; for example, it may be determined that the distributed system 104 has a fault directly according to a system error report, or in a case that the system has no error report, it may be determined whether the distributed system 104 has a problem in data consistency according to a consistency detection result by detecting final write data consistency, state machine log data consistency, and linear consistency.
Through steps S301 to S303, compared to the problem of low test efficiency when performing performance test on the distributed system 104 in the related art, the embodiment of the present application is based on the research finding that a failure is easy to occur when a master node is switched, and then the master node of the segment is determined first to obtain a first master node, after the real master node of the segment is quickly located, a network partition is constructed in the segment, the first master node is isolated, and a write or read request is continuously submitted to the distributed system 104 in the whole course to operate the system, and whether a failure occurs in the distributed system 104 is monitored in the process, so that the problems of system dead cycle, system deadlock, inconsistent distributed copy data, and the like, which are difficult to be quickly found (even missed) in the test process of a series of related technologies, are efficiently reproduced, and when performing performance test on the distributed system 104 in the related technologies, the problem of low testing efficiency is that the testing efficiency is improved when the performance tests such as system robustness and data consistency are performed on the distributed system 104.
In some embodiments, fig. 4 is a flowchart of a testing method for a distributed system according to a second embodiment of the present application, where, as shown in fig. 4, the process includes the following steps:
step S401, after the new main node is voted out by the voting of the film area, the first main node is recovered before the new main node does not complete the switching operation;
step S402, it is determined whether the distributed system 104 is malfunctioning.
Through steps S401 to S402, compared with the problem that it is difficult to test and discover that the system has the potential deadlock hazard in the related art, in the embodiment of the present application, based on the research and discovery that the deadlock problem is more likely to occur in the scenario that the master node is about to be switched, by isolating the master node, voting a new master node in the segment, and before the new master node does not complete the master switching operation, recovering the isolated master node (which may send a heartbeat request to other nodes), and reproducing the scenario that the deadlock is likely to occur, thereby efficiently discovering whether the system has the potential deadlock hazard.
In some embodiments, fig. 5 is a flowchart of a testing method for a distributed system according to a third embodiment of the present application, where, as shown in fig. 5, the process includes the following steps:
step S501, determining a main node of the district to obtain a second main node, and constructing a network partition in the district to isolate the second main node; for example, after a 5-node Raft cluster (node a to node e) isolates a master node (e.g., node a), the cluster selects a new master node (e.g., node b) and then isolates the new master node, and at this time, there are 3 nodes (e.g., node c to node e) that can work, and at this time, the system selects a new master node (e.g., node c) from the 3 nodes;
step S502, determining the main node of the fragment area to obtain a third main node, and constructing a network partition in the fragment area to isolate the third main node; for example, isolation master node c;
step S503, one isolated node is recovered randomly; for example, one node (e.g., node b) is randomly recovered from three isolated nodes a, b and c;
step S504, confirm the host node of the chip area, get the fourth host node, construct the network partition in the chip area, in order to isolate the fourth host node; for example, the cluster elects a new master node (for example, node d) among the 3 nodes that can work, namely node b, node d and node e, and isolates the master node;
step S505, it is determined whether the distributed system 104 is malfunctioning.
Through steps S501 to S505, compared to the problem of low test efficiency in performance test for the distributed system 104 in the related art, in the embodiment of the present application, based on the research finding that a failure is very easy to occur when a master node is switched, a network partition policy with a state is designed, a next network partition is continuously constructed according to current master node information and network partition states, and a write or read request is continuously submitted to the distributed system 104 in the whole process, so that the system is run, and whether a failure exists in the distributed system 104 is monitored in the process, so that the problems of system dead cycle, system deadlock, inconsistent distributed copy data and the like, which are difficult to be quickly found (even missed) in the test process of a series of related technologies, are efficiently reproduced, and the test efficiency is improved.
It should be noted that, the above embodiment is only an example that a network partition is constructed in combination with a master node location to lay out a complex test scenario, and a skilled person in the art should understand that, for simplicity of description, all possible combinations of the technical features in the above embodiment are not described, however, as long as there is no contradiction between the combinations of the technical features, all possible combinations of the technical features should be considered as a scope described in this specification.
An embodiment of the present application further provides a testing apparatus for a distributed system 104, where the distributed system 104 includes a plurality of partitions with master nodes, and the apparatus is applied to any partition, fig. 6 is a block diagram of a testing apparatus for a distributed system according to a fourth embodiment of the present application, and as shown in fig. 6, the system includes a master node positioning module 601, a network partitioning module 602, a load generating module 603, and a fault determining module 604, where:
the master node positioning module 601 is configured to determine a master node of the segment to obtain a first master node; for example, the master node location module 601 may be composed of two parts: fig. 7 is a schematic diagram of a master node positioning module according to a fifth embodiment of the present application, and as shown in fig. 7, a node information query interface may be deployed on each distributed node to be tested, while a master node query module is deployed at a pressure measurement end, the master node query module calls the node information query interface through RPC, and other modules acquire master node information by integrating the master node query module;
a network partition module 602 configured to construct a network partition within the tile to isolate the first master node;
the load generation module 603 is configured to continuously submit a write or read request to the distributed system 104; the load generation module allows the distributed system 104 to actually function, thereby triggering various potential problems. It should be noted that the load generation module 603 may integrate the master node location module 601, so as to pressurize the master node, and also pressurize each node indifferently, wherein the indifferently pressurizing has higher efficiency and simpler implementation, and can trigger more potential data problems in the distributed system 104, for example, the indifferently pressurizing can detect a series of problems caused by that some nodes think that the nodes are the master node although the nodes are already in the owner.
The failure determination module 604 is used to determine whether the distributed system 104 has failed. Data detection by the distributed system 104 may include: and finally, detecting consistency of written data, detecting log data consistency of the state machine and detecting linear consistency. Regarding consistency detection of finally written data, only data of a final disk drop on each distributed instance needs to be detected, and regarding linear consistency detection of a system, existing tool posccupin can be integrated to achieve consistency detection, because even if the data of the final disk drop on each distributed instance is detected to be consistent, it cannot be guaranteed that a cluster does not have inconsistency in the operation process, and in order to guarantee the consistency, the consistency of log contents of state machines on each instance needs to be strictly checked.
In addition, in combination with the testing method of the distributed system in the foregoing embodiment, the embodiment of the present application may provide a storage medium to implement. The storage medium has a computer program stored thereon; the computer program, when executed by a processor, implements a method of testing a distributed system as in any of the above embodiments.
In one embodiment, a computer device is provided, which may be a terminal. The computer device includes a processor, a memory, a network interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method of testing a distributed system. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on a shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.
In an embodiment, fig. 8 is a schematic internal structure diagram of an electronic device according to an embodiment of the present application, and as shown in fig. 8, there is provided an electronic device, which may be a server, and its internal structure diagram may be as shown in fig. 8. The electronic device comprises a processor, a network interface, an internal memory and a non-volatile memory connected by an internal bus, wherein the non-volatile memory stores an operating system, a computer program and a database. The processor is used for providing calculation and control capability, the network interface is used for communicating with an external terminal through network connection, the internal memory is used for providing an environment for an operating system and running of a computer program, the computer program is executed by the processor to realize a testing method of the distributed system, and the database is used for storing data.
Those skilled in the art will appreciate that the structure shown in fig. 8 is a block diagram of only a portion of the structure relevant to the present disclosure, and does not constitute a limitation on the electronic device to which the present disclosure may be applied, and that a particular electronic device may include more or less components than those shown, or combine certain components, or have a different arrangement of components.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, and these are all within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (10)

1. A method of testing a distributed system, the distributed system comprising a plurality of tiles having master nodes, the method being applied to any of the tiles, the method comprising: in the event that a write or read request continues to be submitted to the distributed system:
determining a main node of the fragment area to obtain a first main node, and constructing a network partition in the fragment area to isolate the first main node;
determining whether the distributed system is malfunctioning.
2. The method of claim 1, wherein with the first master node isolated, the method comprises:
after the new main node is selected by the chip voting, the first main node is recovered before the new main node does not complete the main switching operation;
determining whether the distributed system is malfunctioning.
3. The method of claim 1, wherein in the case where the first host node is isolated, the method comprises:
determining a main node of the fragment area to obtain a second main node, and constructing a network partition in the fragment area to isolate the second main node;
determining whether the distributed system is malfunctioning.
4. The method of claim 3, wherein in the case where multiple nodes are isolated within the tile, the method comprises:
randomly recovering one of the isolated nodes;
determining a main node of the chip area to obtain a third main node, and constructing a network partition in the chip area to isolate the third main node;
determining whether the distributed system is malfunctioning.
5. The method according to any of claims 1 to 4, wherein the process of determining the master node of the parcel comprises: and acquiring node information of each node in the parcel, and determining a master node of the parcel according to the node information.
6. The method of claim 5, wherein in a distributed system using a Raft algorithm, the node information comprises: an identification of whether the node is a master node, and a Term value for the node.
7. The method of any of claims 1 to 4, wherein the determining whether the distributed system is malfunctioning comprises:
and detecting the consistency of the finally written data, the log data consistency of the state machine and the linear consistency, and determining whether the distributed system fails according to the detection result.
8. A test apparatus for a distributed system, the distributed system including a plurality of tiles having master nodes, the apparatus being applied to any of the tiles, the apparatus comprising:
the main node positioning module is used for determining the main node of the fragment area to obtain a first main node;
a network partition module for constructing a network partition within the tile to isolate the first master node;
the load generation module is used for continuously submitting write or read requests to the distributed system;
a failure determination module to determine whether the distributed system is failing.
9. An electronic device comprising a memory and a processor, wherein the memory has stored therein a computer program, and the processor is configured to execute the computer program to perform a method of testing a distributed system according to any one of claims 1 to 7.
10. A storage medium having stored thereon a computer program, wherein the computer program is arranged to perform a method of testing a distributed system according to any of claims 1 to 7 when run.
CN202210710364.3A 2022-06-22 2022-06-22 Testing method and device for distributed system Pending CN114780442A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210710364.3A CN114780442A (en) 2022-06-22 2022-06-22 Testing method and device for distributed system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210710364.3A CN114780442A (en) 2022-06-22 2022-06-22 Testing method and device for distributed system

Publications (1)

Publication Number Publication Date
CN114780442A true CN114780442A (en) 2022-07-22

Family

ID=82422379

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210710364.3A Pending CN114780442A (en) 2022-06-22 2022-06-22 Testing method and device for distributed system

Country Status (1)

Country Link
CN (1) CN114780442A (en)

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100138687A1 (en) * 2008-11-28 2010-06-03 Fujitsu Limited Recording medium storing failure isolation processing program, failure node isolation method, and storage system
CN106911728A (en) * 2015-12-22 2017-06-30 华为技术服务有限公司 The choosing method and device of host node in distributed system
CN108959045A (en) * 2018-06-08 2018-12-07 郑州云海信息技术有限公司 A kind of test method and system of NAS clustering fault performance of handoffs
CN111104283A (en) * 2019-11-29 2020-05-05 浪潮电子信息产业股份有限公司 Fault detection method, device, equipment and medium of distributed storage system
CN111124719A (en) * 2019-12-13 2020-05-08 苏州浪潮智能科技有限公司 High-availability test method, device and readable medium based on main and standby management nodes
CN111181779A (en) * 2019-12-20 2020-05-19 苏州浪潮智能科技有限公司 Method and device for testing cluster failover performance and storage medium
CN112069014A (en) * 2020-08-28 2020-12-11 苏州浪潮智能科技有限公司 Storage system fault simulation method, device, equipment and medium
CN112596934A (en) * 2020-12-26 2021-04-02 中国农业银行股份有限公司 Fault testing method and device
CN112860525A (en) * 2021-03-31 2021-05-28 中国工商银行股份有限公司 Node fault prediction method and device in distributed system
CN113411221A (en) * 2021-06-30 2021-09-17 中国南方电网有限责任公司 Power communication network fault simulation verification method, device, equipment and storage medium

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100138687A1 (en) * 2008-11-28 2010-06-03 Fujitsu Limited Recording medium storing failure isolation processing program, failure node isolation method, and storage system
CN106911728A (en) * 2015-12-22 2017-06-30 华为技术服务有限公司 The choosing method and device of host node in distributed system
CN108959045A (en) * 2018-06-08 2018-12-07 郑州云海信息技术有限公司 A kind of test method and system of NAS clustering fault performance of handoffs
CN111104283A (en) * 2019-11-29 2020-05-05 浪潮电子信息产业股份有限公司 Fault detection method, device, equipment and medium of distributed storage system
CN111124719A (en) * 2019-12-13 2020-05-08 苏州浪潮智能科技有限公司 High-availability test method, device and readable medium based on main and standby management nodes
CN111181779A (en) * 2019-12-20 2020-05-19 苏州浪潮智能科技有限公司 Method and device for testing cluster failover performance and storage medium
CN112069014A (en) * 2020-08-28 2020-12-11 苏州浪潮智能科技有限公司 Storage system fault simulation method, device, equipment and medium
CN112596934A (en) * 2020-12-26 2021-04-02 中国农业银行股份有限公司 Fault testing method and device
CN112860525A (en) * 2021-03-31 2021-05-28 中国工商银行股份有限公司 Node fault prediction method and device in distributed system
CN113411221A (en) * 2021-06-30 2021-09-17 中国南方电网有限责任公司 Power communication network fault simulation verification method, device, equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
占小狼: ""Raft实战系列"", 《51CTO博客》 *
李亚方主编: "《Linux操作系统应用与安全项目化实战教程》", 31 July 2020 *

Similar Documents

Publication Publication Date Title
US8732282B1 (en) Model framework to facilitate robust programming of distributed workflows
US10671468B2 (en) Enhanced dump data collection from hardware fail modes
US6182243B1 (en) Selective data capture for software exception conditions
CN104809039B (en) Internal-memory detection method based on physical memory allocation map
US20080320336A1 (en) System and Method of Client Side Analysis for Identifying Failing RAM After a User Mode or Kernel Mode Exception
CN103259688A (en) Failure diagnosis method and device of distributed storage system
CN105988798B (en) Patch processing method and device
US7685471B2 (en) System and method for detecting software defects
CN113238924B (en) Chaotic engineering realization method and system in distributed graph database system
US9009549B2 (en) Memory diagnostic apparatus and memory diagnostic method and program
Tsai et al. Test algebra for combinatorial testing
CN114780442A (en) Testing method and device for distributed system
US10223186B2 (en) Coherency error detection and reporting in a processor
CN111291063A (en) Master and backup copy election method, system, computer equipment and storage medium
CN115982049A (en) Abnormity detection method and device in performance test and computer equipment
EP2829974A2 (en) Memory dump method, information processing apparatus and program
CN114816806A (en) Container availability verification method and device, computer equipment and storage medium
US7024347B2 (en) Transaction conflict testing method and apparatus
US9471409B2 (en) Processing of PDSE extended sharing violations among sysplexes with a shared DASD
US9268599B2 (en) Recording and profiling transaction failure addresses of the abort-causing and approximate abort-causing data and instructions in hardware transactional memories
Pan et al. Blind Men and the Elephant: Piecing together Hadoop for diagnosis
TWI620191B (en) System for testing memory according to range of physical addresses of memory module and method thereof
Vedeshenkov An approach to self-diagnosis of nonuniform digital systems
Vedeshenkov et al. Diagnosability of digital systems structured as minimal quasicomplete 7× 7 graph
CN116069536A (en) Data restoration method, device, computer equipment, storage medium and program product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20220722