WO2024045980A1 - 一种分布式定时消息系统测试方法、装置以及设备 - Google Patents

一种分布式定时消息系统测试方法、装置以及设备 Download PDF

Info

Publication number
WO2024045980A1
WO2024045980A1 PCT/CN2023/110149 CN2023110149W WO2024045980A1 WO 2024045980 A1 WO2024045980 A1 WO 2024045980A1 CN 2023110149 W CN2023110149 W CN 2023110149W WO 2024045980 A1 WO2024045980 A1 WO 2024045980A1
Authority
WO
WIPO (PCT)
Prior art keywords
message
timing
server
fault
timing message
Prior art date
Application number
PCT/CN2023/110149
Other languages
English (en)
French (fr)
Inventor
高坤
Original Assignee
支付宝(杭州)信息技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 支付宝(杭州)信息技术有限公司 filed Critical 支付宝(杭州)信息技术有限公司
Publication of WO2024045980A1 publication Critical patent/WO2024045980A1/zh

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/50Testing arrangements
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network

Definitions

  • This specification relates to the field of testing technology, and in particular to a distributed timing message system testing method, device and equipment.
  • nodes responsible for delivering scheduled messages. They usually cooperate according to certain strategies to improve efficiency and take disaster recovery into account.
  • One or more embodiments of this specification provide a distributed timing message system testing method, device, equipment and storage medium to solve the following technical problem: the need to more reliably test the robustness of the distributed timing message system test Test plans to guide improvements to the system and reduce these risks.
  • One or more embodiments of this specification provide a distributed timing message system testing method.
  • the system includes a message publishing client, a message subscription Client, timing message server, the method includes: initiating batch subscriptions to timing messages that can be published by the message publishing client through the message subscription client; constructing a fault according to the delay time corresponding to the timing message Inject instructions and send them to the scheduled message server to inject specified types of faults into the scheduled message server; use the message publishing client to send each of the scheduled messages subscribed by the message subscription client to the scheduled message server.
  • the timing message server publishes, so that the timing message server delivers the timing message to the message subscription client; according to the subscription, the reception of the timing message by the message subscription client is verified,
  • the test results for the system are determined based on the calibration results.
  • the system includes a message publishing client, a message subscription client, and a timing message server.
  • the device includes: a message subscription module.
  • the message subscription client initiates batch subscriptions to the timing messages that can be published by the message publishing client;
  • the fault injection module constructs a fault injection instruction according to the delay time corresponding to the timing message and sends it to the timing message server , to inject a specified type of fault into the scheduled message server;
  • the publishing and delivery module publishes each of the scheduled messages subscribed by the message subscription client to the scheduled message server through the message publishing client, So that the timing message server delivers the timing message to the message subscription client;
  • the result determination module verifies the reception of the timing message by the message subscription client based on the subscription, and based on the verification The results determine the results of testing the system.
  • the system includes a message publishing client, a message subscription client, and a timing message server.
  • the device includes: at least one processor; and , a memory communicatively connected to the at least one processor; wherein the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor, so that the at least one
  • the processor can: initiate batch subscriptions to timing messages that can be published by the message publishing client through the message subscription client; construct a fault injection instruction according to the delay time corresponding to the timing message and send it to the timing message
  • the server is used to inject a specified type of fault into the scheduled message server; through the message publishing client, each of the scheduled messages subscribed by the message subscription client is published to the scheduled message server so that all
  • the timing message server delivers the timing message to the message subscription client; based on the subscription, the message subscription client receives the timing message is verified, and the system is determined based on the verification result. test results.
  • One or more embodiments of this specification provide a non-volatile computer storage medium, the medium stores computer-executable instructions, and the computer-executable instructions are configured to: subscribe to the client through the message, initiate batch communication
  • the message publishing client subscribes to timing messages that can be published; according to the delay time corresponding to the timing message, a fault injection instruction is constructed and sent to the timing message server to inject a specified type into the timing message server Failure; use the message publishing client to publish each of the timing messages subscribed by the message subscription client to the timing message server, so that the timing message server delivers the timing messages to the message subscription client.
  • Timing messages according to the subscription, verify the reception of the timing messages by the message subscription client, and determine the test results of the system based on the verification results.
  • At least one of the above technical solutions adopted in one or more embodiments of this specification can achieve the following beneficial effects: for the use of scheduled messages based on distributed clusters, a message publishing client, a message subscription client and a scheduled message are provided.
  • the message publishing client publishes the timed message to the timer message server.
  • the timer message server delivers the timer message to the message subscription client, and actively performs optional types of failures on the timer message server.
  • Figure 1 is a schematic flow chart of a distributed timing message system testing method provided by one or more embodiments of this specification
  • Figure 2 is a partial architectural schematic diagram of a distributed timing message system provided by one or more embodiments of this specification;
  • Figure 3 is a schematic diagram of a partition takeover scenario in the system in Figure 2 provided by one or more embodiments of this specification;
  • Figure 4 is a schematic diagram of a specific implementation of the method in Figure 1 provided by one or more embodiments of this specification;
  • Figure 5 is a schematic structural diagram of a distributed timing message system testing device provided by one or more embodiments of this specification.
  • Figure 6 is a schematic structural diagram of a distributed timing message system testing device provided by one or more embodiments of this specification.
  • the embodiments of this specification provide a distributed timing message system testing method, device, equipment and storage medium.
  • scheduled messages refer to messages that can be consumed by consumers after a specified timestamp. They are used to solve some scenarios where message production and consumption have time window requirements, or scenarios where scheduled tasks are triggered by messages. Scheduled messages are often used Trigger designated business functions to run regularly, such as page refresh, transaction retry, payment reminder, running risk control strategy, collecting order data, reporting monitoring data, etc.
  • Robustness refers to the software’s ability to respond to input errors, disk failures, network overloads, or In the case of intentional attacks, can it not freeze or crash, and still be able to complete its work normally?
  • this solution constructs server-side fault injection by sending instructions, thereby achieving daily-level stability test integration and continuously exploring issues related to product robustness. The following is a detailed explanation based on this idea.
  • Figure 1 is a schematic flowchart of a distributed timing message system testing method provided by one or more embodiments of this specification.
  • This method can be applied to different business fields, such as: electronic payment business field, e-commerce business field, social business field, game business field, official business field, etc.
  • This process can be executed on distributed timing message system equipment in these fields, and/or test equipment connected to these systems. For example, test programs are pre-installed on these devices, and by running the test program, corresponding messages are sent to one or multiple ends. Control instruction, the instruction receiving end responds to the control instruction and specifically executes the corresponding steps. Certain input parameters or intermediate results in the process allow manual intervention to adjust to help improve accuracy.
  • the process in Figure 1 may include the following steps S102 to S108.
  • S102 Use the message subscription client to initiate batch subscriptions to scheduled messages that can be published by the message publishing client.
  • the distributed scheduled message system includes a message publishing client, a message subscription client, and a scheduled message server (which can be referred to as a server for short). There can be multiple terminals at each terminal, and distributed deployment has been carried out in advance.
  • the system at least includes a distributed cluster composed of multiple scheduled message servers.
  • Each scheduled message server is a node in the distributed cluster, and the message publishing client,
  • the message subscription client can also serve as a node in the distributed cluster, or be outside the distributed cluster.
  • the message subscription client subscribes to the scheduled message, and the message publishing client or the scheduled message server itself publishes the scheduled message accordingly.
  • the published scheduled message is served by the scheduled message.
  • the client delivers the message to the message subscription client in a timely manner.
  • the test script triggers the message subscription client to subscribe, triggers the message publishing client to publish, and then tests whether the delivery effect of the scheduled message server is abnormal.
  • S104 Construct a fault injection instruction according to the delay time corresponding to the timing message and send it to the timing message server, so as to inject a specified type of fault into the timing message server.
  • node status changes may occur within the system, leading to risks. This risk is mainly reflected in the impact on the delivery effect of the scheduled message server.
  • the delay time corresponding to a scheduled message includes at least the scheduled duration. For example, if a scheduled message is set to a scheduled duration of 30 seconds, it should be sent (for example, delivered, or published and delivered, etc.) 30 seconds after the scheduled actual time point. ) the timing message or the timing message is received, then the delay time corresponding to the timing message can be 30 seconds accordingly. In addition, considering that sending and receiving messages also takes a small amount of time (for example, millisecond-level duration), you can also add a little delay to the timing duration, and then use the whole as the delay time corresponding to the timing message.
  • This solution can also be used for testing in instant messaging scenarios.
  • messages are sent instantly, and the stability and predictability of the server in this scenario are relatively clear, the test time window is narrower and the test The effect is very limited and is not conducive to completely and comprehensively simulating and mining various complex risk environments in real situations.
  • the solution of this application is mainly aimed at the scenario of scheduled messages. Through the cooperation of a large number of scheduled messages with different delay times, as well as server-side fault injection of multiple optional types and optional strategies, various complex structures can be constructed more realistically. Test scenarios help achieve more efficient and effective test results at low cost.
  • a fault injection instruction whose corresponding fault injection time is not less than the delay time is constructed to perform server-side fault injection, where the fault injection time includes the injected The duration of at least part of the failure's effectiveness.
  • the fault injection time includes the injected
  • the duration of at least part of the failure's effectiveness it is beneficial to make the server in the expected fault state at key time points such as sending or receiving the timing message after delay. Therefore, the exceptions that may occur when receiving the timing message are more likely to occur. It is caused by this fault, which helps to clarify the correlation and degree of correlation between the test influencing factors and the test results during the testing process.
  • timing messages are involved at the same time, and their delay time setting strategies are diverse, such as setting a consistent delay time and setting a stepped delay time. etc.
  • delay time setting strategies are diverse, such as setting a consistent delay time and setting a stepped delay time. etc.
  • servers and message subscription clients participating in message delivery thereby constructing a more complex and realistic test environment and more accurately testing the robustness of the server.
  • S106 Publish each of the timing messages subscribed by the message subscription client to the timing message server through the message publishing client, so that the timing message server delivers the timing message to the message subscription client. Timed messages.
  • one or more message subscription clients subscribe to a scheduled message from the scheduled message server.
  • the scheduled message is published by the message publishing client to the server.
  • the server will publish the scheduled message, Deliver to each message subscription client that subscribes to the scheduled message.
  • Fault injection may affect the delivery action of the server, thereby affecting the reception of the message subscription client.
  • the scheduled message server is in a state affected by an injected fault before delivering the scheduled message or during the process of delivering the scheduled message.
  • the scheduled message server may be affected by another fault.
  • the scheduled message server takes over the service and continues to complete the delivery. The process of taking over the service will cause node status changes.
  • This solution pays special attention to the robustness of the server in this case. It should be noted that the other scheduled message server can also be injected with faults. This solution not only focuses on the robustness of a single scheduled message server, but also focuses on the robustness of the entire distributed cluster.
  • S108 Verify the reception of the timing message by the message subscription client according to the subscription, and determine the test result of the system according to the verification result.
  • the subscription reflects the expectation of the scheduled messages to be received. However, due to the impact of a fault, the situation in which the message subscription client receives the scheduled messages may not meet the expectations. Through verification, it is determined that the The extent to which this expectation is not met. For example, according to the subscription, the integrity and real-timeness of the scheduled messages received by the message subscription client through the delivery of the server are verified to determine whether the delivery is affected by the injected specified type of fault, thereby determining the robustness of the server. Performance. In addition, you can also check the previous interactive actions and status changes of one server or multiple servers involved in the timing message to see if they comply with the predetermined logic.
  • the above-mentioned test process can be arranged into daily-level test cases and executed repeatedly to obtain more reliable global test results.
  • a three-end cooperation architecture of message publishing client, message subscription client and scheduled message server is provided.
  • the message publishing client publishes to the scheduled message server.
  • Timing messages the timing message server delivers timing messages to the message subscription client, and actively and optionally injects faults into the timing message server, so that the delivery process of timing messages is affected by server failures.
  • the reception of a large number of scheduled messages by the message subscription client under this influence (such as completeness, real-timeness, etc.) is used as the basis for system stability evaluation.
  • the directionality and reliability are good, and daily-level sustainable operation abnormalities can be achieved. Test and reduce manual operation costs, and better support the testing scenario of the robustness of the distributed timing message system.
  • Figure 2 is a partial architectural schematic diagram of a distributed timing message system provided by one or more embodiments of this specification.
  • the producer serves as the above-mentioned message publishing client
  • the consumer serves as the above-mentioned message subscribing client.
  • Etc. there is also a partition coordinator responsible for coordinating each scheduled message server.
  • the system can use the scheduled message server as a dimension to logically divide all messages (including scheduled messages to be delivered), obtain multiple partitions, and assign them to the corresponding scheduled message server.
  • the corresponding scheduled message server will Messages belonging to this partition provide delivery and other services.
  • the current partitions corresponding to server A are P1 and P2
  • the partitions corresponding to server B are P3 and P4
  • the partitions corresponding to server B are P5 and P6
  • the partition coordinator performs partitioning between the scheduled message servers. Synchronize configuration information and allocate partitions.
  • the producer When working normally, the producer produces messages and inputs the corresponding scheduled message server. It can also send timing instructions to the server to make the message a scheduled message, or it can send instructions to cancel the timing.
  • filters a filter chain can be formed when there are multiple filters
  • the wheel-shaped data structure of the time wheel stores timing messages. The time wheel is divided into multiple grids. According to the timing length of the timing message, determine which grid it should be stored in.
  • the storage area can be divided into short-term storage area and long-term storage area. , choose to use according to actual needs.
  • the scheduled message server will also time through the time wheel trigger, and when the scheduled duration expires, it will trigger the storage query for the corresponding scheduled message, and deliver the message as an output message to the corresponding consumer through the delivery router.
  • Figure 3 is a schematic diagram of a partition takeover scenario in the system in Figure 2 provided by one or more embodiments of this specification.
  • server A is responsible for partition P3, and server B serves as the backup manager of P3.
  • server B will Take over P3.
  • Server A shuts down P3 by closing the timer of P3 and updating the detection point accordingly.
  • Server B starts the failure message compensation task and restores P3 by establishing the timer of P3 and reading the partition detection point. For example, Before closing P3, the index reaches the time point of 1007.
  • the index reaches the time point of 1007.
  • the takeover for P3, you can continue to obtain the messages at the corresponding time points from 1007.
  • the messages corresponding to the time points from 1007 to 1009 are obtained, and then the current index is updated.
  • server B continues to serve P3. During this process, there will be a partition state transition stage, leading to the risk of distributed inconsistency.
  • At least one scheduled message server is newly online and is responsible for delivering scheduled messages, or at least one scheduled message
  • the server takes over the service of delivering the timing message from another timing message server.
  • the partition state transition stage In the partition state transition stage, at least some partitions are reallocated or switched to disaster recovery status (for example, different servers take over service to achieve multiple backup disaster recovery).
  • the fault injection time period when the test process is actually executed, the fault injection time period may be relatively long, and during this period, the more critical partition state transition stage may account for a small part of it, and the timing message corresponds to
  • the delay time is preset and may not hit the partition state transition stage. This may reduce the probability of abnormal situations and is not conducive to digging out problems through testing.
  • this solution uses an adaptively adjusted delay time, so that the timing message essentially becomes an adaptive dynamic timing timing message. Specifically, for example, when the timing message server waits and executes the steps of delivering the timing message, it can determine the timing message Whether the corresponding delay time matches the partition status transition stage. If not, adjust the delay time accordingly to force an attempt to deliver scheduled messages during the partition status transition stage, thereby helping to cause exceptions with a higher probability and improving test revenue. .
  • one or more embodiments of this specification also provide a schematic principle diagram of a specific implementation of the method in Figure 1, as shown in Figure 4.
  • This specific implementation will be described with reference to Figure 4.
  • Exemplary steps include the following: Environment setup: Set up one or more fixed scheduled message publishing clients and scheduled message subscription clients in a stable test environment, and start the scheduled message server.
  • the specified command receiving end exposes the HTTP service to accept fault injection.
  • Data preparation By cleaning up the residual data, ensuring that each client has no residual messages, and initiating subscriptions and corresponding releases of scheduled messages in batches. For example, initiate 100 scheduled message subscriptions with a 30-second delay (the magnitude can be adjusted arbitrarily). After triggering, check that the current 100 subscriptions have been successfully initialized and the subscribed messages have been received.
  • Send fault injection instructions After the data preparation is completed, a fault injection that is not less than the scheduled message delay time is asynchronously initiated.
  • the injected faults include, for example, at least one of the following types of atomic faults: downtime, CPU surge, disk fullness, IO exception, memory surge, network packet delay duplication and loss, JVM method-level exception, etc.
  • the scheduled message server asynchronously delivers scheduled messages based on the data released by the message publishing client.
  • the message subscription client consumes the message after receiving the message delivered by the server.
  • Result verification Verify that the message subscriber receives the above 100 batch timing messages at the expected time, such as 30 seconds later. Based on this, it can be judged that the integrity and real-time nature of the timing messages have not been affected by server exceptions. You can check the timing. Whether the message is consumed successfully as expected, thereby verifying the robustness of the system. After this test is completed, the remaining data can be cleaned up in time for subsequent testing.
  • Generate baseline Arrange the test process and scenarios corresponding to the test results into daily test tasks, perform multiple trigger runs, and generate a robust test baseline based on the results of one or more runs. If the test results are abnormal, you can adjust them accordingly and then generate a baseline. baseline.
  • the scheduled message server into which the fault was injected is used as the first server, and it is determined that in response to the fault, the second server among the multiple scheduled message servers should Take over the service of delivering scheduled messages from the first server, and inject specified types of faults into the second server through the first server. This can be injected incidentally during the takeover switching process, which helps avoid increasing instruction interaction overhead.
  • a gradually weakening fault injection is designed to extend the service takeover chain (consisting of multiple servers that take over services in turn).
  • the service-impairing effect corresponding to the fault injected into the second server can be lower than the service-impairing effect corresponding to the fault injected into the first server (for example, the service-impairing effect corresponding to the CPU surge, usually Lower than the service impairment effect corresponding to the downtime), because the higher the service impairment effect, the more likely it is that the fault will cause the server to switch over and take over, which may lead to multiple relay takeovers for the same scheduled message, and so on.
  • the third server may continue to take over the service, and the second server injects faults into the third server that further reduce the service-impairing effect.
  • a unified end can also be responsible for injecting all faults, and the corresponding accuracy and flexibility may be affected.
  • it can also make the service takeover chain each time
  • the injected faults are at least partially differentiated in type (for example, the first injection is a crash fault, the second injection is a CPU spike, etc.), thereby increasing the complexity of the scenario and helping to more efficiently Digging out system anomalies.
  • some types of faults have real-time fluctuations in actual applications, such as CPU surges and memory surges.
  • the service impairment effect does not reach the critical value, so it may not necessarily It will cause the server to switch over and take over, and it will remain in the fault state for a period of time.
  • the trough state may be the key to maintaining the robustness of the system. That is, although the system may be in the peak state of the fault, Abnormalities occur, but it is also possible to survive the peak state and successfully complete the business during the trough state.
  • the low state may be fleeting. This plan considers simulating this low state by inserting safe time slots during the test process to give the same server a chance to turn around. Instead of continuously suppressing it with faults, it is possible to more objectively evaluate the robustness of the system under severe fluctuations in actual business.
  • the injection of the specified type of fault triggered at least one scheduled message server to take over the service of delivering scheduled messages from another scheduled message server. If not, , then a safety time slot is inserted into the injection effective time period of the fault. In the safe time slot, the injected fault does not take effect.
  • the length of the safety time slot is smaller than or even much shorter than the injection effective time period, which helps to accurately evaluate the system risk. Robust sensitivity and response speed. If the server can seize the safe time slot to successfully deliver scheduled messages, the robustness sensitivity and response speed will be relatively high, and the entire system will be relatively more reliable.
  • one or more embodiments of this specification also provide devices and equipment corresponding to the above method, as shown in Figures 5 and 6.
  • FIG. 5 is a schematic structural diagram of a distributed timing message system testing device provided by one or more embodiments of this specification.
  • the system includes a message publishing client, a message subscription client, and a timing message server.
  • the device includes: The message subscription module 502, through the message subscription client, initiates batch subscriptions to the timing messages that can be published by the message publishing client; the fault injection module 504, according to the delay time corresponding to the timing message, constructs a fault injection instruction and Sent to the scheduled message server to inject a specified type of fault into the scheduled message server; the publishing and delivery module 506 uses the message publishing client to subscribe the message to each of the scheduled messages subscribed by the client.
  • the scheduled message server delivers the scheduled message to the message subscription client; the result determination module 508, according to the subscription, receives the scheduled message from the message subscription client The status of the message is verified, and the test result of the system is determined based on the verification result.
  • the fault injection module 504 constructs a fault injection instruction whose corresponding fault injection time is not less than the delay time according to the delay time corresponding to the timing message.
  • the system includes a distributed cluster composed of multiple timing message servers; the fault injection module 504, after injecting a specified type of fault into the timing message server, uses the Inject a specified type of fault, triggering in the distributed cluster: at least one scheduled message server is newly online responsible for delivering the scheduled message service, or at least one scheduled message server takes over the delivery of the scheduled message from another scheduled message server Scheduled message service.
  • a partition management module 510 which uses the timing message server as a dimension to logically divide all timing messages to be delivered to obtain multiple partitions and assign them to the corresponding timing message servers;
  • the partition management module 510 after injecting a specified type of fault into the timing message server, enters the partition status transition phase in response to the injected specified type of fault; in the partition status transition phase, at least Some of the partitions have been reallocated or switched to disaster recovery status.
  • the fault injection module 504 determines whether the delay time corresponding to the timing message matches the partition state transition stage; if not, adjusts the delay time accordingly to force the partition state to Attempt to deliver the scheduled message during the transition phase.
  • the fault injection module 504 after injecting a specified type of fault into the timing message server, uses the timing message server into which the fault is injected as the first server; determine the response to In the event of the failure, the second server among the plurality of scheduled message servers wants to take over the service of delivering the scheduled messages from the first server; through the first server, the second server needs to Injects a specified type of fault.
  • the service-impairing effect corresponding to the fault injected into the second server is lower than the service-impairing effect corresponding to the fault injected into the first server.
  • the fault injection module 504 after injecting a specified type of fault into the timing message server, determines whether the injection of the specified type of fault triggers at least one timing message server to send a request from another timing message server.
  • the timing message server takes over the service of delivering the timing message; if not, insert a safety time slot into the injection effective time period of the fault. In the safety time slot, the fault does not take effect, and the safety time slot The length is less than the injection effective time period.
  • the fault injection module 504 injects at least one of the following types of atomic faults into the timing message server: downtime, CPU surge, disk full, IO exception, memory surge, network packet delay repetition and missing, JVM device-level exceptions.
  • the result determination module 508 verifies, according to the subscription, the integrity and real-time nature of the timing messages received by the timing message server through the delivery to determine whether the delivery is accepted. refers to the injection effects of certain types of failures.
  • the result determination module 508 after determining the test results for the system based on the verification results, arranges the test process corresponding to the test results as a daily test task and performs multiple runs; according to the The results of multiple runs above are used to generate a test baseline.
  • FIG. 6 is a schematic structural diagram of a distributed timing message system testing device provided by one or more embodiments of this specification.
  • the system includes a message publishing client, a message subscription client, and a timing message server.
  • the device includes: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor, So that the at least one processor can: initiate batch subscriptions to timing messages that can be published by the message publishing client through the message subscription client; construct a fault injection instruction according to the delay time corresponding to the timing message and Sent to the timing message server to inject a specified type of fault into the timing message server; through the message publishing client, each of the timing messages subscribed by the message subscription client is sent to the timing message
  • the server publishes so that the scheduled message server delivers the scheduled message to the message subscription client; based on the subscription, the message subscription client receives the scheduled message and is verified. According to the verification The results determine the results of testing
  • the processor and memory can communicate through a bus, and the device can also include input/output interfaces for communicating with other devices.
  • one or more embodiments of this specification also provide a non-volatile computer storage medium corresponding to the method in Figure 1, which stores computer-executable instructions, and the computer-executable instructions are set to:
  • the message subscription client initiates batch subscriptions to the timing messages that can be published by the message publishing client; according to the delay time corresponding to the timing message, a fault injection instruction is constructed and sent to the timing message server to send the message to the timing message server.
  • the scheduled message server injects a specified type of fault; through the message publishing client, publishes each of the scheduled messages subscribed by the message subscription client to the scheduled message server, so that the scheduled message server Deliver the timing message to the message subscription client; verify the reception of the timing message by the message subscription client based on the subscription, and determine the test result for the system based on the verification result.
  • programmable A logic device Programmable Logic Device, PLD
  • FPGA Field Programmable Gate Array
  • HDL Hardware Description Language
  • HDL High-Speed Integrated Circuit Hardware Description Language
  • ABEL Advanced Boolean Expression Language
  • AHDL Advanced Boolean Expression Language
  • Confluence CUPL
  • HDCal Component Description Language
  • JHDL Java Hardware Description Language
  • Lava Lava
  • Lola MyHDL
  • PALASM RHDL
  • VHDL Very-High-Speed Integrated Circuit Hardware Description Language
  • Verilog Verilog
  • the controller may be implemented in any suitable manner, for example, the controller may take the form of, for example, a microprocessor or processor and a computer readable medium storing computer readable program code (eg, software or firmware) executable by the (micro)processor. , logic gates, switches, Application Specific Integrated Circuit (ASIC), programmable logic controllers and embedded microcontrollers.
  • controllers include but are not limited to the following microcontrollers: ARC 625D, Atmel AT91SAM, For Microchip PIC18F26K20 and Silicone Labs C8051F320, the memory controller can also be implemented as part of the memory's control logic.
  • the controller in addition to implementing the controller in the form of pure computer-readable program code, the controller can be completely programmed with logic gates, switches, application-specific integrated circuits, programmable logic controllers and embedded logic by logically programming the method steps. Microcontroller, etc. to achieve the same function. Therefore, this controller can be considered as a hardware component, and the devices included therein for implementing various functions can also be considered as structures within the hardware component. Or even, the means for implementing various functions can be considered as structures within hardware components as well as software modules implementing the methods.
  • a typical implementation device is a computer.
  • the computer may be, for example, a personal computer, a laptop computer, a cellular phone, a camera phone, a smartphone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or A combination of any of these devices.
  • embodiments of the present description may be provided as methods, systems, or computer program products. Accordingly, embodiments of the present description may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment that combines software and hardware aspects. Furthermore, embodiments of the present description may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
  • computer-usable storage media including, but not limited to, disk storage, CD-ROM, optical storage, etc.
  • These computer program instructions may also be stored in a computer-readable memory that causes a computer or other programmable data processing apparatus to operate in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction means, the instructions
  • the device implements the functions specified in a process or processes of the flowchart and/or a block or blocks of the block diagram.
  • These computer program instructions may also be loaded onto a computer or other programmable data processing device, causing a series of operating steps to be performed on the computer or other programmable device to produce computer-implemented processing, thereby executing on the computer or other programmable device.
  • Instructions provide steps for implementing the functions specified in a process or processes of a flowchart diagram and/or a block or blocks of a block diagram.
  • a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
  • processors CPUs
  • input/output interfaces network interfaces
  • memory volatile and non-volatile memory
  • Memory may include non-permanent storage in computer-readable media, random access memory (RAM) and/or non-volatile memory in the form of read-only memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.
  • RAM random access memory
  • ROM read-only memory
  • flash RAM flash random access memory
  • Computer-readable media includes both persistent and non-volatile, removable and non-removable media that can be implemented by any method or technology for storage of information.
  • Information may be computer-readable instructions, data structures, modules of programs, or other data.
  • Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology, compact disc read-only memory (CD-ROM), digital versatile disc (DVD) or other optical storage, Magnetic tape cassettes, tape magnetic disk storage or other magnetic storage devices or any other non-transmission medium can be used to store information that can be accessed by a computing device.
  • computer-readable media does not include transitory media, such as modulated data signals and carrier waves.
  • program modules include routines, programs, objects, components, data structures, etc. that perform specific tasks or implement specific abstract data types.
  • the present description may also be practiced in distributed computing environments where tasks are performed by remote processing devices connected through communications networks.
  • program modules may be located in both local and remote computer storage media including storage devices.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Debugging And Monitoring (AREA)

Abstract

本说明书实施例公开了一种分布式定时消息系统测试方法、装置以及设备。方案包括:通过所述消息订阅客户端,批量发起对所述消息发布客户端可发布的定时消息的订阅;根据所述定时消息对应的延迟时间,构造故障注入指令并发送给所述定时消息服务端,以向所述定时消息服务端注入指定类型的故障;通过所述消息发布客户端,将所述消息订阅客户端订阅的各所述定时消息向所述定时消息服务端发布,以便所述定时消息服务端向所述消息订阅客户端投递所述定时消息,根据所述订阅,对所述消息订阅客户端接收所述定时消息的情况进行校验,根据校验结果确定对所述系统的测试结果。

Description

一种分布式定时消息系统测试方法、装置以及设备 技术领域
本说明书涉及测试技术领域,尤其涉及一种分布式定时消息系统测试方法、装置以及设备。
背景技术
在分布式定时消息系统中,有多个节点都负责定时消息的投递,它们通常按照一定的策略进行合作以提高效率,同时兼顾容灾。
由于定时消息不是立即发出的,而是经过一定的延迟时间后才会发出,因此整个系统的不确定性提高了。在延迟时间内,以及延迟时间到期后投递定时消息的过程中,系统内可能发生节点状态变迁,进而导致各种分布式不一致的风险,这些风险直观的表现包括定时消息丢失、重发、投递时间不符合预期等。
基于此,需要能够可靠低成本地测试分布式定时消息系统测试的鲁棒性的测试方案,以指导对系统的改进,降低这些风险。
发明内容
本说明书一个或多个实施例提供一种分布式定时消息系统测试方法、装置、设备以及存储介质,用以解决如下技术问题:需要能够更可靠地测试分布式定时消息系统测试的鲁棒性的测试方案,以指导对系统的改进,降低这些风险。
为解决上述技术问题,本说明书一个或多个实施例是这样实现的:本说明书一个或多个实施例提供的一种分布式定时消息系统测试方法,所述系统包括消息发布客户端、消息订阅客户端、定时消息服务端,所述方法包括:通过所述消息订阅客户端,批量发起对所述消息发布客户端可发布的定时消息的订阅;根据所述定时消息对应的延迟时间,构造故障注入指令并发送给所述定时消息服务端,以向所述定时消息服务端注入指定类型的故障;通过所述消息发布客户端,将所述消息订阅客户端订阅的各所述定时消息向所述定时消息服务端发布,以便所述定时消息服务端向所述消息订阅客户端投递所述定时消息;根据所述订阅,对所述消息订阅客户端接收所述定时消息的情况进行校验,根据校验结果确定对所述系统的测试结果。
本说明书一个或多个实施例提供的一种分布式定时消息系统测试装置,所述系统包括消息发布客户端、消息订阅客户端、定时消息服务端,所述装置包括:消息订阅模块,通过所述消息订阅客户端,批量发起对所述消息发布客户端可发布的定时消息的订阅;故障注入模块,根据所述定时消息对应的延迟时间,构造故障注入指令并发送给所述定时消息服务端,以向所述定时消息服务端注入指定类型的故障;发布投递模块,通过所述消息发布客户端,将所述消息订阅客户端订阅的各所述定时消息向所述定时消息服务端发布,以便所述定时消息服务端向所述消息订阅客户端投递所述定时消息;结果确定模块,根据所述订阅,对所述消息订阅客户端接收所述定时消息的情况进行校验,根据校验结果确定对所述系统的测试结果。
本说明书一个或多个实施例提供的一种分布式定时消息系统测试设备,所述系统包括消息发布客户端、消息订阅客户端、定时消息服务端,所述设备包括:至少一个处理器;以及,与所述至少一个处理器通信连接的存储器;其中,所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够:通过所述消息订阅客户端,批量发起对所述消息发布客户端可发布的定时消息的订阅;根据所述定时消息对应的延迟时间,构造故障注入指令并发送给所述定时消息服务端,以向所述定时消息服务端注入指定类型的故障;通过所述消息发布客户端,将所述消息订阅客户端订阅的各所述定时消息向所述定时消息服务端发布,以便所述定时消息服务端向所述消息订阅客户端投递所述定时消息;根据所述订阅,对所述消息订阅客户端接收所述定时消息的情况进行校验,根据校验结果确定对所述系统的测试结果。
本说明书一个或多个实施例提供的一种非易失性计算机存储介质,所述介质存储有计算机可执行指令,所述计算机可执行指令设置为:通过所述消息订阅客户端,批量发起对所述消息发布客户端可发布的定时消息的订阅;根据所述定时消息对应的延迟时间,构造故障注入指令并发送给所述定时消息服务端,以向所述定时消息服务端注入指定类型的故障;通过所述消息发布客户端,将所述消息订阅客户端订阅的各所述定时消息向所述定时消息服务端发布,以便所述定时消息服务端向所述消息订阅客户端投递所述定时消息;根据所述订阅,对所述消息订阅客户端接收所述定时消息的情况进行校验,根据校验结果确定对所述系统的测试结果。
本说明书一个或多个实施例采用的上述至少一个技术方案能够达到以下有益效果:针对基于分布式集群的定时消息的使用,提供了消息发布客户端、消息订阅客户端和定 时消息服务端三端配合的架构,由消息发布客户端向定时消息服务端发布定时消息,定时消息服务端向消息订阅客户端投递定时消息,对定时消息服务端进行主动的可选类型的故障注入,以使得定时消息的投递过程收到服务端故障的影响,进而以这种影响下消息订阅客户端对大量定时消息的接收情况(比如,完整性、实时性等)作为系统稳态评判依据,定向性和可靠性较好,能够实现日常级可持续运行异常测试和降低人工操作成本,较好地支持了分布式定时消息系统鲁棒性的测试场景。
附图说明
为了更清楚地说明本说明书实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本说明书中记载的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本说明书一个或多个实施例提供的一种分布式定时消息系统测试方法的流程示意图;
图2为本说明书一个或多个实施例提供的一种分布式定时消息系统的部分架构示意图;
图3为本说明书一个或多个实施例提供的图2中的系统中的分区接管的场景示意图;
图4为本说明书一个或多个实施例提供的图1中方法的一种具体实施方案的原理示意图;
图5为本说明书一个或多个实施例提供的一种分布式定时消息系统测试装置的结构示意图;
图6为本说明书一个或多个实施例提供的一种分布式定时消息系统测试设备的结构示意图。
具体实施方式
本说明书实施例提供一种分布式定时消息系统测试方法、装置、设备以及存储介质。
为了使本技术领域的人员更好地理解本说明书中的技术方案,下面将结合本说明书实施例中的附图,对本说明书实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本说明书实施例, 本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都应当属于本申请保护的范围。
本申请提出了适用性较好的分布式定时消息系统鲁棒性的测试方案,尝试在中间件定时消息产品等产品系统上测试使用。其中,定时消息指的是在指定时间戳之后才可被消费者消费的消息,用于解决一些消息生产和消费有时间窗口要求的场景,或者通过消息触发定时任务的场景,定时消息往往用于触发指定业务功能定时运行,比如,页面刷新、交易重试、支付提醒、运行风控策略、收集订单数据、上报监控数据等;鲁棒性指的是软件在输入错误、磁盘故障、网络过载或有意攻击情况下,能否不死机、不崩溃,仍然能够正常完成工作。
本方案在以定时消息完整性和实时性为检验标准的前提下,通过发送指令构造服务端故障注入,从而实现日常级稳定性测试集成,持续挖掘产品鲁棒性相关问题。下面基于这样的思路详细说明。
图1为本说明书一个或多个实施例提供的一种分布式定时消息系统测试方法的流程示意图。该方法可以应用于不同的业务领域中,这些业务领域比如包括:电子支付业务领域、电商业务领域、社交业务领域、游戏业务领域、公务业务领域等。该流程可以在这些领域中的分布式定时消息系统设备,和/或连接这些系统的测试设备上执行,比如,在这些设备上预先安装测试程序,通过运行测试程序,向一端或者多端发送相应的控制指令,指令接收端响应于控制指令,具体执行相应步骤。流程中的某些输入参数或者中间结果允许人工干预调节,以帮助提高准确性。
图1中的流程可以包括以下步骤S102至步骤S108。
S102:通过所述消息订阅客户端,批量发起对所述消息发布客户端可发布的定时消息的订阅。
分布式定时消息系统包括消息发布客户端、消息订阅客户端、定时消息服务端(可简称为服务端),各端都可以有多个,预先进行了分布式的部署。
在本说明书一个或多个实施例中,该系统至少包括由多个定时消息服务端构成的分布式集群,各定时消息服务端分别为该分布式集群中的一个节点,而消息发布客户端、消息订阅客户端也可以作为该分布式集群中的节点,或者处于该分布式集群之外。通过该分布式集群为客户端提供定时消息服务。由消息订阅客户端订阅定时消息,消息发布客户端或者定时消息服务端自身对应地发布定时消息,发布的定时消息由定时消息服务 端适时地投递给消息订阅客户端,在测试过程中,比如,由测试脚本来触发消息订阅客户端进行订阅,触发消息发布客户端进行发布,然后测试定时消息服务端的投递效果是否出现异常。
S104:根据所述定时消息对应的延迟时间,构造故障注入指令并发送给所述定时消息服务端,以向所述定时消息服务端注入指定类型的故障。
在本说明书一个或多个实施例中,背景技术中提到了系统内可能发生节点状态变迁进而导致风险,该风险主要体现在对定时消息服务端的投递效果的影响。为了实现日常级的故障复现和稳定测试,主动通过构造故障注入指令,向定时消息服务端注入所需的故障,如此,既便于按需随时测试,又能够及时恢复正常业务。
定时消息对应的延迟时间至少包括定时时长,比如,为某定时消息设定30秒的定时时长,则应当在预定的其实时间点之后的30秒时,发送(比如,投递、或者发布加投递等)该定时消息或者接收到该定时消息,那么,该定时消息对应的延迟时间可以相应地为30秒。另外,考虑到消息发送和接收也需要耗费少量时间(比如,毫秒级别的时长),则也可以在定时时长基础上增补一点时延,再将整体作为定时消息对应的延迟时间。
本方案同样也可以用于在即时消息的场景下测试,但是,由于消息是即时发送的,而即刻下该场景下服务端的稳定性和可预期性相对明确,因此,测试时间窗口更窄,测试效果局限性较大,不利于完整全面地模拟和挖掘出真实情况下的各种复杂的风险发生环境。基于此,本申请的方案主要是针对定时消息的场景的,可以通过大量定时消息不同延迟时间的配合,以及多种可选类型和可选策略的服务端故障注入,更真实地构造各种复杂的测试场景,有助于在低成本下取得更高效和有效的测试效果。
在本说明书一个或多个实施例中,根据定时消息对应的延迟时间,构造对应的故障注入时间不小于该延迟时间的故障注入指令,以进行服务端故障注入,其中,故障注入时间包括注入的故障的至少部分生效持续时间。如此,有利于使得在定时消息在延迟后,进行发送或者接收等关键时间点上,服务端是处于所期望的故障状态的,由此,接收定时消息的情况所可能发生的异常,更有可能是该故障导致的,有助于明确测试过程中测试影响因素与测试结果之间的关联关系和关联程度。
在本说明书一个或多个实施例中,在测试过程中,同时有大量定时消息参与,它们的延迟时间的设定策略是多样的,比如,设置一致的延迟时间,设置阶梯化的延迟时间 等,类似地,参与消息投递的服务端和消息订阅客户端也可以是大量的,由此能构造出更复杂更真实的测试环境,更准确地测试服务端的鲁棒性。
S106:通过所述消息发布客户端,将所述消息订阅客户端订阅的各所述定时消息向所述定时消息服务端发布,以便所述定时消息服务端向所述消息订阅客户端投递所述定时消息。
在本说明书一个或多个实施例中,一个或者多个消息订阅客户端向定时消息服务端订阅定时消息,该定时消息由消息发布客户端向服务端发布,服务端将发布的该定时消息,向订阅了该定时消息的各消息订阅客户端投递,故障的注入则可能影响服务端的投递动作,进而影响消息订阅客户端的接收情况。
在本说明书一个或多个实施例中,定时消息服务端在投递定时消息前或者投递定时消息的过程中,处于受到注入的故障影响的状态,由此,可能使得该定时消息服务端被另一个定时消息服务端接管服务,继续完成投递,接管服务的过程则会发生节点状态变迁,本方案尤其关注这种情况下的服务端鲁棒性表现。需要说明的是,该另一个定时消息服务端也可以是注入了故障的,本方案不光关注单个定时消息服务端的鲁棒性,更关注整个分布式集群的鲁棒性。
S108:根据所述订阅,对所述消息订阅客户端接收所述定时消息的情况进行校验,根据校验结果确定对所述系统的测试结果。
在本说明书一个或多个实施例中,订阅反映了对所要接收到的定时消息的预期,而由于故障的影响,导致消息订阅客户端接收定时消息的情况未必符合该预期,通过校验确定与该预期的不符程度。比如,根据订阅,对消息订阅客户端通过服务端的投递,接收到的定时消息的完整性和实时性进行校验,以确定投递受注入的指定类型的故障的影响,从而确定服务端的鲁棒性表现。除此之外,还可以校验涉及该定时消息的一个服务端本地或者多个服务端之前的交互动作和状态变迁情况,看是否符合预定的逻辑。
在本说明书一个或多个实施例中,上述的测试过程可以编排为日常级测试用例,反复执行,以获得更可靠的全局的测试结果。
通过图1的方法,针对基于分布式集群的定时消息的使用,提供了消息发布客户端、消息订阅客户端和定时消息服务端三端配合的架构,由消息发布客户端向定时消息服务端发布定时消息,定时消息服务端向消息订阅客户端投递定时消息,对定时消息服务端进行主动的可选类型的故障注入,以使得定时消息的投递过程收到服务端故障的影响, 进而以这种影响下消息订阅客户端对大量定时消息的接收情况(比如,完整性、实时性等)作为系统稳态评判依据,定向性和可靠性较好,能够实现日常级可持续运行异常测试和降低人工操作成本,较好地支持了分布式定时消息系统鲁棒性的测试场景。
基于图1的方法,本说明书还提供了该方法的一些具体实施方案和扩展方案,下面继续进行说明。
更直观地,结合一种示例性的分布式定时消息系统继续说明,参见图2。图2为本说明书一个或多个实施例提供的一种分布式定时消息系统的部分架构示意图。
在图2中,生产者作为上述的消息发布客户端,消费者作为上述的消息订阅客户端,定时消息服务端有多个,分区表中示出的服务端A、服务端B和服务端C等,还有分区协调者负责协调各定时消息服务端。该系统可以以定时消息服务端作为维度,对全部消息(包括待投递的定时消息)进行逻辑上的划分,得到多个分区,分配给对应的定时消息服务端,由对应的定时消息服务端为属于该分区的消息提供投递等服务。比如,当前服务端A对应的分区为P1、P2,服务端B对应的分区为P3、P4,服务端B对应的分区为P5、P6,分区协调者则在各定时消息服务端之间进行分区配置信息同步,以及分配分区。
正常工作时,生产者生产消息,并输入对应的定时消息服务端,还可以向服务端发送定时指令,以使得消息成为定时消息,也可以发送指令取消定时。在定时消息服务端上,可以设置过滤器(有多个过滤器时可以构成过滤器链)用于对输入消息进行过滤,过滤出满足规则的消息存储路由器,该系统中示例性地利用称为时间轮的轮状数据结构存储定时消息,时间轮划分为多格,根据定时消息的定时时长确定其应当存储在哪个格中,当定时时间较长(长于时间轮的一周所代表的时长)时,需要通过轮数加格数的方式,确定定时消息应当存储在哪格中,等待在达到轮数且到达该格时再将该定时消息发出,存储区可以分为短期存储区和长期存储区,根据实际需求选择使用。定时消息服务端同时会通过时间轮触发器计时,并在定时时长到期时,触发向存储查询对应的定时消息,并通过投递路由器,将该消息作为输出消息向对应的消费者投递。
在实际应用中,当分布式集群中的定时消息服务端的数量出现变化时(比如,机器扩缩容,替换,宕机等),为了保证新上线的服务端能拥有分区配置信息开始提供服务,或者下线的服务端的分区被其他服务端接管,会触发集群的分区重分配。参见图3,图3为本说明书一个或多个实施例提供的图2中的系统中的分区接管的场景示意图。
在图3中,初始时,分区P3由服务端A负责,服务端B作为P3的备份负责者,当服务端A由于诸如故障或者主动关闭等原因,暂时无法负责P3时,则服务端B会接管P3。服务端A通过关闭P3的计时器,以及相应地更新检测点,关闭了P3,服务端B通过建立P3的计时器,以及读取分区检测点,启动失效消息补偿任务,恢复P3,具体比如,在关闭P3前,索引到了1007的时间点,在接管时,针对P3,可以从1007开始继续获取相应时间点的消息,比如,获取了从1007到1009的时间点对应的消息,进而当前索引更新到了1010处,服务端B继续为P3服务。在这个过程中,会有分区状态变迁阶段,从而导致分布式不一致的风险。
类似地,向定时消息服务端注入指定类型的故障之后,通过注入指定类型的故障,可以在分布式集群中触发:至少一个定时消息服务端新上线负责投递定时消息的服务,或者至少一个定时消息服务端从另一个定时消息服务端接管投递所述定时消息的服务。对于整个系统而言,其响应于向服务端注入的指定类型的故障,可能进入分区状态变迁阶段,在分区状态变迁阶段,至少部分分区被重新分配或者切换容灾状态(比如,不同服务端接管服务以实现多备份容灾)。
在本说明书一个或多个实施例中,测试过程实际执行时,故障注入时间段可能相对长,而在这段时间内,较为关键的分区状态变迁阶段可能占其中一小部分,定时消息对应的延迟时间都是预先设定的,未必能够命中分区状态变迁阶段,如此,可能降低异常情况发生的概率,不利于通过测试挖掘出问题。针对这个问题,本方案采用自适应调整的延时时间,使得定时消息实质上成为自适应动态定时的定时消息,具体比如,定时消息服务端等待以及执行投递定时消息的步骤时,可以判断定时消息对应的延迟时间是否与分区状态变迁阶段相匹配,若否,则相应地调整延迟时间,以强制在分区状态变迁阶段中尝试投递定时消息,从而有助于更高概率地引发异常,提高测试收益。
根据上面的说明,本说明书一个或多个实施例还提供了图1中方法的一种具体实施方案的原理示意图,如图4所示。结合图4,对该具体实施方案进行说明,示例性的步骤包括以下:环境搭建:在稳定测试环境搭建固定的一个或者多个定时消息发布客户端和定时消息订阅客户端,定时消息服务端启动指定的指令接收端对外暴露HTTP服务,用于接受故障注入。
数据准备:通过清理残余数据,确保各客户端无残留消息,批量发起定时消息的订阅和对应发布。比如,发起100笔30秒延迟后的定时消息订阅(量级可任意调节),触发后检查当前这100笔订阅初始化成功,并且尚接收到所订阅的消息。
发送故障注入指令:在数据准备完成后异步发起不小于定时消息延迟时间的故障注入。所注入的故障比如包括以下至少一种类型的原子故障:宕机、CPU飙高、磁盘打满、IO异常、内存飙高、网络包延迟重复和丢失、JVM方法级异常等。与此同时,定时消息服务端根据消息发布客户端发布上来的数据异步进行定时消息的投递,消息订阅客户端收到服务端投递过来的消息后进行消费。通过这步实现构造了分布式定时消息系统在异常下处理定时消息的场景。
结果校验:在预期时间,比如30秒后校验消息订阅端收到上述100条批量定时消息,以此为依据判断定时消息的完整性与实时性未受到服务端异常的影响,可以检查定时消息是否按预期消费成功,从而验证系统的鲁棒性,本次测试完毕后可以及时清理残余数据,以便后续测试。
生成基线:将测试结果对应的测试过程和场景编排为日常测试任务,进行多次触发运行,根据一次或者多次运行的结果生成鲁棒性测试基线,若测试结果异常可以相应调整后,再生成基线。
在本说明书一个或多个实施例中,分布式集群中有多个定时消息服务端参与测试过程,因此,定时消息服务端之间有可能切换接管服务,为了在单次测试中尽可能地让异常暴露出来,采用链式的故障跟踪注入方案,让故障注入跟随服务的转移也相应地及时转移。如此,也能够避免预先大面积地给定时消息服务端注入故障而浪费资源。
具体比如,向定时消息服务端注入指定类型的故障之后,将被注入故障的该定时消息服务端作为第一服务端,确定响应于该故障,多个定时消息服务端中的第二服务端要从第一服务端接管投递定时消息的服务,通过第一服务端,向所述第二服务端注入指定类型的故障,可以在接管切换过程中顺便注入,有助于避免增加指令交互开销。
进一步地,基于这样的思路,设计了逐渐减弱的故障注入,以延长服务接管链(由依次接管服务的多个服务端构成)。沿用上例进行说明,可以使得第二服务端被注入的故障对应的服务妨害效果,低于第一服务端被注入的故障对应的服务妨害效果(比如,CPU飙高对应的服务妨害效果,通常低于宕机对应的服务妨害效果),因为,服务妨害效果越高的故障越有可能导致服务端切换接管,从而针对同一个定时消息可能导致多次接力接管,以此类推,若第二服务端由于故障也无法投递消息,则可能由第三服务端继续接管服务,则第二服务端向第三服务端注入服务妨害效果进一步降低的故障。除了接力注入故障的方式以外,也可以由统一的一端来负责所有故障的注入,相应的精准性和灵活性可能受到影响。除了服务妨害效果逐渐降低以外,还可以使得服务接管链上各次 注入的故障在类型上呈现至少部分差异化(比如,第一次注入的是宕机故障,第二次注入的是CPU飙高等),从而提高了场景的复杂度,有助于更有效率地挖掘出系统异常。
在本说明书一个或多个实施例中,对于一些类型的故障,在实际应用中具有即时波动性,比如,CPU飙高、内存飙高,大多数时候由于服务妨害效果并非达到临界值,因此未必会引发服务端切换接管,则会在故障状态持续一段时间。在这种情况下,考虑在测试过程中不仅要模拟故障的高峰状态,同样也要模拟故障的低谷状态,因为低谷状态可能是系统保持鲁棒性的关键所在,即系统虽然在故障高峰状态可能发生异常,但其也有可能撑过高峰状态,而在低谷状态顺利完成业务。当然,在实际应用中,低谷状态可能是转瞬即逝的,本方案考虑出在测试过程中,通过插入安全时隙的方案模拟这种低谷状态,以给予同一个服务端打翻身仗的机会,而不是平稳地一直用故障持续压制它,从而能够更加客观地评估系统在实际业务剧烈波动的状态下的鲁棒性。
具体比如,在向定时消息服务端注入指定类型的故障之后,可以判断通过注入指定类型的故障,是否触发了至少一个定时消息服务端从另一个定时消息服务端接管投递定时消息的服务,若否,则在故障的注入生效时间段中插入安全时隙,在安全时隙,注入的故障不生效,安全时隙的长度小于甚至可以远小于注入生效时间段,从而有助于精准地评价系统鲁棒性的敏感和响应速度,若服务端能抓住安全时隙来顺利投递定时消息,则鲁棒性的敏感和响应速度相对高,整个系统相对更可靠。
基于同样的思路,本说明书一个或多个实施例还提供了上述方法对应的装置和设备,如图5、图6所示。
图5为本说明书一个或多个实施例提供的一种分布式定时消息系统测试装置的结构示意图,所述系统包括消息发布客户端、消息订阅客户端、定时消息服务端,所述装置包括:消息订阅模块502,通过所述消息订阅客户端,批量发起对所述消息发布客户端可发布的定时消息的订阅;故障注入模块504,根据所述定时消息对应的延迟时间,构造故障注入指令并发送给所述定时消息服务端,以向所述定时消息服务端注入指定类型的故障;发布投递模块506,通过所述消息发布客户端,将所述消息订阅客户端订阅的各所述定时消息向所述定时消息服务端发布,以便所述定时消息服务端向所述消息订阅客户端投递所述定时消息;结果确定模块508,根据所述订阅,对所述消息订阅客户端接收所述定时消息的情况进行校验,根据校验结果确定对所述系统的测试结果。
可选地,所述故障注入模块504,根据所述定时消息对应的延迟时间,构造对应的故障注入时间不小于所述延迟时间的故障注入指令。
可选地,所述系统包括由多个所述定时消息服务端构成的分布式集群;所述故障注入模块504,在所述向所述定时消息服务端注入指定类型的故障之后,通过所述注入指定类型的故障,在所述分布式集群中触发:至少一个定时消息服务端新上线负责投递所述定时消息的服务,或者至少一个定时消息服务端从另一个定时消息服务端接管投递所述定时消息的服务。
可选地,还包括:分区管理模块510,以所述定时消息服务端作为维度,对待投递的全部定时消息进行逻辑上的划分,得到多个分区,分配给对应的所述定时消息服务端;所述分区管理模块510,在所述向所述定时消息服务端注入指定类型的故障之后,响应于所述注入的指定类型的故障,进入分区状态变迁阶段;在所述分区状态变迁阶段,至少部分所述分区被重新分配或者切换容灾状态。
可选地,所述故障注入模块504,判断所述定时消息对应的延迟时间是否与所述分区状态变迁阶段相匹配;若否,则相应地调整所述延迟时间,以强制在所述分区状态变迁阶段中尝试投递所述定时消息。
可选地,所述故障注入模块504,在所述向所述定时消息服务端注入指定类型的故障之后,将被注入所述故障的所述定时消息服务端作为第一服务端;确定响应于所述故障,所述多个定时消息服务端中的第二服务端要从所述第一服务端接管投递所述定时消息的服务;通过所述第一服务端,向所述第二服务端注入指定类型的故障。
可选地,所述第二服务端被注入的故障对应的服务妨害效果,低于所述第一服务端被注入的故障对应的服务妨害效果。
可选地,所述故障注入模块504,在所述向所述定时消息服务端注入指定类型的故障之后,判断通过所述注入指定类型的故障,是否触发了至少一个定时消息服务端从另一个定时消息服务端接管投递所述定时消息的服务;若否,则在所述故障的注入生效时间段中插入安全时隙,在所述安全时隙,所述故障不生效,所述安全时隙的长度小于所述注入生效时间段。
可选地,所述故障注入模块504,向所述定时消息服务端注入以下至少一种类型的原子故障:宕机、CPU飙高、磁盘打满、IO异常、内存飙高、网络包延迟重复和丢失、JVM装置级异常。
可选地,所述结果确定模块508,根据所述订阅,对所述定时消息服务端通过所述投递,接收到的定时消息的完整性和实时性进行校验,以确定所述投递受所述注入的指 定类型的故障的影响。
可选地,所述结果确定模块508,在所述根据校验结果确定对所述系统的测试结果之后,将所述测试结果对应的测试过程编排为日常测试任务,进行多次运行;根据所述多次运行的结果生成测试基线。
图6为本说明书一个或多个实施例提供的一种分布式定时消息系统测试设备的结构示意图,所述系统包括消息发布客户端、消息订阅客户端、定时消息服务端,所述设备包括:至少一个处理器;以及,与所述至少一个处理器通信连接的存储器;其中,所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够:通过所述消息订阅客户端,批量发起对所述消息发布客户端可发布的定时消息的订阅;根据所述定时消息对应的延迟时间,构造故障注入指令并发送给所述定时消息服务端,以向所述定时消息服务端注入指定类型的故障;通过所述消息发布客户端,将所述消息订阅客户端订阅的各所述定时消息向所述定时消息服务端发布,以便所述定时消息服务端向所述消息订阅客户端投递所述定时消息;根据所述订阅,对所述消息订阅客户端接收所述定时消息的情况进行校验,根据校验结果确定对所述系统的测试结果。
处理器与存储器之间可以通过总线通信,设备还可以包括与其他设备通信的输入/输出接口。
基于同样的思路,本说明书一个或多个实施例还提供了对应于图1中方法的一种非易失性计算机存储介质,存储有计算机可执行指令,所述计算机可执行指令设置为:通过所述消息订阅客户端,批量发起对所述消息发布客户端可发布的定时消息的订阅;根据所述定时消息对应的延迟时间,构造故障注入指令并发送给所述定时消息服务端,以向所述定时消息服务端注入指定类型的故障;通过所述消息发布客户端,将所述消息订阅客户端订阅的各所述定时消息向所述定时消息服务端发布,以便所述定时消息服务端向所述消息订阅客户端投递所述定时消息;根据所述订阅,对所述消息订阅客户端接收所述定时消息的情况进行校验,根据校验结果确定对所述系统的测试结果。
在20世纪90年代,对于一个技术的改进可以很明显地区分是硬件上的改进(例如,对二极管、晶体管、开关等电路结构的改进)还是软件上的改进(对于方法流程的改进)。然而,随着技术的发展,当今的很多方法流程的改进已经可以视为硬件电路结构的直接改进。设计人员几乎都通过将改进的方法流程编程到硬件电路中来得到相应的硬件电路结构。因此,不能说一个方法流程的改进就不能用硬件实体模块来实现。例如,可编程 逻辑器件(Programmable Logic Device,PLD)(例如现场可编程门阵列(Field Programmable Gate Array,FPGA))就是这样一种集成电路,其逻辑功能由用户对器件编程来确定。由设计人员自行编程来把一个数字系统“集成”在一片PLD上,而不需要请芯片制造厂商来设计和制作专用的集成电路芯片。而且,如今,取代手工地制作集成电路芯片,这种编程也多半改用“逻辑编译器(logic compiler)”软件来实现,它与程序开发撰写时所用的软件编译器相类似,而要编译之前的原始代码也得用特定的编程语言来撰写,此称之为硬件描述语言(Hardware Description Language,HDL),而HDL也并非仅有一种,而是有许多种,如ABEL(Advanced Boolean Expression Language)、AHDL(Altera Hardware Description Language)、Confluence、CUPL(Cornell University Programming Language)、HDCal、JHDL(Java Hardware Description Language)、Lava、Lola、MyHDL、PALASM、RHDL(Ruby Hardware Description Language)等,目前最普遍使用的是VHDL(Very-High-Speed Integrated Circuit Hardware Description Language)与Verilog。本领域技术人员也应该清楚,只需要将方法流程用上述几种硬件描述语言稍作逻辑编程并编程到集成电路中,就可以很容易得到实现该逻辑方法流程的硬件电路。
控制器可以按任何适当的方式实现,例如,控制器可以采取例如微处理器或处理器以及存储可由该(微)处理器执行的计算机可读程序代码(例如软件或固件)的计算机可读介质、逻辑门、开关、专用集成电路(Application Specific Integrated Circuit,ASIC)、可编程逻辑控制器和嵌入微控制器的形式,控制器的例子包括但不限于以下微控制器:ARC 625D、Atmel AT91SAM、Microchip PIC18F26K20以及Silicone Labs C8051F320,存储器控制器还可以被实现为存储器的控制逻辑的一部分。本领域技术人员也知道,除了以纯计算机可读程序代码方式实现控制器以外,完全可以通过将方法步骤进行逻辑编程来使得控制器以逻辑门、开关、专用集成电路、可编程逻辑控制器和嵌入微控制器等的形式来实现相同功能。因此这种控制器可以被认为是一种硬件部件,而对其内包括的用于实现各种功能的装置也可以视为硬件部件内的结构。或者甚至,可以将用于实现各种功能的装置视为既可以是实现方法的软件模块又可以是硬件部件内的结构。
上述实施例阐明的系统、装置、模块或单元,具体可以由计算机芯片或实体实现,或者由具有某种功能的产品来实现。一种典型的实现设备为计算机。具体的,计算机例如可以为个人计算机、膝上型计算机、蜂窝电话、相机电话、智能电话、个人数字助理、媒体播放器、导航设备、电子邮件设备、游戏控制台、平板计算机、可穿戴设备或者这些设备中的任何设备的组合。
为了描述的方便,描述以上装置时以功能分为各种单元分别描述。当然,在实施本说明书时可以把各单元的功能在同一个或多个软件和/或硬件中实现。
本领域内的技术人员应明白,本说明书实施例可提供为方法、系统、或计算机程序产品。因此,本说明书实施例可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本说明书实施例可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。
本说明书是参照根据本说明书实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
在一个典型的配置中,计算设备包括一个或多个处理器(CPU)、输入/输出接口、网络接口和内存。
内存可能包括计算机可读介质中的非永久性存储器,随机存取存储器(RAM)和/或非易失性内存等形式,如只读存储器(ROM)或闪存(flash RAM)。内存是计算机可读介质的示例。
计算机可读介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。 计算机的存储介质的例子包括,但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带,磁带磁磁盘存储或其他磁性存储设备或任何其他非传输介质,可用于存储可以被计算设备访问的信息。按照本文中的界定,计算机可读介质不包括暂存电脑可读媒体(transitory media),如调制的数据信号和载波。
还需要说明的是,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、商品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、商品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、商品或者设备中还存在另外的相同要素。
本说明书可以在由计算机执行的计算机可执行指令的一般上下文中描述,例如程序模块。一般地,程序模块包括执行特定任务或实现特定抽象数据类型的例程、程序、对象、组件、数据结构等等。也可以在分布式计算环境中实践本说明书,在这些分布式计算环境中,由通过通信网络而被连接的远程处理设备来执行任务。在分布式计算环境中,程序模块可以位于包括存储设备在内的本地和远程计算机存储介质中。
本说明书中的各个实施例均采用递进的方式描述,各个实施例之间相同相似的部分互相参见即可,每个实施例重点说明的都是与其他实施例的不同之处。尤其,对于装置、设备、非易失性计算机存储介质实施例而言,由于其基本相似于方法实施例,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。
上述对本说明书特定实施例进行了描述。其它实施例在所附权利要求书的范围内。在一些情况下,在权利要求书中记载的动作或步骤可以按照不同于实施例中的顺序来执行并且仍然可以实现期望的结果。另外,在附图中描绘的过程不一定要求示出的特定顺序或者连续顺序才能实现期望的结果。在某些实施方式中,多任务处理和并行处理也是可以的或者可能是有利的。
以上所述仅为本说明书的一个或多个实施例而已,并不用于限制本说明书。对于本领域技术人员来说,本说明书的一个或多个实施例可以有各种更改和变化。凡在本说明书的一个或多个实施例的精神和原理之内所作的任何修改、等同替换、改进等,均应包含在本说明书的权利要求范围之内。

Claims (23)

  1. 一种分布式定时消息系统测试方法,所述系统包括消息发布客户端、消息订阅客户端、定时消息服务端,所述方法包括:
    通过所述消息订阅客户端,批量发起对所述消息发布客户端可发布的定时消息的订阅;
    根据所述定时消息对应的延迟时间,构造故障注入指令并发送给所述定时消息服务端,以向所述定时消息服务端注入指定类型的故障;
    通过所述消息发布客户端,将所述消息订阅客户端订阅的各所述定时消息向所述定时消息服务端发布,以便所述定时消息服务端向所述消息订阅客户端投递所述定时消息;
    根据所述订阅,对所述消息订阅客户端接收所述定时消息的情况进行校验,根据校验结果确定对所述系统的测试结果。
  2. 如权利要求1所述的方法,所述根据所述定时消息对应的延迟时间,构造故障注入指令,包括:
    根据所述定时消息对应的延迟时间,构造对应的故障注入时间不小于所述延迟时间的故障注入指令。
  3. 如权利要求1所述的方法,所述系统包括由多个所述定时消息服务端构成的分布式集群;
    所述向所述定时消息服务端注入指定类型的故障之后,所述方法还包括:
    通过所述注入指定类型的故障,在所述分布式集群中触发:至少一个定时消息服务端新上线负责投递所述定时消息的服务,或者至少一个定时消息服务端从另一个定时消息服务端接管投递所述定时消息的服务。
  4. 如权利要求3所述的方法,还包括:
    以所述定时消息服务端作为维度,对待投递的全部定时消息进行逻辑上的划分,得到多个分区,分配给对应的所述定时消息服务端;
    所述向所述定时消息服务端注入指定类型的故障之后,所述方法还包括:
    响应于所述注入的指定类型的故障,进入分区状态变迁阶段;
    在所述分区状态变迁阶段,至少部分所述分区被重新分配或者切换容灾状态。
  5. 如权利要求4所述的方法,所述定时消息服务端向所述消息订阅客户端投递所述定时消息,还包括:
    判断所述定时消息对应的延迟时间是否与所述分区状态变迁阶段相匹配;
    若否,则相应地调整所述延迟时间,以强制在所述分区状态变迁阶段中尝试投递所 述定时消息。
  6. 如权利要求3所述的方法,所述向所述定时消息服务端注入指定类型的故障之后,所述方法还包括:
    将被注入所述故障的所述定时消息服务端作为第一服务端;
    确定响应于所述故障,所述多个定时消息服务端中的第二服务端要从所述第一服务端接管投递所述定时消息的服务;
    通过所述第一服务端,向所述第二服务端注入指定类型的故障。
  7. 如权利要求6所述的方法,所述第二服务端被注入的故障对应的服务妨害效果,低于所述第一服务端被注入的故障对应的服务妨害效果。
  8. 如权利要求3所述的方法,所述向所述定时消息服务端注入指定类型的故障之后,所述方法还包括:
    判断通过所述注入指定类型的故障,是否触发了至少一个定时消息服务端从另一个定时消息服务端接管投递所述定时消息的服务;
    若否,则在所述故障的注入生效时间段中插入安全时隙,在所述安全时隙,所述故障不生效,所述安全时隙的长度小于所述注入生效时间段。
  9. 如权利要求1所述的方法,所述向所述定时消息服务端注入指定类型的故障,包括:
    向所述定时消息服务端注入以下至少一种类型的原子故障:
    宕机、CPU飙高、磁盘打满、IO异常、内存飙高、网络包延迟重复和丢失、JVM方法级异常。
  10. 如权利要求1所述的方法,所述根据所述订阅,对所述消息订阅客户端接收所述定时消息的情况进行校验,包括:
    根据所述订阅,对所述消息订阅客户端通过所述投递,接收到的定时消息的完整性和实时性进行校验,以确定所述投递受所述注入的指定类型的故障的影响。
  11. 如权利要求1所述的方法,所述根据校验结果确定对所述系统的测试结果之后,所述方法还包括:
    将所述测试结果对应的测试过程编排为日常测试任务,进行多次运行;
    根据所述多次运行的结果生成测试基线。
  12. 一种分布式定时消息系统测试装置,所述系统包括消息发布客户端、消息订阅客户端、定时消息服务端,所述装置包括:
    消息订阅模块,通过所述消息订阅客户端,批量发起对所述消息发布客户端可发布 的定时消息的订阅;
    故障注入模块,根据所述定时消息对应的延迟时间,构造故障注入指令并发送给所述定时消息服务端,以向所述定时消息服务端注入指定类型的故障;
    发布投递模块,通过所述消息发布客户端,将所述消息订阅客户端订阅的各所述定时消息向所述定时消息服务端发布,以便所述定时消息服务端向所述消息订阅客户端投递所述定时消息;
    结果确定模块,根据所述订阅,对所述消息订阅客户端接收所述定时消息的情况进行校验,根据校验结果确定对所述系统的测试结果。
  13. 如权利要求12所述的装置,所述故障注入模块,根据所述定时消息对应的延迟时间,构造对应的故障注入时间不小于所述延迟时间的故障注入指令。
  14. 如权利要求12所述的装置,所述系统包括由多个所述定时消息服务端构成的分布式集群;
    所述故障注入模块,在所述向所述定时消息服务端注入指定类型的故障之后,通过所述注入指定类型的故障,在所述分布式集群中触发:至少一个定时消息服务端新上线负责投递所述定时消息的服务,或者至少一个定时消息服务端从另一个定时消息服务端接管投递所述定时消息的服务。
  15. 如权利要求14所述的装置,还包括:
    分区管理模块,以所述定时消息服务端作为维度,对待投递的全部定时消息进行逻辑上的划分,得到多个分区,分配给对应的所述定时消息服务端;
    所述分区管理模块,在所述向所述定时消息服务端注入指定类型的故障之后,响应于所述注入的指定类型的故障,进入分区状态变迁阶段;
    在所述分区状态变迁阶段,至少部分所述分区被重新分配或者切换容灾状态。
  16. 如权利要求15所述的装置,所述故障注入模块,判断所述定时消息对应的延迟时间是否与所述分区状态变迁阶段相匹配;
    若否,则相应地调整所述延迟时间,以强制在所述分区状态变迁阶段中尝试投递所述定时消息。
  17. 如权利要求14所述的装置,所述故障注入模块,在所述向所述定时消息服务端注入指定类型的故障之后,将被注入所述故障的所述定时消息服务端作为第一服务端;
    确定响应于所述故障,所述多个定时消息服务端中的第二服务端要从所述第一服务端接管投递所述定时消息的服务;
    通过所述第一服务端,向所述第二服务端注入指定类型的故障。
  18. 如权利要求17所述的装置,所述第二服务端被注入的故障对应的服务妨害效果,低于所述第一服务端被注入的故障对应的服务妨害效果。
  19. 如权利要求14所述的装置,所述故障注入模块,在所述向所述定时消息服务端注入指定类型的故障之后,判断通过所述注入指定类型的故障,是否触发了至少一个定时消息服务端从另一个定时消息服务端接管投递所述定时消息的服务;
    若否,则在所述故障的注入生效时间段中插入安全时隙,在所述安全时隙,所述故障不生效,所述安全时隙的长度小于所述注入生效时间段。
  20. 如权利要求12所述的装置,所述故障注入模块,向所述定时消息服务端注入以下至少一种类型的原子故障:
    宕机、CPU飙高、磁盘打满、IO异常、内存飙高、网络包延迟重复和丢失、JVM装置级异常。
  21. 如权利要求12所述的装置,所述结果确定模块,根据所述订阅,对所述定时消息服务端通过所述投递,接收到的定时消息的完整性和实时性进行校验,以确定所述投递受所述注入的指定类型的故障的影响。
  22. 如权利要求12所述的装置,所述结果确定模块,在所述根据校验结果确定对所述系统的测试结果之后,将所述测试结果对应的测试过程编排为日常测试任务,进行多次运行;
    根据所述多次运行的结果生成测试基线。
  23. 一种分布式定时消息系统测试设备,所述系统包括消息发布客户端、消息订阅客户端、定时消息服务端,所述设备包括:
    至少一个处理器;以及,
    与所述至少一个处理器通信连接的存储器;其中,
    所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够:
    通过所述消息订阅客户端,批量发起对所述消息发布客户端可发布的定时消息的订阅;
    根据所述定时消息对应的延迟时间,构造故障注入指令并发送给所述定时消息服务端,以向所述定时消息服务端注入指定类型的故障;
    通过所述消息发布客户端,将所述消息订阅客户端订阅的各所述定时消息向所述定时消息服务端发布,以便所述定时消息服务端向所述消息订阅客户端投递所述定时消息;
    根据所述订阅,对所述消息订阅客户端接收所述定时消息的情况进行校验,根据校 验结果确定对所述系统的测试结果。
PCT/CN2023/110149 2022-08-29 2023-07-31 一种分布式定时消息系统测试方法、装置以及设备 WO2024045980A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211038337.2A CN115604164A (zh) 2022-08-29 2022-08-29 一种分布式定时消息系统测试方法、装置以及设备
CN202211038337.2 2022-08-29

Publications (1)

Publication Number Publication Date
WO2024045980A1 true WO2024045980A1 (zh) 2024-03-07

Family

ID=84842974

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/110149 WO2024045980A1 (zh) 2022-08-29 2023-07-31 一种分布式定时消息系统测试方法、装置以及设备

Country Status (2)

Country Link
CN (1) CN115604164A (zh)
WO (1) WO2024045980A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115604164A (zh) * 2022-08-29 2023-01-13 支付宝(杭州)信息技术有限公司(Cn) 一种分布式定时消息系统测试方法、装置以及设备

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170242784A1 (en) * 2016-02-19 2017-08-24 International Business Machines Corporation Failure recovery testing framework for microservice-based applications
CN109981406A (zh) * 2019-03-22 2019-07-05 北京达佳互联信息技术有限公司 测试方法、装置、系统和计算机可读存储介质
CN111857585A (zh) * 2020-07-10 2020-10-30 苏州浪潮智能科技有限公司 存储系统自定义业务功能配置方法、装置、设备及介质
CN114237994A (zh) * 2021-12-01 2022-03-25 中国工商银行股份有限公司 用于分布式系统的测试方法及系统、电子设备及存储介质
CN114500635A (zh) * 2022-01-07 2022-05-13 支付宝(杭州)信息技术有限公司 服务处理方法及装置
CN115604164A (zh) * 2022-08-29 2023-01-13 支付宝(杭州)信息技术有限公司(Cn) 一种分布式定时消息系统测试方法、装置以及设备

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170242784A1 (en) * 2016-02-19 2017-08-24 International Business Machines Corporation Failure recovery testing framework for microservice-based applications
CN109981406A (zh) * 2019-03-22 2019-07-05 北京达佳互联信息技术有限公司 测试方法、装置、系统和计算机可读存储介质
CN111857585A (zh) * 2020-07-10 2020-10-30 苏州浪潮智能科技有限公司 存储系统自定义业务功能配置方法、装置、设备及介质
CN114237994A (zh) * 2021-12-01 2022-03-25 中国工商银行股份有限公司 用于分布式系统的测试方法及系统、电子设备及存储介质
CN114500635A (zh) * 2022-01-07 2022-05-13 支付宝(杭州)信息技术有限公司 服务处理方法及装置
CN115604164A (zh) * 2022-08-29 2023-01-13 支付宝(杭州)信息技术有限公司(Cn) 一种分布式定时消息系统测试方法、装置以及设备

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
YONG YANG, LI YING; WU ZHONG-HAI: "Survey of State-of-the-art Distributed Tracing Technology ", JOURNAL OF SOFTWARE, vol. 31, no. 7, 15 July 2020 (2020-07-15), pages 2019 - 2039, XP093143455, ISSN: 1000-9825, DOI: 10.13328/j.cnki.jos.006047 *

Also Published As

Publication number Publication date
CN115604164A (zh) 2023-01-13

Similar Documents

Publication Publication Date Title
JP6756924B2 (ja) ブロックチェーンを基にしたコンセンサス方法およびデバイス
WO2018161901A1 (zh) 一种共识方法及装置
CN107392611B (zh) 一种发送交易信息和共识验证的方法及装置
WO2024045980A1 (zh) 一种分布式定时消息系统测试方法、装置以及设备
US8725684B1 (en) Synchronizing data stores
US10884623B2 (en) Method and apparatus for upgrading a distributed storage system
US9710344B1 (en) Locality based quorum eligibility
CN111090699A (zh) 业务数据的同步方法和装置、存储介质、电子装置
Du et al. Clock-RSM: Low-latency inter-datacenter state machine replication using loosely synchronized physical clocks
Sebepou et al. Cec: Continuous eventual checkpointing for data stream processing operators
CN110737567A (zh) 基于缓存的服务端接口熔断方法及装置
CN108804119A (zh) 配置更新方法、装置、系统、配置中心、应用节点及介质
US20230098190A1 (en) Data processing method, apparatus, device and medium based on distributed storage
CN108418859B (zh) 写数据的方法和装置
CN111865632A (zh) 分布式数据存储集群的切换方法及切换指令发送方法和装置
CN109921897B (zh) 工作量证明计算的触发方法、装置、计算设备及存储介质
CN111371871A (zh) 一种区块链节点设备及区块链网络系统
CN108681558B (zh) 一种数据回滚方法、装置、及终端
CN110908824A (zh) 一种故障识别方法、装置及设备
CN115033927A (zh) 一种检测数据完整性的方法、装置、设备及介质
CN114039981B (zh) 一种消息处理方法、装置、服务器及存储介质
Li et al. Stabilizer: geo-replication with user-defined consistency
CN114780296A (zh) 数据库集群的数据备份方法、装置及系统
Sun et al. Adaptive trade‐off between consistency and performance in data replication
CN110019023B (zh) 一种机构信息报文的推送方法、装置及设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23859015

Country of ref document: EP

Kind code of ref document: A1