CN114500349B

CN114500349B - A cloud platform chaos testing method and device

Info

Publication number: CN114500349B
Application number: CN202111613738.1A
Authority: CN
Inventors: 杨帆; 刘磊; 何玥
Original assignee: China Telecom Cloud Technology Co Ltd
Current assignee: China Telecom Cloud Technology Co Ltd
Priority date: 2021-12-27
Filing date: 2021-12-27
Publication date: 2023-08-08
Anticipated expiration: 2041-12-27
Also published as: CN114500349A

Abstract

The invention discloses a cloud platform chaos testing method and device, wherein the method includes: after injecting a fault into the cloud platform to be tested, obtaining the test result of the kth round of concurrent testing of the cloud platform to be tested, and counting the actual results of the kth round of concurrent testing Error rate; use the actual error rate of the k-th round of concurrent testing to estimate the predicted error rate of the k+1-th round of concurrent testing; determine whether the predicted error rate of the k+1-th round of concurrent testing is greater than the preset threshold; when greater than the preset threshold , then reduce the amount of concurrent requests; when it is less than the preset threshold, increase the amount of concurrent requests; when it is equal to the preset threshold, determine the amount of concurrent requests of the k-th round of concurrent testing as the criticality of the cloud platform under test under failure. The present invention adjusts the amount of concurrent requests for the current test by using historical test results, and self-adaptively adjusts the amount of concurrent requests after the fault injection, thereby accurately testing the performance criticality of the cloud platform after the fault is injected, and determining the performance of the fault after the fault Cloud platform performance degradation.

Description

A cloud platform chaos testing method and device

技术领域technical field

本发明涉及测试技术领域，具体涉及一种云平台混沌测试方法及装置。The invention relates to the technical field of testing, in particular to a cloud platform chaos testing method and device.

背景技术Background technique

近年来，云计算一直是ICT领域炙手可热的研究方向，对云平台性能的质量把控也一直备受关注。目前主流的开源云平台软件(如：OpenStack)一般采用微服务架构。微服务架构将单体软件拆分为功能鲜明，可独立运行、部署的多个软件服务，具有扩展性好、易部署、易开发等特点。采用微服务架构有利于降低软件开发的成本，便于与Devops(Development and operations)的工作模式相结合，但同时也引入了新的挑战。微服务架构的软件为分布式，提供同功能的微服务副本一般位于不同的宿主机、虚拟机或容器。基础设施的故障，微服务副本的意外启停都有可能降低甚至中断软件向外提供服务的能力。对于云服务商，控制平面的性能剧烈降低和响应失败，都会造成巨大经济损失。In recent years, cloud computing has been a hot research direction in the ICT field, and the quality control of cloud platform performance has also attracted much attention. At present, the mainstream open source cloud platform software (such as: OpenStack) generally adopts the microservice architecture. The microservice architecture splits a single software into multiple software services with distinct functions that can be run and deployed independently. It has the characteristics of good scalability, easy deployment, and easy development. The adoption of the microservice architecture is beneficial to reduce the cost of software development and facilitates the combination with the Devops (Development and operations) working mode, but it also introduces new challenges. The software of the microservice architecture is distributed, and the copies of microservices that provide the same function are generally located in different hosts, virtual machines or containers. Infrastructure failures and unexpected start and stop of microservice copies may reduce or even interrupt the ability of the software to provide services to the outside world. For cloud service providers, the performance degradation and response failure of the control plane will cause huge economic losses.

混沌实验是近年来在软件测试领域兴起的研究方向。混沌实验主要用于观察微服务软件系统在有随机故障注入的情况下是否具备应对故障的能力。混沌实验的执行是亟待自动化实现的重要环节。现有的混沌实验工具，如：ChaosBlade、ChaosMonkey等，能够满足人为模拟CPU、内存等故障的需求，但发明人发现，云平台测试时如果采用恒定的并发去请求数量，只能根据请求成功率来判断故障是否对结果有影响，而不能确定故障对云平台的性能降级具体有多大。只能进行多次地重复实验才能给出性能影响评估。Chaos experiment is an emerging research direction in the field of software testing in recent years. The chaos experiment is mainly used to observe whether the microservice software system has the ability to deal with failures in the case of random failure injection. The execution of chaos experiments is an important link that needs to be automated. Existing chaos experiment tools, such as: ChaosBlade, ChaosMonkey, etc., can meet the needs of artificially simulating CPU, memory and other faults, but the inventor found that if the cloud platform test uses a constant number of concurrent requests, it can only be based on the request success rate To determine whether the fault has an impact on the result, but not to determine how much the fault will degrade the performance of the cloud platform. An assessment of the performance impact can only be given by performing multiple experiments.

发明内容Contents of the invention

因此，本发明要解决现有技术中无法确定故障对云平台的性能降级情况的技术问题，从而提供一种云平台混沌测试方法及装置。Therefore, the present invention aims to solve the technical problem in the prior art that it is impossible to determine the performance degradation of the cloud platform due to faults, thereby providing a cloud platform chaos testing method and device.

本发明实施例的一方面，提供了一种云平台混沌测试方法，包括如下步骤：在向待测云平台注入故障之后，获取对所述待测云平台第k轮并发测试的测试结果，统计第k轮并发测试的实际错误率，k取1,2,3……；利用所述第k轮并发测试的实际错误率估算得到第k+1轮并发测试的预测错误率；判断所述第k+1轮并发测试的预测错误率是否大于预设阈值；当所述第k+1轮并发测试的预测错误率大于所述预设阈值，则在所述第k轮并发测试的基础上减少并发请求量，得到第k+1轮并发测试的并发请求量；当所述第k+1轮并发测试的预测错误率小于所述预设阈值，则在所述第k轮并发测试的基础上增加并发请求量，得到第k+1轮并发测试的并发请求量；当所述第k+1轮并发测试的预测错误率等于所述预设阈值时，将所述第k轮并发测试的并发请求量确定为所述待测云平台在所述故障下的临界。An aspect of the embodiments of the present invention provides a cloud platform chaos testing method, comprising the following steps: after injecting a fault into the cloud platform to be tested, obtaining the test result of the kth round of concurrent testing of the cloud platform to be tested, and counting The actual error rate of the kth round of concurrent testing, where k is 1, 2, 3...; the predicted error rate of the k+1th round of concurrent testing is estimated by using the actual error rate of the kth round of concurrent testing; Whether the prediction error rate of the k+1 round of concurrent testing is greater than the preset threshold; when the prediction error rate of the k+1th round of concurrent testing is greater than the preset threshold, it will be reduced on the basis of the kth round of concurrent testing The amount of concurrent requests is obtained by obtaining the concurrent request amount of the k+1th round of concurrent testing; when the prediction error rate of the k+1th round of concurrent testing is less than the preset threshold, then on the basis of the kth round of concurrent testing Increase the amount of concurrent requests to obtain the amount of concurrent requests for the k+1th round of concurrent testing; when the prediction error rate of the k+1th round of concurrent testing is equal to the preset threshold, the concurrency of the kth round of concurrent testing The request amount is determined as the criticality of the cloud platform under test under the fault.

可选地，所述利用所述第k轮并发测试的实际错误率估算得到第k+1轮并发测试的预测错误率，包括：获取第k轮并发测试的预测错误率；利用预先配置的权重以及第k轮并发测试的预测错误率和实际错误率计算得到所述第k+1轮并发测试的预测错误率。Optionally, estimating the predicted error rate of the k+1th round of concurrent testing by using the actual error rate of the k-th round of concurrent testing includes: obtaining the predicted error rate of the k-th round of concurrent testing; using pre-configured weights and calculating the predicted error rate and the actual error rate of the kth round of concurrent testing to obtain the predicted error rate of the k+1th round of concurrent testing.

可选地，通过以下公式计算得到所述第k+1轮并发测试的预测错误率：Optionally, the prediction error rate of the k+1th round of concurrent testing is calculated by the following formula:

e′_k+1＝αe_k+(1-α)e′_k e′ _k+1 ＝αe _k +(1-α)e′ _k

其中，e_k表示第k轮并发测试的实际错误率，e'_k表示第k轮并发测试的预测错误率，α表示平滑系数。Among them, e _k represents the actual error rate of the k-th round of concurrent testing, e' _k represents the predicted error rate of the k-th round of concurrent testing, and α represents the smoothing coefficient.

可选地，所述在所述第k轮并发测试的基础上减少并发请求量，得到第k+1轮并发测试的并发请求量，包括：利用所述第k+1轮并发测试的预测错误率作为衰减系数，计算得到所述第k+1轮并发测试的并发请求量。Optionally, the reducing the amount of concurrent requests on the basis of the kth round of concurrent testing to obtain the amount of concurrent requests of the k+1th round of concurrent testing includes: using the prediction error of the k+1th round of concurrent testing The rate is used as an attenuation coefficient to calculate the amount of concurrent requests for the k+1th round of concurrent testing.

可选地，通过以下公式计算得到所述第k+1轮并发测试的并发请求量：Optionally, the amount of concurrent requests for the k+1th round of concurrent testing is calculated by the following formula:

其中，e'_k+1表示第k+1轮并发测试的预测错误率，C_k表示第k轮并发测试的并发请求量，表示向上取整计算符。Among them, e' _k+1 represents the prediction error rate of the k+1th round of concurrent testing, C _k represents the amount of concurrent requests for the kth round of concurrent testing, Represents the round-up operator.

可选地，所述在所述第k轮并发测试的基础上增加并发请求量，得到第k+1轮并发测试的并发请求量，包括：利用预先设置的上浮系数和所述第k轮并发测试的并发请求量确定出所要增加的并发请求量，再加上所述第k轮并发测试的并发请求量得到所述第k+1轮并发测试的并发请求量。Optionally, said increasing the amount of concurrent requests on the basis of the k-th round of concurrent testing to obtain the amount of concurrent requests for the k+1 round of concurrent testing includes: using a preset floating coefficient and the k-th round of concurrent testing The amount of concurrent requests to be increased is determined by the amount of concurrent requests in the test, and added to the amount of concurrent requests in the kth round of concurrent testing to obtain the amount of concurrent requests in the k+1th round of concurrent testing.

其中，e'_k+1表示第k+1轮并发测试的预测错误率，C_k表示第k轮并发测试的并发请求量，β表示上浮系数，表示向下取整计算符。Among them, e' _k+1 represents the prediction error rate of the k+1-th round of concurrent testing, C _k represents the amount of concurrent requests for the k-th round of concurrent testing, and β represents the floating coefficient, Represents the round down operator.

本发明的另一方面，还提供了一种云平台混沌测试装置，包括：获取模块，用于在向待测云平台注入故障之后，获取对所述待测云平台第k轮并发测试的测试结果，统计第k轮并发测试的实际错误率，k取1,2,3……；估算模块，用于利用所述第k轮并发测试的实际错误率估算得到第k+1轮并发测试的预测错误率；判断模块，用于判断所述第k+1轮并发测试的预测错误率是否大于预设阈值；第一计算模块，用于当所述第k+1轮并发测试的预测错误率大于所述预设阈值，则在所述第k轮并发测试的基础上减少并发请求量，得到第k+1轮并发测试的并发请求量；第二计算模块，用于当所述第k+1轮并发测试的预测错误率小于所述预设阈值，则在所述第k轮并发测试的基础上增加并发请求量，得到第k+1轮并发测试的并发请求量；确定模块，用于当所述第k+1轮并发测试的预测错误率等于所述预设阈值时，将所述第k轮并发测试的并发请求量确定为所述待测云平台在所述故障下的临界。Another aspect of the present invention also provides a cloud platform chaos testing device, including: an acquisition module, used to acquire the test of the kth round of concurrent testing of the cloud platform to be tested after injecting a fault into the cloud platform to be tested As a result, the actual error rate of the kth round of concurrent testing is counted, k is 1, 2, 3...; the estimation module is used to estimate the actual error rate of the kth round of concurrent testing to obtain the k+1th round of concurrent testing Prediction error rate; judging module, used to judge whether the prediction error rate of the k+1th round of concurrent testing is greater than a preset threshold; a first calculation module, used when the prediction error rate of the k+1th round of concurrent testing If it is greater than the preset threshold, the amount of concurrent requests is reduced on the basis of the kth round of concurrent testing to obtain the amount of concurrent requests for the k+1th round of concurrent testing; the second calculation module is used for when the k+th round If the prediction error rate of one round of concurrent testing is less than the preset threshold, the amount of concurrent requests is increased on the basis of the kth round of concurrent testing to obtain the amount of concurrent requests for the k+1th round of concurrent testing; the determination module is used to When the prediction error rate of the k+1th round of concurrent testing is equal to the preset threshold, the concurrent request amount of the kth round of concurrent testing is determined as the criticality of the cloud platform under test under the failure.

本发明的另一方面，还提供了一种计算机设备，包括：至少一个处理器；以及与所述至少一个处理器通信连接的存储器；其中，所述存储器存储有可被所述至少一个处理器执行的指令，所述指令被所述至少一个处理器执行，从而执行上述的云平台混沌测试方法。Another aspect of the present invention also provides a computer device, including: at least one processor; and a memory connected to the at least one processor in communication; wherein, the memory stores information that can be used by the at least one processor Executed instructions, the instructions are executed by the at least one processor, so as to execute the above cloud platform chaos testing method.

本发明的另一方面，还提供了一种计算机可读存储介质，其特征在于，所述计算机可读存储介质存储有计算机指令，所述计算机指令用于使计算机执行上述的云平台混沌测试方法。Another aspect of the present invention also provides a computer-readable storage medium, which is characterized in that the computer-readable storage medium stores computer instructions, and the computer instructions are used to make a computer execute the above-mentioned cloud platform chaos testing method .

本发明技术方案，具有如下优点：The technical solution of the present invention has the following advantages:

根据本发明实施例，通过利用历史测试结果来对当前的测试的并发请求量进行调节，在故障注入之后自适应的调节并发请求量，从而准确测试出注入故障后的云平台的性能临界，确定出故障后的云平台性能降级情况。According to the embodiment of the present invention, by using the historical test results to adjust the concurrent request volume of the current test, and adaptively adjust the concurrent request volume after the fault injection, so as to accurately test the performance criticality of the cloud platform after the fault injection, and determine The performance degradation of the cloud platform after a failure.

根据本发明实施例，通过利用上一轮并发测试的错误率，也即是利用历史数据来动态调节当前的错误率，进而达到调节并发请求量的目的，提高故障后的性能降级水平的评估准确性。According to the embodiment of the present invention, by using the error rate of the last round of concurrent testing, that is, using historical data to dynamically adjust the current error rate, and then achieve the purpose of adjusting the amount of concurrent requests, and improve the accuracy of the evaluation of the performance degradation level after a fault sex.

附图说明Description of drawings

为了更清楚地说明本发明具体实施方式或现有技术中的技术方案，下面将对具体实施方式或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图是本发明的一些实施方式，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the specific implementation of the present invention or the technical solutions in the prior art, the following will briefly introduce the accompanying drawings that need to be used in the specific implementation or description of the prior art. Obviously, the accompanying drawings in the following description The drawings show some implementations of the present invention, and those skilled in the art can obtain other drawings based on these drawings without any creative effort.

图1为本发明实施例测试系统的框架示意图；Fig. 1 is the frame schematic diagram of the testing system of the embodiment of the present invention;

图2为本发明实施例1中云平台混沌测试方法的一个具体示例的流程图；Fig. 2 is the flowchart of a concrete example of cloud platform chaos test method in embodiment 1 of the present invention;

图3为本发明实施例云平台混沌测试故障注入时序图；Fig. 3 is the sequence diagram of fault injection of cloud platform chaos test in the embodiment of the present invention;

图4为本发明实施例的一种测试系统的部署架构示意图；FIG. 4 is a schematic diagram of a deployment architecture of a test system according to an embodiment of the present invention;

图5为本发明实施例2中云平台混沌测试装置的一个具体示例的原理框图；Fig. 5 is the functional block diagram of a specific example of cloud platform chaos testing device in embodiment 2 of the present invention;

图6为本发明实施例的计算机设备的结构示意图。FIG. 6 is a schematic structural diagram of a computer device according to an embodiment of the present invention.

具体实施方式Detailed ways

下面将结合附图对本发明的技术方案进行清楚、完整地描述，显然，所描述的实施例是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The technical solutions of the present invention will be clearly and completely described below in conjunction with the accompanying drawings. Apparently, the described embodiments are part of the embodiments of the present invention, but not all of them. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.

在本发明的描述中，需要说明的是，术语“中心”、“上”、“下”、“左”、“右”、“竖直”、“水平”、“内”、“外”等指示的方位或位置关系为基于附图所示的方位或位置关系，仅是为了便于描述本发明和简化描述，而不是指示或暗示所指的装置或元件必须具有特定的方位、以特定的方位构造和操作，因此不能理解为对本发明的限制。此外，术语“第一”、“第二”、“第三”仅用于描述目的，而不能理解为指示或暗示相对重要性。In the description of the present invention, it should be noted that the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer" etc. The indicated orientation or positional relationship is based on the orientation or positional relationship shown in the drawings, and is only for the convenience of describing the present invention and simplifying the description, rather than indicating or implying that the referred device or element must have a specific orientation, or in a specific orientation. construction and operation, therefore, should not be construed as limiting the invention. In addition, the terms "first", "second", and "third" are used for descriptive purposes only, and should not be construed as indicating or implying relative importance.

在本发明的描述中，需要说明的是，除非另有明确的规定和限定，术语“安装”、“相连”、“连接”应做广义理解，例如，可以是固定连接，也可以是可拆卸连接，或一体地连接；可以是机械连接，也可以是电连接；可以是直接相连，也可以通过中间媒介间接相连，还可以是两个元件内部的连通，可以是无线连接，也可以是有线连接。对于本领域的普通技术人员而言，可以具体情况理解上述术语在本发明中的具体含义。In the description of the present invention, it should be noted that unless otherwise specified and limited, the terms "installation", "connection" and "connection" should be understood in a broad sense, for example, it can be a fixed connection or a detachable connection. Connected, or integrally connected; it can be mechanically or electrically connected; it can be directly connected, or indirectly connected through an intermediary, or it can be the internal communication of two components, which can be wireless or wired connect. Those of ordinary skill in the art can understand the specific meanings of the above terms in the present invention in specific situations.

此外，下面所描述的本发明不同实施方式中所涉及的技术特征只要彼此之间未构成冲突就可以相互结合。In addition, the technical features involved in the different embodiments of the present invention described below may be combined with each other as long as there is no conflict with each other.

本发明实施例所提供的云平台混沌测试方法及装置能够在云平台的自动化测试过程中动态调节故障后的并发请求量，还可以控制故障注入、解除发生时的迭代次数或时间，实现了服务启停、网络拥塞、内存负载高等故障的模拟。本发明提出的装置可以自动化地执行、记录混沌实验的故障注入，提高了混沌实验稳态指标的多样性，完善了故障对性能降级的评估。The cloud platform chaos testing method and device provided by the embodiments of the present invention can dynamically adjust the amount of concurrent requests after a fault in the automated testing process of the cloud platform, and can also control the number of iterations or time when fault injection and release occur, realizing service Simulation of failures such as start-stop, network congestion, high memory load, etc. The device proposed by the invention can automatically execute and record the fault injection of the chaos experiment, improves the diversity of the steady-state indicators of the chaos experiment, and improves the evaluation of the performance degradation caused by the fault.

在介绍本发明实施例的测试方法之前，先介绍本发明实施例提供的测试资源和环境等内容。Before introducing the test method of the embodiment of the present invention, the test resource and environment provided by the embodiment of the present invention are introduced first.

本发明实施例的测试过程需要提供如下几种资源：混沌任务、用户、项目、角色。混沌任务为一个混沌配置在待测云平台的一次执行。项目为管理混沌配置、混沌任务的最小单元。用户、项目、角色之间按照基于角色的权限控制(RBAC,Role-based Access Control)方式关联。用户凭借角色访问项目下的资源。The testing process of the embodiment of the present invention needs to provide the following resources: chaos tasks, users, items, and roles. A chaos task is an execution of a chaos configuration on the cloud platform to be tested. A project is the smallest unit for managing chaos configurations and chaos tasks. Users, projects, and roles are associated in the way of role-based access control (RBAC, Role-based Access Control). Users access resources under a project by virtue of roles.

本发明实施例还提供了一个测试系统，用于执行混沌测试，如图1所示，该系统主要由以下几个模块组成：用户权限管理模块、任务管理模块、云平台测试套件、远程故障注入工具等。除以上模块外，装置依赖于代码托管平台和持续集成平台。The embodiment of the present invention also provides a test system for performing chaos testing, as shown in Figure 1, the system is mainly composed of the following modules: user authority management module, task management module, cloud platform test suite, remote fault injection tools etc. In addition to the above modules, the device depends on the code hosting platform and continuous integration platform.

用户权限管理模块：提供用户、项目、角色的新建、删除、修改、查询功能。支持通过RBAC的权限控制方式。User rights management module: Provide functions for creating, deleting, modifying, and querying users, projects, and roles. Supports permission control through RBAC.

代码托管平台：可选用gitlab或gerrit等代码托管仓库。用户提交关于故障注入的yaml配置到代码托管平台，yaml文件的配置参数包括但不限于：故障类型、注入迭代次数、解除迭代次数、总迭代次数等。在代码托管平台上配置被审阅后合入。Code hosting platform: You can choose code hosting warehouses such as gitlab or gerrit. The user submits the yaml configuration about fault injection to the code hosting platform. The configuration parameters of the yaml file include but are not limited to: fault type, number of injection iterations, number of release iterations, total number of iterations, etc. The configuration is reviewed and merged in on the code hosting platform.

任务管理模块：任务管理模块以HTTP服务的形式运行，该模块提供混沌任务的新增、删除、修改、查询的功能。当新增混沌任务时，任务管理模块首先校验请求，包括但不限于：校验开始时间是否大于当前时间，故障注入配置文件是否存在等校验项。完成校验后，将ID、名称、备注、开始时间、配置文件内容等信息保存至数据库。此时任务处于pending状态，当任务开始时间达到时，任务管理模块触发持续集成平台上关于云平台故障注入、测试的流水线，并更新任务状态为running。持续集成平台流水线完成后，任务管理模块更新任务为finished，并保存测试结果至数据库。当更新混沌任务时，任务管理模块首先进行状态校验，任意状态的任务名称、备注字段均可更新，但只有pending状态的任务能更新开始时间，更新后任务管理模块将信息保存至数据库。Task management module: The task management module operates in the form of HTTP service, which provides the functions of adding, deleting, modifying and querying chaotic tasks. When adding a chaotic task, the task management module first verifies the request, including but not limited to: verifying whether the start time is greater than the current time, whether the fault injection configuration file exists, and other verification items. After the verification is completed, save the ID, name, remarks, start time, configuration file content and other information to the database. At this time, the task is in the pending state. When the task start time is reached, the task management module triggers the pipeline on the continuous integration platform for fault injection and testing of the cloud platform, and updates the task status to running. After the continuous integration platform pipeline is completed, the task management module updates the task as finished and saves the test results to the database. When updating a chaotic task, the task management module first checks the status, and the task name and remark field in any state can be updated, but only the task in the pending state can update the start time. After the update, the task management module saves the information to the database.

持续集成平台流水线：当持续集成平台流水线被触发后，持续集成平台从节点会首先从代码托管平台下载云平台测试工具、故障注入工具、故障注入配置文件。然后依次安装故障注入工具、云平台测试工具。最后执行云平台测试的基础配置，并指定故障注入配置文件来运行测试。Continuous integration platform pipeline: When the continuous integration platform pipeline is triggered, the continuous integration platform slave node will first download the cloud platform test tool, fault injection tool, and fault injection configuration file from the code hosting platform. Then install fault injection tools and cloud platform testing tools in sequence. Finally, the basic configuration of the cloud platform test is executed, and the fault injection configuration file is specified to run the test.

云平台测试工具：持续集成平台从节点上运行云平台测试工具。测试命令下发后，云平台测试工具启动两个线程，一个用于执行测试任务，另一个用于处理故障注入。测试的任务和执行的迭代次数分别在在先入先出的队列里管理，执行测试的线程在每次执行前将迭代次数等单次运行信息放入队列。处理故障注入的线程如果是指定迭代次数触发，则监听迭代次数的队列，从队列里依次取迭代次数比对。如果是指定时间触发，则在队列中第一次取迭代次数后启动定时器。Cloud platform testing tool: The continuous integration platform runs the cloud platform testing tool on the slave node. After the test command is issued, the cloud platform test tool starts two threads, one is used to execute the test task, and the other is used to handle fault injection. The test task and the number of iterations executed are managed in the first-in-first-out queue, and the thread executing the test puts the iteration number and other single-run information into the queue before each execution. If the thread that handles fault injection is triggered by a specified number of iterations, it will listen to the queue of iterations, and compare the number of iterations sequentially from the queue. If it is triggered at a specified time, the timer will be started after the first number of iterations are fetched in the queue.

云平台故障注入：云平台故障注入模块提供了服务启停、docker启停、服务器启停、网络丢包、网络延时、内存负载注入、CPU负载注入的能力。故障注入模块向外提供了API用于其他程序调用。在本发明中，故障注入的模块被云平台测试工具调用，测试工具中处理故障注入的线程被触发后，调用故障注入的API。故障注入模块通过管理网SSH或带外网络IPMI登录后，采用systemctl、tc等工具完成故障注入任务。Cloud platform fault injection: The cloud platform fault injection module provides the capabilities of service start and stop, docker start and stop, server start and stop, network packet loss, network delay, memory load injection, and CPU load injection. The fault injection module provides an API for other program calls. In the present invention, the fault injection module is invoked by the cloud platform test tool, and after the fault injection processing thread in the test tool is triggered, the fault injection API is called. After the fault injection module logs in through the management network SSH or the out-of-band network IPMI, it uses systemctl, tc and other tools to complete the fault injection task.

本发明实施例提供的一种云平台混沌测试方法，如图2所示，包括如下步骤：A kind of cloud platform chaos test method provided by the embodiment of the present invention, as shown in Figure 2, includes the following steps:

步骤S101,在向待测云平台注入故障之后，获取对所述待测云平台第k轮并发测试的测试结果，统计第k轮并发测试的实际错误率，k取1,2,3……。Step S101, after injecting faults into the cloud platform to be tested, obtain the test result of the k-th round of concurrent testing on the cloud platform to be tested, and count the actual error rate of the k-th round of concurrent testing, where k is 1, 2, 3... .

本发明实施例中所述的第k轮并发测试，也即是指第k次迭代的测试过程中，对于第k轮并发测试，可以利用云平台测试工具向云平台发起相应并发请求量的请求，获取相应的测试结果。其中，测试结果可以是指单次并发的响应结果，若未响应，则报错；若成功响应，则表示测试通过。其中，报错的次数与并发请求量的比值可以作为实际错误率。本发明实施例中，k的值可以取1,2,3……，可以是指注入故障后的测试轮次，也可以是指整个测试任务中的测试轮次。例如，测试迭代1000次，其中，k的值取1,2,3……1000，其中，故障是在第200次开始注入，第800次恢复。本发明实施例主要保护在故障注入之后，如何进行并发请求量的调节。The k-th round of concurrent testing described in the embodiments of the present invention refers to that during the testing process of the k-th iteration, for the k-th round of concurrent testing, the cloud platform testing tool can be used to initiate a request for a corresponding amount of concurrent requests to the cloud platform , to obtain the corresponding test results. Wherein, the test result may refer to a single concurrent response result, if there is no response, an error will be reported; if the response is successful, it means that the test is passed. Among them, the ratio of the number of error reports to the amount of concurrent requests can be used as the actual error rate. In the embodiment of the present invention, the value of k may be 1, 2, 3..., which may refer to the test round after the fault is injected, or may refer to the test round in the entire test task. For example, the test is iterated 1000 times, and the value of k is 1, 2, 3...1000, and the fault is injected at the 200th time and recovered at the 800th time. The embodiment of the present invention mainly protects how to adjust the amount of concurrent requests after fault injection.

当然，对于本领域技术人员而言，在阅读本发明实施例之后，可以知晓并发请求量的调节可以是贯穿整个测试环节的。本发明实施例则是强调在故障注入之后，如何确定出故障状态下的待测云平台的性能降级后的临界。Certainly, for those skilled in the art, after reading the embodiments of the present invention, it can be known that the adjustment of the amount of concurrent requests can be carried out throughout the entire testing process. The embodiment of the present invention emphasizes how to determine the degraded criticality of the performance of the cloud platform to be tested in a fault state after fault injection.

本发明实施例中，测试执行与故障注入的消息传递通过队列1完成，队列1保存着迭代次数(也即是k)。测试进程在每次执行测试前，将迭代次数存入队列1后再新开一个线程进行测试任务。当正在执行的测试线程达到并发请求量时，测试进程等待测试线程执行完毕，读取测试结果，统计实际错误率。其中，每个测试线程对应一个测试用例，一个测试用例对应一个并发测试请求。In the embodiment of the present invention, the message delivery of test execution and fault injection is completed through queue 1, and queue 1 stores the number of iterations (that is, k). Before each execution of the test, the test process stores the number of iterations in the queue 1 and then opens a new thread to perform the test task. When the test thread being executed reaches the number of concurrent requests, the test process waits for the test thread to finish executing, reads the test result, and counts the actual error rate. Wherein, each test thread corresponds to a test case, and a test case corresponds to a concurrent test request.

步骤S102,利用所述第k轮并发测试的实际错误率估算得到第k+1轮并发测试的预测错误率。Step S102, using the actual error rate estimation of the kth round of concurrent testing to obtain the predicted error rate of the k+1th round of concurrent testing.

本实施例中，在统计出实际错误率之后，在对所述实际错误率进行平滑更新，估算出下一轮次的并发测试的预测错误率。具体地，本发明实施例中，可以利用第k轮的并发测试的实际错误率直接估算出第k+1轮并发测试的预测错误率，例如，实际错误率乘以一个系数；另一方面，还可以利用第k轮并发测试的实际错误率与第k轮并发测试的预测错误率一起计算得到第k+1轮并发测试的预测错误率，例如，对第k轮并发测试的实际错误率和预测错误率进行加权求和得到第k+1轮并发测试的预测错误率。In this embodiment, after the actual error rate is calculated, the actual error rate is smoothly updated to estimate the predicted error rate of the next round of concurrent testing. Specifically, in the embodiment of the present invention, the predicted error rate of the k+1-th round of concurrent testing can be directly estimated by using the actual error rate of the k-th round of concurrent testing, for example, the actual error rate is multiplied by a coefficient; on the other hand, It is also possible to use the actual error rate of the k-th round of concurrent testing and the predicted error rate of the k-th round of concurrent testing to calculate the predicted error rate of the k+1-th round of concurrent testing, for example, the actual error rate of the k-th round of concurrent testing and The prediction error rate is weighted and summed to obtain the prediction error rate of the k+1th round of concurrent testing.

作为一种可选实施方式，所述利用所述第k轮并发测试的实际错误率估算得到第k+1轮并发测试的预测错误率，包括：获取第k轮并发测试的预测错误率；利用预先配置的权重以及第k轮并发测试的预测错误率和实际错误率计算得到所述第k+1轮并发测试的预测错误率。As an optional implementation manner, the estimating the predicted error rate of the k+1th round of concurrent testing by using the actual error rate of the k-th round of concurrent testing includes: obtaining the predicted error rate of the k-th round of concurrent testing; using The pre-configured weights and the predicted error rate and actual error rate of the kth round of concurrent testing are calculated to obtain the predicted error rate of the k+1th round of concurrent testing.

本发明实施例中，第k轮并发测试的预测错误率可以是在第k-1轮并发测试结束之后，计算得到的；若k＝1时，则该预测错误率为初始值，也即是0。预先配置的权重是指在计算第k+1轮并发测试的预测错误率时，第k轮并发测试的预测错误率和实际错误率在计算过程中的相对权重。In the embodiment of the present invention, the prediction error rate of the k-th round of concurrent testing can be calculated after the end of the k-1 round of concurrent testing; if k=1, the prediction error rate is the initial value, that is, 0. The pre-configured weight refers to the relative weight of the predicted error rate of the k-th round of concurrent testing and the actual error rate in the calculation process when calculating the predicted error rate of the k+1-th round of concurrent testing.

作为进一步可选的实施方式，可以通过以下公式计算得到所述第k+1轮并发测试的预测错误率：As a further optional implementation manner, the prediction error rate of the k+1th round of concurrent testing can be calculated by the following formula:

e′_k+1＝αe_k+(1-α)e′_k e′ _k+1 ＝αe _k +(1-α)e′ _k

其中，e_k表示第k轮并发测试的实际错误率，e'_k表示第k轮并发测试的预测错误率，α表示平滑系数。其中α的值为定值，可以取大于0.5并小于1的值，例如，取0.8。平滑系数越高新观测值的占比越高。由于本发明实施例的测试过程是一个迭代过程，因此，上述公式也是一个迭代过程，随着迭代次数的增加，时间越久的错误率占的比重越小。Among them, e _k represents the actual error rate of the k-th round of concurrent testing, e' _k represents the predicted error rate of the k-th round of concurrent testing, and α represents the smoothing coefficient. Wherein, the value of α is a fixed value, which may be greater than 0.5 and less than 1, for example, 0.8. The higher the smoothing coefficient, the higher the proportion of new observations. Since the testing process in the embodiment of the present invention is an iterative process, the above formula is also an iterative process. As the number of iterations increases, the longer the error rate, the smaller the proportion.

步骤S103,判断所述第k+1轮并发测试的预测错误率是否大于预设阈值。Step S103, judging whether the prediction error rate of the k+1th round of concurrent testing is greater than a preset threshold.

步骤S104,当所述第k+1轮并发测试的预测错误率大于所述预设阈值，则在所述第k轮并发测试的基础上减少并发请求量，得到第k+1轮并发测试的并发请求量。Step S104, when the prediction error rate of the k+1th round of concurrent testing is greater than the preset threshold, then reduce the amount of concurrent requests on the basis of the k+1th round of concurrent testing to obtain the k+1th round of concurrent testing The amount of concurrent requests.

步骤S105,当所述第k+1轮并发测试的预测错误率小于所述预设阈值，则在所述第k轮并发测试的基础上增加并发请求量，得到第k+1轮并发测试的并发请求量。Step S105, when the prediction error rate of the k+1th round of concurrent testing is less than the preset threshold, increase the amount of concurrent requests on the basis of the k+1th round of concurrent testing to obtain the k+1th round of concurrent testing The amount of concurrent requests.

步骤S106,当所述第k+1轮并发测试的预测错误率等于所述预设阈值时，将所述第k轮并发测试的并发请求量确定为所述待测云平台在所述故障下的临界。Step S106, when the prediction error rate of the k+1th round of concurrent testing is equal to the preset threshold, determine the amount of concurrent requests of the kth round of concurrent testing as the cloud platform under test under the failure critical.

本发明实施例中，当判断出第k+1轮并发测试的预测错误率小于预设阈值时，则表示待测云平台未达到并发请求处理的临界，因此，可以上调下一轮并发请求量；若第k+1轮并发测试的预测错误率大于预设阈值时，则表示待测云平台已经超出并发请求处理的界限，则需要下调下一轮测试的并发请求量；预测错误率等于预设阈值，则可以认为当前并发请求量为待测云平台在该故障下的界限，则可以不用调整并发请求量。In the embodiment of the present invention, when it is judged that the prediction error rate of the k+1 round of concurrent testing is less than the preset threshold, it means that the cloud platform to be tested has not reached the critical point of concurrent request processing, therefore, the amount of concurrent requests for the next round can be increased ; If the prediction error rate of the k+1th round of concurrent testing is greater than the preset threshold, it means that the cloud platform to be tested has exceeded the limit of concurrent request processing, and it is necessary to reduce the amount of concurrent requests for the next round of testing; the prediction error rate is equal to the preset threshold. If the threshold is set, it can be considered that the current amount of concurrent requests is the limit of the cloud platform under test under the fault, and there is no need to adjust the amount of concurrent requests.

本发明实施例中，无论预测错误率与预设阈值的关系是哪一种，都可以返回下一轮迭代测试。其中，若出现大于或者小于预设阈值，则利用调整后的并发请求量进行下一轮次的并发测试，也即是k的值加1，返回执行相应的并发测试，即执行步骤S101-S103,进行相应的判断，再根据判断结果做后续动作。当k的值达到最大值时，完成该次混沌测试。另一方面，还可以设置循环退出条件为：当所述第k+1轮并发测试的预测错误率等于所述预设阈值时，因为此时已经确定出了故障下待测云平台的并发极限。In the embodiment of the present invention, regardless of the relationship between the prediction error rate and the preset threshold, the next round of iterative testing can be returned. Among them, if it is greater than or less than the preset threshold, use the adjusted concurrent request amount to perform the next round of concurrent testing, that is, add 1 to the value of k, return to execute the corresponding concurrent testing, that is, execute steps S101-S103 , Make corresponding judgments, and then make follow-up actions according to the judgment results. When the value of k reaches the maximum value, the chaos test is completed. On the other hand, the loop exit condition can also be set as: when the prediction error rate of the k+1th round of concurrent testing is equal to the preset threshold, because the concurrency limit of the cloud platform to be tested under the failure has been determined at this time .

作为一种可选实施方式，所述在所述第k轮并发测试的基础上减少并发请求量，得到第k+1轮并发测试的并发请求量，包括：利用所述第k+1轮并发测试的预测错误率作为衰减系数，计算得到所述第k+1轮并发测试的并发请求量。As an optional implementation manner, the reducing the amount of concurrent requests on the basis of the kth round of concurrent testing to obtain the concurrent request amount of the k+1th round of concurrent testing includes: using the k+1th round of concurrent testing The prediction error rate of the test is used as an attenuation coefficient to calculate the concurrent request amount of the k+1th round of concurrent testing.

由于预测错误率结果是根据上一轮次的实际错误率计算得到的，因此，利用其作为衰减系数进行并发请求量的调整，使得调整后的并发请求量更加符合上一轮次的测试结果，也能够更快地接近云平台在故障下的临界。Since the predicted error rate is calculated based on the actual error rate of the previous round, it is used as the attenuation coefficient to adjust the amount of concurrent requests so that the adjusted amount of concurrent requests is more in line with the test results of the previous round. It can also approach the criticality of the cloud platform under failure faster.

具体地，本发明实施例可以通过以下公式计算得到所述第k+1轮并发测试的并发请求量(即下调的并发请求量)：Specifically, in this embodiment of the present invention, the amount of concurrent requests for the k+1th round of concurrent testing (that is, the reduced amount of concurrent requests) can be calculated by the following formula:

另一方面，本发明实施例中对于并发量的上调也可以采用上述相类似的方式，采用预测错误率作为上浮系数，其计算方式可以与上述公式类似，效果也类似，这里不再赘述。On the other hand, in the embodiment of the present invention, the method similar to the above can also be used to increase the amount of concurrency, and the prediction error rate is used as the floating coefficient. The calculation method can be similar to the above formula, and the effect is also similar, and will not be repeated here.

作为一种可选实施方式，所述在所述第k轮并发测试的基础上增加并发请求量，得到第k+1轮并发测试的并发请求量，包括：利用预先设置的上浮系数和所述第k轮并发测试的并发请求量确定出所要增加的并发请求量，再加上所述第k轮并发测试的并发请求量得到所述第k+1轮并发测试的并发请求量。As an optional implementation manner, the increase of the concurrent request amount on the basis of the k-th round of concurrent testing to obtain the concurrent request amount of the k+1-th round of concurrent testing includes: using a preset floating coefficient and the The amount of concurrent requests to be increased is determined from the amount of concurrent requests in the kth round of concurrent testing, and added to the amount of concurrent requests in the kth round of concurrent testing to obtain the amount of concurrent requests in the k+1th round of concurrent testing.

也即是本发明实施例中，可以设置一个固定的上浮系数来计算上调的并发请求量。具体地可以通过以下公式计算得到所述第k+1轮并发测试的并发请求量：That is, in the embodiment of the present invention, a fixed floating coefficient can be set to calculate the increased concurrent request amount. Specifically, the amount of concurrent requests for the k+1th round of concurrent testing can be calculated by the following formula:

其中，e'_k+1表示第k+1轮并发测试的预测错误率，C_k表示第k轮并发测试的并发请求量，β表示上浮系数，可以根据经验进行设置，表示向下取整计算符。Among them, e' _k+1 represents the prediction error rate of the k+1th round of concurrent testing, C _k represents the amount of concurrent requests for the kth round of concurrent testing, and β represents the floating coefficient, which can be set according to experience. Represents the round down operator.

综合以上计算公式，得到如下：Combining the above calculation formulas, we get the following:

其中，∈为预设阈值。如果预测错误率大于预设阈值，测试管理进程根据错误率衰减测试管理进程内的并发请求量。如果错误率小于预设阈值，比如当前错误率为0的时候，并发请求量并不能达到云平台的临界，此时尝试适当增大测试管理进程的并发请求量。Among them, ∈ is the preset threshold. If the prediction error rate is greater than the preset threshold, the test management process decays the amount of concurrent requests in the test management process according to the error rate. If the error rate is lower than the preset threshold, for example, when the current error rate is 0, the amount of concurrent requests cannot reach the critical limit of the cloud platform. At this time, try to increase the amount of concurrent requests of the test management process appropriately.

本发明实施例中，故障注入线程有两种工作模式，一是定时模式，二是迭代次数模式。定时模式中混沌实验的配置指定测试开始后的故障注入时间。迭代次数模式根据测试的执行次数决定故障的注入时间。二者均采用轮询的方式。In the embodiment of the present invention, the fault injection thread has two working modes, one is the timing mode, and the other is the iteration number mode. The configuration of a chaos experiment in timed mode specifies the fault injection time after the start of the test. The number of iterations mode determines when a fault is injected based on the number of times a test is executed. Both of them adopt the method of polling.

下面通过图3所示的混沌测试的工作流程来介绍本发明实施例，如图3所示，包括：The following introduces the embodiment of the present invention through the workflow of the chaos test shown in Figure 3, as shown in Figure 3, including:

步骤1：用户提交混沌实验配置到代码托管平台。Step 1: The user submits the chaos experiment configuration to the code hosting platform.

步骤2：经管理员审核后合并至代码库。Step 2: Merge into the code base after being reviewed by the administrator.

步骤3：用户通过Restful API接口创建混沌实验任务，创建时指定配置和混沌实验任务开始时间。Step 3: The user creates a chaos experiment task through the Restful API interface, and specifies the configuration and start time of the chaos experiment task when creating.

步骤4：任务管理模块检测到混沌实验任务时间到达后，触发持续集成平台流水线。Step 4: After the task management module detects that the time of the chaos experiment task has arrived, it triggers the continuous integration platform pipeline.

步骤5：持续集成平台从节点下载混沌实验配置、故障注入工具、云平台测试工具。Step 5: The continuous integration platform downloads the chaos experiment configuration, fault injection tool, and cloud platform test tool from the node.

步骤6：持续集成平台从节点上依次安装故障注入工具和云平台测试工具。Step 6: The continuous integration platform installs fault injection tools and cloud platform testing tools sequentially from the nodes.

步骤7：运行云平台测试，测试过程中由云平台测试工具与故障注入工具管理故障的注入与恢复。Step 7: Run the cloud platform test. During the test, the cloud platform test tool and fault injection tool manage fault injection and recovery.

步骤8：生成测试结果并返回任务状态。Step 8: Generate test results and return task status.

如图4是本发明实施例的一种混沌测试的平台，其部署结构为：代码托管平台部署于服务器1，用户权限管理模块、任务管理模块、数据库部署于服务器2，Jenkins master部署于服务器3，服务器4作为Jenkins slave。服务器1～4均与云平台管理网络相连接，服务器4与云平台所有管理节点、计算节点的带外管理网相连接。Figure 4 is a chaos testing platform according to the embodiment of the present invention, and its deployment structure is as follows: the code hosting platform is deployed on server 1, the user rights management module, task management module, and database are deployed on server 2, and the Jenkins master is deployed on server 3 , Server 4 acts as a Jenkins slave. Servers 1 to 4 are all connected to the cloud platform management network, and server 4 is connected to the out-of-band management network of all management nodes and computing nodes of the cloud platform.

待测云平台采用Openstack，由三个管理节点和两个计算节点构成。管理节点部署有keystone、nova-api、nova-scheduler、nova-conductor、placement、cinder-api、cinder-scheduler、glance、neutron-server、neutron-dhcp-agent等。计算节点部署有nova-compute、cinder-volume等服务。The cloud platform to be tested uses Openstack and consists of three management nodes and two computing nodes. The management nodes are deployed with keystone, nova-api, nova-scheduler, nova-conductor, placement, cinder-api, cinder-scheduler, glance, neutron-server, neutron-dhcp-agent, etc. The computing nodes are deployed with services such as nova-compute and cinder-volume.

用户具体的操作步骤包括：The specific operation steps for users include:

步骤1：用户通过git将混沌实验配置提交到服务器1上的代码托管平台。Step 1: The user submits the chaos experiment configuration to the code hosting platform on server 1 through git.

步骤2：配置经过审阅后合入。Step 2: The configuration is merged after review.

步骤3：用户持有凭证信息从用户权限管理模块申请令牌，并持有令牌向任务管理模块发起请求，创建混沌实验任务。Step 3: The user holds the credential information to apply for a token from the user authority management module, and holds the token to initiate a request to the task management module to create a chaos experiment task.

步骤4：Jenkins流水线被触发。Step 4: The Jenkins pipeline is triggered.

步骤5：Jenkins slave开始下载配置和相关软件。Step 5: Jenkins slave starts downloading configuration and related software.

步骤6：混沌实验任务的执行在Jenkins slave节点(服务器4)，服务器4上安装有故障注入工具和云平台测试工具。Step 6: The execution of the chaos experiment task is on the Jenkins slave node (server 4), and the fault injection tool and the cloud platform testing tool are installed on the server 4.

步骤7：云平台测试工具执行测试。在本实施例中，混沌实验配置指定运行创建云主机的测试，并指定总迭代次数1000次，并发20。故障类型为随机一台管理节点nova-api服务down掉，发生时间为迭代次数200，解除时间为迭代次数800。测试工具运行测试例200次时，测试工具中故障注入线程检测到到达指定次数，通过云平台管理网远程登录管理节点1(随机选择)，执行systemctl stop openstack-nova-api的命令，随即服务停止。到达迭代次数800时恢复服务。迭代次数200次后，开始根据错误率的统计自适应调整并发数，不断靠近系统正常处理请求的并发极限。Step 7: The cloud platform test tool executes the test. In this embodiment, the chaos experiment configuration specifies to run the test of creating a cloud host, and specifies a total number of iterations of 1000, and a concurrency of 20. The fault type is that the nova-api service of a random management node is down, the occurrence time is 200 iterations, and the release time is 800 iterations. When the test tool runs the test case 200 times, the fault injection thread in the test tool detects that the specified number of times has been reached, remotely logs in to management node 1 (selected at random) through the cloud platform management network, executes the command systemctl stop openstack-nova-api, and then the service stops . Resume service when iteration count 800 is reached. After 200 iterations, start to adjust the number of concurrency adaptively according to the statistics of the error rate, and keep approaching the concurrency limit of the system's normal processing requests.

步骤8：混沌实验结束后，返回测试结果和混沌实验状态。Step 8: After the chaos experiment is over, return the test result and the state of the chaos experiment.

实施例2Example 2

本实施例提供一种云平台混沌测试装置，该装置可以用于执行上述实施例1中的测试方法，如图5所示，该装置包括：This embodiment provides a cloud platform chaos test device, which can be used to execute the test method in the above-mentioned embodiment 1, as shown in Figure 5, the device includes:

获取模块501，用于在向待测云平台注入故障之后，获取对所述待测云平台第k轮并发测试的测试结果，统计第k轮并发测试的实际错误率，k取1,2,3……；Obtaining module 501, is used for after injecting fault to cloud platform to be tested, obtains the test result of the kth round concurrent test of described cloud platform to be tested, counts the actual error rate of kth round concurrent test, k is 1, 2, 3...;

估算模块502，用于利用所述第k轮并发测试的实际错误率估算得到第k+1轮并发测试的预测错误率；An estimation module 502, configured to estimate the predicted error rate of the k+1th round of concurrent testing by using the actual error rate of the kth round of concurrent testing;

判断模块503，用于判断所述第k+1轮并发测试的预测错误率是否大于预设阈值；A judging module 503, configured to judge whether the prediction error rate of the k+1th round of concurrent testing is greater than a preset threshold;

第一计算模块504，用于当所述第k+1轮并发测试的预测错误率大于所述预设阈值，则在所述第k轮并发测试的基础上减少并发请求量，得到第k+1轮并发测试的并发请求量；The first calculation module 504 is configured to reduce the amount of concurrent requests on the basis of the kth round of concurrent testing when the prediction error rate of the k+1th round of concurrent testing is greater than the preset threshold, and obtain the k+th round of concurrent testing. The amount of concurrent requests for one round of concurrent testing;

第二计算模块505，用于当所述第k+1轮并发测试的预测错误率小于所述预设阈值，则在所述第k轮并发测试的基础上增加并发请求量，得到第k+1轮并发测试的并发请求量；The second calculation module 505 is configured to increase the amount of concurrent requests on the basis of the kth round of concurrent testing when the prediction error rate of the k+1th round of concurrent testing is less than the preset threshold, and obtain the k+th round of concurrent testing. The amount of concurrent requests for one round of concurrent testing;

确定模块506，用于当所述第k+1轮并发测试的预测错误率等于所述预设阈值时，将所述第k轮并发测试的并发请求量确定为所述待测云平台在所述故障下的临界。A determining module 506, configured to determine the concurrent request amount of the kth round of concurrent testing as the cloud platform under test when the prediction error rate of the k+1th round of concurrent testing is equal to the preset threshold Criticality under the above fault.

关于装置实施例的具体描述可以参见上述方法实施例，这里不做赘述。For specific descriptions of the apparatus embodiments, reference may be made to the foregoing method embodiments, and details are not repeated here.

实施例3Example 3

本发明的一个实施例中，还提供了一种计算机设备，其内部结构图可以如图6所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口，还可以包括显示屏和输入装置。其中，该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统和计算机程序。该内存储器为非易失性存储介质中的操作系统和计算机程序的运行提供环境。该计算机设备的网络接口用于与外部的计算机设备通过网络连接通信。该计算机程序被处理器执行时以实现用于流量回放的数据去重方法或者业务系统的测试方法，该计算机设备还可以包括显示屏和输入装置，其显示屏可以是液晶显示屏或者电子墨水显示屏，该计算机设备的输入装置可以是显示屏上覆盖的触摸层，也可以是计算机设备外壳上设置的按键、轨迹球或触控板等。In an embodiment of the present invention, a computer device is also provided, and its internal structure diagram may be as shown in FIG. 6 . The computer equipment includes a processor, a memory, a network interface connected through a system bus, and may also include a display screen and an input device. Wherein, the processor of the computer device is used to provide calculation and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and computer programs. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used to communicate with external computer devices via a network connection. When the computer program is executed by the processor, the data deduplication method for traffic playback or the test method of the business system can be realized. The computer device can also include a display screen and an input device, and the display screen can be a liquid crystal display or an electronic ink display. The input device of the computer equipment may be a touch layer covered on the display screen, or a button, a trackball, or a touch pad provided on the casing of the computer equipment.

另一方面，则该计算机设备可以不包括显示屏和输入装置，本领域技术人员可以理解，图6中示出的结构，仅仅是与本申请方案相关的部分结构的框图，并不构成对本申请方案所应用于其上的计算机设备的限定，具体的计算机设备可以包括比图中所示更多或更少的部件，或者组合某些部件，或者具有不同的部件布置。On the other hand, the computer equipment may not include a display screen and an input device. Those skilled in the art can understand that the structure shown in FIG. A specific computer device may include more or fewer components than shown in the figure, or combine certain components, or have a different arrangement of components.

在一个实施例中，提供了一种计算机设备，包括至少一个处理器；以及与所述至少一个处理器通信连接的存储器；其中，所述存储器存储有可被所述至少一个处理器执行的指令，所述指令被所述至少一个处理器执行，实现以下步骤：In one embodiment, a computer device is provided, comprising at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores instructions executable by the at least one processor , the instructions are executed by the at least one processor to implement the following steps:

在向待测云平台注入故障之后，获取对所述待测云平台第k轮并发测试的测试结果，统计第k轮并发测试的实际错误率，k取1,2,3……；After injecting faults into the cloud platform to be tested, obtain the test result of the kth round of concurrent testing of the cloud platform to be tested, and count the actual error rate of the kth round of concurrent testing, where k is 1, 2, 3...;

利用所述第k轮并发测试的实际错误率估算得到第k+1轮并发测试的预测错误率；Using the actual error rate estimation of the kth round of concurrent testing to obtain the predicted error rate of the k+1th round of concurrent testing;

判断所述第k+1轮并发测试的预测错误率是否大于预设阈值；Judging whether the prediction error rate of the k+1th round of concurrent testing is greater than a preset threshold;

当所述第k+1轮并发测试的预测错误率大于所述预设阈值，则在所述第k轮并发测试的基础上减少并发请求量，得到第k+1轮并发测试的并发请求量；When the prediction error rate of the k+1th round of concurrent testing is greater than the preset threshold, the amount of concurrent requests is reduced on the basis of the kth round of concurrent testing to obtain the amount of concurrent requests for the k+1th round of concurrent testing ;

当所述第k+1轮并发测试的预测错误率小于所述预设阈值，则在所述第k轮并发测试的基础上增加并发请求量，得到第k+1轮并发测试的并发请求量；When the prediction error rate of the k+1th round of concurrent testing is less than the preset threshold, the amount of concurrent requests is increased on the basis of the kth round of concurrent testing to obtain the amount of concurrent requests for the k+1th round of concurrent testing ;

当所述第k+1轮并发测试的预测错误率等于所述预设阈值时，将所述第k轮并发测试的并发请求量确定为所述待测云平台在所述故障下的临界。When the prediction error rate of the k+1th round of concurrent testing is equal to the preset threshold, the concurrent request amount of the kth round of concurrent testing is determined as the criticality of the cloud platform under test under the failure.

在一个实施例中，提供了一种可读存储介质，所述计算机可读存储介质存储有计算机指令，所述计算机指令用于使所述计算机执行：In one embodiment, a readable storage medium is provided, the computer-readable storage medium stores computer instructions, and the computer instructions are used to cause the computer to execute:

本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程，是可以通过计算机程序来指令相关的硬件来完成，所述的计算机程序可存储于一非易失性计算机可读取存储介质中，该计算机程序在执行时，可包括如上述各方法的实施例的流程。其中，本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用，均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限，RAM以多种形式可得，诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above-mentioned embodiments can be completed by instructing related hardware through computer programs, and the computer programs can be stored in a non-volatile computer-readable memory In the medium, when the computer program is executed, it may include the processes of the embodiments of the above-mentioned methods. Wherein, any references to memory, storage, database or other media used in the various embodiments provided in the present application may include non-volatile and/or volatile memory. Nonvolatile memory can include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory can include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in many forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Chain Synchlink DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

本领域内的技术人员应明白，本发明的实施例可提供为方法、系统、或计算机程序产品。因此，本发明可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且，本发明可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。Those skilled in the art should understand that the embodiments of the present invention may be provided as methods, systems, or computer program products. Accordingly, the present invention can take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

本发明是参照根据本发明实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器，使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It should be understood that each procedure and/or block in the flowchart and/or block diagram, and a combination of procedures and/or blocks in the flowchart and/or block diagram can be realized by computer program instructions. These computer program instructions may be provided to a general purpose computer, special purpose computer, embedded processor, or processor of other programmable data processing equipment to produce a machine such that the instructions executed by the processor of the computer or other programmable data processing equipment produce a An apparatus for realizing the functions specified in one or more procedures of the flowchart and/or one or more blocks of the block diagram.

这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中，使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品，该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to operate in a specific manner, such that the instructions stored in the computer-readable memory produce an article of manufacture comprising instruction means, the instructions The device realizes the function specified in one or more procedures of the flowchart and/or one or more blocks of the block diagram.

这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上，使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理，从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions can also be loaded onto a computer or other programmable data processing device, causing a series of operational steps to be performed on the computer or other programmable device to produce a computer-implemented process, thereby The instructions provide steps for implementing the functions specified in the flow chart or blocks of the flowchart and/or the block or blocks of the block diagrams.

显然，上述实施例仅仅是为清楚地说明所作的举例，而并非对实施方式的限定。对于所属领域的普通技术人员来说，在上述说明的基础上还可以做出其它不同形式的变化或变动。这里无需也无法对所有的实施方式予以穷举。而由此所引伸出的显而易见的变化或变动仍处于本发明创造的保护范围之中。Apparently, the above-mentioned embodiments are only examples for clear description, rather than limiting the implementation. For those of ordinary skill in the art, other changes or changes in different forms can be made on the basis of the above description. It is not necessary and impossible to exhaustively list all the implementation manners here. And the obvious changes or changes derived therefrom are still within the scope of protection of the present invention.

Claims

1. a cloud platform chaos testing method, is characterized in that, comprises the steps:

After injecting faults into the cloud platform to be tested, obtain the test result of the kth round of concurrent testing of the cloud platform to be tested, and count the actual error rate of the kth round of concurrent testing, where k is 1, 2, 3...;

Using the actual error rate estimation of the kth round of concurrent testing to obtain the predicted error rate of the k+1th round of concurrent testing;

Judging whether the prediction error rate of the k+1th round of concurrent testing is greater than a preset threshold;

When the prediction error rate of the k+1th round of concurrent testing is greater than the preset threshold, the amount of concurrent requests is reduced on the basis of the kth round of concurrent testing to obtain the amount of concurrent requests for the k+1th round of concurrent testing ;

When the prediction error rate of the k+1th round of concurrent testing is less than the preset threshold, the amount of concurrent requests is increased on the basis of the kth round of concurrent testing to obtain the amount of concurrent requests for the k+1th round of concurrent testing ;

When the prediction error rate of the k+1th round of concurrent testing is equal to the preset threshold, the concurrent request amount of the kth round of concurrent testing is determined as the criticality of the cloud platform under test under the failure.

2. cloud platform chaos test method according to claim 1, is characterized in that, described utilization of the actual error rate estimation of described k round concurrent test obtains the prediction error rate of k+1 round concurrent test, comprising:

Obtain the prediction error rate of the k-th round of concurrent testing;

The predicted error rate of the k+1th round of concurrent testing is calculated by using the preconfigured weights and the predicted error rate and actual error rate of the kth round of concurrent testing.

3. cloud platform chaos test method according to claim 2, is characterized in that, calculates and obtains the forecast error rate of described k+1 round concurrent test by following formula:

e′ _k+1 ＝αe _k +(1-α)e′ _k

Among them, e _k represents the actual error rate of the k-th round of concurrent testing, e' _k represents the predicted error rate of the k-th round of concurrent testing, and α represents the smoothing coefficient.

4. cloud platform chaos test method according to claim 1, is characterized in that, described on the basis of described k the concurrent test of round reduces concurrent request amount, obtains the concurrent request amount of k+1 round concurrent test, include:

Using the prediction error rate of the k+1th round of concurrent testing as an attenuation coefficient, the amount of concurrent requests for the k+1th round of concurrent testing is calculated.

5. cloud platform chaos test method according to claim 4, is characterized in that, obtains the concurrent request volume of described k+1 round concurrency test by following formula calculation:

Among them, e' _k+1 represents the prediction error rate of the k+1th round of concurrent testing, C _k represents the amount of concurrent requests for the kth round of concurrent testing, Represents the round-up operator.

6. cloud platform chaos test method according to claim 1, is characterized in that, described on the basis of described k the concurrent test of round increases concurrent request amount, obtains the concurrent request amount of k+1 round concurrent test, include:

Use the preset floating coefficient and the concurrent request amount of the kth round of concurrent testing to determine the amount of concurrent requests to be increased, and add the concurrent request amount of the kth round of concurrent testing to obtain the k+1th round of concurrent requests The amount of concurrent requests tested.

7. cloud platform chaos test method according to claim 6, is characterized in that, obtains the concurrency request amount of described k+1 round concurrency test by following formula calculation:

Among them, C _k represents the amount of concurrent requests for the k-th round of concurrent testing, and β represents the floating coefficient. Represents the round down operator.

8. A cloud platform chaos testing device, characterized in that it comprises:

The acquisition module is used to obtain the test result of the kth round of concurrent testing of the cloud platform to be tested after injecting faults into the cloud platform to be tested, and count the actual error rate of the kth round of concurrent testing, where k is 1, 2, 3 ...;

An estimating module, configured to estimate the predicted error rate of the k+1th round of concurrent testing by using the actual error rate of the kth round of concurrent testing;

A judging module, configured to judge whether the prediction error rate of the k+1th round of concurrent testing is greater than a preset threshold;

The first calculation module is configured to reduce the amount of concurrent requests on the basis of the k+1th round of concurrent testing when the prediction error rate of the k+1th round of concurrent testing is greater than the preset threshold, to obtain the k+1th round The number of concurrent requests for rounds of concurrent testing;

The second calculation module is configured to increase the amount of concurrent requests on the basis of the k+1th round of concurrent testing when the prediction error rate of the k+1th round of concurrent testing is less than the preset threshold to obtain the k+1th round The number of concurrent requests for rounds of concurrent testing;

A determining module, configured to determine the amount of concurrent requests of the kth round of concurrent testing as the cloud platform under test when the prediction error rate of the k+1th round of concurrent testing is equal to the preset threshold Criticality under failure.

9. A computer device, characterized in that it comprises: at least one processor; and a memory connected in communication with the at least one processor; wherein the memory stores instructions executable by the at least one processor, The instructions are executed by the at least one processor, so as to execute the cloud platform chaos testing method according to any one of claims 1-7.

10. A computer-readable storage medium, wherein the computer-readable storage medium stores computer instructions, and the computer instructions are used to make a computer execute the cloud platform according to any one of claims 1-7 Chaos testing method.