CN109800160A - Cluster server fault testing method and relevant apparatus in machine learning system - Google Patents

Cluster server fault testing method and relevant apparatus in machine learning system Download PDF

Info

Publication number
CN109800160A
CN109800160A CN201811620118.9A CN201811620118A CN109800160A CN 109800160 A CN109800160 A CN 109800160A CN 201811620118 A CN201811620118 A CN 201811620118A CN 109800160 A CN109800160 A CN 109800160A
Authority
CN
China
Prior art keywords
server
fault test
software fault
failure
test
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811620118.9A
Other languages
Chinese (zh)
Other versions
CN109800160B (en
Inventor
郑海刚
吕旭涛
王孝宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Intellifusion Technologies Co Ltd
Original Assignee
Shenzhen Intellifusion Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Intellifusion Technologies Co Ltd filed Critical Shenzhen Intellifusion Technologies Co Ltd
Priority to CN201811620118.9A priority Critical patent/CN109800160B/en
Publication of CN109800160A publication Critical patent/CN109800160A/en
Application granted granted Critical
Publication of CN109800160B publication Critical patent/CN109800160B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Debugging And Monitoring (AREA)

Abstract

This application discloses the cluster server fault testing methods and relevant apparatus in machine learning system, which comprises receives failure and generates the fault test task that server is sent, wherein the fault test task carries software fault test script;The M server into the cluster server issues the test request for carrying the software fault test script, wherein M is positive integer;Receive the M test response that the M server is sent, wherein, the M test response carries the M server and runs M software fault test data obtained from the software fault test script, and the M server and M test response correspond;The M software fault test data is verified, to obtain M software fault test result.Implement the embodiment of the present invention, is conducive to abundant application scenarios.

Description

Cluster server fault testing method and relevant apparatus in machine learning system
Technical field
The present invention relates to the cluster server fault test sides in field of computer technology more particularly to machine learning system Method and relevant apparatus.
Background technique
Machine learning after decades of development, is answered under the development of storage capacity and computing capability extensively finally With.Model training in machine learning needs to calculate a large amount of data, to obtain suitable model.Although the computing capability of GPU Than several order of magnitude of CPU, but the calculating demand of machine learning is faced, still not enough.Therefore, often through in cluster server Upper deployment docker service blocks training to carry out multimachine more, to meet the calculating demand of machine learning.Wherein, Docker is one The container engine of open source and a kind of virtualization technology of lightweight, and it is small to performance loss, it is easy to encapsulate, therefore in machine Application in study is also more and more extensive.
In general, the model training of a machine learning short then several hours, long then several weeks.If once training In, the cluster server in machine learning system breaks down, then needs to restart, and calculating before can all be wasted.So The process that one good machine learning system needs handling failure and restores.
Therefore, it before machine learning system investment actual use, needs to the cluster server in machine learning system It issues fault test script and carries out fault test, deposited with detecting cluster server troubleshooting capability and finding out cluster server Latent defect.
However, existing test method, hardware fault test script can only be issued to individual server and carries out hardware fault Test, application scenarios are single.
Summary of the invention
The embodiment of the invention provides the cluster server fault testing methods and relevant apparatus in machine learning system, real The embodiment of the present invention is applied, abundant application scenarios are conducive to.
First aspect present invention provides the cluster server fault testing method in machine learning system, comprising:
Failure execute server receives failure and generates the fault test task that server is sent, wherein the fault test Task carries software fault test script;
M server of the failure execute server into the cluster server, which issues, carries the software fault survey The test request of training sheet, wherein M is positive integer;
The failure execute server receives the M test response that the M server is sent, wherein the M test Response carries the M server and runs M software fault test data obtained from the software fault test script, described M server and M test response correspond;
The failure execute server verifies the M software fault test data, to obtain M software fault Test result.
Based in a first aspect, in a kind of possible embodiment of the invention, the method also includes:
When reception retests failed request, the failure execute server is sent again to the database server Test failure message, wherein it is described retest failure message carrying retest failure identification, the failure that retests disappears Breath is used to indicate the database server lookup and retests the matched historical failure test assignment of failure identification with described;
The server receives the historical failure test assignment that the database server carries historical failure test script;
The failure execute server issues the history survey for carrying the historical failure test script to the M server Examination request;
The failure execute server receives the M history test response that the M server is sent, wherein the M History test response carries the M server and runs M historical failure test obtained from the historical failure test script Data, the M server and M history test response correspond;
The failure execute server verifies the M historical failure test data, to obtain M historical failure Test result.
Second aspect of the present invention provides a kind of server, comprising:
First receiving module generates the fault test task that server is sent for receiving failure, wherein the failure is surveyed Trial business carries software fault test script;
Module is issued, is issued for the M server into the cluster server and carries the software fault test foot This test request, wherein M is positive integer;
Second receiving module, the M test response sent for receiving the M server, wherein the M test Response carries the M server and runs M software fault test data obtained from the software fault test script, described M server and M test response correspond;
Correction verification module, for being verified to the M software fault test data, to obtain M software fault test As a result.
Third aspect present invention provides a kind of computer readable storage medium, and the computer readable storage medium is used for Computer program is stored, the storage computer program is executed by the processor, to realize the cluster in machine learning system The described in any item methods of the fault testing method of server.
As can be seen that receiving failure in above-mentioned technical proposal in failure execute server and generating the carrying that server is sent When the fault test task of software fault test script, M server of the failure execute server into cluster server is issued Carry the test request of software fault test script.When M server running software fault test script obtains M software fault After test data, failure execute server receives the M test for carrying M software fault test data that M server is sent Response, then, failure execute server verifies M software fault test data, to obtain M software fault test knot Fruit is found out potential existing for cluster server to realize the software fault test of the cluster server in machine learning system The abnormal conditions occurred when defect and operation, meanwhile, application scenarios are enriched, meet the needs of cluster.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of invention for those of ordinary skill in the art without creative efforts, can be with It obtains other drawings based on these drawings.
Wherein:
Fig. 1-a is the cluster server fault testing method in machine learning system provided by one embodiment of the present invention Flow diagram;
Fig. 1-b is a kind of configuration diagram for machine learning system that another embodiment of the invention provides;
Fig. 2-a is the cluster server fault testing method in the machine learning system that another embodiment of the invention provides Flow diagram;
Fig. 2-b for one embodiment of the invention provide a kind of failure execute server test according to the software fault it is preferential The sequence of grade from high to low, calls user interface to issue the signal of test request to the N number of container being deployed on i-th of server Figure;
It is logical that Fig. 2-c for one embodiment of the invention provides the sequence that a kind of failure execute server is numbered according to described M Cross the schematic diagram of M server of safety shell protocol Telnet;
Fig. 3 is a kind of schematic diagram of server provided by one embodiment of the present invention.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall within the protection scope of the present invention.
It is described in detail separately below.
Term " includes " in description and claims of this specification and the attached drawing and " having " and they appoint What is deformed, it is intended that is covered and non-exclusive is included.Such as contain the process, method, system, production of a series of steps or units Product or equipment are not limited to listed step or unit, but optionally further comprising the step of not listing or unit, or Optionally further comprising the other step or units intrinsic for these process, methods, product or equipment.
Firstly, being in the machine learning system that one embodiment of the present of invention provides referring to Fig. 1-a and Fig. 1-b, Fig. 1-a The flow diagram of cluster server fault testing method.Scheme shown in Fig. 1-a can have in the system of the framework shown in Fig. 1-b The embodiment of body.Wherein, the cluster service as shown in Fig. 1-a, in the machine learning system of one embodiment of the present of invention offer Device fault testing method may include:
101, failure execute server receives failure and generates the fault test task that server is sent.
Wherein, the fault test task carries software fault test script.
Software fault test script refers mainly to the script of application service process exception relevant to machine learning, such as can be with Include: parameter server (parameterserver) process is restarted in distributed training script or task (worker) into Cheng Chongqi or the script exited, the script of finger daemon (dockerdaemon) exception of server, container orchestration engine Pod (combinations of several associated vessels), job (timed task) or deployment in (kubernetes, K8s) is (stateless Using) abnormal script.
Wherein, Kubernetes is a container orchestration engine of Google open source, it supports automatically dispose, extensive Scalable, application container management.
102, M server of the failure execute server into the cluster server, which issues, carries the software fault survey The test request of training sheet.
Wherein, M is positive integer.
Wherein, cluster server for example may include the server in different distributions formula system.In cluster server Multiple containers (Docker) is deployed on each server, so that different application service process be allowed to operate on container.
Wherein, M for example can be equal to 1,2,3,5,6,11,13,20 or other values.
103, failure execute server receives the M test response that the M server is sent.
Wherein, the M test response carries the M server and runs obtained from the software fault test script M software fault test data, the M server and M test response correspond.
104, failure execute server verifies the M software fault test data, to obtain M software fault Test result.
- a referring to fig. 2, Fig. 2-a are the cluster server in the machine learning system that another embodiment of the invention provides The flow diagram of fault testing method.Wherein, as shown in Fig. 2-a, the machine learning of another embodiment offer of the invention Cluster server fault testing method in system may include:
201, failure execute server receives failure and generates the fault test task that server is sent.
Wherein, the fault test task carries software fault test script.
Software fault test script refers mainly to the script of application service process exception relevant to machine learning, such as can be with Include: parameter server (parameterserver) process is restarted in distributed training script or task (worker) into Cheng Chongqi or the script exited, the script of finger daemon (dockerdaemon) exception of server, container orchestration engine Pod (combinations of several associated vessels), job (timed task) or deployment in (kubernetes, K8s) is (stateless Using) abnormal script.
Wherein, Kubernetes is a container orchestration engine of Google open source, it supports automatically dispose, extensive Scalable, application container management.
202, M server of the failure execute server into the cluster server, which issues, carries the software fault survey The test request of training sheet, wherein M is positive integer.
Wherein, cluster server for example may include the server in different distributions formula system.In cluster server Multiple containers (Docker) is deployed on each server, so that different application service process be allowed to operate on container.
Wherein, M for example can be equal to 1,2,3,5,6,11,13,20 or other values.
Optionally, in a first aspect, in a kind of possible embodiment of the invention, the failure execute server is to institute It states M server in cluster server and issues the test request for carrying the software fault test script, comprising:
The failure execute server according to the test request transmit time-consuming length to the M server into The number of row from small to large, to obtain M number;
The sequence that the failure execute server is numbered according to described M, which issues the test to the M server, asks It asks.
As can be seen that failure execute server transmits time-consuming length according to test request in above-mentioned technical proposal Number from small to large is carried out to M server, and issues survey according to M server of the sequence of number into cluster server Examination request, to improve the efficiency that entire cluster server receives test request.
Optionally, based in a first aspect, the failure executes service in the possible embodiment of the first of the invention The sequence that device is numbered according to described M issues the test request to the M server, comprising:
The failure execute server determine the software fault test script and default expression formula whether successful match;
If the software fault test script and the default expression formula successful match, the failure execute server are true It is fixed whether to have the permission for calling user interface, wherein the user interface belongs to Container Management node, the Container Management node For any one server in the M server, the Container Management node is deployed in the M server for managing In each server on N number of container, N is positive integer;
If there is the permission for calling the user interface, the failure execute server according to described M number sequence It calls N number of container of the user interface on each server being deployed in the M server to issue the test to ask It asks.
Wherein, presetting expression formula for example may include regular expression.
Wherein, regular expression is a kind of logical formula to string operation, is exactly with predefined some spies Determine character and the combination of these specific characters, form one " regular character string ", this " regular character string " is used to express to word Accord with a kind of filter logic of string.
User interface includes API (Application Programming Interface, application programming interface), API is some functions predetermined, it is therefore an objective to provide application program and be accessed with developer based on certain software or hardware The ability of one group of routine, and be not necessarily to access source code, or understand the details of internal work mechanism.
Wherein, N for example can be equal to 1,2,3,5,6,11,13,20 or other values.
As can be seen that in above-mentioned technical proposal, firstly, failure execute server determines software fault test script and presets Expression formula whether successful match, in successful match, failure execute server then determine whether call user interface permission, When there is the permission for calling user interface, failure execute server is then according to the M sequence numbered calling user interface to being deployed in M N number of container on each server in a server issues test request, so that it is determined that by calling this side of user interface Formula carries out issuing test request, ensure that failure execute server calls the permission of user interface, is not applied to avoid other The script of fault test is issued to M server.
Optionally, based in the possible embodiment of the first of first aspect or first aspect, second of possible reality It applies in mode, the method also includes:
If without the permission for calling the user interface, the failure execute server is sent to the Container Management node Call user interface authority acquiring request, wherein the calling user interface authority acquiring request carries the failure and executes clothes The authentication information of business device, the calling user interface authority acquiring request are used to indicate the Container Management node to described Authentication information is authenticated, and when certification passes through, the Container Management node opens tune to the failure execute server With the permission of the user interface;
The failure execute server receives the Container Management node and sends calling user interface authority acquiring response, with Obtain calling the permission of the user interface.
As can be seen that above-mentioned technical proposal in, failure execute server without call user interface permission when, need to The permission of Container Management node request call user interface, when Container Management node believes the authentication of failure execute server When ceasing and authenticated, and passing through, just to the permission of failure execute server open call user interface, to avoid not adjusted It is carried out issuing test request with the third-party server of user interface permission, or carries out the operation of some unauthorizeds, it is ensured that therefore The safety interacted between barrier execute server and cluster server.
Optionally ,-b referring to fig. 2, Fig. 2-b be a kind of failure execute server for providing of one embodiment of the present of invention by According to the sequence of the software fault test prioritization from high to low, call user interface N number of on i-th of server to being deployed in Container issues the schematic diagram of test request, in the possible embodiment of second based on first aspect or first aspect, third In the possible embodiment of kind, if described have the permission for calling the user interface, the failure execute server is according to institute The sequence for stating M number calls N number of container of the user interface on each server being deployed in the M server Issue the test request, comprising:
If there is the permission for calling the user interface, the failure execute server according to described M number sequence tune M configuration file of the M server is obtained with the user interface, wherein match for i-th in the M configuration file Set the software fault test prioritization that file includes the N number of container being deployed on i-th of server, i-th of server category In the M server, 0 < i≤M and i are positive integer;
Sequence of the failure execute server according to the software fault test prioritization from high to low, calls the use N number of container of the family interface on each server being deployed in the M server issues the test request.
Wherein, i for example can be equal to 1,2,3,5,6,11,13,20 or other values.
As can be seen that the sequence that failure execute server is numbered according to M calls user interface to obtain in above-mentioned technical proposal The M configuration file of M server is taken, with from the N obtained on each server for being deployed in M server in configuration file Then the software fault test prioritization of a container according to the sequence of software fault test prioritization from high to low, calls user Interface issues test request to the N number of container being deployed on each server in M server, to avoid cluster server Operation load is excessively heavy, improves test rate, meanwhile, it realizes the fault test to different vessels, finds out each container and exist Latent defect and operation when the abnormal conditions that occur.
Optionally ,-c referring to fig. 2, Fig. 2-c be a kind of failure execute server for providing of one embodiment of the present of invention by The sequence numbered according to described M is based on first aspect or the by the schematic diagram of M server of safety shell protocol Telnet In the first or second or the third possible embodiment of one side, in the 4th kind of possible embodiment, the side Method further include:
If the software fault test script and the default non-successful match of expression formula, the failure execute server Determine whether the permission by M server described in safety shell protocol Telnet;
If there is the permission by M server described in the safety shell protocol Telnet, the failure executes service The sequence that device is then numbered according to described M is by M server described in the safety shell protocol Telnet, with to the M The Q role service run in each server of a server issues the test request, wherein Q is positive integer.
Wherein, safety shell protocol is the agreement for aiming at telnet session and other network services offer safety.
Wherein, role service for example may include in distributed training parameter server (parameterserver) into Journey, task (worker) process, the finger daemon (dockerdaemon) of server.
Wherein, Q for example can be equal to 1,2,3,5,6,11,13,20 or other values.
As can be seen that in above-mentioned technical proposal, in software fault test script successful match non-with default expression formula, therefore Barrier execute server then determines whether the permission by M server of safety shell protocol Telnet, passes through safety if having The permission of M server of shell protocol Telnet, the sequence that failure execute server is then numbered according to M pass through Secure Shell Protocol remote logs in M server, issues test with the Q role service run into each server of M server Request, to realize that the sequence that failure execute server is numbered according to M passes through safety shell protocol Telnet M service Device avoids the operation load of failure execute server excessively heavy, improves the efficiency for issuing test request, meanwhile, it realizes to difference The abnormal feelings occurred when latent defect existing for each role service and operation are found out in the fault test of role service Condition.
Optionally, the first based on first aspect or first aspect or second or the third or the 4th kind of possible reality It applies in mode, in the 5th kind of possible embodiment, the method also includes:
If without the permission by M server described in the safety shell protocol Telnet, the failure executes service The sequence that device is then numbered according to described M sends M Telnet authority acquiring request to the M server, wherein described Each Telnet authority acquiring request in M Telnet authority acquiring request carries the authentication information, the M I-th of Telnet authority acquiring request in a Telnet authority acquiring request is used to indicate i-th of server pair The authentication information is authenticated, and when certification passes through, i-th of server passes through to the failure execute server The permission of i-th of server described in the safety shell protocol Telnet, the M server and the M Telnet Authority acquiring request corresponds;
The failure execute server receives the M server and sends M Telnet authority acquiring response, to obtain Pass through the permission of M server described in the safety shell protocol Telnet, wherein the M server and the M are a remote Journey logon rights obtain response and correspond.
As can be seen that when failure execute server is without the permission for passing through M server of safety shell protocol Telnet, It needs to send M Telnet authority acquiring request to M server, to obtain by safety shell protocol Telnet M The permission of server, thus avoid without by the third-party server of M server permission of safety shell protocol Telnet into Row issues test request, or carries out the operation of some unauthorizeds, it is ensured that hands between failure execute server and cluster server Mutual safety.
Optionally, the first based on first aspect or first aspect or second or the third or the 4th kind or the 5th kind In possible embodiment, in the 6th kind of possible embodiment, the fault test task also carries hardware fault test foot This, the method also includes:
The failure execute server determines the permission having through M server described in safety shell protocol Telnet, It is taken with the sequence numbered according to described M by M server described in safety shell protocol Telnet, and then to described M K module being engaged in each server of device issues the hardware fault test script, wherein K is positive integer.
Wherein, K for example can be equal to 1,2,3,5,6,11,13,20 or other values.
Wherein, hardware fault test script may include: the script of server abnormal shutdown, the foot of disk read-write failure Originally, the script of the script of hard disk plug, the script of net card failure, network jitter or packet loss, the high script of memory usage, CPU, The script of GPU load too high.
Wherein, module for example may include: CPU, GPU, memory, hard disk, network interface card, power supply.
As can be seen that failure execute server determines whether long-range by safety shell protocol in above-mentioned technical proposal The permission of M server is logged in, if there is the permission by M server of safety shell protocol Telnet, failure executes service The sequence that device is then numbered according to M is by M server of safety shell protocol Telnet, with to each clothes of M server K module in business device issues test request, to realize that the sequence that failure execute server is numbered according to M is outer by safety Shell protocol remote logs in M server, avoids the operation load of failure execute server excessively heavy, raising issues test request Efficiency, meanwhile, it realizes fault test to disparate modules, finds out and occur when latent defect existing for modules and operation Abnormal conditions.
203, failure execute server receives the M test response that the M server is sent.
Wherein, the M test response carries the M server and runs obtained from the software fault test script M software fault test data, the M server and M test response correspond.
204, failure execute server verifies the M software fault test data, to obtain M software fault Test result.
Optionally, in a first aspect, in one possible embodiment of the invention, the failure execute server is to described M software fault test data is verified, to obtain M software fault test result, comprising:
The failure execute server is generated tests with the matched M software fault of the M software fault test data Data Identification;
The failure execute server sends predetermined software fault test request of data to database server, wherein institute It states predetermined software fault test request of data and carries the M software fault test data mark, the predetermined software data are asked It asks and is used to indicate the database server inquiry and the matched M predetermined software of M software fault test data mark Fault test data;
The failure execute server receives the database server and asks for the predetermined software fault test data The predetermined software fault test data of transmission are asked to respond, wherein it is default that the predetermined software fault test data response carries M item Software fault test data;
The failure execute server is by the M software fault test data and the M predetermined software fault test Data are verified, to obtain M software fault test result.
Wherein, database server is soft by running one/multiple stage computers in a local network and data base management system Part collectively forms, and database server provides data service for client applications.
As can be seen that in above-mentioned technical proposal, failure execute server by M software fault test data with from database The M predetermined software fault test data that server obtains are verified, to obtain M software fault test result, thus root According to check results, the cluster server abnormal conditions that existing latent defect and when operation occur during the test are found out.
Optionally, based in a first aspect, executing clothes in the failure in the possible embodiment of the first of the invention Business device verifies the M software fault test data, after obtaining M software fault test result, the method Further include:
The failure execute server sends M query messages to the M server, wherein the M query messages In i-th of query messages be used to indicate i-th of server detected after fault test be deployed in it is described i-th clothes K mould of the Q role service and i-th of server that are run in the N number of container, i-th of server on business device The operation conditions of block, the M server and the M query messages correspond;
The failure execute server receives the M poll-final message that the M server is sent, wherein the M Poll-final message carries M operation conditions detection data, and the M server and the M poll-final message one are a pair of It answers;
The failure execute server analyzes the M operation conditions detection data, with the determination M service Whether device operation is normal;
If the M server operation is abnormal, the failure execute server is sent to the database server Request is checked in operation conditions log, wherein the operation conditions log checks that request carries operation conditions log mark, the fortune Row situation log checks that request is used to indicate the database server inquiry and identifies matched fortune with the operation conditions log Row situation log;
The failure execute server receives the database server and checks request hair for the operation conditions log Response is checked in the operation conditions log sent, wherein the operation conditions log checks that response carries the operation conditions log;
The failure execute server analyzes the operation conditions log, in the determination M server Each server runs abnormal reason, and then forms misoperation report.
As can be seen that disappearing in above-mentioned technical proposal firstly, failure execute server sends M inquiry to M server Breath, when each server in M server receives query messages, i-th of server in M server is just according to inquiry The Q angle that message detects the N number of container being deployed on i-th of server after fault test, runs in i-th of server The operation conditions of K module of color server and i-th of server, after detection, M server executes service to failure The M poll-final message that device is sent, then, the M operation conditions that failure execute server carries M poll-final message Detection data is analyzed, and to determine whether M server operation be normal, if M server operation is abnormal, failure executes clothes Business device analyzes operation conditions log, to determine that each server in M server runs abnormal reason, in turn Misoperation report is formed, is worked after test the health detection of cluster server to realize, meanwhile, pass through analysis Operation conditions log, each server found out in M server run abnormal reason, are conducive to subsequent targeted Optimize cluster server, improves the stability of machine learning system.
Optionally, based on the possible embodiment of the first of first aspect or first aspect, at of the invention second In possible embodiment, the method also includes:
When reception retests failed request, the failure execute server is sent again to the database server Test failure message, wherein it is described retest failure message carrying retest failure identification, the failure that retests disappears Breath is used to indicate the database server lookup and retests the matched historical failure test assignment of failure identification with described;
The server receives the historical failure test assignment that the database server carries historical failure test script;
The failure execute server issues the history survey for carrying the historical failure test script to the M server Examination request;
The failure execute server receives the M history test response that the M server is sent, wherein the M History test response carries the M server and runs M historical failure test obtained from the historical failure test script Data, the M server and M history test response correspond;
The failure execute server verifies the M historical failure test data, to obtain M historical failure Test result.
As can be seen that, when reception retests failed request, server is to database server in above-mentioned technical proposal Transmission retests failure message, to obtain carrying the historical failure test assignment of historical failure test script, thus to soft When part failure measure has doubt, or when the problem of inaccuracy occurs in upper primary software fault test result, realization pair Cluster server re-issues the history test request for carrying historical failure test script, meanwhile, reduce unnecessary operated Journey saves the entire testing time.
Referring to Fig. 3, Fig. 3 is a kind of schematic diagram for server that one embodiment of the present of invention provides.Wherein, such as Fig. 3 institute Show, a kind of server 300 that one embodiment of the present of invention provides may include:
First receiving module 301 generates the fault test task that server is sent for receiving failure.
Wherein, the fault test task carries software fault test script.
Software fault test script refers mainly to the script of application service process exception relevant to machine learning, such as can be with It include: the script or task (worker) that parameter server in distributed training (parameter server) process is restarted The script that process is restarted or exited, the script of finger daemon (docker daemon) exception of server, container orchestration engine Pod (combinations of several associated vessels), job (timed task) or deployment in (kubernetes, K8s) is (stateless Using) abnormal script.
Wherein, Kubernetes is a container orchestration engine of Google open source, it supports automatically dispose, extensive Scalable, application container management.
Module 302 is issued, is issued for the M server into the cluster server and carries the software fault test The test request of script.
Wherein, M is positive integer.
Wherein, cluster server for example may include the server in different distributions formula system.In cluster server Multiple containers (Docker) is deployed on each server, so that different application service process be allowed to operate on container.
Wherein, M for example can be equal to 1,2,3,5,6,11,13,20 or other values.
Second receiving module 303, the M test response sent for receiving the M server.
Wherein, the M test response carries the M server and runs obtained from the software fault test script M software fault test data, the M server and M test response correspond.
Correction verification module 304, for being verified to the M software fault test data, to obtain M software fault survey Test result.
It should be noted that for the various method embodiments described above, for simple description, therefore, it is stated as a series of Combination of actions, but those skilled in the art answer it is described know, the present invention is not limited by the sequence of acts described, because For according to the present invention, some steps may be performed in other sequences or simultaneously.Secondly, those skilled in the art also Ying Suoshu Know, the embodiments described in the specification are all preferred embodiments, related actions and modules not necessarily this hair Necessary to bright.
The above, the above embodiments are merely illustrative of the technical solutions of the present invention, rather than its limitations;Although referring to before Stating embodiment, invention is explained in detail, those skilled in the art should understand that: it still can be to preceding Technical solution documented by each embodiment is stated to modify or equivalent replacement of some of the technical features;And these It modifies or replaces, the range for technical solution of various embodiments of the present invention that it does not separate the essence of the corresponding technical solution.

Claims (10)

1. the cluster server fault testing method in machine learning system characterized by comprising
Failure execute server receives failure and generates the fault test task that server is sent, wherein the fault test task Carry software fault test script;
M server of the failure execute server into the cluster server, which issues, carries the software fault test foot This test request, wherein M is positive integer;
The failure execute server receives the M test response that the M server is sent, wherein the M test response It carries the M server and runs M software fault test data obtained from the software fault test script, the M is a Server and M test response correspond;
The failure execute server verifies the M software fault test data, to obtain M software fault test As a result.
2. the method according to claim 1, wherein the failure execute server is into the cluster server M server issue the test request for carrying the software fault test script, comprising:
The failure execute server according to the test request transmit time-consuming length to M server progress from It is small to arrive big number, to obtain M number;
The sequence that the failure execute server is numbered according to described M issues the test request to the M server.
3. according to the method described in claim 2, it is characterized in that, what the failure execute server was numbered according to described M Sequence issues the test request to the M server, comprising:
The failure execute server determine the software fault test script and default expression formula whether successful match;
If the software fault test script and the default expression formula successful match, the failure execute server determination are It is no to have the permission for calling user interface, wherein the user interface belongs to Container Management node, and the Container Management node is institute Any one server in M server is stated, the Container Management node is deployed in the M server for managing N number of container on each server, N are positive integer;
If there is the permission for calling the user interface, the sequence that the failure execute server is numbered according to described M is called N number of container of the user interface on each server being deployed in the M server issues the test request.
4. described according to the method described in claim 3, it is characterized in that, if described have a permission for calling the user interface The sequence that failure execute server is then numbered according to described M calls the user interface to being deployed in the M server Each server on N number of container issue the test request, comprising:
If there is the permission for calling the user interface, the sequence that the failure execute server is numbered according to described M calls institute State the M configuration file that user interface obtains the M server, wherein i-th of configuration text in the M configuration file Part includes the software fault test prioritization for the N number of container being deployed on i-th of server, and i-th of server belongs to institute M server is stated, 0 < i≤M and i are positive integer;
Sequence of the failure execute server according to the software fault test prioritization from high to low, calls the user to connect N number of container of the mouth on each server being deployed in the M server issues the test request.
5. according to the method described in claim 3, it is characterized in that, the method also includes:
If the software fault test script and the default non-successful match of expression formula, the failure execute server determine Whether permission by safety shell protocol Telnet described in M server is had;
If there is the permission by M server described in the safety shell protocol Telnet, the failure execute server The sequence numbered according to described M is by M server described in the safety shell protocol Telnet, to take to described M Q role service for being engaged in running in each server of device issues the test request, wherein Q is positive integer.
6. method according to claim 1 or 5, which is characterized in that the fault test task also carries hardware fault survey Training sheet, the method also includes:
The failure execute server determines the permission having through M server described in the safety shell protocol Telnet, With the sequence numbered according to described M by M server described in the safety shell protocol Telnet, and then to the M K module in each server of a server issues the hardware fault test script, wherein K is positive integer.
7. the method according to claim 1, wherein the failure execute server is to the M software fault Test data is verified, to obtain M software fault test result, comprising:
The failure execute server generates and the matched M software fault test data of the M software fault test data Mark;
The failure execute server sends predetermined software fault test request of data to database server, wherein described pre- If the request of software fault test data carries the M software fault test data mark, the predetermined software request of data is used Matched M predetermined software failure is identified with the M software fault test data in indicating that the database server is inquired Test data;
The failure execute server receives the database server and sends out for the predetermined software fault test request of data The predetermined software fault test data response sent, wherein the predetermined software fault test data response carries M predetermined software Fault test data;
The failure execute server is by the M software fault test data and the M predetermined software fault test data It is verified, to obtain M software fault test result.
8. the method according to the description of claim 7 is characterized in that in the failure execute server to the M software event Barrier test data is verified, after obtaining M software fault test result, the method also includes:
The failure execute server sends M query messages to the M server, wherein in the M query messages I-th of query messages, which is used to indicate i-th of server and detects after fault test, is deployed in i-th of server On N number of container, the Q role service and i-th of server that run in i-th of server K module Operation conditions, the M server and the M query messages correspond;
The failure execute server receives the M poll-final message that the M server is sent, wherein the M inquiry End message carries M operation conditions detection data, and the M server and the M poll-final message correspond;
The failure execute server analyzes the M operation conditions detection data, with the determination M server fortune Whether row is normal;
If the M server operation is abnormal, the failure execute server sends to the database server and runs Request is checked in situation log, wherein the operation conditions log checks that request carries operation conditions log mark, the operation shape Condition log checks that request is used to indicate the database server inquiry and identifies matched operation shape with the operation conditions log Condition log;
The failure execute server receives the database server and checks what request was sent for the operation conditions log Response is checked in operation conditions log, wherein the operation conditions log checks that response carries the operation conditions log;
The failure execute server analyzes the operation conditions log, with each of described M server of determination Server runs abnormal reason, and then forms misoperation report.
9. a kind of server characterized by comprising
First receiving module generates the fault test task that server is sent for receiving failure, wherein the fault test is appointed Business carries software fault test script;
Module is issued, is issued for the M server into the cluster server and carries the software fault test script Test request, wherein M is positive integer;
Second receiving module, the M test response sent for receiving the M server, wherein the M test response It carries the M server and runs M software fault test data obtained from the software fault test script, the M is a Server and M test response correspond;
Correction verification module, for being verified to the M software fault test data, to obtain M software fault test result.
10. a kind of computer readable storage medium, which is characterized in that the computer readable storage medium is for storing computer Program, the storage computer program is executed by the processor, to realize the method according to claim 1.
CN201811620118.9A 2018-12-27 2018-12-27 Cluster server fault testing method and related device in machine learning system Active CN109800160B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811620118.9A CN109800160B (en) 2018-12-27 2018-12-27 Cluster server fault testing method and related device in machine learning system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811620118.9A CN109800160B (en) 2018-12-27 2018-12-27 Cluster server fault testing method and related device in machine learning system

Publications (2)

Publication Number Publication Date
CN109800160A true CN109800160A (en) 2019-05-24
CN109800160B CN109800160B (en) 2021-03-05

Family

ID=66557909

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811620118.9A Active CN109800160B (en) 2018-12-27 2018-12-27 Cluster server fault testing method and related device in machine learning system

Country Status (1)

Country Link
CN (1) CN109800160B (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110618853A (en) * 2019-08-02 2019-12-27 东软集团股份有限公司 Detection method, device and equipment for zombie container
CN110852445A (en) * 2019-10-28 2020-02-28 广州文远知行科技有限公司 Distributed machine learning training method and device, computer equipment and storage medium
CN111641716A (en) * 2020-06-01 2020-09-08 第四范式(北京)技术有限公司 Self-healing method of parameter server, parameter server and parameter service system
CN112217899A (en) * 2020-10-19 2021-01-12 政采云有限公司 Container troubleshooting system and method
CN112346979A (en) * 2020-11-11 2021-02-09 杭州飞致云信息科技有限公司 Software performance testing method, system and readable storage medium
CN112783769A (en) * 2021-01-19 2021-05-11 深圳市莫廷影像技术有限公司 Self-defined automatic software testing method
CN112905445A (en) * 2020-12-09 2021-06-04 江苏苏宁云计算有限公司 Log-based test method and device and computer system
CN113094266A (en) * 2021-04-06 2021-07-09 中国工商银行股份有限公司 Fault testing method, platform and equipment for container database
CN115022328A (en) * 2022-06-24 2022-09-06 脸萌有限公司 Server cluster, server cluster testing method and device and electronic equipment

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030101385A1 (en) * 2001-11-28 2003-05-29 Inventec Corporation Cross-platform system-fault warning system and method
CN102354298A (en) * 2011-07-27 2012-02-15 哈尔滨工业大学 Software testing automation framework (STAF)-based fault injection automation testing platform and method for high-end fault-tolerant computer
CN105205003A (en) * 2015-10-28 2015-12-30 努比亚技术有限公司 Automated testing method and device based on clustering system
CN106897110A (en) * 2017-02-23 2017-06-27 郑州云海信息技术有限公司 A kind of container dispatching method and management node scheduler
CN107967837A (en) * 2017-05-31 2018-04-27 常州信息职业技术学院 A kind of training platform and its implementation based on container
CN108092850A (en) * 2017-12-12 2018-05-29 郑州云海信息技术有限公司 A kind of cluster server method for diagnosing faults and system based on heartbeat mechanism
CN108654089A (en) * 2018-05-09 2018-10-16 腾讯科技(深圳)有限公司 The test method and device of Mission Objective, electronic equipment, storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030101385A1 (en) * 2001-11-28 2003-05-29 Inventec Corporation Cross-platform system-fault warning system and method
CN102354298A (en) * 2011-07-27 2012-02-15 哈尔滨工业大学 Software testing automation framework (STAF)-based fault injection automation testing platform and method for high-end fault-tolerant computer
CN105205003A (en) * 2015-10-28 2015-12-30 努比亚技术有限公司 Automated testing method and device based on clustering system
CN106897110A (en) * 2017-02-23 2017-06-27 郑州云海信息技术有限公司 A kind of container dispatching method and management node scheduler
CN107967837A (en) * 2017-05-31 2018-04-27 常州信息职业技术学院 A kind of training platform and its implementation based on container
CN108092850A (en) * 2017-12-12 2018-05-29 郑州云海信息技术有限公司 A kind of cluster server method for diagnosing faults and system based on heartbeat mechanism
CN108654089A (en) * 2018-05-09 2018-10-16 腾讯科技(深圳)有限公司 The test method and device of Mission Objective, electronic equipment, storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
维克托•法西克 等: "《微服务运维实战》", 30 June 2018, 华中科技大学出版社 *

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110618853B (en) * 2019-08-02 2022-04-22 东软集团股份有限公司 Detection method, device and equipment for zombie container
CN110618853A (en) * 2019-08-02 2019-12-27 东软集团股份有限公司 Detection method, device and equipment for zombie container
CN110852445A (en) * 2019-10-28 2020-02-28 广州文远知行科技有限公司 Distributed machine learning training method and device, computer equipment and storage medium
CN111641716A (en) * 2020-06-01 2020-09-08 第四范式(北京)技术有限公司 Self-healing method of parameter server, parameter server and parameter service system
CN111641716B (en) * 2020-06-01 2023-05-02 第四范式(北京)技术有限公司 Self-healing method of parameter server, parameter server and parameter service system
CN112217899A (en) * 2020-10-19 2021-01-12 政采云有限公司 Container troubleshooting system and method
CN112346979A (en) * 2020-11-11 2021-02-09 杭州飞致云信息科技有限公司 Software performance testing method, system and readable storage medium
CN112905445A (en) * 2020-12-09 2021-06-04 江苏苏宁云计算有限公司 Log-based test method and device and computer system
CN112783769A (en) * 2021-01-19 2021-05-11 深圳市莫廷影像技术有限公司 Self-defined automatic software testing method
CN113094266A (en) * 2021-04-06 2021-07-09 中国工商银行股份有限公司 Fault testing method, platform and equipment for container database
CN113094266B (en) * 2021-04-06 2024-06-14 中国工商银行股份有限公司 Fault testing method, platform and equipment for container database
CN115022328A (en) * 2022-06-24 2022-09-06 脸萌有限公司 Server cluster, server cluster testing method and device and electronic equipment
CN115022328B (en) * 2022-06-24 2023-08-08 脸萌有限公司 Server cluster, testing method and device of server cluster and electronic equipment

Also Published As

Publication number Publication date
CN109800160B (en) 2021-03-05

Similar Documents

Publication Publication Date Title
CN109800160A (en) Cluster server fault testing method and relevant apparatus in machine learning system
CN103678354B (en) Local relation type database node scheduling method and device based on cloud computing platform
US10474563B1 (en) System testing from production transactions
CN108829581B (en) Application program testing method and device, computer equipment and storage medium
CN102402481B (en) The fuzz testing of asynchronous routine code
CN109165168A (en) A kind of method for testing pressure, device, equipment and medium
CN108206830B (en) Vulnerability scanning method, apparatus, computer equipment and storage medium
CN105427695B (en) Program class examination paper automatic assessment method and system
CN109194543A (en) Collecting method and device
EP2629205A1 (en) Multi-entity test case execution workflow
CN106209503B (en) RPC interface test method and system
CN107608902A (en) Routine interface method of testing and device
CN107168844B (en) Performance monitoring method and device
CN112732499A (en) Test method and device based on micro-service architecture and computer system
CN111382080A (en) Stability test method for equipment cloud management platform system
CN114168429A (en) Error reporting analysis method and device, computer equipment and storage medium
TWI626538B (en) Infrastructure rule generation
CN104537284B (en) Software protecting system and method based on remote service
CN106302412A (en) A kind of intelligent checking system for the test of information system crushing resistance and detection method
CN106875184A (en) Abnormal scene analogy method, device and equipment
JP2004145413A (en) Diagnostic system for security hole
CN115119197B (en) Wireless network risk analysis method, device, equipment and medium based on big data
CN114338051B (en) Method, device, equipment and medium for acquiring random number by block chain
CN109274533A (en) A kind of positioning device and method of the Web service failure of rule-based engine
CN109658259A (en) Peasant household&#39;s listings data processing method, device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant