CN109800160A - Cluster server fault testing method and relevant apparatus in machine learning system - Google Patents
Cluster server fault testing method and relevant apparatus in machine learning system Download PDFInfo
- Publication number
- CN109800160A CN109800160A CN201811620118.9A CN201811620118A CN109800160A CN 109800160 A CN109800160 A CN 109800160A CN 201811620118 A CN201811620118 A CN 201811620118A CN 109800160 A CN109800160 A CN 109800160A
- Authority
- CN
- China
- Prior art keywords
- server
- fault test
- software fault
- failure
- test
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Landscapes
- Debugging And Monitoring (AREA)
Abstract
This application discloses the cluster server fault testing methods and relevant apparatus in machine learning system, which comprises receives failure and generates the fault test task that server is sent, wherein the fault test task carries software fault test script;The M server into the cluster server issues the test request for carrying the software fault test script, wherein M is positive integer;Receive the M test response that the M server is sent, wherein, the M test response carries the M server and runs M software fault test data obtained from the software fault test script, and the M server and M test response correspond;The M software fault test data is verified, to obtain M software fault test result.Implement the embodiment of the present invention, is conducive to abundant application scenarios.
Description
Technical field
The present invention relates to the cluster server fault test sides in field of computer technology more particularly to machine learning system
Method and relevant apparatus.
Background technique
Machine learning after decades of development, is answered under the development of storage capacity and computing capability extensively finally
With.Model training in machine learning needs to calculate a large amount of data, to obtain suitable model.Although the computing capability of GPU
Than several order of magnitude of CPU, but the calculating demand of machine learning is faced, still not enough.Therefore, often through in cluster server
Upper deployment docker service blocks training to carry out multimachine more, to meet the calculating demand of machine learning.Wherein, Docker is one
The container engine of open source and a kind of virtualization technology of lightweight, and it is small to performance loss, it is easy to encapsulate, therefore in machine
Application in study is also more and more extensive.
In general, the model training of a machine learning short then several hours, long then several weeks.If once training
In, the cluster server in machine learning system breaks down, then needs to restart, and calculating before can all be wasted.So
The process that one good machine learning system needs handling failure and restores.
Therefore, it before machine learning system investment actual use, needs to the cluster server in machine learning system
It issues fault test script and carries out fault test, deposited with detecting cluster server troubleshooting capability and finding out cluster server
Latent defect.
However, existing test method, hardware fault test script can only be issued to individual server and carries out hardware fault
Test, application scenarios are single.
Summary of the invention
The embodiment of the invention provides the cluster server fault testing methods and relevant apparatus in machine learning system, real
The embodiment of the present invention is applied, abundant application scenarios are conducive to.
First aspect present invention provides the cluster server fault testing method in machine learning system, comprising:
Failure execute server receives failure and generates the fault test task that server is sent, wherein the fault test
Task carries software fault test script;
M server of the failure execute server into the cluster server, which issues, carries the software fault survey
The test request of training sheet, wherein M is positive integer;
The failure execute server receives the M test response that the M server is sent, wherein the M test
Response carries the M server and runs M software fault test data obtained from the software fault test script, described
M server and M test response correspond;
The failure execute server verifies the M software fault test data, to obtain M software fault
Test result.
Based in a first aspect, in a kind of possible embodiment of the invention, the method also includes:
When reception retests failed request, the failure execute server is sent again to the database server
Test failure message, wherein it is described retest failure message carrying retest failure identification, the failure that retests disappears
Breath is used to indicate the database server lookup and retests the matched historical failure test assignment of failure identification with described;
The server receives the historical failure test assignment that the database server carries historical failure test script;
The failure execute server issues the history survey for carrying the historical failure test script to the M server
Examination request;
The failure execute server receives the M history test response that the M server is sent, wherein the M
History test response carries the M server and runs M historical failure test obtained from the historical failure test script
Data, the M server and M history test response correspond;
The failure execute server verifies the M historical failure test data, to obtain M historical failure
Test result.
Second aspect of the present invention provides a kind of server, comprising:
First receiving module generates the fault test task that server is sent for receiving failure, wherein the failure is surveyed
Trial business carries software fault test script;
Module is issued, is issued for the M server into the cluster server and carries the software fault test foot
This test request, wherein M is positive integer;
Second receiving module, the M test response sent for receiving the M server, wherein the M test
Response carries the M server and runs M software fault test data obtained from the software fault test script, described
M server and M test response correspond;
Correction verification module, for being verified to the M software fault test data, to obtain M software fault test
As a result.
Third aspect present invention provides a kind of computer readable storage medium, and the computer readable storage medium is used for
Computer program is stored, the storage computer program is executed by the processor, to realize the cluster in machine learning system
The described in any item methods of the fault testing method of server.
As can be seen that receiving failure in above-mentioned technical proposal in failure execute server and generating the carrying that server is sent
When the fault test task of software fault test script, M server of the failure execute server into cluster server is issued
Carry the test request of software fault test script.When M server running software fault test script obtains M software fault
After test data, failure execute server receives the M test for carrying M software fault test data that M server is sent
Response, then, failure execute server verifies M software fault test data, to obtain M software fault test knot
Fruit is found out potential existing for cluster server to realize the software fault test of the cluster server in machine learning system
The abnormal conditions occurred when defect and operation, meanwhile, application scenarios are enriched, meet the needs of cluster.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below
There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this
Some embodiments of invention for those of ordinary skill in the art without creative efforts, can be with
It obtains other drawings based on these drawings.
Wherein:
Fig. 1-a is the cluster server fault testing method in machine learning system provided by one embodiment of the present invention
Flow diagram;
Fig. 1-b is a kind of configuration diagram for machine learning system that another embodiment of the invention provides;
Fig. 2-a is the cluster server fault testing method in the machine learning system that another embodiment of the invention provides
Flow diagram;
Fig. 2-b for one embodiment of the invention provide a kind of failure execute server test according to the software fault it is preferential
The sequence of grade from high to low, calls user interface to issue the signal of test request to the N number of container being deployed on i-th of server
Figure;
It is logical that Fig. 2-c for one embodiment of the invention provides the sequence that a kind of failure execute server is numbered according to described M
Cross the schematic diagram of M server of safety shell protocol Telnet;
Fig. 3 is a kind of schematic diagram of server provided by one embodiment of the present invention.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete
Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on
Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other
Embodiment shall fall within the protection scope of the present invention.
It is described in detail separately below.
Term " includes " in description and claims of this specification and the attached drawing and " having " and they appoint
What is deformed, it is intended that is covered and non-exclusive is included.Such as contain the process, method, system, production of a series of steps or units
Product or equipment are not limited to listed step or unit, but optionally further comprising the step of not listing or unit, or
Optionally further comprising the other step or units intrinsic for these process, methods, product or equipment.
Firstly, being in the machine learning system that one embodiment of the present of invention provides referring to Fig. 1-a and Fig. 1-b, Fig. 1-a
The flow diagram of cluster server fault testing method.Scheme shown in Fig. 1-a can have in the system of the framework shown in Fig. 1-b
The embodiment of body.Wherein, the cluster service as shown in Fig. 1-a, in the machine learning system of one embodiment of the present of invention offer
Device fault testing method may include:
101, failure execute server receives failure and generates the fault test task that server is sent.
Wherein, the fault test task carries software fault test script.
Software fault test script refers mainly to the script of application service process exception relevant to machine learning, such as can be with
Include: parameter server (parameterserver) process is restarted in distributed training script or task (worker) into
Cheng Chongqi or the script exited, the script of finger daemon (dockerdaemon) exception of server, container orchestration engine
Pod (combinations of several associated vessels), job (timed task) or deployment in (kubernetes, K8s) is (stateless
Using) abnormal script.
Wherein, Kubernetes is a container orchestration engine of Google open source, it supports automatically dispose, extensive
Scalable, application container management.
102, M server of the failure execute server into the cluster server, which issues, carries the software fault survey
The test request of training sheet.
Wherein, M is positive integer.
Wherein, cluster server for example may include the server in different distributions formula system.In cluster server
Multiple containers (Docker) is deployed on each server, so that different application service process be allowed to operate on container.
Wherein, M for example can be equal to 1,2,3,5,6,11,13,20 or other values.
103, failure execute server receives the M test response that the M server is sent.
Wherein, the M test response carries the M server and runs obtained from the software fault test script
M software fault test data, the M server and M test response correspond.
104, failure execute server verifies the M software fault test data, to obtain M software fault
Test result.
- a referring to fig. 2, Fig. 2-a are the cluster server in the machine learning system that another embodiment of the invention provides
The flow diagram of fault testing method.Wherein, as shown in Fig. 2-a, the machine learning of another embodiment offer of the invention
Cluster server fault testing method in system may include:
201, failure execute server receives failure and generates the fault test task that server is sent.
Wherein, the fault test task carries software fault test script.
Software fault test script refers mainly to the script of application service process exception relevant to machine learning, such as can be with
Include: parameter server (parameterserver) process is restarted in distributed training script or task (worker) into
Cheng Chongqi or the script exited, the script of finger daemon (dockerdaemon) exception of server, container orchestration engine
Pod (combinations of several associated vessels), job (timed task) or deployment in (kubernetes, K8s) is (stateless
Using) abnormal script.
Wherein, Kubernetes is a container orchestration engine of Google open source, it supports automatically dispose, extensive
Scalable, application container management.
202, M server of the failure execute server into the cluster server, which issues, carries the software fault survey
The test request of training sheet, wherein M is positive integer.
Wherein, cluster server for example may include the server in different distributions formula system.In cluster server
Multiple containers (Docker) is deployed on each server, so that different application service process be allowed to operate on container.
Wherein, M for example can be equal to 1,2,3,5,6,11,13,20 or other values.
Optionally, in a first aspect, in a kind of possible embodiment of the invention, the failure execute server is to institute
It states M server in cluster server and issues the test request for carrying the software fault test script, comprising:
The failure execute server according to the test request transmit time-consuming length to the M server into
The number of row from small to large, to obtain M number;
The sequence that the failure execute server is numbered according to described M, which issues the test to the M server, asks
It asks.
As can be seen that failure execute server transmits time-consuming length according to test request in above-mentioned technical proposal
Number from small to large is carried out to M server, and issues survey according to M server of the sequence of number into cluster server
Examination request, to improve the efficiency that entire cluster server receives test request.
Optionally, based in a first aspect, the failure executes service in the possible embodiment of the first of the invention
The sequence that device is numbered according to described M issues the test request to the M server, comprising:
The failure execute server determine the software fault test script and default expression formula whether successful match;
If the software fault test script and the default expression formula successful match, the failure execute server are true
It is fixed whether to have the permission for calling user interface, wherein the user interface belongs to Container Management node, the Container Management node
For any one server in the M server, the Container Management node is deployed in the M server for managing
In each server on N number of container, N is positive integer;
If there is the permission for calling the user interface, the failure execute server according to described M number sequence
It calls N number of container of the user interface on each server being deployed in the M server to issue the test to ask
It asks.
Wherein, presetting expression formula for example may include regular expression.
Wherein, regular expression is a kind of logical formula to string operation, is exactly with predefined some spies
Determine character and the combination of these specific characters, form one " regular character string ", this " regular character string " is used to express to word
Accord with a kind of filter logic of string.
User interface includes API (Application Programming Interface, application programming interface),
API is some functions predetermined, it is therefore an objective to provide application program and be accessed with developer based on certain software or hardware
The ability of one group of routine, and be not necessarily to access source code, or understand the details of internal work mechanism.
Wherein, N for example can be equal to 1,2,3,5,6,11,13,20 or other values.
As can be seen that in above-mentioned technical proposal, firstly, failure execute server determines software fault test script and presets
Expression formula whether successful match, in successful match, failure execute server then determine whether call user interface permission,
When there is the permission for calling user interface, failure execute server is then according to the M sequence numbered calling user interface to being deployed in M
N number of container on each server in a server issues test request, so that it is determined that by calling this side of user interface
Formula carries out issuing test request, ensure that failure execute server calls the permission of user interface, is not applied to avoid other
The script of fault test is issued to M server.
Optionally, based in the possible embodiment of the first of first aspect or first aspect, second of possible reality
It applies in mode, the method also includes:
If without the permission for calling the user interface, the failure execute server is sent to the Container Management node
Call user interface authority acquiring request, wherein the calling user interface authority acquiring request carries the failure and executes clothes
The authentication information of business device, the calling user interface authority acquiring request are used to indicate the Container Management node to described
Authentication information is authenticated, and when certification passes through, the Container Management node opens tune to the failure execute server
With the permission of the user interface;
The failure execute server receives the Container Management node and sends calling user interface authority acquiring response, with
Obtain calling the permission of the user interface.
As can be seen that above-mentioned technical proposal in, failure execute server without call user interface permission when, need to
The permission of Container Management node request call user interface, when Container Management node believes the authentication of failure execute server
When ceasing and authenticated, and passing through, just to the permission of failure execute server open call user interface, to avoid not adjusted
It is carried out issuing test request with the third-party server of user interface permission, or carries out the operation of some unauthorizeds, it is ensured that therefore
The safety interacted between barrier execute server and cluster server.
Optionally ,-b referring to fig. 2, Fig. 2-b be a kind of failure execute server for providing of one embodiment of the present of invention by
According to the sequence of the software fault test prioritization from high to low, call user interface N number of on i-th of server to being deployed in
Container issues the schematic diagram of test request, in the possible embodiment of second based on first aspect or first aspect, third
In the possible embodiment of kind, if described have the permission for calling the user interface, the failure execute server is according to institute
The sequence for stating M number calls N number of container of the user interface on each server being deployed in the M server
Issue the test request, comprising:
If there is the permission for calling the user interface, the failure execute server according to described M number sequence tune
M configuration file of the M server is obtained with the user interface, wherein match for i-th in the M configuration file
Set the software fault test prioritization that file includes the N number of container being deployed on i-th of server, i-th of server category
In the M server, 0 < i≤M and i are positive integer;
Sequence of the failure execute server according to the software fault test prioritization from high to low, calls the use
N number of container of the family interface on each server being deployed in the M server issues the test request.
Wherein, i for example can be equal to 1,2,3,5,6,11,13,20 or other values.
As can be seen that the sequence that failure execute server is numbered according to M calls user interface to obtain in above-mentioned technical proposal
The M configuration file of M server is taken, with from the N obtained on each server for being deployed in M server in configuration file
Then the software fault test prioritization of a container according to the sequence of software fault test prioritization from high to low, calls user
Interface issues test request to the N number of container being deployed on each server in M server, to avoid cluster server
Operation load is excessively heavy, improves test rate, meanwhile, it realizes the fault test to different vessels, finds out each container and exist
Latent defect and operation when the abnormal conditions that occur.
Optionally ,-c referring to fig. 2, Fig. 2-c be a kind of failure execute server for providing of one embodiment of the present of invention by
The sequence numbered according to described M is based on first aspect or the by the schematic diagram of M server of safety shell protocol Telnet
In the first or second or the third possible embodiment of one side, in the 4th kind of possible embodiment, the side
Method further include:
If the software fault test script and the default non-successful match of expression formula, the failure execute server
Determine whether the permission by M server described in safety shell protocol Telnet;
If there is the permission by M server described in the safety shell protocol Telnet, the failure executes service
The sequence that device is then numbered according to described M is by M server described in the safety shell protocol Telnet, with to the M
The Q role service run in each server of a server issues the test request, wherein Q is positive integer.
Wherein, safety shell protocol is the agreement for aiming at telnet session and other network services offer safety.
Wherein, role service for example may include in distributed training parameter server (parameterserver) into
Journey, task (worker) process, the finger daemon (dockerdaemon) of server.
Wherein, Q for example can be equal to 1,2,3,5,6,11,13,20 or other values.
As can be seen that in above-mentioned technical proposal, in software fault test script successful match non-with default expression formula, therefore
Barrier execute server then determines whether the permission by M server of safety shell protocol Telnet, passes through safety if having
The permission of M server of shell protocol Telnet, the sequence that failure execute server is then numbered according to M pass through Secure Shell
Protocol remote logs in M server, issues test with the Q role service run into each server of M server
Request, to realize that the sequence that failure execute server is numbered according to M passes through safety shell protocol Telnet M service
Device avoids the operation load of failure execute server excessively heavy, improves the efficiency for issuing test request, meanwhile, it realizes to difference
The abnormal feelings occurred when latent defect existing for each role service and operation are found out in the fault test of role service
Condition.
Optionally, the first based on first aspect or first aspect or second or the third or the 4th kind of possible reality
It applies in mode, in the 5th kind of possible embodiment, the method also includes:
If without the permission by M server described in the safety shell protocol Telnet, the failure executes service
The sequence that device is then numbered according to described M sends M Telnet authority acquiring request to the M server, wherein described
Each Telnet authority acquiring request in M Telnet authority acquiring request carries the authentication information, the M
I-th of Telnet authority acquiring request in a Telnet authority acquiring request is used to indicate i-th of server pair
The authentication information is authenticated, and when certification passes through, i-th of server passes through to the failure execute server
The permission of i-th of server described in the safety shell protocol Telnet, the M server and the M Telnet
Authority acquiring request corresponds;
The failure execute server receives the M server and sends M Telnet authority acquiring response, to obtain
Pass through the permission of M server described in the safety shell protocol Telnet, wherein the M server and the M are a remote
Journey logon rights obtain response and correspond.
As can be seen that when failure execute server is without the permission for passing through M server of safety shell protocol Telnet,
It needs to send M Telnet authority acquiring request to M server, to obtain by safety shell protocol Telnet M
The permission of server, thus avoid without by the third-party server of M server permission of safety shell protocol Telnet into
Row issues test request, or carries out the operation of some unauthorizeds, it is ensured that hands between failure execute server and cluster server
Mutual safety.
Optionally, the first based on first aspect or first aspect or second or the third or the 4th kind or the 5th kind
In possible embodiment, in the 6th kind of possible embodiment, the fault test task also carries hardware fault test foot
This, the method also includes:
The failure execute server determines the permission having through M server described in safety shell protocol Telnet,
It is taken with the sequence numbered according to described M by M server described in safety shell protocol Telnet, and then to described M
K module being engaged in each server of device issues the hardware fault test script, wherein K is positive integer.
Wherein, K for example can be equal to 1,2,3,5,6,11,13,20 or other values.
Wherein, hardware fault test script may include: the script of server abnormal shutdown, the foot of disk read-write failure
Originally, the script of the script of hard disk plug, the script of net card failure, network jitter or packet loss, the high script of memory usage, CPU,
The script of GPU load too high.
Wherein, module for example may include: CPU, GPU, memory, hard disk, network interface card, power supply.
As can be seen that failure execute server determines whether long-range by safety shell protocol in above-mentioned technical proposal
The permission of M server is logged in, if there is the permission by M server of safety shell protocol Telnet, failure executes service
The sequence that device is then numbered according to M is by M server of safety shell protocol Telnet, with to each clothes of M server
K module in business device issues test request, to realize that the sequence that failure execute server is numbered according to M is outer by safety
Shell protocol remote logs in M server, avoids the operation load of failure execute server excessively heavy, raising issues test request
Efficiency, meanwhile, it realizes fault test to disparate modules, finds out and occur when latent defect existing for modules and operation
Abnormal conditions.
203, failure execute server receives the M test response that the M server is sent.
Wherein, the M test response carries the M server and runs obtained from the software fault test script
M software fault test data, the M server and M test response correspond.
204, failure execute server verifies the M software fault test data, to obtain M software fault
Test result.
Optionally, in a first aspect, in one possible embodiment of the invention, the failure execute server is to described
M software fault test data is verified, to obtain M software fault test result, comprising:
The failure execute server is generated tests with the matched M software fault of the M software fault test data
Data Identification;
The failure execute server sends predetermined software fault test request of data to database server, wherein institute
It states predetermined software fault test request of data and carries the M software fault test data mark, the predetermined software data are asked
It asks and is used to indicate the database server inquiry and the matched M predetermined software of M software fault test data mark
Fault test data;
The failure execute server receives the database server and asks for the predetermined software fault test data
The predetermined software fault test data of transmission are asked to respond, wherein it is default that the predetermined software fault test data response carries M item
Software fault test data;
The failure execute server is by the M software fault test data and the M predetermined software fault test
Data are verified, to obtain M software fault test result.
Wherein, database server is soft by running one/multiple stage computers in a local network and data base management system
Part collectively forms, and database server provides data service for client applications.
As can be seen that in above-mentioned technical proposal, failure execute server by M software fault test data with from database
The M predetermined software fault test data that server obtains are verified, to obtain M software fault test result, thus root
According to check results, the cluster server abnormal conditions that existing latent defect and when operation occur during the test are found out.
Optionally, based in a first aspect, executing clothes in the failure in the possible embodiment of the first of the invention
Business device verifies the M software fault test data, after obtaining M software fault test result, the method
Further include:
The failure execute server sends M query messages to the M server, wherein the M query messages
In i-th of query messages be used to indicate i-th of server detected after fault test be deployed in it is described i-th clothes
K mould of the Q role service and i-th of server that are run in the N number of container, i-th of server on business device
The operation conditions of block, the M server and the M query messages correspond;
The failure execute server receives the M poll-final message that the M server is sent, wherein the M
Poll-final message carries M operation conditions detection data, and the M server and the M poll-final message one are a pair of
It answers;
The failure execute server analyzes the M operation conditions detection data, with the determination M service
Whether device operation is normal;
If the M server operation is abnormal, the failure execute server is sent to the database server
Request is checked in operation conditions log, wherein the operation conditions log checks that request carries operation conditions log mark, the fortune
Row situation log checks that request is used to indicate the database server inquiry and identifies matched fortune with the operation conditions log
Row situation log;
The failure execute server receives the database server and checks request hair for the operation conditions log
Response is checked in the operation conditions log sent, wherein the operation conditions log checks that response carries the operation conditions log;
The failure execute server analyzes the operation conditions log, in the determination M server
Each server runs abnormal reason, and then forms misoperation report.
As can be seen that disappearing in above-mentioned technical proposal firstly, failure execute server sends M inquiry to M server
Breath, when each server in M server receives query messages, i-th of server in M server is just according to inquiry
The Q angle that message detects the N number of container being deployed on i-th of server after fault test, runs in i-th of server
The operation conditions of K module of color server and i-th of server, after detection, M server executes service to failure
The M poll-final message that device is sent, then, the M operation conditions that failure execute server carries M poll-final message
Detection data is analyzed, and to determine whether M server operation be normal, if M server operation is abnormal, failure executes clothes
Business device analyzes operation conditions log, to determine that each server in M server runs abnormal reason, in turn
Misoperation report is formed, is worked after test the health detection of cluster server to realize, meanwhile, pass through analysis
Operation conditions log, each server found out in M server run abnormal reason, are conducive to subsequent targeted
Optimize cluster server, improves the stability of machine learning system.
Optionally, based on the possible embodiment of the first of first aspect or first aspect, at of the invention second
In possible embodiment, the method also includes:
When reception retests failed request, the failure execute server is sent again to the database server
Test failure message, wherein it is described retest failure message carrying retest failure identification, the failure that retests disappears
Breath is used to indicate the database server lookup and retests the matched historical failure test assignment of failure identification with described;
The server receives the historical failure test assignment that the database server carries historical failure test script;
The failure execute server issues the history survey for carrying the historical failure test script to the M server
Examination request;
The failure execute server receives the M history test response that the M server is sent, wherein the M
History test response carries the M server and runs M historical failure test obtained from the historical failure test script
Data, the M server and M history test response correspond;
The failure execute server verifies the M historical failure test data, to obtain M historical failure
Test result.
As can be seen that, when reception retests failed request, server is to database server in above-mentioned technical proposal
Transmission retests failure message, to obtain carrying the historical failure test assignment of historical failure test script, thus to soft
When part failure measure has doubt, or when the problem of inaccuracy occurs in upper primary software fault test result, realization pair
Cluster server re-issues the history test request for carrying historical failure test script, meanwhile, reduce unnecessary operated
Journey saves the entire testing time.
Referring to Fig. 3, Fig. 3 is a kind of schematic diagram for server that one embodiment of the present of invention provides.Wherein, such as Fig. 3 institute
Show, a kind of server 300 that one embodiment of the present of invention provides may include:
First receiving module 301 generates the fault test task that server is sent for receiving failure.
Wherein, the fault test task carries software fault test script.
Software fault test script refers mainly to the script of application service process exception relevant to machine learning, such as can be with
It include: the script or task (worker) that parameter server in distributed training (parameter server) process is restarted
The script that process is restarted or exited, the script of finger daemon (docker daemon) exception of server, container orchestration engine
Pod (combinations of several associated vessels), job (timed task) or deployment in (kubernetes, K8s) is (stateless
Using) abnormal script.
Wherein, Kubernetes is a container orchestration engine of Google open source, it supports automatically dispose, extensive
Scalable, application container management.
Module 302 is issued, is issued for the M server into the cluster server and carries the software fault test
The test request of script.
Wherein, M is positive integer.
Wherein, cluster server for example may include the server in different distributions formula system.In cluster server
Multiple containers (Docker) is deployed on each server, so that different application service process be allowed to operate on container.
Wherein, M for example can be equal to 1,2,3,5,6,11,13,20 or other values.
Second receiving module 303, the M test response sent for receiving the M server.
Wherein, the M test response carries the M server and runs obtained from the software fault test script
M software fault test data, the M server and M test response correspond.
Correction verification module 304, for being verified to the M software fault test data, to obtain M software fault survey
Test result.
It should be noted that for the various method embodiments described above, for simple description, therefore, it is stated as a series of
Combination of actions, but those skilled in the art answer it is described know, the present invention is not limited by the sequence of acts described, because
For according to the present invention, some steps may be performed in other sequences or simultaneously.Secondly, those skilled in the art also Ying Suoshu
Know, the embodiments described in the specification are all preferred embodiments, related actions and modules not necessarily this hair
Necessary to bright.
The above, the above embodiments are merely illustrative of the technical solutions of the present invention, rather than its limitations;Although referring to before
Stating embodiment, invention is explained in detail, those skilled in the art should understand that: it still can be to preceding
Technical solution documented by each embodiment is stated to modify or equivalent replacement of some of the technical features;And these
It modifies or replaces, the range for technical solution of various embodiments of the present invention that it does not separate the essence of the corresponding technical solution.
Claims (10)
1. the cluster server fault testing method in machine learning system characterized by comprising
Failure execute server receives failure and generates the fault test task that server is sent, wherein the fault test task
Carry software fault test script;
M server of the failure execute server into the cluster server, which issues, carries the software fault test foot
This test request, wherein M is positive integer;
The failure execute server receives the M test response that the M server is sent, wherein the M test response
It carries the M server and runs M software fault test data obtained from the software fault test script, the M is a
Server and M test response correspond;
The failure execute server verifies the M software fault test data, to obtain M software fault test
As a result.
2. the method according to claim 1, wherein the failure execute server is into the cluster server
M server issue the test request for carrying the software fault test script, comprising:
The failure execute server according to the test request transmit time-consuming length to M server progress from
It is small to arrive big number, to obtain M number;
The sequence that the failure execute server is numbered according to described M issues the test request to the M server.
3. according to the method described in claim 2, it is characterized in that, what the failure execute server was numbered according to described M
Sequence issues the test request to the M server, comprising:
The failure execute server determine the software fault test script and default expression formula whether successful match;
If the software fault test script and the default expression formula successful match, the failure execute server determination are
It is no to have the permission for calling user interface, wherein the user interface belongs to Container Management node, and the Container Management node is institute
Any one server in M server is stated, the Container Management node is deployed in the M server for managing
N number of container on each server, N are positive integer;
If there is the permission for calling the user interface, the sequence that the failure execute server is numbered according to described M is called
N number of container of the user interface on each server being deployed in the M server issues the test request.
4. described according to the method described in claim 3, it is characterized in that, if described have a permission for calling the user interface
The sequence that failure execute server is then numbered according to described M calls the user interface to being deployed in the M server
Each server on N number of container issue the test request, comprising:
If there is the permission for calling the user interface, the sequence that the failure execute server is numbered according to described M calls institute
State the M configuration file that user interface obtains the M server, wherein i-th of configuration text in the M configuration file
Part includes the software fault test prioritization for the N number of container being deployed on i-th of server, and i-th of server belongs to institute
M server is stated, 0 < i≤M and i are positive integer;
Sequence of the failure execute server according to the software fault test prioritization from high to low, calls the user to connect
N number of container of the mouth on each server being deployed in the M server issues the test request.
5. according to the method described in claim 3, it is characterized in that, the method also includes:
If the software fault test script and the default non-successful match of expression formula, the failure execute server determine
Whether permission by safety shell protocol Telnet described in M server is had;
If there is the permission by M server described in the safety shell protocol Telnet, the failure execute server
The sequence numbered according to described M is by M server described in the safety shell protocol Telnet, to take to described M
Q role service for being engaged in running in each server of device issues the test request, wherein Q is positive integer.
6. method according to claim 1 or 5, which is characterized in that the fault test task also carries hardware fault survey
Training sheet, the method also includes:
The failure execute server determines the permission having through M server described in the safety shell protocol Telnet,
With the sequence numbered according to described M by M server described in the safety shell protocol Telnet, and then to the M
K module in each server of a server issues the hardware fault test script, wherein K is positive integer.
7. the method according to claim 1, wherein the failure execute server is to the M software fault
Test data is verified, to obtain M software fault test result, comprising:
The failure execute server generates and the matched M software fault test data of the M software fault test data
Mark;
The failure execute server sends predetermined software fault test request of data to database server, wherein described pre-
If the request of software fault test data carries the M software fault test data mark, the predetermined software request of data is used
Matched M predetermined software failure is identified with the M software fault test data in indicating that the database server is inquired
Test data;
The failure execute server receives the database server and sends out for the predetermined software fault test request of data
The predetermined software fault test data response sent, wherein the predetermined software fault test data response carries M predetermined software
Fault test data;
The failure execute server is by the M software fault test data and the M predetermined software fault test data
It is verified, to obtain M software fault test result.
8. the method according to the description of claim 7 is characterized in that in the failure execute server to the M software event
Barrier test data is verified, after obtaining M software fault test result, the method also includes:
The failure execute server sends M query messages to the M server, wherein in the M query messages
I-th of query messages, which is used to indicate i-th of server and detects after fault test, is deployed in i-th of server
On N number of container, the Q role service and i-th of server that run in i-th of server K module
Operation conditions, the M server and the M query messages correspond;
The failure execute server receives the M poll-final message that the M server is sent, wherein the M inquiry
End message carries M operation conditions detection data, and the M server and the M poll-final message correspond;
The failure execute server analyzes the M operation conditions detection data, with the determination M server fortune
Whether row is normal;
If the M server operation is abnormal, the failure execute server sends to the database server and runs
Request is checked in situation log, wherein the operation conditions log checks that request carries operation conditions log mark, the operation shape
Condition log checks that request is used to indicate the database server inquiry and identifies matched operation shape with the operation conditions log
Condition log;
The failure execute server receives the database server and checks what request was sent for the operation conditions log
Response is checked in operation conditions log, wherein the operation conditions log checks that response carries the operation conditions log;
The failure execute server analyzes the operation conditions log, with each of described M server of determination
Server runs abnormal reason, and then forms misoperation report.
9. a kind of server characterized by comprising
First receiving module generates the fault test task that server is sent for receiving failure, wherein the fault test is appointed
Business carries software fault test script;
Module is issued, is issued for the M server into the cluster server and carries the software fault test script
Test request, wherein M is positive integer;
Second receiving module, the M test response sent for receiving the M server, wherein the M test response
It carries the M server and runs M software fault test data obtained from the software fault test script, the M is a
Server and M test response correspond;
Correction verification module, for being verified to the M software fault test data, to obtain M software fault test result.
10. a kind of computer readable storage medium, which is characterized in that the computer readable storage medium is for storing computer
Program, the storage computer program is executed by the processor, to realize the method according to claim 1.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811620118.9A CN109800160B (en) | 2018-12-27 | 2018-12-27 | Cluster server fault testing method and related device in machine learning system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811620118.9A CN109800160B (en) | 2018-12-27 | 2018-12-27 | Cluster server fault testing method and related device in machine learning system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109800160A true CN109800160A (en) | 2019-05-24 |
CN109800160B CN109800160B (en) | 2021-03-05 |
Family
ID=66557909
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811620118.9A Active CN109800160B (en) | 2018-12-27 | 2018-12-27 | Cluster server fault testing method and related device in machine learning system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109800160B (en) |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110618853A (en) * | 2019-08-02 | 2019-12-27 | 东软集团股份有限公司 | Detection method, device and equipment for zombie container |
CN110852445A (en) * | 2019-10-28 | 2020-02-28 | 广州文远知行科技有限公司 | Distributed machine learning training method and device, computer equipment and storage medium |
CN111641716A (en) * | 2020-06-01 | 2020-09-08 | 第四范式(北京)技术有限公司 | Self-healing method of parameter server, parameter server and parameter service system |
CN112217899A (en) * | 2020-10-19 | 2021-01-12 | 政采云有限公司 | Container troubleshooting system and method |
CN112346979A (en) * | 2020-11-11 | 2021-02-09 | 杭州飞致云信息科技有限公司 | Software performance testing method, system and readable storage medium |
CN112783769A (en) * | 2021-01-19 | 2021-05-11 | 深圳市莫廷影像技术有限公司 | Self-defined automatic software testing method |
CN112905445A (en) * | 2020-12-09 | 2021-06-04 | 江苏苏宁云计算有限公司 | Log-based test method and device and computer system |
CN113094266A (en) * | 2021-04-06 | 2021-07-09 | 中国工商银行股份有限公司 | Fault testing method, platform and equipment for container database |
CN115022328A (en) * | 2022-06-24 | 2022-09-06 | 脸萌有限公司 | Server cluster, server cluster testing method and device and electronic equipment |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030101385A1 (en) * | 2001-11-28 | 2003-05-29 | Inventec Corporation | Cross-platform system-fault warning system and method |
CN102354298A (en) * | 2011-07-27 | 2012-02-15 | 哈尔滨工业大学 | Software testing automation framework (STAF)-based fault injection automation testing platform and method for high-end fault-tolerant computer |
CN105205003A (en) * | 2015-10-28 | 2015-12-30 | 努比亚技术有限公司 | Automated testing method and device based on clustering system |
CN106897110A (en) * | 2017-02-23 | 2017-06-27 | 郑州云海信息技术有限公司 | A kind of container dispatching method and management node scheduler |
CN107967837A (en) * | 2017-05-31 | 2018-04-27 | 常州信息职业技术学院 | A kind of training platform and its implementation based on container |
CN108092850A (en) * | 2017-12-12 | 2018-05-29 | 郑州云海信息技术有限公司 | A kind of cluster server method for diagnosing faults and system based on heartbeat mechanism |
CN108654089A (en) * | 2018-05-09 | 2018-10-16 | 腾讯科技(深圳)有限公司 | The test method and device of Mission Objective, electronic equipment, storage medium |
-
2018
- 2018-12-27 CN CN201811620118.9A patent/CN109800160B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030101385A1 (en) * | 2001-11-28 | 2003-05-29 | Inventec Corporation | Cross-platform system-fault warning system and method |
CN102354298A (en) * | 2011-07-27 | 2012-02-15 | 哈尔滨工业大学 | Software testing automation framework (STAF)-based fault injection automation testing platform and method for high-end fault-tolerant computer |
CN105205003A (en) * | 2015-10-28 | 2015-12-30 | 努比亚技术有限公司 | Automated testing method and device based on clustering system |
CN106897110A (en) * | 2017-02-23 | 2017-06-27 | 郑州云海信息技术有限公司 | A kind of container dispatching method and management node scheduler |
CN107967837A (en) * | 2017-05-31 | 2018-04-27 | 常州信息职业技术学院 | A kind of training platform and its implementation based on container |
CN108092850A (en) * | 2017-12-12 | 2018-05-29 | 郑州云海信息技术有限公司 | A kind of cluster server method for diagnosing faults and system based on heartbeat mechanism |
CN108654089A (en) * | 2018-05-09 | 2018-10-16 | 腾讯科技(深圳)有限公司 | The test method and device of Mission Objective, electronic equipment, storage medium |
Non-Patent Citations (1)
Title |
---|
维克托•法西克 等: "《微服务运维实战》", 30 June 2018, 华中科技大学出版社 * |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110618853B (en) * | 2019-08-02 | 2022-04-22 | 东软集团股份有限公司 | Detection method, device and equipment for zombie container |
CN110618853A (en) * | 2019-08-02 | 2019-12-27 | 东软集团股份有限公司 | Detection method, device and equipment for zombie container |
CN110852445A (en) * | 2019-10-28 | 2020-02-28 | 广州文远知行科技有限公司 | Distributed machine learning training method and device, computer equipment and storage medium |
CN111641716A (en) * | 2020-06-01 | 2020-09-08 | 第四范式(北京)技术有限公司 | Self-healing method of parameter server, parameter server and parameter service system |
CN111641716B (en) * | 2020-06-01 | 2023-05-02 | 第四范式(北京)技术有限公司 | Self-healing method of parameter server, parameter server and parameter service system |
CN112217899A (en) * | 2020-10-19 | 2021-01-12 | 政采云有限公司 | Container troubleshooting system and method |
CN112346979A (en) * | 2020-11-11 | 2021-02-09 | 杭州飞致云信息科技有限公司 | Software performance testing method, system and readable storage medium |
CN112905445A (en) * | 2020-12-09 | 2021-06-04 | 江苏苏宁云计算有限公司 | Log-based test method and device and computer system |
CN112783769A (en) * | 2021-01-19 | 2021-05-11 | 深圳市莫廷影像技术有限公司 | Self-defined automatic software testing method |
CN113094266A (en) * | 2021-04-06 | 2021-07-09 | 中国工商银行股份有限公司 | Fault testing method, platform and equipment for container database |
CN113094266B (en) * | 2021-04-06 | 2024-06-14 | 中国工商银行股份有限公司 | Fault testing method, platform and equipment for container database |
CN115022328A (en) * | 2022-06-24 | 2022-09-06 | 脸萌有限公司 | Server cluster, server cluster testing method and device and electronic equipment |
CN115022328B (en) * | 2022-06-24 | 2023-08-08 | 脸萌有限公司 | Server cluster, testing method and device of server cluster and electronic equipment |
Also Published As
Publication number | Publication date |
---|---|
CN109800160B (en) | 2021-03-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109800160A (en) | Cluster server fault testing method and relevant apparatus in machine learning system | |
CN103678354B (en) | Local relation type database node scheduling method and device based on cloud computing platform | |
US10474563B1 (en) | System testing from production transactions | |
CN108829581B (en) | Application program testing method and device, computer equipment and storage medium | |
CN102402481B (en) | The fuzz testing of asynchronous routine code | |
CN109165168A (en) | A kind of method for testing pressure, device, equipment and medium | |
CN108206830B (en) | Vulnerability scanning method, apparatus, computer equipment and storage medium | |
CN105427695B (en) | Program class examination paper automatic assessment method and system | |
CN109194543A (en) | Collecting method and device | |
EP2629205A1 (en) | Multi-entity test case execution workflow | |
CN106209503B (en) | RPC interface test method and system | |
CN107608902A (en) | Routine interface method of testing and device | |
CN107168844B (en) | Performance monitoring method and device | |
CN112732499A (en) | Test method and device based on micro-service architecture and computer system | |
CN111382080A (en) | Stability test method for equipment cloud management platform system | |
CN114168429A (en) | Error reporting analysis method and device, computer equipment and storage medium | |
TWI626538B (en) | Infrastructure rule generation | |
CN104537284B (en) | Software protecting system and method based on remote service | |
CN106302412A (en) | A kind of intelligent checking system for the test of information system crushing resistance and detection method | |
CN106875184A (en) | Abnormal scene analogy method, device and equipment | |
JP2004145413A (en) | Diagnostic system for security hole | |
CN115119197B (en) | Wireless network risk analysis method, device, equipment and medium based on big data | |
CN114338051B (en) | Method, device, equipment and medium for acquiring random number by block chain | |
CN109274533A (en) | A kind of positioning device and method of the Web service failure of rule-based engine | |
CN109658259A (en) | Peasant household's listings data processing method, device, computer equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |