Disclosure of Invention
In order to solve the technical problem, the invention provides a server fault identification method, a server fault identification device and a storage medium.
The invention provides a server fault identification method, wherein a server is used for serving a predetermined project, a plurality of servers used for serving the predetermined project are distributed in a plurality of areas, and a terminal accesses the server used for serving the predetermined project when logging in the predetermined project, and the method comprises the following steps:
acquiring abnormal information occurring when a terminal accesses a server;
according to the abnormal information, determining a time node of the terminal with the abnormal information accessing the server and an information acquisition interval corresponding to the time node;
acquiring the change trend information and login duration information of the information acquisition interval;
and judging whether the server fails or not according to the change trend information and the login duration information.
The method also has the following characteristics: the method further comprises the steps of:
acquiring total time information and total amount information of all terminals accessed to an area to which the server belongs;
and judging whether the server has faults or not according to the total time information and the total quantity information.
The method also has the following characteristics: the judging whether the server fails according to the change trend information and the login duration information comprises the following steps:
judging whether the preset items of the server service have a first type of fault or not by utilizing a prestored first judgment condition according to the change trend information and the login duration information;
and/or the presence of a gas in the gas,
the judging whether the server has the fault according to the total time information and the total quantity information comprises the following steps:
and judging whether a second type of fault exists in the preset item of the regional service of the server by utilizing a pre-stored second judgment condition according to the total time information and the total quantity information.
The method also has the following characteristics: the step of judging whether the preset item of the server service has a first type of fault or not by using a pre-stored first judgment condition according to the change trend information and the login duration information comprises the following steps:
determining that a predetermined item of the server service has a failure when any one of the following conditions is satisfied:
the method comprises the following steps that firstly, the change trend information is smaller than a first preset value;
the second condition is that the change trend information is equal to a second preset value;
thirdly, the login duration information is smaller than a third preset value;
and/or the presence of a gas in the gas,
the judging whether a second type of fault exists in the preset item of the regional service to which the server belongs according to the total time information and the total quantity information by using a pre-stored second judging condition comprises the following steps:
when all the following conditions are met, judging that the preset item of the regional service to which the server belongs has a fault:
the condition four is that the total time information is smaller than a first duration, and the first duration is a lower quartile of an average login duration of the terminals accessed into all the areas for logging in the predetermined item;
the condition five is that the total time information is less than a second duration;
and sixthly, the total quantity information is larger than a fourth preset value.
The method also has the following characteristics: the method further comprises the following steps:
acquiring the times of accessing a terminal to a server;
determining the crowd ratio of the access times of the terminal according to the times of the terminal accessing the server;
and when the condition four, the condition five and the condition six are simultaneously met, judging whether the audience ratio is smaller than a preset audience ratio, and if so, judging that the preset item of the regional service to which the server belongs has no fault.
The method also has the following characteristics: the method further comprises the following steps:
and when the following conditions are met, judging that the server has a fault:
the area to which the server belongs is judged to be a second type of fault, and the number of the terminals accessed to the server is greater than the number of the terminals accessed to any other server in the area to which the server belongs;
the average login duration of the terminal accessed to the server is less than the average login duration of the terminal accessed to any other server in the area to which the server belongs;
and accessing terminals which are more than a preset proportion of all the terminals accessed in the area to the server.
The method also has the following characteristics: the method for determining the information acquisition interval comprises the following steps:
and taking the time node as a center, selecting n connected data access points at the upstream and downstream of the time node in the time dimension, and taking 2n +1 nodes as information acquisition points, wherein the data access points are the time nodes for accessing other terminals into the server.
The method also has the following characteristics: the change trend information comprises the slope of the connecting line of 2n +1 nodes;
the login duration information includes a total duration spent in the predetermined item after the terminal accesses the server.
The application also provides a server failure recognition device, wherein the server is used for serving a predetermined item, a plurality of servers for serving the predetermined item are distributed in a plurality of areas, and a terminal accesses the server for serving the predetermined item when logging in the predetermined item, the recognition device comprises:
the abnormal information acquisition module is used for acquiring abnormal information which appears when the terminal is accessed to the server;
the interval determining module is used for determining a time node of the server accessed by the terminal with the abnormal information and an information acquisition interval corresponding to the time node according to the abnormal information;
the judgment parameter acquisition module is used for acquiring the change trend information and the login duration information of the information acquisition interval;
and the first judgment module is used for judging whether the server has a fault or not according to the change trend information and the login duration information.
The device also has the following characteristics: the device further comprises:
the area information acquisition module is used for acquiring total time information and total amount information of all terminals accessing the area to which the server belongs;
and the second judging module is used for judging whether the server has faults or not according to the total time information and the total quantity information.
The device also has the following characteristics: the first judging module is used for judging whether the preset item of the server service has a first type of fault or not by utilizing a prestored first judging condition according to the change trend information and the login duration information;
and/or the presence of a gas in the gas,
the second judging module is used for judging whether a second type of fault exists in the preset item of the regional service of the server by utilizing a pre-stored second judging condition according to the total time information and the total amount information.
The device also has the following characteristics: the first judging module is used for executing the following judgment:
when any one of the following conditions is met, judging that the first type of fault exists in the preset item of the server service:
the method comprises the following steps that firstly, the change trend information is smaller than a first preset value;
the second condition is that the change trend information is equal to a second preset value;
thirdly, the login duration information is smaller than a third preset value;
and/or the presence of a gas in the gas,
the second judging module is used for executing the following judgment:
when the following conditions are all met, judging that the second type of fault exists in the preset item of the regional service to which the server belongs:
the condition four is that the total time information is smaller than a first duration, and the first duration is a lower quartile of an average login duration of the terminals accessed into all the areas for logging in the predetermined item;
the condition five is that the total time information is less than a second duration;
and sixthly, the total quantity information is larger than a fourth preset value.
The device also has the following characteristics: the device further comprises:
the number obtaining module is used for obtaining the number of times that the terminal accesses the server;
the system comprises an audience ratio determining module, a service providing module and a service providing module, wherein the audience ratio determining module is used for determining the audience ratio of the access times of the terminal according to the times of the terminal accessing the server;
and the third judging module is used for judging whether the audience ratio is smaller than a preset audience ratio or not when a condition four, a condition five and a condition six are simultaneously met, and if so, judging that the preset item of the regional service to which the server belongs has no fault.
The device also has the following characteristics: the device further comprises:
a fourth judgment unit configured to perform the following judgment:
and when the following conditions are met, judging that the server has a fault:
the area to which the server belongs is judged to be a second type of fault, and the number of the terminals accessed to the server is greater than the number of the terminals accessed to any other server in the area to which the server belongs;
the average login duration of the terminal accessed to the server is less than the average login duration of the terminal accessed to any other server in the area to which the server belongs;
and the terminals which are more than a preset proportion of all the terminals accessed in the area are accessed to the server.
The device also has the following characteristics: the area information acquisition module includes:
and the area information determining unit is used for selecting n connected data access points at the upstream and downstream of the time node in the time dimension by taking the time node as a center, and taking 2n +1 nodes as information acquisition points, wherein the data access points are the time nodes for accessing other terminals into the server.
The present application also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a server failure identification method as described above.
By using the server fault identification method and device provided by the invention, the server accessed by the terminal with abnormal information can be quickly judged by utilizing the information such as the login time length and the login time node of the terminal which can be conveniently obtained in the prior art, so that the server fault and the position of the server with the fault can be quickly and accurately identified, and the problems of long time consumption and inaccurate identification and positioning in the process of identifying and positioning the server with the fault in the prior art are effectively solved.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention. It should be noted that the embodiments and features of the embodiments in the present application may be arbitrarily combined with each other without conflict.
The application provides a server fault identification method, which can quickly judge a server accessed by a terminal with abnormal information according to information such as login duration, login time nodes and the like of the terminal, which can be obtained in the prior art, so that a server fault and the position of the server with the fault can be quickly and accurately identified.
When a game or some large nationwide projects are operated, a server is needed to provide support and guarantee for the game or the project operation. In order to ensure that game players or users in the whole country can quickly access games or items, servers are arranged in a plurality of regions in the whole country, a plurality of servers can be arranged in each region, each server can respectively serve a plurality of games or a plurality of items, and namely a plurality of games or items can be simultaneously operated on one server at the same time.
As shown in fig. 1, a server failure identification method includes the following steps:
s10, acquiring abnormal information when the terminal accesses the server;
s20, according to the abnormal information, determining a time node of the terminal with the abnormal information, which is accessed to the server, so as to obtain an information acquisition interval corresponding to the time node;
s30, acquiring the change trend information and the login duration information of the information acquisition interval;
and S40, judging whether the server has a fault or not according to the change trend information and the login duration information.
In S10, the exception information is sent by the terminal, and for example, the terminal is subjected to a flash back, a card pause, a game start failure, a game entry failure by App, or the like during the process of logging in the game, and all belong to the exception information. Because the game starting failure is very complicated, but the abnormal information causing the game starting failure is reported by the server, and all the problems cannot be completely covered in the application, in general, the abnormality with a relatively low degree, such as blockage, flash back, player disconnection reconnection, decoding channel establishment failure, insufficient server GPU resources and the like, occurs in the abnormal information when the terminal is accessed to the server, and when the abnormality with the relatively low degree occurs, the probability of the server failure is very low, so that the server does not need to be subjected to failure identification, and the abnormality can be directly ignored. When a terminal has a serious abnormality when accessing a server, it is likely to be caused by a server failure, that is, only when the terminal has a serious abnormality when accessing the server, the server is identified. Therefore, the abnormal information mentioned in the method S10 refers to the abnormal information which appears to a greater extent when the terminal accesses the server. Here, it should be noted that the method existing in the prior art may be adopted to acquire the abnormal information occurring when the terminal accesses the server, and the method can already be implemented in the prior art.
The method steps are mainly used for judging whether the predetermined item (which may be a game or a large item) running on the server has a problem, and further, in order to ensure that the predetermined item runs normally and stably, the method further includes a method step for judging whether the predetermined item running on the area to which the server has abnormal information when the terminal logs in has a problem, as shown in fig. 2, the method includes the following steps:
s50, acquiring total time information and total amount information of all terminals accessing the area to which the server belongs;
and S60, judging whether the server has faults or not according to the total time information and the total quantity information.
In the above control method, the information collection interval is involved, which may also be regarded as a window period of data collection, and data collection is performed on a node to be collected in the information collection interval to obtain the change trend information and the login duration information in S30. The method for determining the information interval is defined, and by taking a time node as a center, n connected data access points at the upstream and downstream of the time node are selected in a time dimension, and 2n +1 nodes are taken as information acquisition points, wherein the data access points are time nodes for accessing other terminals to a server. The value of n is determined according to actual conditions, for example, when the value can be set according to the performance of the server, when the performance of the server is better, the number of the selected nodes may be relatively smaller, and when the performance of the server is poorer, the number of the selected nodes is larger, so as to obtain more accurate data, and preferably, the value of n may be 5 to 20. Here, the n nodes upstream of the time node are time nodes to which a terminal that has accessed the server before accessing the server by the terminal that has presented the abnormal information accesses the server, and may be time nodes to which normal registration is performed or time nodes to which the abnormal information has presented. The n nodes downstream of the time node are the time nodes of the terminal access server which is accessed to the server after the terminal with the abnormal information is accessed to the server. No matter the n nodes at the upstream or the n nodes at the downstream continuously exist by taking time as a unit, so that the reliability of the change trend information obtained in the information acquisition interval is ensured.
In a specific embodiment, the variation trend information obtained in S30 according to the variation trend information of the information acquisition interval is a slope of 2n +1 node connecting lines. That is, if 2n +1 nodes are connected in order of time with time as the abscissa and the time taken to access the server as the ordinate, a straight line can be obtained approximately, and the 2n +1 nodes are distributed on both sides of the straight line. And taking the slope of the straight line as the change trend information of the information acquisition interval.
In S30, the login duration information includes a total duration spent in a predetermined item, such as a predetermined game, after the terminal accesses the server. For example, if a terminal takes 30 minutes from the start of App triggering to the exit of a game, the login duration information of the terminal is 30 minutes.
In S50, the method for determining the total time information includes determining an area to which the server belongs, obtaining the number of servers in the area, counting the number of terminals accessed in each server, counting the online time of each terminal from the time when the server is accessed to the time when the server is disconnected, summing the online time of all terminals in all servers in the area, and then taking the average value as the total time information. That is, the average value of the time length of the game running by the players corresponding to the terminals accessed by all the servers in a certain area.
In S50, the total amount information is determined by determining an area to which the server belongs, acquiring the number of servers in the area, counting the number of terminals accessed in each server, and summing the number of all terminals accessed by all servers in the area as the total amount information. That is, there is a total sample amount of a certain game played by the player in the server in a certain area.
In a preferred embodiment, the determining whether the server fails in a specific execution process according to the change trend information and the login duration information includes:
and judging whether the preset items of the server service have the first type of faults or not by utilizing the prestored first judgment condition according to the change trend information and the login duration information.
Specifically, according to the change trend information and the login duration information, judging whether a first type of fault exists in a preset item of the server service by using a prestored first judgment condition comprises the following steps:
when any one of the following conditions is satisfied, it is determined that a predetermined item of the server service has a failure:
the method comprises the following steps that firstly, the change trend information is smaller than a first preset value;
the second condition is that the change trend information is equal to a second preset value;
and thirdly, the login duration information is smaller than a third preset value.
The second preset value is 0, that is, the condition two indicates that the login duration of each node in the whole information acquisition interval is the same. However, due to the performance of the terminal itself and the network environment where the terminal is located, the login duration is different, so if the trend information is 0, it is said that there is a problem. The first preset value and the third preset value are determined by a general classification prediction model such as a CART decision tree when a fault recognition model is established (described in detail later).
In another preferred embodiment, the determining whether the server has the fault according to the total time information and the total amount information includes:
and judging whether a second type of fault exists in the preset item of the area service of the server by utilizing a prestored second judgment condition according to the total time information and the total quantity information.
Specifically, the step of judging whether a second type of fault exists in a predetermined item of the area service to which the server belongs by using a pre-stored second judgment condition according to the total time information and the total quantity information includes:
and when all the following conditions are met, judging that the preset item of the area service to which the server belongs has a fault:
the condition four is that the total time information is less than a first duration, and the first duration is a lower quartile of an average login duration of the terminal login preset items accessed into all the areas;
the condition five is that the total time information is less than the second duration;
and sixthly, the total quantity information is larger than a fourth preset value.
The first time length is determined by a statistical method, that is, the average login time lengths of the preset items logged in by the terminals accessed into all the areas are arranged according to the sequence of the time lengths from large to small, and the total time information is located at the position of one of the last four arranged time lengths, so that the condition of the total time information is considered to be met. The second time length and the fourth preset value are also determined by a common classification prediction model such as a CART decision tree in the server fault identification model process.
Here, it should be noted that when the network environment in the environment where the terminal is located is poor, the terminal may try to adopt an unlimited connection mode, so that the value of the total information duration is reduced, and misjudgment is caused, so that the condition that no fault exists is classified as the fault. In order to solve the problem and avoid the misjudgment of the second type of fault, the method of the invention further comprises the following steps:
acquiring the times of accessing a terminal to a server;
determining the crowd ratio of the access times of the terminal according to the times of the terminal accessing the server;
and when the condition four, the condition five and the condition six are simultaneously met, judging whether the audience ratio is smaller than a preset audience ratio, and if so, judging that the preset item of the regional service to which the server belongs has no fault.
The preset crowd ratio may be set according to the network condition of the area where the server is located, and is preferably 70%, that is, when the number of times that a certain terminal accesses the server accounts for 30% of the total number of times that all terminals access the server, the server is determined to be in a normal operation state, and is determined to be a problem of the terminal, considering that the server condition does not conform to the second type of fault.
Further, in order to further improve the determination accuracy and avoid the occurrence of a misjudgment condition, the fault identification method of the present invention further includes:
and when all the following conditions are met, judging that the server has a fault:
the area to which the server belongs is judged to have a second type of fault, and the number of the terminals accessed to the server is larger than that of the terminals accessed to any other server in the area to which the server belongs;
the average login duration of the terminal accessed to the server is less than the average login duration of the terminal accessed to any other server in the area to which the server belongs;
more than a predetermined percentage of all terminals accessed in the area are present on the server.
If all of the three conditions are satisfied by one and only one server, the server is judged to be failed, and the problem is not caused in the predetermined item in the whole area.
The server fault identification method is deployed on a Linux system, when the fault is judged to occur through the method, the fault can be a first type fault, a second type fault or an independent server fault, and at least one of the following methods is adopted for processing:
1. recording fault information, pushing the fault information to a background of the Linux system, and enabling operators to see the fault information of the server, so that cooperative management is realized;
2. shutting down the service of the server involved in the failure;
3. sending the fault information of the server to a maintenance worker for processing in the form of mail, short message or WeChat;
4. after the fault of the fault server is removed, the state of the server is changed into a normal online state in the background of the Linux system, so that the server can be normally used.
Before the fault identification method in the invention application is operated, a fault identification model needs to be established. In the process of establishing a fault identification model, a server with a fault needs to be sampled to construct sample data, so that various threshold values used in the fault identification process are obtained. When fault sampling is carried out, all server fault information in a sufficient time interval is collected so as to ensure that the sampling data is sufficient and the fault information can be clearly and completely reflected. For example, all information from 0 hour of the day to 24 hours of the day, including fault information and non-fault information, is collected, and the fault information is processed, for example, some unnecessary noise information in the fault information is removed, and data is changed to obtain a value of a fault point about a judgment condition. For example, the first judgment condition relates to a first preset value for judging the change trend information and a third preset value for judging the login duration information. Then the fault data needs to be processed in order to obtain the first preset value and the third preset value. In the process of processing fault data, the slope of the 2n +1 node connecting lines is obtained according to the fault change trend information obtained in the fault information acquisition section, and the slope of the straight line is used as the change trend information of the fault information acquisition section.
The fault log-in duration information includes a total duration spent in a predetermined item, such as a predetermined game, after the fault terminal accesses the server. For example, the fault terminal takes 5 minutes from the start of triggering the APP to the exit of the game, and the fault login duration information of the fault terminal is 5 minutes.
By utilizing the CART decision tree, the change trend information, the login duration information and the like are input into the decision tree as the mould-entering dimension to be used as independent variables, and whether the server is a fault server or not is used as a dependent variable. The decision tree after parameter adjustment can automatically calculate a first preset value, a third preset value and the like for judgment in the first judgment condition. Similarly, each threshold used for judgment in the method can be determined by a sampling mode and by using a CART decision tree. Non-deterministic values, such as quartiles, etc., are present in the decision conditions because a relative value is required for future decisions, and the threshold determined by the previous classification model is close to a relative value in the current sample set.
In addition, before the server fault recognition model is online and used for server fault recognition, a test is required. In performing the test, the Precision (PPV) was calculated using a confusion matrix, TP/(TP + FP), i.e. the correct weight of the model prediction among all results where the model prediction was Positive (Positive). Through testing, the PPV of the server identification method in the application is larger than 90%.
The present application further provides a server failure recognition apparatus, where a server is used for servicing a predetermined item, a plurality of servers for servicing the predetermined item are distributed in a plurality of areas, and a terminal accesses the server for servicing the predetermined item when logging in the predetermined item, as shown in fig. 3, the recognition apparatus includes:
the abnormal information acquisition module is used for acquiring abnormal information which appears when the terminal is accessed to the server;
the interval determining module is used for determining a time node of the terminal access server with the abnormal information according to the abnormal information so as to obtain an information acquisition interval corresponding to the time node;
the judgment parameter acquisition module is used for acquiring the change trend information and the login duration information of the information acquisition interval;
and the first judgment module is used for judging whether the server has a fault or not according to the change trend information and the login duration information.
Further, the apparatus further comprises:
the area information acquisition module is used for acquiring total time information and total amount information of all terminals accessing the area to which the server belongs;
and the second judging module is used for judging whether the server has faults or not according to the total time information and the total amount information.
Wherein, regional information acquisition module includes:
and the area information determining unit is used for selecting n connected data access points at the upstream and the downstream of the time node in the time dimension by taking the time node as the center, and taking 2n +1 nodes as information acquisition points, wherein the data access points are the time nodes of other terminals accessing the server.
Further, the first judging module is used for judging whether the first type of fault exists in the preset item of the server service by utilizing a prestored first judging condition according to the change trend information and the login duration information;
and/or the presence of a gas in the gas,
and the second judging module is used for judging whether a second type of fault exists in the preset item of the area service of the server by utilizing a prestored second judging condition according to the total time information and the total amount information.
Further, the first determining module is configured to perform the following determination:
when any one of the following conditions is satisfied, it is determined that a predetermined item of the server service has a failure:
the method comprises the following steps that firstly, the change trend information is smaller than a first preset value;
the second condition is that the change trend information is equal to a second preset value;
thirdly, logging in time length information is smaller than a third preset value;
and/or the presence of a gas in the gas,
the second judging module is used for executing the following judgment:
and when all the following conditions are met, judging that the preset item of the area service to which the server belongs has a fault:
the condition four is that the total time information is less than a first duration, and the first duration is a lower quartile of an average login duration of the terminal login preset items accessed into all the areas;
the condition five is that the total time information is less than the second duration;
and sixthly, the total quantity information is larger than a fourth preset value.
Further, the apparatus further comprises:
the number obtaining module is used for obtaining the number of times that the terminal accesses the server;
the system comprises an audience rate determining module, a user access module and a service module, wherein the audience rate determining module is used for determining the audience rate of the access times of a terminal according to the times of the terminal accessing a server;
and the third judging module is used for judging whether the crowd ratio is smaller than the preset crowd ratio or not when the condition four, the condition five and the condition six are simultaneously met, and judging that no fault exists in the preset item of the regional service to which the server belongs if the crowd ratio is smaller than the preset crowd ratio.
Still further, the apparatus further comprises:
a fourth judgment unit configured to perform the following judgment:
and when all the following conditions are met, judging that the server has a fault:
the area to which the server belongs is judged to be a second type of fault, and the number of the terminals accessed to the server is greater than the number of the terminals accessed to any other server in the area to which the server belongs;
the average login duration of the terminal accessed to the server is less than the average login duration of the terminal accessed to any other server in the area to which the server belongs;
and accessing terminals which are more than a preset proportion of all the terminals accessed in the area to the server.
Since the server failure recognition device in the present application is used for implementing the above server failure recognition method, the functions and actions of the server failure recognition device are not described herein again.
In addition, the present application also provides a computer-readable storage medium on which a computer program is stored, the computer program, when executed by a processor, implementing the above-described server failure identification method.
According to the method and the device, server faults are sampled, an expert sample set is established, and each threshold value for judgment is determined through a classification prediction model according to the expert sample set, so that the server faults can be determined more accurately and rapidly.
The above-described aspects may be implemented individually or in various combinations, and such variations are within the scope of the present invention.
It will be understood by those skilled in the art that all or part of the steps of the above methods may be implemented by instructing the relevant hardware through a program, and the program may be stored in a computer readable storage medium, such as a read-only memory, a magnetic or optical disk, and the like. Alternatively, all or part of the steps of the foregoing embodiments may also be implemented by using one or more integrated circuits, and accordingly, each module/unit in the foregoing embodiments may be implemented in the form of hardware, and may also be implemented in the form of a software functional module. The present invention is not limited to any specific form of combination of hardware and software.
It is to be noted that, in this document, the terms "comprises", "comprising" or any other variation thereof are intended to cover a non-exclusive inclusion, so that an article or apparatus including a series of elements includes not only those elements but also other elements not explicitly listed or inherent to such article or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of additional like elements in the article or device comprising the element.
The above embodiments are merely to illustrate the technical solutions of the present invention and not to limit the present invention, and the present invention has been described in detail with reference to the preferred embodiments. It will be understood by those skilled in the art that various modifications and equivalent arrangements may be made without departing from the spirit and scope of the present invention and it should be understood that the present invention is to be covered by the appended claims.