CN110519102B - Server fault identification method and device and storage medium - Google Patents

Server fault identification method and device and storage medium Download PDF

Info

Publication number
CN110519102B
CN110519102B CN201910865239.8A CN201910865239A CN110519102B CN 110519102 B CN110519102 B CN 110519102B CN 201910865239 A CN201910865239 A CN 201910865239A CN 110519102 B CN110519102 B CN 110519102B
Authority
CN
China
Prior art keywords
server
information
time
terminal
accessed
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910865239.8A
Other languages
Chinese (zh)
Other versions
CN110519102A (en
Inventor
孙翌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Abacus Industrial Technology Co ltd
Original Assignee
Guiyang Gloud Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guiyang Gloud Technology Co ltd filed Critical Guiyang Gloud Technology Co ltd
Priority to CN201910865239.8A priority Critical patent/CN110519102B/en
Publication of CN110519102A publication Critical patent/CN110519102A/en
Application granted granted Critical
Publication of CN110519102B publication Critical patent/CN110519102B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0677Localisation of faults

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention discloses a server fault identification method, a device and a storage medium, wherein a server is used for serving scheduled items, a plurality of servers used for serving the scheduled items are distributed in a plurality of areas, and a terminal accesses the servers used for serving the scheduled items when logging in the scheduled items, and the method comprises the following steps: acquiring abnormal information occurring when a terminal accesses a server; according to the abnormal information, determining a time node of a terminal access server in which the abnormal information appears so as to obtain an information acquisition interval corresponding to the time node; acquiring change trend information and login duration information of an information acquisition interval; according to the change trend information and the login duration information, whether the server fails or not is judged, so that the server failure and the position of the failed server are quickly and accurately identified, and the problems that in the prior art, the time consumption is long and the identification and the positioning are not accurate enough in the process of identifying and positioning the failed server are effectively solved.

Description

Server fault identification method and device and storage medium
Technical Field
The present invention relates to the field of fault identification technologies, and in particular, to a server fault identification method, apparatus, and storage medium.
Background
The traditional server fault identification method mainly finds out server faults through the ways of regular check of workers, game player feedback using the server and the like. Although the server failure information can be obtained, the purpose of really and rapidly and accurately positioning the failure server cannot be achieved due to excessive missing reports and serious hysteresis of failure information sending, and server failure problem processing personnel are usually caused to be unsuitable. The fault identification method of the server has too strong dependence on people, cannot carry out data mining on faults, and has an unclear fault finding target.
The existing server background system can collect the time (actual online game time) of each access of all users (namely game players who use the terminal to enter the game) to the server, and the terminal information of abnormal service. The abnormal information and the time data of the user accessing the service can reflect the condition that the user uses the server, but the occurrence of the terminal information with abnormal service does not necessarily mean that the server or the service has a fault. The server fault problem processing personnel need to manually diagnose by checking the data of the background system one by one to locate the fault server, input the fault server into the system, and then send out a maintenance engineer for maintenance.
This causes the problems of long time consumption and inaccurate identification and positioning in the process of identifying and positioning the fault server in the prior art.
Disclosure of Invention
In order to solve the technical problem, the invention provides a server fault identification method, a server fault identification device and a storage medium.
The invention provides a server fault identification method, wherein a server is used for serving a predetermined project, a plurality of servers used for serving the predetermined project are distributed in a plurality of areas, and a terminal accesses the server used for serving the predetermined project when logging in the predetermined project, and the method comprises the following steps:
acquiring abnormal information occurring when a terminal accesses a server;
according to the abnormal information, determining a time node of the terminal with the abnormal information accessing the server and an information acquisition interval corresponding to the time node;
acquiring the change trend information and login duration information of the information acquisition interval;
and judging whether the server fails or not according to the change trend information and the login duration information.
The method also has the following characteristics: the method further comprises the steps of:
acquiring total time information and total amount information of all terminals accessed to an area to which the server belongs;
and judging whether the server has faults or not according to the total time information and the total quantity information.
The method also has the following characteristics: the judging whether the server fails according to the change trend information and the login duration information comprises the following steps:
judging whether the preset items of the server service have a first type of fault or not by utilizing a prestored first judgment condition according to the change trend information and the login duration information;
and/or the presence of a gas in the gas,
the judging whether the server has the fault according to the total time information and the total quantity information comprises the following steps:
and judging whether a second type of fault exists in the preset item of the regional service of the server by utilizing a pre-stored second judgment condition according to the total time information and the total quantity information.
The method also has the following characteristics: the step of judging whether the preset item of the server service has a first type of fault or not by using a pre-stored first judgment condition according to the change trend information and the login duration information comprises the following steps:
determining that a predetermined item of the server service has a failure when any one of the following conditions is satisfied:
the method comprises the following steps that firstly, the change trend information is smaller than a first preset value;
the second condition is that the change trend information is equal to a second preset value;
thirdly, the login duration information is smaller than a third preset value;
and/or the presence of a gas in the gas,
the judging whether a second type of fault exists in the preset item of the regional service to which the server belongs according to the total time information and the total quantity information by using a pre-stored second judging condition comprises the following steps:
when all the following conditions are met, judging that the preset item of the regional service to which the server belongs has a fault:
the condition four is that the total time information is smaller than a first duration, and the first duration is a lower quartile of an average login duration of the terminals accessed into all the areas for logging in the predetermined item;
the condition five is that the total time information is less than a second duration;
and sixthly, the total quantity information is larger than a fourth preset value.
The method also has the following characteristics: the method further comprises the following steps:
acquiring the times of accessing a terminal to a server;
determining the crowd ratio of the access times of the terminal according to the times of the terminal accessing the server;
and when the condition four, the condition five and the condition six are simultaneously met, judging whether the audience ratio is smaller than a preset audience ratio, and if so, judging that the preset item of the regional service to which the server belongs has no fault.
The method also has the following characteristics: the method further comprises the following steps:
and when the following conditions are met, judging that the server has a fault:
the area to which the server belongs is judged to be a second type of fault, and the number of the terminals accessed to the server is greater than the number of the terminals accessed to any other server in the area to which the server belongs;
the average login duration of the terminal accessed to the server is less than the average login duration of the terminal accessed to any other server in the area to which the server belongs;
and accessing terminals which are more than a preset proportion of all the terminals accessed in the area to the server.
The method also has the following characteristics: the method for determining the information acquisition interval comprises the following steps:
and taking the time node as a center, selecting n connected data access points at the upstream and downstream of the time node in the time dimension, and taking 2n +1 nodes as information acquisition points, wherein the data access points are the time nodes for accessing other terminals into the server.
The method also has the following characteristics: the change trend information comprises the slope of the connecting line of 2n +1 nodes;
the login duration information includes a total duration spent in the predetermined item after the terminal accesses the server.
The application also provides a server failure recognition device, wherein the server is used for serving a predetermined item, a plurality of servers for serving the predetermined item are distributed in a plurality of areas, and a terminal accesses the server for serving the predetermined item when logging in the predetermined item, the recognition device comprises:
the abnormal information acquisition module is used for acquiring abnormal information which appears when the terminal is accessed to the server;
the interval determining module is used for determining a time node of the server accessed by the terminal with the abnormal information and an information acquisition interval corresponding to the time node according to the abnormal information;
the judgment parameter acquisition module is used for acquiring the change trend information and the login duration information of the information acquisition interval;
and the first judgment module is used for judging whether the server has a fault or not according to the change trend information and the login duration information.
The device also has the following characteristics: the device further comprises:
the area information acquisition module is used for acquiring total time information and total amount information of all terminals accessing the area to which the server belongs;
and the second judging module is used for judging whether the server has faults or not according to the total time information and the total quantity information.
The device also has the following characteristics: the first judging module is used for judging whether the preset item of the server service has a first type of fault or not by utilizing a prestored first judging condition according to the change trend information and the login duration information;
and/or the presence of a gas in the gas,
the second judging module is used for judging whether a second type of fault exists in the preset item of the regional service of the server by utilizing a pre-stored second judging condition according to the total time information and the total amount information.
The device also has the following characteristics: the first judging module is used for executing the following judgment:
when any one of the following conditions is met, judging that the first type of fault exists in the preset item of the server service:
the method comprises the following steps that firstly, the change trend information is smaller than a first preset value;
the second condition is that the change trend information is equal to a second preset value;
thirdly, the login duration information is smaller than a third preset value;
and/or the presence of a gas in the gas,
the second judging module is used for executing the following judgment:
when the following conditions are all met, judging that the second type of fault exists in the preset item of the regional service to which the server belongs:
the condition four is that the total time information is smaller than a first duration, and the first duration is a lower quartile of an average login duration of the terminals accessed into all the areas for logging in the predetermined item;
the condition five is that the total time information is less than a second duration;
and sixthly, the total quantity information is larger than a fourth preset value.
The device also has the following characteristics: the device further comprises:
the number obtaining module is used for obtaining the number of times that the terminal accesses the server;
the system comprises an audience ratio determining module, a service providing module and a service providing module, wherein the audience ratio determining module is used for determining the audience ratio of the access times of the terminal according to the times of the terminal accessing the server;
and the third judging module is used for judging whether the audience ratio is smaller than a preset audience ratio or not when a condition four, a condition five and a condition six are simultaneously met, and if so, judging that the preset item of the regional service to which the server belongs has no fault.
The device also has the following characteristics: the device further comprises:
a fourth judgment unit configured to perform the following judgment:
and when the following conditions are met, judging that the server has a fault:
the area to which the server belongs is judged to be a second type of fault, and the number of the terminals accessed to the server is greater than the number of the terminals accessed to any other server in the area to which the server belongs;
the average login duration of the terminal accessed to the server is less than the average login duration of the terminal accessed to any other server in the area to which the server belongs;
and the terminals which are more than a preset proportion of all the terminals accessed in the area are accessed to the server.
The device also has the following characteristics: the area information acquisition module includes:
and the area information determining unit is used for selecting n connected data access points at the upstream and downstream of the time node in the time dimension by taking the time node as a center, and taking 2n +1 nodes as information acquisition points, wherein the data access points are the time nodes for accessing other terminals into the server.
The present application also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a server failure identification method as described above.
By using the server fault identification method and device provided by the invention, the server accessed by the terminal with abnormal information can be quickly judged by utilizing the information such as the login time length and the login time node of the terminal which can be conveniently obtained in the prior art, so that the server fault and the position of the server with the fault can be quickly and accurately identified, and the problems of long time consumption and inaccurate identification and positioning in the process of identifying and positioning the server with the fault in the prior art are effectively solved.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate an embodiment of the invention and, together with the description, serve to explain the invention and not to limit the invention. In the drawings:
FIG. 1 is a flowchart of a server failure identification method in an embodiment;
FIG. 2 is a second flowchart of a server failure identification method in the embodiment;
fig. 3 is a block diagram of a server failure recognition apparatus in the embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention. It should be noted that the embodiments and features of the embodiments in the present application may be arbitrarily combined with each other without conflict.
The application provides a server fault identification method, which can quickly judge a server accessed by a terminal with abnormal information according to information such as login duration, login time nodes and the like of the terminal, which can be obtained in the prior art, so that a server fault and the position of the server with the fault can be quickly and accurately identified.
When a game or some large nationwide projects are operated, a server is needed to provide support and guarantee for the game or the project operation. In order to ensure that game players or users in the whole country can quickly access games or items, servers are arranged in a plurality of regions in the whole country, a plurality of servers can be arranged in each region, each server can respectively serve a plurality of games or a plurality of items, and namely a plurality of games or items can be simultaneously operated on one server at the same time.
As shown in fig. 1, a server failure identification method includes the following steps:
s10, acquiring abnormal information when the terminal accesses the server;
s20, according to the abnormal information, determining a time node of the terminal with the abnormal information, which is accessed to the server, so as to obtain an information acquisition interval corresponding to the time node;
s30, acquiring the change trend information and the login duration information of the information acquisition interval;
and S40, judging whether the server has a fault or not according to the change trend information and the login duration information.
In S10, the exception information is sent by the terminal, and for example, the terminal is subjected to a flash back, a card pause, a game start failure, a game entry failure by App, or the like during the process of logging in the game, and all belong to the exception information. Because the game starting failure is very complicated, but the abnormal information causing the game starting failure is reported by the server, and all the problems cannot be completely covered in the application, in general, the abnormality with a relatively low degree, such as blockage, flash back, player disconnection reconnection, decoding channel establishment failure, insufficient server GPU resources and the like, occurs in the abnormal information when the terminal is accessed to the server, and when the abnormality with the relatively low degree occurs, the probability of the server failure is very low, so that the server does not need to be subjected to failure identification, and the abnormality can be directly ignored. When a terminal has a serious abnormality when accessing a server, it is likely to be caused by a server failure, that is, only when the terminal has a serious abnormality when accessing the server, the server is identified. Therefore, the abnormal information mentioned in the method S10 refers to the abnormal information which appears to a greater extent when the terminal accesses the server. Here, it should be noted that the method existing in the prior art may be adopted to acquire the abnormal information occurring when the terminal accesses the server, and the method can already be implemented in the prior art.
The method steps are mainly used for judging whether the predetermined item (which may be a game or a large item) running on the server has a problem, and further, in order to ensure that the predetermined item runs normally and stably, the method further includes a method step for judging whether the predetermined item running on the area to which the server has abnormal information when the terminal logs in has a problem, as shown in fig. 2, the method includes the following steps:
s50, acquiring total time information and total amount information of all terminals accessing the area to which the server belongs;
and S60, judging whether the server has faults or not according to the total time information and the total quantity information.
In the above control method, the information collection interval is involved, which may also be regarded as a window period of data collection, and data collection is performed on a node to be collected in the information collection interval to obtain the change trend information and the login duration information in S30. The method for determining the information interval is defined, and by taking a time node as a center, n connected data access points at the upstream and downstream of the time node are selected in a time dimension, and 2n +1 nodes are taken as information acquisition points, wherein the data access points are time nodes for accessing other terminals to a server. The value of n is determined according to actual conditions, for example, when the value can be set according to the performance of the server, when the performance of the server is better, the number of the selected nodes may be relatively smaller, and when the performance of the server is poorer, the number of the selected nodes is larger, so as to obtain more accurate data, and preferably, the value of n may be 5 to 20. Here, the n nodes upstream of the time node are time nodes to which a terminal that has accessed the server before accessing the server by the terminal that has presented the abnormal information accesses the server, and may be time nodes to which normal registration is performed or time nodes to which the abnormal information has presented. The n nodes downstream of the time node are the time nodes of the terminal access server which is accessed to the server after the terminal with the abnormal information is accessed to the server. No matter the n nodes at the upstream or the n nodes at the downstream continuously exist by taking time as a unit, so that the reliability of the change trend information obtained in the information acquisition interval is ensured.
In a specific embodiment, the variation trend information obtained in S30 according to the variation trend information of the information acquisition interval is a slope of 2n +1 node connecting lines. That is, if 2n +1 nodes are connected in order of time with time as the abscissa and the time taken to access the server as the ordinate, a straight line can be obtained approximately, and the 2n +1 nodes are distributed on both sides of the straight line. And taking the slope of the straight line as the change trend information of the information acquisition interval.
In S30, the login duration information includes a total duration spent in a predetermined item, such as a predetermined game, after the terminal accesses the server. For example, if a terminal takes 30 minutes from the start of App triggering to the exit of a game, the login duration information of the terminal is 30 minutes.
In S50, the method for determining the total time information includes determining an area to which the server belongs, obtaining the number of servers in the area, counting the number of terminals accessed in each server, counting the online time of each terminal from the time when the server is accessed to the time when the server is disconnected, summing the online time of all terminals in all servers in the area, and then taking the average value as the total time information. That is, the average value of the time length of the game running by the players corresponding to the terminals accessed by all the servers in a certain area.
In S50, the total amount information is determined by determining an area to which the server belongs, acquiring the number of servers in the area, counting the number of terminals accessed in each server, and summing the number of all terminals accessed by all servers in the area as the total amount information. That is, there is a total sample amount of a certain game played by the player in the server in a certain area.
In a preferred embodiment, the determining whether the server fails in a specific execution process according to the change trend information and the login duration information includes:
and judging whether the preset items of the server service have the first type of faults or not by utilizing the prestored first judgment condition according to the change trend information and the login duration information.
Specifically, according to the change trend information and the login duration information, judging whether a first type of fault exists in a preset item of the server service by using a prestored first judgment condition comprises the following steps:
when any one of the following conditions is satisfied, it is determined that a predetermined item of the server service has a failure:
the method comprises the following steps that firstly, the change trend information is smaller than a first preset value;
the second condition is that the change trend information is equal to a second preset value;
and thirdly, the login duration information is smaller than a third preset value.
The second preset value is 0, that is, the condition two indicates that the login duration of each node in the whole information acquisition interval is the same. However, due to the performance of the terminal itself and the network environment where the terminal is located, the login duration is different, so if the trend information is 0, it is said that there is a problem. The first preset value and the third preset value are determined by a general classification prediction model such as a CART decision tree when a fault recognition model is established (described in detail later).
In another preferred embodiment, the determining whether the server has the fault according to the total time information and the total amount information includes:
and judging whether a second type of fault exists in the preset item of the area service of the server by utilizing a prestored second judgment condition according to the total time information and the total quantity information.
Specifically, the step of judging whether a second type of fault exists in a predetermined item of the area service to which the server belongs by using a pre-stored second judgment condition according to the total time information and the total quantity information includes:
and when all the following conditions are met, judging that the preset item of the area service to which the server belongs has a fault:
the condition four is that the total time information is less than a first duration, and the first duration is a lower quartile of an average login duration of the terminal login preset items accessed into all the areas;
the condition five is that the total time information is less than the second duration;
and sixthly, the total quantity information is larger than a fourth preset value.
The first time length is determined by a statistical method, that is, the average login time lengths of the preset items logged in by the terminals accessed into all the areas are arranged according to the sequence of the time lengths from large to small, and the total time information is located at the position of one of the last four arranged time lengths, so that the condition of the total time information is considered to be met. The second time length and the fourth preset value are also determined by a common classification prediction model such as a CART decision tree in the server fault identification model process.
Here, it should be noted that when the network environment in the environment where the terminal is located is poor, the terminal may try to adopt an unlimited connection mode, so that the value of the total information duration is reduced, and misjudgment is caused, so that the condition that no fault exists is classified as the fault. In order to solve the problem and avoid the misjudgment of the second type of fault, the method of the invention further comprises the following steps:
acquiring the times of accessing a terminal to a server;
determining the crowd ratio of the access times of the terminal according to the times of the terminal accessing the server;
and when the condition four, the condition five and the condition six are simultaneously met, judging whether the audience ratio is smaller than a preset audience ratio, and if so, judging that the preset item of the regional service to which the server belongs has no fault.
The preset crowd ratio may be set according to the network condition of the area where the server is located, and is preferably 70%, that is, when the number of times that a certain terminal accesses the server accounts for 30% of the total number of times that all terminals access the server, the server is determined to be in a normal operation state, and is determined to be a problem of the terminal, considering that the server condition does not conform to the second type of fault.
Further, in order to further improve the determination accuracy and avoid the occurrence of a misjudgment condition, the fault identification method of the present invention further includes:
and when all the following conditions are met, judging that the server has a fault:
the area to which the server belongs is judged to have a second type of fault, and the number of the terminals accessed to the server is larger than that of the terminals accessed to any other server in the area to which the server belongs;
the average login duration of the terminal accessed to the server is less than the average login duration of the terminal accessed to any other server in the area to which the server belongs;
more than a predetermined percentage of all terminals accessed in the area are present on the server.
If all of the three conditions are satisfied by one and only one server, the server is judged to be failed, and the problem is not caused in the predetermined item in the whole area.
The server fault identification method is deployed on a Linux system, when the fault is judged to occur through the method, the fault can be a first type fault, a second type fault or an independent server fault, and at least one of the following methods is adopted for processing:
1. recording fault information, pushing the fault information to a background of the Linux system, and enabling operators to see the fault information of the server, so that cooperative management is realized;
2. shutting down the service of the server involved in the failure;
3. sending the fault information of the server to a maintenance worker for processing in the form of mail, short message or WeChat;
4. after the fault of the fault server is removed, the state of the server is changed into a normal online state in the background of the Linux system, so that the server can be normally used.
Before the fault identification method in the invention application is operated, a fault identification model needs to be established. In the process of establishing a fault identification model, a server with a fault needs to be sampled to construct sample data, so that various threshold values used in the fault identification process are obtained. When fault sampling is carried out, all server fault information in a sufficient time interval is collected so as to ensure that the sampling data is sufficient and the fault information can be clearly and completely reflected. For example, all information from 0 hour of the day to 24 hours of the day, including fault information and non-fault information, is collected, and the fault information is processed, for example, some unnecessary noise information in the fault information is removed, and data is changed to obtain a value of a fault point about a judgment condition. For example, the first judgment condition relates to a first preset value for judging the change trend information and a third preset value for judging the login duration information. Then the fault data needs to be processed in order to obtain the first preset value and the third preset value. In the process of processing fault data, the slope of the 2n +1 node connecting lines is obtained according to the fault change trend information obtained in the fault information acquisition section, and the slope of the straight line is used as the change trend information of the fault information acquisition section.
The fault log-in duration information includes a total duration spent in a predetermined item, such as a predetermined game, after the fault terminal accesses the server. For example, the fault terminal takes 5 minutes from the start of triggering the APP to the exit of the game, and the fault login duration information of the fault terminal is 5 minutes.
By utilizing the CART decision tree, the change trend information, the login duration information and the like are input into the decision tree as the mould-entering dimension to be used as independent variables, and whether the server is a fault server or not is used as a dependent variable. The decision tree after parameter adjustment can automatically calculate a first preset value, a third preset value and the like for judgment in the first judgment condition. Similarly, each threshold used for judgment in the method can be determined by a sampling mode and by using a CART decision tree. Non-deterministic values, such as quartiles, etc., are present in the decision conditions because a relative value is required for future decisions, and the threshold determined by the previous classification model is close to a relative value in the current sample set.
In addition, before the server fault recognition model is online and used for server fault recognition, a test is required. In performing the test, the Precision (PPV) was calculated using a confusion matrix, TP/(TP + FP), i.e. the correct weight of the model prediction among all results where the model prediction was Positive (Positive). Through testing, the PPV of the server identification method in the application is larger than 90%.
The present application further provides a server failure recognition apparatus, where a server is used for servicing a predetermined item, a plurality of servers for servicing the predetermined item are distributed in a plurality of areas, and a terminal accesses the server for servicing the predetermined item when logging in the predetermined item, as shown in fig. 3, the recognition apparatus includes:
the abnormal information acquisition module is used for acquiring abnormal information which appears when the terminal is accessed to the server;
the interval determining module is used for determining a time node of the terminal access server with the abnormal information according to the abnormal information so as to obtain an information acquisition interval corresponding to the time node;
the judgment parameter acquisition module is used for acquiring the change trend information and the login duration information of the information acquisition interval;
and the first judgment module is used for judging whether the server has a fault or not according to the change trend information and the login duration information.
Further, the apparatus further comprises:
the area information acquisition module is used for acquiring total time information and total amount information of all terminals accessing the area to which the server belongs;
and the second judging module is used for judging whether the server has faults or not according to the total time information and the total amount information.
Wherein, regional information acquisition module includes:
and the area information determining unit is used for selecting n connected data access points at the upstream and the downstream of the time node in the time dimension by taking the time node as the center, and taking 2n +1 nodes as information acquisition points, wherein the data access points are the time nodes of other terminals accessing the server.
Further, the first judging module is used for judging whether the first type of fault exists in the preset item of the server service by utilizing a prestored first judging condition according to the change trend information and the login duration information;
and/or the presence of a gas in the gas,
and the second judging module is used for judging whether a second type of fault exists in the preset item of the area service of the server by utilizing a prestored second judging condition according to the total time information and the total amount information.
Further, the first determining module is configured to perform the following determination:
when any one of the following conditions is satisfied, it is determined that a predetermined item of the server service has a failure:
the method comprises the following steps that firstly, the change trend information is smaller than a first preset value;
the second condition is that the change trend information is equal to a second preset value;
thirdly, logging in time length information is smaller than a third preset value;
and/or the presence of a gas in the gas,
the second judging module is used for executing the following judgment:
and when all the following conditions are met, judging that the preset item of the area service to which the server belongs has a fault:
the condition four is that the total time information is less than a first duration, and the first duration is a lower quartile of an average login duration of the terminal login preset items accessed into all the areas;
the condition five is that the total time information is less than the second duration;
and sixthly, the total quantity information is larger than a fourth preset value.
Further, the apparatus further comprises:
the number obtaining module is used for obtaining the number of times that the terminal accesses the server;
the system comprises an audience rate determining module, a user access module and a service module, wherein the audience rate determining module is used for determining the audience rate of the access times of a terminal according to the times of the terminal accessing a server;
and the third judging module is used for judging whether the crowd ratio is smaller than the preset crowd ratio or not when the condition four, the condition five and the condition six are simultaneously met, and judging that no fault exists in the preset item of the regional service to which the server belongs if the crowd ratio is smaller than the preset crowd ratio.
Still further, the apparatus further comprises:
a fourth judgment unit configured to perform the following judgment:
and when all the following conditions are met, judging that the server has a fault:
the area to which the server belongs is judged to be a second type of fault, and the number of the terminals accessed to the server is greater than the number of the terminals accessed to any other server in the area to which the server belongs;
the average login duration of the terminal accessed to the server is less than the average login duration of the terminal accessed to any other server in the area to which the server belongs;
and accessing terminals which are more than a preset proportion of all the terminals accessed in the area to the server.
Since the server failure recognition device in the present application is used for implementing the above server failure recognition method, the functions and actions of the server failure recognition device are not described herein again.
In addition, the present application also provides a computer-readable storage medium on which a computer program is stored, the computer program, when executed by a processor, implementing the above-described server failure identification method.
According to the method and the device, server faults are sampled, an expert sample set is established, and each threshold value for judgment is determined through a classification prediction model according to the expert sample set, so that the server faults can be determined more accurately and rapidly.
The above-described aspects may be implemented individually or in various combinations, and such variations are within the scope of the present invention.
It will be understood by those skilled in the art that all or part of the steps of the above methods may be implemented by instructing the relevant hardware through a program, and the program may be stored in a computer readable storage medium, such as a read-only memory, a magnetic or optical disk, and the like. Alternatively, all or part of the steps of the foregoing embodiments may also be implemented by using one or more integrated circuits, and accordingly, each module/unit in the foregoing embodiments may be implemented in the form of hardware, and may also be implemented in the form of a software functional module. The present invention is not limited to any specific form of combination of hardware and software.
It is to be noted that, in this document, the terms "comprises", "comprising" or any other variation thereof are intended to cover a non-exclusive inclusion, so that an article or apparatus including a series of elements includes not only those elements but also other elements not explicitly listed or inherent to such article or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of additional like elements in the article or device comprising the element.
The above embodiments are merely to illustrate the technical solutions of the present invention and not to limit the present invention, and the present invention has been described in detail with reference to the preferred embodiments. It will be understood by those skilled in the art that various modifications and equivalent arrangements may be made without departing from the spirit and scope of the present invention and it should be understood that the present invention is to be covered by the appended claims.

Claims (11)

1. A server failure recognition method, wherein the server is used for servicing a predetermined item, a plurality of servers for servicing the predetermined item are distributed in a plurality of areas, and a terminal accesses the server for servicing the predetermined item when logging in the predetermined item, the method comprising the steps of:
acquiring abnormal information occurring when a terminal accesses a server;
according to the abnormal information, determining a time node of the terminal with the abnormal information accessing the server and an information acquisition interval corresponding to the time node; the method for determining the information acquisition interval comprises the following steps:
taking the time node as a center, selecting n connected data access points at the upstream and downstream of the time node in the time dimension, and taking 2n +1 nodes as information acquisition points, wherein the data access points are the time nodes for accessing other terminals into the server;
the n nodes at the upstream of the time node are the time nodes accessed by the terminal which is accessed into the server before the terminal with abnormal information is accessed into the server, and the n nodes at the downstream of the time node are the time nodes accessed into the terminal of the server after the terminal with abnormal information is accessed into the server;
acquiring the change trend information and login duration information of the information acquisition interval;
judging whether a preset item of the server service fails or not according to the change trend information and the login duration information;
the change trend information comprises the slope of the connecting line of 2n +1 nodes; the method for acquiring the change trend information of the information acquisition interval comprises the following steps:
sequentially connecting 2n +1 nodes according to a time sequence by taking time as an abscissa and time spent when a server is accessed as an ordinate to obtain a straight line, wherein the 2n +1 nodes are distributed on two sides of the straight line, and the slope of the straight line is the change trend information of the information acquisition interval;
and the login duration information is the total duration spent in the preset project after the terminal accesses the server.
2. The server failure identification method according to claim 1, characterized in that the method further comprises the steps of:
acquiring total time information and total amount information of all terminals accessed to an area to which the server belongs;
judging whether a preset item running on the area of the server with abnormal information during terminal login has a fault or not according to the total time information and the total amount information;
the method for determining the total time information comprises the following steps:
determining the area to which the server belongs, acquiring the number of the servers in the area, counting the number of terminals accessed in each server, counting the online time of each terminal from the moment when the server is accessed to the moment when the terminal is disconnected from the server, summing the online time of all the terminals in all the servers in the area, and then averaging the online time of all the terminals in all the servers in the area, wherein the average value is total time information;
the method for determining the total amount information comprises the following steps: determining the area of the server, acquiring the number of the servers in the area, counting the number of terminals accessed by each server, and summing the number of all the terminals accessed by all the servers in the area to obtain total number information.
3. The server failure identification method according to claim 2, wherein the determining whether the predetermined item of the server service fails according to the change trend information and the login duration information comprises:
judging whether the preset items of the server service have a first type of fault or not by utilizing a prestored first judgment condition according to the change trend information and the login duration information;
the step of judging whether the preset item of the server service has a first type of fault or not by using a pre-stored first judgment condition according to the change trend information and the login duration information comprises the following steps:
determining that a predetermined item of the server service has a failure when any one of the following conditions is satisfied:
the method comprises the following steps that firstly, the change trend information is smaller than a first preset value;
the second condition is that the change trend information is equal to a second preset value;
thirdly, the login duration information is smaller than a third preset value;
wherein the second preset value is 0; the first preset value and the third preset value are determined by a classification prediction model when a fault identification model is established;
and/or the presence of a gas in the gas,
the judging whether the server has the fault according to the total time information and the total quantity information comprises the following steps:
judging whether a second type of fault exists in the preset item of the regional service of the server by utilizing a pre-stored second judgment condition according to the total time information and the total quantity information;
the judging whether the predetermined item of the regional service to which the server belongs has a fault by using a pre-stored second judgment condition according to the total time information and the total amount information includes:
when the following conditions are all met, judging that the second type of fault exists in the preset item of the regional service to which the server belongs:
the condition four is that the total time information is smaller than a first duration, and the first duration is a lower quartile of an average login duration of the terminals accessed into all the areas for logging in the predetermined item;
the condition five is that the total time information is less than a second duration;
the condition six is that the total quantity information is greater than a fourth preset value;
the first time length is determined by a statistical method, that is, average login time lengths of terminals accessed into all the areas for logging in the predetermined item are arranged according to the sequence of the time lengths from large to small, the total time is positioned in the last quarter of the arrangement, and the total time meets the fourth condition;
and the second time length and the fourth preset value are determined by a classification prediction model in the process of server fault identification model.
4. The server failure identification method according to claim 3, wherein the method further comprises:
acquiring the times of accessing a terminal to a server;
determining the crowd ratio of the access times of the terminal according to the times of the terminal accessing the server;
when a condition four, a condition five and a condition six are simultaneously met, judging whether the audience ratio is smaller than a preset audience ratio, if so, judging that the preset item of the regional service to which the server belongs has no fault;
the crowd ratio is the ratio of the number of times of accessing the server of a certain terminal to the total number of times of accessing the servers of all terminals.
5. The server failure identification method according to claim 4, wherein the method further comprises:
and when the following conditions are met, judging that the server has a fault:
the area to which the server belongs is judged to be a second type of fault, and the number of the terminals accessed to the server is greater than the number of the terminals accessed to any other server in the area to which the server belongs;
the average login duration of the terminal accessed to the server is less than the average login duration of the terminal accessed to any other server in the area to which the server belongs;
and accessing terminals which are more than a preset proportion of all the terminals accessed in the area to the server.
6. An apparatus for identifying a failure of a server, the server serving a predetermined item, a plurality of servers serving the predetermined item being distributed in a plurality of areas, and a terminal accessing the server serving the predetermined item when registering the predetermined item, the apparatus comprising:
the abnormal information acquisition module is used for acquiring abnormal information which appears when the terminal is accessed to the server;
the interval determining module is used for determining a time node of the server accessed by the terminal with the abnormal information and an information acquisition interval corresponding to the time node according to the abnormal information;
the area information acquisition module includes:
the area information determining unit is used for selecting n connected data access points at the upstream and downstream of the time node in the time dimension by taking the time node as a center, and taking 2n +1 nodes as information acquisition points, wherein the data access points are the time nodes for accessing other terminals into the server;
the n nodes at the upstream of the time node are the time nodes accessed by the terminal which is accessed into the server before the terminal with abnormal information is accessed into the server, and the n nodes at the downstream of the time node are the time nodes accessed into the terminal of the server after the terminal with abnormal information is accessed into the server;
the judgment parameter acquisition module is used for acquiring the change trend information and the login duration information of the information acquisition interval; wherein, the change trend information comprises the slope of the 2n +1 node connecting lines; the login duration information is the total duration spent in the preset project after the terminal accesses the server;
the judgment parameter acquisition module is specifically configured to:
sequentially connecting 2n +1 nodes according to a time sequence by taking time as an abscissa and time spent when a server is accessed as an ordinate to obtain a straight line, wherein the 2n +1 nodes are distributed on two sides of the straight line, and the slope of the straight line is the change trend information of the information acquisition interval;
and the first judgment module is used for judging whether the preset item of the server service has a fault or not according to the change trend information and the login duration information.
7. The server failure recognition apparatus according to claim 6, wherein the apparatus further comprises:
the area information acquisition module is used for acquiring total time information and total amount information of all terminals accessing the area to which the server belongs;
the area information acquisition module is specifically configured to:
determining the area to which the server belongs, acquiring the number of the servers in the area, counting the number of terminals accessed in each server, counting the online time of each terminal from the moment when the server is accessed to the moment when the terminal is disconnected from the server, summing the online time of all the terminals in all the servers in the area, and then averaging the online time of all the terminals in all the servers in the area, wherein the average value is total time information;
the area information acquiring module is specifically further configured to:
determining the area of the server, acquiring the number of the servers in the area, counting the number of terminals accessed by each server, and summing the number of all the terminals accessed by all the servers in the area to obtain total number information;
and the second judgment module is used for judging whether the preset item running on the region where the abnormal information appears during the terminal login has a fault or not according to the total time information and the total amount information.
8. The server fault recognition device according to claim 7, wherein the first determining module is configured to determine whether a first type of fault exists in the predetermined item of the server service according to the change trend information and the login duration information by using a pre-stored first determining condition;
the first judging module is used for executing the following judgment:
when any one of the following conditions is met, judging that the first type of fault exists in the preset item of the server service:
the method comprises the following steps that firstly, the change trend information is smaller than a first preset value;
the second condition is that the change trend information is equal to a second preset value;
thirdly, the login duration information is smaller than a third preset value;
wherein the second preset value is 0; the first preset value and the third preset value are determined by a classification prediction model when a fault identification model is established;
and/or the presence of a gas in the gas,
the second judging module is used for judging whether a second type of fault exists in the preset item of the regional service of the server by utilizing a prestored second judging condition according to the total time information and the total amount information;
the second judging module is used for executing the following judgment:
when the following conditions are all met, judging that the second type of fault exists in the preset item of the regional service to which the server belongs:
the condition four is that the total time information is smaller than a first duration, and the first duration is a lower quartile of an average login duration of the terminals accessed into all the areas for logging in the predetermined item;
the condition five is that the total time information is less than a second duration;
the condition six is that the total quantity information is greater than a fourth preset value;
the first time length is determined by a statistical method, that is, average login time lengths of terminals accessed into all the areas for logging in the predetermined item are arranged according to the sequence of the time lengths from large to small, the total time is positioned in the last quarter of the arrangement, and the total time meets the fourth condition;
and the second time length and the fourth preset value are determined by a classification prediction model in the process of server fault identification model.
9. The server failure recognition apparatus according to claim 8, wherein the apparatus further comprises:
the number obtaining module is used for obtaining the number of times that the terminal accesses the server;
the system comprises an audience ratio determining module, a service providing module and a service providing module, wherein the audience ratio determining module is used for determining the audience ratio of the access times of the terminal according to the times of the terminal accessing the server;
the third judging module is used for judging whether the audience ratio is smaller than a preset audience ratio or not when a condition four, a condition five and a condition six are simultaneously met, and if so, judging that the preset item of the regional service to which the server belongs has no fault;
the crowd ratio is the ratio of the number of times of accessing the server of a certain terminal to the total number of times of accessing the servers of all terminals.
10. The server failure recognition apparatus according to claim 9, wherein the apparatus further comprises:
a fourth judgment unit configured to perform the following judgment:
and when the following conditions are met, judging that the server has a fault:
the area to which the server belongs is judged to be a second type of fault, and the number of the terminals accessed to the server is greater than the number of the terminals accessed to any other server in the area to which the server belongs;
the average login duration of the terminal accessed to the server is less than the average login duration of the terminal accessed to any other server in the area to which the server belongs;
and accessing terminals which are more than a preset proportion of all the terminals accessed in the area to the server.
11. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out a server failure identification method according to any one of claims 1 to 5.
CN201910865239.8A 2019-09-12 2019-09-12 Server fault identification method and device and storage medium Active CN110519102B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910865239.8A CN110519102B (en) 2019-09-12 2019-09-12 Server fault identification method and device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910865239.8A CN110519102B (en) 2019-09-12 2019-09-12 Server fault identification method and device and storage medium

Publications (2)

Publication Number Publication Date
CN110519102A CN110519102A (en) 2019-11-29
CN110519102B true CN110519102B (en) 2020-10-30

Family

ID=68630775

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910865239.8A Active CN110519102B (en) 2019-09-12 2019-09-12 Server fault identification method and device and storage medium

Country Status (1)

Country Link
CN (1) CN110519102B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111835702B (en) * 2020-01-20 2023-10-31 北京嘀嘀无限科技发展有限公司 Login method, login device, login equipment and computer readable storage medium

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102130726A (en) * 2010-01-15 2011-07-20 西门子公司 Fault diagnosis method in vehicle-mounted wireless communication system and device thereof
CN103650569B (en) * 2013-07-22 2018-02-02 华为技术有限公司 Wireless network method for diagnosing faults and equipment
CN106464547B (en) * 2014-03-31 2021-06-25 英国电讯有限公司 Method and system for detecting performance problem of home data network and storage medium
CN107391341A (en) * 2017-07-21 2017-11-24 郑州云海信息技术有限公司 A kind of fault early warning method and device
CN107864063B (en) * 2017-12-12 2021-09-17 北京奇艺世纪科技有限公司 Abnormity monitoring method and device and electronic equipment

Also Published As

Publication number Publication date
CN110519102A (en) 2019-11-29

Similar Documents

Publication Publication Date Title
CN107872353A (en) A kind of Fault Locating Method and device
CN110046073B (en) Log collection method and device, equipment and storage medium
CN109995614B (en) Alpha testing method and device
CN108399114A (en) A kind of system performance testing method, apparatus and storage medium
CN111506489A (en) Test method, system, device, server and storage medium
CN111752850B (en) Method and related equipment for testing block chain system
CN110674021A (en) Detection method and system for login log of mobile application
CN111611140A (en) Reporting verification method and device of buried point data, electronic equipment and storage medium
CN111464376A (en) Website availability monitoring method and device, storage medium and computer equipment
CN112540887A (en) Fault drilling method and device, electronic equipment and storage medium
CN110519102B (en) Server fault identification method and device and storage medium
CN113037562A (en) Gateway fault assessment method and device and server
CN111526109B (en) Method and device for automatically detecting running state of web threat recognition defense system
CN112184072A (en) Machine room equipment management method and device
CN110769076B (en) DNS (Domain name System) testing method and system
CN110069382A (en) Software supervision method, server, terminal device, computer equipment and medium
CN114840422A (en) Test method, test device, electronic equipment and storage medium
CN113973068A (en) Chaos test method and device, chaos test platform and storage medium
US20120072160A1 (en) Measure presentation device, measure presentation method, and non-transitory computer readable storage medium
CN110134558B (en) Method and device for detecting server
CN113836013A (en) Embedded point testing method and device, computer equipment and computer readable storage medium
CN112131128A (en) Data testing method, device, storage medium and electronic device
CN114245242B (en) User offline detection method and device and electronic equipment
CN110661677A (en) DNS (Domain name System) testing method, device and system
CN115225455B (en) Abnormal device detection method and device, electronic device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20240403

Address after: Room 503, Building 3, No. 6, Xicheng Xi'an North Road, Xinluo District, Longyan City, Fujian Province, 364000

Patentee after: Xie Xinyong

Country or region after: China

Address before: 550000 floor 5, building a, Liyang building (Gaoke No.1), 160 Changling South Road, Guiyang National High tech Industrial Development Zone, Guiyang City, Guizhou Province

Patentee before: GUIYANG GLOUD TECHNOLOGY Co.,Ltd.

Country or region before: China

TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20240424

Address after: Room 501-2432, Office Building, Development Zone, No. 8, Xingsheng South Road, Economic Development Zone, Miyun District, Beijing 100000 (Central Office Area of Economic Development Zone)

Patentee after: Beijing Abacus Industrial Technology Co.,Ltd.

Country or region after: China

Address before: Room 503, Building 3, No. 6, Xicheng Xi'an North Road, Xinluo District, Longyan City, Fujian Province, 364000

Patentee before: Xie Xinyong

Country or region before: China

TR01 Transfer of patent right