CN106844078A

CN106844078A - A kind for the treatment of method and apparatus of PCIE failures

Info

Publication number: CN106844078A
Application number: CN201611230230.2A
Authority: CN
Inventors: 常现超
Original assignee: Zhengzhou Yunhai Information Technology Co Ltd
Current assignee: Zhengzhou Yunhai Information Technology Co Ltd
Priority date: 2016-12-27
Filing date: 2016-12-27
Publication date: 2017-06-13

Abstract

This application discloses a kind for the treatment of method and apparatus of PCIE failures, the method gathers PCIE fault messages in being included in kernel；The PCIE fault messages are transferred to User space from kernel；The PCIE fault messages collected are analyzed in User space；According to the result of analysis, the PCIE failures are repaired or isolated.The device includes collecting unit, for gathering PCIE fault messages in kernel；Transmission unit, for the PCIE fault messages to be transferred into User space from kernel；Analytic unit, for being analyzed to the PCIE fault messages collected in User space；Repair and isolated location, for the result according to analysis, the PCIE failures are repaired or isolated.The above method and device bothersome laborious go to repair failure without artificial, it is possible to increase the efficiency and quality of fault restoration.

Description

A kind for the treatment of method and apparatus of PCIE failures

Technical field

The invention belongs to Computer Applied Technology field, more particularly to a kind for the treatment of method and apparatus of PCIE failures.

Background technology

With developing rapidly for computer technology and integrated circuit technique, no matter being obtained for from software or hardware winged Speed lifting.Because many peripheral hardwares of computer are all that PCIE (Peripheral Component Interface Express) sets Standby, with being continuously increased for number of devices, the probability that PCIE device breaks down is also increasing, brings very big to keeper Challenge, this is accomplished by the health status that keeper often pays close attention to PCIE device, nonetheless, it is also difficult to find failure in time. , it is necessary to keeper checks substantial amounts of system journal and analyzes during PCIE generation failures, take a long time reparation and break down Equipment, and data volumes of some services are huge, and the cluster of server is also big, and maintenance gets up to waste time and energy, and may be tight The service impacting quality of weight.

The content of the invention

To solve the above problems, the invention provides a kind for the treatment of method and apparatus of PCIE failures, without artificial bothersome Laborious goes reparation failure, it is possible to increase the efficiency and quality of fault restoration.

A kind of processing method of PCIE failures that the present invention is provided, including：

PCIE fault messages are gathered in kernel；

The PCIE fault messages are transferred to User space from kernel；

The PCIE fault messages collected are analyzed in User space；

According to the result of analysis, the PCIE failures are repaired or isolated.

Preferably, in the processing method of above-mentioned PCIE failures, the PCIE failures are repaired or is isolated described Afterwards, also include：

The PCIE fault messages are notified into keeper.

Preferably, in the processing method of above-mentioned PCIE failures, the PCIE fault messages are notified into keeper described Afterwards, also include：

Alarmed for the PCIE fault messages.

Preferably, in the processing method of above-mentioned PCIE failures, the PCIE fault messages that gathered in kernel are：

To kernel patch is squeezed into system, kernel code is changed, PCIE fault messages are gathered in kernel.

Preferably, it is described that the PCIE fault messages are transferred to from kernel in the processing method of above-mentioned PCIE failures User space is：

The PCIE fault messages are transferred to by User space from kernel with the communication mode of netlink.

A kind of processing unit of PCIE failures that the present invention is provided, including：

Collecting unit, for gathering PCIE fault messages in kernel；

Transmission unit, for the PCIE fault messages to be transferred into User space from kernel；

Analytic unit, for being analyzed to the PCIE fault messages collected in User space；

Repair and isolated location, for the result according to analysis, the PCIE failures are repaired or isolated.

Preferably, in the processing unit of above-mentioned PCIE failures,

Also include：

Notification unit, for the PCIE fault messages to be notified into keeper.

Preferably, in the processing unit of above-mentioned PCIE failures, also include：

Alarm unit, for being alarmed for the PCIE fault messages.

Preferably, in the processing unit of above-mentioned PCIE failures, the collecting unit in system specifically in squeezing into Core patch, changes kernel code, and PCIE fault messages are gathered in kernel.

Preferably, in the processing unit of above-mentioned PCIE failures, the transmission unit is specifically for the communication of netlink The PCIE fault messages are transferred to User space by mode from kernel.

The treating method and apparatus of the above-mentioned PCIE failures provided by foregoing description, the present invention, due to the method It is included in collection PCIE fault messages in kernel；The PCIE fault messages are transferred to User space from kernel；In User space pair The PCIE fault messages collected are analyzed；According to the result of analysis, the PCIE failures are repaired or isolated, because This bothersome laborious goes to repair failure without artificial, it is possible to increase the efficiency and quality of fault restoration.

Brief description of the drawings

In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing The accompanying drawing to be used needed for having technology description is briefly described, it should be apparent that, drawings in the following description are only this Inventive embodiment, for those of ordinary skill in the art, on the premise of not paying creative work, can also basis The accompanying drawing of offer obtains other accompanying drawings.

The schematic diagram of the processing method of the first PCIE failure that Fig. 1 is provided for the embodiment of the present application；

The schematic diagram of the processing unit of the first PCIE failure that Fig. 2 is provided for the embodiment of the present application；

The 4th kind of schematic diagram of the processing unit of PCIE failures that Fig. 3 is provided for the embodiment of the present application.

Specific embodiment

Core concept of the invention is to provide a kind for the treatment of method and apparatus of PCIE failures, without artificial bothersome laborious Go repair failure, it is possible to increase the efficiency and quality of fault restoration.

Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete Site preparation is described, it is clear that described embodiment is only a part of embodiment of the invention, rather than whole embodiments.It is based on Embodiment in the present invention, it is every other that those of ordinary skill in the art are obtained under the premise of creative work is not made Embodiment, belongs to the scope of protection of the invention.

The processing method of the first PCIE failure that the embodiment of the present application is provided is as shown in figure 1, Fig. 1 is the embodiment of the present application The schematic diagram of the processing method of the first the PCIE failure for providing, the method comprises the following steps：

S1：PCIE fault messages are gathered in kernel；

It should be noted that failure benefit can be squeezed into the operating system nucleus of computer, using KPatch instruments Fourth, for collecting fault message, wherein fault message can include but is not limited to position and the failure cause of failure generation, and It is packaged and is transmitted.Furthermore it is possible to squeeze into patch module during operating system, without going to compile again in Core, and usually said patch, are got to patch inside kernel source code when kernel is compiled, and are then compiled, specifically, Can pass through/proc files, code is directly changed inside kernel, the collection of fault message can also be so realized, herein It is not intended to limit concrete implementation mode.

S2：The PCIE fault messages are transferred to User space from kernel；

It should be noted that due to collecting the position of fault message in kernel, and follow-up processing procedure occurs in user State, it is therefore desirable to which PCIE fault messages are transferred to User space from kernel, and specific transmission means includes but is not limited to utilize Netlink passages.

S3：The PCIE fault messages collected are analyzed in User space；

Specifically, statistic of classification can be carried out to the PCIE fault messages, the result analyzed.

S4：According to the result of analysis, the PCIE failures are repaired or isolated.

It should be noted that in this step, after the completion of analysis, it is possible to attempt it is automatic repair failure, if can not repair Work(, such as EMS memory error, it is possible to the internal memory of failure is done and is isolated, it is to avoid failure memory again by using causing system unstable, Avoid the failure that serious influence is caused to system or key service, produce serious consequence, this mode to make up people It is monitoring PCIE device health status, the inefficiency of manual administration failure and analysis Trouble cause and can not be timely and effective The deficiency for processing and causing machine to be unable to stable operation.

By foregoing description, the processing method of above-mentioned the first PCIE failure that the embodiment of the present application is provided is included in PCIE fault messages are gathered in kernel；The PCIE fault messages are transferred to User space from kernel；In User space to collection The PCIE fault messages are analyzed；According to the result of analysis, the PCIE failures are repaired or isolated, therefore need not It is artificial bothersome laborious to go to repair failure, it is possible to increase the efficiency and quality of fault restoration.

Second processing method of PCIE failures that the embodiment of the present application is provided, is at the place of above-mentioned the first PCIE failure On the basis of reason method, also including following technical characteristic：

It is described the PCIE failures are repaired or isolated after, also include：

The PCIE fault messages are notified into keeper.

Specifically, the result and detailed information of failure are sent to keeper, can be with short message or the side of mail Formula is notified, to ensure troubleshooting rationally, specific form includes but is not limited to make chart or curve, with Added Management Member more intuitively observes failure.

The processing method of the third PCIE failure that the embodiment of the present application is provided, is at the place of above-mentioned second PCIE failures On the basis of reason method, also including following technical characteristic：

It is described the PCIE fault messages are notified into keeper after, also include：

Alarmed for the PCIE fault messages.

It should be noted that some fault messages are more serious, therefore information is allowed to allow keeper to understand simultaneously with prestissimo Treatment is very important, such as when certain hardware damage cannot be repaired, in order to not influence the normal of system to use, must just enter Row isolation, by taking CPU as an example, CPU has 24 cores on a machine, if one of core has been damaged and cannot repaired, must just use up Fast isolation, it is impossible to reuse, other 23 can also use, but performance has just declined, must now notify keeper and When more exchange device, the mode of this alarm can show that state of affairs urgency level so that the problem of keeper's priority treatment equipment.

The 4th kind of processing method of PCIE failures that the embodiment of the present application is provided, is at the place of above-mentioned the third PCIE failure On the basis of reason method, also including following technical characteristic：

It is described in kernel gather PCIE fault messages be：

It should be noted that by the way of kernel patch is squeezed into, can be loaded directly into the case where kernel is not compiled Patch module, obtains failure, in hgher efficiency.

The 5th kind of processing method of PCIE failures that the embodiment of the present application is provided, be it is above-mentioned the first to the 4th kind of PCIE In the processing method of failure on the basis of any one, also including following technical characteristic：

It is described the PCIE fault messages are transferred to User space from kernel to be：

It should be noted that Netlink is kernel state and the mode of User space communication in linux system, when PCIE is produced Patch module will be collected into dependent failure information after failure, then place this information in the passage of netlink, be sent to User space.

The processing unit of the first PCIE failure that the embodiment of the present application is provided is as shown in Fig. 2 Fig. 2 is the embodiment of the present application The schematic diagram of the processing unit of the first the PCIE failure for providing, the device includes：

Collecting unit 201, for gathering PCIE fault messages in kernel, it is necessary to illustrate, can be in computer In operating system nucleus, using KPatch instruments, failure patch is squeezed into, for collecting fault message, wherein fault message can be with Position and failure cause that including but not limited to failure occurs, and be packaged and transmitted.Furthermore it is possible to be in operation Patch module is squeezed into during system operation, without going to compile kernel again, and usually said patch, it is when kernel is compiled Wait and patch is got into kernel source code the inside, then compile, specifically, can pass through/proc files, directly repaiied inside kernel Change code, can also so realize the collection of fault message, concrete implementation mode is not intended to limit herein；

Transmission unit 202, for the PCIE fault messages to be transferred into User space from kernel, it is necessary to illustrate, by In the position of collection fault message in kernel, and follow-up processing procedure occurs in User space, it is therefore desirable to by PCIE failures letter Breath is transferred to User space from kernel, and specific transmission means is included but is not limited to using netlink passages；

Analytic unit 203, for being analyzed to the PCIE fault messages collected in User space, specifically, can be with Statistic of classification is carried out to the PCIE fault messages, the result analyzed；

Repair and isolated location 204, for the result according to analysis, the PCIE failures are repaired or isolated, need It is noted that after the completion of analysis, it is possible to attempt automatic reparation failure, if reparation is unsuccessful, such as EMS memory error, it is possible to will The internal memory of failure does isolates, it is to avoid failure memory is again by using causing system unstable, it is to avoid the failure is to system or pass Key service causes serious influence, produces serious consequence, this mode can make up artificial monitoring PCIE device health status, Manual administration failure and analysis Trouble cause inefficiency and can not it is timely and effective treatment and cause machine can not stablize The deficiency of operation.

Second processing unit of PCIE failures that the embodiment of the present application is provided, is at the place of above-mentioned the first PCIE failure On the basis of reason device, also including following technical characteristic：

Notification unit, for the PCIE fault messages to be notified into keeper.

Specifically, the result and detailed information of failure are sent to keeper, and to ensure reasonable handling failure, tool The form of body is included but is not limited to make chart or curve, and failure is more intuitively observed with Added Management person, and with short message or postal The mode of part is notified.

The processing unit of the third PCIE failure that the embodiment of the present application is provided, is at the place of above-mentioned second PCIE failures On the basis of reason device, also including following technical characteristic：

Alarm unit, for being alarmed for the PCIE fault messages.

It should be noted that some fault messages are more serious, therefore information is allowed to allow keeper to understand simultaneously with prestissimo Treatment is extremely important, such as when certain hardware damage cannot be repaired, in order to not influence the normal of system to use, just must go to every From, by taking CPU as an example, CPU has 24 cores on a machine, if one of core has been damaged and cannot repaired, just must as early as possible every From that can not reuse, other 23 also can be to use, but performance has just declined, and now must send out warning notice management Member's more exchange device in time, the mode of this alarm can show that state of affairs urgency level so that keeper's priority treatment equipment Problem.

The 4th kind of processing unit of PCIE failures that the embodiment of the present application is provided, is at the place of above-mentioned the third PCIE failure On the basis of reason device, also including following technical characteristic：

The collecting unit is gathered specifically for kernel patch is squeezed into system, changing kernel code in kernel PCIE fault messages.

Specifically, the 4th kind of signal of the processing unit of PCIE failures provided for the embodiment of the present application with reference to Fig. 3, Fig. 3 Figure, the device includes the kernel 402 being connected with PCIE device 401, kernel patch 403 is squeezed into kernel 402, the kernel patch 403 are transferred to analytic unit 405 after collection PCIE fault messages in kernel 402 using transmission unit 404, further according to analysis As a result, using repairing and isolated location 406 is repaired or isolated, it is necessary to illustrate, using the side for squeezing into kernel patch Formula, can be loaded directly into patch module in the case where kernel is not compiled, and obtain failure, and treatment effeciency is higher.

The 5th kind of processing unit of PCIE failures that the embodiment of the present application is provided, be it is above-mentioned the first to the 4th kind of PCIE In the processing unit of failure on the basis of any one, also including following technical characteristic：

Specifically for from kernel be transferred to the PCIE fault messages with the communication mode of netlink by the transmission unit User space.

In sum, the embodiment of the present application is provided the above method and device, can reduce the work of fault management, realize The automation of fault management, can timely and effectively find and solve failure, it is ensured that the safe and reliable fortune of system and key service OK.

The foregoing description of the disclosed embodiments, enables professional and technical personnel in the field to realize or uses the present invention. Various modifications to these embodiments will be apparent for those skilled in the art, as defined herein General Principle can be realized in other embodiments without departing from the spirit or scope of the present invention.Therefore, the present invention The embodiments shown herein is not intended to be limited to, and is to fit to and principles disclosed herein and features of novelty phase one The scope most wide for causing.

Claims

1. a kind of processing method of PCIE failures, it is characterised in that including：

PCIE fault messages are gathered in kernel；

The PCIE fault messages are transferred to User space from kernel；

The PCIE fault messages collected are analyzed in User space；

2. the processing method of PCIE failures according to claim 1, it is characterised in that

The PCIE fault messages are notified into keeper.

3. the processing method of PCIE failures according to claim 2, it is characterised in that

Alarmed for the PCIE fault messages.

4. the processing method of PCIE failures according to claim 3, it is characterised in that

It is described in kernel gather PCIE fault messages be：

5. the processing method of the PCIE failures according to claim any one of 1-4, it is characterised in that

6. a kind of processing unit of PCIE failures, it is characterised in that including：

Collecting unit, for gathering PCIE fault messages in kernel；

7. the processing unit of PCIE failures according to claim 6, it is characterised in that

Also include：

Notification unit, for the PCIE fault messages to be notified into keeper.

8. the processing unit of PCIE failures according to claim 7, it is characterised in that

Also include：

Alarm unit, for being alarmed for the PCIE fault messages.

9. the processing unit of PCIE failures according to claim 8, it is characterised in that

The collecting unit gathers PCIE events specifically for kernel patch is squeezed into system, changing kernel code in kernel Barrier information.

10. the processing unit of the PCIE failures according to claim any one of 6-9, it is characterised in that

The transmission unit from kernel by the PCIE fault messages with the communication mode of netlink specifically for being transferred to user State.