CN103312564B - InfiniBand network detecting method - Google Patents

InfiniBand network detecting method Download PDF

Info

Publication number
CN103312564B
CN103312564B CN201310253119.5A CN201310253119A CN103312564B CN 103312564 B CN103312564 B CN 103312564B CN 201310253119 A CN201310253119 A CN 201310253119A CN 103312564 B CN103312564 B CN 103312564B
Authority
CN
China
Prior art keywords
port
equipment
makeing mistakes
port number
infiniband network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201310253119.5A
Other languages
Chinese (zh)
Other versions
CN103312564A (en
Inventor
胡耀国
路川
曹振南
马少杰
杨亮
田相桂
何沧平
姜金良
范娟
沈杰
易成
曹征
侯雪峰
苗春葆
赵明坤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhongke Shuguang International Information Industry Co ltd
Original Assignee
Dawning Information Industry Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dawning Information Industry Beijing Co Ltd filed Critical Dawning Information Industry Beijing Co Ltd
Priority to CN201310253119.5A priority Critical patent/CN103312564B/en
Publication of CN103312564A publication Critical patent/CN103312564A/en
Application granted granted Critical
Publication of CN103312564B publication Critical patent/CN103312564B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Data Exchanges In Wide-Area Networks (AREA)
  • Small-Scale Networks (AREA)

Abstract

The invention discloses a kind of InfiniBand network detecting method, including the second corresponding relation of the device name of equipment obtained in the InfiniBand network physical port number with each port in first corresponding relation of No. LID and equipment and logical port number;No. LID of the equipment at the port place of makeing mistakes in acquisition InfiniBand network and the logical port number of port of makeing mistakes;The device name of the equipment at port place of makeing mistakes and the physical port number of port of makeing mistakes is obtained according to the first corresponding relation, the second corresponding relation, No. LID of equipment of port place of makeing mistakes and the logical port number of port of makeing mistakes.Present invention may be broadly applicable to the cluster of different scales, once run and just can find out the whole problematic link of IB network, it is possible to carry out integrated with other adjustment methods, at utmost reduce human error, substantially increase the debugging efficiency of large-scale IB network, and save time and manpower.

Description

InfiniBand network detecting method
Technical field
The invention relates essentially to network communication field, more specifically, relate to a kind of InfiniBand network detecting method.
Background technology
It is a kind of high-speed network technology and standard that InfiniBand (is called for short IB), there is the advantage such as high bandwidth, low latency, being widely used in fields such as high-performance calculation, big data process, HPCC generally uses InfiniBand as supercomputing network or storage network.The all normal transceiving data of each equipment that network connects it is necessary to ensure that before the node of cluster uses InfiniBand network, additionally due to physical contact when hardware electric property, cable quality and equipment are installed is likely to cause that certain port of some equipment can not keep zero error rate when sending data, it is accomplished by debugging InfiniBand network this time, find problematic equipment, and progressively go contacting between parameter or equipment with the equipment of adjustment equipment so that keep zero error rate or an extremely low error rate when data send.
Existing IB adjustment method complex steps, it is necessary to use network stress test program that whole InfiniBand network is pressed, then manually find out problematic equipment, then problem device or problem port are processed.
General step is: (1) first checks that confirmation opensm service is in running status, and each node IB subcard is in normal operating condition;(2) error count that whole IB nets all IB equipment resets;(3) IB network pressure program is run;(4) diagnostic message of IB network is exported by ibdiagnet, if it find that there is equipment to recorded mistake, need to remove to search corresponding equipment according to equipment No. lid network structure compareing ibnetdiscover output with port numbers, and be in another equipment of same link with this port;(5) problematic equipment is carried out malfunction elimination.Wherein, step (1), (2), (3) perform ratio faster, and what whole debugging process was main is operated in (4th) step.If the scale of IB net is smaller, such as tens, the number of devices every time gone wrong also can be fewer, generally only several.This time, problem device can also be quickly found in the output of manual two orders of comparison.But when the scale of cluster is relatively larger, such as more than 1000 nodes, this time, the output of ibdiagnet diagnostic message was relatively more, and the network structure data of ibnetdiscover output are more, simultaneously relative to small-scale cluster, the IB number of devices made mistakes after first installation also can be on the high side.If it will be a job very consuming time that this time still goes to contrast the equipment of searching problem by hand one by one, at large-scale Chao Suan center, once such debugging even can consume staff's half a day to time;Simultaneously because searching work is manually to search contrast, in search procedure, also can be very easy to that omission occurs.
Summary of the invention
For the defect of above-mentioned prior art, the present invention proposes a kind of InfiniBand network detecting method, addresses how to improve the technical problem of the debugging efficiency of IB network.
According to an aspect of the present invention, provide a kind of InfiniBand network detecting method, including: step S1: obtain the second corresponding relation of the physical port number with each port in first corresponding relation of No. LID and described equipment of the device name of equipment in described InfiniBand network and logical port number;Step S2: obtain No. LID of equipment of port place of makeing mistakes in described InfiniBand network and the logical port number of described port of makeing mistakes;Step S3: according to described first corresponding relation, described second corresponding relation, described in make mistakes No. LID of equipment of port place and the logical port number of described port of makeing mistakes obtain described in make mistakes the device name of equipment at port place and the physical port number of described port of makeing mistakes.
In the process, described step S1 includes: obtained the second corresponding relation of the physical port number with each port in first corresponding relation of No. LID and described equipment of the device name of equipment in described InfiniBand network and logical port number by Ibnetwork order.
In the process, described step S2 includes: by the logical port number of No. LID of the equipment at the port place of makeing mistakes in the Ibdiagnet order described InfiniBand network of acquisition and described port of makeing mistakes.
In the process, described method also includes: obtained the equipment at non-working port place and the physical port number of described non-working port by the physical port number of the device name of the equipment at described port place of makeing mistakes and described port of makeing mistakes, wherein, described non-working port and described port of makeing mistakes are in two ports on same link.
In the process, described method also includes: exported by the physical port number of the device name of the equipment at described non-working port place and described non-working port.
In the process, before described step S1, described method also includes: confirms that the Opensm service of described Infiniband network is in running status, and confirms that each node daughter card of described InfiniBand network is in normal operating condition;The error count of all devices of described InfiniBand network is reset;And to described Infiniband network operation InfiniBand network pressure program.
Present invention may be broadly applicable to the cluster of different scales, once run and just can find out the whole problematic link of IB network, it is possible to carry out integrated with other adjustment methods, at utmost reduce human error, substantially increase the debugging efficiency of large-scale IB network, and save time and manpower.
Accompanying drawing explanation
Accompanying drawing is for providing a further understanding of the present invention, and constitutes a part for description, is used for together with embodiments of the present invention explaining the present invention, is not intended that limitation of the present invention.In the accompanying drawings:
Fig. 1 is the flow chart of the overview embodiment according to the present invention;
Fig. 2 is flow chart according to a particular embodiment of the invention.
Detailed description of the invention
Below in conjunction with accompanying drawing, the preferred embodiments of the present invention are illustrated, it will be appreciated that preferred embodiment described herein is merely to illustrate and explains the present invention, is not intended to limit the present invention.
Fig. 1 is the flow chart of the overview embodiment according to the present invention, in FIG:
Step S100: obtain the second corresponding relation of the physical port number with each port in first corresponding relation of No. LID and equipment of the device name of equipment in InfiniBand network and logical port number.
Step S102: obtain No. LID of equipment of the port place of makeing mistakes in InfiniBand network and the logical port number of port of makeing mistakes.
Step S104: obtain the device name of the equipment at port place of makeing mistakes and the physical port number of port of makeing mistakes according to the first corresponding relation, the second corresponding relation, No. LID of equipment of port place of makeing mistakes and the logical port number of port of makeing mistakes.
Wherein, each step can carry out without artificial participation automatically.
In the present embodiment, corresponding relation due to the corresponding relation of device name Yu No. LID obtaining equipment and the physical port number of this each port of equipment and logical port number, therefore when detecting equipment LID and the logical port number of port of makeing mistakes, can automatically get device name and physical port number, thus decreasing artificial participation, improve Infiniband Networked E-Journals efficiency.
Fig. 2 flow chart according to a particular embodiment of the invention, in fig. 2:
Step S200: confirm that the Opensm service of Infiniband network is in running status, and confirm that each node daughter card of InfiniBand network is in normal operating condition.
Step S202: the error count of all devices of InfiniBand network is reset.
Step S204: to Infiniband network operation InfiniBand network pressure program.
Step S206: obtained the second corresponding relation of the physical port number with each port in first corresponding relation of No. LID and equipment of the device name of equipment in InfiniBand network and logical port number by Ibnetwork order.Such as, corresponding with equipment by No. LID, then the port of device interior is carried out the numbering of 1~18.
Step S208: by the logical port number of No. LID of the equipment at the port place of makeing mistakes in Ibdiagnet order acquisition InfiniBand network and port of makeing mistakes.Wherein, port of makeing mistakes here refers to the port receiving wrong data, and following non-working port refers to the port broken down producing wrong data, wherein, makes mistakes port and non-working port is for two ports on same link.
Step S210: obtain the device name of the equipment at port place of makeing mistakes and the physical port number of port of makeing mistakes according to the first corresponding relation, the second corresponding relation, No. LID of equipment of port place of makeing mistakes and the logical port number of port of makeing mistakes.Wherein it is possible to here just the information of port of makeing mistakes is exported, it is also possible to be obtained in following steps again non-working port information is exported to after the information of non-working port.
Step S212: obtained the equipment at non-working port place and the physical port number of non-working port by the physical port number of the device name of the equipment at port place of makeing mistakes and port of makeing mistakes.Wherein it is possible to here just the information of non-working port is exported, it is also possible in following steps
Step S214: the physical port number of the device name of the equipment at non-working port place and non-working port is exported.For example, it is possible to print in the way of device name adds the physical port number of port or show over the display, thus staff can be clear from recognizing which port there occurs fault.
In the present embodiment, corresponding relation due to the corresponding relation of device name Yu No. LID obtaining equipment and the physical port number of this each port of equipment and logical port number, therefore when detecting equipment LID and the logical port number of port of makeing mistakes, can automatically get device name and physical port number, just the whole problematic link of IB network can be found out thus once running, and can carry out integrated with other adjustment methods, at utmost reduce human error, substantially increase the debugging efficiency of large-scale IB network, and save time and manpower.
The foregoing is only the preferred embodiments of the present invention, be not limited to the present invention, for a person skilled in the art, the present invention can have various modifications and variations.All within the spirit and principles in the present invention, any amendment of making, equivalent replacement, improvement etc., should be included within protection scope of the present invention.

Claims (5)

1. an InfiniBand network detecting method, it is characterised in that including:
Step S1: obtain the second corresponding relation of the physical port number with each port in first corresponding relation of No. LID and described equipment of the device name of equipment in described InfiniBand network and logical port number;
Step S2: obtain No. LID of equipment of port place of makeing mistakes in described InfiniBand network and the logical port number of described port of makeing mistakes;
Step S3: according to described first corresponding relation, described second corresponding relation, described in make mistakes No. LID of equipment of port place and the logical port number of described port of makeing mistakes obtain described in make mistakes the device name of equipment at port place and the physical port number of described port of makeing mistakes;
Wherein, described step S1 includes: obtained the second corresponding relation of the physical port number with each port in first corresponding relation of No. LID and described equipment of the device name of equipment in described InfiniBand network and logical port number by Ibnetwork order.
2. method according to claim 1, it is characterised in that described step S2 includes:
Logical port number by No. LID of the equipment at the port place of makeing mistakes in the Ibdiagnet order described InfiniBand network of acquisition and described port of makeing mistakes.
3. method according to claim 2, it is characterized in that, described method also includes: obtained the equipment at non-working port place and the physical port number of described non-working port by the physical port number of the device name of the equipment at described port place of makeing mistakes and described port of makeing mistakes, wherein, described non-working port and described port of makeing mistakes are in two ports on same link.
4. method according to claim 3, it is characterised in that described method also includes: the physical port number of the device name of the equipment at described non-working port place and described non-working port is exported.
5. method according to claim 1, it is characterised in that before described step S1, described method also includes:
Confirm that the Opensm service of described Infiniband network is in running status, and confirm that each node daughter card of described InfiniBand network is in normal operating condition;
The error count of all devices of described InfiniBand network is reset;And
To described Infiniband network operation InfiniBand network pressure program.
CN201310253119.5A 2013-06-24 2013-06-24 InfiniBand network detecting method Active CN103312564B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310253119.5A CN103312564B (en) 2013-06-24 2013-06-24 InfiniBand network detecting method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310253119.5A CN103312564B (en) 2013-06-24 2013-06-24 InfiniBand network detecting method

Publications (2)

Publication Number Publication Date
CN103312564A CN103312564A (en) 2013-09-18
CN103312564B true CN103312564B (en) 2016-07-06

Family

ID=49137365

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310253119.5A Active CN103312564B (en) 2013-06-24 2013-06-24 InfiniBand network detecting method

Country Status (1)

Country Link
CN (1) CN103312564B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1554055A (en) * 2001-07-23 2004-12-08 �Ƚ�΢װ�ù�˾ High-availability cluster virtual server system
CN1647466A (en) * 2002-04-18 2005-07-27 国际商业机器公司 A method for providing redundancy for channel adapter failure

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8681606B2 (en) * 2011-08-30 2014-03-25 International Business Machines Corporation Implementing redundancy on infiniband (IB) networks

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1554055A (en) * 2001-07-23 2004-12-08 �Ƚ�΢װ�ù�˾ High-availability cluster virtual server system
CN1647466A (en) * 2002-04-18 2005-07-27 国际商业机器公司 A method for providing redundancy for channel adapter failure

Also Published As

Publication number Publication date
CN103312564A (en) 2013-09-18

Similar Documents

Publication Publication Date Title
CN108667725A (en) A kind of industrial AnyRouter and implementation method based on a variety of accesses and edge calculations
US8793351B2 (en) Automated configuration of new racks and other computing assets in a data center
CN105015565B (en) A kind of train fault information automatic displaying method
CN102880990B (en) Fault processing system
CN103095475B (en) The method for inspecting and system of multimode communication device
CN105740121A (en) Log text monitoring and early-warning method and apparatus
CN109905268B (en) Network operation and maintenance method and device
CN106568168A (en) Debugging method, debugger and system
CN107402797A (en) A kind of software compilation method and device
CN108322318B (en) Alarm analysis method and equipment
CN106776226A (en) The monitoring method and device of self-aided terminal
CN106600120A (en) Economic management cost control system
CN112583644A (en) Alarm processing method, device, equipment and readable storage medium
CN104202328B (en) A kind of method, configuration module and the subscription end of subscription GOOSE/SMV messages
CN103312564B (en) InfiniBand network detecting method
CN105071970A (en) Failure analysis method, failure analysis system and network management equipment
CN111948993A (en) Sample processing pipeline control system
CN108512675B (en) Network diagnosis method and device, control node and network node
CN108712306A (en) A kind of information system automation inspection platform and method for inspecting
CN104517082B (en) Electric power data acquisition apparatus and method
CN103810085B (en) A kind of method and device that module testing is carried out by comparing
CN106294126B (en) The automation formula correctness management method and device of SEN ion injection machine table
CN106547697B (en) A kind of the automation formula correctness management method and device of NISSIN ion injection machine table
CN106776236A (en) The method and apparatus of the execution of monitoring program
CN106126364A (en) A kind of fault event memory collection method based on Linux system and system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20220729

Address after: 100089 building 36, courtyard 8, Dongbeiwang West Road, Haidian District, Beijing

Patentee after: Dawning Information Industry (Beijing) Co.,Ltd.

Patentee after: DAWNING INFORMATION INDUSTRY Co.,Ltd.

Address before: 100193 No. 36 Building, No. 8 Hospital, Wangxi Road, Haidian District, Beijing

Patentee before: Dawning Information Industry (Beijing) Co.,Ltd.

TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20240407

Address after: 266100 room 211, area a, software park, 169 Songling Road, Laoshan District, Qingdao City, Shandong Province

Patentee after: Zhongke Shuguang International Information Industry Co.,Ltd.

Country or region after: China

Address before: 100089 building 36, courtyard 8, Dongbeiwang West Road, Haidian District, Beijing

Patentee before: Dawning Information Industry (Beijing) Co.,Ltd.

Country or region before: China

Patentee before: DAWNING INFORMATION INDUSTRY Co.,Ltd.