WO2022267304A1 - Network communication method, computing device, and readable storage medium - Google Patents

Network communication method, computing device, and readable storage medium Download PDF

Info

Publication number
WO2022267304A1
WO2022267304A1 PCT/CN2021/129671 CN2021129671W WO2022267304A1 WO 2022267304 A1 WO2022267304 A1 WO 2022267304A1 CN 2021129671 W CN2021129671 W CN 2021129671W WO 2022267304 A1 WO2022267304 A1 WO 2022267304A1
Authority
WO
WIPO (PCT)
Prior art keywords
function
ucx
processor
computing device
godson
Prior art date
Application number
PCT/CN2021/129671
Other languages
French (fr)
Chinese (zh)
Inventor
马海亮
孟杰
薛皓琳
吴昆鹏
Original Assignee
统信软件技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 统信软件技术有限公司 filed Critical 统信软件技术有限公司
Publication of WO2022267304A1 publication Critical patent/WO2022267304A1/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L12/00Data switching networks
    • H04L12/02Details
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the invention relates to the field of computers, in particular to a network communication method, computing equipment and a readable storage medium.
  • high-performance parallel computing With the increasing demand for computing, high-performance parallel computing is becoming more and more important. Among them, high-speed interconnection network communication is an important part of high-performance parallel computing, and plays a vital role in the calculation efficiency of high-performance parallel computing.
  • UCX Unified Communications X
  • UCX is a network communication framework (a collection of libraries and interfaces) that provides efficient and relatively simple ways to build widely used HPC (High Performance Computing) protocols: label matching, remote memory access operations, streams, remote atomic operations, etc.
  • HPC High Performance Computing
  • the present invention provides a network communication method, a computing device and a readable storage medium in an attempt to solve or at least alleviate the above existing problems.
  • a network communication method executed in a computing device, the computing device includes a Godson processor, the method includes: obtaining a network communication software framework UCX; adding an objective function supporting the Godson processor architecture in UCX And add the function of obtaining the Godson processor mode in the function of obtaining the processor mode of UCX, and obtain the target UCX.
  • the target function includes refreshing the processor data and instruction cache function, calculating the leading zero function in the binary code, and the inline hook function;
  • the target UCX is compiled and installed on the device, so that the computing device uses the interface provided by the target UCX for network communication.
  • the step of adding an objective function supporting the Godson processor architecture in UCX includes: adding a refresh processor data and instruction cache supporting the Godson processor architecture in the UCS part of UCX function and calculate the leading zero function in the binary code; add an inline hook function that supports the Godson processor architecture in the UCM part of UCX.
  • the step of increasing the function of obtaining the Godson processor mode in the function of obtaining the processor mode of UCX includes: adding the Godson processor mode enumeration in the processor mode enumeration type Enumeration: Add the logic of obtaining the enumeration item of the Godson processor mode in the function of obtaining the processor mode.
  • asm is used to declare an inline assembly expression
  • volatile is used to declare to the compiler that the inline assembly will not be optimized
  • sync is used to refresh the processor data and cache in the LoongISA architecture
  • memory is used to declare that the memory has changed .
  • the calculation leading zero instruction of the LoongISA architecture is used to calculate the leading zero in the binary code the number of .
  • the inline hook function supporting the Godson processor architecture is added to the UCM part of UCX
  • the inline hook function is implemented in the following way to replace the call with a function defined by UCX system library function: when calling a system library function, obtain the address of the called system library function, and obtain the address of the function defined by UCX corresponding to the called system library function; combine the jump instruction with the obtained UCX
  • the address of the self-defined function is written to the address of the called system library function.
  • the current processor is the Godson processor in the following manner: when acquiring the processor When the return value of the mode function is the enumeration item of the Godson processor mode, it is determined that the current processor is the Godson processor.
  • a computing device comprising: at least one processor; and a memory storing program instructions, wherein the program instructions are configured to be executed by the at least one processor, the program instructions comprising Execute the instructions of the network communication method according to the present invention.
  • a readable storage medium storing program instructions, and when the program instructions are read and executed by a computing device, the computing device executes the network communication method according to the present invention.
  • the network communication software framework UCX is obtained. Then, add the target function supporting the Loongson processor architecture in UCX and add the function of acquiring Loongson processor mode in the function of acquiring the processor mode of UCX to obtain the target UCX. After obtaining the target UCX, compile and install the target UCX on the computing device including the Godson processor. In this way, the computing device including the Godson processor can use the interface provided by the target UCX to realize high-speed Internet communication, thereby improving the efficiency of high-performance parallel computing of the computing device. It can be seen that the network communication method of the present invention can enable the Godson platform not supported by the original UCX to use the interface provided by it to realize high-speed network interconnection communication, thereby improving the efficiency of high-performance parallel computing of the Godson platform.
  • FIG. 1 shows a structural block diagram of a computing device 100 according to an embodiment of the present invention
  • FIG. 2 shows a flowchart of a network communication method 200 according to an embodiment of the present invention
  • Fig. 3 shows a schematic diagram of a function calling process using the Inline Hook method according to an embodiment of the present invention.
  • Godson processors have been widely used in various industries.
  • the Loongson processor is based on the LoongISA architecture. Based on the previous description, it can be seen that the existing UCX does not support the LoongISA architecture. Therefore, in order to solve the problem of supporting high performance and parallel communication on the Godson platform, the present invention implements a method for inter-node communication based on a high-speed network supporting the Godson platform based on the existing UCX. Furthermore, the UCX communication interface for efficient intra-node communication is realized based on the shared memory mechanism.
  • FIG. 1 shows a structural block diagram of a computing device 100 according to an embodiment of the present invention.
  • the computing device 100 shown in FIG. 1 is only an example.
  • the computing device used to implement the network communication method of the present invention may be any type of device, and its hardware configuration may be the same as that shown in FIG. 1
  • the computing device 100 shown in FIG. 1 is the same as or may be different from the computing device 100 shown in FIG. 1 .
  • the computing device used to implement the network communication method of the present invention may add or delete hardware components of the computing device 100 shown in FIG. 1 , and the present invention does not limit the specific hardware configuration of the computing device.
  • computing device 100 typically includes system memory 106 and one or more processors 104 .
  • a memory bus 108 may be used for communication between the processor 104 and the system memory 106 .
  • processor 104 may be any type of processing including, but not limited to, a microprocessor ( ⁇ P), microcontroller ( ⁇ C), digital information processor (DSP), or any combination thereof.
  • Processor 104 may include one or more levels of cache such as L1 cache 110 and L2 cache 112 , processor core 114 and registers 116 .
  • Exemplary processor core 114 may include an arithmetic logic unit (ALU), a floating point unit (FPU), a digital signal processing core (DSP core), or any combination thereof.
  • ALU arithmetic logic unit
  • FPU floating point unit
  • DSP core digital signal processing core
  • An example memory controller 118 may be used with the processor 104 or, in some implementations, the memory controller 118 may be an internal part of the processor 104 .
  • system memory 106 may be any type of memory including, but not limited to: volatile memory (such as RAM), non-volatile memory (such as ROM, flash memory, etc.), or any combination thereof.
  • volatile memory such as RAM
  • non-volatile memory such as ROM, flash memory, etc.
  • the physical memory in the computing device usually refers to the volatile memory RAM, and the data in the disk needs to be loaded into the physical memory before being read by the processor 104 .
  • System memory 106 may include an operating system 120 , one or more applications 122 , and program data 124 .
  • applications 122 may be arranged to execute instructions on an operating system with program data 124 by one or more processors 104 .
  • the operating system 120 may be, for example, Linux, Windows, etc., which includes program instructions for handling basic system services and performing hardware-dependent tasks.
  • the application 122 includes program instructions for realizing various user-desired functions.
  • the application 122 may be, for example, a browser, instant messaging software, software development tools (such as an integrated development environment IDE, a compiler, etc.), but is not limited thereto.
  • a driver module may be added to the operating system 120 .
  • the processor 104 When the computing device 100 starts to run, the processor 104 reads program instructions of the operating system 120 from the system memory 106 and executes them.
  • the application 122 runs on the operating system 120, and utilizes the interface provided by the operating system 120 and the underlying hardware to realize various user-desired functions.
  • the application 122 is loaded into the system memory 106 , and the processor 104 reads and executes the program instructions of the application 122 from the system memory 106 .
  • Computing device 100 also includes storage device 132 , which includes removable storage 136 and non-removable storage 138 , both of which are connected to storage interface bus 134 .
  • Computing device 100 may also include interface bus 140 to facilitate communication from various interface devices (eg, output devices 142 , peripheral interfaces 144 , and communication devices 146 ) to base configuration 102 via bus/interface controller 130 .
  • Example output devices 142 include a graphics processing unit 148 and an audio processing unit 150 . They may be configured to facilitate communication with various external devices such as a display or speakers via one or more A/V ports 152 .
  • Example peripherals interfaces 144 may include serial interface controller 154 and parallel interface controller 156, which may be configured to facilitate communication via one or more I/O ports 158 and input devices such as (e.g., keyboard, mouse, pen) , voice input device, touch input device) or other peripherals (such as printers, scanners, etc.) to communicate with external devices such as.
  • the example communication device 146 may include a network controller 160 , which may be arranged to facilitate communication with one or more other computing devices 162 over a network communication link via one or more communication ports 164 .
  • a network communication link may be one example of a communication medium.
  • Communication media typically embodies computer readable instructions, data structures, program modules in a modulated data signal such as a carrier wave or other transport mechanism and may include any information delivery media.
  • a "modulated data signal" may be a signal that has one or more of its data sets or changes thereof in such a manner as to encode information in the signal.
  • communication media may include wired media such as a wired or dedicated-line network, and various wireless media such as acoustic, radio frequency (RF), microwave, infrared (IR) or other wireless media.
  • RF radio frequency
  • IR infrared
  • the term computer readable media as used herein may include both storage media and communication media.
  • the application 122 includes instructions for executing the network communication method 200 of the present invention, and the instructions may instruct the processor 104 to execute the network communication method of the present invention.
  • the application 122 may also include other applications 126 for implementing other functions.
  • FIG. 2 shows a flow chart of a network communication method 200 according to an embodiment of the present invention, and the method 200 is suitable for being executed in a computing device (such as the computing device 100 shown in FIG. 1 ).
  • the computing device includes a Godson processor.
  • the network communication method 200 of the present invention starts at step S210.
  • step S210 the network communication software framework UCX is obtained.
  • UCX mainly includes four parts: UCS, UCM, UCT and UCP. specifically:
  • UCS is a service layer that provides the necessary functionality to implement portable and efficient utilities.
  • This layer mainly includes the following services: abstractions for accessing platform-specific functionality (atomic operations, thread safety, etc.), tools for efficient memory management (memory pools, memory allocators, etc.), commonly used data structures (hash, tree, list).
  • UCM is mainly responsible for intercepting the memory allocation and release events used by the memory registration cache.
  • UCT is a transport layer that abstracts the differences between various hardware architectures and provides a low-level API that implements a communication protocol. The main goal of this layer is to provide direct and efficient access to hardware networking functions. In addition, this layer provides communication context management (on a thread- and application-level basis) and constructs for allocation and management to servers.
  • communication API UCT defines the communication methods of short data transmission (short), transmission with data copy (bcopy) and transmission with zero copy (zcopy) according to the data length. Short data transfer (short), this type of operation is optimized for the transfer of short data. Transfer with data copy (bcopy), this type of operation is optimized for moderately sized messages sent through so-called bounce buffers.
  • This secondary buffer is usually allocated given network constraints and is ready for immediate use by the hardware. This method can be used for non-sequential I/O because custom data packing routines can be provided. Zero-copy transfers, this type of operation enables messages to be sent directly from user buffers, or received directly from user buffers, without copying between network layers.
  • UCP implements the higher-level protocols used by parallel programming models such as MPI and PGAS by using the lower-level functions exposed by the UCT layer.
  • UCP mainly provides the following functions: initialization, remote memory access (RMA) communication, remote atomic memory operation (AMO), activity message, tag matching.
  • Initialization the functions of this interface include the setting of the communication context, querying network capabilities and initializing the local communication endpoint.
  • a communication context represents an abstraction of network transport resources.
  • the communication endpoint setup interface initializes a UCP endpoint, which is an abstraction of all necessary resources associated with a particular connection. Communication endpoints are used as input to all communication operations to describe the source and destination of the communication.
  • Remote Memory Access (RMA) communication this interface defines the low overhead required to implement distributed and shared memory programming models, the one-sided communication operations (such as PUT and GET) required for direct access to memory communication structures.
  • UCP consists of a separate set of interfaces for passing discrete data. This functionality is included to support the communication requirements of various programming models and to take advantage of the scatter-gather capabilities of modern networking hardware.
  • Remote Atomic Memory Operations (AMO) this interface provides support for atomic execution of operations on remote memory, which is an important operation of the PGAS programming model (especially OpenSHMEM).
  • Tag matching this interface supports tag matching for send-receive semantics, which is a key communication semantic defined by the MPI specification.
  • Active Message this interface implements calling the sender-specified callback on incoming data packets for processing by the receiving process.
  • a double-sided MPI interface can be easily implemented on top of this concept.
  • these interfaces are more general and apply to other programming paradigms where the receiver process does not pre-post a receive, but wishes to react directly to incoming packets.
  • the active message interface provides separate APIs for different message types and discontinuous data. Stream, this interface provides sequential and reliable communication semantics. Data is seen as an ordered sequence of bytes pushed over the connection. In contrast to tag-matching interfaces, the size of each sender does not have to match the size of each receiver as long as the total number of bytes is the same.
  • This API is designed to match the widely used BSD socket-based programming model.
  • step S220 add the target function supporting the Godson processor architecture to the UCX and add the function of acquiring the Godson processor mode to the function of acquiring the processor mode of the UCX, and obtain the target UCX.
  • Target functions include flushing processor data and instruction cache functions, computing leading zeros in binary code, and inlining hook functions.
  • the function of refreshing processor data and instruction cache is implemented by means of inline assembly.
  • inline assembly expression asm volatile("sync":::”memory”).
  • asm all inline assembly expressions start with this
  • volatile is used to declare to the compiler that this inline assembly will not be optimized.
  • sync is an instruction to refresh processor data and cache in LoongISA architecture
  • memory is used to declare that the memory has changed, that is, to tell the compiler that the memory has changed, and it needs to be read directly from the corresponding memory, and the copy stored in the register should not be used anymore.
  • Functions that count leading zeros in binary codes are also implemented in inline assembly. Specifically, it can be implemented through the calculation leading zero instruction of the LoongISA architecture.
  • the calculation leading zero instruction of LoongISA architecture includes clz assembly instruction and dclz assembly instruction. clz returns the number of 0s before the first 1 in the 32-bit binary code, and dclz returns the number of 0s before the first 1 in the 64-bit binary code.
  • the purpose of adding the inline hook function supporting the Godson processor architecture in the UCM part of UCX is to replace the system library function with the function customized by UCX. Specifically, when a program calls a certain system library function, the system library function is replaced with a function defined by UCX corresponding to the system library function. That is, when a certain program calls a certain system library function, the system library function is not executed, but the function defined by the UCX corresponding to the system function is executed.
  • the inline hook is to replace the system library function with the function customized by UCX by modifying the machine code. Specifically, when the program calls the system library function, the address of the called system library function is obtained, and the address of the function defined by UCX corresponding to the called system library function is obtained. Then, write the jump instruction and the obtained address of the function defined by the UCX into the address of the called system library function. In this way, when a program executes a system library function call, it will jump to the user-defined function corresponding to the system library function for execution.
  • FIG. 3 shows a schematic diagram of a function call process using the Inline Hook method according to an embodiment of the present invention.
  • the system library functions mmap, munmap, mremap, shmat, shmdt, sbrk, brk, and madvise correspond to the functions ucm_mmap, ucm_munmap, ucm_mremap, ucm_shmat, ucm_shmdt, ucm_sbrk, ucm_brk, and ucm_madvise defined by UCX.
  • this step is to add the function of obtaining the Godson processor mode in the function of obtaining the processor mode of the UCS part of UCX.
  • the present invention utilizes UCS_TEST_F (test_math, bitops) test function to test the function of leading zero function in the calculation binary code in UCS, utilize UCS_TEST_F (test_type, cpu_set) test function to the cpu pattern function in UCS Test, use the UCS_TEST_F (malloc_hook_cplusplus, mmap_ptrs) and UCS_TEST_F (malloc_hook, bistro_patch) test functions to test the Inline Hook function in UCM.
  • UCS_TEST_F test_math, bitops
  • UCS_TEST_F test_type, cpu_set
  • UCS_TEST_F malloc_hook_cplusplus, mmap_ptrs
  • UCS_TEST_F malloc_hook, bistro_patch
  • step S230 After obtaining the target UCX, enter step S230, compile and install the target UCX on the computing device, so that the computing device uses the interface provided by the target UCX to perform network communication.
  • the system library functions mmap, munmap, mremap, shmat, shmdt, sbrk, brk, madvise can be intercepted, thereby executing ucm_mmap, ucm_munmap, ucm_mremap, Ucm_shmat, ucm_shmdt, ucm_sbrk, ucm_brk, ucm_madvise and other functions.
  • the following is a specific example to illustrate, in the UCS part of UCX, add the refresh processor data and instruction cache function supporting the LoongISA architecture and the function of calculating the leading zero in the binary code, and add the inline hook function supporting the LoongISA architecture in the UCM part of UCX , Add the function of obtaining the Godson processor mode to the function of obtaining the processor mode in the UCS part of UCX, and obtain the target UCX.
  • the obtained target UCX supports the Loongson platform, so the target UCX can be compiled and installed on the Loongson platform. In this way, the interface provided by UCX can be used on the Loongson platform to realize high-speed Internet communication, thereby improving the efficiency of high-performance parallel computing on the Loongson platform.
  • the network communication software framework UCX is obtained. Then, add the target function supporting the Loongson processor architecture in UCX and add the function of acquiring Loongson processor mode in the function of acquiring the processor mode of UCX to obtain the target UCX. After obtaining the target UCX, compile and install the target UCX on the computing device including preset processing. In this way, the computing device including preset processing can use the interface provided by the target UCX to realize high-speed Internet communication, thereby improving the efficiency of high-performance parallel computing of the computing device. It can be seen that the network communication method of the present invention can enable the Godson platform not supported by the original UCX to use the interface provided by it to realize high-speed network interconnection communication, thereby significantly improving the efficiency of high-performance parallel computing on the Godson platform.
  • the various techniques described herein can be implemented in conjunction with hardware or software, or a combination thereof.
  • the method and device of the present invention, or certain aspects or parts of the method and device of the present invention may be embedded in a tangible medium, such as a removable hard disk, USB flash drive, floppy disk, CD-ROM or any other machine-readable storage medium
  • program code ie, instructions
  • a machine such as a computer
  • the program when the program is loaded into a machine such as a computer and executed by the machine, the machine becomes an apparatus for practicing the invention.
  • the computing device In the case of program code execution on a programmable computer, the computing device generally includes a processor, a storage medium readable by the processor (including volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device.
  • the memory is configured to store program code; the processor is configured to execute the document loading method of the present invention according to instructions in the program code stored in the memory.
  • Readable media include, by way of example and not limitation, readable storage media and communication media.
  • Readable storage media store information such as computer readable instructions, data structures, program modules or other data.
  • Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. Combinations of any of the above are also included within the scope of readable media.
  • the algorithms and displays are not inherently related to any particular computer, virtual system, or other device.
  • Various general-purpose systems can also be used with examples of the invention. The structure required to construct such a system is apparent from the above description.
  • the present invention is not specific to any particular programming language. It should be understood that various programming languages can be used to implement the content of the present invention described herein, and the above description of specific languages is for disclosing the best mode of the present invention.
  • modules or units or components of the devices in the examples disclosed herein may be arranged in the device as described in this embodiment, or alternatively may be located in a different location than the device in this example. in one or more devices.
  • the modules in the preceding examples may be combined into one module or furthermore may be divided into a plurality of sub-modules.
  • modules in the device in the embodiment can be adaptively changed and arranged in one or more devices different from the embodiment.
  • Modules or units or components in the embodiments may be combined into one module or unit or component, and furthermore may be divided into a plurality of sub-modules or sub-units or sub-assemblies.
  • All features disclosed in this specification including accompanying claims, abstract and drawings) and any method or method so disclosed may be used in any combination, except that at least some of such features and/or processes or units are mutually exclusive. All processes or units of equipment are combined.
  • Each feature disclosed in this specification may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Stored Programmes (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

Disclosed in the present invention is a network communication method performed in a computing device, the computing device comprising a Loongson processor, the method comprising: acquiring a network communication software framework UCX; adding to the UCX a target function that supports Loongson processor architecture and adding to the processor mode acquisition function of the UCX the function of acquiring a Loongson processor mode to obtain target UCX; and compiling and installing the target UCX on the computing device, so that the computing device performs network communication using an interface provided by the target UCX. Also disclosed in the present invention are a corresponding computing device and readable storage medium. The network communication method of the present invention enables a Loongson platform that is not supported by original UCX can also achieve high-speed network interconnection communication using the interface provided by the UCX.

Description

一种网络通信方法、计算设备及可读存储介质A network communication method, computing device and readable storage medium 技术领域technical field
本发明涉及计算机领域,尤其涉及一种网络通信方法、计算设备及可读存储介质。The invention relates to the field of computers, in particular to a network communication method, computing equipment and a readable storage medium.
背景技术Background technique
随着计算需求的与日剧增,高性能并行计算变得越来越重要。其中,高速互连网络通信是高性能并行计算中的重要组成部分,对高性能并行计算的计算效率有着至关重要的作用。现有能实现高速网络互连通信的有UCX(统一通信X)。With the increasing demand for computing, high-performance parallel computing is becoming more and more important. Among them, high-speed interconnection network communication is an important part of high-performance parallel computing, and plays a vital role in the calculation efficiency of high-performance parallel computing. Currently, there is UCX (Unified Communications X) that can realize high-speed network interconnection communication.
UCX是一个网络通信框架(库和接口的集合),为构建广泛使用的HPC(高性能计算)协议提供了有效且相对简单的方法:标签匹配、远程内存访问操作、流、远程原子操作等。UCX is a network communication framework (a collection of libraries and interfaces) that provides efficient and relatively simple ways to build widely used HPC (High Performance Computing) protocols: label matching, remote memory access operations, streams, remote atomic operations, etc.
然而,现有的UCX可扩展性较差,仅支持X86_64、Power8、Power9和Arm v8架构。因此,基于其他架构的平台无法利用UCX所提供的接口来实现高速网络互连通信。However, the existing UCX has poor scalability and only supports X86_64, Power8, Power9 and Arm v8 architectures. Therefore, platforms based on other architectures cannot utilize the interfaces provided by UCX to realize high-speed network interconnection communication.
发明内容Contents of the invention
为此,本发明提供了一种网络通信方法、计算设备及可读存储介质,以力图解决或者至少缓解上面存在的问题。Therefore, the present invention provides a network communication method, a computing device and a readable storage medium in an attempt to solve or at least alleviate the above existing problems.
根据本发明的一个方面,提供一种网络通信方法,在计算设备中执行,计算设备包括龙芯处理器,该方法包括:获取网络通信软件框架UCX;在UCX中添加支持龙芯处理器架构的目标函数以及在UCX的获取处理器模式函数中增加获取龙芯处理器模式的功能,获得目标UCX,目标函数包括刷新处理器数据和指令缓存函数、计算二进制编码中前导零函数以及内联钩子函数;在计算设备上编译安装目标UCX,以使计算设备利用目标UCX提供的接口进行 网络通信。According to one aspect of the present invention, a network communication method is provided, executed in a computing device, the computing device includes a Godson processor, the method includes: obtaining a network communication software framework UCX; adding an objective function supporting the Godson processor architecture in UCX And add the function of obtaining the Godson processor mode in the function of obtaining the processor mode of UCX, and obtain the target UCX. The target function includes refreshing the processor data and instruction cache function, calculating the leading zero function in the binary code, and the inline hook function; The target UCX is compiled and installed on the device, so that the computing device uses the interface provided by the target UCX for network communication.
可选地,在根据本发明的网络通信方法中,在UCX中添加支持龙芯处理器架构的目标函数的步骤,包括:在UCX的UCS部分添加支持龙芯处理器架构的刷新处理器数据和指令缓存函数以及计算二进制编码中前导零函数;在UCX的UCM部分添加支持龙芯处理器架构的内联钩子函数。Optionally, in the network communication method according to the present invention, the step of adding an objective function supporting the Godson processor architecture in UCX includes: adding a refresh processor data and instruction cache supporting the Godson processor architecture in the UCS part of UCX function and calculate the leading zero function in the binary code; add an inline hook function that supports the Godson processor architecture in the UCM part of UCX.
可选地,在根据本发明的网络通信方法中,在UCX的获取处理器模式函数中增加获取龙芯处理器模式的功能的步骤,包括:在处理器模式枚举类型中添加龙芯处理器模式枚举项;在获取处理器模式函数中增加获取龙芯处理器模式枚举项的逻辑。Optionally, in the network communication method according to the present invention, the step of increasing the function of obtaining the Godson processor mode in the function of obtaining the processor mode of UCX includes: adding the Godson processor mode enumeration in the processor mode enumeration type Enumeration: Add the logic of obtaining the enumeration item of the Godson processor mode in the function of obtaining the processor mode.
可选地,在根据本发明的网络通信方法中,在UCX的UCS部分添加支持龙芯处理器架构的刷新处理器数据和指令缓存函数后,通过如下内联汇编表达式来刷新龙芯处理器数据和指令缓存:Optionally, in the network communication method according to the present invention, after the UCS part of UCX adds the refresh processor data and instruction cache function supporting the Godson processor architecture, refresh the Godson processor data and the instruction cache function through the following inline assembly expression Instruction cache:
asm volatile("sync":::"memory")asm volatile("sync":::"memory")
其中,asm用于声明一个内联汇编表达式,volatile用于向编译器声明不对该内联汇编进行优化,sync用于在LoongISA架构中刷新处理器数据和缓存,memory用于声明内存已发生改动。Among them, asm is used to declare an inline assembly expression, volatile is used to declare to the compiler that the inline assembly will not be optimized, sync is used to refresh the processor data and cache in the LoongISA architecture, and memory is used to declare that the memory has changed .
可选地,在根据本发明的网络通信方法中,在UCX的UCS部分添加支持龙芯处理器架构的计算二进制编码中前导零函数后,利用LoongISA架构的计算前导零指令来计算二进制编码中前导零的个数。Optionally, in the network communication method according to the present invention, after the UCS part of UCX adds the leading zero function in the calculation binary code supporting the Loongson processor architecture, the calculation leading zero instruction of the LoongISA architecture is used to calculate the leading zero in the binary code the number of .
可选地,在根据本发明的网络通信方法中,在UCX的UCM部分添加支持龙芯处理器架构的内联钩子函数后,内联钩子函数通过如下方式来实现用UCX所自定义的函数替换调用的系统库函数:当调用系统库函数时,获取调用的系统库函数的地址,并获取与调用的系统库函数相对应的UCX所自定义的函数的地址;将跳转指令与获取的UCX所自定义的函数的地址写入调用的系统库函数的地址处。Optionally, in the network communication method according to the present invention, after the inline hook function supporting the Godson processor architecture is added to the UCM part of UCX, the inline hook function is implemented in the following way to replace the call with a function defined by UCX system library function: when calling a system library function, obtain the address of the called system library function, and obtain the address of the function defined by UCX corresponding to the called system library function; combine the jump instruction with the obtained UCX The address of the self-defined function is written to the address of the called system library function.
可选地,在根据本发明的网络通信方法中,在UCX的获取处理器模式函数中增加获取龙芯处理器模式的功能后,通过如下方式来确定当前处理器为龙芯处理器:当获取处理器模式函数返回值为龙芯处理器模式的枚举项时,确定当前处理器为龙芯处理器。Optionally, in the network communication method according to the present invention, after adding the function of acquiring the Godson processor mode in the acquisition processor mode function of UCX, it is determined that the current processor is the Godson processor in the following manner: when acquiring the processor When the return value of the mode function is the enumeration item of the Godson processor mode, it is determined that the current processor is the Godson processor.
根据本发明的又一个方面,提供一种计算设备,包括:至少一个处理器;以及存储器,存储有程序指令,其中,程序指令被配置为适于由至少一个处理器执行,程序指令包括用于执行根据本发明的网络通信方法的指令。According to yet another aspect of the present invention, there is provided a computing device comprising: at least one processor; and a memory storing program instructions, wherein the program instructions are configured to be executed by the at least one processor, the program instructions comprising Execute the instructions of the network communication method according to the present invention.
根据本发明的又一个方面,提供一种存储有程序指令的可读存储介质,当程序指令被计算设备读取并执行时,使得计算设备执行根据本发明的网络通信方法。According to still another aspect of the present invention, a readable storage medium storing program instructions is provided, and when the program instructions are read and executed by a computing device, the computing device executes the network communication method according to the present invention.
根据本发明的网络通信方法,首先获取网络通信软件框架UCX。然后,在UCX中添加支持龙芯处理器架构的目标函数以及在UCX的获取处理器模式函数中增加获取龙芯处理器模式的功能,获得目标UCX。在获得目标UCX后,在包括龙芯处理器的计算设备上编译安装目标UCX。这样,包括龙芯处理器的计算设备便可以利用目标UCX所提供的接口实现高速互联网络通信,从而能够提升该计算设备高性能并行计算的效率。可见,本发明的网络通信方法能够使原有UCX所不支持的龙芯平台也能利用其所提供的接口来实现高速网络互连通信,从而能够提升龙芯平台高性能并行计算的效率。According to the network communication method of the present invention, firstly, the network communication software framework UCX is obtained. Then, add the target function supporting the Loongson processor architecture in UCX and add the function of acquiring Loongson processor mode in the function of acquiring the processor mode of UCX to obtain the target UCX. After obtaining the target UCX, compile and install the target UCX on the computing device including the Godson processor. In this way, the computing device including the Godson processor can use the interface provided by the target UCX to realize high-speed Internet communication, thereby improving the efficiency of high-performance parallel computing of the computing device. It can be seen that the network communication method of the present invention can enable the Godson platform not supported by the original UCX to use the interface provided by it to realize high-speed network interconnection communication, thereby improving the efficiency of high-performance parallel computing of the Godson platform.
附图说明Description of drawings
为了实现上述以及相关目的,本文结合下面的描述和附图来描述某些说明性方面,这些方面指示了可以实践本文所公开的原理的各种方式,并且所有方面及其等效方面旨在落入所要求保护的主题的范围内。通过结合附图阅读下面的详细描述,本公开的上述以及其它目的、特征和优势将变得更加明显。遍及本公开,相同的附图标记通常指代相同的部件或元素。To the accomplishment of the foregoing and related ends, certain illustrative aspects are herein described, taken in conjunction with the following description and drawings, which are indicative of the various ways in which the principles disclosed herein may be practiced, and all aspects and their equivalents are intended to fall within the scope of within the scope of the claimed subject matter. The above and other objects, features and advantages of the present disclosure will become more apparent by reading the following detailed description in conjunction with the accompanying drawings. Like reference numerals generally refer to like parts or elements throughout this disclosure.
图1示出了根据本发明一个实施例的计算设备100的结构框图;FIG. 1 shows a structural block diagram of a computing device 100 according to an embodiment of the present invention;
图2示出了根据本发明一个实施例的网络通信方法200的流程图;FIG. 2 shows a flowchart of a network communication method 200 according to an embodiment of the present invention;
图3示出了根据本发明一个实施例的使用了Inline Hook方法的函数调用流程的示意图。Fig. 3 shows a schematic diagram of a function calling process using the Inline Hook method according to an embodiment of the present invention.
具体实施方式detailed description
下面将参照附图更详细地描述本公开的示例性实施例。虽然附图中显示了本公开的示例性实施例,然而应当理解,可以以各种形式实现本公开而不应 被这里阐述的实施例所限制。相反,提供这些实施例是为了能够更透彻地理解本公开,并且能够将本公开的范围完整的传达给本领域的技术人员。Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. Although exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided for more thorough understanding of the present disclosure and to fully convey the scope of the present disclosure to those skilled in the art.
随着国产自主化进度的加快以及信创产业的快速发展,实现核心技术的自主可控变得越来越重要。其中,为了实现CPU的自主可控,我国自主研发了龙芯处理器。目前,龙芯处理器已被广泛应用于各行各业。With the acceleration of domestic self-reliance and the rapid development of the Xinchuang industry, it is becoming more and more important to realize the independent control of core technologies. Among them, in order to realize the independent controllability of the CPU, my country has independently developed the Godson processor. At present, Godson processors have been widely used in various industries.
龙芯处理器基于的是龙芯(LoongISA)架构。基于前文的描述可知,现有的UCX并不支持LoongISA架构。因此,为解决在龙芯平台上支持高性能以及可并行通信的问题,本发明根据现有的UCX实现了一种支持龙芯平台的基于高速网络进行节点间通信的方法。进一步地讲,基于共享内存机制实现高效的节点内通信的UCX通信接口。The Loongson processor is based on the LoongISA architecture. Based on the previous description, it can be seen that the existing UCX does not support the LoongISA architecture. Therefore, in order to solve the problem of supporting high performance and parallel communication on the Godson platform, the present invention implements a method for inter-node communication based on a high-speed network supporting the Godson platform based on the existing UCX. Furthermore, the UCX communication interface for efficient intra-node communication is realized based on the shared memory mechanism.
图1示出了根据本发明一个实施例的计算设备100的结构框图。需要说明的是,图1所示的计算设备100仅为一个示例,在实践中,用于实施本发明的网络通信方法的计算设备可以是任意型号的设备,其硬件配置情况可以与图1所示的计算设备100相同,也可以与图1所示的计算设备100不同。实践中用于实施本发明的网络通信方法的计算设备可以对图1所示的计算设备100的硬件组件进行增加或删减,本发明对计算设备的具体硬件配置情况不做限制。FIG. 1 shows a structural block diagram of a computing device 100 according to an embodiment of the present invention. It should be noted that the computing device 100 shown in FIG. 1 is only an example. In practice, the computing device used to implement the network communication method of the present invention may be any type of device, and its hardware configuration may be the same as that shown in FIG. 1 The computing device 100 shown in FIG. 1 is the same as or may be different from the computing device 100 shown in FIG. 1 . In practice, the computing device used to implement the network communication method of the present invention may add or delete hardware components of the computing device 100 shown in FIG. 1 , and the present invention does not limit the specific hardware configuration of the computing device.
如图1所示,在基本配置102中,计算设备100典型地包括系统存储器106和一个或者多个处理器104。存储器总线108可以用于在处理器104和系统存储器106之间的通信。As shown in FIG. 1 , in a base configuration 102 , computing device 100 typically includes system memory 106 and one or more processors 104 . A memory bus 108 may be used for communication between the processor 104 and the system memory 106 .
取决于期望的配置,处理器104可以是任何类型的处理,包括但不限于:微处理器(μP)、微控制器(μC)、数字信息处理器(DSP)或者它们的任何组合。处理器104可以包括诸如一级高速缓存110和二级高速缓存112之类的一个或者多个级别的高速缓存、处理器核心114和寄存器116。示例的处理器核心114可以包括运算逻辑单元(ALU)、浮点数单元(FPU)、数字信号处理核心(DSP核心)或者它们的任何组合。示例的存储器控制器118可以与处理器104一起使用,或者在一些实现中,存储器控制器118可以是处理器104的一个内部部分。Depending on the desired configuration, processor 104 may be any type of processing including, but not limited to, a microprocessor (μP), microcontroller (μC), digital information processor (DSP), or any combination thereof. Processor 104 may include one or more levels of cache such as L1 cache 110 and L2 cache 112 , processor core 114 and registers 116 . Exemplary processor core 114 may include an arithmetic logic unit (ALU), a floating point unit (FPU), a digital signal processing core (DSP core), or any combination thereof. An example memory controller 118 may be used with the processor 104 or, in some implementations, the memory controller 118 may be an internal part of the processor 104 .
取决于期望的配置,系统存储器106可以是任意类型的存储器,包括但 不限于:易失性存储器(诸如RAM)、非易失性存储器(诸如ROM、闪存等)或者它们的任何组合。计算设备中的物理内存通常指的是易失性存储器RAM,磁盘中的数据需要加载至物理内存中才能够被处理器104读取。系统存储器106可以包括操作系统120、一个或者多个应用122以及程序数据124。在一些实施方式中,应用122可以布置为在操作系统上由一个或多个处理器104利用程序数据124执行指令。操作系统120例如可以是Linux、Windows等,其包括用于处理基本系统服务以及执行依赖于硬件的任务的程序指令。应用122包括用于实现各种用户期望的功能的程序指令,应用122例如可以是浏览器、即时通讯软件、软件开发工具(例如集成开发环境IDE、编译器等)等,但不限于此。当应用122被安装到计算设备100中时,可以向操作系统120添加驱动模块。Depending on the desired configuration, system memory 106 may be any type of memory including, but not limited to: volatile memory (such as RAM), non-volatile memory (such as ROM, flash memory, etc.), or any combination thereof. The physical memory in the computing device usually refers to the volatile memory RAM, and the data in the disk needs to be loaded into the physical memory before being read by the processor 104 . System memory 106 may include an operating system 120 , one or more applications 122 , and program data 124 . In some implementations, applications 122 may be arranged to execute instructions on an operating system with program data 124 by one or more processors 104 . The operating system 120 may be, for example, Linux, Windows, etc., which includes program instructions for handling basic system services and performing hardware-dependent tasks. The application 122 includes program instructions for realizing various user-desired functions. The application 122 may be, for example, a browser, instant messaging software, software development tools (such as an integrated development environment IDE, a compiler, etc.), but is not limited thereto. When the application 122 is installed into the computing device 100 , a driver module may be added to the operating system 120 .
在计算设备100启动运行时,处理器104会从系统存储器106中读取操作系统120的程序指令并执行。应用122运行在操作系统120之上,利用操作系统120以及底层硬件提供的接口来实现各种用户期望的功能。当用户启动应用122时,应用122会加载至系统存储器106中,处理器104从系统存储器106中读取并执行应用122的程序指令。When the computing device 100 starts to run, the processor 104 reads program instructions of the operating system 120 from the system memory 106 and executes them. The application 122 runs on the operating system 120, and utilizes the interface provided by the operating system 120 and the underlying hardware to realize various user-desired functions. When the user starts the application 122 , the application 122 is loaded into the system memory 106 , and the processor 104 reads and executes the program instructions of the application 122 from the system memory 106 .
计算设备100还包括储存设备132,储存设备132包括可移除储存器136和不可移除储存器138,可移除储存器136和不可移除储存器138均与储存接口总线134连接。Computing device 100 also includes storage device 132 , which includes removable storage 136 and non-removable storage 138 , both of which are connected to storage interface bus 134 .
计算设备100还可以包括有助于从各种接口设备(例如,输出设备142、外设接口144和通信设备146)到基本配置102经由总线/接口控制器130的通信的接口总线140。示例的输出设备142包括图形处理单元148和音频处理单元150。它们可以被配置为有助于经由一个或者多个A/V端口152与诸如显示器或者扬声器之类的各种外部设备进行通信。示例外设接口144可以包括串行接口控制器154和并行接口控制器156,它们可以被配置为有助于经由一个或者多个I/O端口158和诸如输入设备(例如,键盘、鼠标、笔、语音输入设备、触摸输入设备)或者其他外设(例如打印机、扫描仪等)之类的外部设备进行通信。示例的通信设备146可以包括网络控制器160,其可以被布置为便于经由一个或者多个通信端口164与一个或者多个其他计算设备162通过网络通信链路的通信。Computing device 100 may also include interface bus 140 to facilitate communication from various interface devices (eg, output devices 142 , peripheral interfaces 144 , and communication devices 146 ) to base configuration 102 via bus/interface controller 130 . Example output devices 142 include a graphics processing unit 148 and an audio processing unit 150 . They may be configured to facilitate communication with various external devices such as a display or speakers via one or more A/V ports 152 . Example peripherals interfaces 144 may include serial interface controller 154 and parallel interface controller 156, which may be configured to facilitate communication via one or more I/O ports 158 and input devices such as (e.g., keyboard, mouse, pen) , voice input device, touch input device) or other peripherals (such as printers, scanners, etc.) to communicate with external devices such as. The example communication device 146 may include a network controller 160 , which may be arranged to facilitate communication with one or more other computing devices 162 over a network communication link via one or more communication ports 164 .
网络通信链路可以是通信介质的一个示例。通信介质通常可以体现为在诸如载波或者其他传输机制之类的调制数据信号中的计算机可读指令、数据结构、程序模块,并且可以包括任何信息递送介质。“调制数据信号”可以这样的信号,它的数据集中的一个或者多个或者它的改变可以在信号中编码信息的方式进行。作为非限制性的示例,通信介质可以包括诸如有线网络或者专线网络之类的有线介质,以及诸如声音、射频(RF)、微波、红外(IR)或者其它无线介质在内的各种无线介质。这里使用的术语计算机可读介质可以包括存储介质和通信介质二者。A network communication link may be one example of a communication medium. Communication media typically embodies computer readable instructions, data structures, program modules in a modulated data signal such as a carrier wave or other transport mechanism and may include any information delivery media. A "modulated data signal" may be a signal that has one or more of its data sets or changes thereof in such a manner as to encode information in the signal. By way of non-limiting example, communication media may include wired media such as a wired or dedicated-line network, and various wireless media such as acoustic, radio frequency (RF), microwave, infrared (IR) or other wireless media. The term computer readable media as used herein may include both storage media and communication media.
在根据本发明的计算设备100中,应用122包括用于执行本发明的网络通信方法200的指令,该指令可以指示处理器104执行本发明的网络通信方法。本领域技术人员可以理解,除了用于执行网络通信方法200的指令之外,应用122还可以包括用于实现其他功能的其他应用126。In the computing device 100 according to the present invention, the application 122 includes instructions for executing the network communication method 200 of the present invention, and the instructions may instruct the processor 104 to execute the network communication method of the present invention. Those skilled in the art can understand that, in addition to instructions for executing the network communication method 200, the application 122 may also include other applications 126 for implementing other functions.
图2示出了根据本发明一个实施例的网络通信方法200的流程图,方法200适于在计算设备(例如图1所示的计算设备100)中执行。该计算设备包括龙芯处理器。FIG. 2 shows a flow chart of a network communication method 200 according to an embodiment of the present invention, and the method 200 is suitable for being executed in a computing device (such as the computing device 100 shown in FIG. 1 ). The computing device includes a Godson processor.
如图2所示,本发明的网络通信方法200始于步骤S210。在步骤S210中,获取网络通信软件框架UCX。As shown in FIG. 2, the network communication method 200 of the present invention starts at step S210. In step S210, the network communication software framework UCX is obtained.
为了便于理解本发明,在此对UCX进行一个说明。UCX主要包括UCS、UCM、UCT和UCP四个部分。具体地:In order to facilitate the understanding of the present invention, a description of UCX is given here. UCX mainly includes four parts: UCS, UCM, UCT and UCP. specifically:
UCS是一个服务层,为实现可移植的高效实用程序提供必要的功能。该层主要包括以下服务:用于访问平台特定功能(原子操作,线程安全性等)的抽象、用于高效内存管理的工具(内存池,内存分配器等)、常用的数据结构(哈希,树,列表)。UCS is a service layer that provides the necessary functionality to implement portable and efficient utilities. This layer mainly includes the following services: abstractions for accessing platform-specific functionality (atomic operations, thread safety, etc.), tools for efficient memory management (memory pools, memory allocators, etc.), commonly used data structures (hash, tree, list).
UCM主要是负责拦截内存注册缓存使用的内存分配和释放事件。UCM is mainly responsible for intercepting the memory allocation and release events used by the memory registration cache.
UCT是一个传输层,抽象了各种硬件体系结构之间的差异,并提供了一个底层API,可实现通信协议。该层的主要目标是提供对硬件网络功能的直接有效访问。此外,该层还提供了通信上下文管理(基于线程和应用程序级别)以及对服务器的分配和管理的构造。在通信API方面,根据数据长度的不同,UCT定义了短数据传输(short)、带数据拷贝的传输(bcopy)和零拷贝的传 输(zcopy)的通信方式。短数据传输(short),此类型的操作针对短数据的传输进行了优化。带数据拷贝的传输(bcopy),这种类型的操作针对通过所谓的反弹缓冲区发送的中等大小的消息进行了优化。通常在给定网络限制的情况下分配该辅助缓冲区,并准备由硬件立即使用。由于可以提供自定义数据打包例程,因此该方法可以用于非连续的I/O。零拷贝的传输,这种类型的操作使得消息直接从用户缓冲区发送,或者直接从用户缓冲区接收,而无需在网络层之间进行复制。UCT is a transport layer that abstracts the differences between various hardware architectures and provides a low-level API that implements a communication protocol. The main goal of this layer is to provide direct and efficient access to hardware networking functions. In addition, this layer provides communication context management (on a thread- and application-level basis) and constructs for allocation and management to servers. In terms of communication API, UCT defines the communication methods of short data transmission (short), transmission with data copy (bcopy) and transmission with zero copy (zcopy) according to the data length. Short data transfer (short), this type of operation is optimized for the transfer of short data. Transfer with data copy (bcopy), this type of operation is optimized for moderately sized messages sent through so-called bounce buffers. This secondary buffer is usually allocated given network constraints and is ready for immediate use by the hardware. This method can be used for non-sequential I/O because custom data packing routines can be provided. Zero-copy transfers, this type of operation enables messages to be sent directly from user buffers, or received directly from user buffers, without copying between network layers.
UCP通过使用UCT层公开的较低级功能来实现MPI和PGAS等并行编程模型所使用的较高级协议。UCP主要提供以下功能:初始化、远程内存访问(RMA)通信、远程原子内存操作(AMO)、活动消息、标签匹配。初始化,此接口的功能包括通信上下文的设置,查询网络功能并初始化本地通信端点。通信上下文表示的是网络传输资源的抽象。通信端点设置接口初始化UCP端点,这是与特定连接关联的所有必要资源的抽象。通信端点用作所有通信操作的输入,以描述通信的源和目标。远程内存访问(RMA)通信,此接口定义了实现分布式和共享内存编程模型所需的低开销,直接访问内存通信结构所需的单面通信操作(例如PUT和GET)。UCP包含一组单独的接口,用于传递不连续的数据。包含此功能是为了支持各种编程模型的通信要求,并利用现代网络硬件的分散聚集功能。远程原子内存操作(AMO),此接口提供了对远程存储器上原子执行操作的支持,这是PGAS编程模型(尤其是OpenSHMEM)的一种重要操作。标签匹配,此接口支持发送-接收语义的标签匹配,这是MPI规范定义的关键通信语义。活动消息(Active Message),此接口实现了对传入的数据包调用发件人指定的回调,以便由接收过程进行处理。例如,可以在这种概念的上层轻松实现双面MPI接口。但是,这些接口更为通用,适用于其他程序设计范例,在这些程序设计中,接收器进程不会预先发布接收,而是希望直接对传入的数据包做出反应。与RMA和标签匹配接口一样,活动消息接口为不同的消息类型和不连续的数据提供了单独的API。流,此接口提供顺序和可靠的通信语义。数据被视为通过连接推送的字节的有序序列。与标签匹配接口相反,只要字节总数相同,每个发送方的大小不一定要与每个接收方的大小相匹配。此API旨在匹配广泛使用的BSD套接字基于编程模型。UCP implements the higher-level protocols used by parallel programming models such as MPI and PGAS by using the lower-level functions exposed by the UCT layer. UCP mainly provides the following functions: initialization, remote memory access (RMA) communication, remote atomic memory operation (AMO), activity message, tag matching. Initialization, the functions of this interface include the setting of the communication context, querying network capabilities and initializing the local communication endpoint. A communication context represents an abstraction of network transport resources. The communication endpoint setup interface initializes a UCP endpoint, which is an abstraction of all necessary resources associated with a particular connection. Communication endpoints are used as input to all communication operations to describe the source and destination of the communication. Remote Memory Access (RMA) communication, this interface defines the low overhead required to implement distributed and shared memory programming models, the one-sided communication operations (such as PUT and GET) required for direct access to memory communication structures. UCP consists of a separate set of interfaces for passing discrete data. This functionality is included to support the communication requirements of various programming models and to take advantage of the scatter-gather capabilities of modern networking hardware. Remote Atomic Memory Operations (AMO), this interface provides support for atomic execution of operations on remote memory, which is an important operation of the PGAS programming model (especially OpenSHMEM). Tag matching, this interface supports tag matching for send-receive semantics, which is a key communication semantic defined by the MPI specification. Active Message (Active Message), this interface implements calling the sender-specified callback on incoming data packets for processing by the receiving process. For example, a double-sided MPI interface can be easily implemented on top of this concept. However, these interfaces are more general and apply to other programming paradigms where the receiver process does not pre-post a receive, but wishes to react directly to incoming packets. Like the RMA and tag-matching interfaces, the active message interface provides separate APIs for different message types and discontinuous data. Stream, this interface provides sequential and reliable communication semantics. Data is seen as an ordered sequence of bytes pushed over the connection. In contrast to tag-matching interfaces, the size of each sender does not have to match the size of each receiver as long as the total number of bytes is the same. This API is designed to match the widely used BSD socket-based programming model.
随后进入步骤S220,在UCX中添加支持龙芯处理器架构的目标函数以及在UCX的获取处理器模式函数中增加获取龙芯处理器模式的功能,获得目标UCX。目标函数包括刷新处理器数据和指令缓存函数、计算二进制编码中前导零函数以及内联钩子函数。Then enter step S220, add the target function supporting the Godson processor architecture to the UCX and add the function of acquiring the Godson processor mode to the function of acquiring the processor mode of the UCX, and obtain the target UCX. Target functions include flushing processor data and instruction cache functions, computing leading zeros in binary code, and inlining hook functions.
其中,在UCX中添加支持龙芯处理器架构的目标函数时,涉及到的是UCX的UCS部分和UCM部分。具体地,在UCX的UCS部分添加支持龙芯处理器架构的刷新处理器数据和指令缓存函数以及计算二进制编码中前导零函数。在UCX的UCM部分添加支持龙芯处理器架构的内联钩子函数。Among them, when adding the objective function supporting the Godson processor architecture in UCX, it involves the UCS part and UCM part of UCX. Specifically, in the UCS part of UCX, add the function of refreshing processor data and instruction cache that supports the Godson processor architecture, and the function of calculating leading zeros in binary code. Add an inline hook function that supports the Loongson processor architecture in the UCM part of UCX.
根据本发明的一个实施例,刷新处理器数据和指令缓存的函数通过内联汇编的方式实现。具体地,可以通过内联汇编表达式asm volatile("sync":::"memory")来实现。其中,asm(内联汇编表达式都以此为开头)用于声明一个内联汇编表达式。volatile用于向编译器声明不对该内联汇编进行优化。sync(sync是LoongISA架构刷新处理器数据和缓存的指令)用于在LoongISA架构中刷新处理器数据和缓存。memory用于声明内存已发生改动,即告诉编译器内存已发生改动,需直接到相应的内存中去读取,不应再使用存放在寄存器中的拷贝。According to an embodiment of the present invention, the function of refreshing processor data and instruction cache is implemented by means of inline assembly. Specifically, it can be realized through the inline assembly expression asm volatile("sync":::"memory"). Among them, asm (all inline assembly expressions start with this) is used to declare an inline assembly expression. volatile is used to declare to the compiler that this inline assembly will not be optimized. sync (sync is an instruction to refresh processor data and cache in LoongISA architecture) is used to refresh processor data and cache in LoongISA architecture. memory is used to declare that the memory has changed, that is, to tell the compiler that the memory has changed, and it needs to be read directly from the corresponding memory, and the copy stored in the register should not be used anymore.
计算二进制编码中前导零的函数也通过内联汇编的方式实现。具体地,可以通过LoongISA架构的计算前导零指令来实现。其中,LoongISA架构的计算前导零指令包括clz汇编指令和dclz汇编指令。clz是返回32位二进制编码中第一个1前面0的个数,dclz是返回64位二进制编码中第一个1前面0的个数。Functions that count leading zeros in binary codes are also implemented in inline assembly. Specifically, it can be implemented through the calculation leading zero instruction of the LoongISA architecture. Among them, the calculation leading zero instruction of LoongISA architecture includes clz assembly instruction and dclz assembly instruction. clz returns the number of 0s before the first 1 in the 32-bit binary code, and dclz returns the number of 0s before the first 1 in the 64-bit binary code.
另外,关于在UCX的UCM部分添加支持龙芯处理器架构的内联钩子函数这一步骤,在此先说明一点。在UCX的UCM部分添加支持龙芯处理器架构的内联钩子函数的目的是,用UCX所自定义的函数替换系统库函数。具体而言,当某一程序调用某一系统库函数时,用与该系统库函数相对应的UCX所自定义的函数来替换该系统库函数。即,当某一程序调用某一系统库函数时,并不执行该系统库函数,而是执行与该系统函数相对应的UCX所自定义的函数。In addition, about the step of adding an inline hook function that supports the Godson processor architecture in the UCM part of UCX, let me explain a little bit here. The purpose of adding the inline hook function supporting the Godson processor architecture in the UCM part of UCX is to replace the system library function with the function customized by UCX. Specifically, when a program calls a certain system library function, the system library function is replaced with a function defined by UCX corresponding to the system library function. That is, when a certain program calls a certain system library function, the system library function is not executed, but the function defined by the UCX corresponding to the system function is executed.
其中,内联钩子(Inline Hook)是通过修改机器码的方式来实现用UCX 所自定义的函数替换系统库函数。具体地,当程序调用系统库函数时,获取调用的系统库函数的地址,并获取与调用的系统库函数相对应的UCX所自定义的函数的地址。然后,将跳转指令与获取的UCX所自定义的函数的地址写入调用的系统库函数的地址处。这样,当某一程序执行系统库函数调用时,便会跳转到与该系统库函数相对应的自定义函数处执行。Among them, the inline hook (Inline Hook) is to replace the system library function with the function customized by UCX by modifying the machine code. Specifically, when the program calls the system library function, the address of the called system library function is obtained, and the address of the function defined by UCX corresponding to the called system library function is obtained. Then, write the jump instruction and the obtained address of the function defined by the UCX into the address of the called system library function. In this way, when a program executes a system library function call, it will jump to the user-defined function corresponding to the system library function for execution.
具体地可参见3,图3示出了根据本发明一个实施例的使用了Inline Hook方法的函数调用流程的示意图。当某一程序执行系统库函数调用时,首先跳转到系统调用地址处。然后执行系统调用地址处的跳转指令,跳转到用户自定义的函数地址处。待执行完用户自定义的函数后,再返回执行jalr的下一条语句。具体地实现步骤如下:For details, refer to 3. FIG. 3 shows a schematic diagram of a function call process using the Inline Hook method according to an embodiment of the present invention. When a program executes a system library function call, it first jumps to the address of the system call. Then execute the jump instruction at the system call address to jump to the user-defined function address. After executing the user-defined function, return to execute the next statement of jalr. The specific implementation steps are as follows:
(1)构造跳转指令,通过机器码的方式将自定义函数地址赋值给t9寄存器,再跳转到t9寄存器。(1) Construct a jump instruction, assign the address of the custom function to the t9 register by means of machine code, and then jump to the t9 register.
(2)将构造指令写入系统调用的地址处。(2) Write the construction instruction to the address of the system call.
(3)执行系统调用时跳转到自定义函数处执行。(3) When the system call is executed, it jumps to the user-defined function for execution.
(4)执行完自定义函数后,执行原有流程。(4) Execute the original process after executing the custom function.
可见,在UCX的UCM部分添加支持龙芯处理器架构的Inline Hook方法后,当某一程序调用系统库函数时,便可以拦截该系统库函数,从而执行与该系统库函数相对应的UCX所自定义的函数。其中,系统库函数mmap、munmap、mremap、shmat、shmdt、sbrk、brk、madvise与UCX所定义的函数ucm_mmap、ucm_munmap、ucm_mremap、ucm_shmat、ucm_shmdt、ucm_sbrk、ucm_brk、ucm_madvise一一对应。It can be seen that after the Inline Hook method supporting the Loongson processor architecture is added to the UCM part of UCX, when a program calls a system library function, it can intercept the system library function, thereby executing the UCX's own function corresponding to the system library function. defined function. Among them, the system library functions mmap, munmap, mremap, shmat, shmdt, sbrk, brk, and madvise correspond to the functions ucm_mmap, ucm_munmap, ucm_mremap, ucm_shmat, ucm_shmdt, ucm_sbrk, ucm_brk, and ucm_madvise defined by UCX.
至此,在UCX中添加了支持龙芯处理器架构的刷新处理器数据和指令缓存函数、计算二进制编码中前导零函数以及内联钩子函数。So far, functions to refresh processor data and instruction cache, calculate leading zeros in binary code, and inline hook functions that support the Loongson processor architecture have been added to UCX.
接下来,对在UCX的获取处理器模式函数中增加获取龙芯处理器模式的功能这一步骤进行说明。其中,此步是在UCX的UCS部分的获取处理器模式函数中来增加获取龙芯处理器模式的功能。Next, the step of adding the function of obtaining the Godson processor mode in the function of obtaining the processor mode of UCX is described. Among them, this step is to add the function of obtaining the Godson processor mode in the function of obtaining the processor mode of the UCS part of UCX.
具体地,在处理器模式枚举类型中添加龙芯处理器模式枚举项,并在获取处理器模式函数中增加获取龙芯处理器模式枚举项的逻辑。这样,当获取处理器模式函数返回值为龙芯处理器模式的枚举项时,可以确定当前处理器为龙 芯处理器。Specifically, add the Loongson processor mode enumeration item to the processor mode enumeration type, and add the logic of acquiring the Loongson processor mode enumeration item to the processor mode acquisition function. In this way, when the return value of the function of obtaining the processor mode is the enumeration item of the Godson processor mode, it can be determined that the current processor is the Godson processor.
下面通过一个具体的示例来进行说明,在枚举类型ucs_cpu_model_t中添加UCS_CPU_MODEL_LOONGISA枚举项,在ucs_arch_get_cpu_model函数中增加获取UCS_CPU_MODEL_LOONGISA枚举项的逻辑。这样,当ucs_arch_get_cpu_model函数接收到的返回值是UCS_CPU_MODEL_LOONGISA时,则表明CPU的模式是LoongISA。The following is a specific example to illustrate, add the UCS_CPU_MODEL_LOONGISA enumeration item in the enumeration type ucs_cpu_model_t, and add the logic of obtaining the UCS_CPU_MODEL_LOONGISA enumeration item in the ucs_arch_get_cpu_model function. In this way, when the return value received by the ucs_arch_get_cpu_model function is UCS_CPU_MODEL_LOONGISA, it indicates that the CPU mode is LoongISA.
至此,在UCX的UCS部分中添加了支持龙芯处理器架构的刷新处理器数据和指令缓存函数、计算二进制编码中前导零函数和获取龙芯处理器模式的功能,在UCX的UCM部分添加了支持龙芯处理器架构的内联钩子函数,得到了目标UCX。So far, in the UCS part of UCX, the function of refreshing processor data and instruction cache function, calculating the leading zero function in the binary code and obtaining the mode of the Godson processor has been added in the UCS part of UCX, and the function of supporting Godson processor has been added in the UCM part of UCX Inline hook function for processor architecture, get target UCX.
在此说明一点,本发明利用UCS_TEST_F(test_math,bitops)测试函数对UCS中计算二进制编码中前导零函数的功能进行了测试,利用UCS_TEST_F(test_type,cpu_set)测试函数对UCS中的cpu模式功能进行了测试,利用UCS_TEST_F(malloc_hook_cplusplus,mmap_ptrs)和UCS_TEST_F(malloc_hook,bistro_patch)测试函数对UCM中的Inline Hook功能进行了测试。其中,在运行测试命令make-C test/gtest test后,各项测试均通过,这表明在UCX中增加对龙芯处理器的支持后,UCX便可以在基于龙芯处理器架构的平台上进行编译和运行。Explain a bit here, the present invention utilizes UCS_TEST_F (test_math, bitops) test function to test the function of leading zero function in the calculation binary code in UCS, utilize UCS_TEST_F (test_type, cpu_set) test function to the cpu pattern function in UCS Test, use the UCS_TEST_F (malloc_hook_cplusplus, mmap_ptrs) and UCS_TEST_F (malloc_hook, bistro_patch) test functions to test the Inline Hook function in UCM. Among them, after running the test command make-C test/gtest test, all the tests are passed, which shows that after adding support for the Godson processor in UCX, UCX can be compiled and compiled on the platform based on the Godson processor architecture. run.
在得到目标UCX后,进入步骤S230,在计算设备上编译安装目标UCX,以使计算设备利用目标UCX提供的接口进行网络通信。After obtaining the target UCX, enter step S230, compile and install the target UCX on the computing device, so that the computing device uses the interface provided by the target UCX to perform network communication.
其中,在UCM部分增加支持龙芯处理器架构的Inline Hook方法后,可以拦截系统库函数mmap、munmap、mremap、shmat、shmdt、sbrk、brk、madvise,从而执行UCX所定义的ucm_mmap、ucm_munmap、ucm_mremap、ucm_shmat、ucm_shmdt、ucm_sbrk、ucm_brk、ucm_madvise等函数。Among them, after adding the Inline Hook method supporting the Loongson processor architecture in the UCM part, the system library functions mmap, munmap, mremap, shmat, shmdt, sbrk, brk, madvise can be intercepted, thereby executing ucm_mmap, ucm_munmap, ucm_mremap, Ucm_shmat, ucm_shmdt, ucm_sbrk, ucm_brk, ucm_madvise and other functions.
在UCS部分中添加支持龙芯处理器架构的刷新处理器数据和指令缓存函数、计算二进制编码中前导零函数和获取龙芯处理器模式的功能后,可以获取到龙芯处理器的模式、刷新龙芯处理器数据和缓存以及计算二进制编码中前导零,继而可以使得UCS中其他的功能,如(原子操作,线程安全性等)的抽象、用于高效内存管理的工具(内存池,内存分配器等)、常用的数据结构 (哈希,树,列表)等,也可以在基于预设处理架构的平台中使用。因此,在计算设备上编译安装目标UCX后,计算设备便可以利用目标UCX所提供的接口进行网络通信。After adding the function of refreshing processor data and instruction cache that supports the Loongson processor architecture, calculating the leading zero function in the binary code, and obtaining the mode of the Loongson processor in the UCS part, you can obtain the mode of the Loongson processor and refresh the Loongson processor. Leading zeros in data and caches and computing binary codes, which in turn can enable other functions in UCS, such as abstraction of (atomic operations, thread safety, etc.), tools for efficient memory management (memory pools, memory allocators, etc.), Commonly used data structures (hash, tree, list), etc. can also be used in platforms based on preset processing architectures. Therefore, after compiling and installing the target UCX on the computing device, the computing device can use the interface provided by the target UCX for network communication.
下面以一个具体示例来进行说明,在UCX的UCS部分添加支持LoongISA架构的刷新处理器数据和指令缓存函数以及计算二进制编码中前导零函数、在UCX的UCM部分添加支持LoongISA架构的内联钩子函数、在UCX的UCS部分的获取处理器模式函数中增加获取龙芯处理器模式的功能,得到目标UCX。得到的目标UCX支持龙芯平台,因此可以在龙芯平台上编译安装目标UCX。这样,在龙芯平台上便可以利用UCX所提供的接口来实现高速互联网络通信,从而能够提升龙芯平台上高性能并行计算的效率。The following is a specific example to illustrate, in the UCS part of UCX, add the refresh processor data and instruction cache function supporting the LoongISA architecture and the function of calculating the leading zero in the binary code, and add the inline hook function supporting the LoongISA architecture in the UCM part of UCX , Add the function of obtaining the Godson processor mode to the function of obtaining the processor mode in the UCS part of UCX, and obtain the target UCX. The obtained target UCX supports the Loongson platform, so the target UCX can be compiled and installed on the Loongson platform. In this way, the interface provided by UCX can be used on the Loongson platform to realize high-speed Internet communication, thereby improving the efficiency of high-performance parallel computing on the Loongson platform.
根据本发明的网络通信方法,首先获取网络通信软件框架UCX。然后,在UCX中添加支持龙芯处理器架构的目标函数以及在UCX的获取处理器模式函数中增加获取龙芯处理器模式的功能,获得目标UCX。在获得目标UCX后,在包括预设处理的计算设备上编译安装目标UCX。这样,包括预设处理的计算设备便可以利用目标UCX所提供的接口实现高速互联网络通信,从而能够提升该计算设备高性能并行计算的效率。可见,本发明的网络通信方法能够使原有UCX所不支持的龙芯平台也能利用其所提供的接口来实现高速网络互连通信,从而能够显著提升龙芯平台上高性能并行计算的效率。According to the network communication method of the present invention, firstly, the network communication software framework UCX is obtained. Then, add the target function supporting the Loongson processor architecture in UCX and add the function of acquiring Loongson processor mode in the function of acquiring the processor mode of UCX to obtain the target UCX. After obtaining the target UCX, compile and install the target UCX on the computing device including preset processing. In this way, the computing device including preset processing can use the interface provided by the target UCX to realize high-speed Internet communication, thereby improving the efficiency of high-performance parallel computing of the computing device. It can be seen that the network communication method of the present invention can enable the Godson platform not supported by the original UCX to use the interface provided by it to realize high-speed network interconnection communication, thereby significantly improving the efficiency of high-performance parallel computing on the Godson platform.
这里描述的各种技术可结合硬件或软件,或者它们的组合一起实现。从而,本发明的方法和设备,或者本发明的方法和设备的某些方面或部分可采取嵌入有形媒介,例如可移动硬盘、U盘、软盘、CD-ROM或者其它任意机器可读的存储介质中的程序代码(即指令)的形式,其中当程序被载入诸如计算机之类的机器,并被所述机器执行时,所述机器变成实践本发明的设备。The various techniques described herein can be implemented in conjunction with hardware or software, or a combination thereof. Thus, the method and device of the present invention, or certain aspects or parts of the method and device of the present invention may be embedded in a tangible medium, such as a removable hard disk, USB flash drive, floppy disk, CD-ROM or any other machine-readable storage medium In the form of program code (ie, instructions) in a machine such as a computer, when the program is loaded into a machine such as a computer and executed by the machine, the machine becomes an apparatus for practicing the invention.
在程序代码在可编程计算机上执行的情况下,计算设备一般包括处理器、处理器可读的存储介质(包括易失性和非易失性存储器和/或存储元件),至少一个输入装置,和至少一个输出装置。其中,存储器被配置用于存储程序代码;处理器被配置用于根据该存储器中存储的所述程序代码中的指令,执行本发明的文档加载方法。In the case of program code execution on a programmable computer, the computing device generally includes a processor, a storage medium readable by the processor (including volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device. Wherein, the memory is configured to store program code; the processor is configured to execute the document loading method of the present invention according to instructions in the program code stored in the memory.
以示例而非限制的方式,可读介质包括可读存储介质和通信介质。可读存储介质存储诸如计算机可读指令、数据结构、程序模块或其它数据等信息。通 信介质一般以诸如载波或其它传输机制等已调制数据信号来体现计算机可读指令、数据结构、程序模块或其它数据,并且包括任何信息传递介质。以上的任一种的组合也包括在可读介质的范围之内。Readable media include, by way of example and not limitation, readable storage media and communication media. Readable storage media store information such as computer readable instructions, data structures, program modules or other data. Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. Combinations of any of the above are also included within the scope of readable media.
在此处所提供的说明书中,算法和显示不与任何特定计算机、虚拟系统或者其它设备固有相关。各种通用系统也可以与本发明的示例一起使用。根据上面的描述,构造这类系统所要求的结构是显而易见的。此外,本发明也不针对任何特定编程语言。应当明白,可以利用各种编程语言实现在此描述的本发明的内容,并且上面对特定语言所做的描述是为了披露本发明的最佳实施方式。In the description provided herein, the algorithms and displays are not inherently related to any particular computer, virtual system, or other device. Various general-purpose systems can also be used with examples of the invention. The structure required to construct such a system is apparent from the above description. Furthermore, the present invention is not specific to any particular programming language. It should be understood that various programming languages can be used to implement the content of the present invention described herein, and the above description of specific languages is for disclosing the best mode of the present invention.
在此处所提供的说明书中,说明了大量具体细节。然而,能够理解,本发明的实施例可以在没有这些具体细节的情况下被实践。在一些实例中,并未详细示出公知的方法、结构和技术,以便不模糊对本说明书的理解。In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure the understanding of this description.
应当理解,为了精简本公开并帮助理解各个发明方面中的一个或多个,在上面对本发明的示例性实施例的描述中,本发明的各个特征有时被一起分组到单个实施例、图、或者对其的描述中。然而,并不应将该公开的方法解释成反映如下意图:即所要求保护的本发明要求比在每个权利要求中所明确记载的特征更多特征。更确切地说,如下面的权利要求书所反映的那样,发明方面在于少于前面公开的单个实施例的所有特征。因此,遵循具体实施方式的权利要求书由此明确地并入该具体实施方式,其中每个权利要求本身都作为本发明的单独实施例。It should be appreciated that in the above description of exemplary embodiments of the invention, in order to streamline this disclosure and to facilitate understanding of one or more of the various inventive aspects, various features of the invention are sometimes grouped together in a single embodiment, figure, or in its description. This method of disclosure, however, is not to be interpreted as reflecting an intention that the claimed invention requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the Detailed Description are hereby expressly incorporated into this Detailed Description, with each claim standing on its own as a separate embodiment of this invention.
本领域那些技术人员应当理解在本文所公开的示例中的设备的模块或单元或组件可以布置在如该实施例中所描述的设备中,或者可替换地可以定位在与该示例中的设备不同的一个或多个设备中。前述示例中的模块可以组合为一个模块或者此外可以分成多个子模块。Those skilled in the art will understand that the modules or units or components of the devices in the examples disclosed herein may be arranged in the device as described in this embodiment, or alternatively may be located in a different location than the device in this example. in one or more devices. The modules in the preceding examples may be combined into one module or furthermore may be divided into a plurality of sub-modules.
本领域那些技术人员可以理解,可以对实施例中的设备中的模块进行自适应性地改变并且把它们设置在与该实施例不同的一个或多个设备中。可以把实施例中的模块或单元或组件组合成一个模块或单元或组件,以及此外可以把它们分成多个子模块或子单元或子组件。除了这样的特征和/或过程或者单元中的至少一些是相互排斥之外,可以采用任何组合对本说明书(包括伴随的权利要求、摘要和附图)中公开的所有特征以及如此公开的任何方法或者设 备的所有过程或单元进行组合。除非另外明确陈述,本说明书(包括伴随的权利要求、摘要和附图)中公开的每个特征可以由提供相同、等同或相似目的的替代特征来代替。Those skilled in the art can understand that the modules in the device in the embodiment can be adaptively changed and arranged in one or more devices different from the embodiment. Modules or units or components in the embodiments may be combined into one module or unit or component, and furthermore may be divided into a plurality of sub-modules or sub-units or sub-assemblies. All features disclosed in this specification (including accompanying claims, abstract and drawings) and any method or method so disclosed may be used in any combination, except that at least some of such features and/or processes or units are mutually exclusive. All processes or units of equipment are combined. Each feature disclosed in this specification (including accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.
此外,本领域的技术人员能够理解,尽管在此所述的一些实施例包括其它实施例中所包括的某些特征而不是其它特征,但是不同实施例的特征的组合意味着处于本发明的范围之内并且形成不同的实施例。例如,在下面的权利要求书中,所要求保护的实施例的任意之一都可以以任意的组合方式来使用。Furthermore, those skilled in the art will understand that although some embodiments described herein include some features included in other embodiments but not others, combinations of features from different embodiments are meant to be within the scope of the invention. and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.
此外,所述实施例中的一些在此被描述成可以由计算机系统的处理器或者由执行所述功能的其它装置实施的方法或方法元素的组合。因此,具有用于实施所述方法或方法元素的必要指令的处理器形成用于实施该方法或方法元素的装置。此外,装置实施例的在此所述的元素是如下装置的例子:该装置用于实施由为了实施该发明的目的的元素所执行的功能。Furthermore, some of the described embodiments are described herein as a method or combination of method elements that may be implemented by a processor of a computer system or by other means for performing the described function. Thus, a processor with the necessary instructions for carrying out the described method or element of a method forms a means for carrying out the method or element of a method. Furthermore, elements described herein of an apparatus embodiment are examples of means for carrying out the function performed by the element for the purpose of carrying out the invention.
如在此所使用的那样,除非另行规定,使用序数词“第一”、“第二”、“第三”等等来描述普通对象仅仅表示涉及类似对象的不同实例,并且并不意图暗示这样被描述的对象必须具有时间上、空间上、排序方面或者以任意其它方式的给定顺序。As used herein, unless otherwise specified, the use of ordinal numbers "first," "second," "third," etc. to describe generic objects merely means referring to different instances of similar objects and is not intended to imply such The described objects must have a given order temporally, spatially, sequentially or in any other way.
尽管根据有限数量的实施例描述了本发明,但是受益于上面的描述,本技术领域内的技术人员明白,在由此描述的本发明的范围内,可以设想其它实施例。此外,应当注意,本说明书中使用的语言主要是为了可读性和教导的目的而选择的,而不是为了解释或者限定本发明的主题而选择的。因此,在不偏离所附权利要求书的范围和精神的情况下,对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。对于本发明的范围,对本发明所做的公开是说明性的,而非限制性的,本发明的范围由所附权利要求书限定。While the invention has been described in terms of a limited number of embodiments, it will be apparent to a person skilled in the art having the benefit of the above description that other embodiments are conceivable within the scope of the invention thus described. In addition, it should be noted that the language used in the specification has been chosen primarily for the purpose of readability and instruction rather than to explain or define the inventive subject matter. Accordingly, many modifications and alterations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the appended claims. With respect to the scope of the present invention, the disclosure of the present invention is intended to be illustrative rather than restrictive, and the scope of the present invention is defined by the appended claims.

Claims (9)

  1. 一种网络通信方法,适于在计算设备中执行,所述计算设备包括龙芯处理器,所述方法包括:A network communication method, adapted to be executed in a computing device, the computing device comprising a Godson processor, the method comprising:
    获取网络通信软件框架UCX;Obtain network communication software framework UCX;
    在所述UCX中添加支持龙芯处理器架构的目标函数以及在所述UCX的获取处理器模式函数中增加获取龙芯处理器模式的功能,获得目标UCX,所述目标函数包括刷新处理器数据和指令缓存函数、计算二进制编码中前导零函数以及内联钩子函数;Add the target function supporting the Godson processor architecture in the UCX and increase the function of acquiring the Godson processor mode in the acquisition processor mode function of the UCX to obtain the target UCX, and the target function includes refreshing processor data and instructions Caching functions, functions for calculating leading zeros in binary codes, and inline hook functions;
    在所述计算设备上编译安装所述目标UCX,以使所述计算设备利用所述目标UCX提供的接口进行网络通信。compiling and installing the target UCX on the computing device, so that the computing device uses the interface provided by the target UCX to perform network communication.
  2. 如权利要求1所述的方法,其中,在所述UCX中添加支持龙芯处理器架构的目标函数的步骤,包括:The method according to claim 1, wherein the step of adding an objective function supporting the Godson processor architecture in the UCX includes:
    在UCX的UCS部分添加支持龙芯处理器架构的刷新处理器数据和指令缓存函数以及计算二进制编码中前导零函数;In the UCS part of UCX, add the refresh processor data and instruction cache functions that support the Godson processor architecture, and the function of calculating leading zeros in binary code;
    在UCX的UCM部分添加支持龙芯处理器架构的内联钩子函数。Add an inline hook function that supports the Loongson processor architecture in the UCM part of UCX.
  3. 如权利要求1或2所述的方法,其中,在所述UCX的获取处理器模式函数中增加获取龙芯处理器模式的功能的步骤,包括:The method according to claim 1 or 2, wherein the step of increasing the function of obtaining the Godson processor mode in the obtaining processor mode function of the UCX includes:
    在处理器模式枚举类型中添加龙芯处理器模式枚举项;Add the Godson processor mode enumeration item in the processor mode enumeration type;
    在获取处理器模式函数中增加获取龙芯处理器模式枚举项的逻辑。Add the logic of obtaining the Loongson processor mode enumeration item in the function of obtaining the processor mode.
  4. 如权利要求2所述的方法,其中,在UCX的UCS部分添加支持龙芯处理器架构的刷新处理器数据和指令缓存函数后,通过如下内联汇编表达式来刷新龙芯处理器数据和指令缓存:The method according to claim 2, wherein, after the UCS part of UCX adds the refresh processor data and instruction cache function supporting the Godson processor architecture, the Godson processor data and instruction cache are refreshed by the following inline assembly expression:
    asm volatile("sync":::"memory")asm volatile("sync":::"memory")
    其中,asm用于声明一个内联汇编表达式,volatile用于向编译器声明不对该内联汇编进行优化,sync用于在LoongISA架构中刷新处理器数据和缓存,memory用于声明内存已发生改动。Among them, asm is used to declare an inline assembly expression, volatile is used to declare to the compiler that the inline assembly will not be optimized, sync is used to refresh the processor data and cache in the LoongISA architecture, and memory is used to declare that the memory has changed .
  5. 如权利要求2或4所述的方法,其中,在UCX的UCS部分添加支持 龙芯处理器架构的计算二进制编码中前导零函数后,利用LoongISA架构的计算前导零指令来计算二进制编码中前导零的个数。The method according to claim 2 or 4, wherein, after the UCS part of UCX adds the leading zero function in the calculation binary code supporting the Loongson processor architecture, the calculation leading zero instruction of the LoongISA architecture is used to calculate the leading zero in the binary code number.
  6. 如权利要求2所述的方法,其中,在UCX的UCM部分添加支持龙芯处理器架构的内联钩子函数后,所述内联钩子函数通过如下方式来实现用UCX所自定义的函数替换调用的系统库函数:The method according to claim 2, wherein, after adding the inline hook function supporting the Godson processor architecture in the UCM part of UCX, the inline hook function realizes replacing the call with a function defined by UCX in the following manner System library functions:
    当调用系统库函数时,获取调用的系统库函数的地址,并获取与调用的系统库函数相对应的UCX所自定义的函数的地址;When the system library function is called, the address of the called system library function is obtained, and the address of the function defined by UCX corresponding to the called system library function is obtained;
    将跳转指令与获取的UCX所自定义的函数的地址写入调用的系统库函数的地址处。Write the jump instruction and the address of the function defined by the obtained UCX into the address of the called system library function.
  7. 如权利要求3所述的方法,其中,在所述UCX的获取处理器模式函数中增加获取龙芯处理器模式的功能后,通过如下方式来确定当前处理器为龙芯处理器:The method according to claim 3, wherein, after adding the function of obtaining the Godson processor mode in the acquisition processor mode function of the UCX, it is determined that the current processor is the Godson processor in the following manner:
    当获取处理器模式函数返回值为龙芯处理器模式的枚举项时,确定当前处理器为龙芯处理器。When the return value of the get processor mode function is the enumeration item of the Godson processor mode, it is determined that the current processor is the Godson processor.
  8. 一种计算设备,包括:A computing device comprising:
    至少一个处理器;以及at least one processor; and
    存储器,存储有程序指令,其中,所述程序指令被配置为适于由所述至少一个处理器执行,所述程序指令包括用于执行如权利要求1-7中任一项所述方法的指令。A memory storing program instructions, wherein the program instructions are configured to be executed by the at least one processor, the program instructions comprising instructions for performing the method according to any one of claims 1-7 .
  9. 一种存储有程序指令的可读存储介质,当所述程序指令被计算设备读取并执行时,使得所述计算设备执行如权利要求1-7中任一项所述方法。A readable storage medium storing program instructions, when the program instructions are read and executed by a computing device, the computing device is made to execute the method according to any one of claims 1-7.
PCT/CN2021/129671 2021-06-25 2021-11-10 Network communication method, computing device, and readable storage medium WO2022267304A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110710382.7 2021-06-25
CN202110710382.7A CN113452532B (en) 2021-06-25 2021-06-25 Network communication method, computing device and readable storage medium

Publications (1)

Publication Number Publication Date
WO2022267304A1 true WO2022267304A1 (en) 2022-12-29

Family

ID=77812729

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/129671 WO2022267304A1 (en) 2021-06-25 2021-11-10 Network communication method, computing device, and readable storage medium

Country Status (2)

Country Link
CN (2) CN113452532B (en)
WO (1) WO2022267304A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113452532B (en) * 2021-06-25 2022-08-12 统信软件技术有限公司 Network communication method, computing device and readable storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104915206A (en) * 2015-06-15 2015-09-16 南京阿凡达机器人科技有限公司 Method for managing attributes and data on DSP based on text analysis
CN106815086A (en) * 2017-01-13 2017-06-09 邦彦技术股份有限公司 Communication control framework based on Loongson platform
US20170168783A1 (en) * 2015-12-10 2017-06-15 Sap Se Generating logic with scripting language in software as a service enterprise resource planning
CN110716710A (en) * 2019-08-26 2020-01-21 许华敏 Radar signal processing software architecture
CN113452532A (en) * 2021-06-25 2021-09-28 统信软件技术有限公司 Network communication method, computing device and readable storage medium

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6609158B1 (en) * 1999-10-26 2003-08-19 Novell, Inc. Component architecture in a computer system
CN106502706A (en) * 2016-11-10 2017-03-15 成都中嵌自动化工程有限公司 A kind of credible embedded computer and its collocation method based on Loongson processor
CN106991329A (en) * 2017-03-31 2017-07-28 山东超越数控电子有限公司 A kind of trust calculation unit and its operation method based on domestic TCM
US11106491B2 (en) * 2018-04-06 2021-08-31 Beijing Didi Infinity Technology And Development Co., Ltd. Method and system for kernel routine callbacks
CN111597109B (en) * 2020-04-24 2022-03-11 清华大学 Defect detection method and system for cross-architecture firmware stack memory
US20210042254A1 (en) * 2020-10-28 2021-02-11 Pratik Marolia Accelerator controller hub
CN112929461B (en) * 2021-01-21 2022-09-16 中国人民解放军国防科技大学 MPI process management interface implementation method based on high-speed interconnection network

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104915206A (en) * 2015-06-15 2015-09-16 南京阿凡达机器人科技有限公司 Method for managing attributes and data on DSP based on text analysis
US20170168783A1 (en) * 2015-12-10 2017-06-15 Sap Se Generating logic with scripting language in software as a service enterprise resource planning
CN106815086A (en) * 2017-01-13 2017-06-09 邦彦技术股份有限公司 Communication control framework based on Loongson platform
CN110716710A (en) * 2019-08-26 2020-01-21 许华敏 Radar signal processing software architecture
CN113452532A (en) * 2021-06-25 2021-09-28 统信软件技术有限公司 Network communication method, computing device and readable storage medium

Also Published As

Publication number Publication date
CN113452532A (en) 2021-09-28
CN115242563A (en) 2022-10-25
CN113452532B (en) 2022-08-12
CN115242563B (en) 2023-11-14

Similar Documents

Publication Publication Date Title
US9535842B2 (en) System and method for performing message driven prefetching at the network interface
US20130117543A1 (en) Low overhead operation latency aware scheduler
US8010718B2 (en) Direct memory access in a hybrid computing environment
CN117785288A (en) Hardware accelerator and method for transfer operations
US9559940B2 (en) Take-over of network frame handling in a computing environment
US9189271B2 (en) Operation transfer from an origin virtual machine to a destination virtual machine while continue the execution of the operation on the origin virtual machine
US9286232B2 (en) Administering registered virtual addresses in a hybrid computing environment including maintaining a cache of ranges of currently registered virtual addresses
US20080074433A1 (en) Graphics Processors With Parallel Scheduling and Execution of Threads
US20140181474A1 (en) Atomic write and read microprocessor instructions
US20130054669A1 (en) Calling Functions Within A Deterministic Calling Convention
US10678689B2 (en) Dynamic home tile mapping
US20200409735A1 (en) Inter-processor interrupt virtualization with pass-through of local interrupt controller
WO2022237098A1 (en) Heterogeneous program execution method and apparatus, and computing device and readable storage medium
US9632907B2 (en) Tracking deferred data packets in a debug trace architecture
US20200319893A1 (en) Booting Tiles of Processing Units
WO2014055135A1 (en) Fast remote procedure call
CN113590197A (en) Configurable processor supporting variable-length vector processing and implementation method thereof
WO2022267304A1 (en) Network communication method, computing device, and readable storage medium
WO2017016255A1 (en) Parallel processing method and apparatus for multiple launch instructions of micro-engine, and storage medium
US20220229723A1 (en) Low overhead error correction code
CN110874336B (en) Distributed block storage low-delay control method and system based on Shenwei platform
WO2014190699A1 (en) Cpu instruction processing method and processor
WO2021212074A1 (en) Parallelism in serial pipeline processing
Choi et al. Accelerating communication for parallel programming models on GPU systems
US12026546B2 (en) Parallelism in serial pipeline processing

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21946786

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE