WO2023051282A1 - 一种嵌入向量预取的方法、装置、系统及相关设备 - Google Patents

一种嵌入向量预取的方法、装置、系统及相关设备 Download PDF

Info

Publication number
WO2023051282A1
WO2023051282A1 PCT/CN2022/119301 CN2022119301W WO2023051282A1 WO 2023051282 A1 WO2023051282 A1 WO 2023051282A1 CN 2022119301 W CN2022119301 W CN 2022119301W WO 2023051282 A1 WO2023051282 A1 WO 2023051282A1
Authority
WO
WIPO (PCT)
Prior art keywords
processor
accelerator
vector
embedded vector
hash value
Prior art date
Application number
PCT/CN2022/119301
Other languages
English (en)
French (fr)
Inventor
端启航
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP22874669.9A priority Critical patent/EP4390706A1/en
Publication of WO2023051282A1 publication Critical patent/WO2023051282A1/zh
Priority to US18/619,696 priority patent/US20240241724A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3802Instruction prefetching
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0862Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches with prefetch
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0806Multiuser, multiprocessor or multiprocessing cache systems
    • G06F12/0813Multiuser, multiprocessor or multiprocessing cache systems with a network or matrix configuration
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0877Cache access modes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/38Information transfer, e.g. on bus
    • G06F13/40Bus structure
    • G06F13/4004Coupling between buses
    • G06F13/4022Coupling between buses using switching circuits, e.g. switching matrix, connection or expansion network
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/061Improving I/O performance
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0655Vertical data movement, i.e. input-output transfer; data movement between one or more hosts and one or more storage devices
    • G06F3/0656Data buffering arrangements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/30018Bit or string instructions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3861Recovery, e.g. branch miss-prediction, exception handling
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5016Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/50Indexing scheme relating to G06F9/50
    • G06F2209/503Resource availability
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/50Indexing scheme relating to G06F9/50
    • G06F2209/509Offload
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/10Providing a specific technical effect
    • G06F2212/1016Performance improvement
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/10Providing a specific technical effect
    • G06F2212/1041Resource optimization
    • G06F2212/1044Space efficiency improvement
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/30Providing cache or TLB in specific location of a processing system
    • G06F2212/301In special purpose processing node, e.g. vector processor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/45Caching of specific data in cache memory
    • G06F2212/454Vector or matrix data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/50Control mechanisms for virtual memory, cache or TLB
    • G06F2212/502Control mechanisms for virtual memory, cache or TLB using adaptive policy
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/60Details of cache memory
    • G06F2212/6028Prefetching based on hints or prefetch instructions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2213/00Indexing scheme relating to interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F2213/0026PCI express
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present invention relates to the field of deep learning, in particular to a method, device, system and related equipment for prefetching embedded vectors.
  • Embedding vector techniques are widely used in deep learning-based recommender systems. Embedding vector technology is to use the user features obtained by the recommendation system as a sparse vector, and convert it into a dense vector through an embedding table.
  • the embedding table is stored in the memory of the server associated with the recommendation system, and a row in the embedding table is an embedding vector.
  • prefetching the process in which the processor puts the embedding vector in the embedding table required for training from the server's memory into the accelerator.
  • prefetching the process in which the processor puts the embedding vector in the embedding table required for training from the server's memory into the accelerator.
  • prefetching the process in which the processor puts the embedding vector in the embedding table required for training from the server's memory into the accelerator.
  • This application provides a method, device, system and related equipment for prefetching embedded vectors, which are used to solve the problem that most of the embedded vectors are stored on an accelerator during the prefetching process. Due to the limited memory capacity of the accelerator, the embedded vector capacity exceeds The memory capacity of the accelerator causes the problem of system abnormality.
  • the embodiment of the present invention provides an application system embedded with vector prefetching, the system includes a server, an accelerator and a high-speed serial computer expansion bus standard.
  • the server includes a processor and a first memory
  • the accelerator includes a second memory, an instruction decoder, a controller, a multiplexer and a computing module.
  • the server and the accelerator can be interconnected by high-speed bandwidth such as high-speed serial computer expansion bus standard, and the accelerators can be connected by high-speed bandwidth or network, wherein one server can be connected to multiple accelerators.
  • a server is a device with both computing capabilities and storage capabilities. It can be a physical server, or a virtual machine based on a general-purpose physical server combined with network function virtualization technology. This application does not specifically limit the form of the server.
  • the server includes a processor and a first memory, and the server may include more or less components, or integrate multiple components into one component.
  • Processors are used to process data access requests from servers or other systems, as well as requests generated internally by the system.
  • the processor receives the write data requests sent by the server through the front-end port, the data in these write data requests will be temporarily stored in the first memory, and when the total amount of data in the first memory reaches a certain threshold, the processing The controller sends the data stored in the first memory to the hard disk for persistent storage through the back-end port.
  • the first memory refers to the internal memory that directly exchanges data with the processor. It can read and write data at any time, and the speed is very fast. It is used as a temporary data storage for the operating system or other running programs.
  • the first memory can be used to store data information, such as batch data and embedded tables, and the speed of the processor calling the above data information is very fast.
  • the first memory can also be used to store program codes. The processor reads the data stored in the first memory and calls the program codes stored in the first memory to realize the management of the hard disk.
  • the accelerator can be a graphics processor, an embedded neural network processor, or other types of accelerator cards.
  • the accelerator may include a second memory, an instruction decoder, a controller, a multiplexer, and a computing module.
  • the second memory can be used to store data information, which is similar to the structure of the first memory, but differs in memory capacity.
  • the instruction decoder is used for receiving instructions sent by the processor, decoding the instructions sent by the processor, and obtaining addresses and operation types used to indicate a plurality of data to be calculated.
  • the controller may receive the address of the plurality of data sent by the instruction decoder and the calculation result output by the calculation module.
  • the multiplexer is used to select and send the memory access command of the controller or the processor to the second memory according to the control signal of the instruction decoder, and obtain data to be sent to the controller and the processor from the second memory.
  • the calculation module is used to perform corresponding calculations on multiple data according to the operation type.
  • High-speed serial computer expansion bus standard designed to replace the old bus standard, belongs to high-speed serial point-to-point dual-channel high-bandwidth transmission, the connected devices are allocated exclusive channel bandwidth, do not share bus bandwidth, mainly support active power management, error reporting , end-to-end reliable transmission, hot swapping and quality of service functions.
  • an embodiment of the present invention provides a method for prefetching an embedded vector, the method comprising: the processor reads the salt value (salt) and the first embedded vector keyword (embedding key), and the processor reads the The salt value and the first embedding vector key determine the accelerator (device) corresponding to the first embedding vector key.
  • the processor determines whether there is an overflow of the embedded vector in the accelerator, and if there is no overflow of the embedded vector, sends the first embedding vector (embedding) to the corresponding accelerator, and if there is an overflow of the embedded vector, the processor does not send The first embedding vector is sent to the accelerator, and the first memory is always stored in the first memory.
  • the processor can read a batch of data from the disk or the network, and the batch data can include m embedding vector keywords, and the processor converts the embedding vector corresponding to each embedding vector keyword
  • the operation of sending from the first memory to the second memory is the same.
  • the first embedding vector key can be any one of the batch data, corresponding to the only row in the embedding table, and corresponding to the unique embedding vector.
  • the processor can also deduplicate and segment the embedded vector keywords in the batch data.
  • the processor can randomly generate a salt value.
  • the processor reads the salt value and the key of the first embedded vector, and the processor determines the first hash value according to the key of the first embedded vector.
  • the processor inputs the first embedding vector key into the first hash algorithm to determine the first hash value.
  • the first hash algorithm may be an information digest algorithm, a secure hash algorithm, and the like.
  • the processor determines the second hash value according to the salt value and the first hash value.
  • the processor first combines the salt value with the first hash value, and the processor may perform string concatenation of the first hash value and the salt value, or insert the salt value into one of the first hash values or multiple positions to obtain the first hash value after salting.
  • the processor can input the salted first hash value into the second hash algorithm.
  • the second hash algorithm can be an information digest algorithm. Algorithms etc.
  • the processor determines the accelerator corresponding to the key of the first embedded vector according to the second hash value.
  • the processor may convert the second hash value into a digital form, and substitute the number of accelerators in the system into the formula of the modulo n mapping relationship.
  • the processor can determine the accelerators corresponding to all the embedding vector keys in the batch data in the same way, and obtain the capacity of each accelerator corresponding to the embedding vector key.
  • the processor determines whether the accelerator has an embedded vector overflow condition.
  • the processor can obtain the number of embedding vectors that can be stored in each accelerator, that is, the capacity of the second memory, and compare the number of embedding vectors with the capacity of the second memory of the accelerator. When the memory capacity is not less than the number of embedding vectors, there is no overflow of embedding vectors.
  • the processor may calculate a standard deviation according to the number of embedding vectors corresponding to each accelerator, set a threshold, and compare the standard deviation with the threshold. In the case where the standard deviation is less than or equal to the threshold, the embedding vector has no overflow.
  • the processor When the embedding vector overflows, the first embedding vector is always stored in the first memory, and the processor does not send the first embedding vector to the accelerator.
  • the processor can also read a new salt value and save it in the configuration file, and repeat the above steps using the new salt value to re- Calculate the correspondence between embedding vectors and accelerators until all accelerators do not have embedding vector overflow.
  • the processor may send the first embedded vector, the address of the embedded vector, and the communication information between the accelerators to the cache memory of the second internal memory of the corresponding accelerator.
  • the above method can cover up the process of information transmission between the server and the accelerator, can effectively solve the problem of accelerator capacity overflow caused by unbalanced prefetching, and will not cause system abnormalities.
  • the processor reads the salt and the first embedded vector key, determines the first hash value according to the first embedded vector key, and determines the second hash value according to the salt and the first hash value Value, according to the above second hash value, perform modulo n operation to obtain the accelerator corresponding to the first embedding vector key.
  • the processor determines whether there is an overflow of the embedding vector, and if there is no overflow of the embedding vector, the processor sends the first embedding vector to the second memory of the corresponding accelerator. In case there is overflow of the embedding vector, the processor does not send the first embedding vector into the accelerator, but re-reads the salt value.
  • the processor changes the fixed modulo-n mapping relationship into a dynamic modulo-n mapping relationship by adding a string to the hash value, that is, adding salt, and changes the corresponding relationship between the embedded vector and the accelerator, so that the embedded vector can be balanced It is allocated to different accelerators to avoid embedding vector overflow in the accelerator, avoiding the generation of system exceptions, and achieving the effect of balanced prefetching.
  • an embodiment of the present invention provides an embedded vector prefetching device, which includes an acquisition unit, a hash operation unit, a comparison unit, and a data output unit.
  • the obtaining unit is used to obtain the salt value and the first embedded vector keyword
  • the hash operation unit is used to determine the accelerator corresponding to the first embedded vector keyword according to the first embedded vector keyword and the salt value, and is used to determine the accelerators assigned to each accelerator Whether the embedding vector of is overflowed.
  • the data output unit is used to send the first embedding vector to the second memory of the corresponding accelerator when there is no overflow of the embedding vector, or to not send the first embedding vector when there is an overflow of the embedding vector In the corresponding accelerator, the first embedding vector is kept stored in the first memory.
  • the salt value obtained by the obtaining unit may be a randomly generated character string consisting of one or more characters, or a character string stored in a configuration file.
  • the first embedding vector key is any embedding vector key in the batch data, and the batch data may be obtained by the acquisition unit from a disk or a network.
  • the obtaining unit is used for inputting the salt and the first embedding vector key into the hash operation unit.
  • the hash operation unit is configured to determine the accelerator corresponding to the first embedded vector keyword according to the first embedded vector keyword and the salt value.
  • the hash operation unit is configured to substitute the first embedded vector key into the first hash algorithm to determine the first hash value.
  • the hash operation unit is used to add the salt value to the first hash value, substitute the salted first hash value into the second hash algorithm, and determine the second hash value.
  • the first hash algorithm and the second hash algorithm may be information digest algorithms, secure hash algorithms, and the like.
  • the hash operation unit is used to convert the second hash value into a digital form, and then substitute the second hash value and the number of accelerators into the modulo n mapping relationship to determine the first embedding vector key the corresponding accelerator.
  • the hash operation unit is used to determine whether the embedding vector allocated to each accelerator overflows.
  • the comparison unit can be used to compare the capacity of the embedding vector allocated to each accelerator with the capacity of the second memory of each accelerator, and if the capacity of the embedding vector is greater than In the case of the capacity of the second memory of the accelerator, the embedding vector overflows. When the capacity of the embedded vector is less than or equal to the capacity of the second memory of the accelerator, there is no overflow of the embedded vector.
  • the comparison unit can also be used to compare the embedding quantity with the capacity of the second memory of each accelerator after determining the accelerator corresponding to each embedding vector keyword.
  • the hash operation unit can be used to calculate a standard deviation according to the number of embedding vectors corresponding to each accelerator, and set a threshold, and compare the standard deviation with the threshold the size of. When the standard deviation is greater than the threshold, the embedding vector overflows, and when the standard deviation is less than or equal to the threshold, no embedding vector overflows.
  • the data output unit is configured to send the first embedded vector, the address of the embedded vector, and the communication information between the accelerators to the corresponding second memory of the accelerator.
  • the data output unit is configured not to send the first embedded vector to the accelerator, and to keep a state in which the first embedded vector is stored in the first memory unchanged.
  • the acquisition unit can be used to reacquire a new salt value, store it in the configuration file, repeat the above process, and recalculate the correspondence between the embedded vector keywords and the accelerators.
  • the device can be widely used in model training based on deep learning, and the device can also include a training unit.
  • an embodiment of the present invention provides a computing device, the computing device includes a processor and a memory, wherein the memory stores computer instructions, and the processor is used to execute any possible method of the first, second or third aspect. The function of each module in the implementation mode.
  • the embodiment of the present invention provides a computer-readable storage medium, where instructions are stored in the computer-readable storage medium, and the instructions are run on the above-mentioned computing device, so that the above-mentioned computing device executes the first, second, and third steps. method described in the aspect.
  • the hash algorithm used in the prefetching process has a fast operation speed, and records the salts read during the prefetching process, so that all salts that can achieve balanced prefetching can be compared in terms of training effects and other aspects, and a better one can be selected. Salt achieves higher throughput and improves overall computing efficiency.
  • this application changes the fixed modulo-n mapping relationship into a dynamic modulo-n mapping relationship by adding salt to the basic hash value, changing the correspondence between the embedded vector and the location information of the accelerator, so that
  • the embedding vectors corresponding to the embedding vector keywords can be evenly allocated to different accelerators, which solves the problem of insufficient accelerator memory capacity caused by unbalanced prefetching and keeps the system running normally.
  • Fig. 1 is a schematic block diagram of an application system embedded in vector prefetching provided by an embodiment of the present invention
  • FIG. 2 is a flowchart of a method for embedding vector prefetching provided by an embodiment of the present invention
  • Fig. 3 is a schematic structural diagram of an embedded vector prefetching device provided by an embodiment of the present invention.
  • this application provides a method for prefetching the embedded vector, which can distribute the embedded vectors in the embedded table to each accelerator in a balanced manner, and avoid the occurrence of abnormal.
  • FIG. 1 is a schematic block diagram of an application system embedded in vector prefetching provided by the present application.
  • the application system 100 includes a server 110 , an accelerator 120 and a high-speed serial computer expansion bus standard 130 .
  • the server 110 and the accelerator 120 are interconnected by high-speed bandwidth such as PCIE 130 between them, and the accelerators 120 are connected by PCIE 130 or a network.
  • the server 110 is a device with both computing capability and storage capability, and may be a physical server, such as an X86 server, an ARM server, etc., or may be a general-purpose physical server combined with network functions virtualization (network functions virtualization, NFV)
  • a virtual machine (virtual machine, VM) realized by technology, a virtual machine refers to a complete computer system that has complete hardware system functions through software simulation and runs in a completely isolated environment, such as a virtual device in cloud computing. This application does not make specific limited.
  • the server 110 includes a processor 111 and a first memory 112 . It should be understood that the server shown in FIG. 1 may include more or fewer components, and multiple components in the server shown in FIG. 1 may also be integrated into one component, and the present application does not specifically limit the structure of the server.
  • the processor 111 may be composed of at least one general-purpose processor, such as a central processing unit (central processing unit, CPU), or a combination of a CPU and a hardware chip.
  • the aforementioned hardware chip may be an application-specific integrated circuit (application-specific integrated circuit, ASIC), a programmable logic device (programmable logic device, PLD) or a combination thereof.
  • the aforementioned PLD may be a complex programmable logic device (complex programmable logic device, CPLD), a field-programmable gate array (field-programmable gate array, FPGA), a general array logic (generic array logic, GAL) or any combination thereof.
  • Processor 111 is used to process data access requests from servers or other systems, and is also used to process requests generated within the system. Exemplarily, when the processor 111 receives the write data request sent by the server 110 through the front-end port, the data in these write data requests will be temporarily stored in the first memory 112, and when the total amount of data in the first memory 112 reaches a certain When the threshold is reached, the processor 111 sends the data stored in the first memory to the hard disk for persistent storage through the back-end port. After receiving the request, the processor can also read data, read salt and batch data, and store these data in the first memory as well.
  • the first memory 112 refers to an internal memory directly exchanging data with the processor. It can read and write data at any time, and the speed is very fast. It is used as a temporary data storage for the operating system or other running programs.
  • Memory includes at least two kinds of memory, for example, memory can be either random access memory or read only memory (ROM).
  • the random access memory is dynamic random access memory (DRAM), or storage class memory (storage class memory, SCM).
  • DRAM is a semiconductor memory that, like most random access memory (RAM), is a volatile memory device.
  • the first memory may also include other random access memories, such as static random access memory (static random access memory, SRAM) and the like.
  • the read-only memory for example, it may be programmable read-only memory (programmable read only memory, PROM), erasable programmable read-only memory (erasable programmable read only memory, EPROM) and the like.
  • the first memory can also be a dual in-line memory module or a dual in-line memory module (DIMM for short), that is, a module composed of dynamic random access memory (DRAM), or a solid-state memory module.
  • DRAM dynamic random access memory
  • SSD solid state disk
  • the first memory 112 can be used to store data information, such as batch data and embedded tables, so the speed of reading the above data information is very fast.
  • the first memory can also be used to store program codes, the processor reads the data stored in the first memory, and runs the program codes, and the processor 111 runs the program codes stored in the first memory to manage the hard disk.
  • the program codes in the first memory 112 in FIG. 1 may include one or more units, for example, the one or more units may be an acquisition unit, a hash operation unit, a comparison unit and a data output unit. It should be understood that each module unit in the program code is an exemplary division method, and each module unit can be merged or split into more or fewer module units, and the positional relationship between the system and the modules is also different. constitute any limitation, and this application does not make specific limitation.
  • the accelerator 120 may be a graphics processing unit (graphics processing unit, GPU), an embedded neural network processor (neural-network processing unit, NPU), or other types of accelerator cards (physics processing unit, PPU).
  • the accelerator 120 includes a second memory 121 , an instruction decoder 122 , a controller 123 , a multiplexer 124 and a computing module 125 .
  • the second memory 121 is used to store data information, has a similar structure and function to the first memory 112, and differs only in memory capacity. There is also a cache memory (cache) in the second memory.
  • the original meaning of the cache memory refers to a kind of RAM whose access speed is faster than general random access memory (RAM).
  • Cache memory is a first-level memory between the main memory and the processor. It is composed of a static memory chip (SRAM). It has a relatively small capacity but a fast speed, which is close to the speed of the processor. The scheduling and transfer of information between cache memory and main memory is done automatically by hardware.
  • the instruction decoder 122 is used to receive instructions sent by the processor, decode the instructions sent by the processor, and obtain a decoding result, which is used to indicate addresses and operation types of multiple data to be calculated.
  • the instruction decoder 122 includes a status register and an instruction cache queue.
  • the status register is an addressable memory space. When the processor sends a read request to this address, the instruction decoder 122 immediately returns the working status of the accelerator stored in the status register to the processor.
  • the controller 123 receives the addresses of the multiple data sent by the instruction decoder 122 and the calculation result output by the calculation module 125 .
  • the multiplexer 124 is used to select and send the memory access command of the controller 123 or the processor to the memory according to the control signal of the instruction decoder 122 , and obtain data to be sent to the controller 123 and the processor from the memory.
  • the calculation module 125 is used for performing corresponding calculations on a plurality of data according to operation types.
  • the calculation module 125 includes a calculation unit, an input unit, a calculation unit array and an output unit.
  • the calculation unit is used to control the calculation unit array to execute instructions to perform corresponding data processing operations.
  • the input unit is used to cache data ready for execution of instructions.
  • the output unit is used to cache calculation results obtained after the calculation unit array executes instructions.
  • the first memory 112 and the second memory 121 have similar functions, but differ in memory capacity, and the capacity of the second memory is smaller than that of the first memory. Since the capacity of the embedding table generally exceeds the capacity of the second memory, the embedding table can only be stored in the first memory, wherein the second memory 121 is located in the accelerator 120 and can store part of the embedding vector sent by the processor, the embedding vector address and Communication information between accelerators.
  • the accelerator 120 shown in FIG. 1 may include more components or fewer components, and the accelerator may also integrate multiple components into one component.
  • the structure is not limited in any particular way.
  • PCIE peripheral component interconnect express
  • the high-speed serial computer expansion bus standard 130 shown in FIG. 1 has multiple specifications, from PCIE x1 to PCIE x32, etc., and the application does not make any specific limitations on the type of the high-speed serial computer expansion bus standard.
  • FIG. 2 is a flowchart of a method for prefetching embedded vectors provided by the present application.
  • the method for embedding vector prefetching can be applied to the application system shown in Figure 1, and the method includes the following steps:
  • S201 The processor reads the salt value and the first embedded vector key.
  • the salt value may be a randomly generated string consisting of one or more characters, or a string stored in a configuration file.
  • the first embedding vector key is any embedding vector key in the batch data (batch).
  • the batch data may include keywords corresponding to each embedding vector (embedding) in an embedding table.
  • the batch data includes multiple embedded vector keywords, if there are repeated embedded vector keywords in multiple embedded vector keywords, the repeated embedded vector keywords can be removed. If there are multiple embedded vector keywords, it is not used The embedding vector keywords of , you can delete the unused embedding vector keywords. Batch data can be read by the processor from disk or network.
  • S202 The processor determines an accelerator corresponding to the first embedded vector keyword according to the first embedded vector keyword and the salt.
  • the process for the processor to determine the accelerator information corresponding to the first embedded vector keyword according to the first embedded vector keyword and the salt includes the following steps S211 to S213.
  • S211 The processor determines a first hash value according to the first embedded vector key.
  • the processor inputs the first embedding vector key into a first hash algorithm to determine a first hash value.
  • the first hash algorithm may be a message digest algorithm (message digest algorithm md5, MD5), a secure hash algorithm (secure hash algorithm 1, SHA-1), etc. It should be understood that the first hash algorithm also includes a variety of different forms, such as SHA-224, SHA-256, etc., this application does not make any specific restrictions on the first hash algorithm.
  • the first hash algorithm can be expressed as:
  • ADD1 is represented as the first hash value
  • key is represented as the first embedded vector key
  • the hash() function is represented as a mapping relationship between the above-mentioned first embedded vector key and the above-mentioned first hash value.
  • S212 The processor determines a second hash value according to the salt value and the first hash value.
  • the processor needs to combine the salt value with the first hash value.
  • the manner in which the processor combines the salt value with the first hash value may include: (1) performing character string splicing on the first hash value and the salt value to obtain the salted first hash value; (2) combining the salt value Insert into one or more positions in the first hash value, so as to obtain the salted first hash value.
  • the way of combining the salt value and the first hash value is also the way of combining strings. In addition to the above two ways, there are many different forms. The method of combining the salt value and the first hash value in this application Not specifically limited.
  • the processor inputs the salted first hash value into a second hash algorithm
  • the second hash algorithm may be an information digest algorithm, a secure hash algorithm, or the like.
  • the second hash algorithm also includes many different representation forms, such as SHA-224, SHA-256, etc., and this application does not impose any specific limitation on the second hash algorithm.
  • the second hash algorithm can be expressed as:
  • ADD2 is represented as the second hash value
  • salt is the salt value
  • key is represented as the first embedding vector keyword
  • the hash() function is expressed as a mapping relationship between the first hash value after salting and the second hash value.
  • S213 The processor determines the accelerator corresponding to the first embedded vector key according to the second hash value.
  • the processor converts the second hash value into a digital form, and substitutes the number of accelerators in the system into the formula of the modulo n mapping relationship to determine the accelerator information corresponding to the keyword of the first embedded vector .
  • modulo n mapping relationship can be expressed as:
  • dev represents accelerator information
  • the second hash value is expressed in digital form
  • n is the number of accelerators in the system
  • mod represents the mapping relationship between the second hash value and the accelerators.
  • step S203 The processor determines whether there is an overflow of the embedding vector in the above-mentioned accelerator, if not, proceed to step S204, and if overflow, proceed to step S205.
  • the processor first judges whether the accelerator overflows the embedding vector, and does not send the corresponding first vector to the accelerator.
  • the processor can obtain the number of embedding vectors that can be stored in each accelerator, that is, the capacity of the second memory.
  • the processor compares the number of embedding vectors allocated to each accelerator with the capacity of the accelerator's second memory, and if the capacity of the second memory of all accelerators is not less than the number of embedding vectors allocated to the corresponding accelerator, it also That is, in the case that the embedded vector has no overflow, the processor executes step S204. If the number of embedded vectors is greater than the capacity of the second memory of the accelerator, the embedded vectors overflow, and the processor executes step S205.
  • the processor calculates a standard deviation according to the number of embedding vectors corresponding to each accelerator, sets a threshold, and compares the standard deviation with the threshold. If the standard deviation is less than or equal to the threshold, the embedding vector has no overflow, and the processor executes step S204. If the standard deviation is greater than the threshold, the embedding vector overflows, and the processor executes step S205.
  • S204 The processor sends the first embedding vector to a corresponding accelerator.
  • the processor determines that the capacity of the second memory of all accelerators is not less than the number of embedding vectors allocated to the corresponding accelerator, and the processor finds the corresponding first embedding vector keyword in the embedding table An embedding vector, and sending the embedding vector to a corresponding accelerator.
  • the processor may send the first embedding vector to the cache memory in the second memory of the corresponding accelerator through PCIE.
  • the processor can also send the address of the embedded vector and the communication information between the accelerators to each accelerator.
  • S205 The processor does not send the first embedding vector to the accelerator, but keeps the first embedding vector stored in the first memory.
  • the processor determines that an embedded vector overflow exists in the accelerator.
  • the processor does not send the first embedding vector corresponding to the first embedding vector key in the embedding table to the accelerator, and the first embedding vector is always stored in the first memory of the server.
  • the processor after the processor finishes calculating each embedding vector key in the batch data, it can obtain the accelerator corresponding to each embedding vector key.
  • the processor can count the capacity of the second memory of each accelerator, and compare it with the number of embedding vectors allocated to each accelerator. In the case that the number of embedded vectors is greater than the capacity of the second memory of the accelerator, the processor can read a new salt value, save it in the configuration file, repeat the above steps from S201 to S205 with the new salt value, and recalculate each The corresponding relationship between the embedded vector keywords and the accelerators until there is no embedded vector overflow in all accelerators. In the case that the capacity of the second memory of all accelerators is not less than the number of embedding vectors allocated to the corresponding accelerator, the processor may send each embedding vector to the second memory of the corresponding accelerator.
  • the processor after determining the accelerator corresponding to an embedding vector keyword in the batch data, the processor counts the capacity of the second memory of each accelerator, and the embedding vectors allocated to each accelerator at this time capacity for comparison. In the case that the number of embedded vectors is greater than the capacity of the second memory of the accelerator, the processor can read a new salt value, save it in the configuration file, and use the new salt value to repeat the above steps from S201 to S205 to recalculate The corresponding relationship between each embedding vector keyword and the accelerator, until there is no overflow of the embedding vector in all accelerators. The processor may also send the embedding vector sent to the accelerator before this comparison to the first memory of the server.
  • the processor sends an embedding vector corresponding to an embedding vector keyword calculated this time to the corresponding accelerator through PCIE. in the second memory.
  • the processor continues to calculate the accelerator corresponding to the next embedding vector keyword in the batch data, and repeats the above steps.
  • the processor can also use the address of the embedding vector, the communication between the accelerators The information is sent to each accelerator. Because in the recommendation system, when the accelerator performs model training, it may need to use the embedding vector stored by other accelerators. According to the address of the embedding vector required for accelerator training, the processor extracts the embedding vector from the cache memory of the accelerator where it is located, and then sends the embedding vector through PCIE or the network to the Embedded in an accelerator for vector operations.
  • the processor when the recommendation system training is initialized, stores the embedding table in the first memory, reads batch data from the disk, and stores them in the first memory of the server as well.
  • the batch data includes a set of embedding vector keywords to be used in training, and each embedding vector keyword corresponds to a unique row in the embedding table, corresponding to an embedding vector.
  • the processor invokes the salt value and the embedding vector key in the batch data to calculate the accelerator corresponding to the embedding vector key.
  • the first hash algorithm is MD5.
  • the processor substitutes the salted first hash value into the second hash algorithm, , calculate the second hash value 6595 after salting.
  • the processor substitutes the second hash value and the number of accelerators in the system into the modulo n mapping relationship, In , the accelerator corresponding to the embedding vector keyword is obtained as device5 through modulo n mapping relationship calculation.
  • both the first hash value and the second hash value are expressed in the form of hexadecimal, and when the modulo n operation is performed, the second hash value is expressed in the form of numbers.
  • Accelerators corresponding to other embedding vectors in the batch data can be calculated in the same way, so we won’t go into details one by one.
  • the processor After determining the accelerators corresponding to all the keywords of the embedded vectors, the processor counts the capacity of the embedded vectors allocated to each accelerator, and compares the capacity of the embedded vectors with the second memory capacity of each accelerator counted by the processor. It is determined that there is a case where the capacity of the embedding vector is larger than that of the second memory of the accelerator, and the embedding vector overflows.
  • the embedding vector is always stored in the first memory of the server.
  • the processor randomly generates a new salt value, saves it in the configuration file, and then uses the new salt value to repeat the above steps to recalculate the corresponding relationship between the embedded vector keyword and the accelerator until a salt value is found that satisfies the balanced prefetch conditions, so that the accelerator does not have the overflow of embedded vectors.
  • the processor finds the appropriate salt value, determines that there is no overflow of embedded vectors in each accelerator, and sends the embedded vector, the address of each embedded vector, and the communication information between accelerators to the second memory of the corresponding accelerator through PCIE, and prefetches At the end of the process, enter the training phase.
  • an accelerator When an accelerator performs recommendation model training, it needs to call the embedding vectors stored in other accelerators. According to the address of the embedding vector required for training stored in the second memory of the accelerator, the processor extracts the embedding vector from the cache memory of the corresponding accelerator, and then passes the embedding vector through the PCIE, sent from the original accelerator to this accelerator.
  • the processor reads the salt and the batch data, determines the first hash value according to the first embedded vector keyword in the batch data, and determines the second hash value according to the salt and the above-mentioned first hash value, according to the above-mentioned
  • the modulo n operation is performed on the second hash value to obtain the accelerator corresponding to the keyword of the first embedding vector, and the processor sends the first embedding vector to the second memory of the corresponding accelerator.
  • the first and second hash algorithms used by the processor do not need to record the position of the embedding vector, record the embedding table in the accelerator, and do not need to search for the position of the embedding vector, which greatly improves the operation speed.
  • the processor changes the fixed modulo-n mapping relationship into a dynamic modulo-n mapping relationship by adding a string to the first hash value, that is, adding salt, and changes the relationship between the embedded vector and the location information of the accelerator.
  • the corresponding relationship so that the capacity of the embedding vector corresponding to each accelerator is not greater than the capacity of the second memory of the accelerator, thus achieving balanced prefetching and avoiding the overflow of embedded vectors and system exceptions.
  • FIG. 3 is a schematic structural diagram of an embedded vector prefetching device provided in the present application.
  • the program code 310 in the prefetching device 300 includes an acquisition unit 320 , a hash operation unit 330 , a comparison unit 340 and a data output unit 350 .
  • the device sends the embedding vector corresponding to each embedding vector keyword from the first memory to the second memory in the same process, and the same process is as follows:
  • the obtaining unit is used to obtain the salt value and the first embedded vector key, where the obtaining unit can deduplicate and segment the embedded vector key when there are multiple embedded vector keys, and store it in the server in the first memory.
  • the obtained salt value is a string that can be split and stored in a configuration file. In the absence of a configuration file, the salt is randomly generated.
  • the acquisition unit is used to input the salt value and the batch data into the hash operation unit.
  • the hash operation unit is configured to determine the accelerator corresponding to the first embedded vector keyword according to the first embedded vector keyword and the salt value.
  • ADD1 is represented as the first hash value
  • key is represented as the first embedded vector key
  • the hash() function is represented as a mapping relationship between the above-mentioned first embedded vector key and the above-mentioned first hash value.
  • the first hash algorithm can be an information digest algorithm, a secure hash algorithm, etc. It should be understood that the first hash algorithm, as a basic hash algorithm, has many different forms, and this application does not make any specific restrictions on the first hash algorithm .
  • the hash operation unit is used to add the salt value to the first hash value, and substitute the salted first hash value into the second hash value.
  • Greek algorithm determine the second hash value.
  • ADD2 is represented as the second hash value
  • salt is the salt value
  • key is represented as the first embedding vector keyword
  • the hash() function is expressed as the mapping relationship between the first hash value after salting and the second hash value above
  • the second hash value and the first hash Value representation is the same.
  • the way of combining the salt value with the first hash value may include: (1) the hash operation unit is used to splicing the first hash value and the salt value to obtain the first hash value after salting; (2) ) The hash operation unit is used to insert the salt value into one or more positions in the first hash value, so as to obtain the salted first hash value.
  • the second hash algorithm may be an information digest algorithm, a secure hash algorithm, etc. It should be understood that the second hash algorithm also includes a variety of different forms, such as SHA-224, SHA-256, etc.
  • the binary hash algorithm does not impose any specific limitations. It should be understood that there are many different ways of combining the salt value and the first hash value besides the above two ways, and this application does not specifically limit the way of combining the salt value and the first hash value.
  • the hash operation unit is used to convert the second hash value into a digital form, and then substitute the second hash value and the number of accelerators into the modulo n mapping relationship , determine the accelerator corresponding to the first embedding vector keyword.
  • dev represents the accelerator information
  • salt represents the salt value
  • key represents the first embedding vector keyword
  • n is the number of accelerators in the system
  • mod represents the mapping relationship between the second hash value and the accelerators.
  • the comparison unit is used to determine whether the embedding vectors allocated to each accelerator overflow.
  • the method for determining whether the embedded vector overflows may include: (1) the comparison unit is used to compare the capacity of the embedded vector allocated to each accelerator with the capacity of the second internal memory of each accelerator, and when the capacity of the embedded vector is greater than the capacity of the second internal memory of the accelerator In case of capacity, the embedding vector overflows. (2) The comparison unit is used to calculate a standard deviation according to the number of embedding vectors corresponding to each accelerator, set a threshold, and compare the standard deviation with the threshold. In cases where the standard deviation is larger than this threshold, the embedding vector overflows.
  • the data output unit sends the first embedding vector to the second memory of the corresponding accelerator.
  • the comparison unit obtains a result that the memory capacity of each accelerator is not less than the capacity of the corresponding embedding vector, or the standard deviation is less than or equal to a threshold, and it is determined that there is no overflow of the embedding vector.
  • the data output unit is used to send the first embedding vector, the address of the embedding vector, and the communication information between the accelerators to the cache memory of the corresponding second memory of the accelerator.
  • the comparison unit obtains a result that the memory capacity of the accelerator is smaller than the capacity of the corresponding embedding vector, or the standard deviation is greater than a threshold, and it is determined that there is an overflow of the embedding vector.
  • the data output unit is used for not sending the first embedding vector to the corresponding accelerator.
  • the obtaining unit is used to reacquire a new salt value, store it in the configuration file, repeat the above steps of each unit, and recalculate the corresponding relationship between the embedded vector keyword and the accelerator.
  • the device for prefetching embedded vectors may further include a training unit.
  • the device for embedding vector prefetching may include an acquisition unit, a hash operation unit, a comparison unit and a data output unit. Combining the four units, the fixed modulo-n mapping relationship can be transformed into a dynamic modulo-n mapping relationship, that is, the accelerator corresponding to the embedding vector is changed, so that the embedding vector can be distributed to different accelerators more evenly, solving the problem of the first Second, the memory capacity is limited and the embedding vector overflows.
  • Embedding vector techniques are widely used in deep learning-based recommender systems.
  • the processor needs to send the embedding vector required for training to the accelerator in advance.
  • the processor reads the embedding table into the first memory of the server.
  • the processor randomly generates a salt value and reads a batch of data from disk.
  • the processor deduplicates and splits the embedded vector keywords needed for training in the batch data, and stores the batch data in the first memory of the server.
  • the processor executes the program code in the first memory, calls the salt value and the first embedding vector key, substitutes into the first hash algorithm, and determines the first hash value.
  • the processor combines the first hash value and the salt value by splicing to obtain the first salted hash value, and substitutes the salted first hash value into the second hash algorithm to determine the second hash value.
  • the processor continues to execute the program code in the first memory, and substitutes the second hash value and the number of accelerators in the system into the modulo n mapping relationship to determine the accelerator corresponding to the first embedding vector key.
  • the processor determines the accelerators corresponding to all the embedding vector keywords, it calculates a standard deviation according to the number of embedding vectors corresponding to each accelerator, sets a threshold, and compares the standard deviation with the threshold. If the standard deviation is greater than the threshold, the processor returns all embedding vectors to be sent to the accelerator to the first memory. The processor re-reads a new salt value, stores it in the configuration file, repeats the above steps according to the new salt value, and re-determines the accelerator corresponding to the embedding vector.
  • the processor compares the standard deviation with the threshold again, and when it is determined that the standard deviation is less than or equal to the threshold and there is no overflow of the embedded vector, sends the embedded vector to the second memory of the corresponding accelerator through PCIE. At the same time, the processor also sends the address of the embedding vector required for each accelerator training and the communication information between the accelerators to each accelerator through PCIE, and the prefetch process ends.
  • the above method of determining whether the embedded vector overflows can ensure that the loaded salt value prevents the embedded vector from overflowing during the training process, ensuring the normal operation of the system. At the same time, it can cover up the process of sending information between the server and the accelerator, improving the overall operating efficiency.
  • the accelerator After the processor sends the embedding vector to the accelerator, the accelerator starts training the recommendation model. Training requires embedding vectors in other accelerators, and the accelerator extracts the embedding vectors from the cache memory of another accelerator according to the address of the embedding vectors, and sends the extracted embedding vectors through PCIE according to the communication information between accelerators to the desired accelerator. The accelerator assembles and trains the acquired embedding vectors. After the training is completed, the gradient information of the embedding vector obtained by the accelerator is sent to each accelerator through PCIE according to the communication information between the accelerators, and then the corresponding embedding vector is found according to the address of the embedding vector, and the gradient The information is added to the original embedding vector and the embedding vector is updated.
  • the processor can directly read the appropriate salt value.
  • the processor can compare the distribution of embedding vectors in each accelerator, or compare the corresponding training effects, and further conditionally filter the salt value, so that the system can achieve a higher performance by using a better salt value.
  • Throughput rate the processor uses different salt values for calculation, it can compare the distribution of embedding vectors in each accelerator, or compare the corresponding training effects, and further conditionally filter the salt value, so that the system can achieve a higher performance by using a better salt value.
  • this application provides a method for prefetching embedded vectors by an application system, changing the hash value by adding salt, and changing the accelerator corresponding to the keyword of the embedded vector obtained through the hash operation, so as to realize no embedding in the accelerator.
  • Vector overflow to equalize the effect of prefetching.
  • the hash operation speed is fast, and it is not necessary to record the position of the embedding vector during the calculation process, and it is not necessary to record the embedding table synchronously in each accelerator in order to find the position of the embedding vector, which improves the overall operation efficiency.
  • the system can select better salt through multiple training and condition screening, so that the system can achieve higher throughput and be applied to parallel processing.
  • An embodiment of the present application also provides a computing device, the computing device includes a processor and a memory, the memory stores computer instructions, and the processor includes functions for executing the modules shown in FIGS. 1-3 .
  • the embodiment of the present application also provides a computer-readable storage medium, where instructions are stored in the computer-readable storage medium, and when the instruction is run on a processor, the method flow shown in FIGS. 1-3 is implemented.
  • the above-mentioned embodiments may be implemented in whole or in part by software, hardware, firmware or other arbitrary combinations.
  • the above-described embodiments may be implemented in whole or in part in the form of computer program products.
  • a computer program product comprises at least one computer instruction.
  • the computer program instructions When the computer program instructions are loaded or executed on the computer, the processes or functions according to the embodiments of the present invention will be generated in whole or in part.
  • the computer can be a general purpose computer, a special purpose computer, a computer network, or other programmable devices.
  • Computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, e.g.
  • a computer-readable storage medium may be any available medium that can be accessed by a computer, or a data storage node such as a server or a data center that includes at least one set of available media. Available media may be magnetic media (eg, floppy disks, hard disks, tapes), optical media (eg, high-density digital video discs (DVD), or semiconductor media.
  • the semiconductor media may be SSDs.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Human Computer Interaction (AREA)
  • Neurology (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Computer Hardware Design (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明实施例公开了一种嵌入向量预取的方法、装置、系统及相关设备,其中方法包括:处理器读取盐值和第一嵌入向量关键字,根据盐值和第一嵌入向量关键字,确定第一嵌入向量关键字对应的加速器,处理器确定所述加速器是否存在嵌入向量溢出的情况,在不存在嵌入向量溢出的情况下,处理器将第一嵌入向量发送至加速器中,在存在嵌入向量溢出的情况下,处理器不将第一嵌入向量发送至加速器中,保持第一嵌入向量存储于第一内存中。本发明实施例通过加盐的方式改变哈希值,改变了嵌入向量关键字与加速器之间的映射关系,使得嵌入向量均衡地分配至不同的加速器中,消除嵌入向量溢出的情况,保持系统正常运行。

Description

一种嵌入向量预取的方法、装置、系统及相关设备
本申请要求于2021年09月29日提交中国专利局、申请号为202111157670.0、申请名称为“一种嵌入向量预取的方法、装置、系统及相关设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本发明涉及深度学习领域,尤其涉及一种嵌入向量预取的方法、装置、系统及相关设备。
背景技术
目前,深度学习领域发展迅速,深度学习在各行各业得到了广泛的应用。尤其是将深度学习应用于推荐系统中取得了不错的效果。
在基于深度学习的推荐系统中,嵌入向量技术的应用非常广泛。嵌入向量技术就是将推荐系统获得的用户特征作为稀疏向量,通过嵌入表转换为稠密向量。嵌入表存储在推荐系统关联的服务器的内存中,嵌入表中的一行即为一个嵌入向量。在推荐模型进行深度学习时,由处理器将训练所需的嵌入表中的嵌入向量从服务器的内存中放到加速器中的过程称为预取。目前,在预取过程中,会出现大部分嵌入向量存放在一个加速器上的情况,由于加速器内存容量有限,嵌入向量溢出,系统出现异常。
发明内容
本申请提供了一种嵌入向量预取的方法、装置、系统及相关设备,用于解决目前在预取过程中,大部分嵌入向量存放在一个加速器上,由于加速器内存容量有限,嵌入向量容量超出加速器的内存容量,导致系统异常的问题。
第一方面,本发明实施例提供了一种嵌入向量预取的应用系统,该系统包括服务器,加速器和高速串行计算机扩展总线标准。其中,服务器包括处理器和第一内存,加速器包括第二内存、指令解码器、控制器、多路选择器和计算模块。服务器和加速器之间可以由高速串行计算机扩展总线标准等高速带宽互联,加速器之间可以由高速带宽或者网络连接,其中,一个服务器可以与多个加速器连接。
服务器是一种既具有计算能力又具有存储能力的设备,可以是物理服务器,也可以是基于通用的物理服务器结合网络功能虚拟化技术实现的虚拟机,本申请对服务器的形式不作具体限定。可选地,服务器包括处理器和第一内存,服务器可以包括更多或者更少的部件,也可以将多个部件集成为一个部件。
处理器用于处理来自服务器或者其他系统的数据访问请求,也用于处理系统内部生成的请求。可选地,处理器通过前端端口接收服务器发送的写数据请求时,会将这些写数据请求中的数据暂时保存在第一内存中,当第一内存中的数据总量达到一定阈值时,处理器通过后端端口将第一内存中存储的数据发送至硬盘进行持久化存储。
第一内存是指与处理器直接交换数据的内部存储器,它可以随时读写数据,而且速度很快,作为操作系统或其他正在运行中的程序的临时数据存储器。第一内存可以用于存储数据信息,例如批数据和嵌入表,处理器调用上述数据信息的速度很快。第一内存还可以用于存储程序代码,处理器读取第一内存中存储的数据,调用第一内存中存储的程序代码,可以实 现对硬盘的管理。
加速器可以是图形处理器、嵌入式神经网络处理器或者是其他类型的加速卡。可选地,加速器可以包括第二内存、指令解码器、控制器、多路选择器和计算模块。
第二内存可以用于存储数据信息,同第一内存结构,作用相似,只是在内存容量上有所区别。可选地,第二内存中还存在高速缓冲存储器,高速缓冲存储器和主存储器之间信息的调度和传送由硬件自动进行。指令解码器用于接收处理器发送的指令,对处理器发送的指令进行解码,得到用于指示待进行计算的多个数据的地址以及运算类型。控制器可以接收指令解码器发送过来的所述多个数据的地址和计算模块输出的计算结果。多路选择器用于根据指令解码器的控制信号,选择将控制器或者处理器的访存命令发送给第二内存,以及,从第二内存中获取需要发送给控制器以及处理器的数据。计算模块用于根据运算类型对多个数据执行相应的计算。
高速串行计算机扩展总线标准,旨在替代旧的总线标准,属于高速串行点对点双通道高带宽传输,所连接的设备分配独享通道带宽,不共享总线带宽,主要支持主动电源管理,错误报告,端对端的可靠性传输,热插拔以及服务质量等功能。
第二方面,本发明实施例提供了一种嵌入向量预取的方法,该方法包括:处理器读取盐值(salt)和第一嵌入向量关键字(embedding key),处理器根据读取的盐值和第一嵌入向量关键字,确定第一嵌入向量关键字对应的加速器(device)。处理器确定加速器是否存在嵌入向量溢出的情况,在不存在嵌入向量溢出的情况下,将第一嵌入向量(embedding)发送至对应的加速器中,在存在嵌入向量溢出的情况下,处理器不将第一嵌入向量发送至加速器中,保持第一内存一直存储于第一内存中。
在第二方面一种可能的实现方式中,处理器可以从磁盘或者网络读取一个批数据,批数据中可以包括m个嵌入向量关键字,处理器将每一个嵌入向量关键字对应的嵌入向量从第一内存中发送至第二内存中的操作都相同,第一嵌入向量关键字可以是批数据中的任意一个,对应嵌入表(embedding table)中唯一的一行,对应唯一的嵌入向量。处理器读取到批数据后,还可以对批数据中的嵌入向量关键字进行去重和切分。处理器可以随机产生一个盐值。
处理器读取盐值和第一嵌入向量关键字,处理器根据第一嵌入向量关键字,确定第一哈希值。可选地,处理器将第一嵌入向量关键字输入第一哈希算法中,确定第一哈希值。其中,第一哈希算法可以是信息摘要算法,安全散列算法等。
处理器根据盐值和第一哈希值,确定第二哈希值。可选地,处理器先将盐值和第一哈希值结合,处理器可以将第一哈希值和盐值进行字符串拼接,或者,将盐值插入到第一哈希值中的一个或多个位置,从而得到加盐后的第一哈希值。在得到加盐后的第一哈希值后,处理器可以将加盐后的第一哈希值输入第二哈希算法,可选地,第二哈希算法可以是信息摘要算法,安全散列算法等。
处理器根据第二哈希值,确定第一嵌入向量关键字对应的加速器。可选地,处理器可以将第二哈希值转换为数字形式,与系统中的加速器的数量一起代入模n映射关系的公式中。在确定第一嵌入向量关键字对应的加速器信息后,以同样的方法,处理器可以确定批数据中所有的嵌入向量关键字对应的加速器,得到各个加速器上对应的嵌入向量关键字的容量。
处理器确定上述加速器是否存在嵌入向量溢出的情况。可选地,处理器可以获取到每个加速器中可以存放的嵌入向量的数量,也就是第二内存的容量,将嵌入向量的数量与加速器的第二内存的容量进行比较,在加速器的第二内存的容量都不小于嵌入向量的数量的情况下, 嵌入向量无溢出。或者,处理器可以根据各个加速器对应的嵌入向量的数量计算一个标准差,并设置一个阈值,比较该标准差与该阈值的大小。在标准差小于或者等于阈值的情况下,嵌入向量无溢出。
在嵌入向量溢出的情况下,第一嵌入向量一直存储于第一内存中,处理器不发送第一嵌入向量至加速器中。可选地,在嵌入向量的容量大于加速器的第二内存容量的情况下,处理器还可以读取一个新的盐值,并且保存到配置文件当中,利用新的盐值,重复上述步骤,重新计算嵌入向量与加速器之间的对应关系,直至所有加速器都不存在嵌入向量溢出的情况。
可选地,在嵌入向量无溢出的情况下,处理器可以将第一嵌入向量、嵌入向量地址、加速器之间的通信信息发送至对应的加速器的第二内存的高速缓冲存储器中。
上述方法可以掩盖服务器与加速器之间信息传输的流程,可以有效解决预取不均衡导致的加速器容量溢出问题,不会出现系统异常问题。
实施第二方面描述的方法,处理器通过读取盐和第一嵌入向量关键字,根据第一嵌入向量关键字确定第一哈希值,根据盐和上述第一哈希值确定第二哈希值,根据上述第二哈希值,进行模n运算,得到第一嵌入向量关键字对应的加速器。处理器确定是否存在嵌入向量溢出的情况,在没有嵌入向量溢出的情况下,处理器将第一嵌入向量发送至对应的加速器的第二内存中。在存在嵌入向量溢出的情况下,处理器不将第一嵌入向量送至加速器中,而是重新读取盐值。处理器通过在哈希值中加入字符串,也就是加盐的方式,将固定的模n映射关系变为动态的模n映射关系,改变嵌入向量与加速器的对应关系,使得嵌入向量可以均衡地分配到不同的加速器中,避免加速器中嵌入向量溢出,避免了系统异常的产生,达到均衡预取的效果。
第三方面,本发明实施例提供了一种嵌入向量预取装置,该装置包括获取单元、哈希运算单元、比较单元和数据输出单元。获取单元用于获取盐值和第一嵌入向量关键字,哈希运算单元用于根据第一嵌入向量关键字和盐值,确定第一嵌入向量关键字对应的加速器,用于确定分配至各个加速器的嵌入向量是否溢出。数据输出单元用于在无嵌入向量溢出的情况下,将第一嵌入向量发送至对应的加速器的第二内存中,或者,在存在嵌入向量溢出的情况下,不将所述第一嵌入向量发送至对应的加速器中,保持第一嵌入向量存储于第一内存中。
可选地,获取单元获取的盐值可以是随机生成的一个或者多个字符组成的字符串,或者,是存储在配置文件中的字符串。第一嵌入向量关键字为批数据中的任意一个嵌入向量关键字,批数据可以是获取单元从磁盘或者网络中读取得到的。获取单元用于将盐和第一嵌入向量关键字一并输入至哈希运算单元。
哈希运算单元用于根据第一嵌入向量关键字和盐值,确定第一嵌入向量关键字对应的加速器。可选地,哈希运算单元用于将第一嵌入向量关键字代入第一哈希算法,确定第一哈希值。在得到第一哈希值后,哈希运算单元用于将盐值加入到第一哈希值中,将加盐后的第一哈希值代入第二哈希算法中,确定第二哈希值。可选地,第一哈希算法和第二哈希算法可以是信息摘要算法、安全散列算法等。在得到第二哈希值后,哈希运算单元用于将第二哈希值转换为数字形式,再将第二哈希值和加速器数量代入模n映射关系中,确定第一嵌入向量关键字对应的加速器。哈希运算单元用于确定分配至各个加速器的嵌入向量是否溢出。
可选地,在获取到所有嵌入向量关键字对应的加速器后,比较单元可以用于将分配至各个加速器的嵌入向量的容量与各个加速器的第二内存的容量进行比较,在嵌入向量的容量大于加速器第二内存的容量的情况下,嵌入向量溢出。在嵌入向量的容量小于或者等于加速器 第二内存的容量的情况下,不存在嵌入向量溢出的情况。比较单元也可以用于,在确定每一个嵌入向量关键字对应的加速器后,进行嵌入数量与各加速器的第二内存的容量进行比较。
可选地,在获取到所有嵌入向量关键字对应的加速器后,哈希运算单元可以用于根据各个加速器对应的嵌入向量的数量计算一个标准差,并设置一个阈值,比较该标准差与该阈值的大小。在标准差大于该阈值的情况下,嵌入向量溢出,在标准差小于或者等于该阈值的情况下,无嵌入向量溢出。
可选地,在无嵌入向量溢出的情况下,数据输出单元用于将第一嵌入向量,嵌入向量地址,加速器之间的通信信息发送至对应的加速器第二内存中。
可选地,在存在嵌入向量溢出的情况下,数据输出单元用于不将第一嵌入向量发送至加速器中,保持第一嵌入向量存储于第一内存中状态不变。可选地,获取单元可以用于重新获取一个新的盐值,存储到配置文件当中,重复上述过程,重新计算嵌入向量关键字与加速器之间的对应关系。
可选地,该装置可以广泛应用于基于深度学习的模型训练中,该装置还可以包括训练单元。
第四方面,本发明实施例,提供了一种计算设备,该计算设备包括处理器和存储器,其中,存储器存储计算机指令,处理器用于执行第一、第二或第三方面任一种可能的实现方式中各个模块的功能。
第五方面,本发明实施例,提供了一种计算机可读存储介质,计算机可读存储介质中存储有指令,指令在上述计算设备上运行,使上述计算设备执行第一、第二、第三方面所述的方法。
预取过程中运用的哈希算法,运算速度快,并且记录在预取过程中读取的盐,可以在所有能够实现均衡预取的盐中,进行训练效果等方面的比较,选取更优的盐达到更高的吞吐率,提高了整体的运算效率。
综上所述,本申请通过在基础哈希值上加盐的方式将固定的模n映射关系变为动态的模n映射关系,改变了嵌入向量与加速器这个位置信息之间的对应关系,使得嵌入向量关键字对应的嵌入向量可以均衡地分配到不同的加速器中,解决了由于预取不均衡导致的加速器内存容量不足的问题,保持系统正常运行。
本申请在上述各方面提供的实现方式的基础上,还可以进行进一步组合以提供更多实现方式。
附图说明
为了更清楚地说明本发明实施例技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍。
图1是本发明实施例提供的一种嵌入向量预取的应用系统的示意性框图;
图2是本发明实施例提供的一种嵌入向量预取的方法的流程图;
图3是本发明实施例提供的一种嵌入向量预取装置的结构示意图。
具体实施方式
为了解决上述加速器内存容量有限,嵌入向量溢出,导致系统异常的问题,本申请提供了一种嵌入向量预取的方法,能够将嵌入表中的嵌入向量均衡地分配到各个加速器中,避免 系统产生异常。
如图1所示,图1是本申请提供的一种嵌入向量预取的应用系统的示意性框图。该应用系统100包括服务器110,加速器120和高速串行计算机扩展总线标准130。其中,服务器110与加速器120由之间PCIE 130等高速带宽互联,加速器120之间由PCIE 130或者网络连接。
服务器110是一种既具有计算能力又具有存储能力的设备,可以是物理服务器,比如X86服务器、ARM服务器等等,也可以是基于通用的物理服务器结合网络功能虚拟化(network functions virtualization,NFV)技术实现的虚拟机(virtual machine,VM),虚拟机指通过软件模拟的具有完整硬件系统功能的、运行在一个完全隔离环境中的完整计算机系统,比如云计算中的虚拟设备,本申请不作具体限定。在一种可能的实施方式中,服务器110中包括处理器111和第一内存112。应理解,图1所示的服务器可以包括更多或者更少的部件,也可以将图1所示的服务器中的多个部件集成为一个部件,本申请对服务器的结构不作具体限定。
处理器111可以由至少一个通用处理器构成,例如中央处理器(central processing unit,CPU),或者CPU和硬件芯片的组合。上述硬件芯片可以是专用集成电路(application-specific integrated circuit,ASIC)、可编程逻辑器件(programmable logic device,PLD)或其组合。上述PLD可以是复杂可编程逻辑器件(complex programmable logic device,CPLD)、现场可编程逻辑门阵列(field-programmable gate array,FPGA)、通用阵列逻辑(generic array logic,GAL)或其任意组合。
处理器111用于处理来自服务器或者其他系统的数据访问请求,也用于处理系统内部生成的请求。示例性的,处理器111通过前端端口接收服务器110发送的写数据请求时,会将这些写数据请求中的数据暂时保存在第一内存112中,当第一内存112中的数据总量达到一定阈值时,处理器111通过后端端口将第一内存中存储的数据发送至硬盘进行持久化存储。处理器还可以在接收请求后,读取数据,读取盐和批数据,同样,将这些数据保存在第一内存中。
第一内存112是指与处理器直接交换数据的内部存储器,它可以随时读写数据,而且速度很快,作为操作系统或其他正在运行中的程序的临时数据存储器。内存包括至少两种存储器,例如内存既可以是随机存取存储器,也可以是只读存储器(read only memory,ROM)。举例来说,随机存取存储器是动态随机存取存储器(dynamic random access memory,DRAM),或者存储级存储器(storage class memory,SCM)。DRAM是一种半导体存储器,与大部分随机存取存储器(random access memory,RAM)一样,属于一种易失性存储器(volatile memory)设备。第一内存还可以包括其他随机存取存储器,例如静态随机存取存储器(static random access memory,SRAM)等。而对于只读存储器,举例来说,可以是可编程只读存储器(programmable read only memory,PROM)、可抹除可编程只读存储器(erasable programmable read only memory,EPROM)等。另外,第一内存还可以是双列直插式存储器模块或双线存储器模块(dual in-line memory module,简称DIMM),即由动态随机存取存储器(DRAM)组成的模块,还可以是固态硬盘(solid state disk,SSD)。
第一内存112可以用于存储数据信息,例如批数据和嵌入表,因此读取上述数据信息的速度很快。第一内存还可以用于存储程序代码,处理器读取第一内存中存储的数据,运行程序代码,处理器111运行第一内存中存储的程序代码可以实现对硬盘的管理。图1中第一内存112中的程序代码可以包括一个或多个单元,示例性地,该一个或多个单元可以是获取单元、哈希运算单元、比较单元和数据输出单元。应理解,程序代码中的各个模块单元是一种 示例性的划分方式,各个模块单元之间可以合并或者拆分为更多或者更少的模块单元,且系统和模块之间的位置关系也不构成任何限制,本申请不作具体限定。
加速器120可以是图形处理器(graphics processing unit,GPU)、嵌入式神经网络处理器(neural-network processing unit,NPU)或者是其他类型的加速卡(physics processing unit,PPU)。在一种可能的实施方式中,加速器120包括第二内存121、指令解码器122、控制器123、多路选择器124和计算模块125。
第二内存121用于存储数据信息,同第一内存112结构,作用相似,只在内存容量上有所区别。第二内存中还存在高速缓冲存储器(cache),高速缓冲存储器的原始意义是指存取速度比一般随机存取记忆体(RAM)来得快的一种RAM。高速缓冲存储器是存在于主存与处理器之间的一级存储器,由静态存储芯片(SRAM)组成,容量比较小但速度快,接近于处理器的速度。高速缓冲存储器和主存储器之间信息的调度和传送由硬件自动进行。
指令解码器122用于接收处理器发送的指令,对处理器发送的指令进行解码,得到解码结果,该解码结果用于指示待进行计算的多个数据的地址以及运算类型。在一更具体的实施例中,指令解码器122包括状态寄存器以及指令缓存队列。状态寄存器为内存可寻址的空间,当处理器向该地址发送读取请求,指令解码器122随即向处理器返回状态寄存器中存储的加速器的工作状态。
控制器123接受指令解码器122发送过来的所述多个数据的地址和计算模块125输出的计算结果。
多路选择器124用于根据指令解码器122的控制信号,选择将控制器123或者处理器的访存命令发送给内存,以及,从内存中获取需要发送给控制器123以及处理器的数据。
计算模块125用于根据运算类型对多个数据执行相应的计算。计算模块125包括计算单元、输入单元、计算单元阵列和输出单元。其中,计算单元用于控制计算单元阵列执行指令对相应的数据处理操作。输入单元用于缓存准备用于执行指令的数据。输出单元用于缓存计算单元阵列执行指令之后得到的计算结果。
在应用系统中,第一内存112和第二内存121作用相似,只是在内存容量上有所差异,第二内存的容量小于第一内存的容量。由于嵌入表的容量普遍超过第二内存的容量,所以嵌入表只能存储在第一内存中,其中,第二内存121位于加速器120中,可以存储处理器发送的部分嵌入向量,嵌入向量地址和加速器之间的通信信息。
应理解,图1所示的加速器120可以包括更多的部件或者更少的部件,加速器也可以将其中的多个部件集成为一个部件,加速器的结构存在多种形式,本申请对加速器的具体结构不作任何具体限定。
高速串行计算机扩展总线标准130(peripheral component interconnect express,PCIE),旨在替代旧的总线标准。PCIE属于高速串行点对点双通道高带宽传输,所连接的设备分配独享通道带宽,不共享总线带宽,主要支持主动电源管理,错误报告,端对端的可靠性传输,热插拔以及服务质量等功能。
应理解,图1所示的高速串行计算机扩展总线标准130存在多种规格,从PCIE x1到PCIE x32等等,本申请对高速串行计算机扩展总线标准的种类不作任何具体限定。
如图2所示,图2是本申请提供的一种嵌入向量预取的方法的流程图。嵌入向量预取的方法可以应用于图1所示的应用系统中,该方法包括以下步骤:
处理器读取的批数据中存在m个嵌入向量关键字,处理器将每一个嵌入向量关键字对应 的嵌入向量从第一内存中发送至第二内存中的操作都相同,操作步骤如下:
S201:处理器读取盐值和第一嵌入向量关键字。
在一种可能的实施方式中,盐值(salt)可以是随机生成的一个或者多个字符组成的字符串,或者,是存储在配置文件中的字符串。
在一种可能的实施方式中,第一嵌入向量关键字(embedding key)为批数据(batch)中的任意一个嵌入向量关键字。批数据可以包括嵌入表(embedding table)中各个嵌入向量(embedding)对应的关键字。在批数据包括多个嵌入向量关键字的情况下,如果多个嵌入向量关键字存在重复的嵌入向量关键字,可以将重复的嵌入向量关键字去掉,如果多个嵌入向量关键字存在用不到的嵌入向量关键字,可以将用不到的嵌入向量关键字进行删除。批数据可以是处理器从磁盘或者网络中读取得到的。
S202:处理器根据第一嵌入向量关键字和盐,确定第一嵌入向量关键字对应的加速器。
在一种可能的实施方式中,处理器根据第一嵌入向量关键字和盐,确定第一嵌入向量关键字对应的加速器信息的过程包括以下步骤S211~步骤S213。
S211:处理器根据第一嵌入向量关键字,确定第一哈希值。
在一种可能的实施方式中,处理器将第一嵌入向量关键字输入第一哈希算法中,确定第一哈希值。其中,第一哈希算法可以是信息摘要算法(message digest algorithm md5,MD5),安全散列算法(secure hash algorithm 1,SHA-1)等,应理解,第一哈希算法还包括多种不同的形式,如SHA-224、SHA-256等等,本申请对第一哈希算法不作任何具体的限制。
在一种更具体的实施方式中,第一哈希算法可以表示为:
ADD1=hash(key);
其中,ADD1表示为第一哈希值,key表示为第一嵌入向量关键字,hash()函数表示为上述第一嵌入向量关键字和上述第一哈希值之间的映射关系。处理器将不同的嵌入向量关键字输入同一第一哈希算法时,得到的输出值的长度是固定的。可以理解,在第一嵌入向量关键字相同的情况下,第一哈希算法采用的具体算法不同,第一哈希值一般不同。
S212:处理器根据盐值和第一哈希值,确定第二哈希值。
在一种可能的实施方式中,处理器需要将盐值和第一哈希值结合。处理器将盐值和第一哈希值结合的方式可以包括:(1)将第一哈希值和盐值进行字符串拼接得到加盐后的第一哈希值;(2)将盐值插入到第一哈希值中的一个或多个位置,从而得到加盐后的第一哈希值。应理解,盐值和第一哈希值结合的方式也就是字符串结合的方式,除上述两种方式外,还存在很多不同的形式,本申请对盐值和第一哈希值结合的方式不作具体限定。然后,处理器将加盐后的第一哈希值输入第二哈希算法,第二哈希算法可以是信息摘要算法,安全散列算法等。应理解,第二哈希算法还包括多种不同的表现形式,如SHA-224、SHA-256等等,本申请对第二哈希算法不作任何具体的限制。
在一种更具体的实施方式中,第二哈希算法可以表示为:
Figure PCTCN2022119301-appb-000001
其中,ADD2表示为第二哈希值,salt为盐值,key表示为第一嵌入向量关键字,
Figure PCTCN2022119301-appb-000002
Figure PCTCN2022119301-appb-000003
为加盐后的第一哈希值,hash()函数表示为上述加盐后的第一哈希值和上述第二哈希值之间的映射关系。
S213:处理器根据第二哈希值,确定第一嵌入向量关键字对应的加速器。
在一种可能的实施方式中,处理器将第二哈希值转换为数字形式,与系统中的加速器的 数量一起代入模n映射关系的公式中,确定第一嵌入向量关键字对应的加速器信息。
在一种更具体的实施方式中,模n映射关系可以表示为:
Figure PCTCN2022119301-appb-000004
其中,dev表示加速器信息,
Figure PCTCN2022119301-appb-000005
为第二哈希值,在本实施方式中,第二哈希值以数字的形式表示,n为系统中加速器的数量,mod表示第二哈希值和加速器之间的映射关系。
S203:处理器确定上述加速器是否存在嵌入向量溢出的情况,如果没有溢出,进入步骤S204,如果溢出,进入步骤S205。
在一种可能的实施方式中,处理器在得到第一嵌入向量关键字对应的加速器后,先对加速器是否存在嵌入向量溢出的情况进行判断,不将对应的第一向量发送至加速器中。处理器可以获取到每个加速器中可以存放的嵌入向量的数量,也就是第二内存的容量。处理器将分配到各个加速器上的嵌入向量的数量与加速器的第二内存的容量进行比较,在所有加速器的第二内存的容量都不小于分配到对应加速器的嵌入向量的数量的情况下,也就是嵌入向量无溢出的情况下,处理器执行步骤S204。在存在嵌入向量的数量大于加速器的第二内存的容量的情况下,嵌入向量溢出,处理器执行步骤S205。
在另一种可能的实施方式中,处理器根据各个加速器对应的嵌入向量的数量计算一个标准差,并设置一个阈值,比较该标准差与该阈值的大小。在标准差小于或者等于阈值的情况下,嵌入向量无溢出,处理器执行步骤S204。在标准差大于该阈值的情况下,嵌入向量溢出,处理器执行步骤S205。
S204:处理器将第一嵌入向量发送至对应的加速器中。
在一种可能的实施方式中,处理器确定所有加速器的第二内存的容量都不小于分配到对应加速器的嵌入向量的数量,处理器将第一嵌入向量关键字在嵌入表中找到对应的第一嵌入向量,并将该嵌入向量发送至对应的加速器中。
在一种更具体的实施方式中,处理器可以将第一嵌入向量通过PCIE发送至对应加速器第二内存中的高速缓冲存储器中。处理器还可以将嵌入向量的地址和加速器之间的通信信息发送至各加速器中。
S205:处理器不将第一嵌入向量发送至加速器中,保持第一嵌入向量存储于第一内存中。
在一种可能的实施方式中,处理器确定加速器中存在嵌入向量溢出的情况。处理器不将第一嵌入向量关键字在嵌入表中对应的第一嵌入向量发送至加速器中,第一嵌入向量一直存储于服务器的第一内存中。
在一种具体的实施方式中,处理器在对批数据中的每个嵌入向量关键字都计算完毕之后,可以得到每个嵌入向量关键字对应的加速器。处理器可以统计每个加速器的第二内存的容量,与分配到各个加速器上的嵌入向量的数量进行比较。在存在嵌入向量的数量大于加速器的第二内存的容量的情况下,处理器可以读取一个新的盐值,保存到配置文件当中,利用新盐值重复上述S201至S205的步骤,重新计算各个嵌入向量关键字与加速器之间的对应关系,直至所有的加速器中都不存在嵌入向量溢出的情况。在所有加速器的第二内存的容量都不小于分配至对应加速器的嵌入向量的数量的情况下,处理器可以各个嵌入向量发送至对应的加速器的第二内存中。
在另一种具体的实施方式中,处理器在确定批数据中的一个嵌入向量关键字对应的加速器后,统计每个加速器的第二内存的容量,与此时分配到各加速器上的嵌入向量的容量进行 比较。在存在嵌入向量的数量大于加速器的第二内存的容量的情况下,处理器可以读取一个新的盐值,保存到配置文件当中,利用新盐值,重复上述S201至S205的步骤,重新计算各个嵌入向量关键字与加速器之间的对应关系,直至所有的加速器中都不存在嵌入向量溢出的情况。处理器还可以将在进行此次比较之前发送至加速器中的嵌入向量发送至服务器的第一内存中。在所有加速器的第二内存的容量都不小于分配至对应加速器的嵌入向量的数量的情况下,处理器将本次计算的一个嵌入向量关键字对应的一个嵌入向量通过PCIE发送至对应的加速器的第二内存中。处理器继续计算批数据中的下一个嵌入向量关键字对应的加速器,重复上述步骤。
在另一种具体的实施方式中,在所有加速器的第二内存的容量都不小于分配至对应加速器的嵌入向量的数量的情况下,处理器还可以将嵌入向量的地址,加速器之间的通信信息发送至各个加速器中。因为在推荐系统中,加速器进行模型训练时,可能需要用到其他加速器存储的嵌入向量。处理器根据加速器训练所需的嵌入向量的地址,将嵌入向量从其所在的加速器的高速缓冲存储器中提取出来,再根据加速器之间的通信信息,将该嵌入向量通过PCIE或者网络发送至需要该嵌入向量进行运算的加速器中。
在一个具体的实施例中,在推荐系统训练初始化时,处理器将嵌入表存储到第一内存中,从磁盘上读取批数据,同样存储到服务器的第一内存中。其中,批数据包括训练中要用到的一组嵌入向量关键字,每一个嵌入向量关键字对应嵌入表中唯一的一行,对应一个嵌入向量。处理器进一步解析批数据,对其中的嵌入向量关键字进行去重和切分,解析得到的批数据表示为keys=[2,5,9,12,15,17,20]。处理器随机产生一个盐值,salt=3。处理器调用盐值和批数据中的嵌入向量关键字,计算该嵌入向量关键字对应的加速器。在本实施例中,第一哈希算法为MD5。处理器先根据批数据中的一个嵌入向量关键字,例如key=2,执行第一哈希算法,代入公式ADD1=hash(key)中,得到第一哈希值ADD1为7f89。然后处理器将盐和第一哈希值进行字符串拼接,得到加盐后的第一哈希值为7f893。处理器将加盐后的第一哈希值,代入第二哈希算法,
Figure PCTCN2022119301-appb-000006
中,计算得到加盐后的第二哈希值6595。处理器将第二哈希值和系统中加速器数量代入模n映射关系,
Figure PCTCN2022119301-appb-000007
Figure PCTCN2022119301-appb-000008
中,通过模n映射关系计算,得到该嵌入向量关键字对应的加速器为device5。其中,第一哈希值和第二哈希值都以十六进制的形式表示,在进行模n运算时,第二哈希值以数字形式表示。批数据中其他嵌入向量对应的加速器可以通过同样的方式进行计算,不一一赘述。确定所有的嵌入向量关键字对应的加速器后,处理器统计分配至各加速器中的嵌入向量的容量,将该嵌入向量的容量与处理器统计的每个加速器的第二内存容量进行比较。确定存在嵌入向量的容量大于加速器的第二内存的容量的情况,嵌入向量溢出。此时,嵌入向量一直存储于服务器的第一内存中。处理器随机产生一个新的盐值,保存到配置文件中,再利用新盐值,重复上述步骤,重新计算嵌入向量关键字与加速器之间的对应关系,直至找到一个盐值,满足均衡预取的条件,使加速器不存在嵌入向量溢出的情况。处理器找到合适的盐值,确定各个加速器无嵌入向量溢出的情况,将嵌入向量,各个嵌入向量的地址,加速器之间的通信信息等通过PCIE发送至对应的加速器的第二内存中,预取过程结束,进入训练阶段。
加速器进行推荐模型训练时,需要调用其他加速器中存储的嵌入向量。处理器根据加速器第二内存中存储的训练所需的嵌入向量的地址,将嵌入向量从对应的加速器的高速缓冲存储器中提取出来,再根据两个加速器之间的通信信息,将该嵌入向量通过PCIE,从原始加速器发送至该加速器中。
综上可知,处理器通过读取盐和批数据,根据批数据中的第一嵌入向量关键字确定第一哈希值,根据盐和上述第一哈希值确定第二哈希值,根据上述第二哈希值,进行模n运算,得到第一嵌入向量关键字对应的加速器,处理器将第一嵌入向量发送至对应的加速器的第二内存中。处理器在上述过程中,利用的第一,第二哈希算法,无需记录嵌入向量位置,无需在加速器中记录嵌入表,也无需查找嵌入向量的位置,极大提高了运算速度。并且,处理器通过在第一哈希值中加入字符串,也就是加盐的方式,将固定的模n映射关系变为动态的模n映射关系,改变了嵌入向量与加速器这个位置信息之间的对应关系,使得每个加速器对应的嵌入向量的容量不大于加速器第二内存的容量,从而实现了均衡预取,避免了嵌入向量溢出,系统异常等情况。
如图3所示,图3是本申请提供的一种嵌入向量预取装置的结构示意图。预取装置300中的程序代码310包括获取单元320、哈希运算单元330、比较单元340和数据输出单元350。
同样,由于批数据中存在m个嵌入向量关键字,该装置将每一个嵌入向量关键字对应的嵌入向量从第一内存中发送至第二内存中的过程都相同,相同的过程如下:
获取单元用于获取盐值和第一嵌入向量关键字,其中,获取单元在存在多个嵌入向量关键字的情况下,可以对嵌入向量关键字进行去重和切分,并且将其存储在服务器的第一内存中。获取的盐值是一个字符串,可以进行拆分,可以存储在配置文件中,在没有配置文件的情况下,盐随机产生。获取单元用于将盐值和批数据一并输入至哈希运算单元。
哈希运算单元用于根据第一嵌入向量关键字和盐值,确定第一嵌入向量关键字对应的加速器。在一种可能的实施方式中,哈希运算单元用于将第一嵌入向量关键字代入第一哈希算法ADD1=hash(key)中,确定第一哈希值。其中,ADD1表示为第一哈希值,key表示为第一嵌入向量关键字,hash()函数表示为上述第一嵌入向量关键字和上述第一哈希值之间的映射关系。第一哈希算法可以是信息摘要算法,安全散列算法等,应理解,第一哈希算法作为基础哈希算法,存在很多不同的形式,本申请对第一哈希算法不作任何具体的限制。
在一种可能的实施方式中,在得到第一哈希值后,哈希运算单元用于将盐值加入到第一哈希值中,将加盐后的第一哈希值代入第二哈希算法
Figure PCTCN2022119301-appb-000009
中,确定第二哈希值。其中,ADD2表示为第二哈希值,salt为盐值,key表示为第一嵌入向量关键字,
Figure PCTCN2022119301-appb-000010
为加盐后的第一哈希值,hash()函数表示为上述加盐后的第一哈希值和上述第二哈希值之间的映射关系,第二哈希值与第一哈希值表现形式相同。盐值与第一哈希值结合的方式可以包括:(1)哈希运算单元用于将第一哈希值和盐值进行字符串拼接,得到加盐后的第一哈希值;(2)哈希运算单元用于将盐值插入到第一哈希值中的一个或多个位置,从而得到加盐后的第一哈希值。其中,第二哈希算法可以是信息摘要算法,安全散列算法等,应理解,第二哈希算法还包括多种不同的形式,如SHA-224、SHA-256等等,本申请对第二哈希算法不作任何具体的限制。应理解,盐值和第一哈希值结合的方式,除上述两种方式外,还存在很多不同的形式,本申请对盐值和第一哈希值结合的方式不作具体限定。
在一种可能的实施方式中,在得到第二哈希值后,哈希运算单元用于将第二哈希值转换为数字形式,再将第二哈希值和加速器数量代入模n映射关系
Figure PCTCN2022119301-appb-000011
Figure PCTCN2022119301-appb-000012
中,确定第一嵌入向量关键字对应的加速器。其中,dev表示加速器信息,salt为盐值,key表示为第一嵌入向量关键字,
Figure PCTCN2022119301-appb-000013
为第二哈希值,n为系统中加速器的数量,mod表示第二哈希值和加速器之间的映射关系。
在一种可能的实施方式中,比较单元用于确定分配至各个加速器的嵌入向量是否溢出。 确定嵌入向量是否溢出的方式可以包括:(1)比较单元用于将分配至各个加速器的嵌入向量的容量与各个加速器的第二内存的容量进行比较,在嵌入向量的容量大于加速器第二内存的容量的情况下,嵌入向量溢出。(2)比较单元用于根据各个加速器对应的嵌入向量的数量计算一个标准差,并设置一个阈值,比较该标准差与该阈值的大小。在标准差大于该阈值的情况下,嵌入向量溢出。
数据输出单元将第一嵌入向量发送至对应的加速器的第二内存中。在一种可能的实施方式中,通过比较单元,得到各个加速器的内存容量都不小于对应的嵌入向量的容量,或者标准差小于或者等于阈值的结果,确定无嵌入向量溢出的情况。数据输出单元用于将第一嵌入向量,嵌入向量地址,加速器之间的通信信息发送至对应的加速器第二内存的高速缓冲存储器中。
在一种可能的实施方式中,通过比较单元,得到加速器的内存容量小于对应的嵌入向量的容量,或者标准差大于阈值的结果,确定存在嵌入向量溢出的情况。数据输出单元用于不将第一嵌入向量发送至对应的加速器中。在一种可能的实施方式中,获取单元用于重新获取一个新的盐值,存储到配置文件当中,重复上述各单元步骤,重新计算嵌入向量关键字与加速器之间的对应关系。
在一种可能的实施方式中,该嵌入向量预取的装置还可以包括训练单元。
综上可知,该嵌入向量预取的装置可以包括获取单元、哈希运算单元、比较单元和数据输出单元。结合四个单元,可以将固定的模n映射关系变换为动态的模n映射关系,也就是改变了嵌入向量对应的加速器,使得嵌入向量可以更加均衡地分配至不同的加速器中,解决了加速器第二内存容量有限,嵌入向量溢出的问题。
为了使本申请能够被更好地理解,下面在具体的应用场景中,结合本申请提供的嵌入向量预取的应用系统和方法进行详细描述。
在基于深度学习的推荐系统中,嵌入向量技术的应用非常广泛。在推荐模型训练过程中,处理器需要提前将训练所需的嵌入向量发送至加速器中。训练初始化时,由于嵌入表的容量普遍大于加速器的第二内存容量,处理器将嵌入表读取到服务器的第一内存中。处理器随机产生一个盐值,从磁盘中读取一个批数据。处理器对批数据中的,训练需要用到的嵌入向量关键字进行去重和切分,将批数据也存储到服务器的第一内存中。处理器执行第一内存中的程序代码,调用盐值和第一嵌入向量关键字,代入第一哈希算法,确定第一哈希值。处理器将第一哈希值与盐值通过拼接的方式进行结合,得到的加盐后的第一哈希值,将加盐后的第一哈希值代入第二哈希算法,确定第二哈希值。处理器继续执行第一内存中的程序代码,将第二哈希值和系统中加速器的数量代入模n映射关系中,确定第一嵌入向量关键字对应的加速器。
处理器确定所有嵌入向量关键字对应的加速器后,根据每一个加速器对应的嵌入向量的数量,计算得到一个标准差,并且设置一个阈值,比较标准差与阈值的大小。在标准差大于阈值的情况下,处理器将所有将要发送至加速器中的嵌入向量返回第一内存中。处理器重新读取一个新的盐值,存储到配置文件中,根据新盐值重复上述步骤,重新确定嵌入向量对应的加速器。处理器再一次比较标准差与阈值的大小,在确定标准差小于或者等于阈值,无嵌入向量溢出的情况下,将嵌入向量通过PCIE发送至对应的加速器的第二内存中。同时,处理器将每个加速器训练所需的嵌入向量的地址及加速器之间的通信信息也通过PCIE一并发送至各加速器中,预取过程结束。
通过上述确定嵌入向量是否溢出的方法可以保证加载的盐值使嵌入向量在训练过程中没有溢出的情况,保证系统正常运行,同时,可以掩盖服务器与加速器之间相互发送信息的流程,提高了整体的运算效率。
处理器将嵌入向量发送至加速器后,加速器开始进行推荐模型训练。训练需要其他加速器中的嵌入向量,加速器根据所需嵌入向量的地址将该嵌入向量从另一个加速器的高速缓冲存储器中提取出来,根据加速器之间的通信信息,将提取出的嵌入向量通过PCIE发送至需要的加速器中。加速器对获取的嵌入向量进行组装,训练。在训练完成后,加速器得到的嵌入向量的梯度信息,根据加速器之间的通信信息,将该梯度信息通过PCIE发送至各个加速器中,再根据该嵌入向量的地址,找到对应的嵌入向量,将梯度信息加入到原来的嵌入向量中,更新该嵌入向量。
记录上述训练过程中所有的盐值,在利用相同的嵌入向量再次进行训练的情况下,处理器可以直接读取合适的盐值。处理器在利用不同的盐值计算时,可以比较嵌入向量在各个加速器中的分布情况,或者比较对应的训练效果,进一步对盐值进行条件筛选,利用更优的盐值使系统达到更高的吞吐率。
综上可知,本申请提供的一种应用系统执行嵌入向量预取的方法,利用加盐的方式改变哈希值,改变通过哈希运算得到的嵌入向量关键字对应的加速器,实现加速器中无嵌入向量溢出,均衡预取的效果。其中,哈希运算速度快,计算过程中不必记录嵌入向量的位置,不必为了查找嵌入向量位置,不必在每个加速器中同步记录嵌入表,提高整体运算效率。该系统可以通过多次训练和条件筛选,选取更优质的盐,使系统可以达到更高的吞吐率,应用于并行处理的情况。
本申请实施例还提供一种计算设备,该计算设备包括处理器和存储器,存储器存储计算机指令,处理器包括执行图1-图3所示的各个模块的功能。
本申请实施例还提供一种计算机可读存储介质,计算机可读存储介质中存储有指令,当其在处理器上运行时,图1-图3所示的方法流程得以实现。
上述实施例,可以全部或部分地通过软件、硬件、固件或其他任意组合来实现。当使用软件实现时,上述实施例可以全部或部分地以计算机程序产品的形式实现。计算机程序产品包括至少一个计算机指令。在计算机上加载或执行计算机程序指令时,全部或部分地产生按照本发明实施例的流程或功能。计算机可以为通用计算机、专用计算机、计算机网络、或者其他可编程装置。计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(digital subscriber line,DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含至少一个可用介质集合的服务器、数据中心等数据存储节点。可用介质可以是磁性介质(例如,软盘、硬盘、磁带)、光介质(例如,高密度数字视频光盘(digital video disc,DVD)、或者半导体介质。半导体介质可以是SSD。
以上所述,仅为本发明的具体实施方式,但本发明的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本发明揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本发明的保护范围之内。因此,本发明的保护范围应以权利要求的保护范围为准。

Claims (12)

  1. 一种嵌入向量预取的方法,其特征在于,包括:
    处理器读取盐值和第一嵌入向量关键字;
    所述处理器根据所述盐值和所述第一嵌入向量关键字,确定所述第一嵌入向量关键字对应的加速器;
    所述处理器确定所述加速器是否存在嵌入向量溢出的情况,在不存在嵌入向量溢出的情况下,所述处理器将第一嵌入向量发送至所述加速器中,在存在嵌入向量溢出的情况下,所述处理器不将所述第一嵌入向量发送至所述加速器中,保持所述第一嵌入向量存储于第一内存中。
  2. 根据权利要求1所述的方法,其特征在于,
    所述盐值为随机生成的一个或者多个字符组成的字符串,或者,是存储在配置文件中的字符串。
  3. 根据权利要求1或2所述的方法,其特征在于,所述处理器根据所述盐值和所述第一嵌入向量关键字,确定所述第一嵌入向量关键字对应的加速器包括:
    所述处理器根据所述第一嵌入向量关键字,确定第一哈希值;
    所述处理器根据所述盐值和所述第一哈希值,确定第二哈希值;
    所述处理器根据所述第二哈希值,确定所述第一嵌入向量关键字对应的加速器。
  4. 根据权利要求3所述的方法,其特征在于,所述处理器根据所述第一嵌入向量关键字,确定第一哈希值,包括:
    所述处理器将所述第一嵌入向量关键字输入第一哈希算法中,确定所述第一哈希值;
    所述第一哈希算法包括信息摘要算法、安全散列算法中的一种或者多种。
  5. 根据权利要求3或4所述的方法,其特征在于,所述处理器根据所述盐值和所述第一哈希值,确定第二哈希值包括:
    所述处理器将所述盐值和所述第一哈希值结合,得到加盐后的第一哈希值;
    所述处理器将所述加盐后的第一哈希值,输入第二哈希算法,确定所述第二哈希值。
  6. 根据权利要求5所述的方法,其特征在于,所述处理器将所述盐值和所述第一哈希值结合,得到加盐后的第一哈希值,包括:
    所述处理器将所述第一哈希值和所述盐值进行字符串拼接,得到加盐后的第一哈希值。
  7. 根据权利要求5所述的方法,其特征在于,所述处理器将所述盐值和所述第一哈希值结合,得到加盐后的第一哈希值,包括:
    所述处理器将所述盐值插入到所述第一哈希值中的一个或多个位置,得到加盐后的第一哈希值。
  8. 根据权利要求3-7任一权利要求所述的方法,其特征在于,所述处理器根据所述第二 哈希值,确定所述第一嵌入向量关键字对应的加速器,包括:
    所述处理器将所述第二哈希值转换为数字形式,获取系统中的加速器的数量n,代入模n映射关系中,确定所述第一嵌入向量关键字对应的加速器。
  9. 根据权利要求1-8任一权利要求所述的方法,其特征在于,
    在不存在嵌入向量溢出的情况下,所述处理器将第一嵌入向量发送至所述加速器中,包括:
    在不存在嵌入向量溢出的情况下,所述处理器根据所述第一嵌入向量关键字从嵌入表中找到对应的第一嵌入向量,将所述第一嵌入向量发送到所述加速器的高速缓冲存储器中;
    在存在嵌入向量溢出的情况下,所述处理器不将所述第一嵌入向量发送至所述加速器中,保持所述第一嵌入向量存储于第一内存中包括:
    在存在嵌入向量溢出的情况下,所述处理器不将所述第一嵌入向量发送至所述加速器中,读取一个新的盐值,存储在配置文件中。
  10. 一种嵌入向量预取的装置,其特征在于,所述装置包括:
    获取单元,用于获取盐值和第一嵌入向量关键字;
    哈希运算单元,用于根据所述盐值和所述第一嵌入向量关键字,确定所述第一嵌入向量关键字对应的加速器;
    比较单元:用于确定所述加速器是否存在嵌入向量溢出的情况;
    数据输出单元,用于在不存在嵌入向量溢出的情况下,将第一嵌入向量发送至所述加速器中,用于在存在嵌入向量溢出的情况下,不将所述第一嵌入向量发送至所述加速器中。
  11. 一种计算设备,其特征在于,所述计算设备包括处理器和存储器,所述存储器存储计算机程序,所述处理器执行所述计算机程序,以使所述计算设备执行如权利要求1至9任一项所述的方法。
  12. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质中存储有程序,所述程序在计算设备上运行时,使所述计算设备执行如权利要求1至9任一项所述的方法。
PCT/CN2022/119301 2021-09-29 2022-09-16 一种嵌入向量预取的方法、装置、系统及相关设备 WO2023051282A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP22874669.9A EP4390706A1 (en) 2021-09-29 2022-09-16 Embedded vector prefetching method, apparatus and system, and related device
US18/619,696 US20240241724A1 (en) 2021-09-29 2024-03-28 Embedding vector prefetching method and apparatus, system, and related device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111157670.0 2021-09-29
CN202111157670.0A CN114936087B (zh) 2021-09-29 2021-09-29 一种嵌入向量预取的方法、装置、系统及相关设备

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/619,696 Continuation US20240241724A1 (en) 2021-09-29 2024-03-28 Embedding vector prefetching method and apparatus, system, and related device

Publications (1)

Publication Number Publication Date
WO2023051282A1 true WO2023051282A1 (zh) 2023-04-06

Family

ID=82863035

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/119301 WO2023051282A1 (zh) 2021-09-29 2022-09-16 一种嵌入向量预取的方法、装置、系统及相关设备

Country Status (4)

Country Link
US (1) US20240241724A1 (zh)
EP (1) EP4390706A1 (zh)
CN (1) CN114936087B (zh)
WO (1) WO2023051282A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114936087B (zh) * 2021-09-29 2023-06-02 华为技术有限公司 一种嵌入向量预取的方法、装置、系统及相关设备
CN117076720B (zh) * 2023-10-18 2024-02-02 北京燧原智能科技有限公司 一种嵌入表访问方法、装置、电子设备及存储介质

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2005098628A1 (ja) * 2004-03-30 2005-10-20 Ibm Japan, Ltd. オーバーフロー防止方法、装置、及びプログラム
CN101034412A (zh) * 2007-04-02 2007-09-12 华为技术有限公司 一种信息存储的方法、信息查找的方法及引擎装置
CN102737064A (zh) * 2011-04-15 2012-10-17 腾讯科技(深圳)有限公司 文件缓存方法及装置
CN105721390A (zh) * 2014-12-01 2016-06-29 阿里巴巴集团控股有限公司 一种加密存储方法和装置
CN107766258A (zh) * 2017-09-27 2018-03-06 精硕科技(北京)股份有限公司 内存存储方法与装置、内存查询方法与装置
US20180314524A1 (en) * 2017-04-28 2018-11-01 Intel Corporation Supporting learned branch predictors
US20190073580A1 (en) * 2017-09-01 2019-03-07 Facebook, Inc. Sparse Neural Network Modeling Infrastructure
CN111767364A (zh) * 2019-03-26 2020-10-13 钉钉控股(开曼)有限公司 数据处理方法、装置和设备
CN114936087A (zh) * 2021-09-29 2022-08-23 华为技术有限公司 一种嵌入向量预取的方法、装置、系统及相关设备

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP6063321B2 (ja) * 2013-03-27 2017-01-18 株式会社富士通エフサス サーバ装置およびハッシュ値処理方法
JP6829156B2 (ja) * 2017-06-26 2021-02-10 日本電信電話株式会社 ネットワーク負荷分散装置および方法
CN109379297B (zh) * 2018-11-26 2023-01-10 锐捷网络股份有限公司 一种实现流量负载均衡的方法和装置
CN113132249A (zh) * 2019-12-31 2021-07-16 华为技术有限公司 一种负载均衡方法和设备

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2005098628A1 (ja) * 2004-03-30 2005-10-20 Ibm Japan, Ltd. オーバーフロー防止方法、装置、及びプログラム
CN101034412A (zh) * 2007-04-02 2007-09-12 华为技术有限公司 一种信息存储的方法、信息查找的方法及引擎装置
CN102737064A (zh) * 2011-04-15 2012-10-17 腾讯科技(深圳)有限公司 文件缓存方法及装置
CN105721390A (zh) * 2014-12-01 2016-06-29 阿里巴巴集团控股有限公司 一种加密存储方法和装置
US20180314524A1 (en) * 2017-04-28 2018-11-01 Intel Corporation Supporting learned branch predictors
US20190073580A1 (en) * 2017-09-01 2019-03-07 Facebook, Inc. Sparse Neural Network Modeling Infrastructure
CN107766258A (zh) * 2017-09-27 2018-03-06 精硕科技(北京)股份有限公司 内存存储方法与装置、内存查询方法与装置
CN111767364A (zh) * 2019-03-26 2020-10-13 钉钉控股(开曼)有限公司 数据处理方法、装置和设备
CN114936087A (zh) * 2021-09-29 2022-08-23 华为技术有限公司 一种嵌入向量预取的方法、装置、系统及相关设备

Also Published As

Publication number Publication date
CN114936087B (zh) 2023-06-02
EP4390706A1 (en) 2024-06-26
US20240241724A1 (en) 2024-07-18
CN114936087A (zh) 2022-08-23

Similar Documents

Publication Publication Date Title
US20190266193A1 (en) Data processing method for bloom filter, and bloom filter
EP3057272B1 (en) Technologies for concurrency of cuckoo hashing flow lookup
WO2023051282A1 (zh) 一种嵌入向量预取的方法、装置、系统及相关设备
US8332367B2 (en) Parallel data redundancy removal
EP1934764B1 (en) Dma transfers of sets of data and an exclusive or (xor) of the sets of data
US20160132541A1 (en) Efficient implementations for mapreduce systems
US10320695B2 (en) Message aggregation, combining and compression for efficient data communications in GPU-based clusters
US20150149695A1 (en) System and method for computing message digests
US10691731B2 (en) Efficient lookup in multiple bloom filters
CN104636185A (zh) 业务上下文管理方法、物理主机、pcie设备及迁移管理设备
JP2016062613A (ja) キャッシュメモリ・システム及びその動作方法
EP3465450B1 (en) Improving throughput in openfabrics environments
TW201942761A (zh) 伺服器系統
WO2023124304A1 (zh) 芯片的缓存系统、数据处理方法、设备、存储介质及芯片
CN109840051A (zh) 一种存储系统的数据存储方法及装置
US20220414001A1 (en) Memory inclusivity management in computing systems
US11593014B2 (en) System and method for approximating replication completion time
KR102416336B1 (ko) 블록체인을 관리하기 위한 장치, 방법, 시스템 및 컴퓨터 판독가능 저장 매체
CN114327281B (zh) 用于ssd的tcg软硬件加速方法、装置、计算机设备及存储介质
CN112445413A (zh) 一种数据存储的方法、装置及相关设备
US11947512B2 (en) Feedback-based inverted index compression
US12112052B2 (en) Reading a master boot record for a namespace after reformatting the namespace
WO2018188416A1 (zh) 一种数据搜索的方法、装置和相关设备
CN118277344B (zh) 分布式键值存储系统的存储节点层间合并方法及装置
US11281610B2 (en) Method, device, and computer program product for managing data transfer

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22874669

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2022874669

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2022874669

Country of ref document: EP

Effective date: 20240318

NENP Non-entry into the national phase

Ref country code: DE