CN112819145A

CN112819145A - Chip, neural network training system, memory management method, device and equipment

Info

Publication number: CN112819145A
Application number: CN202110221213.7A
Authority: CN
Inventors: 冷祥纶; 王勇; 其他发明人请求不公开姓名
Original assignee: Shanghai Power Tensors Intelligent Technology Co Ltd
Current assignee: Shanghai Power Tensors Intelligent Technology Co Ltd
Priority date: 2021-02-26
Filing date: 2021-02-26
Publication date: 2021-05-18
Also published as: CN113033785B; CN113033785A

Abstract

The embodiment of the specification provides a chip, a neural network training system, a memory management method, a memory management device and memory management equipment. When data related to neural network training is read or written into the external device connected with the communication interface, the processing unit in the chip can send an original address corresponding to the data to be read or written to the address conversion unit, and the address conversion unit can convert the original address into a target address matched with the external device based on the type of the external device connected with the communication interface, so that the processing unit can successfully read or write the data from the target address of the external device. The original interface is subjected to function expansion, so that the interface can be connected with an external memory, the memory capacity of a single chip can be increased, and the training efficiency is improved.

Description

Chip, neural network training system, memory management method, device and equipment

Technical Field

The present disclosure relates to the field of artificial intelligence technologies, and in particular, to a chip, a neural network training system, and a memory management method, device, and apparatus.

Background

With the development of artificial intelligence technology, various neural network models are widely used in various fields. As neural network models become more complex and larger, the amount of data involved in the neural network models is also increasing. In order to improve the training speed of the neural network, a plurality of chips (e.g., AI chips) can be used to accelerate the training of the neural network, and the memory space of a single chip is usually very limited, which often cannot meet the memory requirement of a large-scale neural network model. At present, a plurality of chips are usually adopted to carry out accelerated training on a neural network so as to make up for the defect of insufficient memory of a single chip, and the plurality of chips can receive instructions from a CPU and finish the accelerated training on the neural network together according to the instructions. When the chip is adopted to train the neural network, a communication link used by the chip for communicating with the CPU is the same as a communication link for issuing the instruction by the CPU, so that when the number of the used chips is large, each chip is communicated with the CPU, the communication link for issuing the instruction by the CPU is seriously occupied, the efficiency for issuing the instruction by the CPU is influenced, and the training efficiency is influenced. Meanwhile, when the number of chips is large, the communication data amount between the chips is large, and the communication efficiency between the chips is low, which also affects the neural network training efficiency.

Disclosure of Invention

The disclosure provides a chip, a neural network training system, a memory management method, a memory management device and memory management equipment.

According to a first aspect of the embodiments of the present disclosure, there is provided a chip, the chip comprising a processing unit, an address translation unit, and at least one communication interface, each of the communication interfaces being usable for connecting an external device, the type of the external device comprising an external memory and/or other chips,

the processing unit is configured to send an original address corresponding to data to be read or written, which is related to neural network training, to the address conversion unit, where the original address is a physical address in the other chip;

the address conversion unit is configured to convert the original address into a target address matching the type of the external device and output the target address, so that the processing unit reads or writes the data in the target address of the external device.

In some embodiments, the address translation unit includes a storage subunit;

the storage subunit is configured to store a mapping relationship between physical addresses in the other chips and physical addresses in the external memory;

the address conversion unit is configured to convert the original address into the target address based on the mapping relationship if it is determined that the communication interface is connected to the external memory.

In some embodiments, the address translation unit is further configured to directly determine the original address as the target address if it is determined that the communication interface is connected to the other chip.

In some embodiments, the physical addresses of the storage units of the other chips are pre-divided into a plurality of address partitions, and the physical addresses in each address partition can be mapped into the physical addresses in the external memory through a preset mapping relationship;

and the address conversion unit is used for determining the address partition where the original address is located and determining the target address based on the address partition where the original address is located and the mapping relation.

In some embodiments, the address translation unit includes a plurality of sets of first registers, each set of the first registers being for storing indication information indicating an address range of one of the address partitions;

and the address conversion unit is used for determining the address partition where the original address is located according to the indication information.

In some embodiments, the address translation unit further includes a plurality of sets of second registers, each set of the second registers being configured to store current enable status information of one of the address partitions;

the address translation unit is configured to determine that the communication interface is connected to the external memory based on:

determining a target address partition from the plurality of address partitions based on the enable state information;

determining that the communication interface is connected with the external memory in the case that the original address is determined to fall within the address range of the target address partition according to the indication information.

In some embodiments, the external memory comprises a DDR, the DDR in communication with the chip over a bus, the bus comprising a PCIE bus.

In some embodiments, the processing unit is further to:

determining a state in which data relating to the neural network training is located, the state relating to a time interval between a time at which the data is next used and a current time;

determining whether to store the data stored in an internal memory of the chip to the external memory and/or whether to store the data in the external memory to the internal memory according to the state.

In some embodiments, the processing unit is further to: in response to determining to store the data stored in the internal memory of the chip to the external memory according to the state, sending the original address corresponding to the data to be stored in the external memory to the address conversion unit, so that the address conversion unit converts the original address into the target address in the external memory; and/or

In response to determining to store the data stored in the external memory to the internal memory according to the status, sending the original address corresponding to the data to be stored to the internal memory to the address translation unit, so that the address translation unit translates the original address to the target address in the external memory.

In some embodiments, the states include a first state and a second state; the processing unit is configured to:

determining to store the data stored in the internal memory to the external memory if the data stored in the internal memory is in the first state;

determining to store the data in the external memory to the internal memory if the data stored in the external memory is in a second state.

In some embodiments, the processing unit is to determine the state the data is in based on:

and determining the state of data related to each layer according to the number of the layers in the neural network and the current layer currently processed by the corresponding chip and the current calculation direction of the neural network.

In some embodiments, the processing unit is to:

determining a target layer from the neural network based on the calculation direction and the number of layers, wherein the target layer participates in calculation after the current layer, and the number of layers spaced from the current layer is smaller than a preset number of layers;

determining a state in which data relating to the target layer is in as the second state;

determining data relating to layers of the neural network other than the target layer as the first state.

In some embodiments, the predetermined number of layers is determined based on a time required for the chip to perform calculation on each layer of the neural network and a transmission time of the data between the internal memory and the external memory.

According to a second aspect of the embodiments of the present disclosure, there is provided a neural network training system, the neural network training system comprising a main processor and a plurality of chips mentioned in the first aspect, the plurality of chips being connected through the communication interface; and the plurality of chips are used for receiving the instruction of the main processor and carrying out accelerated training on the neural network based on the instruction.

In some embodiments, the address translation unit includes a storage subunit;

In some embodiments, the processing unit is further to:

In some embodiments, the processing unit is to:

According to a third aspect of the embodiments of the present disclosure, there is provided a memory management method for a chip, where the chip includes at least one communication interface and an internal memory, the communication interface is used to connect to at least one external memory, and the method includes:

determining the state of data related to the neural network training in the process of training the neural network by using the chip, wherein the state is related to the duration between the time when the data is used next time and the current time;

determining whether to store the data stored in the internal memory to the external memory and/or whether to store the data in the external memory to the internal memory according to the state.

In some embodiments, the method further comprises:

in response to determining to store the data stored in the internal memory of the chip to the external memory according to the state, converting an original address corresponding to the data to be stored in the external memory into the target address in the external memory, so as to write the data to be stored in the external memory at the target address of the external memory; and/or

In response to determining to store the data stored in the external memory to the internal memory according to the state, converting the original address corresponding to the data to be stored to the internal memory to a target address in the external memory to read the data to be stored to the internal memory from the target address of the external memory.

In some embodiments, the states include a first state and a second state;

determining to store the data in the external memory to the internal memory if the data stored in the external memory is in the second state.

In some embodiments, determining the state in which the data relating to the neural network training is located comprises:

and determining the state of data related to each layer according to the number of layers spaced between each layer in the neural network and the current layer currently processed by the chip and the current calculation direction of the chip for calculating the neural network.

In some embodiments, determining the state of the data related to each layer according to the number of layers spaced between each layer in the neural network and the current layer currently processed by the chip and the current calculation direction of the chip for calculating the neural network comprises:

According to a fourth aspect of the embodiments of the present disclosure, there is provided a memory management device for a chip, where the device is configured to perform memory management on the chip, the chip includes at least one communication interface and an internal storage, the communication interface is configured to connect to at least one external storage, and the device includes:

the state determining module is used for determining the state of data related to the neural network training in the process of training the neural network by using the chip, wherein the state is related to the time length between the next time of using the data and the current time;

a determination module that determines whether to store the data stored in the internal memory to the external memory and/or whether to store the data in the external memory to the internal memory according to the state. In some embodiments, the states include a first state and a second state;

In some embodiments, the state determination module, when configured to determine the state of the data related to the neural network training, is specifically configured to:

In some embodiments, the state determining module is configured to, when determining the state of the data related to each layer according to the number of layers spaced between each layer in the neural network and a current layer currently processed by the chip and a current calculation direction of the chip for calculating the neural network, specifically:

According to a fifth aspect of the embodiments of the present disclosure, an electronic device is provided, which includes a processor, a memory, and a chip as provided in any one of the first aspect and the first aspect of the present disclosure, and/or performs the memory management method provided in any one of the third aspect and the third aspect of the present disclosure.

In the embodiment of the disclosure, in order to expand the memory of a single chip, the original communication interface for communication between chips in the chip is functionally expanded, so that the communication interface can be connected with an external storage and other chips. In order to realize the access to the external memory through the original communication interface of the inter-chip communication, the chip comprises an address conversion unit for converting the address. When data related to neural network training is read or written into an external device connected to the communication interface, a processing unit in the chip may send an original address corresponding to the data to be read or written to an address conversion unit, where the original address is a physical address in another chip, and the address conversion unit may convert the original address into a target address matched with the external device based on the type of the external device connected to the communication interface, such as an address in an external memory or another chip, so that the processing unit may successfully read or write data from the target address of the external device. The function expansion is carried out on the original interface, so that the interface can be connected with an external memory, the memory capacity of a single chip can be increased, the number of the chips can be reduced when a large-scale neural network is trained, the data volume of communication between the chips is reduced, the training efficiency is improved, meanwhile, the data volume of communication between the chips and a CPU can also be reduced, the occupation of a CPU communication link is reduced, and the efficiency of issuing instructions by the CPU is improved. In addition, by increasing the memory capacity of a single chip, the training of the large-scale neural network can be realized only by simple data parallel, and the training efficiency is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure.

Fig. 1 is a schematic diagram of a neural network training system of an embodiment of the present disclosure.

Fig. 2 is a schematic structural diagram of a chip according to an embodiment of the disclosure.

Fig. 3 is a schematic diagram of a neural network training system according to an embodiment of the present disclosure.

Fig. 4 is a schematic diagram of an address mapping according to an embodiment of the disclosure.

Fig. 5 is a schematic diagram of a neural network training process of an embodiment of the present disclosure.

Fig. 6 is a schematic diagram of a memory management method of a chip according to an embodiment of the disclosure.

Fig. 7 is a schematic diagram of a memory management device according to an embodiment of the disclosure.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

The terminology used in the present disclosure is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. As used in this disclosure and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality.

It is to be understood that although the terms first, second, third, etc. may be used herein to describe various information, such information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present disclosure. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.

In order to make the technical solutions in the embodiments of the present disclosure better understood and make the above objects, features and advantages of the embodiments of the present disclosure more comprehensible, the technical solutions in the embodiments of the present disclosure are described in further detail below with reference to the accompanying drawings.

Artificial intelligence technology has been widely used in various fields, and in order to improve the accuracy of the prediction result of the neural network model, the structure of the current neural network model is more and more complex, the scale of the model is more and more large, and the data volume related to the model is more and more. Taking the GPT-3 model commonly used at present as an example, the parameter quantity of the model is up to 1750 hundred million, and 700GB is required even if only the weight (in terms of 32 bits of floating point number) in the model is stored. Thus, some chips, such as AI chips, may be employed to accelerate neural network training. The AI chip is mainly used for executing specific calculation tasks in the neural network model training process so as to accelerate the neural network training. At present, the memory of an AI chip is generally 32-80 GB, and the requirement of a large-scale neural network model on the memory can not be met far away.

In order to solve the problem of insufficient memory of a single chip, a multi-machine multi-card training system is usually adopted to train a neural network in a data parallel or model parallel mode. As shown in fig. 1, one node may include a plurality of AI chips (also called accelerator cards, such as V100 GPU in fig. 1), each AI chip may communicate with a CPU through a PCIE (peripheral component Interface) link, as shown in fig. 1, each V100 GPU may communicate with the CPU through a PCIE Switch (PCIE bridge), meanwhile, each AI chip may communicate with each other through an NVlink (communication protocol) link, and different CPUs may communicate with each other through a QPI (QPI, Common System Interface). For example, assuming that the model has 40 layers and the training system has 8 AI chips in total, each AI chip may store 5 layers of model parameters, and different AI chips may transmit the calculation results of the stored layers of the model to other AI chips or to the CPU. Because the storage space of one node may be insufficient, multiple nodes (i.e. multiple computers) are usually required, and AI chips of different nodes communicate with each other through NIC (Network Interface Controller), which not only has low communication bandwidth, but also has large delay, resulting in low model efficiency. In addition, because the AI chips communicate with the CPU through the PCIE link and the CPU needs to issue the command through the PCIE link, when the number of the AI chips is large, each AI chip needs to communicate with the CPU, which may seriously seize the CPU PCIE link, resulting in a decrease in the efficiency of the CPU issuing the command.

Based on this, the embodiments of the present disclosure provide another solution, which can increase the memory capacity of a single chip by expanding the memory of the single chip, thereby reducing the number of chips used in neural network training, greatly reducing the communication data amount between chips, and improving training efficiency. Meanwhile, the data volume of the communication between the chip and the main processor (for example, a CPU) can be reduced, the occupation of a PCIE communication link of the main processor is reduced, and the instruction issuing efficiency of the main processor is improved.

Specifically, the embodiments of the present disclosure provide a chip, which may be various chips that can be used to accelerate neural network training, for example, the chip may be an AI chip, or other chips with similar functions.

As shown in fig. 2, the chip includes a processing unit 21, an address conversion unit 22, and at least one communication interface 23, each communication interface 23 may be used for connecting an external device 24, and the type of the external device 24 may be an external memory and/or other chip for accelerating neural network training.

The processing unit 21 is configured to send an original address corresponding to data to be read or written, which is related to neural network training, to the address conversion unit, where the original address is a physical address in another chip;

the address conversion unit 22 is configured to convert the original address into a target address matching the type of the external device 24 and output the target address, so that the processing unit 21 reads or writes the data at the target address of the external device 24.

In order to realize that a chip can be externally connected with other memories to enlarge the memory capacity of a single chip and simultaneously reduce the change of the chip structure as much as possible, the embodiment of the disclosure performs function expansion on the original inter-chip communication interface for inter-chip communication in the chip, so that the inter-chip communication can be connected with other chips and can be realized when a model is trained together with other chips. Meanwhile, the communication interface can also be connected with an external memory so as to expand the memory capacity of a single chip. When to communicate with other chips, and when to communicate with external memory, is related to the chip structure and the scheduling policy during the chip operation, and is not limited herein.

Certainly, since the functions of the communication interfaces are expanded in the embodiment of the present disclosure, and the address defined by the interface protocol of the communication interface between the original chips is a physical address in another chip, in order to support memory expansion, the external memory can be accessed through the communication interface, and an address conversion unit can be added in the chip to map the physical address in another chip to the physical address in the external memory through the address conversion unit, so as to implement the processing unit to read and write data from the external memory. In some embodiments, all the communication interfaces may share one address translation unit, and the address translation unit is used for performing address translation on addresses output by the respective interfaces. In some practical examples, in order to simplify the data processing flow of the address translation unit, there may also be one address translation unit corresponding to each communication interface, and used for translating the address output from the communication interface.

In the process of training the neural network, when a processing unit in a chip needs to read data related to training of the neural network for calculation, or needs to store calculated data related to training of the neural network, an original address corresponding to the data to be read or written may be sent to an address conversion unit, where the original address may be an address carried in a data reading or writing instruction, and since the address carried in the instruction is an address defined in an original communication interface protocol, the original addresses are all physical addresses in other chips. Because the external device connected with each communication interface may be other chips or an external memory, the address conversion unit may determine the type of the external device connected with the communication interface, then convert the original address into a target address matched with the type of the external device based on the type of the external device, and output the target address through the communication interface, so that the processing unit may read the data from the target address of the external device or write the data into the target address of the external device.

The data in the embodiment of the present disclosure includes various tensors used in a training process of the neural network, for example, input data (e.g., sample data) in the training process, process data generated in the training process of the neural network: model parameters such as weights, biases, etc. for each layer, and output results such as activations, losses, etc. for each layer of the neural network.

In some embodiments, the chip may be used in a neural network training system, as shown in fig. 3, the neural network training system includes a main processor and a plurality of the chips, the plurality of the chips may be communicatively connected to each other, the plurality of the chips may receive an instruction from the main processor, and then perform accelerated training of the neural network based on the received instruction. The main processor may be a CPU or other processors, and is configured to allocate a calculation task to each chip, so that the chip completes the calculation task based on an instruction. In some scenarios, each chip may also transmit the calculation result to the main processor, so that the main processor performs summary processing on the calculation result. Because the chip is externally connected with an external memory, the memory capacity of a single chip is increased, so that the number of the chips can be reduced, the data volume of communication between the chips is reduced, the processing efficiency is improved, meanwhile, the communication data volume of the chips and a main processor can be reduced, the occupation of a communication link of the main processor is reduced, and the efficiency of issuing instructions by the main processor is improved.

In some embodiments, the address translation unit includes a storage subunit for storing a mapping relationship between physical addresses in other chips and physical addresses in the external memory. For example, some or all of the physical addresses in other chips may be mapped to physical addresses in the external memory, and the specific mapping manner may be determined according to actual requirements. For example, the physical address in the chip may be divided into a plurality of address partitions, the physical address in the external memory may be divided into a plurality of address partitions, one address partition in the chip is associated with one address partition in the external memory, and the addresses in the two associated address partitions may be mutually translated by some translation method. After receiving the original address sent by the processing unit, the address conversion unit may first determine the type of the external device currently connected to the communication interface corresponding to the original address, for example, whether the external device is an external memory or another chip, and if it is determined that the external memory is connected to the communication interface, convert the original address into the target address in the external memory according to the stored mapping relationship.

Of course, in some embodiments, if the address translation unit determines that the communication interface corresponding to the original address is connected to another chip, the original address does not need to be translated, and the original address is directly output as the target address.

Because a chip may include a plurality of communication interfaces, part of the communication interfaces may be connected to an external memory, and part of the communication interfaces may be connected to other chips, in order to distinguish different interfaces, an original address may carry an interface identifier, and after receiving the original address sent by the processing unit, the address conversion unit may determine, according to the interface identifier, whether the communication interface corresponding to the interface identifier is connected to another chip or an external memory, and then determine whether to convert the original address. In some embodiments, if the address translation unit is configured to perform address translation on all the above communication interfaces, the address translation unit may include a plurality of registers, each register corresponding to one communication interface and configured to identify a type of an external device connected to the communication interface, for example, assuming that 12 communication interfaces are included in a chip, the address translation unit may include 12 registers, and each register may identify, through a 1 or 0 state, that the interface is currently connected to another chip or an external memory. Of course, in a scenario where one communication interface corresponds to one address translation unit, that is, one address translation unit is used to perform address translation on one communication interface, each address translation unit may include a register for storing the type of the external device connected to the communication interface.

In some embodiments, an address translation unit quickly translates an original address in the chip to a target address in an external memory for convenience. The physical address to be mapped in the memory unit of the chip may be divided into a plurality of address partitions, each address partition may correspond to a segment of address, and the address of each address partition may be mapped into the physical address in the external memory through a preset mapping relationship. After receiving the original address, the address conversion unit may determine an address partition where the original address is located, and then convert the original address into the target address based on the address partition where the original address is located and a mapping relationship corresponding to the address partition. The number of the address partitions and the number of the addresses contained in each address partition can be set according to actual requirements.

In some embodiments, the address translation unit may include multiple sets of first registers, each set of first registers corresponds to one address partition, and each set of first registers stores therein indication information indicating an address range of one address partition, where the indication information may be various information indicating the address range of the address partition, for example, a start address and an end address of the address partition, an offset between the start address and the end address and the start address, or an offset between the end address and the start address, which may be set according to actual requirements. After receiving the original address, the address conversion unit may compare the original address with the indication information stored in each group of first register memories in parallel, determine an address partition where the original address is located, and then convert the original address into a target address in the external memory based on a mapping relationship corresponding to the address partition.

In some embodiments, the address translation unit may further include multiple sets of second registers, each set of second registers corresponds to one address partition, and each set of second registers stores current enable state information of one address partition. The enable state information is used to indicate the current enable state of each address partition, if the address partition is enabled, that is, it can be used to compare with the original address received by the address translation unit to determine whether the original address falls within the range of the address partition, and if the address partition is not enabled, it is not used to compare with the received original address. After receiving the original address, the address conversion unit may determine one or more target address partitions from the plurality of address partitions according to the enable state information stored in the second register, where the target address partitions are enabled address partitions, may then compare the original address with the address range of the target address partition in parallel, and if the original address does not fall within the range of any one of the target address partitions, determine that the communication interface is connected to another chip. Of course, if the original address falls into the address range of one of the target address partitions, the original address is directly converted according to the mapping relationship of the target address partition, and the converted original address is output after being converted into the target address.

For example, as shown in fig. 4, a part or all of the physical address space of the memory unit in the chip may be divided into N address partitions (where N may be set according to actual requirements), similarly, the physical address space in the external memory is also divided into N address partitions, the nth address partition in the chip corresponds to the nth address partition in the external memory, and for each address partition N, the address translation unit has a set of corresponding registers for storing the enable state information of the address partition, the start address (start _ addr) of the address partition, the end address (end _ addr) of the address partition, and the reference address (tgrt _ base _ addr) of the address partition in the external memory. The address translation unit, upon receiving the original address (addr1) sent by the processing unit, may compare the original address (addr1) with the enabled address partitions in parallel, and if in its range, for example, assuming that the original address (addr1) is within the range of address partition N, translate to the target address (addr2) in the external memory according to equation (1):

addr2＝addr1–start_addr+tgrt_base_addr

where addr2 represents the target address, addr1 represents the original address received by the address translation unit, start _ addr represents the start address of the address partition N, and tgrt _ base _ addr represents the base address of the address partition N in the external memory.

After determining the target address (addr2), the target address may be output over the communication interface for the processing unit to read or write data from the target address of the external memory.

Of course, if the original address (addr1) is not in the range of the enabled address partitions when compared with each other in parallel, the original address (addr1) may be directly output without being converted.

In some embodiments, the external memory may be a DDR memory, and the DDR memory may communicate with the chip via a bus, which may be a PCIE bus. Of course, the external memory may also be another type of memory, and the type of the bus may also be set to be another type of bus according to the requirement of the memory, and the embodiment of the present disclosure is not limited.

According to the embodiment of the disclosure, the memory of the chip is expanded, so that a single chip has a larger storage space, and therefore, the training of the neural network can be completed by adopting fewer chips, the data volume of communication between the chips can be reduced, and the training efficiency of the neural network is improved. Meanwhile, the communication data volume between the chip and the CPU can be reduced, and the occupation of a PCIE communication link of the CPU is reduced.

In addition, for some models with larger models, the memory of a single chip cannot store the parameters of the whole model, so that a model parallel training mode needs to be adopted, and the training efficiency of the mode is lower. By increasing the memory capacity of a single chip, the parameters of the whole model can be stored only by the single chip, so that the training of the neural network can be completed only by simple data parallel without a complex model parallel mode, and the training efficiency of the neural network is greatly improved.

The storage capacity of a single chip can be increased by adding an external storage in the chip, but the access of the external storage also brings challenges to the memory management of the chip. After the external memory is accessed, how to manage the internal memory and the external memory of the chip so as to better utilize the external memory to improve the processing efficiency of the chip when training the neural network is very important.

In some embodiments, in the training of the neural network using the chip, the processing unit may be configured to determine a current state of data related to the training of the neural network, wherein the state of each data is related to a duration between a next time of use of each data and a current time, and the state may be used to indicate a duration of an interval between the next time of use of each data and the current time. Where the states may be various states indicating how long the data has been used. For example, the data may be determined to be in a different state according to the time when the data is used next time and the time length of the current time, for example, if one data is used within a time length less than a preset time length, the data may be determined to be in an active state, and if one data is used after the time length is greater than the preset time length, the data may be determined to be in an inactive state. Of course, in actual use, the data may be divided into more states as needed, and different processing may be performed on the data in different states.

After determining the state of each data, it may be determined whether the data currently stored in the internal memory needs to be stored in the external memory and/or whether the data stored in the external memory needs to be stored in the internal memory according to the state. Since the chip stores the output result of each layer after calculation in the internal memory, and acquires the output result from the internal memory when using the data, the data to be used needs to be stored in the internal memory, and the data that will not be used in a short time needs to be stored in the external memory. For example, if the status indicates that the data is to be used and the data is stored in the external memory at this time, the data may be stored from the external memory to the internal memory for use in chip computing, and if the status indicates that the data is to be used for a long time and the data is stored in the internal memory at this time, the data may be stored from the internal memory to the external memory to release the memory space of the internal memory for storing the data to be used.

In some embodiments, when the processing unit determines that some data stored in the internal memory of the chip needs to be stored in the external memory according to the state, the processing unit may send an original address corresponding to the data to the address translation unit, so that the address translation unit may translate the original address into a target address in the external memory, so that the processing unit stores the data in the target address of the external memory. For example, in some scenarios, the CPU may determine, based on the current computation of the chip to the layer number of the neural network, and the computation capability of the chip, which layers of data are not used temporarily, and may store the data into the external memory, so that the CPU may send an instruction to the processing unit, where the instruction may carry the data to be stored into the external memory, and an original address corresponding to the data (for example, a block address may be divided in advance from the external memory for storing the temporarily unused data in the internal memory, and for the CPU, the address known by the CPU is the original address corresponding to the block address), and then the processing unit may send the original address carried in the instruction to the address conversion unit, so as to obtain an actual target address in the external memory corresponding to the original address, and store the data into the target address.

Similarly, if the processing unit determines that some data stored in the external memory needs to be stored in the internal memory according to the state, the processing unit may send an original address corresponding to the data to the address translation unit, so that the address translation unit may translate the original address into a target address in the external memory, so that the processing unit reads the data from the target address in the external memory. For example, in some scenarios, the CPU may determine, based on the current computation of the chip to the layer number of the neural network, and the computation capability of the chip, which layers of data are to be used and stored in the internal memory, so that the CPU may send an instruction to the processing unit, where the instruction may carry the data to be stored in the internal memory and the original address corresponding to the data (for example, the CPU knows in advance which layer of data is stored in which address space, but for the CPU, the known address is the original address corresponding to the block address), and then the processing unit may send the original address carried in the instruction to the address conversion unit, so as to obtain the actual target address in the external memory corresponding to the original address, and read the data from the target address of the external memory.

In some embodiments, the state of each data may include a first state and a second state, and if the data is in the first state, it takes a long time to use the data, so that if the data stored in the internal storage is in the first state, it is determined that the data may be stored in the external storage first to release the memory space in the internal storage. If the data is in the second state, the data is about to be used, and if the data stored in the external memory is in the second state, the data is determined to be capable of being stored in the internal memory, so that the chip can directly obtain the data from the internal memory and use the data when the chip needs to use the data, and the processing efficiency is improved.

In some embodiments, when determining the state of each data related to the training of the neural network, the state information of the data related to each layer may be determined according to the number of layers spaced between each layer in the neural network and the current layer currently processed by the chip, and the current calculation direction of the chip on the neural network. The calculation direction may be determined according to a stage at which the chip performs calculation on the neural network currently, for example, if the chip is in a forward propagation stage at this time, the calculation direction is calculated from the layer 1 backward, and if the chip is in a backward propagation stage at this time, the calculation direction is calculated from the last layer 1 forward. According to the calculation direction and the number of layers between each layer and the current layer, how long the data related to each layer still needs to be used can be determined, and therefore the state of the data related to each layer can be determined.

For example, assuming that the neural network has 20 layers, if the current calculation is layer 5 in the forward propagation process, data of layers (1-4 layers) before layer 5 will not be used for a short time, and thus data related to these layers can be determined as the first state, and data more than a certain number of layers will not be used for a short time after layer 5, for example, layers more than 5 layers (for example, 10-20 layers) apart from layer 5, and thus model parameters of these layers can also be determined as the first state. Similarly, the back propagation phase can be determined by similar means.

In some embodiments, when the state of the data related to each layer is determined according to the number of layers spaced between each layer in the neural network and a current layer currently processed by the chip and a current calculation direction in which the chip calculates the neural network, a target layer may be determined from the neural network based on the current calculation direction and the number of layers spaced between each layer and the current layer, where the target layer is a layer of the neural network that is calculated after the current layer and is spaced from the current layer by less than a preset number of layers. The state in which the data associated with the target layer is located is then determined to be the second state. Meanwhile, a state in which data of a layer other than the target layer is located is determined as a first state.

For example, assuming that the number of neural networks is 20, if the neural network is in the forward propagation stage, the currently calculated layer is the 5 th layer, and since the calculation is from the front to the back, the parameters of the 6 th, 7 th and 8 th layers are to be used, so that the neural network layers which are located behind the current layer and spaced from the current layer by less than a certain number of layers, such as 3 or 5 layers, can be determined as the target layer, and then the data related to the layers can be determined as the second state. The data of the other layer is data that is not used for a short time, and thus can be determined as the first state.

In some embodiments, when determining how many layers of the neural network layer are required to be currently determined as the target layer, the time required for the chip to perform calculation on each layer of the neural network and the transmission time determination of data between the internal memory and the external memory may be performed. For example, if the current calculation is layer 5, the time required for the neural network to calculate each layer is 1s, and the time required for transporting data related to each layer from the external storage to the internal storage is 2s, when layer 5 is calculated, it is necessary to ensure that data of at least 2 layers after layer 5 is stored in the internal storage, so as to ensure that the neural network does not have the problem that data to be used needs to be temporarily read from the outside during subsequent calculation, thereby optimizing the use of the memory and improving the processing efficiency.

Through combining the neural network training in-process, the characteristics that each data used at different time quantums can be with in the data storage that will use to internal memory, and the data storage that the short time can not be used to external memory, guarantees that processing unit when will using certain data, can directly read from internal memory to improve the treatment effeciency.

Further, an embodiment of the present disclosure further provides a neural network training system, where the neural network training system includes a main processor and a plurality of chips described in the above embodiments, and the plurality of chips are connected through a communication interface; and the plurality of chips are used for receiving the instruction of the main processor and carrying out accelerated training on the neural network based on the instruction.

The specific structure and function of the chip may refer to the description in the above chip embodiment, and are not described herein again.

The storage capacity of a single chip can be increased by adding an external storage in the chip, but the access of the external storage also brings challenges to the memory management of the chip. After the external memory is accessed, how to manage the embedded memory and the external memory of the chip so as to better utilize the external memory to improve the processing efficiency of the chip when training the neural network is very important.

The main purpose of the memory is to store data used in the neural network training process, no matter the embedded memory or the external memory of the chip. These data include various tensors of the neural network during the training process, such as input data (e.g., sample data) during the training process, process data generated during the training process of the neural network: model parameters such as weights, biases, etc. for each layer of the neural network, and output results such as losses, etc. for each layer of the neural network.

In order to better understand the method for managing a chip memory provided in the embodiment of the present disclosure, a training process of a neural network is briefly introduced below with reference to an example, as shown in fig. 5, a schematic diagram of a neural network model is shown, which includes an input layer, a hidden layer, and an output layer, and for a neural network with a relatively complex structure and a relatively large scale, the neural network often includes a plurality of hidden layers. The process of training the neural network generally includes two processes of forward propagation and backward propagation, in the process of forward propagation, after sample data for training the neural network is input by an input layer, the sample data sequentially passes through each layer in the neural network, and an output result (such as an output of each layer in the figure) is obtained by calculation according to a current model parameter of each layer, an output of each layer can be used as an input of a next layer to calculate the next layer, and an output of each layer is also used in the process of backward propagation to adjust the model parameter of the layer of the neural network. After the calculation of each layer is completed, a loss can be determined according to the final output, and then a back propagation process is performed, in which the model parameters of each layer of the neural network (such as updated parameters of each layer in the figure) are sequentially updated from back to front according to the determined loss. The process of forward and backward propagation is then repeated until the loss converges.

From the training process of the neural network, whether in the forward propagation process or the backward propagation process, the chip calculates each layer in sequence from front to back or from back to front to obtain the calculation result of each layer, and when calculating the current layer, only data related to the calculation of the current layer, such as the input of the current layer and model parameters (e.g., weight or offset) of the current layer, are needed, while data related to a layer far away from the current layer may be used after a long period of time. In addition, the output result of the current layer (for example, the activation of the output in the forward propagation process, or the updated current model parameters after backward propagation) needs to be used when the current layer is calculated next time, and the interval between the two steps is relatively long.

Based on the above characteristics existing in the neural network training process, the embodiment of the present disclosure provides a memory management method for a chip, because the speed of reading data from an internal memory of the chip is much higher than the speed of reading data from an external memory, in the neural network training process, data that will not be used in a short time can be stored in the external memory, so as to release the space of the internal memory, and the data to be used can be transported from the external memory to the internal memory in advance for storage, so that the data can be directly read from the internal memory and used when being used for calculating the neural network, and thus the processing efficiency in the neural network training process can be improved.

The method provided by the embodiment of the disclosure can be executed by a chip and can also be executed by other processors which are communicated with the chip. For example, some neural network training systems may include a CPU and a plurality of chips, and the plurality of chips may receive instructions from the CPU, so as to train the neural network according to the instructions from the CPU. In this scenario, the memory management method may be executed by the CPU, and the CPU determines the current status of each data, where the status may be used to indicate how long the data is still to be used, and then informs the chip which data is to be used and which data is to be used for a longer time, so that the chip may transfer the data which is not used for a short time from the internal storage to the external storage, and transfer the data which is to be used from the external storage to the internal storage. Of course, the memory management method may also be executed by the chip, for example, the CPU may send the data related to the current state of each data to the chip, and the chip may determine the current state of each data, determine which data is to be used, and determine which data needs to be used for a longer time, so as to transfer the data which is not used for a short time from the internal storage to the external storage, and transfer the data to be used from the external storage to the internal storage.

Specifically, as shown in fig. 6, the memory management method in the embodiment of the present disclosure includes the following steps:

s602, in the process of training a neural network by using the chip, determining the state of data related to the neural network training, wherein the state is related to the time length between the next time of using the data and the current time;

s604, determining whether to store the data stored in the internal memory into the external memory and/or whether to store the data stored in the external memory into the internal memory according to the state.

In step S602, during the training of the neural network using the chip, a current state of data related to the training of the neural network may be determined, where the state of each data is related to a duration between a next time of use of each data and a current time, and the state may be used to indicate a duration of an interval between the next time of use of each data and the current time. Where the states may be various states indicating how long the data has been used. For example, the data may be determined to be in a different state according to the time when the data is used next time and the time length of the current time, for example, if one data is used within a time length less than a preset time length, the data may be determined to be in an active state, and if one data is used after the time length is greater than the preset time length, the data may be determined to be in an inactive state. Of course, in actual use, the data may be divided into more states as needed, and different processing may be performed on the data in different states.

In step S604, after determining the state of each data, it may be determined whether the data currently stored in the internal memory needs to be stored in the external memory and whether the data stored in the external memory needs to be stored in the internal memory according to the state. Since the chip stores the output result of each layer after calculation in the internal memory, and acquires the output result from the internal memory when using the data, the data to be used needs to be stored in the internal memory, and the data that will not be used in a short time needs to be stored in the external memory. For example, if the status indicates that the data is to be used and the data is stored in the external memory at this time, the data may be stored from the external memory to the internal memory for use in chip computing, and if the status indicates that the data is to be used for a long time and the data is stored in the internal memory at this time, the data may be stored from the internal memory to the external memory to release the memory space of the internal memory for storing the data to be used.

In some embodiments, the chip may be the chip described in the above embodiments (chip embodiments). Of course, the chip may also have another structure, as long as the chip can be connected to an external memory and read and write data from the external memory, and the embodiment of the present application is not limited. In some embodiments, the chip is the chip described in the above embodiments, and the method further includes:

In response to determining to store the data stored in the external memory to the internal memory according to the state, converting the original address corresponding to the data to be stored to the internal memory to a target address in the external memory to read the data to be stored to the internal memory from the target address of the external memory. In some embodiments, the state of each data may include a first state and a second state, and if the data is in the first state, it takes a long time to use the data, so that if the data stored in the internal storage is in the first state, it is determined that the data may be stored in the external storage first to release the memory space in the internal storage. If the data is in the second state, the data is about to be used, and if the data stored in the external memory is in the second state, the data is determined to be capable of being stored in the internal memory, so that the chip can directly obtain the data from the internal memory and use the data when the chip needs to use the data, and the processing efficiency is improved.

In some embodiments, in step S602, when determining the state of each data related to the neural network training, the state information of the data related to each layer may be determined according to the number of layers spaced between each layer in the neural network and the current layer currently processed by the chip, and the current calculation direction of the chip on the neural network. The calculation direction may be determined according to a stage at which the chip performs calculation on the neural network currently, for example, if the chip is in a forward propagation stage at this time, the calculation direction is calculated from the layer 1 backward, and if the chip is in a backward propagation stage at this time, the calculation direction is calculated from the last layer 1 forward. According to the calculation direction and the number of layers between each layer and the current layer, how long the data related to each layer still needs to be used can be determined, and therefore the state of the data related to each layer can be determined.

Of course, in practical applications, if the chip embedded memory is large enough, all model parameters (e.g., weights, offsets) of the neural network can be stored in the embedded memory, and the output of each layer can be determined to be placed in the embedded memory or the external storage according to the state of the layer. Of course, when the embedded memory of the chip is not enough to store all the model parameters, the model parameters can be determined to be placed in the embedded memory or the external memory according to the state of the model parameters.

The memory management method provided by the embodiment of the disclosure can manage the memory of the chip by combining the characteristics of the neural network training process, and can improve the efficiency of processing the neural network while expanding the memory of the chip. Meanwhile, the chip can manage the memory, so that a programmer does not need to consider the problem that the memory needs to be managed after an external memory is connected, and inconvenience brought to the programmer due to expansion of the chip memory is reduced.

Accordingly, an embodiment of the present disclosure provides an apparatus for memory management of a chip, as shown in fig. 7, where the apparatus 70 is configured to perform memory management on a chip, where the chip includes at least one communication interface and an internal storage, the communication interface is configured to connect to at least one external storage, and the apparatus 70 includes:

a state determining module 71, configured to determine, during training of the neural network by using the chip, a state of data related to the training of the neural network, where the state is related to a duration between a next time when the data is used and a current time;

a decision block 72 determines whether to store the data stored in the internal memory to the external memory and/or whether to store the data in the external memory to the internal memory according to the status.

In some embodiments, the states include a first state and a second state;

The specific implementation details of the memory management device for managing the chip memory may refer to the description in the above method, and are not described herein again.

Further, an embodiment of the present disclosure also provides an electronic device, where the electronic device includes a processor, a memory, and the chip in the foregoing embodiment, and/or a computer instruction executable by the processor is stored in the memory of the electronic device, and when the processor executes the computer instruction, the memory management method in any one of the foregoing embodiments is implemented.

The embodiments of the present disclosure also provide a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the computer program implements the memory management method according to any of the foregoing embodiments.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

From the above description of the embodiments, it is clear to those skilled in the art that the embodiments of the present disclosure can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the embodiments of the present specification may be essentially or partially implemented in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments of the present specification.

The systems, devices, modules or modules illustrated in the above embodiments may be implemented by a computer chip or an entity, or by an article of manufacture with certain functionality. A typical implementation device is a computer, which may take the form of a personal computer, laptop computer, cellular telephone, camera phone, smart phone, personal digital assistant, media player, navigation device, email messaging device, game console, tablet computer, wearable device, or a combination of any of these devices.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the apparatus embodiment, since it is substantially similar to the method embodiment, it is relatively simple to describe, and reference may be made to some descriptions of the method embodiment for relevant points. The above-described apparatus embodiments are merely illustrative, and the modules described as separate components may or may not be physically separate, and the functions of the modules may be implemented in one or more software and/or hardware when implementing the embodiments of the present disclosure. And part or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of the embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

The foregoing is only a specific embodiment of the embodiments of the present disclosure, and it should be noted that, for those skilled in the art, a plurality of modifications and decorations can be made without departing from the principle of the embodiments of the present disclosure, and these modifications and decorations should also be regarded as the protection scope of the embodiments of the present disclosure.

Claims

1. Chip, characterized in that it comprises a processing unit, an address translation unit and at least one communication interface, each of said communication interfaces being usable for connecting an external device, the type of said external device comprising an external memory and/or other chips,

2. The chip of claim 1, wherein the address translation unit comprises a storage subunit;

3. The chip according to claim 1 or 2, wherein the address translation unit is further configured to directly determine the original address as the target address if it is determined that the communication interface is connected to the other chip.

4. The chip according to any one of claims 1 to 3, wherein the physical addresses of the storage units of the other chips are divided into a plurality of address partitions in advance, and the physical addresses in each address partition can be mapped to the physical addresses in the external memory through a preset mapping relationship;

5. The chip of claim 4, wherein the address translation unit includes a plurality of sets of first registers, each set of the first registers being configured to store indication information indicating an address range of one of the address partitions;

6. The chip of claim 5, wherein the address translation unit further comprises a plurality of sets of second registers, each set of the second registers being configured to store current enable status information of one of the address partitions;

7. The chip of any of claims 1-6, wherein the external memory comprises a DDR, the DDR and the chip communicate over a bus, the bus comprising a PCIE bus.

8. The chip of any one of claims 1-7, wherein the processing unit is further configured to:

9. The chip of claim 8, wherein the processing unit is further configured to: in response to determining to store the data stored in the internal memory of the chip to the external memory according to the state, sending the original address corresponding to the data to be stored in the external memory to the address conversion unit, so that the address conversion unit converts the original address into the target address in the external memory; and/or

10. The chip of claim 9, wherein the states comprise a first state and a second state; the processing unit is configured to:

11. The chip according to claim 9 or 10, wherein the processing unit is configured to determine the state of the data based on:

12. The chip of claim 11, wherein the processing unit is configured to:

13. The chip of claim 12, wherein the number of layers is determined based on a time required for the chip to perform calculations on each layer of the neural network and a transmission time of the data between the internal memory and the external memory.

14. A neural network training system comprising a host processor and a plurality of chips as claimed in any one of claims 1 to 13, the plurality of chips being connected via the communications interface;

and the plurality of chips are used for receiving the instruction of the main processor and carrying out accelerated training on the neural network based on the instruction.

15. A method for managing a memory of a chip, wherein the chip includes at least one communication interface and an internal storage, and the communication interface is used for connecting to at least one external storage, the method comprising:

16. The method of claim 15, further comprising:

17. The method of claim 15 or 16, wherein the states comprise a first state and a second state;

18. The method of any one of claims 15-17, wherein determining the state in which the data related to the neural network training is located comprises:

and determining the state of data related to each layer according to the number of layers spaced between each layer in the neural network and the current layer currently processed by the chip and the current calculation direction of the neural network.

19. The method of claim 18, wherein determining the state of the data associated with each layer according to the number of layers in the neural network that are spaced from the current layer currently processed by the chip and the current calculation direction of the chip on the neural network comprises:

20. The method of claim 19, wherein the number of layers is determined based on a time required for the chip to perform calculations on each layer of the neural network and a transmission time of the data between the internal memory and the external memory.

21. A memory management device of a chip, wherein the device is configured to perform memory management on the chip, the chip includes at least one communication interface and an internal storage, the communication interface is configured to connect to at least one external storage, and the device includes:

a determination module that determines whether to store the data stored in the internal memory to the external memory and/or whether to store the data in the external memory to the internal memory according to the state.

22. An electronic device comprising a processor, a memory and a chip according to any one of claims 1 to 13, and/or performing a memory management method according to any one of claims 15 to 20.