CN117474067A

CN117474067A - Neural network training method and related equipment

Info

Publication number: CN117474067A
Application number: CN202211391730.XA
Authority: CN
Inventors: 龙国平; 钟苑丰; 李立英
Original assignee: Huawei Cloud Computing Technologies Co Ltd
Current assignee: Huawei Cloud Computing Technologies Co Ltd
Priority date: 2022-07-22
Filing date: 2022-11-08
Publication date: 2024-01-30

Abstract

The embodiment of the application discloses a training method of a neural network and related equipment, wherein the method can be used in a scene of training the neural network in the artificial intelligence field, and comprises the following steps: a first computational graph may be obtained while performing an nth round of training of the neural network, wherein the first computational graph is one of one or more computational graphs corresponding to the nth round of training of the neural network; after the first compiled code corresponding to the first computational graph is determined to have been stored in the system, the first compiled code may be directly executed, the first compiled code being generated in performing an mth round of training of the neural network, M being less than N; since the operations of converting the first computational graph into an intermediate computational representation and obtaining compiled code based on the intermediate computational representation no longer need to be performed, the overhead of computer resources is saved.

Description

Neural network training method and related equipment

The present application claims priority from the chinese patent office, application number 202210871003.7, entitled "a data processing method and related apparatus," filed on 22 nd month 2022, the entire contents of which are incorporated herein by reference.

Technical Field

The application relates to the field of artificial intelligence, in particular to a training method of a neural network and related equipment.

Background

The first computational graph (computational graph) is a general computational process representation method for describing a directed acyclic graph of a function, and is commonly applied to various data processing platforms. In the field of artificial intelligence (Artificial Intelligence, AI), if iterative training is required for a neural network, each round of training operation for the neural network can be converted into a first computational graph, compiled codes corresponding to the first computational graph are obtained, and the compiled codes are executed to implement each round of training operation for the neural network.

Specifically, in each training operation of the neural network, after a first computation graph corresponding to one training operation of the neural network is acquired, an expression conversion (trace) may be performed on the entire first computation graph to obtain an intermediate computation representation (intermediate representation, IR) corresponding to the first computation graph, where the intermediate computation representation may also be referred to as a logic description of the first computation graph, and a compiling operation is performed on the intermediate computation representation to obtain a compiled code corresponding to the first computation graph.

However, in each training operation of the neural network, the first computational graph needs to be converted into an intermediate computational representation, and then the compiled code is obtained based on the intermediate computational representation, which brings about the cost of computer resources.

Disclosure of Invention

The embodiment of the application provides a training method of a neural network and related equipment, when the Nth training of the first neural network is executed, because a first compiled code corresponding to a first calculation graph is generated when the Mth training of the first neural network is executed, the first compiled code corresponding to the first calculation graph can be determined to be stored in a system, the first compiled code is directly executed, the operation of converting the first calculation graph into an intermediate calculation representation and obtaining the first compiled code based on the intermediate calculation representation is not needed, and the expenditure on computer resources is saved.

In order to solve the technical problems, the embodiment of the application provides the following technical scheme:

in a first aspect, an embodiment of the present application provides a method for training a neural network, which may be used in a scenario for training a neural network in the field of artificial intelligence, where the method includes: after acquiring the first computational graph, the first communication device may determine that a first compiled code (First compiled code) corresponding to the first computational graph has been stored in the system and execute the first compiled code, the first compiled code being generated during an mth round of training of the first neural network, N and M being positive integers, M being less than N. Wherein one or more computational graphs correspond to an nth round of training of the first neural network; further, the calculation graphs are graphically represented by the calculation process, and the one or more calculation graphs corresponding to the nth round training of the first neural network are graphically represented by the calculation process in the nth round training of the neural network, and the process of executing the one or more calculation graphs corresponding to the nth round training of the first neural network can be understood as the process of executing the nth round training of the first neural network. The first computational graph is one of one or more computational graphs corresponding to an nth round of training of the neural network, and the first computational graph refers to graphically representing a computational process of at least one first step in the nth round of training operation of the first neural network; illustratively, the one or more first steps corresponding to the first computational graph may include: calculate a loss function value, back propagate to generate a gradient value, or perform parameter update on a weight parameter in the first neural network, etc. The first communication device may be a cloud device or a terminal device.

In this implementation manner, when the nth training of the first neural network is performed, after the first computation graph is acquired, since the first compiled code corresponding to the first computation graph has been generated when the mth training of the first neural network is performed, it may be determined that the first compiled code corresponding to the first computation graph is already stored in the system, and the first compiled code is directly performed, that is, in the nth training operation of the first neural network, the operation of converting the first computation graph into the intermediate computation representation and obtaining the first compiled code based on the intermediate computation representation is not required, so that the cost of computer resources is saved.

In one possible implementation manner of the first aspect, the first communication device executes first compiled code, including: the first communication device may acquire a first mapping relationship from the system, where the first mapping relationship is used to indicate an acquisition location of a value of an input parameter of the first computation graph; optionally, the input parameters in the first calculation map may include a weight parameter in the first calculation map and a non-training parameter in the first calculation map, and the first mapping relationship may be used to indicate a value obtaining position of the weight parameter in the first calculation map and a value obtaining position of the non-training parameter. The first communication device determines the value of the input parameter in the first calculation graph in the N-th training according to the first mapping relation, and executes the first compiled code according to the value of the input parameter in the first calculation graph. It should be noted that, the determining operation of the "value of the input parameter in the first calculation map" and the executing operation of the "first compiled code" may be performed in a crossing manner, for example, in the process of executing the first compiled code, the value of at least one input parameter in the first calculation map may be determined, and the execution of the first compiled code may be continued.

In this implementation manner, the system may further store a first mapping relationship, where the first mapping relationship indicates an acquisition position of an input parameter in the first computation graph, so that in a process of executing the first compiled code, a value of the input parameter in the first computation graph may be determined directly according to the first mapping relationship, which is favorable to improving an acquisition speed of the value of the input parameter in the first computation graph, and further is favorable to improving an execution speed of a training operation of the first neural network.

In a possible implementation manner of the first aspect, before the first communication device obtains the first mapping relationship, the method may further include: if the first mapping relationship does not exist in the system, the first communication device may also establish the first mapping relationship and store the first mapping relationship in the system. The system where the first communication device is located comprises a storage device which can be accessed by the first communication device, wherein the storage device which can be accessed by the first communication device comprises a memory which can be accessed by the first communication device, and also can comprise a memory which can be accessed by the first communication device. In the implementation manner, when the first mapping relation does not exist in the system, that is, the first mapping relation cannot be directly acquired in the system, the first mapping relation can be established, so that the feasibility of the scheme in various conditions is ensured, and the completeness of the scheme is improved.

In one possible implementation manner of the first aspect, the first computation graph is a reusable computation graph. In this implementation manner, if the first computation graph is not reused, the first compiled code corresponding to the first computation graph is not reused, and storing the first compiled code into the system also causes waste of storage resources of the system, so that the first computation graph is limited to a reusable computation graph, only the compiled code corresponding to the reusable computation graph is stored in the system, which is beneficial to improving the utilization rate of the storage resources of the system.

In one possible implementation manner of the first aspect, the determining, by the first communication device, that the first compiled code corresponding to the first computation graph is stored in the system may include: the first communication equipment performs expression conversion on the first calculation graph to obtain intermediate calculation expression IR corresponding to the first calculation graph, and determines that first compiled codes are stored in the system according to the IR corresponding to the first calculation graph; alternatively, the first communication device may determine, from the IR corresponding to the first calculation map, that the memory included in the system already has the first compiled code stored therein. In the implementation manner, whether the first compiled code is stored in the system is determined according to the IR corresponding to the first calculation map, so that whether the first compiled code is stored in the system can be accurately determined, further, the first compiled code can be successfully acquired from the system, and the smoothness of the execution process of the first calculation map is improved.

In a possible implementation manner of the first aspect, when performing the mth round of training of the first neural network, the first computing diagram may also be acquired by the first communication device, and the method may further include: the first communication device generates first compiled codes according to the first calculation graph after acquiring the first calculation graph; and storing the first compiled code in the system. In the implementation manner, when the Mth training of the first neural network is executed, after the first compiled code is generated, the first compiled code is stored in the system, so that the first compiled code can be directly obtained from the system when the Nth training of the first neural network is executed, and the smoothness of the implementation process of the scheme is improved.

In one possible implementation manner of the first aspect, the first communication device determining that first compiled code corresponding to the first computation graph is stored in the system includes: if the first communication device determines that a first mapping relationship is stored in the system, it may be determined that a first compiled code (First compiled code) corresponding to the first calculation map is already stored in the system, where the first mapping relationship is used to indicate an acquisition position of an input parameter of the first calculation map. In this implementation manner, the first communication device generates the first compiled code in the first round of training operation performed after determining that the first calculation map can be multiplexed, and the first mapping relationship can be established in the second round and the subsequent rounds performed after determining that the first calculation map can be multiplexed, so if the first mapping relationship is already stored in the system, the first communication device has a high probability that the step of "generating the first compiled code corresponding to the first calculation map by the compiler" has been performed, so that the first compiled code corresponding to the first calculation map can be directly obtained from the system; the method and the device for judging whether the first compiled code corresponding to the first calculation map exists in the system or not are used for judging whether the first compiled code corresponding to the first calculation map exists in the system or not, intermediate calculation expression corresponding to the first calculation map does not need to be generated any more, and whether the first compiled code corresponding to the first calculation map exists or not is inquired according to the first calculation map expression.

In one possible implementation manner of the first aspect, the first calculation map corresponds to a first step in an nth round of training of the first neural network; the first communication device, after executing the first compiled code, the method further comprises: the first communication device generates first output data, wherein the first output data adopts a first data structure, the first output data comprises at least one input data of a second step in the training operation of the first neural network, the second step in the training operation of the first neural network can also be called as a downstream task of the first step in the training operation of the first neural network, the first data structure is a data structure adopted when the second step in the training operation of the first neural network is executed, and the training operation of the first neural network comprises an nth round of training of the first neural network. For example, the first output data may be represented as tensor data, the first communication device may generate the first output data that can be understood by the downstream task according to a first data structure of the tensor, where the first data structure of the tensor may be used to describe a definition of data members in a tensor form adopted when performing the second step in the nth training operation of the first neural network, an arrangement form of the first output data in a memory, or a memory alignment form adopted when storing the first output data in the memory, and so on, which are not exhaustive herein. Illustratively, the "definition of data members in tensor form employed" may include the data type of each data member, e.g., a 32-bit floating point number (float 32) and a 16-bit integer (int 16) being different data types; the size of the tensor corresponding to each data member may also be included, and other information of each data member may also be defined, which is not exhaustive herein. The arrangement form of the data in the memory may include what form of storage structure is adopted by the output data in the form of tensors in the memory, and the storage structure may include a queue, a stack, a linked list or other storage structures, etc., which are only taken as concepts for facilitating understanding of the tensor data structure, and are not limited to this scheme.

In the implementation manner, the first data structure adopted by the downstream task of the first step in the training operation of the neural network can be also obtained, when the first compiled code corresponding to the first calculation graph is executed, the output data of the first data structure is generated, and the downstream task does not need to convert the data structure of the first output data when accessing the first output data, so that the cost of computer resources caused by the conversion of the same data among different data structures is avoided.

In one possible implementation manner of the first aspect, the first calculation map corresponds to a first step in an nth round of training of the first neural network; executing the first compiled code by the first communication device may include: the first communication device obtains at least one input data of the first computational graph according to a format of a second data structure, wherein the at least one input data of the first computational graph exists in second output data of a third step in training operation of the first neural network, and the second data structure is adopted when the third step in training operation of the first neural network is executed. For example, at least one of the input data in the first computation graph may represent a tensor, and the first communication device may understand the second output data stored in the memory according to the second data structure of the tensor, for example, the second data structure of the tensor may be used to describe a definition of data members in the form of the tensor, an arrangement form of the second output data in the memory, or a memory alignment form adopted when the second output data is stored in the memory, where the type of information carried by the second data structure of the tensor is not exhaustive.

In this implementation manner, the second data structure adopted by the upstream task in the first step in the training operation of the neural network may be further obtained, and at least one input data of the first computation graph may be obtained from the output data of the upstream task according to the format of the second data structure, so as to avoid the overhead of computer resources caused when the same data is converted between different data structures.

In a possible implementation manner of the first aspect, the storage location of the first output data and the read location of the at least one input data of the second step are consistent; alternatively, the shared pointer technique may be used to implement "the read position of the at least one input data of the first computation graph in the memory and the storage position of the second output data in the memory coincide. After reading at least one input data of the first computation graph, the first communication device does not modify the second output data in the process of executing the first compiled code corresponding to the first computation graph. In this implementation, the read location of at least one input data of the first computational graph is consistent with the storage location of the second output data, thereby avoiding copying of the same data between different storage locations, to further reduce the overhead of computer resources.

In a possible implementation manner of the first aspect, the read position of the at least one input data of the first calculation map and the storage position of the second output data are consistent. Alternatively, the "the storage location of the first output data and the read location of the at least one input data of the second step coincide" may be achieved using a shared pointer technique. It should be noted that, after the first communication device performs the write operation of the first output data, ownership of the first output data will be transferred to the downstream task. In this implementation, the storage location of the first output data is consistent with the read location of at least one input data of the second step, so that copying of the same data between different storage locations is avoided, and the overhead of computer resources is further reduced.

In one possible implementation manner of the first aspect, the method further includes: the first communication device sends first output data in a mode of calling a preset interface, wherein a second step in training operation of the first neural network comprises sending the first output data, the first data structure is a data structure adopted when the sending operation of the first output data is executed, and the preset interface can be an interface of a gradient communication library provided by a third party.

In the implementation manner, a specific example of a downstream task of a first step in training operation of the neural network is provided, and communication of first output data is realized by calling a preset interface form, so that the method is convenient and quick; and the first output data of the first data structure is generated, so that the conversion of the first output data among different data structures is avoided, and the efficiency of the transmission process of the first output data is improved.

In a second aspect, an embodiment of the present application provides a training device for a neural network, which may be used in a scenario of training the neural network in the field of artificial intelligence, where the training device for the neural network includes: the device comprises an acquisition module, a determination module and an execution module. The acquisition module is used for acquiring a first calculation graph, wherein the first calculation graph is one of one or more calculation graphs corresponding to the Nth round of training of the neural network, and N is a positive integer; the determining module is used for determining a first compiled code which is stored in the system and corresponds to the first calculation graph, wherein the first compiled code is generated in the M-th training of the execution neural network, M is a positive integer, and M is smaller than N; and the execution module is used for executing the first compiled code.

In the second aspect of the present application, the training device of the neural network may also be used to execute the steps executed by the first communication device in the first aspect and in each possible implementation manner of the first aspect, and the specific implementation manner, meaning of nouns and beneficial effects of the steps in each possible implementation manner of the second aspect may refer to the first aspect, which is not described herein again.

In a third aspect, embodiments of the present application provide a computer-readable storage medium, in which a computer program is stored, which when run on a computer, causes the computer to perform the method for training a neural network according to the first aspect.

In a fourth aspect, embodiments of the present application provide a communication device including a processor and a memory, the processor coupled to the memory, the memory configured to store a program; a processor configured to execute the program in the memory, and cause the communication device to perform the training method of the neural network of the first aspect.

In a fifth aspect, embodiments of the present application provide a computer program product, the computer program product comprising a program which, when run on a computer, causes the computer to perform the method for training a neural network according to the first aspect.

In a sixth aspect, the present application provides a chip system comprising a processor for supporting a terminal device or a communication device to perform the functions involved in the above aspects, e.g. to transmit or process data and/or information involved in the above methods. In one possible design, the chip system further includes a memory for holding program instructions and data necessary for the terminal device or the communication device. The chip system can be composed of chips, and can also comprise chips and other discrete devices.

Drawings

FIG. 1 is a schematic structural diagram of an artificial intelligence main body framework according to an embodiment of the present application;

FIG. 2a is a system architecture diagram of a training system for a neural network according to an embodiment of the present application;

FIG. 2b is a diagram of another system architecture of a training system for neural networks according to embodiments of the present application;

fig. 2c is a schematic flow chart of a training method of a neural network according to an embodiment of the present application;

fig. 3 is another flow chart of a training method of a neural network according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a first calculation map according to an embodiment of the present disclosure;

FIG. 5 is another schematic diagram of a first calculation map according to an embodiment of the present disclosure;

FIG. 6 is a further schematic diagram of a first calculation map provided in an embodiment of the present application;

FIG. 7 is a further schematic diagram of a first calculation map provided in an embodiment of the present application;

FIG. 8 is a further schematic diagram of a first calculation map provided in an embodiment of the present application;

FIG. 9 is a schematic diagram of input parameters in a first calculation chart according to an embodiment of the present application;

fig. 10 is a schematic flow chart of sending first output data according to an embodiment of the present application;

fig. 11 is a schematic flow chart of a training method of a neural network according to an embodiment of the present application;

fig. 12 is a schematic structural diagram of a training device for a neural network according to an embodiment of the present application;

fig. 13 is another schematic structural diagram of a training device for a neural network according to an embodiment of the present disclosure;

fig. 14 is a schematic structural diagram of a communication device according to an embodiment of the present application;

fig. 15 is a schematic structural diagram of a communication device according to an embodiment of the present application;

fig. 16 is a schematic structural diagram of a chip according to an embodiment of the present application.

Detailed Description

Embodiments of the present application are described below with reference to the accompanying drawings. As one of ordinary skill in the art can appreciate, with the development of technology and the appearance of new scenes, the technical solutions provided in the embodiments of the present application are applicable to similar technical problems.

The terms first, second and the like in the description and in the claims of the present application and in the above-described figures, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the terms so used are interchangeable under appropriate circumstances and are merely illustrative of the manner in which the embodiments of the application described herein have been described for objects of the same nature. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of elements is not necessarily limited to those elements, but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

Referring to fig. 1, a schematic structural diagram of an artificial intelligence main body framework is shown in fig. 1, and the artificial intelligence main body framework is described below from two dimensions of "intelligent information chain" (horizontal axis) and "IT value chain" (vertical axis). Where the "intelligent information chain" reflects a list of processes from the acquisition of data to the processing. For example, there may be general procedures of intelligent information awareness, intelligent information representation and formation, intelligent reasoning, intelligent decision making, intelligent execution and output. In this process, the data undergoes a "data-information-knowledge-wisdom" gel process. The "IT value chain" reflects the value that artificial intelligence brings to the information technology industry from the underlying infrastructure of personal intelligence, information (provisioning and processing technology implementation), to the industrial ecological process of the system.

(1) Infrastructure of

The infrastructure provides computing capability support for the artificial intelligence system, realizes communication with the outside world, and realizes support through the base platform. Communicating with the outside through the sensor; the computing power is provided by a smart chip, which may specifically be a hardware acceleration chip such as a central processing unit (central processing unit, CPU), an embedded neural network processor (neural-network processing unit, NPU), a graphics processor (graphics processing unit, GPU), an application specific integrated circuit (application specific integrated circuit, ASIC), or a field programmable gate array (field programmable gate array, FPGA); the basic platform comprises a distributed computing framework, a network and other relevant platform guarantees and supports, and can comprise cloud storage, computing, interconnection and interworking networks and the like. For example, the sensor and external communication obtains data that is provided to a smart chip in a distributed computing system provided by the base platform for computation.

(2) Data

The data of the upper layer of the infrastructure is used to represent the data source in the field of artificial intelligence. The data relate to graphics, images, voice and text, and also relate to the internet of things data of the traditional equipment, including service data of the existing system and sensing data such as force, displacement, liquid level, temperature, humidity and the like.

(3) Data processing

Data processing typically includes data training, machine learning, deep learning, searching, reasoning, decision making, and the like.

Wherein machine learning and deep learning can perform symbolized and formalized intelligent information modeling, extraction, preprocessing, training and the like on data.

Reasoning refers to the process of simulating human intelligent reasoning modes in a computer or an intelligent system, and carrying out machine thinking and problem solving by using formal information according to a reasoning control strategy, and typical functions are searching and matching.

Decision making refers to the process of making decisions after intelligent information is inferred, and generally provides functions of classification, sequencing, prediction and the like.

(4) General capability

After the data has been processed, some general-purpose capabilities can be formed based on the result of the data processing, such as algorithms or a general-purpose system, for example, translation, text analysis, computer vision processing, speech recognition, image recognition, etc.

(5) Intelligent product and industry application

The intelligent product and industry application refers to products and applications of an artificial intelligent system in various fields, is encapsulation of an artificial intelligent overall solution, and realizes land application by making intelligent information decisions, and the application fields mainly comprise: intelligent terminal, intelligent manufacturing, intelligent transportation, intelligent home, intelligent medical treatment, intelligent security, automatic driving, smart city, etc.

The neural network training method and device can be applied to the process of training the neural network, and the neural network can be any neural network in the application field of an artificial intelligence system. Before describing the training method of the neural network provided in the embodiments of the present application, please refer to fig. 2a and fig. 2b, and fig. 2a and fig. 2b are two system architecture diagrams of the training system of the neural network provided in the embodiments of the present application.

In an application scenario, referring first to fig. 2a, a training system 200 of a neural network may include a cloud device 210, a database 220, a terminal device 230, and a data storage system 240, where the terminal device 230 includes a computing module 231, and in fig. 2a, a training operation of the first machine learning model/rule 201 is performed by the cloud device 210.

The cloud device 210 may be implemented by one or more servers, a training sample is stored in the database 220, the cloud device 210 generates a first machine learning model/rule 201, and iteratively trains the first machine learning model/rule 201 by using the training sample to obtain a trained first machine learning model/rule 201. The first machine learning model/rule 201 may be embodied as a neural network or may be embodied as a model other than a neural network, and in the embodiment of the present application, only the first machine learning model/rule 201 is described as an example of the first neural network.

The cloud device 210 configures the trained first machine learning model/rule 201 in the computing module 231 of the terminal device 230, for example, the terminal device 230 may be a mobile phone, a tablet, a notebook, a VR device, a monitoring system, a data processing system of a radar, or the like. The terminal device 230 may call data, codes, etc. in the data storage system 240, or may store data, instructions, etc. in the data storage system 240. The data storage system 240 may be disposed in the terminal device 230, or the data storage system 240 may be an external memory with respect to the terminal device 230. The first machine learning model/rule 201 in the terminal device 230 is used for processing the input data to obtain prediction information corresponding to the input data.

In another application scenario, referring to fig. 2b, the training system 200 of the neural network may include a cloud device 210, a database 220, a terminal device 230, and a data storage system 240, where the terminal device 230 includes a computing module 231, and in fig. 2b, the training operation of the first machine learning model/rule 201 is jointly performed by the cloud device 210 and the plurality of terminal devices 230.

The data storage system 240 may store a training data set, and each terminal device 230 may perform iterative training on the first machine learning model/rule 201 according to the training samples in the data storage system 240, to obtain a first gradient value corresponding to the weight parameter in the first machine learning model/rule 201. In one implementation manner, each terminal device 230 may send the foregoing first gradient value to the cloud device 210, where the cloud device 210 aggregates the first gradient values uploaded by the plurality of terminal devices 230 to obtain second gradient values corresponding to the weight parameters in the first machine learning model/rule 201, sends the foregoing second gradient values to each terminal device 230, and each second terminal device 230 updates the weight parameters in the first machine learning model/rule 201 according to the second gradient values, so as to implement iterative training of the first machine learning model/rule 201. It should be noted that the training of the first machine learning model/rule 201 may also be implemented in other manners, and the above-mentioned fig. 2a and fig. 2b are only two examples for facilitating understanding of the present solution, and are not intended to limit the present solution.

Based on the above description, the present application provides a training method of a neural network, where the training method of the neural network may be applied to a process that the cloud device 210 trains the first machine learning model/rule 201 by using a training data set, or may be applied to a process that the terminal device 240 trains the first machine learning model/rule 201 by using a training data set. Specifically, referring to fig. 2c, fig. 2c is a schematic flow chart of a neural network training method according to an embodiment of the present application. Wherein, A1, when performing an nth round of training on a neural network (hereinafter referred to as a "first neural network" for convenience of description), the first communication device may acquire a first calculation map, where the first calculation map is one of one or more calculation maps corresponding to the nth round of training of the first neural network, N being a positive integer; further, the calculation graphs are graphically represented by calculation processes, and one or more calculation graphs corresponding to the nth round training of the first neural network are graphically represented by calculation processes in the nth round training of the neural network, and the process of executing the one or more calculation graphs corresponding to the nth round training of the first neural network can be understood as a process of executing the nth round training of the first neural network; the first calculation map is one of the one or more calculation maps corresponding to the nth round training of the first neural network, and the meaning of "the first calculation map" may refer to the explanation of the meaning of "the one or more calculation maps corresponding to the nth round training of the first neural network". A2, after acquiring the first calculation map, the first communication device may determine that a first compiled code (First compiled code) corresponding to the first calculation map is stored in the system, where the first compiled code is generated during an mth round of training of the neural network, and M is a positive integer, where M is less than N. A3, the first communication device executes the first compiled code.

In the embodiment of the application, when the nth training of the first neural network is executed, the operation of converting the first calculation graph into the intermediate calculation representation and obtaining the compiled code based on the intermediate calculation representation is not needed, so that the cost of computer resources is saved.

In conjunction with the foregoing description, a specific implementation flow of the neural network training method provided in the embodiment of the present application is described below, and since the step of "training the first neural network according to training data" may be performed by the cloud device 210 or may be performed by the terminal device 230, the following two cases are described separately.

1. Performing, by the cloud device, a training operation on the first neural network

In this embodiment of the present application, referring specifically to fig. 3, fig. 3 is another flow chart of a neural network training method provided in this embodiment of the present application, where the neural network training method provided in this embodiment of the present application may include:

301. a first computational graph is obtained, the first computational graph being one of one or more computational graphs corresponding to an nth round of training of the neural network.

In this embodiment of the present application, when performing the nth round of training of the first neural network, the first communication device may acquire a first calculation map, where the first calculation map corresponds to the nth round of training of the first neural network, and the one or more calculation maps include the first calculation map, that is, the first calculation map is one of the one or more calculation maps corresponding to the nth round of training of the neural network, where N is a positive integer; for example, the first computational graph corresponds to at least one first step in an nth round of training operations of the first neural network. The first communication device may be a processor in the cloud device, for example, the first communication device may be a neural network processor in the cloud device; for another example, the first communication device may be a graphics processor in a cloud device; for another example, the first communication device may be a central processor in the cloud device, and the like, and may specifically be flexibly determined in combination with an actual application scenario, which is not limited herein.

The one round of training operations on the first neural network may include 1 or more training operations on the first neural network, which may include multiple training operations performed on the first neural network with a batch (batch) or multiple batches (batch) of training samples, each batch (batch) of training samples including multiple training samples.

Optionally, the calculation process corresponding to the nth round training of the first neural network is graphically represented, and executing the one or more calculation processes corresponding to the nth round training of the first neural network may be understood as executing the nth round training of the first neural network; the first computational graph is graphically representative of one or more first steps in an nth round of training operations of the first neural network, and executing the first computational graph may be understood as implementing one or more first steps in the nth round of training operations of the first neural network.

Further, in one case, the first computational graph graphically represents all steps in an nth round of training operations of the first neural network. For a more visual understanding of the present solution, please refer to fig. 4, fig. 4 is a schematic diagram of a first calculation chart provided in an embodiment of the present application. As shown in fig. 4, the training system of the first neural network includes 1 CPU and 1 NPU, all steps in each round of training operation of the first neural network are performed by the NPU, and one or more first steps corresponding to the first calculation map performed by the NPU may include: the method includes calculating a loss function value, back-propagating to generate a gradient value, and parameter updating weight parameters in a first neural network, which may also be referred to as training parameters in the first neural network. Each round of training operations on the first neural network in fig. 4 may include performing one training operation on the first neural network, or may also include performing multiple training operations on the first neural network with a batch of training samples. It should be understood that the example in fig. 4 is merely for facilitating understanding of the present solution, and is not intended to limit the present solution.

In another case, the plurality of computation graphs corresponds to an nth round of training of the first neural network, the first computation graph being one of the plurality of computation graphs, i.e., the first computation graph graphically represents a portion of the steps in the nth round of training operation of the first neural network. Specifically, after the first communication device or other communication devices other than the first communication device acquire the second calculation map corresponding to the nth round of training of the first neural network, the first communication device may acquire a plurality of calculation maps corresponding to the nth round of training of the first neural network; the second calculation map corresponding to the training operation of the first neural network represents all steps in the nth training operation of the first neural network in a graphical manner, each calculation map of the plurality of calculation maps corresponding to the nth training of the first neural network is a subgraph of the second calculation map, and then the first calculation map is also a subgraph of the second calculation map, that is, the first calculation map represents part of the steps in the nth training operation of the first neural network in a graphical manner.

For a more intuitive understanding of the present solution, please refer to fig. 5 to 8, and fig. 5 to 8 are various schematic diagrams of the first calculation map provided in the embodiment of the present application. Referring to fig. 5, in fig. 5, the training system of the first neural network includes 1 CPU and 1 NPU, and in the example of fig. 5, three calculation graphs may be included, where the first calculation graph may be any one of the three calculation graphs, and the NPU is used to generate a function value of a loss function of the first neural network when executing the first calculation graph, and calculate, according to the function value of the loss function, a gradient value corresponding to a weight parameter of the first neural network; the CPU judges whether the gradient value of the weight parameter of the first neural network overflows; if the judgment result is yes, triggering the NPU to execute a second calculation graph, wherein the second calculation graph indicates that the gradient value of the weight parameter of the first neural network is scaled; if the result is negative, triggering the NPU to execute a third calculation graph, wherein the third calculation graph indicates to update the weight parameters in the first neural network. It should be noted that, in fig. 5, the first calculation map, the second calculation map, and the third calculation map are all executed by the same NPU, and in other application scenarios, the first calculation map, the second calculation map, and the third calculation map may also be executed by different NPUs, and the example in fig. 5 is merely for convenience of understanding the present solution, and is not limited to this solution.

Referring to fig. 6 again, in fig. 6, taking the training system of the first neural network including 1 CPU and multiple NPUs (that is, NPU1, NPU2 and … NPU6 in fig. 6) as an example, in each round of training operation of the first neural network, multiple NPUs may use the same calculation map, specifically, each NPU generates a function value of a loss function of the first neural network according to a batch of (batch) training samples, calculates and obtains a gradient value corresponding to a weight parameter of the first neural network according to the function value of the loss function, and may implement synchronization on the weight parameter of the first neural network by means of total reduction (AllReduce) between multiple NPUs, that is, each NPU sends out the generated gradient value, after completing the aggregation operation on the gradient value of the weight parameter of the first neural network generated by multiple NPUs, each NPU receives the aggregated gradient value, and updates the parameter of the first neural network according to the aggregated gradient value, which is understood that fig. 6 is not limited to the present embodiment.

Referring to fig. 7 and 8, since the first neural network is too large in both fig. 7 and 8, the memory resource or the computing power resource of the single processor cannot complete the calculation of the forward operation of the whole first neural network, so that the nth training operation of the first neural network can be split into a plurality of first calculation graphs. Referring to fig. 7, in fig. 7, the first neural network is split into a serial neural network module B1, a neural network module B2 and a neural network module B3, where each neural network module includes a plurality of neural network layers. And realizing forward propagation operation in the N-th training operation of the first neural network through the first calculation diagrams 1 to 3 to obtain the prediction information output by the first neural network. Then, the back propagation operation in the N-th training operation of the first neural network is realized through the first calculation diagrams 4 to 6, and gradient values corresponding to the weight parameters in the neural network module B1, the neural network module B2 and the neural network module B3 are respectively generated; the weight parameters in the first neural network are updated by the first calculation map 8.

Referring to fig. 8 again, in fig. 8, the first neural network is split into neural network modules C1 to C5, and the neural network modules C2 to C4 are three neural network modules in parallel. And realizing forward propagation operation in the N-th training operation of the first neural network through the first calculation diagrams 1 to 5 to obtain the prediction information output by the first neural network. Then, through the first calculation diagrams 6 to 10, the back propagation operation in the N-th training operation of the first neural network is realized, and gradient values corresponding to the weight parameters in the neural network modules C1 to C5 are respectively generated; the first calculation map 8 is used to update the weight parameters in the first neural network, and it should be understood that the examples in fig. 7 and 8 are only used to facilitate understanding of the concept of the "first calculation map" and are not intended to limit the present solution.

Alternatively, the first communication device may determine "the one or more calculation graphs corresponding to the nth round training of the first neural network" in various manners, and it should be noted that, the determination process of "the one or more calculation graphs corresponding to the nth round training of the first neural network" may be performed by the first communication device, or may be performed by another communication device other than the first communication device, where the first communication device receives the first calculation graphs sent by the other communication device, which is not limited in this application. Specifically, in one implementation manner, a preset policy may be configured on the first communication device, and after the second calculation map is obtained, a splitting operation may be performed on the second calculation map according to the preset policy to obtain one or more calculation maps corresponding to the nth round training of the first neural network.

The preset strategy may include any one or more of the following strategies: the computationally intensive steps preferably employ, but are not exhaustive of, compiling execution policies, increasing training speed of neural networks, reducing computer resource overhead policies or other policies, and the like. Optionally, before performing step 301, the first communication device may also receive a preset policy configured by the user, and further optionally, the preset policy configured on the first communication device can be updated. It should be noted that, the user herein may be a user of the first communication device, for example, a technician for training the first neural network, or the like.

For example, since the steps shown in the first computation graph mostly need to be performed by the NPU, the GPU, or other types of artificial intelligence accelerators (artificial intelligence accelerator), the CPU may be required to send the values of the input parameters in the first computation graph to the artificial intelligence accelerators, in the foregoing steps, the steps corresponding to the execution of the first computation graph by the artificial intelligence accelerators can accelerate the training speed of the neural network, but the process of sending the values of the input parameters in the first computation graph to the artificial intelligence accelerators can reduce the training speed of the neural network and increase the cost of computer resources, so that the user configures the preset policy into the first communication device, the user can instruct the determining process of the first computation graph, and the determined rationality of the first computation graph is beneficial.

In another implementation, the first communication device may present the second computational graph to the user after acquiring the second computational graph corresponding to the nth round of training of the first neural network; the first communication device receives first information entered by a user, the first information indicating splitting of the second computational graph into one or more computational graphs. For example, the first information may include one or more computational graphs corresponding to an nth round of training of the first neural network; for another example, the first information may include a position of at least one splitting node in the second computation graph, and the first communication device may split the second computation graph into a plurality of computation graphs according to at least one splitting node in the first information, which information specifically carried in the first information may be flexibly set in combination with an actual application scenario, which is not limited herein. In the embodiment of the application, the second calculation map is displayed to the user, and the user directly determines the first calculation map according to the second calculation map, so that the rationality of the determined first calculation map is improved further.

In another implementation, the first communication device may also heuristically determine one or more first computational graphs from the second computational graph after acquiring the second computational graph.

302. Judging whether the first calculation graph can be reused or not, and if not, entering a step 303; if yes, go to step 304.

In some embodiments of the present application, after the first communication device acquires the first calculation map, it may determine whether the first calculation map is reusable, and if the determination result is no, step 303 may be entered; if yes, go to step 304. It should be noted that, step 302 is an optional step, and in some scenarios, the calculation graphs adopted by each round of training operation of the first neural network are the same, and in one implementation, the first communication device may default that the first calculation graphs obtained each time are reusable, so that step 304 is directly entered without executing step 302. In this embodiment of the present invention, if the first computation graph is not reused, the first compiled code corresponding to the first computation graph is not reused, and storing the first compiled code into the system may also cause waste of storage resources of the system, so that defining the first computation graph as a reusable computation graph only stores the compiled code corresponding to the reusable computation graph in the system, which is beneficial to improving the utilization rate of the storage resources of the system.

The first communication device may determine whether the first computational graph is reusable in a number of ways. Specifically, in one implementation, the first communication device may determine whether the first computation graph can be reused according to the value of N. For example, in an application scenario where the computational graph used by the first training operation and the computational graph used by the second training operation of the first neural network are different, and the computational graph used by the second training operation and each subsequent training operation are the same, step 303 may include: in the case where the value of N is 1, the first communication device may determine that the first calculation map is not reusable; in the case where the value of N is greater than 1, in one implementation, the first communication device may directly determine that the first computation graph can be multiplexed; in another implementation manner, the first communication device may continuously determine whether the compiling execution mode can bring about gain according to the calculated amount of the first calculation map and the parameter amount required by the first calculation map, and if so, determine that the first calculation map can be reused; if the judgment result is negative, the first calculation map can be determined not to be reusable. Among these, the determining factors of "whether gain can be brought about" may include: whether the training speed of the neural network can be increased, the consumption of computer resources can be reduced, other factors can be reduced, and the like, which factors can be flexibly set in combination with the actual application scene are specifically adopted, and the training method is not limited.

For another example, in another application scenario, multiple rounds of training operations of the first neural network may correspond to at least two different second computational graphs. In the process of training the first neural network according to the second calculation map shown in fig. 4, in order to increase the training speed of the first neural network, the training mode may be changed from the high-precision training mode to the mixed-precision training mode, but the problem that the generated gradient value of the weight parameter of the first neural network overflows may be caused after the training mode is changed to the mixed-precision training mode is illustrated by referring to fig. 4 and 5, so that the step of "judging whether the gradient value of the weight parameter of the first neural network overflows" needs to be added, that is, the second calculation map corresponding to each round of training operation of the first neural network may be changed to the first calculation map shown in fig. 5, and it should be noted that the second calculation map may be changed due to other factors.

Specifically, the first communication device may store second information, where the second information is used to indicate a preset value set corresponding to N, and when the value of N is included in the preset value set, the step 302 may include: judging whether the value of N is contained in the preset value set, and if the value of N is outside the preset value set, determining that the first calculation map corresponding to at least one first step in the N-th training operation of the first neural network can not be reused. If the value of N is within the preset value set, in one implementation manner, the first communication device may directly determine that the first computation graph can be reused, in another implementation manner, the first communication device may continuously determine whether the compiling execution manner can bring the gain according to the computation amount of the first computation graph and the parameter amount required by the first computation graph, if the determination result is yes, it may determine that the first computation graph can be reused; if the judgment result is negative, the first calculation map can be determined not to be reusable.

In another implementation, the first communication device may further determine whether the first computation graph is reusable according to the value of the non-training parameter in the first neural network. For example, when the learning rate in the non-training parameters of the first neural network changes, the gradient value that updates the weight parameters of the first neural network each time may change, and thus the calculation map used when the first neural network performs the training operation may change, so the first communication device may determine whether the learning rate used when the nth training operation of the first neural network is performed is the same as the learning rate used when the nth-1 training operation of the first neural network is performed, and if the determination result is no, the first communication device may determine that the first calculation map corresponding to at least one first step in the nth training operation of the first neural network cannot be reused. If the result of the determination is yes, the first communication device may determine that the foregoing first calculation map can be multiplexed, and the example herein is only for facilitating understanding of the present solution, and is not limited to the present solution. In another implementation, the first communication device may further determine whether the first computational graph is reusable, or the like, in combination with the value of N and the value of the non-training parameter in the first neural network. It should be noted that, the first communication device may also perform the operation of determining whether the first computation graph can be multiplexed based on other policies, which may be specifically and flexibly determined in combination with an actual application scenario, and is not limited herein.

303. At least one first step in an nth round of training operations of the first neural network is performed by way of explanation execution.

In some embodiments of the present application, the first communication device may perform at least one first step in an nth round of training operation of the first neural network corresponding to the first computation graph by interpreting the manner of performing, if it is determined that the first computation graph is not capable of being multiplexed.

The "compiling and executing" means that according to a first intermediate computing expression (IR) corresponding to the first computation graph, a compiler is utilized to generate a compiled code corresponding to the whole first computation graph (i.e. the compiled code is compiled into machine code) at one time, and the compiled code corresponding to the first computation graph is stored, and when executing, the compiled code corresponding to the whole first computation graph can be directly executed. The method of 'interpretation execution' is that in the execution process, a first intermediate computing expression (IR) corresponding to a first computing diagram is interpreted into machine codes row by row and run, and then the next row is interpreted and executed, namely, the execution is performed while the interpretation is performed.

It should be noted that, step 303 is an optional step, where the first communication device determines that the first computation graph cannot be reused, at least one first step in the nth round of training operation of the first neural network corresponding to the first computation graph may also be performed by a compiling execution manner.

304. Judging whether the first mapping relation is established, if not, entering step 305; if yes, go to step 309.

In some embodiments of the present application, the first communication device may determine whether the first mapping relationship is established, that is, whether the first mapping relationship already established exists in a system where the first communication device is located; if the result is negative, go to step 305; if the result is yes, step 309 is entered, where the first mapping relationship is used to indicate the location of the value of the input parameter in the first calculation map. The system where the first communication device is located comprises a storage device which can be accessed by the first communication device, wherein the storage device which can be accessed by the first communication device comprises a memory which can be accessed by the first communication device, and also can comprise a memory which can be accessed by the first communication device.

Optionally, the input parameters in the first calculation map may include a weight parameter in the first calculation map and a non-training parameter in the first calculation map, and the first mapping relationship may be used to indicate a value obtaining position of the weight parameter in the first calculation map and a value obtaining position of the non-training parameter.

Optionally, the first mapping relationship may include a one-to-one mapping relationship between the plurality of non-training parameters in the first calculation map and the plurality of non-training parameters in the third calculation map, where the mapping relationship is used to indicate an acquisition position of the value of the non-training parameter in the first calculation map; for any one of the non-training parameters in the first calculation map (which may be referred to as a "target parameter" hereinafter for convenience of description), the first mapping relationship may be expressed as a mapping relationship between the target parameter and a position of a source of the value of the target parameter in the third calculation map, for example. Optionally, the first mapping relationship may further include a one-to-one mapping relationship between the plurality of weight parameters in the first calculation map and the plurality of weight parameters in the third calculation map, where the mapping relationship is used to indicate an acquisition position of the value of the weight parameter in the first calculation map.

The third calculation map corresponds to at least one first step in the N-1 training operation of the first neural network, and is similar to the first calculation map, except that the third calculation map is adopted in the N-1 training operation of the first neural network, and the first calculation map is adopted in the N training operation of the first neural network. After the N-1 th round of training operation of the first neural network is performed, a value of each training parameter in the first neural network and an updated value of each weight parameter in the first neural network may be determined.

The "non-training parameters in the first computational graph" are used to control the training process of the first neural network, for example, the "non-training parameters in the first computational graph" may include parameters in a regularization layer (batch norm) employed in the training process of the first neural network, the regularization layer being used to place the trained first neural network overfitting; for another example, the "non-training parameters in the first calculation map" may include a learning rate in the loss function, where the learning rate is used to control an update step of the weight parameters of the first neural network, and the value of the "non-training parameters in the first calculation map" is updated during the forward propagation of each round of training operation, and the updated value of the non-training parameters in the first calculation map is also used in the next round of training operation, which should be understood that the example of "non-training parameters in the first calculation map" is only for convenience to understand the present solution, and is not limited to the present solution. The "weight parameter in the first calculation map" may also be referred to as a training parameter in the first calculation map, and the gradient value obtained by back propagation during the training of the first neural network is used to update the value of the weight parameter in the first calculation map, and the updated value of the "weight parameter in the first calculation map" is used in the next training operation.

It should be noted that, the "first mapping relationship" may not include a one-to-one mapping relationship between the plurality of weight parameters in the first calculation map and the plurality of weight parameters in the third calculation map, and may also be a mapping relationship between the plurality of weight parameters in the first calculation map and the parameters in other calculation maps; in connection with the above description of fig. 5, three calculation graphs are shown in fig. 5, where the value of the weight parameter in the first calculation graph is derived from the third calculation graph, and the first mapping relationship may include a one-to-one mapping relationship between the weight parameter in the first calculation graph and the weight parameters in the third calculation graph in fig. 5, so as to indicate the obtaining position of the value of the weight parameter in the first calculation graph, which is only for facilitating understanding of the present solution, and is not limited to the present solution.

305. And performing expression conversion on the first calculation graph to obtain a first intermediate calculation expression corresponding to the first calculation graph.

In this embodiment, step 304 is an optional step, and if step 304 is executed, the first communication device may perform expression conversion (trace) on the first computation graph to obtain a first intermediate computation expression corresponding to the first computation graph when it is determined that the first mapping relationship is not yet established. If step 304 is not executed, the first communication device may directly perform the expression conversion on the first computation graph to obtain the first intermediate computation expression corresponding to the first computation graph when determining that the first computation graph can be multiplexed. For example, the first computation graph obtained in step 301 may be understood as a first computation graph in a high-level language, and "the first intermediate computation expression corresponding to the first computation graph" may be understood as a first computation graph in a logic description form.

306. Judging whether a first compiled code corresponding to the first calculation graph is stored in the system or not according to the first intermediate calculation expression, and if not, entering a step 307; if yes, go to step 308.

In this embodiment of the present application, after obtaining a first intermediate computing expression corresponding to a first computation graph, the first communication device may determine, according to the first intermediate computing expression, whether a first compiled code corresponding to the first computation graph has been stored in the system; alternatively, the first communication device may determine, according to the first intermediate computing expression, whether the first compiled code is already stored in the memory of the system. In the implementation manner, whether the first compiled code is stored in the system is determined according to the IR corresponding to the first calculation map, so that whether the first compiled code is stored in the system can be accurately determined, further, the first compiled code can be successfully acquired from the system, and the smoothness of the execution process of the first calculation map is improved.

Specifically, step 306 may include: the first communication device generates an index value according to the first intermediate computing expression, and judges whether a first compiled code corresponding to the first computing diagram exists in a preset position of a memory of the first communication device or not based on the index value. If the determination result is no, that is, it is determined that the first compiled code corresponding to the first computation graph does not exist in the system, then step 307 is entered; if the determination result is yes, that is, it is determined that the first compiled code corresponding to the first computation graph is already stored in the system, step 308 is entered.

307. According to the first intermediate computing expression, a first compiled code corresponding to the first computing graph is generated by a compiler, and the first compiled code is stored in the system.

In this embodiment of the present application, when the first communication device determines that the first compiled code corresponding to the first computation graph does not exist in the system according to the first intermediate computation expression, the first communication device may generate the first compiled code corresponding to the first computation graph by using the compiler according to the first intermediate computation expression, and store the first compiled code in the system, for example, write the first compiled code corresponding to the first computation graph into a preset location in the memory of the first communication device. In the implementation manner, under the condition that the first compiled code does not exist in the system, after the first compiled code is generated, the first compiled code is stored in the system, so that after the first calculation map is acquired next time, the first compiled code can be directly acquired from the system, and the smoothness of the implementation process of the scheme is improved.

Optionally, the first communication device may further trigger the start of establishing the first mapping relation. Further alternatively, the first communication device may trigger establishment of a one-to-one mapping relationship between the plurality of weight parameters in the first computational graph and the plurality of weight parameters in the other computational graph. Specifically, in the case where the first computation graph can be multiplexed and the first communication device determines that the first compiled code corresponding to the first computation graph does not exist in the preset position of the memory, the first communication device may generate the first compiled code corresponding to the first computation graph through the compiler on behalf of the first round of training operation after the current round is determined that the first computation graph can be multiplexed, and store the first compiled code corresponding to the first computation graph to the preset position of the memory; and establishing a mapping relationship between the multiple weight parameters in the first calculation map and the multiple weight parameters in other calculation maps, where the other calculation maps may be the third calculation map (i.e. the first calculation map adopted in the N-1 round of training of the first neural network) or may be other calculation maps except the third calculation map, and the description in step 304 may be referred to specifically.

For a more intuitive understanding of the present solution, please refer to fig. 9, fig. 9 is a schematic diagram of input parameters in a first calculation chart provided in the present embodiment, where the input parameters in the first calculation chart include weight parameters in the first calculation chart and non-training parameters in the first calculation chart, fig. 9 illustrates input relationships between the weight parameters in the first calculation chart and the non-training parameters in the first calculation chart in a first round, a second round and a third round of training operations performed according to the first calculation chart, and fig. 9 illustrates that the first calculation chart corresponding to the training operations of the second round and the subsequent rounds can be multiplexed. Where D0 represents the first neural network in the first round of training operations, and a0, D0, and e0 represent the values of the weight parameters in the first neural network (i.e., D0) in the first round of training operations. D1 represents the first neural network in the second round of training operations, a1, D1 and e1 represent the values of the weight parameters in the first neural network (i.e. D1) in the second round of training operations, and the arrow from D0 to D1 represents: the value of the non-training parameter in D0 obtained in the forward propagation of the first round of training operation is determined as the value of the non-training parameter in D1 before the start of the second round of training operation. D2 represents the first neural network in the third training operation, a2, D2 and e2 represent the values of the weight parameters in the first neural network D2 in the third training operation, and the arrow from D1 to D2 represents: the value of the non-training parameter in D1 obtained in the forward propagation of the second round of training operation is determined as the value of the non-training parameter in D2 before the third round of training operation is started.

As shown in fig. 9, the weight parameters in the training operation of the first round and the training operation of the subsequent rounds are obtained in the same manner; the first round of training operation is different from the non-training parameter obtaining manner in the first neural network in the second round of training operation, and the non-training parameter obtaining manner in the first neural network in the second round of training operation and the subsequent round of training operation is the same, so that the first communication device may trigger to start to establish the first mapping relationship in the first round of training operation (i.e. the second round of training operation in fig. 9) after determining that the first calculation map can be multiplexed, but only in the second round of training operation and the subsequent round of training operation determining that the first calculation map can be multiplexed, it should be understood that the example in fig. 9 is only for facilitating understanding the scheme, and is not used for limiting the scheme.

308. And establishing a first mapping relation.

In this embodiment of the present application, if the first computation graph can be multiplexed, the first communication device determines that the first compiled code corresponding to the first computation graph exists in the preset position of the local memory, and the system does not have the established first mapping relationship, the first mapping relationship may be established, and the first mapping relationship is stored in the system.

Specifically, in one implementation manner, the first communication device may directly establish a one-to-one mapping relationship between the multiple weight parameters in the first computation graph and the multiple weight parameters in other computation graphs, where the other computation graphs may be a third computation graph (i.e., the first computation graph adopted in the N-1 round of training of the first neural network) or may be computation graphs other than the third computation graph, and specifically may refer to the description in step 304; and a one-to-one mapping relationship between the plurality of non-training parameters in the first computational graph and the plurality of non-training parameters in the third computational graph to complete construction of the first mapping relationship.

In another implementation, if a one-to-one mapping relationship between the plurality of weight parameters in the first computation graph and the plurality of weight parameters in the other computation graph has been established in the first round of training operation after the first computation graph is determined to be able to be multiplexed, in step 308, the first communication device may establish a one-to-one mapping relationship between the plurality of non-training parameters in the first computation graph and the plurality of non-training parameters in the third computation graph, so as to complete the construction of the first mapping relationship.

Optionally, if the multiple training operations of the first neural network may correspond to at least two different second computation graphs, that is, the first computation graph may change during the multiple training operations of the first neural network, the first mapping relationship needs to be reconstructed. Alternatively, if the first calculation map executed by the first communication device is not changed, but the acquisition position of the input parameter in the first calculation map is changed, the first mapping relationship needs to be reconstructed.

309. And acquiring a first compiled code corresponding to the first calculation graph from the system, wherein the first compiled code is generated in the M-th training of the neural network, M is a positive integer, and M is smaller than N.

In this embodiment, step 304 is an optional step, if step 304 is executed, and step 309 is entered when it is determined that the first mapping relationship is established through step 304, the first communication device may directly obtain, from the preset location of the memory, a first compiled code corresponding to the first calculation map, where the first compiled code is generated during execution of the mth round training of the neural network, M is an integer greater than 1, and M is less than N.

As can be seen from the description in steps 307 and 308, since the first communication device generates the first compiled code in the first round of training operation performed after determining that the first calculation map can be multiplexed, and the first mapping relationship can be established in the second round and the subsequent rounds performed after determining that the first calculation map can be multiplexed, if the first mapping relationship is already stored in the system, the first communication device has a high probability that the step of "generating the first compiled code corresponding to the first calculation map by the compiler" has been performed, so that the first compiled code corresponding to the first calculation map can be directly obtained from the system; the method and the device for judging whether the first compiled code corresponding to the first calculation map exists in the system or not are used for judging whether the first compiled code corresponding to the first calculation map exists in the system or not, intermediate calculation expression corresponding to the first calculation map does not need to be generated any more, and whether the first compiled code corresponding to the first calculation map exists or not is inquired according to the first calculation map expression.

If step 304 is performed and it is determined that the first mapping relationship is not already established in step 304, step 308 is performed in step 306 and step 309 is performed; that is, in the case where it is determined that the first mapping relationship has not been established successfully and the first compiled code has been stored in the system, the construction operation of the first mapping relationship is performed and the first compiled code is variously acquired from the system through step 308.

Alternatively, if step 304 is not performed, step 306 may be performed to 308 and then step 309 may be performed. That is, in the case where it is determined that there is a first compiled code corresponding to the first computation graph in a preset location in the memory according to the first intermediate first computation graph expression, a construction operation of the first mapping relationship is performed through step 308, and the first compiled code is acquired from various systems.

In the embodiment of the present application, when the first computation graph can be multiplexed and the first mapping relationship is not yet established, performing expression conversion on the first computation graph to obtain an intermediate computation expression corresponding to the first computation graph, and when it is determined that a first compiled code corresponding to the first computation graph exists in stored data according to the first computation expression, establishing the first mapping relationship, and directly obtaining the first compiled code from the stored data; instead of directly generating the intermediate computing expression corresponding to the first computing graph under the condition that the first mapping relation is not established, and generating the first compiled code corresponding to the first computing graph by utilizing the compiler, the step of generating the first compiled code corresponding to the first computing graph according to the intermediate computing expression corresponding to the first computing graph is saved, so that the cost of computer resources is reduced, the speed of acquiring the first compiled code corresponding to the first computing graph is increased, and the execution speed of training operation of the first neural network is improved.

310. Input data of a first computational graph is obtained.

In this embodiment of the present application, the first communication device needs to acquire input data of the first computation graph. The input data of the first calculation map may include a value of an input parameter of the first calculation map. Specifically, the first communication device may acquire a first mapping relationship from the system, where the first mapping relationship is used to indicate an acquisition position of an input parameter of the first computation graph; and determining the value of the input parameter in the first calculation graph in the N-th training of the first neural network according to the first mapping relation. In this implementation manner, the system may further store a first mapping relationship, where the first mapping relationship indicates an acquisition position of an input parameter in the first computation graph, so that in a process of executing the first compiled code, a value of the input parameter in the first computation graph may be determined directly according to the first mapping relationship, which is favorable to improving an acquisition speed of the value of the input parameter in the first computation graph, and further is favorable to improving an execution speed of a training operation of the first neural network.

Optionally, the input data of the first computation graph may further include training samples input into the first neural network, for example, if one or more first steps corresponding to the first computation graph can implement a forward propagation process of the training samples through the first neural network, the input data of the first computation graph may include the training samples; for another example, if the forward propagation of the training samples in the first n neural network layers of the first neural network can be achieved in one or more first steps corresponding to the first computation graph, the training samples may be included in the input data of the first computation graph.

Alternatively, the input data of the first computation graph may further include data generated by a neural network layer in the first neural network, for example, referring to fig. 7 and 8, since the computer resources consumed by the forward propagation process of the entire first neural network are excessive, one or more first steps corresponding to the first computation graph can implement operations of a plurality of neural network layers among the first neural networks, and the input data of the first computation graph may include data generated by a certain neural network layer of the first neural network.

Alternatively, the input data of the first computation graph may further include gradient values corresponding to weight parameters in the first neural network, and the like, which types of data the input data of the first computation graph specifically includes may be determined according to the actual application scenario, which is only for convenience of understanding the present solution and is not limited to the present solution.

If the value of at least one input data in the first computation graph exists in the second output data obtained after the third step in the training operation of the neural network is performed, and if the third step in the training operation of the first neural network is performed instead of being performed by compiling, optionally, the first communication device may further acquire a second data structure adopted when the third step in the training operation of the first neural network is performed, and acquire the value of at least one input data in the first computation graph according to a format of the second data structure, where the "third step in the training operation of the neural network" may also be referred to as an upstream task of the "first step in the training operation of the neural network".

Further optionally, after the first communication device obtains the first mapping relationship from the system, if it is determined, according to the first mapping relationship, that the value of at least one input parameter in the first calculation map is stored in the second output data generated when the third step in the training operation of the first neural network is performed; that is, according to the first mapping relationship, it is determined that the obtaining location of the input parameter of the first computation graph includes the second output data, and in step 310, the first communication device may obtain, according to the format of the second data structure, the value of at least one input parameter in the first computation graph from the second output data.

Illustratively, at least one of the input data in the first computational graph may be represented as a tensor, and the first communication device may understand the second output data stored in the memory from the second data structure of the tensor; illustratively, the second data structure of the tensor may be used to describe the definition of the data members in the form of the tensor, the arrangement of the second output data in the memory, or the memory alignment employed when storing the second output data in the memory, etc., which is not intended to be exhaustive of the type of information carried by the second data structure of the tensor when performing the third step in the nth round of training operation of the first neural network.

For example, "definition of data members in tensor form employed in performing the third step in the nth round of training operations of the first neural network" may include data types of each data member employed in performing the third step in the nth round of training operations of the first neural network, and, for example, 32-bit floating point number (float 32) and 16-bit integer (int 16) are different data types; the size of the tensor corresponding to each data member may also be included, and other information of each data member may also be defined, which is not exhaustive herein. Illustratively, the arrangement form of the second output data in the memory may include what form of storage structure is adopted by the tensor form of the second output data in the memory, where the foregoing storage structure may include a queue, a stack, a linked list, or other storage structure, and the like, which is only used herein for illustrating the concept of conveniently understanding the tensor data structure, and is not limited to the present scheme.

In the embodiment of the application, the second data structure adopted by the upstream task in the first step in the training operation of the neural network can be further obtained, and at least one input data of the first calculation graph is obtained from the output data of the upstream task according to the format of the second data structure, so that the cost of computer resources caused by the conversion of the same data among different data structures is avoided.

Further optionally, the read location of the at least one input data of the first computational graph and the storage location of the second output data are identical. Alternatively, the shared pointer technique may be used to implement "the read position of the at least one input data of the first computation graph in the memory and the storage position of the second output data in the memory coincide. It should be noted that, after the first communication device reads at least one input data of the first computation graph, the second output data is not modified in the process of executing the first compiled code. In the embodiment of the application, the reading position of at least one input data of the first calculation graph is consistent with the storage position of the second output data, so that the same data is prevented from being copied between different storage positions, and the cost of computer resources is further reduced.

It should be noted that, the embodiment of the present application does not limit the execution sequence between the step 310 and any one of the steps 302 to 309, and the step 310 may be performed before or after any one of the steps 302 to 309.

311. First compiled code corresponding to the first computational graph is executed.

In this embodiment of the present application, after executing the first compiled code corresponding to the first computation graph according to the value of the input parameter in the first computation graph, the first communication device may generate third output data, for example, the third output data may be tensor data. It should be noted that, in the embodiment of the present application, the execution order of the steps 310 and 311 is not limited, and in the process of executing the first compiled code, the value of at least one input parameter in the first calculation map may also be obtained through the step 310, and the first compiled code is continuously executed, that is, the steps 310 and 311 may be executed in a cross manner.

Optionally, if the second step in the nth round of training operations of the first neural network is not performed by compiling, in one implementation, before performing step 310, the first communication device may further obtain the first data structure used when performing the second step in the training operations of the neural network, where step 311 may include: the first communication device generates first output data of the first data structure, which may be the same as the third output data, or which may include part of the data in the third output data. Wherein the first output data comprises at least one input data of a second step of the training operation of the neural network, which may also be referred to as a downstream task of the first step of the training operation of the neural network.

The first communication device may generate the first output data that can be understood by the downstream task according to the first data structure of the tensor, where the first data structure of the tensor may be used to describe a definition of data members in a tensor form, an arrangement form of the first output data in a memory, or a memory alignment form adopted when the first output data is stored in the memory, which is not exhaustive herein. The meaning of the "first data structure" is similar to that of the "second data structure" described above, and it will be understood by referring to the above description, and details thereof will not be repeated here.

In the embodiment of the present application, the first data structure adopted by the downstream task of the first step in the training operation of the neural network may be further obtained, when the first compiled code corresponding to the first computation graph is executed, output data of the first data structure is generated, and when the downstream task accesses the first output data, it is no longer necessary to convert the data structure of the first output data, so that overhead of computer resources caused by converting the same data between different data structures is avoided.

Further alternatively, the storage location of the first output data and the read location of the at least one input data of the second step coincide. Alternatively, the "the storage location of the first output data and the read location of the at least one input data of the second step coincide" may be achieved using a shared pointer technique. It should be noted that, after the first communication device performs the write operation of the first output data, ownership of the first output data will be transferred to the downstream task.

In the embodiment of the application, the storage position of the first output data is consistent with the reading position of at least one input data in the second step, so that the same data is prevented from being copied between different storage positions, and the cost of computer resources is further reduced.

In another implementation, the first communication device generates first output data of the target data structure, and converts the first output data of the target data structure into output data of the first data structure, where the first output data includes at least one input data of a second step of a training operation of the neural network, the first data structure is a data structure adopted at the second step of the training operation of the neural network, and the auricularia auricular edge data structure is a data structure adopted at the first step of the training operation of the neural network.

Optionally, the first communication device needs to perform a transmission operation of the third output data after generating the third output data. For example, the first communication device is an NPU, and the first steps corresponding to the first computation graph include generating a gradient value of a weight parameter of the first neural network (i.e., an example of the third output data) in an nth training operation of the first neural network, where the NPU needs to send the generated gradient value to the CPU, i.e., needs to perform a sending operation of the third output data.

Specifically, in one implementation, the plurality of first steps corresponding to the first computation graph include generating, in the nth training operation of the first neural network, a gradient value of a weight parameter of the first neural network, and performing a sending operation of the third output data, and step 311 may include: the first communication device executes the first compiled code corresponding to the first computational graph to generate third output data and transmits the third output data.

In another implementation, step 311 may include: the first communication device may execute a first compiled code corresponding to the first computation graph to generate third output data, and execute a sending operation of the third output data by calling a preset interface, where the preset interface may be an interface of a gradient communication library provided by a third party.

In the present application scenario, the "transmission operation of the third output data" is taken as the downstream task of the "first step in the training operation of the neural network", that is, the "transmission operation of the third output data" is taken as the second step in the training operation of the neural network ". In one implementation, the first communication device may execute the first compiled code corresponding to the first computational graph to generate third output data of the first data structure, and send the first output data of the first data structure by invoking the preset interface. Alternatively, the consistency of the storage position of the first output data of the first data structure and the position of the preset interface for reading the first output data can be achieved through a shared pointer technology.

For a more intuitive understanding of the present solution, please refer to fig. 10, fig. 10 is a schematic flow chart of transmitting the first output data according to an embodiment of the present application. In fig. 10, step 304 is taken as an example, where after acquiring the reusable computing fig. 1, the communication device 1 determines whether there is a first mapping relation corresponding to the parameter in the computing fig. 1, and if yes, acquires the first compiled code corresponding to the computing fig. 1 from the stored data, and executes the first compiled code corresponding to the computing fig. 1. If the judgment result is negative, trace calculates the figure 1 to obtain an intermediate calculation expression corresponding to the figure 1, and judges whether the compiled code corresponding to the figure 1 exists in the preset position of the memory according to the intermediate calculation expression corresponding to the figure 1; if the judgment result is negative, the communication device 1 may generate a compiled code corresponding to the calculation chart 1, store the compiled code corresponding to the calculation chart 1 to a preset position of the memory, establish a first mapping relationship, and execute the compiled code corresponding to the calculation chart 1; if the determination result is yes, the communication device 1 may establish a first mapping relationship, and execute the compiled code corresponding to the calculation fig. 1.

In fig. 10, the transmission operation of the first output data is performed by calling a preset interface provided by the third party, so that the communication device 1 may acquire the first data structure adopted when the third party performs the transmission operation of the first output data, generate the first output data of the first data structure after performing the compiled code corresponding to the calculation of fig. 1, and call the interface to transmit the first output data of the first data structure.

After receiving the first output data of the first data structure, the communication device 2 may convert the data structure of the first output data and start to perform at least one step corresponding to the calculation of fig. 2. Specifically, after acquiring the reusable calculation map 2, the communication device 2 determines whether or not there is a first mapping relationship corresponding to the parameter in the calculation map 2, and if so, acquires a first compiled code corresponding to the calculation map 2 from the stored data, and executes the first compiled code corresponding to the calculation map 2. If the judgment result is negative, trace calculates the figure 2 to obtain an intermediate calculation expression corresponding to the figure 2, and judges whether the compiled code corresponding to the figure 2 exists in the preset position of the memory according to the intermediate calculation expression corresponding to the figure 2; if the judgment result is negative, the communication device 2 can generate a compiled code corresponding to the calculation chart 2, store the compiled code corresponding to the calculation chart 2 to a preset position of the memory, establish a first mapping relation and execute the compiled code corresponding to the calculation chart 2; if the determination result is yes, the communication device 2 may establish a first mapping relationship, and execute the compiled code corresponding to the calculation map 2. It should be noted that, fig. 10 shows a process of executing the first calculation map on the communication device 1 and the communication device 2, and a data interaction process between the communication device 1 and the communication device 2, and the example in fig. 10 is merely for convenience of understanding the present solution, and is not limited to the present solution.

In the embodiment of the application, a specific example of a downstream task of a first step in training operation of the neural network is provided, and communication of first output data is realized by calling a preset interface form, so that the method is convenient and quick; and the first output data of the first data structure is generated, so that the conversion of the first output data among different data structures is avoided, and the efficiency of the transmission process of the first output data is improved.

In another implementation, the first communication device may execute the first compiled code corresponding to the first computation graph to generate third output data of the target data structure, generate the first output data of the first data structure according to the third output data of the target data structure, and send the first output data of the first data structure by calling a preset interface.

It should be noted that, in the embodiment corresponding to fig. 3, only the execution bodies of steps 301 to 311 are all the first communication devices as an example, and in an actual application scenario, steps 301 to 311 may be implemented by at least two communication devices together. For example, steps 301 to 302 and steps 303 to 311 may be performed by different communication devices, and the architecture diagram shown in fig. 5 is taken as an example, where steps 301 to 302 may be performed by a CPU, and if the CPU determines that the first computation diagram can be multiplexed, then a first compiled code corresponding to the first computation diagram may be generated by a compiler, and first information, which is used to instruct the NPU to implement one or more first steps corresponding to the first computation diagram in a compiling manner, is sent to each NPU, where the first information, the first computation diagram, and the first compiled code corresponding to the first computation diagram. If the CPU determines that the first calculation graphs can not be multiplexed, third information and the first calculation graphs can be sent to each NPU, wherein the third information is used for indicating the NPU to implement one or more first steps corresponding to the first calculation graphs in an interpretation execution mode. In other application scenarios, the execution bodies of steps 301 to 311 may have other allocation forms, which are not listed here, and the execution body of each step in steps 301 to 311 may be flexibly determined in combination with an actual application scenario, which is not limited in the embodiment of the present application.

2. Training operation on the first neural network is jointly executed by the cloud end equipment and the terminal equipment

In this embodiment of the present application, in a scenario in which the cloud device and the terminal device jointly execute the training operation on the first neural network, in one implementation manner, the terminal device executes the step of "generating, by the compiler, the first compiled code corresponding to the first computation graph", and a specific implementation manner of the training method for the terminal device to execute the neural network may refer to the description in the embodiment corresponding to fig. 3, which is not described herein.

In another implementation manner, the "first compiled code corresponding to the first calculation map" is sent by the cloud device to the terminal device, specifically, referring to fig. 11, fig. 11 is a further flowchart of a neural network training method provided in the embodiment of the present application, where the neural network training method provided in the embodiment of the present application may include:

1101. a first computational graph is obtained, the first computational graph being one of one or more computational graphs corresponding to an nth round of training of the neural network.

In this embodiment of the present application, the terminal device obtains a first calculation map, and one or more calculation maps corresponding to an nth round training of the first neural network are included in the one or more calculation maps, that is, the first calculation map is one of the one or more calculation maps corresponding to the nth round training of the neural network. Specifically, step 1101 may include: the manner in which the terminal device receives the first calculation map sent by the cloud device and the cloud device generates the first calculation map and the concept of the first calculation map can be described in step 301 in the corresponding embodiment of fig. 3, which is not described herein. Alternatively, the manner in which the terminal device generates the first calculation map may refer to the description in step 301 in the corresponding embodiment of fig. 3, which is not described herein.

1102. Judging whether the first calculation graph can be reused or not, if not, entering step 1103; if yes, go to step 1104.

In this embodiment, step 1102 is an optional step, and if step 1102 is performed, in one implementation manner, the specific implementation manner of performing step 1102 by the terminal device may refer to the description in step 202 in the corresponding embodiment of fig. 2, which is not described herein. In another implementation manner, the terminal device receives the first calculation graph and fourth information sent by the cloud device, the fourth information indicates whether the first calculation graph can be reused, and the terminal device can determine whether the first calculation graph can be reused according to the received fourth information.

1103. At least one first step in an nth round of training operations of the first neural network is performed by way of explanation execution.

In some embodiments of the present application, after determining that the first calculation map cannot be reused, the terminal device may execute at least one first step in the nth round of training operation of the first neural network by explaining and executing the method, and the specific implementation manner of the executing step 1103 may refer to the description in step 303 in the corresponding embodiment of fig. 3, which is not repeated herein.

It should be noted that, in the optional step 1103, when the terminal device receives the first calculation map and the fourth information sent by the cloud device, the terminal device may also receive the compiled code corresponding to the first calculation map sent by the cloud device, after determining that the first calculation map cannot be reused, the terminal device may execute the compiled code corresponding to the first calculation map sent by the cloud device, and after executing, delete the compiled code corresponding to the first calculation map.

1104. Input data of a first computational graph is obtained.

In this embodiment of the present invention, the terminal device may obtain input data of the first calculation map, where the input data may include training samples and values of parameters in the first calculation map, and the training samples included in the input data may be obtained from stored data by the terminal device.

For the specific acquisition mode of the "value of the parameter in the first calculation map", in one implementation mode, the value of the input parameter in the first calculation map may be sent to the terminal device by the cloud device. In another implementation, the value of the input parameter in the first computational graph may be generated by the terminal device when performing the N-1 th round of training operation of the first neural network; specifically, the terminal device may determine, according to the first mapping relationship, a value of a parameter in the first calculation map. The first mapping relationship may be generated by the cloud device and then sent to the terminal device, or may be generated by the terminal device, and the concept of the "first mapping relationship" and the specific generation manner of the "first mapping relationship" may refer to the description in the corresponding embodiment of fig. 3, which is not described herein.

1105. And acquiring a first compiled code corresponding to the first calculation graph from the system, executing the first compiled code corresponding to the first calculation graph, and executing the first compiled code when the Mth round of operation of the first neural network is executed.

In some embodiments of the present application, the cloud device may send, to the terminal device, a first compiled code corresponding to the first computation graph in a first round of training operation that determines that the first computation graph can be multiplexed; correspondingly, the terminal equipment stores the first compiled code corresponding to the first calculation graph into the system under the condition that the first calculation graph can be reused.

After acquiring the input data of the first computation graph, the terminal device may acquire the first compiled code corresponding to the first computation graph from the system and execute the first compiled code corresponding to the first computation graph, and the specific implementation manner of step 1105 may refer to the description in step 311 in the corresponding embodiment of fig. 3, which is not described herein. It should be noted that, the execution sequence of step 1104 and step 1105 is not limited in the embodiments of the present application, steps 1104 and 1105 may be executed alternately, that is, in the process of executing the first compiled code, the input data of the first computation graph may be obtained again, and the execution of the first compiled code may be continued.

In order to better implement the above-described solutions according to the embodiments of the present application, on the basis of the embodiments corresponding to fig. 1 to 11, the following further provides related devices for implementing the above-described solutions. Referring specifically to fig. 12, fig. 12 is a schematic structural diagram of a training apparatus for a neural network according to an embodiment of the present application, where the training apparatus 1200 for a neural network includes an obtaining module 1201, a determining module 1202, and an executing module 1203, where the obtaining module 1201 is configured to obtain a first calculation map, where one or more calculation maps corresponding to an nth round training of the neural network include the first calculation map, that is, the first calculation map is one of one or more calculation maps corresponding to the nth round training of the neural network, and N is a positive integer; a determining module 1202, configured to determine a first compiled code stored in the system and corresponding to a first computation graph, where the first compiled code is generated during an mth round of training of the neural network, and M is a positive integer, and M is less than N; the execution module 1203 is configured to execute the first compiled code.

In one possible design, the execution module 1203 is specifically configured to: acquiring a first mapping relation from the system, wherein the first mapping relation is used for indicating the acquisition position of the input parameters of the first calculation map; determining the value of an input parameter in a first calculation graph in the nth round according to the first mapping relation; and executing the first compiled code according to the value of the input parameter.

In one possible design, referring to fig. 13, fig. 13 is another schematic structural diagram of a training device for a neural network provided in an embodiment of the present application, and the training device 1200 for a neural network further includes: the establishing module 1204 is configured to establish the first mapping relationship if the first mapping relationship does not exist in the system.

In one possible design, the first computational graph is a reusable computational graph.

In one possible design, the determining module 1202 is specifically configured to: performing expression conversion on the first calculation graph to obtain an intermediate calculation expression IR corresponding to the first calculation graph; from the IR, it is determined that the first compiled code is stored in the system.

In one possible design, referring to fig. 13, in performing the M-th training of the neural network, the obtaining module 1201 is further configured to obtain a first calculation map, and generate a first compiled code according to the first calculation map; the training apparatus 1200 of the neural network further includes: a storage module 1205 is used to store the first compiled code in the system.

In one possible design, the determining module 1202 is specifically configured to determine that the first compiled code is stored in the system if a first mapping relationship is stored in the system, where the first mapping relationship is used to indicate an acquisition location of an input parameter of the first computation graph.

In one possible design, referring to fig. 13, the first computational graph corresponds to a first step in an nth round of training of the neural network; the training apparatus 1200 of the neural network further includes: a generating module 1206, configured to generate first output data, where the first output data adopts a first data structure, the first output data includes at least one input data of a second step in a training operation of the neural network, the first data structure is a data structure adopted when the second step in the training operation of the neural network is performed, and the training operation of the neural network includes an nth round of training of the neural network;

and/or the number of the groups of groups,

the execution module 1203 is specifically configured to: at least one input data of the first computational graph is obtained according to a format of a second data structure, wherein the at least one input data of the first computational graph exists in second output data of a third step in training operations of the neural network, and the second data structure is a data structure adopted when the third step in training operations of the neural network is executed.

In one possible design, the storage location of the first output data is coincident with the read location of the at least one input data of the second step; and/or the read position of at least one input data of the first computational graph is consistent with the storage position of the second output data.

In one possible design, referring to fig. 13, the training apparatus 1200 of the neural network further includes: the transmitting module 1207 is configured to transmit the first output data by calling a preset interface, where the second step in the training operation of the neural network includes transmitting the first output data, and the first data structure is a data structure adopted when the transmitting operation of the first output data is performed.

In one possible design, referring to fig. 13, the training apparatus 1200 of the neural network further includes: the splitting module 1208 is configured to perform a splitting operation on the second computational graph according to a preset policy input by a user to obtain one or more computational graphs corresponding to an nth round training of the neural network; alternatively, the training apparatus 1200 of the neural network further includes: a receiving module 1209, configured to receive one or more calculation graphs corresponding to an nth round training of the neural network, which is input by a user.

It should be noted that, content such as information interaction and execution process between each module/unit in the training device 1200 of the neural network, each method embodiment corresponding to fig. 3 to 11 in the present application is based on the same concept, and specific content may be referred to the description in the foregoing method embodiment shown in the present application, which is not repeated herein.

Next, a communication device provided in the embodiments of the present application is described, where the foregoing communication device is used to perform a training method of a neural network provided in the present application, and in an application scenario, the communication device may be represented as a server, specifically referring to fig. 14, fig. 14 is a schematic structural diagram of the communication device provided in the embodiments of the present application, and specifically, the communication device is implemented by one or more servers, where the communication device 1400 may be relatively different due to configuration or performance, and may include one or more central processing units (central processing units, CPU) 1422 (for example, one or more processors) and a memory 1432, and one or more storage media 1430 (for example, one or more mass storage devices) storing the application 1442 or data 1444. Wherein the memory 1432 and storage medium 1430 can be transitory or persistent storage. The program stored on the storage medium 1430 may include one or more modules (not shown), each of which may include a series of instruction operations in the communication device. Further, the central processor 1422 may be provided in communication with a storage medium 1430 to perform a series of instruction operations in the storage medium 1430 on the communication device 1400.

The communication device 1400 may also include one or more power supplies 1426, one or more wired or wireless network interfaces 1450, one or more input/output interfaces 1458, and/or one or more operating systems 1441, such as Windows Server, mac OS XTM, unixTM, linuxTM, freeBSDTM, and the like.

In the embodiment of the present application, the central processor 1422 is configured to perform the training method of the neural network performed by the communication device in the corresponding embodiment of fig. 3 to 10. It should be noted that, the specific manner in which the cpu 1422 executes the foregoing steps is based on the same concept as that of the method embodiments corresponding to fig. 3 to 10 in the present application, so that the technical effects brought by the same concept as that of the method embodiments corresponding to fig. 3 to 10 in the present application are the same, and the specific content can be referred to the descriptions in the foregoing method embodiments shown in the present application and will not be repeated here.

In another application scenario, the communication device may be represented as a terminal device, referring to fig. 15, fig. 15 is a schematic structural diagram of the communication device provided in the embodiment of the present application, and the communication device may be specifically represented as a virtual reality VR device, a mobile phone, a tablet, a notebook computer, an intelligent wearable device, a monitoring data processing device, or a radar data processing device, which is not limited herein. Specifically, the communication device includes: a receiver 1501, a transmitter 1502, a processor 1503 and a memory 1504 (where the number of processors 1503 in the communication device may be one or more, one processor is exemplified in fig. 15), wherein the processor 1503 may include an application processor 15031 and a communication processor 15032. In some embodiments of the present application, the receiver 1501, transmitter 1502, processor 1503, and memory 1504 may be connected by a bus or other means.

Memory 1504 may include read only memory and random access memory and provide instructions and data to the processor 1503. A portion of the memory 1504 may also include non-volatile random access memory (non-volatile random access memory, NVRAM). The memory 1504 stores a processor and operating instructions, executable modules or data structures, or a subset thereof, or an extended set thereof, wherein the operating instructions may include various operating instructions for implementing various operations.

The processor 1503 controls the operation of the communication device. In a specific application, the various components of the communication device are coupled together by a bus system that may include, in addition to a data bus, a power bus, a control bus, a status signal bus, and the like. For clarity of illustration, however, the various buses are referred to in the figures as bus systems.

The method disclosed in the embodiments of the present application may be applied to the processor 1503 or implemented by the processor 1503. The processor 1503 may be an integrated circuit chip with signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in the processor 1503 or by instructions in the form of software. The processor 1503 may be a general-purpose processor, a digital signal processor (digital signal processing, DSP), a microprocessor, or a microcontroller, and may further include an application specific integrated circuit (application specific integrated circuit, ASIC), a field-programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. The processor 1503 may implement or perform the methods, steps, and logic blocks disclosed in the embodiments of the present application. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the embodiments of the present application may be embodied directly in hardware, in a decoded processor, or in a combination of hardware and software modules in a decoded processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in the memory 1504 and the processor 1503 reads the information in the memory 1504 and in combination with its hardware performs the steps of the above method.

The receiver 1501 may be used to receive input digital or character information and to generate signal inputs related to the relevant settings and function control of the communication device. The transmitter 1502 may be used to output numeric or character information through a first interface; the transmitter 1502 may also be configured to send instructions to the disk set through the first interface to modify data in the disk set; the transmitter 1502 may also include a display device such as a display screen.

In this embodiment, in one case, the processor 1503 is configured to execute the training method of the neural network executed by the terminal device in the corresponding embodiment of fig. 11. It should be noted that, the specific manner in which the application processor 15031 in the processor 1503 executes the foregoing steps is based on the same concept as that of the method embodiment corresponding to fig. 11 in the present application, and the technical effects brought by the specific manner are the same as those of the method embodiment corresponding to fig. 11 in the present application, and details of the specific manner may be referred to the descriptions in the method embodiment shown in the foregoing application, which are not repeated herein.

There is also provided in an embodiment of the present application a computer-readable storage medium in which a program for performing signal processing is stored, which when run on a computer causes the computer to perform the steps performed by the communication device in the method described in the embodiment shown in the foregoing fig. 3 to 10, or causes the computer to perform the steps performed by the terminal device in the method described in the embodiment shown in the foregoing fig. 11.

Embodiments of the present application also provide a computer program product which, when run on a computer, causes the computer to perform the steps performed by the communication device in the method described in the embodiments shown in fig. 3 to 10, or causes the computer to perform the steps performed by the terminal device in the method described in the embodiment shown in fig. 11.

The first communication device or the terminal device provided in the embodiment of the present application may specifically be a chip, where the chip includes: a processing unit, which may be, for example, a processor, and a communication unit, which may be, for example, an input/output interface, pins or circuitry, etc. The processing unit may execute the computer-executable instructions stored in the storage unit, so that the chip performs the training method of the neural network described in the embodiments shown in fig. 3 to 11. Optionally, the storage unit is a storage unit in the chip, such as a register, a cache, etc., and the storage unit may also be a storage unit in the wireless access device side located outside the chip, such as a read-only memory (ROM) or other type of static storage device that may store static information and instructions, a random access memory (random access memory, RAM), etc.

Specifically, referring to fig. 16, fig. 16 is a schematic structural diagram of a chip provided in an embodiment of the present application, where the chip may be represented as a neural network processor NPU 160, and the NPU 160 is mounted as a coprocessor on a main CPU (Host CPU), and the Host CPU distributes tasks. The NPU has a core part of an arithmetic circuit 160, and the controller 1604 controls the arithmetic circuit 1603 to extract matrix data in a memory and perform multiplication.

In some implementations, the arithmetic circuit 1603 internally includes a plurality of processing units (PEs). In some implementations, the operational circuitry 1603 is a two-dimensional systolic array. The arithmetic circuit 1603 may also be a one-dimensional systolic array or other electronic circuit capable of performing mathematical operations such as multiplication and addition. In some implementations, the operational circuitry 1603 is a general-purpose matrix processor.

For example, assume that there is an input matrix a, a weight matrix B, and an output matrix C. The arithmetic circuit fetches the data corresponding to the matrix B from the weight memory 1602 and buffers the data on each PE in the arithmetic circuit. The arithmetic circuit takes matrix a data from the input memory 1601 and performs matrix operation with matrix B, and the obtained partial result or final result of the matrix is stored in an accumulator (accumulator) 1608.

Unified memory 1606 is used to store input data and output data. The weight data is directly passed through a memory cell access controller (Direct Memory Access Controller, DMAC) 1605, which is carried into the weight memory 1602. The input data is also carried into the unified memory 1606 through the DMAC.

BIU Bus Interface Unit, bus interface unit 1610, is used for the AXI bus to interact with DMAC and instruction fetch memory (Instruction Fetch Buffer, IFB) 1609.

The bus interface unit 1610 (Bus Interface Unit, abbreviated as BIU) is configured to obtain an instruction from an external memory by the instruction fetch memory 1609, and further configured to obtain raw data of the input matrix a or the weight matrix B from the external memory by the storage unit access controller 1605.

The DMAC is mainly used to transfer input data in the external memory DDR to the unified memory 1606 or to transfer weight data to the weight memory 1602 or to transfer input data to the input memory 1601.

The vector calculation unit 1607 includes a plurality of operation processing units, and performs further processing such as vector multiplication, vector addition, exponential operation, logarithmic operation, magnitude comparison, and the like on the output of the operation circuit, if necessary. The method is mainly used for non-convolution/full-connection layer network calculation in the neural network, such as Batch Normalization (batch normalization), pixel-level summation, up-sampling of a characteristic plane and the like.

In some implementations, vector computation unit 1607 can store the vector of processed outputs to unified memory 1606. For example, the vector calculation unit 1607 may apply a linear function and/or a nonlinear function to the output of the operation circuit 1603, such as linear interpolation of the feature planes extracted by the convolution layers, and further such as a vector of accumulated values, to generate the activation values. In some implementations, the vector computation unit 1607 generates normalized values, pixel-level summed values, or both. In some implementations, the vector of processed outputs can be used as an activation input to the arithmetic circuitry 1603, for example for use in subsequent layers in a neural network.

An instruction fetch memory (instruction fetch buffer) 1609 connected to the controller 1604 for storing instructions used by the controller 1604;

the unified memory 1606, input memory 1601, weight memory 1602 and finger memory 1609 are all On-Chip memories. The external memory is proprietary to the NPU hardware architecture.

The operation corresponding to the first calculation map may be performed by the operation circuit 1603 or the vector calculation unit 1607.

The processor mentioned in any of the above may be a general-purpose central processing unit, a microprocessor, an ASIC, or one or more integrated circuits for controlling the execution of the program of the method of the first aspect.

It should be further noted that the above-described apparatus embodiments are merely illustrative, and that the units described as separate units may or may not be physically separate, and that units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. In addition, in the drawings of the embodiment of the device provided by the application, the connection relation between the modules represents that the modules have communication connection therebetween, and can be specifically implemented as one or more communication buses or signal lines.

From the above description of the embodiments, it will be apparent to those skilled in the art that the present application may be implemented by means of software plus necessary general purpose hardware, or of course may be implemented by dedicated hardware including application specific integrated circuits, dedicated CPUs, dedicated memories, dedicated components and the like. Generally, functions performed by computer programs can be easily implemented by corresponding hardware, and specific hardware structures for implementing the same functions can be varied, such as analog circuits, digital circuits, or dedicated circuits. However, a software program implementation is a preferred embodiment in many cases for the present application. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a readable storage medium, such as a floppy disk, a usb disk, a removable hard disk, a ROM, a RAM, a magnetic disk or an optical disk of a computer, etc., including several instructions for causing a computer device (which may be a personal computer, a communication device, or a network device, etc.) to perform the method described in the embodiments of the present application.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product.

The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present application, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, communication device, or data center to another website, computer, communication device, or data center by a wired (e.g., coaxial cable, optical fiber, digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be stored by a computer or a data storage device such as a communication device, a data center, or the like that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., a floppy Disk, a hard Disk, a magnetic tape), an optical medium (e.g., a DVD), or a semiconductor medium (e.g., a Solid State Disk (SSD)), or the like.

Claims

1. A method of training a neural network, wherein in performing an nth round of training of the neural network, the method comprises:

acquiring a first calculation map, wherein the first calculation map is one of one or more calculation maps corresponding to an Nth round of training of the neural network, and N is a positive integer;

determining that a first compiled code corresponding to the first computational graph is stored in a system, wherein the first compiled code is generated in performing an mth round of training of the neural network, M is a positive integer, and M is less than N;

the first compiled code is executed.

2. The method of claim 1, wherein the executing the first compiled code comprises:

acquiring a first mapping relation from the system, wherein the first mapping relation is used for indicating the acquisition position of the input parameters of the first calculation map;

determining the value of the input parameter in a first calculation graph in the nth round according to the first mapping relation;

and executing the first compiled code according to the value of the input parameter.

3. The method of claim 2, comprising, prior to said obtaining the first mapping relationship:

If the first mapping relation does not exist in the system, the first mapping relation is established.

4. A method according to any of claims 1-3, wherein the first computational graph is a reusable computational graph.

5. The method of claims 1-4, wherein determining that first compiled code corresponding to the first computational graph has been stored in the system comprises:

performing expression conversion on the first calculation graph to obtain an intermediate calculation expression IR corresponding to the first calculation graph;

from the IR, it is determined that the first compiled code is stored in the system.

6. The method of any of claims 1-5, wherein in performing an mth round of training of the neural network, the method further comprises:

acquiring the first calculation graph, and generating a first compiled code according to the first calculation graph;

the first compiled code is stored in the system.

7. The method of any of claims 1-4, wherein determining that first compiled code corresponding to the first computational graph has been stored in the system comprises:

and if the first mapping relation is stored in the system, determining that the first compiled code is stored in the system, wherein the first mapping relation is used for indicating the acquisition position of the input parameters of the first calculation graph.

8. The method of any of claims 1-7, wherein the first computational graph corresponds to a first step in an nth round of training of the neural network;

after the executing the first compiled code, the method further comprises: generating first output data, wherein the first output data adopts a first data structure, the first output data comprises at least one input data of a second step in the training operation of the neural network, the first data structure is a data structure adopted when the second step in the training operation of the neural network is executed, and the training operation of the neural network comprises an N-th round of training of the neural network;

and/or the number of the groups of groups,

the executing the first compiled code, comprising: at least one input data of the first computational graph is obtained according to a format of a second data structure, wherein the at least one input data of the first computational graph exists in second output data of a third step in training operations of the neural network, and the second data structure is a data structure adopted when the third step in training operations of the neural network is executed.

9. The method of claim 8, wherein the step of determining the position of the first electrode is performed,

the storage location of the first output data is consistent with the read location of the at least one input data of the second step; and/or the number of the groups of groups,

the read position of at least one input data of the first computational graph is consistent with the storage position of the second output data.

10. The method of claim 8, wherein the method further comprises:

and sending the first output data in a mode of calling a preset interface, wherein the second step in the training operation of the neural network comprises sending the first output data, and the first data structure is a data structure adopted when the sending operation of the first output data is executed.

11. The method of any of claims 1-10, wherein prior to the obtaining the first computational graph, the method further comprises:

according to a preset strategy input by a user, performing splitting operation on a second calculation graph to obtain one or more calculation graphs corresponding to the Nth round of training of the neural network; or,

the one or more computational graphs corresponding to an nth round of training of the neural network are received as user input.

12. A training apparatus for a neural network, wherein in performing an nth round of training of the neural network, the apparatus comprises:

the acquisition module is used for acquiring a first calculation graph, wherein the first calculation graph is one of one or more calculation graphs corresponding to the Nth round of training of the neural network, and N is a positive integer;

a determining module, configured to determine that a first compiled code corresponding to the first computation graph is stored in a system, where the first compiled code is generated in performing an mth round of training of the neural network, where M is a positive integer, and where M is less than N;

and the execution module is used for executing the first compiled code.

13. The apparatus according to claim 12, wherein the execution module is specifically configured to:

14. The apparatus of claim 13, wherein the device comprises a plurality of sensors,

The apparatus further comprises: the establishing module is used for establishing the first mapping relation if the first mapping relation does not exist in the system.

15. The apparatus of any of claims 12-14, wherein the first computational graph is a reusable computational graph.

16. The apparatus according to claims 12-15, wherein the determining module is specifically configured to:

17. The apparatus of any of claims 12-16, wherein the acquisition module, when performing an mth round of training of the neural network, is further configured to acquire the first computational graph, and generate a first compiled code from the first computational graph;

the apparatus further comprises: and the storage module is used for storing the first compiled code in the system.

18. The device according to any one of claims 12-15, wherein,

the determining module is specifically configured to determine that the first compiled code is stored in the system if a first mapping relationship is already stored in the system, where the first mapping relationship is used to indicate an acquisition location of an input parameter of the first computation graph.

19. The apparatus of any of claims 12-18, wherein the first computational graph corresponds to a first step in an nth round of training of the neural network;

the apparatus further comprises: a generation module, configured to generate first output data, where the first output data adopts a first data structure, the first output data includes at least one input data of a second step in a training operation of the neural network, the first data structure is a data structure adopted when the second step in the training operation of the neural network is performed, and the training operation of the neural network includes an nth round of training of the neural network;

and/or the number of the groups of groups,

the execution module is specifically configured to: at least one input data of the first computational graph is obtained according to a format of a second data structure, wherein the at least one input data of the first computational graph exists in second output data of a third step in training operations of the neural network, and the second data structure is a data structure adopted when the third step in training operations of the neural network is executed.

20. The apparatus of claim 19, wherein the device comprises a plurality of sensors,

21. The apparatus of claim 19, wherein the apparatus further comprises:

the second step in the training operation of the neural network includes sending the first output data, where the first data structure is a data structure adopted when the sending operation of the first output data is executed.

22. The device according to any one of claims 12-21, wherein,

the apparatus further comprises: the splitting module is used for executing splitting operation on the second computational graph according to a preset strategy input by a user to obtain the one or more computational graphs corresponding to the Nth round of training of the neural network; or,

the apparatus further comprises: and the receiving module is used for receiving the one or more calculation graphs corresponding to the Nth round of training of the neural network, which are input by a user.

23. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein a program which, when run on a computer, causes the computer to perform the method of any of claims 1 to 11.

24. A communication device comprising a processor and a memory, the processor being coupled to the memory,

the memory is used for storing programs;

the processor configured to execute a program in the memory, so that the communication device performs the method according to any one of claims 1 to 11.

25. A computer program product, characterized in that the computer program product comprises a program which, when run on a computer, causes the computer to perform the method according to any of claims 1 to 11.