WO2022068314A1 - 神经网络训练的方法、神经网络的压缩方法以及相关设备 - Google Patents

神经网络训练的方法、神经网络的压缩方法以及相关设备 Download PDF

Info

Publication number
WO2022068314A1
WO2022068314A1 PCT/CN2021/105927 CN2021105927W WO2022068314A1 WO 2022068314 A1 WO2022068314 A1 WO 2022068314A1 CN 2021105927 W CN2021105927 W CN 2021105927W WO 2022068314 A1 WO2022068314 A1 WO 2022068314A1
Authority
WO
WIPO (PCT)
Prior art keywords
information
neural network
feature extraction
pieces
feature
Prior art date
Application number
PCT/CN2021/105927
Other languages
English (en)
French (fr)
Inventor
孟笑君
王雅圣
张正彦
岂凡超
刘知远
Original Assignee
华为技术有限公司
清华大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司, 清华大学 filed Critical 华为技术有限公司
Publication of WO2022068314A1 publication Critical patent/WO2022068314A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Definitions

  • the present application relates to the field of artificial intelligence, and in particular, to a method for training a neural network, a method for compressing a neural network, and related equipment.
  • Artificial Intelligence is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.
  • artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and produce a new kind of intelligent machine that responds in a similar way to human intelligence.
  • Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.
  • text processing based on deep learning (deep learning) neural network is a common application of artificial intelligence.
  • NLP natural language processing
  • the text processing model based on Transformer structure is usually relatively large, which results in a large storage space occupied and a slow inference speed. Therefore, a neural network compression scheme needs to be introduced urgently.
  • Embodiments of the present application provide a method for training a neural network, a method for compressing a neural network, and related equipment, and a method for training a neural network for performing a pruning operation on a first feature extraction network, using a first loss function to train the first neural network, so that the data distribution rules of the N feature information generated by the feature extraction network before and after pruning are similar, so as to ensure that the feature expression capabilities of the feature extraction network before and after pruning are similar, so as to ensure the Performance of Feature Extraction Networks.
  • the embodiments of the present application provide a method for training a neural network, which can be used in the field of artificial intelligence.
  • the method may include: the training device inputs the first training data into the first feature extraction network, and obtains N pieces of first feature information corresponding to the first training data output by the first feature extraction network, where N is an integer greater than 1; For the first feature information, the first distribution information is calculated, and the first distribution information is used to indicate the data distribution law of the N pieces of first feature information.
  • the training device performs a pruning operation on the first feature extraction network through the first neural network, and obtains a pruned first feature extraction network; inputs the first training data into the pruned first feature extraction network, and obtains a pruned first feature extraction network.
  • the training device performs a training operation on the first neural network according to the first loss function to obtain a second neural network; wherein, the second neural network is the first neural network that has performed the training operation, and the first loss function indicates that the first distribution information and the first loss function.
  • the similarity between the second distribution information that is, the goal of iterative training is to shorten the similarity between the first distribution information and the second distribution information, and the similarity between the first distribution information and the second distribution information is used for It reflects the degree of difference between the first distribution information and the second distribution information, and can also be expressed as the distance between the first distribution information and the second distribution information.
  • the aforementioned distance can be KL divergence distance, cross entropy distance, Euclidean distance, Mahalanobis, cosine, or other types of distances. It should be noted that, in the process of training the first neural network, the weight parameters of the first feature extraction network are not modified.
  • a method for training a neural network for performing a pruning operation on the first feature extraction network is provided, and the first neural network after performing the training operation can be used to perform a pruning operation on the first feature extraction network.
  • Pruning that is, provides a compression scheme of the neural network; in addition, the first loss function is used to train the first neural network, so that the data distribution rules of the N feature information generated by the feature extraction network before and after pruning are similar,
  • the first feature extraction network can not only be the feature extraction network of the Transform structure, but also can be a recurrent neural network.
  • the feature extraction network of neural networks such as convolutional neural networks expands the application scenarios of this scheme.
  • the first distribution information includes a value of a distance between any two pieces of first characteristic information in the N pieces of first characteristic information, so as to indicate a data distribution law of the N pieces of first characteristic information ;
  • the second distribution information includes the value of the distance between any two pieces of second characteristic information in the N pieces of second characteristic information, so as to indicate the data distribution law of the N pieces of second characteristic information. That is, the distribution rule of one feature information in the N pieces of first feature information is reflected by the value of the distance between the one feature information and each of the N pieces of first feature information, and one of the N pieces of second feature information
  • the distribution rule of the feature information is represented by the value of the distance between the one feature information and each of the N second feature information.
  • the data distribution rule of the N pieces of feature information is determined by calculating the distance between any two pieces of feature information in the N pieces of feature information, and an implementation manner of the data distribution rule of the N pieces of feature information is provided, and Simple operation and easy implementation.
  • the N pieces of first characteristic information include third characteristic information and fourth characteristic information, and both the third characteristic information and the fourth characteristic information are any one of the N pieces of first characteristic information characteristic information.
  • the training device calculates the first distribution information according to the N pieces of first feature information, which may include: the training device directly calculates the cosine distance, the Euclidean distance, the Manhattan distance, the Mahalanobis distance, the first-order distance between the third feature information and the fourth feature information distance or cross entropy distance, and is determined as the distance between the third feature information and the fourth feature information.
  • the N pieces of first feature information include third feature information
  • the third feature information is any one of the N pieces of first feature information.
  • the training device calculates the first distribution information according to the N pieces of first feature information, which may include: the training device calculates the first distance between the third feature information and each of the first feature information in the N pieces of first feature information, and obtains the third The sum of the first distances between the feature information and all the first feature information, where the aforementioned first distance refers to a cosine distance, an Euclidean distance, a Manhattan distance, a Mahalanobis distance, a first-order distance, or a cross-entropy distance.
  • the training device calculates a second distance between the third feature information and the fourth feature information, where the aforementioned second distance refers to a cosine distance, an Euclidean distance, a Manhattan distance, a Mahalanobis distance, a first-order distance, or a cross-entropy distance.
  • the training device determines the ratio between the second distance and the sum of all the first distances as the distance between the third characteristic information and the fourth characteristic information.
  • the first distribution information includes a value of a distance between each of the N pieces of first characteristic information and the preset characteristic information, so as to indicate data of the N pieces of first characteristic information Distribution law;
  • the second distribution information includes the value of the distance between each of the N pieces of second characteristic information and the preset characteristic information, so as to indicate the data distribution law of the N pieces of second characteristic information.
  • the preset feature information has the same shape as the first feature information and the second feature information, and the preset feature information and the first feature information have the same shape means that
  • the preset feature information and the first feature information are both M-dimensional tensors, and the size of the first dimension in the M-dimension of the first feature information and the second dimension in the M-dimension of the second feature information are the same, and M is greater than or equal to An integer of 1, the first dimension is any one of the M dimensions of the first feature information, and the second dimension is the same dimension as the first dimension in the M dimensions of the second feature information.
  • the first feature information or the second feature information is a vector including m elements
  • the preset feature information may be a vector including m zeros
  • the preset feature information is a vector including m ones.
  • the first feature extraction network is a feature extraction network in a neural network with a Transformer structure, and the first feature extraction network includes at least two attention heads.
  • the training device performs a pruning operation on the first feature extraction network through the first neural network to obtain a pruned first feature extraction network, including: the training device through the first neural network, pruning at least two of the first feature extraction network includes: A pruning operation is performed on each attention head, and a first feature extraction network after pruning is constructed according to at least one attention head still retained after pruning.
  • the number of attention heads included in the pruned first feature extraction network is less than the number of attention heads included in the first feature extraction network.
  • the performance of the first feature extraction network has little influence, so the first feature extraction network is selected as the feature extraction network of the neural network of the Transformer structure, and the attention heads in the first feature extraction network are pruned, so as to maximize the performance of the first feature extraction network. Improve the performance of the pruned first feature extraction network.
  • the training device uses the first neural network to perform a pruning operation on at least two attention heads included in the first feature extraction network, including: training the device through the first neural network, A first score for each of the at least two attention heads is generated, and a pruning operation is performed on the at least two attention heads according to the at least two first scores corresponding to the at least two attention heads.
  • the first score of an attention head represents the importance of the attention head, and is used to indicate whether an attention head is pruned.
  • the first feature extraction network includes the attention heads with a high degree of importance. The force head will be preserved, and the less important attention head will be pruned.
  • the first score of each attention head is generated by the first neural network, and then whether the attention head will be pruned is determined according to the score of each attention head, which is simple to operate and easy to implement.
  • the value of the first score is a first preset value or a second preset value, and the values of the first preset value and the second preset value are different.
  • the first attention head is any one of the at least two attention heads. When the value of the first attention head is the first preset value, the first attention will be retained; When the value of is the second preset value, the first attention head will be pruned.
  • the training device generates a first score for each of the at least two attention heads through the first neural network, including: the training device assigns the at least two attention heads to the first score.
  • Each attention head is input to the first neural network, and a second score of each attention head output by the first neural network is obtained, and the second score can be a continuous score.
  • the generation process of the second score for the first attention head among the at least two attention heads is a possible implementation manner of the first aspect.
  • the training device inputs the attention matrix corresponding to the first attention head into the first neural network according to the self-attention mechanism, that is, according to a set of attention matrices corresponding to the first attention head, performs the self-attention operation, Then, the operation result is input into the first neural network to obtain the second score of the first attention head output by the first neural network.
  • the training device performs discretization processing on the second score to obtain the first score, and the process of the discretization process is differentiable.
  • the process of generating the first score of each attention head is differentiable, and the process of reversely updating the weight parameters of the first neural network using the first loss function is also continuous, so that the first The update process of the weight parameters of the neural network is more rigorous, so as to improve the training efficiency of the first neural network, and it is also beneficial to obtain the first neural network with a higher accuracy rate.
  • the first training data includes N sentences, a first feature information is feature information of a sentence in the N sentences, and a second feature information is a sentence in the N sentences. characteristic information.
  • the first training data is a sentence, a sentence includes N words, a first feature information is the feature information of a word among the N words, and a second feature information is the feature information of a word among the N words.
  • two representations of N pieces of first feature information are provided, which improves the implementation flexibility of this solution; if one piece of first feature information is the feature information of one sentence in N sentences, it is beneficial to improve training
  • the difficulty of the process is to improve the accuracy of the final first feature extraction network; if a first feature information is the feature information of a word in N words, it only needs to perform feature extraction on one sentence to realize the first neural network.
  • One-time training of the network is beneficial to improve the efficiency of the training process of the first neural network.
  • the first neural network is any one of the following neural networks: a convolutional neural network, a recurrent neural network, a residual neural network, or a fully connected neural network.
  • a convolutional neural network a convolutional neural network
  • a recurrent neural network a residual neural network
  • a fully connected neural network a fully connected neural network.
  • the method may further include: training a device to obtain a final pruned first feature extraction network.
  • the training device determines that the function value of the first loss function satisfies the convergence condition, the first neural network will not be trained for the next time, and the training device can obtain the During the last training of the first neural network, the pruned first feature extraction network (that is, generated during the last training) generated by the first neural network (also called the second neural network) is used.
  • the pruned first feature extraction network as the final pruned first feature extraction network that can be output.
  • an embodiment of the present application provides a method for compressing a neural network, characterized in that the method includes: an execution device obtains a first feature extraction network; the execution device prunes the second feature extraction network through the second neural network , the second feature extraction network after pruning is obtained, and the second neural network is the neural network that has performed the training operation.
  • the second neural network is obtained by training according to the first loss function, the first loss function indicates the similarity between the first distribution information and the second distribution information, and the first distribution information is used to indicate N pieces of first feature information
  • the N first feature information is obtained by inputting the first training data into the first feature extraction network
  • the second distribution information is used to indicate the data distribution law of the N second feature information
  • the N second features The information is obtained by inputting the first training data into the pruned first feature extraction network.
  • the second neural network is obtained by training a training device, and the execution device and the training device may be the same device.
  • the neural network structures of the first feature extraction network and the second feature extraction network may be identical, that is, the neural network layers included in the first feature extraction network and the second feature extraction network are identical.
  • the neural network structures of the first feature extraction network and the second feature extraction network may also be different.
  • the number of attention heads included in a multi-head attention layer of the second feature extraction network may be the same as the number of attention heads included in a multi-head attention layer of the first feature extraction network.
  • the first distribution information includes a value of a distance between any two pieces of first characteristic information in the N pieces of first characteristic information, so as to indicate a data distribution rule of the N pieces of first characteristic information ;
  • the second distribution information includes the value of the distance between any two pieces of second characteristic information in the N pieces of second characteristic information, so as to indicate the data distribution law of the N pieces of second characteristic information.
  • the second feature extraction network is trained by means of pre-training and fine-tune, and the second feature extraction network is pruned through the second neural network, including: : Before fine-tuning the second feature extraction network, use the second neural network to prune the second feature extraction network that has performed the pre-training operation.
  • pruning the first feature extraction network in the pre-training stage can not only compress the first feature extraction network, but also reduce the storage space occupied by the first feature extraction network and improve the first feature extraction network.
  • the efficiency in the inference stage can also be improved in the fine-tuning stage when training the first feature extraction network, thereby improving the efficiency of the training process of the first feature extraction network.
  • the first feature extraction network is a feature extraction network in a neural network with a Transformer structure, and the first feature extraction network includes at least two attention heads.
  • the execution device prunes the first feature extraction network through the second neural network to obtain a second neural network after pruning, where the second neural network is a neural network that has performed training operations, including: At least two attention heads included in the first feature extraction network perform a pruning operation to obtain a first feature extraction network after pruning, and the number of attention heads included in the first feature extraction network after pruning is less than that of the first feature Extract the number of attention heads included in the network.
  • the execution device performs a pruning operation on at least two attention heads included in the first feature extraction network through the second neural network, including: the execution device passes through the second neural network, A first score for each of the at least two attention heads is generated, and the first score of one attention head is used to indicate whether an attention head is pruned; according to the at least two attention heads corresponding to the at least two attention heads; For the first score, perform a pruning operation on at least two attention heads.
  • the execution device generates, through the second neural network, a first score for each of the at least two attention heads, including: the execution device assigns the at least two attention heads to the first score.
  • Each attention head is input to the second neural network, and the second score of each attention head output by the second neural network is obtained; the second score is discretized to obtain the first score, and the process of discretization is differentiable of.
  • the second aspect of the embodiments of the present application may also perform steps in various possible implementations of the first aspect.
  • specific implementation steps, meanings of terms, and for the beneficial effects brought by each possible implementation manner reference may be made to the descriptions in the various possible implementation manners in the first aspect, which will not be repeated here.
  • the embodiments of the present application provide a training device for a neural network, which can be used in the field of artificial intelligence.
  • the neural network training device includes: an input module for inputting the first training data into the first feature extraction network to obtain N pieces of first feature information corresponding to the first training data output by the first feature extraction network, where N is greater than 1 an integer of ; the calculation module is used to calculate the first distribution information according to the N pieces of first feature information, and the first distribution information is used to indicate the data distribution law of the N pieces of first feature information; the pruning module is used to pass the first neural network
  • the network performs a pruning operation on the first feature extraction network to obtain the pruned first feature extraction network; the input module is also used to input the first training data into the pruned first feature extraction network, and obtain the pruned first feature extraction network.
  • the training module is used to perform a training operation on the first neural network according to the first loss function to obtain the second neural network, and the first loss function indicates the first distribution information and the second distribution information. similarity between.
  • the third aspect of the embodiments of the present application may also perform the steps in the various possible implementations of the first aspect.
  • the third aspect of the embodiments of the present application and the specific implementation steps of the various possible implementations of the third aspect, and each possible implementation for the beneficial effects brought by the implementation manner, reference may be made to the descriptions in the various possible implementation manners in the first aspect, and details are not repeated here.
  • an embodiment of the present application provides a neural network compression device, which can be used in the field of artificial intelligence.
  • the device includes: an acquisition module for acquiring a second feature extraction network; a pruning module for pruning the second feature extraction network through a second neural network to obtain a pruned second feature extraction network; wherein,
  • the second neural network is obtained by training according to the first loss function, the first loss function indicates the similarity between the first distribution information and the second distribution information, and the first distribution information is used to indicate the data of N pieces of first feature information Distribution law, the N pieces of first feature information are obtained after inputting the first training data into the first feature extraction network, the second distribution information is used to indicate the data distribution law of the N pieces of second feature information, and the N pieces of second feature information are It is obtained by inputting the first training data into the pruned first feature extraction network.
  • the fourth aspect of the embodiments of the present application may also perform the steps in the various possible implementations of the second aspect.
  • an embodiment of the present application provides a training device, which may include a processor, the processor is coupled to a memory, the memory stores program instructions, and when the program instructions stored in the memory are executed by the processor, the above-mentioned first aspect is implemented.
  • the training method of the neural network may include a processor, the processor is coupled to a memory, the memory stores program instructions, and when the program instructions stored in the memory are executed by the processor, the above-mentioned first aspect is implemented.
  • an embodiment of the present application provides an execution device, which may include a processor, the processor is coupled to a memory, the memory stores program instructions, and the second aspect described above is implemented when the program instructions stored in the memory are executed by the processor.
  • an execution device which may include a processor, the processor is coupled to a memory, the memory stores program instructions, and the second aspect described above is implemented when the program instructions stored in the memory are executed by the processor.
  • an embodiment of the present application provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and when the computer program is run on a computer, causes the computer to execute the neural network described in the first aspect above.
  • an embodiment of the present application provides a circuit system, where the circuit system includes a processing circuit, and the processing circuit is configured to execute the method for training a neural network described in the first aspect, or to execute the second aspect.
  • the compression method of the neural network is configured to execute the method for training a neural network described in the first aspect, or to execute the second aspect.
  • an embodiment of the present application provides a computer program that, when run on a computer, enables the computer to execute the neural network training method described in the first aspect above, or execute the neural network described in the second aspect above. Compression method for the network.
  • an embodiment of the present application provides a chip system, where the chip system includes a processor for implementing the functions involved in the above aspects, for example, sending or processing the data and/or information involved in the above method .
  • the chip system further includes a memory for storing necessary program instructions and data of the server or the communication device.
  • the chip system may be composed of chips, or may include chips and other discrete devices.
  • FIG. 1 is a schematic structural diagram of an artificial intelligence main frame provided by an embodiment of the present application.
  • FIG. 2 is a schematic structural diagram of a feature extraction network in a neural network of a Transformer structure provided by an embodiment of the present application;
  • FIG. 3 is a system architecture diagram of a compression system of a neural network provided by an embodiment of the present application.
  • FIG. 4 is a schematic flowchart of a training method of a neural network provided by an embodiment of the present application.
  • 5 is two schematic diagrams of the distribution of N pieces of first feature information in the neural network training method provided by the embodiment of the present application.
  • FIG. 6 is a schematic diagram of the first distribution information in the training method of the neural network provided by the embodiment of the present application.
  • FIG. 7 is a schematic diagram of a process of pruning attention heads in the training method of the neural network provided by the embodiment of the present application.
  • FIG. 8 is a schematic diagram of the first distribution information and the second distribution information in the training method of the neural network provided by the embodiment of the present application;
  • FIG. 9 is another schematic flowchart of a training method of a neural network provided by an embodiment of the present application.
  • FIG. 10 is a schematic flowchart of a method for compressing a neural network according to an embodiment of the present application
  • FIG. 11 is a schematic structural diagram of a training apparatus for a neural network provided by an embodiment of the application.
  • FIG. 12 is a schematic structural diagram of a neural network compression apparatus provided by an embodiment of the present application.
  • FIG. 13 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.
  • FIG. 14 is a schematic structural diagram of a chip provided by an embodiment of the present application.
  • Embodiments of the present application provide a method for training a neural network, a method for compressing a neural network, and related equipment, and a method for training a neural network for performing a pruning operation on a first feature extraction network, using a first loss function to train the first neural network, so that the data distribution rules of the N feature information generated by the feature extraction network before and after pruning are similar, so as to ensure that the feature expression capabilities of the feature extraction network before and after pruning are similar, so as to ensure the Performance of Feature Extraction Networks.
  • Figure 1 shows a schematic structural diagram of the main frame of artificial intelligence.
  • the above-mentioned artificial intelligence theme framework is explained in two dimensions (vertical axis).
  • the "intelligent information chain” reflects a series of processes from data acquisition to processing. For example, it can be the general process of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision-making, intelligent execution and output. In this process, data has gone through the process of "data-information-knowledge-wisdom".
  • the "IT value chain” reflects the value brought by artificial intelligence to the information technology industry from the underlying infrastructure of human intelligence, information (providing and processing technology implementation) to the industrial ecological process of the system.
  • the infrastructure provides computing power support for artificial intelligence systems, realizes communication with the outside world, and supports through the basic platform. Communicate with the outside through sensors; computing power is provided by smart chips, including but not limited to central processing unit (CPU), embedded neural-network processing unit (NPU), graphics processor (graphics processing unit, GPU), application specific integrated circuit (ASIC) and field programmable gate array (field programmable gate array, FPGA) and other hardware acceleration chips; the basic platform includes distributed computing framework and network related
  • the platform guarantee and support can include cloud storage and computing, interconnection network, etc. For example, sensors communicate with external parties to obtain data, and these data are provided to the intelligent chips in the distributed computing system provided by the basic platform for calculation.
  • the data on the upper layer of the infrastructure is used to represent the data sources in the field of artificial intelligence.
  • the data involves graphics, images, voice, and text, as well as IoT data from traditional devices, including business data from existing systems and sensory data such as force, displacement, liquid level, temperature, and humidity.
  • Data processing usually includes data training, machine learning, deep learning, search, reasoning, decision-making, etc.
  • machine learning and deep learning can perform symbolic and formalized intelligent information modeling, extraction, preprocessing, training, etc. on data.
  • Reasoning refers to the process of simulating human's intelligent reasoning method in a computer or intelligent system, using formalized information to carry out machine thinking and solving problems according to the reasoning control strategy, and the typical function is search and matching.
  • Decision-making refers to the process of making decisions after intelligent information is reasoned, usually providing functions such as classification, sorting, and prediction.
  • some general capabilities can be formed based on the results of data processing, such as algorithms or a general system, such as translation, text analysis, computer vision processing, speech recognition, etc. .
  • Intelligent products and industry applications refer to the products and applications of artificial intelligence systems in various fields. They are the encapsulation of the overall artificial intelligence solution, the productization of intelligent information decision-making, and the realization of landing applications. Its application areas mainly include: intelligent terminals, intelligent manufacturing, Smart transportation, smart home, smart healthcare, smart security, autonomous driving, safe city, etc.
  • the embodiments of the present application can be applied to various fields of artificial intelligence, including natural language processing, image processing, and audio processing, and can be specifically applied to scenarios where various types of neural networks in various fields need to be compressed middle.
  • the aforementioned various types of neural networks include, but are not limited to, cyclic neural networks, convolutional neural networks, residual neural networks, fully connected neural networks, and neural networks with Transformer structures, etc.
  • the neural network is a neural network with a Transformer structure, and it is applied to the field of natural language processing as an example.
  • the neural network to be compressed ie, the first feature extraction network
  • the neural network to be compressed is other types of neural networks, or when the neural network to be compressed is
  • the network processes other types of data, for example, when the first feature extraction network processes image data or audio data, it can be understood by analogy, and details are not described here.
  • related terms and related concepts such as neural networks involved in the embodiments of the present application are first introduced below.
  • the neural network of the Transformer structure may include an encoder part (that is, a feature extraction network in the neural network of the Transformer structure) and a decoder (decoder) part, see FIG. 2 ,
  • FIG. 2 is the Transformer structure provided by the embodiment of the application
  • the feature extraction network in the neural network of the Transformer structure includes an embedding layer and at least one Transformer layer, and a Transformer layer includes a multi-head attention layer, a summation and normalization (add&norm) layer,
  • the feedforward neural network layer and the summation and normalization layer that is, after the text to be processed is processed by the feature extraction network in the neural network of the Transformer structure, the feature information of the entire text to be processed can be obtained.
  • the feature information is a kind of feature information suitable for computer processing of the text to be processed, and can be used for tasks such as text similarity, text classification, reading comprehension, and machine translation.
  • the embedding layer can perform embedding processing on each word in the text to be processed to obtain the initial feature information of each word.
  • the text to be processed can be a piece of text or a sentence.
  • the text can be Chinese text, English text, or other language text.
  • the embedding layer includes an input embedding layer and a positional encoding layer.
  • word embedding processing can be performed on each word in the text to be processed to obtain the word embedding tensor of each word.
  • the tensor can be expressed as a one-dimensional vector, two-dimensional matrix, three-dimensional or more-dimensional data, etc.
  • the position encoding layer the position of each word in the text to be processed can be obtained, and then a position tensor can be generated for the position of each word.
  • the position of each word may be the absolute position of each word in the text to be processed.
  • the position of "today” can be expressed as the first position
  • the position of "day” can be expressed as the second position
  • the positions of the respective words may be relative positions between the respective words.
  • the position of "today” can be expressed as before “day”
  • the position of "day” can be expressed as after “today”, before “day”, ... .
  • the position tensor and the word embedding tensor of each word can be combined to obtain the initial feature information of each word, thereby obtaining Process the initial feature information corresponding to the text.
  • the multi-head attention layer can also be called an attention layer, in one example, the attention layer can be a fixed window multi-head attention layer.
  • Each attention head in the multiple attention heads corresponds to a set of attention matrices, and the set of attention matrices includes a first transformation matrix, a second transformation matrix and a third transformation matrix. The functions of the second transformation matrix and the third transformation matrix are different.
  • the first transformation matrix is used to generate the query feature information of the text to be processed
  • the second transformation matrix is used to generate the key feature information of the text to be processed
  • the third transformation matrix is used to generate the key feature information of the text to be processed.
  • the transformation matrix is used to generate the value feature information of the text to be processed.
  • Different attention heads are used to extract the semantic information of the text to be processed at different angles.
  • one attention head can focus on the sentence components of the text to be processed, and another attention head can focus on the sentence components of the text to be processed.
  • Subject-verb-object structure another attention head can focus on the dependencies between words in the text to be processed, etc.
  • each attention The feature information of head attention is learned by the model itself during the training process. The above example is more to explain the learning ability of multiple attention heads, and is not used to limit this scheme.
  • the multi-head attention layer includes z attention heads. Although the value of h is 3 as an example in Figure 2, the actual situation can include more or less attention head.
  • the operation of any one of the multiple attention heads can be expressed by the following formula:
  • X represents the initial feature information of the entire text to be processed (that is, the initial feature information obtained after the entire text to be processed is input into the embedding layer), which includes the initial feature information of each word in the text to be processed, and head i represents the text to be processed.
  • the output obtained after the initial feature information of the text is input to the ith attention head in the z attention heads, It represents that the ith attention head adopts the attention mechanism in the calculation process, represents the first transition matrix in the ith attention head, represents the second transition matrix in the ith attention head, represents the third transition matrix in the ith attention head, represents the transpose of Ki , represents Qi and Outer product between, represent The outer product between V i and z represents the number of attention heads in the attention layer. It should be understood that the examples here are only for the convenience of understanding the operation mode of the attention heads, and are not used to limit this solution.
  • the multi-head attention layer may be the next layer of the embedding layer; in other embodiments, there may be multiple Transformer layers in the feature extraction network of the neural network of the Transformer structure, then the last The output of a Transformer layer is the feature information of the text to be processed.
  • each attention head The operating principle of each attention head is the attention mechanism, which imitates the internal process of biological observation behavior, that is, a mechanism that aligns internal experience with external sense to increase the fineness of observation in some areas. Attention resources quickly filter out high-value information from a large amount of information. Attention mechanism can quickly extract important features of sparse data, so it is widely used in natural language processing tasks, especially machine translation.
  • the self-attention mechanism is an improvement of the attention mechanism, which reduces the dependence on external information and is better at capturing the internal correlation of data or features.
  • the essential idea of the attention mechanism can be rewritten as the following formula:
  • Lx
  • represents the length of Source
  • Source represents the input text to be processed
  • the meaning of the formula is to imagine that the elements included in Source are composed of a series of data pairs.
  • the weight coefficient of the Key corresponding to the Value of each element in the Source is obtained, and then the value of each element in the Source is calculated. Value is weighted and summed, that is, the final Attention value of the aforementioned element is obtained. So in essence, the Attention mechanism is to weight and sum the Value values of each element in the Source, and Query and Key are used to calculate the weight coefficient of the corresponding Value.
  • Attention can be understood as selectively screening out a small amount of important information from a large amount of information and focusing on these important information, ignoring most of the unimportant information.
  • the process of focusing is reflected in the calculation of the weight coefficient.
  • the self-attention mechanism can be understood as internal Attention (intra attention).
  • the attention mechanism occurs between the Query of the element in the Target and all the elements in the Source.
  • the self-attention mechanism refers to the internal elements of the Source or the internal elements of the Target.
  • the specific calculation process is the same, but the calculation object has changed.
  • Natural language is human language, and natural language processing is the processing of human language. Natural language processing is the process of systematically analyzing, understanding, and extracting information from text data in an intelligent and efficient manner.
  • NLP machine translation
  • NER Named entity recognition
  • RE relation extraction
  • IE information extraction
  • sentiment analysis Sensing
  • speech recognition speech recognition
  • question answering questions answering
  • Natural language inference topic segmentation
  • natural language processing tasks can fall into the following categories.
  • Sequence tagging Each word in a sentence requires the model to give a categorical category based on the context. Such as Chinese word segmentation, part-of-speech tagging, named entity recognition, semantic role tagging.
  • Classification tasks output a classification value for the entire sentence, such as text classification.
  • Sentence relationship inference Given two sentences, determine whether the two sentences have a nominal relationship. For example, enlightenment, QA, semantic rewriting, natural language inference.
  • Generative task output a piece of text, generate another piece of text.
  • Word segmentation (word segmentation or word breaker, WB): The continuous natural language text is divided into lexical sequences with semantic rationality and completeness, which can solve the problem of cross ambiguity.
  • NER Named Entity Recognition
  • Part-speech tagging assigns a part of speech (noun, verb, adjective, etc.) to each word in natural language text; dependency parsing: automatically analyzes the syntactic components (subject, predicate, object, attributive, adverbial and complement, etc.), can solve the problem of structural ambiguity. Comment: You can enjoy the sunrise in the room; Ambiguity 1: The room is okay; Ambiguity 2: You can enjoy the sunrise; Part of speech: In the room (subject), you can (predicate), enjoy the sunrise (verb-object phrase).
  • Word embedding&semantic similarity vectorized representation of vocabulary, and the calculation of semantic similarity of vocabulary based on this, which can solve the similarity of vocabulary and language. For example: watermelon and (dumb melon/strawberry), which is closer?
  • Vectorized representation watermelon (0.1222, 0.22333, ..); similarity calculation: dummy (0.115) strawberry (0.325); vectorized representation: (-0.333, 0.1223..) (0.333, 0.3333, ..).
  • Text semantic similarity Relying on the massive data of the whole network and deep neural network technology, the ability to realize the calculation of semantic similarity between texts can solve the problem of text semantic similarity. For example: how to prevent the license plate from the front of the car and (how to install the front license plate/how to apply for the Beijing license plate), which is closer?
  • Vectorized representation how to prevent the front of the car from the license plate (0.1222, 0.22333, ..); similarity calculation: how to install the front license plate (0.762), how to apply for the Beijing license plate (0.486), vectorized representation: (-0.333, 0.1223..)( 0.333, 0.3333, ..).
  • the neural network training method provided by the embodiment of the present application is used to train a first neural network whose task objective is to perform a pruning operation on the first feature extraction network, and ensure the feature expression performance of the first feature extraction network before and after pruning. Basically unchanged.
  • the compression system of the neural network provided by the embodiment of the present application is first introduced with reference to FIG. 3 . Please refer to FIG. 3 first.
  • the neural network compression system includes a training device 310 , a database 320 , an execution device 330 , a data storage system 340 and a client device 350 ; the execution device 330 includes a computing module 331 and an input/output (I/O) interface 332 .
  • the training process of the first feature extraction network 302 is pre-training and fine-tuning. Then, in an implementation manner, as shown in FIG. 3 , the first neural network 301 prunes the first feature extraction network 302 in the pre-training stage of the first feature extraction network 302 .
  • the database 320 stores a first training data set, and the first training data set may include multiple training texts.
  • the training device 310 obtains the first feature extraction network 302, which is a neural network that has been pre-trained, and the training device 310 generates the first feature extraction network 302 for performing the pruning operation.
  • Neural network 301 uses a plurality of training texts in the first training data set and the first feature extraction network 302 to train the first neural network 301 to obtain the first neural network 301 that has performed the training operation. Yes, the weight parameters of the first feature extraction network 302 will not be modified during the training process of the first neural network 301 .
  • the training device 310 uses the mature first neural network 301 to prune the first feature extraction network 302 to obtain the pruned first feature extraction network 302, and the training device 310 prunes the The branched first feature extraction network 302 is sent to the execution device 330 .
  • the execution device 330 can call data, codes, etc. in the data storage system 340, and can also store data, instructions, etc. in the data storage system 340.
  • the data storage system 340 may be configured in the execution device 330 , or the data storage system 340 may be an external memory relative to the execution device 330 .
  • a second training data set may be stored in the data storage system 340, and the second training data set includes a plurality of training texts and the correct result of each training text.
  • the execution device 330 uses the second training data set to train the third neural network integrated with the pruned first feature extraction network 302 to obtain a mature third neural network.
  • the “user” directly interacts with the client, and the execution device 330 obtains the pending processing sent by the client device 350 through the I/O interface 332 .
  • the calculation module 211 processes the text to be processed through a mature third neural network to generate a prediction result of the text to be processed, and sends the prediction result of the text to be processed to the client device 350 through the I/O interface 332 .
  • FIG. 3 is only an example of the compression system of the neural network provided by the embodiment of the present invention, and the positional relationship among the devices, devices, modules, etc. shown in the figure does not constitute any limitation.
  • the execution device 330 and the client device 350 may be integrated in the same device.
  • the execution device 330 may be divided into two independent devices, the training device of the third neural network and the execution device of the third neural network, and the steps of the fine-tuning phase of the first feature extraction network 302 are executed by the training device of the third neural network , the steps of the inference phase of the third neural network are executed by the execution device of the third neural network.
  • the training process of the first feature extraction network 302 is not a training method of pre-training and fine-tuning.
  • the training device 310 obtains a third neural network
  • the third neural network is The neural network that has performed the training operation, that is, the third neural network is a mature neural network, and the third neural network is integrated with the first feature extraction network 302 .
  • the training device 310 sends the pruned third neural network to the execution device 330, and after the pruning
  • the third neural network includes the pruned first feature extraction network 302 .
  • the execution device 330 After the execution device 330 obtains the pruned third neural network, it no longer trains the pruned third neural network, but directly executes operations in the inference stage according to the pruned third neural network.
  • the execution device 330 and the client device 350 may be integrated in the same device.
  • the embodiment of the present application includes the inference phase and the training phase of the first neural network 301 (that is, the neural network for performing the pruning operation), and the inference phase and the training phase of the first neural network 301
  • the process is different, and the following describes the inference stage and the training stage of the first neural network 301 respectively.
  • FIG. 4 is a schematic flowchart of a training method of a neural network provided by an embodiment of the present application.
  • the training method of a neural network provided by the embodiment of the present application may include:
  • the training device inputs the first training data into a first feature extraction network, and obtains N pieces of first feature information corresponding to the first training data output by the first feature extraction network.
  • the training device is configured with a training data set, and the training data set includes a plurality of training data. Since the first feature extraction network is used as an example for processing text data, each training data can be represented as training text.
  • the training device inputs the first training data into the first feature extraction network, and obtains N pieces of first feature information corresponding to the first training data output by the first feature extraction network.
  • the first feature extraction network is represented as a feature extraction network in a neural network with a Transformer structure as an example, the first feature extraction network may also be called an encoder, and the first feature extraction network includes at least two For the attention head, the first feature extraction network and the specific network structure of the attention head, reference may be made to the description in FIG. 2 above, which will not be repeated here. Further, the first feature extraction network belongs to the third neural network for natural language processing, and there can be many types of tasks for natural language processing, such as word segmentation, named entity recognition, part-of-speech tagging, etc. To be exhaustive, for specific examples of the foregoing various tasks, reference may be made to the above description, which will not be repeated here.
  • the first training data includes N pieces of training data, and each training data can be represented as a sentence.
  • the first training data includes N sentences
  • step 401 may include: the training device inputs the N sentences into the first feature extraction network respectively, so as to obtain, respectively, each of the N sentences output by the first feature extraction network.
  • the first feature information of each sentence that is, N pieces of first feature information are obtained.
  • a first feature information is feature information of one sentence in the N sentences.
  • N is an integer greater than or equal to 2.
  • the value of N may be 2, 3, 4, 5, 6, or other numerical values.
  • the first training data is a sentence
  • a sentence includes N words.
  • Step 301 may include: the training device inputs the aforementioned sentence into a first feature extraction network, so as to generate the feature information of the sentence through the first feature extraction network, and obtain the feature information of each word from the feature information of a sentence, that is, The feature information of the one sentence is decomposed to obtain the feature information of each word in the N words, and a first feature information is the feature information of a word in the N words.
  • two representations of the N pieces of first feature information are provided, which improves the implementation flexibility of the solution;
  • the difficulty of the training process is to improve the accuracy of the final first feature extraction network; if a first feature information is the feature information of a word in N words, only one sentence needs to be feature extraction to achieve the first feature information.
  • One-time training of the neural network is beneficial to improve the efficiency of the training process of the first neural network.
  • the first training data is a word
  • a word includes N letters.
  • Step 401 may include: the training device inputs the aforementioned word into the first feature extraction network, so as to generate a word through the first feature extraction network.
  • the feature information of a word the feature information of each letter is obtained from the feature information of a word, that is, the feature information of a word is decomposed to obtain the feature information of each letter in the N letters, a first The feature information is the feature information of one letter among the N letters.
  • the training device calculates first distribution information according to the N pieces of first feature information, where the first distribution information is used to indicate a data distribution rule of the N pieces of first feature information.
  • the training device after obtaining N pieces of first feature information, the training device will calculate the first distribution information.
  • the first distribution information may be stored in the form of a table, matrix, array, index, etc.
  • the first distribution information is used to indicate the data distribution law of the N pieces of first feature information, including each feature in the N pieces of first feature information. distribution of information.
  • the first distribution information includes the value of the distance between any two pieces of first characteristic information in the N pieces of first characteristic information, so as to indicate the data distribution law of the N pieces of first characteristic information; that is, The distribution rule of one feature information in the N pieces of first feature information is represented by the value of the distance between the one piece of feature information and each of the N pieces of first feature information. The farther the distance between the two first feature information, the smaller the similarity between the two first feature information; the closer the distance between the two first feature information, the similarity between the two first feature information the greater the degree.
  • the data distribution law of the N pieces of feature information is determined by calculating the distance between any two pieces of feature information in the N pieces of feature information, and an implementation method of the data distribution law of the N pieces of feature information is provided, And the operation is simple and easy to realize.
  • the training device may directly calculate the cosine distance, Euclidean distance, Manhattan distance, Mahalanobis distance, first-order distance, cross-entropy distance, or other types of distances between the third feature information and the fourth feature information, etc. , and is determined as the distance between the third feature information and the fourth feature information, and the training device performs the foregoing operations on any two feature information in the N pieces of first feature information to obtain the first distribution information.
  • the N pieces of first feature information include feature information h n and feature information h i , then
  • the value of the distance between the pieces of feature information, Dist cos (h n , hi ) represents the calculation of the cosine distance between h n and hi
  • Equation (2) discloses the calculation method of the cosine distance between h n and hi specific formula, Represents the sum of the cosine distances between h n and each of the N pieces of first feature information.
  • FIG. 5 is two schematic diagrams of the distribution of N first feature information in the training method of the neural network provided by the embodiment of the application
  • FIG. 6 is A schematic diagram of the first distribution information in the neural network training method provided by the embodiment of the present application.
  • Both FIG. 5 and FIG. 6 take the value of N as 4 as an example.
  • Fig. 5 includes (a) sub-schematic diagram and (b) sub-schematic diagram, A1, A2, A3 and A4 respectively represent 4 sentences generated by the first feature extraction network (that is, the first training data includes 4 sentences) feature information, that is, the distribution of the 4 first feature information is shown. Since the distribution of the 4 first feature information can be intuitively seen in the two sub-schematic diagrams in FIG. Introduce again.
  • the first distribution information is represented as a matrix as an example.
  • Each value in the matrix represents the distance between two first feature information.
  • B1 represents the two feature information of A3 and A4.
  • the distance between, the first distribution information shown in FIG. 6 represents the distribution of the four first feature information in the sub-schematic diagram (a) of FIG. 5, the value of the distance between A1 and A1 is 0, A1
  • the value of the distance between A2 and A2 is 2
  • the value of the distance between A1 and A3 is 6, etc. Since the distance between A1 and A3 in Figure 5 is the farthest, correspondingly, between A1 and A3 in Figure 6 The value of the distance is the largest.
  • the matrix shown in Figure 6 can be understood in conjunction with Figure 5.
  • the matrix values in Figure 6 will not be explained one by one here. It should be noted that the examples in Figure 5 and Figure 6 are only for convenience. Understand this solution. In practical applications, the first distribution information can also be expressed in other forms, such as tables, arrays, etc., or the value of each distance in the first distribution information can be a value after normalization processing. etc., are not limited here.
  • the first distribution information includes a value of a distance between each of the N pieces of first characteristic information and the preset characteristic information, so as to indicate a data distribution rule of the N pieces of first characteristic information.
  • the longer the distance between a piece of first feature information and the preset feature information the smaller the similarity between the first feature information and the preset feature information; the difference between a piece of first feature information and the preset feature information is The greater the similarity, the greater the similarity between the first feature information and the preset feature information.
  • the shape of the preset feature information is the same as that of the first feature information, and the shape of the preset feature information and the first feature information is the same, which means that the preset feature information and the first feature information are both M-dimensional tensors, and the size of the first feature information is
  • the first dimension in the M dimension and the second dimension in the M dimension of the second feature information have the same size, M is an integer greater than or equal to 1, the first dimension is any dimension in the M dimensions of the first feature information, the first dimension The two dimensions are the same dimensions as the first dimension among the M dimensions of the second feature information.
  • the first feature information is a vector including m elements
  • the preset feature information may be a vector including m zeros
  • the preset feature information is a vector including m ones, etc.
  • the examples here are only In order to facilitate understanding of the concept of preset feature information, it is not used to limit this solution.
  • the training device may, for the third feature information (any one of the N pieces of first feature information), calculate the third feature information and the pre-defined feature information. Let the cosine distance, Euclidean distance, Manhattan distance, Mahalanobis distance, first-order distance, cross-entropy distance or other types of distances between the feature information, and determine as the distance between the third feature information and the preset feature information, The training device performs the foregoing operations on each of the N pieces of first characteristic information to obtain the first distribution information.
  • the training device performs a pruning operation on the first feature extraction network through the first neural network, to obtain a pruned first feature extraction network.
  • the training device will perform a pruning operation on the first feature extraction network through the first neural network, so as to obtain a pruned first feature extraction network.
  • the first neural network can be expressed as various types of neural networks, and the first neural network is any one of the following neural networks: convolutional neural network, recurrent neural network, residual neural network or fully connected neural network.
  • the first neural network is any one of the following neural networks: convolutional neural network, recurrent neural network, residual neural network or fully connected neural network.
  • multiple implementation manners of the first neural network are provided, which improves the implementation flexibility of this solution.
  • the training device can prune the weight parameters of the first feature extraction network through the first neural network, or prune the neural network layers in the first feature extraction network, or prune the weight parameters of the first feature extraction network. At least one attention head in the neural network layer is pruned.
  • the attention layer of the first feature extraction network may include at least two attention heads
  • step 403 may include: training equipment Through the first neural network, a pruning operation is performed on at least two attention heads included in the first feature extraction network, and a pruned first feature extraction is constructed according to the at least one attention head still retained after pruning. network, the number of attention heads included in the pruned first feature extraction network is less than the number of attention heads included in the first feature extraction network.
  • the first feature extraction network includes 8 attention heads
  • the pruned first feature extraction network may include 6 attention heads, so that the number of parameters included in the pruned first feature extraction network is smaller
  • the examples here are only for the convenience of understanding the solution, and are not intended to limit the solution.
  • step 403 may include: training the device to generate a first score for each of the at least two attention heads through the first neural network.
  • the first score of an attention head represents the importance of the attention head, and is used to indicate whether an attention head is pruned.
  • the first feature extraction network includes the attention heads with a high degree of importance. The force head will be preserved, and the less important attention head will be pruned.
  • the training device performs a pruning operation on the at least two attention heads according to the at least two first scores corresponding to the at least two attention heads.
  • the first score of each attention head is generated by the first neural network, and then whether the attention head will be pruned is determined according to the score of each attention head, which is simple to operate and easy to implement.
  • the first score corresponding to the attention head with a higher degree of importance may be higher, and the first score corresponding to the attention head with a lower degree of importance may be lower; it may also be an attention with a higher degree of importance.
  • the value of the first score is a first preset value or a second preset value, and the values of the first preset value and the second preset value are different.
  • the first attention head is any one of the at least two attention heads. When the value of the first attention head is the first preset value, the first attention will be retained; When the value of is the second preset value, the first attention head will be pruned.
  • the value of the aforementioned first preset value may be 1, 2, 3, 4 or other values, etc.
  • the value of the aforementioned second preset value may be 0, 1, 2 or other values, etc., as long as the first preset value is guaranteed.
  • the values of the first preset value and the second preset value may be different. As an example, for example, the value of the first score is 0 or 1. If the value of the first score of the first attention head is 0, the first attention head is pruned. If the value of the score is 1, the first attention head is reserved, etc.
  • the specific values of the first preset value and the second preset value can be flexibly set according to the actual situation, which is not limited here.
  • the training device inputs each of the at least two attention heads into the first neural network, and obtains a second score of each attention head output by the first neural network, and the second score may be a continuous score.
  • a second score may specifically be 0.58, 0.69, 1, 1.28, 1.38, etc. The examples here are only for easier understanding of the solution, and are not used to limit the solution. Specifically, the generation process of the second score for the first attention head among the at least two attention heads.
  • the training device inputs the attention matrix corresponding to the first attention head into the first neural network according to the self-attention mechanism, that is, according to a set of attention matrices corresponding to the first attention head, performs the self-attention operation, Then, the operation result is input into the first neural network to obtain the second score of the first attention head output by the first neural network.
  • the self-attention mechanism that is, according to a set of attention matrices corresponding to the first attention head, performs the self-attention operation.
  • the training device After obtaining the second score of the first attention head, the training device performs discretization processing on the second score of the first attention head to obtain the first score of the first attention head.
  • the process of discretization processing is differentiable.
  • the specific method of discretization processing may be gumbel-softmax, gumbel-max or other types of discretization processing methods, and so on.
  • the training device performs the foregoing operations on each of the plurality of attention heads, so that a first score for each attention head can be generated.
  • the process of generating the first score of each attention head is differentiable, and the process of reversely updating the weight parameters of the first neural network using the first loss function is also continuous, so that the first The updating process of the weight parameters of a neural network is more rigorous, so as to improve the training efficiency of the first neural network, and it is also beneficial to obtain a first neural network with a higher accuracy rate.
  • the value of the first score may be continuous, and a first threshold is preset on the training device.
  • a first threshold is preset on the training device.
  • the attention head can be pruned. When the first score of the head is less than the first threshold, the attention head can be retained.
  • FIG. 7 is a schematic diagram of a process of pruning attention heads in the neural network training method provided by the embodiment of the present application.
  • the first feature extraction network includes 3 Transform layers, and each Transform layer includes 4 attention heads as an example.
  • the attention heads represented by the gray blocks are unimportant attention heads, and the mosaic blocks represent the attention heads.
  • the attention head is an important attention head, for example, the attention head represented by the mosaic block numbered 1 in the neural network layer numbered 1 is an important attention head, and the gray block numbered 1 in the neural network layer numbered 2 If the representative attention head is an unimportant attention head, after pruning multiple attention heads included in different neural network layers of the first feature extraction network, the pruned first feature extraction network is reconstructed. , the first feature extraction network after pruning includes the remaining 6 important attention heads.
  • FIG. 7 is only for the convenience of understanding this scheme and is not used to limit this scheme.
  • step 403 may include: the training device directly inputs the first feature extraction network into the first neural network, and obtains the pruned first feature extraction network output by the first neural network.
  • step 403 may be executed before or after any of steps 401 and 402, as long as it is ensured that step 403 is executed before step 404.
  • the training device inputs the first training data into the pruned first feature extraction network, and obtains N pieces of second feature information corresponding to the first training data output by the pruned first feature extraction network.
  • the training device after obtaining the pruned first feature extraction network, the training device inputs the first training data into the pruned first feature extraction network, so that the Perform feature extraction on a training data to obtain N pieces of second feature information corresponding to the first training data output by the pruned first feature extraction network.
  • the specific implementation of step 404 is similar to the specific implementation of step 401, except that the execution subject in step 401 is the first feature extraction network, and the execution subject of step 404 is the pruned first feature extraction network. Do repeat.
  • the meanings of the N pieces of second feature information are similar to those of the N pieces of second feature information. If the first training data includes N sentences, one piece of second feature information is the feature information of one sentence in the N sentences; or, the first training data includes N sentences.
  • the data is a sentence, a sentence includes N words, and a second feature information is the feature information of a word in the N words.
  • the training device calculates second distribution information according to the N pieces of second feature information, where the second distribution information is used to indicate a data distribution rule of the N pieces of second feature information.
  • step 405 is similar to the specific implementation of step 402, and the difference is only that in step 402, the training device processes N pieces of first feature information, and in step 405, it processes N pieces of second feature information Information can be understood by referring to the above description.
  • the specific expression form of the second distribution information is similar to the specific expression form of the first distribution information, and reference may be made to the introduction in step 405, which will not be repeated here.
  • the training device performs a training operation on the first neural network according to the first loss function to obtain a second neural network, where the first loss function indicates the similarity between the first distribution information and the second distribution information.
  • the training device after obtaining the first distribution information and the second distribution information, calculates the function value of the first loss function according to the first distribution information and the second distribution information, and calculates the function value of the first loss function according to the function value of the first loss function. Carry out gradient derivation, and reversely update the weight parameters of the first neural network to complete a training of the first neural network.
  • the training device performs iterative training on the first neural network by repeatedly performing steps 401 to 406 until the The convergence condition of the first loss function obtains the second neural network, which is the first neural network after training.
  • the goal of iterative training is to close the similarity between the first distribution information and the second distribution information, that is, the goal of iterative training is to close the similarity between the first distribution information and the second distribution information, and the first distribution information
  • the similarity with the second distribution information is used to reflect the degree of difference between the first distribution information and the second distribution information, and may also be expressed as the distance between the first distribution information and the second distribution information. It should be noted that the weight parameters of the first feature extraction network will not be updated in the process of training the first neural network.
  • the training device determines that the function value of the first loss function satisfies the convergence condition, the first neural network will not be trained again, and the training device can obtain the During the last training of the first neural network, the pruned first feature extraction network generated by the first neural network (also referred to as the second neural network) (that is, through the step of 403 generated first feature extraction network after pruning), as the final outputted first feature extraction network after pruning.
  • the pruned first feature extraction network generated by the first neural network also referred to as the second neural network
  • the second neural network that is, through the step of 403 generated first feature extraction network after pruning
  • the first loss function can specifically calculate the distance between the first distribution information and the second distribution information, and the aforementioned distance can be KL divergence (Kullback Leibler divergence) distance, cross entropy distance, Euclidean distance, Mahalanobis distance, and cosine distance. Or other types of distances, etc., not exhaustive here. It should be noted that shortening the similarity between the first distribution information and the second distribution information does not mean shortening the distance between each first feature information and each second feature information.
  • the value of N is 3, and the three first training data are "the weather is so nice today", “the weather is so comfortable today” and “the clothes are so pretty”, then “the weather is so nice today”
  • the distance between the first feature information of "The weather is really comfortable today” and the first feature information of "The weather is really comfortable today” will be relatively close, and the distance between the first feature information of "Huahua's clothes are so beautiful” will be farther from the first two, then
  • the training target is that the distance between the second feature information of "the weather is so nice today” and the second feature information of "the weather is really comfortable today” is close, and the second feature information of "the clothes are so beautiful” is the same as the first two. That is, the purpose of training is to improve the relative distance between different second feature information, and the similarity between the relative distance between different first feature information.
  • FIG. 8 is the first distribution information and the second distribution in the training method of the neural network provided by the embodiment of the present application.
  • Schematic diagram of information In Fig. 8, the first distribution information and the second distribution information both include the distance between any two of the N pieces of feature information as an example, and Fig. 8 includes three subsections (a), (b) and (c).
  • Schematic diagram, the three sub-schematic diagrams (a), (b) and (c) of FIG. 8 are all taken as examples showing three first feature information, and the sub-schematic diagram (a) of FIG.
  • the sub-schematic diagram (b) and the sub-schematic diagram (c) in Figure 8 both represent the distribution of the three second feature information
  • C1, C2 and C3 represent three different training data respectively
  • the sub-schematic diagram (a) in Figure 8 The box in the figure represents the first feature information of C1, the circle in the sub-schematic diagram of Fig. 8 (a) represents the first feature information of C2, and the five-pointed star in the sub-schematic diagram of Fig. 8 (a) represents the first feature of C3 information. Since the attention heads subtracted by the first feature extraction network can be different in different training times, the distribution of the N second feature information output by the pruned first feature extraction network in different training times is different.
  • the sub-schematic diagram (b) and the sub-schematic diagram (c) of FIG. 8 respectively represent the distribution of the three second features in different training times.
  • the boxes in the sub-schematic diagram (b) of FIG. 8 and the sub-schematic diagram (c) of FIG. 8 represent the first feature information of C1
  • the circles in the sub-schematic diagram (b) of FIG. 8 and the sub-schematic diagram (c) of FIG. 8 Represents the first feature information of C2
  • the five-pointed star in the sub-schematic diagram (b) of FIG. 8 and the sub-schematic diagram (c) of FIG. 8 represents the first feature information of C3.
  • FIG. 9 is a schematic flowchart of a training method for a neural network provided by an embodiment of the present application.
  • the training device obtains N pieces of training data from the training data set (that is, obtains the first training data), and inputs the N pieces of training data into the first feature extraction network, which is a pre-trained neural network , to obtain N first feature information.
  • the training device generates first distribution information according to the N pieces of first feature information.
  • the training device inputs multiple sets of attention matrices corresponding to multiple attention heads included in the first feature extraction network into the first neural network to obtain a second score for each attention head generated by the first neural network.
  • the training device performs discretization processing according to the second score of each attention head, and obtains the first score of each attention head.
  • the aforementioned process of discretization processing is differentiable.
  • the training device prunes the first feature extraction network according to the first score of each attention head, and reconstructs the pruned first feature extraction network.
  • D6 The training device inputs the N pieces of training data into the pruned first feature extraction network to obtain N pieces of second feature information.
  • the training device generates second distribution information according to the N pieces of second feature information.
  • the training device calculates the distance between the first distribution information and the second distribution information, that is, calculates the function value of the first loss function, and backpropagates to update the weight parameters of the first neural network, so as to complete the calculation of the first loss function.
  • a training of the neural network It should be understood that the example in FIG. 9 is only to facilitate understanding of the solution, and is not intended to limit the solution.
  • a method for training a neural network for performing a pruning operation on a first feature extraction network is provided by the above method, and the first neural network after performing the training operation can be used for performing a pruning operation on the first feature extraction network.
  • Perform pruning that is, a compression scheme of neural network is provided; in addition, the first loss function is used to train the first neural network, so that the data distribution law of the N feature information generated by the feature extraction network before and after pruning is similar , so as to ensure that the feature expression capabilities of the feature extraction network before and after pruning are similar, so as to ensure the performance of the feature extraction network after pruning; and the first feature extraction network can not only be a feature extraction network of Transform structure, but also a recurrent neural network. , convolutional neural network and other neural network feature extraction network, which expands the application scenarios of this scheme.
  • FIG. 10 is a schematic flowchart of a method for compressing a neural network provided by an embodiment of the present application.
  • the method for compressing a neural network provided by the embodiment of the present application may include:
  • the executing device acquires a second feature extraction network.
  • the execution device needs to acquire the second feature extraction network.
  • the training device of the first neural network and the execution device of the second neural network may be the same device, or may be separate devices.
  • the second feature extraction network and the first feature extraction network may be different feature extraction networks, or may be the same feature extraction network.
  • the neural network structures of the first feature extraction network and the second feature extraction network may be identical, that is, the neural network layers included in the first feature extraction network and the second feature extraction network are identical.
  • the neural network structures of the first feature extraction network and the second feature extraction network may also be different.
  • the number of attention heads included in a multi-head attention layer of the second feature extraction network may be the same as the number of attention heads included in a multi-head attention layer of the first feature extraction network.
  • the acquired second feature extraction network is a neural network that has performed a pre-training operation.
  • the second feature extraction network does not adopt the training method of pre-training and fine-tuning, the second feature extraction network is obtained as a trained neural network, and the specific process can be combined with the above description of FIG. 3 .
  • the execution device prunes the second feature extraction network through the second neural network to obtain a pruned second feature extraction network, where the second neural network is obtained by training according to the first loss function, and the first loss function Indicates the similarity between the first distribution information and the second distribution information, the first distribution information is used to indicate the data distribution law of the N pieces of first feature information, and the N pieces of first feature information are the input of the first training data into the first feature After the network is extracted, the second distribution information is used to indicate the data distribution law of the N pieces of second feature information, and the N pieces of second feature information are obtained by inputting the first training data into the pruned first feature extraction network.
  • the execution device prunes the first feature extraction network through the second neural network, and obtains the pruned first neural network.
  • the second neural network is obtained by training according to the first loss function.
  • the specific implementation of the pruning operation through the first neural network is similar to the specific implementation of step 403 in the embodiment corresponding to FIG. 4 , and details are not described here.
  • the execution device uses the second neural network to prune the second feature extraction network before entering the fine-tuning stage of the second feature extraction network.
  • the second feature extraction network is a neural network that has performed pre-training operations.
  • the execution device prunes the second feature extraction network through the second neural network, and the second feature extraction network is the trained neural network. The second feature extraction network no longer needs to be trained.
  • step 1002 can also be obtained through step 403, that is, when the first neural network (or the second neural network) ) can directly obtain the first feature extraction network after pruning. Specifically, when it is determined that the convergence condition of the first loss function is satisfied, the pruned first feature extraction network generated in the current training batch can be obtained, that is, during the last training process of the first neural network, The resulting pruned first feature extraction network.
  • pruning the first feature extraction network in the pre-training stage can not only compress the first feature extraction network, but also reduce the storage space occupied by the first feature extraction network and improve the first feature extraction network.
  • the efficiency of the network in the inference stage can also improve the efficiency of the fine-tuning stage when training the first feature extraction network, thereby improving the efficiency of the training process of the first feature extraction network.
  • the first feature extraction network is pruned by the second neural network, that is, the compression of the first feature extraction network is realized, and a compression scheme of the neural network is provided; function to train the first neural network, so that the data distribution rules of the N feature information generated by the feature extraction network before and after pruning are similar, so as to ensure that the feature expression capabilities of the feature extraction network before and after pruning are similar, so as to ensure the The performance of the feature extraction network; and the first feature extraction network can be not only the feature extraction network of the Transform structure, but also the feature extraction network of neural networks such as recurrent neural network and convolutional neural network, which expands the application scenarios of this scheme.
  • BERT base and BERT Large represent two different types of neural networks.
  • the first feature extraction network comes from the aforementioned two neural networks respectively.
  • BERT base and BERT Large after pruning, the storage space is reduced and the processing speed is improved.
  • STS is the abbreviation of Semantic Textual Similarity, which represents the type of tasks performed by the neural network
  • STS-12, STS-13, STS-14 and The subsequent serial numbers in STS-15 represent the numbers of different training data sets, and each value in Table 2 is an accuracy value. It can be seen from Table 2 above that after pruning is performed through the solution provided in the embodiment of the present application, the neural network On the contrary, the performance of the network has been improved.
  • FIG. 11 is a schematic structural diagram of a neural network training apparatus provided by an embodiment of the present application.
  • the neural network training device 1100 includes an input module 1101 , a calculation module 1102 , a pruning module 1103 and a training module 1104 .
  • the input module 1101 is used to input the first training data into the first feature extraction network, and obtain N pieces of first feature information corresponding to the first training data output by the first feature extraction network, where N is an integer greater than 1; calculate The module 1102 is used to calculate the first distribution information according to the N pieces of first feature information, and the first distribution information is used to indicate the data distribution law of the N pieces of first feature information; the pruning module 1103 is used to pass the first neural network, Perform a pruning operation on the first feature extraction network to obtain a pruned first feature extraction network; the input module 1101 is also used to input the first training data into the pruned first feature extraction network to obtain a pruned first feature extraction network.
  • the calculation module 1102 is further configured to calculate second distribution information according to the N pieces of second feature information, and the second distribution information is used to indicate N The data distribution law of the second feature information;
  • the training module 1104 is used to perform a training operation on the first neural network according to the first loss function to obtain a second neural network, and the first loss function indicates the first distribution information and the second distribution. similarity between information.
  • a method for training a neural network for performing a pruning operation on a first feature extraction network is provided, and the first neural network after performing the training operation can be used for pruning the first feature extraction network, That is, a compression scheme of a neural network is provided; in addition, the training module 1104 adopts the first loss function to train the first neural network, so that the data distribution rules of the N pieces of feature information generated by the feature extraction network before and after pruning are similar, This ensures that the feature expression capabilities of the feature extraction network before and after pruning are similar, so as to ensure the performance of the feature extraction network after pruning.
  • the first distribution information includes a value of the distance between any two pieces of first characteristic information in the N pieces of first characteristic information, so as to indicate the data distribution law of the N pieces of first characteristic information;
  • the second distribution The information includes the value of the distance between any two pieces of second feature information in the N pieces of second feature information, so as to indicate the data distribution law of the N pieces of second feature information.
  • the first feature extraction network is a feature extraction network in a neural network with a Transformer structure, and the first feature extraction network includes at least two attention heads.
  • the pruning module 1103 is specifically configured to perform a pruning operation on at least two attention heads included in the first feature extraction network through the first neural network, to obtain the pruned first feature extraction network, and the pruned first feature extraction network.
  • the feature extraction network includes fewer attention heads than the first feature extraction network includes.
  • the pruning module 1103 is specifically configured to generate, through the first neural network, a first score for each of the at least two attention heads, according to the score corresponding to the at least two attention heads. At least two first scores, perform pruning operations on at least two attention heads.
  • the pruning module 1103 is specifically configured to input each attention head of the at least two attention heads into the first neural network, and obtain the second value of each attention head output by the first neural network. Score, the second score is discretized to obtain the first score, and the process of discretization is differentiable.
  • the first training data includes N sentences, and one first feature information is the feature information of one sentence in the N sentences; or, the first training data is one sentence, and one sentence includes N words , a first feature information is the feature information of a word in the N words.
  • the first neural network is any one of the following neural networks: convolutional neural network, recurrent neural network, residual neural network or fully connected neural network.
  • FIG. 12 is a schematic structural diagram of the neural network compression apparatus provided by the embodiment of the present application.
  • the neural network compression apparatus 1200 includes an acquisition module 1201 and a pruning module 1202 .
  • the obtaining module 1201 is used for obtaining the second feature extraction network;
  • the pruning module 1202 is used for pruning the second feature extraction network through the second neural network to obtain the pruned second feature extraction network.
  • the second neural network is obtained by training according to the first loss function, the first loss function indicates the similarity between the first distribution information and the second distribution information, and the first distribution information is used to indicate N pieces of first feature information
  • the N first feature information is obtained by inputting the first training data into the first feature extraction network
  • the second distribution information is used to indicate the data distribution law of the N second feature information
  • the N second features The information is obtained by inputting the first training data into the pruned first feature extraction network.
  • the second feature extraction network is pruned by the second neural network, that is, the compression of the second feature extraction network is realized, and a compression scheme of the neural network is provided; function to train the first neural network, so that the data distribution rules of the N feature information generated by the feature extraction network before and after pruning are similar, so as to ensure that the feature expression capabilities of the feature extraction network before and after pruning are similar, so as to ensure the Performance of Feature Extraction Networks.
  • the first distribution information includes a value of the distance between any two pieces of first characteristic information in the N pieces of first characteristic information, so as to indicate the data distribution law of the N pieces of first characteristic information;
  • the second distribution The information includes the value of the distance between any two pieces of second feature information in the N pieces of second feature information, so as to indicate the data distribution law of the N pieces of second feature information.
  • the first feature extraction network is trained by means of pre-training and fine-tuning; the pruning module 1202 is specifically configured to prune the second feature extraction network through the second neural network before fine-tuning branch.
  • the first feature extraction network is a feature extraction network in a neural network with a Transformer structure, and the first feature extraction network includes at least two attention heads.
  • the pruning module 1202 is specifically configured to perform a pruning operation on at least two attention heads included in the first feature extraction network through the first neural network, to obtain a pruned first feature extraction network, and the pruned first feature extraction network.
  • the feature extraction network includes fewer attention heads than the first feature extraction network includes.
  • the pruning module 1202 is specifically configured to generate, through the first neural network, a first score for each of the at least two attention heads, according to the score corresponding to the at least two attention heads. At least two first scores, perform pruning operations on at least two attention heads.
  • the pruning module 1202 is specifically configured to input each attention head of the at least two attention heads into the first neural network, and obtain the second value of each attention head output by the first neural network. Score, the second score is discretized to obtain the first score, and the process of discretization is differentiable.
  • the first training data includes N sentences, and one first feature information is the feature information of one sentence in the N sentences; or, the first training data is one sentence, and one sentence includes N words , a first feature information is the feature information of a word in the N words.
  • the second neural network is any one of the following neural networks: a convolutional neural network, a recurrent neural network, a residual neural network, or a fully connected neural network.
  • FIG. 13 is a schematic structural diagram of the electronic device provided by the embodiment of the present application.
  • the electronic device 1300 may be deployed with the neural network described in the corresponding embodiment of FIG. 11 .
  • the network training apparatus 1100 is used to implement the functions of the training equipment corresponding to FIG. 4 to FIG. 9; or, the electronic device 1300 may be deployed with the neural network compression apparatus 1200 described in the corresponding embodiment of FIG. 10 corresponds to the function of the execution device.
  • the electronic device 1300 may vary greatly due to different configurations or performances, and may include one or more central processing units (CPU) 1322 (for example, one or more processors) and a memory 1332, One or more storage media 1330 (eg, one or more mass storage devices) that store applications 1342 or data 1344.
  • the memory 1332 and the storage medium 1330 may be short-term storage or persistent storage.
  • the program stored in the storage medium 1330 may include one or more modules (not shown in the figure), and each module may include a series of instructions to operate on the electronic device.
  • the central processing unit 1322 may be configured to communicate with the storage medium 1330 to execute a series of instruction operations in the storage medium 1330 on the electronic device 1300 .
  • Electronic device 1300 may also include one or more power supplies 1326, one or more wired or wireless network interfaces 1350, one or more input and output interfaces 1358, and/or, one or more operating systems 1341, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM and many more.
  • operating systems 1341 such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM and many more.
  • the central processing unit 1322 is configured to implement the functions of the training device in the embodiments corresponding to FIG. 4 to FIG. 9 . Specifically, the central processing unit 1322 is used for:
  • the central processing unit 1322 also implements other steps performed by the training equipment in the embodiments corresponding to FIG. 4 to FIG. 9 .
  • the specific functions of the training equipment performed by the central processing unit 1322 in the embodiments corresponding to FIGS. 4 to 9 For the implementation manner and the beneficial effects brought about, reference may be made to the descriptions in the respective method embodiments corresponding to FIG. 4 to FIG. 9 , which will not be repeated here.
  • the central processing unit 1322 is configured to implement the function of the execution device in the embodiment corresponding to FIG. 10 . Specifically, the central processing unit 1322 is used for:
  • a first feature extraction network is obtained; the second feature extraction network is pruned through the second neural network to obtain a pruned second feature extraction network.
  • the second neural network is obtained by training according to the first loss function, the first loss function indicates the similarity between the first distribution information and the second distribution information, and the first distribution information is used to indicate N pieces of first feature information
  • the N first feature information is obtained by inputting the first training data into the first feature extraction network
  • the second distribution information is used to indicate the data distribution law of the N second feature information
  • the N second features The information is obtained by inputting the first training data into the pruned first feature extraction network.
  • the central processing unit 1322 also implements other steps performed by the execution device in the embodiment corresponding to FIG. 10 .
  • the central processing unit 1322 executes the function of the execution device in the embodiment corresponding to FIG. 10 , the specific implementation and the resulting For the beneficial effects, reference may be made to the descriptions in the respective method embodiments corresponding to FIG. 10 , which will not be repeated here.
  • Embodiments of the present application also provide a computer-readable storage medium, where a program is stored in the computer-readable storage medium, and when the computer-readable storage medium runs on a computer, the computer executes the training device as shown in the above-mentioned corresponding embodiments of FIG. 4 to FIG. 9 .
  • the executed steps, or, execute the steps executed by the execution device in the embodiment corresponding to FIG. 10 above.
  • the embodiments of the present application also provide a computer program product that, when running on a computer, causes the computer to perform the steps performed by the training device in the embodiments corresponding to FIG. 4 to FIG. Corresponds to the steps performed by the execution device in the embodiment.
  • An embodiment of the present application further provides a circuit system, where the circuit system includes a processing circuit, and the processing circuit is configured to perform the steps performed by the training device in the embodiments corresponding to FIG. 4 to FIG. 9 , or to perform the steps as described above.
  • FIG. 10 corresponds to the steps performed by the execution device in the embodiment.
  • the execution device or training device provided in this embodiment of the present application may specifically be a chip, and the chip includes: a processing unit and a communication unit, where the processing unit may be, for example, a processor, and the communication unit may be, for example, an input/output interface, a pin or a circuit, etc.
  • the processing unit can execute the computer execution instructions stored in the storage unit, so that the chip executes the steps executed by the training device in the embodiment corresponding to FIG. 4 to FIG. 9 , or executes the steps executed by the execution device in the embodiment corresponding to FIG. step.
  • the storage unit is a storage unit in the chip, such as a register, a cache, etc.
  • the storage unit may also be a storage unit located outside the chip in the wireless access device, such as only Read-only memory (ROM) or other types of static storage devices that can store static information and instructions, random access memory (RAM), etc.
  • ROM Read-only memory
  • RAM random access memory
  • FIG. 14 is a schematic structural diagram of a chip provided by an embodiment of the present application.
  • the chip may be represented as a neural network processor NPU 140, and the NPU 140 is mounted as a co-processor to the main CPU (Host CPU), tasks are allocated by the Host CPU.
  • the core part of the NPU is the arithmetic circuit 1403, which is controlled by the controller 1404 to extract the matrix data in the memory and perform multiplication operations.
  • the operation circuit 1403 includes multiple processing units (Process Engine, PE).
  • arithmetic circuit 1403 is a two-dimensional systolic array.
  • the arithmetic circuit 1403 may also be a one-dimensional systolic array or other electronic circuitry capable of performing mathematical operations such as multiplication and addition.
  • arithmetic circuit 1403 is a general-purpose matrix processor.
  • the arithmetic circuit 1403 fetches the data corresponding to the matrix B from the weight memory 1402, and buffers it on each PE in the arithmetic circuit.
  • the operation circuit 1403 fetches the data of the matrix A from the input memory 1401 and performs the matrix operation on the matrix B, and stores the partial result or the final result of the matrix in the accumulator 1408 .
  • Unified memory 1406 is used to store input data and output data.
  • the weight data is directly passed through the storage unit access controller (Direct Memory Access Controller, DMAC) 1405, and the DMAC is transferred to the weight memory 1402.
  • Input data is also moved to unified memory 1406 via the DMAC.
  • DMAC Direct Memory Access Controller
  • the BIU is the Bus Interface Unit, that is, the bus interface unit 1410, which is used for the interaction between the AXI bus and the DMAC and the instruction fetch buffer (Instruction Fetch Buffer, IFB) 1409.
  • IFB Instruction Fetch Buffer
  • the bus interface unit 1410 (Bus Interface Unit, BIU for short) is used for the instruction fetch memory 1409 to obtain instructions from the external memory, and also for the storage unit access controller 1405 to obtain the original data of the input matrix A or the weight matrix B from the external memory.
  • the DMAC is mainly used to transfer the input data in the external memory DDR to the unified memory 1406 , the weight data to the weight memory 1402 , or the input data to the input memory 1401 .
  • the vector calculation unit 1407 includes a plurality of operation processing units, and further processes the output of the operation circuit 1403 if necessary, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison and so on. It is mainly used for non-convolutional/fully connected layer network computation in neural networks, such as Batch Normalization, pixel-level summation, and upsampling of feature planes.
  • the vector computation unit 1407 can store the processed output vectors to the unified memory 1406 .
  • the vector calculation unit 1407 may apply a linear function and/or a nonlinear function to the output of the operation circuit 1403, such as linear interpolation of the feature plane extracted by the convolutional layer, such as a vector of accumulated values, to generate activation values.
  • the vector computation unit 1407 generates normalized values, pixel-level summed values, or both.
  • the vector of processed outputs can be used as activation input to the arithmetic circuit 1403, eg, for use in subsequent layers in a neural network.
  • the instruction fetch buffer (instruction fetch buffer) 1409 connected to the controller 1404 is used to store the instructions used by the controller 1404;
  • the unified memory 1406, the input memory 1401, the weight memory 1402 and the instruction fetch memory 1409 are all On-Chip memories. External memory is private to the NPU hardware architecture.
  • each layer in the recurrent neural network can be performed by the operation circuit 1403 or the vector calculation unit 1407 .
  • the processor mentioned in any one of the above may be a general-purpose central processing unit, a microprocessor, an ASIC, or one or more integrated circuits for controlling the execution of the program of the method in the first aspect.
  • the device embodiments described above are only schematic, wherein the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be A physical unit, which can be located in one place or distributed over multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
  • the connection relationship between the modules indicates that there is a communication connection between them, which may be specifically implemented as one or more communication buses or signal lines.
  • U disk mobile hard disk
  • ROM read-only memory
  • RAM magnetic disk or optical disk
  • a computer device which may be a personal computer, server, or network device, etc.
  • the computer program product includes one or more computer instructions.
  • the computer may be a general purpose computer, special purpose computer, computer network, or other programmable device.
  • the computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be downloaded from a website site, computer, server, or data center Transmission to another website site, computer, server, or data center is by wire (eg, coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (eg, infrared, wireless, microwave, etc.).
  • wire eg, coaxial cable, fiber optic, digital subscriber line (DSL)
  • wireless eg, infrared, wireless, microwave, etc.
  • the computer-readable storage medium may be any available medium that can be stored by a computer, or a data storage device such as a server, data center, etc., which includes one or more available media integrated.
  • the usable media may be magnetic media (eg, floppy disks, hard disks, magnetic tapes), optical media (eg, DVD), or semiconductor media (eg, Solid State Disk (SSD)), and the like.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Machine Translation (AREA)

Abstract

人工智能领域中的神经网络的压缩技术,公开了一种神经网络训练的方法。包括:将第一训练数据输入第一特征提取网络,得到与第一训练数据对应的N个第一特征信息,计算指示N个第一特征信息的数据分布规律的第一分布信息,通过第一神经网络对第一特征提取网络进行剪枝,将第一训练数据输入剪枝后的第一特征提取网络,得到与第一训练数据对应的N个第二特征信息,计算指示N个第二特征信息的数据分布规律的第二分布信息,根据指示第一分布信息和第二分布信息的相似度的第一损失函数,对第一神经网络进行训练。提供了执行剪枝操作的神经网络的训练方法,剪枝前后得到的特征信息的数据分布规律类似,保证剪枝后的特征提取网络的性能。

Description

神经网络训练的方法、神经网络的压缩方法以及相关设备
本申请要求于2020年9月29日提交中国专利局、申请号为202011057004.5、发明名称为“神经网络训练的方法、神经网络的压缩方法以及相关设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能领域,尤其涉及一种神经网络训练的方法、神经网络的压缩方法以及相关设备。
背景技术
人工智能(Artificial Intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说,人工智能是计算机科学的一个分支,它企图了解智能的实质,并生产出一种新的能以人类智能相似的方式作出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。目前,基于深度学习(deep learning)的神经网络进行文本处理是人工智能常见的一个应用方式。
近两年来,基于转换器(Transformer)结构的文本处理模型在自然语言处理(natural language processing,NLP)领域中已经成为了一种新的范式,相比较早期的模型,Transformer结构的文本处理模型的语义特征提取能力更强,具备更长距离的特征捕获能力。
基于Transformer结构的文本处理模型通常比较大,从而导致占据的存储空间较大,且推理速度较慢,因此,一种神经网络的压缩方案亟待推出。
发明内容
本申请实施例提供了一种神经网络训练的方法、神经网络的压缩方法以及相关设备,提供了一种用于对第一特征提取网络执行剪枝操作的神经网络的训练方法,采用第一损失函数来训练第一神经网络,以使剪枝前后的特征提取网络生成的N个特征信息的数据分布规律类似,从而保证剪枝前后的特征提取网络的特征表达能力相似,以保证剪枝后的特征提取网络的性能。
为解决上述技术问题,本申请实施例提供以下技术方案:
第一方面,本申请实施例提供一种神经网络的训练方法,可用于人工智能领域中。方法可以包括:训练设备将第一训练数据输入第一特征提取网络,得到第一特征提取网络输出的与第一训练数据对应的N个第一特征信息,N为大于1的整数;根据N个第一特征信息,计算第一分布信息,第一分布信息用于指示N个第一特征信息的数据分布规律。训练设备通过第一神经网络,对第一特征提取网络执行剪枝操作,得到剪枝后的第一特征提取网络;将第一训练数据输入剪枝后的第一特征提取网络,得到剪枝后的第一特征提取网络输出的与第一训练数据对应的N个第二特征信息,根据N个第二特征信息,计算第二分布信息,第二分布信息用于指示N个第二特征信息的数据分布规律。训练设备根据第一损失函数,对第一神经网络执行训练操作,得到第二神经网络;其中,第二神经网络为执行过 训练操作的第一神经网络,第一损失函数指示第一分布信息与第二分布信息之间的相似度,也即迭代训练的目标为拉近第一分布信息与第二分布信息之间的相似度,第一分布信息与第二分布信息之间的相似度用于体现第一分布信息与第二分布信息之间的差异程度,也可以表示为第一分布信息与第二分布信息之间的距离,前述距离可以为KL散度距离、交叉熵距离、欧式距离、马氏距离、余弦距离或其他类型的距离。需要说明的是,在对第一神经网络进行训练的过程中,不修改第一特征提取网络的权重参数。
本实现方式中,通过上述方式,提供了一种用于对第一特征提取网络执行剪枝操作的神经网络的训练方法,执行过训练操作第一神经网络能够用于对第一特征提取网络进行剪枝,也即提供了一种神经网络的压缩方案;此外,采用第一损失函数来训练第一神经网络,以使剪枝前后的特征提取网络生成的N个特征信息的数据分布规律类似,从而保证剪枝前后的特征提取网络的特征表达能力相似,以保证剪枝后的特征提取网络的性能;且第一特征提取网络不仅可以为Transform结构的特征提取网络,还可以为循环神经网络、卷积神经网络等神经网络的特征提取网络,扩展了本方案的应用场景。
在第一方面的一种可能实现方式中,第一分布信息包括N个第一特征信息中任意两个第一特征信息之间的距离的值,以指示N个第一特征信息的数据分布规律;第二分布信息包括N个第二特征信息中任意两个第二特征信息之间的距离的值,以指示N个第二特征信息的数据分布规律。也即N个第一特征信息中一个特征信息的分布规律为通过该一个特征信息与N个第一特征信息中每个特征信息之间的距离的值来体现,N个第二特征信息中一个特征信息的分布规律为通过该一个特征信息与N个第二特征信息中每个特征信息之间的距离的值来体现。
本实现方式中,通过计算N个特征信息中任意两个特征信息之间的距离,来确定N个特征信息的数据分布规律,提供了N个特征信息的数据分布规律的一种实现方式,且操作简单,易于实现。
在第一方面的一种可能实现方式中,N个第一特征信息包括第三特征信息和第四特征信息,第三特征信息和第四特征信息均为N个第一特征信息中的任意一个特征信息。训练设备根据N个第一特征信息,计算第一分布信息,可以包括:训练设备直接计算第三特征信息和第四特征信息之间的余弦距离、欧式距离、曼哈顿距离、马氏距离、一阶距离或交叉熵距离,并确定为第三特征信息和第四特征信息之间的距离。
在第一方面的一种可能实现方式中,N个第一特征信息包括第三特征信息,第三特征信息为N个第一特征信息中的任意一个特征信息。则训练设备根据N个第一特征信息,计算第一分布信息,可以包括:训练设备计算第三特征信息与N个第一特征信息中每个第一特征信息的第一距离,得到的第三特征信息与所有第一特征信息之间的第一距离的和,前述第一距离指的是余弦距离、欧式距离、曼哈顿距离、马氏距离、一阶距离或交叉熵距离。训练设备计算第三特征信息和第四特征信息之间的第二距离,前述第二距离指的是余弦距离、欧式距离、曼哈顿距离、马氏距离、一阶距离或交叉熵距离。训练设备将第二距离与所有第一距离之间的和的比值确定为第三特征信息与第四特征信息之间的距离。
在第一方面的一种可能实现方式中,第一分布信息包括N个第一特征信息中每个特征 信息与预设特征信息之间的距离的值,以指示N个第一特征信息的数据分布规律;第二分布信息包括N个第二特征信息中每个特征信息与预设特征信息之间的距离的值,以指示N个第二特征信息的数据分布规律。其中,由于第一特征信息与第二特征信息的形状可以相同,预设特征信息与第一特征信息以及第二特征信息的形状相同,预设特征信息和第一特征信息的形状相同指的是预设特征信息和第一特征信息均为M维张量,且第一特征信息的M维中的第一维和第二特征信息的M维中的第二维的尺寸相同,M为大于或等于1的整数,第一维为第一特征信息的M维中的任一维,第二维为第二特征信息的M维中与第一维相同的维度。作为示例,例如第一特征信息或第二特征信息为包括m个元素的向量,则预设特征信息可以为包括m个0的向量,或者,预设特征信息为包括m个1的向量。
在第一方面的一种可能实现方式中,第一特征提取网络为Transformer结构的神经网络中的特征提取网络,第一特征提取网络中包括至少两个注意力头。训练设备通过第一神经网络,对第一特征提取网络执行剪枝操作,得到剪枝后的第一特征提取网络,包括:训练设备通过第一神经网络,对第一特征提取网络包括的至少两个注意力头执行剪枝操作,并根据进行剪枝后仍旧保留下的至少一个注意力头,构建剪枝后的第一特征提取网络。剪枝后的第一特征提取网络包括的注意力头的数量少于第一特征提取网络包括的注意力头的数量。
本实现方式中,技术人员在研究中发现,Transformer结构的神经网络中的部分注意力头是冗余的,或者,Transformer结构的神经网络中的部分注意力头的重要性较低,去掉之后对第一特征提取网络的性能的影响不大,所以将第一特征提取网络选取为Transformer结构的神经网络的特征提取网络,对第一特征提取网络中的注意力头进行剪枝,从而尽可能的提高剪枝后的第一特征提取网络的性能。
在第一方面的一种可能实现方式中,训练设备通过所第一神经网络,对第一特征提取网络包括的至少两个注意力头执行剪枝操作,包括:训练设备通过第一神经网络,生成至少两个注意力头中每个注意力头的第一评分,根据与至少两个注意力头对应的至少两个第一评分,对至少两个注意力头执行剪枝操作。其中,一个注意力头的第一评分代表该一个注意力头的重要程度,用于指示一个注意力头是否被剪枝,第一特征提取网络包括的多个注意力头中重要程度高的注意力头将会被保留,重要程度低的注意力头将会被剪枝。
本实现方式中,通过第一神经网络生成每个注意力头的第一评分,进而根据每个注意力头的评分决定该注意力头是否会被剪枝,操作简单,易于实现。
在第一方面的一种可能实现方式中,第一评分的取值为第一预设值或第二预设值,第一预设值和第二预设值的取值不同。第一注意力头为至少两个注意力头中任一个注意力头,当第一注意力头的取值为第一预设值时,第一注意力会被保留;当第一注意力头的取值为第二预设值时,第一注意力头会被剪枝。
在第一方面的一种可能实现方式中,训练设备通过第一神经网络,生成至少两个注意力头中每个注意力头的第一评分,包括:训练设备将至少两个注意力头中每个注意力头输入第一神经网络,得到第一神经网络输出的每个注意力头的第二评分,第二评分可以为连续的评分。具体的,针对至少两个注意力头中第一注意力头的第二评分的生成过程。训练 设备根据自注意力机制,将与第一注意力头对应的注意力矩阵输入第一神经网络中,也即根据与第一注意力头对应的一套注意力矩阵,执行自注意力运算,进而将运算结果输入第一神经网络中,得到第一神经网络输出的第一注意力头的第二评分。训练设备对第二评分进行离散化处理,得到第一评分,离散化处理的过程为可微分的。
本实现方式中,生成每个注意力头的第一评分的过程为可微分的,则在利用第一损失函数,反向更新第一神经网络的权重参数的过程也是连续的,从而使第一神经网络的权重参数的更新过程更为严谨,以提高第一神经网络的训练效率,也有利于得到正确率更高的第一神经网络。
在第一方面的一种可能实现方式中,第一训练数据包括N个句子,一个第一特征信息为N个句子中一个句子的特征信息,一个第二特征信息为N个句子中一个句子的特征信息。或者,第一训练数据为一个句子,一个句子中包括N个词语,一个第一特征信息为N个词语中一个词语的特征信息,一个第二特征信息为N个词语中一个词语的特征信息。
本实现方式中,提供了N个第一特征信息的两种表现形式,提高了本方案的实现灵活性;若一个第一特征信息为N个句子中一个句子的特征信息,则有利于提高训练过程的难度,以提高最后的第一特征提取网络的准确率;若一个第一特征信息为N个词语中一个词语的特征信息,则只需要对一个句子进行特征提取就可以实现对第一神经网络的一次训练,有利于提高第一神经网络的训练过程的效率。
在第一方面的一种可能实现方式中,第一神经网络为以下中的任一种神经网络:卷积神经网络、循环神经网络、残差神经网络或全连接神经网络。本实现方式中,提供了第一神经网络的多种实现方式,提高了本方案的实现灵活性。
在第一方面的一种可能实现方式中,方法还可以包括:训练设备获取最终的剪枝后的第一特征提取网络。具体的,在对第一神经网络进行迭代训练的过程中,当训练设备确定第一损失函数的函数值满足收敛条件后,不会再对第一神经网络进行下一次训练,训练设备可以获取在对第一神经网络进行最后一次训练的过程中,通过第一神经网络(也可以称为第二神经网络)生成的剪枝后的第一特征提取网络(也即在最后一次训练的过程中生成的剪枝后的第一特征提取网络),作为最终的可以输出的剪枝后的第一特征提取网络。
第二方面,本申请实施例提供一种神经网络的压缩方法,其特征在于,方法包括:执行设备获取第一特征提取网络;执行设备通过第二神经网络,对第二特征提取网络进行剪枝,得到剪枝后的第二特征提取网络,第二神经网络为执行过训练操作的神经网络。其中,第二神经网络为根据第一损失函数进行训练得到的,第一损失函数指示第一分布信息与第二分布信息之间的相似度,第一分布信息用于指示N个第一特征信息的数据分布规律,N个第一特征信息为将第一训练数据输入第一特征提取网络后得到的,第二分布信息用于指示N个第二特征信息的数据分布规律,N个第二特征信息为将第一训练数据输入剪枝后的第一特征提取网络后得到的。
在第二方面的一种可能实现方式中,第二神经网络为由训练设备训练得到的,执行设备和训练设备可以为同一个设备。第一特征提取网络和第二特征提取网络的神经网络结构可以完全相同,也即第一特征提取网络和第二特征提取网络包括的神经网络层完全相同。 或者,第一特征提取网络和第二特征提取网络的神经网络结构也可以有所不同,在第二特征提取网络与第一特征提取网络均为Transform结构的特征提取网络的情况下,仅需要保证第二特征提取网络的一个多头注意力层中包括的注意力头的个数,与,第一特征提取网络的一个多头注意力层中包括的注意力头的个数相同即可。
在第二方面的一种可能实现方式中,第一分布信息包括N个第一特征信息中任意两个第一特征信息之间的距离的值,以指示N个第一特征信息的数据分布规律;第二分布信息包括N个第二特征信息中任意两个第二特征信息之间的距离的值,以指示N个第二特征信息的数据分布规律。
在第二方面的一种可能实现方式中,第二特征提取网络为采用预训练和微调(fine-tune)的方式进行训练,通过第二神经网络,对第二特征提取网络进行剪枝,包括:在对第二特征提取网络进行微调之前,通过第二神经网络,对执行过预训练操作的第二特征提取网络进行剪枝。
本实现方式中,在预训练阶段对第一特征提取网络进行剪枝,不仅能够实现对第一特征提取网络的压缩,以减少第一特征提取网络所占的存储空间,提高第一特征提取网络在推理阶段的效率,也可以提高对第一特征提取网络进行训练时微调阶段的效率,从而提高第一特征提取网络的训练过程的效率。
在第二方面的一种可能实现方式中,第一特征提取网络为Transformer结构的神经网络中的特征提取网络,第一特征提取网络中包括至少两个注意力头。执行设备通过第二神经网络,对第一特征提取网络进行剪枝,得到剪枝后的第二神经网络,第二神经网络为执行过训练操作的神经网络,包括:通过第二神经网络,对第一特征提取网络包括的至少两个注意力头执行剪枝操作,得到剪枝后的第一特征提取网络,剪枝后的第一特征提取网络包括的注意力头的数量少于第一特征提取网络包括的注意力头的数量。
在第二方面的一种可能实现方式中,执行设备通过所第二神经网络,对第一特征提取网络包括的至少两个注意力头执行剪枝操作,包括:执行设备通过第二神经网络,生成至少两个注意力头中每个注意力头的第一评分,一个注意力头的第一评分用于指示一个注意力头是否被剪枝;根据与至少两个注意力头对应的至少两个第一评分,对至少两个注意力头执行剪枝操作。
在第二方面的一种可能实现方式中,执行设备通过第二神经网络,生成至少两个注意力头中每个注意力头的第一评分,包括:执行设备将至少两个注意力头中每个注意力头输入第二神经网络,得到第二神经网络输出的每个注意力头的第二评分;对第二评分进行离散化处理,得到第一评分,离散化处理的过程为可微分的。
本申请实施例的第二方面还可以执行第一方面的各个可能实现方式中的步骤,对于本申请实施例第二方面以及第二方面的各种可能实现方式的具体实现步骤、名词的含义以及每种可能实现方式所带来的有益效果,均可以参考第一方面中各种可能的实现方式中的描述,此处不再一一赘述。
第三方面,本申请实施例提供一种神经网络的训练装置,可用于人工智能领域中。神经网络的训练装置包括:输入模块,用于将第一训练数据输入第一特征提取网络,得到第 一特征提取网络输出的与第一训练数据对应的N个第一特征信息,N为大于1的整数;计算模块,用于根据N个第一特征信息,计算第一分布信息,第一分布信息用于指示N个第一特征信息的数据分布规律;剪枝模块,用于通过第一神经网络,对第一特征提取网络执行剪枝操作,得到剪枝后的第一特征提取网络;输入模块,还用于将第一训练数据输入剪枝后的第一特征提取网络,得到剪枝后的第一特征提取网络输出的与第一训练数据对应的N个第二特征信息;计算模块,还用于根据N个第二特征信息,计算第二分布信息,第二分布信息用于指示N个第二特征信息的数据分布规律;训练模块,用于根据第一损失函数,对第一神经网络执行训练操作,得到第二神经网络,第一损失函数指示第一分布信息与第二分布信息之间的相似度。
本申请实施例的第三方面还可以执行第一方面的各个可能实现方式中的步骤,对于本申请实施例第三方面以及第三方面的各种可能实现方式的具体实现步骤,以及每种可能实现方式所带来的有益效果,均可以参考第一方面中各种可能的实现方式中的描述,此处不再一一赘述。
第四方面,本申请实施例提供一种神经网络的压缩装置,可用于人工智能领域中。装置包括:获取模块,用于获取第二特征提取网络;剪枝模块,用于通过第二神经网络,对第二特征提取网络进行剪枝,得到剪枝后的第二特征提取网络;其中,第二神经网络为根据第一损失函数进行训练得到的,第一损失函数指示第一分布信息与第二分布信息之间的相似度,第一分布信息用于指示N个第一特征信息的数据分布规律,N个第一特征信息为将第一训练数据输入第一特征提取网络后得到的,第二分布信息用于指示N个第二特征信息的数据分布规律,N个第二特征信息为将第一训练数据输入剪枝后的第一特征提取网络后得到的。
本申请实施例的第四方面还可以执行第二方面的各个可能实现方式中的步骤,对于本申请实施例第四方面以及第四方面的各种可能实现方式的具体实现步骤,以及每种可能实现方式所带来的有益效果,均可以参考第二方面中各种可能的实现方式中的描述,此处不再一一赘述。
第五方面,本申请实施例提供了一种训练设备,可以包括处理器,处理器和存储器耦合,存储器存储有程序指令,当存储器存储的程序指令被处理器执行时实现上述第一方面所述的神经网络的训练方法。
第六方面,本申请实施例提供了一种执行设备,可以包括处理器,处理器和存储器耦合,存储器存储有程序指令,当存储器存储的程序指令被处理器执行时实现上述第二方面所述的神经网络的压缩方法。
第七方面,本申请实施例提供了一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机程序,当其在计算机上运行时,使得计算机执行上述第一方面所述的神经网络的训练方法,或者,使得计算机执行上述第二方面所述的神经网络的压缩方法。
第八方面,本申请实施例提供了一种电路系统,所述电路系统包括处理电路,所述处理电路配置为执行上述第一方面所述的神经网络的训练方法,或者,执行上述第二方面所述的神经网络的压缩方法。
第九方面,本申请实施例提供了一种计算机程序,当其在计算机上运行时,使得计算机执行上述第一方面所述的神经网络的训练方法,或者,执行上述第二方面所述的神经网络的压缩方法。
第十方面,本申请实施例提供了一种芯片系统,该芯片系统包括处理器,用于实现上述各个方面中所涉及的功能,例如,发送或处理上述方法中所涉及的数据和/或信息。在一种可能的设计中,所述芯片系统还包括存储器,所述存储器,用于保存服务器或通信设备必要的程序指令和数据。该芯片系统,可以由芯片构成,也可以包括芯片和其他分立器件。
附图说明
图1为本申请实施例提供的人工智能主体框架的一种结构示意图;
图2为本申请实施例提供的Transformer结构的神经网络中特征提取网络的一种架构示意图;
图3为本申请实施例提供的神经网络的压缩系统的一种系统架构图;
图4为本申请实施例提供的神经网络的训练方法的一种流程示意图;
图5为本申请实施例提供的神经网络的训练方法中N个第一特征信息的分布情况的两种示意图;
图6为本申请实施例提供的神经网络的训练方法中第一分布信息的一个示意图;
图7为本申请实施例提供的神经网络的训练方法中对注意力头进行剪枝过程的一个示意图;
图8为本申请实施例提供的神经网络的训练方法中第一分布信息和第二分布信息的示意图;
图9为本申请实施例提供的神经网络的训练方法的另一种流程示意图;
图10为本申请实施例提供的神经网络的压缩方法的一种流程示意图;
图11为本申请实施例提供的神经网络的训练装置的一种结构示意图;
图12为本申请实施例提供的神经网络的压缩装置的一种结构示意图;
图13为本申请实施例提供的电子设备的一种结构示意图;
图14为本申请实施例提供的芯片的一种结构示意图。
具体实施方式
本申请实施例提供了一种神经网络训练的方法、神经网络的压缩方法以及相关设备,提供了一种用于对第一特征提取网络执行剪枝操作的神经网络的训练方法,采用第一损失函数来训练第一神经网络,以使剪枝前后的特征提取网络生成的N个特征信息的数据分布规律类似,从而保证剪枝前后的特征提取网络的特征表达能力相似,以保证剪枝后的特征提取网络的性能。
下面结合附图,对本申请的实施例进行描述。本领域普通技术人员可知,随着技术的发展和新场景的出现,本申请实施例提供的技术方案对于类似的技术问题,同样适用。
本申请的说明书和权利要求书及上述附图中的术语“第一”、第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的术语在适当情况下 可以互换,这仅仅是描述本申请的实施例中对相同属性的对象在描述时所采用的区分方式。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,以便包含一系列单元的过程、方法、系统、产品或设备不必限于那些单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它单元。
首先对人工智能系统总体工作流程进行描述,请参见图1,图1示出的为人工智能主体框架的一种结构示意图,下面从“智能信息链”(水平轴)和“IT价值链”(垂直轴)两个维度对上述人工智能主题框架进行阐述。其中,“智能信息链”反映从数据的获取到处理的一列过程。举例来说,可以是智能信息感知、智能信息表示与形成、智能推理、智能决策、智能执行与输出的一般过程。在这个过程中,数据经历了“数据—信息—知识—智慧”的凝练过程。“IT价值链”从人智能的底层基础设施、信息(提供和处理技术实现)到系统的产业生态过程,反映人工智能为信息技术产业带来的价值。
(1)基础设施
基础设施为人工智能系统提供计算能力支持,实现与外部世界的沟通,并通过基础平台实现支撑。通过传感器与外部沟通;计算能力由智能芯片提供,前述智能芯片包括但不限于中央处理器(central processing unit,CPU)、嵌入式神经网络处理器(neural-network processing unit,NPU)、图形处理器(graphics processing unit,GPU)、专用集成电路(application specific integrated circuit,ASIC)和现场可编程逻辑门阵列(field programmable gate array,FPGA)等硬件加速芯片;基础平台包括分布式计算框架及网络等相关的平台保障和支持,可以包括云存储和计算、互联互通网络等。举例来说,传感器和外部沟通获取数据,这些数据提供给基础平台提供的分布式计算系统中的智能芯片进行计算。
(2)数据
基础设施的上一层的数据用于表示人工智能领域的数据来源。数据涉及到图形、图像、语音、文本,还涉及到传统设备的物联网数据,包括已有系统的业务数据以及力、位移、液位、温度、湿度等感知数据。
(3)数据处理
数据处理通常包括数据训练,机器学习,深度学习,搜索,推理,决策等方式。
其中,机器学习和深度学习可以对数据进行符号化和形式化的智能信息建模、抽取、预处理、训练等。
推理是指在计算机或智能系统中,模拟人类的智能推理方式,依据推理控制策略,利用形式化的信息进行机器思维和求解问题的过程,典型的功能是搜索与匹配。
决策是指智能信息经过推理后进行决策的过程,通常提供分类、排序、预测等功能。
(4)通用能力
对数据经过上面提到的数据处理后,进一步基于数据处理的结果可以形成一些通用的能力,比如可以是算法或者一个通用系统,例如,翻译,文本的分析,计算机视觉的处理,语音识别等等。
(5)智能产品及行业应用
智能产品及行业应用指人工智能系统在各领域的产品和应用,是对人工智能整体解决 方案的封装,将智能信息决策产品化、实现落地应用,其应用领域主要包括:智能终端、智能制造、智能交通、智能家居、智能医疗、智能安防、自动驾驶、平安城市等。
本申请实施例可以应用于对人工智能的各种领域中,可以包括自然语言处理领域、图像处理领域和音频处理领域,具体可以应用于需要对各种领域的各种类型神经网络进行压缩的场景中。前述各种类型的神经网络包括但不限于循环神经网络、卷积神经网络、残差神经网络、全连接神经网络和转换器(Transformer)结构的神经网络等,后续实施例中仅为待压缩的神经网络为Transformer结构的神经网络,且应用于自然语言处理领域为例进行介绍,当待压缩的神经网络(也即第一特征提取网络)为其他类型的神经网络时,或者当待压缩的神经网络处理的为其他类型的数据,例如第一特征提取网络为处理图像数据或音频数据时,均可以类推理解,此处不做赘述。为了便于理解,下面先对本申请实施例涉及的相关术语及神经网络等相关概念进行介绍。
(1)Transformer结构的神经网络
Transformer结构的神经网络可以包括编码器(encoder)部分(也即Transformer结构的神经网络中的特征提取网络)和解码器(decoder)部分,参阅图2,图2为本申请实施例提供的Transformer结构的神经网络中特征提取网络的一种架构示意图。如图2所示,Transformer结构的神经网络中特征提取网络包括嵌入层和至少一个Transformer层,一个Transformer层中包括多头(multi-head)注意力层、求和与归一化(add&norm)层、前馈(feed forward)神经网络层和求和与归一化层,也即待处理文本经过Transformer结构的神经网络中特征提取网络的处理之后,能够得到整个待处理文本的特征信息。该特征信息为待处理文本的一种适合计算机处理的特征信息,可用于文本相似度、文本分类、阅读理解、机器翻译等任务。接下来,结合具体例子对上述嵌入层和多头注意力层进行具体介绍。
嵌入层在获取待处理文本后,可以对待处理文本中各个词进行嵌入处理,以得到各个词的初始特征信息。待处理文本可以为一段文本,也可以为一个句子。文本可以为中文文本,也可以为英文文本,还可以为其他语言文本。
具体的,在一些实施例中,如图2所示,嵌入层包括输入嵌入(input embedding)层和位置编码(positional encoding)层。在输入嵌入层,可以对待处理文本中的各个词进行词嵌入处理,从而得到各个词的词嵌入张量,张量具体可以表现为一维的向量、二维的矩阵、三维或更多维的数据等等。在位置编码层,可以获取各个词在待处理文本中的位置,进而对各个词的位置生成位置张量。在一些示例中,各个词的位置可以为各个词在待处理文本中的绝对位置。以待处理文本为“今天天气真好”为例,其中的“今”的位置可以表示为第一位,“天”的位置可以表示为第二位,……。在一些示例中,各个词的位置可以为各个词之间的相对位置。仍以待处理文本为“今天天气真好”为例,其中的“今”的位置可以表示为“天”之前,“天”的位置可以表示为“今”之后、“天”之前,……。在得到待处理文本中每个词的词嵌入张量和位置张量之后,可以将每个词的位置张量和词嵌入张量进行组合,得到每个词的初始特征信息,从而得到与待处理文本对应的初始特征信息。
多头注意力层也可以称为注意力层,在一个例子中,注意力层可以为固定窗口多头注 意力(fixed window multi-head attention)层。多个注意力头中每个注意力头对应一套注意力矩阵(attention matrix),一套注意力矩阵中包括第一转换矩阵、第二转换矩阵和第三转换矩阵,第一转换矩阵、第二转换矩阵和第三转换矩阵的功能不同,第一转换矩阵用于生成待处理文本的查询(Query)特征信息,第二转换矩阵用于生成待处理文本的键(Key)特征信息,第三转换矩阵用于生成待处理文本的价值(Value)特征信息。不同的注意力头用于提取待处理文本在不同角度的语义信息,作为示例,例如一个注意力头关注的可以为待处理文本的句子成分,另一个注意力头关注的可以为待处理文本的主谓宾结构,另一个注意力头关注的可以为待处理文本中各个词语之间的依存关系等,需要说明的是,此处举例仅为方便理解本方案,在实际情况中每个注意力头关注的特征信息是在训练过程中模型自己学习的,前述例子更多的是想要解释多个注意力头的学习能力,不用于限定本方案。为更直观地理解本方案,如图2所示,多头注意力层包括z个注意力头(head),虽然图2中以h的取值为3为例,但实际情况中可以包括更多或更少的注意力头。多个注意力头中任一个注意力头的运行方式可以通过如下公式表示:
Figure PCTCN2021105927-appb-000001
Figure PCTCN2021105927-appb-000002
Figure PCTCN2021105927-appb-000003
Figure PCTCN2021105927-appb-000004
其中,X代表整个待处理文本的初始特征信息(也即将整个待处理文本输入嵌入层之后得到是初始特征信息),其中包括待处理文本中每个词的初始特征信息,head i代表将待处理文本的初始特征信息输入z个注意力头中第i个注意力头后得到的输出,
Figure PCTCN2021105927-appb-000005
代表第i个注意力头在计算过程中采用了注意力机制,
Figure PCTCN2021105927-appb-000006
代表第i个注意力头中的第一转换矩阵,
Figure PCTCN2021105927-appb-000007
代表第i个注意力头中的第二转换矩阵,
Figure PCTCN2021105927-appb-000008
代表第i个注意力头中的第三转换矩阵,
Figure PCTCN2021105927-appb-000009
代表K i的转置,
Figure PCTCN2021105927-appb-000010
代表Q i
Figure PCTCN2021105927-appb-000011
之间外积,
Figure PCTCN2021105927-appb-000012
代表
Figure PCTCN2021105927-appb-000013
和V i之间外积,z代表注意力层中注意力头的个数,应理解,此处举例仅为方便理解注意力头的运行方式,不用于限定本方案。
在一些实施例中,如图2所示,多头注意力层可以为嵌入层的下一层;在另一些实施例中,Transformer结构的神经网络的特征提取网络中可以多个Transformer层,则最后一个Transformer层输出的为待处理文本的特征信息。
(2)注意力机制(attention mechanism)
每个注意力头的运行原理为注意力机制,注意力机制模仿了生物观察行为的内部过程,即一种将内部经验和外部感觉对齐从而增加部分区域的观察精细度的机制,能够利用有限的注意力资源从大量信息中快速筛选出高价值信息。注意力机制可以快速提取稀疏数据的 重要特征,因而被广泛用于自然语言处理任务,特别是机器翻译。而自注意力机制(self-attention mechanism)是注意力机制的改进,其减少了对外部信息的依赖,更擅长捕捉数据或特征的内部相关性。注意力机制的本质思想可以改写为如下公式:
其中,Lx=||Source||代表Source的长度,Source代表输入的待处理文本,公式含义即将Source包括的多个元素想象成是由一系列的数据对构成,此时给定目标Target中的某个元素的Query,通过计算前述某个元素的Query和Source中所有元素的Key的相似性或者相关性,得到Source中每个元素的Key对应Value的权重系数,然后对Source中每个元素的Value进行加权求和,即得到了前述某个元素的最终的Attention数值。所以本质上Attention机制是对Source中各个元素的Value值进行加权求和,而Query和Key用来计算对应Value的权重系数。从概念上理解,把Attention可以理解为从大量信息中有选择地筛选出少量重要信息并聚焦到这些重要信息上,忽略大多不重要的信息。聚焦的过程体现在权重系数的计算上,权重越大越聚焦于其对应的Value值上,即权重代表了信息的重要性,而Value是其对应的信息。自注意力机制可以理解为内部Attention(intra attention),注意力机制发生在Target中元素的Query和Source中的所有元素之间,自注意力机制指的是在Source内部元素之间或者Target内部元素之间发生的Attention机制,也可以理解为Target=Source这种特殊情况下的注意力计算机制,其具体计算过程是一样的,只是计算对象发生了变化而已。
(3)自然语言处理
自然语言(natural language)即人类语言,自然语言处理就是对人类语言的处理。自然语言处理是以一种智能与高效的方式,对文本数据进行系统化分析、理解与信息提取的过程。通过使用NLP及其组件,我们可以管理非常大块的文本数据,或者执行大量的自动化任务,并且解决各式各样的问题,如自动摘要(automatic summarization),机器翻译(machine translation,MT),命名实体识别(named entity recognition,NER),关系提取(relation extraction,RE),信息抽取(information extraction,IE),情感分析(Sentiment analysis),语音识别(speech recognition),问答系统(question answering),自然语言推断(Natural language inference)以及主题分割等等。
示例性的,自然语言处理任务可以有以下几类。
序列标注:句子中每一个单词要求模型根据上下文给出一个分类类别。如中文分词、词性标注、命名实体识别、语义角色标注。
分类任务:整个句子输出一个分类值,如文本分类。
句子关系推断:给定两个句子,判断这两个句子是否具备某种名义关系。例如entilment、QA、语义改写、自然语言推断。
生成式任务:输出一段文本,生成另一段文本。如机器翻译、文本摘要、写诗造句、看图说话。
下面示例性的列举一些自然语言处理案例。
分词(word segmentation或word breaker,WB):将连续的自然语言文本,切分成具有语义合理性和完整性的词汇序列,可以解决交叉歧义问题。
命名实体识别(named entity recognition,NER):识别自然语言文本中具有特定意义的实体(人、地、机构、时间、作品等),可以从粒度整合未登录体词。例句:天使爱美丽在线观看;分词:天使爱美丽在线观看;实体:天使爱美丽->电影。
词性标注(part-speech tagging):为自然语言文本中的每个词汇赋予一个词性(名词、动词、形容词等);依存句法分析(dependency parsing):自动分析句子中的句法成分(主语、谓语、宾语、定语、状语和补语等成分),可以解决结构歧义问题。评论:房间里还可以欣赏日出;歧义1:房间还可以;歧义2:可以欣赏日出;词性:房间里(主语),还可以(谓语),欣赏日出(动宾短语)。
词向量与语义相似度(word embedding&semantic similarity):对词汇进行向量化表示,并据此实现词汇的语义相似度计算,可以解决词汇语言相似度。例如:西瓜与(呆瓜/草莓),哪个更接近?向量化表示:西瓜(0.1222,0.22333,..);相似度计算:呆瓜(0.115)草莓(0.325);向量化表示:(-0.333,0.1223..)(0.333,0.3333,..)。
文本语义相似度(text semantic similarity):依托全网海量数据和深度神经网络技术,实现文本间的语义相似度计算的能力,可以解决文本语义相似度问题。例如:车头如何防止车牌与(前牌照怎么装/如何办理北京牌照),哪个更接近?向量化表示:车头如何防止车牌(0.1222,0.22333,..);相似度计算:前牌照怎么装(0.762),如何办理北京牌照(0.486),向量化表示:(-0.333,0.1223..)(0.333,0.3333,..)。
本申请实施例提供的神经网络的训练方法,用于训练一个任务目标为对第一特征提取网络执行剪枝操作的第一神经网络,并且保证剪枝前后的第一特征提取网络的特征表达性能基本不变。为了便于理解本方案,本申请实施例中首先结合图3对本申请实施例提供的神经网络的压缩系统进行介绍,请先参阅图3,图3为本申请实施例提供的神经网络的压缩系统的一种系统架构图。神经网络的压缩系统中包括训练设备310、数据库320、执行设备330、数据存储系统340和客户设备350;执行设备330中包括计算模块331和输入/输出(I/O)接口332。
在一种情况下,第一特征提取网络302的训练过程采用的为预训练和微调的方式。则在一种实现方式中,如图3所示,第一神经网络301为在第一特征提取网络302预训练阶段对第一特征提取网络302进行剪枝。则具体的,数据库320中存储有第一训练数据集合,第一训练数据集合中可以包括多个训练文本。在第一神经网络301的训练阶段,训练设备310获取第一特征提取网络302,第一特征提取网络302为已经进行过预训练的神经网络,训练设备310生成用于执行剪枝操作的第一神经网络301,并利用第一训练数据集合中的多个训练文本和第一特征提取网络302,对第一神经网络301进行训练,以得到执行过训练操作的第一神经网络301,需要说明的是,在第一神经网络301的训练过程中不会修改第一特征提取网络302的权重参数。
在第一神经网络301的推理阶段,训练设备310利用成熟的第一神经网络301对第一特征提取网络302进行剪枝,以得到剪枝后的第一特征提取网络302,训练设备310将剪枝后的第一特征提取网络302发送给执行设备330。
执行设备330可以调用数据存储系统340中的数据、代码等,也可以将数据、指令等 存入数据存储系统340中。数据存储系统340可以配置于执行设备330中,也可以为数据存储系统340相对执行设备330是外部存储器。数据存储系统340中可以存储有第二训练数据集合,第二训练数据集合中包括多个训练文本以及每个训练文本的正确结果。在第一特征提取网络302的微调阶段,执行设备330利用第二训练数据集合对集成有剪枝后的第一特征提取网络302的第三神经网络进行训练,以得到成熟的第三神经网络。
本申请的一些实施例中,如图3所示,在第三神经网络的推理阶段,“用户”与客户端直接进行交互,执行设备330通过I/O接口332获取客户设备350发送的待处理文本,计算模块211通过成熟的第三神经网络对待处理文本进行处理,以生成待处理文本的预测结果,并通过I/O接口332向客户设备350发送待处理文本的预测结果。
但图3仅是本发明实施例提供的神经网络的压缩系统的一种示例,图中所示设备、器件、模块等之间的位置关系不构成任何限制。在本申请的另一些实施例中,执行设备330和客户设备350可以集成于同一设备中。或者,执行设备330可以被分为第三神经网络的训练设备和第三神经网络的执行设备两个独立的设备,由第三神经网络的训练设备执行第一特征提取网络302的微调阶段的步骤,由第三神经网络的执行设备执行第三神经网络的推理阶段的步骤。
在另一种情况下,第一特征提取网络302的训练过程采用的不是预训练和微调的训练方式,与上一种情况的区别在于,训练设备310获取第三神经网络,第三神经网络为执行过训练操作的神经网络,也即第三神经网络为成熟的神经网络,第三神经网络中集成有第一特征提取网络302。训练设备310在得到剪枝后的第一特征提取网络302之后,也即得到剪枝后的第三神经网络,训练设备310将剪枝后的第三神经网络发送给执行设备330,剪枝后的第三神经网络中包括剪枝后的第一特征提取网络302。
执行设备330在得到剪枝后的第三神经网络之后,不再对剪枝后的第三神经网络进行训练,而是直接根据剪枝后的第三神经网络,执行推理阶段的操作。对应的,在本种情况下,执行设备330和客户设备350可以集成于同一设备中。
由图3中的描述可知,本申请实施例包括第一神经网络301(也即用于执行剪枝操作的神经网络)的推理阶段和训练阶段,而第一神经网络301推理阶段和训练阶段的流程有所不同,以下分别对第一神经网络301推理阶段和训练阶段进行描述。
一、第一神经网络的训练阶段
本申请实施例中,请参阅图4,图4为本申请实施例提供的神经网络的训练方法的一种流程示意图,本申请实施例提供的神经网络的训练方法可以包括:
401、训练设备将第一训练数据输入第一特征提取网络,得到第一特征提取网络输出的与第一训练数据对应的N个第一特征信息。
本申请实施例中,训练设备中配置有训练数据集合,训练数据集合中包括多个训练数据,由于以第一特征提取网络为用于处理文本数据的特征提取网络为例,则每个训练数据可以表现为训练文本。训练设备将第一训练数据输入第一特征提取网络,得到第一特征提取网络输出的与第一训练数据对应的N个第一特征信息。
其中,由于以第一特征提取网络表现为转换器(Transformer)结构的神经网络中的特 征提取网络为例,第一特征提取网络也可以称为编码器,第一特征提取网络中包括至少两个注意力头,第一特征提取网络以及注意力头的具体网络结构可以参阅上述图2中的描述,此处不做赘述。进一步地,第一特征提取网络归属于用于进行自然语言处理的第三神经网络中,自然语言处理类型的任务又可以有多种,例如分词、命名实体识别、词性标注等等,此处不做穷举,对于前述各种任务的具体举例可参阅上述描述,此处也不再赘述。
具体的,第一训练数据中包括N个训练数据,每个训练数据可以表现为一个句子。在一种实现方式中,第一训练数据包括N个句子,步骤401可以包括:训练设备将N个句子分别输入第一特征提取网络,从而分别得到第一特征提取网络输出的N个句子中每个句子的第一特征信息,也即得到N个第一特征信息。一个第一特征信息为N个句子中一个句子的特征信息。N为大于或等于2的整数,作为示例,例如N的取值可以为2、3、4、5、6或其他数值等等。
在另一种实现方式中,第一训练数据为一个句子,一个句子中包括N个词语。步骤301可以包括:训练设备将前述一个句子输入第一特征提取网络,以通过第一特征提取网络生成该一个句子的特征信息,从一个句子的特征信息中获取每个词语的特征信息,也即对该一个句子的特征信息进行分解,以得到N个词语中每个词语的特征信息,一个第一特征信息为N个词语中一个词语的特征信息。本申请实施例中,提供了N个第一特征信息的两种表现形式,提高了本方案的实现灵活性;若一个第一特征信息为N个句子中一个句子的特征信息,则有利于提高训练过程的难度,以提高最后的第一特征提取网络的准确率;若一个第一特征信息为N个词语中一个词语的特征信息,则只需要对一个句子进行特征提取就可以实现对第一神经网络的一次训练,有利于提高第一神经网络的训练过程的效率。
在另一种实现方式中,第一训练数据为一个词语,一个词语中包括N个字母,步骤401可以包括:训练设备将前述一个词语输入第一特征提取网络,以通过第一特征提取网络生成该一个词语的特征信息,从一个词语的特征信息中获取每个字母的特征信息,也即对该一个词语的特征信息进行分解,以得到N个字母中每个字母的特征信息,一个第一特征信息为N个字母中一个字母的特征信息。
402、训练设备根据N个第一特征信息,计算第一分布信息,第一分布信息用于指示N个第一特征信息的数据分布规律。
本申请实施例中,训练设备在得到N个第一特征信息之后,会计算第一分布信息。其中,第一分布信息具体可以通过表格、矩阵、数组、索引等形式进行存储,第一分布信息用于指示N个第一特征信息的数据分布规律,包括N个第一特征信息中每个特征信息的分布情况。
进一步地,在一种情况下,第一分布信息包括N个第一特征信息中任意两个第一特征信息之间的距离的值,以指示N个第一特征信息的数据分布规律;也即N个第一特征信息中一个特征信息的分布规律为通过该一个特征信息与N个第一特征信息中每个特征信息之间的距离的值来体现。两个第一特征信息之间的距离越远,两个第一特征信息之间的相似度越小;两个第一特征信息之间的距离越近,两个第一特征信息之间的相似度越大。本申请实施例中,通过计算N个特征信息中任意两个特征信息之间的距离,来确定N个特征信 息的数据分布规律,提供了N个特征信息的数据分布规律的一种实现方式,且操作简单,易于实现。
具体的,训练设备在得到N个第一特性信息之后,N个第一特征信息中包括一个第三特征信息和一个第四特征信息,第三特征信息和第四特征信息均为N个第一特征信息中的任意一个特征信息。在一种实现方式中,训练设备可以直接计算第三特征信息和第四特征信息之间的余弦距离、欧式距离、曼哈顿距离、马氏距离、一阶距离、交叉熵距离或其他类型的距离等,并确定为第三特征信息和第四特征信息之间的距离,训练设备对N个第一特性信息中任意两个特征信息均执行前述操作,以得到第一分布信息。
在另一种实现方式中,先以在余弦距离、欧式距离、曼哈顿距离、马氏距离、一阶距离、交叉熵距离等类型的距离中选取余弦距离为例,训练设备计算第三特征信息与N个第一特征信息中每个第一特征信息的第一余弦距离,得到的第三特征信息与所有第一特征信息之间的第一余弦距离的和,并计算第三特征信息和第四特征信息之间的第二余弦距离,将第二余弦距离与所有第一余弦距离之间的和的比值确定为第三特征信息与第四特征信息之间的距离。
为进一步理解本方案,以下公开两个第一特征信息之间距离的计算公式的一个示例,例如N个第一特征信息中包括特征信息h n和特征信息h i,则
Figure PCTCN2021105927-appb-000014
Figure PCTCN2021105927-appb-000015
其中,
Figure PCTCN2021105927-appb-000016
代表第一分布信息中的一个数值,为N个第一特征信息中第n个特征信息的分布信息中的第i项,也即代表N个第一特征信息中第n个特征信息与第i个特征信息之间的距离的值,Dist cos(h n,h i)代表计算h n和h i之间的余弦距离,式(2)公开了计算h n和h i之间的余弦距离的具体公式,
Figure PCTCN2021105927-appb-000017
代表计算h n与N个第一特征信息中每个特征信息之间的余弦距离的和,应理解,式(1)和式(2)中的举例仅为方便理解本方案,在其他实施例中余弦距离也可以被替换为欧式距离、曼哈顿距离、马氏距离、一阶距离、交叉熵距离等,此处不做限定。
为更直观地理解第一分布信息,请参阅图5和图6,图5为本申请实施例提供的神经网络的训练方法中N个第一特征信息的分布情况的两种示意图,图6为本申请实施例提供的神经网络的训练方法中第一分布信息的一个示意图。图5和图6中均以N的取值为4为例。请先参阅图5,图5包括(a)子示意图和(b)子示意图,A1、A2、A3和A4分别代表通过第一特征提取网络生成的4个句子(也即第一训练数据中包括4个句子)的特征信息,也即示出了4个第一特征信息的分布情况,由于图5的两个子示意图中均可以直观的看出4个第一特征信息的分布情况,此处不再进行介绍。
继续参阅图6,图6中以第一分布信息表现为一个矩阵为例,矩阵中的每个值均代表两个第一特征信息之间的距离,例如B1代表A3和A4这两个特征信息之间的距离,图6 中示出的第一分布信息表示的为图5的(a)子示意图中4个第一特征信息的分布情况,A1和A1之间的距离的值为0,A1和A2之间的距离的值为2,A1和A3之间的距离的值为6等,由于图5中A1和A3之间的距离最远,则对应的,图6中A1和A3之间的距离的值最大,图6中示出的矩阵可以结合图5进行理解,此处不对图6中的矩阵值进行一一解释,需要说明的是,图5和图6中的示例仅为方便理解本方案,在实际应用中,第一分布信息还可以表现为其他形式,例如表格、数组等,或者第一分布信息中的每个距离的值均可以为进行过归一化处理后的值等,此处均不做限定。
在另一种情况下,第一分布信息包括N个第一特征信息中每个特征信息与预设特征信息之间的距离的值,以指示N个第一特征信息的数据分布规律。其中,一个第一特征信息与预设特征信息之间的距离越远,该第一特征信息与预设特征信息之间的相似度越小;一个第一特征信息与预设特征信息之间的相似度越大,该第一特征信息与预设特征信息之间的相似度越大。
预设特征信息与第一特征信息的形状相同,预设特征信息和第一特征信息的形状相同指的是预设特征信息和第一特征信息均为M维张量,且第一特征信息的M维中的第一维和第二特征信息的M维中的第二维的尺寸相同,M为大于或等于1的整数,第一维为第一特征信息的M维中的任一维,第二维为第二特征信息的M维中与第一维相同的维度。作为示例,例如第一特征信息为包括m个元素的向量,则预设特征信息可以为包括m个0的向量,或者,预设特征信息为包括m个1的向量等等,此处举例仅为方便理解预设特征信息的概念,不用于限定本方案。
具体的,在一种实现方式中,训练设备在得到N个第一特征信息之后,针对第三特征信息(N个第一特征信息中的任一个特征信息),可以计算第三特征信息与预设特征信息之间的余弦距离、欧式距离、曼哈顿距离、马氏距离、一阶距离、交叉熵距离或其他类型的距离等,并确定为第三特征信息和预设特征信息之间的距离,训练设备对N个第一特性信息中每个特征信息均执行前述操作,以得到第一分布信息。
在另一种实现方式中,先以在余弦距离、欧式距离、曼哈顿距离、马氏距离、一阶距离、交叉熵距离等类型的距离中选取余弦距离为例,训练设备计算N个第一特征信息中每个特征信息与预设特征信息之间的第三余弦距离,得到的N个第一特征信息中所有特征信息与预设特征信息之间的第三余弦距离的和,并计算第三特征信息与预设特征信息之间的第四余弦距离,将第四余弦距离与所有第三余弦距离的和之间的比值确定为第三特征信息与预设特征信息之间的距离。
403、训练设备通过第一神经网络,对第一特征提取网络执行剪枝操作,得到剪枝后的第一特征提取网络。
本申请实施例中,训练设备会通过第一神经网络,对第一特征提取网络执行剪枝操作,以得到剪枝后的第一特征提取网络。其中,第一神经网络可以表现为各种类型的神经网络,第一神经网络为以下中的任一种神经网络:卷积神经网络、循环神经网络、残差神经网络或全连接神经网络。本申请实施例中,提供了第一神经网络的多种实现方式,提高了本方案的实现灵活性。
具体的,训练设备可以通过第一神经网络对第一特征提取网络的权重参数进行剪枝,或者,对第一特征提取网络中的神经网络层进行剪枝,或者,对第一特征提取网络的神经网络层中的至少一个注意力头进行剪枝。
可选地,在第一特征提取网络为Transformer结构的神经网络中的特征提取网络的情况下,第一特征提取网络的注意力层可以包括至少两个注意力头,步骤403可以包括:训练设备通过第一神经网络,对第一特征提取网络包括的至少两个注意力头执行剪枝操作,并根据进行剪枝后仍旧保留下的至少一个注意力头,构建剪枝后的第一特征提取网络,剪枝后的第一特征提取网络包括的注意力头的数量少于第一特征提取网络包括的注意力头的数量。作为示例,例如第一特征提取网络包括8个注意力头,剪枝后的第一特征提取网络可以包括6个注意力头,从而剪枝后的第一特征提取网络中包括的参数数量更少,应理解,此处举例仅为方便理解本方案,不用于限定本方案。
本申请实施例中,技术人员在研究中发现,Transformer结构的神经网络中的部分注意力头是冗余的,或者,Transformer结构的神经网络中的部分注意力头的重要性较低,去掉之后对第一特征提取网络的性能的影响不大,所以将第一特征提取网络选取为Transformer结构的神经网络的特征提取网络,对第一特征提取网络中的注意力头进行剪枝,从而尽可能的提高剪枝后的第一特征提取网络的性能。
进一步地,在一种情况下,步骤403可以包括:训练设备通过第一神经网络,生成至少两个注意力头中每个注意力头的第一评分。其中,一个注意力头的第一评分代表该一个注意力头的重要程度,用于指示一个注意力头是否被剪枝,第一特征提取网络包括的多个注意力头中重要程度高的注意力头将会被保留,重要程度低的注意力头将会被剪枝。训练设备根据与至少两个注意力头对应的至少两个第一评分,对至少两个注意力头执行剪枝操作。本申请实施例中,通过第一神经网络生成每个注意力头的第一评分,进而根据每个注意力头的评分决定该注意力头是否会被剪枝,操作简单,易于实现。
更进一步地,可以为与重要程度越高的注意力头对应的第一评分越高,与重要程度越低的注意力对应的第一评分越低;也可以为与重要程度越高的注意力头对应的第一评分越低,与重要程度越低的注意力对应的第一评分越高。
针对训练设备利用第一评分执行剪枝操作的过程。在一种实现方式中,第一评分的取值为第一预设值或第二预设值,第一预设值和第二预设值的取值不同。第一注意力头为至少两个注意力头中任一个注意力头,当第一注意力头的取值为第一预设值时,第一注意力会被保留;当第一注意力头的取值为第二预设值时,第一注意力头会被剪枝。前述第一预设值的取值可以为1、2、3、4或其他取值等的,前述第二预设值的取值可以为0、1、2或其他取值等,只要保证第一预设值和第二预设值的取值不同即可。作为示例,例如第一评分的取值为0或1,若第一注意力头的第一评分的取值为0,则第一注意力头被剪枝,若第一注意力头的第一评分的取值为1,则第一注意力头被保留等,具体第一预设值和第二预设值的取值均可结合实际情况灵活设定,此处不做限定。
针对生成第一评分的过程。训练设备将至少两个注意力头中每个注意力头输入第一神经网络,得到第一神经网络输出的每个注意力头的第二评分,第二评分可以为连续的评分。 作为示例,例如一个第二评分具体可以为0.58、0.69、1、1.28、1.38等等,此处举例仅为更方便理解本方案,不用于限定本方案。具体的,针对至少两个注意力头中第一注意力头的第二评分的生成过程。训练设备根据自注意力机制,将与第一注意力头对应的注意力矩阵输入第一神经网络中,也即根据与第一注意力头对应的一套注意力矩阵,执行自注意力运算,进而将运算结果输入第一神经网络中,得到第一神经网络输出的第一注意力头的第二评分。为进一步理解本方案,请参阅如下公式:
Figure PCTCN2021105927-appb-000018
其中,
Figure PCTCN2021105927-appb-000019
代表对与第i个注意力头对应的一套注意力矩阵执行自注意力运算后得到的运算结果,
Figure PCTCN2021105927-appb-000020
代表第i个注意力头中的第一转换矩阵,
Figure PCTCN2021105927-appb-000021
代表第i个注意力头中的第二转换矩阵,
Figure PCTCN2021105927-appb-000022
代表第i个注意力头中的第三转换矩阵,
Figure PCTCN2021105927-appb-000023
代表对
Figure PCTCN2021105927-appb-000024
进行转置,z代表注意力层中注意力头的个数,应理解,此处举例仅为方便理解将注意力矩阵输入第一神经网络的过程,不用于限定本方案。
训练设备在得到第一注意力头的第二评分之后,对第一注意力头的第二评分进行离散化处理,得到第一注意力头的第一评分。其中,离散化处理的过程为可微分的,作为示例,例如离散化处理的具体方式可以为gumbel(耿贝尔)-softmax、gumbel-max或其他类型的离散化处理方式等等。训练设备对多个注意力头中的每个注意力头均执行前述操作,从而可以生成每个注意力头的第一评分。本申请实施例中,生成每个注意力头的第一评分的过程为可微分的,则在利用第一损失函数,反向更新第一神经网络的权重参数的过程也是连续的,从而使第一神经网络的权重参数的更新过程更为严谨,以提高第一神经网络的训练效率,也有利于得到正确率更高的第一神经网络。
在另一种实现方式中,第一评分的取值可以为连续的,训练设备上预先设置有第一阈值,若重要程度高的注意力头对应的第一评分越高,则当一个注意力头的第一评分大于或等于第一阈值时,可以保留该注意力头,当一个注意力头的第一评分小于第一阈值时,可以对该注意力头进行剪枝。
或者,若重要程度高的注意力头对应的第一评分越低,则当一个注意力头的第一评分大于或等于第一阈值时,可以对该注意力头进行剪枝,当一个注意力头的第一评分小于第一阈值时,可以保留该注意力头。
为更直观地理解本方案,请参阅图7,图7为本申请实施例提供的神经网络的训练方法中对注意力头进行剪枝过程的一个示意图。图7中以第一特征提取网络中包括3个Transform层,每个Transform层包括4个注意力头为例,其中,灰色块代表的注意力头为不重要的注意力头,马赛克块代表的注意力头为重要的注意头,例如编号为1的神经网络层中编号为1的马赛克块代表的注意力头为重要的注意力头,编号为2的神经网络层中编号为1的灰色块代表的注意力头为不重要的注意力头,则在对第一特征提取网络的不同神经网络层包括的多个注意力头进行剪枝之后,并重新构建剪枝后的第一特征提取网络,剪枝后的第一特征提取网络中包括保留下的6个重要的注意力头,应理解,图7中的示例仅 为方便理解本方案,不用于限定本方案。
在另一种情况下,步骤403可以包括:训练设备直接将第一特征提取网络输入至第一神经网络中,得到第一神经网络输出的剪枝后的第一特征提取网络。
需要说明的是,本申请实施例不限定步骤403的执行顺序,步骤403可以在步骤401和402任一步骤之前或之后执行,只要保证步骤403在步骤404之前执行即可。
404、训练设备将第一训练数据输入剪枝后的第一特征提取网络,得到剪枝后的第一特征提取网络输出的与第一训练数据对应的N个第二特征信息。
本申请实施例中,训练设备在得到剪枝后的第一特征提取网络之后,将第一训练数据输入剪枝后的第一特征提取网络,以通过剪枝后的第一特征提取网络对第一训练数据进行特征提取,得到剪枝后的第一特征提取网络输出的与第一训练数据对应的N个第二特征信息。步骤404的具体实现方式与步骤401的具体实现方式类似,区别仅在于步骤401中的执行主体为第一特征提取网络,步骤404的执行主体为剪枝后的第一特征提取网络,此处不做赘述。
其中,N个第二特征信息与N个第二特征信息的含义类似,若第一训练数据包括N个句子,一个第二特征信息为N个句子中一个句子的特征信息;或者,第一训练数据为一个句子,一个句子中包括N个词语,一个第二特征信息为N个词语中一个词语的特征信息。
405、训练设备根据N个第二特征信息,计算第二分布信息,第二分布信息用于指示N个第二特征信息的数据分布规律。
本申请实施例中,步骤405的具体实现方式与步骤402的具体实现方式类似,区别仅在于步骤402中训练设备处理的为N个第一特征信息,步骤405中处理的为N个第二特征信息,可参阅上述描述理解。其中,第二分布信息的具体表现形式与第一分布信息的具体表现形式类似,均可参照步骤405中的介绍,此处不做赘述。
406、训练设备根据第一损失函数,对第一神经网络执行训练操作,得到第二神经网络,第一损失函数指示第一分布信息与第二分布信息之间的相似度。
本申请实施例中,训练设备在得到第一分布信息和第二分布信息之后,会根据第一分布信息和第二分布信息计算第一损失函数的函数值,并根据第一损失函数的函数值进行梯度求导,并反向更新第一神经网络的权重参数,以完成对第一神经网络的一次训练,训练设备通过重复执行步骤401至406,来对第一神经网络进行迭代训练,直至满足第一损失函数的收敛条件,得到第二神经网络,第二神经网络为训练后的第一神经网络。迭代训练的目标为拉近第一分布信息和第二分布信息之间的相似度,也即迭代训练的目标为拉近第一分布信息与第二分布信息之间的相似度,第一分布信息与第二分布信息之间的相似度用于体现第一分布信息与第二分布信息之间的差异程度,也可以表示为第一分布信息与第二分布信息之间的距离。需要说明的是,在对第一神经网络进行训练的过程中不会更新第一特征提取网络的权重参数。
此外,在对第一神经网络进行迭代训练的过程中,当训练设备确定第一损失函数的函数值满足收敛条件后,不会再对第一神经网络进行下一次训练,训练设备可以获取在对第一神经网络进行最后一次训练的过程中,通过第一神经网络(也可以称为第二神经网络) 生成的剪枝后的第一特征提取网络(也即在最后一次训练的过程中通过步骤403生成的剪枝后的第一特征提取网络),作为最终的可以输出的剪枝后的第一特征提取网络。
其中,第一损失函数具体可以计算第一分布信息和第二分布信息之间的距离,前述距离可以为KL散度(Kullback Leibler divergence)距离、交叉熵距离、欧式距离、马氏距离、余弦距离或其他类型的距离等等,此处不做穷举。需要说明的是,拉近第一分布信息与第二分布信息之间的相似度,不代表拉近每个第一特征信息与每个第二特征信息之间的距离。作为示例,例如N的取值为3,3个第一训练数据分别为“今天天气真好啊”、“今天天气真舒服啊”和“花花的衣服真好看”,则“今天天气真好啊”的第一特征信息和“今天天气真舒服啊”的第一特征信息之间的距离会比较近,“花花的衣服真好看”的第一特征信息与前两者的距离会较远,则训练的目标为“今天天气真好啊”的第二特征信息和“今天天气真舒服啊”的第二特征信息之间的距离近,“花花的衣服真好看”的第二特征信息与前两者的距离较远,也即训练的目的为提高不同第二特征信息之间的相对距离,与,不同第一特征信息之间的相对距离之间的相似度。
为更直观地理解第一分布信息与第二分布信息之间的相似度这个概念,请参阅图8,图8为本申请实施例提供的神经网络的训练方法中第一分布信息和第二分布信息的示意图。图8中以第一分布信息和第二分布信息包括的均为N个特征信息中任意两个特征信息之间的距离为例,图8包括(a)、(b)和(c)三个子示意图,图8的(a)、(b)和(c)三个子示意图中均以示出三个第一特征信息为例,图8的(a)子示意图代表3个第一特征信息的分布情况,图8的(b)子示意图和(c)子示意图均代表3个第二特征信息的分布情况,C1、C2和C3分别代表三个不同的训练数据,图8的(a)子示意图中的方框代表C1的第一特征信息,图8的(a)子示意图中的圆形代表C2的第一特征信息,图8的(a)子示意图中的五角星代表C3的第一特征信息。由于在不同训练次数中第一特征提取网络被减掉的注意力头可以不同,所以不同训练次数中剪枝后的第一特征提取网络输出的N个第二特征信息的分布不同,图8的(b)子示意图和图8的(c)子示意图分别代表在不同训练次数中3个第二特征的分布情况。图8的(b)子示意图和图8的(c)子示意图中的方框代表C1的第一特征信息,图8的(b)子示意图和图8的(c)子示意图中的圆形代表C2的第一特征信息,图8的(b)子示意图和图8的(c)子示意图中的五角星代表C3的第一特征信息。在图8的(a)子示意图和图8的(c)子示意图中,虽然方框、圆形以及五角星的绝对位置不同,但由于图8的(a)子示意图和图8的(c)子示意图中,五角星和圆形的距离较近,圆形与方框、五角星与方框的距离均较远,则图8的(a)子示意图所示出的3个第一特征信息的分布情况(也即对应第一分布信息)与图8的(c)子示意图所示出的3个第二特征信息的分布情况(也即对应第二分布信息)之间的相似度较高,与图8的(a)子示意图对应的第一分布信息和与图8的(b)子示意图对应的第二分布信息之间的相似度较低,应理解,图8中的举例仅为方便理解第一分布信息与第二分布信息之间的相似度这个概念的一个距离,不用于限定本方案。
为了更直观地理解本方案,请参阅图9,图9为本申请实施例提供的神经网络的训练方法的一种流程示意图。D1、训练设备从训练数据集合中获取N个训练数据(也即获取到 第一训练数据),将N个训练数据输入第一特征提取网络,第一特征提取网络为执行过预训练的神经网络,以得到N个第一特征信息。D2、训练设备根据N个第一特征信息,生成第一分布信息。D3、训练设备将与第一特征提取网络包括的多个注意力头一一对应的多套注意力矩阵输入第一神经网络,得到第一神经网络生成的每个注意力头的第二评分。D4、训练设备根据每个注意力头的第二评分,进行离散化处理,得到每个注意力头的第一评分,前述离散化处理的过程为可微的。D5、训练设备根据每个注意力头的第一评分,对第一特征提取网络进行剪枝,并重新构建剪枝后的第一特征提取网络。D6、训练设备将N个训练数据输入剪枝后的第一特征提取网络,以得到N个第二特征信息。D7、训练设备根据N个第二特征信息,生成第二分布信息。D8、训练设备计算第一分布信息和第二分布信息之间的距离,也即计算第一损失函数的函数值,并反向传播以更新第一神经网络的权重参数,以完成了对第一神经网络的一次训练。应理解,图9中的示例仅为方便理解本方案,不用于限定本方案。
本申请实施例中,通过上述方式,提供了一种用于对第一特征提取网络执行剪枝操作的神经网络的训练方法,执行过训练操作第一神经网络能够用于对第一特征提取网络进行剪枝,也即提供了一种神经网络的压缩方案;此外,采用第一损失函数来训练第一神经网络,以使剪枝前后的特征提取网络生成的N个特征信息的数据分布规律类似,从而保证剪枝前后的特征提取网络的特征表达能力相似,以保证剪枝后的特征提取网络的性能;且第一特征提取网络不仅可以为Transform结构的特征提取网络,还可以为循环神经网络、卷积神经网络等神经网络的特征提取网络,扩展了本方案的应用场景。
一、第一神经网络的推理阶段
本申请实施例中,请参阅图10,图10为本申请实施例提供的神经网络的压缩方法的一种流程示意图,本申请实施例提供的神经网络的压缩方法可以包括:
1001、执行设备获取第二特征提取网络。
本申请实施例中,执行设备需要获取第二特征提取网络。其中,第一神经网络的训练设备和第二神经网络的执行设备可以为同一设备,也可以为分别独立的设备。第二特征提取网络和第一特征提取网络可以为不同的特征提取网络,也可以为同一的特征提取网络。进一步地,第一特征提取网络和第二特征提取网络的神经网络结构可以完全相同,也即第一特征提取网络和第二特征提取网络包括的神经网络层完全相同。或者,第一特征提取网络和第二特征提取网络的神经网络结构也可以有所不同,在第二特征提取网络与第一特征提取网络均为Transform结构的特征提取网络的情况下,仅需要保证第二特征提取网络的一个多头注意力层中包括的注意力头的个数,与,第一特征提取网络的一个多头注意力层中包括的注意力头的个数相同即可。
具体的,若第二特征提取网络为采用预训练和微调的训练方式,则获取的第二特征提取网络为执行过预训练操作的神经网络。
若第二特征提取网络不是采用预训练和微调的训练方式,则获取第二特征提取网络为训练后的神经网络,具体过程可结合上述对图3的描述。
1002、执行设备通过第二神经网络,对第二特征提取网络进行剪枝,得到剪枝后的第 二特征提取网络,第二神经网络为根据第一损失函数进行训练得到的,第一损失函数指示第一分布信息与第二分布信息之间的相似度,第一分布信息用于指示N个第一特征信息的数据分布规律,N个第一特征信息为将第一训练数据输入第一特征提取网络后得到的,第二分布信息用于指示N个第二特征信息的数据分布规律,N个第二特征信息为将第一训练数据输入剪枝后的第一特征提取网络后得到的。
本申请实施例中,执行设备通过第二神经网络,对第一特征提取网络进行剪枝,得到剪枝后的所述第一神经网络。其中,第二神经网络为根据第一损失函数训练得到的,对于第一神经网络的(或第二神经网络)的训练过程可以参阅图4对应实施例中的描述。对于通过第一神经网络进行剪枝操作的具体实现方式,与图4对应实施例中步骤403的具体实现方式类似,此处不做赘述。
具体的,若第一特征提取网络为采用预训练和微调的训练方式,则执行设备为通过第二神经网络,在进入第二特征提取网络的微调阶段之前,对第二特征提取网络进行剪枝,第二特征提取网络为执行过预训练操作的神经网络。
若第二特征提取网络不是采用预训练和微调的训练方式,则执行设备通过第二神经网络对第二特征提取网络进行剪枝,第二特征提取网络为训练后的神经网络,剪枝后的第二特征提取网络不再需要训练。
需要说明的是,若第一神经网络的训练设备和第二神经网络的执行设备为同一设备,则步骤1002也可以为通过步骤403得到,也即在对第一神经网络(或第二神经网络)的训练过程中可以直接获取剪枝后的第一特征提取网络。具体的,可以为在确定满足第一损失函数的收敛条件时,获取当前训练批次中生成的剪枝后的第一特征提取网络,也即获取在第一神经网络的最后一次训练过程中,生成的剪枝后的第一特征提取网络。
本申请实施例中,在预训练阶段对第一特征提取网络进行剪枝,不仅能够实现对第一特征提取网络的压缩,以减少第一特征提取网络所占的存储空间,提高第一特征提取网络在推理阶段的效率,也可以提高对第一特征提取网络进行训练时微调阶段的效率,从而提高第一特征提取网络的训练过程的效率。
本申请实施例中,通过第二神经网络对第一特征提取网络进行剪枝,也即实现了对第一特征提取网络的压缩,提供了一种神经网络的压缩方案;此外,采用第一损失函数来训练第一神经网络,以使剪枝前后的特征提取网络生成的N个特征信息的数据分布规律类似,从而保证剪枝前后的特征提取网络的特征表达能力相似,以保证剪枝后的特征提取网络的性能;且第一特征提取网络不仅可以为Transform结构的特征提取网络,还可以为循环神经网络、卷积神经网络等神经网络的特征提取网络,扩展了本方案的应用场景。
为更直观地理解本申请实施例带来的有益效果,以下结合实际数据进行介绍。先参阅如下表1。
表1
Figure PCTCN2021105927-appb-000025
其中,BERT base和BERT Large代表两种不同类型的神经网络,第一特征提取网络分别来自前述两种神经网络,Ratio=0%代表没有对第一特征提取网络进行剪枝,Ratio=50%代表剪去了第一特征提取网络中50%的注意力头,对于BERT base和BERT Large而言,在剪枝后,存储空间都减小了,且处理速度得到了提升。
继续结合表2理解采用本申请实施例提供的方案进行剪枝后,剪枝后的神经网络的性能的改变,请参阅如下表2。
表2
Figure PCTCN2021105927-appb-000026
其中,BERT base和BERT Large代表两种不同类型的神经网络,STS为语义文本相似度(Semantic Textual Similarity)的缩写,代表神经网络执行的任务类型,STS-12、STS-13、STS-14和STS-15中后面的序号代表不同的训练数据集合的编号,表2中的每个数值均为一个准确度值,通过上述表2可知,通过本申请实施例提供的方案进行剪枝后,神经网络的性能反而有所提升。
在图1至图10所对应的实施例的基础上,为了更好的实施本申请实施例的上述方案,下面还提供用于实施上述方案的相关设备。具体参阅图11,图11为本申请实施例提供的神经网络的训练装置的一种结构示意图。神经网络的训练装置1100包括输入模块1101、计算模块1102、剪枝模块1103和训练模块1104。其中,输入模块1101,用于将第一训练数据输入第一特征提取网络,得到第一特征提取网络输出的与第一训练数据对应的N个第一特征信息,N为大于1的整数;计算模块1102,用于根据N个第一特征信息,计算第一分布信息,第一分布信息用于指示N个第一特征信息的数据分布规律;剪枝模块1103,用于通过第一神经网络,对第一特征提取网络执行剪枝操作,得到剪枝后的第一特征提取网络;输入模块1101,还用于将第一训练数据输入剪枝后的第一特征提取网络,得到剪枝后的第一特征提取网络输出的与第一训练数据对应的N个第二特征信息;计算模块1102,还用于根据N个第二特征信息,计算第二分布信息,第二分布信息用于指示N个第二特征信息的数据分布规律;训练模块1104,用于根据第一损失函数,对第一神经网络执行训练操作,得到第二神经网络,第一损失函数指示第一分布信息与第二分布信息之间的相似度。
本申请实施例中,提供了一种用于对第一特征提取网络执行剪枝操作的神经网络的训 练方法,执行过训练操作第一神经网络能够用于对第一特征提取网络进行剪枝,也即提供了一种神经网络的压缩方案;此外,训练模块1104采用第一损失函数来训练第一神经网络,以使剪枝前后的特征提取网络生成的N个特征信息的数据分布规律类似,从而保证剪枝前后的特征提取网络的特征表达能力相似,以保证剪枝后的特征提取网络的性能。
在一种可能的设计中,第一分布信息包括N个第一特征信息中任意两个第一特征信息之间的距离的值,以指示N个第一特征信息的数据分布规律;第二分布信息包括N个第二特征信息中任意两个第二特征信息之间的距离的值,以指示N个第二特征信息的数据分布规律。
在一种可能的设计中,第一特征提取网络为转换器(Transformer)结构的神经网络中的特征提取网络,第一特征提取网络中包括至少两个注意力头。剪枝模块1103,具体用于通过第一神经网络,对第一特征提取网络包括的至少两个注意力头执行剪枝操作,得到剪枝后的第一特征提取网络,剪枝后的第一特征提取网络包括的注意力头的数量少于第一特征提取网络包括的注意力头的数量。
在一种可能的设计中,剪枝模块1103,具体用于通过第一神经网络,生成至少两个注意力头中每个注意力头的第一评分,根据与至少两个注意力头对应的至少两个第一评分,对至少两个注意力头执行剪枝操作。
在一种可能的设计中,剪枝模块1103,具体用于将至少两个注意力头中每个注意力头输入第一神经网络,得到第一神经网络输出的每个注意力头的第二评分,对第二评分进行离散化处理,得到第一评分,离散化处理的过程为可微分的。
在一种可能的设计中,第一训练数据包括N个句子,一个第一特征信息为N个句子中一个句子的特征信息;或者,第一训练数据为一个句子,一个句子中包括N个词语,一个第一特征信息为N个词语中一个词语的特征信息。
在一种可能的设计中,第一神经网络为以下中的任一种神经网络:卷积神经网络、循环神经网络、残差神经网络或全连接神经网络。
需要说明的是,神经网络的训练装置1100中各模块/单元之间的信息交互、执行过程等内容,与本申请中图4至图9对应的各个方法实施例基于同一构思,具体内容可参见本申请前述所示的方法实施例中的叙述,此处不再赘述。
本申请实施例还提供一种神经网络的压缩装置,具体参阅图12,图12为本申请实施例提供的神经网络的压缩装置的一种结构示意图。神经网络的压缩装置1200包括获取模块1201和剪枝模块1202。获取模块1201,用于获取第二特征提取网络;剪枝模块1202,用于通过第二神经网络,对第二特征提取网络进行剪枝,得到剪枝后的第二特征提取网络。其中,第二神经网络为根据第一损失函数进行训练得到的,第一损失函数指示第一分布信息与第二分布信息之间的相似度,第一分布信息用于指示N个第一特征信息的数据分布规律,N个第一特征信息为将第一训练数据输入第一特征提取网络后得到的,第二分布信息用于指示N个第二特征信息的数据分布规律,N个第二特征信息为将第一训练数据输入剪枝后的第一特征提取网络后得到的。
本申请实施例中,通过第二神经网络对第二特征提取网络进行剪枝,也即实现了对第 二特征提取网络的压缩,提供了一种神经网络的压缩方案;此外,采用第一损失函数来训练第一神经网络,以使剪枝前后的特征提取网络生成的N个特征信息的数据分布规律类似,从而保证剪枝前后的特征提取网络的特征表达能力相似,以保证剪枝后的特征提取网络的性能。
在一种可能的设计中,第一分布信息包括N个第一特征信息中任意两个第一特征信息之间的距离的值,以指示N个第一特征信息的数据分布规律;第二分布信息包括N个第二特征信息中任意两个第二特征信息之间的距离的值,以指示N个第二特征信息的数据分布规律。
在一种可能的设计中,第一特征提取网络为采用预训练和微调的方式进行训练;剪枝模块1202,具体用于在微调之前,通过第二神经网络,对第二特征提取网络进行剪枝。
在一种可能的设计中,第一特征提取网络为转换器(Transformer)结构的神经网络中的特征提取网络,第一特征提取网络中包括至少两个注意力头。剪枝模块1202,具体用于通过第一神经网络,对第一特征提取网络包括的至少两个注意力头执行剪枝操作,得到剪枝后的第一特征提取网络,剪枝后的第一特征提取网络包括的注意力头的数量少于第一特征提取网络包括的注意力头的数量。
在一种可能的设计中,剪枝模块1202,具体用于通过第一神经网络,生成至少两个注意力头中每个注意力头的第一评分,根据与至少两个注意力头对应的至少两个第一评分,对至少两个注意力头执行剪枝操作。
在一种可能的设计中,剪枝模块1202,具体用于将至少两个注意力头中每个注意力头输入第一神经网络,得到第一神经网络输出的每个注意力头的第二评分,对第二评分进行离散化处理,得到第一评分,离散化处理的过程为可微分的。
在一种可能的设计中,第一训练数据包括N个句子,一个第一特征信息为N个句子中一个句子的特征信息;或者,第一训练数据为一个句子,一个句子中包括N个词语,一个第一特征信息为N个词语中一个词语的特征信息。
在一种可能的设计中,第二神经网络为以下中的任一种神经网络:卷积神经网络、循环神经网络、残差神经网络或全连接神经网络。
需要说明的是,神经网络的压缩装置1200中各模块/单元之间的信息交互、执行过程等内容,与本申请中图10对应的各个方法实施例基于同一构思,具体内容可参见本申请前述所示的方法实施例中的叙述,此处不再赘述。
本申请实施例提供了一种电子设备,请参阅图13,图13为本申请实施例提供的电子设备的一种结构示意图,电子设备1300上可以部署有图11对应实施例中所描述的神经网络的训练装置1100,用于实现图4至图9对应的训练设备的功能;或者,电子设备1300上可以部署有图12对应实施例中所描述的神经网络的压缩装置1200,用于实现图10对应的执行设备的功能。具体的,电子设备1300可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上中央处理器(central processing units,CPU)1322(例如,一个或一个以上处理器)和存储器1332,一个或一个以上存储应用程序1342或数据1344的存储介质1330(例如一个或一个以上海量存储设备)。其中,存储器1332和存储介质1330可以 是短暂存储或持久存储。存储在存储介质1330的程序可以包括一个或一个以上模块(图示没标出),每个模块可以包括对电子设备中的一系列指令操作。更进一步地,中央处理器1322可以设置为与存储介质1330通信,在电子设备1300上执行存储介质1330中的一系列指令操作。
电子设备1300还可以包括一个或一个以上电源1326,一个或一个以上有线或无线网络接口1350,一个或一个以上输入输出接口1358,和/或,一个或一个以上操作系统1341,例如Windows ServerTM,Mac OS XTM,UnixTM,LinuxTM,FreeBSDTM等等。
本申请实施例中,在一种情况下,中央处理器1322,用于实现图4至图9对应实施例中的训练设备的功能。具体的,中央处理器1322用于:
将第一训练数据输入第一特征提取网络,得到第一特征提取网络输出的与第一训练数据对应的N个第一特征信息,N为大于1的整数;根据N个第一特征信息,计算第一分布信息,第一分布信息用于指示N个第一特征信息的数据分布规律;通过第一神经网络,对第一特征提取网络执行剪枝操作,得到剪枝后的第一特征提取网络;将第一训练数据输入剪枝后的第一特征提取网络,得到剪枝后的第一特征提取网络输出的与第一训练数据对应的N个第二特征信息;根据N个第二特征信息,计算第二分布信息,第二分布信息用于指示N个第二特征信息的数据分布规律;根据第一损失函数,对第一神经网络执行训练操作,得到第二神经网络,第二神经网络为执行过训练操作的第一神经网络,第一损失函数指示第一分布信息与第二分布信息之间的相似度。
需要说明的是,中央处理器1322还于实现图4至图9对应实施例中的训练设备执行的其他步骤,对于中央处理器1322执行图4至图9对应实施例中训练设备的功能的具体实现方式以及带来的有益效果,均可以参考图4至图9对应的各个方法实施例中的叙述,此处不再一一赘述。
本申请实施例中,在另一种情况下,中央处理器1322,用于实现图10对应实施例中的执行设备的功能。具体的,中央处理器1322用于:
获取第一特征提取网络;通过第二神经网络,对第二特征提取网络进行剪枝,得到剪枝后的第二特征提取网络。其中,第二神经网络为根据第一损失函数进行训练得到的,第一损失函数指示第一分布信息与第二分布信息之间的相似度,第一分布信息用于指示N个第一特征信息的数据分布规律,N个第一特征信息为将第一训练数据输入第一特征提取网络后得到的,第二分布信息用于指示N个第二特征信息的数据分布规律,N个第二特征信息为将第一训练数据输入剪枝后的第一特征提取网络后得到的。
需要说明的是,中央处理器1322还于实现图10对应实施例中的执行设备执行的其他步骤,对于中央处理器1322执行图10对应实施例中执行设备的功能的具体实现方式以及带来的有益效果,均可以参考图10对应的各个方法实施例中的叙述,此处不再一一赘述。
本申请实施例中还提供一种计算机可读存储介质,该计算机可读存储介质中存储有程序,当其在计算机上运行时,使得计算机执行如上述图4至图9对应实施例中训练设备所执行的步骤,或者,执行如上述图10对应实施例中执行设备所执行的步骤。
本申请实施例中还提供一种包括计算机程序产品,当其在计算机上运行时,使得计算 机执行如上述图4至图9对应实施例中训练设备所执行的步骤,或者,执行如上述图10对应实施例中执行设备所执行的步骤。
本申请实施例中还提供一种电路系统,所述电路系统包括处理电路,所述处理电路配置为执行如上述图4至图9对应实施例中训练设备所执行的步骤,或者,执行如上述图10对应实施例中执行设备所执行的步骤。
本申请实施例提供的执行设备或训练设备具体可以为芯片,芯片包括:处理单元和通信单元,所述处理单元例如可以是处理器,所述通信单元例如可以是输入/输出接口、管脚或电路等。该处理单元可执行存储单元存储的计算机执行指令,以使芯片执行上述图4至图9对应实施例中训练设备所执行的步骤,或者,执行如上述图10对应实施例中执行设备所执行的步骤。可选地,所述存储单元为所述芯片内的存储单元,如寄存器、缓存等,所述存储单元还可以是所述无线接入设备端内的位于所述芯片外部的存储单元,如只读存储器(read-only memory,ROM)或可存储静态信息和指令的其他类型的静态存储设备,随机存取存储器(random access memory,RAM)等。
具体的,请参阅图14,图14为本申请实施例提供的芯片的一种结构示意图,所述芯片可以表现为神经网络处理器NPU 140,NPU 140作为协处理器挂载到主CPU(Host CPU)上,由Host CPU分配任务。NPU的核心部分为运算电路1403,通过控制器1404控制运算电路1403提取存储器中的矩阵数据并进行乘法运算。
在一些实现中,运算电路1403内部包括多个处理单元(Process Engine,PE)。在一些实现中,运算电路1403是二维脉动阵列。运算电路1403还可以是一维脉动阵列或者能够执行例如乘法和加法这样的数学运算的其它电子线路。在一些实现中,运算电路1403是通用的矩阵处理器。
举例来说,假设有输入矩阵A,权重矩阵B,输出矩阵C。运算电路1403从权重存储器1402中取矩阵B相应的数据,并缓存在运算电路中每一个PE上。运算电路1403从输入存储器1401中取矩阵A数据与矩阵B进行矩阵运算,得到的矩阵的部分结果或最终结果,保存在累加器(accumulator)1408中。
统一存储器1406用于存放输入数据以及输出数据。权重数据直接通过存储单元访问控制器(Direct Memory Access Controller,DMAC)1405,DMAC被搬运到权重存储器1402中。输入数据也通过DMAC被搬运到统一存储器1406中。
BIU为Bus Interface Unit即,总线接口单元1410,用于AXI总线与DMAC和取指存储器(Instruction Fetch Buffer,IFB)1409的交互。
总线接口单元1410(Bus Interface Unit,简称BIU),用于取指存储器1409从外部存储器获取指令,还用于存储单元访问控制器1405从外部存储器获取输入矩阵A或者权重矩阵B的原数据。
DMAC主要用于将外部存储器DDR中的输入数据搬运到统一存储器1406或将权重数据搬运到权重存储器1402中或将输入数据数据搬运到输入存储器1401中。
向量计算单元1407包括多个运算处理单元,在需要的情况下,对运算电路1403的输出做进一步处理,如向量乘,向量加,指数运算,对数运算,大小比较等等。主要用于神 经网络中非卷积/全连接层网络计算,如Batch Normalization(批归一化),像素级求和,对特征平面进行上采样等。
在一些实现中,向量计算单元1407能将经处理的输出的向量存储到统一存储器1406。例如,向量计算单元1407可以将线性函数和/或非线性函数应用到运算电路1403的输出,例如对卷积层提取的特征平面进行线性插值,再例如累加值的向量,用以生成激活值。在一些实现中,向量计算单元1407生成归一化的值、像素级求和的值,或二者均有。在一些实现中,处理过的输出的向量能够用作到运算电路1403的激活输入,例如用于在神经网络中的后续层中的使用。
控制器1404连接的取指存储器(instruction fetch buffer)1409,用于存储控制器1404使用的指令;
统一存储器1406,输入存储器1401,权重存储器1402以及取指存储器1409均为On-Chip存储器。外部存储器私有于该NPU硬件架构。
其中,循环神经网络中各层的运算可以由运算电路1403或向量计算单元1407执行。
其中,上述任一处提到的处理器,可以是一个通用中央处理器,微处理器,ASIC,或一个或多个用于控制上述第一方面方法的程序执行的集成电路。
另外需说明的是,以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。另外,本申请提供的装置实施例附图中,模块之间的连接关系表示它们之间具有通信连接,具体可以实现为一条或多条通信总线或信号线。
通过以上的实施方式的描述,所属领域的技术人员可以清楚地了解到本申请可借助软件加必需的通用硬件的方式来实现,当然也可以通过专用硬件包括专用集成电路、专用CLU、专用存储器、专用元器件等来实现。一般情况下,凡由计算机程序完成的功能都可以很容易地用相应的硬件来实现,而且,用来实现同一功能的具体硬件结构也可以是多种多样的,例如模拟电路、数字电路或专用电路等。但是,对本申请而言更多情况下软件程序实现是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在可读取的存储介质中,如计算机的软盘、U盘、移动硬盘、ROM、RAM、磁碟或者光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述的方法。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。
所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一计算机可读存储介质传 输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存储的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘(Solid State Disk,SSD))等。

Claims (24)

  1. 一种神经网络的训练方法,其特征在于,所述方法包括:
    将第一训练数据输入第一特征提取网络,得到所述第一特征提取网络输出的与所述第一训练数据对应的N个第一特征信息,所述N为大于1的整数;
    根据所述N个第一特征信息,计算第一分布信息,所述第一分布信息用于指示所述N个第一特征信息的数据分布规律;
    通过第一神经网络,对所述第一特征提取网络执行剪枝操作,得到剪枝后的第一特征提取网络;
    将所述第一训练数据输入所述剪枝后的第一特征提取网络,得到所述剪枝后的第一特征提取网络输出的与所述第一训练数据对应的N个第二特征信息;
    根据所述N个第二特征信息,计算第二分布信息,所述第二分布信息用于指示所述N个第二特征信息的数据分布规律;
    根据第一损失函数,对所述第一神经网络执行训练操作,得到第二神经网络,所述第一损失函数指示所述第一分布信息与所述第二分布信息之间的相似度。
  2. 根据权利要求1所述的方法,其特征在于,
    所述第一分布信息包括所述N个第一特征信息中任意两个第一特征信息之间的距离的值,以指示所述N个第一特征信息的数据分布规律;
    所述第二分布信息包括所述N个第二特征信息中任意两个第二特征信息之间的距离的值,以指示所述N个第二特征信息的数据分布规律。
  3. 根据权利要求1或2所述的方法,其特征在于,所述第一特征提取网络为转换器(Transformer)结构的神经网络中的特征提取网络,所述第一特征提取网络中包括至少两个注意力头;
    所述通过第一神经网络,对所述第一特征提取网络执行剪枝操作,得到剪枝后的第一特征提取网络,包括:
    通过所述第一神经网络,对所述第一特征提取网络包括的所述至少两个注意力头执行剪枝操作,得到所述剪枝后的第一特征提取网络,所述剪枝后的第一特征提取网络包括的注意力头的数量少于所述第一特征提取网络包括的注意力头的数量。
  4. 根据权利要求3所述的方法,其特征在于,所述通过所第一神经网络,对所述第一特征提取网络包括的所述至少两个注意力头执行剪枝操作,包括:
    通过所述第一神经网络,生成所述至少两个注意力头中每个注意力头的第一评分;
    根据与所述至少两个注意力头对应的至少两个第一评分,对所述至少两个注意力头执行剪枝操作。
  5. 根据权利要求4所述的方法,其特征在于,所述通过所述第一神经网络,生成所述至少两个注意力头中每个注意力头的第一评分,包括:
    将所述至少两个注意力头中每个注意力头输入所述第一神经网络,得到所述第一神经网络输出的所述每个注意力头的第二评分;
    对所述第二评分进行离散化处理,得到所述第一评分,所述离散化处理的过程为可微 分的。
  6. 根据权利要求1或2所述的方法,其特征在于,所述第一训练数据包括N个句子,一个第一特征信息为所述N个句子中一个句子的特征信息;或者,
    所述第一训练数据为一个句子,所述一个句子中包括N个词语,一个第一特征信息为所述N个词语中一个词语的特征信息。
  7. 根据权利要求1或2所述的方法,其特征在于,所述第一神经网络为以下中的任一种神经网络:卷积神经网络、循环神经网络、残差神经网络或全连接神经网络。
  8. 一种神经网络的压缩方法,其特征在于,所述方法包括:
    获取第二特征提取网络;
    通过第二神经网络,对所述第二特征提取网络进行剪枝,得到剪枝后的所述第二特征提取网络,其中,所述第二神经网络为根据第一损失函数进行训练得到的,所述第一损失函数指示第一分布信息与第二分布信息之间的相似度,所述第一分布信息用于指示N个第一特征信息的数据分布规律,所述N个第一特征信息为将第一训练数据输入第一特征提取网络后得到的,所述第二分布信息用于指示所述N个第二特征信息的数据分布规律,所述N个第二特征信息为将所述第一训练数据输入剪枝后的所述第一特征提取网络后得到的。
  9. 根据权利要求8所述的方法,其特征在于,
    所述第一分布信息包括所述N个第一特征信息中任意两个第一特征信息之间的距离的值,以指示所述N个第一特征信息的数据分布规律;
    所述第二分布信息包括所述N个第二特征信息中任意两个第二特征信息之间的距离的值,以指示所述N个第二特征信息的数据分布规律。
  10. 根据权利要求8或9所述的方法,其特征在于,所述第二特征提取网络为采用预训练和微调(fine-tune)的方式进行训练,所述通过第二神经网络,对所述第二特征提取网络进行剪枝,包括:
    在所述微调之前,通过所述第二神经网络,对所述第二特征提取网络进行剪枝。
  11. 一种神经网络的训练装置,其特征在于,所述装置包括:
    输入模块,用于将第一训练数据输入第一特征提取网络,得到所述第一特征提取网络输出的与所述第一训练数据对应的N个第一特征信息,所述N为大于1的整数;
    计算模块,用于根据所述N个第一特征信息,计算第一分布信息,所述第一分布信息用于指示所述N个第一特征信息的数据分布规律;
    剪枝模块,用于通过第一神经网络,对所述第一特征提取网络执行剪枝操作,得到剪枝后的第一特征提取网络;
    所述输入模块,还用于将所述第一训练数据输入所述剪枝后的第一特征提取网络,得到所述剪枝后的第一特征提取网络输出的与所述第一训练数据对应的N个第二特征信息;
    所述计算模块,还用于根据所述N个第二特征信息,计算第二分布信息,所述第二分布信息用于指示所述N个第二特征信息的数据分布规律;
    训练模块,用于根据第一损失函数,对所述第一神经网络执行训练操作,得到第二神经网络,所述第一损失函数指示所述第一分布信息与所述第二分布信息之间的相似度。
  12. 根据权利要求11所述的装置,其特征在于,
    所述第一分布信息包括所述N个第一特征信息中任意两个第一特征信息之间的距离的值,以指示所述N个第一特征信息的数据分布规律;
    所述第二分布信息包括所述N个第二特征信息中任意两个第二特征信息之间的距离的值,以指示所述N个第二特征信息的数据分布规律。
  13. 根据权利要求11或12所述的装置,其特征在于,所述第一特征提取网络为转换器(Transformer)结构的神经网络中的特征提取网络,所述第一特征提取网络中包括至少两个注意力头;
    所述剪枝模块,具体用于通过所述第一神经网络,对所述第一特征提取网络包括的所述至少两个注意力头执行剪枝操作,得到所述剪枝后的第一特征提取网络,所述剪枝后的第一特征提取网络包括的注意力头的数量少于所述第一特征提取网络包括的注意力头的数量。
  14. 根据权利要求13所述的装置,其特征在于,
    所述剪枝模块,具体用于通过所述第一神经网络,生成所述至少两个注意力头中每个注意力头的第一评分,根据与所述至少两个注意力头对应的至少两个第一评分,对所述至少两个注意力头执行剪枝操作。
  15. 根据权利要求14所述的装置,其特征在于,
    所述剪枝模块,具体用于将所述至少两个注意力头中每个注意力头输入所述第一神经网络,得到所述第一神经网络输出的所述每个注意力头的第二评分,对所述第二评分进行离散化处理,得到所述第一评分,所述离散化处理的过程为可微分的。
  16. 根据权利要求11或12所述的装置,其特征在于,所述第一训练数据包括N个句子,一个第一特征信息为所述N个句子中一个句子的特征信息;或者,
    所述第一训练数据为一个句子,所述一个句子中包括N个词语,一个第一特征信息为所述N个词语中一个词语的特征信息。
  17. 根据权利要求11或12所述的装置,其特征在于,所述第一神经网络为以下中的任一种神经网络:卷积神经网络、循环神经网络、残差神经网络或全连接神经网络。
  18. 一种神经网络的压缩装置,其特征在于,所述装置包括:
    获取模块,用于获取第二特征提取网络;
    剪枝模块,用于通过第二神经网络,对所述第二特征提取网络进行剪枝,得到剪枝后的所述第二特征提取网络;
    其中,所述第二神经网络为根据第一损失函数进行训练得到的,所述第一损失函数指示第一分布信息与第二分布信息之间的相似度,所述第一分布信息用于指示N个第一特征信息的数据分布规律,所述N个第一特征信息为将第一训练数据输入第一特征提取网络后得到的,所述第二分布信息用于指示所述N个第二特征信息的数据分布规律,所述N个第二特征信息为将所述第一训练数据输入剪枝后的所述第一特征提取网络后得到的。
  19. 根据权利要求18所述的装置,其特征在于,
    所述第一分布信息包括所述N个第一特征信息中任意两个第一特征信息之间的距离的 值,以指示所述N个第一特征信息的数据分布规律;
    所述第二分布信息包括所述N个第二特征信息中任意两个第二特征信息之间的距离的值,以指示所述N个第二特征信息的数据分布规律。
  20. 根据权利要求18或19所述的装置,其特征在于,所述第二特征提取网络为采用预训练和微调(fine-tune)的方式进行训练;
    所述剪枝模块,具体用于在所述微调之前,通过所述第二神经网络,对所述第二特征提取网络进行剪枝。
  21. 一种训练设备,其特征在于,包括处理器,所述处理器和存储器耦合,所述存储器存储有程序指令,当所述存储器存储的程序指令被所述处理器执行时实现权利要求1至7中任一项所述的方法。
  22. 一种执行设备,其特征在于,包括处理器,所述处理器和存储器耦合,所述存储器存储有程序指令,当所述存储器存储的程序指令被所述处理器执行时实现权利要求8至10中任一项所述的方法。
  23. 一种计算机可读存储介质,其特征在于,包括程序,当其在计算机上运行时,使得计算机执行如权利要求1至7中任一项所述的方法,或者,使得计算机执行如权利要求8至10中任一项所述的方法。
  24. 一种电路系统,其特征在于,所述电路系统包括处理电路,所述处理电路配置为执行如权利要求1至7中任一项所述的方法,或者,所述处理电路配置为执行如权利要求8至10中任一项所述的方法。
PCT/CN2021/105927 2020-09-29 2021-07-13 神经网络训练的方法、神经网络的压缩方法以及相关设备 WO2022068314A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011057004.5 2020-09-29
CN202011057004.5A CN112183747A (zh) 2020-09-29 2020-09-29 神经网络训练的方法、神经网络的压缩方法以及相关设备

Publications (1)

Publication Number Publication Date
WO2022068314A1 true WO2022068314A1 (zh) 2022-04-07

Family

ID=73947316

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/105927 WO2022068314A1 (zh) 2020-09-29 2021-07-13 神经网络训练的方法、神经网络的压缩方法以及相关设备

Country Status (2)

Country Link
CN (1) CN112183747A (zh)
WO (1) WO2022068314A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115935992A (zh) * 2022-11-23 2023-04-07 贝壳找房(北京)科技有限公司 命名实体识别方法、装置及存储介质
CN117556828A (zh) * 2024-01-03 2024-02-13 华南师范大学 图文情感分析方法

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112183747A (zh) * 2020-09-29 2021-01-05 华为技术有限公司 神经网络训练的方法、神经网络的压缩方法以及相关设备
CN113065636A (zh) * 2021-02-27 2021-07-02 华为技术有限公司 一种卷积神经网络的剪枝处理方法、数据处理方法及设备
CN112989977B (zh) * 2021-03-03 2022-09-06 复旦大学 一种基于跨模态注意力机制的视听事件定位方法及装置
CN113761841B (zh) * 2021-04-19 2023-07-25 腾讯科技(深圳)有限公司 将文本数据转换为声学特征的方法
CN113486189A (zh) * 2021-06-08 2021-10-08 广州数说故事信息科技有限公司 一种开放性知识图谱挖掘方法及系统
CN113516638B (zh) * 2021-06-25 2022-07-19 中南大学 一种神经网络内部特征重要性可视化分析及特征迁移方法
CN113849601A (zh) * 2021-09-17 2021-12-28 上海数熙传媒科技有限公司 一种针对问答任务模型的输入剪枝加速方法
CN113901904A (zh) * 2021-09-29 2022-01-07 北京百度网讯科技有限公司 图像处理方法、人脸识别模型训练方法、装置及设备
CN116881430B (zh) * 2023-09-07 2023-12-12 北京上奇数字科技有限公司 一种产业链识别方法、装置、电子设备及可读存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018148493A1 (en) * 2017-02-09 2018-08-16 Painted Dog, Inc. Methods and apparatus for detecting, filtering, and identifying objects in streaming video
CN109034372A (zh) * 2018-06-28 2018-12-18 浙江大学 一种基于概率的神经网络剪枝方法
CN109635936A (zh) * 2018-12-29 2019-04-16 杭州国芯科技股份有限公司 一种基于重训练的神经网络剪枝量化方法
US20190279089A1 (en) * 2016-11-17 2019-09-12 Tusimple, Inc. Method and apparatus for neural network pruning
CN111079691A (zh) * 2019-12-27 2020-04-28 中国科学院重庆绿色智能技术研究院 一种基于双流网络的剪枝方法
CN112183747A (zh) * 2020-09-29 2021-01-05 华为技术有限公司 神经网络训练的方法、神经网络的压缩方法以及相关设备

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190279089A1 (en) * 2016-11-17 2019-09-12 Tusimple, Inc. Method and apparatus for neural network pruning
WO2018148493A1 (en) * 2017-02-09 2018-08-16 Painted Dog, Inc. Methods and apparatus for detecting, filtering, and identifying objects in streaming video
CN109034372A (zh) * 2018-06-28 2018-12-18 浙江大学 一种基于概率的神经网络剪枝方法
CN109635936A (zh) * 2018-12-29 2019-04-16 杭州国芯科技股份有限公司 一种基于重训练的神经网络剪枝量化方法
CN111079691A (zh) * 2019-12-27 2020-04-28 中国科学院重庆绿色智能技术研究院 一种基于双流网络的剪枝方法
CN112183747A (zh) * 2020-09-29 2021-01-05 华为技术有限公司 神经网络训练的方法、神经网络的压缩方法以及相关设备

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115935992A (zh) * 2022-11-23 2023-04-07 贝壳找房(北京)科技有限公司 命名实体识别方法、装置及存储介质
CN117556828A (zh) * 2024-01-03 2024-02-13 华南师范大学 图文情感分析方法
CN117556828B (zh) * 2024-01-03 2024-04-30 华南师范大学 图文情感分析方法

Also Published As

Publication number Publication date
CN112183747A (zh) 2021-01-05

Similar Documents

Publication Publication Date Title
WO2022068314A1 (zh) 神经网络训练的方法、神经网络的压缩方法以及相关设备
WO2020228376A1 (zh) 文本处理方法、模型训练方法和装置
WO2022057776A1 (zh) 一种模型压缩方法及装置
WO2022007823A1 (zh) 一种文本数据处理方法及装置
WO2021047286A1 (zh) 文本处理模型的训练方法、文本处理方法及装置
CN111368996B (zh) 可传递自然语言表示的重新训练投影网络
Lu et al. Brain intelligence: go beyond artificial intelligence
WO2022068627A1 (zh) 一种数据处理方法及相关设备
CN111368993B (zh) 一种数据处理方法及相关设备
Chen et al. Big data deep learning: challenges and perspectives
US20210034813A1 (en) Neural network model with evidence extraction
CN111930942B (zh) 文本分类方法、语言模型训练方法、装置及设备
CN113239700A (zh) 改进bert的文本语义匹配设备、系统、方法及存储介质
CN109992773B (zh) 基于多任务学习的词向量训练方法、系统、设备及介质
WO2023160472A1 (zh) 一种模型训练方法及相关设备
WO2022001724A1 (zh) 一种数据处理方法及装置
Mishra et al. The understanding of deep learning: A comprehensive review
WO2022156561A1 (zh) 一种自然语言处理方法以及装置
WO2022253074A1 (zh) 一种数据处理方法及相关设备
WO2023236977A1 (zh) 一种数据处理方法及相关设备
WO2022206717A1 (zh) 一种模型训练方法及装置
GB2574098A (en) Interactive systems and methods
WO2023284716A1 (zh) 一种神经网络搜索方法及相关设备
CN113553510B (zh) 一种文本信息推荐方法、装置及可读介质
CN116432019A (zh) 一种数据处理方法及相关设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21873976

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21873976

Country of ref document: EP

Kind code of ref document: A1