WO2024077981A1 - Natural language processing method, system and device, and storage medium - Google Patents

Natural language processing method, system and device, and storage medium Download PDF

Info

Publication number
WO2024077981A1
WO2024077981A1 PCT/CN2023/098938 CN2023098938W WO2024077981A1 WO 2024077981 A1 WO2024077981 A1 WO 2024077981A1 CN 2023098938 W CN2023098938 W CN 2023098938W WO 2024077981 A1 WO2024077981 A1 WO 2024077981A1
Authority
WO
WIPO (PCT)
Prior art keywords
natural language
language processing
model
sparsification
parameter
Prior art date
Application number
PCT/CN2023/098938
Other languages
French (fr)
Chinese (zh)
Inventor
李兵兵
阚宏伟
王彦伟
Original Assignee
浪潮电子信息产业股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 浪潮电子信息产业股份有限公司 filed Critical 浪潮电子信息产业股份有限公司
Publication of WO2024077981A1 publication Critical patent/WO2024077981A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present application relates to the field of machine learning technology, and in particular to a natural language processing method, system, device and storage medium.
  • Natural language processing models require matrix multiplication calculations.
  • large-scale natural language deep learning models based on attention mechanisms contain a large number of matrix multiplication calculations.
  • deep network model parameters have great redundancy, which provides conditions for reasoning optimization based on compression models.
  • the natural language processing model based on the attention mechanism is usually composed of multiple functional modules that are cyclically and sequentially superimposed.
  • Figure 1 is a commonly used natural language processing model based on the attention mechanism.
  • most calculations are performed in the form of matrix multiplication.
  • the calculation of the Multi-Head Attention module in Figure 1 requires multiple matrix multiplication operations.
  • the currently commonly used knowledge distillation method is based on the trained large model, namely the teacher model, to obtain the final output loss of the model and the output value of the intermediate layer, and then obtain the distillation loss.
  • the small model of a specified structure namely the student model
  • some parameters of the large model are selected as the initialization parameters of the small model.
  • 20 layers of the original 100-layer large model are selected to form a small model, and then the small model is trained through the model prediction loss and distillation loss.
  • the knowledge distillation method since the small model only selects some parameters of the large model for initialization, the model parameters will be greatly adjusted during the training process, which will also lead to a significant decrease in model accuracy.
  • the structure of the small model needs to be manually given, which limits the flexibility of the model structure and the effectiveness of model compression, and is not conducive to ensuring model accuracy.
  • the knowledge distillation method is a software-level optimization. In actual hardware deployment, since it uses a general matrix multiplication calculation method, it cannot form an effective collaborative optimization with the software layer, that is, the space that can be optimized on existing hardware is very limited.
  • the purpose of this application is to provide a natural language processing method, system, device and storage medium to effectively implement natural language processing, ensure accuracy, and effectively perform collaborative optimization at the software and hardware levels.
  • a natural language processing method comprising:
  • any one model parameter matrix of the first natural language processing model setting a row sparsification parameter group for determining whether rows in the model parameter matrix are retained, and a column sparsification parameter group for determining whether columns in the model parameter matrix are retained;
  • the first natural language processing model is trained according to the row sparsification parameter groups and the column sparsification parameter groups of each model parameter matrix, and in the forward propagation process of the training, a prediction loss and a sparsity loss are determined, and in the backward propagation process of the training, the remaining parameters of the first natural language processing model that are not currently sparse are updated by the prediction loss, and each of the row sparsification parameter groups and each of the column sparsification parameter groups are updated by the prediction loss and the sparsity loss;
  • Hardware deployment is performed based on the second natural language processing model, and after the deployment is completed, the text to be processed is input into the second natural language processing model to obtain a natural language processing result for the text to be processed output by the second natural language processing model.
  • the remaining parameters of the first natural language processing model that are not currently sparse are updated by using the prediction loss, including:
  • the remaining parameters of the first natural language processing model that are not currently sparse are updated with the goal of reducing the prediction loss.
  • updating each of the row sparsification parameter groups and each of the column sparsification parameter groups by using the prediction loss and the sparsity loss includes:
  • each of the row sparsification parameter groups and each of the column sparsification parameter groups are updated with the goal of reducing the total loss.
  • the updating of each of the row sparsification parameter groups and each of the column sparsification parameter groups includes:
  • Sk represents the row sparsification parameter group at the current moment
  • Sk+1 represents the row sparsification parameter group at the next moment
  • lr represents the sparsification parameter differential calculation learning rate
  • Loss represents the total loss
  • Softlplus represents the Softlplus function
  • Qk represents the column sparsification parameter group at the current moment
  • Qk +1 represents the column sparsification parameter group at the next moment
  • Mk represents the model parameter mask of the model parameter matrix
  • M ij represents the value of the i-th row and j-th column in M k
  • x i represents the value of the i-th parameter in S k
  • y j represents the value of the j-th parameter in Q k
  • i and j are both positive integers
  • 1 ⁇ i ⁇ a, 1 ⁇ j ⁇ b, a and b are the number of rows and columns of the model parameter matrix, respectively.
  • each parameter in any row sparsification parameter group and any column sparsification parameter group is a default value after being set and before being updated, so as to retain each row and each column in any of the model parameter matrices.
  • the total loss is the sum of the prediction loss and the sparsity loss.
  • the total loss k1*loss1+k2*loss2, wherein k1 and k2 are both preset coefficients, loss1 is the prediction loss, and loss2 is the sparsity loss.
  • it also includes:
  • it also includes:
  • inputting the text to be processed into the second natural language processing model to obtain a natural language processing result for the text to be processed output by the second natural language processing model includes:
  • the text to be processed is input into the second natural language processing model to obtain a semantic recognition result for the text to be processed output by the second natural language processing model.
  • the hardware deployment based on the second natural language processing model includes:
  • the non-zero parameter rows and the non-zero parameter columns of the model parameter matrix are extracted to obtain a sparse model parameter matrix
  • Hardware deployment is performed based on each sparse model parameter matrix.
  • the hardware deployment based on each sparse model parameter matrix includes:
  • Hardware deployment is performed based on each sparse model parameter matrix. During deployment, zeros are added to the calculation results of the corresponding model parameter matrix in accordance with the principle of maintaining dimensional invariance.
  • a natural language processing system comprising:
  • a first natural language processing model determination module is used to establish an initial natural language processing model and perform training to obtain a trained first natural language processing model
  • a sparsification setting module used for setting, for any one model parameter matrix of the first natural language processing model, a row sparsification parameter group for determining whether a row in the model parameter matrix is retained, and a column sparsification parameter group for determining whether a column in the model parameter matrix is retained;
  • a pruning module configured to train the first natural language processing model according to the row sparsification parameter groups and the column sparsification parameter groups of each model parameter matrix, and determine the prediction loss and the sparsity loss during the forward propagation process of the training, and during the backward propagation process of the training, update the remaining parameters of the first natural language processing model that are not currently sparse by using the prediction loss, and update each of the row sparsification parameter groups and each of the column sparsification parameter groups by using the prediction loss and the sparsity loss;
  • a second natural language processing model determination module configured to obtain a trained second natural language processing model when a total loss determined based on the prediction loss and the sparsity loss converges
  • An execution module is used to perform hardware deployment based on the second natural language processing model, and after the deployment is completed, input the text to be processed into the second natural language processing model to obtain a natural language processing result for the text to be processed output by the second natural language processing model.
  • a natural language processing device comprising:
  • a processor is used to execute the computer program to implement the steps of the natural language processing method as described above.
  • a non-volatile computer readable storage medium stores a computer program, and when the computer program is executed by a processor, the steps of the natural language processing method described above are implemented.
  • the present application will not directly delete some layers in the first natural language processing model, but will set a row sparse parameter group for determining whether the rows in the model parameter matrix are retained, and a column sparse parameter group for determining whether the columns in the model parameter matrix are retained, for any one model parameter matrix of the first natural language processing model.
  • the present application will not directly delete some layers in the first natural language processing model, but will set a row sparse parameter group for determining whether the rows in the model parameter matrix are retained, and a column sparse parameter group for determining whether the columns in the model parameter matrix are retained, for any one model parameter matrix of the first natural language processing model.
  • which rows and columns of the model parameter matrix are retained, and which rows and columns are excluded are determined by the corresponding row and column sparse parameter groups.
  • the first natural language processing model can be trained according to the row sparse parameter group and column sparse parameter group of each model parameter matrix.
  • the prediction loss and sparsity loss are determined. Since the remaining parameters of the first natural language processing model that are not currently sparse are updated through the prediction loss during the back propagation process of training, the solution of the present application can effectively guarantee the accuracy of the second natural language processing model obtained after the training is completed, that is, the present application achieves lossless accuracy while achieving optimization.
  • the present application will also update each row sparsification parameter group and each column sparsification parameter group through the prediction loss and sparsity loss.
  • the total loss determined based on the prediction loss and sparsity loss converges, it means that the prediction loss and sparsity loss have reached a suitable level, and when the sparsity loss reaches a suitable optimization level, it means that each model parameter matrix in the second natural language processing model has been filtered and deleted for some rows and columns.
  • hardware deployment can be performed based on the second natural language processing model, and after the deployment is completed, the text to be processed is input into the second natural language processing model to obtain the natural language processing result for the text to be processed output by the second natural language processing model, that is, through the second natural language processing, the text to be processed can be effectively processed. Natural language processing.
  • the present application sets a row sparsification parameter group for determining whether the rows in the model parameter matrix are retained, and a column sparsification parameter group for determining whether the columns in the model parameter matrix are retained during optimization, after the optimization is completed, when the hardware is deployed, the corresponding rows and columns that have been filtered do not need to be deployed in hardware, that is, when the hardware is deployed, only the remaining rows and columns after sparsification need to be considered, so that the hardware requirements of the present application are also lower, the amount of calculation is small, and the coordinated optimization of software and hardware is achieved.
  • the solution of the present application can effectively implement natural language processing, and can effectively perform collaborative optimization at the software and hardware levels to improve the natural language processing efficiency of the model. At the same time, after the optimization is completed, there will be no loss of accuracy.
  • FIG1 is a schematic diagram of the structure of a commonly used natural language processing model based on an attention mechanism
  • FIG. 2 is a schematic diagram showing the principle of calculation of matrix multiplication
  • FIG3 is a flowchart of an implementation of a natural language processing method in the present application.
  • FIG4 is a schematic diagram of the training principle of the first natural language processing model in the present application.
  • FIG5 is a schematic diagram showing the changes in the model parameter matrix after sparseness in this application.
  • FIG6 is a schematic diagram of the structure of a natural language processing system in the present application.
  • FIG. 7 is a schematic diagram of the structure of a natural language processing device in the present application.
  • the core of this application is to provide a natural language processing method that can effectively implement natural language processing and can effectively perform collaborative optimization at the software and hardware levels to improve the natural language processing efficiency of the model. At the same time, after the optimization is completed, there will be no loss of accuracy.
  • FIG. 3 is a flowchart of an implementation of a natural language processing method in the present application.
  • the natural language processing method may include the following steps:
  • Step S301 Establish an initial natural language processing model and perform training to obtain a trained first natural language processing model.
  • an initial natural language processing model may be established first.
  • the specific form of the initial natural language processing model may be various and may be set and adjusted according to actual needs, such as an initial natural language processing model that adopts a deep network structure.
  • the initial natural language processing model After the initial natural language processing model is established, it can be trained with training samples. When the recognition accuracy reaches the requirement, it can be determined that the training is completed. At this time, the first natural language processing model that has been trained is obtained.
  • the training samples are usually text data.
  • Step S302 For any one model parameter matrix of the first natural language processing model, set a row sparsification parameter group for determining whether rows in the model parameter matrix are retained, and a column sparsification parameter group for determining whether columns in the model parameter matrix are retained.
  • the first natural language processing model will include multiple model parameter matrices. It can be understood that at this time, since sparseness has not been performed, each model parameter matrix is an original, non-sparse model parameter matrix.
  • the present application will set a row sparsification parameter group for determining whether the rows in the model parameter matrix are retained, and a column sparsification parameter group for determining whether the columns in the model parameter matrix are retained.
  • the original model parameter matrix is a 4-row 3-column model parameter matrix.
  • the 2nd row and the 2nd column need to be sparse, that is, the 2nd row and the 2nd column do not need to be retained.
  • each row sparsification parameter group and each column sparsification parameter group will be continuously adjusted. Therefore, when performing the initial setting of each row sparsification parameter group and each column sparsification parameter group, they can be set arbitrarily. However, in actual applications, in order to ensure accuracy, when performing the initial setting of each row sparsification parameter group and each column sparsification parameter group, each row and each column of each model parameter matrix are usually retained.
  • each parameter in any row sparsification parameter group and any column sparsification parameter group is set to a default value after the setting is completed and before being updated, so as to retain each row and each column in any model parameter matrix.
  • each parameter in any row sparsification parameter group and any column sparsification parameter group is set to a default value, making the setting process simple and convenient.
  • row sparsification parameter groups and column sparsification parameter groups there are many specific forms of row sparsification parameter groups and column sparsification parameter groups, as long as they can realize the respective functions of the row sparsification parameter groups and column sparsification parameter groups of the present application.
  • the row sparsification parameter group can usually be set in the form of a vector, and each numerical value in the vector is used to determine whether the corresponding row of the model parameter matrix is retained.
  • the column sparsification parameter group can usually be set in the form of a vector, and each numerical value in the vector is used to determine whether the corresponding column of the model parameter matrix is retained.
  • Step S303 The first natural language processing model is trained according to the row sparsification parameter groups and column sparsification parameter groups of each model parameter matrix, and the prediction loss and the sparsity loss are determined during the forward propagation of the training. During the backward propagation of the training, the remaining parameters of the first natural language processing model that are not currently sparse are updated through the prediction loss, and each row sparsification parameter group and each column sparsification parameter group are updated through the prediction loss and the sparsity loss.
  • the row sparsification parameter group and the column sparsification parameter group can determine the model parameter matrix after sparsification, and then train the first natural language processing model.
  • the present application needs to determine the prediction loss and sparsity loss.
  • the prediction loss is marked as ce_loss, which is loss1 in the subsequent implementation.
  • the sparsity loss is marked as sparsity_loss, which is loss2 in the subsequent implementation.
  • the prediction loss reflects the prediction accuracy of the first natural language processing model. The smaller the prediction loss, the higher the prediction accuracy of the first natural language processing model.
  • the present application will update the remaining parameters of the first natural language processing model that are not currently sparse through the prediction loss during the back propagation process of training.
  • the remaining unsparse parameters of the model parameter matrix are 6, namely the parameters of the 1st row and the 1st column, the parameters of the 1st row and the 3rd column, the parameters of the 3rd row and the 1st column, the parameters of the 3rd row and the 3rd column, the parameters of the 4th row and the 1st column, and the parameters of the 4th row and the 3rd column.
  • the final decision to sparse may not necessarily be the 2nd row and the 2nd column.
  • updating the remaining parameters of the first natural language processing model that are not currently sparse by predicting the loss may include:
  • the remaining parameters of the first natural language processing model that are not currently sparse are updated with the goal of reducing the prediction loss.
  • the prediction loss reflects the prediction accuracy of the first natural language processing model.
  • the present application also updates each row sparse parameter group and each column sparse parameter group through prediction loss and sparsity loss.
  • the sparsity loss reflects the degree to which the first natural language processing model is sparse. The smaller the sparsity loss, the higher the degree to which the first natural language processing model is sparse, that is, more rows and columns do not need to be retained.
  • the updating of each row sparsification parameter group and each column sparsification parameter group by prediction loss and sparsity loss described in step S303 includes:
  • each row sparsification parameter group and each column sparsification parameter group are updated with the goal of reducing the total loss.
  • the total loss is composed of the prediction loss and the sparsity loss process, and as described above, the lower the prediction loss, the higher the accuracy, the lower the sparsity loss, and the higher the degree of sparsity.
  • the prediction loss may increase. Therefore, in this implementation, the training goal is to reduce the total loss, and each row sparsification parameter group and each column sparsification parameter group are updated. It is equivalent to a trade-off between the prediction loss and the sparsity loss, with the purpose of reducing the total loss.
  • each row sparsification parameter group and each column sparsification parameter group may specifically include:
  • Each row sparsification parameter group and each column sparsification parameter group are updated, including:
  • Sk represents the row sparsification parameter group at the current moment
  • Sk+1 represents the row sparsification parameter group at the next moment
  • lr represents the learning rate of sparsification parameter differential calculation
  • Loss represents the total loss
  • Softlplus represents the Softlplus function
  • Qk represents the column sparsification parameter group at the current moment
  • Qk +1 represents the column sparsification parameter group at the next moment
  • Mk represents the model parameter mask of the model parameter matrix
  • Mij represents the value of the i-th row and j-th column in Mk
  • xi represents the value of the i-th parameter in Sk
  • yj represents the value of the j-th parameter in Qk .
  • Both i and j are positive integers, and 1 ⁇ i ⁇ a, 1 ⁇ j ⁇ b, where a and b are the number of rows and columns of the model parameter matrix, respectively.
  • the STE forward function indicates that the model parameter mask of the model parameter matrix is determined based on the row sparse parameter group and the column sparse parameter group of the model parameter matrix.
  • the STE reverse function indicates the update process of the row sparse parameter group and the column sparse parameter group.
  • the row sparsification parameter group of the model parameter matrix is [-5.5, 3.0, 1.2]
  • the column sparsification parameter group is [3.3, -2.2, 1.0]. Since -5.5 and -2.2 are negative numbers, the first row and the second column of the model parameter matrix W are sparse, that is, the model The first row and second column of the parameter matrix W do not need to be retained.
  • the model parameter mask of the model parameter matrix can be expressed as A 0 in the model parameter mask indicates that the parameter is set to 0, and a 1 indicates that the parameter may not be 0.
  • Step S304 When the total loss determined based on the prediction loss and the sparsity loss converges, a trained second natural language processing model is obtained.
  • the training is completed and a trained second natural language processing model can be obtained.
  • the model parameter matrix becomes 4x4 in size.
  • Step S305 Perform hardware deployment based on the second natural language processing model, and after the deployment is completed, input the text to be processed into the second natural language processing model to obtain a natural language processing result for the text to be processed output by the second natural language processing model.
  • the text to be processed is input into the second natural language processing model, and the natural language processing result for the text to be processed output by the second natural language processing model can be obtained.
  • the hardware deployment based on the second natural language processing model described in step S305 may specifically include:
  • the non-zero parameter rows and the non-zero parameter columns of the model parameter matrix are extracted to obtain the sparse model parameter matrix
  • Hardware deployment is performed based on each sparse model parameter matrix.
  • the row sparse parameter groups and column sparse parameter groups of each final model parameter matrix can be determined, and then the non-zero parameter rows and non-zero parameter columns of the model parameter matrix can be extracted to obtain the sparse model parameter matrix.
  • a 4x4 model parameter matrix is obtained.
  • the 6x6 model parameter matrix of Figure 5 as well as the numbers of the non-zero parameter rows and the non-zero parameter columns of the model parameter matrix can be input to extract the non-zero parameter rows and non-zero parameter columns of the model parameter matrix to obtain the sparse model parameter matrix, and then perform hardware deployment based on each sparse model parameter matrix.
  • the TensorRT interface call can be modified so that for the second row in the 6x6 model parameter matrix of the input Figure 5, the column compression is performed at the corresponding position. Similarly, for the sixth row in the 6x6 model parameter matrix of Figure 5, the column compression is performed at the corresponding position. For the second and fifth columns in the 6x6 model parameter matrix of Figure 5, the row compression is performed at the corresponding position. Finally, the non-zero parameter rows and non-zero parameter columns of the model parameter matrix are extracted, and the sparse model parameter matrix is obtained.
  • hardware deployment based on each sparse model parameter matrix may specifically include:
  • Hardware deployment is performed based on each sparse model parameter matrix. During deployment, zeros are added to the calculation results of the corresponding model parameter matrix in accordance with the principle of maintaining dimensional invariance.
  • a model parameter matrix of a certain layer is 6x6 in size before sparsification, and is used to multiply input data of size 106x6 to obtain an output of size 106x6. If the model parameter matrix becomes 4x4 in size after sparsification, the last two columns of the 106x6 input data do not need to be used, that is, the 106x4 input data can be multiplied by the 4x4 model parameter matrix to obtain an output of size 106x4. After obtaining the output of size 106x4, this implementation method uses the principle of maintaining dimensional invariance and fills the calculation results of the corresponding model parameter matrix with 0, that is, in this example, 2 columns of 0 will be added to the 106x4 output to restore it to an output of size 106x6.
  • the 0-padding in this implementation method is to fill the calculation results of the corresponding model parameter matrix with 0, rather than to fill the corresponding sparse rows and columns with 0. That is, for the hardware, the multiplication calculation of the model parameter matrix of size 4x4 is performed, rather than the multiplication calculation of the model parameter matrix of size 6x6. If the multiplication calculation of the model parameter matrix of size 6x6 is performed, it is equivalent to the optimization of the software not being coordinated with the hardware, because in this case the amount of calculation on the hardware is unchanged. In some traditional solutions, when optimizing the software, some parameters in the model parameter matrix will be reset to zero. These reset parameters are randomly distributed, resulting in these 0s still needing to participate in the matrix multiplication operation on the hardware, and no coordinated optimization of the hardware is achieved.
  • the above example is to restore the output size of 106x4 to 106x6 by padding 2 columns of 0 after obtaining the output size of 106x4, thereby ensuring the inconvenience of dimension, that is, ensuring that the output size of the second natural language processing model is consistent with the output of the original model, and the sparse rows and columns determined previously do not participate in the matrix multiplication operation.
  • the total loss of the present application is composed of prediction loss and sparsity loss.
  • the total loss can be specifically the sum of prediction loss and sparsity loss, which is also a relatively simple setting method in practical applications.
  • total loss k1*loss1+k2*loss2, wherein k1 and k2 are both preset coefficients, loss1 is the prediction loss, and loss2 is the sparsity loss.
  • a certain weight can be set for the prediction loss and the sparsity loss respectively, that is, the degree of influence of the two on the total loss is different.
  • k1 the degree of influence of the prediction loss on the total loss.
  • k2 the greater the influence of the sparsity loss on the total loss.
  • Such an implementation is more conducive to simplifying the model and can usually be used in situations where high computing speed is required.
  • the total loss can also be selected as other forms, which can be set according to actual conditions and does not affect the implementation of this application. However, it can be understood that the total loss usually needs to be positively correlated with the prediction loss and positively correlated with the sparsity loss.
  • this application considers that training is completed when the total loss determined based on the prediction loss and the sparsity loss converges. In most cases, when the total loss converges, the prediction loss and the sparsity loss usually reach a low level, but in a few cases, the prediction loss or the sparsity loss may still be high.
  • a prompt message when the predicted loss is higher than the first threshold, or when the sparsity loss is higher than the second threshold, a prompt message will be output so that the staff can promptly notice the situation and take corresponding measures.
  • the value of k1 in the aforementioned implementation can be appropriately increased so that the total loss takes more consideration of the predicted loss.
  • the value of k2 in the aforementioned implementation can be appropriately increased so that the total loss takes more consideration of the sparsity loss.
  • it may also include: receiving a coefficient adjustment instruction, and adjusting the values of k1 and/or k2 according to the coefficient adjustment instruction.
  • the step S305 of inputting the text to be processed into the second natural language processing model to obtain the natural language processing result for the text to be processed output by the second natural language processing model may specifically include:
  • the text to be processed is input into the second natural language processing model to obtain a semantic recognition result for the text to be processed output by the second natural language processing model.
  • the second natural language processing model of the present application can process the text to be processed.
  • the specific processing purpose is usually to perform semantic recognition of the text to be processed.
  • the second natural language processing model of the present application can also perform other processing on the text to be processed, such as grammatical error analysis, knowledge extraction, text translation, etc.
  • the present application will not directly delete some layers in the first natural language processing model, but will set a row sparse parameter group for determining whether the rows in the model parameter matrix are retained, and a column sparse parameter group for determining whether the columns in the model parameter matrix are retained, for any one model parameter matrix of the first natural language processing model.
  • the present application will not directly delete some layers in the first natural language processing model, but will set a row sparse parameter group for determining whether the rows in the model parameter matrix are retained, and a column sparse parameter group for determining whether the columns in the model parameter matrix are retained, for any one model parameter matrix of the first natural language processing model.
  • which rows and columns of the model parameter matrix are retained, and which rows and columns are excluded are determined by the corresponding row and column sparse parameter groups.
  • the first natural language processing model can be trained according to the row sparse parameter group and column sparse parameter group of each model parameter matrix.
  • the prediction loss and sparsity loss are determined. Since the remaining parameters of the first natural language processing model that are not currently sparse are updated through prediction loss during the back propagation process of training, the solution of the present application can effectively guarantee the accuracy of the second natural language processing model obtained after the training is completed, that is, the present application achieves lossless accuracy while achieving optimization.
  • the present application will also update each row sparsification parameter group and each column sparsification parameter group through prediction loss and sparsity loss.
  • the prediction loss and sparsity loss are determined, the accuracy of the second natural language processing model obtained after the training is completed can be effectively guaranteed. That is, the present application achieves lossless accuracy while achieving optimization.
  • the total loss converges, it means that the prediction loss and the sparsity loss have reached a suitable degree, and when the sparsity loss reaches a suitable degree of optimization, it means that each model parameter matrix in the second natural language processing model has filtered and deleted some rows and columns.
  • hardware deployment can be performed based on the second natural language processing model, and after the deployment is completed, the text to be processed is input into the second natural language processing model to obtain the natural language processing result for the text to be processed output by the second natural language processing model, that is, through the second natural language processing, the natural language processing of the text to be processed can be effectively performed.
  • the present application sets the row sparsification parameter group used to determine whether the rows in the model parameter matrix are retained during optimization, and the column sparsification parameter group used to determine whether the columns in the model parameter matrix are retained, therefore, after the optimization is completed, when the hardware is deployed, the corresponding rows and columns that have been filtered do not need to be deployed for hardware, that is, when the hardware is deployed, only the remaining rows and columns after sparseness need to be considered, so that the requirements of the present application for hardware are also lower, and the amount of calculation is small, that is, the collaborative optimization of software and hardware is realized.
  • the solution of the present application can effectively implement natural language processing, and can effectively perform collaborative optimization at the software and hardware levels to improve the natural language processing efficiency of the model. At the same time, after the optimization is completed, there will be no loss of accuracy.
  • the embodiment of the present application also provides a natural language processing system, which can be referenced in correspondence with the above.
  • FIG6 it is a schematic diagram of the structure of a natural language processing system in the present application, including:
  • a first natural language processing model determination module 601 is used to establish an initial natural language processing model and perform training to obtain a trained first natural language processing model
  • a sparsification setting module 602 for setting, for any one model parameter matrix of the first natural language processing model, a row sparsification parameter group for determining whether a row in the model parameter matrix is retained, and a column sparsification parameter group for determining whether a column in the model parameter matrix is retained;
  • the pruning module 603 is used to train the first natural language processing model according to the row sparsification parameter groups and the column sparsification parameter groups of each model parameter matrix, and determine the prediction loss and the sparsity loss during the forward propagation process of the training, and update the remaining parameters of the first natural language processing model that are not currently sparse by the prediction loss during the backward propagation process of the training, and update each row sparsification parameter group and each column sparsification parameter group by the prediction loss and the sparsity loss;
  • a second natural language processing model determination module 604 is used to obtain a trained second natural language processing model when the total loss determined based on the prediction loss and the sparsity loss converges;
  • the execution module 605 is used to perform hardware deployment based on the second natural language processing model, and after the deployment is completed, input the text to be processed into the second natural language processing model to obtain the natural language processing result for the text to be processed output by the second natural language processing model.
  • the pruning module 603 updates the remaining parameters of the first natural language processing model that are not currently sparse by predicting the loss during the back propagation process of the training, including:
  • the remaining parameters of the first natural language processing model that are not currently sparse are updated with the goal of reducing the prediction loss.
  • the pruning module 603 updates each row sparsification parameter group and each column sparsification parameter group by predicting the loss and the sparsity loss, including:
  • each row sparsification parameter group and each column sparsification parameter group are updated with the goal of reducing the total loss.
  • the pruning module 603 updates each row sparsification parameter group and each column sparsification parameter group, including:
  • Sk represents the row sparsification parameter group at the current moment
  • Sk+1 represents the row sparsification parameter group at the next moment
  • lr represents the sparsification parameter differential calculation learning rate
  • Loss represents the total loss
  • Softlplus represents the Softlplus function
  • Qk represents the column sparsification parameter group at the current moment
  • Qk +1 represents the column sparsification parameter group at the next moment
  • Mk represents the model parameter mask of the model parameter matrix
  • Mij represents the value of the i-th row and j-th column in Mk
  • xi represents the value of the i-th parameter in Sk
  • yj represents the value of the j-th parameter in Qk
  • i and j are both positive integers
  • 1 ⁇ i ⁇ a, 1 ⁇ j ⁇ b, a and b are the number of rows and columns of the model parameter matrix, respectively.
  • each parameter in any row sparsification parameter group and any column sparsification parameter group is set to a default value after being set and before being updated, so as to retain each row and each column in any model parameter matrix.
  • the total loss is the sum of the prediction loss and the sparsity loss.
  • total loss k1*loss1+k2*loss2, wherein k1 and k2 are both preset coefficients, loss1 is the prediction loss, and loss2 is the sparsity loss.
  • the information prompt module is used to output prompt information after obtaining the trained second natural language processing model when it is judged that the prediction loss is higher than the first threshold or the sparsity loss is higher than the second threshold.
  • the coefficient adjustment module is used to receive the coefficient adjustment instruction and adjust the value of k1 and/or k2 according to the coefficient adjustment instruction.
  • the execution module 605 inputs the text to be processed into the second natural language processing model, and obtains the natural language processing result for the text to be processed output by the second natural language processing model, including:
  • the text to be processed is input into the second natural language processing model to obtain a semantic recognition result for the text to be processed output by the second natural language processing model.
  • the execution module 605 performs hardware deployment based on the second natural language processing model, including:
  • the non-zero parameter rows and the non-zero parameter columns of the model parameter matrix are extracted to obtain the sparse model parameter matrix
  • Hardware deployment is performed based on each sparse model parameter matrix.
  • the execution module 605 performs hardware deployment based on each sparse model parameter matrix, including:
  • Hardware deployment is performed based on each sparse model parameter matrix. During deployment, zeros are added to the calculation results of the corresponding model parameter matrix in accordance with the principle of maintaining dimensional invariance.
  • the present application embodiment also provides a natural language processing device and a non-volatile readable storage medium, which can be referenced in correspondence with the above.
  • the non-volatile readable storage medium stores a computer program, and when the computer program is executed by a processor, the steps of the natural language processing method in any of the above embodiments are implemented.
  • the natural language processing device may include:
  • Memory 701 used for storing computer programs
  • the processor 702 is used to execute a computer program to implement the steps of the natural language processing method in any of the above embodiments.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Machine Translation (AREA)

Abstract

Disclosed in the present application are a natural language processing method, system and device, and a storage medium, which are applied to the technical field of machine learning. The method comprises: obtaining a trained first natural language processing model; setting row and column sparsity parameter groups used for determining whether to retain rows and columns in model parameter matrixes of the first natural language processing model, performing training, updating, by means of a prediction loss, current remaining parameters which are not sparsified, and updating the row and column sparsity parameter groups by means of the prediction loss and a sparsity loss; when a total loss converges, obtaining a trained second natural language processing model; and deploying hardware on the basis of the second natural language processing model and, after the deployment is completed, inputting into the second natural language processing model a text to be processed so as to obtain a natural language processing result. The solution of the present application can be used for effectively implementing natural language processing, and performing collaborative optimization on software and hardware levels without precision losses.

Description

一种自然语言处理方法、系统、设备及存储介质A natural language processing method, system, device and storage medium
本申请要求于2022年10月11日提交中国专利局,申请号为202211237680.X,申请名称为“一种自然语言处理方法、系统、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims priority to the Chinese patent application filed with the China Patent Office on October 11, 2022, with application number 202211237680.X, and application name “A Natural Language Processing Method, System, Device and Storage Medium”, all contents of which are incorporated by reference in this application.
技术领域Technical Field
本申请涉及机器学习技术领域,特别是涉及一种自然语言处理方法、系统、设备及存储介质。The present application relates to the field of machine learning technology, and in particular to a natural language processing method, system, device and storage medium.
背景技术Background technique
自然语言处理模型的设计和推理,依赖于软件模型的训练和实际硬件的适配部署。自然语言处理模型需要进行矩阵乘法计算,例如以注意力机制为基础的大型自然语言深度学习模型,包含了大量的矩阵乘法计算,同时,深度网络模型参数具有很大的冗余性,为基于压缩模型的推理优化提供了条件。The design and reasoning of natural language processing models rely on the training of software models and the adaptation and deployment of actual hardware. Natural language processing models require matrix multiplication calculations. For example, large-scale natural language deep learning models based on attention mechanisms contain a large number of matrix multiplication calculations. At the same time, deep network model parameters have great redundancy, which provides conditions for reasoning optimization based on compression models.
基于注意力机制的自然语言处理模型,通常由多种功能模块循环、顺序叠加组成,例如图1便是目前常用的基于注意力机制的自然语言处理模型。自然语言处理模型中,大部分计算都以矩阵乘法形式进行,例如图1中的多头注意力(Multi-Head Attention)模块,其计算便需要包含多个矩阵乘法运算。The natural language processing model based on the attention mechanism is usually composed of multiple functional modules that are cyclically and sequentially superimposed. For example, Figure 1 is a commonly used natural language processing model based on the attention mechanism. In the natural language processing model, most calculations are performed in the form of matrix multiplication. For example, the calculation of the Multi-Head Attention module in Figure 1 requires multiple matrix multiplication operations.
在进行矩阵乘法运算时,以图2的矩阵A和矩阵B相乘,用以得到矩阵C为例,需要对矩阵A的每一行与矩阵B的每一列进行对应元素相乘并求和,得到矩阵C中相应位置的元素。When performing a matrix multiplication operation, taking the multiplication of matrix A and matrix B in Figure 2 to obtain matrix C as an example, it is necessary to multiply the corresponding elements of each row of matrix A and each column of matrix B and sum them to obtain the elements at the corresponding positions in matrix C.
为了加速深度网络的运行,目前的加速方法一般从软件和硬件两阶段进行优化。在软件层面,是通过模型结构简化的方式,实现在一定程度的精度损失的情况下,使用小的模型结构代替大的模型结构,从而降低模型推理时的计算量。在硬件层面,是将精简后的模型进行硬件部署和硬件加速,实现高效率的实时推理。In order to accelerate the operation of deep networks, current acceleration methods generally optimize from two stages: software and hardware. At the software level, the model structure is simplified to achieve a certain degree of accuracy loss by replacing the large model structure with a small model structure, thereby reducing the amount of calculation during model reasoning. At the hardware level, the streamlined model is deployed and accelerated by hardware to achieve efficient real-time reasoning.
现有的加速方法,为了实现在已有硬件平台上部署,需要将原始大模型压缩成符合通用矩阵乘法格式的小模型。例如目前常用的知识蒸馏的方法,是基于已经训练完成的大模型,即教师Teacher模型,得到模型最终的输出损失和中间层输出值,进而得到蒸馏损失distillation loss。对于指定结构的小模型,即学生student模型,是选取大模型的部分参数作为小模型的初始化参数,例如原来的100层大模型选择其中的20层构成小模型,再通过模型预测损失和蒸馏损失,进行小模型的训练。In order to be deployed on existing hardware platforms, existing acceleration methods need to compress the original large model into a small model that conforms to the general matrix multiplication format. For example, the currently commonly used knowledge distillation method is based on the trained large model, namely the teacher model, to obtain the final output loss of the model and the output value of the intermediate layer, and then obtain the distillation loss. For the small model of a specified structure, namely the student model, some parameters of the large model are selected as the initialization parameters of the small model. For example, 20 layers of the original 100-layer large model are selected to form a small model, and then the small model is trained through the model prediction loss and distillation loss.
这种知识蒸馏的方式,由于小模型仅选取大模型的部分参数进行初始化,训练过程中,模型参数会出现较大调整,也会导致模型精度出现明显下降。此外,小模型的结构需要人为手动给定,限制了模型结构的灵活性和模型压缩的有效性,同时也不利于保障模型精度。此外,知识蒸馏的方式是软件层面的优化,在实际的硬件部署时,由于采用的是通用矩阵乘法的计算方式,因此无法与软件层形成有效协同优化,即在已有硬件上可优化的空间十分有限。In this knowledge distillation method, since the small model only selects some parameters of the large model for initialization, the model parameters will be greatly adjusted during the training process, which will also lead to a significant decrease in model accuracy. In addition, the structure of the small model needs to be manually given, which limits the flexibility of the model structure and the effectiveness of model compression, and is not conducive to ensuring model accuracy. In addition, the knowledge distillation method is a software-level optimization. In actual hardware deployment, since it uses a general matrix multiplication calculation method, it cannot form an effective collaborative optimization with the software layer, that is, the space that can be optimized on existing hardware is very limited.
发明内容Summary of the invention
本申请的目的是提供一种自然语言处理方法、系统、设备及存储介质,以有效地实现自然语言处理,保障精度的同时,有效地进行软硬件层面的协同优化。 The purpose of this application is to provide a natural language processing method, system, device and storage medium to effectively implement natural language processing, ensure accuracy, and effectively perform collaborative optimization at the software and hardware levels.
为解决上述技术问题,本申请提供如下技术方案:In order to solve the above technical problems, this application provides the following technical solutions:
一种自然语言处理方法,包括:A natural language processing method, comprising:
建立初始自然语言处理模型并进行训练,得到训练完成的第一自然语言处理模型;Establishing an initial natural language processing model and training it to obtain a first natural language processing model that has been trained;
针对所述第一自然语言处理模型的任意1个模型参数矩阵,设定用于决定所述模型参数矩阵中的行是否保留的行稀疏化参数组,以及用于决定所述模型参数矩阵中的列是否保留的列稀疏化参数组;For any one model parameter matrix of the first natural language processing model, setting a row sparsification parameter group for determining whether rows in the model parameter matrix are retained, and a column sparsification parameter group for determining whether columns in the model parameter matrix are retained;
根据各个模型参数矩阵的行稀疏化参数组和列稀疏化参数组,进行所述第一自然语言处理模型的训练,且在训练的前向传播过程中,确定出预测损失和稀疏度损失,在训练的反向传播过程中,通过所述预测损失,对所述第一自然语言处理模型当前未被稀疏的剩余参数进行更新,通过所述预测损失和所述稀疏度损失,对各个所述行稀疏化参数组和各个所述列稀疏化参数组进行更新;The first natural language processing model is trained according to the row sparsification parameter groups and the column sparsification parameter groups of each model parameter matrix, and in the forward propagation process of the training, a prediction loss and a sparsity loss are determined, and in the backward propagation process of the training, the remaining parameters of the first natural language processing model that are not currently sparse are updated by the prediction loss, and each of the row sparsification parameter groups and each of the column sparsification parameter groups are updated by the prediction loss and the sparsity loss;
当基于所述预测损失与所述稀疏度损失确定出的总损失收敛时,得到训练完成的第二自然语言处理模型;When the total loss determined based on the prediction loss and the sparsity loss converges, a trained second natural language processing model is obtained;
基于所述第二自然语言处理模型进行硬件部署,并在部署完成之后,将待处理文本输入至所述第二自然语言处理模型,得到由所述第二自然语言处理模型输出的针对所述待处理文本的自然语言处理结果。Hardware deployment is performed based on the second natural language processing model, and after the deployment is completed, the text to be processed is input into the second natural language processing model to obtain a natural language processing result for the text to be processed output by the second natural language processing model.
在一些实施例中,在训练的反向传播过程中,通过所述预测损失,对所述第一自然语言处理模型当前未被稀疏的剩余参数进行更新,包括:In some embodiments, during the back propagation process of training, the remaining parameters of the first natural language processing model that are not currently sparse are updated by using the prediction loss, including:
在训练的反向传播过程中,以降低所述预测损失为训练目标,对所述第一自然语言处理模型当前未被稀疏的剩余参数进行更新。During the back propagation process of training, the remaining parameters of the first natural language processing model that are not currently sparse are updated with the goal of reducing the prediction loss.
在一些实施例中,通过所述预测损失和所述稀疏度损失,对各个所述行稀疏化参数组和各个所述列稀疏化参数组进行更新,包括:In some embodiments, updating each of the row sparsification parameter groups and each of the column sparsification parameter groups by using the prediction loss and the sparsity loss includes:
在训练的反向传播过程中,以降低所述总损失为训练目标,对各个所述行稀疏化参数组和各个所述列稀疏化参数组进行更新。During the back propagation process of training, each of the row sparsification parameter groups and each of the column sparsification parameter groups are updated with the goal of reducing the total loss.
在一些实施例中,所述对各个所述行稀疏化参数组和各个所述列稀疏化参数组进行更新,包括:In some embodiments, the updating of each of the row sparsification parameter groups and each of the column sparsification parameter groups includes:
针对任意1个模型参数矩阵的行稀疏化参数组,按照
For any row sparse parameter group of a model parameter matrix, follow
对所述行稀疏化参数组进行更新;Updating the row sparsification parameter group;
针对任意1个模型参数矩阵的列稀疏化参数组,按照
For any column sparsification parameter group of a model parameter matrix, follow
对所述列稀疏化参数组进行更新; Updating the column sparsification parameter group;
其中,Sk表示的是当前时刻的行稀疏化参数组,Sk+1表示的是下一时刻的行稀疏化参数组,lr表示的是稀疏化参数微分计算学习率,Loss表示的是所述总损失,Softlplus表示的是Softlplus函数,Qk表示的是当前时刻的列稀疏化参数组,Qk+1表示的是下一时刻的列稀疏化参数组,Mk表示的是所述模型参数矩阵的模型参数掩码,且Mij表示的是Mk中的第i行第j列的数值,xi表示的是Sk中的第i个参数的数值,yj表示的是Qk中的第j个参数的数值,i和j均为正整数,且1≤i≤a,1≤j≤b,a和b分别为所述模型参数矩阵的行数和列数。Among them, Sk represents the row sparsification parameter group at the current moment, Sk+1 represents the row sparsification parameter group at the next moment, lr represents the sparsification parameter differential calculation learning rate, Loss represents the total loss, Softlplus represents the Softlplus function, Qk represents the column sparsification parameter group at the current moment, Qk +1 represents the column sparsification parameter group at the next moment, Mk represents the model parameter mask of the model parameter matrix, and M ij represents the value of the i-th row and j-th column in M k , x i represents the value of the i-th parameter in S k , y j represents the value of the j-th parameter in Q k , i and j are both positive integers, and 1≤i≤a, 1≤j≤b, a and b are the number of rows and columns of the model parameter matrix, respectively.
在一些实施例中,任意1个行稀疏化参数组以及任意1个列稀疏化参数组中的各个参数在设定完毕之后,未进行更新之前均为默认值,以保留任意所述模型参数矩阵中的每一行和每一列。In some embodiments, each parameter in any row sparsification parameter group and any column sparsification parameter group is a default value after being set and before being updated, so as to retain each row and each column in any of the model parameter matrices.
在一些实施例中,所述总损失为所述预测损失与所述稀疏度损失的和。In some embodiments, the total loss is the sum of the prediction loss and the sparsity loss.
在一些实施例中,所述总损失=k1*loss1+k2*loss2,其中,k1和k2均为预设系数,loss1为所述预测损失,loss2为所述稀疏度损失。In some embodiments, the total loss = k1*loss1+k2*loss2, wherein k1 and k2 are both preset coefficients, loss1 is the prediction loss, and loss2 is the sparsity loss.
在一些实施例中,还包括:In some embodiments, it also includes:
在得到训练完成的第二自然语言处理模型之后,当判断出所述预测损失高于第一阈值时,或者所述稀疏度损失高于第二阈值时,输出提示信息。After obtaining the trained second natural language processing model, when it is determined that the prediction loss is higher than the first threshold, or when the sparsity loss is higher than the second threshold, a prompt message is output.
在一些实施例中,还包括:In some embodiments, it also includes:
接收系数调节指令,并根据所述系数调节指令调整k1和/或k2的取值。Receive a coefficient adjustment instruction, and adjust the value of k1 and/or k2 according to the coefficient adjustment instruction.
在一些实施例中,将待处理文本输入至所述第二自然语言处理模型,得到由所述第二自然语言处理模型输出的针对所述待处理文本的自然语言处理结果,包括:In some embodiments, inputting the text to be processed into the second natural language processing model to obtain a natural language processing result for the text to be processed output by the second natural language processing model includes:
将待处理文本输入至所述第二自然语言处理模型,得到由所述第二自然语言处理模型输出的针对所述待处理文本的语义识别结果。The text to be processed is input into the second natural language processing model to obtain a semantic recognition result for the text to be processed output by the second natural language processing model.
在一些实施例中,所述基于所述第二自然语言处理模型进行硬件部署,包括:In some embodiments, the hardware deployment based on the second natural language processing model includes:
基于所述第二自然语言处理模型,确定出各个模型参数矩阵的行稀疏化参数组和列稀疏化参数组;Based on the second natural language processing model, determining a row sparsification parameter group and a column sparsification parameter group of each model parameter matrix;
针对任意1个模型参数矩阵,根据所述模型参数矩阵的行稀疏化参数组和列稀疏化参数组,提取出所述模型参数矩阵的非零参数行与非零参数列,得到稀疏后的模型参数矩阵;For any one model parameter matrix, according to the row sparse parameter group and the column sparse parameter group of the model parameter matrix, the non-zero parameter rows and the non-zero parameter columns of the model parameter matrix are extracted to obtain a sparse model parameter matrix;
基于各个稀疏后的模型参数矩阵进行硬件部署。Hardware deployment is performed based on each sparse model parameter matrix.
在一些实施例中,所述基于各个稀疏后的模型参数矩阵进行硬件部署,包括:In some embodiments, the hardware deployment based on each sparse model parameter matrix includes:
基于各个稀疏后的模型参数矩阵进行硬件部署,且在部署时,以保持维度不变性为原则,在相应的模型参数矩阵的计算结果中补0。Hardware deployment is performed based on each sparse model parameter matrix. During deployment, zeros are added to the calculation results of the corresponding model parameter matrix in accordance with the principle of maintaining dimensional invariance.
一种自然语言处理系统,包括: A natural language processing system, comprising:
第一自然语言处理模型确定模块,用于建立初始自然语言处理模型并进行训练,得到训练完成的第一自然语言处理模型;A first natural language processing model determination module is used to establish an initial natural language processing model and perform training to obtain a trained first natural language processing model;
稀疏化设定模块,用于针对所述第一自然语言处理模型的任意1个模型参数矩阵,设定用于决定所述模型参数矩阵中的行是否保留的行稀疏化参数组,以及用于决定所述模型参数矩阵中的列是否保留的列稀疏化参数组;a sparsification setting module, used for setting, for any one model parameter matrix of the first natural language processing model, a row sparsification parameter group for determining whether a row in the model parameter matrix is retained, and a column sparsification parameter group for determining whether a column in the model parameter matrix is retained;
剪枝模块,用于根据各个模型参数矩阵的行稀疏化参数组和列稀疏化参数组,进行所述第一自然语言处理模型的训练,且在训练的前向传播过程中,确定出预测损失和稀疏度损失,在训练的反向传播过程中,通过所述预测损失,对所述第一自然语言处理模型当前未被稀疏的剩余参数进行更新,通过所述预测损失和所述稀疏度损失,对各个所述行稀疏化参数组和各个所述列稀疏化参数组进行更新;A pruning module, configured to train the first natural language processing model according to the row sparsification parameter groups and the column sparsification parameter groups of each model parameter matrix, and determine the prediction loss and the sparsity loss during the forward propagation process of the training, and during the backward propagation process of the training, update the remaining parameters of the first natural language processing model that are not currently sparse by using the prediction loss, and update each of the row sparsification parameter groups and each of the column sparsification parameter groups by using the prediction loss and the sparsity loss;
第二自然语言处理模型确定模块,用于当基于所述预测损失与所述稀疏度损失确定出的总损失收敛时,得到训练完成的第二自然语言处理模型;A second natural language processing model determination module, configured to obtain a trained second natural language processing model when a total loss determined based on the prediction loss and the sparsity loss converges;
执行模块,用于基于所述第二自然语言处理模型进行硬件部署,并在部署完成之后,将待处理文本输入至所述第二自然语言处理模型,得到由所述第二自然语言处理模型输出的针对所述待处理文本的自然语言处理结果。An execution module is used to perform hardware deployment based on the second natural language processing model, and after the deployment is completed, input the text to be processed into the second natural language processing model to obtain a natural language processing result for the text to be processed output by the second natural language processing model.
一种自然语言处理设备,包括:A natural language processing device, comprising:
存储器,用于存储计算机程序;Memory for storing computer programs;
处理器,用于执行所述计算机程序以实现如上述所述的自然语言处理方法的步骤。A processor is used to execute the computer program to implement the steps of the natural language processing method as described above.
一种非易失性可读存储介质,所述计算机非易失性可读存储介质上存储有计算机程序,所述计算机程序被处理器执行时实现如上述所述的自然语言处理方法的步骤。A non-volatile computer readable storage medium stores a computer program, and when the computer program is executed by a processor, the steps of the natural language processing method described above are implemented.
应用本申请实施例所提供的技术方案,建立初始自然语言处理模型并进行训练,得到训练完成的第一自然语言处理模型之后,采用稀疏化的方式实现软件层面的优化。具体的,本申请并不会直接删除第一自然语言处理模型中的部分层,而是会针对第一自然语言处理模型的任意1个模型参数矩阵,设定用于决定模型参数矩阵中的行是否保留的行稀疏化参数组,以及用于决定模型参数矩阵中的列是否保留的列稀疏化参数组。也就说,针对第一自然语言处理模型的任意1个模型参数矩阵,该模型参数矩阵哪些行、列保留,哪些行、列排除,由对应的行、列稀疏化参数组决定。之后,便可以根据各个模型参数矩阵的行稀疏化参数组和列稀疏化参数组,进行第一自然语言处理模型的训练。在训练的前向传播过程中,确定出预测损失和稀疏度损失。由于在训练的反向传播过程中,通过预测损失,对第一自然语言处理模型当前未被稀疏的剩余参数进行更新,使得本申请的方案可以有效地保障训练完成后得到的第二自然语言处理模型的精度,即本申请在实现优化的同时实现了精度的无损。在训练的反向传播过程中,本申请还会通过预测损失和稀疏度损失,对各个行稀疏化参数组和各个列稀疏化参数组进行更新,当基于预测损失与稀疏度损失确定出的总损失收敛时,说明预测损失与稀疏度损失已经达到了合适的程度,而当稀疏度损失达到了合适的优化程度时,说明第二自然语言处理模型中的各个模型参数矩阵进行了部分行,列的过滤删除。之后,可以基于第二自然语言处理模型进行硬件部署,并在部署完成之后,将待处理文本输入至第二自然语言处理模型,得到由第二自然语言处理模型输出的针对待处理文本的自然语言处理结果,即通过第二自然语言处理,可以有效地进行待处理文本的 自然语言处理。此外可以理解的是,由于本申请在优化时,设定的是用于决定模型参数矩阵中的行是否保留的行稀疏化参数组,以及用于决定模型参数矩阵中的列是否保留的列稀疏化参数组,因此,在优化完成之后,进行硬件部署时,经过过滤的相应的行与列均无需进行硬件部署,即硬件部署时只需要考虑经过稀疏化之后剩余的行,列,使得本申请对于硬件的要求也更低,计算量小,即实现了软硬件的协同优化。Apply the technical solution provided by the embodiment of the present application, establish an initial natural language processing model and train it, and after obtaining the first natural language processing model that has been trained, use a sparse method to achieve software-level optimization. Specifically, the present application will not directly delete some layers in the first natural language processing model, but will set a row sparse parameter group for determining whether the rows in the model parameter matrix are retained, and a column sparse parameter group for determining whether the columns in the model parameter matrix are retained, for any one model parameter matrix of the first natural language processing model. In other words, for any one model parameter matrix of the first natural language processing model, which rows and columns of the model parameter matrix are retained, and which rows and columns are excluded, are determined by the corresponding row and column sparse parameter groups. Afterwards, the first natural language processing model can be trained according to the row sparse parameter group and column sparse parameter group of each model parameter matrix. In the forward propagation process of training, the prediction loss and sparsity loss are determined. Since the remaining parameters of the first natural language processing model that are not currently sparse are updated through the prediction loss during the back propagation process of training, the solution of the present application can effectively guarantee the accuracy of the second natural language processing model obtained after the training is completed, that is, the present application achieves lossless accuracy while achieving optimization. During the back propagation process of training, the present application will also update each row sparsification parameter group and each column sparsification parameter group through the prediction loss and sparsity loss. When the total loss determined based on the prediction loss and sparsity loss converges, it means that the prediction loss and sparsity loss have reached a suitable level, and when the sparsity loss reaches a suitable optimization level, it means that each model parameter matrix in the second natural language processing model has been filtered and deleted for some rows and columns. Afterwards, hardware deployment can be performed based on the second natural language processing model, and after the deployment is completed, the text to be processed is input into the second natural language processing model to obtain the natural language processing result for the text to be processed output by the second natural language processing model, that is, through the second natural language processing, the text to be processed can be effectively processed. Natural language processing. In addition, it can be understood that, since the present application sets a row sparsification parameter group for determining whether the rows in the model parameter matrix are retained, and a column sparsification parameter group for determining whether the columns in the model parameter matrix are retained during optimization, after the optimization is completed, when the hardware is deployed, the corresponding rows and columns that have been filtered do not need to be deployed in hardware, that is, when the hardware is deployed, only the remaining rows and columns after sparsification need to be considered, so that the hardware requirements of the present application are also lower, the amount of calculation is small, and the coordinated optimization of software and hardware is achieved.
综上所述,本申请的方案可以有效地实现自然语言处理,可以有效地进行软硬件层面的协同优化以提高模型的自然语言处理效率,同时,本申请完成优化之后,不会损失精度。In summary, the solution of the present application can effectively implement natural language processing, and can effectively perform collaborative optimization at the software and hardware levels to improve the natural language processing efficiency of the model. At the same time, after the optimization is completed, there will be no loss of accuracy.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings required for use in the embodiments or the description of the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present application. For ordinary technicians in this field, other drawings can be obtained based on these drawings without paying any creative work.
图1为常用的基于注意力机制的自然语言处理模型的结构示意图;FIG1 is a schematic diagram of the structure of a commonly used natural language processing model based on an attention mechanism;
图2为矩阵乘法的计算的原理示意图;FIG. 2 is a schematic diagram showing the principle of calculation of matrix multiplication;
图3为本申请中一种自然语言处理方法的实施流程图;FIG3 is a flowchart of an implementation of a natural language processing method in the present application;
图4为本申请中进行第一自然语言处理模型的训练原理示意图;FIG4 is a schematic diagram of the training principle of the first natural language processing model in the present application;
图5为本申请中模型参数矩阵稀疏化后的变化示意图;FIG5 is a schematic diagram showing the changes in the model parameter matrix after sparseness in this application;
图6为本申请中一种自然语言处理系统的结构示意图;FIG6 is a schematic diagram of the structure of a natural language processing system in the present application;
图7为本申请中一种自然语言处理设备的结构示意图。FIG. 7 is a schematic diagram of the structure of a natural language processing device in the present application.
具体实施方式Detailed ways
本申请的核心是提供一种自然语言处理方法,可以有效地实现自然语言处理,可以有效地进行软硬件层面的协同优化以提高模型的自然语言处理效率,同时,本申请完成优化之后,不会损失精度。The core of this application is to provide a natural language processing method that can effectively implement natural language processing and can effectively perform collaborative optimization at the software and hardware levels to improve the natural language processing efficiency of the model. At the same time, after the optimization is completed, there will be no loss of accuracy.
为了使本技术领域的人员更好地理解本申请方案,下面结合附图和具体实施方式对本申请作进一步的详细说明。显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。In order to enable those skilled in the art to better understand the present application, the present application is further described in detail below in conjunction with the accompanying drawings and specific implementation methods. Obviously, the described embodiments are only part of the embodiments of the present application, rather than all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by ordinary technicians in the field without making creative work are within the scope of protection of the present application.
请参考图3,图3为本申请中一种自然语言处理方法的实施流程图,该自然语言处理方法可以包括以下步骤:Please refer to FIG. 3 , which is a flowchart of an implementation of a natural language processing method in the present application. The natural language processing method may include the following steps:
步骤S301:建立初始自然语言处理模型并进行训练,得到训练完成的第一自然语言处理模型。Step S301: Establish an initial natural language processing model and perform training to obtain a trained first natural language processing model.
具体的,可以先建立初始自然语言处理模型,该初始自然语言处理模型的具体形式可以有多种,可以根据实际需要进行设定和调整,例如具体为采用深度网络结构的初始自然语言处理模型。Specifically, an initial natural language processing model may be established first. The specific form of the initial natural language processing model may be various and may be set and adjusted according to actual needs, such as an initial natural language processing model that adopts a deep network structure.
建立了初始自然语言处理模型之后,便可以用训练样本对其进行训练,当识别精度达到要求时便可以确定训练完毕,此时得到的便是训练完成的第一自然语言处理模型。 After the initial natural language processing model is established, it can be trained with training samples. When the recognition accuracy reaches the requirement, it can be determined that the training is completed. At this time, the first natural language processing model that has been trained is obtained.
由于是进行自然语言处理,因此训练样本通常为文本数据。Since natural language processing is being performed, the training samples are usually text data.
步骤S302:针对第一自然语言处理模型的任意1个模型参数矩阵,设定用于决定模型参数矩阵中的行是否保留的行稀疏化参数组,以及用于决定模型参数矩阵中的列是否保留的列稀疏化参数组。Step S302: For any one model parameter matrix of the first natural language processing model, set a row sparsification parameter group for determining whether rows in the model parameter matrix are retained, and a column sparsification parameter group for determining whether columns in the model parameter matrix are retained.
在得到训练完成的第一自然语言处理模型之后,第一自然语言处理模型中会包括多个模型参数矩阵,可以理解的是,此时由于还未进行稀疏化,因此各个模型参数矩阵均是原始的,未被稀疏的模型参数矩阵。After the first natural language processing model is trained, the first natural language processing model will include multiple model parameter matrices. It can be understood that at this time, since sparseness has not been performed, each model parameter matrix is an original, non-sparse model parameter matrix.
针对第一自然语言处理模型的任意1个模型参数矩阵,本申请会设定用于决定模型参数矩阵中的行是否保留的行稀疏化参数组,以及用于决定模型参数矩阵中的列是否保留的列稀疏化参数组。For any model parameter matrix of the first natural language processing model, the present application will set a row sparsification parameter group for determining whether the rows in the model parameter matrix are retained, and a column sparsification parameter group for determining whether the columns in the model parameter matrix are retained.
也就是说,对于任意1个模型参数矩阵的任意1行而言,该行是否需要保留,是由该模型参数矩阵的行稀疏化参数组决定。同样的,对于模型参数矩阵的任意1列而言,该列是否需要保留,是由该模型参数矩阵的列稀疏化参数组决定。That is to say, for any row of any model parameter matrix, whether the row needs to be retained is determined by the row sparsification parameter group of the model parameter matrix. Similarly, for any column of the model parameter matrix, whether the column needs to be retained is determined by the column sparsification parameter group of the model parameter matrix.
例如图4中,原始的模型参数矩阵是一个4行3列的模型参数矩阵,通过该模型参数矩阵的行稀疏化参数组和列稀疏化参数组,确定第2行和第2列需要稀疏化,即第2行和第2列不需要保留。For example, in FIG4 , the original model parameter matrix is a 4-row 3-column model parameter matrix. Through the row sparse parameter group and the column sparse parameter group of the model parameter matrix, it is determined that the 2nd row and the 2nd column need to be sparse, that is, the 2nd row and the 2nd column do not need to be retained.
此外需要说明的是,在后续步骤S303的训练过程中,会不断地调整各个行稀疏化参数组以及各个列稀疏化参数组,因此,在进行各个行稀疏化参数组以及各个列稀疏化参数组的初始设定时,可以任意设定。但实际应用中,为了保障精度,在进行各个行稀疏化参数组以及各个列稀疏化参数组的初始设定时,通常会将每个模型参数矩阵的每一行和每一列均进行保留。In addition, it should be noted that in the training process of the subsequent step S303, each row sparsification parameter group and each column sparsification parameter group will be continuously adjusted. Therefore, when performing the initial setting of each row sparsification parameter group and each column sparsification parameter group, they can be set arbitrarily. However, in actual applications, in order to ensure accuracy, when performing the initial setting of each row sparsification parameter group and each column sparsification parameter group, each row and each column of each model parameter matrix are usually retained.
例如,在本申请的一些实施例中,任意1个行稀疏化参数组以及任意1个列稀疏化参数组中的各个参数在设定完毕之后,未进行更新之前均为默认值,以保留任意模型参数矩阵中的每一行和每一列。该种实施方式中,任意1个行稀疏化参数组以及任意1个列稀疏化参数组中的各个参数均设定为默认值,使得设定过程简单方便。For example, in some embodiments of the present application, each parameter in any row sparsification parameter group and any column sparsification parameter group is set to a default value after the setting is completed and before being updated, so as to retain each row and each column in any model parameter matrix. In this implementation, each parameter in any row sparsification parameter group and any column sparsification parameter group is set to a default value, making the setting process simple and convenient.
行稀疏化参数组和列稀疏化参数组的具体形式有多种,只要能够实现本申请的行稀疏化参数组和列稀疏化参数组的各自功能即可。由于模型参数矩阵通常有多行多列,因此行稀疏化参数组通常可以设置为向量的形式,该向量中的各个数值分别用于决定模型参数矩阵的相应行是否保留。同样的,列稀疏化参数组通常可以设置为向量的形式,该向量中的各个数值分别用于决定模型参数矩阵的相应列是否保留。There are many specific forms of row sparsification parameter groups and column sparsification parameter groups, as long as they can realize the respective functions of the row sparsification parameter groups and column sparsification parameter groups of the present application. Since the model parameter matrix usually has multiple rows and columns, the row sparsification parameter group can usually be set in the form of a vector, and each numerical value in the vector is used to determine whether the corresponding row of the model parameter matrix is retained. Similarly, the column sparsification parameter group can usually be set in the form of a vector, and each numerical value in the vector is used to determine whether the corresponding column of the model parameter matrix is retained.
步骤S303:根据各个模型参数矩阵的行稀疏化参数组和列稀疏化参数组,进行第一自然语言处理模型的训练,且在训练的前向传播过程中,确定出预测损失和稀疏度损失,在训练的反向传播过程中,通过预测损失,对第一自然语言处理模型当前未被稀疏的剩余参数进行更新,通过预测损失和稀疏度损失,对各个行稀疏化参数组和各个列稀疏化参数组进行更新。Step S303: The first natural language processing model is trained according to the row sparsification parameter groups and column sparsification parameter groups of each model parameter matrix, and the prediction loss and the sparsity loss are determined during the forward propagation of the training. During the backward propagation of the training, the remaining parameters of the first natural language processing model that are not currently sparse are updated through the prediction loss, and each row sparsification parameter group and each column sparsification parameter group are updated through the prediction loss and the sparsity loss.
行稀疏化参数组和列稀疏化参数组可以决定稀疏化之后的模型参数矩阵,进而进行第一自然语言处理模型的训练。 The row sparsification parameter group and the column sparsification parameter group can determine the model parameter matrix after sparsification, and then train the first natural language processing model.
在训练的前向传播过程中,本申请需要确定出预测损失和稀疏度损失,在图4中,预测损失标记为ce_loss,也即后续实施方式中的loss1。稀疏度损失标记为sparsity_loss,也即后续实施方式中的loss2。During the forward propagation process of training, the present application needs to determine the prediction loss and sparsity loss. In FIG4 , the prediction loss is marked as ce_loss, which is loss1 in the subsequent implementation. The sparsity loss is marked as sparsity_loss, which is loss2 in the subsequent implementation.
预测损失反映的是第一自然语言处理模型的预测精度,预测损失越小,说明第一自然语言处理模型的预测精度越高。The prediction loss reflects the prediction accuracy of the first natural language processing model. The smaller the prediction loss, the higher the prediction accuracy of the first natural language processing model.
本申请为了保障精度,在训练的反向传播过程中,会通过预测损失,对第一自然语言处理模型当前未被稀疏的剩余参数进行更新。例如对于图4中的一个原始为4行3列的模型参数矩阵而言,由于当前的行稀疏化参数组和列稀疏化参数组决定了第2行和第2列需要稀疏化,因此,该模型参数矩阵剩余的未被稀疏的参数便为6个,即第1行第1列的参数,第1行第3列的参数,第3行第1列的参数,第3行第3列的参数,第4行第1列的参数以及第4行第3列的参数。当然,随着训练的进行,对于该模型参数矩阵而言,最终决定稀疏化的未必是第2行和第2列。In order to ensure accuracy, the present application will update the remaining parameters of the first natural language processing model that are not currently sparse through the prediction loss during the back propagation process of training. For example, for a model parameter matrix with 4 rows and 3 columns originally in Figure 4, since the current row sparsification parameter group and column sparsification parameter group determine that the 2nd row and the 2nd column need to be sparse, the remaining unsparse parameters of the model parameter matrix are 6, namely the parameters of the 1st row and the 1st column, the parameters of the 1st row and the 3rd column, the parameters of the 3rd row and the 1st column, the parameters of the 3rd row and the 3rd column, the parameters of the 4th row and the 1st column, and the parameters of the 4th row and the 3rd column. Of course, as the training progresses, for the model parameter matrix, the final decision to sparse may not necessarily be the 2nd row and the 2nd column.
在本申请的一些实施例中,步骤S303中描述的在训练的反向传播过程中,通过预测损失,对第一自然语言处理模型当前未被稀疏的剩余参数进行更新,可以包括:In some embodiments of the present application, in the back propagation process of training described in step S303, updating the remaining parameters of the first natural language processing model that are not currently sparse by predicting the loss may include:
在训练的反向传播过程中,以降低预测损失为训练目标,对第一自然语言处理模型当前未被稀疏的剩余参数进行更新。During the back propagation process of training, the remaining parameters of the first natural language processing model that are not currently sparse are updated with the goal of reducing the prediction loss.
如上文的描述,预测损失反映的是第一自然语言处理模型的预测精度,预测损失越小,说明第一自然语言处理模型的预测精度越高,因此,在训练的反向传播过程中,通常可以以降低预测损失为训练目标,对第一自然语言处理模型当前未被稀疏的剩余参数进行更新。本申请图4中,将这一过程标记为反向传播1。As described above, the prediction loss reflects the prediction accuracy of the first natural language processing model. The smaller the prediction loss, the higher the prediction accuracy of the first natural language processing model. Therefore, in the back propagation process of training, it is usually possible to reduce the prediction loss as the training goal and update the remaining parameters of the first natural language processing model that are not currently sparse. In FIG. 4 of the present application, this process is marked as back propagation 1.
在训练的反向传播过程中,本申请除了会进行第一自然语言处理模型当前未被稀疏的剩余参数的更新,还会通过预测损失和稀疏度损失,对各个行稀疏化参数组和各个列稀疏化参数组进行更新。During the back propagation process of training, in addition to updating the remaining parameters of the first natural language processing model that are not currently sparse, the present application also updates each row sparse parameter group and each column sparse parameter group through prediction loss and sparsity loss.
稀疏度损失反映的是第一自然语言处理模型被稀疏化的程度,稀疏度损失越小,说明第一自然语言处理模型被稀疏化的程度越高,即说明有越多的行、列不需要保留。The sparsity loss reflects the degree to which the first natural language processing model is sparse. The smaller the sparsity loss, the higher the degree to which the first natural language processing model is sparse, that is, more rows and columns do not need to be retained.
在本申请的一些实施例中,步骤S303中描述的通过预测损失和稀疏度损失,对各个行稀疏化参数组和各个列稀疏化参数组进行更新,包括:In some embodiments of the present application, the updating of each row sparsification parameter group and each column sparsification parameter group by prediction loss and sparsity loss described in step S303 includes:
在训练的反向传播过程中,以降低总损失为训练目标,对各个行稀疏化参数组和各个列稀疏化参数组进行更新。During the back propagation process of training, each row sparsification parameter group and each column sparsification parameter group are updated with the goal of reducing the total loss.
该种实施方式中,总损失由预测损失和稀疏度损失过程,并且如上文的描述,预测损失越低,说明精度越高,稀疏度损失越低,稀疏的程度越高。而预测损失和稀疏度损失之间存在影响,例如更新各个行稀疏化参数组和各个列稀疏化参数组从而降低稀疏度损失时,可能会增大预测损失,因此,该种实施方式中,是以降低总损失为训练目标,对各个行稀疏化参数组和各个列稀疏化参数组进行更新。相当于是进行了预测损失和稀疏度损失的权衡,以降低总损失为目的。In this implementation, the total loss is composed of the prediction loss and the sparsity loss process, and as described above, the lower the prediction loss, the higher the accuracy, the lower the sparsity loss, and the higher the degree of sparsity. There is an influence between the prediction loss and the sparsity loss. For example, when updating each row sparsification parameter group and each column sparsification parameter group to reduce the sparsity loss, the prediction loss may increase. Therefore, in this implementation, the training goal is to reduce the total loss, and each row sparsification parameter group and each column sparsification parameter group are updated. It is equivalent to a trade-off between the prediction loss and the sparsity loss, with the purpose of reducing the total loss.
本申请图4中,将对各个行稀疏化参数组和各个列稀疏化参数组进行更新这一过程,标记为反向传播2。 In FIG. 4 of the present application, the process of updating each row sparsification parameter group and each column sparsification parameter group is marked as back propagation 2.
此外需要说明的是,对各个行稀疏化参数组和各个列稀疏化参数组进行更新,具体采用的算法可以有多种,例如在本申请的一些实施例中,步骤S303中描述的对各个行稀疏化参数组和各个列稀疏化参数组进行更新,可以具体包括:In addition, it should be noted that there may be multiple algorithms used to update each row sparsification parameter group and each column sparsification parameter group. For example, in some embodiments of the present application, the updating of each row sparsification parameter group and each column sparsification parameter group described in step S303 may specifically include:
对各个行稀疏化参数组和各个列稀疏化参数组进行更新,包括:Each row sparsification parameter group and each column sparsification parameter group are updated, including:
针对任意1个模型参数矩阵的行稀疏化参数组,按照
For any row sparse parameter group of a model parameter matrix, follow
对行稀疏化参数组进行更新;Update the row sparsification parameter group;
针对任意1个模型参数矩阵的列稀疏化参数组,按照
For any column sparsification parameter group of a model parameter matrix, follow
对列稀疏化参数组进行更新;Update the column sparsification parameter group;
其中,Sk表示的是当前时刻的行稀疏化参数组,Sk+1表示的是下一时刻的行稀疏化参数组,lr表示的是稀疏化参数微分计算学习率,Loss表示的是总损失,Softlplus表示的是Softlplus函数,Qk表示的是当前时刻的列稀疏化参数组,Qk+1表示的是下一时刻的列稀疏化参数组,Mk表示的是模型参数矩阵的模型参数掩码,且Mij表示的是Mk中的第i行第j列的数值,xi表示的是Sk中的第i个参数的数值,yj表示的是Qk中的第j个参数的数值,i和j均为正整数,且1≤i≤a,1≤j≤b,a和b分别为模型参数矩阵的行数和列数。Among them, Sk represents the row sparsification parameter group at the current moment, Sk+1 represents the row sparsification parameter group at the next moment, lr represents the learning rate of sparsification parameter differential calculation, Loss represents the total loss, Softlplus represents the Softlplus function, Qk represents the column sparsification parameter group at the current moment, Qk +1 represents the column sparsification parameter group at the next moment, Mk represents the model parameter mask of the model parameter matrix, and Mij represents the value of the i-th row and j-th column in Mk , xi represents the value of the i-th parameter in Sk , and yj represents the value of the j-th parameter in Qk . Both i and j are positive integers, and 1≤i≤a, 1≤j≤b, where a and b are the number of rows and columns of the model parameter matrix, respectively.
在图4中,STE前向函数表示的便是根据模型参数矩阵的行稀疏化参数组和列稀疏化参数组,确定出模型参数矩阵的模型参数掩码。STE反向函数则表示的是进行行稀疏化参数组以及列稀疏化参数组的更新过程。In Figure 4, the STE forward function indicates that the model parameter mask of the model parameter matrix is determined based on the row sparse parameter group and the column sparse parameter group of the model parameter matrix. The STE reverse function indicates the update process of the row sparse parameter group and the column sparse parameter group.
例如一些实施例中,对于1个3x3大小的模型参数矩阵W,在步骤S303执行完毕之后,即第一自然语言处理模型训练完成,得到第二自然语言处理模型之后,例如该模型参数矩阵的行稀疏化参数组为[-5.5,3.0,1.2],而列稀疏化参数组为[3.3,-2.2,1.0],由于-5.5和-2.2为负数,因此,对应的是模型参数矩阵W的第1行和第2列被稀疏化,即模型 参数矩阵W的第1行和第2列不需要保留。该例子中,模型参数矩阵的模型参数掩码便可以表示为模型参数掩码中的0表示该参数被置为0,1表示该处参数可以不为0。For example, in some embodiments, for a 3x3 model parameter matrix W, after step S303 is completed, that is, the first natural language processing model is trained and the second natural language processing model is obtained, for example, the row sparsification parameter group of the model parameter matrix is [-5.5, 3.0, 1.2], and the column sparsification parameter group is [3.3, -2.2, 1.0]. Since -5.5 and -2.2 are negative numbers, the first row and the second column of the model parameter matrix W are sparse, that is, the model The first row and second column of the parameter matrix W do not need to be retained. In this example, the model parameter mask of the model parameter matrix can be expressed as A 0 in the model parameter mask indicates that the parameter is set to 0, and a 1 indicates that the parameter may not be 0.
步骤S304:当基于预测损失与稀疏度损失确定出的总损失收敛时,得到训练完成的第二自然语言处理模型。Step S304: When the total loss determined based on the prediction loss and the sparsity loss converges, a trained second natural language processing model is obtained.
当基于预测损失与稀疏度损失确定出的总损失收敛时,说明训练完毕,可以得到训练完成的第二自然语言处理模型。When the total loss determined based on the prediction loss and the sparsity loss converges, the training is completed and a trained second natural language processing model can be obtained.
例如图5中,对于1个原本为6x6大小的模型参数矩阵,经过步骤S303的训练之后,第2行,第6行,第2列以及第5列需要稀疏化,则在第二自然语言处理模型中,该模型参数矩阵变为4x4大小。For example, in FIG5 , for a model parameter matrix originally having a size of 6x6, after training in step S303, the 2nd row, the 6th row, the 2nd column and the 5th column need to be sparse, then in the second natural language processing model, the model parameter matrix becomes 4x4 in size.
步骤S305:基于第二自然语言处理模型进行硬件部署,并在部署完成之后,将待处理文本输入至第二自然语言处理模型,得到由第二自然语言处理模型输出的针对待处理文本的自然语言处理结果。Step S305: Perform hardware deployment based on the second natural language processing model, and after the deployment is completed, input the text to be processed into the second natural language processing model to obtain a natural language processing result for the text to be processed output by the second natural language processing model.
由于本申请进行模型优化时,是进行整行、整列的稀疏化,因此,进行硬件部署时,对于进行了稀疏化的行与列,硬件上无需进行计算,有效的降低了硬件资源占用,即有利于实现软硬件的协同优化。Since the present application performs sparsification of entire rows and columns during model optimization, when performing hardware deployment, no hardware calculations are required for the sparsified rows and columns, which effectively reduces hardware resource usage and is conducive to the coordinated optimization of software and hardware.
硬件上部署完毕之后,将待处理文本输入至第二自然语言处理模型中,便可以得到由第二自然语言处理模型输出的针对待处理文本的自然语言处理结果。After the deployment on the hardware is completed, the text to be processed is input into the second natural language processing model, and the natural language processing result for the text to be processed output by the second natural language processing model can be obtained.
在本申请的一些实施例中,步骤S305描述的基于第二自然语言处理模型进行硬件部署,可以具体包括:In some embodiments of the present application, the hardware deployment based on the second natural language processing model described in step S305 may specifically include:
基于第二自然语言处理模型,确定出各个模型参数矩阵的行稀疏化参数组和列稀疏化参数组;Based on the second natural language processing model, determining a row sparsification parameter group and a column sparsification parameter group of each model parameter matrix;
针对任意1个模型参数矩阵,根据模型参数矩阵的行稀疏化参数组和列稀疏化参数组,提取出模型参数矩阵的非零参数行与非零参数列,得到稀疏后的模型参数矩阵;For any model parameter matrix, according to the row sparse parameter group and the column sparse parameter group of the model parameter matrix, the non-zero parameter rows and the non-zero parameter columns of the model parameter matrix are extracted to obtain the sparse model parameter matrix;
基于各个稀疏后的模型参数矩阵进行硬件部署。Hardware deployment is performed based on each sparse model parameter matrix.
具体的,在训练完毕,得到第二自然语言处理模型之后,便可以确定出最终的各个模型参数矩阵的行稀疏化参数组和列稀疏化参数组,进而可以提取出模型参数矩阵的非零参数行与非零参数列,得到稀疏后的模型参数矩阵。例如图5中,对于1个原本为6x6大小的模型参数矩阵,提取出模型参数矩阵的非零参数行与非零参数列之后,得到的是4x4大小的模型参数矩阵。最后便可以基于各个稀疏后的模型参数矩阵进行硬件部署。 Specifically, after the training is completed and the second natural language processing model is obtained, the row sparse parameter groups and column sparse parameter groups of each final model parameter matrix can be determined, and then the non-zero parameter rows and non-zero parameter columns of the model parameter matrix can be extracted to obtain the sparse model parameter matrix. For example, in Figure 5, for a model parameter matrix that was originally 6x6 in size, after extracting the non-zero parameter rows and non-zero parameter columns of the model parameter matrix, a 4x4 model parameter matrix is obtained. Finally, hardware deployment can be performed based on each sparse model parameter matrix.
例如一种具体场合中,可以输入图5的6x6大小的模型参数矩阵,以及该模型参数矩阵的非零参数行的编号,以及非零参数列的编号,从而提取出模型参数矩阵的非零参数行与非零参数列,得到稀疏后的模型参数矩阵,进而基于各个稀疏后的模型参数矩阵进行硬件部署。For example, in a specific scenario, the 6x6 model parameter matrix of Figure 5, as well as the numbers of the non-zero parameter rows and the non-zero parameter columns of the model parameter matrix can be input to extract the non-zero parameter rows and non-zero parameter columns of the model parameter matrix to obtain the sparse model parameter matrix, and then perform hardware deployment based on each sparse model parameter matrix.
例如,具体可以进行TensorRT接口调用的修改,使得对于输入的图5的6x6大小的模型参数矩阵中的第2行,进行对应位置的列压缩,同样的,对于图5的6x6大小的模型参数矩阵中的第6行,进行对应位置的列压缩,对于图5的6x6大小的模型参数矩阵中的第2列以及第5列,则是进行对应位置的行压缩,最终实现了提取出模型参数矩阵的非零参数行与非零参数列,即得到了稀疏后的模型参数矩阵。For example, the TensorRT interface call can be modified so that for the second row in the 6x6 model parameter matrix of the input Figure 5, the column compression is performed at the corresponding position. Similarly, for the sixth row in the 6x6 model parameter matrix of Figure 5, the column compression is performed at the corresponding position. For the second and fifth columns in the 6x6 model parameter matrix of Figure 5, the row compression is performed at the corresponding position. Finally, the non-zero parameter rows and non-zero parameter columns of the model parameter matrix are extracted, and the sparse model parameter matrix is obtained.
进一步的,在本申请的一些实施例中,基于各个稀疏后的模型参数矩阵进行硬件部署,可以具体包括:Furthermore, in some embodiments of the present application, hardware deployment based on each sparse model parameter matrix may specifically include:
基于各个稀疏后的模型参数矩阵进行硬件部署,且在部署时,以保持维度不变性为原则,在相应的模型参数矩阵的计算结果中补0。Hardware deployment is performed based on each sparse model parameter matrix. During deployment, zeros are added to the calculation results of the corresponding model parameter matrix in accordance with the principle of maintaining dimensional invariance.
例如,某一层的某个模型参数矩阵在稀疏化之前为6x6大小,用于与106x6大小的输入数据相乘,得到的是106x6大小的输出。而如果该模型参数矩阵在稀疏化之后变为了4x4大小,则106x6大小的输入数据中,最后两列便无需使用,即使用106x4大小的输入数据与4x4大小的模型参数矩阵相乘即可,得到106x4大小的输出。而得到106x4大小的输出之后,该种实施方式中以保持维度不变性为原则,会在相应的模型参数矩阵的计算结果中补0,即该例子中,会为106x4大小的输出添加2列0,恢复为106x6大小的输出。For example, a model parameter matrix of a certain layer is 6x6 in size before sparsification, and is used to multiply input data of size 106x6 to obtain an output of size 106x6. If the model parameter matrix becomes 4x4 in size after sparsification, the last two columns of the 106x6 input data do not need to be used, that is, the 106x4 input data can be multiplied by the 4x4 model parameter matrix to obtain an output of size 106x4. After obtaining the output of size 106x4, this implementation method uses the principle of maintaining dimensional invariance and fills the calculation results of the corresponding model parameter matrix with 0, that is, in this example, 2 columns of 0 will be added to the 106x4 output to restore it to an output of size 106x6.
此外需要强调的是,该种实施方式中进行补0,是为相应的模型参数矩阵的计算结果补0,而不是说需要为相应的稀疏行和列进行补0。即对于硬件而言,进行的是4x4大小的模型参数矩阵的乘法计算,而不是进行的6x6大小的模型参数矩阵的乘法计算,如果进行的是6x6大小的模型参数矩阵的乘法计算,相当于软件上的优化未协同到硬件上,因为这样的情况下硬件上的计算量是不变的,一些传统方案中在软件上优化时,会将模型参数矩阵中的部分参数归零,这些归零的参数随机分布,导致在硬件上这些0依旧需要参与进行矩阵乘法运算,并未实现硬件上的协同优化。In addition, it should be emphasized that the 0-padding in this implementation method is to fill the calculation results of the corresponding model parameter matrix with 0, rather than to fill the corresponding sparse rows and columns with 0. That is, for the hardware, the multiplication calculation of the model parameter matrix of size 4x4 is performed, rather than the multiplication calculation of the model parameter matrix of size 6x6. If the multiplication calculation of the model parameter matrix of size 6x6 is performed, it is equivalent to the optimization of the software not being coordinated with the hardware, because in this case the amount of calculation on the hardware is unchanged. In some traditional solutions, when optimizing the software, some parameters in the model parameter matrix will be reset to zero. These reset parameters are randomly distributed, resulting in these 0s still needing to participate in the matrix multiplication operation on the hardware, and no coordinated optimization of the hardware is achieved.
本申请的方案中,上述例子是在得到106x4大小的输出之后,再通过补2列0,恢复为106x6大小的输出,保障了维度的不便,即保障第二自然语言处理模型的输出大小与原模型的输出是一致的,此前确定出的稀疏行和列是不参与矩阵乘法运算的。In the solution of the present application, the above example is to restore the output size of 106x4 to 106x6 by padding 2 columns of 0 after obtaining the output size of 106x4, thereby ensuring the inconvenience of dimension, that is, ensuring that the output size of the second natural language processing model is consistent with the output of the original model, and the sparse rows and columns determined previously do not participate in the matrix multiplication operation.
本申请的总损失由预测损失与稀疏度损失构成,在本申请的一些实施例中,总损失可以具体为预测损失与稀疏度损失的和,这也是实际应用中较为简单的设置方式。The total loss of the present application is composed of prediction loss and sparsity loss. In some embodiments of the present application, the total loss can be specifically the sum of prediction loss and sparsity loss, which is also a relatively simple setting method in practical applications.
进一步的,在本申请的一些实施例中,总损失=k1*loss1+k2*loss2,其中,k1和k2均为预设系数,loss1为预测损失,loss2为稀疏度损失。Furthermore, in some embodiments of the present application, total loss = k1*loss1+k2*loss2, wherein k1 and k2 are both preset coefficients, loss1 is the prediction loss, and loss2 is the sparsity loss.
该种实施方式中,考虑到可以为预测损失和稀疏度损失各自设置一定的权重,也即使得二者对于总损失的影响程度不同。可以理解的是,k1取值越大,总损失受到预测损失的影响也就越大,这样的实施方式更有利于保障高预测精度,通常可以应用在对于精度要求较高的场合中。而k2取值越大,总损失受到稀疏度损失的影响也就越大,这样的实施方式更有利于简化模型,通常可以应用在对于计算速度要求较高的场合中。 In this implementation, it is considered that a certain weight can be set for the prediction loss and the sparsity loss respectively, that is, the degree of influence of the two on the total loss is different. It can be understood that the larger the value of k1, the greater the influence of the prediction loss on the total loss. Such an implementation is more conducive to ensuring high prediction accuracy and can usually be used in situations where high accuracy is required. The larger the value of k2, the greater the influence of the sparsity loss on the total loss. Such an implementation is more conducive to simplifying the model and can usually be used in situations where high computing speed is required.
此外,其他实施方式中,总损失还可以选取为其他形式,可以根据实际情况进行设定,并不影响本申请的实施,但可以理解的是,总损失通常需要与预测损失正相关,且与稀疏度损失正相关。In addition, in other implementations, the total loss can also be selected as other forms, which can be set according to actual conditions and does not affect the implementation of this application. However, it can be understood that the total loss usually needs to be positively correlated with the prediction loss and positively correlated with the sparsity loss.
在本申请的一些实施例中,还包括:In some embodiments of the present application, it also includes:
在得到训练完成的第二自然语言处理模型之后,当判断出预测损失高于第一阈值时,或者稀疏度损失高于第二阈值时,输出提示信息。After obtaining the trained second natural language processing model, when it is determined that the prediction loss is higher than the first threshold, or when the sparsity loss is higher than the second threshold, a prompt message is output.
由前文的描述可知,本申请是在基于预测损失与稀疏度损失确定出的总损失收敛时,认为训练完成。在大部分情况下,当总损失收敛时,预测损失与稀疏度损失通常都达到了较低的水平,但在少部分场合中,预测损失或者稀疏度损失可能仍然较高。As can be seen from the above description, this application considers that training is completed when the total loss determined based on the prediction loss and the sparsity loss converges. In most cases, when the total loss converges, the prediction loss and the sparsity loss usually reach a low level, but in a few cases, the prediction loss or the sparsity loss may still be high.
该种实施方式中,便会在预测损失高于第一阈值时,或者稀疏度损失高于第二阈值时,输出提示信息,以便工作人员及时注意到该情况,进而进行相应的措施。例如一种场合中,在预测损失高于第一阈值时,可以适当增大前述实施方式中的k1的取值,使得总损失更多的考虑预测损失,相应的,当稀疏度损失高于第二阈值,可以适当增大前述实施方式中的k2的取值,使得总损失更多的考虑稀疏度损失。In this implementation, when the predicted loss is higher than the first threshold, or when the sparsity loss is higher than the second threshold, a prompt message will be output so that the staff can promptly notice the situation and take corresponding measures. For example, in one case, when the predicted loss is higher than the first threshold, the value of k1 in the aforementioned implementation can be appropriately increased so that the total loss takes more consideration of the predicted loss. Correspondingly, when the sparsity loss is higher than the second threshold, the value of k2 in the aforementioned implementation can be appropriately increased so that the total loss takes more consideration of the sparsity loss.
即在本申请的一些实施例中,还可以包括:接收系数调节指令,并根据系数调节指令调整k1和/或k2的取值。That is, in some embodiments of the present application, it may also include: receiving a coefficient adjustment instruction, and adjusting the values of k1 and/or k2 according to the coefficient adjustment instruction.
在本申请的一些实施例中,步骤S305中描述的将待处理文本输入至第二自然语言处理模型,得到由第二自然语言处理模型输出的针对待处理文本的自然语言处理结果,可以具体包括:In some embodiments of the present application, the step S305 of inputting the text to be processed into the second natural language processing model to obtain the natural language processing result for the text to be processed output by the second natural language processing model may specifically include:
将待处理文本输入至第二自然语言处理模型,得到由第二自然语言处理模型输出的针对待处理文本的语义识别结果。The text to be processed is input into the second natural language processing model to obtain a semantic recognition result for the text to be processed output by the second natural language processing model.
本申请的第二自然语言处理模型可以进行待处理文本的处理,在实际应用中,具体的处理目的通常是进行待处理文本的语义识别。当然,在其他实施方式中,本申请的第二自然语言处理模型也可以对待处理文本进行其他方面的处理,例如语病分析,知识抽取,文本翻译等等。The second natural language processing model of the present application can process the text to be processed. In practical applications, the specific processing purpose is usually to perform semantic recognition of the text to be processed. Of course, in other embodiments, the second natural language processing model of the present application can also perform other processing on the text to be processed, such as grammatical error analysis, knowledge extraction, text translation, etc.
应用本申请实施例所提供的技术方案,建立初始自然语言处理模型并进行训练,得到训练完成的第一自然语言处理模型之后,采用稀疏化的方式实现软件层面的优化。具体的,本申请并不会直接删除第一自然语言处理模型中的部分层,而是会针对第一自然语言处理模型的任意1个模型参数矩阵,设定用于决定模型参数矩阵中的行是否保留的行稀疏化参数组,以及用于决定模型参数矩阵中的列是否保留的列稀疏化参数组。也就说,针对第一自然语言处理模型的任意1个模型参数矩阵,该模型参数矩阵哪些行、列保留,哪些行、列排除,由对应的行、列稀疏化参数组决定。之后,便可以根据各个模型参数矩阵的行稀疏化参数组和列稀疏化参数组,进行第一自然语言处理模型的训练。在训练的前向传播过程中,确定出预测损失和稀疏度损失。由于在训练的反向传播过程中,通过预测损失,对第一自然语言处理模型当前未被稀疏的剩余参数进行更新,使得本申请的方案可以有效地保障训练完成后得到的第二自然语言处理模型的精度,即本申请在实现优化的同时实现了精度的无损。在训练的反向传播过程中,本申请还会通过预测损失和稀疏度损失,对各个行稀疏化参数组和各个列稀疏化参数组进行更新,当基于预测损失与稀疏度损失确定出的 总损失收敛时,说明预测损失与稀疏度损失已经达到了合适的程度,而当稀疏度损失达到了合适的优化程度时,说明第二自然语言处理模型中的各个模型参数矩阵进行了部分行,列的过滤删除。之后,可以基于第二自然语言处理模型进行硬件部署,并在部署完成之后,将待处理文本输入至第二自然语言处理模型,得到由第二自然语言处理模型输出的针对待处理文本的自然语言处理结果,即通过第二自然语言处理,可以有效地进行待处理文本的自然语言处理。此外可以理解的是,由于本申请在优化时,设定的是用于决定模型参数矩阵中的行是否保留的行稀疏化参数组,以及用于决定模型参数矩阵中的列是否保留的列稀疏化参数组,因此,在优化完成之后,进行硬件部署时,经过过滤的相应的行与列均无需进行硬件部署,即硬件部署时只需要考虑经过稀疏化之后剩余的行,列,使得本申请对于硬件的要求也更低,计算量小,即实现了软硬件的协同优化。Apply the technical solution provided by the embodiment of the present application, establish an initial natural language processing model and train it, and after obtaining the first natural language processing model that has been trained, use a sparse method to achieve software-level optimization. Specifically, the present application will not directly delete some layers in the first natural language processing model, but will set a row sparse parameter group for determining whether the rows in the model parameter matrix are retained, and a column sparse parameter group for determining whether the columns in the model parameter matrix are retained, for any one model parameter matrix of the first natural language processing model. In other words, for any one model parameter matrix of the first natural language processing model, which rows and columns of the model parameter matrix are retained, and which rows and columns are excluded, are determined by the corresponding row and column sparse parameter groups. Afterwards, the first natural language processing model can be trained according to the row sparse parameter group and column sparse parameter group of each model parameter matrix. In the forward propagation process of training, the prediction loss and sparsity loss are determined. Since the remaining parameters of the first natural language processing model that are not currently sparse are updated through prediction loss during the back propagation process of training, the solution of the present application can effectively guarantee the accuracy of the second natural language processing model obtained after the training is completed, that is, the present application achieves lossless accuracy while achieving optimization. During the back propagation process of training, the present application will also update each row sparsification parameter group and each column sparsification parameter group through prediction loss and sparsity loss. When the prediction loss and sparsity loss are determined, the accuracy of the second natural language processing model obtained after the training is completed can be effectively guaranteed. That is, the present application achieves lossless accuracy while achieving optimization. When the total loss converges, it means that the prediction loss and the sparsity loss have reached a suitable degree, and when the sparsity loss reaches a suitable degree of optimization, it means that each model parameter matrix in the second natural language processing model has filtered and deleted some rows and columns. After that, hardware deployment can be performed based on the second natural language processing model, and after the deployment is completed, the text to be processed is input into the second natural language processing model to obtain the natural language processing result for the text to be processed output by the second natural language processing model, that is, through the second natural language processing, the natural language processing of the text to be processed can be effectively performed. In addition, it can be understood that, since the present application sets the row sparsification parameter group used to determine whether the rows in the model parameter matrix are retained during optimization, and the column sparsification parameter group used to determine whether the columns in the model parameter matrix are retained, therefore, after the optimization is completed, when the hardware is deployed, the corresponding rows and columns that have been filtered do not need to be deployed for hardware, that is, when the hardware is deployed, only the remaining rows and columns after sparseness need to be considered, so that the requirements of the present application for hardware are also lower, and the amount of calculation is small, that is, the collaborative optimization of software and hardware is realized.
综上,本申请的方案可以有效地实现自然语言处理,可以有效地进行软硬件层面的协同优化以提高模型的自然语言处理效率,同时,本申请完成优化之后,不会损失精度。In summary, the solution of the present application can effectively implement natural language processing, and can effectively perform collaborative optimization at the software and hardware levels to improve the natural language processing efficiency of the model. At the same time, after the optimization is completed, there will be no loss of accuracy.
相应于上面的方法实施例,本申请实施例还提供了一种自然语言处理系统,可与上文相互对应参照。Corresponding to the above method embodiment, the embodiment of the present application also provides a natural language processing system, which can be referenced in correspondence with the above.
参见图6所示,为本申请中一种自然语言处理系统的结构示意图,包括:Referring to FIG6 , it is a schematic diagram of the structure of a natural language processing system in the present application, including:
第一自然语言处理模型确定模块601,用于建立初始自然语言处理模型并进行训练,得到训练完成的第一自然语言处理模型;A first natural language processing model determination module 601 is used to establish an initial natural language processing model and perform training to obtain a trained first natural language processing model;
稀疏化设定模块602,用于针对第一自然语言处理模型的任意1个模型参数矩阵,设定用于决定模型参数矩阵中的行是否保留的行稀疏化参数组,以及用于决定模型参数矩阵中的列是否保留的列稀疏化参数组;A sparsification setting module 602, for setting, for any one model parameter matrix of the first natural language processing model, a row sparsification parameter group for determining whether a row in the model parameter matrix is retained, and a column sparsification parameter group for determining whether a column in the model parameter matrix is retained;
剪枝模块603,用于根据各个模型参数矩阵的行稀疏化参数组和列稀疏化参数组,进行第一自然语言处理模型的训练,且在训练的前向传播过程中,确定出预测损失和稀疏度损失,在训练的反向传播过程中,通过预测损失,对第一自然语言处理模型当前未被稀疏的剩余参数进行更新,通过预测损失和稀疏度损失,对各个行稀疏化参数组和各个列稀疏化参数组进行更新;The pruning module 603 is used to train the first natural language processing model according to the row sparsification parameter groups and the column sparsification parameter groups of each model parameter matrix, and determine the prediction loss and the sparsity loss during the forward propagation process of the training, and update the remaining parameters of the first natural language processing model that are not currently sparse by the prediction loss during the backward propagation process of the training, and update each row sparsification parameter group and each column sparsification parameter group by the prediction loss and the sparsity loss;
第二自然语言处理模型确定模块604,用于当基于预测损失与稀疏度损失确定出的总损失收敛时,得到训练完成的第二自然语言处理模型;A second natural language processing model determination module 604 is used to obtain a trained second natural language processing model when the total loss determined based on the prediction loss and the sparsity loss converges;
执行模块605,用于基于第二自然语言处理模型进行硬件部署,并在部署完成之后,将待处理文本输入至第二自然语言处理模型,得到由第二自然语言处理模型输出的针对待处理文本的自然语言处理结果。The execution module 605 is used to perform hardware deployment based on the second natural language processing model, and after the deployment is completed, input the text to be processed into the second natural language processing model to obtain the natural language processing result for the text to be processed output by the second natural language processing model.
在本申请的一些实施例中,剪枝模块603在训练的反向传播过程中,通过预测损失,对第一自然语言处理模型当前未被稀疏的剩余参数进行更新,包括:In some embodiments of the present application, the pruning module 603 updates the remaining parameters of the first natural language processing model that are not currently sparse by predicting the loss during the back propagation process of the training, including:
在训练的反向传播过程中,以降低预测损失为训练目标,对第一自然语言处理模型当前未被稀疏的剩余参数进行更新。During the back propagation process of training, the remaining parameters of the first natural language processing model that are not currently sparse are updated with the goal of reducing the prediction loss.
在本申请的一些实施例中,剪枝模块603通过预测损失和稀疏度损失,对各个行稀疏化参数组和各个列稀疏化参数组进行更新,包括:In some embodiments of the present application, the pruning module 603 updates each row sparsification parameter group and each column sparsification parameter group by predicting the loss and the sparsity loss, including:
在训练的反向传播过程中,以降低总损失为训练目标,对各个行稀疏化参数组和各个列稀疏化参数组进行更新。 During the back propagation process of training, each row sparsification parameter group and each column sparsification parameter group are updated with the goal of reducing the total loss.
在本申请的一些实施例中,剪枝模块603对各个行稀疏化参数组和各个列稀疏化参数组进行更新,包括:In some embodiments of the present application, the pruning module 603 updates each row sparsification parameter group and each column sparsification parameter group, including:
针对任意1个模型参数矩阵的行稀疏化参数组,按照
For any row sparse parameter group of a model parameter matrix, follow
对行稀疏化参数组进行更新;Update the row sparsification parameter group;
针对任意1个模型参数矩阵的列稀疏化参数组,按照
For any column sparsification parameter group of a model parameter matrix, follow
对列稀疏化参数组进行更新;Update the column sparsification parameter group;
其中,Sk表示的是当前时刻的行稀疏化参数组,Sk+1表示的是下一时刻的行稀疏化参数组,lr表示的是稀疏化参数微分计算学习率,Loss表示的是所述总损失,Softlplus表示的是Softlplus函数,Qk表示的是当前时刻的列稀疏化参数组,Qk+1表示的是下一时刻的列稀疏化参数组,Mk表示的是所述模型参数矩阵的模型参数掩码,且Mij表示的是Mk中的第i行第j列的数值,xi表示的是Sk中的第i个参数的数值,yj表示的是Qk中的第j个参数的数值,i和j均为正整数,且1≤i≤a,1≤j≤b,a和b分别为所述模型参数矩阵的行数和列数。Among them, Sk represents the row sparsification parameter group at the current moment, Sk+1 represents the row sparsification parameter group at the next moment, lr represents the sparsification parameter differential calculation learning rate, Loss represents the total loss, Softlplus represents the Softlplus function, Qk represents the column sparsification parameter group at the current moment, Qk +1 represents the column sparsification parameter group at the next moment, Mk represents the model parameter mask of the model parameter matrix, and Mij represents the value of the i-th row and j-th column in Mk , xi represents the value of the i-th parameter in Sk , yj represents the value of the j-th parameter in Qk , i and j are both positive integers, and 1≤i≤a, 1≤j≤b, a and b are the number of rows and columns of the model parameter matrix, respectively.
在本申请的一些实施例中,任意1个行稀疏化参数组以及任意1个列稀疏化参数组中的各个参数在设定完毕之后,未进行更新之前均为默认值,以保留任意模型参数矩阵中的每一行和每一列。In some embodiments of the present application, each parameter in any row sparsification parameter group and any column sparsification parameter group is set to a default value after being set and before being updated, so as to retain each row and each column in any model parameter matrix.
在本申请的一些实施例中,总损失为预测损失与稀疏度损失的和。In some embodiments of the present application, the total loss is the sum of the prediction loss and the sparsity loss.
在本申请的一些实施例中,总损失=k1*loss1+k2*loss2,其中,k1和k2均为预设系数,loss1为预测损失,loss2为稀疏度损失。In some embodiments of the present application, total loss = k1*loss1+k2*loss2, wherein k1 and k2 are both preset coefficients, loss1 is the prediction loss, and loss2 is the sparsity loss.
在本申请的一些实施例中,还包括:In some embodiments of the present application, it also includes:
信息提示模块,用于在得到训练完成的第二自然语言处理模型之后,当判断出预测损失高于第一阈值时,或者稀疏度损失高于第二阈值时,输出提示信息。The information prompt module is used to output prompt information after obtaining the trained second natural language processing model when it is judged that the prediction loss is higher than the first threshold or the sparsity loss is higher than the second threshold.
在本申请的一些实施例中,还包括: In some embodiments of the present application, it also includes:
系数调节模块,用于接收系数调节指令,并根据系数调节指令调整k1和/或k2的取值。The coefficient adjustment module is used to receive the coefficient adjustment instruction and adjust the value of k1 and/or k2 according to the coefficient adjustment instruction.
在本申请的一些实施例中,执行模块605将待处理文本输入至第二自然语言处理模型,得到由第二自然语言处理模型输出的针对待处理文本的自然语言处理结果,包括:In some embodiments of the present application, the execution module 605 inputs the text to be processed into the second natural language processing model, and obtains the natural language processing result for the text to be processed output by the second natural language processing model, including:
将待处理文本输入至第二自然语言处理模型,得到由第二自然语言处理模型输出的针对待处理文本的语义识别结果。The text to be processed is input into the second natural language processing model to obtain a semantic recognition result for the text to be processed output by the second natural language processing model.
在本申请的一些实施例中于,执行模块605基于第二自然语言处理模型进行硬件部署,包括:In some embodiments of the present application, the execution module 605 performs hardware deployment based on the second natural language processing model, including:
基于第二自然语言处理模型,确定出各个模型参数矩阵的行稀疏化参数组和列稀疏化参数组;Based on the second natural language processing model, determining a row sparsification parameter group and a column sparsification parameter group of each model parameter matrix;
针对任意1个模型参数矩阵,根据模型参数矩阵的行稀疏化参数组和列稀疏化参数组,提取出模型参数矩阵的非零参数行与非零参数列,得到稀疏后的模型参数矩阵;For any model parameter matrix, according to the row sparse parameter group and the column sparse parameter group of the model parameter matrix, the non-zero parameter rows and the non-zero parameter columns of the model parameter matrix are extracted to obtain the sparse model parameter matrix;
基于各个稀疏后的模型参数矩阵进行硬件部署。Hardware deployment is performed based on each sparse model parameter matrix.
在本申请的一些实施例中,执行模块605基于各个稀疏后的模型参数矩阵进行硬件部署,包括:In some embodiments of the present application, the execution module 605 performs hardware deployment based on each sparse model parameter matrix, including:
基于各个稀疏后的模型参数矩阵进行硬件部署,且在部署时,以保持维度不变性为原则,在相应的模型参数矩阵的计算结果中补0。Hardware deployment is performed based on each sparse model parameter matrix. During deployment, zeros are added to the calculation results of the corresponding model parameter matrix in accordance with the principle of maintaining dimensional invariance.
相应于上面的方法和系统实施例,本申请实施例还提供了一种自然语言处理设备以及一种非易失性可读存储介质,可与上文相互对应参照。该非易失性可读存储介质上存储有计算机程序,计算机程序被处理器执行时实现如上述任一实施例中的自然语言处理方法的步骤。Corresponding to the above method and system embodiments, the present application embodiment also provides a natural language processing device and a non-volatile readable storage medium, which can be referenced in correspondence with the above. The non-volatile readable storage medium stores a computer program, and when the computer program is executed by a processor, the steps of the natural language processing method in any of the above embodiments are implemented.
可参阅图7,该自然语言处理设备可以包括:Referring to FIG. 7 , the natural language processing device may include:
存储器701,用于存储计算机程序;Memory 701, used for storing computer programs;
处理器702,用于执行计算机程序以实现如上述任一实施例中的自然语言处理方法的步骤。The processor 702 is used to execute a computer program to implement the steps of the natural language processing method in any of the above embodiments.
还需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括要素的过程、方法、物品或者设备中还存在另外的相同要素。It should also be noted that, in this article, relational terms such as first and second, etc. are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Moreover, the terms "include", "comprise" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, method, article or device including a series of elements includes not only those elements, but also other elements not explicitly listed, or also includes elements inherent to such process, method, article or device. In the absence of further restrictions, the elements defined by the sentence "comprise a ..." do not exclude the presence of other identical elements in the process, method, article or device including the elements.
专业人员还可以进一步意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。 Professionals may further appreciate that the units and algorithm steps of each example described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, computer software, or a combination of the two. In order to clearly illustrate the interchangeability of hardware and software, the composition and steps of each example have been generally described in the above description according to function. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Professionals and technicians may use different methods to implement the described functions for each specific application, but such implementation should not be considered to be beyond the scope of this application.
本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的技术方案及其核心思想。应当指出,对于本技术领域的普通技术人员来说,在不脱离本申请原理的前提下,还可以对本申请进行若干改进和修饰,这些改进和修饰也落入本申请的保护范围内。 Specific examples are used herein to illustrate the principles and implementation methods of the present application, and the description of the above embodiments is only used to help understand the technical solution and core ideas of the present application. It should be pointed out that for ordinary technicians in this technical field, without departing from the principles of the present application, several improvements and modifications can be made to the present application, and these improvements and modifications also fall within the scope of protection of the present application.

Claims (20)

  1. 一种自然语言处理方法,其特征在于,包括:A natural language processing method, characterized by comprising:
    建立初始自然语言处理模型并进行训练,得到训练完成的第一自然语言处理模型;Establishing an initial natural language processing model and training it to obtain a first natural language processing model that has been trained;
    针对所述第一自然语言处理模型的任意1个模型参数矩阵,设定用于决定所述模型参数矩阵中的行是否保留的行稀疏化参数组,以及用于决定所述模型参数矩阵中的列是否保留的列稀疏化参数组;For any one model parameter matrix of the first natural language processing model, setting a row sparsification parameter group for determining whether rows in the model parameter matrix are retained, and a column sparsification parameter group for determining whether columns in the model parameter matrix are retained;
    根据各个模型参数矩阵的行稀疏化参数组和列稀疏化参数组,进行所述第一自然语言处理模型的训练,且在训练的前向传播过程中,确定出预测损失和稀疏度损失,在训练的反向传播过程中,通过所述预测损失,对所述第一自然语言处理模型当前未被稀疏的剩余参数进行更新,通过所述预测损失和所述稀疏度损失,对各个所述行稀疏化参数组和各个所述列稀疏化参数组进行更新;The first natural language processing model is trained according to the row sparsification parameter groups and the column sparsification parameter groups of each model parameter matrix, and in the forward propagation process of the training, a prediction loss and a sparsity loss are determined, and in the backward propagation process of the training, the remaining parameters of the first natural language processing model that are not currently sparse are updated by the prediction loss, and each of the row sparsification parameter groups and each of the column sparsification parameter groups are updated by the prediction loss and the sparsity loss;
    当基于所述预测损失与所述稀疏度损失确定出的总损失收敛时,得到训练完成的第二自然语言处理模型;When the total loss determined based on the prediction loss and the sparsity loss converges, a trained second natural language processing model is obtained;
    基于所述第二自然语言处理模型进行硬件部署,并在部署完成之后,将待处理文本输入至所述第二自然语言处理模型,得到由所述第二自然语言处理模型输出的针对所述待处理文本的自然语言处理结果。Hardware deployment is performed based on the second natural language processing model, and after the deployment is completed, the text to be processed is input into the second natural language processing model to obtain a natural language processing result for the text to be processed output by the second natural language processing model.
  2. 根据权利要求1所述的自然语言处理方法,其特征在于,在训练的反向传播过程中,通过所述预测损失,对所述第一自然语言处理模型当前未被稀疏的剩余参数进行更新,包括:The natural language processing method according to claim 1, characterized in that, during the back propagation process of training, the remaining parameters of the first natural language processing model that are not currently sparse are updated by using the prediction loss, comprising:
    在训练的反向传播过程中,以降低所述预测损失为训练目标,对所述第一自然语言处理模型当前未被稀疏的剩余参数进行更新。During the back propagation process of training, the remaining parameters of the first natural language processing model that are not currently sparse are updated with the goal of reducing the prediction loss.
  3. 根据权利要求1所述的自然语言处理方法,其特征在于,通过所述预测损失和所述稀疏度损失,对各个所述行稀疏化参数组和各个所述列稀疏化参数组进行更新,包括:The natural language processing method according to claim 1, characterized in that updating each of the row sparsification parameter groups and each of the column sparsification parameter groups by using the prediction loss and the sparsity loss comprises:
    在训练的反向传播过程中,以降低所述总损失为训练目标,对各个所述行稀疏化参数组和各个所述列稀疏化参数组进行更新。During the back propagation process of training, each of the row sparsification parameter groups and each of the column sparsification parameter groups are updated with the goal of reducing the total loss.
  4. 根据权利要求1所述的自然语言处理方法,其特征在于,所述对各个所述行稀疏化参数组和各个所述列稀疏化参数组进行更新,包括:The natural language processing method according to claim 1, characterized in that the updating of each of the row sparsification parameter groups and each of the column sparsification parameter groups comprises:
    针对任意1个模型参数矩阵的行稀疏化参数组,按照
    For any row sparse parameter group of a model parameter matrix, follow
    对所述行稀疏化参数组进行更新; Updating the row sparsification parameter group;
    针对任意1个模型参数矩阵的列稀疏化参数组,按照
    For any column sparsification parameter group of a model parameter matrix, follow
    对所述列稀疏化参数组进行更新;Updating the column sparsification parameter group;
    其中,Sk表示的是当前时刻的行稀疏化参数组,Sk+1表示的是下一时刻的行稀疏化参数组,lr表示的是稀疏化参数微分计算学习率,Loss表示的是所述总损失,Softlplus表示的是Softlplus函数,Qk表示的是当前时刻的列稀疏化参数组,Qk+1表示的是下一时刻的列稀疏化参数组,Mk表示的是所述模型参数矩阵的模型参数掩码,且Among them, Sk represents the row sparsification parameter group at the current moment, Sk+1 represents the row sparsification parameter group at the next moment, lr represents the sparsification parameter differential calculation learning rate, Loss represents the total loss, Softlplus represents the Softlplus function, Qk represents the column sparsification parameter group at the current moment, Qk +1 represents the column sparsification parameter group at the next moment, Mk represents the model parameter mask of the model parameter matrix, and
    Mij表示的是Mk中的第i行第j列的数值,xi表示的是Sk中的第i个参数的数值,yj表示的是Qk中的第j个参数的数值,i和j均为正整数,且1≤i≤a,1≤j≤b,a和b分别为所述模型参数矩阵的行数和列数。 Mij represents the value of the i-th row and j-th column in Mk , xi represents the value of the i-th parameter in Sk , yj represents the value of the j-th parameter in Qk , i and j are both positive integers, and 1≤i≤a, 1≤j≤b, a and b are the number of rows and columns of the model parameter matrix, respectively.
  5. 根据权利要求1所述的自然语言处理方法,其特征在于,任意1个行稀疏化参数组以及任意1个列稀疏化参数组中的各个参数在设定完毕之后,未进行更新之前均为默认值,以保留任意所述模型参数矩阵中的每一行和每一列。The natural language processing method according to claim 1 is characterized in that each parameter in any row sparsification parameter group and any column sparsification parameter group is a default value after being set and before being updated, so as to retain each row and each column in any of the model parameter matrices.
  6. 根据权利要求1所述的自然语言处理方法,其特征在于,所述总损失为所述预测损失与所述稀疏度损失的和。The natural language processing method according to claim 1 is characterized in that the total loss is the sum of the prediction loss and the sparsity loss.
  7. 根据权利要求1所述的自然语言处理方法,其特征在于,所述总损失=k1*loss1+k2*loss2,其中,k1和k2均为预设系数,loss1为所述预测损失,loss2为所述稀疏度损失。The natural language processing method according to claim 1 is characterized in that the total loss = k1*loss1+k2*loss2, wherein k1 and k2 are both preset coefficients, loss1 is the prediction loss, and loss2 is the sparsity loss.
  8. 根据权利要求7所述的自然语言处理方法,其特征在于,还包括:The natural language processing method according to claim 7, further comprising:
    在得到训练完成的第二自然语言处理模型之后,当判断出所述预测损失高于第一阈值时,或者所述稀疏度损失高于第二阈值时,输出提示信息。After obtaining the trained second natural language processing model, when it is determined that the prediction loss is higher than the first threshold, or when the sparsity loss is higher than the second threshold, a prompt message is output.
  9. 根据权利要求8所述的自然语言处理方法,其特征在于,还包括:The natural language processing method according to claim 8, further comprising:
    接收系数调节指令,并根据所述系数调节指令调整k1和/或k2的取值。Receive a coefficient adjustment instruction, and adjust the value of k1 and/or k2 according to the coefficient adjustment instruction.
  10. 根据权利要求1所述的自然语言处理方法,其特征在于,将待处理文本输入至所述第二自然语言处理模型,得到由所述第二自然语言处理模型输出的针对所述待处理文本的自然语言处理结果,包括:The natural language processing method according to claim 1, characterized in that the text to be processed is input into the second natural language processing model, and the natural language processing result for the text to be processed output by the second natural language processing model is obtained, comprising:
    将待处理文本输入至所述第二自然语言处理模型,得到由所述第二自然语言处理模型输出的针对所述待处理文本的语义识别结果。 The text to be processed is input into the second natural language processing model to obtain a semantic recognition result for the text to be processed output by the second natural language processing model.
  11. 根据权利要求1至10任一项所述的自然语言处理方法,其特征在于,所述基于所述第二自然语言处理模型进行硬件部署,包括:The natural language processing method according to any one of claims 1 to 10, characterized in that the hardware deployment based on the second natural language processing model comprises:
    基于所述第二自然语言处理模型,确定出各个模型参数矩阵的行稀疏化参数组和列稀疏化参数组;Based on the second natural language processing model, determining a row sparsification parameter group and a column sparsification parameter group of each model parameter matrix;
    针对任意1个模型参数矩阵,根据所述模型参数矩阵的行稀疏化参数组和列稀疏化参数组,提取出所述模型参数矩阵的非零参数行与非零参数列,得到稀疏后的模型参数矩阵;For any one model parameter matrix, according to the row sparse parameter group and the column sparse parameter group of the model parameter matrix, the non-zero parameter rows and the non-zero parameter columns of the model parameter matrix are extracted to obtain a sparse model parameter matrix;
    基于各个稀疏后的模型参数矩阵进行硬件部署。Hardware deployment is performed based on each sparse model parameter matrix.
  12. 根据权利要求11所述的自然语言处理方法,其特征在于,所述基于各个稀疏后的模型参数矩阵进行硬件部署,包括:The natural language processing method according to claim 11, characterized in that the hardware deployment based on each sparse model parameter matrix comprises:
    基于各个稀疏后的模型参数矩阵进行硬件部署,且在部署时,以保持维度不变性为原则,在相应的模型参数矩阵的计算结果中补0。Hardware deployment is performed based on each sparse model parameter matrix. During deployment, zeros are added to the calculation results of the corresponding model parameter matrix in accordance with the principle of maintaining dimensional invariance.
  13. 根据权利要求1所述的自然语言处理方法,其特征在于,所述建立初始自然语言处理模型并进行训练,得到训练完成的第一自然语言处理模型,包括:The natural language processing method according to claim 1, characterized in that the step of establishing an initial natural language processing model and training it to obtain a trained first natural language processing model comprises:
    获取深度网络结构的初始自然语言处理模型以及对应的文本数据;Obtain the initial natural language processing model of the deep network structure and the corresponding text data;
    采用所述文本数据对所述初始自然语言处理模型进行训练,得到训练完成的第一自然语言处理模型。The initial natural language processing model is trained using the text data to obtain a trained first natural language processing model.
  14. 根据权利要求1所述的自然语言处理方法,其特征在于,所述设定用于决定所述模型参数矩阵中的行是否保留的行稀疏化参数组,以及用于决定所述模型参数矩阵中的列是否保留的列稀疏化参数组,包括:The natural language processing method according to claim 1, characterized in that the setting of the row sparsification parameter group used to determine whether the rows in the model parameter matrix are retained, and the column sparsification parameter group used to determine whether the columns in the model parameter matrix are retained, include:
    设定用于决定所述模型参数矩阵中的行是否保留的且以第一向量进行表示的行稀疏化参数组,以及用于决定所述模型参数矩阵中的列是否保留的且以第二向量进行表示的列稀疏化参数组;Setting a row sparsification parameter group represented by a first vector for determining whether a row in the model parameter matrix is retained, and a column sparsification parameter group represented by a second vector for determining whether a column in the model parameter matrix is retained;
    其中,所述第一向量中的各个数值分别用于决定所述模型参数矩阵中的相应行是否保留;所述第二向量中的各个数值分别用于决定所述模型参数矩阵中的相应列是否保留。Among them, each numerical value in the first vector is used to determine whether the corresponding row in the model parameter matrix is retained; each numerical value in the second vector is used to determine whether the corresponding column in the model parameter matrix is retained.
  15. 根据权利要求1所述的自然语言处理方法,其特征在于,所述在训练的前向传播过程中,确定出预测损失,包括:The natural language processing method according to claim 1, characterized in that the step of determining the prediction loss during the forward propagation of the training comprises:
    获取所述第一自然语言处理模型的模型输出;Obtaining a model output of the first natural language processing model;
    根据所述模型输出与损失函数,确定在训练的前向传播过程中的预测损失。Based on the model output and the loss function, the prediction loss during the forward propagation process of training is determined.
  16. 根据权利要求1所述的自然语言处理方法,其特征在于,所述在训练的前向传播过程中,确定出稀疏度损失,包括:The natural language processing method according to claim 1, characterized in that the determining of the sparsity loss during the forward propagation process of the training comprises:
    获取所述第一自然语言处理模型中与模型参数行掩码对应的第一反向传播结果,以及与模型参数列掩码对应的第二反向传播结果;Obtaining a first back-propagation result corresponding to a model parameter row mask and a second back-propagation result corresponding to a model parameter column mask in the first natural language processing model;
    根据所述第一反向传播结果、所述第二反向传播结果以及损失函数,确定在训练的前向传播过程中的稀疏度损失。The sparsity loss in the forward propagation process of training is determined according to the first back-propagation result, the second back-propagation result and the loss function.
  17. 根据权利要求1所述的自然语言处理方法,其特征在于,所述将待处理文本输入至所述第二自然语言处理模型,得到由所述第二自然语言处理模型输出的针对所述待处理文本的自然语言处理结果,包括: The natural language processing method according to claim 1, characterized in that the step of inputting the text to be processed into the second natural language processing model to obtain a natural language processing result for the text to be processed output by the second natural language processing model comprises:
    将待处理文本输入至所述第二自然语言处理模型进行语义识别、语病分析、知识抽取、文本翻译中的一种处理,得到由所述第二自然语言处理模型输出的针对所述待处理文本的自然语言处理结果;Inputting the text to be processed into the second natural language processing model for one of semantic recognition, grammatical error analysis, knowledge extraction, and text translation, and obtaining a natural language processing result for the text to be processed output by the second natural language processing model;
    其中,所述自然语言处理结果为所述语义识别、所述语病分析、所述知识抽取、所述文本翻译中的一种结果。Among them, the natural language processing result is one of the results of the semantic recognition, the grammatical error analysis, the knowledge extraction, and the text translation.
  18. 一种自然语言处理系统,其特征在于,包括:A natural language processing system, comprising:
    第一自然语言处理模型确定模块,用于建立初始自然语言处理模型并进行训练,得到训练完成的第一自然语言处理模型;A first natural language processing model determination module is used to establish an initial natural language processing model and perform training to obtain a trained first natural language processing model;
    稀疏化设定模块,用于针对所述第一自然语言处理模型的任意1个模型参数矩阵,设定用于决定所述模型参数矩阵中的行是否保留的行稀疏化参数组,以及用于决定所述模型参数矩阵中的列是否保留的列稀疏化参数组;a sparsification setting module, used for setting, for any one model parameter matrix of the first natural language processing model, a row sparsification parameter group for determining whether a row in the model parameter matrix is retained, and a column sparsification parameter group for determining whether a column in the model parameter matrix is retained;
    剪枝模块,用于根据各个模型参数矩阵的行稀疏化参数组和列稀疏化参数组,进行所述第一自然语言处理模型的训练,且在训练的前向传播过程中,确定出预测损失和稀疏度损失,在训练的反向传播过程中,通过所述预测损失,对所述第一自然语言处理模型当前未被稀疏的剩余参数进行更新,通过所述预测损失和所述稀疏度损失,对各个所述行稀疏化参数组和各个所述列稀疏化参数组进行更新;A pruning module, configured to train the first natural language processing model according to the row sparsification parameter groups and the column sparsification parameter groups of each model parameter matrix, and determine the prediction loss and the sparsity loss during the forward propagation process of the training, and update the remaining parameters of the first natural language processing model that are not currently sparse by the prediction loss during the backward propagation process of the training, and update each of the row sparsification parameter groups and each of the column sparsification parameter groups by the prediction loss and the sparsity loss;
    第二自然语言处理模型确定模块,用于当基于所述预测损失与所述稀疏度损失确定出的总损失收敛时,得到训练完成的第二自然语言处理模型;A second natural language processing model determination module, configured to obtain a trained second natural language processing model when a total loss determined based on the prediction loss and the sparsity loss converges;
    执行模块,用于基于所述第二自然语言处理模型进行硬件部署,并在部署完成之后,将待处理文本输入至所述第二自然语言处理模型,得到由所述第二自然语言处理模型输出的针对所述待处理文本的自然语言处理结果。An execution module is used to perform hardware deployment based on the second natural language processing model, and after the deployment is completed, input the text to be processed into the second natural language processing model to obtain a natural language processing result for the text to be processed output by the second natural language processing model.
  19. 一种自然语言处理设备,其特征在于,包括:A natural language processing device, comprising:
    存储器,用于存储计算机程序;Memory for storing computer programs;
    处理器,用于执行所述计算机程序以实现如权利要求1至17任一项所述的自然语言处理方法的步骤。A processor, configured to execute the computer program to implement the steps of the natural language processing method as claimed in any one of claims 1 to 17.
  20. 一种非易失性可读存储介质,其特征在于,所述计算机非易失性可读存储介质上存储有计算机程序,所述计算机程序被处理器执行时实现如权利要求1至17任一项所述的自然语言处理方法的步骤。 A non-volatile computer readable storage medium, characterized in that a computer program is stored on the non-volatile computer readable storage medium, and when the computer program is executed by a processor, the steps of the natural language processing method as described in any one of claims 1 to 17 are implemented.
PCT/CN2023/098938 2022-10-11 2023-06-07 Natural language processing method, system and device, and storage medium WO2024077981A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211237680.XA CN115329744B (en) 2022-10-11 2022-10-11 Natural language processing method, system, equipment and storage medium
CN202211237680.X 2022-10-11

Publications (1)

Publication Number Publication Date
WO2024077981A1 true WO2024077981A1 (en) 2024-04-18

Family

ID=83914501

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/098938 WO2024077981A1 (en) 2022-10-11 2023-06-07 Natural language processing method, system and device, and storage medium

Country Status (2)

Country Link
CN (1) CN115329744B (en)
WO (1) WO2024077981A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115329744B (en) * 2022-10-11 2023-04-07 浪潮电子信息产业股份有限公司 Natural language processing method, system, equipment and storage medium
CN117668563B (en) * 2024-01-31 2024-04-30 苏州元脑智能科技有限公司 Text recognition method, text recognition device, electronic equipment and readable storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022021673A1 (en) * 2020-07-31 2022-02-03 中国原子能科学研究院 Method and system for predicting sparse matrix vector multiplication operation time
CN114490922A (en) * 2020-10-27 2022-05-13 华为技术有限公司 Natural language understanding model training method and device
CN114723047A (en) * 2022-04-15 2022-07-08 支付宝(杭州)信息技术有限公司 Task model training method, device and system
CN115329744A (en) * 2022-10-11 2022-11-11 浪潮电子信息产业股份有限公司 Natural language processing method, system, equipment and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022021673A1 (en) * 2020-07-31 2022-02-03 中国原子能科学研究院 Method and system for predicting sparse matrix vector multiplication operation time
CN114490922A (en) * 2020-10-27 2022-05-13 华为技术有限公司 Natural language understanding model training method and device
CN114723047A (en) * 2022-04-15 2022-07-08 支付宝(杭州)信息技术有限公司 Task model training method, device and system
CN115329744A (en) * 2022-10-11 2022-11-11 浪潮电子信息产业股份有限公司 Natural language processing method, system, equipment and storage medium

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
LA VU TUAN, DAO VAN PHUONG, ZUO JIAKUO, ZHAO LI: "Adaptive Compressed Sensing Method for Speech", JOURNAL OF SOUTHEAST UNIVERSITY (NATURAL SCIENCE EDITION ), DONGNAN DAXUE, NANJING, CN, vol. 42, no. 6, 30 November 2012 (2012-11-30), CN , pages 1027 - 1030, XP009554112, ISSN: 1001-0505, DOI: 10.3969/j.issn.1001-0505.2012.06.001 *
LEI CHENYI; LIU DONG; LI WEIPING; ZHA ZHENG-JUN; LI HOUQIANG: "Comparative Deep Learning of Hybrid Representations for Image Recommendations", 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), IEEE, 27 June 2016 (2016-06-27), pages 2545 - 2553, XP033021435, DOI: 10.1109/CVPR.2016.279 *
LI, XIAOWEI; SHU, HUI; GUANG, YAN; ZHAI, YI; YANG, ZI-JI : "Survey of the Application of Natural Language Processing for Resume Analysis", COMPUTER SCIENCE, KEXUE JISHU WENXIAN CHUBANSHE CHONGQING FENSHE, CN, vol. 49, no. 6A, 30 June 2022 (2022-06-30), CN , pages 66 - 73, XP009554390, ISSN: 1002-137X, DOI: 10.11896/jsjkx.210600134 *

Also Published As

Publication number Publication date
CN115329744B (en) 2023-04-07
CN115329744A (en) 2022-11-11

Similar Documents

Publication Publication Date Title
WO2024077981A1 (en) Natural language processing method, system and device, and storage medium
CN109783817B (en) Text semantic similarity calculation model based on deep reinforcement learning
CN108280514B (en) FPGA-based sparse neural network acceleration system and design method
WO2020140386A1 (en) Textcnn-based knowledge extraction method and apparatus, and computer device and storage medium
CN113905391B (en) Integrated learning network traffic prediction method, system, equipment, terminal and medium
CN112784964A (en) Image classification method based on bridging knowledge distillation convolution neural network
US20220083868A1 (en) Neural network training method and apparatus, and electronic device
CN108764317A (en) A kind of residual error convolutional neural networks image classification method based on multichannel characteristic weighing
CN112215353B (en) Channel pruning method based on variational structure optimization network
CN107395211B (en) Data processing method and device based on convolutional neural network model
CN110909874A (en) Convolution operation optimization method and device of neural network model
CN109583586B (en) Convolution kernel processing method and device in voice recognition or image recognition
CN110751265A (en) Lightweight neural network construction method and system and electronic equipment
CN113157919B (en) Sentence text aspect-level emotion classification method and sentence text aspect-level emotion classification system
CN107784360A (en) Step-by-step movement convolutional neural networks beta pruning compression method
CN107644252A (en) A kind of recurrent neural networks model compression method of more mechanism mixing
CN116644804B (en) Distributed training system, neural network model training method, device and medium
CN111126595A (en) Method and equipment for model compression of neural network
CN113111889A (en) Target detection network processing method for edge computing terminal
CN111353534A (en) Graph data category prediction method based on adaptive fractional order gradient
WO2022246986A1 (en) Data processing method, apparatus and device, and computer-readable storage medium
Fuketa et al. Image-classifier deep convolutional neural network training by 9-bit dedicated hardware to realize validation accuracy and energy efficiency superior to the half precision floating point format
CN117521763A (en) Artificial intelligent model compression method integrating regularized pruning and importance pruning
CN116431816B (en) Document classification method, apparatus, device and computer readable storage medium
WO2023246177A1 (en) Image processing method, and electronic device and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23876192

Country of ref document: EP

Kind code of ref document: A1