WO2024077981A1

WO2024077981A1 - Natural language processing method, system and device, and storage medium

Info

Publication number: WO2024077981A1
Application number: PCT/CN2023/098938
Authority: WO
Inventors: 李兵兵; 阚宏伟; 王彦伟
Original assignee: 浪潮电子信息产业股份有限公司
Priority date: 2022-10-11
Filing date: 2023-06-07
Publication date: 2024-04-18
Also published as: CN115329744A; CN115329744B

Abstract

Disclosed in the present application are a natural language processing method, system and device, and a storage medium, which are applied to the technical field of machine learning. The method comprises: obtaining a trained first natural language processing model; setting row and column sparsity parameter groups used for determining whether to retain rows and columns in model parameter matrixes of the first natural language processing model, performing training, updating, by means of a prediction loss, current remaining parameters which are not sparsified, and updating the row and column sparsity parameter groups by means of the prediction loss and a sparsity loss; when a total loss converges, obtaining a trained second natural language processing model; and deploying hardware on the basis of the second natural language processing model and, after the deployment is completed, inputting into the second natural language processing model a text to be processed so as to obtain a natural language processing result. The solution of the present application can be used for effectively implementing natural language processing, and performing collaborative optimization on software and hardware levels without precision losses.

Description

A natural language processing method, system, device and storage medium

This application claims priority to the Chinese patent application filed with the China Patent Office on October 11, 2022, with application number 202211237680.X, and application name “A Natural Language Processing Method, System, Device and Storage Medium”, all contents of which are incorporated by reference in this application.

Technical Field

The present application relates to the field of machine learning technology, and in particular to a natural language processing method, system, device and storage medium.

Background technique

The design and reasoning of natural language processing models rely on the training of software models and the adaptation and deployment of actual hardware. Natural language processing models require matrix multiplication calculations. For example, large-scale natural language deep learning models based on attention mechanisms contain a large number of matrix multiplication calculations. At the same time, deep network model parameters have great redundancy, which provides conditions for reasoning optimization based on compression models.

The natural language processing model based on the attention mechanism is usually composed of multiple functional modules that are cyclically and sequentially superimposed. For example, Figure 1 is a commonly used natural language processing model based on the attention mechanism. In the natural language processing model, most calculations are performed in the form of matrix multiplication. For example, the calculation of the Multi-Head Attention module in Figure 1 requires multiple matrix multiplication operations.

When performing a matrix multiplication operation, taking the multiplication of matrix A and matrix B in Figure 2 to obtain matrix C as an example, it is necessary to multiply the corresponding elements of each row of matrix A and each column of matrix B and sum them to obtain the elements at the corresponding positions in matrix C.

In order to accelerate the operation of deep networks, current acceleration methods generally optimize from two stages: software and hardware. At the software level, the model structure is simplified to achieve a certain degree of accuracy loss by replacing the large model structure with a small model structure, thereby reducing the amount of calculation during model reasoning. At the hardware level, the streamlined model is deployed and accelerated by hardware to achieve efficient real-time reasoning.

In order to be deployed on existing hardware platforms, existing acceleration methods need to compress the original large model into a small model that conforms to the general matrix multiplication format. For example, the currently commonly used knowledge distillation method is based on the trained large model, namely the teacher model, to obtain the final output loss of the model and the output value of the intermediate layer, and then obtain the distillation loss. For the small model of a specified structure, namely the student model, some parameters of the large model are selected as the initialization parameters of the small model. For example, 20 layers of the original 100-layer large model are selected to form a small model, and then the small model is trained through the model prediction loss and distillation loss.

In this knowledge distillation method, since the small model only selects some parameters of the large model for initialization, the model parameters will be greatly adjusted during the training process, which will also lead to a significant decrease in model accuracy. In addition, the structure of the small model needs to be manually given, which limits the flexibility of the model structure and the effectiveness of model compression, and is not conducive to ensuring model accuracy. In addition, the knowledge distillation method is a software-level optimization. In actual hardware deployment, since it uses a general matrix multiplication calculation method, it cannot form an effective collaborative optimization with the software layer, that is, the space that can be optimized on existing hardware is very limited.

Summary of the invention

The purpose of this application is to provide a natural language processing method, system, device and storage medium to effectively implement natural language processing, ensure accuracy, and effectively perform collaborative optimization at the software and hardware levels.

In order to solve the above technical problems, this application provides the following technical solutions:

A natural language processing method, comprising:

Establishing an initial natural language processing model and training it to obtain a first natural language processing model that has been trained;

For any one model parameter matrix of the first natural language processing model, setting a row sparsification parameter group for determining whether rows in the model parameter matrix are retained, and a column sparsification parameter group for determining whether columns in the model parameter matrix are retained;

The first natural language processing model is trained according to the row sparsification parameter groups and the column sparsification parameter groups of each model parameter matrix, and in the forward propagation process of the training, a prediction loss and a sparsity loss are determined, and in the backward propagation process of the training, the remaining parameters of the first natural language processing model that are not currently sparse are updated by the prediction loss, and each of the row sparsification parameter groups and each of the column sparsification parameter groups are updated by the prediction loss and the sparsity loss;

When the total loss determined based on the prediction loss and the sparsity loss converges, a trained second natural language processing model is obtained;

Hardware deployment is performed based on the second natural language processing model, and after the deployment is completed, the text to be processed is input into the second natural language processing model to obtain a natural language processing result for the text to be processed output by the second natural language processing model.

In some embodiments, during the back propagation process of training, the remaining parameters of the first natural language processing model that are not currently sparse are updated by using the prediction loss, including:

During the back propagation process of training, the remaining parameters of the first natural language processing model that are not currently sparse are updated with the goal of reducing the prediction loss.

In some embodiments, updating each of the row sparsification parameter groups and each of the column sparsification parameter groups by using the prediction loss and the sparsity loss includes:

During the back propagation process of training, each of the row sparsification parameter groups and each of the column sparsification parameter groups are updated with the goal of reducing the total loss.

In some embodiments, the updating of each of the row sparsification parameter groups and each of the column sparsification parameter groups includes:

For any row sparse parameter group of a model parameter matrix, follow

Updating the row sparsification parameter group;

For any column sparsification parameter group of a model parameter matrix, follow

Updating the column sparsification parameter group;

Among them, _Sk represents the row sparsification parameter group at the current moment, _Sk+1 represents the row sparsification parameter group at the next moment, _lr represents the sparsification parameter differential calculation learning rate, Loss represents the total loss, Softlplus represents the Softlplus function, _Qk represents the column sparsification parameter group at the current moment, Qk ₊₁ represents the column sparsification parameter group at the next moment, _Mk represents the model parameter mask of the model parameter matrix, and M _ij represents the value of the i-th row and j-th column in M _k , x _i represents the value of the i-th parameter in S _k , y _j represents the value of the j-th parameter in Q _k , i and j are both positive integers, and 1≤i≤a, 1≤j≤b, a and b are the number of rows and columns of the model parameter matrix, respectively.

In some embodiments, each parameter in any row sparsification parameter group and any column sparsification parameter group is a default value after being set and before being updated, so as to retain each row and each column in any of the model parameter matrices.

In some embodiments, the total loss is the sum of the prediction loss and the sparsity loss.

In some embodiments, the total loss = k1*loss1+k2*loss2, wherein k1 and k2 are both preset coefficients, loss1 is the prediction loss, and loss2 is the sparsity loss.

In some embodiments, it also includes:

After obtaining the trained second natural language processing model, when it is determined that the prediction loss is higher than the first threshold, or when the sparsity loss is higher than the second threshold, a prompt message is output.

In some embodiments, it also includes:

Receive a coefficient adjustment instruction, and adjust the value of k1 and/or k2 according to the coefficient adjustment instruction.

In some embodiments, inputting the text to be processed into the second natural language processing model to obtain a natural language processing result for the text to be processed output by the second natural language processing model includes:

The text to be processed is input into the second natural language processing model to obtain a semantic recognition result for the text to be processed output by the second natural language processing model.

In some embodiments, the hardware deployment based on the second natural language processing model includes:

Based on the second natural language processing model, determining a row sparsification parameter group and a column sparsification parameter group of each model parameter matrix;

For any one model parameter matrix, according to the row sparse parameter group and the column sparse parameter group of the model parameter matrix, the non-zero parameter rows and the non-zero parameter columns of the model parameter matrix are extracted to obtain a sparse model parameter matrix;

Hardware deployment is performed based on each sparse model parameter matrix.

In some embodiments, the hardware deployment based on each sparse model parameter matrix includes:

Hardware deployment is performed based on each sparse model parameter matrix. During deployment, zeros are added to the calculation results of the corresponding model parameter matrix in accordance with the principle of maintaining dimensional invariance.

A natural language processing system, comprising:

A first natural language processing model determination module is used to establish an initial natural language processing model and perform training to obtain a trained first natural language processing model;

a sparsification setting module, used for setting, for any one model parameter matrix of the first natural language processing model, a row sparsification parameter group for determining whether a row in the model parameter matrix is retained, and a column sparsification parameter group for determining whether a column in the model parameter matrix is retained;

A pruning module, configured to train the first natural language processing model according to the row sparsification parameter groups and the column sparsification parameter groups of each model parameter matrix, and determine the prediction loss and the sparsity loss during the forward propagation process of the training, and during the backward propagation process of the training, update the remaining parameters of the first natural language processing model that are not currently sparse by using the prediction loss, and update each of the row sparsification parameter groups and each of the column sparsification parameter groups by using the prediction loss and the sparsity loss;

A second natural language processing model determination module, configured to obtain a trained second natural language processing model when a total loss determined based on the prediction loss and the sparsity loss converges;

An execution module is used to perform hardware deployment based on the second natural language processing model, and after the deployment is completed, input the text to be processed into the second natural language processing model to obtain a natural language processing result for the text to be processed output by the second natural language processing model.

A natural language processing device, comprising:

Memory for storing computer programs;

A processor is used to execute the computer program to implement the steps of the natural language processing method as described above.

A non-volatile computer readable storage medium stores a computer program, and when the computer program is executed by a processor, the steps of the natural language processing method described above are implemented.

Apply the technical solution provided by the embodiment of the present application, establish an initial natural language processing model and train it, and after obtaining the first natural language processing model that has been trained, use a sparse method to achieve software-level optimization. Specifically, the present application will not directly delete some layers in the first natural language processing model, but will set a row sparse parameter group for determining whether the rows in the model parameter matrix are retained, and a column sparse parameter group for determining whether the columns in the model parameter matrix are retained, for any one model parameter matrix of the first natural language processing model. In other words, for any one model parameter matrix of the first natural language processing model, which rows and columns of the model parameter matrix are retained, and which rows and columns are excluded, are determined by the corresponding row and column sparse parameter groups. Afterwards, the first natural language processing model can be trained according to the row sparse parameter group and column sparse parameter group of each model parameter matrix. In the forward propagation process of training, the prediction loss and sparsity loss are determined. Since the remaining parameters of the first natural language processing model that are not currently sparse are updated through the prediction loss during the back propagation process of training, the solution of the present application can effectively guarantee the accuracy of the second natural language processing model obtained after the training is completed, that is, the present application achieves lossless accuracy while achieving optimization. During the back propagation process of training, the present application will also update each row sparsification parameter group and each column sparsification parameter group through the prediction loss and sparsity loss. When the total loss determined based on the prediction loss and sparsity loss converges, it means that the prediction loss and sparsity loss have reached a suitable level, and when the sparsity loss reaches a suitable optimization level, it means that each model parameter matrix in the second natural language processing model has been filtered and deleted for some rows and columns. Afterwards, hardware deployment can be performed based on the second natural language processing model, and after the deployment is completed, the text to be processed is input into the second natural language processing model to obtain the natural language processing result for the text to be processed output by the second natural language processing model, that is, through the second natural language processing, the text to be processed can be effectively processed. Natural language processing. In addition, it can be understood that, since the present application sets a row sparsification parameter group for determining whether the rows in the model parameter matrix are retained, and a column sparsification parameter group for determining whether the columns in the model parameter matrix are retained during optimization, after the optimization is completed, when the hardware is deployed, the corresponding rows and columns that have been filtered do not need to be deployed in hardware, that is, when the hardware is deployed, only the remaining rows and columns after sparsification need to be considered, so that the hardware requirements of the present application are also lower, the amount of calculation is small, and the coordinated optimization of software and hardware is achieved.

In summary, the solution of the present application can effectively implement natural language processing, and can effectively perform collaborative optimization at the software and hardware levels to improve the natural language processing efficiency of the model. At the same time, after the optimization is completed, there will be no loss of accuracy.

BRIEF DESCRIPTION OF THE DRAWINGS

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings required for use in the embodiments or the description of the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present application. For ordinary technicians in this field, other drawings can be obtained based on these drawings without paying any creative work.

FIG1 is a schematic diagram of the structure of a commonly used natural language processing model based on an attention mechanism;

FIG. 2 is a schematic diagram showing the principle of calculation of matrix multiplication;

FIG3 is a flowchart of an implementation of a natural language processing method in the present application;

FIG4 is a schematic diagram of the training principle of the first natural language processing model in the present application;

FIG5 is a schematic diagram showing the changes in the model parameter matrix after sparseness in this application;

FIG6 is a schematic diagram of the structure of a natural language processing system in the present application;

FIG. 7 is a schematic diagram of the structure of a natural language processing device in the present application.

Detailed ways

The core of this application is to provide a natural language processing method that can effectively implement natural language processing and can effectively perform collaborative optimization at the software and hardware levels to improve the natural language processing efficiency of the model. At the same time, after the optimization is completed, there will be no loss of accuracy.

In order to enable those skilled in the art to better understand the present application, the present application is further described in detail below in conjunction with the accompanying drawings and specific implementation methods. Obviously, the described embodiments are only part of the embodiments of the present application, rather than all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by ordinary technicians in the field without making creative work are within the scope of protection of the present application.

Please refer to FIG. 3 , which is a flowchart of an implementation of a natural language processing method in the present application. The natural language processing method may include the following steps:

Step S301: Establish an initial natural language processing model and perform training to obtain a trained first natural language processing model.

Specifically, an initial natural language processing model may be established first. The specific form of the initial natural language processing model may be various and may be set and adjusted according to actual needs, such as an initial natural language processing model that adopts a deep network structure.

After the initial natural language processing model is established, it can be trained with training samples. When the recognition accuracy reaches the requirement, it can be determined that the training is completed. At this time, the first natural language processing model that has been trained is obtained.

Since natural language processing is being performed, the training samples are usually text data.

Step S302: For any one model parameter matrix of the first natural language processing model, set a row sparsification parameter group for determining whether rows in the model parameter matrix are retained, and a column sparsification parameter group for determining whether columns in the model parameter matrix are retained.

After the first natural language processing model is trained, the first natural language processing model will include multiple model parameter matrices. It can be understood that at this time, since sparseness has not been performed, each model parameter matrix is an original, non-sparse model parameter matrix.

For any model parameter matrix of the first natural language processing model, the present application will set a row sparsification parameter group for determining whether the rows in the model parameter matrix are retained, and a column sparsification parameter group for determining whether the columns in the model parameter matrix are retained.

That is to say, for any row of any model parameter matrix, whether the row needs to be retained is determined by the row sparsification parameter group of the model parameter matrix. Similarly, for any column of the model parameter matrix, whether the column needs to be retained is determined by the column sparsification parameter group of the model parameter matrix.

For example, in FIG4 , the original model parameter matrix is a 4-row 3-column model parameter matrix. Through the row sparse parameter group and the column sparse parameter group of the model parameter matrix, it is determined that the 2nd row and the 2nd column need to be sparse, that is, the 2nd row and the 2nd column do not need to be retained.

In addition, it should be noted that in the training process of the subsequent step S303, each row sparsification parameter group and each column sparsification parameter group will be continuously adjusted. Therefore, when performing the initial setting of each row sparsification parameter group and each column sparsification parameter group, they can be set arbitrarily. However, in actual applications, in order to ensure accuracy, when performing the initial setting of each row sparsification parameter group and each column sparsification parameter group, each row and each column of each model parameter matrix are usually retained.

For example, in some embodiments of the present application, each parameter in any row sparsification parameter group and any column sparsification parameter group is set to a default value after the setting is completed and before being updated, so as to retain each row and each column in any model parameter matrix. In this implementation, each parameter in any row sparsification parameter group and any column sparsification parameter group is set to a default value, making the setting process simple and convenient.

There are many specific forms of row sparsification parameter groups and column sparsification parameter groups, as long as they can realize the respective functions of the row sparsification parameter groups and column sparsification parameter groups of the present application. Since the model parameter matrix usually has multiple rows and columns, the row sparsification parameter group can usually be set in the form of a vector, and each numerical value in the vector is used to determine whether the corresponding row of the model parameter matrix is retained. Similarly, the column sparsification parameter group can usually be set in the form of a vector, and each numerical value in the vector is used to determine whether the corresponding column of the model parameter matrix is retained.

Step S303: The first natural language processing model is trained according to the row sparsification parameter groups and column sparsification parameter groups of each model parameter matrix, and the prediction loss and the sparsity loss are determined during the forward propagation of the training. During the backward propagation of the training, the remaining parameters of the first natural language processing model that are not currently sparse are updated through the prediction loss, and each row sparsification parameter group and each column sparsification parameter group are updated through the prediction loss and the sparsity loss.

The row sparsification parameter group and the column sparsification parameter group can determine the model parameter matrix after sparsification, and then train the first natural language processing model.

During the forward propagation process of training, the present application needs to determine the prediction loss and sparsity loss. In FIG4 , the prediction loss is marked as ce_loss, which is loss1 in the subsequent implementation. The sparsity loss is marked as sparsity_loss, which is loss2 in the subsequent implementation.

The prediction loss reflects the prediction accuracy of the first natural language processing model. The smaller the prediction loss, the higher the prediction accuracy of the first natural language processing model.

In order to ensure accuracy, the present application will update the remaining parameters of the first natural language processing model that are not currently sparse through the prediction loss during the back propagation process of training. For example, for a model parameter matrix with 4 rows and 3 columns originally in Figure 4, since the current row sparsification parameter group and column sparsification parameter group determine that the 2nd row and the 2nd column need to be sparse, the remaining unsparse parameters of the model parameter matrix are 6, namely the parameters of the 1st row and the 1st column, the parameters of the 1st row and the 3rd column, the parameters of the 3rd row and the 1st column, the parameters of the 3rd row and the 3rd column, the parameters of the 4th row and the 1st column, and the parameters of the 4th row and the 3rd column. Of course, as the training progresses, for the model parameter matrix, the final decision to sparse may not necessarily be the 2nd row and the 2nd column.

In some embodiments of the present application, in the back propagation process of training described in step S303, updating the remaining parameters of the first natural language processing model that are not currently sparse by predicting the loss may include:

As described above, the prediction loss reflects the prediction accuracy of the first natural language processing model. The smaller the prediction loss, the higher the prediction accuracy of the first natural language processing model. Therefore, in the back propagation process of training, it is usually possible to reduce the prediction loss as the training goal and update the remaining parameters of the first natural language processing model that are not currently sparse. In FIG. 4 of the present application, this process is marked as back propagation 1.

During the back propagation process of training, in addition to updating the remaining parameters of the first natural language processing model that are not currently sparse, the present application also updates each row sparse parameter group and each column sparse parameter group through prediction loss and sparsity loss.

The sparsity loss reflects the degree to which the first natural language processing model is sparse. The smaller the sparsity loss, the higher the degree to which the first natural language processing model is sparse, that is, more rows and columns do not need to be retained.

In some embodiments of the present application, the updating of each row sparsification parameter group and each column sparsification parameter group by prediction loss and sparsity loss described in step S303 includes:

During the back propagation process of training, each row sparsification parameter group and each column sparsification parameter group are updated with the goal of reducing the total loss.

In this implementation, the total loss is composed of the prediction loss and the sparsity loss process, and as described above, the lower the prediction loss, the higher the accuracy, the lower the sparsity loss, and the higher the degree of sparsity. There is an influence between the prediction loss and the sparsity loss. For example, when updating each row sparsification parameter group and each column sparsification parameter group to reduce the sparsity loss, the prediction loss may increase. Therefore, in this implementation, the training goal is to reduce the total loss, and each row sparsification parameter group and each column sparsification parameter group are updated. It is equivalent to a trade-off between the prediction loss and the sparsity loss, with the purpose of reducing the total loss.

In FIG. 4 of the present application, the process of updating each row sparsification parameter group and each column sparsification parameter group is marked as back propagation 2.

In addition, it should be noted that there may be multiple algorithms used to update each row sparsification parameter group and each column sparsification parameter group. For example, in some embodiments of the present application, the updating of each row sparsification parameter group and each column sparsification parameter group described in step S303 may specifically include:

Each row sparsification parameter group and each column sparsification parameter group are updated, including:

For any row sparse parameter group of a model parameter matrix, follow

Update the row sparsification parameter group;

Update the column sparsification parameter group;

Among them, _Sk represents the row sparsification parameter group at the current moment, _Sk+1 represents the row sparsification parameter group at the next moment, _lr represents the learning rate of sparsification parameter differential calculation, Loss represents the total loss, Softlplus represents the Softlplus function, _Qk represents the column sparsification parameter group at the current moment, Qk ₊₁ represents the column sparsification parameter group at the next moment, Mk represents the model parameter mask of the model parameter matrix, and _Mij represents the value of the i-th row and j-th column in _Mk , _xi represents the value of the i-th parameter in _Sk , and _yj represents the value of the j-th parameter in _Qk . Both i and j are positive integers, and 1≤i≤a, 1≤j≤b, where a and b are the number of rows and columns of the model parameter matrix, respectively.

In Figure 4, the STE forward function indicates that the model parameter mask of the model parameter matrix is determined based on the row sparse parameter group and the column sparse parameter group of the model parameter matrix. The STE reverse function indicates the update process of the row sparse parameter group and the column sparse parameter group.

For example, in some embodiments, for a 3x3 model parameter matrix W, after step S303 is completed, that is, the first natural language processing model is trained and the second natural language processing model is obtained, for example, the row sparsification parameter group of the model parameter matrix is [-5.5, 3.0, 1.2], and the column sparsification parameter group is [3.3, -2.2, 1.0]. Since -5.5 and -2.2 are negative numbers, the first row and the second column of the model parameter matrix W are sparse, that is, the model The first row and second column of the parameter matrix W do not need to be retained. In this example, the model parameter mask of the model parameter matrix can be expressed as A 0 in the model parameter mask indicates that the parameter is set to 0, and a 1 indicates that the parameter may not be 0.

Step S304: When the total loss determined based on the prediction loss and the sparsity loss converges, a trained second natural language processing model is obtained.

When the total loss determined based on the prediction loss and the sparsity loss converges, the training is completed and a trained second natural language processing model can be obtained.

For example, in FIG5 , for a model parameter matrix originally having a size of 6x6, after training in step S303, the 2nd row, the 6th row, the 2nd column and the 5th column need to be sparse, then in the second natural language processing model, the model parameter matrix becomes 4x4 in size.

Step S305: Perform hardware deployment based on the second natural language processing model, and after the deployment is completed, input the text to be processed into the second natural language processing model to obtain a natural language processing result for the text to be processed output by the second natural language processing model.

Since the present application performs sparsification of entire rows and columns during model optimization, when performing hardware deployment, no hardware calculations are required for the sparsified rows and columns, which effectively reduces hardware resource usage and is conducive to the coordinated optimization of software and hardware.

After the deployment on the hardware is completed, the text to be processed is input into the second natural language processing model, and the natural language processing result for the text to be processed output by the second natural language processing model can be obtained.

In some embodiments of the present application, the hardware deployment based on the second natural language processing model described in step S305 may specifically include:

For any model parameter matrix, according to the row sparse parameter group and the column sparse parameter group of the model parameter matrix, the non-zero parameter rows and the non-zero parameter columns of the model parameter matrix are extracted to obtain the sparse model parameter matrix;

Hardware deployment is performed based on each sparse model parameter matrix.

Specifically, after the training is completed and the second natural language processing model is obtained, the row sparse parameter groups and column sparse parameter groups of each final model parameter matrix can be determined, and then the non-zero parameter rows and non-zero parameter columns of the model parameter matrix can be extracted to obtain the sparse model parameter matrix. For example, in Figure 5, for a model parameter matrix that was originally 6x6 in size, after extracting the non-zero parameter rows and non-zero parameter columns of the model parameter matrix, a 4x4 model parameter matrix is obtained. Finally, hardware deployment can be performed based on each sparse model parameter matrix.

For example, in a specific scenario, the 6x6 model parameter matrix of Figure 5, as well as the numbers of the non-zero parameter rows and the non-zero parameter columns of the model parameter matrix can be input to extract the non-zero parameter rows and non-zero parameter columns of the model parameter matrix to obtain the sparse model parameter matrix, and then perform hardware deployment based on each sparse model parameter matrix.

For example, the TensorRT interface call can be modified so that for the second row in the 6x6 model parameter matrix of the input Figure 5, the column compression is performed at the corresponding position. Similarly, for the sixth row in the 6x6 model parameter matrix of Figure 5, the column compression is performed at the corresponding position. For the second and fifth columns in the 6x6 model parameter matrix of Figure 5, the row compression is performed at the corresponding position. Finally, the non-zero parameter rows and non-zero parameter columns of the model parameter matrix are extracted, and the sparse model parameter matrix is obtained.

Furthermore, in some embodiments of the present application, hardware deployment based on each sparse model parameter matrix may specifically include:

For example, a model parameter matrix of a certain layer is 6x6 in size before sparsification, and is used to multiply input data of size 106x6 to obtain an output of size 106x6. If the model parameter matrix becomes 4x4 in size after sparsification, the last two columns of the 106x6 input data do not need to be used, that is, the 106x4 input data can be multiplied by the 4x4 model parameter matrix to obtain an output of size 106x4. After obtaining the output of size 106x4, this implementation method uses the principle of maintaining dimensional invariance and fills the calculation results of the corresponding model parameter matrix with 0, that is, in this example, 2 columns of 0 will be added to the 106x4 output to restore it to an output of size 106x6.

In addition, it should be emphasized that the 0-padding in this implementation method is to fill the calculation results of the corresponding model parameter matrix with 0, rather than to fill the corresponding sparse rows and columns with 0. That is, for the hardware, the multiplication calculation of the model parameter matrix of size 4x4 is performed, rather than the multiplication calculation of the model parameter matrix of size 6x6. If the multiplication calculation of the model parameter matrix of size 6x6 is performed, it is equivalent to the optimization of the software not being coordinated with the hardware, because in this case the amount of calculation on the hardware is unchanged. In some traditional solutions, when optimizing the software, some parameters in the model parameter matrix will be reset to zero. These reset parameters are randomly distributed, resulting in these 0s still needing to participate in the matrix multiplication operation on the hardware, and no coordinated optimization of the hardware is achieved.

In the solution of the present application, the above example is to restore the output size of 106x4 to 106x6 by padding 2 columns of 0 after obtaining the output size of 106x4, thereby ensuring the inconvenience of dimension, that is, ensuring that the output size of the second natural language processing model is consistent with the output of the original model, and the sparse rows and columns determined previously do not participate in the matrix multiplication operation.

The total loss of the present application is composed of prediction loss and sparsity loss. In some embodiments of the present application, the total loss can be specifically the sum of prediction loss and sparsity loss, which is also a relatively simple setting method in practical applications.

Furthermore, in some embodiments of the present application, total loss = k1*loss1+k2*loss2, wherein k1 and k2 are both preset coefficients, loss1 is the prediction loss, and loss2 is the sparsity loss.

In this implementation, it is considered that a certain weight can be set for the prediction loss and the sparsity loss respectively, that is, the degree of influence of the two on the total loss is different. It can be understood that the larger the value of k1, the greater the influence of the prediction loss on the total loss. Such an implementation is more conducive to ensuring high prediction accuracy and can usually be used in situations where high accuracy is required. The larger the value of k2, the greater the influence of the sparsity loss on the total loss. Such an implementation is more conducive to simplifying the model and can usually be used in situations where high computing speed is required.

In addition, in other implementations, the total loss can also be selected as other forms, which can be set according to actual conditions and does not affect the implementation of this application. However, it can be understood that the total loss usually needs to be positively correlated with the prediction loss and positively correlated with the sparsity loss.

In some embodiments of the present application, it also includes:

As can be seen from the above description, this application considers that training is completed when the total loss determined based on the prediction loss and the sparsity loss converges. In most cases, when the total loss converges, the prediction loss and the sparsity loss usually reach a low level, but in a few cases, the prediction loss or the sparsity loss may still be high.

In this implementation, when the predicted loss is higher than the first threshold, or when the sparsity loss is higher than the second threshold, a prompt message will be output so that the staff can promptly notice the situation and take corresponding measures. For example, in one case, when the predicted loss is higher than the first threshold, the value of k1 in the aforementioned implementation can be appropriately increased so that the total loss takes more consideration of the predicted loss. Correspondingly, when the sparsity loss is higher than the second threshold, the value of k2 in the aforementioned implementation can be appropriately increased so that the total loss takes more consideration of the sparsity loss.

That is, in some embodiments of the present application, it may also include: receiving a coefficient adjustment instruction, and adjusting the values of k1 and/or k2 according to the coefficient adjustment instruction.

In some embodiments of the present application, the step S305 of inputting the text to be processed into the second natural language processing model to obtain the natural language processing result for the text to be processed output by the second natural language processing model may specifically include:

The second natural language processing model of the present application can process the text to be processed. In practical applications, the specific processing purpose is usually to perform semantic recognition of the text to be processed. Of course, in other embodiments, the second natural language processing model of the present application can also perform other processing on the text to be processed, such as grammatical error analysis, knowledge extraction, text translation, etc.

Apply the technical solution provided by the embodiment of the present application, establish an initial natural language processing model and train it, and after obtaining the first natural language processing model that has been trained, use a sparse method to achieve software-level optimization. Specifically, the present application will not directly delete some layers in the first natural language processing model, but will set a row sparse parameter group for determining whether the rows in the model parameter matrix are retained, and a column sparse parameter group for determining whether the columns in the model parameter matrix are retained, for any one model parameter matrix of the first natural language processing model. In other words, for any one model parameter matrix of the first natural language processing model, which rows and columns of the model parameter matrix are retained, and which rows and columns are excluded, are determined by the corresponding row and column sparse parameter groups. Afterwards, the first natural language processing model can be trained according to the row sparse parameter group and column sparse parameter group of each model parameter matrix. In the forward propagation process of training, the prediction loss and sparsity loss are determined. Since the remaining parameters of the first natural language processing model that are not currently sparse are updated through prediction loss during the back propagation process of training, the solution of the present application can effectively guarantee the accuracy of the second natural language processing model obtained after the training is completed, that is, the present application achieves lossless accuracy while achieving optimization. During the back propagation process of training, the present application will also update each row sparsification parameter group and each column sparsification parameter group through prediction loss and sparsity loss. When the prediction loss and sparsity loss are determined, the accuracy of the second natural language processing model obtained after the training is completed can be effectively guaranteed. That is, the present application achieves lossless accuracy while achieving optimization. When the total loss converges, it means that the prediction loss and the sparsity loss have reached a suitable degree, and when the sparsity loss reaches a suitable degree of optimization, it means that each model parameter matrix in the second natural language processing model has filtered and deleted some rows and columns. After that, hardware deployment can be performed based on the second natural language processing model, and after the deployment is completed, the text to be processed is input into the second natural language processing model to obtain the natural language processing result for the text to be processed output by the second natural language processing model, that is, through the second natural language processing, the natural language processing of the text to be processed can be effectively performed. In addition, it can be understood that, since the present application sets the row sparsification parameter group used to determine whether the rows in the model parameter matrix are retained during optimization, and the column sparsification parameter group used to determine whether the columns in the model parameter matrix are retained, therefore, after the optimization is completed, when the hardware is deployed, the corresponding rows and columns that have been filtered do not need to be deployed for hardware, that is, when the hardware is deployed, only the remaining rows and columns after sparseness need to be considered, so that the requirements of the present application for hardware are also lower, and the amount of calculation is small, that is, the collaborative optimization of software and hardware is realized.

Corresponding to the above method embodiment, the embodiment of the present application also provides a natural language processing system, which can be referenced in correspondence with the above.

Referring to FIG6 , it is a schematic diagram of the structure of a natural language processing system in the present application, including:

A first natural language processing model determination module 601 is used to establish an initial natural language processing model and perform training to obtain a trained first natural language processing model;

A sparsification setting module 602, for setting, for any one model parameter matrix of the first natural language processing model, a row sparsification parameter group for determining whether a row in the model parameter matrix is retained, and a column sparsification parameter group for determining whether a column in the model parameter matrix is retained;

The pruning module 603 is used to train the first natural language processing model according to the row sparsification parameter groups and the column sparsification parameter groups of each model parameter matrix, and determine the prediction loss and the sparsity loss during the forward propagation process of the training, and update the remaining parameters of the first natural language processing model that are not currently sparse by the prediction loss during the backward propagation process of the training, and update each row sparsification parameter group and each column sparsification parameter group by the prediction loss and the sparsity loss;

A second natural language processing model determination module 604 is used to obtain a trained second natural language processing model when the total loss determined based on the prediction loss and the sparsity loss converges;

The execution module 605 is used to perform hardware deployment based on the second natural language processing model, and after the deployment is completed, input the text to be processed into the second natural language processing model to obtain the natural language processing result for the text to be processed output by the second natural language processing model.

In some embodiments of the present application, the pruning module 603 updates the remaining parameters of the first natural language processing model that are not currently sparse by predicting the loss during the back propagation process of the training, including:

In some embodiments of the present application, the pruning module 603 updates each row sparsification parameter group and each column sparsification parameter group by predicting the loss and the sparsity loss, including:

In some embodiments of the present application, the pruning module 603 updates each row sparsification parameter group and each column sparsification parameter group, including:

For any row sparse parameter group of a model parameter matrix, follow

Update the row sparsification parameter group;

Update the column sparsification parameter group;

Among them, _Sk represents the row sparsification parameter group at the current moment, _Sk+1 represents the row sparsification parameter group at the next moment, _lr represents the sparsification parameter differential calculation learning rate, Loss represents the total loss, Softlplus represents the Softlplus function, _Qk represents the column sparsification parameter group at the current moment, Qk ₊₁ represents the column sparsification parameter group at the next moment, _Mk represents the model parameter mask of the model parameter matrix, and _Mij represents the value of the i-th row and j-th column in _Mk , _xi represents the value of the i-th parameter in _Sk , _yj represents the value of the j-th parameter in _Qk , i and j are both positive integers, and 1≤i≤a, 1≤j≤b, a and b are the number of rows and columns of the model parameter matrix, respectively.

In some embodiments of the present application, each parameter in any row sparsification parameter group and any column sparsification parameter group is set to a default value after being set and before being updated, so as to retain each row and each column in any model parameter matrix.

In some embodiments of the present application, the total loss is the sum of the prediction loss and the sparsity loss.

In some embodiments of the present application, total loss = k1*loss1+k2*loss2, wherein k1 and k2 are both preset coefficients, loss1 is the prediction loss, and loss2 is the sparsity loss.

In some embodiments of the present application, it also includes:

The information prompt module is used to output prompt information after obtaining the trained second natural language processing model when it is judged that the prediction loss is higher than the first threshold or the sparsity loss is higher than the second threshold.

In some embodiments of the present application, it also includes:

The coefficient adjustment module is used to receive the coefficient adjustment instruction and adjust the value of k1 and/or k2 according to the coefficient adjustment instruction.

In some embodiments of the present application, the execution module 605 inputs the text to be processed into the second natural language processing model, and obtains the natural language processing result for the text to be processed output by the second natural language processing model, including:

In some embodiments of the present application, the execution module 605 performs hardware deployment based on the second natural language processing model, including:

Hardware deployment is performed based on each sparse model parameter matrix.

In some embodiments of the present application, the execution module 605 performs hardware deployment based on each sparse model parameter matrix, including:

Corresponding to the above method and system embodiments, the present application embodiment also provides a natural language processing device and a non-volatile readable storage medium, which can be referenced in correspondence with the above. The non-volatile readable storage medium stores a computer program, and when the computer program is executed by a processor, the steps of the natural language processing method in any of the above embodiments are implemented.

Referring to FIG. 7 , the natural language processing device may include:

Memory 701, used for storing computer programs;

The processor 702 is used to execute a computer program to implement the steps of the natural language processing method in any of the above embodiments.

It should also be noted that, in this article, relational terms such as first and second, etc. are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Moreover, the terms "include", "comprise" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, method, article or device including a series of elements includes not only those elements, but also other elements not explicitly listed, or also includes elements inherent to such process, method, article or device. In the absence of further restrictions, the elements defined by the sentence "comprise a ..." do not exclude the presence of other identical elements in the process, method, article or device including the elements.

Professionals may further appreciate that the units and algorithm steps of each example described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, computer software, or a combination of the two. In order to clearly illustrate the interchangeability of hardware and software, the composition and steps of each example have been generally described in the above description according to function. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Professionals and technicians may use different methods to implement the described functions for each specific application, but such implementation should not be considered to be beyond the scope of this application.

Specific examples are used herein to illustrate the principles and implementation methods of the present application, and the description of the above embodiments is only used to help understand the technical solution and core ideas of the present application. It should be pointed out that for ordinary technicians in this technical field, without departing from the principles of the present application, several improvements and modifications can be made to the present application, and these improvements and modifications also fall within the scope of protection of the present application.

Claims

A natural language processing method, characterized by comprising:

Establishing an initial natural language processing model and training it to obtain a first natural language processing model that has been trained;

For any one model parameter matrix of the first natural language processing model, setting a row sparsification parameter group for determining whether rows in the model parameter matrix are retained, and a column sparsification parameter group for determining whether columns in the model parameter matrix are retained;

The first natural language processing model is trained according to the row sparsification parameter groups and the column sparsification parameter groups of each model parameter matrix, and in the forward propagation process of the training, a prediction loss and a sparsity loss are determined, and in the backward propagation process of the training, the remaining parameters of the first natural language processing model that are not currently sparse are updated by the prediction loss, and each of the row sparsification parameter groups and each of the column sparsification parameter groups are updated by the prediction loss and the sparsity loss;

When the total loss determined based on the prediction loss and the sparsity loss converges, a trained second natural language processing model is obtained;

Hardware deployment is performed based on the second natural language processing model, and after the deployment is completed, the text to be processed is input into the second natural language processing model to obtain a natural language processing result for the text to be processed output by the second natural language processing model.
The natural language processing method according to claim 1, characterized in that, during the back propagation process of training, the remaining parameters of the first natural language processing model that are not currently sparse are updated by using the prediction loss, comprising:

During the back propagation process of training, the remaining parameters of the first natural language processing model that are not currently sparse are updated with the goal of reducing the prediction loss.
The natural language processing method according to claim 1, characterized in that updating each of the row sparsification parameter groups and each of the column sparsification parameter groups by using the prediction loss and the sparsity loss comprises:

During the back propagation process of training, each of the row sparsification parameter groups and each of the column sparsification parameter groups are updated with the goal of reducing the total loss.
The natural language processing method according to claim 1, characterized in that the updating of each of the row sparsification parameter groups and each of the column sparsification parameter groups comprises:

For any row sparse parameter group of a model parameter matrix, follow

Updating the row sparsification parameter group;

For any column sparsification parameter group of a model parameter matrix, follow

Updating the column sparsification parameter group;

Among them, Sk represents the row sparsification parameter group at the current moment, Sk+1 represents the row sparsification parameter group at the next moment, lr represents the sparsification parameter differential calculation learning rate, Loss represents the total loss, Softlplus represents the Softlplus function, Qk represents the column sparsification parameter group at the current moment, Qk +1 represents the column sparsification parameter group at the next moment, Mk represents the model parameter mask of the model parameter matrix, and

Mij represents the value of the i-th row and j-th column in Mk , xi represents the value of the i-th parameter in Sk , yj represents the value of the j-th parameter in Qk , i and j are both positive integers, and 1≤i≤a, 1≤j≤b, a and b are the number of rows and columns of the model parameter matrix, respectively.
The natural language processing method according to claim 1 is characterized in that each parameter in any row sparsification parameter group and any column sparsification parameter group is a default value after being set and before being updated, so as to retain each row and each column in any of the model parameter matrices.
The natural language processing method according to claim 1 is characterized in that the total loss is the sum of the prediction loss and the sparsity loss.
The natural language processing method according to claim 1 is characterized in that the total loss = k1*loss1+k2*loss2, wherein k1 and k2 are both preset coefficients, loss1 is the prediction loss, and loss2 is the sparsity loss.
The natural language processing method according to claim 7, further comprising:

After obtaining the trained second natural language processing model, when it is determined that the prediction loss is higher than the first threshold, or when the sparsity loss is higher than the second threshold, a prompt message is output.
The natural language processing method according to claim 8, further comprising:

Receive a coefficient adjustment instruction, and adjust the value of k1 and/or k2 according to the coefficient adjustment instruction.
The natural language processing method according to claim 1, characterized in that the text to be processed is input into the second natural language processing model, and the natural language processing result for the text to be processed output by the second natural language processing model is obtained, comprising:

The text to be processed is input into the second natural language processing model to obtain a semantic recognition result for the text to be processed output by the second natural language processing model.
The natural language processing method according to any one of claims 1 to 10, characterized in that the hardware deployment based on the second natural language processing model comprises:

Based on the second natural language processing model, determining a row sparsification parameter group and a column sparsification parameter group of each model parameter matrix;

For any one model parameter matrix, according to the row sparse parameter group and the column sparse parameter group of the model parameter matrix, the non-zero parameter rows and the non-zero parameter columns of the model parameter matrix are extracted to obtain a sparse model parameter matrix;

Hardware deployment is performed based on each sparse model parameter matrix.
The natural language processing method according to claim 11, characterized in that the hardware deployment based on each sparse model parameter matrix comprises:

Hardware deployment is performed based on each sparse model parameter matrix. During deployment, zeros are added to the calculation results of the corresponding model parameter matrix in accordance with the principle of maintaining dimensional invariance.
The natural language processing method according to claim 1, characterized in that the step of establishing an initial natural language processing model and training it to obtain a trained first natural language processing model comprises:

Obtain the initial natural language processing model of the deep network structure and the corresponding text data;

The initial natural language processing model is trained using the text data to obtain a trained first natural language processing model.
The natural language processing method according to claim 1, characterized in that the setting of the row sparsification parameter group used to determine whether the rows in the model parameter matrix are retained, and the column sparsification parameter group used to determine whether the columns in the model parameter matrix are retained, include:

Setting a row sparsification parameter group represented by a first vector for determining whether a row in the model parameter matrix is retained, and a column sparsification parameter group represented by a second vector for determining whether a column in the model parameter matrix is retained;

Among them, each numerical value in the first vector is used to determine whether the corresponding row in the model parameter matrix is retained; each numerical value in the second vector is used to determine whether the corresponding column in the model parameter matrix is retained.
The natural language processing method according to claim 1, characterized in that the step of determining the prediction loss during the forward propagation of the training comprises:

Obtaining a model output of the first natural language processing model;

Based on the model output and the loss function, the prediction loss during the forward propagation process of training is determined.
The natural language processing method according to claim 1, characterized in that the determining of the sparsity loss during the forward propagation process of the training comprises:

Obtaining a first back-propagation result corresponding to a model parameter row mask and a second back-propagation result corresponding to a model parameter column mask in the first natural language processing model;

The sparsity loss in the forward propagation process of training is determined according to the first back-propagation result, the second back-propagation result and the loss function.
The natural language processing method according to claim 1, characterized in that the step of inputting the text to be processed into the second natural language processing model to obtain a natural language processing result for the text to be processed output by the second natural language processing model comprises:

Inputting the text to be processed into the second natural language processing model for one of semantic recognition, grammatical error analysis, knowledge extraction, and text translation, and obtaining a natural language processing result for the text to be processed output by the second natural language processing model;

Among them, the natural language processing result is one of the results of the semantic recognition, the grammatical error analysis, the knowledge extraction, and the text translation.
A natural language processing system, comprising:

A first natural language processing model determination module is used to establish an initial natural language processing model and perform training to obtain a trained first natural language processing model;

a sparsification setting module, used for setting, for any one model parameter matrix of the first natural language processing model, a row sparsification parameter group for determining whether a row in the model parameter matrix is retained, and a column sparsification parameter group for determining whether a column in the model parameter matrix is retained;

A pruning module, configured to train the first natural language processing model according to the row sparsification parameter groups and the column sparsification parameter groups of each model parameter matrix, and determine the prediction loss and the sparsity loss during the forward propagation process of the training, and update the remaining parameters of the first natural language processing model that are not currently sparse by the prediction loss during the backward propagation process of the training, and update each of the row sparsification parameter groups and each of the column sparsification parameter groups by the prediction loss and the sparsity loss;

A second natural language processing model determination module, configured to obtain a trained second natural language processing model when a total loss determined based on the prediction loss and the sparsity loss converges;

An execution module is used to perform hardware deployment based on the second natural language processing model, and after the deployment is completed, input the text to be processed into the second natural language processing model to obtain a natural language processing result for the text to be processed output by the second natural language processing model.
A natural language processing device, comprising:

Memory for storing computer programs;

A processor, configured to execute the computer program to implement the steps of the natural language processing method as claimed in any one of claims 1 to 17.
A non-volatile computer readable storage medium, characterized in that a computer program is stored on the non-volatile computer readable storage medium, and when the computer program is executed by a processor, the steps of the natural language processing method as described in any one of claims 1 to 17 are implemented.