CN115329744B

CN115329744B - Natural language processing method, system, equipment and storage medium

Info

Publication number: CN115329744B
Application number: CN202211237680.XA
Authority: CN
Inventors: 李兵兵; 阚宏伟; 王彦伟
Original assignee: Inspur Electronic Information Industry Co Ltd
Current assignee: Inspur Electronic Information Industry Co Ltd
Priority date: 2022-10-11
Filing date: 2022-10-11
Publication date: 2023-04-07
Anticipated expiration: 2042-10-11
Also published as: WO2024077981A1; CN115329744A

Abstract

The application discloses a natural language processing method, a system, a device and a storage medium, which are applied to the technical field of machine learning and comprise the following steps: obtaining a trained first natural language processing model; setting a row and column sparse parameter set for determining whether rows and columns in a model parameter matrix of the first natural language processing model are reserved and training, updating residual parameters which are not sparse currently through loss prediction, and updating each row and column sparse parameter set through loss prediction and sparsity loss; when the total loss is converged, obtaining a trained second natural language processing model; and performing hardware deployment based on the second natural language processing model, and inputting the text to be processed into the second natural language processing model after the deployment is completed to obtain a natural language processing result. By applying the scheme of the application, the natural language processing can be effectively realized, the cooperative optimization of software and hardware layers can be carried out, and the precision can not be lost.

Description

Natural language processing method, system, equipment and storage medium

Technical Field

The present invention relates to the field of machine learning technologies, and in particular, to a method, a system, a device, and a storage medium for natural language processing.

Background

The design and reasoning of the natural language processing model depend on the training of a software model and the adaptive deployment of actual hardware. The natural language processing model needs to perform matrix multiplication calculation, for example, a large natural language deep learning model based on an attention mechanism contains a large amount of matrix multiplication calculation, and meanwhile, the parameters of the deep network model have great redundancy, so that conditions are provided for reasoning optimization based on a compression model.

The natural language processing model based on attention mechanism is generally composed of a plurality of function modules which are circularly and sequentially overlapped, for example, fig. 1 is a currently common natural language processing model based on attention mechanism. In the natural language processing model, most of the calculations are performed in the form of matrix multiplication, such as the Multi-Head Attention (Multi-Head Attention) module in fig. 1, and the calculation needs to include multiple matrix multiplication operations.

When performing matrix multiplication, taking the matrix a and the matrix B in fig. 2 as an example to obtain the matrix C, it is necessary to multiply and sum corresponding elements of each row of the matrix a and each column of the matrix B to obtain elements of corresponding positions in the matrix C.

In order to accelerate the operation of the deep network, the current acceleration method is generally optimized from two stages of software and hardware. In the software level, the model structure is simplified, so that a small model structure is used for replacing a large model structure under the condition of certain precision loss, and the calculated amount in model reasoning is reduced. On the hardware level, the simplified model is subjected to hardware deployment and hardware acceleration, and high-efficiency real-time reasoning is realized.

In order to realize deployment on an existing hardware platform, an original large model needs to be compressed into a small model conforming to a general matrix multiplication format. For example, the conventional knowledge distillation method is based on a trained large model, namely a Teacher model, to obtain the final output loss and the intermediate layer output value of the model, and further obtain the distillation loss. For the small model with the specified structure, namely the student model, part of parameters of the large model are selected as initialization parameters of the small model, for example, 20 layers of the original 100 layers of the large model are selected to form the small model, and then loss and distillation loss are predicted through the model to train the small model.

In the knowledge distillation mode, because the small model only selects part of parameters of the large model for initialization, the model parameters can be greatly adjusted in the training process, and the model precision can also be obviously reduced. In addition, the structure of the small model needs to be manually given, so that the flexibility of the model structure and the effectiveness of model compression are limited, and meanwhile, the model precision is not guaranteed. In addition, the knowledge distillation mode is optimization at a software level, and in actual hardware deployment, because a calculation mode of universal matrix multiplication is adopted, effective cooperative optimization cannot be formed with a software layer, namely, the space which can be optimized on the existing hardware is very limited.

In summary, how to effectively implement natural language processing, and effectively perform cooperative optimization on software and hardware layers while ensuring precision is a technical problem that needs to be solved by those skilled in the art.

Disclosure of Invention

The invention aims to provide a natural language processing method, a natural language processing system, natural language processing equipment and a storage medium, so that the natural language processing is effectively realized, the precision is ensured, and the software and hardware level collaborative optimization is effectively carried out.

In order to solve the technical problems, the invention provides the following technical scheme:

a natural language processing method, comprising:

establishing an initial natural language processing model and training to obtain a trained first natural language processing model;

setting, for any 1 model parameter matrix of the first natural language processing model, a row thinning parameter group for determining whether a row in the model parameter matrix is reserved, and a column thinning parameter group for determining whether a column in the model parameter matrix is reserved;

training the first natural language processing model according to the row thinning parameter group and the column thinning parameter group of each model parameter matrix, determining prediction loss and sparsity loss in the forward propagation process of the training, updating the residual parameters of the first natural language processing model which are not thinned currently through the prediction loss in the backward propagation process of the training, and updating each row thinning parameter group and each column thinning parameter group through the prediction loss and the sparsity loss;

obtaining a trained second natural language processing model when the total loss determined based on the prediction loss and the sparsity loss is converged;

and performing hardware deployment based on the second natural language processing model, and after the deployment is completed, inputting the text to be processed into the second natural language processing model to obtain a natural language processing result output by the second natural language processing model and aiming at the text to be processed.

Preferably, in the training back propagation process, updating the residual parameters of the first natural language processing model, which are not currently sparse, through the prediction loss includes:

and in the training back propagation process, updating the residual parameters of the first natural language processing model which are not sparse currently by taking the reduction of the prediction loss as a training target.

Preferably, updating each of the row thinning parameter groups and each of the column thinning parameter groups by the prediction loss and the sparsity loss includes:

and in the back propagation process of training, updating each row sparse parameter set and each column sparse parameter set by taking the total loss reduction as a training target.

Preferably, the updating each of the row thinning parameter groups and the column thinning parameter groups includes:

aiming at the row sparse parameter set of any 1 model parameter matrix, according to

Updating the row thinning parameter group;

aiming at the column sparse parameter set of any 1 model parameter matrix, according to

Updating the column thinning parameter group;

wherein, the first and the second end of the pipe are connected with each other,S _k the row thinning parameter set at the current time is shown,S _k+1 the row thinning parameter set at the next time instant is shown,l _r the sparse parameter differential calculation learning rate is expressed,Lossit is the total loss that is represented,Softlplusis shown asSoftlplusThe function of the function is that of the function,Q _k the column thinning parameter set for the current time instant is shown,Q _k+1 the column thinning parameter set at the next time instant is shown,M _k a model parameter mask of the model parameter matrix is represented, an

；M _ij Is shown asM _k Row i and column j in (1),x _i is shown asS _k The value of the i-th parameter in (b),y _j is shown asQ _k Wherein i and j are positive integers, i is more than or equal to 1 and less than or equal to a, j is more than or equal to 1 and less than or equal to b, and a and b are the number of rows and columns of the model parameter matrix respectively.

Preferably, after the setting of each parameter in any 1 row thinning parameter group and any 1 column thinning parameter group is completed, the parameter is a default value before updating, so as to retain each row and each column in any model parameter matrix.

Preferably, the total loss is a sum of the prediction loss and the sparsity loss.

Preferably, the total loss = k1 × loss1+ k2 × loss2, where k1 and k2 are both preset coefficients, loss1 is the predicted loss, and loss2 is the sparsity loss.

Preferably, the method further comprises the following steps:

and after a trained second natural language processing model is obtained, when the prediction loss is judged to be higher than a first threshold value or the sparsity loss is judged to be higher than a second threshold value, outputting prompt information.

Preferably, the method further comprises the following steps:

and receiving a coefficient adjusting instruction, and adjusting the value of k1 and/or k2 according to the coefficient adjusting instruction.

Preferably, inputting the text to be processed into the second natural language processing model to obtain the natural language processing result for the text to be processed output by the second natural language processing model, and the method includes:

and inputting the text to be processed into the second natural language processing model to obtain a semantic recognition result output by the second natural language processing model and aiming at the text to be processed.

Preferably, the deploying hardware based on the second natural language processing model includes:

determining a row thinning parameter group and a column thinning parameter group of each model parameter matrix based on the second natural language processing model;

aiming at any 1 model parameter matrix, extracting non-zero parameter rows and non-zero parameter columns of the model parameter matrix according to a row sparse parameter set and a column sparse parameter set of the model parameter matrix to obtain a sparse model parameter matrix;

and carrying out hardware deployment based on each sparse model parameter matrix.

Preferably, the deploying hardware based on each sparse model parameter matrix includes:

and (3) hardware deployment is carried out based on each sparse model parameter matrix, and 0 is supplemented in the calculation result of the corresponding model parameter matrix by using the principle of keeping dimension invariance during deployment.

A natural language processing system comprising:

the first natural language processing model determining module is used for establishing an initial natural language processing model and training the initial natural language processing model to obtain a trained first natural language processing model;

a sparseness setting module configured to set, for any 1 model parameter matrix of the first natural language processing model, a row sparseness parameter group for determining whether a row in the model parameter matrix is reserved, and a column sparseness parameter group for determining whether a column in the model parameter matrix is reserved;

a pruning module, configured to perform training of the first natural language processing model according to the row sparsification parameter set and the column sparsification parameter set of each model parameter matrix, and determine a prediction loss and a sparseness loss in a forward propagation process of the training, update, in a backward propagation process of the training, a remaining parameter of the first natural language processing model that is not currently sparsified through the prediction loss, and update, through the prediction loss and the sparseness loss, each of the row sparsification parameter sets and each of the column sparsification parameter sets;

a second natural language processing model determining module, configured to obtain a trained second natural language processing model when a total loss determined based on the prediction loss and the sparsity loss converges;

and the execution module is used for deploying hardware based on the second natural language processing model, inputting the text to be processed into the second natural language processing model after deployment is completed, and obtaining a natural language processing result which is output by the second natural language processing model and aims at the text to be processed.

A natural language processing device, comprising:

a memory for storing a computer program;

a processor for executing the computer program to implement the steps of the natural language processing method as described above.

A computer-readable storage medium having stored thereon a computer program which, when executed by a processor, carries out the steps of the natural language processing method as described above.

By applying the technical scheme provided by the embodiment of the invention, the initial natural language processing model is established and trained, and after the trained first natural language processing model is obtained, software level optimization is realized by adopting a sparse mode. Specifically, the present application does not directly delete part of the layers in the first natural language processing model, but sets a row thinning parameter group for determining whether a row in the model parameter matrix is reserved and a column thinning parameter group for determining whether a column in the model parameter matrix is reserved for any 1 model parameter matrix of the first natural language processing model. That is, for any 1 model parameter matrix of the first natural language processing model, which rows and columns of the model parameter matrix are reserved and which rows and columns are excluded are determined by the corresponding row and column thinning parameter sets. Then, the first natural language processing model can be trained according to the row thinning parameter group and the column thinning parameter group of each model parameter matrix. In the forward propagation process of training, prediction loss and sparsity loss are determined. In the training back propagation process, the residual parameters of the first natural language processing model which are not sparse at present are updated through loss prediction, so that the precision of the second natural language processing model obtained after the training is finished can be effectively guaranteed by the scheme of the application, namely, the precision is not damaged while the optimization is realized. In the training back propagation process, each row sparsification parameter group and each column sparsification parameter group are updated through prediction loss and sparsity loss, when total loss determined based on the prediction loss and the sparsity loss is converged, the prediction loss and the sparsity loss are proved to reach a proper degree, and when the sparsity loss reaches a proper optimization degree, each model parameter matrix in the second natural language processing model is subjected to partial row and column filtering and deleting. Then, hardware deployment can be performed based on the second natural language processing model, and after the deployment is completed, the text to be processed is input to the second natural language processing model, so as to obtain a natural language processing result output by the second natural language processing model and aiming at the text to be processed, that is, through the second natural language processing, the natural language processing of the text to be processed can be effectively performed. In addition, it can be understood that, during optimization, the row sparsity parameter group for determining whether rows in the model parameter matrix are reserved and the column sparsity parameter group for determining whether columns in the model parameter matrix are reserved are set, so that after optimization is completed, when hardware deployment is performed, hardware deployment is not required to be performed on corresponding filtered rows and columns, that is, only the rows and columns remaining after sparsity need to be considered during hardware deployment, so that the requirement of the application on hardware is lower, the calculation amount is small, and cooperative optimization of software and hardware is realized.

To sum up, the scheme of this application can realize the natural language effectively and handle, can carry out the natural language treatment efficiency in order to improve the model in coordination optimization of software and hardware aspect effectively, simultaneously, after this application accomplishes the optimization, can not lose the precision.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a schematic diagram of a conventional attention-based natural language processing model;

FIG. 2 is a schematic diagram of the calculation of matrix multiplication;

FIG. 3 is a flow chart of an embodiment of a natural language processing method according to the present invention;

FIG. 4 is a schematic diagram illustrating the training of a first natural language processing model in accordance with an exemplary embodiment;

FIG. 5 is a schematic diagram illustrating the sparse variation of a model parameter matrix in one embodiment;

FIG. 6 is a schematic diagram of a natural language processing system according to the present invention;

FIG. 7 is a schematic diagram of a natural language processing apparatus according to the present invention.

Detailed Description

The core of the invention is to provide a natural language processing method, which can effectively realize natural language processing, can effectively carry out cooperative optimization of software and hardware levels so as to improve the natural language processing efficiency of the model, and meanwhile, the precision can not be lost after the optimization is completed.

In order that those skilled in the art will better understand the disclosure, reference will now be made in detail to the embodiments of the disclosure as illustrated in the accompanying drawings. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 3, fig. 3 is a flowchart illustrating an implementation of a natural language processing method according to the present invention, where the natural language processing method includes the following steps:

step S301: and establishing an initial natural language processing model and training to obtain a trained first natural language processing model.

Specifically, the initial natural language processing model may be established first, and the specific form of the initial natural language processing model may be various, and may be set and adjusted according to actual needs, for example, specifically, the initial natural language processing model adopting a deep network structure.

After the initial natural language processing model is established, the initial natural language processing model can be trained by using a training sample, and when the recognition precision meets the requirement, the training can be determined to be finished, and the first natural language processing model after the training is obtained at the moment.

Since natural language processing is performed, the training sample is usually text data.

Step S302: for any 1 model parameter matrix of the first natural language processing model, a row thinning parameter group for deciding whether a row in the model parameter matrix is reserved and a column thinning parameter group for deciding whether a column in the model parameter matrix is reserved are set.

After the trained first natural language processing model is obtained, the first natural language processing model includes a plurality of model parameter matrices, and it can be understood that at this time, since the thinning-out is not performed, each model parameter matrix is an original model parameter matrix which is not thinned out.

For any 1 model parameter matrix of the first natural language processing model, the present application sets a row thinning parameter group for determining whether a row in the model parameter matrix is reserved, and a column thinning parameter group for determining whether a column in the model parameter matrix is reserved.

That is, for any 1 row of any 1 model parameter matrix, whether the row needs to be reserved is determined by the row thinning parameter group of the model parameter matrix. Similarly, for any 1 column of the model parameter matrix, whether the column needs to be reserved is determined by the column thinning parameter group of the model parameter matrix.

For example, in fig. 4, the original model parameter matrix is a 4-row and 3-column model parameter matrix, and it is determined that the 2 nd row and the 2 nd column need to be thinned through the row thinning parameter set and the column thinning parameter set of the model parameter matrix, that is, the 2 nd row and the 2 nd column do not need to be reserved.

In addition, since each row thinning parameter group and each column thinning parameter group are constantly adjusted in the training process in the subsequent step S303, the initial settings of each row thinning parameter group and each column thinning parameter group can be arbitrarily set. However, in practice, in order to ensure the accuracy, when performing initial setting of each row thinning parameter set and each column thinning parameter set, each row and each column of each model parameter matrix are usually reserved.

For example, in an embodiment of the present invention, after the setting of each parameter in any 1 row thinning parameter group and any 1 column thinning parameter group is completed, the parameters are default values before being updated, so as to keep each row and each column in any model parameter matrix. In this embodiment, each parameter in any 1 row thinning parameter group and any 1 column thinning parameter group is set as a default value, so that the setting process is simple and convenient.

The specific form of the row thinning parameter group and the column thinning parameter group is various, as long as the respective functions of the row thinning parameter group and the column thinning parameter group of the present application can be realized. Since the model parameter matrix usually has a plurality of rows and a plurality of columns, the row thinning parameter set may be usually configured as a vector, and each value in the vector is used to determine whether a corresponding row of the model parameter matrix is reserved. Likewise, the column thinning parameter group may be generally configured in the form of a vector, and each value in the vector is used to determine whether a corresponding column of the model parameter matrix is reserved.

Step S303: and according to the row thinning parameter group and the column thinning parameter group of each model parameter matrix, training the first natural language processing model, determining prediction loss and sparsity loss in the forward propagation process of the training, updating residual parameters of the first natural language processing model which are not thinned currently through the prediction loss in the backward propagation process of the training, and updating each row thinning parameter group and each column thinning parameter group through the prediction loss and the sparsity loss.

The row thinning parameter group and the column thinning parameter group can determine a model parameter matrix after thinning, and further train the first natural language processing model.

In the forward propagation process of training, the present application needs to determine the prediction loss and the sparsity loss, and in fig. 4, the prediction loss is labeled ce _ loss, i.e., loss1 in the following embodiments. The sparsity loss is labeled sparse _ loss, loss2 in the following embodiments.

The prediction loss reflects the prediction accuracy of the first natural language processing model, and the smaller the prediction loss, the higher the prediction accuracy of the first natural language processing model.

In order to guarantee the precision, the residual parameters of the first natural language processing model which are not sparse currently are updated through loss prediction in the training back propagation process. For example, for a model parameter matrix originally having 4 rows and 3 columns in fig. 4, since the current row thinning parameter group and column thinning parameter group determine that the 2 nd row and the 2 nd column need to be thinned, the remaining parameters that are not thinned in the model parameter matrix are 6, that is, the parameters of the 1 st row and the 1 st column, the parameters of the 1 st row and the 3 rd column, the parameters of the 3 rd row and the 1 st column, the parameters of the 3 rd row and the 3 rd column, the parameters of the 4 th row and the 1 st column, and the parameters of the 4 th row and the 3 rd column. Of course, as training progresses, it is not necessarily the 2 nd row and 2 nd column for the model parameter matrix that ultimately decides to be sparse.

In an embodiment of the present invention, the updating, during the back propagation process of the training described in step S303, the remaining parameters of the first natural language processing model that are not currently sparse by predicting the loss may include:

As described above, the prediction loss reflects the prediction accuracy of the first natural language processing model, and the smaller the prediction loss, the higher the prediction accuracy of the first natural language processing model is, so that in the back propagation process of training, the residual parameters of the first natural language processing model that are not sparse currently may be updated with the prediction loss reduced as a training target. In fig. 4 of the present application, this process is labeled as back-propagation 1.

In the training back propagation process, the method can update the residual parameters of the first natural language processing model which are not sparse currently, and can update the row sparse parameter sets and the column sparse parameter sets through prediction loss and sparsity loss.

The sparsity loss reflects the degree of sparseness of the first natural language processing model, and the smaller the sparsity loss is, the higher the degree of sparseness of the first natural language processing model is, that is, the more rows and columns do not need to be reserved.

In one embodiment of the present invention, the updating of each row thinning parameter group and each column thinning parameter group by predicting the loss and the sparsity loss described in step S303 includes:

In this embodiment, the total loss is made by a prediction loss and sparsity loss process, and as described above, the lower the prediction loss, the higher the specification accuracy, and the lower the sparsity loss, the higher the degree of sparsity. However, there is an influence between the prediction loss and the sparsity loss, and for example, when the sparsity loss is reduced by updating each row thinning parameter group and each column thinning parameter group, the prediction loss may be increased. Rather, a tradeoff between prediction penalty and sparsity penalty is made with the goal of reducing the total penalty.

In fig. 4 of the present application, a process of updating each row thinning parameter group and each column thinning parameter group is denoted as back propagation 2.

It should be noted that, the algorithm specifically used for updating each row thinning parameter group and each column thinning parameter group may be various, for example, in an embodiment of the present invention, the updating of each row thinning parameter group and each column thinning parameter group described in step S303 may specifically include:

the updating each of the row thinning parameter sets and each of the column thinning parameter sets includes:

the row thinning parameter group aiming at any 1 model parameter matrix is as follows

Updating the row thinning parameter group;

Updating the column thinning parameter group;

wherein the content of the first and second substances,S _k the row thinning parameter set at the current time is shown,S _k+1 show that is nextThe set of line thinning parameters at a time instant,l _r the sparse parameter differential calculation learning rate is expressed,Lossit is the total loss that is represented,Softlplusis shown asSoftlplusThe function of the function is that of the function,Q _k the column thinning parameter set for the current time instant is shown,Q _k+1 the column thinning parameter set at the next time instant is shown,M _k the model parameter mask of the model parameter matrix is represented, an

；M _ij Is shown asM _k The value in the ith row and jth column in (e),x _i is shown asS _k The value of the i-th parameter in (b),y _j is shown asQ _k Wherein, i and j are positive integers, i is more than or equal to 1 and less than or equal to a, j is more than or equal to 1 and less than or equal to b, and a and b are respectively the row number and column number of the model parameter matrix.

In fig. 4, the STE forward function represents that the model parameter mask of the model parameter matrix is determined based on the row thinning parameter group and the column thinning parameter group of the model parameter matrix. The STE inverse function indicates that the row thinning parameter group and the column thinning parameter group are updated.

For example, in one embodiment, after the step S303 is completed, i.e. the training of the first natural language processing model is completed, and the second natural language processing model is obtained, for 1 model parameter matrix W with a size of 3 × 3, for example, the row thinning parameter set of the model parameter matrix is [ -5.5,3.0,1.2]And the column thinning parameter set is [3.3, -2.2,1.0 ]]Since-5.5 and-2.2 are negative numbers, it corresponds to the 1 st row and 2 nd column of the model parameter matrix W being sparse, i.e. the 1 st row and 2 nd column of the model parameter matrix W do not need to be preserved. In this example, the model parameter mask of the model parameter matrix may be expressed as

0 table in model parameter maskThe parameter is set to 0 and a 1 indicates that the parameter may not be 0.

Step S304: and when the total loss determined based on the prediction loss and the sparsity loss is converged, obtaining a trained second natural language processing model.

When the total loss determined based on the prediction loss and the sparsity loss is converged, it is indicated that the training is completed, and a trained second natural language processing model can be obtained.

For example, in fig. 5, after the training of step S303 is performed on 1 model parameter matrix originally having a size of 6 × 6, if the 2 nd row, the 6 th row, the 2 nd column, and the 5 th column need to be thinned, the model parameter matrix becomes 4 × 4 in the second natural language processing model.

Step S305: and performing hardware deployment based on the second natural language processing model, and inputting the text to be processed into the second natural language processing model after the deployment is completed, so as to obtain a natural language processing result output by the second natural language processing model and aiming at the text to be processed.

The whole row and whole column are thinned when the model is optimized, so that the row and the column which are thinned do not need to be calculated on hardware when the hardware is deployed, the occupation of hardware resources is effectively reduced, and the collaborative optimization of software and hardware is facilitated.

After the deployment on the hardware is finished, the text to be processed is input into the second natural language processing model, so that a natural language processing result output by the second natural language processing model and aiming at the text to be processed can be obtained.

In a specific embodiment of the present invention, the hardware deployment based on the second natural language processing model, which is described in step S305, may specifically include:

determining a row sparse parameter set and a column sparse parameter set of each model parameter matrix based on the second natural language processing model;

Specifically, after training is completed and a second natural language processing model is obtained, the final row sparse parameter set and column sparse parameter set of each model parameter matrix can be determined, and then non-zero parameter rows and non-zero parameter columns of the model parameter matrix can be extracted to obtain a sparse model parameter matrix. For example, in fig. 5, for 1 model parameter matrix originally having a size of 6 × 6, after extracting non-zero parameter rows and non-zero parameter columns of the model parameter matrix, a model parameter matrix having a size of 4 × 4 is obtained. And finally, hardware deployment can be carried out based on each sparse model parameter matrix.

For example, in a specific situation, the model parameter matrix with the size of 6 × 6 in fig. 5, the number of the nonzero parameter row and the number of the nonzero parameter column of the model parameter matrix may be input, so that the nonzero parameter row and the nonzero parameter column of the model parameter matrix are extracted, a sparse model parameter matrix is obtained, and then hardware deployment is performed based on each sparse model parameter matrix.

For example, the TensorRT interface call may be modified specifically, so that the column compression corresponding to the position is performed on the 2 nd row in the input model parameter matrix of size 6 × 6 in fig. 5, similarly, the column compression corresponding to the position is performed on the 6 th row in the model parameter matrix of size 6 × 6 in fig. 5, and the row compression corresponding to the position is performed on the 2 nd column and the 5 th column in the model parameter matrix of size 6 × 6 in fig. 5, and finally, the extraction of the non-zero parameter rows and non-zero parameter columns of the model parameter matrix is achieved, that is, the sparse model parameter matrix is obtained.

Further, in a specific embodiment of the present invention, the deploying hardware based on each sparse model parameter matrix may specifically include:

For example, a certain model parameter matrix of a certain layer is 6x6 size before thinning for multiplication with input data of 106x6 size, resulting in an output of 106x6 size. If the model parameter matrix becomes 4x4 after thinning, the last two columns of input data with the size of 106x6 do not need to be used, i.e. the input data with the size of 106x4 is multiplied by the model parameter matrix with the size of 4x4 to obtain the output with the size of 106x 4. After the output of 106x4 size is obtained, in this embodiment, on the principle of keeping the dimension invariance, 0 is complemented in the calculation result of the corresponding model parameter matrix, that is, in this example, 2 columns of 0 are added to the output of 106x4 size, and the output is recovered to the output of 106x6 size.

It should be further emphasized that in this embodiment, the 0-filling is performed for the calculation result of the corresponding model parameter matrix, rather than the 0-filling for the corresponding sparse row and column. That is, for hardware, the multiplication of a model parameter matrix with a size of 4x4 is performed, instead of the multiplication of a model parameter matrix with a size of 6x6, if the multiplication of a model parameter matrix with a size of 6x6 is performed, it is equivalent to that the optimization on software is not coordinated to hardware, because the calculation amount on hardware is not changed, in some conventional schemes, when the optimization on software is performed, some parameters in the model parameter matrix are zeroed, and the zeroed parameters are randomly distributed, so that the 0 s still need to participate in the matrix multiplication on hardware, and the coordinated optimization on hardware is not realized.

In the solution of the present application, after obtaining the output with a size of 106x4, the above example restores the output with a size of 106x6 by complementing 2 columns and 0, which ensures the inconvenience of dimensionality, that is, ensures that the output size of the second natural language processing model is consistent with the output of the original model, and the sparse rows and columns determined before do not participate in the matrix multiplication.

The total loss of the present application is composed of a predicted loss and a sparsity loss, and in a specific embodiment of the present invention, the total loss may specifically be a sum of the predicted loss and the sparsity loss, which is a simpler setting mode in practical application.

Further, in an embodiment of the present invention, the total loss = k1 × loss1+ k2 × loss2, where k1 and k2 are both preset coefficients, loss1 is a predicted loss, and loss2 is a sparsity loss.

In this embodiment, it is considered that a certain weight may be set for each of the prediction loss and the sparsity loss, that is, the degree of influence of the prediction loss and the sparsity loss on the total loss is different. It can be understood that the larger the value of k1 is, the larger the influence of the total loss by the prediction loss is, and such an embodiment is more favorable for ensuring high prediction accuracy, and can be generally applied to occasions with higher accuracy requirements. And the larger the value of k2 is, the larger the influence of the total loss by the loss of the sparsity is, so that the implementation mode is more favorable for simplifying the model and can be generally applied to occasions with higher requirements on the calculation speed.

In addition, in other embodiments, the total loss may also be selected in other forms, which may be set according to actual situations, and does not affect the implementation of the present invention, but it can be understood that the total loss generally needs to be positively correlated with the predicted loss and positively correlated with the sparsity loss.

In one embodiment of the present invention, the method further comprises:

and after the trained second natural language processing model is obtained, when the prediction loss is judged to be higher than a first threshold value or the sparsity loss is judged to be higher than a second threshold value, outputting prompt information.

As can be seen from the foregoing description, the present application considers training to be complete when the total loss determined based on the prediction loss and the sparsity loss converges. In most cases, the prediction penalty and the sparsity penalty are usually low when the total penalty converges, but in a few cases the prediction penalty or sparsity penalty may still be high.

In this embodiment, when the predicted loss is higher than the first threshold or the sparsity loss is higher than the second threshold, prompt information is output so that the staff can notice the situation in time and take corresponding measures. For example, in one case, when the prediction loss is higher than the first threshold, the value of k1 in the foregoing embodiment may be increased appropriately so that the total loss takes more into account the prediction loss, and correspondingly, when the sparsity loss is higher than the second threshold, the value of k2 in the foregoing embodiment may be increased appropriately so that the total loss takes more into account the sparsity loss.

That is, in an embodiment of the present invention, the method may further include: and receiving a coefficient adjusting instruction, and adjusting the value of k1 and/or k2 according to the coefficient adjusting instruction.

In a specific embodiment of the present invention, the inputting the text to be processed into the second natural language processing model described in step S305 to obtain the natural language processing result for the text to be processed output by the second natural language processing model may specifically include:

The second natural language processing model of the application can process the text to be processed, and in practical application, a specific processing purpose is generally to perform semantic recognition on the text to be processed. Of course, in other embodiments, the second natural language processing model of the present application may also perform other aspects of processing on the text to be processed, such as language analysis, knowledge extraction, text translation, and the like.

By applying the technical scheme provided by the embodiment of the invention, the initial natural language processing model is established and trained, and after the trained first natural language processing model is obtained, software level optimization is realized in a sparse mode. Specifically, the present application does not directly delete part of the layers in the first natural language processing model, but sets a row thinning parameter group for determining whether a row in the model parameter matrix is reserved and a column thinning parameter group for determining whether a column in the model parameter matrix is reserved for any 1 model parameter matrix of the first natural language processing model. That is, for any 1 model parameter matrix of the first natural language processing model, which rows and columns of the model parameter matrix are reserved and which rows and columns are excluded are determined by the corresponding row and column thinning parameter sets. Then, the first natural language processing model can be trained according to the row thinning parameter group and the column thinning parameter group of each model parameter matrix. In the forward propagation process of training, prediction loss and sparsity loss are determined. In the back propagation process of training, the residual parameters of the first natural language processing model which are not sparse at present are updated through loss prediction, so that the precision of the second natural language processing model obtained after the training is finished can be effectively guaranteed by the scheme of the application, namely, the precision is not damaged while the optimization is realized. In the training back propagation process, each row sparsification parameter group and each column sparsification parameter group are updated through prediction loss and sparsity loss, when total loss determined based on the prediction loss and the sparsity loss is converged, the prediction loss and the sparsity loss are proved to reach a proper degree, and when the sparsity loss reaches a proper optimization degree, each model parameter matrix in the second natural language processing model is subjected to partial row and column filtering and deleting. Then, hardware deployment can be performed based on the second natural language processing model, and after the deployment is completed, the text to be processed is input to the second natural language processing model, so as to obtain a natural language processing result output by the second natural language processing model and aiming at the text to be processed, that is, through the second natural language processing, the natural language processing of the text to be processed can be effectively performed. In addition, it can be understood that, during optimization, the row sparsity parameter group for determining whether rows in the model parameter matrix are reserved and the column sparsity parameter group for determining whether columns in the model parameter matrix are reserved are set, so that after optimization is completed, when hardware deployment is performed, hardware deployment is not required to be performed on corresponding filtered rows and columns, that is, only the rows and columns remaining after sparsity need to be considered during hardware deployment, so that the requirement of the application on hardware is lower, the calculation amount is small, and cooperative optimization of software and hardware is realized.

Corresponding to the above method embodiments, the embodiments of the present invention further provide a natural language processing system, which can be referred to in correspondence with the above.

Referring to fig. 6, a schematic structural diagram of a natural language processing system according to the present invention is shown, which includes:

a first natural language processing model determining module 601, configured to establish an initial natural language processing model and perform training to obtain a trained first natural language processing model;

a sparseness setting module 602, configured to set, for any 1 model parameter matrix of the first natural language processing model, a row sparseness parameter set used for determining whether rows in the model parameter matrix are reserved, and a column sparseness parameter set used for determining whether columns in the model parameter matrix are reserved;

a pruning module 603, configured to perform training of the first natural language processing model according to the row sparsification parameter set and the column sparsification parameter set of each model parameter matrix, determine prediction loss and sparsity loss in a forward propagation process of the training, update, in a backward propagation process of the training, remaining parameters of the first natural language processing model that are not currently sparse through the prediction loss, and update, through the prediction loss and the sparsity loss, each row sparsification parameter set and each column sparsification parameter set;

a second natural language processing model determining module 604, configured to obtain a trained second natural language processing model when a total loss determined based on the prediction loss and the sparsity loss converges;

the execution module 605 is configured to perform hardware deployment based on the second natural language processing model, and after the hardware deployment is completed, input the text to be processed into the second natural language processing model to obtain a natural language processing result for the text to be processed, which is output by the second natural language processing model.

In a specific embodiment of the present invention, the pruning module 603 updates, in the training back propagation process, the remaining parameters of the first natural language processing model that are not currently sparse through predicting loss, including:

In an embodiment of the present invention, the pruning module 603 updates each row sparsity parameter set and each column sparsity parameter set by predicting the loss and the sparsity loss, including:

In an embodiment of the present invention, the pruning module 603 updates each row thinning parameter set and each column thinning parameter set, including:

Updating the row thinning parameter group;

Updating the column thinning parameter group;

wherein the content of the first and second substances,S _k the row thinning parameter set at the current time is shown,S _k+1 the row thinning parameter set at the next time instant is shown,l _r the sparse parameter differential calculation learning rate is expressed,Lossit is the total loss that is represented,Softlplusis shown asSoftlplusThe function of the function is that of the function,Q _k the column thinning parameter set for the current time instant is shown,Q _k+1 the column thinning parameter set at the next time instant is shown,M _k the model parameter mask of the model parameter matrix is represented, an

In an embodiment of the present invention, after the setting of each parameter in any 1 row thinning parameter group and any 1 column thinning parameter group is completed, the parameter is a default value before the updating, so as to keep each row and each column in any model parameter matrix.

In one embodiment of the present invention, the total loss is the sum of the prediction loss and the sparsity loss.

In an embodiment of the present invention, the total loss = k1 × loss1+ k2 × loss2, where k1 and k2 are both predetermined coefficients, loss1 is a predicted loss, and loss2 is a sparsity loss.

In one embodiment of the present invention, the method further comprises:

and the information prompt module is used for outputting prompt information when the prediction loss is judged to be higher than a first threshold value or the sparsity loss is judged to be higher than a second threshold value after the trained second natural language processing model is obtained.

In one embodiment of the present invention, the method further comprises:

and the coefficient adjusting module is used for receiving the coefficient adjusting instruction and adjusting the value of k1 and/or k2 according to the coefficient adjusting instruction.

In an embodiment of the present invention, the executing module 605 inputs the text to be processed into the second natural language processing model, and obtains the natural language processing result for the text to be processed output by the second natural language processing model, including:

In an embodiment of the invention, the executing module 605 performs hardware deployment based on the second natural language processing model, including:

determining a row sparsifying parameter group and a column sparsifying parameter group of each model parameter matrix based on the second natural language processing model;

and performing hardware deployment based on each sparse model parameter matrix.

In a specific embodiment of the present invention, the executing module 605 performs hardware deployment based on each sparse model parameter matrix, including:

Corresponding to the above method and system embodiments, the present invention also provides a natural language processing device and a computer readable storage medium, which can be referred to in correspondence with the above. The computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the steps of the natural language processing method as in any of the above embodiments.

Referring to fig. 7, the natural language processing apparatus may include:

a memory 701 for storing a computer program;

a processor 702 for executing a computer program to implement the steps of the natural language processing method in any of the embodiments described above.

It is further noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising one of 8230; \8230;" 8230; "does not exclude the presence of additional like elements in a process, method, article, or apparatus that comprises the element.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the technical solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The principle and the embodiment of the present invention are explained by applying specific examples, and the above description of the embodiments is only used to help understanding the technical solution and the core idea of the present invention. It should be noted that, for those skilled in the art, without departing from the principle of the present invention, several improvements and modifications can be made to the present invention, and these improvements and modifications also fall into the protection scope of the present invention.

Claims

1. A natural language processing method, comprising:

and performing hardware deployment based on the second natural language processing model, and after the deployment is completed, inputting the text to be processed into the second natural language processing model to obtain a natural language processing result which is output by the second natural language processing model and aims at the text to be processed.

2. The method of claim 1, wherein updating the residual parameters of the first natural language processing model that are not currently sparse through the predictive loss during the back propagation of the training comprises:

3. The natural language processing method according to claim 1, wherein updating each of the row thinning parameter groups and each of the column thinning parameter groups by the prediction loss and the sparsity loss includes:

4. The natural language processing method of claim 1, wherein said updating each of said row thinning parameter sets and each of said column thinning parameter sets comprises:

Updating the row thinning parameter group;

Updating the column thinning parameter group;

wherein the content of the first and second substances,S _k the row thinning parameter set at the current time is shown,S _k+1 the row thinning parameter set at the next time instant is shown,l _r the sparse parameter differential calculation learning rate is expressed,Lossit is the total loss that is represented,Softlplusis shown asSoftlplusThe function of the function(s) is,Q _k the column thinning parameter set at the current time is shown,Q _k+1 the column thinning parameter set at the next time instant is shown,M _k a model parameter mask of the model parameter matrix is represented, an

；M _ij Is shown asM _k The value in the ith row and jth column in (e),x _i is shown asS _k The value of the i-th parameter in (b),y _j is shown asQ _k Wherein i and j are positive integers, i is more than or equal to 1 and less than or equal to a, j is more than or equal to 1 and less than or equal to b, and a and b are the number of rows and columns of the model parameter matrix respectively.

5. The method according to claim 1, wherein each of the parameters in any 1 row thinning parameter group and any 1 column thinning parameter group is a default value after setting and before updating, so as to retain each row and each column in any model parameter matrix.

6. The natural language processing method of claim 1, wherein the total loss is a sum of the prediction loss and the sparsity loss.

7. The method according to claim 1, wherein the total loss = k1 × loss1+ k2 × loss2, where k1 and k2 are both predetermined coefficients, loss1 is the predicted loss, and loss2 is the sparsity loss.

8. The natural language processing method according to claim 7, further comprising:

9. The natural language processing method according to claim 8, further comprising:

10. The method according to claim 1, wherein inputting a text to be processed into the second natural language processing model, and obtaining a natural language processing result for the text to be processed output by the second natural language processing model, comprises:

11. A natural language processing method according to any one of claims 1 to 10, wherein said deploying hardware based on the second natural language processing model comprises:

and performing hardware deployment based on each sparse model parameter matrix.

12. The natural language processing method according to claim 11, wherein the deploying hardware based on each sparse model parameter matrix comprises:

and performing hardware deployment based on each sparse model parameter matrix, and supplementing 0 in the calculation result of the corresponding model parameter matrix by using the principle of keeping dimension invariance during deployment.

13. A natural language processing system, comprising:

a pruning module, configured to perform training of the first natural language processing model according to a row sparsification parameter set and a column sparsification parameter set of each model parameter matrix, determine a prediction loss and a sparsity loss in a forward propagation process of the training, update, in a backward propagation process of the training, a remaining parameter of the first natural language processing model that is not currently sparse through the prediction loss, and update, through the prediction loss and the sparsity loss, each of the row sparsification parameter sets and each of the column sparsification parameter sets;

14. A natural language processing apparatus, comprising:

a memory for storing a computer program;

a processor for executing the computer program to implement the steps of the natural language processing method of any one of claims 1 to 12.

15. A computer-readable storage medium, characterized in that a computer program is stored thereon, which computer program, when being executed by a processor, carries out the steps of the natural language processing method according to any one of claims 1 to 12.