CN117391079A

CN117391079A - Method for generating large model by reasoning text

Info

Publication number: CN117391079A
Application number: CN202311321583.3A
Authority: CN
Inventors: 曹肖攀; 马国祖; 秦瑾; 陈超; 张峻崎; 赵玲; 花榕励; 张喜强; 秦涛; 杨祺; 孙力泽; 赵长海
Original assignee: China Telecom Wanwei Information Technology Co Ltd
Current assignee: China Telecom Wanwei Information Technology Co Ltd
Priority date: 2023-10-12
Filing date: 2023-10-12
Publication date: 2024-01-12

Abstract

The invention belongs to the technical field of natural language processing, and relates to a method for generating a large model by reasoning texts, in particular to a method for training an infinite long model based on LSAFormer-1.3B infinite long. The patent builds a large model LSAFomer-1.3B with low-dimension multi-head attention, infinite training and infinite reasoning of 13 hundred million texts. The patent provides a k-q dimension-reduction attention matrix mode, the dimension reduction effectively improves the attention calculation efficiency, and the dimension-reduction attention matrix mode provided by the patent is completely different from the q-v matrix operation in the industry Linformer; in order to enable infinite training and infinite reasoning under low-computational-force conditions.

Description

Method for generating large model by reasoning text

Technical Field

The invention belongs to the technical field of natural language processing, and relates to a method for generating a large model by reasoning texts, in particular to a method for training an infinite long model based on LSAFormer-1.3B infinite long.

Background

The large text generation model has wide application scenes such as text creation and knowledge question and answer, but with the increase of the model, especially the increase of the dimension of an embedded layer and the length of the text, the calculation power consumption is huge, and the common server is difficult to calculate; therefore, the patent proposes that the length of the text is not limited in the generation model and training reasoning stage by deleting the position codes, so that training and reasoning can be performed in the same sliding window mode. Secondly, in order to save the splitting property of single text content generated by a shuffle mechanism in the training process of text information caused by a sliding window mode, the patent provides a grouping shuffle mechanism, so that the generalization of a model is improved, and the problem that the model forgets the prior knowledge due to the splitting of the text information is avoided. Finally, in the dimension of the embedded layer, in the multi-head attention mechanism, k and v are respectively mapped to the low-dimension (768) calculation attention mechanism, so that the calculation efficiency of the attention mechanism is improved.

Disclosure of Invention

The invention relates to a method for training an infinite growth model based on LSAFomer-1.3B infinite length.

The specific implementation steps are as follows:

s1, making a token word list

Collecting text writing, knowledge questions and answers, knowledge recommendation and other text generation task related training data sets in the project; after the Chinese characters, letters and characters in the training data set are collected and de-duplicated, a mapping word list is built, and special characters of the token word list are reserved, namely token_Dict= {0: "[ CLS ]",1: "[ PAD ]",2: "[ SEP ]",3: "[ NewLine ]",4: ... Once again; wherein [ CLS ] is a text initiator, [ PAD ] represents that when the text length does not reach the maximum length, [ PAD ] is adopted to fill to the max_len length, [ SEP ] is a segmenter in a text generation task, [ NewLine ] is a text line feeder, and max_len=512 is the maximum length of the text contracted by an input model;

s2, manufacturing a training data set

The training data set is stored into a plurality of txt files, and if one piece of text generated data contains line-changing symbols, the line-changing symbols are replaced by [ NewLine ], so that each text generation task in the txt files occupies one line in the training set; according to the Window mechanism training mode, an infinite length text generation training task is realized, and the Window width is set to window=256; if the number of the tokens of the text length of a certain data set exceeds the maximum length agreed by the model, the data can realize an infinite length text generation task in a sliding window mode; if the length of X exceeds max_len, the data is recorded as X, and if the length of X exceeds max_len, the characters from id1 to id2 in X are appointed, input [ "[ CLS ]" ] +X [0:max_len-1] passes through the optimization target output X [0:max_len ] corresponding to the model, and if the length of X exceeds max_len, a training set is constructed again according to the translation Window, wherein window=256 is smaller than max_len, and then X [ Window: max_len+window ] is translated content, and corresponds to the output of X [ window+1:max_len+window+1]; if the text length exceeds max_len+Window, continuing to translate for multiple times to generate an infinite text, and inputting a model for multiple times on the data layer, namely by a training sample in a sliding Window mode, wherein the model text length is still kept at 512, so that an infinite text training task under the condition of low calculation force is realized; all texts in the current training set text txt form a batch in a sliding window mode to form a training set list of [ a1, a2, a3, a4, a5, a6, b1, b2, b3, b4, b5, b6. ], ai, bi represent training samples, a1, a2, a3, a4, a5 and a6 are a plurality of inputs and outputs formed by sliding windows of one training text; each training set list randomly deletes a plurality of inputs and outputs formed by a plurality of sliding windows before grouping, so that the possibility of combining different inputs and outputs of a text in a batch is further improved;

s3, constructing a network model structure

Aiming at the text X of the corpus in the project, changing the text X into an index list through the token_Dict in the step S1, sequentially finding the digital index of each word through the token_Dict, changing the training sample into a digital sequence, changing the digital sequence into tensor, introducing a neural network model into an embedded representation layer, using a GPT decoder and a low-dimensional space attention mechanism, realizing the text reasoning operation of the sliding window mode without the limitation of long text training and deleting the position embedded code; introduces;

s4, model training

Adding deep learning according to the model structure set in the step S3, namely performing model training by adopting an Adam optimizer and a cross entropy loss function, substituting the model input data obtained in the step S2 into the network structure obtained in the step S3, namely minimizing the loss function as a final target of the model, and finally achieving the purpose of model training: when the index of the true token at the i-th position of the optimization target Y is index, the probability of the input X passing through the token at the i-th position of the LSAFormer neural network model is improved, namely the model enables the text content of the input model to output a corresponding Y value through the LSAFormer neural network model prediction through training a cross entropy loss function;

s5, model prediction

And if the text reasoning exceeds the maximum length max_len in the reasoning process, translating the Window length, continuing the following reasoning, wherein the text input by the user is text, the text is divided into inputs by a word segmentation algorithm, the inputs are a list, the content of the input LSAFomer is [ "CLS" ] +inputs, the corresponding token value is output by the model, and then the token value is added to the input, and the steps are repeated, so that the text content is generated.

S3: constructing a network model structure:

certain text X passing for corpus in projectThe method comprises the steps of S1, changing token_Dict into an index list, sequentially finding the digital index of each word through token_Dict, changing a training sample into a digital sequence, changing the digital sequence into tensors, then obtaining an embedded representation matrix with the size of [ 12, 512 and 2048 ] by using an embedded layer with the size of [ 20000 and 2048 ], wherein 2048 is the size of a vocabulary Dict, the matrix size of the embedded representation matrix is [ 12, 512 and 2048 ], 12 is the batch size, then inputting the word embedded representation of 512 characters into a model proposed by the patent, combining a feedforward fully-connected neural network by using a multi-head attention mechanism as a unit, repeating 30 times, namely, 30 layers of the same network, obtaining an integral network model structure, and embedding codes at the embedded representation layer and deleting positions; introducing GPT if the current input multi-head attention expression matrix is X, three matrices q=xw are calculated ^Q ，k＝KW ^K ，v＝VW ^v Dividing q and k by multi-head mechanism, and W in original GPT, transformer related model structure ^Q ，W ^K ，W ^v The three are the same in general dimension, and a low-dimension space attention mechanism is introduced, namely a parameter d _s Is a parameter less than d, typically set to 768, X ε R ^n×d ， W ^v ∈R ^d×d Where n is the number of tokens entered by the model, i.e., the text length is the same as max_len, and d is the representation dimension of each character token in the text; n and d are usually not changeable at will, n is the text length, and usually has a relation with the application scene and cannot be changed at will; the dimension d represents the representing dimension of each token character, the larger d is, the stronger the learning ability of the model is, namely, the larger d is, the larger n is, the stronger the model ability is, the lower the calculation efficiency is easy to appear, and d is introduced _s ＝768，d _s Three matrices +.> v∈R ^n×d After the three matrixes q, k and v are respectively multi-headed, (q) (k) ^T ∈R ^n×n ，(q)(k) ^T D is the same as the original transducer in meaning of each element of (a) _s ＜＜d，d _s The parameters are independently set, the parameters cannot be increased along with the increase of the values of d and n, if the matrix size of the input feedforward layer fully-connected neural network is [ 12, 512 and 2048 ], the matrix is mapped to the matrix with the size of [ 12, 512 and 4800 x 2] through the full connection with the size of [ 2048 and 4800 x 2], and the matrix is split into two matrices res and gate with the same size, and the matrix sizes are [ 12, 512 and 4800 ]. The gate matrix is multiplied by corresponding elements of res after passing through a GELU activation function, the matrix size is [ 12, 512, 4800 ], then after passing through full connection of [ 4800, 2048 ], the output size of the feedforward layer full-connection neural network is still [ 12, 512, 2048 ], the feedforward layer full-connection neural network is improved to be high-dimension, then the original transducer is returned to be different from low-dimension through the activation function, and is mapped to be directly split into half of 4800 x 2 dimension, so that the full connection of [ 12, 512, 4800 x 2] and full connection of [ 4800 x 2, 2048 ] in the transducer is avoided, and then full connection of [ 2048, 20000 ] is carried out, wherein 20000 is all token numbers.

The patent builds a large model LSAFomer-1.3B with low-dimension multi-head attention, infinite training and infinite reasoning of 13 hundred million texts. The patent provides a k-q dimension-reduction attention matrix mode, the dimension reduction effectively improves the attention calculation efficiency, and the dimension-reduction attention matrix mode provided by the patent is completely different from the q-v matrix operation in the industry Linformer; in order to enable infinite training and infinite reasoning under the low-calculation-force condition, the patent provides a scheme for carrying out fusion by deleting 4 operations of position coding, sliding window mode, grouping shuffle and randomly deleting a plurality of model inputs and outputs, and the sliding window mode of the patent is completely the same as the sliding window mode of the training and reasoning stage, namely is similar to the sliding window mode of the industry, but is completely different from the using mode, and can really realize the capability of infinite training and infinite reasoning under the low-calculation-force condition only when the four operations are fused.

Drawings

FIG. 1 is a diagram of the SAFormer-1B network model of the present invention.

Detailed Description

A method for training an infinite growth model based on LSAFormer-1.3B infinite length comprises the following specific implementation steps:

s1: making a token word list: the presently disclosed technology may be employed herein. In the collection project, text writing, knowledge question answering, knowledge recommendation and other text generation task related training data sets; collecting all Chinese characters, letters and characters, and building a mapping word list after duplication removal, and reserving a plurality of special characters, namely token_Dict= {0: "[ CLS ]",1: "[ PAD ]",2:

"[ SEP ]",3: "[ NewLine ]",4: .......}. Wherein [ CLS ] is a text initiator, [ PAD ] represents that when the text length does not reach the maximum length, [ PAD ] is adopted to fill to the max_len length, [ SEP ] is a segmenter in a text generation task, [ NewLine ] is a text line feeder, and max_len=512 is the maximum length of text contracted by an input model.

S2: a training data set is created, the data set of a text generation task (text free generation, text question-answer, text multi-round dialogue) is stored as a few hundred txt files, if a piece of text generated data contains a line-feed, it is replaced by a NewLine. Ensuring that each text generation task in the txt file occupies one row in the training set. According to the Window mechanism training mode, an infinite length text generation training task can be achieved, and Window width is set to Window=256. If the number of tokens of the text length of a certain data set exceeds the maximum length agreed by the model, the data can realize the infinite length text generation task in a sliding window mode. The method comprises the following specific steps: a certain piece of data is marked as X, if the length of X exceeds max_len, X [ id1:id2] is agreed, namely, the characters from id1 to id2 in X. The input [ "[ CLS ]" ] +X [0:max_len-1] passes through the optimization target output X [0:max_len ] corresponding to the model, and the part exceeding max_len is used for reconstructing a training set according to the translation Window, wherein window=256 is smaller than max_len, X [ Window: max_len+Window ] is the translated content, and the corresponding output through the model is X [ Window+1:max_len+Window+1]; if the text length exceeds max_len+Window, the translation is continued for a plurality of times, so that the method can realize an infinite text generation scheme, and a training sample corresponding to the data layer realizes a plurality of times of model input in a sliding Window mode, and the text length of the model is still kept to be 512, so that an infinite text training task under the condition of low calculation force can be realized. The application is completely different from the previous method that training is carried out after intercepting and continuing from the end, and in order to memorize history information, only Window is translated; while the shuffle mechanism is commonly used in the industry in the model training process, in the present application, the use of the conventional shuffle mechanism in the truncation operation will cause information splitting, such as multiple inputs and outputs after sliding a window in a text, because the shuffle mechanism is in different batches, for example, one input and output in the text is updated in the current batch at present, and another input and output in the text is updated in the batch after two weeks or one month, which will seriously destroy the overall perception of the model on a text, thereby causing forgetting of knowledge by the model. The present application therefore proposes a packet shuffle mechanism that places the input and output of multiple models behind a text window in one batch at the highest possible. The specific implementation measures are as follows: all texts in the current training set text txt are formed into a batch in a sliding window mode to form a training set list of [ a1, a2, a3, a4, a5, a6, b1, b2, b3, b4, b5, b6. ]. ai, bi represent training samples, a1, a2, a3, a4, a5, a6 are a plurality of inputs and outputs of a training text formed by sliding windows. Grouping, i.e. by a parameter group=4, divides a training set list into a plurality of groups according to group, such as [ [ a1, a2, a3, a4], [ a5, a6, b1, b2], [ b3, b4, b5, b6]. The first place ] and then carrying out shuffle according to the groups, in this way, not only the dependency of the model on the sequence is avoided, but also the generalization of the model is improved, and meanwhile, the information splitting of a certain sample caused by a sliding window mechanism can be saved, so that the forgetting of knowledge of the model is avoided. In order to further avoid information splitting of a certain sample caused by the batch value, each training set list aimed at in the application randomly deletes a plurality of input and output formed by a plurality of sliding windows before grouping, so that the possibility of combining different input and output of a text in one batch is further improved, for example [ a1, a2, a3, a4, a5, a6, b1, b2, b3, b4, b5, b6..] the front a1 is deleted, and [ [ a2, a3, a4, a5], [ a6, b1, b2, b3], [ b4, b5, b6..], the generalization of the model can be further improved, and the information splitting of a single text can be remedied.

S3: constructing a network model structure; a certain text X of the corpus in the item is changed into an index list through the token_Dict of the step S1. The digital index of each word is sequentially found through token_Dict, so that the training sample is changed into a digital sequence and then into tensors, then through an embedding layer with the size of [ 20000, 2048 ], 2048 is the dimension represented by each character, 20000 is the size of the vocabulary Dict, the matrix size of the embedding representation is [ 12, 512, 2048 ], 12 is the batch size, and then words with 512 characters are embedded into the model represented by the input application. The application adopts the mature neural network model structure in industry as the main framework of the GPT decoder part: the multi-head attention mechanism and the feedforward fully-connected neural network are taken as a unit and repeated 30 times, namely 30 layers of the same network, so that an overall network model structure is obtained. In the embedded representation layer, in order to realize the text training operation of the sliding window mode without limiting the length of the text, the embedded codes are deleted at the deleting position; for GPT, if the current input multi-head attention expression matrix is X, three matrices q=xw are calculated ^Q ，k＝KW ^K ，v＝VW ^v Dividing q and k by multi-head mechanism, and W in original GPT, transformer related model structure ^Q ，W ^K ，W ^v The three dimensions are the same, and in order to improve the model efficiency, the application introduces a low-dimensional space attention mechanism, namely a parameter d _s Is a parameter less than d, typically set to 768, X ε R ^n×d ， W ^v ∈R ^dxd Where n is the number of tokens entered by the model, i.e., the text length (as well as max_len), and d is the representation dimension of each character token in the text. In the current natural language processing technology (comprising a large model), n and d are generally not optionally changeable, n is a text length, and is generally related to an application scene and cannot be optionally changed; dimension d represents the representation dimension of each token character, the larger d, the more learning the model. That is, the larger d is, the larger n is, the more powerful the model capability is, but the problem of low calculation efficiency occurs at this time. The present application introduces d _s ＝768(d _s Usually a multiple of the number of attention heads) three matrices +.> v∈R ^n×d After the three matrixes q, k and v are respectively multi-headed, (q) (k) ^T ∈R ^n×n ，(q)(k) ^T D is the same as the original transducer in meaning of each element of (a) _s D, thus being more computationally efficient than the original transducer, especially in training large models, d _s The attention coefficient matrix is more computationally efficient, is a separately set parameter and does not increase with the increase of d and n values, and is one of the innovation points to be protected by the application, and the above is a low-dimensional multi-head attention mechanism part. For the feedforward layer fully-connected neural network part, if the matrix size of the input feedforward fully-connected neural network is [ 12, 512, 2048 ], the matrix is mapped to the matrix with the size of [ 12, 512, 4800 x 2] through the fully-connected matrix with the size of [ 2048, 4800 x 2], and the matrix is split into two matrices res with the same size, gate, and the matrix sizes are all [ 12, 512, 4800 ]. Wherein the gate matrix is multiplied by the corresponding element of res after passing through a GELU (here, the prior art) activation function, momentThe array size is [ 12, 512, 4800 ], and then after full connection of the array size is [ 4800, 2048 ], the output size of the feedforward layer full-connection neural network is still [ 12, 512, 2048 ]. The above is a fully connected neural network of the feed-forward layer, and the original transducer is improved to a high dimension, then returns to a low dimension again through an activation function, and the mapping of the application is directly split to half of 4800 x 2 dimension to avoid the large matrix multiplication of [ 12, 512, 4800 x 2] and full connection [ 4800 x 2, 2048 ] in the transducer, and then the full connection of [ 2048, 20000 ] is carried out, wherein 20000 is all token numbers. The main innovation point of the application is to provide a low-dimension attention computing mode.

S4, model training: and adding the existing deep learning technology according to the model structure set in the step S3. Model training is carried out by adopting an Adam optimizer and a cross entropy loss function, wherein the model training is carried out by adopting the prior art, and the specific training process is as follows: and (3) inputting the data into the model obtained in the step (S2), and feeding the data into the network structure obtained in the step (S3), wherein a final target of the model is the minimum loss function. Final purpose of model training: when the index of the true token at the ith position of the optimization target Y is index, the probability that the input X passes through the token at the ith position of the LSAFormer neural network model is maximized. The model enables the text content of the input model to output a corresponding Y value through the LSAFormer neural network model prediction by training the cross entropy loss function.

S5, model prediction: in the text reasoning aspect, a sliding Window mode is still adopted, and in order to ensure consistency with the training process, if the maximum length max_len is exceeded in the reasoning process, the Window length is translated, and the later reasoning is continued. The specific process is as follows: if the text input by the current user is text, the text is segmented into inputs by a word segmentation algorithm, the inputs are a list, the content of the input LSAFormer is [ 'CLS' ] +inputs, a corresponding token value is output by a model, and then the token value is added to the input. And repeating the steps, so that text content is generated, when the text content exceeds max_len, translating Window and continuing the following reasoning, so that the method can realize the training with infinite length and can realize the task of generating reasoning with infinite length text.

On the premise that software and hardware facilities completely meet the conditions required by the deep learning model operation, the method and the device operate through the steps, and finally the final result is obtained. The scheme has moderate difficulty in changing, can change on the model level and increase or decrease the network layer number to test and verify when knowing the knowledge of a large language model. The scene to be applied is-a long text authoring task: when the text length of the user text creation task is longer, the task can be implemented through the patent. Domain knowledge question-answering: based on the field question-answer data set, the field knowledge question-answer and field knowledge recommendation capability can be realized by adopting the model training. Natural language processing tasks: the data set is formed by the natural language processing related data set through the template, and the capabilities of text translation, speech operation rewriting, title generation and the like can be realized through model training.

Claims

1. A method for generating a large model from an inference text, comprising the steps of:

s1, making a token word list

s2, manufacturing a training data set

The training data set is stored into a plurality of txt files, and if one piece of text generated data contains line-changing symbols, the line-changing symbols are replaced by [ NewLine ], so that each text generation task in the txt files occupies one line in the training set; according to the Window mechanism training mode, an infinite length text generation training task is realized, and the Window width is set to window=256; if the number of the tokens of the text length of a certain data set exceeds the maximum length agreed by the model, the data can realize an infinite length text generation task in a sliding window mode; if the length of X exceeds max_len, the data is recorded as X, and if the length of X exceeds max_len, the characters from id1 to id2 in X are appointed, input [ "[ CLS ]" ] +X [0:max_len-1] passes through the optimization target output X [0:max_len ] corresponding to the model, and if the length of X exceeds max_len, a training set is constructed again according to the translation Window, wherein window=256 is smaller than max_len, and then X [ Window: max_len+window ] is translated content, and corresponds to the output of X [ window+1:max_len+window+1]; if the text length exceeds max_len+Window, continuing to translate for multiple times to generate an infinite text, and inputting a model for multiple times on the data layer, namely by a training sample in a sliding Window mode, wherein the model text length is still kept at 512, so that an infinite text training task under the condition of low calculation force is realized; all texts in the current training set text txt form a batch in a sliding window mode to form a training set list of [ a1, a2, a3, a4, a5, a6, b1, b2, b3, b4, b5, b6. ], ai, bi represent training samples, a1, a2, a3, a4, a5 and a6 are a plurality of inputs and outputs formed by sliding windows of one training text; each training set list randomly deletes a plurality of inputs and outputs formed by a plurality of sliding windows before grouping, and introduces a text into different inputs and outputs to be combined in a batch;

s3, constructing a network model structure

Aiming at the text X of the corpus in the project, changing the text X into an index list through the token_Dict in the step S1, sequentially finding the digital index of each word through the token_Dict, changing the training sample into a digital sequence, changing the digital sequence into tensor, introducing a neural network model into an embedded representation layer, using a GPT decoder and a low-dimensional space attention mechanism, realizing the text reasoning operation of the sliding window mode without the limitation of long text training and deleting the position embedded code;

s4, model training

Adding deep learning according to the model structure in the step S3, namely performing model training by adopting an Adam optimizer and a cross entropy loss function, substituting the model input data obtained in the step S2 into the network structure obtained in the step S3, wherein a final target of the model is the minimum loss function, and the final purpose of the model training is as follows: when the index of the true token at the i-th position of the optimization target Y is index, the probability of the input X passing through the token at the i-th position of the LSAFormer neural network model is improved, namely the model enables the text content of the input model to output a corresponding Y value through the LSAFormer neural network model prediction through training a cross entropy loss function;

s5, model prediction

2. The method for generating a large model by inference text according to claim 1, wherein in the step S3, a certain text X of corpus in a project is changed into an index list through token_dict in the step S1, a digital index of each word is sequentially found through token_dict, so that the training sample is changed into a digital sequence and further changed into tensor, then an embedded layer with the size of [ 20000, 2048 ] is used, 2048 is used as a dimension represented by each character, 20000 is used as a size of a vocabulary Dict, a matrix with the size of [ 12, 512, 2048 ] of an embedded representation is obtained, wherein 12 is a batch size of batch, then words with 512 characters are embedded into the representation input model, a multi-head attention mechanism is combined with a feedforward fully-connected neural network as a unit, and the whole network is repeated 30 times, namely, a 30-layer identical network is obtained, and codes are embedded at the embedded representation layer and the deletion position; introducing GPT if the current input multi-head attention expression matrix is X, three matrices q=xw are calculated ^Q ，k＝KW ^K ，v＝VW ^V Dividing q and k by multi-head mechanism, and W in original GPT, transformer related model structure ^Q ，W ^K ，W ^V The three are the same in general dimension, and a low-dimension space attention mechanism is introduced, namely a parameter d _s Is a parameter less than d, typically set to 768, X ε R ^n×d ，W ^v ∈R ^d×d Where n is the number of tokens entered by the model, i.e., the text length is the same as max_len, and d is the representation dimension of each character token in the text; n and d are usually not changeable at will, n is the text length, and usually has a relation with the application scene and cannot be changed at will; the dimension d represents the representing dimension of each token character, the larger d is, the stronger the learning ability of the model is, namely, the larger d is, the larger n is, the stronger the model ability is, the lower the calculation efficiency is easy to appear, and d is introduced _s ＝768，d _s Three matrices +.>v∈R ^n×d After the three matrixes q, k and v are respectively multi-headed, (q) (k) ^T ∈R ^n×n ，(q)(k) ^T D is the same as the original transducer in meaning of each element of (a) _s ＜＜d，d _s The parameters are independently set, the parameters cannot be increased along with the increase of the values of d and n, if the matrix size of the input feedforward layer fully-connected neural network is [ 12, 512 and 2048 ], the matrix is mapped to the matrix with the size of [ 12, 512 and 4800 x 2] through the full connection with the size of [ 2048 and 4800 x 2], and the matrix is split into two matrices res and gate with the same size, and the matrix sizes are [ 12, 512 and 4800 ]. Wherein the gate matrix is multiplied by the corresponding element of res after passing through a GELU activation function, the size of the matrix is [ 12, 512, 4800 ], then after passing through the full connection of [ 4800, 2048 ], the output size of the feedforward layer full-connection neural network is still [ 12, 512, 2048 ], the feedforward layer full-connection neural network is the above, the original transducer is improved to a high dimension, then the original transducer returns to a low dimension again through the activation function to be different, the original transducer is mapped to the high dimension and is directly segmented into half of 4800 x 2 dimension,the large matrix multiplication of [ 12, 512, 4800 x 2] and full connection [ 4800 x 2, 2048 ] in the transducer is avoided, and then full connection of [ 2048, 20000 ] is carried out, wherein 20000 is all token numbers.