CN113902094A

CN113902094A - Structure searching method of double-unit searching space facing language model

Info

Publication number: CN113902094A
Application number: CN202111084940.XA
Authority: CN
Inventors: 余正涛; 苗育华
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2021-09-16
Filing date: 2021-09-16
Publication date: 2022-01-07

Abstract

The invention relates to a structure searching method of a language model-oriented double-unit searching space, and relates to the field of artificial intelligence. The invention provides improvement on the search space of the existing search strategy on the language model task, and constructs the search space more suitable for the language model task. The information storage unit is added in the cyclic neural network unit to effectively store the sequence front-end information, so that the search space is more matched with the language model task, the added unit can solve the problem that the conventional cyclic neural network unit structure cannot solve long sequence dependence, and the continuity of the sequence semantic information is improved. Meanwhile, the search space is directly enlarged due to the addition of the units, and the probability of searching a better network structure is also improved.

Description

Structure searching method of double-unit searching space facing language model

Technical Field

The invention relates to a structure searching method of a language model-oriented double-unit searching space, belonging to the technical field of artificial intelligence.

Background

The design of the search space is the first step in the search research of the neural network structure and is also an extremely important step, and the search space determines the upper limit and the lower limit of the model performance. However, the size of the search space and the contradictory relationship between the search speed and the hardware requirements make the design thereof difficult. On one hand, a huge search space has huge network exploration potential, but extremely high hardware support and time consumption are needed; on the other hand, a smaller search space, while more hardware and time friendly, is very limited in the ability to mine network potential. Therefore, how to define a suitable search space to achieve the best search effect becomes a problem to be solved in the current structure search research.

The research of the neural network structure search is still in the preliminary stage, but field experts have proposed many excellent structure search methods and achieved good results. The neural network structure searching method DARTS which is most popular at present constructs a simplest unit based on a loop structure, a directed acyclic graph is arranged in the unit, the structure in the unit is learned through a gradient optimization method, and the learned unit is circularly connected to form a final model. The model based on the cyclic unit can process a certain sequence short-term dependence problem, but when the sequence is long, the gradient at the far end of the sequence is difficult to propagate backwards to the current sequence, so that the problem of gradient disappearance is generated, and the semantic information of the sequence is interrupted. Aiming at the problem, the invention researches the search space of the structure search on the language model task and provides a structure search method based on a double-unit expansion space.

Disclosure of Invention

The invention provides a structure searching method of a language model-oriented double-unit searching space, which is used for solving the problem that when a sequence is long, the gradient of the far end of the sequence is difficult to reversely propagate to the current sequence, the gradient disappears, and the semantic information of the sequence is interrupted.

The technical scheme of the invention is as follows: the structure searching method of the double-unit searching space facing to the language model comprises the following steps of firstly, constructing the double-unit searching space;

secondly, searching on the PTB data set, and selecting a structure with the minimum loss on the verification set in the searching process as a unit structure to be selected;

and finally, entering an evaluation stage, and evaluating the unit structure to be selected obtained in the search stage in a short time on the language model task to obtain the optimal unit structure.

As a further aspect of the present invention, the structure searching method based on the dual-unit search space includes the following specific implementation steps:

step1, a double-unit search space is provided for the language model task, a search unit is arranged, and a final recurrent neural network is formed through the connection of the units, so that the search space is constructed;

step2, establishing a PTB in the whole search stage, inputting parameters, and continuously training epochs for 50 generations in total to obtain a plurality of different initial cell structures to be selected; selecting a plurality of structures with minimum loss on the verification set in the searching process as unit structures to be selected;

and Step3, evaluating a plurality of unit structures to be selected obtained in the search stage in a short time on the language model task to obtain an optimal unit structure.

As a further aspect of the present invention, the two-unit search space proposed in Step1 is to extend the large frame of the whole search space to the arrangement in DARTS, i.e. to search for one unit, and then to construct the final recurrent neural network by the connection of units, unlike DARTS, two sub-units are arranged inside each unit: information storage unit cellc_tAnd an information processing unit cellh_t(ii) a Each unit is a directed acyclic graph comprising a plurality of nodes; the input of the information storage unit is the input of a plurality of moments before the sequence, so that the front-end information of the sequence can be effectively stored.

As a further aspect of the invention, the experimental parameters of Step2 for the search phase mostly follow the settings in DARTS, with the different parameters: the number of layers of the recurrent neural network is determined as one layer, the word embedding size and the hidden layer size are both 300, and the batch size is 256; information storage unit cellc is arranged in each unit_tAnd an information processing unit cellh_tThe information storage unit internally comprises 3 nodes, and the information processing unit internally comprises 8 nodes.

As a further scheme of the invention, the edges between the nodes are operated by adopting the following four operation functions, wherein the four operation functions are tanh, relu, sigmoid and identity.

As a further scheme of the invention, in the Step2 search stage, different algorithms are respectively used for the two optimization stages for optimization, the network weight w is optimized by using a random gradient descent SGD algorithm, the learning rate is 20, and the weight attenuation is 5 e-7; the structural weight alpha is optimized by using an Adam algorithm, the initial learning rate is 3e-3, and the weight attenuation is 1 e-3.

As a further scheme of the invention, in Step3, parameter setting in the evaluation phase is carried out, the word embedding size and the hidden layer size of the model are expanded to 850, the batch size is 64, the weight optimization method uses an average random gradient descent ASGD algorithm, the initial learning rate is 20, and the weight attenuation size is 8 e-7.

As a further scheme of the invention, in Step3, a plurality of candidate unit structures obtained in the search stage are evaluated in a short time to obtain an optimal unit structure, after the optimal unit structure is obtained, the unit is initialized with network weight again at random, and training is carried out for a longer time in the training set until the training set converges.

When the migratability of the unit structures searched by the present invention is verified, the optimal unit structures searched on the PTB data set are directly migrated to the WT2 data set for evaluation.

The invention has the beneficial effects that:

1. the structure searching method of the dual-unit searching space facing the language model, provided by the invention, improves the searching space of the existing searching strategy on the language model task, and constructs the searching space more suitable for the language model task. The information storage unit is added in the cyclic neural network unit to effectively store the sequence front-end information, so that the search space is more matched with the language model task, the added unit can solve the problem that the conventional cyclic neural network unit structure cannot solve long sequence dependence, and the continuity of the sequence semantic information is improved. Meanwhile, the search space is directly enlarged due to the addition of the units, and the probability of searching a better network structure is also improved.

2. The invention improves the search unit frame of DARTS and provides a double-unit frame. Two sub-units are arranged in each unit, and each unit is a directed acyclic graph containing a plurality of nodes. The addition of the information storage unit also directly enlarges the search space and improves the probability of searching the excellent network structure. Experiments on the Penn Treebank (PTB, word 10000) dataset and the Wikitext-2(WT2, word 33000) dataset show that on the PTB dataset, the confusion is reduced by 0.4 point relative to the baseline method, achieving better results. The migratability of the present invention was also verified on the WT2 data set, reducing the confusion on the test set by 0.2 points compared to the baseline method.

Drawings

FIG. 1 is a model diagram of a structure search method of a language model-oriented two-unit search space according to the present invention;

FIG. 2 is a schematic illustration of the confusion of five candidate structures shown in the present invention;

FIG. 3 is a schematic diagram of a structure of an information storage unit searched on a PTB data set corresponding to the confusion degree of the present invention;

FIG. 4 is a schematic diagram of a structure of an information processing unit searched on a PTB data set corresponding to the confusion degree of the present invention;

FIG. 5 is a graph illustrating the confusion performance of the present invention compared to the Darts method.

Detailed Description

Example 1: as shown in fig. 1 to 5, the structure search method of the language model-oriented two-unit search space includes: firstly, constructing a double-unit search space; secondly, searching on the PTB data set, and selecting a structure with the minimum loss on the verification set in the searching process as a unit structure to be selected; finally, entering an evaluation stage, and carrying out short-time evaluation on the unit structure to be selected obtained in the search stage on the language model task to obtain the optimal unit structure

The structure searching method based on the double-unit searching space comprises the following specific implementation steps:

the two-cell search space proposed in Step1 is a large frame continuation of the entire search spaceUnlike DARTS, in which a unit is searched and then connected to form a final recurrent neural network, two sub-units are provided in the unit at each time in the recurrent neural network: information storage unit cellc_tAnd an information processing unit cellh_tAs shown in fig. 1, each unit is a directed acyclic graph including a plurality of nodes; the input of the information storage unit is the input x at the first five moments of the sequence_t-1,x_t-2,x_t-3,x_t-4,x_t-5So as to effectively store the front-end information of the sequence. And performing linear transformation and addition on the five inputs, and then obtaining the input of the first node in the cellct unit through an activation function tanh, wherein the output of the unit is obtained by adding and averaging the outputs of all intermediate nodes. The addition of the information storage unit also directly enlarges the search space and improves the probability of searching the excellent network structure. The input of the information processing unit is the input x of the current time of the sequence_tHidden state h at the previous moment_t-1And output c of the information storage unit_t. The input of the first node in the cell and the output of the cell are processed in the same way as the information storage cell.

the experimental parameters for the search phase in Step2 mostly follow the setup in DARTS, which is the first choice because both reinforcement learning and evolutionary algorithms require a large enough GPU cluster to search, and DARTS is much less demanding in terms of hardware and more efficient in search speed than the first two methods. The different parameters are: the number of layers of the recurrent neural network is determined as one layer, the word embedding size and the hidden layer size are both 300, and the batch size is 256; information storage unit cellc is arranged in each unit_tAnd an information processing unit cellh_tThe information storage unit internally comprises 3 nodes, and the information processing unit internally comprises 8 nodes.Edges between nodes are operated by adopting the following four operation functions, wherein the four operation functions are tanh, relu, sigmoid and identity.

In the Step2 search stage, different algorithms are respectively used for optimizing the two optimization stages, the network weight w is optimized by using a random gradient descent SGD algorithm, the learning rate is 20, and the weight attenuation is 5 e-7; the structural weight alpha is optimized by using an Adam algorithm, the initial learning rate is 3e-3, and the weight attenuation is 1 e-3.

The searching algorithm mainly comprises four steps:

1) constructing a directed acyclic graph including a plurality of nodes, each node having a set of ordered nodes⁽¹⁾，node⁽¹⁾，…，node⁽ⁿ⁾；

2) All actions that can be taken are placed between every two nodes, thereby making the discrete network structure continuous. Wherein o is^(i,j)(i<j) The input of the node j is obtained by operating all nodes smaller than j, and the specific formula is as follows:

operation o^(i,j)Usually from a set of alternative operations, for example, a recurrent neural network, which are some activation functions.

3) And (5) finding the operation corresponding to the maximum weight alpha in the joint optimization process of the structure weight alpha and the network weight w. For each set of operations o^(i,j)The present invention defines a set of coefficients α ═ α^(i,j)}. In practice, the present invention uses a blending operation during the training of the search

I.e. with softmax as activation, all operation weights are averaged, as follows:

4) because the whole recurrent neural network has two groups of parameters to be trained, one group is the structural parameter alpha of the network, and the other group is the weight parameter w of the network. The two sets of parameters are a process of alternating optimization. Firstly, the invention randomly initializes the structure parameter alpha to obtain an initialized network, then trains the network weight w on a training set, and reduces the loss L on the training set_trainUpdating the network weight w, wherein the network structure parameter alpha is according to the loss L on the verification set_valAnd (6) updating. By such an alternate optimization, the optimal network structure is obtained, and then the search phase of the NAS is ended. And fixing the network structure according to the structure parameter alpha obtained in the searching stage, randomly initializing all the weights w of the network, and training on the training set again to obtain the final network.

Step3, in the evaluation stage and parameter setting of the evaluation stage, the word embedding size and the hidden layer size of the model are expanded to 850, the batch size is 64, the optimization method of the weight w uses an average random gradient descent (ASGD) algorithm, the initial learning rate is 20, and the weight attenuation size is 8 e-7. The invention evaluates five unit structures to be selected obtained in the search stage in a short time on the language model task to obtain an optimal unit structure. The weight w of each unit structure to be selected is initialized randomly and trained on 300 epochs in a training set, the unit structure with the lowest verification confusion degree at the moment is selected as an optimal structure, fig. 2 shows the confusion degree of the five unit structures to be selected when the five unit structures are trained on 300 epochs respectively, the lowest confusion degree is 61.79, the dotted line is the confusion degree on the verification set when the most structures searched by DARTS are trained on 300 epochs, and the lower the confusion degree is, the better. The structure of the unit corresponding to the confusion degree is shown in figures 3 and 4. Wherein, fig. 3 is an information storage unit cellct searched on the PTB data set, fig. 4 is an information processing unit cellht searched on the PTB data set, after obtaining an optimal unit structure, the unit randomly initializes the network weight w again, and trains the training set for a longer time until it converges, and table 1 shows the confusion degree after the optimal unit structure searched by the present invention is fully trained, compared with the baseline method and other methods.

Table 1: confusion contrast of the present invention over PTB datasets with other methods

Table 1 shows the results of the second line for the hand-designed network, the third line for the other NAS methods, and the fourth line for the baseline model and the results of the present invention. Compared with a baseline model, the confusion degree of the method is reduced by 0.6 point on the verification set and 0.4 point on the test set, so that better performance is achieved.

Step4, in order to verify the migration performance of the cell structure searched by the present invention, the present invention directly migrates the optimal cell structure searched on the PTB data set to the WT2 data set for evaluation. The size of the embedded and hidden layers are both set to 700 and the weight decay is 5 e-7. Table 2 shows the results of the confusion on the test set after migration to WT2 data sets and training.

Table 2: results of migrating searched structures on PTB dataset directly onto WT2 dataset

Table 2 shows the results of the second action of manually designed network, the third action of migration of network structures searched by other NAS methods on the PTB data set to the WT2 data set, and the last action of migration of network structures searched by the present invention on the PTB data set to the WT2 data set. Compared with a baseline model, the method has better effect, and the confusion degree on the test set is reduced by 0.2 point.

Step5, verifying the matching degree of the established double-unit search space, and analyzing the matching degree between the currently constructed search space and the task by verifying the complexity of sentences with different lengths. Specifically, the test sets are counted and grouped as shown in table 3:

table 3: PTB test set size and grouping

As shown in Table 3, the test set had 3761 sentences, the shortest word number of the sentence was 1 and the longest word number was 77. The invention divides the test set into eight groups according to the number of words and tests the performance of the model under the eight groups of data. The result is shown in FIG. 5, where the abscissa is the different sequence lengths and the ordinate is the degree of confusion in FIG. 5. FIG. 5 grouping experiment proves that the modeling capacity of the method of the invention on long sequences is better and the modeling capacity of the model on long sequences is enhanced by comparing with the Darts method.

While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.

Claims

1. The structure searching method of the double-unit searching space facing the language model is characterized in that: firstly, constructing a double-unit search space;

2. The structure search method of a language model-oriented two-unit search space according to claim 1, wherein: the structure searching method based on the double-unit searching space comprises the following specific implementation steps:

3. The structure search method of a language model-oriented two-unit search space according to claim 1, wherein: unlike DARTS, in which two sub-units are arranged inside each unit, the two-unit search space proposed in Step1 is a large frame extending the whole search space to the arrangement in DARTS, i.e. searching for one unit, and then constructing the final recurrent neural network by the connection of units: information storage unit cellc_tAnd an information processing unit cellh_t(ii) a Each unit is a directed acyclic graph comprising a plurality of nodes; the input of the information storage unit is the input of a plurality of moments before the sequence, so that the front-end information of the sequence can be effectively stored.

4. The structure search method of a language model-oriented two-unit search space according to claim 1, wherein: the experimental parameters for the search phase in Step2 mostly follow the settings in DARTS, with the different parameters: the number of layers of the recurrent neural network is determined as one layer, the word embedding size and the hidden layer size are both 300, and the batch size is 256; information storage unit cellc is arranged in each unit_tAnd an information processing unit cellh_tThe information storage unit internally comprises 3 nodes, and the information processing unit internally comprises 8 nodes.

5. The structure search method of a language model-oriented two-unit search space according to claim 4, wherein: edges between nodes are operated by adopting the following four operation functions, wherein the four operation functions are tanh, relu, sigmoid and identity.

6. The structure search method of a language model-oriented two-unit search space according to claim 1, wherein: in the Step2 search stage, different algorithms are respectively used for optimizing the two optimization stages, the network weight w is optimized by using a random gradient descent SGD algorithm, the learning rate is 20, and the weight attenuation is 5 e-7; the structural weight alpha is optimized by using an Adam algorithm, the initial learning rate is 3e-3, and the weight attenuation is 1 e-3.

7. The structure search method of a language model-oriented two-unit search space according to claim 1, wherein: in Step3, parameter setting in the evaluation phase is carried out, the word embedding size and the hidden layer size of the model are expanded to 850, the batch size is 64, the weight optimization method uses an average random gradient descent ASGD algorithm, the initial learning rate is 20, and the weight attenuation size is 8 e-7.

8. The structure search method of a language model-oriented two-unit search space according to claim 1, wherein: and in Step3, carrying out short-time evaluation on a plurality of cell structures to be selected obtained in the search stage to obtain an optimal cell structure, after the optimal cell structure is obtained, randomly initializing the cell with network weight again, and carrying out longer-time training on the training set until the cell structure converges.