CN116976351B

CN116976351B - Language model construction method based on subject entity and subject entity recognition device

Info

Publication number: CN116976351B
Application number: CN202311228568.4A
Authority: CN
Inventors: 曹柳; 王琪皓; 黄程韦; 吴江; 朱晓明
Original assignee: Zhejiang Lab
Current assignee: Zhejiang Lab
Priority date: 2023-09-22
Filing date: 2023-09-22
Publication date: 2024-01-23
Anticipated expiration: 2043-09-22
Also published as: CN116976351A

Abstract

The invention discloses a subject entity-based language model construction method, which comprises the following steps: acquiring teaching resources to construct a corresponding initial data set; subject entity screening is carried out on the initial data set so as to construct a corresponding subject entity library; randomly shielding the subject entities in the subject entity library to obtain corresponding shielding words, and forming a data set by the subject entities and the corresponding shielding words; the language neural network is constructed to comprise a pre-coding layer, a characteristic extraction layer and a prediction layer; training the language neural network by adopting a data set to obtain a subject entity language model for mining subject entities; and inputting the teaching resources to be identified into the subject entity language model to output the subject entities contained in the text. The invention also provides a subject entity identification device. The language model constructed by the method provided by the invention can acquire massive priori knowledge in the education field, so that a more comprehensive subject entity data set is constructed.

Description

Language model construction method based on subject entity and subject entity recognition device

Technical Field

The invention belongs to the technical field of natural language processing, and particularly relates to a subject entity-based language model construction method and a subject entity identification device.

Background

In the field of natural language processing technology, a general language model has been rapidly developed and widely applied in recent years, wherein the most remarkable progress is the application of deep learning technology, such as a transducer model, an attention mechanism and the like. The development and application of these techniques have led to significant progress in the general language model in recent years, bringing higher accuracy and efficiency to the field of natural language processing.

However, conventional generic language models may encounter the following problems when processing text in a particular area: firstly, the vocabulary of the vertical domain is not abundant, and although the vocabulary of the general language model is larger, the general language model does not necessarily contain professional terms or new vocabularies of certain specific domains, so that misjudgment or error occurs in text processing of the specific domains. Secondly, the context is inaccurate, and the generic language model may not understand the common context of the specific domain, so that ambiguities may occur in text processing of the specific domain. Thirdly, the general language model needs to process a large amount of data and text, so that the processing speed may be affected.

With the development of artificial intelligence and natural language processing technology, it has become possible to build language models for specific fields. Training a language model in the vertical domain will improve the accuracy and efficiency of natural language processing in a particular domain and can solve these problems.

Patent document CN111931020a discloses a labeling method, device, equipment and storage medium of a formula, comprising: acquiring a formula to be marked; calling a formula labeling model, wherein the formula labeling model is obtained through formula labeling data training on the basis of a target language characterization model corresponding to a formula related subject; the target language characterization model is obtained by expanding the vocabulary of the formula related discipline at least based on a basic language characterization model, and the formula labeling data at least comprises sample formula data and labels corresponding to the sample formula data; and predicting the label of the formula to be marked according to the formula marking model. The method only provides that a basic language characterization model is adopted to extract related vocabulary of disciplines in a formula so as to finish labeling, but the model adopted by the method is a simple language model, and corresponding vocabulary cannot be extracted for labeling aiming at the association relationship between texts and disciplines.

Patent document CN112580361a discloses a formula and text recognition model method based on unified attention mechanism, which comprises recognizing presuppositionlatex or contentlatex and obtaining recognition results, parsing the results into latex semantic tree, and traversing the semantic tree; performing word segmentation on the latex sequence by using a statistical word segmentation method, and performing word segmentation on natural language in the stem except a mathematical formula by using a wordpiece word segmentation method to form a word segmentation sequence; and (3) performing neural network coding on the word segmentation sequence, outputting, completing transformation from variable-length word segmentation sequence to fixed-length hidden space characterization, and completing output mapping of knowledge points by using a feedforward neural network to finish labeling of the knowledge points. The method identifies natural language in the stem to obtain annotation of knowledge points, but the method fails to consider the subject field where the current text is located, and extraction of irrelevant vocabulary may exist.

Disclosure of Invention

The invention mainly aims to provide a subject entity-based language model construction method and a subject entity identification device, and the language model constructed by the method can acquire massive priori knowledge in the education field so as to construct a more comprehensive subject entity data set.

In order to achieve the first object, the present invention provides a subject entity-based language model construction method, comprising the steps of:

acquiring teaching resources to construct a corresponding initial data set comprising video data, text data and voice data;

subject entity screening is carried out on the initial data set so as to construct a corresponding subject entity library;

randomly shielding the subject entities in the subject entity library to obtain corresponding shielding words, and forming a data set by the subject entities and the corresponding shielding words;

the method comprises the steps of constructing a language neural network based on a Transformer, wherein the language neural network comprises a pre-coding layer, a feature extraction layer and a prediction layer, the pre-coding layer is used for converting an input text into word codes and corresponding position codes, the feature extraction layer generates associated feature values of different angles among characters in the input text according to the input word codes and the position codes, and the prediction layer outputs a prediction result according to the input associated feature values, wherein the prediction result comprises subject entities related to the text and corresponding text positions;

training the language neural network by adopting a data set to obtain a subject entity language model for mining subject entities;

and inputting the teaching resources to be identified into the subject entity language model to output the subject entities contained in the text.

According to the invention, the text mining and language model technology in the NLP field is comprehensively utilized, the entity vocabulary is extracted through new word mining, the language model is improved and optimized by combining with the downstream language model training task, and the ability of the training model to recognize the entity vocabulary is improved, so that the training difficulty is improved, and more abundant information is fused into the language model.

Preferably, the subject entities of the initial data set are screened based on word frequency, solidification degree and left and right adjacent word entropy, namely, the phrase characteristics of the Chinese subject entities are aimed at, so that the screening accuracy is improved.

Specifically, the screening process is specifically as follows:

matching the initial data set by adopting a regular expression to obtain an original character string set containing Chinese, english characters and numbers;

traversing the original character string set according to the preset character string length, and filtering the pure digital character strings to obtain a candidate character string set;

screening character strings with occurrence times exceeding a word frequency threshold value in a candidate character string set according to the preset word frequency threshold value to construct corresponding initial candidate phrase;

analyzing the tightness degree between characters in each initial candidate phrase, and reserving the initial candidate phrases meeting the preset solidification degree;

and calculating left-neighbor entropy and right-neighbor entropy of each initial candidate phrase meeting the solidification degree, and reserving the initial candidate phrases meeting the condition that the left-neighbor entropy and the right-neighbor entropy exceed preset average values, so as to construct a subject entity library as the optimal candidate phrases.

Specifically, the solidification degree calculation process of the initial candidate phrase is as follows:

wherein->Represents the coagulability of the "ABCD" string, ">Representing the frequency of occurrence of "ABCD" in the original set of strings, thereby ensuring compactness from character to character in the candidate phrase.

Specifically, during training, a cross entropy loss function is adopted as a purpose of optimizing the language neural network, and iterative optimization is performed by using a gradient descent method so as to update parameters of the language neural network.

Specifically, the cross entropy loss function is expressed as follows:wherein/>An i-th element value representing a sample x real tag,/->The model prediction sample x is represented by a prediction value belonging to the ith category, and n is represented by the total number of categories of the prediction result.

Specifically, the initial dataset needs to be preprocessed before subject entity screening, and the preprocessing includes full-angle conversion from half angle to complex character to simplified character, and removal of space, line changing symbol and special character, so that the problem that characters and phrases cannot be identified is avoided.

Preferably, the pre-coding layer introduces characters in the text into the initialized vector matrix, and simultaneously takes the position information of the characters as additional vectors to construct corresponding word codes and position codes, so that the associated feature values have the position characteristics between the characters and the phrase.

Specifically, the feature extraction layer comprises a multi-head self-attention mechanism unit, a full-connection feedforward network and a residual error connection and normalization unit;

the multi-head self-attention mechanism unit is used for searching association relations of different angles between input characters and splicing comprehensive association features captured in different subspaces;

the fully-connected feedforward network is used for carrying out nonlinear transformation on the captured comprehensive association characteristics so as to obtain corresponding prediction association characteristics;

and the residual connection and normalization unit is used for integrating the correlation characteristic and the prediction correlation characteristic to carry out residual addition and perform normalization operation so as to output a final correlation characteristic value.

To achieve the second object of the present invention, there is provided a subject entity recognition apparatus including a computer memory, a computer processor, and a computer program stored in the computer memory and executable on the computer processor, the computer memory employing the subject entity language model described above; the computer processor, when executing the computer program, performs the steps of:

the teaching resources are input into the subject entity language model to output a set of subject entities associated with the teaching resources.

Compared with the prior art, the invention has the beneficial effects that:

the new word discovery method based on unsupervised learning is used for mining subject entities in the education field, simultaneously, word codes and position codes of the subject entities are combined to construct corresponding association features, and more implicit relations between characters are obtained based on the association features, so that a corresponding subject entity language model is constructed; through the subject entity model, a more comprehensive subject entity data set can be obtained, so that training tasks of all downstream NLPs can be better enabled.

Drawings

FIG. 1 is a flowchart of a subject entity-based language model construction method provided in this embodiment;

FIG. 2 is a schematic diagram of a pre-coding layer of a subject entity language model according to the present embodiment;

FIG. 3 is a schematic diagram of a feature extraction layer of the subject entity language model according to the present embodiment;

fig. 4 is a schematic diagram of a prediction layer of a subject entity language model according to the present embodiment.

Detailed Description

In order to more clearly describe the technical scheme in the embodiment of the invention, the invention is described in detail below with reference to the accompanying drawings. The features of the examples and embodiments described below may be combined with each other without conflict.

As shown in fig. 1, a subject entity-based language model construction method includes the following steps:

more specifically, in this embodiment, teaching resources under the "artificial intelligence" specialty are used as sources, more than 500 teaching materials including 20 courses under the specialty such as "machine learning", "deep learning", "advanced mathematics", and "high-grade mathematics" are collected, paper data under the specialty is collected, and multi-modal data in the specialty field such as course video data, PPT courseware data, etc. are collected. The data of teaching materials, papers and PPT courseware are transcribed into character form by using OCR technology. And for course video data, converting the expression of the teacher in the video into a character form by using an ASR speech-text conversion technology, and obtaining an original corpus.

And aiming at the acquired corpus, cleaning the acquired text data by using a data preprocessing technology. The cleaned regular expression comprises full-angle conversion half-angle conversion, conversion of traditional characters into simplified characters, removal of blank spaces, line-changing characters, special characters and the like. After data cleaning, an initial dataset of the professional domain is obtained.

more specifically, in this embodiment, word frequency, solidification degree, left and right adjacent word entropy are sequentially used, and filtering is performed according to a set threshold value, so as to obtain an entity library of the professional discipline of "artificial intelligence".

Firstly, an original corpus is read, chinese characters, english characters and numbers are matched in the original corpus by using a regular expression "\u4E00\u9FFFa-zA-Z0-9", an original character string is obtained, and the length of the original character string is recorded.

After the original character string is obtained, the next step of traversing is carried out. And setting the length parameter of the candidate character string to 5, namely selecting the character combination with the length of 1-5 to obtain the candidate character string. For example, the character combination of "gradient descent method" is used in the traversal process, and then the candidate character strings are "ladder", "gradient descent", and "gradient descent method", respectively. By traversing the original string, all candidate string data can be obtained.

And filtering out the pure digital character strings after the candidate character strings are obtained. And then counting word frequency, calculating the number of times of occurrence in the original character string, and filtering. And if the word frequency threshold value is lower than the threshold value, no candidate phrase is added. For example, "gradient descent" occurs more than 3 times in the original string, and the candidate phrase is added.

After obtaining the candidate phrases, we calculate the solidification degree of each candidate phrase. The solidification degree refers to the tightness between characters in a phrase segment, for example, under the professional of artificial intelligence, phrases such as "gradient" commonly appear, so that the solidification degree is higher, and phrases such as "network" and "intelligence" have lower solidification degree. Taking the example of calculating the solidification degree of gradient descent, the calculation formula is as follows:

，

wherein S (gradient descent) represents the solidification degree of 'gradient descent', p (gradient descent) represents the frequency of 'gradient descent' in the original character string, min represents the minimum value thereof, and finally log is taken for the minimum value, so as to obtain the final solidification degree. Setting the threshold value of the solidification degree to be 1.8, if the threshold value is exceeded, reserving the phrase, performing the next filtering, and if the threshold value is lower than the threshold value, removing the candidate phrase;

after the calculation and filtration of the solidification degree, the left-neighbor entropy and the right-neighbor entropy of the candidate phrase are further calculated. The left adjacent entropy of the candidate phrase refers to the sum of the information entropy of the candidate phrase and the information entropy combined with all adjacent words to the left of the candidate phrase, so as to judge the diversity of the left adjacent words of the candidate phrase. The larger the left neighbor entropy, the more the categories of the words adjacent to the left of the candidate phrase are described, and the more the adjacent characters are diversified, the more clear the boundary of the candidate phrase is, and the greater the probability of becoming a real phrase is. The left neighbor entropy calculation formula is:wherein->Is the left neighbor entropy of the candidate phrase G,left adjacent word being candidate phrase GThe set, e.g., the candidate phrase "gradient," may be preceded by the terms "edge," "at," "yes," "and," "use," "and" etc. left adjacency words. />Is that the adjacent word to the left of the candidate phrase G is +.>The calculation formula is: />Wherein->Is a left adjacent wordFrequency of co-occurrence with candidate phrase G, +.>Is the frequency with which the candidate phrase G appears alone. The calculation method of right neighbor entropy is similar, and only the calculation object is adjusted to be the entropy of the word appearing on the right side of the candidate phrase, so that the richness of the character appearing on the rear side of the candidate phrase is indicated. The right-neighbor entropy calculation formula is: />Wherein->Is the right neighbor entropy of the candidate phrase G, +.>Is the right set of contiguous words of the candidate phrase G, +.>Is that the adjacent word to the left of the candidate phrase G is +.>The calculation formula is: />Wherein->Is the right adjacency word->Frequency of co-occurrence with candidate phrase G, +.>Is the frequency with which the candidate phrase G appears alone.

And respectively calculating left-neighbor entropy and right-neighbor entropy of the candidate phrase, setting the threshold value of the left-neighbor entropy and the threshold value of the right-neighbor entropy to be 1.2, reserving the candidate phrase if the left-neighbor entropy and the right-neighbor entropy exceed the threshold value simultaneously, and filtering the candidate phrase if one or both of the left-neighbor entropy and the right-neighbor entropy are lower than the threshold value.

After the screening and filtering in the steps, all the candidate phrases still reserved are summarized to form a subject entity library about artificial intelligence.

more specifically, according to the subject entity library of "artificial intelligence", the embodiment constructs a corresponding data set by randomly masking the entities in the initial data set.

Firstly, using a HanLP word segmentation tool to perform word segmentation on an initial data set in the professional field. In the word segmentation process, the subject entity library is preloaded, the phrases in the subject entity library are segmented into one word, and after the word segmentation is finished, the total number of all the words in the text is counted.

After counting the total number of the vocabularies, masking entity words in the subject entity library, and replacing each character of the selected vocabularies by using a [ MASK ] flag Fu Zhebi, so that the model predicts the masked entity words according to the unshaded corpus information in the training process, and the model learns rich subject knowledge. In the shielding process, the ratio of the shielded entity words is calculated, the proportion is ensured not to exceed 20%, and if the proportion exceeds 20%, the shielded entity words are adjusted to be in an unshielded state so as to meet the shielding proportion.

Based on the above procedure, the description is given by the fact that the "gradient" is a vector representing that the derivative of the direction of a certain function at that point takes the maximum value along that direction, i.e. the function changes the fastest along that direction ", and includes, in order, initial data acquisition, acquisition of subject entities, random masking, and masking based on subject entities, the procedure of which is shown in table 1.

。

The language neural network is constructed based on a transducer and comprises a pre-coding layer, a feature extraction layer and a prediction layer.

As shown in fig. 2, the pre-coding layer is used to convert the input text into a word code and a corresponding position code, i.e. in order for the model to know the position information of each character in the input sequence, a position code is added to the word code, and an extra vector is added to indicate the position of the element in the sequence according to the position of the character in the input sequence.

As shown in fig. 3, the feature extraction layer generates associated feature values of different angles between characters in the input text according to the input word codes and the position codes.

More specifically, the Encoder portion of the transducer model is employed, including a multi-headed self-attention mechanism layer, a neural network layer, and a residual connection and normalization layer, and loops through the 4 layers. The multi-head self-attention mechanism layer is responsible for searching the association relations of different angles between input corpus, and in the last splicing step, the association relations captured in different subspaces are recombined. The neural network layer is a fully connected feed-forward network through which characters at each location individually pass. The residual connection and normalization layer is responsible for carrying out residual addition on the outputs of the multi-head self-attention mechanism layer and the neural network layer, and then carrying out one layer normalization operation.

As shown in fig. 4, the prediction layer outputs a prediction result according to the input associated feature value, wherein the prediction result comprises subject entities related to the text and corresponding text positions.

The language neural network is trained using the data set to obtain a discipline entity language model for mining discipline entities.

More specifically, the loss is calculated using a cross entropy loss function that characterizes the probability distribution of the actual output sample as the distance from the probability distribution of the desired output sample, i.e., the smaller the value of the cross entropy, the closer the two probability distributions are. In the model training process, model parameters are continuously optimized so that the output cross entropy loss function is reduced, and finally the loss minimum value is reached.

The cross entropy loss function is expressed as follows:whereinAn i-th element value representing a sample x real tag,/->The model prediction sample x is represented by a prediction value belonging to the ith category, and n is represented by the total number of categories of the prediction result.

More specifically, in this embodiment, the corpus under the "artificial intelligence" specialty is physically masked, and is transmitted to the model prediction, and the model effect is verified according to the prediction result.

In order to verify the reasoning effect of the trained model, reading and understanding tasks in the professional field are constructed, and whether the reasoning result of the observation model accords with the knowledge in the professional field or not is observed. Under the professional field of artificial intelligence, a piece of prediction data is constructed: "transfer [ MASK ] in forward propagation into backward propagation process, find partial derivative of loss function to each [ MASK ] weight and bias layer by layer, as [ MASK ] of objective function to weight and bias. The sentence is transmitted into a model interface, and after the reasoning of the model, the prediction results of three places are respectively a loss function, a neuron and a gradient, so that the model is proved to be fused with the specialized subject knowledge, and the model has the basic reasoning capability of the subject specialized field and can better enable the training task of each downstream NLP.

The present embodiment also provides a subject entity recognition apparatus, including a computer memory, a computer processor, and a computer program stored in the computer memory and executable on the computer processor, where the computer memory adopts the subject entity language model set forth in the above embodiment.

The computer processor, when executing the computer program, performs the steps of: the teaching resources are input into the subject entity language model to output a set of subject entities associated with the teaching resources.

Claims

1. The language model construction method based on subject entities is characterized by comprising the following steps:

subject entity screening is carried out on the initial data set to construct a corresponding subject entity library, and the screening process is specifically as follows:

calculating left-neighbor entropy and right-neighbor entropy of each initial candidate phrase meeting the solidification degree, and reserving the initial candidate phrases meeting the left-neighbor entropy and right-neighbor entropy which exceed preset average values, so as to construct a subject entity library as the optimal candidate phrases;

the method comprises the steps of constructing a language neural network based on a Transformer, wherein the language neural network comprises a pre-coding layer, a feature extraction layer and a prediction layer, the pre-coding layer is used for constructing corresponding word codes and position codes by importing characters in a text into an initialized vector matrix and taking position information of the characters as additional vectors, the feature extraction layer is used for generating associated feature values of different angles among the characters in the input text according to the input word codes and the position codes, and the feature extraction layer comprises a multi-head self-attention mechanism unit, a fully-connected feedforward network and a residual error connection and normalization unit;

the residual connection and normalization unit is used for integrating the correlation characteristic and the prediction correlation characteristic to carry out residual addition and perform normalization operation so as to output a final correlation characteristic value, and the prediction layer outputs a prediction result according to the input correlation characteristic value, wherein the prediction result comprises subject entities related to the text and corresponding text positions;

training the language neural network by adopting a data set, taking a cross entropy loss function as the purpose of optimizing the language neural network, and performing iterative optimization by utilizing a gradient descent method to update parameters of the language neural network so as to obtain a subject entity language model for mining subject entities, wherein the expression of the cross entropy loss function is as follows:

wherein->Represents the i sample->Is (are) true tags->Represents the i sample->N represents the total number of samples; and inputting the teaching resources to be identified into the subject entity language model to output the subject entities contained in the text.

2. The subject entity-based language model construction method of claim 1 wherein the initial candidate phrase solidification degree calculation process is as follows:

wherein->Represents the coagulability of the "ABCD" string, ">Representing the frequency with which "ABCD" appears in the original set of strings.

3. The subject matter-based language model construction method of claim 1 wherein the initial dataset requires pre-processing prior to subject matter screening, the pre-processing including full-angle-to-half-angle, complex character-to-simplified character, and removal of spaces, line breaks, and special characters.

4. A subject entity recognition apparatus comprising a computer memory, a computer processor, and a computer program stored in the computer memory and executable on the computer processor, wherein the computer memory employs the subject entity-based language model construction method of any one of claims 1-3;

the computer processor, when executing the computer program, performs the steps of: