CN116976351B - Language model construction method based on subject entity and subject entity recognition device - Google Patents
Language model construction method based on subject entity and subject entity recognition device Download PDFInfo
- Publication number
- CN116976351B CN116976351B CN202311228568.4A CN202311228568A CN116976351B CN 116976351 B CN116976351 B CN 116976351B CN 202311228568 A CN202311228568 A CN 202311228568A CN 116976351 B CN116976351 B CN 116976351B
- Authority
- CN
- China
- Prior art keywords
- subject
- subject entity
- language model
- language
- data set
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000010276 construction Methods 0.000 title claims abstract description 12
- 238000000034 method Methods 0.000 claims abstract description 26
- 238000013528 artificial neural network Methods 0.000 claims abstract description 19
- 238000012549 training Methods 0.000 claims abstract description 13
- 238000012216 screening Methods 0.000 claims abstract description 12
- 238000000605 extraction Methods 0.000 claims abstract description 11
- 238000005065 mining Methods 0.000 claims abstract description 7
- 238000007711 solidification Methods 0.000 claims description 17
- 230000008023 solidification Effects 0.000 claims description 17
- 230000006870 function Effects 0.000 claims description 12
- 238000004364 calculation method Methods 0.000 claims description 10
- 238000001914 filtration Methods 0.000 claims description 9
- 238000010606 normalization Methods 0.000 claims description 9
- 238000004590 computer program Methods 0.000 claims description 6
- 239000013598 vector Substances 0.000 claims description 6
- PCTMTFRHKVHKIS-BMFZQQSSSA-N (1s,3r,4e,6e,8e,10e,12e,14e,16e,18s,19r,20r,21s,25r,27r,30r,31r,33s,35r,37s,38r)-3-[(2r,3s,4s,5s,6r)-4-amino-3,5-dihydroxy-6-methyloxan-2-yl]oxy-19,25,27,30,31,33,35,37-octahydroxy-18,20,21-trimethyl-23-oxo-22,39-dioxabicyclo[33.3.1]nonatriaconta-4,6,8,10 Chemical compound C1C=C2C[C@@H](OS(O)(=O)=O)CC[C@]2(C)[C@@H]2[C@@H]1[C@@H]1CC[C@H]([C@H](C)CCCC(C)C)[C@@]1(C)CC2.O[C@H]1[C@@H](N)[C@H](O)[C@@H](C)O[C@H]1O[C@H]1/C=C/C=C/C=C/C=C/C=C/C=C/C=C/[C@H](C)[C@@H](O)[C@@H](C)[C@H](C)OC(=O)C[C@H](O)C[C@H](O)CC[C@@H](O)[C@H](O)C[C@H](O)C[C@](O)(C[C@H](O)[C@H]2C(O)=O)O[C@H]2C1 PCTMTFRHKVHKIS-BMFZQQSSSA-N 0.000 claims description 4
- 238000011478 gradient descent method Methods 0.000 claims description 4
- 238000007781 pre-processing Methods 0.000 claims description 4
- 230000009466 transformation Effects 0.000 claims description 3
- 239000011159 matrix material Substances 0.000 claims description 2
- 238000005457 optimization Methods 0.000 claims description 2
- 230000011218 segmentation Effects 0.000 description 11
- 238000013473 artificial intelligence Methods 0.000 description 8
- 238000002372 labelling Methods 0.000 description 8
- 238000003058 natural language processing Methods 0.000 description 8
- 238000005516 engineering process Methods 0.000 description 7
- 238000012512 characterization method Methods 0.000 description 5
- 238000006243 chemical reaction Methods 0.000 description 5
- 230000000873 masking effect Effects 0.000 description 4
- 238000012545 processing Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 3
- 238000009826 distribution Methods 0.000 description 3
- 238000004140 cleaning Methods 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000018109 developmental process Effects 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 239000004816 latex Substances 0.000 description 2
- 229920000126 latex Polymers 0.000 description 2
- 239000000463 material Substances 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 210000002569 neuron Anatomy 0.000 description 1
- 238000003860 storage Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Probability & Statistics with Applications (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a subject entity-based language model construction method, which comprises the following steps: acquiring teaching resources to construct a corresponding initial data set; subject entity screening is carried out on the initial data set so as to construct a corresponding subject entity library; randomly shielding the subject entities in the subject entity library to obtain corresponding shielding words, and forming a data set by the subject entities and the corresponding shielding words; the language neural network is constructed to comprise a pre-coding layer, a characteristic extraction layer and a prediction layer; training the language neural network by adopting a data set to obtain a subject entity language model for mining subject entities; and inputting the teaching resources to be identified into the subject entity language model to output the subject entities contained in the text. The invention also provides a subject entity identification device. The language model constructed by the method provided by the invention can acquire massive priori knowledge in the education field, so that a more comprehensive subject entity data set is constructed.
Description
Technical Field
The invention belongs to the technical field of natural language processing, and particularly relates to a subject entity-based language model construction method and a subject entity identification device.
Background
In the field of natural language processing technology, a general language model has been rapidly developed and widely applied in recent years, wherein the most remarkable progress is the application of deep learning technology, such as a transducer model, an attention mechanism and the like. The development and application of these techniques have led to significant progress in the general language model in recent years, bringing higher accuracy and efficiency to the field of natural language processing.
However, conventional generic language models may encounter the following problems when processing text in a particular area: firstly, the vocabulary of the vertical domain is not abundant, and although the vocabulary of the general language model is larger, the general language model does not necessarily contain professional terms or new vocabularies of certain specific domains, so that misjudgment or error occurs in text processing of the specific domains. Secondly, the context is inaccurate, and the generic language model may not understand the common context of the specific domain, so that ambiguities may occur in text processing of the specific domain. Thirdly, the general language model needs to process a large amount of data and text, so that the processing speed may be affected.
With the development of artificial intelligence and natural language processing technology, it has become possible to build language models for specific fields. Training a language model in the vertical domain will improve the accuracy and efficiency of natural language processing in a particular domain and can solve these problems.
Patent document CN111931020a discloses a labeling method, device, equipment and storage medium of a formula, comprising: acquiring a formula to be marked; calling a formula labeling model, wherein the formula labeling model is obtained through formula labeling data training on the basis of a target language characterization model corresponding to a formula related subject; the target language characterization model is obtained by expanding the vocabulary of the formula related discipline at least based on a basic language characterization model, and the formula labeling data at least comprises sample formula data and labels corresponding to the sample formula data; and predicting the label of the formula to be marked according to the formula marking model. The method only provides that a basic language characterization model is adopted to extract related vocabulary of disciplines in a formula so as to finish labeling, but the model adopted by the method is a simple language model, and corresponding vocabulary cannot be extracted for labeling aiming at the association relationship between texts and disciplines.
Patent document CN112580361a discloses a formula and text recognition model method based on unified attention mechanism, which comprises recognizing presuppositionlatex or contentlatex and obtaining recognition results, parsing the results into latex semantic tree, and traversing the semantic tree; performing word segmentation on the latex sequence by using a statistical word segmentation method, and performing word segmentation on natural language in the stem except a mathematical formula by using a wordpiece word segmentation method to form a word segmentation sequence; and (3) performing neural network coding on the word segmentation sequence, outputting, completing transformation from variable-length word segmentation sequence to fixed-length hidden space characterization, and completing output mapping of knowledge points by using a feedforward neural network to finish labeling of the knowledge points. The method identifies natural language in the stem to obtain annotation of knowledge points, but the method fails to consider the subject field where the current text is located, and extraction of irrelevant vocabulary may exist.
Disclosure of Invention
The invention mainly aims to provide a subject entity-based language model construction method and a subject entity identification device, and the language model constructed by the method can acquire massive priori knowledge in the education field so as to construct a more comprehensive subject entity data set.
In order to achieve the first object, the present invention provides a subject entity-based language model construction method, comprising the steps of:
acquiring teaching resources to construct a corresponding initial data set comprising video data, text data and voice data;
subject entity screening is carried out on the initial data set so as to construct a corresponding subject entity library;
randomly shielding the subject entities in the subject entity library to obtain corresponding shielding words, and forming a data set by the subject entities and the corresponding shielding words;
the method comprises the steps of constructing a language neural network based on a Transformer, wherein the language neural network comprises a pre-coding layer, a feature extraction layer and a prediction layer, the pre-coding layer is used for converting an input text into word codes and corresponding position codes, the feature extraction layer generates associated feature values of different angles among characters in the input text according to the input word codes and the position codes, and the prediction layer outputs a prediction result according to the input associated feature values, wherein the prediction result comprises subject entities related to the text and corresponding text positions;
training the language neural network by adopting a data set to obtain a subject entity language model for mining subject entities;
and inputting the teaching resources to be identified into the subject entity language model to output the subject entities contained in the text.
According to the invention, the text mining and language model technology in the NLP field is comprehensively utilized, the entity vocabulary is extracted through new word mining, the language model is improved and optimized by combining with the downstream language model training task, and the ability of the training model to recognize the entity vocabulary is improved, so that the training difficulty is improved, and more abundant information is fused into the language model.
Preferably, the subject entities of the initial data set are screened based on word frequency, solidification degree and left and right adjacent word entropy, namely, the phrase characteristics of the Chinese subject entities are aimed at, so that the screening accuracy is improved.
Specifically, the screening process is specifically as follows:
matching the initial data set by adopting a regular expression to obtain an original character string set containing Chinese, english characters and numbers;
traversing the original character string set according to the preset character string length, and filtering the pure digital character strings to obtain a candidate character string set;
screening character strings with occurrence times exceeding a word frequency threshold value in a candidate character string set according to the preset word frequency threshold value to construct corresponding initial candidate phrase;
analyzing the tightness degree between characters in each initial candidate phrase, and reserving the initial candidate phrases meeting the preset solidification degree;
and calculating left-neighbor entropy and right-neighbor entropy of each initial candidate phrase meeting the solidification degree, and reserving the initial candidate phrases meeting the condition that the left-neighbor entropy and the right-neighbor entropy exceed preset average values, so as to construct a subject entity library as the optimal candidate phrases.
Specifically, the solidification degree calculation process of the initial candidate phrase is as follows:
wherein->Represents the coagulability of the "ABCD" string, ">Representing the frequency of occurrence of "ABCD" in the original set of strings, thereby ensuring compactness from character to character in the candidate phrase.
Specifically, during training, a cross entropy loss function is adopted as a purpose of optimizing the language neural network, and iterative optimization is performed by using a gradient descent method so as to update parameters of the language neural network.
Specifically, the cross entropy loss function is expressed as follows:wherein/>An i-th element value representing a sample x real tag,/->The model prediction sample x is represented by a prediction value belonging to the ith category, and n is represented by the total number of categories of the prediction result.
Specifically, the initial dataset needs to be preprocessed before subject entity screening, and the preprocessing includes full-angle conversion from half angle to complex character to simplified character, and removal of space, line changing symbol and special character, so that the problem that characters and phrases cannot be identified is avoided.
Preferably, the pre-coding layer introduces characters in the text into the initialized vector matrix, and simultaneously takes the position information of the characters as additional vectors to construct corresponding word codes and position codes, so that the associated feature values have the position characteristics between the characters and the phrase.
Specifically, the feature extraction layer comprises a multi-head self-attention mechanism unit, a full-connection feedforward network and a residual error connection and normalization unit;
the multi-head self-attention mechanism unit is used for searching association relations of different angles between input characters and splicing comprehensive association features captured in different subspaces;
the fully-connected feedforward network is used for carrying out nonlinear transformation on the captured comprehensive association characteristics so as to obtain corresponding prediction association characteristics;
and the residual connection and normalization unit is used for integrating the correlation characteristic and the prediction correlation characteristic to carry out residual addition and perform normalization operation so as to output a final correlation characteristic value.
To achieve the second object of the present invention, there is provided a subject entity recognition apparatus including a computer memory, a computer processor, and a computer program stored in the computer memory and executable on the computer processor, the computer memory employing the subject entity language model described above; the computer processor, when executing the computer program, performs the steps of:
the teaching resources are input into the subject entity language model to output a set of subject entities associated with the teaching resources.
Compared with the prior art, the invention has the beneficial effects that:
the new word discovery method based on unsupervised learning is used for mining subject entities in the education field, simultaneously, word codes and position codes of the subject entities are combined to construct corresponding association features, and more implicit relations between characters are obtained based on the association features, so that a corresponding subject entity language model is constructed; through the subject entity model, a more comprehensive subject entity data set can be obtained, so that training tasks of all downstream NLPs can be better enabled.
Drawings
FIG. 1 is a flowchart of a subject entity-based language model construction method provided in this embodiment;
FIG. 2 is a schematic diagram of a pre-coding layer of a subject entity language model according to the present embodiment;
FIG. 3 is a schematic diagram of a feature extraction layer of the subject entity language model according to the present embodiment;
fig. 4 is a schematic diagram of a prediction layer of a subject entity language model according to the present embodiment.
Detailed Description
In order to more clearly describe the technical scheme in the embodiment of the invention, the invention is described in detail below with reference to the accompanying drawings. The features of the examples and embodiments described below may be combined with each other without conflict.
As shown in fig. 1, a subject entity-based language model construction method includes the following steps:
acquiring teaching resources to construct a corresponding initial data set comprising video data, text data and voice data;
more specifically, in this embodiment, teaching resources under the "artificial intelligence" specialty are used as sources, more than 500 teaching materials including 20 courses under the specialty such as "machine learning", "deep learning", "advanced mathematics", and "high-grade mathematics" are collected, paper data under the specialty is collected, and multi-modal data in the specialty field such as course video data, PPT courseware data, etc. are collected. The data of teaching materials, papers and PPT courseware are transcribed into character form by using OCR technology. And for course video data, converting the expression of the teacher in the video into a character form by using an ASR speech-text conversion technology, and obtaining an original corpus.
And aiming at the acquired corpus, cleaning the acquired text data by using a data preprocessing technology. The cleaned regular expression comprises full-angle conversion half-angle conversion, conversion of traditional characters into simplified characters, removal of blank spaces, line-changing characters, special characters and the like. After data cleaning, an initial dataset of the professional domain is obtained.
Subject entity screening is carried out on the initial data set so as to construct a corresponding subject entity library;
more specifically, in this embodiment, word frequency, solidification degree, left and right adjacent word entropy are sequentially used, and filtering is performed according to a set threshold value, so as to obtain an entity library of the professional discipline of "artificial intelligence".
Firstly, an original corpus is read, chinese characters, english characters and numbers are matched in the original corpus by using a regular expression "\u4E00\u9FFFa-zA-Z0-9", an original character string is obtained, and the length of the original character string is recorded.
After the original character string is obtained, the next step of traversing is carried out. And setting the length parameter of the candidate character string to 5, namely selecting the character combination with the length of 1-5 to obtain the candidate character string. For example, the character combination of "gradient descent method" is used in the traversal process, and then the candidate character strings are "ladder", "gradient descent", and "gradient descent method", respectively. By traversing the original string, all candidate string data can be obtained.
And filtering out the pure digital character strings after the candidate character strings are obtained. And then counting word frequency, calculating the number of times of occurrence in the original character string, and filtering. And if the word frequency threshold value is lower than the threshold value, no candidate phrase is added. For example, "gradient descent" occurs more than 3 times in the original string, and the candidate phrase is added.
After obtaining the candidate phrases, we calculate the solidification degree of each candidate phrase. The solidification degree refers to the tightness between characters in a phrase segment, for example, under the professional of artificial intelligence, phrases such as "gradient" commonly appear, so that the solidification degree is higher, and phrases such as "network" and "intelligence" have lower solidification degree. Taking the example of calculating the solidification degree of gradient descent, the calculation formula is as follows:
,
wherein S (gradient descent) represents the solidification degree of 'gradient descent', p (gradient descent) represents the frequency of 'gradient descent' in the original character string, min represents the minimum value thereof, and finally log is taken for the minimum value, so as to obtain the final solidification degree. Setting the threshold value of the solidification degree to be 1.8, if the threshold value is exceeded, reserving the phrase, performing the next filtering, and if the threshold value is lower than the threshold value, removing the candidate phrase;
after the calculation and filtration of the solidification degree, the left-neighbor entropy and the right-neighbor entropy of the candidate phrase are further calculated. The left adjacent entropy of the candidate phrase refers to the sum of the information entropy of the candidate phrase and the information entropy combined with all adjacent words to the left of the candidate phrase, so as to judge the diversity of the left adjacent words of the candidate phrase. The larger the left neighbor entropy, the more the categories of the words adjacent to the left of the candidate phrase are described, and the more the adjacent characters are diversified, the more clear the boundary of the candidate phrase is, and the greater the probability of becoming a real phrase is. The left neighbor entropy calculation formula is:wherein->Is the left neighbor entropy of the candidate phrase G,left adjacent word being candidate phrase GThe set, e.g., the candidate phrase "gradient," may be preceded by the terms "edge," "at," "yes," "and," "use," "and" etc. left adjacency words. />Is that the adjacent word to the left of the candidate phrase G is +.>The calculation formula is: />Wherein->Is a left adjacent wordFrequency of co-occurrence with candidate phrase G, +.>Is the frequency with which the candidate phrase G appears alone. The calculation method of right neighbor entropy is similar, and only the calculation object is adjusted to be the entropy of the word appearing on the right side of the candidate phrase, so that the richness of the character appearing on the rear side of the candidate phrase is indicated. The right-neighbor entropy calculation formula is: />Wherein->Is the right neighbor entropy of the candidate phrase G, +.>Is the right set of contiguous words of the candidate phrase G, +.>Is that the adjacent word to the left of the candidate phrase G is +.>The calculation formula is: />Wherein->Is the right adjacency word->Frequency of co-occurrence with candidate phrase G, +.>Is the frequency with which the candidate phrase G appears alone.
And respectively calculating left-neighbor entropy and right-neighbor entropy of the candidate phrase, setting the threshold value of the left-neighbor entropy and the threshold value of the right-neighbor entropy to be 1.2, reserving the candidate phrase if the left-neighbor entropy and the right-neighbor entropy exceed the threshold value simultaneously, and filtering the candidate phrase if one or both of the left-neighbor entropy and the right-neighbor entropy are lower than the threshold value.
After the screening and filtering in the steps, all the candidate phrases still reserved are summarized to form a subject entity library about artificial intelligence.
Randomly shielding the subject entities in the subject entity library to obtain corresponding shielding words, and forming a data set by the subject entities and the corresponding shielding words;
more specifically, according to the subject entity library of "artificial intelligence", the embodiment constructs a corresponding data set by randomly masking the entities in the initial data set.
Firstly, using a HanLP word segmentation tool to perform word segmentation on an initial data set in the professional field. In the word segmentation process, the subject entity library is preloaded, the phrases in the subject entity library are segmented into one word, and after the word segmentation is finished, the total number of all the words in the text is counted.
After counting the total number of the vocabularies, masking entity words in the subject entity library, and replacing each character of the selected vocabularies by using a [ MASK ] flag Fu Zhebi, so that the model predicts the masked entity words according to the unshaded corpus information in the training process, and the model learns rich subject knowledge. In the shielding process, the ratio of the shielded entity words is calculated, the proportion is ensured not to exceed 20%, and if the proportion exceeds 20%, the shielded entity words are adjusted to be in an unshielded state so as to meet the shielding proportion.
Based on the above procedure, the description is given by the fact that the "gradient" is a vector representing that the derivative of the direction of a certain function at that point takes the maximum value along that direction, i.e. the function changes the fastest along that direction ", and includes, in order, initial data acquisition, acquisition of subject entities, random masking, and masking based on subject entities, the procedure of which is shown in table 1.
。
The language neural network is constructed based on a transducer and comprises a pre-coding layer, a feature extraction layer and a prediction layer.
As shown in fig. 2, the pre-coding layer is used to convert the input text into a word code and a corresponding position code, i.e. in order for the model to know the position information of each character in the input sequence, a position code is added to the word code, and an extra vector is added to indicate the position of the element in the sequence according to the position of the character in the input sequence.
As shown in fig. 3, the feature extraction layer generates associated feature values of different angles between characters in the input text according to the input word codes and the position codes.
More specifically, the Encoder portion of the transducer model is employed, including a multi-headed self-attention mechanism layer, a neural network layer, and a residual connection and normalization layer, and loops through the 4 layers. The multi-head self-attention mechanism layer is responsible for searching the association relations of different angles between input corpus, and in the last splicing step, the association relations captured in different subspaces are recombined. The neural network layer is a fully connected feed-forward network through which characters at each location individually pass. The residual connection and normalization layer is responsible for carrying out residual addition on the outputs of the multi-head self-attention mechanism layer and the neural network layer, and then carrying out one layer normalization operation.
As shown in fig. 4, the prediction layer outputs a prediction result according to the input associated feature value, wherein the prediction result comprises subject entities related to the text and corresponding text positions.
The language neural network is trained using the data set to obtain a discipline entity language model for mining discipline entities.
More specifically, the loss is calculated using a cross entropy loss function that characterizes the probability distribution of the actual output sample as the distance from the probability distribution of the desired output sample, i.e., the smaller the value of the cross entropy, the closer the two probability distributions are. In the model training process, model parameters are continuously optimized so that the output cross entropy loss function is reduced, and finally the loss minimum value is reached.
The cross entropy loss function is expressed as follows:whereinAn i-th element value representing a sample x real tag,/->The model prediction sample x is represented by a prediction value belonging to the ith category, and n is represented by the total number of categories of the prediction result.
And inputting the teaching resources to be identified into the subject entity language model to output the subject entities contained in the text.
More specifically, in this embodiment, the corpus under the "artificial intelligence" specialty is physically masked, and is transmitted to the model prediction, and the model effect is verified according to the prediction result.
In order to verify the reasoning effect of the trained model, reading and understanding tasks in the professional field are constructed, and whether the reasoning result of the observation model accords with the knowledge in the professional field or not is observed. Under the professional field of artificial intelligence, a piece of prediction data is constructed: "transfer [ MASK ] in forward propagation into backward propagation process, find partial derivative of loss function to each [ MASK ] weight and bias layer by layer, as [ MASK ] of objective function to weight and bias. The sentence is transmitted into a model interface, and after the reasoning of the model, the prediction results of three places are respectively a loss function, a neuron and a gradient, so that the model is proved to be fused with the specialized subject knowledge, and the model has the basic reasoning capability of the subject specialized field and can better enable the training task of each downstream NLP.
The present embodiment also provides a subject entity recognition apparatus, including a computer memory, a computer processor, and a computer program stored in the computer memory and executable on the computer processor, where the computer memory adopts the subject entity language model set forth in the above embodiment.
The computer processor, when executing the computer program, performs the steps of: the teaching resources are input into the subject entity language model to output a set of subject entities associated with the teaching resources.
Claims (4)
1. The language model construction method based on subject entities is characterized by comprising the following steps:
acquiring teaching resources to construct a corresponding initial data set comprising video data, text data and voice data;
subject entity screening is carried out on the initial data set to construct a corresponding subject entity library, and the screening process is specifically as follows:
matching the initial data set by adopting a regular expression to obtain an original character string set containing Chinese, english characters and numbers;
traversing the original character string set according to the preset character string length, and filtering the pure digital character strings to obtain a candidate character string set;
screening character strings with occurrence times exceeding a word frequency threshold value in a candidate character string set according to the preset word frequency threshold value to construct corresponding initial candidate phrase;
analyzing the tightness degree between characters in each initial candidate phrase, and reserving the initial candidate phrases meeting the preset solidification degree;
calculating left-neighbor entropy and right-neighbor entropy of each initial candidate phrase meeting the solidification degree, and reserving the initial candidate phrases meeting the left-neighbor entropy and right-neighbor entropy which exceed preset average values, so as to construct a subject entity library as the optimal candidate phrases;
randomly shielding the subject entities in the subject entity library to obtain corresponding shielding words, and forming a data set by the subject entities and the corresponding shielding words;
the method comprises the steps of constructing a language neural network based on a Transformer, wherein the language neural network comprises a pre-coding layer, a feature extraction layer and a prediction layer, the pre-coding layer is used for constructing corresponding word codes and position codes by importing characters in a text into an initialized vector matrix and taking position information of the characters as additional vectors, the feature extraction layer is used for generating associated feature values of different angles among the characters in the input text according to the input word codes and the position codes, and the feature extraction layer comprises a multi-head self-attention mechanism unit, a fully-connected feedforward network and a residual error connection and normalization unit;
the multi-head self-attention mechanism unit is used for searching association relations of different angles between input characters and splicing comprehensive association features captured in different subspaces;
the fully-connected feedforward network is used for carrying out nonlinear transformation on the captured comprehensive association characteristics so as to obtain corresponding prediction association characteristics;
the residual connection and normalization unit is used for integrating the correlation characteristic and the prediction correlation characteristic to carry out residual addition and perform normalization operation so as to output a final correlation characteristic value, and the prediction layer outputs a prediction result according to the input correlation characteristic value, wherein the prediction result comprises subject entities related to the text and corresponding text positions;
training the language neural network by adopting a data set, taking a cross entropy loss function as the purpose of optimizing the language neural network, and performing iterative optimization by utilizing a gradient descent method to update parameters of the language neural network so as to obtain a subject entity language model for mining subject entities, wherein the expression of the cross entropy loss function is as follows:
wherein->Represents the i sample->Is (are) true tags->Represents the i sample->N represents the total number of samples; and inputting the teaching resources to be identified into the subject entity language model to output the subject entities contained in the text.
2. The subject entity-based language model construction method of claim 1 wherein the initial candidate phrase solidification degree calculation process is as follows:
wherein->Represents the coagulability of the "ABCD" string, ">Representing the frequency with which "ABCD" appears in the original set of strings.
3. The subject matter-based language model construction method of claim 1 wherein the initial dataset requires pre-processing prior to subject matter screening, the pre-processing including full-angle-to-half-angle, complex character-to-simplified character, and removal of spaces, line breaks, and special characters.
4. A subject entity recognition apparatus comprising a computer memory, a computer processor, and a computer program stored in the computer memory and executable on the computer processor, wherein the computer memory employs the subject entity-based language model construction method of any one of claims 1-3;
the computer processor, when executing the computer program, performs the steps of:
the teaching resources are input into the subject entity language model to output a set of subject entities associated with the teaching resources.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311228568.4A CN116976351B (en) | 2023-09-22 | 2023-09-22 | Language model construction method based on subject entity and subject entity recognition device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311228568.4A CN116976351B (en) | 2023-09-22 | 2023-09-22 | Language model construction method based on subject entity and subject entity recognition device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN116976351A CN116976351A (en) | 2023-10-31 |
CN116976351B true CN116976351B (en) | 2024-01-23 |
Family
ID=88473370
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311228568.4A Active CN116976351B (en) | 2023-09-22 | 2023-09-22 | Language model construction method based on subject entity and subject entity recognition device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116976351B (en) |
Citations (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107133220A (en) * | 2017-06-07 | 2017-09-05 | 东南大学 | Name entity recognition method in a kind of Geography field |
CN109902298A (en) * | 2019-02-13 | 2019-06-18 | 东北师范大学 | Domain Modeling and know-how estimating and measuring method in a kind of adaptive and learning system |
CN111368545A (en) * | 2020-02-28 | 2020-07-03 | 北京明略软件系统有限公司 | Named entity identification method and device based on multi-task learning |
CN111444721A (en) * | 2020-05-27 | 2020-07-24 | 南京大学 | Chinese text key information extraction method based on pre-training language model |
CN112287037A (en) * | 2020-10-23 | 2021-01-29 | 大连东软教育科技集团有限公司 | Multi-entity mixed knowledge graph construction method and device and storage medium |
CN112560486A (en) * | 2020-11-25 | 2021-03-26 | 国网江苏省电力有限公司电力科学研究院 | Power entity identification method based on multilayer neural network, storage medium and equipment |
CN112784009A (en) * | 2020-12-28 | 2021-05-11 | 北京邮电大学 | Subject term mining method and device, electronic equipment and storage medium |
CN112800766A (en) * | 2021-01-27 | 2021-05-14 | 华南理工大学 | Chinese medical entity identification and labeling method and system based on active learning |
CN113901807A (en) * | 2021-08-30 | 2022-01-07 | 重庆德莱哲企业管理咨询有限责任公司 | Clinical medicine entity recognition method and clinical test knowledge mining method |
CN114186013A (en) * | 2021-12-15 | 2022-03-15 | 广州华多网络科技有限公司 | Entity recognition model hot updating method and device, equipment, medium and product thereof |
CN114443813A (en) * | 2022-01-09 | 2022-05-06 | 西北大学 | Intelligent online teaching resource knowledge point concept entity linking method |
CN115169349A (en) * | 2022-06-30 | 2022-10-11 | 中国人民解放军战略支援部队信息工程大学 | Chinese electronic resume named entity recognition method based on ALBERT |
CN116720519A (en) * | 2023-06-08 | 2023-09-08 | 吉首大学 | Seedling medicine named entity identification method |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140280314A1 (en) * | 2013-03-14 | 2014-09-18 | Advanced Search Laboratories, lnc. | Dimensional Articulation and Cognium Organization for Information Retrieval Systems |
-
2023
- 2023-09-22 CN CN202311228568.4A patent/CN116976351B/en active Active
Patent Citations (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107133220A (en) * | 2017-06-07 | 2017-09-05 | 东南大学 | Name entity recognition method in a kind of Geography field |
CN109902298A (en) * | 2019-02-13 | 2019-06-18 | 东北师范大学 | Domain Modeling and know-how estimating and measuring method in a kind of adaptive and learning system |
CN111368545A (en) * | 2020-02-28 | 2020-07-03 | 北京明略软件系统有限公司 | Named entity identification method and device based on multi-task learning |
CN111444721A (en) * | 2020-05-27 | 2020-07-24 | 南京大学 | Chinese text key information extraction method based on pre-training language model |
CN112287037A (en) * | 2020-10-23 | 2021-01-29 | 大连东软教育科技集团有限公司 | Multi-entity mixed knowledge graph construction method and device and storage medium |
CN112560486A (en) * | 2020-11-25 | 2021-03-26 | 国网江苏省电力有限公司电力科学研究院 | Power entity identification method based on multilayer neural network, storage medium and equipment |
CN112784009A (en) * | 2020-12-28 | 2021-05-11 | 北京邮电大学 | Subject term mining method and device, electronic equipment and storage medium |
CN112800766A (en) * | 2021-01-27 | 2021-05-14 | 华南理工大学 | Chinese medical entity identification and labeling method and system based on active learning |
CN113901807A (en) * | 2021-08-30 | 2022-01-07 | 重庆德莱哲企业管理咨询有限责任公司 | Clinical medicine entity recognition method and clinical test knowledge mining method |
CN114186013A (en) * | 2021-12-15 | 2022-03-15 | 广州华多网络科技有限公司 | Entity recognition model hot updating method and device, equipment, medium and product thereof |
CN114443813A (en) * | 2022-01-09 | 2022-05-06 | 西北大学 | Intelligent online teaching resource knowledge point concept entity linking method |
CN115169349A (en) * | 2022-06-30 | 2022-10-11 | 中国人民解放军战略支援部队信息工程大学 | Chinese electronic resume named entity recognition method based on ALBERT |
CN116720519A (en) * | 2023-06-08 | 2023-09-08 | 吉首大学 | Seedling medicine named entity identification method |
Non-Patent Citations (4)
Title |
---|
Summarization and Simplification of Medical Articles using Natural Language Processing;Shashank Patel 等;《2022 13th International Conference on Computing Communication and Networking Technologies (ICCCNT)》;第1-6页 * |
基于中文命名实体识别的高中化学试题检索方法研究及应用;张璐;《万方》;第1.2节、第3章 * |
基于知识图谱的在线教学资源库智能化改造关键技术研究与实现;王雨扬;《中国优秀硕士学位论文全文数据库 社会科学Ⅱ辑》;第2023年卷(第2期);第24-25页 * |
张璐.基于中文命名实体识别的高中化学试题检索方法研究及应用.《万方》.2023,第1.2节、第3章. * |
Also Published As
Publication number | Publication date |
---|---|
CN116976351A (en) | 2023-10-31 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110134946B (en) | Machine reading understanding method for complex data | |
CN106599032B (en) | Text event extraction method combining sparse coding and structure sensing machine | |
CN110362819B (en) | Text emotion analysis method based on convolutional neural network | |
CN109684626A (en) | Method for recognizing semantics, model, storage medium and device | |
CN111709242B (en) | Chinese punctuation mark adding method based on named entity recognition | |
CN111966812B (en) | Automatic question answering method based on dynamic word vector and storage medium | |
CN111444367B (en) | Image title generation method based on global and local attention mechanism | |
CN112306494A (en) | Code classification and clustering method based on convolution and cyclic neural network | |
CN112307130B (en) | Document-level remote supervision relation extraction method and system | |
CN116432655B (en) | Method and device for identifying named entities with few samples based on language knowledge learning | |
CN114359946A (en) | Optical music score image recognition method based on residual attention transducer | |
CN115545041B (en) | Model construction method and system for enhancing semantic vector representation of medical statement | |
CN113223509A (en) | Fuzzy statement identification method and system applied to multi-person mixed scene | |
CN114841151B (en) | Medical text entity relation joint extraction method based on decomposition-recombination strategy | |
CN114239574A (en) | Miner violation knowledge extraction method based on entity and relationship joint learning | |
CN114912453A (en) | Chinese legal document named entity identification method based on enhanced sequence features | |
CN113239690A (en) | Chinese text intention identification method based on integration of Bert and fully-connected neural network | |
CN113836896A (en) | Patent text abstract generation method and device based on deep learning | |
CN115238693A (en) | Chinese named entity recognition method based on multi-word segmentation and multi-layer bidirectional long-short term memory | |
CN115064154A (en) | Method and device for generating mixed language voice recognition model | |
CN112434686B (en) | End-to-end misplaced text classification identifier for OCR (optical character) pictures | |
CN116522165B (en) | Public opinion text matching system and method based on twin structure | |
CN115204143B (en) | Method and system for calculating text similarity based on prompt | |
CN116976351B (en) | Language model construction method based on subject entity and subject entity recognition device | |
CN115840815A (en) | Automatic abstract generation method based on pointer key information |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |