CN116976351B - Language model construction method based on subject entity and subject entity recognition device - Google Patents

Language model construction method based on subject entity and subject entity recognition device Download PDF

Info

Publication number
CN116976351B
CN116976351B CN202311228568.4A CN202311228568A CN116976351B CN 116976351 B CN116976351 B CN 116976351B CN 202311228568 A CN202311228568 A CN 202311228568A CN 116976351 B CN116976351 B CN 116976351B
Authority
CN
China
Prior art keywords
subject
subject entity
language model
language
data set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311228568.4A
Other languages
Chinese (zh)
Other versions
CN116976351A (en
Inventor
曹柳
王琪皓
黄程韦
吴江
朱晓明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Lab
Original Assignee
Zhejiang Lab
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Lab filed Critical Zhejiang Lab
Priority to CN202311228568.4A priority Critical patent/CN116976351B/en
Publication of CN116976351A publication Critical patent/CN116976351A/en
Application granted granted Critical
Publication of CN116976351B publication Critical patent/CN116976351B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a subject entity-based language model construction method, which comprises the following steps: acquiring teaching resources to construct a corresponding initial data set; subject entity screening is carried out on the initial data set so as to construct a corresponding subject entity library; randomly shielding the subject entities in the subject entity library to obtain corresponding shielding words, and forming a data set by the subject entities and the corresponding shielding words; the language neural network is constructed to comprise a pre-coding layer, a characteristic extraction layer and a prediction layer; training the language neural network by adopting a data set to obtain a subject entity language model for mining subject entities; and inputting the teaching resources to be identified into the subject entity language model to output the subject entities contained in the text. The invention also provides a subject entity identification device. The language model constructed by the method provided by the invention can acquire massive priori knowledge in the education field, so that a more comprehensive subject entity data set is constructed.

Description

Language model construction method based on subject entity and subject entity recognition device
Technical Field
The invention belongs to the technical field of natural language processing, and particularly relates to a subject entity-based language model construction method and a subject entity identification device.
Background
In the field of natural language processing technology, a general language model has been rapidly developed and widely applied in recent years, wherein the most remarkable progress is the application of deep learning technology, such as a transducer model, an attention mechanism and the like. The development and application of these techniques have led to significant progress in the general language model in recent years, bringing higher accuracy and efficiency to the field of natural language processing.
However, conventional generic language models may encounter the following problems when processing text in a particular area: firstly, the vocabulary of the vertical domain is not abundant, and although the vocabulary of the general language model is larger, the general language model does not necessarily contain professional terms or new vocabularies of certain specific domains, so that misjudgment or error occurs in text processing of the specific domains. Secondly, the context is inaccurate, and the generic language model may not understand the common context of the specific domain, so that ambiguities may occur in text processing of the specific domain. Thirdly, the general language model needs to process a large amount of data and text, so that the processing speed may be affected.
With the development of artificial intelligence and natural language processing technology, it has become possible to build language models for specific fields. Training a language model in the vertical domain will improve the accuracy and efficiency of natural language processing in a particular domain and can solve these problems.
Patent document CN111931020a discloses a labeling method, device, equipment and storage medium of a formula, comprising: acquiring a formula to be marked; calling a formula labeling model, wherein the formula labeling model is obtained through formula labeling data training on the basis of a target language characterization model corresponding to a formula related subject; the target language characterization model is obtained by expanding the vocabulary of the formula related discipline at least based on a basic language characterization model, and the formula labeling data at least comprises sample formula data and labels corresponding to the sample formula data; and predicting the label of the formula to be marked according to the formula marking model. The method only provides that a basic language characterization model is adopted to extract related vocabulary of disciplines in a formula so as to finish labeling, but the model adopted by the method is a simple language model, and corresponding vocabulary cannot be extracted for labeling aiming at the association relationship between texts and disciplines.
Patent document CN112580361a discloses a formula and text recognition model method based on unified attention mechanism, which comprises recognizing presuppositionlatex or contentlatex and obtaining recognition results, parsing the results into latex semantic tree, and traversing the semantic tree; performing word segmentation on the latex sequence by using a statistical word segmentation method, and performing word segmentation on natural language in the stem except a mathematical formula by using a wordpiece word segmentation method to form a word segmentation sequence; and (3) performing neural network coding on the word segmentation sequence, outputting, completing transformation from variable-length word segmentation sequence to fixed-length hidden space characterization, and completing output mapping of knowledge points by using a feedforward neural network to finish labeling of the knowledge points. The method identifies natural language in the stem to obtain annotation of knowledge points, but the method fails to consider the subject field where the current text is located, and extraction of irrelevant vocabulary may exist.
Disclosure of Invention
The invention mainly aims to provide a subject entity-based language model construction method and a subject entity identification device, and the language model constructed by the method can acquire massive priori knowledge in the education field so as to construct a more comprehensive subject entity data set.
In order to achieve the first object, the present invention provides a subject entity-based language model construction method, comprising the steps of:
acquiring teaching resources to construct a corresponding initial data set comprising video data, text data and voice data;
subject entity screening is carried out on the initial data set so as to construct a corresponding subject entity library;
randomly shielding the subject entities in the subject entity library to obtain corresponding shielding words, and forming a data set by the subject entities and the corresponding shielding words;
the method comprises the steps of constructing a language neural network based on a Transformer, wherein the language neural network comprises a pre-coding layer, a feature extraction layer and a prediction layer, the pre-coding layer is used for converting an input text into word codes and corresponding position codes, the feature extraction layer generates associated feature values of different angles among characters in the input text according to the input word codes and the position codes, and the prediction layer outputs a prediction result according to the input associated feature values, wherein the prediction result comprises subject entities related to the text and corresponding text positions;
training the language neural network by adopting a data set to obtain a subject entity language model for mining subject entities;
and inputting the teaching resources to be identified into the subject entity language model to output the subject entities contained in the text.
According to the invention, the text mining and language model technology in the NLP field is comprehensively utilized, the entity vocabulary is extracted through new word mining, the language model is improved and optimized by combining with the downstream language model training task, and the ability of the training model to recognize the entity vocabulary is improved, so that the training difficulty is improved, and more abundant information is fused into the language model.
Preferably, the subject entities of the initial data set are screened based on word frequency, solidification degree and left and right adjacent word entropy, namely, the phrase characteristics of the Chinese subject entities are aimed at, so that the screening accuracy is improved.
Specifically, the screening process is specifically as follows:
matching the initial data set by adopting a regular expression to obtain an original character string set containing Chinese, english characters and numbers;
traversing the original character string set according to the preset character string length, and filtering the pure digital character strings to obtain a candidate character string set;
screening character strings with occurrence times exceeding a word frequency threshold value in a candidate character string set according to the preset word frequency threshold value to construct corresponding initial candidate phrase;
analyzing the tightness degree between characters in each initial candidate phrase, and reserving the initial candidate phrases meeting the preset solidification degree;
and calculating left-neighbor entropy and right-neighbor entropy of each initial candidate phrase meeting the solidification degree, and reserving the initial candidate phrases meeting the condition that the left-neighbor entropy and the right-neighbor entropy exceed preset average values, so as to construct a subject entity library as the optimal candidate phrases.
Specifically, the solidification degree calculation process of the initial candidate phrase is as follows:
wherein->Represents the coagulability of the "ABCD" string, ">Representing the frequency of occurrence of "ABCD" in the original set of strings, thereby ensuring compactness from character to character in the candidate phrase.
Specifically, during training, a cross entropy loss function is adopted as a purpose of optimizing the language neural network, and iterative optimization is performed by using a gradient descent method so as to update parameters of the language neural network.
Specifically, the cross entropy loss function is expressed as follows:wherein/>An i-th element value representing a sample x real tag,/->The model prediction sample x is represented by a prediction value belonging to the ith category, and n is represented by the total number of categories of the prediction result.
Specifically, the initial dataset needs to be preprocessed before subject entity screening, and the preprocessing includes full-angle conversion from half angle to complex character to simplified character, and removal of space, line changing symbol and special character, so that the problem that characters and phrases cannot be identified is avoided.
Preferably, the pre-coding layer introduces characters in the text into the initialized vector matrix, and simultaneously takes the position information of the characters as additional vectors to construct corresponding word codes and position codes, so that the associated feature values have the position characteristics between the characters and the phrase.
Specifically, the feature extraction layer comprises a multi-head self-attention mechanism unit, a full-connection feedforward network and a residual error connection and normalization unit;
the multi-head self-attention mechanism unit is used for searching association relations of different angles between input characters and splicing comprehensive association features captured in different subspaces;
the fully-connected feedforward network is used for carrying out nonlinear transformation on the captured comprehensive association characteristics so as to obtain corresponding prediction association characteristics;
and the residual connection and normalization unit is used for integrating the correlation characteristic and the prediction correlation characteristic to carry out residual addition and perform normalization operation so as to output a final correlation characteristic value.
To achieve the second object of the present invention, there is provided a subject entity recognition apparatus including a computer memory, a computer processor, and a computer program stored in the computer memory and executable on the computer processor, the computer memory employing the subject entity language model described above; the computer processor, when executing the computer program, performs the steps of:
the teaching resources are input into the subject entity language model to output a set of subject entities associated with the teaching resources.
Compared with the prior art, the invention has the beneficial effects that:
the new word discovery method based on unsupervised learning is used for mining subject entities in the education field, simultaneously, word codes and position codes of the subject entities are combined to construct corresponding association features, and more implicit relations between characters are obtained based on the association features, so that a corresponding subject entity language model is constructed; through the subject entity model, a more comprehensive subject entity data set can be obtained, so that training tasks of all downstream NLPs can be better enabled.
Drawings
FIG. 1 is a flowchart of a subject entity-based language model construction method provided in this embodiment;
FIG. 2 is a schematic diagram of a pre-coding layer of a subject entity language model according to the present embodiment;
FIG. 3 is a schematic diagram of a feature extraction layer of the subject entity language model according to the present embodiment;
fig. 4 is a schematic diagram of a prediction layer of a subject entity language model according to the present embodiment.
Detailed Description
In order to more clearly describe the technical scheme in the embodiment of the invention, the invention is described in detail below with reference to the accompanying drawings. The features of the examples and embodiments described below may be combined with each other without conflict.
As shown in fig. 1, a subject entity-based language model construction method includes the following steps:
acquiring teaching resources to construct a corresponding initial data set comprising video data, text data and voice data;
more specifically, in this embodiment, teaching resources under the "artificial intelligence" specialty are used as sources, more than 500 teaching materials including 20 courses under the specialty such as "machine learning", "deep learning", "advanced mathematics", and "high-grade mathematics" are collected, paper data under the specialty is collected, and multi-modal data in the specialty field such as course video data, PPT courseware data, etc. are collected. The data of teaching materials, papers and PPT courseware are transcribed into character form by using OCR technology. And for course video data, converting the expression of the teacher in the video into a character form by using an ASR speech-text conversion technology, and obtaining an original corpus.
And aiming at the acquired corpus, cleaning the acquired text data by using a data preprocessing technology. The cleaned regular expression comprises full-angle conversion half-angle conversion, conversion of traditional characters into simplified characters, removal of blank spaces, line-changing characters, special characters and the like. After data cleaning, an initial dataset of the professional domain is obtained.
Subject entity screening is carried out on the initial data set so as to construct a corresponding subject entity library;
more specifically, in this embodiment, word frequency, solidification degree, left and right adjacent word entropy are sequentially used, and filtering is performed according to a set threshold value, so as to obtain an entity library of the professional discipline of "artificial intelligence".
Firstly, an original corpus is read, chinese characters, english characters and numbers are matched in the original corpus by using a regular expression "\u4E00\u9FFFa-zA-Z0-9", an original character string is obtained, and the length of the original character string is recorded.
After the original character string is obtained, the next step of traversing is carried out. And setting the length parameter of the candidate character string to 5, namely selecting the character combination with the length of 1-5 to obtain the candidate character string. For example, the character combination of "gradient descent method" is used in the traversal process, and then the candidate character strings are "ladder", "gradient descent", and "gradient descent method", respectively. By traversing the original string, all candidate string data can be obtained.
And filtering out the pure digital character strings after the candidate character strings are obtained. And then counting word frequency, calculating the number of times of occurrence in the original character string, and filtering. And if the word frequency threshold value is lower than the threshold value, no candidate phrase is added. For example, "gradient descent" occurs more than 3 times in the original string, and the candidate phrase is added.
After obtaining the candidate phrases, we calculate the solidification degree of each candidate phrase. The solidification degree refers to the tightness between characters in a phrase segment, for example, under the professional of artificial intelligence, phrases such as "gradient" commonly appear, so that the solidification degree is higher, and phrases such as "network" and "intelligence" have lower solidification degree. Taking the example of calculating the solidification degree of gradient descent, the calculation formula is as follows:
wherein S (gradient descent) represents the solidification degree of 'gradient descent', p (gradient descent) represents the frequency of 'gradient descent' in the original character string, min represents the minimum value thereof, and finally log is taken for the minimum value, so as to obtain the final solidification degree. Setting the threshold value of the solidification degree to be 1.8, if the threshold value is exceeded, reserving the phrase, performing the next filtering, and if the threshold value is lower than the threshold value, removing the candidate phrase;
after the calculation and filtration of the solidification degree, the left-neighbor entropy and the right-neighbor entropy of the candidate phrase are further calculated. The left adjacent entropy of the candidate phrase refers to the sum of the information entropy of the candidate phrase and the information entropy combined with all adjacent words to the left of the candidate phrase, so as to judge the diversity of the left adjacent words of the candidate phrase. The larger the left neighbor entropy, the more the categories of the words adjacent to the left of the candidate phrase are described, and the more the adjacent characters are diversified, the more clear the boundary of the candidate phrase is, and the greater the probability of becoming a real phrase is. The left neighbor entropy calculation formula is:wherein->Is the left neighbor entropy of the candidate phrase G,left adjacent word being candidate phrase GThe set, e.g., the candidate phrase "gradient," may be preceded by the terms "edge," "at," "yes," "and," "use," "and" etc. left adjacency words. />Is that the adjacent word to the left of the candidate phrase G is +.>The calculation formula is: />Wherein->Is a left adjacent wordFrequency of co-occurrence with candidate phrase G, +.>Is the frequency with which the candidate phrase G appears alone. The calculation method of right neighbor entropy is similar, and only the calculation object is adjusted to be the entropy of the word appearing on the right side of the candidate phrase, so that the richness of the character appearing on the rear side of the candidate phrase is indicated. The right-neighbor entropy calculation formula is: />Wherein->Is the right neighbor entropy of the candidate phrase G, +.>Is the right set of contiguous words of the candidate phrase G, +.>Is that the adjacent word to the left of the candidate phrase G is +.>The calculation formula is: />Wherein->Is the right adjacency word->Frequency of co-occurrence with candidate phrase G, +.>Is the frequency with which the candidate phrase G appears alone.
And respectively calculating left-neighbor entropy and right-neighbor entropy of the candidate phrase, setting the threshold value of the left-neighbor entropy and the threshold value of the right-neighbor entropy to be 1.2, reserving the candidate phrase if the left-neighbor entropy and the right-neighbor entropy exceed the threshold value simultaneously, and filtering the candidate phrase if one or both of the left-neighbor entropy and the right-neighbor entropy are lower than the threshold value.
After the screening and filtering in the steps, all the candidate phrases still reserved are summarized to form a subject entity library about artificial intelligence.
Randomly shielding the subject entities in the subject entity library to obtain corresponding shielding words, and forming a data set by the subject entities and the corresponding shielding words;
more specifically, according to the subject entity library of "artificial intelligence", the embodiment constructs a corresponding data set by randomly masking the entities in the initial data set.
Firstly, using a HanLP word segmentation tool to perform word segmentation on an initial data set in the professional field. In the word segmentation process, the subject entity library is preloaded, the phrases in the subject entity library are segmented into one word, and after the word segmentation is finished, the total number of all the words in the text is counted.
After counting the total number of the vocabularies, masking entity words in the subject entity library, and replacing each character of the selected vocabularies by using a [ MASK ] flag Fu Zhebi, so that the model predicts the masked entity words according to the unshaded corpus information in the training process, and the model learns rich subject knowledge. In the shielding process, the ratio of the shielded entity words is calculated, the proportion is ensured not to exceed 20%, and if the proportion exceeds 20%, the shielded entity words are adjusted to be in an unshielded state so as to meet the shielding proportion.
Based on the above procedure, the description is given by the fact that the "gradient" is a vector representing that the derivative of the direction of a certain function at that point takes the maximum value along that direction, i.e. the function changes the fastest along that direction ", and includes, in order, initial data acquisition, acquisition of subject entities, random masking, and masking based on subject entities, the procedure of which is shown in table 1.
The language neural network is constructed based on a transducer and comprises a pre-coding layer, a feature extraction layer and a prediction layer.
As shown in fig. 2, the pre-coding layer is used to convert the input text into a word code and a corresponding position code, i.e. in order for the model to know the position information of each character in the input sequence, a position code is added to the word code, and an extra vector is added to indicate the position of the element in the sequence according to the position of the character in the input sequence.
As shown in fig. 3, the feature extraction layer generates associated feature values of different angles between characters in the input text according to the input word codes and the position codes.
More specifically, the Encoder portion of the transducer model is employed, including a multi-headed self-attention mechanism layer, a neural network layer, and a residual connection and normalization layer, and loops through the 4 layers. The multi-head self-attention mechanism layer is responsible for searching the association relations of different angles between input corpus, and in the last splicing step, the association relations captured in different subspaces are recombined. The neural network layer is a fully connected feed-forward network through which characters at each location individually pass. The residual connection and normalization layer is responsible for carrying out residual addition on the outputs of the multi-head self-attention mechanism layer and the neural network layer, and then carrying out one layer normalization operation.
As shown in fig. 4, the prediction layer outputs a prediction result according to the input associated feature value, wherein the prediction result comprises subject entities related to the text and corresponding text positions.
The language neural network is trained using the data set to obtain a discipline entity language model for mining discipline entities.
More specifically, the loss is calculated using a cross entropy loss function that characterizes the probability distribution of the actual output sample as the distance from the probability distribution of the desired output sample, i.e., the smaller the value of the cross entropy, the closer the two probability distributions are. In the model training process, model parameters are continuously optimized so that the output cross entropy loss function is reduced, and finally the loss minimum value is reached.
The cross entropy loss function is expressed as follows:whereinAn i-th element value representing a sample x real tag,/->The model prediction sample x is represented by a prediction value belonging to the ith category, and n is represented by the total number of categories of the prediction result.
And inputting the teaching resources to be identified into the subject entity language model to output the subject entities contained in the text.
More specifically, in this embodiment, the corpus under the "artificial intelligence" specialty is physically masked, and is transmitted to the model prediction, and the model effect is verified according to the prediction result.
In order to verify the reasoning effect of the trained model, reading and understanding tasks in the professional field are constructed, and whether the reasoning result of the observation model accords with the knowledge in the professional field or not is observed. Under the professional field of artificial intelligence, a piece of prediction data is constructed: "transfer [ MASK ] in forward propagation into backward propagation process, find partial derivative of loss function to each [ MASK ] weight and bias layer by layer, as [ MASK ] of objective function to weight and bias. The sentence is transmitted into a model interface, and after the reasoning of the model, the prediction results of three places are respectively a loss function, a neuron and a gradient, so that the model is proved to be fused with the specialized subject knowledge, and the model has the basic reasoning capability of the subject specialized field and can better enable the training task of each downstream NLP.
The present embodiment also provides a subject entity recognition apparatus, including a computer memory, a computer processor, and a computer program stored in the computer memory and executable on the computer processor, where the computer memory adopts the subject entity language model set forth in the above embodiment.
The computer processor, when executing the computer program, performs the steps of: the teaching resources are input into the subject entity language model to output a set of subject entities associated with the teaching resources.

Claims (4)

1. The language model construction method based on subject entities is characterized by comprising the following steps:
acquiring teaching resources to construct a corresponding initial data set comprising video data, text data and voice data;
subject entity screening is carried out on the initial data set to construct a corresponding subject entity library, and the screening process is specifically as follows:
matching the initial data set by adopting a regular expression to obtain an original character string set containing Chinese, english characters and numbers;
traversing the original character string set according to the preset character string length, and filtering the pure digital character strings to obtain a candidate character string set;
screening character strings with occurrence times exceeding a word frequency threshold value in a candidate character string set according to the preset word frequency threshold value to construct corresponding initial candidate phrase;
analyzing the tightness degree between characters in each initial candidate phrase, and reserving the initial candidate phrases meeting the preset solidification degree;
calculating left-neighbor entropy and right-neighbor entropy of each initial candidate phrase meeting the solidification degree, and reserving the initial candidate phrases meeting the left-neighbor entropy and right-neighbor entropy which exceed preset average values, so as to construct a subject entity library as the optimal candidate phrases;
randomly shielding the subject entities in the subject entity library to obtain corresponding shielding words, and forming a data set by the subject entities and the corresponding shielding words;
the method comprises the steps of constructing a language neural network based on a Transformer, wherein the language neural network comprises a pre-coding layer, a feature extraction layer and a prediction layer, the pre-coding layer is used for constructing corresponding word codes and position codes by importing characters in a text into an initialized vector matrix and taking position information of the characters as additional vectors, the feature extraction layer is used for generating associated feature values of different angles among the characters in the input text according to the input word codes and the position codes, and the feature extraction layer comprises a multi-head self-attention mechanism unit, a fully-connected feedforward network and a residual error connection and normalization unit;
the multi-head self-attention mechanism unit is used for searching association relations of different angles between input characters and splicing comprehensive association features captured in different subspaces;
the fully-connected feedforward network is used for carrying out nonlinear transformation on the captured comprehensive association characteristics so as to obtain corresponding prediction association characteristics;
the residual connection and normalization unit is used for integrating the correlation characteristic and the prediction correlation characteristic to carry out residual addition and perform normalization operation so as to output a final correlation characteristic value, and the prediction layer outputs a prediction result according to the input correlation characteristic value, wherein the prediction result comprises subject entities related to the text and corresponding text positions;
training the language neural network by adopting a data set, taking a cross entropy loss function as the purpose of optimizing the language neural network, and performing iterative optimization by utilizing a gradient descent method to update parameters of the language neural network so as to obtain a subject entity language model for mining subject entities, wherein the expression of the cross entropy loss function is as follows:
wherein->Represents the i sample->Is (are) true tags->Represents the i sample->N represents the total number of samples; and inputting the teaching resources to be identified into the subject entity language model to output the subject entities contained in the text.
2. The subject entity-based language model construction method of claim 1 wherein the initial candidate phrase solidification degree calculation process is as follows:
wherein->Represents the coagulability of the "ABCD" string, ">Representing the frequency with which "ABCD" appears in the original set of strings.
3. The subject matter-based language model construction method of claim 1 wherein the initial dataset requires pre-processing prior to subject matter screening, the pre-processing including full-angle-to-half-angle, complex character-to-simplified character, and removal of spaces, line breaks, and special characters.
4. A subject entity recognition apparatus comprising a computer memory, a computer processor, and a computer program stored in the computer memory and executable on the computer processor, wherein the computer memory employs the subject entity-based language model construction method of any one of claims 1-3;
the computer processor, when executing the computer program, performs the steps of:
the teaching resources are input into the subject entity language model to output a set of subject entities associated with the teaching resources.
CN202311228568.4A 2023-09-22 2023-09-22 Language model construction method based on subject entity and subject entity recognition device Active CN116976351B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311228568.4A CN116976351B (en) 2023-09-22 2023-09-22 Language model construction method based on subject entity and subject entity recognition device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311228568.4A CN116976351B (en) 2023-09-22 2023-09-22 Language model construction method based on subject entity and subject entity recognition device

Publications (2)

Publication Number Publication Date
CN116976351A CN116976351A (en) 2023-10-31
CN116976351B true CN116976351B (en) 2024-01-23

Family

ID=88473370

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311228568.4A Active CN116976351B (en) 2023-09-22 2023-09-22 Language model construction method based on subject entity and subject entity recognition device

Country Status (1)

Country Link
CN (1) CN116976351B (en)

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107133220A (en) * 2017-06-07 2017-09-05 东南大学 Name entity recognition method in a kind of Geography field
CN109902298A (en) * 2019-02-13 2019-06-18 东北师范大学 Domain Modeling and know-how estimating and measuring method in a kind of adaptive and learning system
CN111368545A (en) * 2020-02-28 2020-07-03 北京明略软件系统有限公司 Named entity identification method and device based on multi-task learning
CN111444721A (en) * 2020-05-27 2020-07-24 南京大学 Chinese text key information extraction method based on pre-training language model
CN112287037A (en) * 2020-10-23 2021-01-29 大连东软教育科技集团有限公司 Multi-entity mixed knowledge graph construction method and device and storage medium
CN112560486A (en) * 2020-11-25 2021-03-26 国网江苏省电力有限公司电力科学研究院 Power entity identification method based on multilayer neural network, storage medium and equipment
CN112784009A (en) * 2020-12-28 2021-05-11 北京邮电大学 Subject term mining method and device, electronic equipment and storage medium
CN112800766A (en) * 2021-01-27 2021-05-14 华南理工大学 Chinese medical entity identification and labeling method and system based on active learning
CN113901807A (en) * 2021-08-30 2022-01-07 重庆德莱哲企业管理咨询有限责任公司 Clinical medicine entity recognition method and clinical test knowledge mining method
CN114186013A (en) * 2021-12-15 2022-03-15 广州华多网络科技有限公司 Entity recognition model hot updating method and device, equipment, medium and product thereof
CN114443813A (en) * 2022-01-09 2022-05-06 西北大学 Intelligent online teaching resource knowledge point concept entity linking method
CN115169349A (en) * 2022-06-30 2022-10-11 中国人民解放军战略支援部队信息工程大学 Chinese electronic resume named entity recognition method based on ALBERT
CN116720519A (en) * 2023-06-08 2023-09-08 吉首大学 Seedling medicine named entity identification method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140280314A1 (en) * 2013-03-14 2014-09-18 Advanced Search Laboratories, lnc. Dimensional Articulation and Cognium Organization for Information Retrieval Systems

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107133220A (en) * 2017-06-07 2017-09-05 东南大学 Name entity recognition method in a kind of Geography field
CN109902298A (en) * 2019-02-13 2019-06-18 东北师范大学 Domain Modeling and know-how estimating and measuring method in a kind of adaptive and learning system
CN111368545A (en) * 2020-02-28 2020-07-03 北京明略软件系统有限公司 Named entity identification method and device based on multi-task learning
CN111444721A (en) * 2020-05-27 2020-07-24 南京大学 Chinese text key information extraction method based on pre-training language model
CN112287037A (en) * 2020-10-23 2021-01-29 大连东软教育科技集团有限公司 Multi-entity mixed knowledge graph construction method and device and storage medium
CN112560486A (en) * 2020-11-25 2021-03-26 国网江苏省电力有限公司电力科学研究院 Power entity identification method based on multilayer neural network, storage medium and equipment
CN112784009A (en) * 2020-12-28 2021-05-11 北京邮电大学 Subject term mining method and device, electronic equipment and storage medium
CN112800766A (en) * 2021-01-27 2021-05-14 华南理工大学 Chinese medical entity identification and labeling method and system based on active learning
CN113901807A (en) * 2021-08-30 2022-01-07 重庆德莱哲企业管理咨询有限责任公司 Clinical medicine entity recognition method and clinical test knowledge mining method
CN114186013A (en) * 2021-12-15 2022-03-15 广州华多网络科技有限公司 Entity recognition model hot updating method and device, equipment, medium and product thereof
CN114443813A (en) * 2022-01-09 2022-05-06 西北大学 Intelligent online teaching resource knowledge point concept entity linking method
CN115169349A (en) * 2022-06-30 2022-10-11 中国人民解放军战略支援部队信息工程大学 Chinese electronic resume named entity recognition method based on ALBERT
CN116720519A (en) * 2023-06-08 2023-09-08 吉首大学 Seedling medicine named entity identification method

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
Summarization and Simplification of Medical Articles using Natural Language Processing;Shashank Patel 等;《2022 13th International Conference on Computing Communication and Networking Technologies (ICCCNT)》;第1-6页 *
基于中文命名实体识别的高中化学试题检索方法研究及应用;张璐;《万方》;第1.2节、第3章 *
基于知识图谱的在线教学资源库智能化改造关键技术研究与实现;王雨扬;《中国优秀硕士学位论文全文数据库 社会科学Ⅱ辑》;第2023年卷(第2期);第24-25页 *
张璐.基于中文命名实体识别的高中化学试题检索方法研究及应用.《万方》.2023,第1.2节、第3章. *

Also Published As

Publication number Publication date
CN116976351A (en) 2023-10-31

Similar Documents

Publication Publication Date Title
CN110134946B (en) Machine reading understanding method for complex data
CN106599032B (en) Text event extraction method combining sparse coding and structure sensing machine
CN110362819B (en) Text emotion analysis method based on convolutional neural network
CN109684626A (en) Method for recognizing semantics, model, storage medium and device
CN111709242B (en) Chinese punctuation mark adding method based on named entity recognition
CN111966812B (en) Automatic question answering method based on dynamic word vector and storage medium
CN111444367B (en) Image title generation method based on global and local attention mechanism
CN112306494A (en) Code classification and clustering method based on convolution and cyclic neural network
CN112307130B (en) Document-level remote supervision relation extraction method and system
CN116432655B (en) Method and device for identifying named entities with few samples based on language knowledge learning
CN114359946A (en) Optical music score image recognition method based on residual attention transducer
CN115545041B (en) Model construction method and system for enhancing semantic vector representation of medical statement
CN113223509A (en) Fuzzy statement identification method and system applied to multi-person mixed scene
CN114841151B (en) Medical text entity relation joint extraction method based on decomposition-recombination strategy
CN114239574A (en) Miner violation knowledge extraction method based on entity and relationship joint learning
CN114912453A (en) Chinese legal document named entity identification method based on enhanced sequence features
CN113239690A (en) Chinese text intention identification method based on integration of Bert and fully-connected neural network
CN113836896A (en) Patent text abstract generation method and device based on deep learning
CN115238693A (en) Chinese named entity recognition method based on multi-word segmentation and multi-layer bidirectional long-short term memory
CN115064154A (en) Method and device for generating mixed language voice recognition model
CN112434686B (en) End-to-end misplaced text classification identifier for OCR (optical character) pictures
CN116522165B (en) Public opinion text matching system and method based on twin structure
CN115204143B (en) Method and system for calculating text similarity based on prompt
CN116976351B (en) Language model construction method based on subject entity and subject entity recognition device
CN115840815A (en) Automatic abstract generation method based on pointer key information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant