CN114443813A - Intelligent online teaching resource knowledge point concept entity linking method - Google Patents

Intelligent online teaching resource knowledge point concept entity linking method Download PDF

Info

Publication number
CN114443813A
CN114443813A CN202210018754.4A CN202210018754A CN114443813A CN 114443813 A CN114443813 A CN 114443813A CN 202210018754 A CN202210018754 A CN 202210018754A CN 114443813 A CN114443813 A CN 114443813A
Authority
CN
China
Prior art keywords
knowledge point
character
vector
concept
entity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210018754.4A
Other languages
Chinese (zh)
Other versions
CN114443813B (en
Inventor
袁新瑞
王雨扬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northwest University
Original Assignee
Northwest University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northwest University filed Critical Northwest University
Priority to CN202210018754.4A priority Critical patent/CN114443813B/en
Publication of CN114443813A publication Critical patent/CN114443813A/en
Application granted granted Critical
Publication of CN114443813B publication Critical patent/CN114443813B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/374Thesaurus
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3346Query execution using probabilistic model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

An intelligent online teaching resource knowledge point concept entity linking method comprises a knowledge point concept entity recognition model and a knowledge point concept linking model, wherein an application scene mainly faces to teaching resource organization and management in a domestic online learning platform, and domestic teaching is basically Chinese teaching, so that the method is only suitable for Chinese language texts and is compatible with partial English texts. The knowledge point concept entity identification is to extract knowledge point concept entity vocabularies, disciplines, professional terms, historical events and the like from a teaching resource text, wherein the extracted knowledge point concept entities are called knowledge point mentions; the knowledge point concept association refers to finding out concept knowledge with the highest semantic similarity from a knowledge base according to the extracted knowledge point concept mention and the context where the knowledge point concept is located, and performing relationship. The association between teaching resources and knowledge point concepts is realized through knowledge point concept entity recognition and knowledge point concept linkage, and the purpose of constructing a teaching resource organization system taking concept knowledge as a core is achieved.

Description

Intelligent online teaching resource knowledge point concept entity linking method
Technical Field
The invention relates to intelligent education, in particular to an intelligent online teaching resource knowledge point concept entity linking method.
Background
A large amount of learning resources are borne in a traditional teaching resource library, and the rich teaching resource types of the teaching resource library are widely concerned by people. With the increasing number of users on the online learning platform, the number and types of teaching resources in the platform are increasing to meet different requirements of different users on the resources. In practice, along with the increase of the number of teaching resources and the diversification of contents, a learner needs to spend more time and more energy on searching and selecting the learning resources needed by the learner on a teaching resource platform than before, the learning efficiency of the learner in the platform is gradually reduced, and the learning quality and the learning initiative of the learner are seriously influenced.
Knowledge maps have become a core driving force for the development of the internet and artificial intelligence as a means to effectively structure human knowledge. The teaching resource library in the self-adaptive learning system can also build a teaching resource system taking knowledge as a core by means of knowledge graph technology. The teaching resources can be associated with the concept knowledge points, so that a teaching resource system can be effectively organized to enable the self-adaptive learning system.
The existing knowledge point concept marking and association in the online teaching resources are all input in a manual mode by teachers. However, the manual input mode consumes a lot of time and energy, most of knowledge point concepts provided by teachers are coarse-grained, fine-grained knowledge point concepts in teaching resources are ignored, the knowledge point concepts are not fully labeled, and learners cannot intuitively know details of course contents. To solve the above problems, an intelligent method or tool is needed to accurately identify and associate the knowledge point concept entities in the online teaching resources. Currently, only some researchers carry out part of relevant work, and mainly extract key phrases and terms in teaching resources by means of statistical learning. However, these research advances are far from adequate to solve the above-mentioned critical problems.
With the development of knowledge-graph and natural language processing fields, entity linking technology can sufficiently solve the above-mentioned problems. The entity linking technique is to identify references in the text and link to corresponding entities in the knowledge base. Most of the existing entity linking methods are open-ended, that is, key entity vocabularies such as names of people, place names, organizations, time and the like in text corpora are recognized and linked to corresponding entries in a knowledge base (such as encyclopedia, Wikipedia and the like). At present, there are more mature entity linking tools, such as: wikify! AIDA, DBpedia Spotlight, TagMe and Linkify, etc. These entity linking systems are mainly composed of two parts, entity mention detection and entity linking. Although the above-described physical link system has been developed more and more, there are certain disadvantages. In entity mention detection, the system mainly relies on existing Named Entity Recognition (NER) tools, such as Stanza, Jieba and SnowNLP, which can achieve considerable entity recognition accuracy, but can only recognize three entity categories: people, places, and organizations.
Different from the entity linking task in the open field, the teaching resource knowledge point concept linking is to extract and associate concept entities involved in teaching resources, and not to all entities such as: the place name entity, the person entity, the time entity and the like are extracted and associated, so that the existing entity linking tool is not suitable for the concept linking of knowledge points in teaching resources.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention aims to provide an intelligent online teaching resource knowledge point concept entity linking method, which is based on a natural language processing model Bert and is combined with a text data enhancement technology to extract and link knowledge point concepts contained in teaching resources. The association between teaching resources and concept knowledge is realized, and finally a teaching resource organization system taking knowledge point concepts as a core is constructed.
In order to achieve the purpose, the invention adopts the technical scheme that:
an intelligent online teaching resource knowledge point concept entity linking method is characterized by comprising the following steps:
1) firstly, carrying out a preprocessing process of character string cleaning on a character string, wherein the character string cleaning mainly comprises the steps of judging whether a character is a Chinese character set, a numeric character set and an English character set, marking the character set as S, and removing the character if the character is not in the character set S;
2) the model needs to match the cleaned character string C ═ { C ═ C1,c2,......,clAll elements in the symbol are labeled by a BIO labeling mechanism, and when a character ciWhen labeled "B", represents the character ciThe first character of a certain knowledge point concept vocabulary entity, "I" is a middle character of the knowledge point concept vocabulary entity, "O" is a non-knowledge point concept vocabulary character, and finally text data are obtained;
3) text data enhancement constructs a knowledge point concept dictionary Dict through knowledge point entry nouns and aliases thereof in a knowledge base, a Maximum BiDirectional Matching algorithm (BiDirectional Maximum Matching dictionary) is used for Matching a character string C, dictionary vocabularies contained in the character string are found out, matched character substrings are labeled by a 'BIEO' mechanism, namely if the matched character substrings are Csub={ci,ci+1,……,ci+m},CsubE.g. Dict, starting character c in the pair stringiLabeled "B", end character ci+mLabeled "E", starting character ciAnd an end character ci+mCharacter string of { c }i+1,ci+2,…..,ci+m-1All the characters contained in the character string are marked as 'I' and other characters which are not matched are marked as 'O', through the mechanism, a string of character strings with marks can be obtained, and a starting character 'CLS' is added]"and end character" [ SEP]”S={s[CLS],s1,s2,……,s[SEP]Each element siFrom the character C of the corresponding index position in the character string CiAnd a label character;
4) carrying out vector space embedding operation Embedding (S) on the obtained character string S with the label, namely, each element S in SiCharacterised by one dimension being dsThe numerical values in the vector are randomly initialized by using KaiMing distribution,the embedded sequence vector is
Figure BDA0003461606260000041
5) The sequence vector E obtained by the above operationSThe method comprises the steps of representing context semantic information contained in a character string C by using a pre-trained neural network language model Bert, wherein the pre-trained neural network language model Bert refers to a model trained in large-scale general text data, the pre-trained language model Bert is used as a semantic encoder, a text sequence can be effectively represented as a high-dimensional vector, the cleaned character string C is used as the input of the pre-trained Bert language model, the Bert model is used for calculating the character string C by taking characters as a unit, and the input character string C is { C ═ C1,c2,......,clThe Bert model will first insert the identifier "[ CLS ] before the start position and after the end position of the string, respectively]"and" [ SEP]", i.e., string {" [ CLS { "] { [ CLS ]]",c1,c2,......,cl,"[SEP]"} as calculated data for the model;
6) the output vector F obtained by the Bert model is the coding vector of the character string C, and then the sequence vector E with the concept knowledge point vocabulary boundary information is combinedSExtracting candidate concept knowledge point entities from the character string C through an LSTM model and a conditional random field CRF; extracting corresponding substrings on the predicted tag sequence to obtain a knowledge point concept mentioning entity;
7) the knowledge point concept entity link model refers to the extracted knowledge point concept to an entity M ═ M1,m2,.......,mkMatching and associating with a knowledge point entity in a knowledge base, generating a candidate knowledge point concept entity based on a Levenshtein Distance string fuzzy matching algorithm, and adding the current entity m to be referred toiFuzzy matching is carried out on the knowledge point concept vocabularies in the knowledge base, the matched knowledge point concept vocabularies with the editing Distance larger than the Distance are filtered by setting the editing Distance parameter Distance in the fuzzy matching algorithm, and a candidate knowledge point concept entity set is generated
Figure BDA0003461606260000051
8) Coding the abstract text description of each candidate knowledge point concept entity through the introduced pre-training Bert model to obtain a vector for representing the candidate knowledge point concept entities, and for one candidate knowledge point concept entityiWith the corresponding abstract description as a string
Figure BDA0003461606260000052
As an input of the Bert model, the output vector after the Bert model is coded is
Figure BDA0003461606260000053
Corresponding identifier 'CLS' to implicit vector hclsObtaining an output vector by activating a fully-connected layer with a function of tanh
Figure BDA0003461606260000054
As a characterization vector of a conceptual entity of a candidate knowledge point, i.e.
Figure BDA0003461606260000055
Figure BDA0003461606260000056
In this way, a set of characterization vectors for a set of candidate knowledge point concept entities can be obtained
Figure BDA0003461606260000057
9) For each mentioned knowledge point concept miThe method for representing the course text C includes firstly, pre-training a Bert model to a course text C ═ { C } where knowledge point concepts are located1,c2,......,clCoding is carried out, and a representation vector V of the course text is obtainedCObtaining a token vector VCThe method of (2) is the same as the method of the characterization vector of the candidate knowledge point concept entity;
10) the encoding vector of each character in the course text after being calculated by the Bert model is HC={hcls,h1,h2,......,hl,hsepFor the extracted mentioned knowledge point concepts miThe index position of the plaintext substring represented by the index position can be represented as a binary group
Figure BDA0003461606260000061
Wherein, beg represents the index of the starting position of the substring in C, and end represents the index of the ending position of the substring in C. Encoding vector HCIs prepared by
Figure BDA0003461606260000062
The code vector between the middle start position index beg and the end position index end is expressed as
Figure BDA0003461606260000063
Will be provided with
Figure BDA0003461606260000064
Obtaining a characterization vector of the concept entity of the knowledge point through a text convolution network TextCNN
Figure BDA0003461606260000065
TextCNN model for input
Figure BDA0003461606260000066
Calculating the characterization vector V of the course textCAnd a characterization vector referring to the concept entity of the knowledge point
Figure BDA0003461606260000067
Performing concatemate splicing operation, and obtaining an output vector through a full-connection layer with an activation function of tanh
Figure BDA0003461606260000068
Namely, it is
Figure BDA0003461606260000069
Figure BDA00034616062600000610
11) Output vector that will refer to concept entity of knowledge point
Figure BDA00034616062600000611
Set of characterization vectors associated with set of candidate knowledge point concept entities
Figure BDA00034616062600000612
Is subjected to cos similarity calculation, i.e. each vector in (1)
Figure BDA00034616062600000613
Concept of entity sets from candidate knowledge points
Figure BDA00034616062600000614
Figure BDA00034616062600000615
Selecting the knowledge point concept with the highest similarity to be associated with the knowledge point concept, i.e. the final association result can be represented as a binary group
Figure BDA00034616062600000616
12) The concept of knowledge points included in the text of the input course is linked as a result
Figure BDA00034616062600000617
Figure BDA00034616062600000618
And completing the association between teaching resources and knowledge point concepts in the knowledge base.
The input of the knowledge point concept entity recognition model is a text string X ═ X1,x2,......,xnX is composed of n characters, XiThe ith character of X, the text string may be from a course video caption or an electronic textbook text, etc.
The character string cleaning preprocessing method is realized mainly through Unicode codingTable realization, when a character xiUnicode encoding of
Figure BDA0003461606260000071
When located between \ u4e00 and \ u9fa5, i.e.
Figure BDA0003461606260000072
Character xiIs a Chinese character. In the same way, when
Figure BDA0003461606260000073
Figure BDA0003461606260000074
Time, character xiIs a numeric character; when in use
Figure BDA0003461606260000075
Figure BDA0003461606260000076
Or
Figure BDA0003461606260000077
Time, character xiAnd for English characters, deleting all characters outside the encoding range by Unicode encoding, completing the cleaning process of the character string, and cleaning the character string C ═ { C { (C) }1,c2,......,clAnd h, the length l of the character string after cleaning is less than or equal to n.
The calculation of the Bert model for the character string mainly comprises the following steps:
character embedding operation: the character string to be calculated { [ CLS ]]",c1,c2,......,cl,"[SEP]Each character in the' is characterized as a character vector with d dimension by Embedding operation (Embedding), and the embedded character string vector is
Figure BDA0003461606260000078
Integrating position information coding: to obtain the sequence features of text data, the Bert model uses sin and cos mechanisms on charactersString vector EcThe position index of each element in (a) is encoded. I.e. for elements at the position of pos, where diD is more than or equal to 1 for the dimension position in each elementiD is less than or equal to d, when d isiFor even numbers, using sin function for conversion, diIf the number is odd, the cos function is used for conversion to obtain the position coding vector
Figure BDA0003461606260000079
Each element p is a d-dimensional vector, and the corresponding position coding formula is as follows:
Figure BDA00034616062600000710
Figure BDA0003461606260000081
self-attention mechanism based on dot product scaling: the character string vector E obtained by the calculationcAdding the position coded vector P to obtain an input vector Z ═ Z of a self-attention mechanismcls,z1,......,zl,zsep}. The self-attention mechanism mainly captures the degree of association between every two elements in the sequence through a dot product scaling method, and the higher the degree of association between the two elements is, the larger the value of the calculation result is. The formula for the calculation of the self-attention mechanism is as follows, where the inputs are each a vector Z multiplied by its corresponding weight parameter W, i.e. Q ═ ZWQ,K=ZWK,V=ZWVAnd d is the vector dimension of the input:
Figure BDA0003461606260000082
the multi-head self-attention mechanism: in order to fully consider the information from different independent subspaces after the scaling dot product calculation, vectors after h times of scaling dot product calculation, namely h attention heads are spliced and concatered, then a linear transformation is carried out, and a calculation formula is obtainedAs follows, wherein
Figure BDA0003461606260000083
WOFor a trainable parameter matrix:
MultiHead(Q,K,V)=Concate(head1,……,headh)WO
feed-forward network layer: the result of each character element after multi-head attention calculation is a result after only linear transformation, and in order to fully consider the mutual influence between information under different potential dimensions, a feedforward network layer with nonlinear transformation is integrated into a model, and the calculation mode of the feedforward network layer is as follows, wherein the result of each character element after multi-head attention calculation is the result after only linear transformation, and the feedforward network layer is calculated as follows
Figure BDA0003461606260000084
Are trainable parameter matrices:
F=FFN(Z)=ReLU(ZW(1)+b(1))W(2)+b(2)
extracting candidate concept knowledge point entities from the character string C through an LSTM model and a conditional random field CRF; the main process is as follows:
and (3) feature vector fusion: the feature fusion mainly comprises a coding vector F with semantic features and a sequence vector E with knowledge point concept vocabulary boundary informationSPerforming Concate splicing, and performing linear transformation through a weight parameter matrix W to obtain a fused vector V ═ Vcls,v1,v2,......,vl,vsepThe formula is as follows:
V=Concate(F,ES)W
encoding of LSTM model: the LSTM model is a variant of the Recurrent Neural Network (RNN) and has a more robust predictive effect than the RNN model. Vector information of the first i-1 elements can be fully combined when the ith element is calculated, and the calculation process of the LSTM model for the elements at each time step t is as follows:
zt=σ(Wi*[ht-1,vt])
rt=σ(Wr*[ht-1,vt])
Figure BDA0003461606260000091
Figure BDA0003461606260000092
where σ is sigmoid function,. is dot product multiplication operator,. vtIs the t-th element, h, in the fused vector VtAs an implicit state vector, i.e. vtThe output of the vector V after passing through the LSTM model is H ═ H1,h2,.....,hTWhere T ═ l + 2.
CRF model prediction layer: the model prediction layer is used for judging the implicit vector output by the LSTM model and consists of a full connection layer and a CRF layer. First, the implicit state vector H ═ H output by the LSTM model1,h2,.....,hTPerforming linear transformation through a full connection layer to obtain the score of each character corresponding to each class label, i.e. the score l _ score of each labeli=[score1,score2,score3]Comprises three elements, wherein score1Score, which represents the probability of predicting the current character as "B2Score, which represents the probability score for predicting the current character as "I3Representing a probability score of predicting the current character as "O". The probability of each character prediction label in the character string is set as L _ Score ═ L _ Scorecls,l_score1,l_score2,,......,l_scorel,l_scoresepAnd taking the score set of the character string as the input of the CRF layer. The CRF layer can model the labels by taking the input score set as an emision score matrix, calculate a score transition matrix T between label categories to represent the transition probability from one label to another label, mine the dependency relationship between the label categories, calculate the sequence score Scors (H) of the character string, and decode the score sequence Scors (H) by a Viterbi algorithm to obtain a predicted label sequence
Figure BDA0003461606260000101
Removing the corresponding prediction tags of the start identifier 'CLS' and the end identifier 'SEP' carried by the Bert model to obtain the prediction tag sequence result of the character string
Figure BDA0003461606260000102
Extracting corresponding substrings on the predicted tag sequence to obtain a knowledge point concept mentioning entity M ═ { M ═ M1,m2,.......,mk}。
The extracted knowledge point concept is referred to as an entity M ═ { M ═ M1,m2,.......,mkMatching and associating with knowledge point entities in a knowledge base, and the method mainly comprises the following steps: 1. using Levenshtein Distance string fuzzy matching algorithm to each mentioned entity miPerforming fuzzy search, and selecting a candidate knowledge point entity set which is possibly matched from a knowledge base; 2. for reference to entity miPerforming context semantic representation on the candidate entities through a Bert model to obtain context semantic representation vectors; 3. and performing similarity calculation on the context semantic representation vectors of the knowledge point entity and each candidate entity through a cos function, wherein the candidate knowledge point entity with the highest similarity is the linked knowledge point concept.
The TextCNN model is used for inputting
Figure BDA0003461606260000103
The calculation steps are as follows:
1. defining a plurality of one-dimensional convolution kernels, and using the convolution kernels to perform convolution calculation on input respectively to capture the correlation of adjacent characters.
2. And performing time sequence maximum pooling on all output channels respectively, and splicing the pooled output values of the channels to obtain the characterization vector.
The invention has the beneficial effects that:
the technical framework of this patent mainly contains two main parts: the application scene of the patent mainly faces to the organization and management of teaching resources in a domestic online learning platform, and domestic teaching is basically Chinese teaching, so that the knowledge point concept entity recognition model and the knowledge point concept link model are only suitable for Chinese language texts and are compatible with partial English texts. The knowledge point concept entity identification is to extract the contained knowledge point concept entity vocabulary from the teaching resource text, such as: the extracted knowledge point concept entities are called knowledge point mentions; the knowledge point concept association means that the concept knowledge with the highest semantic similarity is found from a knowledge base according to the extracted knowledge point concept mention and the context where the knowledge point concept is located, and the relationship is carried out. The association between teaching resources and knowledge point concepts is realized through knowledge point concept entity recognition and knowledge point concept linkage, and the purpose of constructing a teaching resource organization system taking concept knowledge as a core is achieved.
Drawings
Fig. 1 is a working principle diagram of the present invention.
Fig. 2 is a schematic diagram of the present invention.
Detailed Description
The invention is further described below with reference to the accompanying drawings.
As shown in FIGS. 1 and 2, the knowledge point concept entity recognition model
The input of the knowledge point concept entity recognition model is a text string X ═ X1,x2,......,xnX is composed of n characters, XiThe ith character of X, the text string may be from a course video caption or an electronic textbook text, etc.
Firstly, carrying out a preprocessing process of character string cleaning on a character string, wherein the character string cleaning mainly comprises the steps of judging whether a character is a Chinese character set, a numeric character set and an English character set, marking the character set as S, and removing the character if the character is not in the character set S; the method is realized mainly through a Unicode coding table when a character xiUnicode encoding of
Figure BDA0003461606260000121
When located between \ u4e00 and \ u9fa5, i.e.
Figure BDA0003461606260000122
Character xiIs a Chinese character. In the same way, when
Figure BDA0003461606260000123
Time, character xiIs a numeric character; when in use
Figure BDA0003461606260000124
Figure BDA0003461606260000125
Or
Figure BDA0003461606260000126
Time, character xiIs an English character. The Unicode codes delete all the characters outside the coding range to complete the cleaning process of the character string, and the cleaned character string C ═ C1,c2,......,clAnd h, the length l of the character string after cleaning is less than or equal to n. Next, the model needs to set { C ═ C for the cleaned string C1,c2,......,clAll elements in the symbol are labeled by a BIO labeling mechanism, and when a character ciWhen labeled "B", represents the character ciThe first character of a certain knowledge point concept vocabulary entity, "I" is the middle character of the knowledge point concept vocabulary entity, and "O" is the non-knowledge point concept vocabulary character.
Because the word frequency of the knowledge point concept vocabulary in the teaching text is low and the character string length of the concept vocabulary is large, the traditional character-level-based entity recognition model is difficult to recognize the text boundary of the knowledge point concept entity, and the knowledge point concept entity is difficult to be recognized completely. The method for enhancing the text data is used, and the accuracy of the knowledge point concept entity recognition model is improved by combining the Bert language model.
The text data enhancement constructs a knowledge point concept dictionary Dict through knowledge point vocabulary terms and aliases thereof in a knowledge base, and an external knowledge base used in the patent is a disciplinary knowledge base provided on line by academia. Use ofThe Maximum BiDirectional Matching algorithm (BiDirectional Maximum Matching algorithm) matches the character string C to find out dictionary words contained in the character string. The matched character substrings are all marked by a 'BIEO' mechanism, namely if the matched character substrings are Csub={ci,ci+1,……,ci+m},CsubE.g. Dict, starting character c in the pair stringiLabeled "B", end character ci+mLabeled "E", starting character ciAnd an end character ci+mCharacter string of { c }i+1,ci+2,.....,ci+m-1The included characters are all labeled as "I" and the other characters not matched are labeled as "O". By this mechanism, a string of marked characters can be obtained and the start character "[ CLS ] is added]"and end character" [ SEP]”S={s1,s2,……,slH, each element siFrom the character C of the corresponding index position in the character string CiAnd a label character.
Carrying out vector space embedding operation Embedding (S) on the obtained character string S with the label, namely, each element S in SiCharacterised by one dimension being dsThe numerical values in the high-dimensional vectors are randomly initialized by using KaiMing distribution, and the embedded sequence vectors are
Figure BDA0003461606260000131
The sequence vector E obtained by the above operationSThe boundary information of the knowledge point concept vocabulary is included, and then the context semantic information included in the character string C is characterized. The patent uses a pre-trained neural network language model Bert, which refers to a model trained on large-scale general text data. The method takes the pre-trained language model Bert as a semantic encoder, and can effectively represent the text sequence as a high-dimensional vector.
And taking the cleaned character string C as the input of a pre-training Bert language model, wherein the Bert model is used for calculating the character string C by taking characters as units. For an input string C ═ { C ═ C1,c2,......,clThe Bert model will first insert the identifier "[ CLS ] before the start position and after the end position of the string, respectively]"and" [ SEP]", i.e., string {" [ CLS { "] { [ CLS ]]",c1,c2,......,cl,"[SEP]"} as calculated data for the model. The calculation of the character string by the Bert model mainly comprises the following steps:
character embedding operation: the character string to be calculated { [ CLS ]]",c1,c2,......,cl,"[SEP]Each character in the' is characterized as a character vector with d dimension by Embedding operation (Embedding), and the embedded character string vector is
Figure BDA0003461606260000141
Integrating position information coding: to obtain the sequence features of the text data, the Bert model uses sin and cos mechanisms on a string vector EcThe position index of each element in (a) is encoded. I.e. for elements at the position of pos, where diD is more than or equal to 1 for the dimension position in each elementiD is less than or equal to d, when d isiFor even numbers, using sin function for conversion, diIf the number is odd, the cos function is used for conversion to obtain the position coding vector
Figure BDA0003461606260000142
Each element p is a d-dimensional vector, and the corresponding position coding formula is as follows:
Figure BDA0003461606260000143
Figure BDA0003461606260000144
self-attention mechanism based on dot product scaling: the character string vector E obtained by the calculationcAdding the position coded vector P to obtain an input vector Z ═ Z of a self-attention mechanismcls,z1,......,zl,zsep}. The self-attention mechanism mainly captures the degree of association between every two elements in the sequence through a dot product scaling method, and the higher the degree of association between the two elements is, the larger the value of the calculation result is. The formula for the calculation of the self-attention mechanism is as follows, where the inputs are each a vector Z multiplied by its corresponding weight parameter W, i.e. Q ═ ZWQ,K=ZWK,V=ZWVAnd d is the vector dimension of the input:
Figure BDA0003461606260000151
the multi-head self-attention mechanism: in order to fully consider the information from different independent subspaces after the scaling dot product calculation, vectors after h times of scaling dot product calculation, namely h self-attention heads are spliced and concatered, then a linear transformation is carried out, and the calculation formula is as follows, wherein the information from different independent subspaces after the scaling dot product calculation is carried out
Figure BDA0003461606260000152
WOFor a trainable parameter matrix:
MultiHead(Q,K,V)=Concate(head1,……,headh)WO
feed-forward network layer: the result of each character element after multi-head attention calculation is a result after only linear transformation, and in order to fully consider the mutual influence between information under different potential dimensions, a feedforward network layer with nonlinear transformation is integrated into a model, and the calculation mode of the feedforward network layer is as follows, wherein the result of each character element after multi-head attention calculation is the result after only linear transformation, and the feedforward network layer is calculated as follows
Figure BDA0003461606260000153
Are trainable parameter matrices:
F=FFN(Z)=ReLU(ZW(1)+b(1))W(2)+b(2)
the output vector F obtained by the Bert model is the coding vector of the character string C, and then the sequence vector with the concept knowledge point vocabulary boundary information is combinedESAnd extracting candidate concept knowledge point entities from the character string C through an LSTM model and a conditional random field CRF, wherein the main process is as follows:
and (3) feature vector fusion: the feature fusion mainly comprises a coding vector F with semantic features and a sequence vector E with knowledge point concept vocabulary boundary informationSPerforming Concate splicing, and performing linear transformation through a weight parameter matrix W to obtain a fused vector V ═ Vcls,v1,v2,......,vl,vsepThe formula is as follows:
V=Concate(F,ES)W
encoding of LSTM model: the LSTM model is a variant of the Recurrent Neural Network (RNN) and has a more robust predictive effect than the RNN model. Vector information of the first i-1 elements can be fully combined when the ith element is calculated, and the calculation process of the LSTM model for the elements at each time step t is as follows:
zt=σ(Wi*[ht-1,vt])
rt=σ(Wr*[ht-1,vt])
Figure BDA0003461606260000161
Figure BDA0003461606260000162
where σ is sigmoid function,. is operator of dot product multiplication, vtIs the t-th element, h, in the fused vector VtAs an implicit state vector, i.e. vtThe output of the vector V after passing through the LSTM model is H ═ H1,h2,.....,hTWhere T ═ l + 2.
CRF model prediction layer: the model prediction layer is used for judging the implicit vector output by the LSTM model and consists of a full connection layer and a CRF layer. First, the implicit state vector H ═ H output by the LSTM model1,h2,.....,hTPerforming linear transformation through a full connection layer to obtain the score of each character corresponding to each class label, i.e. the score l _ score of each labeli=[score1,score2,score3]Comprises three elements, wherein score1Score, which represents the probability of predicting the current character as "B2Score, which represents the probability score for predicting the current character as "I3Representing a probability score of predicting the current character as "O". The probability of each character prediction label in the character string is set as L _ Score ═ L _ Scorecls,l_score1,l_score2,,......,l_scorel,l_scoresepAnd taking the score set of the character string as the input of the CRF layer. The CRF layer can model the labels by taking the input score set as an emision score matrix, calculate a score transition matrix T between label categories to represent the transition probability from one label to another label, mine the dependency relationship between the label categories, calculate the sequence score Scors (H) of the character string, and decode the score sequence Scors (H) by a Viterbi algorithm to obtain a predicted label sequence
Figure BDA0003461606260000171
Removing the corresponding prediction tags of the start identifier 'CLS' and the end identifier 'SEP' carried by the Bert model to obtain the prediction tag sequence result of the character string
Figure BDA0003461606260000172
Extracting corresponding substrings on the predicted tag sequence to obtain a knowledge point concept mentioning entity M ═ { M ═ M1,m2,.......,mk}。
Knowledge point concept entity link model
The knowledge point concept entity link model refers to the extracted knowledge point concept to an entity M ═ M1,m2,.......,mkMatching and associating with knowledge point entities in a knowledge base, and the method mainly comprises the following steps: 1. using Levenshtein Distance string fuzzy matching algorithm to each mentioned entity miPerforming fuzzy searchSearching, selecting a candidate knowledge point entity set which is possibly matched from a knowledge base; 2. for reference to entity miPerforming context semantic representation on the candidate entities through a Bert model to obtain context semantic representation vectors; 3. and performing similarity calculation on the context semantic representation vectors of the knowledge point entity and each candidate entity through a cos function, wherein the candidate knowledge point entity with the highest similarity is the linked knowledge point concept.
Generating a candidate knowledge point concept entity based on a Levenshtein Distance string fuzzy matching algorithm, and generating a current mentioned entity miFuzzy matching is carried out on the knowledge point concept vocabularies in the knowledge base, the matched knowledge point concept vocabularies with the editing Distance larger than the Distance are filtered by setting the editing Distance parameter Distance in the fuzzy matching algorithm, and a candidate knowledge point concept entity set is generated
Figure BDA0003461606260000181
In the external knowledge base, there is a corresponding abstract text description for each knowledge point concept vocabulary. The method encodes the abstract text description of each candidate knowledge point concept entity through the pre-training Bert model introduced above, and obtains a vector for representing the candidate knowledge point concept entities. Concept entity for a candidate knowledge pointiIts corresponding abstract description as a string
Figure BDA0003461606260000182
As input to the Bert model. The output vector after being coded by the Bert model is
Figure BDA00034616062600001811
Corresponding identifier 'CLS' to implicit vector hclsObtaining an output vector by activating a fully connected layer with a function of tanh
Figure BDA0003461606260000183
As a characterization vector of a conceptual entity of a candidate knowledge point, i.e.
Figure BDA0003461606260000184
Figure BDA0003461606260000185
In this way, a set of characterization vectors for a set of candidate knowledge point concept entities can be obtained
Figure BDA0003461606260000186
For each mentioned knowledge point concept miThe method for representing the course text C includes firstly, pre-training a Bert model to a course text C ═ { C } where knowledge point concepts are located1,c2,......,clCoding is carried out, and a representation vector V of the course text is obtainedCObtaining a token vector VCIn the same way as the method of the token vector of the concept entity of the candidate knowledge point.
The encoding vector of each character in the course text after being calculated by the Bert model is HC={hcls,h1,h2,......,hl,hsepFor the extracted mentioned knowledge point concepts miThe index position of the plaintext substring represented by the index position can be represented as a binary group
Figure BDA0003461606260000187
Wherein, beg represents the index of the starting position of the substring in C, and end represents the index of the ending position of the substring in C. Encoding vector HCIs prepared by
Figure BDA0003461606260000188
The code vector between the middle start position index beg and the end position index end is expressed as
Figure BDA0003461606260000189
Will be provided with
Figure BDA00034616062600001810
Obtaining a characterization vector of the concept entity of the knowledge point through a text convolution network TextCNN
Figure BDA0003461606260000191
TextCNN model for input
Figure BDA0003461606260000192
The calculation steps are as follows:
3. defining a plurality of one-dimensional convolution kernels, and using the convolution kernels to perform convolution calculation on input respectively to capture the correlation of adjacent characters.
4. And performing time sequence maximum pooling on all output channels respectively, and splicing the pooled output values of the channels to obtain the characterization vector.
Finally, the representation vector V of the course text is divided intoCAnd a characterization vector referring to the concept entity of the knowledge point
Figure BDA0003461606260000193
Performing concatemate splicing operation, and obtaining an output vector through a full-connection layer with an activation function of tanh
Figure BDA0003461606260000194
Namely, it is
Figure BDA0003461606260000195
Figure BDA0003461606260000196
Output vector that will refer to concept entity of knowledge point
Figure BDA0003461606260000197
Set of characterization vectors associated with set of candidate knowledge point concept entities
Figure BDA0003461606260000198
Is subjected to cos similarity calculation, i.e. each vector in (1)
Figure BDA0003461606260000199
Concept of entity sets from candidate knowledge points
Figure BDA00034616062600001910
Figure BDA00034616062600001911
Selecting the knowledge point concept with the highest similarity to be associated with the knowledge point concept, i.e. the final association result can be represented as a binary group
Figure BDA00034616062600001912
The concept of knowledge points included in the text of the input course is linked as a result
Figure BDA00034616062600001913
Figure BDA00034616062600001914
And finishing the association between the teaching resources and the knowledge point concepts in the knowledge base.

Claims (7)

1. An intelligent online teaching resource knowledge point concept entity linking method is characterized by comprising the following steps:
1) firstly, carrying out a preprocessing process of character string cleaning on a character string, wherein the character string cleaning is mainly used for judging whether a character is a Chinese character set, a numeric character set and an English character set, and if the character is not in the character set, removing the character;
2) the model needs to match the cleaned character string C ═ { C ═ C1,c2,......,clAll elements in the symbol are labeled by a BIO labeling mechanism, and when a character ciWhen labeled "B", represents the character ciThe first character of a certain knowledge point concept vocabulary entity, "I" is a middle character of the knowledge point concept vocabulary entity, "O" is a non-knowledge point concept vocabulary character, and finally text data are obtained;
3) text data enhancement constructs a knowledge point concept dictionary Dict through knowledge point vocabulary terms and aliases thereof in a knowledge base, matches a character string C by using a Maximum BiDirectional Matching algorithm (BiDirectional Maximum Matching algorithm), and finds out that the character string containsThe matched character sub-strings are all marked by a 'BIEO' mechanism, namely if the matched character sub-strings are Csub={ci,ci+1,......,ci+m},CsubE.g. Dict, starting character c in the pair stringiLabeled "B", end character ci+mLabeled "E", starting character ciAnd an end character ci+mC between the character stringsi+1,ci+2,......,ci+m-1All the characters contained in the character string are marked as ' I ' and other characters which are not matched are marked as ' O ', and through the mechanism, a string of marked character strings can be obtained and initial characters are added simultaneously ' [ CLS ]]"and end character" [ SEP]”,S={s[CLS],s1,s2,......,sl,S[SEP]Each element siFrom the character C of the corresponding index position in the character string CiAnd a label character;
4) carrying out vector space embedding operation Embedding (S) on the obtained character string S with the label, namely, each element S in SiCharacterised by one dimension being dsThe numerical values in the high-dimensional vectors are randomly initialized by using KaiMing distribution, and the embedded sequence vector is
Figure FDA0003461606250000021
5) The sequence vector E obtained by the above operationSThe method comprises the steps of representing context semantic information contained in a character string C by using a pre-trained neural network language model Bert, wherein the pre-trained neural network language model Bert refers to a model trained in large-scale general text data, the pre-trained language model Bert is used as a semantic encoder, a text sequence can be effectively represented as a high-dimensional vector, the cleaned character string C is used as the input of the pre-trained Bert language model, the Bert model is used for calculating the character string C by taking characters as a unit, and the input character string C is { C ═ C1,c2,......,clThe Bert model will first precede and end the start of the stringAfter placement, the identifiers "[ CLS ] are inserted separately]"and" [ SEP]", i.e., string {" [ CLS ]]″,c1,c2,......,cl,″[SEP]") as the calculated data for the model;
6) the output vector F obtained by the Bert model is the coding vector of the character string C, and then the sequence vector E with the concept knowledge point vocabulary boundary information is combinedSExtracting candidate concept knowledge point entities from the character string C through an LSTM model and a conditional random field CRF; extracting corresponding substrings on the predicted tag sequence to obtain a knowledge point concept mention entity;
7) the knowledge point concept entity link model is to refer the extracted knowledge point concept to an entity M ═ M1,m2,......,mkMatching and associating with a knowledge point entity in a knowledge base, generating a candidate knowledge point concept entity based on a Levenshtein Distance string fuzzy matching algorithm, and adding the current entity m to be referred toiFuzzy matching is carried out on the knowledge point concept vocabularies in the knowledge base, the matched knowledge point concept vocabularies with the editing Distance larger than the Distance are filtered by setting the editing Distance parameter Distance in the fuzzy matching algorithm, and a candidate knowledge point concept entity set is generated
Figure FDA0003461606250000031
8) Coding the abstract text description of each candidate knowledge point concept entity through the introduced pre-training Bert model to obtain a vector for representing the candidate knowledge point concept entities, and for one candidate knowledge point concept entityiWith the corresponding abstract description as a string
Figure FDA0003461606250000032
As an input of the Bert model, the output vector after the Bert model is coded is
Figure FDA0003461606250000033
Corresponding identifier 'CLS' to implicit vector hclsBy excitationObtaining an output vector from a fully connected layer with an active function of tanh
Figure FDA0003461606250000034
As a token vector of a conceptual entity of a candidate knowledge point, i.e.
Figure FDA0003461606250000035
Figure FDA0003461606250000036
In this way, a set of characterization vectors for a set of candidate knowledge point concept entities can be obtained
Figure FDA0003461606250000037
9) For each mentioned knowledge point concept miThe method for representing the course text C includes firstly, pre-training a Bert model to a course text C ═ { C } where knowledge point concepts are located1,c2,......,clCoding is carried out, and a representation vector V of the course text is obtainedCObtaining a token vector VCThe method of (2) is the same as the method of the characterization vector of the candidate knowledge point concept entity;
10) the encoding vector of each character in the course text after being calculated by the Bert model is HC={hcls,h1,h2,......,hl,hsepFor the extracted mentioned knowledge point concepts miThe index position of the plaintext substring represented by the index position can be represented as a binary group
Figure FDA0003461606250000038
Wherein, beg represents the index of the starting position of the substring in C, and end represents the index of the ending position of the substring in C. Encoding vector HCIs prepared from
Figure FDA0003461606250000039
The code vector between the middle start position index beg and the end position index end is expressed as
Figure FDA0003461606250000041
Will be provided with
Figure FDA00034616062500000416
Obtaining a characterization vector of the concept entity of the knowledge point through a text convolution network TextCNN
Figure FDA0003461606250000042
TextCNN model for input
Figure FDA0003461606250000043
Calculating the characterization vector V of the course textCAnd a characterization vector referring to the concept entity of the knowledge point
Figure FDA0003461606250000044
Performing concatemate splicing operation, and obtaining an output vector through a full-connection layer with an activation function of tanh
Figure FDA0003461606250000045
Namely, it is
Figure FDA0003461606250000046
Figure FDA0003461606250000047
11) Output vector that will refer to concept entity of knowledge point
Figure FDA0003461606250000048
Set of token vectors associated with set of candidate knowledge point concept entities
Figure FDA0003461606250000049
Is subjected to cos similarity calculation, i.e. each vector in (1)
Figure FDA00034616062500000410
Concept of entity sets from candidate knowledge points
Figure FDA00034616062500000411
Figure FDA00034616062500000412
Selecting the knowledge point concept with the highest similarity to be associated with the knowledge point concept, i.e. the final association result can be represented as a binary group
Figure FDA00034616062500000413
12) The concept of knowledge points included in the text of the input course is linked as a result
Figure FDA00034616062500000414
Figure FDA00034616062500000415
And finishing the association between the teaching resources and the knowledge point concepts in the knowledge base.
2. The method as claimed in claim 1, wherein the input of the knowledge point concept entity recognition model is a text string X ═ X1,x2,......,xnX is composed of n characters, XiThe ith character of X, the text string may be from a course video caption or an electronic textbook text, etc.
3. The method as claimed in claim 1, wherein the preprocessing method for cleaning the character string is implemented mainly by Unicode coding table when a character x is a characteriUnicode encoding of
Figure FDA0003461606250000051
Is located between \ u4e00 and \ u9fa5, that is'
Figure FDA0003461606250000052
Character xiIs a Chinese character. In the same way, when
Figure FDA0003461606250000053
Time, character xiIs a numeric character; when in use
Figure FDA0003461606250000054
Or
Figure FDA0003461606250000055
Time, character xiAnd for English characters, deleting all characters outside the encoding range by Unicode encoding to finish the cleaning process of the character string, wherein the cleaned character string C is { C }1,c2,......,clAnd h, the length l of the character string after cleaning is less than or equal to n.
4. The intelligent online teaching resource knowledge point concept entity linking method as claimed in claim 1, wherein the calculation of the character string by the Bert model mainly comprises the following steps:
1) character embedding operation: the character string { "[ CLS ] to be calculated]″,c1,c2,......,cl,″[SEP]Each character in the character string is characterized as a d-dimensional character vector through Embedding operation (Embedding), and the embedded character string vector is
Figure FDA0003461606250000056
2) Integrating position information coding: to obtain the sequence features of the text data, the Bert model uses sin and cos mechanisms on a string vector EcThe position index of each element in (a) is encoded. I.e. for elements at the position of pos, where diFor the position of the dimension in each element,1≤did is less than or equal to d, when d isiFor even numbers, using sin function for conversion, diIf the number is odd, the cos function is used for conversion to obtain the position coding vector
Figure FDA0003461606250000057
Each element p is a d-dimensional vector, and the corresponding position coding formula is as follows:
Figure FDA0003461606250000058
Figure FDA0003461606250000061
3) self-attention mechanism based on dot product scaling: the character string vector E obtained by the calculationcAdding the position coded vector P to obtain an input vector Z ═ Z of a self-attention mechanismcls,z1,......,zl,zsep}. The self-attention mechanism mainly captures the degree of association between every two elements in the sequence through a dot product scaling method, and the higher the degree of association between the two elements is, the larger the value of the calculation result is. The formula for the calculation of the self-attention mechanism is as follows, where the inputs are each a vector Z multiplied by its corresponding weight parameter W, i.e. Q ═ ZWQ,K=ZWK,V=ZWVAnd d is the vector dimension of the input:
Figure FDA0003461606250000062
4) the multi-head self-attention mechanism: in order to fully consider the information from different independent subspaces after the scaling dot product calculation, vectors after h times of scaling dot product calculation, namely h attention heads are spliced with concatee, and then linear transformation is carried out, wherein the calculation formula is as follows
Figure FDA0003461606250000063
WOAs a trainable parameter matrix:
MultiHead(Q,K,V)=Concate(head1,......,headh)WO
5) feed-forward network layer: the result of each character element after multi-head attention calculation is a result after only linear transformation, and in order to fully consider the mutual influence between information under different potential dimensions, a feedforward network layer with nonlinear transformation is integrated into a model, and the calculation mode of the feedforward network layer is as follows, wherein W is the following(1)
Figure FDA0003461606250000064
b(1)
Figure FDA0003461606250000065
Are trainable parameter matrices:
F=FFN(Z)=ReLU(ZW(1)+b(1))W(2)+b(2)
5. the method of claim 1, wherein the candidate concept knowledge point entities are extracted from the character string C by an LSTM model and a conditional random field CRF; the main process is as follows:
1) and (3) feature vector fusion: the feature fusion mainly comprises a coding vector F with semantic features and a sequence vector E with knowledge point concept vocabulary boundary informationSPerforming Concate splicing, and performing linear transformation through a weight parameter matrix W to obtain a fused vector V ═ Vcls,v1,v2,......,vl,vsepThe formula is as follows:
V=Concate(F,ES)W
2) encoding of LSTM model: the LSTM model is a variant of the Recurrent Neural Network (RNN) and has a more robust predictive effect than the RNN model. Vector information of the first i-1 elements can be fully combined when the ith element is calculated, and the calculation process of the LSTM model for the elements at each time step t is as follows:
zt=σ(Wi*[ht-1,vt])
rt=σ(Wr*[ht-1,vt])
Figure FDA0003461606250000071
Figure FDA0003461606250000072
where σ is sigmoid function,. is dot product multiplication operator,. vtIs the t-th element, h, in the fused vector VtAs an implicit state vector, i.e. vtThe output of the vector V after passing through the LSTM model is H ═ H1,h2,......,hTWhere T ═ l + 2.
3) CRF model prediction layer: the model prediction layer is used for judging the implicit vector output by the LSTM model and consists of a full connection layer and a CRF layer. First, the implicit state vector H ═ H output by the LSTM model1,h2,......,hTPerforming linear transformation through a full connection layer to obtain the score of each character corresponding to each category label, i.e. the score l _ score of each labeli=[score1,score2,score3]Comprises three elements, wherein score1Score, which represents the probability of predicting the current character as "B2Score, which represents the probability score for predicting the current character as "I3Representing a probability score of predicting the current character as "O". The probability of each character prediction label in the character string is set as L _ Score ═ L _ Scorecls,l_score1,l_score2,,......,l_scorel,l_scoresepAnd taking the score set of the character string as the input of the CRF layer.The CRF layer can model the labels by taking the input score set as an emision score matrix, calculate a score transition matrix T between label categories to represent the transition probability from one label to another label, mine the dependency relationship between the label categories, calculate the sequence score Scors (H) of the character string, and decode the score sequence Scors (H) by a Viterbi algorithm to obtain a predicted label sequence
Figure FDA0003461606250000081
Removing the corresponding prediction tags of the start identifier 'CLS' and the end identifier 'SEP' carried by the Bert model to obtain the prediction tag sequence result of the character string
Figure FDA0003461606250000082
Extracting corresponding substrings on the predicted tag sequence to obtain a knowledge point concept mentioning entity M ═ { M ═ M1,m2,......,mk}。
6. The method as claimed in claim 1, wherein the extracted knowledge point concept mentioning entity M ═ M is used to link the knowledge point concept entities of the online education resources1,m2,......,mkMatching and associating with knowledge point entities in a knowledge base, and the method mainly comprises the following steps: 1) using Levenshtein Distance string fuzzy matching algorithm to each mentioned entity miPerforming fuzzy search, and selecting a candidate knowledge point entity set which is possibly matched from a knowledge base; 2) for reference to entity miPerforming context semantic representation on the candidate entities through a Bert model to obtain context semantic representation vectors; 3. and performing similarity calculation on the context semantic representation vectors of the knowledge point entity and each candidate entity through a cos function, wherein the candidate knowledge point entity with the highest similarity is the linked knowledge point concept.
7. The method of claim 1, wherein the method comprises the step of linking the knowledge points of the online teaching resources with the concept entitiesCharacterized in that said TextCNN model is applied to the input
Figure FDA0003461606250000091
The calculation steps are as follows:
1) defining a plurality of one-dimensional convolution kernels, and using the convolution kernels to perform convolution calculation on input respectively to capture the correlation of adjacent characters.
2) And performing time sequence maximum pooling on all output channels respectively, and splicing the pooled output values of the channels to obtain the characterization vector.
CN202210018754.4A 2022-01-09 2022-01-09 Intelligent on-line teaching resource knowledge point concept entity linking method Active CN114443813B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210018754.4A CN114443813B (en) 2022-01-09 2022-01-09 Intelligent on-line teaching resource knowledge point concept entity linking method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210018754.4A CN114443813B (en) 2022-01-09 2022-01-09 Intelligent on-line teaching resource knowledge point concept entity linking method

Publications (2)

Publication Number Publication Date
CN114443813A true CN114443813A (en) 2022-05-06
CN114443813B CN114443813B (en) 2024-04-09

Family

ID=81367718

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210018754.4A Active CN114443813B (en) 2022-01-09 2022-01-09 Intelligent on-line teaching resource knowledge point concept entity linking method

Country Status (1)

Country Link
CN (1) CN114443813B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115633090A (en) * 2022-10-21 2023-01-20 北京中电飞华通信有限公司 Multi-source data link method based on eSIM card and 5G network
CN116976351A (en) * 2023-09-22 2023-10-31 之江实验室 Language model construction method based on subject entity and subject entity recognition device
CN117852637A (en) * 2024-03-07 2024-04-09 南京师范大学 Definition-based subject concept knowledge system automatic construction method and system

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107526799A (en) * 2017-08-18 2017-12-29 武汉红茶数据技术有限公司 A kind of knowledge mapping construction method based on deep learning
CN109902298A (en) * 2019-02-13 2019-06-18 东北师范大学 Domain Modeling and know-how estimating and measuring method in a kind of adaptive and learning system
CN111753098A (en) * 2020-06-23 2020-10-09 陕西师范大学 Teaching method and system based on cross-media dynamic knowledge graph
WO2021012645A1 (en) * 2019-07-22 2021-01-28 创新先进技术有限公司 Method and device for generating pushing information

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107526799A (en) * 2017-08-18 2017-12-29 武汉红茶数据技术有限公司 A kind of knowledge mapping construction method based on deep learning
CN109902298A (en) * 2019-02-13 2019-06-18 东北师范大学 Domain Modeling and know-how estimating and measuring method in a kind of adaptive and learning system
WO2021012645A1 (en) * 2019-07-22 2021-01-28 创新先进技术有限公司 Method and device for generating pushing information
CN111753098A (en) * 2020-06-23 2020-10-09 陕西师范大学 Teaching method and system based on cross-media dynamic knowledge graph

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
吕健颖;尚福华;曹茂俊;: "课程知识本体自动构建方法研究", 计算机应用与软件, no. 08, 12 August 2018 (2018-08-12) *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115633090A (en) * 2022-10-21 2023-01-20 北京中电飞华通信有限公司 Multi-source data link method based on eSIM card and 5G network
CN115633090B (en) * 2022-10-21 2023-07-18 北京中电飞华通信有限公司 Multi-source data linking method based on eSIM card and 5G network
CN116976351A (en) * 2023-09-22 2023-10-31 之江实验室 Language model construction method based on subject entity and subject entity recognition device
CN116976351B (en) * 2023-09-22 2024-01-23 之江实验室 Language model construction method based on subject entity and subject entity recognition device
CN117852637A (en) * 2024-03-07 2024-04-09 南京师范大学 Definition-based subject concept knowledge system automatic construction method and system

Also Published As

Publication number Publication date
CN114443813B (en) 2024-04-09

Similar Documents

Publication Publication Date Title
CN108460013B (en) Sequence labeling model and method based on fine-grained word representation model
CN111444721B (en) Chinese text key information extraction method based on pre-training language model
CN110609891B (en) Visual dialog generation method based on context awareness graph neural network
CN110083831B (en) Chinese named entity identification method based on BERT-BiGRU-CRF
CN108388560B (en) GRU-CRF conference name identification method based on language model
CN114443813B (en) Intelligent on-line teaching resource knowledge point concept entity linking method
CN111985239B (en) Entity identification method, entity identification device, electronic equipment and storage medium
CN112101028B (en) Multi-feature bidirectional gating field expert entity extraction method and system
CN111694924A (en) Event extraction method and system
CN114943230B (en) Method for linking entities in Chinese specific field by fusing common sense knowledge
CN110276052B (en) Ancient Chinese automatic word segmentation and part-of-speech tagging integrated method and device
CN111274804A (en) Case information extraction method based on named entity recognition
CN112487820A (en) Chinese medical named entity recognition method
CN115292463B (en) Information extraction-based method for joint multi-intention detection and overlapping slot filling
CN111709242A (en) Chinese punctuation mark adding method based on named entity recognition
CN113297364A (en) Natural language understanding method and device for dialog system
CN114239574A (en) Miner violation knowledge extraction method based on entity and relationship joint learning
CN113190656A (en) Chinese named entity extraction method based on multi-label framework and fusion features
CN111444720A (en) Named entity recognition method for English text
CN113641809A (en) XLNET-BiGRU-CRF-based intelligent question answering method
CN114020900A (en) Chart English abstract generation method based on fusion space position attention mechanism
CN112101014B (en) Chinese chemical industry document word segmentation method based on mixed feature fusion
CN113505207B (en) Machine reading understanding method and system for financial public opinion research report
CN115169349A (en) Chinese electronic resume named entity recognition method based on ALBERT
CN114757191A (en) Electric power public opinion field named entity recognition method and system based on deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant