CN114443813A

CN114443813A - Intelligent online teaching resource knowledge point concept entity linking method

Info

Publication number: CN114443813A
Application number: CN202210018754.4A
Authority: CN
Inventors: 袁新瑞; 王雨扬
Original assignee: Northwest University
Current assignee: Northwest University
Priority date: 2022-01-09
Filing date: 2022-01-09
Publication date: 2022-05-06
Anticipated expiration: 2042-01-09
Also published as: CN114443813B

Abstract

An intelligent online teaching resource knowledge point concept entity linking method comprises a knowledge point concept entity recognition model and a knowledge point concept linking model, wherein an application scene mainly faces to teaching resource organization and management in a domestic online learning platform, and domestic teaching is basically Chinese teaching, so that the method is only suitable for Chinese language texts and is compatible with partial English texts. The knowledge point concept entity identification is to extract knowledge point concept entity vocabularies, disciplines, professional terms, historical events and the like from a teaching resource text, wherein the extracted knowledge point concept entities are called knowledge point mentions; the knowledge point concept association refers to finding out concept knowledge with the highest semantic similarity from a knowledge base according to the extracted knowledge point concept mention and the context where the knowledge point concept is located, and performing relationship. The association between teaching resources and knowledge point concepts is realized through knowledge point concept entity recognition and knowledge point concept linkage, and the purpose of constructing a teaching resource organization system taking concept knowledge as a core is achieved.

Description

Intelligent online teaching resource knowledge point concept entity linking method

Technical Field

The invention relates to intelligent education, in particular to an intelligent online teaching resource knowledge point concept entity linking method.

Background

A large amount of learning resources are borne in a traditional teaching resource library, and the rich teaching resource types of the teaching resource library are widely concerned by people. With the increasing number of users on the online learning platform, the number and types of teaching resources in the platform are increasing to meet different requirements of different users on the resources. In practice, along with the increase of the number of teaching resources and the diversification of contents, a learner needs to spend more time and more energy on searching and selecting the learning resources needed by the learner on a teaching resource platform than before, the learning efficiency of the learner in the platform is gradually reduced, and the learning quality and the learning initiative of the learner are seriously influenced.

Knowledge maps have become a core driving force for the development of the internet and artificial intelligence as a means to effectively structure human knowledge. The teaching resource library in the self-adaptive learning system can also build a teaching resource system taking knowledge as a core by means of knowledge graph technology. The teaching resources can be associated with the concept knowledge points, so that a teaching resource system can be effectively organized to enable the self-adaptive learning system.

The existing knowledge point concept marking and association in the online teaching resources are all input in a manual mode by teachers. However, the manual input mode consumes a lot of time and energy, most of knowledge point concepts provided by teachers are coarse-grained, fine-grained knowledge point concepts in teaching resources are ignored, the knowledge point concepts are not fully labeled, and learners cannot intuitively know details of course contents. To solve the above problems, an intelligent method or tool is needed to accurately identify and associate the knowledge point concept entities in the online teaching resources. Currently, only some researchers carry out part of relevant work, and mainly extract key phrases and terms in teaching resources by means of statistical learning. However, these research advances are far from adequate to solve the above-mentioned critical problems.

With the development of knowledge-graph and natural language processing fields, entity linking technology can sufficiently solve the above-mentioned problems. The entity linking technique is to identify references in the text and link to corresponding entities in the knowledge base. Most of the existing entity linking methods are open-ended, that is, key entity vocabularies such as names of people, place names, organizations, time and the like in text corpora are recognized and linked to corresponding entries in a knowledge base (such as encyclopedia, Wikipedia and the like). At present, there are more mature entity linking tools, such as: wikify! AIDA, DBpedia Spotlight, TagMe and Linkify, etc. These entity linking systems are mainly composed of two parts, entity mention detection and entity linking. Although the above-described physical link system has been developed more and more, there are certain disadvantages. In entity mention detection, the system mainly relies on existing Named Entity Recognition (NER) tools, such as Stanza, Jieba and SnowNLP, which can achieve considerable entity recognition accuracy, but can only recognize three entity categories: people, places, and organizations.

Different from the entity linking task in the open field, the teaching resource knowledge point concept linking is to extract and associate concept entities involved in teaching resources, and not to all entities such as: the place name entity, the person entity, the time entity and the like are extracted and associated, so that the existing entity linking tool is not suitable for the concept linking of knowledge points in teaching resources.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention aims to provide an intelligent online teaching resource knowledge point concept entity linking method, which is based on a natural language processing model Bert and is combined with a text data enhancement technology to extract and link knowledge point concepts contained in teaching resources. The association between teaching resources and concept knowledge is realized, and finally a teaching resource organization system taking knowledge point concepts as a core is constructed.

In order to achieve the purpose, the invention adopts the technical scheme that:

an intelligent online teaching resource knowledge point concept entity linking method is characterized by comprising the following steps:

1) firstly, carrying out a preprocessing process of character string cleaning on a character string, wherein the character string cleaning mainly comprises the steps of judging whether a character is a Chinese character set, a numeric character set and an English character set, marking the character set as S, and removing the character if the character is not in the character set S;

2) the model needs to match the cleaned character string C ═ { C ═ C₁,c₂,......,c_lAll elements in the symbol are labeled by a BIO labeling mechanism, and when a character c_iWhen labeled "B", represents the character c_iThe first character of a certain knowledge point concept vocabulary entity, "I" is a middle character of the knowledge point concept vocabulary entity, "O" is a non-knowledge point concept vocabulary character, and finally text data are obtained;

3) text data enhancement constructs a knowledge point concept dictionary Dict through knowledge point entry nouns and aliases thereof in a knowledge base, a Maximum BiDirectional Matching algorithm (BiDirectional Maximum Matching dictionary) is used for Matching a character string C, dictionary vocabularies contained in the character string are found out, matched character substrings are labeled by a 'BIEO' mechanism, namely if the matched character substrings are C_sub＝{c_i,c_i+1,……,c_i+m}，C_subE.g. Dict, starting character c in the pair string_iLabeled "B", end character c_i+mLabeled "E", starting character c_iAnd an end character c_i+mCharacter string of { c }_i+1,c_i+2,…..,c_i+m-1All the characters contained in the character string are marked as 'I' and other characters which are not matched are marked as 'O', through the mechanism, a string of character strings with marks can be obtained, and a starting character 'CLS' is added]"and end character" [ SEP]”S＝{s_[CLS],s₁,s₂,……,s_[SEP]Each element s_iFrom the character C of the corresponding index position in the character string C_iAnd a label character;

4) carrying out vector space embedding operation Embedding (S) on the obtained character string S with the label, namely, each element S in S_iCharacterised by one dimension being d_sThe numerical values in the vector are randomly initialized by using KaiMing distribution,the embedded sequence vector is

5) The sequence vector E obtained by the above operation_SThe method comprises the steps of representing context semantic information contained in a character string C by using a pre-trained neural network language model Bert, wherein the pre-trained neural network language model Bert refers to a model trained in large-scale general text data, the pre-trained language model Bert is used as a semantic encoder, a text sequence can be effectively represented as a high-dimensional vector, the cleaned character string C is used as the input of the pre-trained Bert language model, the Bert model is used for calculating the character string C by taking characters as a unit, and the input character string C is { C ═ C₁,c₂,......,c_lThe Bert model will first insert the identifier "[ CLS ] before the start position and after the end position of the string, respectively]"and" [ SEP]", i.e., string {" [ CLS { "] { [ CLS ]]",c₁,c₂,......,c_l,"[SEP]"} as calculated data for the model;

6) the output vector F obtained by the Bert model is the coding vector of the character string C, and then the sequence vector E with the concept knowledge point vocabulary boundary information is combined_SExtracting candidate concept knowledge point entities from the character string C through an LSTM model and a conditional random field CRF; extracting corresponding substrings on the predicted tag sequence to obtain a knowledge point concept mentioning entity;

7) the knowledge point concept entity link model refers to the extracted knowledge point concept to an entity M ═ M₁,m₂,.......,m_kMatching and associating with a knowledge point entity in a knowledge base, generating a candidate knowledge point concept entity based on a Levenshtein Distance string fuzzy matching algorithm, and adding the current entity m to be referred to_iFuzzy matching is carried out on the knowledge point concept vocabularies in the knowledge base, the matched knowledge point concept vocabularies with the editing Distance larger than the Distance are filtered by setting the editing Distance parameter Distance in the fuzzy matching algorithm, and a candidate knowledge point concept entity set is generated

8) Coding the abstract text description of each candidate knowledge point concept entity through the introduced pre-training Bert model to obtain a vector for representing the candidate knowledge point concept entities, and for one candidate knowledge point concept entity_iWith the corresponding abstract description as a string

As an input of the Bert model, the output vector after the Bert model is coded is

Corresponding identifier 'CLS' to implicit vector h_clsObtaining an output vector by activating a fully-connected layer with a function of tanh

As a characterization vector of a conceptual entity of a candidate knowledge point, i.e.

In this way, a set of characterization vectors for a set of candidate knowledge point concept entities can be obtained

9) For each mentioned knowledge point concept m_iThe method for representing the course text C includes firstly, pre-training a Bert model to a course text C ═ { C } where knowledge point concepts are located₁,c₂,......,c_lCoding is carried out, and a representation vector V of the course text is obtained_CObtaining a token vector V_CThe method of (2) is the same as the method of the characterization vector of the candidate knowledge point concept entity;

10) the encoding vector of each character in the course text after being calculated by the Bert model is H_C＝{h_cls,h₁,h₂,......,h_l,h_sepFor the extracted mentioned knowledge point concepts m_iThe index position of the plaintext substring represented by the index position can be represented as a binary group

Wherein, beg represents the index of the starting position of the substring in C, and end represents the index of the ending position of the substring in C. Encoding vector H_CIs prepared by

The code vector between the middle start position index beg and the end position index end is expressed as

Will be provided with

Obtaining a characterization vector of the concept entity of the knowledge point through a text convolution network TextCNN

TextCNN model for input

Calculating the characterization vector V of the course text_CAnd a characterization vector referring to the concept entity of the knowledge point

Performing concatemate splicing operation, and obtaining an output vector through a full-connection layer with an activation function of tanh

Namely, it is

11) Output vector that will refer to concept entity of knowledge point

Set of characterization vectors associated with set of candidate knowledge point concept entities

Is subjected to cos similarity calculation, i.e. each vector in (1)

Concept of entity sets from candidate knowledge points

Selecting the knowledge point concept with the highest similarity to be associated with the knowledge point concept, i.e. the final association result can be represented as a binary group

12) The concept of knowledge points included in the text of the input course is linked as a result

And completing the association between teaching resources and knowledge point concepts in the knowledge base.

The input of the knowledge point concept entity recognition model is a text string X ═ X₁,x₂,......,x_nX is composed of n characters, X_iThe ith character of X, the text string may be from a course video caption or an electronic textbook text, etc.

The character string cleaning preprocessing method is realized mainly through Unicode codingTable realization, when a character x_iUnicode encoding of

When located between \ u4e00 and \ u9fa5, i.e.

Character x_iIs a Chinese character. In the same way, when

Time, character x_iIs a numeric character; when in use

Or

Time, character x_iAnd for English characters, deleting all characters outside the encoding range by Unicode encoding, completing the cleaning process of the character string, and cleaning the character string C ═ { C { (C) }₁,c₂,......,c_lAnd h, the length l of the character string after cleaning is less than or equal to n.

The calculation of the Bert model for the character string mainly comprises the following steps:

character embedding operation: the character string to be calculated { [ CLS ]]",c₁,c₂,......,c_l,"[SEP]Each character in the' is characterized as a character vector with d dimension by Embedding operation (Embedding), and the embedded character string vector is

Integrating position information coding: to obtain the sequence features of text data, the Bert model uses sin and cos mechanisms on charactersString vector E_cThe position index of each element in (a) is encoded. I.e. for elements at the position of pos, where d_iD is more than or equal to 1 for the dimension position in each element_iD is less than or equal to d, when d is_iFor even numbers, using sin function for conversion, d_iIf the number is odd, the cos function is used for conversion to obtain the position coding vector

Each element p is a d-dimensional vector, and the corresponding position coding formula is as follows:

self-attention mechanism based on dot product scaling: the character string vector E obtained by the calculation_cAdding the position coded vector P to obtain an input vector Z ═ Z of a self-attention mechanism_cls,z₁,......,z_l,z_sep}. The self-attention mechanism mainly captures the degree of association between every two elements in the sequence through a dot product scaling method, and the higher the degree of association between the two elements is, the larger the value of the calculation result is. The formula for the calculation of the self-attention mechanism is as follows, where the inputs are each a vector Z multiplied by its corresponding weight parameter W, i.e. Q ═ ZW^Q,K＝ZW^K,V＝ZW^VAnd d is the vector dimension of the input:

the multi-head self-attention mechanism: in order to fully consider the information from different independent subspaces after the scaling dot product calculation, vectors after h times of scaling dot product calculation, namely h attention heads are spliced and concatered, then a linear transformation is carried out, and a calculation formula is obtainedAs follows, wherein

W^OFor a trainable parameter matrix:

MultiHead(Q,K,V)＝Concate(head₁,……,head_h)W^O

feed-forward network layer: the result of each character element after multi-head attention calculation is a result after only linear transformation, and in order to fully consider the mutual influence between information under different potential dimensions, a feedforward network layer with nonlinear transformation is integrated into a model, and the calculation mode of the feedforward network layer is as follows, wherein the result of each character element after multi-head attention calculation is the result after only linear transformation, and the feedforward network layer is calculated as follows

Are trainable parameter matrices:

F＝FFN(Z)＝ReLU(ZW⁽¹⁾+b⁽¹⁾)W⁽²⁾+b⁽²⁾。

extracting candidate concept knowledge point entities from the character string C through an LSTM model and a conditional random field CRF; the main process is as follows:

and (3) feature vector fusion: the feature fusion mainly comprises a coding vector F with semantic features and a sequence vector E with knowledge point concept vocabulary boundary information_SPerforming Concate splicing, and performing linear transformation through a weight parameter matrix W to obtain a fused vector V ═ V_cls,v₁,v₂,......,v_l,v_sepThe formula is as follows:

V＝Concate(F,E_S)W

encoding of LSTM model: the LSTM model is a variant of the Recurrent Neural Network (RNN) and has a more robust predictive effect than the RNN model. Vector information of the first i-1 elements can be fully combined when the ith element is calculated, and the calculation process of the LSTM model for the elements at each time step t is as follows:

z_t＝σ(W_i*[h_t-1,v_t])

r_t＝σ(W_r*[h_t-1,v_t])

where σ is sigmoid function,. is dot product multiplication operator,. v_tIs the t-th element, h, in the fused vector V_tAs an implicit state vector, i.e. v_tThe output of the vector V after passing through the LSTM model is H ═ H₁,h₂,.....,h_TWhere T ═ l + 2.

CRF model prediction layer: the model prediction layer is used for judging the implicit vector output by the LSTM model and consists of a full connection layer and a CRF layer. First, the implicit state vector H ═ H output by the LSTM model₁,h₂,.....,h_TPerforming linear transformation through a full connection layer to obtain the score of each character corresponding to each class label, i.e. the score l _ score of each label_i＝[score₁,score₂,score₃]Comprises three elements, wherein score₁Score, which represents the probability of predicting the current character as "B₂Score, which represents the probability score for predicting the current character as "I₃Representing a probability score of predicting the current character as "O". The probability of each character prediction label in the character string is set as L _ Score ═ L _ Score_cls,l_score₁,l_score_2,,......,l_score_l,l_score_sepAnd taking the score set of the character string as the input of the CRF layer. The CRF layer can model the labels by taking the input score set as an emision score matrix, calculate a score transition matrix T between label categories to represent the transition probability from one label to another label, mine the dependency relationship between the label categories, calculate the sequence score Scors (H) of the character string, and decode the score sequence Scors (H) by a Viterbi algorithm to obtain a predicted label sequence

Removing the corresponding prediction tags of the start identifier 'CLS' and the end identifier 'SEP' carried by the Bert model to obtain the prediction tag sequence result of the character string

Extracting corresponding substrings on the predicted tag sequence to obtain a knowledge point concept mentioning entity M ═ { M ═ M₁,m₂,.......,m_k}。

The extracted knowledge point concept is referred to as an entity M ═ { M ═ M₁,m₂,.......,m_kMatching and associating with knowledge point entities in a knowledge base, and the method mainly comprises the following steps: 1. using Levenshtein Distance string fuzzy matching algorithm to each mentioned entity m_iPerforming fuzzy search, and selecting a candidate knowledge point entity set which is possibly matched from a knowledge base; 2. for reference to entity m_iPerforming context semantic representation on the candidate entities through a Bert model to obtain context semantic representation vectors; 3. and performing similarity calculation on the context semantic representation vectors of the knowledge point entity and each candidate entity through a cos function, wherein the candidate knowledge point entity with the highest similarity is the linked knowledge point concept.

The TextCNN model is used for inputting

The calculation steps are as follows:

1. defining a plurality of one-dimensional convolution kernels, and using the convolution kernels to perform convolution calculation on input respectively to capture the correlation of adjacent characters.

2. And performing time sequence maximum pooling on all output channels respectively, and splicing the pooled output values of the channels to obtain the characterization vector.

The invention has the beneficial effects that:

the technical framework of this patent mainly contains two main parts: the application scene of the patent mainly faces to the organization and management of teaching resources in a domestic online learning platform, and domestic teaching is basically Chinese teaching, so that the knowledge point concept entity recognition model and the knowledge point concept link model are only suitable for Chinese language texts and are compatible with partial English texts. The knowledge point concept entity identification is to extract the contained knowledge point concept entity vocabulary from the teaching resource text, such as: the extracted knowledge point concept entities are called knowledge point mentions; the knowledge point concept association means that the concept knowledge with the highest semantic similarity is found from a knowledge base according to the extracted knowledge point concept mention and the context where the knowledge point concept is located, and the relationship is carried out. The association between teaching resources and knowledge point concepts is realized through knowledge point concept entity recognition and knowledge point concept linkage, and the purpose of constructing a teaching resource organization system taking concept knowledge as a core is achieved.

Drawings

Fig. 1 is a working principle diagram of the present invention.

Fig. 2 is a schematic diagram of the present invention.

Detailed Description

The invention is further described below with reference to the accompanying drawings.

As shown in FIGS. 1 and 2, the knowledge point concept entity recognition model

Firstly, carrying out a preprocessing process of character string cleaning on a character string, wherein the character string cleaning mainly comprises the steps of judging whether a character is a Chinese character set, a numeric character set and an English character set, marking the character set as S, and removing the character if the character is not in the character set S; the method is realized mainly through a Unicode coding table when a character x_iUnicode encoding of

When located between \ u4e00 and \ u9fa5, i.e.

Character x_iIs a Chinese character. In the same way, when

Time, character x_iIs a numeric character; when in use

Or

Time, character x_iIs an English character. The Unicode codes delete all the characters outside the coding range to complete the cleaning process of the character string, and the cleaned character string C ═ C₁,c₂,......,c_lAnd h, the length l of the character string after cleaning is less than or equal to n. Next, the model needs to set { C ═ C for the cleaned string C₁,c₂,......,c_lAll elements in the symbol are labeled by a BIO labeling mechanism, and when a character c_iWhen labeled "B", represents the character c_iThe first character of a certain knowledge point concept vocabulary entity, "I" is the middle character of the knowledge point concept vocabulary entity, and "O" is the non-knowledge point concept vocabulary character.

Because the word frequency of the knowledge point concept vocabulary in the teaching text is low and the character string length of the concept vocabulary is large, the traditional character-level-based entity recognition model is difficult to recognize the text boundary of the knowledge point concept entity, and the knowledge point concept entity is difficult to be recognized completely. The method for enhancing the text data is used, and the accuracy of the knowledge point concept entity recognition model is improved by combining the Bert language model.

The text data enhancement constructs a knowledge point concept dictionary Dict through knowledge point vocabulary terms and aliases thereof in a knowledge base, and an external knowledge base used in the patent is a disciplinary knowledge base provided on line by academia. Use ofThe Maximum BiDirectional Matching algorithm (BiDirectional Maximum Matching algorithm) matches the character string C to find out dictionary words contained in the character string. The matched character substrings are all marked by a 'BIEO' mechanism, namely if the matched character substrings are C_sub＝{c_i,c_i+1,……,c_i+m}，C_subE.g. Dict, starting character c in the pair string_iLabeled "B", end character c_i+mLabeled "E", starting character c_iAnd an end character c_i+mCharacter string of { c }_i+1,c_i+2,.....,c_i+m-1The included characters are all labeled as "I" and the other characters not matched are labeled as "O". By this mechanism, a string of marked characters can be obtained and the start character "[ CLS ] is added]"and end character" [ SEP]”S＝{s₁,s₂,……,s_lH, each element s_iFrom the character C of the corresponding index position in the character string C_iAnd a label character.

Carrying out vector space embedding operation Embedding (S) on the obtained character string S with the label, namely, each element S in S_iCharacterised by one dimension being d_sThe numerical values in the high-dimensional vectors are randomly initialized by using KaiMing distribution, and the embedded sequence vectors are

The sequence vector E obtained by the above operation_SThe boundary information of the knowledge point concept vocabulary is included, and then the context semantic information included in the character string C is characterized. The patent uses a pre-trained neural network language model Bert, which refers to a model trained on large-scale general text data. The method takes the pre-trained language model Bert as a semantic encoder, and can effectively represent the text sequence as a high-dimensional vector.

And taking the cleaned character string C as the input of a pre-training Bert language model, wherein the Bert model is used for calculating the character string C by taking characters as units. For an input string C ═ { C ═ C₁,c₂,......,c_lThe Bert model will first insert the identifier "[ CLS ] before the start position and after the end position of the string, respectively]"and" [ SEP]", i.e., string {" [ CLS { "] { [ CLS ]]",c₁,c₂,......,c_l,"[SEP]"} as calculated data for the model. The calculation of the character string by the Bert model mainly comprises the following steps:

Integrating position information coding: to obtain the sequence features of the text data, the Bert model uses sin and cos mechanisms on a string vector E_cThe position index of each element in (a) is encoded. I.e. for elements at the position of pos, where d_iD is more than or equal to 1 for the dimension position in each element_iD is less than or equal to d, when d is_iFor even numbers, using sin function for conversion, d_iIf the number is odd, the cos function is used for conversion to obtain the position coding vector

the multi-head self-attention mechanism: in order to fully consider the information from different independent subspaces after the scaling dot product calculation, vectors after h times of scaling dot product calculation, namely h self-attention heads are spliced and concatered, then a linear transformation is carried out, and the calculation formula is as follows, wherein the information from different independent subspaces after the scaling dot product calculation is carried out

W^OFor a trainable parameter matrix:

MultiHead(Q,K,V)＝Concate(head₁,……,head_h)W^O

Are trainable parameter matrices:

F＝FFN(Z)＝ReLU(ZW⁽¹⁾+b⁽¹⁾)W⁽²⁾+b⁽²⁾

the output vector F obtained by the Bert model is the coding vector of the character string C, and then the sequence vector with the concept knowledge point vocabulary boundary information is combinedE_SAnd extracting candidate concept knowledge point entities from the character string C through an LSTM model and a conditional random field CRF, wherein the main process is as follows:

V＝Concate(F,E_S)W

z_t＝σ(W_i*[h_t-1,v_t])

r_t＝σ(W_r*[h_t-1,v_t])

where σ is sigmoid function,. is operator of dot product multiplication, v_tIs the t-th element, h, in the fused vector V_tAs an implicit state vector, i.e. v_tThe output of the vector V after passing through the LSTM model is H ═ H₁,h₂,.....,h_TWhere T ═ l + 2.

Knowledge point concept entity link model

The knowledge point concept entity link model refers to the extracted knowledge point concept to an entity M ═ M₁,m₂,.......,m_kMatching and associating with knowledge point entities in a knowledge base, and the method mainly comprises the following steps: 1. using Levenshtein Distance string fuzzy matching algorithm to each mentioned entity m_iPerforming fuzzy searchSearching, selecting a candidate knowledge point entity set which is possibly matched from a knowledge base; 2. for reference to entity m_iPerforming context semantic representation on the candidate entities through a Bert model to obtain context semantic representation vectors; 3. and performing similarity calculation on the context semantic representation vectors of the knowledge point entity and each candidate entity through a cos function, wherein the candidate knowledge point entity with the highest similarity is the linked knowledge point concept.

Generating a candidate knowledge point concept entity based on a Levenshtein Distance string fuzzy matching algorithm, and generating a current mentioned entity m_iFuzzy matching is carried out on the knowledge point concept vocabularies in the knowledge base, the matched knowledge point concept vocabularies with the editing Distance larger than the Distance are filtered by setting the editing Distance parameter Distance in the fuzzy matching algorithm, and a candidate knowledge point concept entity set is generated

In the external knowledge base, there is a corresponding abstract text description for each knowledge point concept vocabulary. The method encodes the abstract text description of each candidate knowledge point concept entity through the pre-training Bert model introduced above, and obtains a vector for representing the candidate knowledge point concept entities. Concept entity for a candidate knowledge point_iIts corresponding abstract description as a string

As input to the Bert model. The output vector after being coded by the Bert model is

Corresponding identifier 'CLS' to implicit vector h_clsObtaining an output vector by activating a fully connected layer with a function of tanh

For each mentioned knowledge point concept m_iThe method for representing the course text C includes firstly, pre-training a Bert model to a course text C ═ { C } where knowledge point concepts are located₁,c₂,......,c_lCoding is carried out, and a representation vector V of the course text is obtained_CObtaining a token vector V_CIn the same way as the method of the token vector of the concept entity of the candidate knowledge point.

The encoding vector of each character in the course text after being calculated by the Bert model is H_C＝{h_cls,h₁,h₂,......,h_l,h_sepFor the extracted mentioned knowledge point concepts m_iThe index position of the plaintext substring represented by the index position can be represented as a binary group

Will be provided with

TextCNN model for input

The calculation steps are as follows:

3. defining a plurality of one-dimensional convolution kernels, and using the convolution kernels to perform convolution calculation on input respectively to capture the correlation of adjacent characters.

4. And performing time sequence maximum pooling on all output channels respectively, and splicing the pooled output values of the channels to obtain the characterization vector.

Finally, the representation vector V of the course text is divided into_CAnd a characterization vector referring to the concept entity of the knowledge point

Namely, it is

Output vector that will refer to concept entity of knowledge point

Is subjected to cos similarity calculation, i.e. each vector in (1)

Concept of entity sets from candidate knowledge points

The concept of knowledge points included in the text of the input course is linked as a result

And finishing the association between the teaching resources and the knowledge point concepts in the knowledge base.

Claims

1. An intelligent online teaching resource knowledge point concept entity linking method is characterized by comprising the following steps:

1) firstly, carrying out a preprocessing process of character string cleaning on a character string, wherein the character string cleaning is mainly used for judging whether a character is a Chinese character set, a numeric character set and an English character set, and if the character is not in the character set, removing the character;

2) the model needs to match the cleaned character string C ═ { C ═ C₁，c₂，......，c_lAll elements in the symbol are labeled by a BIO labeling mechanism, and when a character c_iWhen labeled "B", represents the character c_iThe first character of a certain knowledge point concept vocabulary entity, "I" is a middle character of the knowledge point concept vocabulary entity, "O" is a non-knowledge point concept vocabulary character, and finally text data are obtained;

3) text data enhancement constructs a knowledge point concept dictionary Dict through knowledge point vocabulary terms and aliases thereof in a knowledge base, matches a character string C by using a Maximum BiDirectional Matching algorithm (BiDirectional Maximum Matching algorithm), and finds out that the character string containsThe matched character sub-strings are all marked by a 'BIEO' mechanism, namely if the matched character sub-strings are C_sub＝{c_i，c_i+1，......，c_i+m}，C_subE.g. Dict, starting character c in the pair string_iLabeled "B", end character c_i+mLabeled "E", starting character c_iAnd an end character c_i+mC between the character strings_i+1，c_i+2，......，c_i+m-1All the characters contained in the character string are marked as ' I ' and other characters which are not matched are marked as ' O ', and through the mechanism, a string of marked character strings can be obtained and initial characters are added simultaneously ' [ CLS ]]"and end character" [ SEP]”，S＝{s_[CLS]，s₁，s₂，......，s_l，S_[SEP]Each element s_iFrom the character C of the corresponding index position in the character string C_iAnd a label character;

4) carrying out vector space embedding operation Embedding (S) on the obtained character string S with the label, namely, each element S in S_iCharacterised by one dimension being d_sThe numerical values in the high-dimensional vectors are randomly initialized by using KaiMing distribution, and the embedded sequence vector is

5) The sequence vector E obtained by the above operation_SThe method comprises the steps of representing context semantic information contained in a character string C by using a pre-trained neural network language model Bert, wherein the pre-trained neural network language model Bert refers to a model trained in large-scale general text data, the pre-trained language model Bert is used as a semantic encoder, a text sequence can be effectively represented as a high-dimensional vector, the cleaned character string C is used as the input of the pre-trained Bert language model, the Bert model is used for calculating the character string C by taking characters as a unit, and the input character string C is { C ═ C₁，c₂，......，c_lThe Bert model will first precede and end the start of the stringAfter placement, the identifiers "[ CLS ] are inserted separately]"and" [ SEP]", i.e., string {" [ CLS ]]″，c₁，c₂，......，c_l，″[SEP]") as the calculated data for the model;

6) the output vector F obtained by the Bert model is the coding vector of the character string C, and then the sequence vector E with the concept knowledge point vocabulary boundary information is combined_SExtracting candidate concept knowledge point entities from the character string C through an LSTM model and a conditional random field CRF; extracting corresponding substrings on the predicted tag sequence to obtain a knowledge point concept mention entity;

7) the knowledge point concept entity link model is to refer the extracted knowledge point concept to an entity M ═ M₁，m₂，......，m_kMatching and associating with a knowledge point entity in a knowledge base, generating a candidate knowledge point concept entity based on a Levenshtein Distance string fuzzy matching algorithm, and adding the current entity m to be referred to_iFuzzy matching is carried out on the knowledge point concept vocabularies in the knowledge base, the matched knowledge point concept vocabularies with the editing Distance larger than the Distance are filtered by setting the editing Distance parameter Distance in the fuzzy matching algorithm, and a candidate knowledge point concept entity set is generated

Corresponding identifier 'CLS' to implicit vector h_clsBy excitationObtaining an output vector from a fully connected layer with an active function of tanh

As a token vector of a conceptual entity of a candidate knowledge point, i.e.

9) For each mentioned knowledge point concept m_iThe method for representing the course text C includes firstly, pre-training a Bert model to a course text C ═ { C } where knowledge point concepts are located₁，c₂，......，c_lCoding is carried out, and a representation vector V of the course text is obtained_CObtaining a token vector V_CThe method of (2) is the same as the method of the characterization vector of the candidate knowledge point concept entity;

10) the encoding vector of each character in the course text after being calculated by the Bert model is H_C＝{h_cls，h₁，h₂，......，h_l，h_sepFor the extracted mentioned knowledge point concepts m_iThe index position of the plaintext substring represented by the index position can be represented as a binary group

Wherein, beg represents the index of the starting position of the substring in C, and end represents the index of the ending position of the substring in C. Encoding vector H_CIs prepared from

Will be provided with

TextCNN model for input

Namely, it is

11) Output vector that will refer to concept entity of knowledge point

Set of token vectors associated with set of candidate knowledge point concept entities

Is subjected to cos similarity calculation, i.e. each vector in (1)

Concept of entity sets from candidate knowledge points

2. The method as claimed in claim 1, wherein the input of the knowledge point concept entity recognition model is a text string X ═ X₁，x₂，......，x_nX is composed of n characters, X_iThe ith character of X, the text string may be from a course video caption or an electronic textbook text, etc.

3. The method as claimed in claim 1, wherein the preprocessing method for cleaning the character string is implemented mainly by Unicode coding table when a character x is a character_iUnicode encoding of

Is located between \ u4e00 and \ u9fa5, that is'

Character x_iIs a Chinese character. In the same way, when

Time, character x_iIs a numeric character; when in use

Or

Time, character x_iAnd for English characters, deleting all characters outside the encoding range by Unicode encoding to finish the cleaning process of the character string, wherein the cleaned character string C is { C }₁，c₂，......，c_lAnd h, the length l of the character string after cleaning is less than or equal to n.

4. The intelligent online teaching resource knowledge point concept entity linking method as claimed in claim 1, wherein the calculation of the character string by the Bert model mainly comprises the following steps:

1) character embedding operation: the character string { "[ CLS ] to be calculated]″，c₁，c₂，......，c_l，″[SEP]Each character in the character string is characterized as a d-dimensional character vector through Embedding operation (Embedding), and the embedded character string vector is

2) Integrating position information coding: to obtain the sequence features of the text data, the Bert model uses sin and cos mechanisms on a string vector E_cThe position index of each element in (a) is encoded. I.e. for elements at the position of pos, where d_iFor the position of the dimension in each element,1≤d_id is less than or equal to d, when d is_iFor even numbers, using sin function for conversion, d_iIf the number is odd, the cos function is used for conversion to obtain the position coding vector

3) self-attention mechanism based on dot product scaling: the character string vector E obtained by the calculation_cAdding the position coded vector P to obtain an input vector Z ═ Z of a self-attention mechanism_cls，z₁，......，z_l，z_sep}. The self-attention mechanism mainly captures the degree of association between every two elements in the sequence through a dot product scaling method, and the higher the degree of association between the two elements is, the larger the value of the calculation result is. The formula for the calculation of the self-attention mechanism is as follows, where the inputs are each a vector Z multiplied by its corresponding weight parameter W, i.e. Q ═ ZW^Q，K＝ZW^K，V＝ZW^VAnd d is the vector dimension of the input:

4) the multi-head self-attention mechanism: in order to fully consider the information from different independent subspaces after the scaling dot product calculation, vectors after h times of scaling dot product calculation, namely h attention heads are spliced with concatee, and then linear transformation is carried out, wherein the calculation formula is as follows

W^OAs a trainable parameter matrix:

MultiHead(Q，K，V)＝Concate(head₁，......，head_h)W^O

5) feed-forward network layer: the result of each character element after multi-head attention calculation is a result after only linear transformation, and in order to fully consider the mutual influence between information under different potential dimensions, a feedforward network layer with nonlinear transformation is integrated into a model, and the calculation mode of the feedforward network layer is as follows, wherein W is the following⁽¹⁾，

b⁽¹⁾，

Are trainable parameter matrices:

F＝FFN(Z)＝ReLU(ZW⁽¹⁾+b⁽¹⁾)W⁽²⁾+b⁽²⁾。

5. the method of claim 1, wherein the candidate concept knowledge point entities are extracted from the character string C by an LSTM model and a conditional random field CRF; the main process is as follows:

1) and (3) feature vector fusion: the feature fusion mainly comprises a coding vector F with semantic features and a sequence vector E with knowledge point concept vocabulary boundary information_SPerforming Concate splicing, and performing linear transformation through a weight parameter matrix W to obtain a fused vector V ═ V_cls，v₁，v₂，......，v_l，v_sepThe formula is as follows:

V＝Concate(F，E_S)W

2) encoding of LSTM model: the LSTM model is a variant of the Recurrent Neural Network (RNN) and has a more robust predictive effect than the RNN model. Vector information of the first i-1 elements can be fully combined when the ith element is calculated, and the calculation process of the LSTM model for the elements at each time step t is as follows:

z_t＝σ(W_i*[h_t-1，v_t])

r_t＝σ(W_r*[h_t-1，v_t])

where σ is sigmoid function,. is dot product multiplication operator,. v_tIs the t-th element, h, in the fused vector V_tAs an implicit state vector, i.e. v_tThe output of the vector V after passing through the LSTM model is H ═ H₁，h₂，......，h_TWhere T ═ l + 2.

3) CRF model prediction layer: the model prediction layer is used for judging the implicit vector output by the LSTM model and consists of a full connection layer and a CRF layer. First, the implicit state vector H ═ H output by the LSTM model₁，h₂，......，h_TPerforming linear transformation through a full connection layer to obtain the score of each character corresponding to each category label, i.e. the score l _ score of each label_i＝[score₁，score₂，score₃]Comprises three elements, wherein score₁Score, which represents the probability of predicting the current character as "B₂Score, which represents the probability score for predicting the current character as "I₃Representing a probability score of predicting the current character as "O". The probability of each character prediction label in the character string is set as L _ Score ═ L _ Score_cls，l_score₁，l_score_2，，......，l_score_l，l_score_sepAnd taking the score set of the character string as the input of the CRF layer.The CRF layer can model the labels by taking the input score set as an emision score matrix, calculate a score transition matrix T between label categories to represent the transition probability from one label to another label, mine the dependency relationship between the label categories, calculate the sequence score Scors (H) of the character string, and decode the score sequence Scors (H) by a Viterbi algorithm to obtain a predicted label sequence

Extracting corresponding substrings on the predicted tag sequence to obtain a knowledge point concept mentioning entity M ═ { M ═ M₁，m₂，......，m_k}。

6. The method as claimed in claim 1, wherein the extracted knowledge point concept mentioning entity M ═ M is used to link the knowledge point concept entities of the online education resources₁，m₂，......，m_kMatching and associating with knowledge point entities in a knowledge base, and the method mainly comprises the following steps: 1) using Levenshtein Distance string fuzzy matching algorithm to each mentioned entity m_iPerforming fuzzy search, and selecting a candidate knowledge point entity set which is possibly matched from a knowledge base; 2) for reference to entity m_iPerforming context semantic representation on the candidate entities through a Bert model to obtain context semantic representation vectors; 3. and performing similarity calculation on the context semantic representation vectors of the knowledge point entity and each candidate entity through a cos function, wherein the candidate knowledge point entity with the highest similarity is the linked knowledge point concept.

7. The method of claim 1, wherein the method comprises the step of linking the knowledge points of the online teaching resources with the concept entitiesCharacterized in that said TextCNN model is applied to the input

The calculation steps are as follows:

1) defining a plurality of one-dimensional convolution kernels, and using the convolution kernels to perform convolution calculation on input respectively to capture the correlation of adjacent characters.

2) And performing time sequence maximum pooling on all output channels respectively, and splicing the pooled output values of the channels to obtain the characterization vector.