CN116204607A

CN116204607A - Text online learning resource knowledge point labeling method, system and medium

Info

Publication number: CN116204607A
Application number: CN202310188731.2A
Authority: CN
Inventors: 王挺; 庞焜元; 唐晋韬; 李莎莎; 吕明阳; 龙科含; 何亮亮; 李冬; 王攀成; 陈凤
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2023-02-27
Filing date: 2023-02-27
Publication date: 2023-06-02

Abstract

The invention discloses a text online learning resource knowledge point labeling method, a text online learning resource knowledge point labeling system and a text online learning resource knowledge point labeling medium, wherein the text online learning resource knowledge point labeling method comprises the steps of tokenizing an input course caption text to obtain a vocabulary sequence, and obtaining BERT codes; dictionary matching is carried out on the input course caption text and a preset designated entity table, and a candidate entity sequence is obtained; acquiring BERT codes based on the word symbol sequences; calculating a dictionary attention code of each element in the candidate entity sequence by using an entity encoder BE; the BERT codes and the dictionary attention codes are spliced and then input into a transducer layer to obtain attention enhancement representation; and inputting the attention enhancement representation into a linear classification layer for linear classification to obtain initial scoring, ending scoring and intra-finger scoring, and inputting into a decoding layer to obtain a knowledge point labeling result. The method can realize automatic labeling of knowledge points of text online learning resources, and has the advantages of high precision and recall ratio.

Description

Text online learning resource knowledge point labeling method, system and medium

Technical Field

The invention relates to the technical field of online education learning, in particular to a text online learning resource knowledge point labeling method, a text online learning resource knowledge point labeling system and a text online learning resource knowledge point labeling medium.

Background

Large-scale online open lessons (Massive Open Online Courses, MOOC for short) have become an important internet online learning application in recent years. Unlike traditional classroom teaching, the background knowledge mastered by the MOOC-oriented students varies greatly, and the degree of understanding of the various terms involved in course data varies.

The MOOC provides the students with course introduction pages, reading materials, illustrations, lecture videos (and subtitles), quiz questions and questions, etc. that form learning resources that the students can reach. The online learning resources represented by the reading materials and the teaching videos exist in the form of long plain text structures, and are called text online learning resources. In text-based online learning resources, knowledge points (or lesson concepts) are one of the unique language structures. Finger means: knowledge concepts of professors in the curriculum video help students understand related topics of the curriculum video. Specifically, two criteria need to be met: (1) phrase: must be a grammatically and semantically correct complete phrase; (2) informativeness: must represent a scientific and technological concept and this concept is relevant to the current course. Although the definition of knowledge points is subjective, the definition of knowledge points by a human grasping the relevant knowledge is often uniform.

The existing knowledge point recognition method takes a phrase as a unit, only a modeling method of literal matching is utilized, and the understanding of the concept of the course in the context is not actually realized. The modeling process for example of Pan model is: firstly, extracting phrases in the captions, merging the phrases with the same literal and composition into a candidate sample, then manually marking and machine scoring by taking the candidate sample as a unit, and finally, matching the positive examples in the candidate sample as a course concept library with the video captions to finish marking work. If the same literal text is encountered in use, they are all references to the concept of a lesson, which is a problem that needs to be solved using techniques and procedures such as physical linking. An important task in online learning resource analysis is to distinguish which content is closely related to the course, which is the focus of the teacher's teaching and also the content that students need to learn. Therefore, how to realize the online learning resource knowledge point labeling of the text can form semantic highlighting by labeling the curriculum related entities in the text, guide the attention of a learner, help the learner to check the learning result and prevent missing key points, so that the online learning resource knowledge point labeling method becomes a key technical problem to be solved urgently.

Disclosure of Invention

The invention aims to solve the technical problems: aiming at the problems in the prior art, the invention provides a method, a system and a medium for labeling knowledge points of text online learning resources.

In order to solve the technical problems, the invention adopts the following technical scheme:

a text online learning resource knowledge point labeling method comprises the following steps:

s101, tokenizing an input course caption text to obtain a tokenized sequence [ t ] ₁ ，t ₂ ，...，t _n ]And obtain BERT code through BERT coding layer

Dictionary matching is carried out on the input course caption text and a preset designated entity table, and a candidate entity sequence [ e ] is obtained ₁ ，e ₂ ，...]And calculates a dictionary attention code of each element therein using an entity encoder BE;

s102, BERT is encoded

And inputting the dictionary attention code split into a transducer layer to obtain an attention enhancement representation hr;

s103, enhancing the attention to the representation h ^r Inputting the initial scoring s into a linear classification layer for linear classification _start Ending scoring s _end And index internal scoring s _mention Scoring s the start _start Ending scoring s _end And index internal scoring s _mention And inputting the decoding layer to obtain a knowledge point labeling result.

Optionally, the BERT encoding is obtained by the BERT encoding layer in step S101

The functional expression of (2) is:

in the above formula, BERT represents the BERT coding model, [ CLS ]]And [ SEP ]]Tag tokens for sentence start and separation, t ₁ ～t _n For a character in a sequence of logograms,

the representation dimension, h, is the hidden layer dimension of the BERT coding model, and n is the number of characters in the sequence of tokens.

Optionally, BERT is encoded in step S102

And dictionary attention code BE (e) _i-n ) The splicing comprises:

s201, firstly, using the indicator entity table and the prior probability, the indicator entity table is a word symbol sequence [ t ] ₁ ，t ₂ ，...，t _n ]Finding out the matching entity with the highest prior probability from all substrings matched with the named entity table, and taking the prior probability as the link confidence level; then according to the preset threshold th _rl Screening the link confidence, selecting a reference entity pair with the confidence greater than a threshold value to obtain a reference list { (rs) _i ，re _i ，e _i ) Of which (rs) _i ，re _i ) For candidate entity e _i Location information rs of (a) _i For candidate entity e _i The starting position, rs _i For candidate entity e _i End position of (2);

s202, reference list { (rs) _i ，re _i ，e _i ) Further original sequence of tokens t ₁ ，t ₂ ，...，t _n ]Splicing into three sequences:

in the above, x ^r Representing word symbols and entity sequences, head ^r Representing a logogram and a sequence of entities x ^r The medium element is in the original word symbol sequence t ₁ ，t ₂ ，...，t _n ]In (3) a start position sequence, tail ^r Representing a logogram and a sequence of entitiesx ^r The medium element is in the original word symbol sequence t ₁ ，t ₂ ，...，t _n ]End position sequence of (a);

s203, combining the initial position sequence head ^r And end position sequence tail ^r For the word symbol and the entity sequence x ^r Any ith element of (2)

And j' th element->

Calculating head-tail relative distance ∈>

Relative distance of head->

Relative distance of the tail head->

Relative distance of tail +.>

And calculates the word symbol and the entity sequence x ^r Any i-th element +.>

And j' th element->

Correlation R of (2) _ij ；

S204, BERT-based coding

And dictionary attention code determination of a word symbol and entity sequence x ^r Any i-th element +.>

Determination ofIts vector represents E _i And combine with the correlation R _ij Determining a logogram and an entity sequence x ^r Any i-th element +.>

And j' th element->

Attention weight a of (2) _i，j The method comprises the steps of carrying out a first treatment on the surface of the Based on the word symbol and the entity sequence x ^r Any i-th element +.>

And j' th element->

Attention weight a of (2) _i，j Attention weighting is performed to obtain a weighted feature a as input to the transducer layer.

Optionally, the relative head-to-tail distance is calculated in step S203

Relative distance of head->

Relative distance between the tail and the head

Relative distance of tail +.>

The functional expression of (2) is: />

In the above, head ^r _i And tail ^r _i Respectively a word symbol and an entity sequence x ^r Any ith element of (2)

In the start position sequence head ^r And end position sequence tail ^r Corresponding element of (1), head ^r _j And tail ^r _j Respectively a word symbol and an entity sequence x ^r Any j-th element of (a)>

In the start position sequence head ^r And end position sequence tail ^r Corresponding to the elements of the group.

Alternatively, the ith element in step S203

And j' th element->

Correlation R of (2) _ij The expression of the calculation function of (c) is:

in the above formula, reLU represents a ReLU activation function, W _r For the word symbol and the entity sequence x ^r Is used for the weight matrix of the (c),

for splicing operation, < >>

And->

Respectively represent the relative distance between the head and the tail>

Relative distance of head->

Relative distance of the tail head->

Relative distance of tail +.>

The result of the encoding of P is encoded by the relative position.

Optionally, the vector in step S204 represents E _i The functional expression of (2) is:

in the above-mentioned method, the step of,

coding +.>

Is the ith code in (e) _i-n ) For the ith-nth code in dictionary attention codes, e _i-n I is the word symbol and the entity sequence x for the i-n candidate entities in the candidate entity sequence ^r The i-th element of (a)>

N is the number of characters in the sequence of tokens,/-, n is the number of characters in the sequence of tokens>

For the word symbol and the entity sequence x ^r Any i-th element of (a); and attention weight a _i，j The expression of the calculation function of (c) is:

in the above, W _q As a trainable weight matrix, E _i For the word symbol and the entity sequence x ^r The ith element in (a)

Is represented by a vector of E _j For the word symbol and the entity sequence x ^r The j-th element of (a)>

Is represented by a vector of W _k，E As a trainable weight matrix, R _ij For the i element->

And j' th element->

W is as follows _k，R For a trainable weight matrix, u is a trainable weight vector and v is a trainable weight vector.

Optionally, step S102 includes:

s301, BERT is encoded

And dictionary attention encoding is split and then input into a transducer layer to obtain a complete dictionary attention enhancement representation H, wherein the word symbol and the entity sequence x ^r The i-th element of (a)>

Is a complete dictionary attention-enhancing representation H _i The functional expression of (2) is:

H _i ＝soffmax(A)EW _v ，

in the above formula, soffmax represents the soffmax activation function, A is the attention weight A _i，j The matrix is composed of all the words and the entity sequence x ^r The i-th element of (a)

Vector representation E of (E) _i Matrix of formations, W _v Is a trainable weight matrix;

s302, aiming at the word symbol and the entity sequence x ^r The ith element in (a)

Is a complete dictionary attention-enhancing representation H _i The first n items are taken out as attention enhancement representation +.>

Get the representation of +.>

The constituent attention-enhancing representation h ^r ；

In the above formula, i is a word symbol and an entity sequence x ^r The ith element in (a)

N is the number of characters in the sequence of tokens.

Optionally, in step S103, the linear classification layer performs linear classification to obtain a starting score S _start Ending scoring s _end And index internal scoring s _mention The functional expression of (2) is:

in the above, s _start (i) To predict the probability that position i will be the starting position of the knowledge point,

for the attention-enhanced coded representation at position i, s _end (j) Probability of ending position for predicted position j as knowledge point, +.>

For the attention-enhanced coded representation at position j, s _mention (k) Probability of predicting position k as an internal component of knowledge point, +.>

For the attention-enhanced coded representation at position k, < >>

And->

Trainable network parameters for the linear classification layer; and when the decoding layer obtains the knowledge point labeling result, the calculation function expression of the probability of any region (i, j) is as follows:

in the above formula, p (i, j) represents the probability of the region (i, j), and σ represents a sigmoid function; if the probability of the region (i, j) is larger than the set value, judging that the region (i, j) is a knowledge point labeling region, and thus obtaining a knowledge point labeling result.

In addition, the invention also provides a text online learning resource knowledge point labeling system, which comprises a microprocessor and a memory which are connected with each other, wherein the microprocessor is programmed or configured to execute the text online learning resource knowledge point labeling method.

Furthermore, the invention also provides a computer readable storage medium, wherein the computer readable storage medium stores a computer program, and the computer program is used for being programmed or configured by a microprocessor to execute the text online learning resource knowledge point labeling method.

Compared with the prior art, the invention has the following advantages: the method comprises the steps of tokenizing an input course caption text to obtain a vocabulary symbol sequence, and obtaining BERT codes through a BERT coding layer; dictionary matching is carried out on the input course caption text and a preset designated entity table, and a candidate entity sequence is obtained; acquiring BERT codes through a BERT coding layer based on the word symbol sequence; calculating a dictionary attention code of each element in the candidate entity sequence by using an entity encoder BE; the BERT codes and the dictionary attention codes are spliced and then input into a transducer layer to obtain attention enhancement representation; the method and the device can realize automatic labeling of the knowledge points aiming at the text online learning resources, and have the advantages of high precision and recall ratio.

Drawings

FIG. 1 is a schematic diagram of a basic flow of a method according to an embodiment of the present invention.

Fig. 2 is a schematic diagram of a network structure using a solid annotation model DsMOOC according to an embodiment of the present invention.

Fig. 3 is a schematic diagram of network training using a solid annotation model DsMOOC in an embodiment of the invention.

Fig. 4 is a general entity linking example in an embodiment of the present invention.

Detailed Description

As shown in fig. 1 and fig. 2, the text online learning resource knowledge point labeling method of the embodiment includes:

s101, tokenizing an input course caption text to obtain a tokenized sequence [ t ] ₁ ,t ₂ ,...,t _n ]And obtain BERT code through BERT coding layer

Inputting text of course caption and pre-processingDictionary matching is carried out on the set designated entity table to obtain a candidate entity sequence [ e ] ₁ ,e ₂ ,…]And calculates a dictionary attention code of each element therein using an entity encoder BE;

s102, BERT is encoded

Input transducer layer acquisition after dictionary attention encoding splicing to obtain attention enhancement representation h ^r ；

S103, enhancing the attention to the representation h ^r Inputting the initial scoring s into a linear classification layer for linear classification _start Ending scoring s _en And index internal scoring s _mentiom Scoring s the start _start Ending scoring s _end And index internal scoring s _mention And inputting the decoding layer to obtain a knowledge point labeling result.

In the text online learning resource knowledge point labeling method of the embodiment, a network model formed by a BERT coding layer, an entity encoder BE, transformer layer, a linear classification layer and a decoding layer is named as an entity labeling model DsMOOC (Discovery and selection in MOOC). The entity labeling of course captions is a refinement of traditional course concept extraction research. In order to mark the knowledge needed to be understood by the learner in the caption, the correlation degree of the entity and the course needs to be distinguished on the basis of entity identification of the caption, and only phrases of the entity which can be understood by the learner are selected. The entity labeling of course captions is different from the conventional concept extraction task in that the concept extraction task does not consider context information, but in this embodiment, the boundaries of the entities are delineated from the context, and it is determined that the entities are not related to the course. The context information provides richer semantic information for the embodiment, so that the pre-training language models such as BERT and the like can be applied, but also brings more complex problem boundaries, namely, entity identification is needed, and mismatching situations of literal identity are needed to be eliminated. The entity labeling of course captions is the same as the wikipedia entity knowledge labeling task in that entity recognition in plain text is needed, and then correlation screening judgment is carried out. But there are two points that differ significantly. On the one hand, the present embodiment does not need to avoid repeated labeling. On the other hand, wikipedia and curriculum captions differ in terms of "helpful" criteria, which can lead to significant differences in screening results, which need to be addressed. In addition, the wikipedia labeling task has relatively plentiful training corpus. The training set of this embodiment may be much smaller than the task needs and this problem needs to be solved. The entity labeling model DsMOOC adopted by the text online learning resource knowledge point labeling method of the embodiment uses information in the knowledge graph as dictionary attention representation enhancement, and compared with other existing labeling methods, the DsMOOC has obvious performance improvement.

In this embodiment, in step S101, the BERT code is obtained through the BERT coding layer

The functional expression of (2) is:

representing dimension, h is the hidden layer dimension of the BERT coding model, n is the number of characters in the word sequence, and superscript ^b Representing this is the original semantic representation of the BERT coding model, enhanced with the latter representation by h ^r Etc. represent a distinction. The BERT coding model is an existing coding model, and can be found in the literature (Devlin, jacob, ming-Wei Chang, kenton Lee, and Kristina Toutova.2019. "BERT: pre-Training of Deep Bidirectional Transformers for Language Understand." Conference proceedings.In Proceedings of the 2019Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies,Volume 1 (Long and Short Paper)s), 4171-86), the present embodiment relates only to the application of the coding model and does not relate to the improvement of the coding model, so its details will not be described in detail here.

One difficulty with tasks is that: which sub-region in the text represents an entity? Logically, this is a pre-issue that considers what entity a sub-region represents and whether an entity is helpful to the reader or not. This step is accomplished in other tasks and systems by consulting the reference-entity table, but also results in these systems only finding the references enumerated in the reference-entity table. Compared with the task and the system for discovering new entities such as NER, the corpus provided by the embodiment directly provides the results of two steps of comprehensive discovery and screening, and the results of the step cannot be independently analyzed and discovered. Therefore, the score of the basic model trained directly on the corpus of this embodiment is low. The embodiment proposes to perform enhanced representation by taking the matching with the index-entity table as a part of the characteristics, so that the information of the index-entity table is utilized, and the problem that the entity outside the entity table cannot be generated due to the hard matching rule is avoided. Specifically, the BERT is encoded in step S102 of the present embodiment

And dictionary attention code BE (e) _i-n ) The splicing comprises:

s201, firstly, using the indicator entity table and the prior probability, the indicator entity table is a word symbol sequence [ t ] ₁ ,t ₂ ,...,t _n ]Finding out the matching entity with the highest prior probability from all substrings matched with the named entity table, and taking the prior probability as the link confidence level; then according to the preset threshold th _rl Screening the link confidence, selecting a reference entity pair with the confidence greater than a threshold value to obtain a reference list { (rs) _i ，re _i ，e _i ) Of which (rs) _i ，re _i ) For candidate entity e _i Location information rs of (a) _i For candidate entity e _i The starting position, rs _i For candidate entity e _i End position of (2); this reference list { (rs) _i ，re _i ，e _i ) The precision rate of the model is possibly lower, and the model is not necessarily completely recalled, but sufficient semantic information supplement can be provided for the model through a representation enhancement mechanism; the basic idea in the method of this embodiment is to add each item (rs _i ，re _i ，e _i ) Position information (rs) _i ，re _i ) And semantic information e _i Added to region (rs) _i ，re _i ) Such that the system can use this information to comprehensively consider the discovery and screening of the references to achieve an enhanced representation when scoring the tokens;

in the above, x ^r Representing word symbols and entity sequences, head ^r Representing a logogram and a sequence of entities x ^r The medium element is in the original word symbol sequence t ₁ ，t ₂ ，...，t _n ]In (3) a start position sequence, tail ^r Representing a logogram and a sequence of entities x ^r The medium element is in the original word symbol sequence t ₁ ，t ₂ ，...，t _n ]End position sequence of (a);

And j' th element->

Calculating head-tail relative distance ∈>

Relative distance of head->

Relative distance of the tail head->

Relative distance of tail +.>

And j' th element->

Correlation R of (2) _ij ；

S204, BERT-based coding

Determining its vector representation E _i And combine with the correlation R _ij Determining a logogram and an entity sequence x ^r Any i-th element +.>

And j' th element->

And j' th element->

Attention weight a of (2) _i，j Attention weighting is performed to obtain a weighted feature a as input to the transducer layer. />

In the present embodiment, the head-to-tail relative distance is calculated in step S203

Relative distance of head->

Relative distance of the tail head->

Relative distance of tail +.>

The functional expression of (2) is:

In the start position sequence head ^r And end position sequence tail ^r Corresponding element of (1), head ^r _j Sum tail ^r _j Respectively a word symbol and an entity sequence x ^r Any j-th element of (a)>

In this embodiment, the ith element in step S203

And j' th element->

Correlation R of (2) _ij The expression of the calculation function of (c) is:

for splicing operation, < >>

And->

Respectively represent the relative distance between the head and the tail>

Relative distance of head->

Relative distance of the tail head->

Relative distance of tail +.>

The result of the encoding of P is encoded by the relative position. It should be noted that, the relative position code P is an existing encoder, and reference may be made to document (6.Vaswani,Ashish,Noam Shazeer,Niki Parmar,Jakob Uszkoreit,Llion Jones,Aidan N.Gomez,Lukasz Kaiser,and Illia Polosukhin.201 7."Attention Is All You seed. '' In proc. Ofneurones, 5998-6008.), and the present embodiment relates only to the application of the coding model, and does not relate to the improvement of the coding model, so its details will not be described In detail here.

The enhancements In this example represent improvements In the mechanism based In fact on the FLAT method (3.Li,Xiaonan,Hang Yan,Xipeng Qiu,and Xuanjing Huang.2020."FLAT: chinese NER Using Flat-Lattice transducer." InProc. OfACL, 6836-42.). In the FLAT method, word information is added to word representation for enhancement, and matched representation of words and words obtained through literature training is used. In the method of this embodiment, the expression of the symbol is a semantic expression generated by BERT

The entity representation is selected from the literature and the entity embedding method used, and is marked as BE (e _i-n ). This entity represents two benefits: on the one hand, it is calculated from the title and information of the entity using the same BERT model code, in the same space as the representation of the logographic. On the other hand, BE (e) _i-n ) The independence is maintained, and the information on one side of the knowledge base is fully utilized. Vector representation E in step S204 of the present embodiment _i The functional expression of (2) is:

in the above-mentioned method, the step of,

coding +.>

And j' th element->

In this embodiment, step S102 includes:

s301, BERT is encoded

H _i ＝softmax(A)EW _v ，

in the above formula, softmax represents a softmax activation function, A is the attention weight A _i，j The matrix is composed of all the words and the entity sequence x ^r The i-th element of (a)

Get the representation of +.>

The constituent attention-enhancing representation h ^r ；

In the above, i isWord symbol and entity sequence x ^r The ith element in (a)

N is the number of characters in the sequence of tokens.

In this embodiment, in step S103, the linear classification layer performs linear classification to obtain a starting score S _start Ending scoring s _end And index internal scoring s _mention The functional expression of (2) is:

For the attention-enhanced coded representation at position k, < >>

And->

Since the data set is much smaller in terms of both the number of sentences and average sentence length, and there is no entity encoding required for dictionary attention-based representation enhancement, the learning of entity discovery tasks by the model must be accomplished by first fine-tuning the task associated with entity discovery. As shown in fig. 3, the method of this embodiment designs a two-stage fine tuning manner to complete the training task. First, data preparation: the method comprises the steps of downloading a Chinese BERT pre-training model, finding and linking task public data sets by Chinese entities in advance, and arranging a knowledge point labeling data set for training course entity labeling. And a second step of: generic entity discovery and link fine-tuning: the method of the embodiment uses the general entity discovery and linking task to make the first fine tuning. The universal entity link sample is shown in fig. 4, and is a task of identifying and linking all entities in the hundred degrees encyclopedia range in the universal field webpage text, and is relatively close to the course entity labeling task in the text. Specific fine tuning procedures are described In ELQ literature (Li, belinda Z., sewon Min, srinivasan Iyer, yashr Mehdad, and Wen-tau YIh.2020, "efficiency One-Pass End-to-End Entity Linking for questions," In Proc. Of EMNLP, 6433-41.) the training procedure In this stage is not within the scope of the claimed invention. The fine tuning updates parameters of a context encoder, an entity encoder and a linear classifier. Third, course entity labeling training: parameters of a context encoder, an entity encoder and a linear classifier are reserved, parameters of a transducer layer are initialized randomly, and a knowledge point labeling data set training set is input for fine adjustment. And updating network parameters in each layer in the entity labeling model DsMOOC by using a back propagation algorithm, and obtaining the entity labeling model DsMOOC after training is completed.

The text online learning resource knowledge point labeling method of the embodiment is further verified by combining experiments. In the experiments of the present example, the test piece,a test set is constructed by using a manual labeling mode, and for a 12440 sentence online course subtitle sample, 6 volunteers label the course knowledge point entities. The training set, the verification set and the test set are randomly divided according to the dividing ratio of 8:1:1. Finally, the test set has 1244 examples, which include 1956 entities. The existing method for comparing the entity labeling model DsMOOC in this embodiment includes: WMP: a method for combining a prior probability screening based on a Wikpedia Miner (Mirne and Witten 2008) statistical dictionary matching method. Wherein the dictionary is a finger-entity table based on Chinese wikipedia statistics, and the prior probability is the ratio of the number of document spreads of a finger to the number of total document spreads in the wikipedia and selected as anchor text. In this embodiment, the maximum recall ratio parameter combination WMP is specifically adopted _recall And a parameter combination WMP for maximizing F1 value _best . MOOCCube (Yu et al 2020) and MOOCCube x (Yu et al 2021): a method of matching an existing published concept library. This is a model of existing research to deal with the concept of lessons in video captions. Pan was not selected for comparison because the library of curriculum concepts he published did not label all curriculum concepts in the subtitle, and the recall ratio was too low. The MOOCCube defines that the concept of the courses related to the computer science courses is ended by the computer science technology, so that the statistical performance of the MOOCCube (computer) sub-library is specially intercepted. In addition, direct training refers to skipping general entity discovery and link trimming, training is directly performed by using a course knowledge point labeling data set on the basis of a pre-training BERT model to obtain a model, and first trimming refers to knowledge point labeling prediction by using a model obtained only through general entity discovery and link trimming. The experimental results obtained are shown in table 1.

Table 1: table of experimental results.

Table 1 shows the performance of the method of the present embodiment and the existing method in labeling tasks by a course caption entity. In the traditional method based on literal matching, the WMP method based on Chinese wikipedia anchor text statistics takes advantage of recall ratio, and after the threshold is adjusted, the precision ratio is also improved. Compared with three existing concept libraries, the accuracy of the concept library of MOOCCube in the computer field is considerable, but the recall ratio is very low. In the general subject comparison, the recall ratio of MOOCCubeX is higher than that of MOOCCube by nearly 20 percent, and the advantage of wider source is reflected. The range is more precise, and the advantages of precision are possessed. The three methods F1 values of direct matching to the existing concept library are between 16.05% and 21.69%, which are far lower than the wikipedia matching and the solid annotation model DsMOOC of this example, which illustrates the inauguration of the direct matching method. The accuracy and recall of the model directly trained on the training set are low, even lower than WMP based on matching and statistical features _best The model F1 value is slightly higher. After the first fine tuning, the recall ratio can be significantly improved, even exceeding the maximum recall ratio based on the matching model. Through the second fine tuning, the precision and recall are further improved. Finally, the entity labeling model DsMOOC of the embodiment obtains the best performance in the course subtitle entity labeling task, and the F1 value reaches 53.40%. In summary, the method for labeling knowledge points of text online learning resources in the embodiment can automatically label knowledge points of text online learning resources, and has the advantages of high precision and recall ratio.

In addition, the embodiment also provides a text online learning resource knowledge point labeling system, which comprises a microprocessor and a memory which are connected with each other, wherein the microprocessor is programmed or configured to execute the text online learning resource knowledge point labeling method. The present embodiment also provides a computer-readable storage medium having a computer program stored therein, the computer program being for programming or configuring by a microprocessor to perform the text-based online learning resource knowledge point labeling method.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-readable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein. The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above description is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above examples, and all technical solutions belonging to the concept of the present invention belong to the protection scope of the present invention. It should be noted that modifications and adaptations to the present invention may occur to one skilled in the art without departing from the principles of the present invention and are intended to be within the scope of the present invention.

Claims

1. A text online learning resource knowledge point labeling method is characterized by comprising the following steps:

s102, BERT is encoded

2. The text online learning resource knowledge point labeling method according to claim 1, wherein in step S101, a BERT code is obtained through a BERT coding layer

The functional expression of (2) is:

3. The method for labeling text-based online learning resource knowledge points according to claim 1, wherein in step S102, BERT is encoded

And dictionary attention code BE (e) _i-n ) The splicing comprises:

And j' th element->

Calculating head-tail relative distance ∈>

Relative distance of head->

Relative distance of the tail head->

Relative distance of tail +.>

And j' th element->

Is related to (a)R _ij ；

S204, BERT-based coding

And j' th element->

And j' th element->

4. The method for labeling knowledge points of online learning resources of text class according to claim 3, wherein the relative distances between the head and the tail are calculated in step S203

Relative distance of head->

Relative distance of the tail head->

Relative distance of tail +.>

The functional expression of (2) is:

5. The text-based online learning resource awareness of claim 3The identifying point marking method is characterized in that the ith element in the step S203

And j' th element->

Correlation R of (2) _ij The expression of the calculation function of (c) is:

in order for the splicing operation to be performed,

and->

Respectively represent the relative distance between the head and the tail>

Relative distance of head->

Relative distance of the tail head->

Relative distance of tail +.>

The result of the encoding of P is encoded by the relative position.

6. The method for labeling knowledge points of online learning resources of text class as set forth in claim 3, wherein the vector in step S204 represents E _i The functional expression of (2) is:

in the above-mentioned method, the step of,

coding +.>

Is the ith code in (e) _i-n ) For the ith-nth code in dictionary attention codes, e _i-n I is the word symbol and the entity sequence x for the i-n candidate entities in the candidate entity sequence ^r The ith element in (a)

And j' th element->

W is as follows _k，R For a trainable weight matrix, u is a trainable weight vector and v is a trainable weight vector. />

7. The method for labeling text-based online learning resource knowledge points according to claim 6, wherein step S102 comprises:

s301, BERT is encoded

H _i ＝softmax(A)EW _v ，

s302, aiming at the word symbol and the entity sequence x ^r I of (a)Element(s)

Get the representation of +.>

The constituent attention-enhancing representation h ^r ；

N is the number of characters in the sequence of tokens.

8. The method for labeling text-based online learning resource knowledge points according to claim 1, wherein in step S103, linear classification is performed by a linear classification layer to obtain a starting score S _start Ending scoring s _end And index internal scoring s _mention The functional expression of (2) is:

for the attention-enhanced coded representation at position i, s _end (j) To predict positionj as probability of knowledge point end position, +.>

For the attention-enhanced coded representation at position k, < >>

And->

9. A text-based online learning resource knowledge point labeling system comprising a microprocessor and a memory connected to each other, wherein the microprocessor is programmed or configured to perform the text-based online learning resource knowledge point labeling method of any one of claims 1-8.

10. A computer readable storage medium having a computer program stored therein, wherein the computer program is configured or programmed by a microprocessor to perform the text-based on-line learning resource knowledge point labeling method of any one of claims 1-8.