CN111950540A

CN111950540A - Knowledge point extraction method, system, device and medium based on deep learning

Info

Publication number: CN111950540A
Application number: CN202010720576.0A
Authority: CN
Inventors: 黄昌勤; 朱佳; 吴志杰; 韩中美
Original assignee: South China Normal University; Zhejiang Normal University CJNU
Current assignee: South China Normal University; Zhejiang Normal University CJNU
Priority date: 2020-07-24
Filing date: 2020-07-24
Publication date: 2020-11-17

Abstract

The invention discloses a knowledge point extraction method, a knowledge point extraction system, a knowledge point extraction device and a knowledge point extraction medium based on deep learning, wherein the method comprises the following steps: acquiring an original data set by a crawler method and/or an OCR text recognition method; preprocessing the original data set to acquire knowledge representation data; determining a Bert pre-training model according to the knowledge characterization data; optimizing the Bert pre-training model according to a dynamic mask method and a mixed mask method; and extracting knowledge points in the original data set according to the optimized Bert model. By means of the Bert pre-training model, complicated characteristic engineering steps can be omitted, and the problem of ambiguity can be solved; the invention optimizes the bert model, extracts knowledge points more accurately, and can be widely applied to the technical field of deep learning.

Description

Knowledge point extraction method, system, device and medium based on deep learning

Technical Field

The invention relates to the technical field of deep learning, in particular to a knowledge point extraction method, a knowledge point extraction system, a knowledge point extraction device and a knowledge point extraction medium based on deep learning.

Background

Both offline textbook knowledge and online classroom knowledge (e.g., MOOC) contain a lot of redundant information, and most of these information is irrelevant to knowledge points, and for this reason, the common practice in the industry is as follows:

1. keyword extraction

Extracting key words from the education text data as important knowledge points, wherein the essence is a key word extraction problem, and the core lies in the characteristics (part of speech, word frequency and the like) of the structural words. However, this method is not suitable for domain education data, and for special education data, especially for text data of mathematics class, a large amount of noise is brought in by special symbols such as formulas contained in the text data, so that the recognition performance of the model is limited.

2. Text representation and text classification

Text representation:

the representation method of the text is the modeling of the language. There are two main approaches at present: Auto-Regressive Model (AR for short), and Auto-Encoding Model (AE for short).

The models for the AR concept include mainly ELMo based on bidirectional LSTM, GPT based on Transformer, and the like. However, the AR method has a drawback in that context relationships are not well taken into account because the bayesian network is always not looped, i.e. all words are in a one-way chain structure from left to right or from right to left, and only the probabilistic relationship (in the front) before the word can be calculated. Thereby leading to another modeling idea AE.

The AE model can make full use of the context to perform unsupervised training on a large amount of data, and essentially reduces the dimension of the data into low-dimensional features, and then recovers the low-dimensional features by using a decoder. Models represented by this are Word2vec and Bert. The former predicts a plurality of words around through words, but model parameters are stored in a static mode and cannot be dynamically adjusted; the latter uses bidirectional transformer to skillfully introduce the method of word mask, trains the model to fill in the blank after completing the shape, and uses the context to predict the hidden word.

Text classification:

there are currently 3 major solutions to the text classification problem.

The first is a manual method based on rule matching, expert system. Firstly, manually establishing a matching rule in advance, then carrying out feature matching on the text, and if the matching rule meets a preset mode, classifying the text into a preset category. This method is the most straightforward and simple method, but the disadvantage is also obvious, namely that it takes a lot of manpower to make the rules.

The second is a solution based on feature engineering for machine learning. Common are a naive bayes classifier, a Support Vector Machine (SVM), logistic regression, a K-nearest neighbor algorithm, a decision tree, and the like. However, the machine learning method faces a complicated characteristic engineering, that is, the original data is subjected to dimension reduction or dimension increase, so that the method is suitable for solving the problem and needs to consume a large amount of labor cost.

The third is a deep learning based approach. The deep learning method solves the difficulty in text representation, maps texts into a real number space, and learns the representation of text features by training an end-to-end network structure, so that the method removes complicated feature engineering and makes text classification more feasible. For domain-related text classification, such as that of the mathematical domain explored herein, existing work is mainly based on traditional machine learning and neural network methods. For example, Zhang wisdom and Zhang Qing use SVM method to classify the elementary mathematic problem, and use TF-IDF mode based on word frequency statistics in feature extraction; the Wangxin and Yezhiwei establish an elementary mathematical knowledge point labeling system based on an LSTM network with an attention mechanism. However, in the aspect of text representation, the work is either manually performed with feature engineering work, or a scheme of using Word2vec static parameters in a pre-training model cannot be dynamically adjusted according to new context data, and is not satisfactory in terms of expression of polysemous text data.

Disclosure of Invention

In view of this, embodiments of the present invention provide a method, a system, an apparatus, and a medium for extracting knowledge points based on deep learning, so as to improve accuracy of knowledge point extraction.

The invention provides a knowledge point extraction method based on deep learning, which comprises the following steps:

acquiring an original data set by a crawler method and/or an OCR text recognition method;

preprocessing the original data set to acquire knowledge representation data;

determining a Bert pre-training model according to the knowledge characterization data;

optimizing the Bert pre-training model according to a dynamic mask method and a mixed mask method;

and extracting knowledge points in the original data set according to the optimized Bert model.

In some embodiments, determining a Bert pre-trained model based on the knowledge characterization data comprises:

constructing a Bert pre-training model according to the knowledge representation data;

and fine-tuning the Bert pre-training model through preset parameters based on a transfer learning technology, and determining the fine-tuned Bert pre-training model.

In some embodiments, the constructing a Bert pre-training model according to the knowledge characterization data specifically includes: constructing a Bert pre-training model according to an output result of the mask language model and a prediction result of a next sentence;

wherein the mask language model is constructed by the steps of:

calculating a first loss of the knowledge representation data through cross entropy;

calculating a first softmax layer according to the first loss;

inputting the data of the first softmax layer into a full-connection network to obtain masked mask characters;

determining a sequence output result according to the mask characters;

constructing a mask language model according to the sequence output result;

the step of predicting the next sentence comprises:

calculating a second loss of the knowledge characterization data through binary cross entropy loss;

calculating a second softmax layer according to the second loss;

inputting the data of the second softmax layer into a cls unit;

and predicting the next sentence according to the output result of the cls unit.

In some embodiments, the optimizing the Bert pre-training model according to a dynamic mask method and a hybrid mask method includes:

converting mass pre-training corpora into a word embedding layer;

determining a parameter matrix of the word embedding layer;

decomposing the parameter matrix into a first matrix and a second matrix;

converting the words in the word embedding layer into one-hot codes;

performing dimension compression on the one-hot code through the first matrix to obtain a compression result;

and performing dimension recovery on the compression result through the second matrix.

In some embodiments, the optimizing the Bert pre-training model according to a dynamic mask method and a hybrid mask method further includes: and constructing the mask language model by a dynamic mask method and a mixed mask method.

In some embodiments, in the step of determining the Bert pre-trained model according to the knowledge characterization data, a learning rate of the Bert pre-trained model is dynamically adjusted in a linear manner.

In some embodiments, the step of determining the Bert pre-trained model according to the knowledge characterization data further includes updating the weights of the Bert pre-trained model by a gradient accumulation method, and clipping the weights by a gradient clipping method.

The second aspect of the present invention provides a knowledge point extraction system based on deep learning, including:

the acquisition module is used for acquiring an original data set by a crawler method and/or an OCR text recognition method;

the preprocessing module is used for preprocessing the original data set to acquire knowledge representation data;

the pre-training model building module is used for determining a Bert pre-training model according to the knowledge representation data;

the pre-training model optimization module is used for optimizing the Bert pre-training model according to a dynamic mask method and a mixed mask method;

and the extraction module is used for extracting the knowledge points in the original data set according to the optimized Bert model.

A third aspect of the invention provides an apparatus comprising a processor and a memory;

the memory is used for storing programs;

the processor is adapted to perform the method according to the first aspect of the invention according to the program.

A fourth aspect of the invention provides a storage medium storing a program for execution by a processor to perform the method according to the first aspect of the invention.

According to the invention, through the Bert pre-training model, the complicated characteristic engineering steps can be omitted, and the problem of ambiguous word can be solved; the invention extracts knowledge points more accurately by optimizing the bert model.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a flowchart illustrating the overall steps of an embodiment of the present invention;

FIG. 2 is a schematic structural diagram of a Bert pre-training model according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating a Bert pre-training process according to an embodiment of the present invention;

fig. 4 is a graph of linear learning rate programming variation according to an embodiment of the present invention.

Detailed Description

The invention will be further explained and explained with reference to the drawings and the embodiments in the description. The step numbers in the embodiments of the present invention are set for convenience of illustration only, the order between the steps is not limited at all, and the execution order of each step in the embodiments can be adaptively adjusted according to the understanding of those skilled in the art.

Referring to fig. 1, an embodiment of the present invention provides a knowledge point extraction method based on deep learning, including:

preprocessing the original data set to acquire knowledge representation data;

It should be noted that the OCR technology mentioned in this embodiment refers to a method for detecting and recognizing characters on a printed/handwritten paper material as computer characters. OCR technology is widely used in everyday life scenarios, such as: extracting ticket information, digitizing test paper title, digitizing form data, and the like. The OCR framework used in the application is composed of two parts, namely character region detection (CTPN) and end-to-end Character Recognition (CRNN).

It should be noted that, the present application implements data crawling by a Selenium automated testing engine. The Selenium is a web browser automation engine based on a JavaScript underlying framework, and is mainly used for scripting web driving, realizing automatic control of a web and performing various operations. Currently, selenium supports python3 and also supports large mainstream browsers such as: google Chrome, Firefox, IE, etc. Therefore, the webpage elements can be crawled by using the selenium engine.

Specifically, a pre-trained model (pre-trained model) of the present embodiment is proposed based on the migration learning. The transfer learning means that a basic model is trained and stored under a large amount of data, and on the basis, the model is continuously trained on a data set of a specific problem by simply modifying. In natural language processing, the language is considered to have wide commonality by the transfer learning theory, so that a model capable of expressing the language can be trained on a large amount of linguistic data. In the downstream task, since the data set is still natural language and also has the characteristics of language, it is only necessary to further fine-tune (fine-tuning) the neural network. The transfer learning has the advantages that the end-to-end constraint is eliminated, and the model parameters do not need to be trained from the beginning; meanwhile, the method is also suitable for network training in a low-computational-effort scene, because the calculation cost of fine tuning is far less than that of the training from the beginning.

The model fine tuning refers to training by using a relatively universal model with generalization capability, and then continuing training by adding a small amount of parameters according to the requirements of specific tasks on the basis. The 2018 proposal of BERT made the natural language processing domain enter the era of pre-training fine tuning. The BERT model is a pre-training model based on a bidirectional Transformer structure, and the main framework of the BERT model is shown in FIG. 2, wherein Trm is the abbreviation of Transformer, E is a word vector, and T is_nIs the nth output.

wherein the mask language model is constructed by the steps of:

calculating a first softmax layer according to the first loss;

determining a sequence output result according to the mask characters;

constructing a mask language model according to the sequence output result;

the step of predicting the next sentence comprises:

calculating a second softmax layer according to the second loss;

inputting the data of the second softmax layer into a cls unit;

In this embodiment, a method for pre-training by using a Bert pre-training model will be described, which is a type of multi-task training and is integrally divided into two parts, and a pre-training schematic diagram is shown in fig. 3.

Wherein, 1, Mask Language Model (MLM)

Masking refers to "blocking" some words in the text, and in practice, they are typically replaced with [ MASK ] characters, which for the model is equivalent to "disappearing". When in pre-training, the BERT adopts a similar MLM method of completing shape and filling space, words in a text are subjected to random mask, and the words are predicted by a training model, so that the model learns the rule of language. In contrast to left-to-right prediction of RNN type networks, the approach of adding a random mask to text subtly incorporates contextual information because all words except for the masked ones are visible. For example: the text "know triangle ABC is right triangle" becomes "know [ mask ] angle ABC is mask ] angle triangle" after adding the mask.

In practice, since the MASK is added to cause the pre-training corpus and the downstream task corpus to be mismatched, normally, only 80% of the occluded words are replaced by [ MASK ], 10% remain unchanged, and 10% are randomly replaced by other words.

2. Prediction of the Next Sentence (NSP)

Stochastic MASKs train models well for learning a sentence, but languages often have inter-sentence relationships, such as common question-answering systems, sentence reasoning, and so on. Therefore, BERT designers have added the NSP task to learn the interphrase relationships. This task is to organize the data set into sentence pairs (denoted (a, b)) with 50% of b not belonging to the context of a and 50% of b belonging to the context of a, with the goal of having the model predict whether sentence b is a context of a. NSP is essentially a simple two-classification problem.

The two subtasks of BERT pre-training are adopted, when loss is calculated, the MLM calculates the predicted loss of each MASK position through cross entropy, NSP determines the loss by adopting binary cross entropy BCE, and the total loss of a single step of the model is equal to the sum of the losses of the two tasks.

In particular, the BERT input has relatively special characters. BERT will add a [ CLS ] character at the beginning of the text and a separator [ SEP ] after each sentence. [ CLS ] has the effect of characterizing the entire text, similar to the intermediate semantic vector of an RNN-type structure, as it is located in the first position of the entire sentence. Downstream tasks are as follows: the classification task, the generation task, and the like can be finely adjusted based on a [ CLS ] connection simple network structure.

converting mass pre-training corpora into a word embedding layer;

determining a parameter matrix of the word embedding layer;

decomposing the parameter matrix into a first matrix and a second matrix;

converting the words in the word embedding layer into one-hot codes;

Specifically, since when mask is performed by the Bert pre-training model: typically, 15% of words are subjected to MASK operations, of which 80% are subjected to [ MASK ] substitutions, 10% are unchanged, and 10% are random word substitutions. This MASK pattern is maintained until the training is complete, and each round predicts the same [ MASK ] position, called static MASK. Therefore, a dynamic mask method is needed to make the model more generic.

The dynamic mask refers to copying k parts of data in advance, independently making a random mask for each part of data once, and generating 10 parts of data of different masks. The advantage of this is that the amount of pre-training data is increased, and the MLM task has strong generalization.

In addition, the n-gram mask is an improved mode aiming at the mask method, and the original BERT mask method randomly masks words which are often separated. In practice, in language, the existence of an intentional group, that is, several consecutive words may express the same concept and topic, for example: in the phrase "the area of the known right triangle is 6", the right triangle expresses the same concept. The way of n-gram mask is to mask consecutive n words, which is also an enhancement to the MLM task in essence.

The implementation process of the BERT of the present invention is described in detail below by taking ALBERT as an example:

since the BERT pre-training is based on a huge amount of corpora, the size (denoted as V) of the dictionary is huge, which makes the parameter amount of the word embedding layer huge. Assuming that the dimension of the word vector is H, the size of the word embedding layer parameter matrix is H × V. In fact, when data flows into the model, the word is first converted into a one-hot vector and multiplied by the matrix, and the one-hot vector has a large number of 0's, which is ineffective in computation. In other words, the real word embedding layer does a lot of invalid calculations. Then do not compress this part of the parameters, assign it to the other parts? Designers of ALBERT propose to decompose this matrix into two matrices. That is, the original H × V matrix is decomposed into H × E and E × V matrices, and when calculating the word vector, the word is converted into one-hot code, and then the dimension is compressed by the first matrix and restored by the second matrix. Although this compression may lose performance, this loss proves to be minimal, and the gains in saving GPU memory are far greater than the losses due to this minimal performance degradation.

In addition, cross-layer parameter sharing is also used to further reduce the amount of BERT parameters. Specifically, in a general Transformer encoder, the output of a previous block is used as the input of a subsequent block, and although the structure is the same from block to block, the parameters of the respective blocks are different. Sharing parameters changes the serial structure into a circular structure, because there is only one block, the parameter amount is reduced to 1/N, and N is the total number of blocks. This idea is essentially a trade-off between the depth and width of the neural network, and with the same amount of parameters, the ALBERT that shares parameters can widen the single layer, while the original BERT that does not share parameters actually deepens the network. Increasing the width enables the network to learn richer features, while increasing the depth facilitates more complex distribution of model learning. At present, the academic world has no conclusion about which width and depth are more important, but many experiments prove that the width is more favorable for improving the performance of the model than the depth.

Also, Sentence Order Prediction (SOP). BERT trains a model to learn language features using subtasks that predict the next sentence. However, this method does not perform ablation experiments in the original paper, and thus raises the question of effectiveness of this approach in the academic world. In ALBERT, a prediction method of sentence order prediction is generally adopted. The method is different from the traditional construction mode of the BERT model, the ALBERT does not randomly extract other texts to form the next sentence, but directly turns the front and back of the two sentences, and then trains a classifier to predict whether the sentences are in the positive sequence or not. This task is an enhancement to the next sentence prediction and shows considerable improvement in comparative experiments.

The modified bert pre-training model for dynamic mask and hybrid mask is described in detail below:

the present application pre-trains (pre-training) the ALBERT language model based on a crawled dataset. Because the text in the mathematical domain often has the characteristic of concentrated distribution of core words, and a domain proper noun often consists of a plurality of characters. If the conventional random Mask approach is followed, the model is less sensitive to these words because the occluded words are only segments of these words. In this regard, the present application modifies the pre-trained MLM task into a dynamic Mask + hybrid Mask form, using the dynamic Mask scheme of RoBERTa and the continuous Mask scheme of BERT n-gram as references. The specific implementation details are as follows:

dynamic mask: for one corpus, 10 copies of data are made, and for each copy a random mask or n-gram mask is made, which results in 10 different masks. In the training of N epochs, each corpus is trained for N/10 epochs.

Mixing masks, namely performing random mask on 70% of corpora and performing n-gram mask on 30% of corpora in 10 corpora copied by the dynamic mask, wherein n is a random value between 2 and 4. The proportion of MASK is 80% [ MASK ] instead, 10% remains unchanged, 10% is replaced with random arbitrary words.

The following describes the specific implementation process of the present invention in detail by taking the extraction process of mathematical knowledge points as an example:

the automatic extraction task of the mathematical knowledge points is a multi-label classification problem of an education domain NLP, and the aim of the automatic extraction task is to train a model to select a plurality of knowledge points related to a mathematical subject from a known knowledge point set. The symbols referred to in this example are defined below by table 1:

TABLE 1

The mathematical Knowledge point automatic Extraction task (hereinafter abbreviated as KE) of the present embodiment is defined as follows: for a mathematical problem Q (ID, S), knowing a set of knowledge points K, the task of KE is to find a mapping f, such that f (S) K^*,

Firstly, for the training of a large model, due to the randomization of the initial weight, the gradient calculated at the early stage is large, the severe change of the weight easily causes the model to overfit a small batch of data sets, and then the gradient descent process becomes very unstable, although the later-stage model still adjusts the phenomenon back, the instability at the early stage influences the performance of the model after final convergence.

The warm-up learning rate (warmup) is a training strategy to improve the model performance. In the embodiment, the model is used for learning at the primary school learning rate within a certain number of steps in the initial training period, and the preset learning rate is recovered after the certain number of steps is reached. The learning rate of the model is dynamically adjusted in a linear manner in the embodiment. Preheating the learning rate by 10% of preheating steps, and adopting a programming mode of linear decreasing of the learning rate for the rest steps.

As shown in fig. 4, in the learning rate adjustment scheme adopted in this embodiment, the total pre-training steps is 20 ten thousand steps, and 10% of the total training steps, that is, 2 ten thousand steps, is taken as the pre-heating steps. In the early warm-up stage, the learning rate is toward 2^-5Linearly increasing. After the preheating stage, the number of training steps is increased and then linearly decreased.

The usage scenario for gradient accumulation is that GPU memory is insufficient to support data training for large batchsize. The main idea is to exchange time for space, and update the weight after the network calculates a certain number of steps, which is equivalent to training a larger batch size data at a time. The algorithm flow is as follows:

1. network feedforward, calculating the loss of the current step k;

2. dividing loss by a preset gradient accumulation step number ga;

3. back ward () back propagation is used to calculate the gradient;

4. is judged (k% ga ═ 0)? If true, updating the weight and then resetting the gradient; if false, jump to step 1.

For a deeper network, if the initial weight is larger, a larger gradient can be easily calculated, and the range which can be represented by a computer is overflowed after multi-layer multiplication, and is represented as a Nan value in python, and the condition is called gradient explosion. In order to stabilize the calculation time, the larger gradient needs to be clipped, i.e. when the gradient exceeds a certain threshold, the threshold is set. Clipping of the model gradient can be achieved in PyTorch using the clip _ grad _ norm _ () function.

The multi-label classification task solved by the invention is a task with extremely unbalanced data labels, and the number of negative examples under each label is far larger than that of positive examples. When the classification task is performed under the extremely unbalanced distribution, the sensitivity of the model to positive examples is low, and if all the predictions are negative examples, higher accuracy can be obtained, but the final purpose of knowledge point extraction is obviously not met.

The solution of data imbalance is mainly based on 2 layers: data level, algorithm level.

The data layer generally adopts a resampling or undersampling mode, and the aim of label balance is achieved by increasing and decreasing data. Resampling increases the number of samples of a class with a small amount of data by excessively collecting samples thereof, but may result in an overfitting of the model to that class of data; under-sampling is to discard the samples of the category with large data volume, but the obvious defect is that the existing data can not be fully utilized, and especially for the small data set used in the embodiment, the samples are precious.

The work of the algorithm level mainly focuses on improving the sensitivity of sparse samples, and the problem of label imbalance is solved by methods of adjusting sample weight, reinforcing punishment to sparse samples and the like.

The embodiment is improved based on a standard binary cross entropy function, introduces a modulation coefficient gamma and a weight factor alpha on the basis of the original cross entropy,

wherein, the gamma coefficient is applied to the predicted value p for enhancing the attention degree to the difficult classification. Since for more easily distinguishable samples, the p-value will be larger, (1-p)^γWill be small and thus the contribution of this sample to the overall loss will be reduced; conversely, for classes that are more difficult to predict, (1-p)^γWill be larger, allowing the model to optimize the parameters more toward the error of this data point. The alpha value represents the proportion of positive examples in the overall data, and is introduced to adjust the sample weight, giving higher attention to fewer classes.

The following verification of the extraction results of the method of the present invention is carried out, and the specific process is as follows:

in general, we refer to the data labels in the binary problem as positive (positive) and negative (negative). For example, for the knowledge point of "triangle", the sample containing the label is a positive example, and the sample not containing is a negative example. The prediction made by a binary model on a sample has two possibilities-true and false. There are thus four possibilities for each sample-true positive, true negative, false positive, false negative, which together constitute the chaotic matrix, and a series of indices are derived.

Since the classifier used in the present application is a probability model based on logistic regression, the output predicted value is also presented in the form of probability, and in order to convert the probability value into the "0/1" predicted value, a threshold value needs to be set artificially. It was mentioned above that for datasets with unbalanced labels, it is not sufficient to see the accuracy or recall only, since the sensitivity of the model to sparse classes cannot be reflected, i.e. the model is predicted to dense classes all together can also achieve very high accuracy. The ROC curve is called receiver operating characteristic curve (receiver operating characteristic curve) and reflects the continuous change of two indexes of sensitivity and specificity comprehensively. The ROC curve is plotted with the abscissa FPR and the ordinate TPR by successively taking the threshold θ and generating a pair of (FPR, TPR) coordinates for each θ.

The AUC value is defined as the Area Under the ROC Curve (Area Under cutter), and the value range is [0.5,1], and the larger the value is, the better the effect of the classifier is. The significance of AUC is equivalent to randomly taking one positive case and one negative case, and the probability that the model predicts a score for the positive case to be greater than that for the negative case. If the AUC is larger, the model has higher chance of distinguishing positive and negative examples, i.e. higher performance. The AUC value of the classification result can be conveniently calculated in Python using the roc _ AUC _ score method of sklern package.

The AUC index is not influenced by label unbalance, and the advantages and the disadvantages of the classifier can be evaluated more compared with the accuracy and the recall rate.

Since the task solved herein is equivalent to a multi-label classification task, it is implemented as a plurality of two classifiers. Thus, the AUC value can be calculated for each tag, and the AUC-macro value is used for evaluation in order to measure the overall performance of the model. It is calculated as the arithmetic mean of all tag AUC values.

The pre-training of the present embodiment continues training based on the ALBERT-tiny version of google sources pre-training weights, the total loss drops from 4.5175 to 0.12, the MLM loss drops from 3.8242 to 0.1, the SOP loss drops from 0.69 to 0.02, the MLM validation set accuracy rises from 0.3889 to 0.9713, and the SOP validation set accuracy rises from 0.6497 to 0.9867 after 60 rounds of training on the data set.

The fine tuning model of the embodiment can improve the performance of the model on the verification set by adopting a smaller learning rate. After a plurality of attempts, the effect is best when the initial learning rate is 2.5e-5 and the attenuation coefficient is 0.9.

In addition, in order to verify the improvement of the model performance by deep pre-training, the original ALBERT-tiny model of google open source is adopted in the embodiment to be compared with the model after 25 rounds of continuous pre-training on the data set. Two pre-trained models were each allowed to fine-tune 6 rounds on the data set and observed for the AUC-macro index of their validation set.

The mixed mask is based on the dynamic mask, 30% of the dynamic mask is taken as n-gram mask, and 70% of the dynamic mask adopts random mask. This is done to better let the model learn some words that occur in succession. In order to verify the effectiveness of the hybrid mask method, the embodiment compares the pre-training effect of the ALBERT model under the hybrid mask corpus and the random mask corpus, and the fine-tuned effect of the hybrid mask corpus and the random mask corpus.

Through verification, the model is better learned under the random mask on the MLM task; the SOP tasks are both identical. However, although the verification set accuracy of the random mask is higher on the MLM task, the task is only easy for the model and cannot represent that the model learns some more complex language features. This was verified in a comparative experiment with fine tuning.

The embodiment of the invention also provides a knowledge point extraction system based on deep learning, which comprises the following steps:

The embodiment of the invention also provides a device, which comprises a processor and a memory;

the memory is used for storing programs;

the processor is configured to perform the method as described above according to the program.

The embodiment of the invention also provides a storage medium, wherein the storage medium stores a program, and the program is executed by a processor to complete the method.

In alternative embodiments, the functions/acts noted in the block diagrams may occur out of the order noted in the operational illustrations. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality/acts involved. Furthermore, the embodiments presented and described in the flow charts of the present invention are provided by way of example in order to provide a more thorough understanding of the technology. The disclosed methods are not limited to the operations and logic flows presented herein. Alternative embodiments are contemplated in which the order of various operations is changed and in which sub-operations described as part of larger operations are performed independently.

Furthermore, although the present invention is described in the context of functional modules, it should be understood that, unless otherwise stated to the contrary, one or more of the described functions and/or features may be integrated in a single physical device and/or software module, or one or more functions and/or features may be implemented in a separate physical device or software module. It will also be appreciated that a detailed discussion of the actual implementation of each module is not necessary for an understanding of the present invention. Rather, the actual implementation of the various functional modules in the apparatus disclosed herein will be understood within the ordinary skill of an engineer, given the nature, function, and internal relationship of the modules. Accordingly, those skilled in the art can, using ordinary skill, practice the invention as set forth in the claims without undue experimentation. It is also to be understood that the specific concepts disclosed are merely illustrative of and not intended to limit the scope of the invention, which is defined by the appended claims and their full scope of equivalents.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.

More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.

It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

While embodiments of the invention have been shown and described, it will be understood by those of ordinary skill in the art that: various changes, modifications, substitutions and alterations can be made to the embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the claims and their equivalents.

While the preferred embodiments of the present invention have been illustrated and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A knowledge point extraction method based on deep learning is characterized by comprising the following steps:

preprocessing the original data set to acquire knowledge representation data;

2. The method for extracting knowledge points based on deep learning of claim 1, wherein the determining a Bert pre-training model according to the knowledge characterization data comprises:

3. The knowledge point extraction method based on deep learning according to claim 2, wherein the construction of the Bert pre-training model according to the knowledge characterization data specifically comprises: constructing a Bert pre-training model according to an output result of the mask language model and a prediction result of a next sentence;

wherein the mask language model is constructed by the steps of:

calculating a first softmax layer according to the first loss;

determining a sequence output result according to the mask characters;

constructing a mask language model according to the sequence output result;

the step of predicting the next sentence comprises:

calculating a second softmax layer according to the second loss;

inputting the data of the second softmax layer into a cls unit;

4. The method for extracting knowledge points based on deep learning of claim 3, wherein the optimization of the Bert pre-training model according to the dynamic mask method and the hybrid mask method comprises:

converting mass pre-training corpora into a word embedding layer;

determining a parameter matrix of the word embedding layer;

decomposing the parameter matrix into a first matrix and a second matrix;

converting the words in the word embedding layer into one-hot codes;

5. The method of claim 4, wherein the Bert pre-training model is optimized according to a dynamic mask method and a hybrid mask method, and further comprising: and constructing the mask language model by a dynamic mask method and a mixed mask method.

6. The method as claimed in claim 2, wherein in the step of determining the Bert pre-trained model according to the knowledge characterization data, a linear manner is adopted to dynamically adjust the learning rate of the Bert pre-trained model.

7. The method as claimed in claim 6, wherein the step of determining the Bert pre-trained model according to the knowledge characterization data further comprises updating the weights of the Bert pre-trained model by a gradient accumulation method, and clipping the weights by a gradient clipping method.

8. A knowledge point extraction system based on deep learning, comprising:

9. An apparatus comprising a processor and a memory;

the memory is used for storing programs;

the processor is configured to perform the method according to the program as claimed in any one of claims 1-7.

10. A storage medium, characterized in that the storage medium stores a program, which is executed by a processor to perform the method according to any one of claims 1-7.