CN112926340B

CN112926340B - Semantic matching model for knowledge point positioning

Info

Publication number: CN112926340B
Application number: CN202110319217.9A
Authority: CN
Inventors: 吴亦珂; 吴天星; 李林; 高超禹; 漆桂林
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2021-03-25
Filing date: 2021-03-25
Publication date: 2024-05-07
Anticipated expiration: 2041-03-25
Also published as: CN112926340A

Abstract

The invention discloses a semantic matching model for knowledge point positioning, which is mainly used for solving the problem of topic knowledge point positioning in the electrical field. The method comprises the steps of firstly preprocessing an original teaching material to form a corpus. Then, the statistical-based semantic matching models TF-IDF, LSI and LDA are used for encoding. And then, using a deep learning semantic matching model to strengthen the deep semantic understanding, and encoding through BERT. Subsequently, for the above four coding schemes, cosine similarity is calculated as a measure of semantic similarity. And finally, selecting the user-specified number of teaching material fragments as a final knowledge point positioning result according to the number of times that the teaching material fragments appear in the front and the cosine similarity by the voting-based semantic matching integrated model.

Description

Semantic matching model for knowledge point positioning

Technical Field

The invention belongs to the field of natural language processing, and particularly relates to a semantic matching model for knowledge point positioning.

Background

Semantic matching is an important basic problem in natural language processing, and can be applied to a large number of NLP tasks, such as information retrieval, question and answer systems, repeated questions, dialogue systems, machine translation and the like, and the NLP tasks can be abstracted into semantic matching problems to a large extent. The quality of the semantic matching model can greatly impact the effect of the end application.

The traditional semantic matching technology comprises BoW, VSM, TF-IDF, BM25, jaccord, simHash and other algorithms, for example, the BM25 algorithm calculates a matching score between the two algorithms according to the coverage degree of the query field, and the matching degree of the web page with higher score and the query is better. The method mainly solves the matching problem of the vocabulary level or the similarity problem of the vocabulary level. But is often subject to word sense limitations, structural limitations, and knowledge limitations. The topic model can be used for semantic matching as well, and sentences are mapped to a low-dimensional continuous space with equal length, so that similarity calculation can be performed on the latent semantic space in an implicit mode. For example, LSI, LDA and PLSA have compact semantic representation of text and convenient operation, so that the defects of the traditional vocabulary matching method are well overcome, but from the effect, the vocabulary matching technology cannot be replaced, and the vocabulary matching technology can only be used as an effective supplement for vocabulary matching. In recent years, the neural network with the increasing heat has an important role in semantic matching, namely a deep semantic matching model. The neural network can encode text content in a vector form through the word vector obtained through training and calculate cosine similarity of the text content and the word vector to match semantic similarity, so that deeper semantic information is mined. But neural networks have poor interpretability and are prone to problems such as semantic drift. The three types of semantic matching models are long and short, so that the realization of more effective semantic matching by adopting a proper method has great research significance.

The semantic matching model for knowledge point positioning provided in the text is mainly aimed at examination scenes in the electrical field, can automatically perform semantic understanding according to the stem information, and searches for corresponding knowledge points from teaching materials. The model takes into account various semantic matching models, adopts TF-IDF to carry out traditional semantic matching, uses LSI and LDA to increase information on the aspect of subject, strengthens understanding of deep semantics through BERT neural network, and finally adopts a semantic matching integration model based on voting to select a certain number of candidate results (appointed by a user) as sources of knowledge points.

Disclosure of Invention

Technical problems: the invention provides a semantic matching model for knowledge point positioning, which can automatically capture the stem information, judge the correlation between the stem information and the paragraphs of different sections of a teaching material, and select the most relevant paragraph of the teaching material as a matching knowledge point. The invention combines a semantic matching method based on statistics and a correlation judging method based on deep learning. The statistical-based semantic matching method comprises three coding modes of TF-IDF, LSI and LDA. And meanwhile, a BERT coding-based deep learning model is added to solve the problem of unknown words, perform deeper semantic understanding and support fuzzy semantic understanding. Finally, using a semantic matching integration model based on voting, selecting K (user-specified) nearest paragraphs with top total ranking as candidate results of knowledge point positioning.

The technical scheme is as follows: the invention is based on a plurality of semantic matching methods, firstly, a statistical-based semantic matching method is adopted. And preprocessing texts of the stem and the teaching material, removing stop words, segmenting the materials, and matching semantic similarity of the stem and the teaching material by using TF-IDF, LSI and LDA models. Then, a correlation judgment method based on deep learning is adopted. And carrying out certain pretreatment on the original corpus to enable the original corpus to meet the input requirement of the BERT neural network, and calculating the similarity by using the BERT neural network. And finally, sorting the results obtained by the four coding modes according to the number of occurrence times and the numerical value of the similarity, and selecting K fragments with the highest ranking as candidate results of knowledge point positioning.

The invention relates to a semantic matching model for knowledge point positioning, which comprises the following steps:

1) For the teaching materials in the given electrical field, dividing the teaching materials in a section unit for facilitating the positioning of knowledge points, and recording chapter information and page numbers of each section;

2) Removing stop words and nonsensical words in the teaching materials, and segmenting the stop words and nonsensical words to form a corpus, and constructing a dictionary;

3) For any appointed topic and processed corpus, according to the dictionary constructed in 2), TF-IDF values of the topic and all corpus are calculated, a statistical-based semantic matching method, namely TF-IDF, LSI and LDA is used for respectively carrying out corresponding coding on the corpus and the topic, and then cosine similarity of the topic and the coding of all corpus fragments is calculated as semantic similarity of the topic and all corpus fragments;

4) Constructing a training set for the processed teaching material segments in the step 1), and training the BERT neural network:

4-1) comprehensively considering cosine similarity obtained by the three methods in 3), and selecting corpus pairs with larger similarity as positive examples;

4-2) randomly selecting sentences of different chapters as negative examples;

5) Training the BERT neural network by using the data set in the 4), and using the model for coding the corpus fragments and the question bank, and using cosine similarity as semantic similarity of the questions and the corpus fragments;

6) And selecting K fragments as knowledge point positioning results according to the times and the similarity of the corpus fragments in the front in the four coding modes by using the semantic matching integration model based on voting.

In the preferred scheme of the semantic matching model method for knowledge point positioning, the statistical-based semantic matching method in the step 3) carries out corresponding coding on corpus and questions according to the following steps:

3-1) forming a dictionary by the word segmentation in the step 2) and calculating TF-IDF values of each word in the dictionary for each corpus and the problem to obtain corresponding codes;

3-2) forming a TF-IDF matrix by using the codes of TF-IDF values of the corpus and the questions obtained in the step 3-1 as column vectors;

Then, performing analysis based on a topic model by using LDA to obtain probability of each corpus belonging to each topic in the corpora, and taking the probability value as a code;

3-3) LSI encoding, specifically, decomposing the TF-IDF matrix obtained in 3-2) using SVD, that is:

A＝UΣV^T

Wherein A is a TF-IDF matrix, U is an orthogonal matrix, a column vector of the matrix is called a left singular vector, V is also an orthogonal matrix, a column vector of the matrix is called a right singular vector, sigma is a rectangular diagonal matrix, and for a result obtained by decomposition, the column vector of the matrix V ^T is selected as the code of each corpus;

3-4) for the three coding modes, respectively calculating cosine similarity of the questions to be solved and the teaching materials to measure similarity of the questions and the teaching material segments.

In the preferred scheme of the semantic matching model method for knowledge point positioning, the coding method of the BERT neural network in the step 5) is carried out according to the following steps:

5-1) removing stop words and nonsensical words from the teaching material segments and problems obtained in 1);

5-2) word segmentation is carried out on the teaching material segments and the problems, and id is formed by encoding;

5-3) sending the obtained id into a model, and taking the code output by the BERT neural network as a result.

In the preferred scheme of the semantic matching model method for knowledge point positioning, the semantic matching integrated model based on voting in the step 6) determines the final knowledge point positioning result according to the following steps:

6-1) setting the number K of required knowledge point positioning by a user, and respectively taking out K teaching material fragments with maximum cosine similarity as a candidate set 1 for four coding modes;

6-2) counting the sum of the occurrence times of the teaching material segments in the candidate set 1, and selecting K teaching material segments with the largest occurrence times from the candidate set as the candidate set 2 (the condition that the final result exceeds K because of the same allowable times);

6-3) in the candidate set 2, sequencing according to the number of times of occurrence of the fragments from large to small, and calculating the sum of cosine similarity under four coding modes if the number of times is the same, wherein the higher the similarity is, the higher the ranking is;

6-4) take out the first K as the final matching result.

The beneficial effects are that: compared with the prior art, the invention has the following advantages:

Compared with most of semantic matching models at present, the method has the greatest advantage that various semantic matching models are comprehensively considered, and the advantages of various models are drawn, so that the comprehensive utilization of various semantic matching modes is realized. Firstly, coding is carried out by using a traditional TF-IDF, and semantic matching of texts is carried out from the semantic point of view of a shallow layer. Secondly, a TF-IDF model is properly expanded, and a theme model is introduced. Specifically, the use of LSI increases understanding of the subject by the model, and can pay attention to more important information. The inference of the implicit subject is performed through the LDA, so that the model has more profound knowledge on the subject, and the model is helped to understand the semantics more profoundly. Moreover, by adopting the BERT coding mode, semantic matching is carried out from a deeper level, and the method can avoid the influence of good pretreatment effect and solve the problem of no login word. Finally, combining the four coding modes from two aspects of appearance times and cosine similarity by a method of an integrated model, and screening out a final result. Most semantic matching models now often use only one of them or focus on them, and cannot realize semantic understanding and matching from shallow to deep.

The model is simple and convenient to use. The construction of the dictionary and the TF-IDF values of the teaching materials can be accomplished offline. The BERT code of the teaching material can also be calculated in advance. When the knowledge point positioning of the specified question is calculated, only the TF-IDF coding vector of the question is calculated for the TF-IDF, and the LDA and the LSI can add the calculated question TF-IDF coding vector into the TF-IDF matrix so as to calculate the corresponding codes of the questions and the corpus under the two models. For the BERT neural network, the problem can be directly input into the network after being properly preprocessed, and the problem code is obtained. After the codes are obtained, the cosine similarity of the questions and the corpus fragments can be calculated, so that a final result is obtained. That is, the whole flow can be mostly finished by on-line calculation. Therefore, the model can be conveniently deployed on a server and has great application prospect.

The model not only can realize knowledge point positioning of the problems in the question bank, but also supports fuzzy query and input of new questions. For topics outside the topic library, the problem of no login word can be solved through the BERT model, and semantic matching is better achieved.

Drawings

FIG. 1 is a schematic view of the overall framework of the present invention;

FIG. 2 is a flow chart of corpus processing in the present invention;

FIG. 3 is a schematic diagram of a statistical-based semantic matching model in the present invention;

FIG. 4 is a schematic diagram of a voting-based semantic matching integration model in the present invention.

Detailed Description

The invention is described in more detail below with reference to examples and accompanying drawings.

The invention relates to a semantic matching model method for knowledge point positioning, which comprises the following 6 steps:

1) For the teaching materials in the given electrical field, in order to facilitate the positioning of knowledge points, the teaching materials are divided in a segment unit, and chapter information and page numbers of each segment are recorded.

(1) In the model, if a sentence is selected as the search granularity, the corpus is excessively large, the calculation complexity is high, and the reaction time is long. But selecting pages as search granularity results in positioning inaccurately and over a wide range. In order to better locate the position of the knowledge point, the granularity of selecting paragraphs as location is proper, namely, each text is used as a corpus;

(2) The chapter information of each section can be obtained by the page number division or regular expression method;

(3) The storage structure may be stored using json, as shown in the following figures:

"professional basic theory 1.1-1.2, page 1 part 1": {

"Chapter number": "first chapter professional basic theory",

"Page number": 1,

"Text": "alternating current that varies with time according to a sinusoidal law is called sinusoidal alternating current.

"Number of knots": analysis and calculation of 1 st section single-phase alternating current circuit

2) And removing stop words and nonsensical words in the teaching materials, and segmenting the stop words and nonsensical words to form a corpus and constructing a dictionary.

(1) There are a large number of stop words and nonsensical words in the original teaching material, which not only occupy extra storage space, but also affect the running efficiency and the final matching quality, such as punctuation marks (the term (s)), virtual words (and, instead, nonsensical words (first chapter, second chapter), etc. The removal method may use regular expression matching or take the form of stop-word removal.

(2) The processed corpus still cannot be directly understood by the model, and word segmentation is still needed, namely, each word in a sentence is divided, and the word segmentation can be realized by directly using jieba words.

(3) In the word segmentation process, new words are added into the dictionary as long as the new words appear, so that the construction of the dictionary is completed.

3) For any appointed topic and processed corpus, according to the dictionary constructed in 2), TF-IDF values of the topic and all the corpus are calculated, and a semantic matching method based on statistics, namely TF-IDF, LSI and LDA, is used for respectively carrying out corresponding coding on the corpus and the topic. And calculating cosine similarity of the topic and the codes of all the corpus fragments as semantic similarity of the topic and all the corpus fragments.

The detailed steps of statistical-based semantic matching are described herein in connection with fig. 3:

(1) Calculation of TF-IDF value

TF-IDF is a statistical method used to evaluate the importance of a word to one of a set of documents or a corpus. Specifically, TF represents word frequency, i.e., the frequency with which a given word appears in the document. The more a word appears, the more important it is explained. IDF is the reverse file frequency and represents the distinguishing capability of a word, i.e. if the document containing a specific term is fewer, the larger the IDF, the term is indicated to have good category distinguishing capability, and the importance of the word to the corpus is reflected to a certain extent.

The TF calculating method is that

Where n is the word to be solved, count (n) is the number of occurrences of n in the corpus, and m is the total word number of the corpus.

The IDF calculation method is that

Where lg denotes the base 10 logarithm, D is the total number of documents of the corpus, and N is the number of documents containing N.

TF-IDF(n)＝TF(n)×IDF(n)

Thus, for each corpus, the TF-IDF value of the word in the dictionary among the corpus can be calculated separately for the word. Thus, TF-IDF codes of each corpus can be obtained, and the TF-IDF codes are shown in the following figures:

For the title, stop words and words without practical meaning can be removed first and word segmentation can be performed, and then TF-IDF coding can be performed.

(2) Construction of TF-IDF matrix

For each corpus and question, its TF-IDF may be encoded as a column vector to form a TF-IDF matrix. In particular, in the actual operation process, considering that the teaching materials are unchanged, the TF-IDF codes of each corpus can be calculated offline in advance. When the problem that the knowledge points need to be positioned is input, the problem is subjected to TF-IDF coding and added into a TF-IDF matrix. The final results should be as shown in the following figures:

(3) LSI coding

LSI is known collectively as latent semantic indexing, and the basic idea of LSI is to reduce high-dimensional documents to a low-dimensional latent semantic space. Therefore, dimension reduction is the most important step in LSI analysis, and by dimension reduction, "noise", i.e., irrelevant information, in a document is removed, so that the semantic structure of an article is gradually presented. Compared with the traditional vector space, the potential semantic space has smaller dimension and more definite semantic relation. LSA can obtain the topic of text by Singular Value Decomposition (SVD), specifically SVD can decompose any one matrix into the product form of three matrices:

A＝UΣV^T

wherein A is a TF-IDF matrix, U is an orthogonal matrix, a column vector of the matrix is called a left singular vector, V is also an orthogonal matrix, a column vector of the matrix is called a right singular vector, sigma is a rectangular diagonal matrix, and for a result obtained by decomposition, the column vector of the matrix V ^T is selected as the code of each corpus; among these three matrices, U represents a word-to-word correlation, and V represents a text-to-subject correlation, so that V can be extracted as the code of the corresponding LSI.

Thus, among our models, encoding using LSI can make the model further extend over the encoded vector of TF-IDF calculation, and through singular value decomposition, the understanding of the subject information is increased for encoding, thereby focusing more on the main information. Since LSI is a topic model, setting of the number of topics has a great influence on the final result. The number of topics is not easy to be too large, otherwise redundant concepts are contained, too small is not suitable, and otherwise, all information cannot be reserved. Preferably, the artificial judgment and the attempt can be performed according to the actual corpus condition. The coding step takes the TF-IDF matrix in step (2) as input, determines the number of topics, and takes the generated V matrix as corpus and LSI coding vector of the problem to be solved.

(4) LDA coding

The LDA can give the theme of each document in the document set in the form of probability distribution, and the theme distribution of each document in the document set is extracted by analyzing a batch of document sets, so that theme clustering or text classification can be performed according to the theme distribution. Meanwhile, the method is a typical bag-of-words model, namely a document is composed of a group of words, and the words have no precedence order relation. LDA, like LSI, belongs to the topic model, but LDA compares the topic probability distribution of each document with respect to the LSI, and the probability distribution of all words on a topic gives a sparse form of dirichlet priors, both priors enabling LDA models to better characterize the document-topic-word relationship than LSI. Thus, the use of the LDA model can effectively further enhance the understanding of the model for topic information.

The use of LDA also requires setting the number of topics, inputting TF-IDF matrix after setting the number of topics, obtaining probability that each corpus belongs to each topic correspondingly through LDA model, and forming vector by the probability values, thereby obtaining the encoding of corpus based on LDA model.

For example, if the probabilities of a corpus belonging to top1, top2, and top3 are 0.1,0.6,0.3, respectively, the corpus may be encoded as [0.1,0.6,0.3].

(5) Calculation of cosine similarity

Cosine similarity the similarity of two vectors is evaluated by calculating their angle cosine values. The two vectors to be required are drawn into a vector space according to the coordinate values, the included angle is obtained, the cosine value corresponding to the included angle is obtained, the smaller the included angle is, the closer the cosine value is to 1, and the more identical the directions of the two vectors are, the more similar the cosine value is. The calculation formula is as follows:

Wherein X and Y are two vectors of which the similarity needs to be calculated, X _i and Y _i are the ith element in the two vectors of X and Y respectively, n is the number of elements of the corresponding vector, and θ is the included angle between X and Y in the vector space.

The invention uses cosine similarity as a similarity measurement index of the topic code and the teaching material code. Three codes are obtained by the three methods, and cosine similarity of the questions and the corpora can be calculated and recorded for each code.

4) And 1) constructing a training set for the teaching materials processed in the step 1), and training the BERT neural network.

The three models used in the prior art are all traditional semantic matching methods, the BERT model is introduced to enable the model to obtain deeper semantic understanding, the BERT is used as a neural network model, the influence of word segmentation effect can be avoided, and the problem that vocabulary is not entered in a dictionary can be solved.

The aspect of training data set selection does not use the corpus processed in 2), mainly because the BERT model provided by google already has the relevant function of preprocessing, and only segmentation is needed. After the data set is read, the BERT model firstly converts the original corpus into Unicode for encoding, and calls related functions to remove stop words and words without practical meaning, and meanwhile, calls FullTokenizer for word segmentation operation. And the words in the corpus are converted into the form of id and input into the BERT model.

The overall task of BERT can be regarded as a two-class task, i.e., for two sentences entered, determining whether the two sentences are related. Therefore, only the positive example and the negative example need to be prepared. For the positive example, cosine similarity of several coding modes in 3) can be calculated in advance, and the larger the cosine similarity is, the smaller the included angle of two corpus in the vector space can be considered, so that the more similar the two corpus is. Accordingly, N corpus alignments with the highest similarity ranking can be selected as the positive examples, and the data quality is ensured through a manual examination mode. Meanwhile, any two sentences of different paragraphs can be selected by selecting negative examples, and in order to ensure the training quality, the fact that the two sentences are irrelevant is required to be determined manually.

5) Training the BERT neural network by using the data set in the 4), and using the model for coding the teaching materials and the question bank, and using cosine similarity as semantic similarity with corpus fragments for a question.

6) And selecting K fragments as knowledge point positioning results according to the times and the similarity of the occurrence of each teaching material fragment in the front in four coding modes by using the semantic matching integrated model based on voting.

The specific implementation steps of the voting-based semantic matching integration model are described herein in connection with fig. 4:

(1) Setting a K value artificially, wherein K is the number of knowledge point positioning which a user wants to determine;

(2) For the four coding modes, taking K corpus fragments with maximum cosine similarity of each coding mode as a candidate set 1, and combining the same fragments into one;

(3) For each segment in the candidate set 1, the occurrence times of the four coding modes are counted, the first K segments are selected as the candidate set 2 according to the arrangement from large to small. Considering the problem of ranking parallel caused by the same occurrence number, K final results are checked because of ranking parallel;

(4) And summing cosine similarity corresponding to the four coding modes for all the fragments in the candidate set 2, arranging the cosine similarity according to the sequence from large to small, selecting the maximum K fragments as a final result, and returning the final result as a final positioning result.

The above examples are only preferred embodiments of the present invention, it being noted that: it will be apparent to those skilled in the art that several modifications and equivalents can be made without departing from the principles of the invention, and such modifications and equivalents fall within the scope of the invention.

Claims

1. A method for establishing a semantic matching model for knowledge point positioning, the method comprising the steps of:

3) For any appointed topic and processed corpus, according to the dictionary constructed in 2), TF-IDF values of the topic and all corpus are calculated, a statistical-based semantic matching method, namely TF-IDF, LSI and LDA is used for respectively carrying out corresponding coding on the corpus and the topic, and then cosine similarity of the topic and the coding of all corpus fragments is calculated as semantic similarity of the topic and all corpus fragments; in the step 3), the corpus and the questions are correspondingly encoded according to the following steps of:

3-2) forming a TF-IDF matrix by using the linguistic data obtained in the 3-1) and TF-IDF values of the problems as column vectors, then analyzing based on a topic model by using LDA to obtain probability that each corpus belongs to each topic in the linguistic data, and using the probability value as a code;

A＝UΣV^T

3-4) for the three coding modes, respectively calculating cosine similarity of the questions to be solved and the teaching materials to measure similarity of the questions and the teaching material fragments;

4-1) comprehensively considering cosine similarity obtained by the three methods in 3), and selecting N corpus pairs with the top similarity ranking as positive examples;

4-2) randomly selecting sentences of different chapters as negative examples;

5) Training the BERT neural network by using the data set in the 4), and using the model for coding the corpus fragment and the topic to be solved, and using cosine similarity as semantic similarity of the topic and the corpus fragment;

6) Selecting K fragments as knowledge point positioning results according to the times and cosine similarity of each corpus fragment in the front in four coding modes based on the voting semantic matching integration model; the semantic matching integration model based on voting in the step 6) determines the final knowledge point positioning result according to the following steps:

6-1) setting the number K of required knowledge point positioning by a user, and respectively taking out K corpus fragments with maximum cosine similarity as candidate sets 1 for four coding modes;

6-2) counting the sum of the occurrence times of the corpus fragments in the candidate set 1, and selecting K corpus fragments with the largest occurrence times from the candidate set as the candidate set 2, wherein the K corpus fragments are allowed to have the same occurrence times, so that the final result exceeds K corpus fragments;

6-4) take out the first K as the final matching result.

2. The method for establishing a semantic matching model for knowledge point localization according to claim 1, wherein the encoding method of the BERT neural network in step 5) is performed as follows: