CN112926340A

CN112926340A - Semantic matching model for knowledge point positioning

Info

Publication number: CN112926340A
Application number: CN202110319217.9A
Authority: CN
Inventors: 吴亦珂; 吴天星; 李林; 高超禹; 漆桂林
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2021-03-25
Filing date: 2021-03-25
Publication date: 2021-06-08
Anticipated expiration: 2041-03-25
Also published as: CN112926340B

Abstract

The invention discloses a semantic matching model for knowledge point positioning, which is mainly used for solving the problem of positioning knowledge points in the electrical field. The method comprises the steps of preprocessing an original teaching material to form a corpus. Then, the statistical-based semantic matching model TF-IDF, LSI, and LDA coding is used. And then, enhancing deep semantic understanding by using a deep learning semantic matching model, and coding by BERT. Then, for the above four encoding modes, cosine similarity is calculated as a measure of semantic similarity. And finally, selecting the teaching material segments with the number specified by the user as a final knowledge point positioning result according to the times of the front columns of the teaching material segments and the cosine similarity based on the semantic matching integrated model of voting.

Description

Semantic matching model for knowledge point positioning

Technical Field

The invention belongs to the field of natural language processing, and particularly relates to a semantic matching model for knowledge point positioning.

Background

Semantic matching is an important basic problem in natural language processing, and can be applied to a large number of NLP tasks, such as information retrieval, question-answering systems, dialogue systems, machine translation and the like, and the NLP tasks can be abstracted into semantic matching problems to a great extent. The quality of the semantic matching model can greatly affect the effect of the final application.

The traditional semantic matching technology comprises algorithms such as BoW, VSM, TF-IDF, BM25, Jaccord, SimHash and the like, for example, BM25 algorithm calculates the matching score between the two fields through the coverage degree of the query fields, and the higher the score, the better the matching degree between the webpage and the query. The method mainly solves the matching problem of the vocabulary level or the similarity problem of the vocabulary level. However, it tends to suffer from word sense limitations, structural limitations, and knowledge limitations. The topic model can also be used for semantic matching, sentences are mapped to a low-dimensional continuous space with equal length, and similarity calculation can be carried out on the implicit latent semantic space. Such as LSI, LDA and PLSA, these techniques are simple in semantic representation of text, convenient to operate, and better make up for the deficiencies of the conventional vocabulary matching method, but in effect, they still cannot replace the literal matching technique, and can only be used as an effective supplement for literal matching. In recent years, the increasingly hot neural network also plays an important role in semantic matching, namely a deep semantic matching model. The neural network can encode the text content in a vector form through the word vector obtained by training and calculate the cosine similarity of the text content and the cosine similarity to match the semantic similarity, so that deeper semantic information is mined. However, neural networks have poor interpretability and are prone to semantic drift. The three semantic matching models have different lengths, so that the adoption of a proper method for realizing more effective semantic matching has great research significance.

The semantic matching model for knowledge point positioning provided by the text is mainly used for automatically carrying out semantic understanding according to question stem information and searching corresponding knowledge points from teaching materials aiming at examination scenes in the electrical field. The model gives consideration to various semantic matching models, adopts TF-IDF to carry out traditional semantic matching, uses LSI and LDA to increase information in the aspect of theme, strengthens the understanding of deep semantics through BERT neural network, and finally selects a certain number of (user-specified) candidate results as the provenance of knowledge points by adopting a voting-based semantic matching integration model.

Disclosure of Invention

The technical problem is as follows: the invention provides a semantic matching model for knowledge point positioning, which can automatically capture question stem information, simultaneously judge the correlation between the question stem information and different sections of teaching materials, and select the most relevant teaching material section as a matching knowledge point. The invention combines a semantic matching method based on statistics and a correlation judgment method based on deep learning. The semantic matching method based on statistics comprises three coding modes of TF-IDF, LSI and LDA. And meanwhile, a deep learning model based on BERT coding is added to solve the problem of unknown words, deeper semantic understanding is performed, and fuzzy semantic understanding is supported. Finally, using a voting-based semantic matching integration model, the top K (user-specified) most similar paragraphs of the total rank are selected as candidate results for knowledge point positioning.

The technical scheme is as follows: the invention is based on a plurality of semantic matching methods, and firstly, adopts a semantic matching method based on statistics. Preprocessing the text of the question stem and the teaching material, removing stop words, segmenting the material, and matching the semantic similarity of the question stem and the teaching material by using TF-IDF, LSI and LDA models. Then, a correlation determination method based on deep learning is employed. And carrying out certain preprocessing on the original linguistic data to enable the original linguistic data to accord with the input requirement of the BERT neural network, and carrying out similarity calculation by using the BERT neural network. And finally, sequencing results obtained by the four coding modes according to the occurrence times and the numerical value of the similarity, and selecting K fragments with the highest ranking as candidate results for positioning the knowledge points.

The invention discloses a semantic matching model for knowledge point positioning, which comprises the following steps:

1) for a given teaching material in the electrical field, in order to facilitate the positioning of knowledge points, dividing the teaching material by taking sections as units, and recording chapter information and page number of each section;

2) removing stop words and nonsense words in the teaching materials, segmenting the stop words and the nonsense words to form linguistic data, and constructing a dictionary;

3) for any specified subject and the processed corpus, calculating TF-IDF values of the subject and all corpora according to the dictionary constructed in 2), respectively coding the corpus and the subject by using a semantic matching method based on statistics, namely TF-IDF, LSI and LDA, and then calculating cosine similarity of the subject and all corpus segment codes as semantic similarity of the subject and all corpus segments;

4) constructing a training set for the teaching material segments processed in the step 1), and training a BERT neural network:

4-1) comprehensively considering the cosine similarity obtained by the three methods in 3), and selecting some corpus pairs with larger similarity as positive examples;

4-2) randomly selecting sentences of different chapters as negative examples;

5) training the BERT neural network by using the data set in 4), using the model for coding the corpus segments and the question bank, and using cosine similarity as semantic similarity of the question and the corpus segments;

6) and selecting K fragments as a knowledge point positioning result according to the times and the similarity of the occurrence of each corpus fragment in the front row in the four encoding modes by the semantic matching integrated model based on voting.

In the preferred embodiment of the semantic matching model method for knowledge point positioning of the present invention, the semantic matching method based on statistics in step 3) correspondingly encodes the corpus and the question according to the following steps:

3-1) forming a dictionary from all word segmentation results through the word segmentation in the step 2), and calculating the TF-IDF value of each word in the dictionary for each corpus and question to obtain corresponding codes;

3-2) coding the corpus obtained in 3-1 and the TF-IDF value of the question as column vectors to form a TF-IDF matrix;

then, LDA is used for analyzing based on a theme model to obtain the probability that each corpus in the corpus belongs to each theme, and the probability value is used as a code;

3-3) LSI coding, specifically, decomposing the TF-IDF matrix obtained in 3-2) using SVD, that is:

A＝UΣV^T

wherein A is TF-IDF matrix, U is orthogonal matrix, its column vector is called left singular vector, V is orthogonal matrix, its column vector is called right singular vector, sigma is rectangular diagonal matrix, and for the result obtained by decomposition V is selected^TThe column vector of the matrix is used as the code of each corpus;

and 3-4) respectively calculating the cosine similarity of the to-be-solved question and the teaching material according to the three coding modes so as to measure the similarity of the question and each teaching material segment.

In the preferred embodiment of the semantic matching model method for knowledge point positioning of the present invention, the coding method of the BERT neural network in the step 5) is performed as follows:

5-1) removing stop words and nonsense words from the teaching material segments and the problems obtained in the step 1);

5-2) segmenting teaching material segments and questions, and coding to form id;

5-3) sending the obtained id into a model, and taking a code output by the BERT neural network as a result.

In the preferred embodiment of the semantic matching model method for knowledge point positioning of the present invention, in the step 6), the voting-based semantic matching integrated model determines the final knowledge point positioning result according to the following steps:

6-1) setting the number K of required knowledge point positioning by a user, and respectively taking K teaching material segments with the largest cosine similarity as a candidate set 1 for four coding modes;

6-2) counting the sum of the occurrence times of the teaching material segments in the candidate set 1, and selecting K teaching material segments with the maximum occurrence times in the candidate set as a candidate set 2 (the allowable times are the same, so that the final result exceeds K);

6-3) in the candidate set 2, sorting according to the occurrence times of the fragments from large to small, calculating the sum of cosine similarity under four coding modes if the times are the same, wherein the larger the similarity is, the higher the ranking is;

6-4) taking the first K as the final matching result.

Has the advantages that: compared with the prior art, the invention has the following advantages:

compared with most semantic matching models at present, the method has the greatest advantages that various semantic matching models are comprehensively considered, and the advantages of various models are taken, so that the comprehensive utilization of various semantic matching modes is realized. Firstly, the traditional TF-IDF is used for coding, and semantic matching of texts is carried out from the perspective of shallow semantics. Secondly, the TF-IDF model is properly expanded, and a theme model is introduced. Specifically, using an LSI increases the understanding of the model on the subject, and is able to focus on more important information. The inference of implied topics is carried out through LDA, so that the model has deeper knowledge on the topics, and the model is helped to understand semantics from a deeper level. Moreover, semantic matching is carried out from a deeper level by a BERT coding mode, and the method can avoid the influence of good and bad preprocessing effect and solve the problem of words which are not logged in. And finally, combining the four coding modes from the aspects of occurrence times and cosine similarity by a model integration method, and screening out a final result. Most semantic matching models can only use one of the models or emphasize the use of the one model, and semantic understanding and matching from a shallow layer to a deep layer cannot be realized.

The model is simple and convenient to use. The construction of the dictionary and the TF-IDF value of the teaching material can be done off-line. The BERT code of the textbook can also be calculated in advance. When calculating the location of the knowledge point of the specified topic, only the TF-IDF coding vector of the problem needs to be calculated for TF-IDF, and LDA and LSI can add the solved topic TF-IDF coding vector into the TF-IDF matrix, thereby solving the corresponding coding of the topic and the corpus under the two models. For the BERT neural network, the problem can be directly input into the network after being properly preprocessed, and the code of the problem is obtained. After the codes are obtained, the cosine similarity of the question and each corpus fragment can be calculated, and then the final result is obtained. That is, most of the whole process can be completed by off-line calculation. Therefore, the model can be conveniently deployed on the server and has a great application prospect.

The model not only can realize the positioning of the knowledge points of the problems in the question bank, but also supports the fuzzy query and the input of new questions. For questions outside the question bank, the problem of words which are not logged in can be solved through a BERT model, and semantic matching can be better achieved.

Drawings

FIG. 1 is a general framework schematic of the present invention;

FIG. 2 is a flow chart of corpus processing in the present invention;

FIG. 3 is a schematic diagram of a statistics-based semantic matching model according to the present invention;

FIG. 4 is a diagram of the voting-based semantic matching integration model of the present invention.

Detailed Description

The following detailed description of the embodiments of the invention is provided in connection with the accompanying drawings.

The invention relates to a semantic matching model method for knowledge point positioning, which comprises the following 6 steps:

1) for the teaching materials in the given electrical field, in order to facilitate the positioning of the knowledge points, the teaching materials are divided by taking sections as units, and chapter information and page numbers of each section are recorded.

(1) In this model, if a sentence is selected as the search granularity, the corpus amount is too large, the computational complexity is large, and the reaction time is long. But selecting pages as search granularity may result in inaccurate positioning and too wide a range. In order to better position the position of the knowledge point, the granularity of selecting the paragraph as the positioning is more suitable, namely each paragraph of characters is used as a corpus;

(2) chapter information of each section can be obtained by page number division or regular expression;

(3) the storage structure may be stored in json, as shown in the following figure:

"professional basic theory 1.1-1.2, page 1 part 1": {

"number of chapters": "chapter i professional basic theory",

"number of pages": 1,

"text": an alternating current that "varies sinusoidally with time is referred to as a sinusoidal alternating current.

"number of sections": "analysis and calculation of 1 st-stage single-phase AC circuit

2) And removing stop words and nonsense words in the teaching materials, segmenting the stop words and the nonsense words to form linguistic data, and constructing a dictionary.

(1) There are a lot of stop words and meaningless words in the original textbook, which not only occupy extra storage space, but also affect the efficiency of operation and the final matching quality, such as punctuation (r), fictional words (but also, meaningless words (first chapter, second section), etc. The removal method can use regular expression matching or use a deactivation vocabulary for removal.

(2) The processed corpus still cannot be directly understood by the model, word segmentation still needs to be carried out, namely, each word in a sentence is divided, and the word segmentation can be realized by directly using a jieba word segmentation mode.

(3) In the process of word segmentation, as long as a new word appears, the new word is added into the dictionary to complete the construction of the dictionary.

3) For any specified subject and the processed linguistic data, calculating TF-IDF values of the subject and all the linguistic data according to the dictionary constructed in the step 2), and correspondingly coding the linguistic data and the subject by using a semantic matching method based on statistics, namely TF-IDF, LSI and LDA. And calculating the cosine similarity of the title and all the corpus fragment codes as the semantic similarity of the title and all the corpus fragments.

The detailed steps of semantic matching based on statistics are described here in connection with fig. 3:

(1) calculation of TF-IDF values

TF-IDF is a statistical method to evaluate the importance of a word to one of a set of documents or a corpus. In particular, TF represents the word frequency, i.e. the frequency with which a given word appears in the document. The more times a word occurs, the more important it is. The IDF is the inverse file frequency and represents the distinguishing capability of a word, namely if the number of documents containing a certain entry is less, the IDF is larger, the entry has good category distinguishing capability, and the importance degree of the word on the linguistic data is reflected to a certain extent.

The TF calculation method comprises

Wherein n is the word to be solved, count (n) is the number of occurrences of n in the corpus, and m is the total number of words in the corpus.

The IDF is calculated by

Wherein lg represents the logarithm with the base of 10, D represents the total number of documents in the corpus, and N represents the number of documents containing N.

TF-IDF(n)＝TF(n)×IDF(n)

Thus, for each corpus, the TF-IDF value of a word in the corpus can be calculated separately for the word in the dictionary. Therefore, the TF-IDF code of each corpus can be obtained, as shown in the following figure:

for the topic, stop words and words without practical meaning can be removed and segmented, and then TF-IDF coding can be carried out.

(2) Construction of TF-IDF matrices

For each corpus and question, its TF-IDF can be encoded as a column vector to form a TF-IDF matrix. In particular, in the actual operation process, considering that the teaching material is unchanged, the TF-IDF code of each corpus can be calculated off-line in advance. When inputting the problem of needing to position the knowledge point, then carrying out TF-IDF coding on the problem and adding the TF-IDF coding into the TF-IDF matrix. The final result should be as follows:

(3) LSI coding

LSI is known as latent semantic indexing, and the basic idea of LSI is to reduce high-dimensional documents to low-dimensional latent semantic spaces. Therefore, dimension reduction is the most important step in LSI analysis, and by means of dimension reduction, "noise", that is, irrelevant information in a document is removed, so that the semantic structure of an article is gradually presented. Compared with the traditional vector space, the dimension of the potential semantic space is smaller, and the semantic relation is more definite. LSA may obtain the subject of the text by a Singular Value Decomposition (SVD) method, and specifically, SVD may decompose any one matrix into a product form of three matrices:

A＝UΣV^T

wherein A is TF-IDF matrix, U is orthogonal matrix, its column vector is called left singular vector, V is orthogonal matrix, its column vector is called right singular vector, sigma is rectangular diagonal matrix, and for the result obtained by decomposition V is selected^TThe column vector of the matrix is used as the code of each corpus; among the three matrices, U represents the correlation between words and V represents the correlation between text and subject, and therefore the V matrix can be extracted as the code of the corresponding LSI.

Therefore, among our models, coding using LSI allows the model to be further extended on the coded vector calculated by TF-IDF, and by singular value decomposition, understanding of the subject information is increased for coding, thereby focusing more on the main information. Since the LSI is a topic model, the setting of the number of topics has a great influence on the final result. The number of the topics is not easy to be too large, otherwise, redundant concepts are included, the topics are not easy to be too small, and otherwise, all information cannot be reserved. It is preferable to make manual judgment and try according to the actual corpus situation. And (3) in the coding step, taking the TF-IDF matrix in the step (2) as input, determining the number of subjects, and taking the generated V matrix as a corpus and an LSI coding vector of a problem to be solved.

(4) LDA coding

The LDA can give the theme of each document in the document set in the form of probability distribution, and by analyzing a batch of document sets and extracting the theme distribution of the documents, the theme clustering or text classification can be carried out according to the theme distribution. Meanwhile, the method is a typical bag-of-words model, namely, a document is composed of a group of words, and no precedence relationship exists among the words. LDA and LSI are the same and belong to a topic model, but compared with LSI, LDA gives a sparse Dirichlet prior to the probability distribution of each document about the topic and the probability distribution of all words on one topic, and the two prior enable the LDA model to describe the relation of the three documents, the topic and the words better than LSI. Thus, the use of the LDA model may effectively further enhance the model's understanding of the subject information.

The use of LDA also needs to set the number of themes, after the number of themes is set, the TF-IDF matrix is input, the probability that each corpus corresponds to each theme can be obtained through the LDA model, and the probability values form vectors, so that the LDA model-based coding of the corpus is obtained.

For example, if a corpus has probabilities of being top1, top2, and top3 being 0.1, 0.6, and 0.3, respectively, the corpus can be encoded as [0.1, 0.6, and 0.3 ].

(5) Calculation of cosine similarity

Cosine similarity is evaluated by calculating the cosine value of the included angle between two vectors. Drawing two vectors to be solved into a vector space according to coordinate values to obtain an included angle of the two vectors, and obtaining a cosine value corresponding to the included angle, wherein the smaller the included angle is, the closer the cosine value is to 1, and the more the directions of the two vectors are matched, the more the two vectors are similar. The calculation formula is as follows:

where X and Y are two vectors whose similarity is to be calculated, X_iAnd Y_iThe method is characterized in that the ith element in X and Y vectors is respectively, n is the number of elements corresponding to the vectors, and theta is the included angle of X and Y in a vector space.

The invention uses cosine similarity as similarity measurement index of topic coding and teaching material coding. Three codes are obtained by the three methods, and the problem and the cosine similarity of each corpus can be calculated and recorded for each code.

4) Constructing a training set for the teaching materials processed in the step 1), and training the BERT neural network.

The three models used in the prior art are traditional semantic matching methods, the BERT model is introduced to enable the model to obtain deeper semantic understanding, and the BERT as a neural network model can be free from the influence of word segmentation effect and can solve the problem that words are not registered in a dictionary.

The selected aspect of the training data set does not use the processed corpus in 2), which is mainly because the BERT model provided by google already has the relevant functions of preprocessing and only needs to be segmented. After reading the data set, the BERT model converts the original corpus into Unicode to encode, calls a related function to remove stop words and words without practical meaning, and calls a FullTokenizer to perform word segmentation operation. And the words in the corpus are converted into id form and input into the BERT model.

The overall task of BERT can be viewed as a two-classification task, i.e., for two words that are input, it is determined whether the two words are related. Therefore, only the positive and negative examples need to be prepared for implementation. For the positive example, cosine similarity of several coding modes in 3) can be calculated in advance, and the larger the cosine similarity is, the smaller the included angle between the two corpora in the vector space can be considered, so that the more similar the two corpora is shown. Accordingly, N corpus pairs with the most top similarity ranking can be selected as positive examples, and the data quality is ensured in a manual inspection mode. Meanwhile, the selection of the negative example can select any two sentences of different paragraphs, and in order to ensure the training quality, the two sentences need to be artificially determined to be really irrelevant.

5) And (3) training the BERT neural network by using the data set in the step (4), using the model for coding a teaching material and a question bank, and using cosine similarity as semantic similarity between the model and a corpus fragment for a question.

6) And selecting K segments as a positioning result of the knowledge points by the voting-based semantic matching integrated model according to the times and the similarity of the front row of each textbook segment in the four coding modes.

The specific implementation steps of the voting-based semantic matching integration model are described in conjunction with fig. 4:

(1) manually setting a K value, wherein K is the number of the knowledge point positions which are required to be determined by the user;

(2) for four coding modes, taking K corpus segments with the largest cosine similarity of each coding mode as a candidate set 1, and combining the same segments into one;

(3) for each segment in the candidate set 1, counting the occurrence times of the segments in the four encoding modes, and sorting the segments from large to small to select the first K segments as a candidate set 2. Considering the problem of rank juxtaposition caused by the same occurrence times, the final result quantity is allowed to check K due to the rank juxtaposition;

(4) and (3) summing the cosine similarities corresponding to the four coding modes for all the segments in the candidate set 2, arranging the cosine similarities from large to small, selecting the largest K segments as final results, and returning the K segments as the final results of positioning.

The above examples are only preferred embodiments of the present invention, it should be noted that: it will be apparent to those skilled in the art that various modifications and equivalents can be made without departing from the spirit of the invention, and it is intended that all such modifications and equivalents fall within the scope of the invention as defined in the claims.

Claims

1. A semantic matching model for knowledge point localization, the method comprising the steps of:

4-2) randomly selecting sentences of different chapters as negative examples;

5) training the BERT neural network by using the data set in 4), using the model for coding the corpus segments and the questions to be solved, and using cosine similarity as semantic similarity of the questions and the corpus segments;

6) and selecting K fragments as a positioning result of the knowledge point according to the times of occurrence of each corpus fragment in the front row and the cosine similarity in the four encoding modes by the voting-based semantic matching integrated model.

2. The semantic matching model for knowledge point positioning according to claim 1, wherein in the step 3), the semantic matching method based on statistics correspondingly encodes the corpus and the questions according to the following steps:

3-2) coding the linguistic data obtained in 3-1) and TF-IDF values of the questions as column vectors to form a TF-IDF matrix, then, analyzing based on a topic model by using LDA to obtain the probability that each linguistic data in the linguistic data belongs to each topic, and taking the probability value as a code;

A＝UΣV^T

3. The semantic matching model for knowledge point localization according to claim 1, wherein the encoding method of the BERT neural network in the step 5) is performed as follows:

4. The semantic matching model for knowledge point positioning according to claim 1, wherein the voting-based semantic matching integrated model in step 6) determines the final knowledge point positioning result according to the following steps:

6-1) setting the number K of required knowledge point positioning by a user, and respectively taking K corpus segments with the largest cosine similarity as a candidate set 1 for four coding modes;

6-2) counting the sum of the occurrence times of the corpus fragments in the candidate set 1, and selecting K corpus fragments with the maximum occurrence times in the candidate set as a candidate set 2, wherein the allowable times are the same, so that the final result exceeds K;

6-4) taking the first K as the final matching result.