CN112926340B - Semantic matching model for knowledge point positioning - Google Patents

Semantic matching model for knowledge point positioning Download PDF

Info

Publication number
CN112926340B
CN112926340B CN202110319217.9A CN202110319217A CN112926340B CN 112926340 B CN112926340 B CN 112926340B CN 202110319217 A CN202110319217 A CN 202110319217A CN 112926340 B CN112926340 B CN 112926340B
Authority
CN
China
Prior art keywords
corpus
topic
model
fragments
similarity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110319217.9A
Other languages
Chinese (zh)
Other versions
CN112926340A (en
Inventor
吴亦珂
吴天星
李林
高超禹
漆桂林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southeast University
Original Assignee
Southeast University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southeast University filed Critical Southeast University
Priority to CN202110319217.9A priority Critical patent/CN112926340B/en
Publication of CN112926340A publication Critical patent/CN112926340A/en
Application granted granted Critical
Publication of CN112926340B publication Critical patent/CN112926340B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a semantic matching model for knowledge point positioning, which is mainly used for solving the problem of topic knowledge point positioning in the electrical field. The method comprises the steps of firstly preprocessing an original teaching material to form a corpus. Then, the statistical-based semantic matching models TF-IDF, LSI and LDA are used for encoding. And then, using a deep learning semantic matching model to strengthen the deep semantic understanding, and encoding through BERT. Subsequently, for the above four coding schemes, cosine similarity is calculated as a measure of semantic similarity. And finally, selecting the user-specified number of teaching material fragments as a final knowledge point positioning result according to the number of times that the teaching material fragments appear in the front and the cosine similarity by the voting-based semantic matching integrated model.

Description

Semantic matching model for knowledge point positioning
Technical Field
The invention belongs to the field of natural language processing, and particularly relates to a semantic matching model for knowledge point positioning.
Background
Semantic matching is an important basic problem in natural language processing, and can be applied to a large number of NLP tasks, such as information retrieval, question and answer systems, repeated questions, dialogue systems, machine translation and the like, and the NLP tasks can be abstracted into semantic matching problems to a large extent. The quality of the semantic matching model can greatly impact the effect of the end application.
The traditional semantic matching technology comprises BoW, VSM, TF-IDF, BM25, jaccord, simHash and other algorithms, for example, the BM25 algorithm calculates a matching score between the two algorithms according to the coverage degree of the query field, and the matching degree of the web page with higher score and the query is better. The method mainly solves the matching problem of the vocabulary level or the similarity problem of the vocabulary level. But is often subject to word sense limitations, structural limitations, and knowledge limitations. The topic model can be used for semantic matching as well, and sentences are mapped to a low-dimensional continuous space with equal length, so that similarity calculation can be performed on the latent semantic space in an implicit mode. For example, LSI, LDA and PLSA have compact semantic representation of text and convenient operation, so that the defects of the traditional vocabulary matching method are well overcome, but from the effect, the vocabulary matching technology cannot be replaced, and the vocabulary matching technology can only be used as an effective supplement for vocabulary matching. In recent years, the neural network with the increasing heat has an important role in semantic matching, namely a deep semantic matching model. The neural network can encode text content in a vector form through the word vector obtained through training and calculate cosine similarity of the text content and the word vector to match semantic similarity, so that deeper semantic information is mined. But neural networks have poor interpretability and are prone to problems such as semantic drift. The three types of semantic matching models are long and short, so that the realization of more effective semantic matching by adopting a proper method has great research significance.
The semantic matching model for knowledge point positioning provided in the text is mainly aimed at examination scenes in the electrical field, can automatically perform semantic understanding according to the stem information, and searches for corresponding knowledge points from teaching materials. The model takes into account various semantic matching models, adopts TF-IDF to carry out traditional semantic matching, uses LSI and LDA to increase information on the aspect of subject, strengthens understanding of deep semantics through BERT neural network, and finally adopts a semantic matching integration model based on voting to select a certain number of candidate results (appointed by a user) as sources of knowledge points.
Disclosure of Invention
Technical problems: the invention provides a semantic matching model for knowledge point positioning, which can automatically capture the stem information, judge the correlation between the stem information and the paragraphs of different sections of a teaching material, and select the most relevant paragraph of the teaching material as a matching knowledge point. The invention combines a semantic matching method based on statistics and a correlation judging method based on deep learning. The statistical-based semantic matching method comprises three coding modes of TF-IDF, LSI and LDA. And meanwhile, a BERT coding-based deep learning model is added to solve the problem of unknown words, perform deeper semantic understanding and support fuzzy semantic understanding. Finally, using a semantic matching integration model based on voting, selecting K (user-specified) nearest paragraphs with top total ranking as candidate results of knowledge point positioning.
The technical scheme is as follows: the invention is based on a plurality of semantic matching methods, firstly, a statistical-based semantic matching method is adopted. And preprocessing texts of the stem and the teaching material, removing stop words, segmenting the materials, and matching semantic similarity of the stem and the teaching material by using TF-IDF, LSI and LDA models. Then, a correlation judgment method based on deep learning is adopted. And carrying out certain pretreatment on the original corpus to enable the original corpus to meet the input requirement of the BERT neural network, and calculating the similarity by using the BERT neural network. And finally, sorting the results obtained by the four coding modes according to the number of occurrence times and the numerical value of the similarity, and selecting K fragments with the highest ranking as candidate results of knowledge point positioning.
The invention relates to a semantic matching model for knowledge point positioning, which comprises the following steps:
1) For the teaching materials in the given electrical field, dividing the teaching materials in a section unit for facilitating the positioning of knowledge points, and recording chapter information and page numbers of each section;
2) Removing stop words and nonsensical words in the teaching materials, and segmenting the stop words and nonsensical words to form a corpus, and constructing a dictionary;
3) For any appointed topic and processed corpus, according to the dictionary constructed in 2), TF-IDF values of the topic and all corpus are calculated, a statistical-based semantic matching method, namely TF-IDF, LSI and LDA is used for respectively carrying out corresponding coding on the corpus and the topic, and then cosine similarity of the topic and the coding of all corpus fragments is calculated as semantic similarity of the topic and all corpus fragments;
4) Constructing a training set for the processed teaching material segments in the step 1), and training the BERT neural network:
4-1) comprehensively considering cosine similarity obtained by the three methods in 3), and selecting corpus pairs with larger similarity as positive examples;
4-2) randomly selecting sentences of different chapters as negative examples;
5) Training the BERT neural network by using the data set in the 4), and using the model for coding the corpus fragments and the question bank, and using cosine similarity as semantic similarity of the questions and the corpus fragments;
6) And selecting K fragments as knowledge point positioning results according to the times and the similarity of the corpus fragments in the front in the four coding modes by using the semantic matching integration model based on voting.
In the preferred scheme of the semantic matching model method for knowledge point positioning, the statistical-based semantic matching method in the step 3) carries out corresponding coding on corpus and questions according to the following steps:
3-1) forming a dictionary by the word segmentation in the step 2) and calculating TF-IDF values of each word in the dictionary for each corpus and the problem to obtain corresponding codes;
3-2) forming a TF-IDF matrix by using the codes of TF-IDF values of the corpus and the questions obtained in the step 3-1 as column vectors;
Then, performing analysis based on a topic model by using LDA to obtain probability of each corpus belonging to each topic in the corpora, and taking the probability value as a code;
3-3) LSI encoding, specifically, decomposing the TF-IDF matrix obtained in 3-2) using SVD, that is:
A=UΣVT
Wherein A is a TF-IDF matrix, U is an orthogonal matrix, a column vector of the matrix is called a left singular vector, V is also an orthogonal matrix, a column vector of the matrix is called a right singular vector, sigma is a rectangular diagonal matrix, and for a result obtained by decomposition, the column vector of the matrix V T is selected as the code of each corpus;
3-4) for the three coding modes, respectively calculating cosine similarity of the questions to be solved and the teaching materials to measure similarity of the questions and the teaching material segments.
In the preferred scheme of the semantic matching model method for knowledge point positioning, the coding method of the BERT neural network in the step 5) is carried out according to the following steps:
5-1) removing stop words and nonsensical words from the teaching material segments and problems obtained in 1);
5-2) word segmentation is carried out on the teaching material segments and the problems, and id is formed by encoding;
5-3) sending the obtained id into a model, and taking the code output by the BERT neural network as a result.
In the preferred scheme of the semantic matching model method for knowledge point positioning, the semantic matching integrated model based on voting in the step 6) determines the final knowledge point positioning result according to the following steps:
6-1) setting the number K of required knowledge point positioning by a user, and respectively taking out K teaching material fragments with maximum cosine similarity as a candidate set 1 for four coding modes;
6-2) counting the sum of the occurrence times of the teaching material segments in the candidate set 1, and selecting K teaching material segments with the largest occurrence times from the candidate set as the candidate set 2 (the condition that the final result exceeds K because of the same allowable times);
6-3) in the candidate set 2, sequencing according to the number of times of occurrence of the fragments from large to small, and calculating the sum of cosine similarity under four coding modes if the number of times is the same, wherein the higher the similarity is, the higher the ranking is;
6-4) take out the first K as the final matching result.
The beneficial effects are that: compared with the prior art, the invention has the following advantages:
Compared with most of semantic matching models at present, the method has the greatest advantage that various semantic matching models are comprehensively considered, and the advantages of various models are drawn, so that the comprehensive utilization of various semantic matching modes is realized. Firstly, coding is carried out by using a traditional TF-IDF, and semantic matching of texts is carried out from the semantic point of view of a shallow layer. Secondly, a TF-IDF model is properly expanded, and a theme model is introduced. Specifically, the use of LSI increases understanding of the subject by the model, and can pay attention to more important information. The inference of the implicit subject is performed through the LDA, so that the model has more profound knowledge on the subject, and the model is helped to understand the semantics more profoundly. Moreover, by adopting the BERT coding mode, semantic matching is carried out from a deeper level, and the method can avoid the influence of good pretreatment effect and solve the problem of no login word. Finally, combining the four coding modes from two aspects of appearance times and cosine similarity by a method of an integrated model, and screening out a final result. Most semantic matching models now often use only one of them or focus on them, and cannot realize semantic understanding and matching from shallow to deep.
The model is simple and convenient to use. The construction of the dictionary and the TF-IDF values of the teaching materials can be accomplished offline. The BERT code of the teaching material can also be calculated in advance. When the knowledge point positioning of the specified question is calculated, only the TF-IDF coding vector of the question is calculated for the TF-IDF, and the LDA and the LSI can add the calculated question TF-IDF coding vector into the TF-IDF matrix so as to calculate the corresponding codes of the questions and the corpus under the two models. For the BERT neural network, the problem can be directly input into the network after being properly preprocessed, and the problem code is obtained. After the codes are obtained, the cosine similarity of the questions and the corpus fragments can be calculated, so that a final result is obtained. That is, the whole flow can be mostly finished by on-line calculation. Therefore, the model can be conveniently deployed on a server and has great application prospect.
The model not only can realize knowledge point positioning of the problems in the question bank, but also supports fuzzy query and input of new questions. For topics outside the topic library, the problem of no login word can be solved through the BERT model, and semantic matching is better achieved.
Drawings
FIG. 1 is a schematic view of the overall framework of the present invention;
FIG. 2 is a flow chart of corpus processing in the present invention;
FIG. 3 is a schematic diagram of a statistical-based semantic matching model in the present invention;
FIG. 4 is a schematic diagram of a voting-based semantic matching integration model in the present invention.
Detailed Description
The invention is described in more detail below with reference to examples and accompanying drawings.
The invention relates to a semantic matching model method for knowledge point positioning, which comprises the following 6 steps:
1) For the teaching materials in the given electrical field, in order to facilitate the positioning of knowledge points, the teaching materials are divided in a segment unit, and chapter information and page numbers of each segment are recorded.
(1) In the model, if a sentence is selected as the search granularity, the corpus is excessively large, the calculation complexity is high, and the reaction time is long. But selecting pages as search granularity results in positioning inaccurately and over a wide range. In order to better locate the position of the knowledge point, the granularity of selecting paragraphs as location is proper, namely, each text is used as a corpus;
(2) The chapter information of each section can be obtained by the page number division or regular expression method;
(3) The storage structure may be stored using json, as shown in the following figures:
"professional basic theory 1.1-1.2, page 1 part 1": {
"Chapter number": "first chapter professional basic theory",
"Page number": 1,
"Text": "alternating current that varies with time according to a sinusoidal law is called sinusoidal alternating current.
"Number of knots": analysis and calculation of 1 st section single-phase alternating current circuit
2) And removing stop words and nonsensical words in the teaching materials, and segmenting the stop words and nonsensical words to form a corpus and constructing a dictionary.
(1) There are a large number of stop words and nonsensical words in the original teaching material, which not only occupy extra storage space, but also affect the running efficiency and the final matching quality, such as punctuation marks (the term (s)), virtual words (and, instead, nonsensical words (first chapter, second chapter), etc. The removal method may use regular expression matching or take the form of stop-word removal.
(2) The processed corpus still cannot be directly understood by the model, and word segmentation is still needed, namely, each word in a sentence is divided, and the word segmentation can be realized by directly using jieba words.
(3) In the word segmentation process, new words are added into the dictionary as long as the new words appear, so that the construction of the dictionary is completed.
3) For any appointed topic and processed corpus, according to the dictionary constructed in 2), TF-IDF values of the topic and all the corpus are calculated, and a semantic matching method based on statistics, namely TF-IDF, LSI and LDA, is used for respectively carrying out corresponding coding on the corpus and the topic. And calculating cosine similarity of the topic and the codes of all the corpus fragments as semantic similarity of the topic and all the corpus fragments.
The detailed steps of statistical-based semantic matching are described herein in connection with fig. 3:
(1) Calculation of TF-IDF value
TF-IDF is a statistical method used to evaluate the importance of a word to one of a set of documents or a corpus. Specifically, TF represents word frequency, i.e., the frequency with which a given word appears in the document. The more a word appears, the more important it is explained. IDF is the reverse file frequency and represents the distinguishing capability of a word, i.e. if the document containing a specific term is fewer, the larger the IDF, the term is indicated to have good category distinguishing capability, and the importance of the word to the corpus is reflected to a certain extent.
The TF calculating method is that
Where n is the word to be solved, count (n) is the number of occurrences of n in the corpus, and m is the total word number of the corpus.
The IDF calculation method is that
Where lg denotes the base 10 logarithm, D is the total number of documents of the corpus, and N is the number of documents containing N.
TF-IDF(n)=TF(n)×IDF(n)
Thus, for each corpus, the TF-IDF value of the word in the dictionary among the corpus can be calculated separately for the word. Thus, TF-IDF codes of each corpus can be obtained, and the TF-IDF codes are shown in the following figures:
For the title, stop words and words without practical meaning can be removed first and word segmentation can be performed, and then TF-IDF coding can be performed.
(2) Construction of TF-IDF matrix
For each corpus and question, its TF-IDF may be encoded as a column vector to form a TF-IDF matrix. In particular, in the actual operation process, considering that the teaching materials are unchanged, the TF-IDF codes of each corpus can be calculated offline in advance. When the problem that the knowledge points need to be positioned is input, the problem is subjected to TF-IDF coding and added into a TF-IDF matrix. The final results should be as shown in the following figures:
(3) LSI coding
LSI is known collectively as latent semantic indexing, and the basic idea of LSI is to reduce high-dimensional documents to a low-dimensional latent semantic space. Therefore, dimension reduction is the most important step in LSI analysis, and by dimension reduction, "noise", i.e., irrelevant information, in a document is removed, so that the semantic structure of an article is gradually presented. Compared with the traditional vector space, the potential semantic space has smaller dimension and more definite semantic relation. LSA can obtain the topic of text by Singular Value Decomposition (SVD), specifically SVD can decompose any one matrix into the product form of three matrices:
A=UΣVT
wherein A is a TF-IDF matrix, U is an orthogonal matrix, a column vector of the matrix is called a left singular vector, V is also an orthogonal matrix, a column vector of the matrix is called a right singular vector, sigma is a rectangular diagonal matrix, and for a result obtained by decomposition, the column vector of the matrix V T is selected as the code of each corpus; among these three matrices, U represents a word-to-word correlation, and V represents a text-to-subject correlation, so that V can be extracted as the code of the corresponding LSI.
Thus, among our models, encoding using LSI can make the model further extend over the encoded vector of TF-IDF calculation, and through singular value decomposition, the understanding of the subject information is increased for encoding, thereby focusing more on the main information. Since LSI is a topic model, setting of the number of topics has a great influence on the final result. The number of topics is not easy to be too large, otherwise redundant concepts are contained, too small is not suitable, and otherwise, all information cannot be reserved. Preferably, the artificial judgment and the attempt can be performed according to the actual corpus condition. The coding step takes the TF-IDF matrix in step (2) as input, determines the number of topics, and takes the generated V matrix as corpus and LSI coding vector of the problem to be solved.
(4) LDA coding
The LDA can give the theme of each document in the document set in the form of probability distribution, and the theme distribution of each document in the document set is extracted by analyzing a batch of document sets, so that theme clustering or text classification can be performed according to the theme distribution. Meanwhile, the method is a typical bag-of-words model, namely a document is composed of a group of words, and the words have no precedence order relation. LDA, like LSI, belongs to the topic model, but LDA compares the topic probability distribution of each document with respect to the LSI, and the probability distribution of all words on a topic gives a sparse form of dirichlet priors, both priors enabling LDA models to better characterize the document-topic-word relationship than LSI. Thus, the use of the LDA model can effectively further enhance the understanding of the model for topic information.
The use of LDA also requires setting the number of topics, inputting TF-IDF matrix after setting the number of topics, obtaining probability that each corpus belongs to each topic correspondingly through LDA model, and forming vector by the probability values, thereby obtaining the encoding of corpus based on LDA model.
For example, if the probabilities of a corpus belonging to top1, top2, and top3 are 0.1,0.6,0.3, respectively, the corpus may be encoded as [0.1,0.6,0.3].
(5) Calculation of cosine similarity
Cosine similarity the similarity of two vectors is evaluated by calculating their angle cosine values. The two vectors to be required are drawn into a vector space according to the coordinate values, the included angle is obtained, the cosine value corresponding to the included angle is obtained, the smaller the included angle is, the closer the cosine value is to 1, and the more identical the directions of the two vectors are, the more similar the cosine value is. The calculation formula is as follows:
Wherein X and Y are two vectors of which the similarity needs to be calculated, X i and Y i are the ith element in the two vectors of X and Y respectively, n is the number of elements of the corresponding vector, and θ is the included angle between X and Y in the vector space.
The invention uses cosine similarity as a similarity measurement index of the topic code and the teaching material code. Three codes are obtained by the three methods, and cosine similarity of the questions and the corpora can be calculated and recorded for each code.
4) And 1) constructing a training set for the teaching materials processed in the step 1), and training the BERT neural network.
The three models used in the prior art are all traditional semantic matching methods, the BERT model is introduced to enable the model to obtain deeper semantic understanding, the BERT is used as a neural network model, the influence of word segmentation effect can be avoided, and the problem that vocabulary is not entered in a dictionary can be solved.
The aspect of training data set selection does not use the corpus processed in 2), mainly because the BERT model provided by google already has the relevant function of preprocessing, and only segmentation is needed. After the data set is read, the BERT model firstly converts the original corpus into Unicode for encoding, and calls related functions to remove stop words and words without practical meaning, and meanwhile, calls FullTokenizer for word segmentation operation. And the words in the corpus are converted into the form of id and input into the BERT model.
The overall task of BERT can be regarded as a two-class task, i.e., for two sentences entered, determining whether the two sentences are related. Therefore, only the positive example and the negative example need to be prepared. For the positive example, cosine similarity of several coding modes in 3) can be calculated in advance, and the larger the cosine similarity is, the smaller the included angle of two corpus in the vector space can be considered, so that the more similar the two corpus is. Accordingly, N corpus alignments with the highest similarity ranking can be selected as the positive examples, and the data quality is ensured through a manual examination mode. Meanwhile, any two sentences of different paragraphs can be selected by selecting negative examples, and in order to ensure the training quality, the fact that the two sentences are irrelevant is required to be determined manually.
5) Training the BERT neural network by using the data set in the 4), and using the model for coding the teaching materials and the question bank, and using cosine similarity as semantic similarity with corpus fragments for a question.
6) And selecting K fragments as knowledge point positioning results according to the times and the similarity of the occurrence of each teaching material fragment in the front in four coding modes by using the semantic matching integrated model based on voting.
The specific implementation steps of the voting-based semantic matching integration model are described herein in connection with fig. 4:
(1) Setting a K value artificially, wherein K is the number of knowledge point positioning which a user wants to determine;
(2) For the four coding modes, taking K corpus fragments with maximum cosine similarity of each coding mode as a candidate set 1, and combining the same fragments into one;
(3) For each segment in the candidate set 1, the occurrence times of the four coding modes are counted, the first K segments are selected as the candidate set 2 according to the arrangement from large to small. Considering the problem of ranking parallel caused by the same occurrence number, K final results are checked because of ranking parallel;
(4) And summing cosine similarity corresponding to the four coding modes for all the fragments in the candidate set 2, arranging the cosine similarity according to the sequence from large to small, selecting the maximum K fragments as a final result, and returning the final result as a final positioning result.
The above examples are only preferred embodiments of the present invention, it being noted that: it will be apparent to those skilled in the art that several modifications and equivalents can be made without departing from the principles of the invention, and such modifications and equivalents fall within the scope of the invention.

Claims (2)

1. A method for establishing a semantic matching model for knowledge point positioning, the method comprising the steps of:
1) For the teaching materials in the given electrical field, dividing the teaching materials in a section unit for facilitating the positioning of knowledge points, and recording chapter information and page numbers of each section;
2) Removing stop words and nonsensical words in the teaching materials, and segmenting the stop words and nonsensical words to form a corpus, and constructing a dictionary;
3) For any appointed topic and processed corpus, according to the dictionary constructed in 2), TF-IDF values of the topic and all corpus are calculated, a statistical-based semantic matching method, namely TF-IDF, LSI and LDA is used for respectively carrying out corresponding coding on the corpus and the topic, and then cosine similarity of the topic and the coding of all corpus fragments is calculated as semantic similarity of the topic and all corpus fragments; in the step 3), the corpus and the questions are correspondingly encoded according to the following steps of:
3-1) forming a dictionary by the word segmentation in the step 2) and calculating TF-IDF values of each word in the dictionary for each corpus and the problem to obtain corresponding codes;
3-2) forming a TF-IDF matrix by using the linguistic data obtained in the 3-1) and TF-IDF values of the problems as column vectors, then analyzing based on a topic model by using LDA to obtain probability that each corpus belongs to each topic in the linguistic data, and using the probability value as a code;
3-3) LSI encoding, specifically, decomposing the TF-IDF matrix obtained in 3-2) using SVD, that is:
A=UΣVT
Wherein A is a TF-IDF matrix, U is an orthogonal matrix, a column vector of the matrix is called a left singular vector, V is also an orthogonal matrix, a column vector of the matrix is called a right singular vector, sigma is a rectangular diagonal matrix, and for a result obtained by decomposition, the column vector of the matrix V T is selected as the code of each corpus;
3-4) for the three coding modes, respectively calculating cosine similarity of the questions to be solved and the teaching materials to measure similarity of the questions and the teaching material fragments;
4) Constructing a training set for the processed teaching material segments in the step 1), and training the BERT neural network:
4-1) comprehensively considering cosine similarity obtained by the three methods in 3), and selecting N corpus pairs with the top similarity ranking as positive examples;
4-2) randomly selecting sentences of different chapters as negative examples;
5) Training the BERT neural network by using the data set in the 4), and using the model for coding the corpus fragment and the topic to be solved, and using cosine similarity as semantic similarity of the topic and the corpus fragment;
6) Selecting K fragments as knowledge point positioning results according to the times and cosine similarity of each corpus fragment in the front in four coding modes based on the voting semantic matching integration model; the semantic matching integration model based on voting in the step 6) determines the final knowledge point positioning result according to the following steps:
6-1) setting the number K of required knowledge point positioning by a user, and respectively taking out K corpus fragments with maximum cosine similarity as candidate sets 1 for four coding modes;
6-2) counting the sum of the occurrence times of the corpus fragments in the candidate set 1, and selecting K corpus fragments with the largest occurrence times from the candidate set as the candidate set 2, wherein the K corpus fragments are allowed to have the same occurrence times, so that the final result exceeds K corpus fragments;
6-3) in the candidate set 2, sequencing according to the number of times of occurrence of the fragments from large to small, and calculating the sum of cosine similarity under four coding modes if the number of times is the same, wherein the higher the similarity is, the higher the ranking is;
6-4) take out the first K as the final matching result.
2. The method for establishing a semantic matching model for knowledge point localization according to claim 1, wherein the encoding method of the BERT neural network in step 5) is performed as follows:
5-1) removing stop words and nonsensical words from the teaching material segments and problems obtained in 1);
5-2) word segmentation is carried out on the teaching material segments and the problems, and id is formed by encoding;
5-3) sending the obtained id into a model, and taking the code output by the BERT neural network as a result.
CN202110319217.9A 2021-03-25 2021-03-25 Semantic matching model for knowledge point positioning Active CN112926340B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110319217.9A CN112926340B (en) 2021-03-25 2021-03-25 Semantic matching model for knowledge point positioning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110319217.9A CN112926340B (en) 2021-03-25 2021-03-25 Semantic matching model for knowledge point positioning

Publications (2)

Publication Number Publication Date
CN112926340A CN112926340A (en) 2021-06-08
CN112926340B true CN112926340B (en) 2024-05-07

Family

ID=76175948

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110319217.9A Active CN112926340B (en) 2021-03-25 2021-03-25 Semantic matching model for knowledge point positioning

Country Status (1)

Country Link
CN (1) CN112926340B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113986968B (en) * 2021-10-22 2022-09-16 广西电网有限责任公司 Scheme intelligent proofreading method based on electric power standard standardization datamation

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102945230A (en) * 2012-10-17 2013-02-27 刘运通 Natural language knowledge acquisition method based on semantic matching driving
CN107291688A (en) * 2017-05-22 2017-10-24 南京大学 Judgement document's similarity analysis method based on topic model
CN109344236A (en) * 2018-09-07 2019-02-15 暨南大学 One kind being based on the problem of various features similarity calculating method
WO2020140635A1 (en) * 2019-01-04 2020-07-09 平安科技(深圳)有限公司 Text matching method and apparatus, storage medium and computer device
CN111753550A (en) * 2020-06-28 2020-10-09 汪秀英 Semantic parsing method for natural language

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102945230A (en) * 2012-10-17 2013-02-27 刘运通 Natural language knowledge acquisition method based on semantic matching driving
CN107291688A (en) * 2017-05-22 2017-10-24 南京大学 Judgement document's similarity analysis method based on topic model
CN109344236A (en) * 2018-09-07 2019-02-15 暨南大学 One kind being based on the problem of various features similarity calculating method
WO2020140635A1 (en) * 2019-01-04 2020-07-09 平安科技(深圳)有限公司 Text matching method and apparatus, storage medium and computer device
CN111753550A (en) * 2020-06-28 2020-10-09 汪秀英 Semantic parsing method for natural language

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于开放域抽取的多文档概念图构建研究;盛泳潘 等;计算机应用研究;20200131;第37卷(第1期);全文 *

Also Published As

Publication number Publication date
CN112926340A (en) 2021-06-08

Similar Documents

Publication Publication Date Title
CN109344236B (en) Problem similarity calculation method based on multiple characteristics
Yan et al. Docchat: An information retrieval approach for chatbot engines using unstructured documents
CN107133213B (en) Method and system for automatically extracting text abstract based on algorithm
CN111259127B (en) Long text answer selection method based on transfer learning sentence vector
CN109948143B (en) Answer extraction method of community question-answering system
CN105183833B (en) Microblog text recommendation method and device based on user model
CN110134946B (en) Machine reading understanding method for complex data
CN111291188B (en) Intelligent information extraction method and system
CN107229610A (en) The analysis method and device of a kind of affection data
CN110489750A (en) Burmese participle and part-of-speech tagging method and device based on two-way LSTM-CRF
CN109766544A (en) Document keyword abstraction method and device based on LDA and term vector
CN110705247B (en) Based on x2-C text similarity calculation method
CN114357127A (en) Intelligent question-answering method based on machine reading understanding and common question-answering model
CN115146629A (en) News text and comment correlation analysis method based on comparative learning
Chang et al. A METHOD OF FINE-GRAINED SHORT TEXT SENTIMENT ANALYSIS BASED ON MACHINE LEARNING.
Celikyilmaz et al. A graph-based semi-supervised learning for question-answering
CN115248839A (en) Knowledge system-based long text retrieval method and device
CN116756303A (en) Automatic generation method and system for multi-topic text abstract
CN115759119A (en) Financial text emotion analysis method, system, medium and equipment
CN111259156A (en) Hot spot clustering method facing time sequence
CN112926340B (en) Semantic matching model for knowledge point positioning
CN110704638A (en) Clustering algorithm-based electric power text dictionary construction method
TWI734085B (en) Dialogue system using intention detection ensemble learning and method thereof
Zi et al. SOM-NCSCM: An efficient neural chinese sentence compression model enhanced with self-organizing map
Jiang et al. A hierarchical bidirectional LSTM sequence model for extractive text summarization in electric power systems

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant