CN114595337A

CN114595337A - Method for constructing curriculum knowledge graph based on GMM

Info

Publication number: CN114595337A
Application number: CN202210109036.8A
Authority: CN
Inventors: 许涛; 马夏青; 许遨鹏; 张自祥; 沈夏炯; 韩道军; 张磊
Original assignee: Henan University
Current assignee: Henan University
Priority date: 2022-01-28
Filing date: 2022-01-28
Publication date: 2022-06-07
Anticipated expiration: 2042-01-28
Also published as: CN114595337B

Abstract

The invention discloses a method for constructing a curriculum knowledge graph based on GMM, which comprises the following steps: grouping the test questions according to chapters, preprocessing the test questions in each chapter and performing Chinese word segmentation to obtain structured test question data, and then performing jieba word segmentation on the effective test question word strings to generate word frequency matrixes of the test questions; based on the processed word frequency matrix of the test question, carrying out test question knowledge point clustering and feature extraction by using a GMM (Gaussian mixture model) model to generate a test question knowledge point model; meanwhile, performing feature recognition and extraction on chapter knowledge points based on the structured test question data to generate a chapter knowledge point model; based on the generated test question knowledge point model and the chapter knowledge point model, the knowledge map technology is utilized to integrate the two types of knowledge points of the test question and the chapter into a course knowledge map. The invention takes massive course test questions as research objects, utilizes a Gaussian mixture clustering method to identify test question knowledge points and the association thereof, combines the existing chapter knowledge point system, and utilizes a knowledge map technology to realize the reconstruction of the course knowledge system.

Description

Method for constructing curriculum knowledge graph based on GMM

Technical Field

The invention belongs to the technical field of education data mining, and particularly relates to a method for constructing a course knowledge graph based on GMM.

Background

The method provides new opportunities for improving the quality of education and teaching in order to meet the needs of sustainable development of economy and society, and China is dedicated to building high-quality education systems, the rapid development of information technology and the continuous emergence of various education big data. In the existing teaching mode, the course teaching outline is an important basis for course teaching and teaching quality evaluation, and the course knowledge points are generally summarized into a tree-shaped knowledge system in a chapter hierarchy mode so as to guide students to learn and establish course assessment standards. However, in the face of intense academic competition, especially in the middle school education stage, in order to realize the 'election' of the examination, various test questions are ingenious in standing, and fusion and comprehensive application of knowledge points are emphasized. Therefore, students have to rely on the theme and sea tactics to strengthen the knowledge understanding ability and the comprehensive application ability of the students, and course assessment is gradually separated from the specified paradigm of the course teaching outline. The tree-shaped knowledge system in the existing course outline can not meet the requirements of course teaching and examination.

In addition, with the development of socio-economic, new technology and knowledge are continuously appearing, and the expression and understanding of the curriculum knowledge points are dynamically changed, such as: the introduction of the crystal in the middle school textbooks of different periods in China is from using the word of crystallization to express the crystal and only serving as a knowledge point of the concept content of solution to specially listing the crystal to increase the contents of crystal classification, structure model, unit cell, close packing and the like and adding simple calculation related to the crystal. Therefore, the course knowledge system should present knowledge point characteristics comprehensively, accurately and dynamically, and the knowledge system should be a network structure expressing many-to-many association between knowledge points.

Disclosure of Invention

The invention provides a method for constructing a course knowledge graph based on GMM (Gaussian mixture clustering), which aims at solving the problem that a tree-shaped knowledge system in the existing course outline cannot comprehensively, accurately and dynamically present knowledge point characteristics.

In order to achieve the purpose, the invention adopts the following technical scheme:

a method for constructing a curriculum knowledge graph based on GMM comprises the following steps:

step A, grouping the test questions according to chapters, preprocessing the test questions in each chapter and segmenting Chinese words, wherein the method comprises the steps of cleaning the test questions containing invalid characters and picture contents to obtain structured test question data, then performing jieba segmentation on character strings of the valid test questions to generate a word frequency matrix of the test questions, and processing the word frequency matrix of the test questions;

step B, identifying and extracting the characteristics of the two types of knowledge points of the test question and the chapter, wherein the step B comprises clustering and extracting the characteristics of the knowledge points of the test question by using a GMM (Gaussian mixture model) based on the processed word frequency matrix of the test question to generate a test question knowledge point model; meanwhile, performing feature recognition and extraction on chapter knowledge points based on the structured test question data to generate a chapter knowledge point model;

and step C, integrating the two types of knowledge points of the test question and the chapter into a course knowledge map by using a knowledge map technology based on the generated test question knowledge point model and the chapter knowledge point model.

Further, the step a includes:

a1, modeling the test question, including deleting the picture part of the test question and cleaning useless characters, wherein the useless characters include numbers, letters, space symbols and line feed symbols; the structure of the test question data is arranged, the preprocessing operation of the attributes of the affiliated chapters, the question types and the scores is introduced, and the structured test question data is createdA model; defining the question bank which completes cleaning and pretreatment as

Wherein q is_iThe I-th test question is represented by { V, W, e, t }, wherein V represents a washed test question string, W represents an effective test question string set, e represents a chapter to which the test question belongs, t represents a test question type, and I represents the number of test questions in a test question bank;

step A2, generating word-frequency matrix of test questions, including dividing Chinese words for effective test question strings by jieba method, decomposing each effective test question string into phrase set W^aThen using the stop word list W_sFiltering the irrelevant phrases and low-frequency phrases to make

The phrase complete set of all the test questions is expressed as

w_jIs the jth word segmentation; for the question bank

Its word frequency matrix

f_ijRepresents the jth participle w_jIn the question bank

The ith test question q_iWhether or not it appears;

step A3, the word frequency matrix is aligned

Carrying out de-duplication processing, combining strong correlation word group columns, defining word group correlation threshold rho, and if the Pearson correlation coefficient rho of any two word groups x and y_x,y>0.8, then the two words are combined into a new word group w_x,y＝w_x∪w_yAnd delete w_xAnd w_y(ii) a Defining a phrase coverage upper and lower threshold [ eta ]_min,η_max]And only using the phrase features in the threshold value range to carry out knowledge point identification.

Further, the step B includes:

step B1, for the question bank

The probability distribution p (q) of the Gaussian mixture model GMM of the test question q is shown in formula (1):

wherein K represents the number of knowledge points contained in the test question;

is the gaussian distribution density function of the k-th knowledge point; alpha is alpha_kRepresenting the probability that the test question contains the k-th knowledge point and satisfying

μ_kRepresents the mean value; sigma_kRepresents a covariance;

using EM algorithm to make parameter alpha_k，μ_kSum Σ_kAnd (3) estimating: first, test questions q are calculated_iProbability gamma of containing knowledge point k_ikAs shown in formula (2); then, α_k，μ_kSum Σ_kThe calculation method of (2) is shown in formulas (3) to (5):

given a convergence threshold ε, equations (2) - (5) are iteratively computed until | α'_k-α′_k-1Until | ≦ ε, find α_k，μ_k，∑_k；

Determining the distribution situation of the knowledge points in the test questions by using a formula (1), and constructing a many-to-many mapping relation between the test questions and the knowledge points; setting the clustering category number, evaluating the clustering result by using a Bayesian information criterion, and selecting the optimal clustering number, namely the number of knowledge points in each chapter;

step B2, according to the clustering result of the test questions, order

Representing a corpus of knowledge points of test questions, K_iShowing the ith test question knowledge point; the most frequently co-occurring phrase of the test questions belonging to the knowledge point is used for expressing the characteristics of the knowledge point of the test questions: order to

The knowledge points of the test questions are represented,<q>and<w>test question set, feature set respectively representing K<w>Composed of several phrases with the highest coverage of test question knowledge points and feature set<w>The overall coverage of the test question knowledge points is 100 percent;

step B3, performing chapter knowledge point identification and feature extraction: structured-based question bank

The chapter attributes q.e of each test question in the course, identify the chapter-level knowledge structure of the course, and identify each chapter knowledge point; using a high coverage word group set to represent chapter knowledge point features: order to

A complete set of the course chapters is represented,

a point of knowledge of a chapter is represented,<q>and<w>respectively representing the test question set and the feature set of C<w>Is composed of several phrases with highest coverage to chapter knowledge points and features set<w>The overall coverage of chapter knowledge points is 100%.

Further, the step C includes:

the RDF technology in the knowledge map is adopted to define the course knowledge system as

To describe the association between knowledge point entities of course knowledge, the association between chapter entities and knowledge point entities;

representing knowledge point entities and their relationships, knowledge point entity K_aAnd K_bIs K_aAnd K_bA set of co-occurring word groups; g^c＝<<C_x,e,K_y>>Represents a chapter entity C_xAnd knowledge point entity K_yThe association of (1) is e, which is consistent with the chapter attribute of the test question.

Compared with the prior art, the invention has the following beneficial effects:

the invention takes massive course test questions as research objects, utilizes a Gaussian mixture clustering method to identify test question knowledge points and the association thereof, combines the existing chapter knowledge point system, and utilizes a knowledge map technology to realize the reconstruction of the course knowledge system. The invention expresses knowledge points and the association thereof by using the test question characteristic phrases, can simply and efficiently construct a knowledge system, and can effectively avoid understanding and calculating the complex semantics of the knowledge points.

Drawings

Fig. 1 is a flowchart of a method for constructing a curriculum knowledge graph based on a GMM according to an embodiment of the present invention.

Detailed Description

The invention is further illustrated by the following examples in conjunction with the accompanying drawings:

the related concepts of knowledge points are first normalized. The invention defines the test question knowledge points as the knowledge entities which are analyzed from the test questions and are characterized by a plurality of phrases; "Chapter knowledge points" refer to the knowledge entities in the course schema that are defined by the hierarchical structure of chapters. The invention hopes to realize the identification and description of the test question knowledge points and chapter knowledge points and reconstruct the course knowledge map based on the course test question database data.

The method is roughly divided into three steps, firstly, the original test question data is preprocessed and Chinese word segmentation is carried out, the contents of test question modeling, word frequency matrix generation, word combination and screening and the like are included, and a structured test question library and a word frequency matrix are generated. Secondly, carrying out feature recognition and extraction on the two types of knowledge points, and carrying out test question knowledge point clustering and feature extraction by using a GMM (Gaussian mixture model) model to generate a test question knowledge point model; and simultaneously, carrying out feature recognition and extraction on the chapter knowledge points to generate a chapter knowledge point model. And finally, integrating the two types of knowledge points into a course knowledge graph by using a knowledge graph technology. A technical roadmap is shown in fig. 1.

Specifically, the method for constructing the curriculum knowledge graph based on the GMM comprises the following steps:

step A: firstly, test questions are grouped according to chapters, and preprocessing and Chinese word segmentation are carried out on the test questions in each chapter. In order to carry out quantitative calculation on test questions, firstly, the test questions containing the contents of characters, pictures and the like need to be preprocessed and cleaned to obtain structured test question data, and then the jieba word segmentation is carried out on effective test question word strings to further generate word frequency matrixes of the test questions. The pre-processing and cleaning of the test question data comprises the steps of constructing test question modeling, generating a word frequency matrix, combining phrases, screening and the like.

The step A comprises the following specific steps:

step A1, firstly, modeling the test question, because the method provided by the invention only processes the character information part of the test question, the picture belonging to invalid information needs to be deleted, and in addition, characters (such as numbers, letters, space characters, line feed characters and the like) also need to be removed, so as to improve the accuracy of word segmentation. The preprocessing refers to the arrangement of the structure of the test question data, the introduction of attributes such as 'affiliated chapter', 'question type', 'score', and the like, and the creation of a knotAnd (4) a structured test question data model. The question bank for cleaning and pre-processing can be defined as

Wherein q is_iWhere { V, W, e, t } represents the ith test question, V represents the washed test question string, W represents the valid word set (valid test question string set), i.e., the segmentation result, e represents the chapter to which the test question belongs, t represents the type of the test question, and I represents the number of test questions in the test question library.

Step A2, generating word frequency matrix of test question, realizing Chinese word segmentation process of test question character string by using current popular word segmentation component jieba method, decomposing each effective test question character string into phrase set W^aThen using the stop word list W_sFiltering irrelevant phrases such as prepositions and adjectives and low-frequency phrases

The phrase complete set of all the test questions is expressed as

w_jIs the jth participle. For the question bank

Its word frequency matrix

f_ijDenotes the jth participle w_jIn the question bank

The ith test question q_iWhether or not it occurs.

Step A3, next, the word frequency matrix is processed

And carrying out duplication elimination processing, combining strongly related word group columns, improving the independence of the word group columns and supporting subsequent analysis. We define a phrase correlation threshold ρ, if the Pearson correlation coefficient ρ of any two phrases x and y_x,y>0.8, then the two words are combined into a new word group w_x,y＝w_x∪w_yAnd delete w_xAnd w_y. In addition, in order to improve the discrimination of phrase features, we also define the upper and lower limit thresholds [ eta ] of phrase coverage_min,η_max]And only using the phrase features in the threshold value range to carry out knowledge point identification. Because the calculation targets of the test question knowledge points and the chapter knowledge points are different, the upper and lower thresholds of the phrase coverage of the test question knowledge points are slightly different, and the threshold of the coverage of the test question knowledge points is (4 percent and 60 percent)]The coverage threshold of the chapter knowledge points is [ 4%, 100%]。

And step B, after the test question word frequency matrix is preprocessed in the step A, the test question word frequency matrix obeys the mixed Gaussian distribution of a plurality of knowledge characteristics, and a Gaussian Mixture Model (GMM) can be adopted to realize the identification and characteristic expression of the test question knowledge points. The chapter hierarchy of the course knowledge can be identified by inquiring the question bank and the characteristics of the course knowledge are expressed by high-frequency phrases.

The step B comprises the following specific steps:

and step B1, the GMM model is a model which decomposes things into a plurality of Gaussian probability density functions, can accurately quantize things by the Gaussian probability density functions, is suitable for the condition that the data record contains a plurality of distribution characteristics, and can be regarded as mixed probability distribution of a plurality of knowledge characteristics.

For the question bank

Assuming that K knowledge points are contained in the test question, the probability distribution of the Gaussian mixture model of the test question q is shown as formula (1), wherein,

is the Gaussian distribution density function of the k-th knowledge point, alpha_kRepresenting the probability that the test question contains the kth knowledge point and satisfying

μ_kRepresents the mean value, sigma_kTo representThe covariance.

Parameter α in equation (1)_k，μ_kSum Σ_kEstimation is performed using the EM (expectation maximization) algorithm. First, test questions q are calculated_iProbability gamma of containing knowledge point k_ikAs shown in equation (2), then, α_k，μ_kSum Σ_kThe calculation method of (2) is shown in formulas (3) to (5).

Given a convergence threshold ε, equations (2) - (5) are iteratively computed until | α'_k-α′_k-1Until | ≦ epsilon, find alpha_k，μ_k，∑_k. By using the formula (1), the distribution condition of the knowledge points in the test questions can be determined, and the many-to-many mapping relation between the test questions and the knowledge points is constructed. We set the number of cluster categories as: [2, 30]And evaluating the clustering result by using a Bayesian Information (BIC) rule, and selecting the optimal clustering number, namely the number of knowledge points in each chapter.

Step B2, according to the clustering result of the test questions, order

Showing knowledge of test questionsComplete set of points, K_iThe ith test question knowledge point is shown. To fully describe the test question knowledge point features, we express the test question knowledge point features with the most frequently co-occurring phrases of the test questions belonging to the knowledge point. Order to

The knowledge points of the test questions are represented,<q>and<w>test question set, feature set respectively representing K<w>2 conditions are satisfied: (1) the phrase w is a plurality of phrases with the highest coverage to the knowledge points K, and (2) the coverage to the test set is 100%. In short, we choose the phrases with the highest coverage of the test question knowledge points to express the features of the test question knowledge points, and the total coverage of the test question knowledge points is 100%.

And step B3, performing chapter knowledge point identification and feature extraction. Structured question bank

Each question in the series has a chapter attribute q.e, whereby we can identify the chapter-level knowledge structure of the course and identify each chapter knowledge point. Similar to the test question knowledge point feature expression, we also use high coverage word group sets to represent chapter knowledge point features. Order to

A complete set of the course chapters is represented,

representing chapter knowledge points. Characteristic word group set<w>The extraction method is similar to the above-mentioned extraction method of the test question knowledge point features, and is not described here again.

Step C, on the basis of the course knowledge characteristic expression, a knowledge graph technology is utilized to countAnd calculating knowledge association and constructing a course knowledge graph. We define the course knowledge system as RDF (resource Description framework) technology in the knowledge graph

Two types of associations of course knowledge are described, namely associations between knowledge point entities and associations of chapter entities and knowledge point entities.

Representing knowledge point entities and their relationships, knowledge point entity K_aAnd K_bIs K_aAnd K_bA set of co-occurring word groups. The expression method utilizes the test question characteristic phrases to express the knowledge points and the association thereof, not only can simply and efficiently construct a knowledge system, but also can effectively avoid understanding and calculating the complex semantics of the knowledge points. G^c＝<<C_x,e,K_y>>Represents a chapter entity C_xAnd knowledge point entity K_yThe association of (1) is e, which is consistent with the chapter attribute of the test question.

In conclusion, the invention takes massive course test questions as research objects, utilizes the Gaussian mixture clustering method to identify the test question knowledge points and the association thereof, combines the existing chapter knowledge point system, and utilizes the knowledge map technology to realize the reconstruction of the course knowledge system. The invention expresses the knowledge points and the association thereof by using the test question characteristic phrases, can simply and efficiently construct a knowledge system, and can also effectively avoid the understanding and the calculation of the complex semantics of the knowledge points.

The above shows only the preferred embodiments of the present invention, and it should be noted that it is obvious to those skilled in the art that various modifications and improvements can be made without departing from the principle of the present invention, and these modifications and improvements should also be considered as the protection scope of the present invention.

Claims

1. A method for constructing a curriculum knowledge graph based on GMM is characterized by comprising the following steps:

2. The method of claim 1, wherein the step a comprises:

step A1, modeling the test question, including deleting the picture part of the test question and cleaning useless characters, wherein the useless characters include numbers, letters, space characters and linefeed characters; the method comprises the following steps of (1) arranging the structure of test question data, introducing preprocessing operation of attributes of 'affiliated chapters', 'question types' and 'scores', and creating a structured test question data model; defining the question bank which completes cleaning and pretreatment as

step A2, generating word-frequency matrix of test questions, including dividing Chinese words for effective test question strings by jieba method, decomposing each effective test question string into phrase set W^aThen, howeverPost-utilization stop word list W_sFiltering the irrelevant phrases and low-frequency phrases to make

The phrase complete set of all the test questions is expressed as

w_jIs the jth word segmentation; for the question bank

Its word frequency matrix

Denotes the jth participle w_jIn the question bank

The ith test question q_iWhether or not it appears;

step A3, the word frequency matrix is aligned

Carrying out de-duplication processing, combining strong correlation word group columns, defining word group correlation threshold rho, and if the Pearson correlation coefficient rho of any two word groups x and y_x，yIf the word length is more than 0.8, the two words are combined to form a new word group w_x，y＝w_x∪w_yAnd delete w_xAnd w_y(ii) a Defining a phrase coverage upper and lower threshold [ eta ]_min，η_max]And only using the phrase features in the threshold value range to carry out knowledge point identification.

3. The method of claim 2, wherein step B comprises:

step B1, for the item library

Gaussian mixture model G of test question qMM probability distribution p (q) is shown in equation (1):

μ_kRepresents the mean value; sigma_kRepresents a covariance;

Determining the distribution condition of the knowledge points in the test questions by using a formula (1), and constructing a many-to-many mapping relation between the test questions and the knowledge points; setting the clustering category number, evaluating the clustering result by using a Bayesian information criterion, and selecting the optimal clustering number, namely the number of knowledge points in each chapter;

step B2, according to the clustering result of the test questions, order

Representing a corpus of knowledge points of test questions, K_iRepresenting the ith test question knowledge point; the most frequently co-occurring phrase of the test questions belonging to the knowledge point is used for expressing the characteristics of the knowledge point of the test questions: order to

A complete set of the course chapters is represented,

a point of knowledge of a chapter is represented,<q>and<w>respectively representing the test question set and the feature set of C<w>Knowledge point coverage by chapter pairHighest number of phrases and feature set<w>The overall coverage of chapter knowledge points is 100%.

4. The method of claim 3, wherein step C comprises:

representing knowledge point entities and their relationships, knowledge point entity K_aAnd K_bIs K_aAnd K_bA set of co-occurring word groups; g^c＝《C_x，e，K_yA chapter entity C_xAnd knowledge point entity K_yThe association of (1) is e, which is consistent with the chapter attribute of the test question.