CN114595337A - Method for constructing curriculum knowledge graph based on GMM - Google Patents

Method for constructing curriculum knowledge graph based on GMM Download PDF

Info

Publication number
CN114595337A
CN114595337A CN202210109036.8A CN202210109036A CN114595337A CN 114595337 A CN114595337 A CN 114595337A CN 202210109036 A CN202210109036 A CN 202210109036A CN 114595337 A CN114595337 A CN 114595337A
Authority
CN
China
Prior art keywords
knowledge
test question
test
chapter
question
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210109036.8A
Other languages
Chinese (zh)
Other versions
CN114595337B (en
Inventor
许涛
马夏青
许遨鹏
张自祥
沈夏炯
韩道军
张磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Henan University
Original Assignee
Henan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Henan University filed Critical Henan University
Priority to CN202210109036.8A priority Critical patent/CN114595337B/en
Publication of CN114595337A publication Critical patent/CN114595337A/en
Application granted granted Critical
Publication of CN114595337B publication Critical patent/CN114595337B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Probability & Statistics with Applications (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method for constructing a curriculum knowledge graph based on GMM, which comprises the following steps: grouping the test questions according to chapters, preprocessing the test questions in each chapter and performing Chinese word segmentation to obtain structured test question data, and then performing jieba word segmentation on the effective test question word strings to generate word frequency matrixes of the test questions; based on the processed word frequency matrix of the test question, carrying out test question knowledge point clustering and feature extraction by using a GMM (Gaussian mixture model) model to generate a test question knowledge point model; meanwhile, performing feature recognition and extraction on chapter knowledge points based on the structured test question data to generate a chapter knowledge point model; based on the generated test question knowledge point model and the chapter knowledge point model, the knowledge map technology is utilized to integrate the two types of knowledge points of the test question and the chapter into a course knowledge map. The invention takes massive course test questions as research objects, utilizes a Gaussian mixture clustering method to identify test question knowledge points and the association thereof, combines the existing chapter knowledge point system, and utilizes a knowledge map technology to realize the reconstruction of the course knowledge system.

Description

Method for constructing curriculum knowledge graph based on GMM
Technical Field
The invention belongs to the technical field of education data mining, and particularly relates to a method for constructing a course knowledge graph based on GMM.
Background
The method provides new opportunities for improving the quality of education and teaching in order to meet the needs of sustainable development of economy and society, and China is dedicated to building high-quality education systems, the rapid development of information technology and the continuous emergence of various education big data. In the existing teaching mode, the course teaching outline is an important basis for course teaching and teaching quality evaluation, and the course knowledge points are generally summarized into a tree-shaped knowledge system in a chapter hierarchy mode so as to guide students to learn and establish course assessment standards. However, in the face of intense academic competition, especially in the middle school education stage, in order to realize the 'election' of the examination, various test questions are ingenious in standing, and fusion and comprehensive application of knowledge points are emphasized. Therefore, students have to rely on the theme and sea tactics to strengthen the knowledge understanding ability and the comprehensive application ability of the students, and course assessment is gradually separated from the specified paradigm of the course teaching outline. The tree-shaped knowledge system in the existing course outline can not meet the requirements of course teaching and examination.
In addition, with the development of socio-economic, new technology and knowledge are continuously appearing, and the expression and understanding of the curriculum knowledge points are dynamically changed, such as: the introduction of the crystal in the middle school textbooks of different periods in China is from using the word of crystallization to express the crystal and only serving as a knowledge point of the concept content of solution to specially listing the crystal to increase the contents of crystal classification, structure model, unit cell, close packing and the like and adding simple calculation related to the crystal. Therefore, the course knowledge system should present knowledge point characteristics comprehensively, accurately and dynamically, and the knowledge system should be a network structure expressing many-to-many association between knowledge points.
Disclosure of Invention
The invention provides a method for constructing a course knowledge graph based on GMM (Gaussian mixture clustering), which aims at solving the problem that a tree-shaped knowledge system in the existing course outline cannot comprehensively, accurately and dynamically present knowledge point characteristics.
In order to achieve the purpose, the invention adopts the following technical scheme:
a method for constructing a curriculum knowledge graph based on GMM comprises the following steps:
step A, grouping the test questions according to chapters, preprocessing the test questions in each chapter and segmenting Chinese words, wherein the method comprises the steps of cleaning the test questions containing invalid characters and picture contents to obtain structured test question data, then performing jieba segmentation on character strings of the valid test questions to generate a word frequency matrix of the test questions, and processing the word frequency matrix of the test questions;
step B, identifying and extracting the characteristics of the two types of knowledge points of the test question and the chapter, wherein the step B comprises clustering and extracting the characteristics of the knowledge points of the test question by using a GMM (Gaussian mixture model) based on the processed word frequency matrix of the test question to generate a test question knowledge point model; meanwhile, performing feature recognition and extraction on chapter knowledge points based on the structured test question data to generate a chapter knowledge point model;
and step C, integrating the two types of knowledge points of the test question and the chapter into a course knowledge map by using a knowledge map technology based on the generated test question knowledge point model and the chapter knowledge point model.
Further, the step a includes:
a1, modeling the test question, including deleting the picture part of the test question and cleaning useless characters, wherein the useless characters include numbers, letters, space symbols and line feed symbols; the structure of the test question data is arranged, the preprocessing operation of the attributes of the affiliated chapters, the question types and the scores is introduced, and the structured test question data is createdA model; defining the question bank which completes cleaning and pretreatment as
Figure BDA0003494446300000021
Wherein q isiThe I-th test question is represented by { V, W, e, t }, wherein V represents a washed test question string, W represents an effective test question string set, e represents a chapter to which the test question belongs, t represents a test question type, and I represents the number of test questions in a test question bank;
step A2, generating word-frequency matrix of test questions, including dividing Chinese words for effective test question strings by jieba method, decomposing each effective test question string into phrase set WaThen using the stop word list WsFiltering the irrelevant phrases and low-frequency phrases to make
Figure BDA0003494446300000022
The phrase complete set of all the test questions is expressed as
Figure BDA0003494446300000023
wjIs the jth word segmentation; for the question bank
Figure BDA0003494446300000024
Its word frequency matrix
Figure BDA0003494446300000025
fijRepresents the jth participle wjIn the question bank
Figure BDA0003494446300000026
The ith test question qiWhether or not it appears;
step A3, the word frequency matrix is aligned
Figure BDA0003494446300000027
Carrying out de-duplication processing, combining strong correlation word group columns, defining word group correlation threshold rho, and if the Pearson correlation coefficient rho of any two word groups x and yx,y>0.8, then the two words are combined into a new word group wx,y=wx∪wyAnd delete wxAnd wy(ii) a Defining a phrase coverage upper and lower threshold [ eta ]minmax]And only using the phrase features in the threshold value range to carry out knowledge point identification.
Further, the step B includes:
step B1, for the question bank
Figure BDA0003494446300000028
The probability distribution p (q) of the Gaussian mixture model GMM of the test question q is shown in formula (1):
Figure BDA0003494446300000029
wherein K represents the number of knowledge points contained in the test question;
Figure BDA0003494446300000031
is the gaussian distribution density function of the k-th knowledge point; alpha is alphakRepresenting the probability that the test question contains the k-th knowledge point and satisfying
Figure BDA0003494446300000032
μkRepresents the mean value; sigmakRepresents a covariance;
using EM algorithm to make parameter alphak,μkSum ΣkAnd (3) estimating: first, test questions q are calculatediProbability gamma of containing knowledge point kikAs shown in formula (2); then, αk,μkSum ΣkThe calculation method of (2) is shown in formulas (3) to (5):
Figure BDA0003494446300000033
Figure BDA0003494446300000034
Figure BDA0003494446300000035
Figure BDA0003494446300000036
given a convergence threshold ε, equations (2) - (5) are iteratively computed until | α'k-α′k-1Until | ≦ ε, find αk,μk,∑k
Determining the distribution situation of the knowledge points in the test questions by using a formula (1), and constructing a many-to-many mapping relation between the test questions and the knowledge points; setting the clustering category number, evaluating the clustering result by using a Bayesian information criterion, and selecting the optimal clustering number, namely the number of knowledge points in each chapter;
step B2, according to the clustering result of the test questions, order
Figure BDA0003494446300000037
Representing a corpus of knowledge points of test questions, KiShowing the ith test question knowledge point; the most frequently co-occurring phrase of the test questions belonging to the knowledge point is used for expressing the characteristics of the knowledge point of the test questions: order to
Figure BDA0003494446300000038
The knowledge points of the test questions are represented,<q>and<w>test question set, feature set respectively representing K<w>Composed of several phrases with the highest coverage of test question knowledge points and feature set<w>The overall coverage of the test question knowledge points is 100 percent;
step B3, performing chapter knowledge point identification and feature extraction: structured-based question bank
Figure BDA0003494446300000039
The chapter attributes q.e of each test question in the course, identify the chapter-level knowledge structure of the course, and identify each chapter knowledge point; using a high coverage word group set to represent chapter knowledge point features: order to
Figure BDA00034944463000000310
A complete set of the course chapters is represented,
Figure BDA00034944463000000311
a point of knowledge of a chapter is represented,<q>and<w>respectively representing the test question set and the feature set of C<w>Is composed of several phrases with highest coverage to chapter knowledge points and features set<w>The overall coverage of chapter knowledge points is 100%.
Further, the step C includes:
the RDF technology in the knowledge map is adopted to define the course knowledge system as
Figure BDA00034944463000000312
To describe the association between knowledge point entities of course knowledge, the association between chapter entities and knowledge point entities;
Figure BDA0003494446300000041
representing knowledge point entities and their relationships, knowledge point entity KaAnd KbIs KaAnd KbA set of co-occurring word groups; gc=<<Cx,e,Ky>>Represents a chapter entity CxAnd knowledge point entity KyThe association of (1) is e, which is consistent with the chapter attribute of the test question.
Compared with the prior art, the invention has the following beneficial effects:
the invention takes massive course test questions as research objects, utilizes a Gaussian mixture clustering method to identify test question knowledge points and the association thereof, combines the existing chapter knowledge point system, and utilizes a knowledge map technology to realize the reconstruction of the course knowledge system. The invention expresses knowledge points and the association thereof by using the test question characteristic phrases, can simply and efficiently construct a knowledge system, and can effectively avoid understanding and calculating the complex semantics of the knowledge points.
Drawings
Fig. 1 is a flowchart of a method for constructing a curriculum knowledge graph based on a GMM according to an embodiment of the present invention.
Detailed Description
The invention is further illustrated by the following examples in conjunction with the accompanying drawings:
the related concepts of knowledge points are first normalized. The invention defines the test question knowledge points as the knowledge entities which are analyzed from the test questions and are characterized by a plurality of phrases; "Chapter knowledge points" refer to the knowledge entities in the course schema that are defined by the hierarchical structure of chapters. The invention hopes to realize the identification and description of the test question knowledge points and chapter knowledge points and reconstruct the course knowledge map based on the course test question database data.
The method is roughly divided into three steps, firstly, the original test question data is preprocessed and Chinese word segmentation is carried out, the contents of test question modeling, word frequency matrix generation, word combination and screening and the like are included, and a structured test question library and a word frequency matrix are generated. Secondly, carrying out feature recognition and extraction on the two types of knowledge points, and carrying out test question knowledge point clustering and feature extraction by using a GMM (Gaussian mixture model) model to generate a test question knowledge point model; and simultaneously, carrying out feature recognition and extraction on the chapter knowledge points to generate a chapter knowledge point model. And finally, integrating the two types of knowledge points into a course knowledge graph by using a knowledge graph technology. A technical roadmap is shown in fig. 1.
Specifically, the method for constructing the curriculum knowledge graph based on the GMM comprises the following steps:
step A: firstly, test questions are grouped according to chapters, and preprocessing and Chinese word segmentation are carried out on the test questions in each chapter. In order to carry out quantitative calculation on test questions, firstly, the test questions containing the contents of characters, pictures and the like need to be preprocessed and cleaned to obtain structured test question data, and then the jieba word segmentation is carried out on effective test question word strings to further generate word frequency matrixes of the test questions. The pre-processing and cleaning of the test question data comprises the steps of constructing test question modeling, generating a word frequency matrix, combining phrases, screening and the like.
The step A comprises the following specific steps:
step A1, firstly, modeling the test question, because the method provided by the invention only processes the character information part of the test question, the picture belonging to invalid information needs to be deleted, and in addition, characters (such as numbers, letters, space characters, line feed characters and the like) also need to be removed, so as to improve the accuracy of word segmentation. The preprocessing refers to the arrangement of the structure of the test question data, the introduction of attributes such as 'affiliated chapter', 'question type', 'score', and the like, and the creation of a knotAnd (4) a structured test question data model. The question bank for cleaning and pre-processing can be defined as
Figure BDA0003494446300000051
Wherein q isiWhere { V, W, e, t } represents the ith test question, V represents the washed test question string, W represents the valid word set (valid test question string set), i.e., the segmentation result, e represents the chapter to which the test question belongs, t represents the type of the test question, and I represents the number of test questions in the test question library.
Step A2, generating word frequency matrix of test question, realizing Chinese word segmentation process of test question character string by using current popular word segmentation component jieba method, decomposing each effective test question character string into phrase set WaThen using the stop word list WsFiltering irrelevant phrases such as prepositions and adjectives and low-frequency phrases
Figure BDA0003494446300000052
The phrase complete set of all the test questions is expressed as
Figure BDA0003494446300000053
wjIs the jth participle. For the question bank
Figure BDA0003494446300000054
Its word frequency matrix
Figure BDA0003494446300000055
fijDenotes the jth participle wjIn the question bank
Figure BDA0003494446300000056
The ith test question qiWhether or not it occurs.
Step A3, next, the word frequency matrix is processed
Figure BDA0003494446300000057
And carrying out duplication elimination processing, combining strongly related word group columns, improving the independence of the word group columns and supporting subsequent analysis. We define a phrase correlation threshold ρ, if the Pearson correlation coefficient ρ of any two phrases x and yx,y>0.8, then the two words are combined into a new word group wx,y=wx∪wyAnd delete wxAnd wy. In addition, in order to improve the discrimination of phrase features, we also define the upper and lower limit thresholds [ eta ] of phrase coverageminmax]And only using the phrase features in the threshold value range to carry out knowledge point identification. Because the calculation targets of the test question knowledge points and the chapter knowledge points are different, the upper and lower thresholds of the phrase coverage of the test question knowledge points are slightly different, and the threshold of the coverage of the test question knowledge points is (4 percent and 60 percent)]The coverage threshold of the chapter knowledge points is [ 4%, 100%]。
And step B, after the test question word frequency matrix is preprocessed in the step A, the test question word frequency matrix obeys the mixed Gaussian distribution of a plurality of knowledge characteristics, and a Gaussian Mixture Model (GMM) can be adopted to realize the identification and characteristic expression of the test question knowledge points. The chapter hierarchy of the course knowledge can be identified by inquiring the question bank and the characteristics of the course knowledge are expressed by high-frequency phrases.
The step B comprises the following specific steps:
and step B1, the GMM model is a model which decomposes things into a plurality of Gaussian probability density functions, can accurately quantize things by the Gaussian probability density functions, is suitable for the condition that the data record contains a plurality of distribution characteristics, and can be regarded as mixed probability distribution of a plurality of knowledge characteristics.
For the question bank
Figure BDA0003494446300000061
Assuming that K knowledge points are contained in the test question, the probability distribution of the Gaussian mixture model of the test question q is shown as formula (1), wherein,
Figure BDA0003494446300000062
is the Gaussian distribution density function of the k-th knowledge point, alphakRepresenting the probability that the test question contains the kth knowledge point and satisfying
Figure BDA0003494446300000063
μkRepresents the mean value, sigmakTo representThe covariance.
Figure BDA0003494446300000064
Parameter α in equation (1)k,μkSum ΣkEstimation is performed using the EM (expectation maximization) algorithm. First, test questions q are calculatediProbability gamma of containing knowledge point kikAs shown in equation (2), then, αk,μkSum ΣkThe calculation method of (2) is shown in formulas (3) to (5).
Figure BDA0003494446300000065
Figure BDA0003494446300000066
Figure BDA0003494446300000067
Figure BDA0003494446300000068
Given a convergence threshold ε, equations (2) - (5) are iteratively computed until | α'k-α′k-1Until | ≦ epsilon, find alphak,μk,∑k. By using the formula (1), the distribution condition of the knowledge points in the test questions can be determined, and the many-to-many mapping relation between the test questions and the knowledge points is constructed. We set the number of cluster categories as: [2, 30]And evaluating the clustering result by using a Bayesian Information (BIC) rule, and selecting the optimal clustering number, namely the number of knowledge points in each chapter.
Step B2, according to the clustering result of the test questions, order
Figure BDA0003494446300000069
Showing knowledge of test questionsComplete set of points, KiThe ith test question knowledge point is shown. To fully describe the test question knowledge point features, we express the test question knowledge point features with the most frequently co-occurring phrases of the test questions belonging to the knowledge point. Order to
Figure BDA00034944463000000610
Figure BDA00034944463000000611
The knowledge points of the test questions are represented,<q>and<w>test question set, feature set respectively representing K<w>2 conditions are satisfied: (1) the phrase w is a plurality of phrases with the highest coverage to the knowledge points K, and (2) the coverage to the test set is 100%. In short, we choose the phrases with the highest coverage of the test question knowledge points to express the features of the test question knowledge points, and the total coverage of the test question knowledge points is 100%.
And step B3, performing chapter knowledge point identification and feature extraction. Structured question bank
Figure BDA00034944463000000612
Each question in the series has a chapter attribute q.e, whereby we can identify the chapter-level knowledge structure of the course and identify each chapter knowledge point. Similar to the test question knowledge point feature expression, we also use high coverage word group sets to represent chapter knowledge point features. Order to
Figure BDA0003494446300000071
A complete set of the course chapters is represented,
Figure BDA0003494446300000072
Figure BDA0003494446300000073
representing chapter knowledge points. Characteristic word group set<w>The extraction method is similar to the above-mentioned extraction method of the test question knowledge point features, and is not described here again.
Step C, on the basis of the course knowledge characteristic expression, a knowledge graph technology is utilized to countAnd calculating knowledge association and constructing a course knowledge graph. We define the course knowledge system as RDF (resource Description framework) technology in the knowledge graph
Figure BDA0003494446300000074
Two types of associations of course knowledge are described, namely associations between knowledge point entities and associations of chapter entities and knowledge point entities.
Figure BDA0003494446300000075
Figure BDA0003494446300000076
Representing knowledge point entities and their relationships, knowledge point entity KaAnd KbIs KaAnd KbA set of co-occurring word groups. The expression method utilizes the test question characteristic phrases to express the knowledge points and the association thereof, not only can simply and efficiently construct a knowledge system, but also can effectively avoid understanding and calculating the complex semantics of the knowledge points. Gc=<<Cx,e,Ky>>Represents a chapter entity CxAnd knowledge point entity KyThe association of (1) is e, which is consistent with the chapter attribute of the test question.
In conclusion, the invention takes massive course test questions as research objects, utilizes the Gaussian mixture clustering method to identify the test question knowledge points and the association thereof, combines the existing chapter knowledge point system, and utilizes the knowledge map technology to realize the reconstruction of the course knowledge system. The invention expresses the knowledge points and the association thereof by using the test question characteristic phrases, can simply and efficiently construct a knowledge system, and can also effectively avoid the understanding and the calculation of the complex semantics of the knowledge points.
The above shows only the preferred embodiments of the present invention, and it should be noted that it is obvious to those skilled in the art that various modifications and improvements can be made without departing from the principle of the present invention, and these modifications and improvements should also be considered as the protection scope of the present invention.

Claims (4)

1. A method for constructing a curriculum knowledge graph based on GMM is characterized by comprising the following steps:
step A, grouping the test questions according to chapters, preprocessing the test questions in each chapter and segmenting Chinese words, wherein the method comprises the steps of cleaning the test questions containing invalid characters and picture contents to obtain structured test question data, then performing jieba segmentation on character strings of the valid test questions to generate a word frequency matrix of the test questions, and processing the word frequency matrix of the test questions;
step B, identifying and extracting the characteristics of the two types of knowledge points of the test question and the chapter, wherein the step B comprises clustering and extracting the characteristics of the knowledge points of the test question by using a GMM (Gaussian mixture model) based on the processed word frequency matrix of the test question to generate a test question knowledge point model; meanwhile, performing feature recognition and extraction on chapter knowledge points based on the structured test question data to generate a chapter knowledge point model;
and step C, integrating the two types of knowledge points of the test question and the chapter into a course knowledge map by using a knowledge map technology based on the generated test question knowledge point model and the chapter knowledge point model.
2. The method of claim 1, wherein the step a comprises:
step A1, modeling the test question, including deleting the picture part of the test question and cleaning useless characters, wherein the useless characters include numbers, letters, space characters and linefeed characters; the method comprises the following steps of (1) arranging the structure of test question data, introducing preprocessing operation of attributes of 'affiliated chapters', 'question types' and 'scores', and creating a structured test question data model; defining the question bank which completes cleaning and pretreatment as
Figure FDA0003494446290000011
Wherein q isiThe I-th test question is represented by { V, W, e, t }, wherein V represents a washed test question string, W represents an effective test question string set, e represents a chapter to which the test question belongs, t represents a test question type, and I represents the number of test questions in a test question bank;
step A2, generating word-frequency matrix of test questions, including dividing Chinese words for effective test question strings by jieba method, decomposing each effective test question string into phrase set WaThen, howeverPost-utilization stop word list WsFiltering the irrelevant phrases and low-frequency phrases to make
Figure FDA0003494446290000012
The phrase complete set of all the test questions is expressed as
Figure FDA0003494446290000017
wjIs the jth word segmentation; for the question bank
Figure FDA0003494446290000013
Its word frequency matrix
Figure FDA0003494446290000014
Denotes the jth participle wjIn the question bank
Figure FDA0003494446290000015
The ith test question qiWhether or not it appears;
step A3, the word frequency matrix is aligned
Figure FDA0003494446290000016
Carrying out de-duplication processing, combining strong correlation word group columns, defining word group correlation threshold rho, and if the Pearson correlation coefficient rho of any two word groups x and yx,yIf the word length is more than 0.8, the two words are combined to form a new word group wx,y=wx∪wyAnd delete wxAnd wy(ii) a Defining a phrase coverage upper and lower threshold [ eta ]min,ηmax]And only using the phrase features in the threshold value range to carry out knowledge point identification.
3. The method of claim 2, wherein step B comprises:
step B1, for the item library
Figure FDA0003494446290000021
Gaussian mixture model G of test question qMM probability distribution p (q) is shown in equation (1):
Figure FDA0003494446290000022
wherein K represents the number of knowledge points contained in the test question;
Figure FDA0003494446290000023
is the gaussian distribution density function of the k-th knowledge point; alpha is alphakRepresenting the probability that the test question contains the k-th knowledge point and satisfying
Figure FDA0003494446290000024
μkRepresents the mean value; sigmakRepresents a covariance;
using EM algorithm to make parameter alphak,μkSum ΣkAnd (3) estimating: first, test questions q are calculatediProbability gamma of containing knowledge point kikAs shown in formula (2); then, αk,μkSum ΣkThe calculation method of (2) is shown in formulas (3) to (5):
Figure FDA0003494446290000025
Figure FDA0003494446290000026
Figure FDA0003494446290000027
Figure FDA0003494446290000028
given a convergence threshold ε, equations (2) - (5) are iteratively computed until | α'k-α′k-1Until | ≦ ε, find αk,μk,∑k
Determining the distribution condition of the knowledge points in the test questions by using a formula (1), and constructing a many-to-many mapping relation between the test questions and the knowledge points; setting the clustering category number, evaluating the clustering result by using a Bayesian information criterion, and selecting the optimal clustering number, namely the number of knowledge points in each chapter;
step B2, according to the clustering result of the test questions, order
Figure FDA0003494446290000029
Representing a corpus of knowledge points of test questions, KiRepresenting the ith test question knowledge point; the most frequently co-occurring phrase of the test questions belonging to the knowledge point is used for expressing the characteristics of the knowledge point of the test questions: order to
Figure FDA00034944462900000210
The knowledge points of the test questions are represented,<q>and<w>test question set, feature set respectively representing K<w>Composed of several phrases with the highest coverage of test question knowledge points and feature set<w>The overall coverage of the test question knowledge points is 100 percent;
step B3, performing chapter knowledge point identification and feature extraction: structured-based question bank
Figure FDA00034944462900000211
The chapter attributes q.e of each test question in the course, identify the chapter-level knowledge structure of the course, and identify each chapter knowledge point; using a high coverage word group set to represent chapter knowledge point features: order to
Figure FDA00034944462900000213
A complete set of the course chapters is represented,
Figure FDA00034944462900000212
a point of knowledge of a chapter is represented,<q>and<w>respectively representing the test question set and the feature set of C<w>Knowledge point coverage by chapter pairHighest number of phrases and feature set<w>The overall coverage of chapter knowledge points is 100%.
4. The method of claim 3, wherein step C comprises:
the RDF technology in the knowledge map is adopted to define the course knowledge system as
Figure FDA0003494446290000031
To describe the association between knowledge point entities of course knowledge, the association between chapter entities and knowledge point entities;
Figure FDA0003494446290000032
representing knowledge point entities and their relationships, knowledge point entity KaAnd KbIs KaAnd KbA set of co-occurring word groups; gc=《Cx,e,KyA chapter entity CxAnd knowledge point entity KyThe association of (1) is e, which is consistent with the chapter attribute of the test question.
CN202210109036.8A 2022-01-28 2022-01-28 Method for constructing course knowledge graph based on GMM Active CN114595337B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210109036.8A CN114595337B (en) 2022-01-28 2022-01-28 Method for constructing course knowledge graph based on GMM

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210109036.8A CN114595337B (en) 2022-01-28 2022-01-28 Method for constructing course knowledge graph based on GMM

Publications (2)

Publication Number Publication Date
CN114595337A true CN114595337A (en) 2022-06-07
CN114595337B CN114595337B (en) 2024-06-28

Family

ID=81806283

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210109036.8A Active CN114595337B (en) 2022-01-28 2022-01-28 Method for constructing course knowledge graph based on GMM

Country Status (1)

Country Link
CN (1) CN114595337B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109815338A (en) * 2018-12-28 2019-05-28 北京市遥感信息研究所 Relation extraction method and system in knowledge mapping based on mixed Gauss model
CN111883140A (en) * 2020-07-24 2020-11-03 中国平安人寿保险股份有限公司 Authentication method, device, equipment and medium based on knowledge graph and voiceprint recognition
US20210104234A1 (en) * 2019-10-08 2021-04-08 Pricewaterhousecoopers Llp Intent-based conversational knowledge graph for spoken language understanding system
CN113127731A (en) * 2021-03-16 2021-07-16 西安理工大学 Knowledge graph-based personalized test question recommendation method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109815338A (en) * 2018-12-28 2019-05-28 北京市遥感信息研究所 Relation extraction method and system in knowledge mapping based on mixed Gauss model
US20210104234A1 (en) * 2019-10-08 2021-04-08 Pricewaterhousecoopers Llp Intent-based conversational knowledge graph for spoken language understanding system
CN111883140A (en) * 2020-07-24 2020-11-03 中国平安人寿保险股份有限公司 Authentication method, device, equipment and medium based on knowledge graph and voiceprint recognition
CN113127731A (en) * 2021-03-16 2021-07-16 西安理工大学 Knowledge graph-based personalized test question recommendation method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
刘一然;骆力明;: "基于知识图谱的学科单选题考点提取研究", 计算机应用研究, no. 06, 8 April 2018 (2018-04-08) *
阮彤;高炬;冯东雷;钱夕元;王婷;孙程琳;: "基于电子病历的临床医疗大数据挖掘流程与方法", 大数据, no. 05, 20 September 2017 (2017-09-20) *

Also Published As

Publication number Publication date
CN114595337B (en) 2024-06-28

Similar Documents

Publication Publication Date Title
CN107766324B (en) Text consistency analysis method based on deep neural network
CN111753024B (en) Multi-source heterogeneous data entity alignment method oriented to public safety field
CN103678670B (en) Micro-blog hot word and hot topic mining system and method
CN111414461B (en) Intelligent question-answering method and system fusing knowledge base and user modeling
CN108595706A (en) A kind of document semantic representation method, file classification method and device based on theme part of speech similitude
CN107203600B (en) Evaluation method for enhancing answer quality ranking by depicting causal dependency relationship and time sequence influence mechanism
CN109657061B (en) Integrated classification method for massive multi-word short texts
CN111581368A (en) Intelligent expert recommendation-oriented user image drawing method based on convolutional neural network
CN116110405B (en) Land-air conversation speaker identification method and equipment based on semi-supervised learning
CN111582506A (en) Multi-label learning method based on global and local label relation
CN112800229A (en) Knowledge graph embedding-based semi-supervised aspect-level emotion analysis method for case-involved field
CN110941958A (en) Text category labeling method and device, electronic equipment and storage medium
CN116756347A (en) Semantic information retrieval method based on big data
CN115659947A (en) Multi-item selection answering method and system based on machine reading understanding and text summarization
CN115935998A (en) Multi-feature financial field named entity identification method
CN113312907B (en) Remote supervision relation extraction method and device based on hybrid neural network
CN112489689B (en) Cross-database voice emotion recognition method and device based on multi-scale difference countermeasure
CN106448660A (en) Natural language fuzzy boundary determining method with introduction of big data analysis
CN116189671B (en) Data mining method and system for language teaching
CN113569008A (en) Big data analysis method and system based on community management data
CN108596245A (en) It is a kind of that the complete face identification method for differentiating sub-space learning is cooperateed with based on multiple view
CN114996442B (en) Text abstract generation system combining abstract degree discrimination and abstract optimization
CN114595337B (en) Method for constructing course knowledge graph based on GMM
CN116049349A (en) Small sample intention recognition method based on multi-level attention and hierarchical category characteristics
CN115292456A (en) Knowledge-driven non-cooperative personality prediction method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant