CN115879463A - Course element recognition model training and recognition method based on text mining - Google Patents

Course element recognition model training and recognition method based on text mining Download PDF

Info

Publication number
CN115879463A
CN115879463A CN202211249113.6A CN202211249113A CN115879463A CN 115879463 A CN115879463 A CN 115879463A CN 202211249113 A CN202211249113 A CN 202211249113A CN 115879463 A CN115879463 A CN 115879463A
Authority
CN
China
Prior art keywords
text
text data
data
word
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211249113.6A
Other languages
Chinese (zh)
Inventor
张建桃
刘洁荧
曾莉
韦婷婷
林筱芸
张叶
姜可欣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China Agricultural University
Original Assignee
South China Agricultural University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China Agricultural University filed Critical South China Agricultural University
Priority to CN202211249113.6A priority Critical patent/CN115879463A/en
Publication of CN115879463A publication Critical patent/CN115879463A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a course element recognition model training and recognition method based on text mining. The training method comprises the following steps: constructing a target frame, wherein the target frame is provided with a plurality of dimensions and elements corresponding to the dimensions; collecting and preprocessing text data from political courses and professional courses; partitioning the preprocessed text data to obtain text data blocks; manually labeling the text data block according to the plurality of dimensions and the elements corresponding to the dimensions to obtain corresponding labels; performing word segmentation processing and word vectorization processing on the text data block to obtain a word vector data set; and constructing a text mining model, and training and testing the text mining model based on the word vector data set and the manually labeled labels to obtain a final text mining model. The method can extract and identify the dimensionality and elements contained in the teaching materials by utilizing the text mining model.

Description

Course element recognition model training and recognition method based on text mining
Technical Field
The invention relates to the field of big data analysis, in particular to a course element recognition model training and recognition method based on text mining.
Background
Text mining refers to the process of extracting valuable information of interest to a user from a large amount of text data. The teaching materials can systematically reflect the main contents of the subject and are important tools in the teaching process of colleges and universities. With the rise of deep learning, the capability of processing unstructured texts by a computer is greatly improved, and text data of a teaching material with low structuralization degree can be well processed. However, the textbook text often implies more subject knowledge background, and for a computer, the intention of the text is difficult to recognize without the knowledge background. In addition, curriculum needs to mine connotations in expertise, which requires computers to further understand the deep meaning of text. At present, the research of text mining on the aspect of natural language understanding is not mature enough, text intentions are mostly recognized according to words with high occurrence frequency in articles, the accuracy of text understanding is low, and the requirement of course exploration on deep education connotations cannot be met.
Disclosure of Invention
The invention aims to provide a course element identification method and a system based on text mining, which can be used for mining elements contained in professional courses.
In order to achieve the above object, an embodiment of the present invention provides a course element recognition model training method based on text mining, including:
constructing a target frame, wherein the target frame is provided with a plurality of dimensions and elements corresponding to the dimensions;
acquiring text data from political courses and professional courses and preprocessing the text data to obtain preprocessed text data;
partitioning the preprocessed text data to obtain text data blocks;
manually labeling the text data block according to the plurality of dimensions and the elements corresponding to the dimensions to obtain corresponding labels;
performing word segmentation processing on the text data block to obtain a word segmentation data set, and performing word vectorization processing on the word segmentation data set to obtain a word vector data set;
and constructing a text mining model, inputting the word vector data set and the manually labeled labels into the text mining model, and training and testing to obtain a final text mining model.
Preferably, the blocking the preprocessed text data to obtain the text data block specifically includes:
and manually partitioning the preprocessed text data according to knowledge points and/or cases in teaching materials of political courses and professional courses to obtain text data blocks corresponding to the knowledge points and/or the cases.
Preferably, the word segmentation processing is performed on the text data block to obtain a word segmentation data set, and the word segmentation data set specifically includes:
and pre-constructing a custom dictionary according to the political course and the professional course, and performing word segmentation processing on the text data blocks corresponding to the knowledge points and/or the cases according to the custom dictionary to obtain word segmentation data sets corresponding to the knowledge points and/or the cases.
Preferably, the performing word segmentation processing on the text data block to obtain a word segmentation data set further includes:
pre-constructing a self-defined stop-use dictionary according to the terms of political courses and professional courses;
and performing word segmentation processing on the text data blocks corresponding to the knowledge points and/or the cases according to the user-defined dictionary and the user-defined stop dictionary to obtain word segmentation data sets corresponding to the knowledge points and/or the cases.
Preferably, the word vectorization processing is performed on the word segmentation data set to obtain a word vectorization data set, which specifically includes:
and performing Word vectorization processing on the participle data set corresponding to the knowledge point/case by adopting a Word2vec model to obtain a Word vector data set corresponding to the knowledge point/case.
Preferably, the manually labeling the text data block according to the plurality of dimensions and the elements corresponding to the dimensions to obtain the corresponding label specifically includes:
performing dimension labeling on each text data block according to a plurality of dimensions, and acquiring dimension label information corresponding to each text data block;
and performing element labeling on the text data blocks according to the dimension label information corresponding to each text data block to obtain the element label information of each text data block under the corresponding dimension.
Preferably, the final text mining model specifically includes:
the total dimension text mining model is used for identifying the dimension corresponding to the curriculum text data to be processed;
and the dimension-divided mining model corresponding to each dimension is used for identifying elements of the curriculum text data to be processed under the current dimension.
Preferably, the total dimension text mining model and the fractal dimension text mining model are both realized by adopting an LSTM model.
Another embodiment of the present invention correspondingly provides a course element recognition model training system based on text mining, including:
the framework building module is used for building a target framework, and the target framework is provided with at least two levels of indexes;
the data acquisition module is used for acquiring text data from the teaching materials of political courses and professional courses;
the data preprocessing module is used for preprocessing the acquired text data to obtain preprocessed text data;
the text characteristic extraction module is used for segmenting the preprocessed text data to obtain text data blocks, segmenting the text data blocks to obtain word segmentation data sets, and carrying out word vectorization processing on the word segmentation data sets to obtain word vector data sets;
the label labeling module is used for manually labeling the text data block according to at least two levels of indexes of the frame target to obtain a corresponding label;
and the text mining model building module is used for building a text mining model, and training and testing the text mining model based on the word vector data set and the manually labeled labels to obtain a final text mining model.
The invention provides a course element identification method based on text mining, which is realized by adopting the final text mining model;
preprocessing the curriculum text data to be processed, sequentially carrying out blocking, word segmentation and word vectorization on the preprocessed curriculum text data to obtain word vectorization data to be processed;
and inputting word vectorization data to be processed into the final text mining model for processing to obtain the dimensionality corresponding to the curriculum text block data and the elements corresponding to the dimensionality.
Compared with the prior art, the invention has the following beneficial effects:
according to the method, a target frame is constructed, manual labeling is carried out according to the dimensionality and the elements corresponding to the dimensionality under the target frame, text data obtained from professional course teaching materials of political courses and other non-political courses are preprocessed and then input into a text mining model together for training, and the text mining model which can be used for identifying the dimensionality and the corresponding elements is obtained; the text mining model can be used for extracting and identifying dimensions and elements contained in the teaching materials, so that the extracted and identified information can be used for further research.
Drawings
Fig. 1 is a flowchart of a course element recognition model training method based on text mining according to embodiment 1 of the present invention.
Fig. 2 is an architecture diagram of a course element recognition model training system based on text mining according to embodiment 2 of the present invention.
Detailed Description
The drawings are only for purposes of illustration and are not to be construed as limiting the invention. For a better understanding of the following embodiments, certain features of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product; it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.
Example 1
As shown in fig. 1, the embodiment provides a course element recognition model training method based on text mining, including:
s1, constructing a target frame, wherein the target frame is provided with a plurality of dimensions and elements corresponding to the dimensions;
s2, collecting text data from the teaching materials of the administrative course and the professional course and preprocessing the text data to obtain preprocessed text data;
s3, partitioning the preprocessed text data to obtain text data blocks;
s4, manually marking the text data block according to the plurality of dimensions and the elements corresponding to the dimensions to obtain corresponding labels;
s5, performing word segmentation processing on the text data block to obtain a word segmentation data set, and performing word vectorization processing on the word segmentation data set to obtain a word vector data set;
and S6, constructing a text mining model, and training and testing the text mining model based on the word vector data set and the manually labeled labels to obtain a final text mining model.
In step S1, a target frame including two levels of indexes may be established according to policy documents and literature data related to courses, political course requirements, and core content, where the first level of indexes is dimensions, and a plurality of second level indexes subdivided in each dimension, that is, elements corresponding to each dimension, are set up. And subsequently, a text mining model can be set according to the two levels of indexes to realize the identification of the dimensionality and the identification of the corresponding elements.
In step S2, collecting and preprocessing text data from the educational course and the professional course teaching materials to obtain preprocessed text data, wherein the professional course refers to other professional courses except for the non-political course, such as "application statistics"; relevant teaching materials are collected from the courses, relevant text data are extracted from the courses, and then the text data are preprocessed.
In a preferred embodiment, when a certain type of target teaching materials need to be analyzed, texts such as political series course teaching materials and the like can be correspondingly collected as text data, subject teaching materials related to the target teaching materials are collected as auxiliary text data, and the text data and the auxiliary text data are used as training samples and test samples of the certain type of target teaching materials for model training, so that a targeted recognition model is obtained, and dimensions and corresponding elements in the certain type of target teaching materials can be recognized more accurately.
The pre-processing may include: null and duplicate removal, outlier removal, etc. Specifically, the pandas library in Python may be used to remove null values, repeated values, and abnormal values in the text data.
In step S3, in order to better and more accurately identify the dimensions and elements related to the teaching materials, the embodiment uses the knowledge points and/or cases in the teaching materials as a data block for identification, and when the embodiment is specifically implemented, the text data is partitioned according to the knowledge points and/or cases in the teaching materials of the political courses and professional courses, so as to obtain the text data block corresponding to the knowledge points and/or cases.
In step S4, manual labeling is performed, that is:
s4, carrying out dimension marking on each text data block according to a plurality of dimensions, and acquiring dimension label information corresponding to each text data block as a primary label;
and S42, performing element labeling on the text data blocks according to the dimension label information corresponding to each text data block to obtain element label information of each text data block under the corresponding dimension, and using the element label information as a secondary label.
Namely, the dimension and the element under the target frame are used for marking the text data block with the appropriate primary index label and secondary index label.
In step S5, the specific steps of performing word segmentation processing on the text data block to obtain a word segmentation data set include:
s51, pre-constructing a custom dictionary according to the terms of political courses and professional courses;
and S52, performing word segmentation processing on the text data blocks corresponding to the knowledge points and/or the cases according to the user-defined dictionary to obtain word segmentation data sets corresponding to the knowledge points and/or the cases.
During the word segmentation process in step S52, words, which may be meaningful compound words, may be continuously selected from the text data after the word segmentation process and written in the custom dictionary. The specific implementation can adopt a jieba word segmentation program package in Python and a built user-defined dictionary to perform word segmentation processing on the text data block, and a segmented word segmentation data set is used as the input of the text mining model.
Further, in order to improve the accuracy of data processing and reduce the dimensionality of data, in S51 of this embodiment, a custom stop dictionary is also pre-constructed according to the terms of political courses and professional courses, where the custom stop words may be words without research meaning, and are integrated into the custom stop dictionary, and the custom stop dictionary is added in the word segmentation process, so that the accuracy of word segmentation can be provided.
Further, in step S52, word segmentation processing is performed on the text data blocks corresponding to the knowledge points and/or the cases according to the customized dictionary and the customized deactivation dictionary to obtain word segmentation data sets corresponding to the knowledge points and/or the cases. In the process, a user-defined disuse word bank constructed by combining the Hadamard disuse word list can be selected for carrying out the stop word processing.
After word segmentation processing is finished, word vectorization is carried out on the word segmentation data sets corresponding to the knowledge points and/or the cases on the basis of the word vectorization model to obtain word vector data sets corresponding to the knowledge points and/or the cases.
In the step S6 of constructing the text mining model, word vectorization is performed on the segmented word data set through the above steps to obtain text semantic features, and the text mining model is constructed based on a Long Short Term Memory network model (LSTM). In order to improve the accuracy and the running speed of text mining, the text mining model comprises a total dimension text mining model and a plurality of dimension models corresponding to all dimensions, so that the accuracy and the running speed of the text mining are improved. And finally, training and testing each model based on the word vectorization data set and the manually labeled labels.
The Word vectorization processing can be realized by adopting a Word2vec model, and each Word in the Word segmentation data set is represented as a dense vector by using the Word2vec model to obtain text semantic features.
In the Word2vec model, the conditional probability of predicting the target Word is calculated based on the context, the vector expression of the target Word is obtained by maximizing the log-likelihood target function by utilizing gradient descent and back propagation, the log-likelihood target function is as follows,
Figure RE-GDA0004038001490000061
in the formula, P (omega) tt-c :ω t+c ) Conditional probability, wherein T is the length of the target word plus c words before and after the target word, and represents the T-th target word; omega t For predicted target words, c is context size; omega t-c :ω t+c The target word does not contain the first c to the last c words of the target word;
conditional probability P (ω) tt-c :ω t+c ) Calculated from the softmax function and is,
Figure RE-GDA0004038001490000062
Figure RE-GDA0004038001490000063
in the formula v j For the vector representation of the jth word in the target word context,
Figure RE-GDA0004038001490000064
for context vector representations of c words before and after the target word, device for selecting or keeping>
Figure RE-GDA0004038001490000065
Is->
Figure RE-GDA0004038001490000066
Transposing; />
Figure RE-GDA0004038001490000067
Is a vector representation of the target word; n represents the number of all the non-repeated words in the word segmentation data set; n represents the nth non-repeated word in the word segmentation data set; v. of n Vector representation of the nth non-repeated word in the word segmentation data set; exp is an exponential function with a natural constant as the base.
And performing Word vectorization on words in the Word segmentation data set by using a Word2vec model, and then performing data analysis by using the words as input of a subsequent mining model. Specifically, the method is realized by adopting a total dimension text mining model and a plurality of dimension-dividing text mining models, wherein the number of the dimension-dividing text mining models is determined by the number of dimensions in a target frame, namely the number of the dimension-dividing text mining models is determined according to the number of the primary indexes. And the total dimension text mining model and the fractal dimension text mining model are realized by adopting an LSTM model.
The principle of the LSTM model is to first obtain a feed-forward representation of the network by unfolding the loop portion of the network, and then use back-propagation training. The model realizes the protection and control of information through a forgetting gate, an input gate and an output gate.
The LSTM model uses a forgetting gate to decide which information should be discarded, and the specific formula is shown in formula (4), wherein sigma is a sigmoid activation function, W f To forget the gate weight matrix, h t-1 Is the output at t-1, x t An input at t; b f The gate offset vector is forgotten.
f t =σ(W f ·[h t-1 ,x t ]+b f ) (4)
The LSTM model uses an input gate to decide which information in the input should be added. The first step is to determine the information i to be updated by a sigmoid activation function t The concrete formula is shown as formula (5); the second step is by tanh hyperbolic tangentObtaining the candidate memory cell information C 'by the function' t Specifically, it is shown in formula (6). Where σ is sigmoid activation function, W i As input to the gate weight matrix, b i For input of the gate offset vector, W c Is a weight matrix of tanh function, b c The vector is biased for the tanh function.
i t =σ(W i ·[h t-1 ,x t ]+b i ) (5)
C′ t =tanh(W c ·[h t-1 ,x t ]+b c ) (6)
In order to allow valuable information to be transmitted in the network, the LSTM model updates the cell state according to the information of the forgetting gate and the input gate, and the specific formula is shown in formula (7), wherein C t-1 The cell state at t-1.
C t =f t ·C t-1 +i t ·C′ t (7)
Finally, the LSTM model determines the output information through the output gates. Firstly, normalization processing is carried out on current cell state information through a tanh function, then the sigmoid function is adopted to determine information needing to be output, and the information needing to be output is multiplied by the current cell state information and the sigmoid function to obtain information output to the next layer. The specific formulas of the output gate are shown in formula (8) and formula (9). Wherein σ is sigmoid activation function, W o To output a gate weight matrix, b o Is the output gate offset vector.
σ t =σ(W o ·[h t-1 ,x t ]+b o ) (8)
h t =o t ·tanh(C t ) (9)
During specific implementation, firstly training a total dimension text mining model to mine dimensions contained in text data, dividing text data blocks into a training set and a test set in proportion, inputting the training set and the training labels into the total dimension text mining model for training by taking primary labels marked by the text data blocks in the training set as training labels, and testing the accuracy of the model mining the dimensions by using the test set;
and training the dimension-divided text mining models one by one to mine the elements of the data under the dimension, wherein text data blocks with the same dimension label information, namely the same first-level index information, in the manual labeling result are proportionally divided into a training set and a test set, a second-level label under the first-level index marked by the text data blocks in the training set is taken as the training label, the training set and the training label are input into the dimension-divided text mining model corresponding to the first-level index for training, and the accuracy of mining the elements under the dimension is tested by using the test set test model.
Compared with traditional methods such as a document method, a content analysis method, an interview method and a questionnaire investigation method, the embodiment applies a text mining technology, so that the meaning contained in the text is solved conveniently by a computer mechanism, and the hidden elements in the professional course can be mined quickly and intelligently. In the embodiment, a text mining technology is adopted to identify elements contained in a professional course, and a text mining model which can be used for identifying the course elements is obtained through constructing a target frame, acquiring and processing data and constructing the text mining model. The course elements can be identified by utilizing the model, the elements contained in the professional courses are obtained, and references are provided for the course construction of each subject.
Example 2
As shown in fig. 2, the embodiment provides a course element recognition model training system based on text mining, which includes:
the framework building module 101 is used for building a target framework, and the target framework is provided with at least two levels of indexes;
the data acquisition module 102 is used for acquiring text data from political courses and professional courses;
the data preprocessing module 103 is configured to preprocess the acquired text data to obtain preprocessed text data;
the text feature extraction module 104 is configured to perform blocking on the preprocessed text data to obtain text data blocks, perform word segmentation on the text data blocks to obtain word segmentation data sets, and perform word vectorization on the word segmentation data sets to obtain word vector data sets;
the label labeling module 105 is used for manually labeling the text data blocks according to at least two levels of indexes of the frame target to obtain corresponding labels;
and the text mining model building module 106 is used for building a text mining model, and training and testing the text mining model based on the word vector data set and the manually labeled labels to obtain a final text mining model.
In the framework building module 101, a target framework including two levels of indexes can be built according to the policy document and the document data related to the course, the requirement of the course and the core content, the first level of index is the dimensionality, and a plurality of second level indexes subdivided under each dimensionality, namely, elements corresponding to each dimensionality, are set up. And subsequently, a text mining model can be set according to the two-stage indexes to realize the identification of the dimensionality and the identification of the corresponding elements.
In the data collection module 102, the professional course refers to other professional courses except for non-political courses, such as application statistics; relevant teaching materials are collected from the courses, relevant text data are extracted from the courses, and then the text data are preprocessed through the data preprocessing module 103.
In a preferred embodiment, when a certain type of target teaching material needs to be analyzed, texts such as series course teaching materials and the like are correspondingly collected as text data, subject teaching materials related to the target teaching material are collected as auxiliary text data, and the text data and the auxiliary text data are used as training samples and test samples of the certain type of target teaching material for model training, so that a targeted recognition model is obtained, and dimensions and corresponding elements in the certain type of target teaching material can be recognized more accurately.
The pre-processing may include: null and duplicate removal, outlier removal, etc. The data preprocessing module 103 can remove null values, repeated values, and abnormal values in the text data using the pandas library in Python.
In the text feature extraction module 104, in order to better and more accurately identify the dimensions and elements involved in the teaching materials, the knowledge points and/or cases in the teaching materials are used as a data block for identification, and when the concrete implementation is implemented, the text data is partitioned according to the knowledge points and/or cases in the teaching materials of political courses and professional courses, so as to obtain the text data block corresponding to the knowledge points and/or cases.
Then, the segmented text data needs to be subjected to word segmentation, which specifically comprises the following steps:
pre-constructing a custom dictionary according to the terms of political courses and professional courses;
and performing word segmentation processing on the text data blocks corresponding to the knowledge points and/or the cases according to the custom dictionary to obtain word segmentation data sets corresponding to the knowledge points and/or the cases.
During the word segmentation process, words can be continuously selected from the text data after the word segmentation process and are coded in the user-defined dictionary, and the words can be meaningful compound words. The specific implementation can adopt a jieba word segmentation program package in Python and a built self-defined dictionary to carry out word segmentation processing on the text data block, and a word segmentation data set obtained after segmentation is used as the input of a text mining model.
Further, in order to improve the accuracy of data processing and reduce the dimensionality of data, a custom stop dictionary is also constructed in advance according to the terms of political courses and professional courses, the custom stop words can be words without research meanings, the custom stop words are integrated into the custom stop dictionary, and the custom stop dictionary is added in the word segmentation process, so that the word segmentation accuracy can be provided.
And further, performing word segmentation processing on the text data blocks corresponding to the knowledge points and/or the cases according to the user-defined dictionary and the user-defined stopping dictionary to obtain word segmentation data sets corresponding to the knowledge points and/or the cases. In the process, a Harmony large disuse word list can be selected to be combined with a self-defined disuse word library constructed to perform the disuse word processing.
After word segmentation processing is finished, word vectorization is carried out on the word segmentation data sets corresponding to the knowledge points or the cases based on the word vectorization model to obtain word vector data sets corresponding to the knowledge points or the cases.
The specific way of manual labeling by the label labeling module 105 is as follows:
dimension labeling is carried out on each text data block according to a plurality of dimensions, and dimension label information corresponding to each text data block is obtained and used as a primary label;
and performing element labeling on the text data blocks according to the dimension label information corresponding to each text data block to obtain element label information of each text data block under the corresponding dimension, and using the element label information as a secondary label.
Namely, the dimension and the element under the target frame are utilized to mark a proper primary index label and a proper secondary index label for the text data block.
In the text mining model building module 106, word vectorization is performed on the participle data set through the aforementioned module to obtain text semantic features, and a text mining model is built by using a Long Short Term Memory network (LSTM) -based model. In order to improve the accuracy and the running speed of text mining, the text mining model comprises a total dimension text mining model and a plurality of dimension models corresponding to all dimensions, so that the accuracy and the running speed of the text mining are improved. And finally, training and testing each model based on the word vectorization data set and the manually labeled labels.
The Word vectorization processing can be realized by adopting a Word2vec model, and each Word in the Word segmentation data set is represented as a dense vector by using the Word2vec model to obtain text semantic features.
And performing Word vectorization on words in the Word segmentation data set by using a Word2vec model, and then performing data analysis by using the words as input of a subsequent text mining model. Specifically, the method is realized by adopting a total dimension text mining model and a plurality of dimension-divided text mining models, wherein the number of the dimension-divided text mining models is determined by the number of dimensions in a target frame, namely the number of primary indexes. And the total dimension text mining model and the fractal dimension text mining model are realized by adopting an LSTM model.
During specific implementation, firstly training a total dimension text mining model to mine dimensions contained in text data, dividing text data blocks into a training set and a test set according to a proportion, inputting the training set and the training set into the total dimension text mining model by taking a primary label marked by the text data blocks in the training set as a training label, training, and testing the accuracy of model mining dimensions by using the test set;
and training the dimension-divided text mining models one by one to mine the elements of the data under the dimension, wherein text data blocks with the same dimension label information, namely the same first-level index information, in the manual labeling result are proportionally divided into a training set and a test set, a second-level label marked by the text data block in the training set under the first-level index is taken as the training label, the training set and the training label are input into the dimension-divided text mining model corresponding to the first-level index for training, and the accuracy of the test set test model for mining the elements under the dimension is utilized.
According to the method, the target framework is built, so that a computer can conveniently understand the connotation contained in the text, the text data of the related courses are processed by using the text mining technology, the internal relation between the professional courses and education is built, the elements contained in the professional courses are mined, and support is provided for promoting the course construction and cultivating professional talents with good literacy.
Example 3
The embodiment provides a course element identification method based on text mining on the basis of embodiment 1, and the identification method is implemented by using the final text mining model obtained in the above embodiment 1; after preprocessing the curriculum text data to be processed, sequentially carrying out blocking, word segmentation and word vectorization on the preprocessed curriculum text data to obtain word vectorization data to be processed;
and inputting the word vectorization data to be processed into the final text mining model for processing to obtain dimensions corresponding to the knowledge points and/or cases in the course text data and elements corresponding to the dimensions.
Further, the preprocessing comprises the steps of removing empty and duplicate and removing abnormal values, then partitioning according to knowledge points and/or cases, performing word segmentation on the segmented curriculum text data to be processed, performing word vectorization processing after word segmentation, and then inputting the words into a final text mining model. The following steps may be performed in entering the text-mining model:
and dimension identification, namely firstly inputting word vector data to be processed into a total dimension text mining model according to knowledge points and/or cases, mining the dimension with the highest text implicit probability, obtaining the mining result of the knowledge points and/or cases under a primary index, and using the mining result as a selection basis of a dimension-divided text mining model.
And element identification, namely inputting corresponding knowledge points and/or cases into corresponding dimension-divided text mining models according to mining results under the primary indexes, mining a plurality of hidden elements, acquiring element mining results under the dimension, integrating the text mining results until all the knowledge points and/or cases are identified, and obtaining the relation between the courses to be processed and the elements.
The technical solutions of example 1 and example 3 are further illustrated below by way of examples.
Step S1 of embodiment 1 is performed to construct a target framework. And establishing 6 dimensions according to the related guidance files and the course teaching target, and subdividing 13 secondary indexes in the 6 dimensions according to the course teaching core content.
Step S2 of embodiment 1 is executed to complete data acquisition and processing. Relevant text data is gathered from the goals and lesson perspectives, respectively. The method takes 'applied statistics' edited by Jiang Jun Ping as a course to be processed, namely, a target teaching material, and collects texts of the course teaching material and 10 related statistics teaching materials as a data set to collect text data. All text data was deduplicated and deduplicated using the pandas library.
Step S3 of the embodiment 1 is executed, the collected texts are partitioned according to knowledge points and/or cases to obtain text data blocks;
and step S4 of the embodiment 1 is executed, the text data block is labeled according to six dimensions and subordinate secondary indexes according to the target frame, and a primary label and a secondary label are obtained.
Step S5 of embodiment 1 is executed, a jieba library is used to perform chinese word segmentation on a text data block to obtain a word segmentation data set, and meanwhile, a custom dictionary is loaded to improve the accuracy of word segmentation, and a custom stop word dictionary is loaded to perform stop word processing, so as to improve the accuracy of data processing and reduce the dimensionality of data. The user-defined dictionary and the user-defined stop dictionary can be continuously updated by observing the processing results of the participles and the stop words, so that a better processing result is obtained. And performing Word vectorization processing on the Word segmentation data set to obtain a Word vectorization data set, wherein each Word in the Word segmentation data set can be represented as a dense vector through training a Word2vec model in the Word vectorization processing to obtain text semantic features.
Step S6 of embodiment 1 is executed to construct a text mining model based on the LSTM model. In order to improve the accuracy and the running speed, the text mining model comprises a total dimension text mining model and six fractal dimension text mining models. The word vector data set is expressed as 9:1 into a training set and a test set, inputting corresponding labels in the training set and the training set data into each model to complete training, and testing each model by using the test set, wherein the average accuracy of the obtained text mining model on the training set is 97.4%, and the average accuracy on the test set is 93.1%.
Executing the step of the embodiment 3, identifying course elements, performing text mining on knowledge points and cases of eleven chapters of the target teaching material application statistics, inputting the text data of the target teaching material subjected to preprocessing, blocking, word segmentation and word vectorization processing into a total dimension text mining model to perform dimension classification on the text data, acquiring a corresponding dimension mining result, and acquiring implicit elements by adopting the corresponding dimension text mining model according to the classification result. The element mining results involve 25 relevant knowledge points and cases in total.
Through example research, the course element identification method based on text mining can quickly and intelligently identify the course elements hidden in the course from the text data, can be applied and popularized in various professional subjects, provides reference for teachers in various subjects to construct the course, and provides support for high-quality talents which are trained in colleges and universities and have literacy and professional knowledge at the same time.
It should be understood that the above-mentioned embodiments of the present invention are only examples for clearly illustrating the technical solutions of the present invention, and are not intended to limit the specific embodiments of the present invention. Any modification, equivalent replacement, and improvement made within the spirit and principle of the claims of the present invention should be included in the protection scope of the claims of the present invention.

Claims (10)

1. A course element recognition model training method based on text mining is characterized by comprising the following steps:
constructing a target frame, wherein the target frame is provided with a plurality of dimensions and elements corresponding to the dimensions;
acquiring text data from teaching materials of political courses and professional courses and preprocessing the text data to obtain preprocessed text data;
partitioning the preprocessed text data to obtain text data blocks;
manually labeling the text data block according to the plurality of dimensions and the elements corresponding to the dimensions to obtain corresponding labels;
performing word segmentation on the text data block to obtain a word segmentation data set, and performing word vectorization on the word segmentation data set to obtain a word vector data set;
and constructing a text mining model, and training and testing the text mining model based on the word vector data set and the manually labeled labels to obtain a final text mining model.
2. The training method of course element recognition model based on text mining as claimed in claim 1, wherein the step of blocking the preprocessed text data to obtain the text data block specifically comprises:
and manually partitioning the preprocessed text data according to the knowledge points and/or cases in the teaching materials of the political courses and the professional courses to obtain text data blocks corresponding to the knowledge points and/or cases.
3. The training method of course element recognition model based on text mining as claimed in claim 2, wherein performing segmentation processing on the text data block to obtain a segmentation data set specifically comprises:
and pre-constructing a custom dictionary according to the political course and the professional course, and performing word segmentation processing on the text data blocks corresponding to the knowledge points and/or the cases according to the custom dictionary to obtain word segmentation data sets corresponding to the knowledge points and/or the cases.
4. The method as claimed in claim 3, wherein the step of performing word segmentation on the text data block to obtain a word segmentation data set further comprises:
pre-constructing a self-defined stop dictionary according to the terms of political courses and professional courses;
and performing word segmentation processing on the text data blocks corresponding to the knowledge points and/or the cases according to the user-defined dictionary and the user-defined stop dictionary to obtain word segmentation data sets corresponding to the knowledge points and/or the cases.
5. The course element recognition model training method based on text mining as claimed in claim 4, wherein the word vectorization processing is performed on the participle data set to obtain a word vector data set, specifically:
and performing Word vectorization processing on the participle data set corresponding to the knowledge point/case by adopting a Word2vec model to obtain a Word vector data set corresponding to the knowledge point/case.
6. The training method of course element recognition model based on text mining as claimed in claim 5, wherein said manually labeling the text data block according to the plurality of dimensions and the elements corresponding to each dimension to obtain the corresponding label specifically comprises:
performing dimension labeling on each text data block according to a plurality of dimensions, and acquiring dimension label information corresponding to each text data block;
and performing element labeling on the text data blocks according to the dimension label information corresponding to each text data block to obtain the element label information of each text data block under the corresponding dimension.
7. The method as claimed in claim 6, wherein the final text-mining model specifically comprises:
the total dimension text mining model is used for identifying the dimension corresponding to the curriculum text data to be processed;
and the dimension-divided mining model corresponding to each dimension is used for identifying elements of the curriculum text data to be processed under the current dimension.
8. The method of claim 7, wherein the overall dimension text mining model and the sub-dimension mining model are implemented using an LSTM model.
9. A course element recognition model training system based on text mining, comprising:
the framework building module is used for building a target framework, and the target framework is provided with at least two levels of indexes;
the data acquisition module is used for acquiring text data from political courses and professional courses;
the data preprocessing module is used for preprocessing the acquired text data to obtain preprocessed text data;
the text characteristic extraction module is used for segmenting the preprocessed text data to obtain text data blocks, segmenting the text data blocks to obtain word segmentation data sets, and carrying out word vectorization processing on the word segmentation data sets to obtain word vector data sets;
the label marking module is used for manually marking the text data block according to at least two levels of indexes of the frame target to obtain a corresponding label;
and the text mining model building module is used for building a text mining model, and training and testing the text mining model based on the word vector data set and the manually labeled labels to obtain a final text mining model.
10. A course element recognition method based on text mining, which is characterized by being realized by adopting the final text mining model of any one of claims 1 to 8;
preprocessing the curriculum text data to be processed, sequentially partitioning, segmenting words and performing word vectorization on the preprocessed curriculum text data to obtain word vectorization data to be processed;
and inputting the word vectorization data to be processed into the final text mining model for processing to obtain the dimensionality corresponding to the curriculum text block data and the elements corresponding to the dimensionality.
CN202211249113.6A 2022-10-12 2022-10-12 Course element recognition model training and recognition method based on text mining Pending CN115879463A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211249113.6A CN115879463A (en) 2022-10-12 2022-10-12 Course element recognition model training and recognition method based on text mining

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211249113.6A CN115879463A (en) 2022-10-12 2022-10-12 Course element recognition model training and recognition method based on text mining

Publications (1)

Publication Number Publication Date
CN115879463A true CN115879463A (en) 2023-03-31

Family

ID=85770413

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211249113.6A Pending CN115879463A (en) 2022-10-12 2022-10-12 Course element recognition model training and recognition method based on text mining

Country Status (1)

Country Link
CN (1) CN115879463A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117556264A (en) * 2024-01-11 2024-02-13 浙江同花顺智能科技有限公司 Training method and device for evaluation model and electronic equipment

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117556264A (en) * 2024-01-11 2024-02-13 浙江同花顺智能科技有限公司 Training method and device for evaluation model and electronic equipment
CN117556264B (en) * 2024-01-11 2024-05-07 浙江同花顺智能科技有限公司 Training method and device for evaluation model and electronic equipment

Similar Documents

Publication Publication Date Title
CN108614875B (en) Chinese emotion tendency classification method based on global average pooling convolutional neural network
US9779085B2 (en) Multilingual embeddings for natural language processing
CN111966917B (en) Event detection and summarization method based on pre-training language model
CN105868184B (en) A kind of Chinese personal name recognition method based on Recognition with Recurrent Neural Network
CN106446526A (en) Electronic medical record entity relation extraction method and apparatus
CN107133220A (en) Name entity recognition method in a kind of Geography field
CN110516074B (en) Website theme classification method and device based on deep learning
CN107301165A (en) A kind of item difficulty analysis method and system
CN109492105B (en) Text emotion classification method based on multi-feature ensemble learning
CN105095863A (en) Similarity-weight-semi-supervised-dictionary-learning-based human behavior identification method
CN113742733B (en) Method and device for extracting trigger words of reading and understanding vulnerability event and identifying vulnerability type
CN110795932B (en) Geological report text information extraction method based on geological ontology
CN113948217A (en) Medical nested named entity recognition method based on local feature integration
CN108090099A (en) A kind of text handling method and device
CN110910175A (en) Tourist ticket product portrait generation method
CN115879463A (en) Course element recognition model training and recognition method based on text mining
Elmitwally et al. The multi-class classification for the first six surats of the Holy Quran
CN111159405B (en) Irony detection method based on background knowledge
CN111782811A (en) E-government affair sensitive text detection method based on convolutional neural network and support vector machine
CN111104492A (en) Hierarchical Attention mechanism-based automatic question-answering method in civil aviation field
CN116361454A (en) Automatic course teaching case assessment method based on Bloom classification method
CN114842982A (en) Knowledge expression method, device and system for medical information system
CN113297376A (en) Legal case risk point identification method and system based on meta-learning
CN104346450B (en) A kind of across media sort methods based on multi-modal recessive coupling expression
CN112069390A (en) User book borrowing behavior analysis and interest prediction method based on space-time dimension

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination