CN112149884A - Academic early warning monitoring method for large-scale students - Google Patents

Academic early warning monitoring method for large-scale students Download PDF

Info

Publication number
CN112149884A
CN112149884A CN202010928263.4A CN202010928263A CN112149884A CN 112149884 A CN112149884 A CN 112149884A CN 202010928263 A CN202010928263 A CN 202010928263A CN 112149884 A CN112149884 A CN 112149884A
Authority
CN
China
Prior art keywords
data
risk
prediction
learning
value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010928263.4A
Other languages
Chinese (zh)
Inventor
龚少麟
满青珊
赵文涛
郝路遥
王骏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Laiwangxin Technology Research Institute Co ltd
Original Assignee
Nanjing Laiwangxin Technology Research Institute Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Laiwangxin Technology Research Institute Co ltd filed Critical Nanjing Laiwangxin Technology Research Institute Co ltd
Priority to CN202010928263.4A priority Critical patent/CN112149884A/en
Publication of CN112149884A publication Critical patent/CN112149884A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/04Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0635Risk analysis of enterprise or organisation activities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0639Performance analysis of employees; Performance analysis of enterprise or organisation operations
    • G06Q10/06393Score-carding, benchmarking or key performance indicator [KPI] analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Systems or methods specially adapted for specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/20Education
    • G06Q50/205Education administration or guidance

Abstract

The invention provides a large-scale student-oriented academic early warning monitoring method, and provides an effective solution for the defects of student differentiation analysis and personalized teaching methods in the current intelligent teaching application on the basis of research on a prediction method and a large data processing platform, namely an offline learning prediction model is established on the basis of a parallel computing frame and a learning prediction algorithm, wherein the solution comprises four stages of feature processing, modeling preparation, model training and model deployment. The learning prediction process realizes a large-scale real-time learning prediction system based on Spark and HBase. And the risk students (the students predicted as the hanging department) are subjected to key monitoring based on the prediction result to find risk points. The risk student monitoring process comprises risk group cluster analysis, and the purpose of the cluster analysis is to discover the risk points of each risk student and provide suggestions for different student groups.

Description

Academic early warning monitoring method for large-scale students
Technical Field
The invention relates to the field of academic situation analysis of intelligent teaching, in particular to a academic early warning monitoring method for large-scale students.
Background
With the research on intelligent teaching in the education industry becoming deeper and wider, differential teaching has become one of the hot spots of teaching research. The learners with good learning performance have certain difference in learning behaviors with other learners. Based on the learning behavior of the learner, a supervised algorithm can be used to predict whether the learner can successfully pass through the course, and the learning characteristics of the inefficient learner can be further mined. The academic prediction and clustering result can provide targeted guidance for teacher teaching activities and student learning activities, and has important practical significance. At present, the network learning platform is widely applied in colleges and universities, and records a large amount of learning behavior data. Many academic prediction technology methods can acquire relevant characteristic data from learning behavior data recorded on a network learning platform. And predicting the learning effect and implementing teaching intervention by data analysis and mining. But the problems are also obvious: 1. the scale of data to be processed and the complexity of the data are increased sharply in a large-scale scene, and the existing non-distributed machine learning-based academic prediction method is long in processing time, low in efficiency and difficult to meet the requirement of real-time response; 2. because the related data of the academic industry has complicated levels and various data categories, the existing methods only rely on prior knowledge and simply extract a plurality of characteristics, which can not avoid the loss of important information and further weaken the performance of the model; 3. the learning situation data surrounds all aspects of student learning, and has the characteristics of high dimensionality and diversification of data styles, such as: point value percent and tenth system, discrete data and continuous data, etc.; 4. many schemes are limited to result prediction, and information is not further mined from data, so that the conclusion is thin, the support is lacked, and a targeted solution is more difficult to provide. There is a need for certain process schemes to overcome these problems.
Disclosure of Invention
The purpose of the invention is as follows: the invention aims to solve the technical problem of providing a academic early warning monitoring method for large-scale students aiming at the defects of the prior art. The scheme is well verified on an education platform of a certain professional school, the number of students in the school is about 200 thousands, and the related data volume reaches the level of ten million.
The method comprises a academic prediction process and a risk student monitoring process, wherein the academic prediction process comprises the following steps:
step a1, processing the learning activity data of the learner by characteristics;
step a2, carrying out modeling preparation;
step a3, performing model batch training by utilizing Spark MLlib to obtain a trained prediction model;
step a4, deploying a prediction model, performing Spark Streaming real-time calculation, and predicting the learning situation.
The risk student monitoring process comprises the following steps:
b1, extracting samples which are subjected to preprocessing, standardization processing and feature selection from a database according to the output result of the academic prediction to form a risk student set;
b2, clustering the risk student sets by using a DBSCAN algorithm with the parameter eps subjected to self-adaptive adjustment, and mining the risk characteristics of the risk subsets in each risk student set;
and b3, aiming at risk subsets with different characteristics, providing a countermeasure scheme based on the expert knowledge base to improve the risk degree of each risk group.
Step a1 includes:
extracting characteristics of learning activity data of a learner, wherein the learning activity data comprises online learning behavior data, offline learning behavior data and scores of basic courses;
the on-line learning behavior data comprises classroom test results, attendance conditions and results of job completion;
the offline learning behavior data comprises an online operation completion result, online video watching time, online examination times, an online test result, the number of posts in a forum and operation review times;
the score data of the basic courses refers to the scores of all the basic courses;
the method comprises the following steps of uniformly calling various sub-data under learning activity data, on-line learning behavior data and off-line learning behavior data as characteristic data, setting k sub-data, calling k groups of characteristic data as k indexes, and obtaining the relative importance degree of the characteristic data based on an entropy weight method:
the information entropy is calculated using the following formula:
Figure RE-GDA0002736993100000021
the information entropy redundancy is calculated by using the following formula: dj=1-Hj
Wherein, p (x)i) Representing random events xiProbability of (H)jIs the entropy of the jth set of characteristic data, djThe information entropy redundancy of the jth group of feature data is shown, n is the sum of the number of samples, i is less than or equal to n, each i corresponds to a learning subject, such as a student, and n represents the number of all students.
Step a1 further includes:
set m sets of feature data, then xijIs the j index value of the ith sample, i is 1, …, n; j is 1, …, m;
calculating the proportion p of the ith sample value in the j indexij
Figure RE-GDA0002736993100000031
Calculating the entropy value H under the jth indexj
Figure RE-GDA0002736993100000032
Wherein the content of the first and second substances,
Figure RE-GDA0002736993100000033
Pij=pij ln(pij)。
computing information entropy redundancy dj
dj=1-Hj
Calculating the weight w of each indexj
Figure RE-GDA0002736993100000034
The weight set W ═ W of each index in descending order1,…,wj,…,wmThe first t w are selected such that:
Figure RE-GDA0002736993100000035
and finally acquiring t groups of characteristic data.
Step a1 further includes: after the importance degrees of each group of feature data are obtained based on the entropy weight method, the first 5 groups of feature data with the accumulated importance degrees and the sum of which is more than 80% are selected to form a training set, so that X is { X ═ X1,x2,x3,x4,x5}, x1、x2、x3、x4、x5Respectively representing the number of posts in the forum, the average score of the work, the average score of the relevant basic courses, the average watching time of the online video every week and the participation frequency of the online learning platform every week.
Step a2 includes: establishing a Logistic regression model function g (z):
Figure RE-GDA0002736993100000036
the output value y range of the function g (z) is an open interval (0,1), if y is more than or equal to 0.5, the output label value is 1, and the result is qualified; if y is less than 0.5, the label value of the output label is 0, which indicates that the result is unqualified;
z is a classification decision boundary, and the calculation formula is as follows:
z=θ01x12x23x34x45x5
θithe regression coefficients of the Logistic regression model, i ═ 0,1, …,5, e.g. θ1Logistic regression coefficient, θ, representing the number of posts in a forum0Is the bias term.
Standardizing the training set X-control model based on the following formula:
Figure RE-GDA0002736993100000041
wherein Z is the characteristic data after Z-score standardization, mu is the mean value of X, and sigma is the standard deviation of X.
Step a3 includes: and calculating an error value by using an LR model in Spark MLlib each time, and stopping iteration when the error is smaller than a set error threshold. The default error threshold is 1 e-6.
The performance evaluation of the training model can be realized by calculating classical performance indexes, such as: accuracy, precision, recall, F1, and ROC curves. The model can also be saved and loaded using the API provided by MLlib.
Step a4 includes: deploying a trained prediction model on line, selecting Hbase to store daily learning activity data, using Spark cleaning data and calling the stored prediction model to predict, and storing the cleaning data and the prediction result in a MySQL database; and finally, constructing a real-time learning prediction service by utilizing Web application, so that teachers and students can check prediction results in real time.
Step b1 includes: and according to the label value output by the academic prediction, extracting the samples which are subjected to preprocessing, standardization processing and feature selection from the MySQL database in the step a4 to form a risk student set.
Step b2 includes: the formula of the DBSCAN algorithm with the adaptively adjusted parameter eps is as follows:
Figure RE-GDA0002736993100000042
wherein the mean square error of asymptotic integral
Figure RE-GDA0002736993100000043
MISE is the integral mean square error;
Figure RE-GDA0002736993100000044
representing a function of n, h, the function having an absolute value not greater than the absolute value of the function when the argument is sufficiently large
Figure RE-GDA0002736993100000051
A fixed multiple of; k represents a kernel function, which is generally a Gaussian kernel function with a mean value of 0 and a variance of 1; h is called window width or smoothness parameter; n is the number of samples; x is an introduced variable and can be eliminated, so that the result is not influenced; [ f [ "(x) ]]2dx=∫[φ"(x)]2dx/σ, φ is a standard normal density function. The above samples are n sets of feature data as described above, and σ represents the variance of n samples under any index. Obtaining an extreme value of the MISE, wherein the determined extreme point position is the optimal eps value;
Figure RE-GDA0002736993100000052
clustering the risk student set by using a DBSCAN algorithm with the parameter eps subjected to adaptive adjustment to obtain k risk subsets { r1,…,rk},rkDenotes the kth risk subset, let tijIs a risk subset riMean value over i characteristic, Tj={t1j,…,t5jIs the risk subset riI ═ 1,2,3,4, 5; j is equal to 1, and j is equal to 1,2,…,k。
step b3 includes: and establishing an expert knowledge base according to the teaching and research data and the teacher opinions, carrying out rule matching on the input various risk subsets by the expert knowledge base, and outputting corresponding evaluation and targeted learning suggestions.
The main innovation points are as follows:
1. the academic prediction scheme based on Spark distributed architecture increases the capacity of the system, and can efficiently process large-scale data; modularization is realized, redundancy is reduced, and the expandability of the system is improved; the development and release speed is accelerated.
2. A new feature selection method, namely an entropy weight method, is provided, the method is suitable for academic data analysis scenes, feature data with high information value can be mined from the data, and the experience in the traditional method is avoided.
3. Two data preprocessing methods are proposed. The proportion average method is used for fusing similar characteristic data, such as score scores of different classifications, so that fused data can represent objective difference; the numeralization method can linearly convert the discrete data into numerical values, and the method can well overcome the defect that the onehot method is easy to generate sparse matrixes.
4. The risk set is clustered and divided by combining an improved clustering algorithm on the basis of academic prediction, so that academic early warning students can be detected in a targeted manner by combining with an expert database, the hanging rate and 32900of the students can be reduced, and the working strength of teachers can be reduced.
Has the advantages that: according to the invention, by utilizing a prediction algorithm model and an unsupervised clustering model with higher practicability, the academic prediction function facing large-scale students is provided, and meanwhile, the characteristics of low-efficiency learning groups can be mined, the classes of low-efficiency learners are subdivided, so that teachers can be helped to provide differentiated and targeted teaching and intervention activities. With the increase of the academic related data, the invention can expand the performance by adding Spark cluster nodes. The method has the characteristics of easy deployment and high reliability, and can provide large-scale real-time data extracted uninterruptedly in the intelligent teaching process to perform real-time academic prediction.
Drawings
The foregoing and/or other advantages of the invention will become further apparent from the following detailed description of the invention when taken in conjunction with the accompanying drawings.
FIG. 1 normalizes the processed data set.
Fig. 2 is a schematic diagram of training set X.
FIG. 3 the academic prediction sub-process.
FIG. 4 Risk student monitoring sub-process.
Fig. 5 shows the overall flow of the academic prediction monitoring system.
Detailed Description
In the embodiment, a large-scale student-oriented academic early warning monitoring system is constructed based on the large-scale student-oriented academic early warning monitoring method, as shown in fig. 3 and 5, the system can predict the academic risk of students, and can evaluate and suggest risk students based on an expert knowledge base. The system can well distinguish the academic risk students from the normal students, enhances the pertinence of teaching and the intervention of teachers, and the project practice shows that the academic passing rate of the students can be obviously improved. Fig. 4 is a framework for realizing model training and prediction based on Spark. And establishing an offline learning prediction model based on parallel computing and logistic regression algorithm in a Spark framework. Under a real-time environment, a large-scale real-time academic prediction system based on Spark stream and Kafka is realized. The scheme is divided into four stages. The first stage is a feature processing stage; the second phase is a modeling preparation phase; the third stage is a prediction model training stage; and fourthly, a prediction model deployment phase.
1. Stage of feature processing
The characteristic processing stage mainly comprises: data preprocessing, data standardization processing and feature selection.
The online learning behavior data, the offline learning behavior data and the achievement data of the relevant basic courses can be obtained through the online learning platform. On-line learning behavior data including classroom testing results, attendance and job completion results; the offline learning behavior data comprises an online operation completion result, online video watching time, online examination times, an online test result, the number of posts in a forum and operation review times; the basic course score data is generally the score of the basic course. These data implicitly contain information about the relevance of the target variable, i.e., student performance and status.
Extracting on-line learning behavior data, off-line learning behavior data and score data of related basic courses of all students in the previous period from a database of an on-line learning platform as a characteristic data part of a data set; and the average scores of the exams of the students at the end of the period are taken as the label data of the data set (if passing, the label value is set to 1, otherwise, the label value is set to 0). An initial data set P ═ P can thus be generated1,···,pi,···,pmI is more than or equal to 1 and less than or equal to m, and m is the number of groups of characteristic data, such as: p is a radical ofiThe online test results of all trainees can be represented, assuming that the sample size of the training set is n, i.e. the column vector piIs n.
(1) Data pre-processing
In the field of intelligent teaching, attribute construction of academic datasets often involves a variety of scenarios, such as: multi-category score value merging processing, discrete value processing and the like. The present invention therefore proposes two specific data preprocessing methods.
1) Method of averaging specific gravities
The score combination processing may adopt a specific gravity average method:
S=(a1/M1··+a2/M2+…+ai/Mi+…+an/Mn)*100/n (i=1,2,…,n)
wherein S is the percentage score after combination, aiIndicating the i-th class score, MiThe full score of the i-th score is shown, and n is the number of categories of all scores.
The proportion average method can effectively eliminate the problem of fusion distortion caused by different grades of scores due to different full scores, and can better comprehensively evaluate the comprehensive scores of a certain object.
On the characteristic part of the data set, integrating the on-line examination scores of all departments in the previous school period, the classroom test scores of all the departments, and the like by using a proportion averaging method; in the data set label part, the examination results at the end of each period of the period can be fused.
2) Numerical processing
When learning predictive modeling is performed, processing of discrete values is often involved, such as: under a certain characteristic, there are four grades of A, B, C and D, or three grades of good, medium and bad. After the discrete values are converted into numerical values, the numerical values can be input into a model for training. The usual processing method is to perform onehot processing. However, when the level is too high, a large sparse matrix appears obviously, and the performance is affected. Since most of the grade indexes are divided into sections with obvious linear characteristics, the method of linear mapping is not adopted to process discrete values.
For example, if the rank set D is four ranks { D1, D2, D3, D4}, the degrees of excellence are ranked from high to low, and the values are numerically processed based on percentiles, then D is obtainediThe scores for the grades were:
Vi=100-(100i/n)+50/n
where n is the number of elements in set D and i is 1,2,3, 4.
The data preprocessing is a very important preposition work in the data mining modeling process, and the main purpose of the data preprocessing is to reduce the dimensionality of a data set, improve the quality of the data set and determine the upper limit of model performance. In this section, some data preprocessing operations such as null value removal and abnormal value processing are also involved, but they are not described again because of their versatility.
(2) Data normalization process
The main purpose of the data normalization process is to reduce the magnitude of each type of data and eliminate the dimension. Since all sorts of emotional data exhibit an approximately normal distribution, the present invention uses the z-score normalization method.
Figure RE-GDA0002736993100000081
Wherein X is the original characteristic, Z is the characteristic data after Z-score standardization, mu is the mean value of X, and sigma is the standard deviation of X.
The preprocessed data is subjected to z-score standardized processing, dependence on extreme values of the data is avoided, equalization processing can be effectively performed on each characteristic data, and the performance of a training model is improved.
(3) Feature selection
The feature selection is to select and extract features of the original data set so as to obtain a new data set which is smaller in scale and has stronger relevance with the target variable. Based on a large-scale user scene of intelligent teaching, the invention provides a feature selection method based on entropy weight.
The entropy weight method is an objective weighting method because it relies only on the discreteness of the data itself.
Information entropy:
Figure RE-GDA0002736993100000082
information entropy redundancy: dj=1-Hj
Wherein, p (x)i) Representing random events xiProbability of (H)jIs the entropy of the jth set of characteristic data, djIs the information entropy redundancy of the jth group of feature data.
The method comprises the following specific steps:
1) if there are n samples, m sets of feature data, then xijThe j index value (i is 1, …, n; j is 1, …, m) of the ith sample is shown in fig. 1.
2) Calculating the proportion of the ith sample value in the j index:
Figure RE-GDA0002736993100000091
3) calculating the entropy value under the j index:
Figure RE-GDA0002736993100000092
wherein the content of the first and second substances,
Figure RE-GDA0002736993100000093
Pij=pij ln(pij)。
4) calculating the information entropy redundancy:
dj=1-Hj,(j=1,2,…,m)
5) calculating the weight of each index:
Figure RE-GDA0002736993100000094
6) descending order of W ═ W1,…,wj,…,wm-selecting the first t w such that:
Figure RE-GDA0002736993100000095
finally, t sets of feature data can be obtained.
Feature selection can reduce dimension disasters and reduce the difficulty of training models. The feature extraction based on the entropy weight method can obtain the importance degree of each group of feature data, and further high-value feature data can be selected. A data set X is obtained through a feature selection process, a portion of which is shown in fig. 2.
2. Modeling preparation phase
The modeling preparation phase comprises: data set division, model selection construction and parameter setting.
(1) Training set partitioning
And randomly disordering the sequence of each sample in the data set, and dividing a training set and a testing set according to the proportion of 8: 2. The training set is used to train the model, and the test set is used to verify the performance of the model.
(2) Model selection construction
The method adopts a Logistic regression model g (z):
Figure RE-GDA0002736993100000101
the output value y range of the function g (z) is an open interval (0,1), if y is more than or equal to 0.5, the output label value is 1, and the result is qualified; if y is less than 0.5, the output label value is 0, which indicates that the result is not qualified.
z is the classification decision boundary, and the formula is as follows:
z=θ01x12x23x34x45x5
θjis a Logistic regression coefficient, j is 0,1, …,5, e.g. θ1Logistic regression coefficient, θ, representing the number of posts in a forum0Is the bias term.
(3) Parameter setting
A logistic regression instance is created in the Spark MLlib module. The iteration error threshold Δ min is set to 1e-6, the L2 regularized penalty term is used, the regularization parameter is 0.01, the two classification threshold is set to 0.5, and other parameters are defaulted.
3. Model training phase
Error values were calculated for each iteration using the LR model in Spark MLlib. The error value decreases with the increase of the iteration number, and the iteration is stopped when the error is smaller than a set error threshold value, wherein the default error threshold value is 1 e-6. And a cross-validation strategy is adopted in the training process, the accuracy, the precision, the recall ratio and the F1 evaluation indexes of the model are finally calculated according to the prediction result of the test set, an ROC curve is further drawn, and the final effect of the prediction model is evaluated. The AUC value under the ROC curve is 0.87, and the model has certain prediction accuracy. The model is saved and loaded using the API provided by MLlib. Calling the save method to save the above trained predictive model, the model can be loaded using the load method.
4. Model deployment phase
And in the model deployment stage, in the deployment and application stage, the trained prediction model is deployed on line, the received learning behavior data is processed and predicted, and academic risk students and academic normal students are classified. Firstly, Hbase is selected to store daily massive learning behavior data, Spark cleaning data is used, a stored prediction model is called to predict, and a prediction result, a training set and processed prediction data are stored in a MySQL database. And finally, constructing a real-time academic prediction service by using the web application, so that teachers and students can check prediction results in real time.
The academic prediction function can realize the academic prediction for large-scale students. The system establishes an off-line academic prediction model based on parallel computation of a Spark framework and a binary logistic regression model. Finally, a real-time academic prediction service is realized in the Web environment.
Second, risk student monitoring sub-function
As shown in fig. 4, the present invention provides a monitoring function for risk learners. Based on the risk set generated by the academic prediction (namely, the set predicted to be risk students), clustering analysis is carried out by using an unsupervised DBSCAN algorithm to obtain subdivided risk subsets. The scheme can be divided into three processes. The first stage is data extraction, and according to a risk set output by a academic predictor function, feature vectors of corresponding samples are extracted to construct a new sample set; the second stage is model construction, eps of each dimension is determined based on a Gaussian kernel function method, default MinPts is 2, an initial DBSCAN model is constructed, and a clustering result is output; and evaluating and suggesting in the third stage, respectively calculating the characteristic mean value of each cluster in the clustering result, and using the characteristic mean value to represent each risk subset characteristic. Based on expert knowledge base, according to each risk subset characteristic, providing evaluation and countermeasure.
1. Data extraction phase
Samples that have been preprocessed, normalized, and feature selected are extracted from the database for later modeling based on each sample label value output by the academic prediction.
2. Stage of model construction
A DBSCAN clustering model is built, the dimension of sample data in the scheme is high, in order to overcome the problem that Euclidean distance difference in the DBSCAN model is not obvious, a kernel function is introduced to optimize measurement radius eps, and MinPts is defaulted to 2.
Using the integrated mean square error MISE as the quasi-side for determining the density estimate, it is clear that minimizing MISE necessitates good h-values.
Figure RE-GDA0002736993100000111
Wherein
Figure RE-GDA0002736993100000112
It is known that
Figure RE-GDA0002736993100000113
∫[f"(x)]2dx=∫[φ"(x)]2dx/σ, φ is a standard normal density function, σ represents the variance of n samples under any index. And (3) solving an extreme value of the MISE, wherein the determined extreme point position is the optimal eps value, namely:
Figure RE-GDA0002736993100000114
clustering the risk student set by using a DBSCAN algorithm with the parameter eps subjected to adaptive adjustment to obtain k risk subsets { r1,…,rk},rkDenotes the kth risk subset, let tijIs a risk subset riMean value over i characteristic, Tj={t1j,…,t5jIs the risk subset riI ═ 1,2,3,4, 5; j is 1,2, …, k.
The kernel function DBSCAN algorithm can determine certain parameters in a self-adaptive mode, prior knowledge of a current data set is not needed, and accuracy is high. And clustering the risk sets through a kernel function DBSCAN algorithm to obtain k risk subsets. The subsets are classified based on the number of posts in the forum, the average score of the work, the average score of the relevant basic courses, the average watching time of the online video every week and the participation frequency of the online learning platform every week, namely, the subsets have characteristic differences.
3. Evaluation recommendation phase
From the above available risk subset r1,…,rkR ═ R risk set1,…,rk}. Let tij(i-1, 2,3,4, 5; j-1, 2, …, k is the risk subset riMean value over i characteristic, Tj={t1j,…,t5jIs the risk subset riSet of feature means.
The method is realized based on an education platform of a certain professional school, and an expert knowledge base based on rules is constructed by collecting and summarizing teaching and research data of the school, teacher experience data and the like. And the expert knowledge base receives the group characteristic data, evaluates the group according to the matching condition of a series of internal rules and provides suggestions. For example, for the groups with lower values of the number of posts in the forum, the average score of the work, the average score of the relevant basic courses, the average watching time of the online video per week and the participation frequency of the online learning platform per week, the following steps are made: the learning attitude is extremely weak, the learning ability is weak, the self-control ability is low, the assessment of the high-degree early warning of the academic industry and the academic suggestion that teachers and families should intervene and pay attention to immediately.
The working process of the invention is as follows:
(1) feature processing
The characteristic processing stage mainly comprises: data preprocessing, data standardization processing and feature selection. The data preprocessing is based on a proportion average method and a numeralization processing method to respectively process the result data and the discrete value of different divisions, so that the requirement of model training can be met; the data standardization adopts a z-score method for processing, and data with different dimensions and magnitude levels are adjusted to keep the importance degrees of various data close; the feature selection mainly uses an entropy weight method to screen out feature data with higher quality. The characteristic processing lays a foundation for the following modeling work.
(2) Predictive modeling preparation
In the modeling preparation phase, the arguments x1, x2, x3, x4 and x5, which respectively represent the characteristics of the number of posts in the forum, the average score of the work, the average score of the relevant basic courses, the average viewing time of the online video per week and the participation frequency of the online learning platform per week, are obtained through the characteristic processing phase. Firstly, a data set is segmented, and a training set for model training and a testing set for testing are divided. Then, model selection is carried out, and the method carries out prediction of learning conditions based on the logistic regression model. And finally, setting relevant parameters of the logistic regression model. The modeling preparation process is mainly to set some basic parameters of the model.
(3) Predictive model training
Error values were calculated for each iteration using the LR model in Spark MLlib. The error value decreases as the number of iterations increases. When the error is less than the set error threshold, the iteration is stopped, and the default threshold is 1 e-6. And calculating four evaluation indexes of the accuracy, the precision, the recall ratio and F1 of the model on the test set, further drawing an ROC curve, and evaluating the final effect of the prediction model. Finally, the model is saved and loaded by using the API provided by the MLlib.
(4) Predictive model deployment
Deploying the trained prediction model on line, selecting Hbase to store daily massive learning behavior data, cleaning the data by using Spark, calling the stored prediction model to predict, and storing the prediction result in a MySQL database. And finally, constructing real-time learning prediction service by utilizing Web application, so that teachers and students can check prediction results in real time. The academic prediction result is one of the outputs of the scheme related to the invention.
(5) Data extraction phase
Samples that have been preprocessed, normalized, and feature selected are extracted from the database for later modeling based on each sample label value output by the academic prediction.
(6) Clustering model construction phase
A DBSCAN algorithm for adaptively adjusting eps parameters is designed, the algorithm can overcome the problem that high-dimensional data objects are distributed too randomly, and the applicability of Euclidean distance is enhanced. Based on the algorithm, the risk set can be reclassified to form a risk subset.
(7) Evaluation recommendation phase
And evaluating the risk student population and providing suggestions according to the subset formed by clustering and the subset characteristics by combining the existing expert knowledge base. The evaluation proposal is one of the output results of the proposal related to the invention, and forms: input data-problem finding-problem solving closed loop. Practical projects show that the achievement of the invention obviously reduces 32900the industrial rate, improves the achievement of students and reduces the working strength of teachers.
The invention provides a academic early warning monitoring method for large-scale students, and a plurality of methods and ways for implementing the technical scheme are provided, the above description is only a preferred embodiment of the invention, and it should be noted that, for those skilled in the art, a plurality of improvements and decorations can be made without departing from the principle of the invention, and these improvements and decorations should also be regarded as the protection scope of the invention. All the components not specified in the present embodiment can be realized by the prior art.

Claims (9)

1. The academic early warning monitoring method facing the large-scale students is characterized by comprising an academic prediction process and a risk student monitoring process, wherein the academic prediction process comprises the following steps:
step a1, processing the learning activity data of the learner by characteristics;
step a2, carrying out modeling preparation;
step a3, performing model batch training by utilizing Spark MLlib to obtain a trained prediction model;
step a4, deploying a prediction model, carrying out Spark Streaming real-time calculation, and predicting the learning situation;
the risk student monitoring process comprises the following steps:
b1, extracting samples which are subjected to preprocessing, standardization processing and feature selection from a database according to the output result of the academic prediction to form a risk student set;
b2, clustering the risk student sets by using a DBSCAN algorithm with the parameter eps subjected to self-adaptive adjustment, and mining the risk characteristics of the risk subsets in each risk student set;
and b3, aiming at risk subsets with different characteristics, providing a countermeasure scheme based on the expert knowledge base to improve the risk degree of each risk group.
2. The method of claim 1, wherein step a1 comprises:
extracting characteristics of learning activity data of a learner, wherein the learning activity data comprises online learning behavior data, offline learning behavior data and scores of basic courses;
the on-line learning behavior data comprises classroom test results, attendance conditions and results of job completion;
the offline learning behavior data comprises an online operation completion result, online video watching time, online examination times, an online test result, the number of posts in a forum and operation review times;
the score data of the basic courses refers to the scores of all the basic courses;
the method comprises the following steps of uniformly calling various sub-data under learning activity data, on-line learning behavior data and off-line learning behavior data as characteristic data, setting k sub-data, calling k groups of characteristic data as k indexes, and obtaining the relative importance degree of the characteristic data based on an entropy weight method:
the information entropy is calculated using the following formula:
Figure RE-FDA0002736993090000011
the information entropy redundancy is calculated by using the following formula: dj=1-Hj
Wherein, p (x)i) Representing random events xiProbability of (H)jIs the entropy of the jth set of characteristic data, djThe information entropy redundancy of the jth group of feature data is shown, n is the sum of the number of samples, i is less than or equal to n, and each i corresponds to a learning subject.
3. The method of claim 2, wherein step a1 further comprises:
set m sets of feature data, then xijIs the j index value of the ith sample, i is 1, …, n; j is 1, …, m;
calculating the proportion p of the ith sample value in the j indexij
Figure RE-FDA0002736993090000021
Calculating the entropy value H under the jth indexj
Figure RE-FDA0002736993090000022
Wherein the content of the first and second substances,
Figure RE-FDA0002736993090000023
computing information entropy redundancy dj
dj=1-Hj
Calculating the weight w of each indexj
Figure RE-FDA0002736993090000024
The weight set W ═ W of each index in descending order1,…,wj,…,wmThe first t w are selected such that:
Figure RE-FDA0002736993090000025
and finally acquiring t groups of characteristic data.
4. The method of claim 3, wherein step a1 further comprises: after the importance degrees of each group of feature data are obtained based on the entropy weight method, the first 5 feature data with the accumulated importance degrees and the sum of which is more than 80% are selected to form a training set, so that X ═ X1,x2,x3,x4,x5},x1、x2、x3、x4、x5Respectively representing the number of posts in the forum, the average score of the work, the average score of the relevant basic courses, the average watching time of the online video every week and the participation frequency of the online learning platform every week.
5. The method of claim 4, wherein step a2 includes: establishing a Logistic regression model function g (z):
Figure RE-FDA0002736993090000031
the output value y range of the function g (z) is an open interval (0,1), if y is more than or equal to 0.5, the output label value is 1, and the result is qualified; if y is less than 0.5, the label value of the output label is 0, which indicates that the result is unqualified;
z is a classification decision boundary, and the calculation formula is as follows:
z=θ01x12x23x34x45x5
θiis the regression coefficient of Logistic regression model, i is 0,1, …,5, theta0Is a bias term;
standardizing the training set X-control model based on the following formula:
Figure RE-FDA0002736993090000032
wherein Z is the characteristic data after Z-score standardization, mu is the mean value of X, and sigma is the standard deviation of X.
6. The method of claim 5, wherein step a3 includes: and calculating an error value by using an LR model in Spark MLlib each time, and stopping iteration when the error is smaller than a set error threshold.
7. The method of claim 6, wherein step a4 includes: deploying a trained prediction model on line, selecting Hbase to store daily learning activity data, using Spark cleaning data and calling the stored prediction model to predict, and storing the cleaning data and the prediction result in a MySQL database; and finally, constructing a real-time learning prediction service by utilizing Web application, so that teachers and students can check prediction results in real time.
8. The method of claim 7, wherein step b1 comprises: and according to the label value output by the academic prediction, extracting the samples which are subjected to preprocessing, standardization processing and feature selection from the MySQL database to form a risk student set.
9. The method of claim 8, wherein step b2 comprises: the formula of the DBSCAN algorithm with the adaptively adjusted parameter eps is as follows:
Figure RE-FDA0002736993090000033
wherein the mean square error of asymptotic integral
Figure RE-FDA0002736993090000034
MISE is the integral mean square error;
Figure RE-FDA0002736993090000041
representing a function of n, h, the function having an absolute value not greater than the absolute value of the function when the argument is sufficiently large
Figure RE-FDA0002736993090000042
A fixed multiple of; k represents a kernel function; h is called window width; n is the number of samples; x is an introduction variable; [ f [ "(x) ]]2dx=∫[φ"(x)]2dx/sigma, phi is a standard normal density function; σ denotes arbitraryVariance of n samples under the index; obtaining an extreme value of the MISE, wherein the determined extreme point position is the optimal eps value;
clustering the risk student set by using a DBSCAN algorithm with the parameter eps subjected to adaptive adjustment to obtain k risk subsets { r1,…,rk},rkDenotes the kth risk subset, let tijIs a risk subset riMean value over i characteristic, Tj={t1j,…,t5jIs the risk subset riI ═ 1,2,3,4, 5; j is 1,2, …, k.
CN202010928263.4A 2020-09-07 2020-09-07 Academic early warning monitoring method for large-scale students Pending CN112149884A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010928263.4A CN112149884A (en) 2020-09-07 2020-09-07 Academic early warning monitoring method for large-scale students

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010928263.4A CN112149884A (en) 2020-09-07 2020-09-07 Academic early warning monitoring method for large-scale students

Publications (1)

Publication Number Publication Date
CN112149884A true CN112149884A (en) 2020-12-29

Family

ID=73890622

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010928263.4A Pending CN112149884A (en) 2020-09-07 2020-09-07 Academic early warning monitoring method for large-scale students

Country Status (1)

Country Link
CN (1) CN112149884A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113080986A (en) * 2021-05-07 2021-07-09 中国科学院深圳先进技术研究院 Method and system for detecting exercise fatigue based on wearable equipment
CN113516286A (en) * 2021-05-14 2021-10-19 山东建筑大学 Student academic early warning method and system based on multi-granularity task joint modeling
CN113609779A (en) * 2021-08-16 2021-11-05 深圳力维智联技术有限公司 Modeling method, device and equipment for distributed machine learning

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2004063831A2 (en) * 2003-01-15 2004-07-29 Bracco Imaging S.P.A. System and method for optimization of a database for the training and testing of prediction algorithms
US20130096892A1 (en) * 2011-10-17 2013-04-18 Alfred H. Essa Systems and methods for monitoring and predicting user performance
CA2838119A1 (en) * 2012-12-27 2014-06-27 Pearson Education, Inc. System and method for selecting predictors for a student risk model
US20140205987A1 (en) * 2013-01-18 2014-07-24 Steve Habermehl Apparatus and method for enhancing academic planning and tracking via an interactive repository database
CN106373057A (en) * 2016-09-29 2017-02-01 西安交通大学 Network education-orientated poor learner identification method
CN107578181A (en) * 2017-09-15 2018-01-12 中南大学 Exceptional student method for digging based on statistic frequency and correlation rule
CN110119421A (en) * 2019-04-03 2019-08-13 昆明理工大学 A kind of electric power stealing user identification method based on Spark flow sorter
CN111260514A (en) * 2020-01-14 2020-06-09 华中师范大学 Student score prediction method based on campus big data
CN111626372A (en) * 2020-05-29 2020-09-04 安徽医学高等专科学校 Online teaching supervision management method and system

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2004063831A2 (en) * 2003-01-15 2004-07-29 Bracco Imaging S.P.A. System and method for optimization of a database for the training and testing of prediction algorithms
US20130096892A1 (en) * 2011-10-17 2013-04-18 Alfred H. Essa Systems and methods for monitoring and predicting user performance
CA2838119A1 (en) * 2012-12-27 2014-06-27 Pearson Education, Inc. System and method for selecting predictors for a student risk model
US20140205987A1 (en) * 2013-01-18 2014-07-24 Steve Habermehl Apparatus and method for enhancing academic planning and tracking via an interactive repository database
CN106373057A (en) * 2016-09-29 2017-02-01 西安交通大学 Network education-orientated poor learner identification method
CN107578181A (en) * 2017-09-15 2018-01-12 中南大学 Exceptional student method for digging based on statistic frequency and correlation rule
CN110119421A (en) * 2019-04-03 2019-08-13 昆明理工大学 A kind of electric power stealing user identification method based on Spark flow sorter
CN111260514A (en) * 2020-01-14 2020-06-09 华中师范大学 Student score prediction method based on campus big data
CN111626372A (en) * 2020-05-29 2020-09-04 安徽医学高等专科学校 Online teaching supervision management method and system

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
廉宇: "基于Hadoop的学业预警系统设计与关键技术研究", 中国优秀硕士学位论文全文数据库 信息科技辑, pages 169 - 170 *
李宗林等: "DBSCAN算法中参数的自适应确定", 计算机工程与应用, pages 71 - 72 *
李建伟;苏占玖;黄茹;: "基于大数据学习分析的在线学习风险预测研究", 现代教育技术, no. 08, pages 78 - 84 *
程光胜: "基于大数据的高职学生行为分析", 职业教育研究, pages 76 - 80 *
陈子健等: "基于教育数据挖掘的在线学习者学业成绩预测建模研究", 中国电化教育, pages 75 - 78 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113080986A (en) * 2021-05-07 2021-07-09 中国科学院深圳先进技术研究院 Method and system for detecting exercise fatigue based on wearable equipment
CN113516286A (en) * 2021-05-14 2021-10-19 山东建筑大学 Student academic early warning method and system based on multi-granularity task joint modeling
CN113609779A (en) * 2021-08-16 2021-11-05 深圳力维智联技术有限公司 Modeling method, device and equipment for distributed machine learning
CN113609779B (en) * 2021-08-16 2024-04-09 深圳力维智联技术有限公司 Modeling method, device and equipment for distributed machine learning

Similar Documents

Publication Publication Date Title
CN112149884A (en) Academic early warning monitoring method for large-scale students
CN116108758B (en) Landslide susceptibility evaluation method
Sani et al. Drop-out prediction in higher education among B40 students
CN112700324A (en) User loan default prediction method based on combination of Catboost and restricted Boltzmann machine
CN110555459A (en) Score prediction method based on fuzzy clustering and support vector regression
CN112150304A (en) Power grid running state track stability prejudging method and system and storage medium
CN112116002A (en) Determination method, verification method and device of detection model
Bassi et al. Students Graduation on Time Prediction Model Using Artificial Neural Network
Hamim et al. Student profile modeling using boosting algorithms
CN112686462A (en) Student portrait-based anomaly detection method, device, equipment and storage medium
Loganathan et al. Development of machine learning based framework for classification and prediction of students in virtual classroom environment
Gao et al. Machine learning for credit card fraud detection
CN114912027A (en) Learning scheme recommendation method and system based on learning outcome prediction
Kumar et al. Comparative study of various supervised machine learning algorithms for an early effective prediction of the employability of students
YURTKAN et al. Student Success Prediction Using Feedforward Neural Networks
Suresh et al. Predicting the e-learners learning style by using support vector regression technique
CN113378581A (en) Knowledge tracking method and system based on multivariate concept attention model
Subarkah et al. Comparison of Different Classification Techniques to Predict Student Graduation
Wahyono et al. Optimization of Random Forest with Genetic Algorithm for Determination of Assessment
CN111079348A (en) Method and device for detecting slowly-varying signal
Gata et al. The Feasibility of Credit Using C4. 5 Algorithm Based on Particle Swarm Optimization Prediction
CN113421176B (en) Intelligent screening method for abnormal data in student score scores
Ma et al. A Comparison of Data Mining Approaches on Predicting the Repayment Behavior in P2P Lending
Zhang et al. Research on intelligent detection method of cognitive ability based on principal component analysis and extreme learning machine
CN116975621A (en) Model stability monitoring method and device and computer equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination