CN112149884A

CN112149884A - Academic early warning monitoring method for large-scale students

Info

Publication number: CN112149884A
Application number: CN202010928263.4A
Authority: CN
Inventors: 龚少麟; 满青珊; 赵文涛; 郝路遥; 王骏
Original assignee: Nanjing Laiwangxin Technology Research Institute Co ltd
Current assignee: Nanjing Laiwangxin Technology Research Institute Co ltd
Priority date: 2020-09-07
Filing date: 2020-09-07
Publication date: 2020-12-29

Abstract

The invention provides a large-scale student-oriented academic early warning monitoring method, and provides an effective solution for the defects of student differentiation analysis and personalized teaching methods in the current intelligent teaching application on the basis of research on a prediction method and a large data processing platform, namely an offline learning prediction model is established on the basis of a parallel computing frame and a learning prediction algorithm, wherein the solution comprises four stages of feature processing, modeling preparation, model training and model deployment. The learning prediction process realizes a large-scale real-time learning prediction system based on Spark and HBase. And the risk students (the students predicted as the hanging department) are subjected to key monitoring based on the prediction result to find risk points. The risk student monitoring process comprises risk group cluster analysis, and the purpose of the cluster analysis is to discover the risk points of each risk student and provide suggestions for different student groups.

Description

Academic early warning monitoring method for large-scale students

Technical Field

The invention relates to the field of academic situation analysis of intelligent teaching, in particular to a academic early warning monitoring method for large-scale students.

Background

With the research on intelligent teaching in the education industry becoming deeper and wider, differential teaching has become one of the hot spots of teaching research. The learners with good learning performance have certain difference in learning behaviors with other learners. Based on the learning behavior of the learner, a supervised algorithm can be used to predict whether the learner can successfully pass through the course, and the learning characteristics of the inefficient learner can be further mined. The academic prediction and clustering result can provide targeted guidance for teacher teaching activities and student learning activities, and has important practical significance. At present, the network learning platform is widely applied in colleges and universities, and records a large amount of learning behavior data. Many academic prediction technology methods can acquire relevant characteristic data from learning behavior data recorded on a network learning platform. And predicting the learning effect and implementing teaching intervention by data analysis and mining. But the problems are also obvious: 1. the scale of data to be processed and the complexity of the data are increased sharply in a large-scale scene, and the existing non-distributed machine learning-based academic prediction method is long in processing time, low in efficiency and difficult to meet the requirement of real-time response; 2. because the related data of the academic industry has complicated levels and various data categories, the existing methods only rely on prior knowledge and simply extract a plurality of characteristics, which can not avoid the loss of important information and further weaken the performance of the model; 3. the learning situation data surrounds all aspects of student learning, and has the characteristics of high dimensionality and diversification of data styles, such as: point value percent and tenth system, discrete data and continuous data, etc.; 4. many schemes are limited to result prediction, and information is not further mined from data, so that the conclusion is thin, the support is lacked, and a targeted solution is more difficult to provide. There is a need for certain process schemes to overcome these problems.

Disclosure of Invention

The purpose of the invention is as follows: the invention aims to solve the technical problem of providing a academic early warning monitoring method for large-scale students aiming at the defects of the prior art. The scheme is well verified on an education platform of a certain professional school, the number of students in the school is about 200 thousands, and the related data volume reaches the level of ten million.

The method comprises a academic prediction process and a risk student monitoring process, wherein the academic prediction process comprises the following steps:

step a1, processing the learning activity data of the learner by characteristics;

step a2, carrying out modeling preparation;

step a3, performing model batch training by utilizing Spark MLlib to obtain a trained prediction model;

step a4, deploying a prediction model, performing Spark Streaming real-time calculation, and predicting the learning situation.

The risk student monitoring process comprises the following steps:

b1, extracting samples which are subjected to preprocessing, standardization processing and feature selection from a database according to the output result of the academic prediction to form a risk student set;

b2, clustering the risk student sets by using a DBSCAN algorithm with the parameter eps subjected to self-adaptive adjustment, and mining the risk characteristics of the risk subsets in each risk student set;

and b3, aiming at risk subsets with different characteristics, providing a countermeasure scheme based on the expert knowledge base to improve the risk degree of each risk group.

Step a1 includes:

extracting characteristics of learning activity data of a learner, wherein the learning activity data comprises online learning behavior data, offline learning behavior data and scores of basic courses;

the on-line learning behavior data comprises classroom test results, attendance conditions and results of job completion;

the offline learning behavior data comprises an online operation completion result, online video watching time, online examination times, an online test result, the number of posts in a forum and operation review times;

the score data of the basic courses refers to the scores of all the basic courses;

the method comprises the following steps of uniformly calling various sub-data under learning activity data, on-line learning behavior data and off-line learning behavior data as characteristic data, setting k sub-data, calling k groups of characteristic data as k indexes, and obtaining the relative importance degree of the characteristic data based on an entropy weight method:

the information entropy is calculated using the following formula:

the information entropy redundancy is calculated by using the following formula: d_j＝1-H_j

Wherein, p (x)_i) Representing random events x_iProbability of (H)_jIs the entropy of the jth set of characteristic data, d_jThe information entropy redundancy of the jth group of feature data is shown, n is the sum of the number of samples, i is less than or equal to n, each i corresponds to a learning subject, such as a student, and n represents the number of all students.

Step a1 further includes:

set m sets of feature data, then x_ijIs the j index value of the ith sample, i is 1, …, n; j is 1, …, m;

calculating the proportion p of the ith sample value in the j index_ij：

Calculating the entropy value H under the jth index_j：

Wherein the content of the first and second substances,

P_ij＝p_ij ln(p_ij)。

computing information entropy redundancy d_j：

d_j＝1-H_j，

Calculating the weight w of each index_j：

The weight set W ═ W of each index in descending order₁,…,w_j,…,w_mThe first t w are selected such that:

and finally acquiring t groups of characteristic data.

Step a1 further includes: after the importance degrees of each group of feature data are obtained based on the entropy weight method, the first 5 groups of feature data with the accumulated importance degrees and the sum of which is more than 80% are selected to form a training set, so that X is { X ═ X₁，x₂，x₃，x₄，x₅}， x₁、x₂、x₃、x₄、x₅Respectively representing the number of posts in the forum, the average score of the work, the average score of the relevant basic courses, the average watching time of the online video every week and the participation frequency of the online learning platform every week.

Step a2 includes: establishing a Logistic regression model function g (z):

the output value y range of the function g (z) is an open interval (0,1), if y is more than or equal to 0.5, the output label value is 1, and the result is qualified; if y is less than 0.5, the label value of the output label is 0, which indicates that the result is unqualified;

z is a classification decision boundary, and the calculation formula is as follows:

z＝θ₀+θ₁x₁+θ₂x₂+θ₃x₃+θ₄x₄+θ₅x₅

θ_ithe regression coefficients of the Logistic regression model, i ═ 0,1, …,5, e.g. θ₁Logistic regression coefficient, θ, representing the number of posts in a forum₀Is the bias term.

Standardizing the training set X-control model based on the following formula:

wherein Z is the characteristic data after Z-score standardization, mu is the mean value of X, and sigma is the standard deviation of X.

Step a3 includes: and calculating an error value by using an LR model in Spark MLlib each time, and stopping iteration when the error is smaller than a set error threshold. The default error threshold is 1 e-6.

The performance evaluation of the training model can be realized by calculating classical performance indexes, such as: accuracy, precision, recall, F1, and ROC curves. The model can also be saved and loaded using the API provided by MLlib.

Step a4 includes: deploying a trained prediction model on line, selecting Hbase to store daily learning activity data, using Spark cleaning data and calling the stored prediction model to predict, and storing the cleaning data and the prediction result in a MySQL database; and finally, constructing a real-time learning prediction service by utilizing Web application, so that teachers and students can check prediction results in real time.

Step b1 includes: and according to the label value output by the academic prediction, extracting the samples which are subjected to preprocessing, standardization processing and feature selection from the MySQL database in the step a4 to form a risk student set.

Step b2 includes: the formula of the DBSCAN algorithm with the adaptively adjusted parameter eps is as follows:

wherein the mean square error of asymptotic integral

MISE is the integral mean square error;

representing a function of n, h, the function having an absolute value not greater than the absolute value of the function when the argument is sufficiently large

A fixed multiple of; k represents a kernel function, which is generally a Gaussian kernel function with a mean value of 0 and a variance of 1; h is called window width or smoothness parameter; n is the number of samples; x is an introduced variable and can be eliminated, so that the result is not influenced; [ f [ "(x) ]]²dx＝∫[φ"(x)]²dx/σ, φ is a standard normal density function. The above samples are n sets of feature data as described above, and σ represents the variance of n samples under any index. Obtaining an extreme value of the MISE, wherein the determined extreme point position is the optimal eps value;

clustering the risk student set by using a DBSCAN algorithm with the parameter eps subjected to adaptive adjustment to obtain k risk subsets { r₁,…,r_k}，r_kDenotes the kth risk subset, let t_ijIs a risk subset r_iMean value over i characteristic, T_j＝{t_1j,…,t_5jIs the risk subset r_iI ═ 1,2,3,4, 5; j is equal to 1, and j is equal to 1,2,…，k。

step b3 includes: and establishing an expert knowledge base according to the teaching and research data and the teacher opinions, carrying out rule matching on the input various risk subsets by the expert knowledge base, and outputting corresponding evaluation and targeted learning suggestions.

The main innovation points are as follows:

1. the academic prediction scheme based on Spark distributed architecture increases the capacity of the system, and can efficiently process large-scale data; modularization is realized, redundancy is reduced, and the expandability of the system is improved; the development and release speed is accelerated.

2. A new feature selection method, namely an entropy weight method, is provided, the method is suitable for academic data analysis scenes, feature data with high information value can be mined from the data, and the experience in the traditional method is avoided.

3. Two data preprocessing methods are proposed. The proportion average method is used for fusing similar characteristic data, such as score scores of different classifications, so that fused data can represent objective difference; the numeralization method can linearly convert the discrete data into numerical values, and the method can well overcome the defect that the onehot method is easy to generate sparse matrixes.

4. The risk set is clustered and divided by combining an improved clustering algorithm on the basis of academic prediction, so that academic early warning students can be detected in a targeted manner by combining with an expert database, the hanging rate and 32900of the students can be reduced, and the working strength of teachers can be reduced.

Has the advantages that: according to the invention, by utilizing a prediction algorithm model and an unsupervised clustering model with higher practicability, the academic prediction function facing large-scale students is provided, and meanwhile, the characteristics of low-efficiency learning groups can be mined, the classes of low-efficiency learners are subdivided, so that teachers can be helped to provide differentiated and targeted teaching and intervention activities. With the increase of the academic related data, the invention can expand the performance by adding Spark cluster nodes. The method has the characteristics of easy deployment and high reliability, and can provide large-scale real-time data extracted uninterruptedly in the intelligent teaching process to perform real-time academic prediction.

Drawings

The foregoing and/or other advantages of the invention will become further apparent from the following detailed description of the invention when taken in conjunction with the accompanying drawings.

FIG. 1 normalizes the processed data set.

Fig. 2 is a schematic diagram of training set X.

FIG. 3 the academic prediction sub-process.

FIG. 4 Risk student monitoring sub-process.

Fig. 5 shows the overall flow of the academic prediction monitoring system.

Detailed Description

In the embodiment, a large-scale student-oriented academic early warning monitoring system is constructed based on the large-scale student-oriented academic early warning monitoring method, as shown in fig. 3 and 5, the system can predict the academic risk of students, and can evaluate and suggest risk students based on an expert knowledge base. The system can well distinguish the academic risk students from the normal students, enhances the pertinence of teaching and the intervention of teachers, and the project practice shows that the academic passing rate of the students can be obviously improved. Fig. 4 is a framework for realizing model training and prediction based on Spark. And establishing an offline learning prediction model based on parallel computing and logistic regression algorithm in a Spark framework. Under a real-time environment, a large-scale real-time academic prediction system based on Spark stream and Kafka is realized. The scheme is divided into four stages. The first stage is a feature processing stage; the second phase is a modeling preparation phase; the third stage is a prediction model training stage; and fourthly, a prediction model deployment phase.

1. Stage of feature processing

The characteristic processing stage mainly comprises: data preprocessing, data standardization processing and feature selection.

The online learning behavior data, the offline learning behavior data and the achievement data of the relevant basic courses can be obtained through the online learning platform. On-line learning behavior data including classroom testing results, attendance and job completion results; the offline learning behavior data comprises an online operation completion result, online video watching time, online examination times, an online test result, the number of posts in a forum and operation review times; the basic course score data is generally the score of the basic course. These data implicitly contain information about the relevance of the target variable, i.e., student performance and status.

Extracting on-line learning behavior data, off-line learning behavior data and score data of related basic courses of all students in the previous period from a database of an on-line learning platform as a characteristic data part of a data set; and the average scores of the exams of the students at the end of the period are taken as the label data of the data set (if passing, the label value is set to 1, otherwise, the label value is set to 0). An initial data set P ═ P can thus be generated₁,···,p_i,···,p_mI is more than or equal to 1 and less than or equal to m, and m is the number of groups of characteristic data, such as: p is a radical of_iThe online test results of all trainees can be represented, assuming that the sample size of the training set is n, i.e. the column vector p_iIs n.

(1) Data pre-processing

In the field of intelligent teaching, attribute construction of academic datasets often involves a variety of scenarios, such as: multi-category score value merging processing, discrete value processing and the like. The present invention therefore proposes two specific data preprocessing methods.

1) Method of averaging specific gravities

The score combination processing may adopt a specific gravity average method:

S＝(a₁/M₁··+a₂/M₂+…+a_i/M_i+…+a_n/M_n)*100/n (i＝1,2,…,n)

wherein S is the percentage score after combination, a_iIndicating the i-th class score, M_iThe full score of the i-th score is shown, and n is the number of categories of all scores.

The proportion average method can effectively eliminate the problem of fusion distortion caused by different grades of scores due to different full scores, and can better comprehensively evaluate the comprehensive scores of a certain object.

On the characteristic part of the data set, integrating the on-line examination scores of all departments in the previous school period, the classroom test scores of all the departments, and the like by using a proportion averaging method; in the data set label part, the examination results at the end of each period of the period can be fused.

2) Numerical processing

When learning predictive modeling is performed, processing of discrete values is often involved, such as: under a certain characteristic, there are four grades of A, B, C and D, or three grades of good, medium and bad. After the discrete values are converted into numerical values, the numerical values can be input into a model for training. The usual processing method is to perform onehot processing. However, when the level is too high, a large sparse matrix appears obviously, and the performance is affected. Since most of the grade indexes are divided into sections with obvious linear characteristics, the method of linear mapping is not adopted to process discrete values.

For example, if the rank set D is four ranks { D1, D2, D3, D4}, the degrees of excellence are ranked from high to low, and the values are numerically processed based on percentiles, then D is obtained_iThe scores for the grades were:

V_i＝100-(100i/n)+50/n

where n is the number of elements in set D and i is 1,2,3, 4.

The data preprocessing is a very important preposition work in the data mining modeling process, and the main purpose of the data preprocessing is to reduce the dimensionality of a data set, improve the quality of the data set and determine the upper limit of model performance. In this section, some data preprocessing operations such as null value removal and abnormal value processing are also involved, but they are not described again because of their versatility.

(2) Data normalization process

The main purpose of the data normalization process is to reduce the magnitude of each type of data and eliminate the dimension. Since all sorts of emotional data exhibit an approximately normal distribution, the present invention uses the z-score normalization method.

Wherein X is the original characteristic, Z is the characteristic data after Z-score standardization, mu is the mean value of X, and sigma is the standard deviation of X.

The preprocessed data is subjected to z-score standardized processing, dependence on extreme values of the data is avoided, equalization processing can be effectively performed on each characteristic data, and the performance of a training model is improved.

(3) Feature selection

The feature selection is to select and extract features of the original data set so as to obtain a new data set which is smaller in scale and has stronger relevance with the target variable. Based on a large-scale user scene of intelligent teaching, the invention provides a feature selection method based on entropy weight.

The entropy weight method is an objective weighting method because it relies only on the discreteness of the data itself.

Information entropy:

information entropy redundancy: d_j＝1-H_j

Wherein, p (x)_i) Representing random events x_iProbability of (H)_jIs the entropy of the jth set of characteristic data, d_jIs the information entropy redundancy of the jth group of feature data.

The method comprises the following specific steps:

1) if there are n samples, m sets of feature data, then x_ijThe j index value (i is 1, …, n; j is 1, …, m) of the ith sample is shown in fig. 1.

2) Calculating the proportion of the ith sample value in the j index:

3) calculating the entropy value under the j index:

wherein the content of the first and second substances,

P_ij＝p_ij ln(p_ij)。

4) calculating the information entropy redundancy:

d_j＝1-H_j，(j＝1,2,…,m)

5) calculating the weight of each index:

6) descending order of W ═ W₁,…,w_j,…,w_m-selecting the first t w such that:

finally, t sets of feature data can be obtained.

Feature selection can reduce dimension disasters and reduce the difficulty of training models. The feature extraction based on the entropy weight method can obtain the importance degree of each group of feature data, and further high-value feature data can be selected. A data set X is obtained through a feature selection process, a portion of which is shown in fig. 2.

2. Modeling preparation phase

The modeling preparation phase comprises: data set division, model selection construction and parameter setting.

(1) Training set partitioning

And randomly disordering the sequence of each sample in the data set, and dividing a training set and a testing set according to the proportion of 8: 2. The training set is used to train the model, and the test set is used to verify the performance of the model.

(2) Model selection construction

The method adopts a Logistic regression model g (z):

the output value y range of the function g (z) is an open interval (0,1), if y is more than or equal to 0.5, the output label value is 1, and the result is qualified; if y is less than 0.5, the output label value is 0, which indicates that the result is not qualified.

z is the classification decision boundary, and the formula is as follows:

z＝θ₀+θ₁x₁+θ₂x₂+θ₃x₃+θ₄x₄+θ₅x₅

θ_jis a Logistic regression coefficient, j is 0,1, …,5, e.g. θ₁Logistic regression coefficient, θ, representing the number of posts in a forum₀Is the bias term.

(3) Parameter setting

A logistic regression instance is created in the Spark MLlib module. The iteration error threshold Δ min is set to 1e-6, the L2 regularized penalty term is used, the regularization parameter is 0.01, the two classification threshold is set to 0.5, and other parameters are defaulted.

3. Model training phase

Error values were calculated for each iteration using the LR model in Spark MLlib. The error value decreases with the increase of the iteration number, and the iteration is stopped when the error is smaller than a set error threshold value, wherein the default error threshold value is 1 e-6. And a cross-validation strategy is adopted in the training process, the accuracy, the precision, the recall ratio and the F1 evaluation indexes of the model are finally calculated according to the prediction result of the test set, an ROC curve is further drawn, and the final effect of the prediction model is evaluated. The AUC value under the ROC curve is 0.87, and the model has certain prediction accuracy. The model is saved and loaded using the API provided by MLlib. Calling the save method to save the above trained predictive model, the model can be loaded using the load method.

4. Model deployment phase

And in the model deployment stage, in the deployment and application stage, the trained prediction model is deployed on line, the received learning behavior data is processed and predicted, and academic risk students and academic normal students are classified. Firstly, Hbase is selected to store daily massive learning behavior data, Spark cleaning data is used, a stored prediction model is called to predict, and a prediction result, a training set and processed prediction data are stored in a MySQL database. And finally, constructing a real-time academic prediction service by using the web application, so that teachers and students can check prediction results in real time.

The academic prediction function can realize the academic prediction for large-scale students. The system establishes an off-line academic prediction model based on parallel computation of a Spark framework and a binary logistic regression model. Finally, a real-time academic prediction service is realized in the Web environment.

Second, risk student monitoring sub-function

As shown in fig. 4, the present invention provides a monitoring function for risk learners. Based on the risk set generated by the academic prediction (namely, the set predicted to be risk students), clustering analysis is carried out by using an unsupervised DBSCAN algorithm to obtain subdivided risk subsets. The scheme can be divided into three processes. The first stage is data extraction, and according to a risk set output by a academic predictor function, feature vectors of corresponding samples are extracted to construct a new sample set; the second stage is model construction, eps of each dimension is determined based on a Gaussian kernel function method, default MinPts is 2, an initial DBSCAN model is constructed, and a clustering result is output; and evaluating and suggesting in the third stage, respectively calculating the characteristic mean value of each cluster in the clustering result, and using the characteristic mean value to represent each risk subset characteristic. Based on expert knowledge base, according to each risk subset characteristic, providing evaluation and countermeasure.

1. Data extraction phase

Samples that have been preprocessed, normalized, and feature selected are extracted from the database for later modeling based on each sample label value output by the academic prediction.

2. Stage of model construction

A DBSCAN clustering model is built, the dimension of sample data in the scheme is high, in order to overcome the problem that Euclidean distance difference in the DBSCAN model is not obvious, a kernel function is introduced to optimize measurement radius eps, and MinPts is defaulted to 2.

Using the integrated mean square error MISE as the quasi-side for determining the density estimate, it is clear that minimizing MISE necessitates good h-values.

Wherein

It is known that

∫[f"(x)]²dx＝∫[φ"(x)]²dx/σ, φ is a standard normal density function, σ represents the variance of n samples under any index. And (3) solving an extreme value of the MISE, wherein the determined extreme point position is the optimal eps value, namely:

clustering the risk student set by using a DBSCAN algorithm with the parameter eps subjected to adaptive adjustment to obtain k risk subsets { r₁,…,r_k}，r_kDenotes the kth risk subset, let t_ijIs a risk subset r_iMean value over i characteristic, T_j＝{t_1j,…,t_5jIs the risk subset r_iI ═ 1,2,3,4, 5; j is 1,2, …, k.

The kernel function DBSCAN algorithm can determine certain parameters in a self-adaptive mode, prior knowledge of a current data set is not needed, and accuracy is high. And clustering the risk sets through a kernel function DBSCAN algorithm to obtain k risk subsets. The subsets are classified based on the number of posts in the forum, the average score of the work, the average score of the relevant basic courses, the average watching time of the online video every week and the participation frequency of the online learning platform every week, namely, the subsets have characteristic differences.

3. Evaluation recommendation phase

From the above available risk subset r₁,…,r_kR ═ R risk set₁,…,r_k}. Let t_ij(i-1, 2,3,4, 5; j-1, 2, …, k is the risk subset r_iMean value over i characteristic, T_j＝{t_1j,…,t_5jIs the risk subset r_iSet of feature means.

The method is realized based on an education platform of a certain professional school, and an expert knowledge base based on rules is constructed by collecting and summarizing teaching and research data of the school, teacher experience data and the like. And the expert knowledge base receives the group characteristic data, evaluates the group according to the matching condition of a series of internal rules and provides suggestions. For example, for the groups with lower values of the number of posts in the forum, the average score of the work, the average score of the relevant basic courses, the average watching time of the online video per week and the participation frequency of the online learning platform per week, the following steps are made: the learning attitude is extremely weak, the learning ability is weak, the self-control ability is low, the assessment of the high-degree early warning of the academic industry and the academic suggestion that teachers and families should intervene and pay attention to immediately.

The working process of the invention is as follows:

(1) feature processing

The characteristic processing stage mainly comprises: data preprocessing, data standardization processing and feature selection. The data preprocessing is based on a proportion average method and a numeralization processing method to respectively process the result data and the discrete value of different divisions, so that the requirement of model training can be met; the data standardization adopts a z-score method for processing, and data with different dimensions and magnitude levels are adjusted to keep the importance degrees of various data close; the feature selection mainly uses an entropy weight method to screen out feature data with higher quality. The characteristic processing lays a foundation for the following modeling work.

(2) Predictive modeling preparation

In the modeling preparation phase, the arguments x1, x2, x3, x4 and x5, which respectively represent the characteristics of the number of posts in the forum, the average score of the work, the average score of the relevant basic courses, the average viewing time of the online video per week and the participation frequency of the online learning platform per week, are obtained through the characteristic processing phase. Firstly, a data set is segmented, and a training set for model training and a testing set for testing are divided. Then, model selection is carried out, and the method carries out prediction of learning conditions based on the logistic regression model. And finally, setting relevant parameters of the logistic regression model. The modeling preparation process is mainly to set some basic parameters of the model.

(3) Predictive model training

Error values were calculated for each iteration using the LR model in Spark MLlib. The error value decreases as the number of iterations increases. When the error is less than the set error threshold, the iteration is stopped, and the default threshold is 1 e-6. And calculating four evaluation indexes of the accuracy, the precision, the recall ratio and F1 of the model on the test set, further drawing an ROC curve, and evaluating the final effect of the prediction model. Finally, the model is saved and loaded by using the API provided by the MLlib.

(4) Predictive model deployment

Deploying the trained prediction model on line, selecting Hbase to store daily massive learning behavior data, cleaning the data by using Spark, calling the stored prediction model to predict, and storing the prediction result in a MySQL database. And finally, constructing real-time learning prediction service by utilizing Web application, so that teachers and students can check prediction results in real time. The academic prediction result is one of the outputs of the scheme related to the invention.

(5) Data extraction phase

(6) Clustering model construction phase

A DBSCAN algorithm for adaptively adjusting eps parameters is designed, the algorithm can overcome the problem that high-dimensional data objects are distributed too randomly, and the applicability of Euclidean distance is enhanced. Based on the algorithm, the risk set can be reclassified to form a risk subset.

(7) Evaluation recommendation phase

And evaluating the risk student population and providing suggestions according to the subset formed by clustering and the subset characteristics by combining the existing expert knowledge base. The evaluation proposal is one of the output results of the proposal related to the invention, and forms: input data-problem finding-problem solving closed loop. Practical projects show that the achievement of the invention obviously reduces 32900the industrial rate, improves the achievement of students and reduces the working strength of teachers.

The invention provides a academic early warning monitoring method for large-scale students, and a plurality of methods and ways for implementing the technical scheme are provided, the above description is only a preferred embodiment of the invention, and it should be noted that, for those skilled in the art, a plurality of improvements and decorations can be made without departing from the principle of the invention, and these improvements and decorations should also be regarded as the protection scope of the invention. All the components not specified in the present embodiment can be realized by the prior art.

Claims

1. The academic early warning monitoring method facing the large-scale students is characterized by comprising an academic prediction process and a risk student monitoring process, wherein the academic prediction process comprises the following steps:

step a2, carrying out modeling preparation;

step a4, deploying a prediction model, carrying out Spark Streaming real-time calculation, and predicting the learning situation;

the risk student monitoring process comprises the following steps:

2. The method of claim 1, wherein step a1 comprises:

the information entropy is calculated using the following formula:

Wherein, p (x)_i) Representing random events x_iProbability of (H)_jIs the entropy of the jth set of characteristic data, d_jThe information entropy redundancy of the jth group of feature data is shown, n is the sum of the number of samples, i is less than or equal to n, and each i corresponds to a learning subject.

3. The method of claim 2, wherein step a1 further comprises:

calculating the proportion p of the ith sample value in the j index_ij：

Calculating the entropy value H under the jth index_j：

Wherein the content of the first and second substances,

computing information entropy redundancy d_j：

d_j＝1-H_j，

Calculating the weight w of each index_j：

and finally acquiring t groups of characteristic data.

4. The method of claim 3, wherein step a1 further comprises: after the importance degrees of each group of feature data are obtained based on the entropy weight method, the first 5 feature data with the accumulated importance degrees and the sum of which is more than 80% are selected to form a training set, so that X ═ X₁，x₂，x₃，x₄，x₅}，x₁、x₂、x₃、x₄、x₅Respectively representing the number of posts in the forum, the average score of the work, the average score of the relevant basic courses, the average watching time of the online video every week and the participation frequency of the online learning platform every week.

5. The method of claim 4, wherein step a2 includes: establishing a Logistic regression model function g (z):

z＝θ₀+θ₁x₁+θ₂x₂+θ₃x₃+θ₄x₄+θ₅x₅

θ_iis the regression coefficient of Logistic regression model, i is 0,1, …,5, theta₀Is a bias term;

standardizing the training set X-control model based on the following formula:

6. The method of claim 5, wherein step a3 includes: and calculating an error value by using an LR model in Spark MLlib each time, and stopping iteration when the error is smaller than a set error threshold.

7. The method of claim 6, wherein step a4 includes: deploying a trained prediction model on line, selecting Hbase to store daily learning activity data, using Spark cleaning data and calling the stored prediction model to predict, and storing the cleaning data and the prediction result in a MySQL database; and finally, constructing a real-time learning prediction service by utilizing Web application, so that teachers and students can check prediction results in real time.

8. The method of claim 7, wherein step b1 comprises: and according to the label value output by the academic prediction, extracting the samples which are subjected to preprocessing, standardization processing and feature selection from the MySQL database to form a risk student set.

9. The method of claim 8, wherein step b2 comprises: the formula of the DBSCAN algorithm with the adaptively adjusted parameter eps is as follows:

wherein the mean square error of asymptotic integral

MISE is the integral mean square error;

A fixed multiple of; k represents a kernel function; h is called window width; n is the number of samples; x is an introduction variable; [ f [ "(x) ]]²dx＝∫[φ"(x)]²dx/sigma, phi is a standard normal density function; σ denotes arbitraryVariance of n samples under the index; obtaining an extreme value of the MISE, wherein the determined extreme point position is the optimal eps value;