CN115409257A

CN115409257A - Score distribution prediction method and system based on condition density estimation model

Info

Publication number: CN115409257A
Application number: CN202211026487.1A
Authority: CN
Inventors: 张娜; 刘明
Original assignee: University of Jinan
Current assignee: University of Jinan
Priority date: 2022-08-25
Filing date: 2022-08-25
Publication date: 2022-11-29

Abstract

The invention provides a score distribution prediction method and system based on a condition density estimation model, relates to the field of data mining, and is used for solving the problems of large limitation and low accuracy rate of the conventional score prediction scheme, and the method comprises the following steps: according to the prediction target, student data are collected and stored in a database of a first server; preprocessing student data stored in a database, eliminating the characteristics of serious deletion or abnormal characteristic value distribution, and performing characteristic fusion by using a condition mask mechanism to obtain a data set; constructing a conditional density estimation model, and training by using a data set; inputting student data to be predicted into a trained conditional density estimation model to obtain score probability density distribution of a prediction target; the invention utilizes the condition density estimation model to construct a unified technical framework, takes any data as an input condition, predicts the complete probability density distribution of the future course/test results, and realizes the accurate prediction of the results under any education scene.

Description

Score distribution prediction method and system based on condition density estimation model

Technical Field

The invention belongs to the field of data mining, and particularly relates to a score distribution prediction method and system based on a condition density estimation model.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

At present, student score prediction is an important target for learning analysis; under the promotion of information construction, each college and universities have begun to build a digital campus or a smart campus and obtain good effects; however, the wisdom campus construction has focused more on colleges and only a portion of the schools began to advance digital education; at present, the existing prediction scheme has the following problems: (1) The existing score prediction only aims at one group, for example, the score prediction only aims at the courses of college students, and a unified framework suitable for all education stages of primary schools, junior high schools and universities is not formed; (2) The considered factors influencing the achievements of the students are few, and the influence of factors such as characteristics of the students, activities after class, teachers and the like on the achievements is ignored; (3) The collected information cannot be completely fused, for example, the prior invention cannot completely fuse student samples with different lessons due to different professions; (4) Only the student score can be predicted or whether the student successfully passes the examination can be judged, and the overall score distribution of the student cannot be predicted, so that the prediction information is incomplete. Therefore, the existing performance prediction scheme has the problems of large limitation and low accuracy rate, and further research is needed.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a score distribution prediction method and a score distribution prediction system based on a condition density estimation model.

In order to achieve the above object, one or more embodiments of the present invention provide the following technical solutions:

the first aspect of the invention provides a score distribution prediction method based on a condition density estimation model;

a performance distribution prediction method based on a conditional density estimation model comprises the following steps:

according to the prediction target, student data are collected and stored in a database of a first server, and the student data stored in the database are screened to obtain attribute characteristics;

preprocessing student data stored in a database, removing features with serious deletion or abnormal feature value distribution, and performing feature fusion by using a condition mask mechanism to obtain a data set;

constructing a conditional density estimation model, and training by using a data set;

and inputting student data to be predicted into the trained conditional density estimation model to obtain the score probability density distribution of the predicted target.

Furthermore, the prediction targets comprise single-course scores, middle-school-examination scores, college-examination scores, examination-research scores and training-institution graduation scores;

the student data comprises student basic information, student course related data, teacher course related data and student behavior data, wherein other information comprises library access records, book borrowing times and prize winning conditions;

furthermore, data are collected from a teaching service system of a school, a digital campus and a smart campus through a data interface or a crawler capturing mode and are stored in a database of the first server.

Furthermore, the feature fusion adopts a conditional mask mechanism, particularly, aiming at the defect that unified modeling cannot be realized due to different features, a mask mechanism is introduced, and the mask is associated with the features to mark the loss of elements.

Furthermore, the conditional density estimation model is used for fitting and predicting the distribution of the target according to the input conditions by a parameterized or non-parameterized method, and the distribution error is optimized by adopting likelihood estimation in the distribution fitting process.

Further, according to the grade, dividing the data set into a training set, a verification set and a test set, and respectively using the training set, the verification set and the test set for learning, checking and testing the condition density estimation model;

further, the condition density estimation model adopts one of a deconvolution density network DDN, a condition flow model CNFs, a kernel hybrid network KMN, a hybrid density network MDN, and a quantile regression random forest QRFCDF.

The invention provides a performance distribution prediction system based on a condition density estimation model in a second aspect.

A score distribution prediction system based on a condition density estimation model comprises a data acquisition module, a feature processing module, a model training module and a score prediction module;

a data acquisition module configured to: according to the prediction target, student data are collected and stored in a database of a first server, and the student data stored in the database are screened to obtain attribute characteristics;

a data processing module configured to: preprocessing student data stored in a database, eliminating the characteristics of serious deletion or abnormal characteristic value distribution, and performing characteristic fusion by using a condition mask mechanism to obtain a data set;

a model training module configured to: constructing a conditional density estimation model, and training by using a data set;

an achievement prediction module configured to: and inputting the student data to be predicted into the trained conditional density estimation model to obtain the score probability density distribution of the prediction target.

A third aspect of the present invention provides a computer-readable storage medium on which a program is stored, which, when executed by a processor, implements the steps in a performance distribution prediction method based on a conditional density estimation model according to the first aspect of the present invention.

A fourth aspect of the present invention provides an electronic device, including a memory, a processor, and a program stored in the memory and executable on the processor, wherein the processor implements the steps of the performance distribution prediction method based on a conditional density estimation model according to the first aspect of the present invention when executing the program.

The above one or more technical solutions have the following beneficial effects:

the invention provides a score distribution prediction method and system based on a condition density estimation model, which utilize the condition density estimation model to construct a unified technical framework, predict the complete probability density distribution of future course/examination scores by taking any data as an input condition, realize the accurate prediction of the scores in any education scene, consider the uncertainty of the prediction process, are more close to the essence of score prediction, have more complete information and improve the prediction precision.

Aiming at the defect that unified modeling cannot be achieved due to different attribute characteristics, a mask mechanism is introduced, and the mask is associated with the characteristics to mark the missing of elements, so that the problem of inconsistent student information is solved, and all collected student information can be fully utilized.

By predicting the score distribution of all possible students, compared with the traditional prediction early warning method, the method can solve different application problems according to different schemes, realize the differentiated learning and properly adjust the teaching difficulty, reduce the academic requirements on part of students, and cultivate various talents.

Advantages of additional aspects of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention, and are included to illustrate an exemplary embodiment of the invention and not to limit the invention.

FIG. 1 is a flow chart of the method of the first embodiment.

FIG. 2 is a diagram of course history information and its derived features after the masking mechanism is used in the first embodiment.

FIG. 3 is a view showing the procedure of the province coding process in which students live in the first embodiment.

Fig. 4 is a diagram of the steps used in the conditional probability density estimation model in the first embodiment.

Fig. 5 is a system configuration diagram of the second embodiment.

Detailed Description

The invention is further described with reference to the following figures and examples.

It is to be understood that the following detailed description is exemplary and is intended to provide further explanation of the invention as claimed. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

Example one

The embodiment discloses a score distribution prediction method based on a condition density estimation model;

as shown in fig. 1, a score distribution prediction method based on a conditional density estimation model includes:

s1, collecting student data according to a prediction target, and storing the student data in a database of a first server;

the prediction targets comprise single course scores, middle school test scores, college test scores, examination scores and training institution graduation scores.

The student data are from a teaching service system of a school, a digital campus and a smart campus, are acquired through a data interface or a crawler capturing mode, and are stored in a database of the first server.

The data acquisition dimensionality and the prediction target of the data can be changed correspondingly according to the education stage; firstly, in order to predict the school score of any grade, the data to be analyzed is changed along with the change of the grade, for example, when a school score of a university is predicted, compared with the relevant data which can be collected by middle and primary schools, newly added data such as the historical school score, the college score, the library borrowing information and the like of the student at the university stage are collected at the university stage so as to improve the accuracy of the school prediction; secondly, when the scores of the students in the ascending examinations including the scores of the middle examinations, the scores of the high examinations and the scores of the examinations are predicted, relevant data influencing the final performance of the stages, including the historical scores of the courses, the basic information of the students, the award winning information, the scores of the simulated examinations and the like of the whole stages, need to be collected to the maximum extent.

Student's data of specific collection includes: the student basic information, the student course related data, the teacher course related data and the student behavior data are used for recording the student learning condition, the teacher history scoring condition, the supervising and teaching scoring condition, the student teaching evaluation condition and the like.

Student course related data includes:

the number of the student, the serial number, the name, the year of the student, the period code, the score, the performance point, the revision and the employee number of the course teaching teacher. Effective data can be added according to the education stage of the student, for example, the information of the school entrance examination score of the student is added for predicting the school entrance score of the high school, the information of the school entrance examination score of the student is added for predicting the school entrance score of the university, and the like.

Teacher lesson related data includes:

teacher's job number, teacher's name, class number, year, period code

The student and teacher derived data comprises:

the student history period course hanging rate, the history average performance point GPA, the teacher history scoring average score and the standard deviation.

The basic information of the student comprises:

school number, name, grade of student, college, specialty, date of birth, age, native place.

The student behavior data includes:

the number of times of book borrowing and passing in and out of the library of the student. The filling of behavior data, such as daily consumption data of students, behavior tracks of students and the like, can be performed according to the informatization level of the school.

Besides collecting the data of the students in the current stage, the information of the previous stage is enriched according to the education stage in which the students are currently located, for example, the characteristics of the students including the performance of the students in college entrance examination are introduced when the performance of the students is predicted.

S2, preprocessing student data stored in a database, eliminating the characteristics of serious deletion or abnormal characteristic value distribution, and performing characteristic fusion by using a condition mask mechanism to obtain a data set;

the method comprises the following steps of performing data preprocessing operations such as data cleaning and duplicate removal by using a Pandas module, improving data quality and associating data, and specifically comprises the following steps:

1) And (4) data cleaning, namely cleaning the data with abnormal information, such as dirty data of abnormal class score values, repeated data and the like.

2) And (4) missing processing, namely removing or filling according to the data missing condition, filling by adopting the mean value of the characteristics, and ensuring that the filled data has no great difference with the original data.

3) And information association, namely associating a series of data of students, teachers and the like through the fields of study numbers, course numbers and teacher work numbers to form a sample set.

Aiming at the defect that unified modeling cannot be realized due to different characteristics, a conditional mask mechanism is introduced, a mask is associated with the characteristics, and the missing of elements is identified, wherein the specific process is as follows:

1) Determining all required attribute feature set features, feature = { c = } ₁ ,c ₂ ,…,c _k Where k is the number of attribute features.

2) Randomly generating a k-dimensional mask sequence rm, rm = [ m = ₁ ,m ₂ ,…,m _k ]Wherein m is _i E {0,1}, i e {1,2, …, k }. Meanwhile, the rm sequence satisfies two rules: (1) Characteristic of the current attribute c _i When the student is absent (the student does not amend the relevant course or data is absent), m is set _i 0, otherwise, 1 is set; (2) When the model is trained, a part of data which are not lost is randomly covered for load training, so that the generalization capability of the model under an unknown condition is improved. Finally, the feature set is multiplied by the rm sequence and then forms a final data set together with the rm sequence, as shown in fig. 2.

S3, constructing a condition density estimation model, and training by using a data set;

the conditional density estimation model CDE comprises a plurality of models, such as a deconvolution density network DDN, a conditional flow model CNFs, a kernel mixed network KMN, a mixed density network MDN, a quantile regression random forest QRFDF and the like, and is characterized in that probability density distribution of a prediction target is fitted according to input conditions through a parameterization or non-parameterization method, and in the distribution fitting process, likelihood estimation is adopted to optimize distribution errors, and the process is as follows:

obtaining m samples x1, x2, …, xm after sampling from an input vector of the model;

calculating likelihood functions of samples

Calculating to obtain a parameter theta which can enable the L to be maximum;

when the sampled sample is at P _G (x ⁱ (ii) a θ) the higher the probability of occurrence in the distribution model, i.e., the larger L, P _G (x ⁱ (ii) a Theta) distribution is closer to the sample distribution.

Wherein, P _G ((x ⁱ (ii) a Theta) is a distribution model defined, the distribution being determined by theta, the aim being to find the parameter theta such that the distribution P is _G ((x ⁱ (ii) a θ) as close as possible to the distribution sampled from the real sample.

The negative log-likelihood method is adopted in the loss function method of the neural network, aiming at minimizing the error of the predicted distribution and the real distribution, and the formula is as follows:

when grade-level student data are collected in the same scene, score distribution of grade-level students does not have large deviation and basically accords with independent same distribution, so that distribution learned by using a training set does not have large deviation with a verification set and a test set. Therefore, the invention simulates the education process and the actual demand in the real scene, the prediction of the scores of all the students of the latest level is taken as the test set for predicting the score distribution, and the data sets are divided and then respectively stored in the database for the subsequent use, specifically:

according to the grade, the data set is divided into a training set, a verification set and a test set, and the training set, the verification set and the test set are respectively used for learning, checking and testing the conditional density estimation model, for example, a real scene is simulated, all student samples of grades 2015-2018 are used as the training set, student samples of grade 2019 are used as the verification set, and students of grade 2020 are used as the test set.

And training the condition density estimation model by using a training set, inputting the verification set into the trained condition density estimation model, and checking whether the output distribution has larger deviation with the real distribution, thereby re-optimizing the model to realize the calibration of the model. The verification set is used for verifying the learning performance of the model, so that the model can be correctly learned, and the over-fitting or under-fitting phenomenon of the model is avoided.

The calibration may be performed by calibrating the predicted distribution by a regression calibration method, by reducing the probability of whether the measured distribution is close to a given value apart from the overall true situation, i.e. calibrating the error between the observed confidence and the expected confidence.

Based on the regression calibration evaluation index, the step of calibrating the model by using the verification set comprises the following steps:

(1) Additionally calculating the cumulative probability value CDF lower than the real score in the distribution of each student score on the basis of predicting the score probability density distribution of each student, and reserving the value into the predefined cumulative probability array CDF _ LIST.

(2) Defining a two-dimensional coordinate axis by self, wherein an abscissa can be defined as ten groups of confidence intervals between 0 and 1, namely [0 to 0.1,0.1 to 0.2, …,0.9 to 1.0], and represents the confidence expected by the model. And then dividing the cumulative probability value in the array CDF _ LIST into corresponding confidence intervals, calculating the frequency of occurrence of samples in each interval, wherein the frequency = the number of samples in the interval/the total number of samples, and the ordinate represents the observation confidence degree predicted by the model and is in the range of 0-1. With the aid of the drawing tool, the calibration curve to the validation set samples can be visualized.

(3) In the case of perfect calibration of the model, the desired confidence is equal to the observation confidence, i.e. the calibration curve drawn by the previous process should coincide perfectly with the perfect calibration curve (diagonal), e.g. the percentage of positive samples should be 0.1 for all samples with a cumulative probability value of 0.1 for the model output. In the case of imperfect calibration, both over-calibration and under-calibration conditions can occur. The calibration curve in the over-calibration state is above the diagonal line, and the cumulative probability of the actual output distribution of the model is higher than that in the real situation from the viewpoint of probability; the under-calibrated state is the opposite, with the observed confidence curve below the perfect calibration curve. In addition, the error between the expected confidence and the observed confidence can be calculated by observing the difference between the calibration curve and the perfect calibration curve or by mean square error MSE, thereby adjusting the parameter calibration curve to the perfect state to realize the model calibration process.

(4) And storing the perfectly calibrated model and parameters.

And S4, inputting the student data to be predicted into the trained conditional density estimation model to obtain the score probability density distribution of the prediction target.

And inputting the student data to be predicted into the calibrated model to predict the student score distribution, and after the complete distribution is obtained, the educator can make a more comprehensive and reliable decision for the student data, and the student information and the prediction result are stored in the database to provide a data interface for a subsequent student score early warning platform.

After the step S3, the model is fully calibrated and outputs reliable result probability density distribution, and according to the complete distribution, an educator can obtain more complete information than the existing result prediction model, so that the application under any scene can be realized.

For example, score early warning is performed, on the basis of obtaining complete distribution of student scores, the probability that the distribution is in a 0-60-point (percent score) interval is further calculated to serve as a final unqualified score risk value of students, compared with a traditional student score early warning model, more distribution information is adopted for decision making, ranking is further performed according to risks, students with risks higher than 70% are focused on, the students with risks higher than 70% are assisted to pass an examination smoothly, model induction and inference steps are consistent with steps S1 and S2, but on the basis of regression calibration in the step S3, the classification problem of whether early warning is correct or not can be further converted (for example, a binary classification task of whether the real score of the student with unqualified risk value of 85% meets the standard) is further adopted, the classification calibration evaluation index is adopted to evaluate the capability of the model, calibration is achieved, and the verification set calibration model is used in the following steps:

(1) And calculating the cumulative probability density (failing under 60 points of the percentage course) of each student achievement distribution in 0-60 points according to each predicted achievement probability density distribution as a course RISK probability value, and storing the course RISK probability value in an array CDF _ RISK.

(2) Defining confidence intervals by self, dividing confidence degrees of 0-1 into ten groups by an abscissa, dividing the confidence intervals into corresponding confidence intervals according to RISK probability values in an array CDF _ RISK, and changing the ordinate of a classification task into the accuracy rate of correct classification of samples in the intervals, which is different from regression calibration; when the confidence coefficient is equal to the accuracy, a perfect calibration state is achieved, for example, the risk probability predicted value of 10 students is between 0.9 and 1.0, and 9 to 10 students really fail under the real condition. If the confidence coefficient is not equal to the accuracy rate, the model can be optimized by adjusting the parameters; similarly, the error can be measured by observing the calibration graph or calculating the expectation of difference (ECE) of the average confidence and accuracy for each interval sample, as shown in the following equation:

wherein b represents each b-th risk interval, N is the total number of sample sets, and N _b As the number of samples in the b-th interval, acc (b) and conf (b) are expressed as the accuracy and the average confidence (average risk probability) in the corresponding interval, respectively.

(3) After calibration, the model is stored for testing, the feedback result is the risk probability value of each student, and meanwhile, the risk probability value is not only reflected in the risk probability of the failing students (the score is less than 60 points), but also expressed in that the proportion (namely the accuracy rate) of the actually failing students in the student samples in the same risk interval zone corresponds to the value of the risk interval zone, for example: the risk probability value of 10 students is 0.9, 9-10 students should fail in practice, and the interval classification accuracy is 0.9 and corresponds to the risk interval of 0.9-1.0.

The educator can analyze the performance characteristics of the student in the course according to the complete conditional probability density distribution of the student, and can cultivate potential and natural students while ensuring that the student with low grade distribution range of the course successfully passes the test, so as to encourage or guide the students to actively take part in academic competitions and scientific research activities, thereby further improving the self level, and further achieving the purpose of comprehensively improving the quality of the students.

Example two

The embodiment discloses a score distribution prediction system based on a condition density estimation model;

as shown in fig. 5, a performance distribution prediction system based on a condition density estimation model includes a data acquisition module, a feature processing module, a model training module, and a performance prediction module;

a data processing module configured to: preprocessing student data stored in a database, removing features with serious deletion or abnormal feature value distribution, and performing feature fusion by using a condition mask mechanism to obtain a data set;

EXAMPLE III

An object of the present embodiment is to provide a computer-readable storage medium.

A computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the steps in a performance distribution prediction method based on a conditional density estimation model according to embodiment 1 of the present disclosure.

Example four

An object of the present embodiment is to provide an electronic device.

An electronic device, comprising a memory, a processor and a program stored in the memory and executable on the processor, wherein the processor implements the steps of the performance distribution prediction method based on the conditional density estimation model according to embodiment 1 of the present disclosure when executing the program.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A score distribution prediction method based on a condition density estimation model is characterized by comprising the following steps:

according to the prediction target, student data are collected and stored in a database of a first server;

preprocessing student data stored in a database, eliminating the characteristics of serious deletion or abnormal characteristic value distribution, and performing characteristic fusion by using a condition mask mechanism to obtain a data set;

and inputting the student data to be predicted into the trained conditional density estimation model to obtain the score probability density distribution of the prediction target.

2. The performance distribution prediction method based on the condition density estimation model as claimed in claim 1, wherein the prediction targets comprise single-course performance, middle school performance, college school performance, research performance and graduation performance of training institutions;

the student data comprises student basic information, student course related data, teacher course related data and student behavior data, wherein other information comprises library access records, book borrowing times and prize winning conditions.

3. The method as claimed in claim 1, wherein the data is collected from a educational administration system, a digital campus, a smart campus via a data interface or a crawler capture method, and stored in the database of the first server.

4. The performance distribution prediction method based on the conditional density estimation model as claimed in claim 1, wherein the feature fusion adopts a conditional mask mechanism, and particularly, for the defect that the unified modeling cannot be performed due to different features, a mask mechanism is introduced, and the absence of elements is identified by associating the mask with the features.

5. The method as claimed in claim 1, wherein the conditional density estimation model is used to fit the distribution of the predicted target according to the input conditions by using a parametric or non-parametric method, and the likelihood estimation is used to optimize the distribution error during the distribution fitting process.

6. The performance distribution prediction method based on the conditional density estimation model as claimed in claim 1, wherein the data set is divided into a training set, a validation set and a test set according to the grade, and the training set, the validation set and the test set are respectively used for learning, checking and testing of the conditional density estimation model.

7. The performance distribution prediction method based on the conditional density estimation model as claimed in claim 1, wherein the conditional density estimation model adopts one of a deconvolution density network DDN, a conditional flow model CNFs, a kernel hybrid network KMN, a hybrid density network MDN, and a quantile regression random forest QRFCDF.

8. A score distribution prediction system based on a condition density estimation model is characterized by comprising a data acquisition module, a feature processing module, a model training module and a score prediction module;

a data acquisition module configured to: according to the prediction target, student data are collected and stored in a database of a first server;

9. Computer-readable storage medium, on which a program is stored which, when being executed by a processor, carries out the steps of a method for performance distribution prediction based on a conditional density estimation model according to any one of claims 1 to 7.

10. Electronic device comprising a memory, a processor and a program stored on the memory and executable on the processor, characterized in that the processor implements the steps of a performance distribution prediction method based on a condition density estimation model according to any of claims 1 to 7 when executing the program.