CN105894119A

CN105894119A - Student ranking prediction method based on campus data

Info

Publication number: CN105894119A
Application number: CN201610207978.4A
Authority: CN
Inventors: 杨磊; 聂敏; 夏虎
Original assignee: Chengdu Xundao Technology Co Ltd
Current assignee: Chengdu Xundao Technology Co Ltd
Priority date: 2016-04-05
Filing date: 2016-04-05
Publication date: 2016-08-24

Abstract

The invention discloses a student ranking prediction method based on campus data, comprising the following steps: collecting the data of all students, including performance data and behavior data; cleaning the student data, and normalizing the non-time data items; extracting the behavior characteristic vector of each student from the processed data, wherein behavior characteristics include performance characteristic, effort degree characteristic and law-of-life characteristic; reducing the dimension of each behavior characteristic vector; subtracting the behavior characteristic vector of each of the other students from the dimension-reduced behavior characteristic vector of each student to get a difference characteristic vector, and inputting the difference characteristic vectors into a classifier to get corresponding tag values, and summing the tag values to get the score of the student; sorting the scores of all the students to get the predicted ranking of each student. According to the invention, the campus data of students is analyzed, the learning habits and behavior characteristics of students are described using data, and the ranking of each student is predicted and used as a reference for student education.

Description

Student's ranking Forecasting Methodologies based on campus data

Technical field

The invention belongs to big data analysis digging technology field, more specifically, relate to a kind of based on campus Student's ranking Forecasting Methodology of data.

Background technology

How to understand students psychology, students ' abnormal behaviour, prediction student's study condition and personalization is provided Teach, have become as many colleges and universities problems faced and challenge.In recent years, along with " data and calculating " For the scientific and technological revolution driven, big data become the important factor in order of Internet information technique industry.How will Big data introduce education sector, as promoting Education Reform, leading the powerful power-assisted of creativity in education, become new Research direction.But at present, it is difficult to the problems such as quantization due to students ' behavior, counts greatly in education sector According to application be also in conceptual phase, effective application mode not yet occurs.

Summary of the invention

It is an object of the invention to overcome the deficiencies in the prior art, it is provided that a kind of students based on campus data arrange Name Forecasting Methodology, by being analyzed the campus data of student, describes the study habit of student by data And behavioural characteristic, it was predicted that obtain student's ranking, as the reference of student education.

For achieving the above object, present invention student based on campus data ranking Forecasting Methodology includes following Step:

S1: gather the data of all students, including achievement data and behavioral data, wherein achievement data Including the course types of all courses of student, credit number, achievement, behavioral data includes that student is in campus Each place uses the record of all-in-one campus card；

S2: the student data collected is carried out data cleansing；

S3: to the non-temporal data item in the student data cleaned, uses following methods to carry out data rule Model:

The jth item non-temporal data of note i-th student are x_ij, i=1,2 ..., N, N represent student's quantity, J=1,2 ..., M, M represent non-temporal data item quantity；Ask for each data x_ijLinear transformation value x '_ij, meter Calculation formula is:

x_{i j}^{'} = \frac{x_{i j} - \min_{j}}{\max_{j} - \min_{j}} (T_{j_m a x} - T_{j_m i n}) + T_{j_m i n}

Wherein, max_jRepresent the maximum in jth item data sequence, min_jRepresent in jth item data sequence Minimum of a value, T_{j_max}Represent that jth item data sequence limits the interval upper limit, T_{j_min}Represent jth item data sequence Limit interval lower limit；

To the data x ' after linear transformation_ij, calculate authority data value y according to below equation_ij:

y_{i j} = \frac{x_{i j}^{'} - {\overset{&OverBar;}{x}}_{j}}{s_{j}}

Wherein,Represent the mean value of jth item data sequence, s_jRepresent the variance of jth item data sequence；

S4: extract from student data each student behavioural characteristic vector, behavioural characteristic include achievement feature, Level of effort feature and rule of life feature, wherein achievement feature include all courses of student course types, Credit number, achievement, level of effort feature is the frequency that student enters the relevant place of study, rule of life feature It is the rule of life metric of student, is made up of the behavioural characteristic vector of student data above item；

S5: the behavioural characteristic vector extracting step S4 carries out dimensionality reduction, the row of each student after obtaining dimensionality reduction It is characterized vector；

S6: to i-th student, uses its behavioural characteristic vector to deduct the behavioural characteristic vector of other each students, Obtaining N-1 difference characteristic vector, by grader good for difference characteristic vector input training in advance, it is right to obtain N-1 the label answered, label value is 1 or-1, is sued for peace by all label values of student, obtains this student's Score, is ranked up the score of all students, thus obtains the ranking predicted value of student；

Wherein, the training method of grader is: to having the student of history ranking, collects and obtains these students Data, obtain the behavioural characteristic vector of these students according to the method for step S1 to step S5, then two Two try to achieve the difference characteristic vector between student；For a difference characteristic vector, if being subtracted characteristic vector Earlier above, then this label corresponding to difference characteristic vector is 1 to corresponding student's ranking, is otherwise-1；By this A little difference characteristic vectors are as the input of grader, and grader, as output, is trained by corresponding label.

Present invention student based on campus data ranking Forecasting Methodology, gathers the data of all students, bag Include achievement data and behavioral data, student data is carried out data cleansing, and to non-temporal data item number According to specification, the data after processing extract the behavioural characteristic vector of each student, and behavioural characteristic includes achievement Feature, level of effort feature and rule of life feature, then carry out dimensionality reduction, Mei Gexue to behavioural characteristic vector Behavioural characteristic vector after raw its dimensionality reduction of employing deducts the behavioural characteristic vector of other each students, tries to achieve difference Characteristic vector, obtains the label value of correspondence in input grader, label value summation obtains the score of student, The score of all students is ranked up, i.e. can get the ranking predicted value of each student.

The present invention is directed to student's learning behavior data in campus and carry out depth analysis, the base to individual students This information, study, living condition carry out quantificational description accurately, it was predicted that the ranking of individual students, are relevant Functional department, provide quantified decision-making foundation for the teaching management of relevant functional department and daily guidance work, Thus effectively discharge the value of student data.

Accompanying drawing explanation

Fig. 1 is the flow chart of present invention student based on campus data ranking Forecasting Methodology；

Fig. 2 is the flow chart of behavioural characteristic Data Dimensionality Reduction.

Detailed description of the invention

Below in conjunction with the accompanying drawings the detailed description of the invention of the present invention is described, in order to those skilled in the art It is more fully understood that the present invention.Requiring particular attention is that, in the following description, when known function and Perhaps, when the detailed description of design can desalinate the main contents of the present invention, these are described in and will be left in the basket here.

Embodiment

Fig. 1 is the flow chart of present invention student based on campus data ranking Forecasting Methodology.As it is shown in figure 1, Present invention student based on campus data ranking Forecasting Methodology comprises the following steps:

S101: student data collection:

First having to gather the data of all students, student data stems from each functional department of school, There is heterogeneous structure, contain the school serialized to the time from structurized student's essential information data Garden life track.Student data includes achievement data and behavioral data, and wherein achievement data includes the institute of student Have the course types of course, credit number and an achievement, and each part of achievement situation (as usual performance, Interim achievement etc.), behavioral data includes that student each place in campus uses the record of all-in-one campus card, example As student in supermarket, the consumer record fetched water of dining room and classroom, including consumption time and the amount of money；Come in and go out figure Book shop, the record of dormitory gate inhibition；Check out record, including book information and borrow the time.Table 1 is number of students According to source and content example.

Table 1

S102: data cleansing:

After collecting all student data, need the initial data collected is carried out data cleansing.This From multiple operation systems and comprise a large amount of historical data due to student data in bright, usually there will be repetition Value, missing values etc., it is therefore desirable to carry out data cleansing.The task of data cleansing be filter those do not meet want The data asked, write data warehouse again after correction.Clean object mainly include the repetition values in data, Missing values, inconsistent data etc., data cleansing is the conventional means of big data fields, and its detailed process exists This repeats no more.

S103: data normalization:

For the student data cleaned, owing to the attribute of every item data is different, it is generally of different amounts Guiding principle and the order of magnitude.It is said that in general, represent that attribute will cause this attribute to have higher value territory by less unit, So tend to " weight " making such attribute have large effect or higher.In order to avoid single to tolerance Position select dependence, it is ensured that the reliability of result, need in initial data in addition to time data its He carries out standardization processing at data item.

Data normalization refers to data bi-directional scaling, is allowed to fall into a little specific interval.This Mode is often used in some compares and the index evaluated processes, and the unit removing data limits, by it It is converted into nondimensional pure values, it is simple to the index of commensurate or magnitude can not compare and weight.This In invention, data normalization include following two step:

● linear transformation:

The jth item non-temporal data of note i-th student are x_ij, i=1,2 ..., N, N represent student's quantity, J=1,2 ..., M, M represent non-temporal data item quantity.To each data, ask for according to below equation respectively Linear transformation value x '_ij:

x_{i j}^{'} = \frac{x_{i j} - \min_{j}}{\max_{j} - \min_{j}} (T_{j_m a x} - T_{j_m i n}) + T_{j_m i n}

Wherein, max_jRepresent the maximum in jth item data sequence, min_jRepresent in jth item data sequence Minimum of a value, T_{j_max}Represent that jth item data sequence limits the interval upper limit, T_{j_min}Represent jth item data sequence Limit interval lower limit.Jth item data sequence is exactly the sequence of the jth item data composition of all students.It is visible, By above formula, by script in jth item data sequence at interval [min_j,max_j] primary system one be mapped to [T_{j_min},T_{j_max}On].

Assume jth item data sequence for [1,2, Isosorbide-5-Nitrae, 3,2,5,6,2,7], interval is [1,7], its limit interval as [0,1], then the data sequence after linear transformation is [0,0.16,0,0.5,0.33,0.16,0.66,0.83,0.16,1].

● numerical value specification:

Based on data for data after linear transformation averages and standard deviation are carried out numerical value standardization.Become linear Data x ' after changing_ij, calculate authority data value y according to below equation_ij:

y_{i j} = \frac{x_{i j}^{'} - {\overset{&OverBar;}{x}}_{j}}{s_{j}}

Wherein,Represent the mean value of jth item data sequence,s_jRepresent jth item data sequence The variance of row,

Each data sequence average after numerical value standardization is 0, and variance is 1, and dimensionless, the word in sequence Segment value fluctuates around about 0, more than 0 explanation higher than average level, less than 0 explanation less than average level.

Unified interval can not only be mapped the data into by two above step, and effectively eliminate Beyond the impact on data overall distribution of the Outlier Data of span.

S104: extract behavioural characteristic vectorial:

After completing the work of data normalization, need extracting data learning behavior feature.Institute in the present invention The behavioural characteristic needing each student is divided into three parts: achievement feature, level of effort feature and rule of life are special Levy.Achievement feature includes the course types of all courses of student, credit number, achievement.Level of effort feature is united Having counted the frequency entering the relevant place of study, including entering library's number of times, number of times is checked card in classroom, print time Count, check out number of times etc., the study level of effort describing student with this and Active Learning wish.Rule of life Feature is the rule of life metric of student, is to be portrayed in the charge time of different location by analysis student The regularity of its daily life system.

In the present embodiment, the computational methods of rule of life metric are: first according to the number of students of each student Access situation to default several places (generally dining room, dormitory, classroom) according to, is calculated This student access probability to these places in predetermined amount of time, is then calculated Shannon according to access probability Entropy, this Shannon entropy is the rule of life metric of student.

Shannon entropy (Shannon Entropy) have expressed the average information that a discrete variable is brought, May be used for characterizing rule of life, its computing formula is:

H_{i} (z) = - \underset{z}{Σ} P_{i f} (z) \log_{2} P_{i f} (z)

Wherein, H_iZ () represents the Shannon entropy of i-th student, P_ifZ () represents that i-th student accesses the f place Access probability, f=1,2 ..., F, F represent place quantity.

Such as, when be calculated a student respectively in dining room, dormitory, these three place, classroom access general When rate is 0.3,0.3,0.4 respectively, it is calculated Shannon entropy H₁(z)=1.572.Another student accesses three When the probability in place is 0.1,0.6,0.2 respectively, it is calculated H₂(z)=1.24.The Shannon entropy of the latter is less, Embody higher Behavior law (probability of the dormitory that comes in and goes out is higher).For a probability distribution, when generally When rate concentrates on certain several value less (one of variable several values that can take minority in most cases), The value of Shannon entropy can be relatively low, if on the contrary, probability relatively averagely (almost cannot judge in various values Which value variable can take), then Shannon entropy can be higher.It can therefore be seen that the time that place is accessed by student More concentrate, then entropy will be the least, and rule of life is the strongest.

Calculating the access probability to each place can use student data to add up, it would however also be possible to employ density The mode estimated obtains, and concrete grammar can be as desired to arrange.For middle school student's data volume of the present invention Big feature, it is proposed that a kind of access probability computational methods, its detailed process is as follows:

Predetermined amount of time is carried out time interval segmentation, from student data, extracts student's visit to every class place Ask the time, project to segment time interval, add up every class place access in each segmentation time interval time Number, general to the access in such place in then using density Estimation Function Estimation to obtain each segmentation time interval Rate, then integration obtains the preset time period access probability to such place.Density Estimation function can basis Being actually needed and select, the density Estimation function expression employed in the present embodiment is:

p_{i f v} (z) = \frac{1}{\sqrt{2 π} G_{i f} h_{i f}} Σ_{v = 1}^{V} e^{- \frac{{(z - z_{i f v})}^{2}}{2 {h_{i f}}^{2}}}

Wherein, p_ifvZ () represents that i-th student accesses the access in f place in the v segmentation time interval Probability, v=1,2 ..., V, V represent the quantity of segmentation time interval.z_ifvRepresent that i-th student is thin at v The access times in f place are accessed in dividing time interval.G_ifRepresent that i-th student visits within a predetermined period of time Ask total access times in f place, i.e.h_ifRepresent that i-th student is accessing the f ground The density Estimation bandwidth value that point is corresponding, its its empirical equation is:

h_{i f} = 1.06 * σ_{i f} * {G_{i f}}^{- \frac{1}{5}}

Wherein σ_ifRepresent V access times z_ifvStandard deviation.

Then to V p_ifvZ () is integrated, it is possible in obtaining predetermined amount of time, i-th student accesses f The access probability P in place_if(z)。

S105: behavioural characteristic Data Dimensionality Reduction:

After extracting student characteristics, owing to characteristic item is more, it is therefore desirable to data are carried out dimension-reduction treatment, Data Dimensionality Reduction can reduce the complexity of calculating, reduces the disappearance of the information content that correlation causes, for magnanimity The feature extraction of data has great significance.The method of Data Dimensionality Reduction has many, can be according to actual needs Select, for the feature of application scenarios of the present invention in the present embodiment, have devised a kind of dimension reduction method, By dimensionality reduction, multi objective is converted into a few overall target, so that the characteristic after dimensionality reduction is contained Information more fully.

Fig. 2 is the flow chart of behavioural characteristic Data Dimensionality Reduction.As in figure 2 it is shown, characteristic dimensionality reduction includes following Step:

S201: structure behavioural characteristic matrix:

The behavioural characteristic vector of note i-th student is B_i={ b_i1,b_i2,…,b_iD}^T, D represents feature item number, by institute The behavioural characteristic matrix U that size is D × N is formed, it is clear that in matrix U by the behavioural characteristic data of student, the I row are B_i, subscript T represents transposition.

S202: ask for covariance matrix:

Ask for the covariance matrix C of behavioural characteristic matrix U.

S203: ask for the eigenmatrix of covariance matrix:

Ask for the characteristic value of covariance matrix C and characteristic of correspondence vector, then according to character pair value from To little, characteristic vector is become matrix the most by rows greatly, take front K row composition characteristic vector matrix P, K Numerical value be configured according to actual needs.

S204: behavioural characteristic matrix after calculating dimensionality reduction:

Calculating the behavioural characteristic matrix Q=PU of student after dimensionality reduction, in matrix Q, the i-th row are after dimensionality reduction the The behavioural characteristic vector B ' of i student_i。

Obviously the line number of matrix Q is K, and in step S203, K is the biggest, and the matrix Q obtained more can embody row It is characterized, but the complexity of subsequent calculations also can increase.The span typically arranging K is

Assume that the behavioural characteristic matrix H constructed by behavioural characteristic vector of 10 students is as follows:

H = [\begin{matrix} 2.5 & 0.5 & 2.2 & 1.9 & 3.1 & 2.3 & 2 & 1 & 1.5 & 1.1 \\ 2.4 & 0.7 & 2.9 & 2.2 & 3 & 2.7 & 1.6 & 1.1 & 1.6 & 0.9 \end{matrix}]

Visible, the behavioural characteristic vector of each student comprises two characteristic items.

Try to achieve covariance matrix C as follows:

C = [\begin{matrix} 0.616555556 & 0.615444444 \\ 0.615444444 & 0.716555556 \end{matrix}]

The eigenvalue λ and the characteristic of correspondence vector α that try to achieve covariance matrix C are respectively as follows:

λ₁=0.490833989, α₁=[-0.735178656,0.677873399]

λ₂=1.28402771, α₂=[-0.677873399 ,-0.735178656]

Then 1 eigenvalue λ of maximum is selected₂Characteristic of correspondence vector forms characteristic vector square as column vector Battle array, then eigenvectors matrix P=[-0.677873399 ,-0.735178656].It is calculated student after dimensionality reduction Behavioural characteristic matrix Q=PU, it may be assumed that

Q=[-0.8280,1.7776 ,-0.9922 ,-0.2742 ,-1.6758 ,-0.9129,0.0991,1.1446,0.4380,1.2238]

After in matrix Q, each numerical value takes decimal point four.

S106: student's ranking is predicted:

By step S101 to S105, from the student data of magnanimity, extract the behavioural characteristic of each student Vector, it is possible to carry out ranking by the behavioural characteristic vector of student and predict.Ranking prediction in the present invention Method particularly includes:

To i-th student, its behavioural characteristic vector is used to deduct the behavioural characteristic vector of other each students, To N-1 difference characteristic vector, by grader good for difference characteristic vector input training in advance, obtain correspondence N-1 label, label value is 1 or-1, all label values of student is sued for peace, obtain this student Point, the score of all students is ranked up, thus obtains the ranking predicted value of student.

Wherein, grader is to be trained by the student data with history ranking to obtain, and training method is: To having the student of history ranking, collect the data of these students, according to step S101 to step S105 Method obtains the behavioural characteristic vector of these students, tries to achieve the difference characteristic vector between student the most two-by-two. For a difference characteristic vector, if being subtracted the student's ranking corresponding to characteristic vector earlier above, then this difference Label corresponding to characteristic vector is 1, is otherwise-1；Using vectorial for these difference characteristics input as grader, Grader, as output, is trained by corresponding label.

As described above it can be seen that the present invention is to have employed the method compared two-by-two to portray two people Difference.Each behavioural characteristic vector of any two people is subtracted each other, as a new characteristic vector. Such as, the ranking of student A is 5, and behavioural characteristic vector is A=(3,2,5,7,9,6,8, Isosorbide-5-Nitrae, 7)^T, student B's Ranking is 12, and behavioural characteristic vector is B=(5,9,8,6,7,1,3,4,7,6)^T, then difference characteristic vector A-B=(-2 ,-7 ,-3,1,2,5,5 ,-3 ,-3,1)^T。

Assuming that there be W student in training sample, each two student is calculated a difference characteristic vector, institute The difference characteristic vector obtained just has W (W-1)/2, then the training sample of grader has W (W-1)/2. Because label only has two classes (1 and-1), so prediction is exactly this label.It is to say, the present invention will Ranking predictive conversion between student is in order to first predict the relative rankings order obtaining each two student, the most again It is real ranking according to these relative rankings sequential conversions, ranking forecasting problem has been translated into a row Sequence problem concerning study, thus efficiently solve the ranking forecasting problem of student.If the ranking of student A is the highest, He occurs that the number of times of 1 is the most in others relatively more produced label, and the number of times of-1 is the fewest, then by meter Label sum produced by the raw A of mathematics and other students can obtain a score, obtaining according to all students Divide and be ranked up can be obtained by the ranking predicted value of current student A.Such as, student A and other student's phases The tag set relatively obtained relatively obtains for (1 ,-1 ,-1,1,1,1 ,-1,1 ,-1 ,-1,1), student B and other student's ratios Tag set be (-1,1 ,-1 ,-1,1,1 ,-1 ,-1 ,-1,1,1), can obtain student A must be divided into 1, student B's -1 must be divided into, then student A can be located further forward than the ranking of student B.

Although detailed description of the invention illustrative to the present invention is described above, in order to the art Artisans understand that the present invention, it should be apparent that the invention is not restricted to the scope of detailed description of the invention, right From the point of view of those skilled in the art, as long as various change limits in appended claim and determines The spirit and scope of the present invention in, these changes are apparent from, all utilize present inventive concept send out Bright creation is all at the row of protection.

Claims

1. student's ranking Forecasting Methodology based on campus data, it is characterised in that comprise the following steps:

S2: the student data collected is carried out data cleansing；

The jth item non-temporal data of note i-th student are x_ij, i=1,2 ..., N, N represent student's quantity, J=1,2 ..., M, M represent data item quantity；Ask for each data x_ijLinear transformation value x '_ij, computing formula For:

x_{i j}^{'} = \frac{x_{i j} - \min_{j}}{\max_{j} - \min_{j}} (T_{j_m a x} - T_{j_m i n}) + T_{j_m i n}

y_{i j} = \frac{x_{i j}^{'} - {\overset{&OverBar;}{x}}_{j}}{s_{j}}

S6: to i-th student, uses the behavioural characteristic vector after its dimensionality reduction to deduct the behavior of other each students Characteristic vector, obtains N-1 difference characteristic vector, by classification good for difference characteristic vector input training in advance Device, obtains N-1 label of correspondence, and label value is 1 or-1, is sued for peace by all label values of this student, Obtain the score of this student, the score of all students is ranked up, thus obtain the ranking predicted value of student；

Student's ranking Forecasting Methodology the most according to claim 1, it is characterised in that described step S4 The computational methods of middle rule of life metric are: according in the student data of each student to default several The access situation in place, is calculated this student access probability to these places in predetermined amount of time, then Being calculated Shannon entropy according to access probability, this Shannon entropy is the rule of life metric of this student.

Student's ranking Forecasting Methodology the most according to claim 2, it is characterised in that described access probability Computational methods be:

Predetermined amount of time is carried out time interval segmentation, from student data, extracts student's visit to every class place Ask the time, project to segment time interval, add up every class place access in each segmentation time interval time Number, uses density Estimation Function Estimation to obtain the interior access probability to such place of each segmentation time interval, Then integration obtains the preset time period access probability to such place.

Student's ranking the most according to claim 1 prediction arrangement method, it is characterised in that described step In S5, the method for behavioural characteristic vector dimensionality reduction is:

S5.1: the behavioural characteristic vector of note i-th student is B_i={ b_i1,b_i2,…,b_iD}^T, D represents feature item number, The behavioural characteristic data of all students are formed the behavioural characteristic matrix U that size is D × N；

S5.2: ask for the covariance matrix C of behavioural characteristic matrix U；

S5.3: ask for characteristic value and the characteristic of correspondence vector of covariance matrix C, then according to character pair Characteristic vector is become matrix by value from big to small the most by rows, takes front K row composition characteristic vector matrix The numerical value of P, K is configured according to actual needs；

S5.4: the behavioural characteristic matrix Q=PU of student after calculating dimensionality reduction, in matrix Q, the i-th row are through fall The behavioural characteristic vector B ' of i-th student after dimension_i。

Student's ranking the most according to claim 4 prediction arrangement method, it is characterised in that described step The span of parameter K is