CN105631536A - Massive open online course (MOOC) quitting prediction algorithm based on semi-supervised learning - Google Patents

Massive open online course (MOOC) quitting prediction algorithm based on semi-supervised learning Download PDF

Info

Publication number
CN105631536A
CN105631536A CN201510967503.0A CN201510967503A CN105631536A CN 105631536 A CN105631536 A CN 105631536A CN 201510967503 A CN201510967503 A CN 201510967503A CN 105631536 A CN105631536 A CN 105631536A
Authority
CN
China
Prior art keywords
sample
behavior characteristics
max
samples
unmarked
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510967503.0A
Other languages
Chinese (zh)
Inventor
江峰
李文涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing Technology and Business Institute
Original Assignee
Chongqing Technology and Business Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing Technology and Business Institute filed Critical Chongqing Technology and Business Institute
Priority to CN201510967503.0A priority Critical patent/CN105631536A/en
Publication of CN105631536A publication Critical patent/CN105631536A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/04Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • G06F18/2155Generating training patterns; Bootstrap methods, e.g. bagging or boosting characterised by the incorporation of unlabelled data, e.g. multiple instance learning [MIL], semi-supervised techniques using expectation-maximisation [EM] or naïve labelling
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/20Education
    • G06Q50/205Education administration or guidance

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Theoretical Computer Science (AREA)
  • Strategic Management (AREA)
  • Tourism & Hospitality (AREA)
  • Economics (AREA)
  • Human Resources & Organizations (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Educational Administration (AREA)
  • Educational Technology (AREA)
  • Marketing (AREA)
  • General Business, Economics & Management (AREA)
  • Primary Health Care (AREA)
  • Artificial Intelligence (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Operations Research (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Quality & Reliability (AREA)
  • Development Economics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Game Theory and Decision Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention relates to a massive open online course (MOOC) quitting prediction algorithm based on semi-supervised learning. Firstly, learning log files of users are acquired from an MOOC website, one part of the acquired users forms a test sample set, and the other part forms a training sample set; secondly, according to the learning log files of the users, behavior features of all samples in the training sample set are counted to obtain n behavior features which most express common features of all samples in the training sample set; thirdly, according to the n behavior features, a semi-supervised learning method is adopted to acquire R classifiers; fourthly, the test sample set is used for testing tagging accuracy of the R classifiers, and the classifier with the highest tagging accuracy is selected; and finally, behavior features of any unmarked user are inputted to the above classifier, and the user is marked. The algorithm of the invention only needs few marking samples, a large amount of manpower and material resources cost for tagging the samples can be reduced, the prediction cost is saved, and the prediction accuracy is also improved.

Description

Large scale network Open Course based on semi-supervised learning moves back class prediction algorithm
Technical field
The present invention relates to computer and information technology, be specifically related to a kind of large scale network Open Course based on semi-supervised learning and move back class prediction algorithm.
Background technology
The maturation of the technology such as Web2.0 and cloud computing provides new opportunity to IT application in education sector, and the large-scale online course (MOOC, also known as admiring class) that discloses is the product that internet, applications is innovated. Along with rise and the MIT of the MOOC websites such as edx, coursera and udacity, the university such as Stanford opens a course at MOOC platform in succession, and MOOC is of increased attention with approval. MOOC relies on the Internet, provides for substantial amounts of student and educates such as answer, examination, sees the educational experience such as video, and student can be allowed to utilize the form Cooperative Studies such as network forum. And the open characteristics that MOOC possesses so that the student that MOOC learns background for difference provides opportunity to study.
Although MOOC has the advantage of its uniqueness compared with traditional education, but the learner colony of MOOC has bigger diversity. This diversity be mainly reflected in education background with in education motivation, if any student register certain subject just to obtaining some knowledge point, and the cost owing to exiting MOOC course is relatively low, this result in learner to move back class rate too high. It is a kind of general phenomenon that many educators point out that the height of MOOC moves back class rate, if taking countermeasure the development causing MOOC platform to be restricted not in time.
The analysis that student exits factor is possible not only to help the construction of MOOC platform improving, and can pass through, and improves the retention ratio of student, thus ensureing carrying out in order of course. Therefore, it is predicted helping MOOC to reach better teaching efficiency by setting up the model class behavior of moving back to student. MOOC moves back the value short-period value of class prediction: by judging whether a user moves back class, it is possible to the user being likely to move back class is intervened by assisted teacher or system, reduces them and moves back class possibility. Long-term value: analyze curriculum character and the relation moving back class rate, design and move back the course that class rate is relatively low, improves the quality of MOOC course.
Existing prediction algorithm mainly has two kinds, a kind of is that some behaviors to student are tracked, if students' work User behavior, video-see behavior, other resource acquisition behaviors etc. are tracked, add up the number of times that these behaviors occur, thus judging that prediction student moves back class or retains. First this prediction algorithm has following defects that, use the study of supervision type, one model of training on a large amount of mark sample sets, but the acquisition cost of sample label very big this is mainly reflected in: the first sample size is big, much more very second sample mark needs cost manpower and time, and marker samples needs professional to carry out; Secondly, this prediction algorithm uses and is characterized by a kind of generalized features, it is impossible to accurately portray moving back class student, therefore, it was predicted that accuracy is relatively low. Another kind of pre-measuring method is finally to move back class rate according to what class rate of moving back weekly calculated this subject, although what this Forecasting Methodology can predict a certain subject moves back class rate, but cannot judge for concrete student or user, namely cannot judge which student or user move back class.
Summary of the invention
For the problems referred to above that prior art exists, it is an object of the invention to one and can accurately judge that certain user moves back class or retains large scale network Open Course and move back class prediction algorithm.
For achieving the above object, the present invention adopts the following technical scheme that the large scale network Open Course based on semi-supervised learning moves back class prediction algorithm, comprises the steps:
S1: obtain the learning log file of user from MOOC website, the user's part obtained constitutes test sample set, another part composing training sample set, the test sample that wherein test sample is concentrated is entirely marker samples, this training sample is concentrated and is included unmarked sample and marker samples, all unmarked samples constitute unmarked sample set, and all marker samples constitute marker samples collection;
S2: add up training sample according to the learning log file of user and concentrate the behavior characteristics of all samples, obtaining can all samples individual n kind behavior characteristics altogether in assertiveness training sample set;
If the course persistent period of a certain course is K week;
If Ui=U (i, 1) ..., U (i, j) ..., U (i, n) }, UiThe i-th sample that expression training sample is concentrated, U (i, j)=(U (i, j)1,...,U(i,j)k,...,U(i,j)K), U (i, j) represents that the jth kind behavior characteristics of training sample concentration i-th sample is vectorial, and U (i, j)kRepresent the number of times that the jth kind behavior characteristics of i-th user occurs in the kth week of course persistent period;
S3: randomly select m kind behavior characteristics from n kind behavior characteristics, and adopt following manner to obtain R kind grader, wherein, m��n, R = C n m = n ! m ! ( n - m ) ! , r = 1 , 2 , 3 ... R ;
The acquisition pattern of R kind grader is as follows:
S301: set r=1;
S302:j=1;
S303:v=1;
S304: set Prj(C | U (i, j)) concentrates i-th sample to be noted as the probability of C under jth kind behavior characteristics for training sample, wherein, is marked the sample of C=0 and represents and retain user, is marked the sample of C=1 and represents and move back class user;
S305: select all unmarked sample under jth kind behavior characteristics in unmarked sample set, the set U that under jth kind behavior characteristics, all unmarked samples are formedj, set of computations U respectivelyjIn the P of each unmarked samplerj(C=0 | and U (v, j)k) and Prj(C=1 | and U (v, j)k), wherein v=1,2 ..., | Uj|, | Uj| represent set UjThe sum of middle sample;
P r j ( C = 0 | U ( v , j ) ) = P r j ( U ( v , j ) | C = 0 ) · P r j ( C = 0 ) P r j ( U ( v , j ) ) - - - ( 1 ) ;
P r j ( C = 0 ) = | L j , C = 0 | | U j | + | L j | - - - ( 1 a ) ;
Wherein, | LJ, C=0| represent that marker samples concentrates the sum of the sample being marked C=0, L under jth kind behavior characteristicsjRepresent all under jth kind behavior characteristics and marked the set that sample is formed, | Lj| represent set LjThe sum of middle sample, | Uj|+|Lj| represent that under jth kind behavior characteristics, training sample concentrates the sum of sample;
Prj(U (v, j) | C=0)=Prj(U(v,j)1| C=0) Prj(U(v,j)2| C=0) ...,
(1b);
Prj(U(v,j)k| C=0) ... Prj(U(v,j)K| C=0)
P r j ( U ( v , j ) k | C = 0 ) = | L j , C = 0 ( U ( v , j ) k ) | | L j , C = 0 | - - - ( 1 b - 1 ) ;
Wherein, | LJ, C=0| represent that marker samples concentrates the sum of the sample being marked C=0 under jth kind behavior characteristics, | LJ, C=0(U(v,j)k) | represent that under jth kind behavior characteristics marker samples has been concentrated and be marked in the sample of C=0, the number of times that jth kind behavior occurs in the kth week of course persistent period be U (v, j)kThe sum of sample;
P r j ( C = 1 | U ( v , j ) ) = P r j ( U ( v , j ) | C = 1 ) · P r j ( C = 1 ) P r j ( U ( v , j ) ) - - - ( 2 ) ;
P r j ( C = 1 ) = | L j , C = 1 | | U j | + | L j | - - - ( 2 a ) ;
Wherein, | LJ, C=1| represent that marker samples concentrates the sum of the sample being marked C=1 under jth kind behavior characteristics;
Prj(U (v, j) | C=1)=Prj(U(v,j)1| C=1) Prj(U(v,j)2| C=1) ...,
(2b);
Prj(U(v,j)k| C=1) ... Prj(U(v,j)K| C=1)
P r j ( U ( v , j ) k | C = 1 ) = | L j , C = 1 ( U ( v , j ) k ) | | L j , C = 1 | - - - ( 2 b - 1 ) ;
Wherein, | LJ, C=1| represent that marker samples concentrates the sum of the sample being marked C=1 under jth kind behavior characteristics, | LJ, C=1(U(v,j)k) | represent that under jth kind behavior characteristics marker samples has been concentrated and be marked in the sample of C=1, the number of times that jth kind behavior occurs in the kth week of course persistent period be U (v, j)kThe sum of sample;
Prj(U (v, j))=P (U (v, j) | C=0) P (C=0)
(3);
P (U (v, j) | C=1) P (C=1)
Output Prj(C=0 | and U (v, j)k) and Prj(C=1 | and U (v, j)k);
S306: make v=v+1;
S307: work as v > | Uj| time, perform next step, otherwise return step S304;
S308:max{Prj(C=0 | U (v, j)) }=max{Prj(C=0 | U (v, j)), v=1,2,3...uj, by max{Prj(C=0 | U (v, j)) } corresponding unmarked sample is from set UjMiddle rejecting, simultaneously by max{Prj(C=0 | U (v, j)) } corresponding unmarked sample moves into set Lj, and by max{Prj(C=0 | U (v, j)) } corresponding unmarked sample mark C=0;
max{Prj(C=1 | U (v, j)) }=max{Prj(C=1 | U (v, j)), v=1,2,3...TUj, by max{Prj(C=1 | U (v, j)) } corresponding unmarked sample is from set UjMiddle rejecting, simultaneously by max{Prj(C=1 | U (v, j)) } corresponding unmarked sample moves into set Lj, and by max{Prj(C=1 | U (v, j)) } corresponding unmarked sample mark C=1;
S309: update the set U that under jth kind behavior characteristics, all unmarked samples are formedjAll set L having marked sample formation with under jth kind behavior characteristicsj, order | Uj|=| Uj|-2, | Tj|=| Tj|+2;
S310:| Uj| when >=2, return step S303, otherwise perform next step;
S311: make j=j+1;
During S312: as j > m, the current marker samples collection of output, and perform next step; Otherwise return step S303;
S313: make r=r+1;
During S314: as r > R, perform next step; Otherwise return step S302;
S4: select optimum grader
S401: the test sample set in obtaining step S1, this test sample is concentrated and is had H test sample, h=1, and 2 ... H;
S402: make r=1;
S403: make h=1;
S404: calculate P according to formula (4)h(C=0 | U (v, j)):
P h ( C = 0 | U ( v , j ) ) = Σ j = 1 m P r j ( C = 0 | U ( v , j ) ) - - - ( 4 ) ;
P is calculated according to formula (5)h(C=1 | U (v, j)):
P h ( C = 1 | U ( v , j ) ) = Σ j = 1 m P r j ( C = 1 | U ( v , j ) ) - - - ( 5 ) ;
S405: if Ph(C=0 | U (v, j)) >=Ph(C=1 | U (v, j)), then the h is tested sample mark C=0, otherwise marks C=1, the h test sample after output token;
S406: make h=h+1;
S407: if h > H, then perform next step, otherwise returns step S404;
S408: calculate the accuracy rate �� of r graderr,Wherein S=H represents the number of times using the r grader to be labeled, and S ' represents the number of times using the r grader mark correct;
S409: make r=r+1;
S410: if r > R, then perform next step, otherwise returns step S403;
S411:max{ ��r}=max{ ��r, r=1,2,3...R}, max{ ��rCorresponding grader is the grader that mark accuracy rate is the highest, finally output max{ ��rCorresponding grader, this grader is designated as
S5: for any one unlabelled user Ux, according to its learning log file, obtain the n kind behavior characteristics of this user, the grader of selected step S411 output, then calculate according to formula (6) P U x ( C = 0 | U ( 1 , j ) ) ;
P U x ( C = 0 | U ( 1 , j ) ) = Σ j = 1 m P r max j ( C = 0 | U ( 1 , j ) ) - - - ( 6 ) ;
Calculate according to formula (7) P U x ( C = 1 | U ( 1 , j ) ) ;
P U x ( C = 1 | U ( 1 , j ) ) = Σ j = 1 m P r max j ( C = 1 | U ( 1 , j ) ) - - - ( 7 ) ;
If P U x ( C = 0 | U ( 1 , j ) ) ≥ P U x ( C = 1 | U ( 1 , j ) ) Then by user UxMark C=0, otherwise marks C=1.
Relative to prior art, present invention have the advantage that
1, the present invention is based on the prediction algorithm of semi-supervised learning, semi-supervised learning is mainly reflected in the acquisition of R grader, this semi-supervised learning has only to use less marker samples, thus decreasing a large amount of man power and materials that sample is labeled cost, not only save forecast cost, and predictablity rate has also improved.
2, obtain, according to cycle statistical activity number of times, the user that behavior characteristics can be portrayed in MOOC preferably;
3, use multiple behavior characteristics can embody the thought of integrated study, improve predictablity rate;
4, semi-supervised learning can use mark sample set simultaneously and not mark sample set, is more suitable for promoting in practice
5, various features carries out semi-supervised learning and reduces the cumulative errors of mark sample, it is possible to be predicted not marking sample preferably, improve predictablity rate. That is: R grader of training simultaneously, selecting optimum and each grader is training on behavior characteristics in m.
Accompanying drawing explanation
Fig. 1 is the comparison diagram (F value) of BSP and additive method.
Detailed description of the invention
To step numbers carry out as described below: S1, S2, S3 and S4 represent step S1, step S2, step S3 and step S4 respectively; S301 represents the 01st little step in step S3, and S302 represents the 02nd little step in step S3, the like; S401 represents the 01st little step in step S4, and S402 represents the 02nd little step in step S4, the like.
Below the present invention is described in further detail.
Large scale network Open Course based on semi-supervised learning moves back class prediction algorithm, comprises the steps:
S1: obtain the learning log file of user from MOOC website, the user's part obtained constitutes test sample set, another part composing training sample set, the test sample that wherein test sample is concentrated is entirely marker samples, this training sample is concentrated and is included unmarked sample and marker samples, all unmarked samples constitute unmarked sample set, and all marker samples constitute marker samples collection;
S2: add up training sample according to the learning log file of user and concentrate the behavior characteristics of all samples, obtaining can all samples individual n kind behavior characteristics altogether in assertiveness training sample set;
If the course persistent period of a certain course is K week;
If Ui=U (i, 1) ...., U (i, j) ...., U (i, n) }, UiThe i-th sample that expression training sample is concentrated, U (i, j)=(U (i, j)1,....U(i,j)k....U(i,j)K), U (i, j) represents that the jth kind behavior characteristics of training sample concentration i-th sample is vectorial, and U (i, j)kRepresent the number of times that the jth kind behavior characteristics of i-th user occurs in the kth week of course persistent period;
S3: randomly select m kind behavior characteristics from n kind behavior characteristics, and adopt following manner to obtain R kind grader, wherein, m��n, R = C n m = n ! m ! ( n - m ) ! , r = 1 , 2 , 3 ... R ;
Obtaining R kind grader based on semi-supervised learning method (BSP), the acquisition pattern of R kind grader is as follows:
S301: set r=1;
S302:j=1;
S303:v=1;
S304: set Prj(C | U (i, j)) concentrates i-th sample to be noted as the probability of C under jth kind behavior characteristics for training sample, wherein, is marked the sample of C=0 and represents and retain user, is marked the sample of C=1 and represents and move back class user;
S305: select all unmarked sample under jth kind behavior characteristics in unmarked sample set, the set U that under jth kind behavior characteristics, all unmarked samples are formedj, set of computations U respectivelyjIn the P of each unmarked samplerj(C=0 | and U (v, j)k) and Prj(C=1 | and U (v, j)k), wherein v=1,2 ..., | Uj|,|Uj| represent set UjThe sum of middle sample;
P r j ( C = 0 | U ( v , j ) ) = P r j ( U ( v , j ) | C = 0 ) · P r j ( C = 0 ) P r j ( U ( v , j ) ) - - - ( 1 ) ;
P r j ( C = 0 ) = | L j , C = 0 | | U j | + | L j | - - - ( 1 a ) ;
Wherein, | LJ, C=0| represent that marker samples concentrates the sum of the sample being marked C=0, L under jth kind behavior characteristicsjRepresent all under jth kind behavior characteristics and marked the set that sample is formed, | Lj| represent set LjThe sum of middle sample, | Uj|+|Lj| represent that under jth kind behavior characteristics, training sample concentrates the sum of sample;
Prj(U (v, j) | C=0)=Prj(U(v,j)1| C=0) Prj(U(v,j)2| C=0) ...,
(1b);
Prj(U(v,j)k| C=0) ... Prj(U(v,j)K| C=0)
P r j ( U ( v , j ) k | C = 0 ) = | L j , C = 0 ( U ( v , j ) k ) | | L j , C = 0 | - - - ( 1 b - 1 ) ;
Wherein, | LJ, C=0| represent that marker samples concentrates the sum of the sample being marked C=0 under jth kind behavior characteristics, | LJ, C=0(U(v,j)k) | represent that under jth kind behavior characteristics marker samples has been concentrated and be marked in the sample of C=0, the number of times that jth kind behavior occurs in the kth week of course persistent period be U (v, j)kThe sum of sample;
P r j ( C = 1 | U ( v , j ) ) = P r j ( U ( v , j ) | C = 1 ) · P r j ( C = 1 ) P r j ( U ( v , j ) ) - - - ( 2 ) ;
P r j ( C = 1 ) = | L j , C = 1 | | U j | + | L j | - - - ( 2 a ) ;
Wherein, | LJ, C=1| represent that marker samples concentrates the sum of the sample being marked C=1 under jth kind behavior characteristics;
Prj(U (v, j) | C=1)=Prj(U(v,j)1| C=1) Prj(U(v,j)2| C=1) ...,
(2b);
Prj(U(v,j)k| C=1) ... Prj(U(v,j)K| C=1)
P r j ( U ( v , j ) k | C = 1 ) = | L j , C = 1 ( U ( v , j ) k ) | | L j , C = 1 | - - - ( 2 b - 1 ) ;
Wherein, | LJ, C=1| represent that marker samples concentrates the sum of the sample being marked C=1 under jth kind behavior characteristics, | LJ, C=1(U(v,j)k) | represent that under jth kind behavior characteristics marker samples has been concentrated and be marked in the sample of C=1, the number of times that jth kind behavior occurs in the kth week of course persistent period be U (v, j)kThe sum of sample;
Prj(U (v, j))=P (U (v, j) | C=0) P (C=0)
(3);
P (U (v, j) | C=1) P (C=1)
Output Prj(C=0 | and U (v, j)k) and Prj(C=1 | and U (v, j)k);
S306: make v=v+1;
S307: work as v > | Uj| time, perform next step, otherwise return step S304;
S308:max{Prj(C=0 | U (v, j)) }=max{Prj(C=0 | U (v, j)), v=1,2,3...uj, by max{Prj(C=0 | U (v, j)) } corresponding unmarked sample is from set UjMiddle rejecting, simultaneously by max{Prj(C=0 | U (v, j)) } corresponding unmarked sample moves into set Lj, and by max{Prj(C=0 | U (v, j)) } corresponding unmarked sample mark C=0;
max{Prj(C=1 | U (v, j)) }=max{Prj(C=1 | U (v, j)), v=1,2,3...TUj, by max{Prj(C=1 | U (v, j)) } corresponding unmarked sample is from set UjMiddle rejecting, simultaneously by max{Prj(C=1 | U (v, j)) } corresponding unmarked sample moves into set Lj, and by max{Prj(C=1 | U (v, j)) } corresponding unmarked sample mark C=1;
S309: update the set U that under jth kind behavior characteristics, all unmarked samples are formedjAll set L having marked sample formation with under jth kind behavior characteristicsj, order | Uj|=| Uj|-2, | Tj|=| Tj|+2;
S310:| Uj| when >=2, return step S303, otherwise perform next step; Namely as | Uj| when=1 or 0, represent the set U that under jth kind behavior characteristics, all unmarked samples are formedjInterior unmarked sample labelling completes;
S311: make j=j+1;
During S312: as j > m, the current marker samples collection of output, and perform next step; Otherwise return step S303;
S313: make r=r+1;
During S314: as r > R, perform next step; Otherwise return step S302;
S4: select optimum grader
S401: the test sample set in obtaining step S1, this test sample is concentrated and is had H test sample, h=1, and 2 ... H;
S402: make r=1;
S403: make h=1;
S404: calculate P according to formula (4)h(C=0 | U (v, j)):
P h ( C = 0 | U ( v , j ) ) = Σ j = 1 m P r j ( C = 0 | U ( v , j ) ) - - - ( 4 ) ;
P is calculated according to formula (5)h(C=1 | U (v, j)):
P h ( C = 1 | U ( v , j ) ) = Σ j = 1 m P r j ( C = 1 | U ( v , j ) ) - - - ( 5 ) ;
S405: if Ph(C=0 | U (v, j)) >=Ph(C=1 | U (v, j)), then the h is tested sample mark C=0, otherwise marks C=1, the h test sample after output token;
S406: make h=h+1;
S407: if h > H, then perform next step, otherwise returns step S404;
S408: calculate the accuracy rate �� of r graderr,Wherein S=H represents the number of times using the r grader to be labeled, and S ' represents the number of times using the r grader mark correct; Being known owing to testing the labelling of sample, the number of times that therefore labelling is correct is also known;
S409: make r=r+1;
S410: if r > R, then perform next step, otherwise returns step S403;
S411:max{ ��r}=max{ ��r, r=1,2,3...R}, max{ ��rCorresponding grader is the grader that mark accuracy rate is the highest, finally output max{ ��rCorresponding grader, this grader is designated as
S5: for any one unlabelled user Ux, according to its learning log file, obtain the n kind behavior characteristics of this user, the grader of selected step S411 output, then calculate according to formula (6) P U x ( C = 0 | U ( 1 , j ) ) ;
P U x ( C = 0 | U ( 1 , j ) ) = Σ j = 1 m P r max j ( C = 0 | U ( 1 , j ) ) - - - ( 6 ) ;
Calculate according to formula (7) P U x ( C = 1 | U ( 1 , j ) ) ;
P U x ( C = 1 | U ( 1 , j ) ) = Σ j = 1 m P r max j ( C = 1 | U ( 1 , j ) ) - - - ( 7 ) ;
If P U x ( C = 0 | U ( 1 , j ) ) ≥ P U x ( C = 1 | U ( 1 , j ) ) Then by user UxMark C=0, otherwise marks C=1.
Contrast test:
Data set
The data set that experiment uses is that XueTangX is in the KDDCup2015 public data collection provided. This data set includes the students ' behavior record information of 39 subjects and whether each student moves back the information of class. Data set is divided into 39 parts according to course classification by us, and each part of data subset comprises the learning records of this subject of Students ' Learning. Therefrom to randomly draw a branch of instruction in school C, this course comprises the learning records of 2392 users, and wherein 546 users are for retaining student, and 1846 users are for moving back class student. The purpose of experiment is by training grader, whether student to be exited course to judge.
Evaluation criterion
According to course number, we have 39 data subsets, using the predictablity rate average on these 39 data subsets as the last performance of algorithm. To each data subset, we use the method for ten folding cross validations that the performance of algorithm is estimated, and finally the estimated performance of all of 39 data subsets are averaged, and obtain the final performance of algorithm. In order to the predictive ability of algorithm is estimated, paper uses accuracy, recall rate and F value as evaluation criterion. According to predicting the outcome with whether a student really moves back class, we obtain student's classification four kinds different:
Turepositive (TP): be predicted as and move back class, the actual student really moving back class
Turenegative (TN): be predicted as retention, the actual student really not moving back class
Falsepositive (FP): be predicted as and move back class, but do not move back the student of class
Falsenegative (FN): be predicted as retention, but actually move back the student of class
And then, it is possible to obtain accuracy (Precision), recall rate (Recall) as follows with the computing formula of F value (F-measure):
Pr e c i s i o n = t r u e P o s i t i v e s t r u e P o s i t i v e s + f a l s e P o s i t i v e s
Re c a l l = t r u e P o s i t i v e s t r u e P o s i t i v e s + f a l s e N e g a t i v e s
F - m e a s u r e = 2 ( Pr e c i s i o n × Re c a l l ) Pr e c i s i o n + Re c a l l
Behavior characteristics extracts:
Learning log file statistics training sample according to user concentrates the behavior characteristics of all samples, obtaining can all samples individual 6 kinds of behavior characteristics altogether in assertiveness training sample set, these behavior characteristicss are user in the course persistent period behavior number of times in each week, the specific descriptions of behavior are shown in table 1 below:
Table 1
The performance comparison of various actions feature
We move back class as a kind of classification problem using prediction student, namely according to the learning behavior record of user, user are divided into two classes: retain user and move back class user. In order to analyze the impact on prediction effect of all kinds of learning behavior, we extract six kinds for feature. Each behavior characteristics, in conjunction with basic sorting technique such as naive Bayesian, decision tree and support vector machine, obtains moving back class based on the student of a certain behavior characteristics and predicts the outcome. Using the method that feature is assembled with Feature Fusion to combine with conventional sorting methods simultaneously as with reference to us, obtain the prediction algorithm of reference, feature is assembled: only considers behavior kind, and is left out the time; Feature Fusion: only consider the time, and be left out behavior kind. The prediction effect of each category feature is as shown in the table:
The prediction effect of various features under table 2 naive Bayesian
Table 3: the prediction effect of various features under decision tree
Table 4: the prediction effect of various features under support vector machine
From table 2-table 4 it appeared that under decision Tree algorithms the performance of various features better, simultaneously except support vector machine, the feature of single kind is better than the feature that traditional characteristic extracting method obtains. This illustrates whether user is moved back class and be respectively provided with certain predictive ability by the behavior characteristics that the feature extracting method that the present invention proposes obtains, and traditional feature extracting method or the kind of neglect, ignore time concept, cause in most cases lower than the prediction effect of single features.
In practical problem, due to mark student, whether to move back the cost of class relatively big, and this results in label and determines that cost is higher, and therefore the present invention proposes the student of Behavior-based control feature and semi-supervised learning and moves back class prediction algorithm.
In six kinds of behavior characteristicss that the inventive method obtains, based on problem (#1 feature), based on video-see (#2 feature), watch (#3 feature) based on other resources and the feature (#4 feature) based on page viewing has a good performance, and watch (#5 feature) based on wiki and accuracy of behavior classification (#6 feature) be discussed lower than other two kinds with problem. Therefore in the contrast test carried out below, we only employ four kinds of behavior characteristicss above.
Above-mentioned four kinds of features are carried out combination of two by us, are then based on double; two attempting coorinated training method and obtaining BSP. BSP takes the sample of 10% as marker samples collection, and the sample of 70% is unmarked sample set, and remaining 20% sample is test sample set. We are using supervised learning method as reference method, we have found that decision tree has a good estimated performance according to the result tested above, therefore using decision tree as benchmark grader. Wherein, the Forecasting Methodology using aggregation characteristic and decision Tree algorithms is designated as aggregation, uses the Forecasting Methodology of fusion feature and decision Tree algorithms to be designated as merge, uses the algorithm of #4 feature and decision tree to be designated as SVF, obtain following prediction effect, referring to table 5 and Fig. 1.
The contrast (accuracy and recall rate) of table 5:BSP and additive method
Precision Recall
#1+#2 feature 0.9571 0.8739
#1+#3 feature 0.9927 0.8747
#1+#4 feature 0.954 0.8878
#2+#3 feature 0.9958 0.8758
#2+#4 feature 0.9579 0.8767
#3+#4 feature 0.9519 0.8767
Aggregation 0.837 0.843
Merging 0.831 0.837
SVF 0.843 0.851
The supervision type method of single-view predicts the outcome well below the semi-supervised Forecasting Methodology of dual-view as can be found from Table 5. Although the inventive method has simply used the marker samples collection of 10%, but the accuracy rate of prediction is all more than 95%, it was predicted that recall rate all more than 85%, this shows that the inventive method has good estimated performance. Simultaneously it appeared that utilize the prediction algorithm training that two kinds of behavior characteristicss carry out semi-supervised learning to be far superior to simply use the prediction algorithm of a kind of behavior characteristics in overall effect simultaneously in Fig. 1.
It is the factor hindering MOOC platform long-run development that student moves back class, and Forecasting Methodology is possible not only to convenient a course is estimated accurately, it is also possible to analyzing affects student and move back the factor of class, thus carrying out early warning to moving back class phenomenon. The inventive method is from student learning behavior, therefore, it is possible to portray the feature of individuality.
What finally illustrate is, above example is only in order to illustrate technical scheme and unrestricted, although the present invention being described in detail with reference to preferred embodiment, it will be understood by those within the art that, technical scheme can be modified or equivalent replacement, without deviating from objective and the scope of technical solution of the present invention, it all should be encompassed in the middle of scope of the presently claimed invention.

Claims (1)

1. the large scale network Open Course based on semi-supervised learning moves back class prediction algorithm, it is characterised in that comprise the steps:
S1: obtain the learning log file of user from MOOC website, the user's part obtained constitutes test sample set, another part composing training sample set, the test sample that wherein test sample is concentrated is entirely marker samples, this training sample is concentrated and is included unmarked sample and marker samples, all unmarked samples constitute unmarked sample set, and all marker samples constitute marker samples collection;
S2: add up training sample according to the learning log file of user and concentrate the behavior characteristics of all samples, obtaining can all samples individual n kind behavior characteristics altogether in assertiveness training sample set;
If the course persistent period of a certain course is K week;
If Ui=U (i, 1) ...., U (i, j) ...., U (i, n) }, UiThe i-th sample that expression training sample is concentrated, U (i, j)=(U (i, j)1,....U(i,j)k....U(i,j)K), U (i, j) represents that the jth kind behavior characteristics of training sample concentration i-th sample is vectorial, and U (i, j)kRepresent the number of times that the jth kind behavior characteristics of i-th user occurs in the kth week of course persistent period;
S3: randomly select m kind behavior characteristics from n kind behavior characteristics, and adopt following manner to obtain R kind grader, wherein, m��n, R = C n m = n ! m ! ( n - m ) ! , r = 1 , 2 , 3 ... R ;
The acquisition pattern of R kind grader is as follows:
S301: set r=1;
S302:j=1;
S303:v=1;
S304: set Prj(C | U (i, j)) concentrates i-th sample to be noted as the probability of C under jth kind behavior characteristics for training sample, wherein, is marked the sample of C=0 and represents and retain user, is marked the sample of C=1 and represents and move back class user;
S305: select all unmarked sample under jth kind behavior characteristics in unmarked sample set, the set U that under jth kind behavior characteristics, all unmarked samples are formedj, set of computations U respectivelyjIn the P of each unmarked samplerj(C=0 | and U (v, j)k) and Prj(C=1 | and U (v, j)k), wherein v=1,2 ..., | Uj|, | Uj| represent set UjThe sum of middle sample;
P r j ( C = 0 | U ( v , j ) ) = P r j ( U ( v , j ) | C = 0 ) · P r j ( C = 0 ) P r j ( U ( v , j ) ) - - - ( 1 ) ;
P r j ( C = 0 ) = | L j , C = 0 | | U j | + | L j | - - - ( 1 a ) ;
Wherein, | LJ, C=0| represent that marker samples concentrates the sum of the sample being marked C=0, L under jth kind behavior characteristicsjRepresent all under jth kind behavior characteristics and marked the set that sample is formed, | Lj| represent set LjThe sum of middle sample, | Uj|+|Lj| represent that under jth kind behavior characteristics, training sample concentrates the sum of sample;
Prj(U (v, j) | C=0)=Prj(U(v,j)1| C=0) Prj(U(v,j)2| C=0) ..., (1b);
Prj(U(v,j)k| C=0) ... Prj(U(v,j)K| C=0)
P r j ( U ( v , j ) k | C = 0 ) = | L j , C = 0 ( U ( v , j ) k ) | | L j , C = 0 | - - - ( 1 b - 1 ) ;
Wherein, | LJ, C=0| represent that marker samples concentrates the sum of the sample being marked C=0 under jth kind behavior characteristics, | LJ, C=0(U(v,j)k) | represent that under jth kind behavior characteristics marker samples has been concentrated and be marked in the sample of C=0, the number of times that jth kind behavior occurs in the kth week of course persistent period be U (v, j)kThe sum of sample;
P r j ( C = 1 | U ( v , j ) ) = P r j ( U ( v , j ) | C = 1 ) · P r j ( C = 1 ) P r j ( U ( v , j ) ) - - - ( 2 ) ;
P r j ( C = 1 ) = | L j , C = 1 | | U j | + | L j | - - - ( 2 a ) ;
Wherein, | LJ, C=1| represent that marker samples concentrates the sum of the sample being marked C=1 under jth kind behavior characteristics;
Prj(U (v, j) | C=1)=Prj(U(v,j)1| C=1) Prj(U(v,j)2| C=1) ..., (2b);
Prj(U(v,j)k| C=1) ... Prj(U(v,j)K| C=1)
P r j ( U ( v , j ) k | C = 1 ) = | L j , C = 1 ( U ( v , j ) k ) | | L j , C = 1 | - - - ( 2 b - 1 ) ;
Wherein, | LJ, C=1| represent that marker samples concentrates the sum of the sample being marked C=1 under jth kind behavior characteristics, | LJ, C=1(U(v,j)k) | represent that under jth kind behavior characteristics marker samples has been concentrated and be marked in the sample of C=1, the number of times that jth kind behavior occurs in the kth week of course persistent period be U (v, j)kThe sum of sample;
Prj(U (v, j))=P (U (v, j) | C=0) P (C=0) (3);
P (U (v, j) | C=1) P (C=1)
Output Prj(C=0 | and U (v, j)k) and Prj(C=1 | and U (v, j)k);
S306: make v=v+1;
S307: work as v > | Uj| time, perform next step, otherwise return step S304;
S308:max{Prj(C=0 | U (v, j)) }=max{Prj(C=0 | U (v, j)), v=1,2,3...uj, by max{Prj(C=0 | U (v, j)) } corresponding unmarked sample is from set UjMiddle rejecting, simultaneously by max{Prj(C=0 | U (v, j)) } corresponding unmarked sample moves into set Lj, and by max{Prj(C=0 | U (v, j)) } corresponding unmarked sample mark C=0;
max{Prj(C=1 | U (v, j)) }=max{Prj(C=1 | U (v, j)), v=1,2,3...TUj, by max{Prj(C=1 | U (v, j)) } corresponding unmarked sample is from set UjMiddle rejecting, simultaneously by max{Prj(C=1 | U (v, j)) } corresponding unmarked sample moves into set Lj, and by max{Prj(C=1 | U (v, j)) } corresponding unmarked sample mark C=1;
S309: update the set U that under jth kind behavior characteristics, all unmarked samples are formedjAll set L having marked sample formation with under jth kind behavior characteristicsj, order | Uj|=| Uj|-2, | Tj|=| Tj|+2;
S310:| Uj| when >=2, return step S303, otherwise perform next step;
S311: make j=j+1;
During S312: as j > m, the current marker samples collection of output, and perform next step; Otherwise return step S303;
S313: make r=r+1;
During S314: as r > R, perform next step; Otherwise return step S302;
S4: select optimum grader
S401: the test sample set in obtaining step S1, this test sample is concentrated and is had H test sample, h=1, and 2 ... H;
S402: make r=1;
S403: make h=1;
S404: calculate P according to formula (4)h(C=0 | U (v, j)):
P h ( C = 0 | U ( v , j ) ) = Σ j = 1 m P r j ( C = 0 | U ( v , j ) ) - - - ( 4 ) ;
P is calculated according to formula (5)h(C=1 | U (v, j)):
P h ( C = 1 | U ( v , j ) ) = Σ j = 1 m P r j ( C = 1 | U ( v , j ) ) - - - ( 5 ) ;
S405: if Ph(C=0 | U (v, j)) >=Ph(C=1 | U (v, j)), then the h is tested sample mark C=0, otherwise marks C=1, the h test sample after output token;
S406: make h=h+1;
S407: if h > H, then perform next step, otherwise returns step S404;
S408: calculate the accuracy rate �� of r graderr,Wherein S=H represents the number of times using the r grader to be labeled, and S ' represents the number of times using the r grader mark correct;
S409: make r=r+1;
S410: if r > R, then perform next step, otherwise returns step S403;
S411:max{ ��r}=max{ ��r, r=1,2,3...R}, max{ ��rCorresponding grader is the grader that mark accuracy rate is the highest, finally output max{ ��rCorresponding grader, this grader is designated as
S5: for any one unlabelled user Ux, according to its learning log file, obtain the n kind behavior characteristics of this user, the grader of selected step S411 output, then calculate according to formula (6) P U x ( C = 0 | U ( 1 , j ) ) ;
P U x ( C = 0 | U ( 1 , j ) ) = Σ j = 1 m P r m a x j ( C = 0 | U ( 1 , j ) ) - - - ( 6 ) ;
Calculate according to formula (7)
P U x ( C = 1 | U ( 1 , j ) ) = Σ j = 1 m P r m a x j ( C = 1 | U ( 1 , j ) ) - - - ( 7 ) ;
If P U x ( C = 0 | U ( 1 , j ) ) ≥ P U x ( C = 1 | U ( 1 , j ) ) Then by user UxMark C=0, otherwise notes C=1.
CN201510967503.0A 2015-12-21 2015-12-21 Massive open online course (MOOC) quitting prediction algorithm based on semi-supervised learning Pending CN105631536A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510967503.0A CN105631536A (en) 2015-12-21 2015-12-21 Massive open online course (MOOC) quitting prediction algorithm based on semi-supervised learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510967503.0A CN105631536A (en) 2015-12-21 2015-12-21 Massive open online course (MOOC) quitting prediction algorithm based on semi-supervised learning

Publications (1)

Publication Number Publication Date
CN105631536A true CN105631536A (en) 2016-06-01

Family

ID=56046443

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510967503.0A Pending CN105631536A (en) 2015-12-21 2015-12-21 Massive open online course (MOOC) quitting prediction algorithm based on semi-supervised learning

Country Status (1)

Country Link
CN (1) CN105631536A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107274077A (en) * 2017-05-31 2017-10-20 清华大学 Course elder generation's postorder computational methods and equipment
CN107609084A (en) * 2017-09-06 2018-01-19 华中师范大学 One kind converges convergent resource correlation method based on gunz
CN108597280A (en) * 2018-04-27 2018-09-28 中国人民解放军国防科技大学 Teaching system and teaching method based on learning behavior analysis
CN109670999A (en) * 2018-12-29 2019-04-23 重庆工商职业学院 Overturn classroom instruction management system
CN111724867A (en) * 2020-06-24 2020-09-29 中国科学技术大学 Molecular property measurement method, molecular property measurement device, electronic apparatus, and storage medium
CN112884449A (en) * 2021-03-12 2021-06-01 北京乐学帮网络技术有限公司 User guiding method, device, computer equipment and storage medium

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107274077A (en) * 2017-05-31 2017-10-20 清华大学 Course elder generation's postorder computational methods and equipment
CN107274077B (en) * 2017-05-31 2020-07-31 清华大学 Course first-order and last-order computing method and equipment
CN107609084A (en) * 2017-09-06 2018-01-19 华中师范大学 One kind converges convergent resource correlation method based on gunz
CN107609084B (en) * 2017-09-06 2021-01-26 华中师范大学 Resource association method based on crowd-sourcing convergence
CN108597280A (en) * 2018-04-27 2018-09-28 中国人民解放军国防科技大学 Teaching system and teaching method based on learning behavior analysis
CN109670999A (en) * 2018-12-29 2019-04-23 重庆工商职业学院 Overturn classroom instruction management system
CN111724867A (en) * 2020-06-24 2020-09-29 中国科学技术大学 Molecular property measurement method, molecular property measurement device, electronic apparatus, and storage medium
CN111724867B (en) * 2020-06-24 2022-09-09 中国科学技术大学 Molecular property measurement method, molecular property measurement device, electronic apparatus, and storage medium
CN112884449A (en) * 2021-03-12 2021-06-01 北京乐学帮网络技术有限公司 User guiding method, device, computer equipment and storage medium
CN112884449B (en) * 2021-03-12 2024-05-14 北京乐学帮网络技术有限公司 User guiding method, device, computer equipment and storage medium

Similar Documents

Publication Publication Date Title
CN106205248B (en) A kind of representative learning person generates system and method in the on-line study cognitive map of domain-specific knowledge learning and mastering state
CN105631536A (en) Massive open online course (MOOC) quitting prediction algorithm based on semi-supervised learning
Anoopkumar et al. A Review on Data Mining techniques and factors used in Educational Data Mining to predict student amelioration
CN105448152A (en) On-line teaching system
CN105488055B (en) The generation method and system of Individualized computer study and evaluation and test product
CN108694501A (en) A kind of individualized learning effect analysis system and method towards xAPI
CN107248019A (en) A kind of cloud teaching platform online teaching evaluation system
CN106373057A (en) Network education-orientated poor learner identification method
CN107256522A (en) Teaching assessment system based on cloud teaching platform
Agustianto et al. Design adaptive learning system using metacognitive strategy path for learning in classroom and intelligent tutoring systems
Rauber et al. Assessing the learning of machine learning in K-12: A ten-year systematic mapping
Adjei et al. Clustering students in assistments: exploring system-and school-level traits to advance personalization
Kumar et al. Implication of classification techniques in predicting student’s recital
Saleh et al. Education is an overview of data mining and the ability to predict the performance of students
Gogri et al. Evaluation of students performance based on formative assessment using data mining
Dahiya A survey on educational data mining
Ababneh et al. Guiding the Students in High School by Using Machine Learning.
Cerezo et al. Different patterns of students interaction with Mooodle and their relationship with achievement
Nudin et al. Impact of soft skills competencies to predict graduates getting jobs using random forest algorithm
Li et al. Design and implementation of student programming profile-based teaching aids solution in introductory programming course
Pandey et al. Data Mining as a torch bearer in education sector
Bouslimani et al. Educational Data Mining approach for engineering graduate attributes analysis
Constante et al. Interactive Computer-Simulation Strategy and Physics Performance of Grade 8 Students
Rajalakshmi ANALYSIS AND PREDICTIONS ON BLENDED LEARNING READINESS AMONG INDIAN STUDENTS AT UNIVERSITIE SUSING DECISION TREE CLASSIFIER IN SCIKIT-LEARN ENVIRONMENT.
Feng et al. Applying learning analytics to support instruction

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20160601