CN114662779A

CN114662779A - Student portrait-based programming score prediction method and system

Info

Publication number: CN114662779A
Application number: CN202210359959.9A
Authority: CN
Inventors: 沈国华; 杨思恩; 黄志球; 李广龙; 李锐; 蔡茂东
Original assignee: Nanjing University of Aeronautics and Astronautics
Current assignee: Nanjing University of Aeronautics and Astronautics
Priority date: 2022-04-07
Filing date: 2022-04-07
Publication date: 2022-06-24

Abstract

The invention discloses a programming score prediction method and a system based on student portrait, which comprises the following steps: acquiring student programming data and preprocessing the student programming data; generating a student representation including personal information, programming skills, and learning records; the programming skills comprise excellence item types, code quality, code styles and space-time consumption, and the learning records comprise total submissions, total passes, experiment scores and accuracy; calculating the median of numerical values such as the total number of the code quality problems, the total submitted number and the like, carrying out abnormity judgment, simultaneously detecting the similarity between the codes submitted by the students marked as abnormity and the codes submitted by other students by utilizing a clone detection technology, and deleting the data of the students if the similarity exceeds a set threshold; and constructing a deep neural network, and performing network training by using the data set from which the abnormal data is deleted to obtain a trained programming result prediction model. The method can more comprehensively and accurately predict the programming scores and reduce the influence of plagiarism codes on score prediction.

Description

Student portrait-based programming score prediction method and system

Technical Field

The invention belongs to the field of education data mining, and particularly relates to a programming score prediction method and system for student portrait construction and deep neural network application.

Background

Over the past period of time, the educational and informational fields have revolutionized. Many online learning systems have come to date that provide massive amounts of data for Educational Data Mining (EDM) research and expedite the development of their applications. EDM applies multidisciplinary theories and techniques, such as education, computer science, psychology, and statistics, to solve problems in educational research and teaching practices. By analyzing and mining data related to education, problems can be discovered and solved. Although the teacher and the student are still one of the most important teaching environments, the application of the internet and the artificial intelligence technology in education creates a more open and intelligent teaching environment, provides a more convenient platform, relatively perfect management and generates richer data. Particularly in the field of programming education, in order to improve the automation degree of programming education and reduce the manpower of teaching personnel, an online evaluation (OJ) system is needed to help students to learn programming and improve their programming skills. The OJ system was originally used for ICPC, IOI and other large games to provide a secure and reliable judgment environment for competitors around the world to submit their game codes. It is an online program and management platform that can collect, compile and evaluate source code. The system can compile source codes submitted by users on line, generate executable files and evaluate the codes from the aspects of correctness, time, memory consumption and the like according to test cases. By using data from the OJ system, students' performance in programming learning can be mined, such as building student portraits and predicting student performance. The student portrait is constructed to more vividly depict students, so that the learning conditions of the students can be better known. Student portrayal is a series of labels that specifically and intuitively describe students based on their personal information and learning behaviors. The labels directly or indirectly show the basic information, learning habits and efficiency of a certain student, so that teachers can more conveniently recognize the students. Meanwhile, the prediction of the programming performance of the students can help teachers to know own teaching effects and learning conditions of the students in time, so that teaching guidance is provided for the students.

The essence of the user portrait is a labeled user overall view, and the process of constructing the user portrait is based on wide user data, user characteristics are obtained through classification of user attributes and extraction by using a certain technical method, the user characteristics are extracted into a user label, and finally the user portrait is obtained. The student representation is more focused on the learning performance of the student than the user representation. For student portrayal construction, different researchers have selected different features. Personal and academic information of students is often considered, while some studies focus on several specific features. For the performance of the students of the prediction online learning platform, a researcher selects the mouse click condition of a user as a characteristic, and the researcher selects the psychological condition of the students and the family economic factors of the students as the characteristic. The student score prediction can be divided into two tasks of classification and regression. In the classification task, the results are typically divided into two groups, "pass" or "fail". For regression tasks, the end-of-term test scores are typically between 0 and 100. In both tasks, many different data mining techniques may also be used. Researchers have compared various data mining techniques in classification and regression tasks, including K Nearest Neighbor (KNN), Support Vector Machine (SVM), Artificial Neural Network (ANN), decision tree and naive bayes, and finally found that ANN has the best result. In recent studies, deep learning is also used to process educational data, and Deep Neural Networks (DNNs) have also achieved great results in the direction of student performance prediction.

Most of the previous educational data mining methods do not aim at a specific course but at all courses or a large class of courses, so that the selected characteristics cannot be optimized to have representativeness of each course when performing the prediction task. For example, for a programming course, many methods predict future performance only by selecting a question id, a user id, an answer number, a correct rate and the like in a system, and do not analyze programming codes or consider a possible code plagiarism phenomenon, so that the prediction effect is not ideal.

Disclosure of Invention

The purpose of the invention is as follows: in view of the above-mentioned shortcomings of the prior art, the present invention aims to provide a method and a system for predicting a programming result based on a student portrait, which are used for predicting a programming result at the end of a student based on programming data of an online evaluation system.

The technical scheme is as follows: in order to achieve the purpose, the technical scheme adopted by the invention is as follows:

a programming achievement prediction method based on student portrait comprises the following steps:

(1) acquiring student programming data including student personal information, question information and student submission data from an online evaluation system; the problem information comprises problem description, problem types and test cases, and the submitted data of the students comprise problem results, source codes and experiment scores;

(2) preprocessing the acquired data, including normalization, coding and missing value processing;

(3) generating a student representation including personal information, programming skills and learning records; the programming skills comprise excellence item types, code quality, code styles and space-time consumption, and the learning records comprise total submissions, total passes, experiment scores and accuracy;

(4) calculating the total number median, the total submitted number median, the total passed number median and the experiment score median of the code quality problems, and setting the student data which has unqualified end-of-term results and meets the abnormity judgment condition as abnormal data; detecting the similarity between the codes submitted by the students marked as abnormal and the codes submitted by other students by using a clone detection technology, and deleting the data of the students if the similarity exceeds a set threshold;

(5) and constructing a deep neural network, and performing network training by using the data set from which the abnormal data is deleted to obtain a trained programming result prediction model.

Preferably, the good topic type in the programming skills is obtained according to the answer accuracy of students to each type of programming question, and the higher the accuracy is, the better the good topic type is; the space-time consumption is read from each student's submission record and the average is finally calculated.

Preferably, the code quality in the programming skill is measured by using a quality inspection tool to the codes submitted by the students in the database, so as to obtain the times of the students presenting each type of code quality problem; the code style is used for performing text processing on the codes submitted by students in the database by using a natural language analysis tool, so that the naming style, the annotation use condition and the indentation condition of the students writing the codes are obtained.

Preferably, the total submission number and the total pass number in the learning record are calculated according to the sum of each record of each student in the database; the experimental score can be calculated according to the final score of each experiment, and the final score is the highest score which can be obtained by each student in a specified time; the accuracy is calculated by the number of submissions and the number of passes.

Preferably, the abnormality determination condition is:

wherein, when the flag value is 1, the data is abnormal, m_cq、m_sn、m_an、m_csRespectively the total number median of code quality problems, the total submitted number median, the total passed number median and the experiment score median, s_cq、s_sn、s_an、s_csThe code quality problem total number, the total submittal number, the total pass number and the experiment score of the student are respectively.

Preferably, the codes submitted by students marked as abnormal are compared with the codes in the collected code library, and the data of students with similarity exceeding a set threshold value are deleted.

Preferably, the deep neural network adopts a fully-connected four-layer network structure which comprises an input layer, two hidden layers and an output layer, a linear rectification function is used as an activation function, and a root-mean-square transfer algorithm is used as an optimization function.

A student representation-based programming achievement prediction system, comprising:

the data acquisition module is used for acquiring student programming data from the online evaluation system, wherein the student programming data comprises student personal information, question information and student submission data; the problem information comprises problem description, problem types and test cases, and the submitted data of the students comprise problem results, source codes and experiment scores;

the preprocessing module is used for preprocessing the acquired data, including normalization, coding and missing value processing; and generating a student representation including personal information, programming skills, and learning records; the programming skills comprise excellence item types, code quality, code styles and space-time consumption, and the learning records comprise total submissions, total passes, experiment scores and accuracy;

the abnormality processing module is used for calculating the total median of the code quality problems, the total submitted median, the total passed median and the experimental score median, and setting the student data which is unqualified in end-of-term results and meets the abnormality judgment condition as abnormal data; detecting the similarity between the codes submitted by the students marked as abnormal and the codes submitted by other students by using a clone detection technology, and deleting the data of the students if the similarity exceeds a set threshold;

and the programming result prediction module is used for constructing a deep neural network and performing network training by using the data set from which the abnormal data is deleted to obtain a trained programming result prediction model.

A computer system comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the computer program when loaded into the processor implementing the steps of the student representation-based programming achievement prediction method.

A computer readable storage medium storing a computer program which, when executed by a processor, implements the steps of the student representation-based programming achievement prediction method.

Has the advantages that: compared with the prior art, the invention has the following advantages:

1. the present invention takes into account that the codes submitted by students are very rich and contain much useful information, and when constructing student representations, the codes are analyzed by both text analysis and code analysis. The code style of the student is analyzed through the text, the code quality of the student is analyzed through the code, the code information is used as the characteristic to conduct subsequent prediction, and score prediction can be conducted comprehensively and accurately.

2. The invention considers the plagiarism phenomenon existing in online programming learning, and the answer submitted by the student is not independently completed by the student, which finally influences the predicted result. The invention provides an anomaly judgment rule to discover the plagiarism phenomenon, and detects the code plagiarism phenomenon in a common experiment by a clone detection tool, thereby reducing the influence of plagiarism codes on score prediction.

Drawings

FIG. 1 is a method block diagram of an embodiment of the invention.

FIG. 2 is a schematic diagram of a student image according to an embodiment of the present invention.

FIG. 3 is a comparison graph of the effect of clone detection (S-programming skills, I-personal information, L-learning records) in an example of the present invention.

FIG. 4 is a graph comparing predicted effects of methods of the present invention with existing methods and feature combinations (S-programming skills, I-personal information, L-learning records).

Detailed Description

In order to facilitate understanding of those skilled in the art, the present invention will be further described with reference to the following embodiments and accompanying drawings.

In order to predict the programming performance of students, data is first collected from the southern airline online assessment system (NUAAOJ), the data is preprocessed to remove unwanted noise, features are selected, and student images are formed. And then dividing the data set into a training set and a testing set, training in a deep neural network, finally generating a result prediction model and evaluating the result of the prediction model. The overall frame diagram is shown in fig. 1.

(1) An open source OJ system is utilized to be deployed in a school to acquire student programming data.

All data used by the method is collected from the NUAAOJ, including student personal information, question information, and student submission data. The personal information of the students shows basic information of the students such as the number, sex, class and the like, and also records the condition that the students log in the system. The issue information details each issue, including the description, type, and test cases of the issue. The student submission data contains the results and source code of a question, and also includes the total score of each experiment. By using the data in the system, the student portrait can be conveniently constructed.

(2) And acquiring the required attributes from the database and preprocessing the attributes.

In order to obtain better prediction results, the quality of feature selection is improved, and the method has important significance on processing of missing data. To reduce the impact of low quality data, many data pre-processing methods can be used, such as normalization and unique heat coding, and since the number of all features is in different orders of magnitude, their value range should be limited to between 0 and 1 in order to normalize the input before they are used. The formula is as follows:

where X and X' are the attribute values before and after normalization, respectively, and Xmax and Xmin are the maximum and minimum values of each attribute, respectively. According to equation (1), the range of attributes may be limited to between 0 and 1. The method adopts a single hot coding method to code disordered data. For example, there are five classes in the dataset (from 501 to 505), and if we do not use one-hot encoding, the value of this characteristic of classes would be 501 to 505. In contrast, the features would be replaced with 501 to 505, if the student is in the 502 class, the 502 th column has a value of 1 and the other columns have values of 0. Some students may be absent due to special circumstances while doing the experiment. For missing values of training data, there are two processing methods, one is to fill in the missing values using the average or median of all data in the column, and the other is to delete data of a certain student directly. For example, if the student's data lacks only scores for one or two experiments, the average of all students is used for population. Conversely, if a student's data lacks more than two trial scores or lacks an end-of-term test score, the student's data will be deleted directly.

(3) A student representation is generated containing the code information.

Student portrayal can describe the characteristics of students from multiple aspects. In order to clearly and quickly capture the characteristics of students, a student portrait model comprising three dimensions is adopted, and personal information, programming skills and learning records of the students are concerned respectively. The personal information of students, such as school number, class and gender, can be directly read from a database table, all characteristics in adequacy knowledge, space-time consumption and learning records in programming skills are obtained by connecting and calculating a plurality of tables in the database, the code style in the programming skills is obtained by performing text analysis through a natural language analysis tool, and the code quality is extracted by performing code analysis through the tool. Table 1 shows the manner in which features in a student representation are obtained, and FIG. 2 shows a model of the student representation.

TABLE 1 manner of constructing student images

Specifically, the good topic type can be obtained according to the answer accuracy of students for each type of programming question, the higher the accuracy rate is, the better the topic type is considered, and the programming questions included in the OJ system can be divided into five categories, namely, a base, a branch, a loop, an array and a function. The spatiotemporal consumption can be read directly from the student's submission record at each time, and the average is calculated. The total number of submissions and the total number of passes can be calculated by summing up according to each record of each student in the database. The experimental score can be calculated according to the final score of each experiment, and the final score in the OJ system is the highest score which can be obtained by each student in a specified time. The accuracy can be calculated by the number of submissions and the number of passes.

As above, all features in the student image can be obtained directly or indirectly (using SQL) from the dataset, except for code style and code quality, which can be summarized by the tool. For the code style, text analysis is carried out through natural language processing, and the naming rule and the code writing format of the student code are mainly considered. We have referred to the existing work and determined three mainstream naming styles, namely hump naming, pascal naming and serpentine naming. The format style is more complex than the naming style, and indentation, annotation and other factors are considered in the method. The code submitted by the student is analyzed by using CoreNLP of Stanford university, the naming style is added into the student image (feature) through unique hot coding, indentation and annotation store the number of times of using the code, and the value is 0 if no indentation or annotation exists. For code quality, code analysis is mainly performed by tools, and since the source code in the OJ system mainly uses the C programming language, CppCheck is used to detect undefined behavior and dangerous coding structures. It is a static analysis tool of C/C + + code to detect the code of all students submitting and passing test cases in the OJ system. And acquiring code quality problems with the detected total error number more than 200 times, and adding the error number of each student to the problems into student images (characteristics) through one-hot coding. Table 2 shows the most frequently occurring code quality problems and their corresponding explanations.

TABLE 2 most frequently occurring code quality issues in submitted codes

Problem of code quality	Total number of	Description of the preferred embodiment
			variableScope	3280	The scope of the variable can be reduced
unusedVariable	995	Variable that unused
			unreadVariable	594	Variable is assigned a value that is never used
Invalidscanf	569	scanf()without field width limits can crash with huge input data
			knownConditionTrueFalse	384	Condition is always false
getsCalled	334	Obsolete function′gets′called
			selfAssignment	203	Redundant assignment to itself

(4) And reducing the influence of abnormal data on the prediction by using a clone detection technology.

Although automatic assessment of the OJ system provides convenience to the instructor, students can easily obtain answers to questions in the OJ system from online sources or classmates. It is inevitable that students refer to external answers and even copy other person's codes. Therefore, clone testing is essential to obtain satisfactory programming performance prediction results. Given that the number of submissions by some students far exceeds those by others, which creates a large gap, medians are more effective at displaying overall performance. The total median number of code quality issues (mcq), total median number of submissions (msn), total median number of passes (man), and median number of experimental scores (mcs) were calculated. For each student si, there is its own total number of code quality issues (scq), number of submissions (ssn), number of passes (san), and value of the score of experiments (scs). And setting the student data with unqualified end results and qualified end results as abnormal data.

Equation (2) shows the determination condition of abnormal data (a value of 1 indicates data abnormality).

A student with a flag value equal to 1 is selected and then the clone detection tool SIM is used to detect the code similarity between the content he submits and the content submitted by other students. The SIM tests vocabulary similarity in natural language text and procedural languages, such as C, C + +, java, Pascal, etc. He can also effectively compare the codes of a large number of students with codes collected in the past to find signs of cheating. If a person's code is detected to have a high degree of similarity to the codes of other students, the student's data will be deleted. In addition, the codes of the students can be detected and compared with the codes in the collected code library, and the data of the students with high similarity can be deleted.

(5) And (4) predicting the student performance by using a deep neural network.

The method adopts a regression mode to predict the future programming performance of the student in the form of the test result at the end of the period. We obtained about 15 million code submissions from the NUAAOJ dataset, with 25% of the dataset used for testing and the remaining datasets used for training. The performance of the regression was then evaluated using Root Mean Square Error (RMSE). RMSE may be calculated using equation (3):

where yi is the actual end-of-term test score,

is a predictor of end-of-term test scores. The test procedure was repeated 5 times to obtain 5 different RMSE values, averaged to obtain the final value. Through multiple experimental comparisons, a fully-connected four-layer deep neural network is finally adopted as a network structure, 128 units are arranged on an input layer, 64 units are arranged on a first hidden layer, 8 units are arranged on a second hidden layer, and one unit is arranged on an output layer. Using a linear rectification function (ReLU) as the activation function and a root mean square transfer (RMSprop) algorithm as the optimization function, the learning rate was 0.001. Finally, to prevent overfitting, the early-stopped patience value is 10, which means that learning should be stopped if the loss is not reduced for 10 consecutive periods.

FIG. 3 shows a comparison of predicted results before and after the effect of outlier data was removed using the clone detection technique. Compared with the method without clone detection, the prediction result is obviously improved, the RMSE is below 15, and almost above 20 before. Therefore, the clone detection tool can have good influence on the final result.

Figure 4 shows the effect of each regression algorithm and the combination between different features. It can be seen that the best results are achieved when using a Deep Neural Network (DNN) and including all the mentioned features, i.e. RMSE is lowest, 12.68. Therefore, the deep neural network has good performance on the prediction of multiple features, and the final prediction result can be effectively improved by adding code related information into the features.

Based on the same inventive concept, the programming result prediction system based on student portrait disclosed by the embodiment of the invention comprises: the data acquisition module is used for acquiring student programming data from the online evaluation system, wherein the student programming data comprises student personal information, question information and student submission data; the problem information comprises problem description, problem types and test cases, and the submitted data of the students comprise problem results, source codes and experiment scores; the preprocessing module is used for preprocessing the acquired data, including normalization, coding and missing value processing; and generating a student representation including personal information, programming skills, and learning records; the programming skills comprise excellence item types, code quality, code styles and space-time consumption, and the learning records comprise total submissions, total passes, experiment scores and accuracy; the abnormality processing module is used for calculating the total median of the code quality problems, the total submitted median, the total passed median and the experimental score median, and setting the student data which is unqualified in end-of-term results and meets the abnormality judgment condition as abnormal data; detecting the similarity between the codes submitted by the students marked as abnormal and the codes submitted by other students by using a clone detection technology, and deleting the data of the students if the similarity exceeds a set threshold; and the programming result prediction module is used for constructing a deep neural network and performing network training by using the data set from which the abnormal data is deleted to obtain a trained programming result prediction model.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the modules described above may refer to corresponding processes in the foregoing method embodiments, and are not described herein again. The division of the modules is only one logical functional division, and in actual implementation, there may be another division, for example, a plurality of modules may be combined or may be integrated into another system.

Based on the same inventive concept, the embodiment of the invention discloses a computer system, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the computer program is loaded to the processor to realize the steps of the student portrait-based programming achievement prediction method.

Based on the same inventive concept, the embodiment of the invention discloses a computer-readable storage medium, which stores a computer program, and the computer program realizes the steps of the student portrait-based programming result prediction method when being executed by a processor.

It will be understood by those skilled in the art that the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer system (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. The storage medium includes: various media capable of storing computer programs, such as a U disk, a removable hard disk, a read only memory ROM, a random access memory RAM, a magnetic disk, or an optical disk.

Claims

1. A programming achievement prediction method based on student portrait is characterized by comprising the following steps:

2. The student representation-based programming performance prediction method of claim 1, wherein the excellence in programming skills is obtained according to the student's answer accuracy for each type of programming question, and the higher the accuracy, the better the question is considered to be; the spatiotemporal consumption was read from the student's submission record each time, and the average was calculated.

3. The student representation-based programming achievement prediction method as claimed in claim 1, wherein the code quality in the programming skill is measured by using a quality check tool to the codes submitted by students in the database, so as to obtain the number of times that the students appear to each type of code quality problem; the code style is used for performing text processing on codes submitted by students in a database by using a natural language analysis tool, so that the naming style, the annotation use condition and the indentation condition of the codes written by the students are obtained.

4. The student representation-based programming achievement prediction method as claimed in claim 1, wherein the total submitted number and the total pass number in the learning record are calculated by summing up each record of each student in the database; the experimental score can be calculated according to the final score of each experiment, and the final score is the highest score which can be obtained by each student in a specified time; the accuracy is calculated by the number of submissions and the number of passes.

5. The student representation-based programming achievement prediction method as claimed in claim 1, wherein the abnormality determination condition is:

6. The student representation-based programming performance prediction method of claim 1, further comprising comparing codes submitted by students marked as abnormal with codes in the collected code library, and deleting data of students whose similarity exceeds a set threshold.

7. The student representation-based programming performance prediction method of claim 1, wherein the deep neural network adopts a fully-connected four-layer network structure comprising an input layer, two hidden layers and an output layer, uses a linear rectification function as an activation function, and uses a root-mean-square transfer algorithm as an optimization function.

8. A student representation-based programming performance prediction system, comprising:

9. A computer system comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the computer program when loaded into the processor implements the steps of a student representation-based programming achievement prediction method of any one of claims 1-7.

10. A computer readable storage medium storing a computer program, wherein the computer program when executed by a processor implements the steps of the student representation-based programming achievement prediction method of any one of claims 1-7.