CN110705025A

CN110705025A - Large-scale examination data analog simulation method, device and storage medium

Info

Publication number: CN110705025A
Application number: CN201910830105.2A
Authority: CN
Inventors: 柯瑞强; 李宏强
Original assignee: Crystal Ball Education Information Technology Co Ltd
Current assignee: Crystal Ball Education Information Technology Co Ltd
Priority date: 2019-09-03
Filing date: 2019-09-03
Publication date: 2020-01-17

Abstract

The invention discloses a large-scale examination data simulation method, a device and a storage medium, wherein the method comprises the following steps: inputting basic condition parameters, wherein the basic condition parameters comprise examinee information, subject information and historical examination data; according to the basic condition parameters, calculating and obtaining the result data of each individual subject by utilizing a normally distributed probability density function; and performing optimization matching and iterative processing on the result data of each individual department by using a least square principle to obtain result data of simulation. The invention is suitable for the relevant research and analysis of large-scale examinations, and can effectively improve the quality of simulation data, thereby improving the level and quality of examinations and tests.

Description

Large-scale examination data analog simulation method, device and storage medium

Technical Field

The invention relates to the technical field of analog simulation, in particular to a large-scale examination data analog simulation method, a large-scale examination data analog simulation device and a large-scale examination data analog simulation storage medium.

Background

Conventionally, in large-scale examinations and evaluations, there are qualification examinations and graded examinations for different purposes, and they are used for talent evaluation and talent selection. Basically, a number of related techniques based on CTT (classical measurement theory) are used in conventional examinations and evaluations. In recent years, with the development of theoretical research, IRT (project response theory) has been increasingly used, but large examinations are based on CTT theory as a whole.

With the development of the reform of the middle and high-school entrance examination, the requirement on the large-scale examination is higher and higher, and two basic principles of the traditional large-scale examination are unchanged all the time: quality and fairness, and on the basis of the two principles, the requirement on efficiency is gradually increased to the height of the principle. In the middle and high-school reform process in recent years, research in related fields is more and more, more and more problems are gradually highlighted in the aspects of traditional examination means, content and standard specifications, including assigning modes and the like, and the requirements of public society and national policies on education fairness and quality standard improvement cannot be met.

For large-scale examinations, especially those with great influence such as high-rise examinations, the relevant research work has great significance and value, and because the cost of the large-scale examinations and the evaluation is very high, data simulation is very necessary and important when relevant research is carried out. Referring to the middle and high-school reform, especially the high-school reform, all over the country in recent years, the relevant standard specifications of the high-school reform of the provinces of several recent batches, including the relevant standard specifications of the Zhejiang and Shanghai of the first batch of reform, the Beijing, Shandong, Tianjin, Hainan and the like of the second batch, especially in the aspect of assigning modes, the mode difference is very large compared with the traditional mode.

In the related research process, methods generally adopted for qualitative research include methods routes such as literature research methods, experience summary methods and comprehensive evaluation methods, and for quantitative research, a simulation research method is generally adopted, and the influence degree of related factors on scoring modes is researched and corresponding solutions are proposed by utilizing big data analysis and simulation technologies.

Disclosure of Invention

The technical problem to be solved by the invention is to provide a large-scale examination data simulation method, device and storage medium, which are suitable for large-scale examination related research and analysis, and can effectively improve the quality of simulation data, thereby improving the level and quality of examination and test.

An embodiment of the present invention provides a large-scale examination data simulation method, including:

inputting basic condition parameters, wherein the basic condition parameters comprise examinee information, subject information and historical examination data;

according to the basic condition parameters, calculating and obtaining the result data of each individual subject by utilizing a normally distributed probability density function;

and performing optimization matching and iterative processing on the result data of each individual department by using a least square principle to obtain result data of simulation.

The examinee information comprises the number of examinees, the examinee number and the information of the subject selection; the subject information comprises basic information of each individual test paper, a score range and difficulty parameters.

Wherein, according to the basic condition parameters, the probability density function of normal distribution is utilized to measure and calculate the result data of each individual subject, and the method comprises the following steps:

calculating theoretical distribution probability of each score value by utilizing a probability density function of normal distribution according to the basic condition parameters, and multiplying the theoretical distribution probability by the number of examinees to obtain the number distribution of each score, namely the performance number distribution;

and assigning the score number distribution to a score array equal to the number of examinees to obtain score data of each single subject.

The method comprises the following steps of performing optimization matching and iterative processing on achievement data of each individual department by using a least square principle to obtain result data of simulation, wherein the method comprises the following steps:

matching score data of a single department with examinee information, and fixing a matching relation;

calculating the correlation coefficient of Pearson product moments among all disciplines by adopting the following formula:

or

The sum of the squares of the errors of all correlation coefficients is calculated using the principle of least squares, using sigma (Y)_i-Y_j)²]The minimum is used as an optimal criterion, wherein Yi is equivalent to a correlation coefficient of Pearson product moments R1, and Yj is equivalent to a configuration correlation coefficient set by R;

and repeating iteration until the result data of the simulation is obtained.

An embodiment of the present invention further provides a large-scale examination data simulation apparatus, including:

the basic condition parameter input unit is used for inputting basic condition parameters, wherein the basic condition parameters comprise examinee information, subject information and historical examination data;

the achievement data calculation unit of each individual subject is used for measuring and calculating achievement data of each individual subject by utilizing a probability density function of normal distribution according to the basic condition parameters;

and the simulation unit is used for performing optimization matching and iterative processing on the result data of each individual department by using the principle of the least square method to obtain simulation result data.

An embodiment of the present invention further provides a big data statistics sampling server, including:

one or more processors;

storage means for storing one or more programs;

when executed by the one or more processors, cause the one or more processors to implement a large test data simulation methodology as described above.

An embodiment of the present invention further provides a computer-readable storage medium, where the storage medium includes a stored computer program, and when the computer program runs, the apparatus on which the storage medium is located is controlled to execute the large test data simulation method as described above.

The embodiment of the invention has the following beneficial effects:

according to the teaching of the embodiment, the invention is suitable for the research and analysis related to large-scale examinations, and can effectively improve the quality of simulation data, thereby improving the level and quality of examinations and tests.

Drawings

In order to more clearly illustrate the technical solution of the present invention, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic flow chart of a simulation method for large-scale test data according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a large-scale examination data simulation apparatus according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be understood that the step numbers used herein are for convenience of description only and are not intended as limitations on the order in which the steps are performed.

It is to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the specification of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

The terms "comprises" and "comprising" indicate the presence of the described features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

The term "and/or" refers to and includes any and all possible combinations of one or more of the associated listed items.

Please refer to fig. 1.

s100, inputting basic condition parameters, wherein the basic condition parameters comprise examinee information, subject information and historical examination data.

In particular embodiments, large multi-subject or multi-item examinations and assessments; the data simulation method for a single subject or a single item can be multi-source, namely, the data simulation method can be generated based on historical data (detail or statistical data) or without historical data (detail or statistical data), and can be generated based on standard normal distribution function simulation or other distribution functions (such as beta distribution) or methods when no historical data exists; for the generated multidimensional data (each single subject or single item data set), the data attribution (sequence) of each single-dimensional data set is changed through a specific algorithm, so that the data set with original multidimensional data without correlation (or correlation far from theoretical or actual expectation) is more in accordance with practical requirements and expectations.

And S200, calculating and obtaining the achievement data of each individual subject by utilizing a probability density function of normal distribution according to the basic condition parameters.

In a specific embodiment, the achievement data of each individual subject is simulated according to the configuration parameters, the subject configuration parameters and the historical data, and the achievement data of each individual subject can be simulated through the historical data or can be simulated according to a pure theory (generally, normal distribution is used).

For the generated multidimensional data (each single subject or single item data group), in terms of algorithm implementation, continuous and repeated iteration is performed by configuring a relatively constant correlation coefficient (Pearson product moment correlation coefficient) and utilizing the principle of a least square method (Gauss-Markov theorem) to obtain an optimal calculation result as a final data result of simulation.

And S300, performing optimization matching and iterative processing on the result data of each individual department by using a least square principle to obtain result data of simulation.

or

and repeating iteration until the result data of the simulation is obtained.

In a specific embodiment, when the least square method is used for iterative processing, the exchanged data is not completely random, but is optimized by sequencing in combination with the definition of the correlation coefficient, so that iteration of a completely random data exchange mode is avoided, and the efficiency of iterative processing is greatly improved.

For generated multidisciplinary multigroup simulation data, the multigroup single-disciplinary result data are matched to the examinee data by continuous iteration by means of the least square principle (Gauss-Markov theorem) according to the configured correlation coefficient (Pearson product moment correlation coefficient), in the iteration process, the main operation mode is the sequencing change of the data, after the iteration is carried out to the optimal result (the error is minimum), the formed data containing the examinee and each discipline result are the final simulation data.

And for the final analog simulation data, performing related analysis processing on the data according to the range and the content of related quantitative research to generate a final target result.

According to the teaching of the embodiment, the embodiment of the invention is mainly used for data simulation of large-scale examinations and can be suitable for research and analysis related to large-scale examinations (such as high and middle schools), including various eligibility examinations, graded examinations, design of rule standards for talent selection evaluation, design of assigning modes and related research fields. The method is suitable for large-scale examination and evaluation of multidisciplinary (multi-project) types. The method can be combined with modern measurement technology theories such as CTT (classical measurement theory), IRT (project reaction theory) and the like, and repeated iteration processing is carried out on one-dimensional and multi-dimensional data generated based on Gaussian distribution (normal distribution) or other methods, combined with correlation coefficients (Pearson product moment correlation coefficients) and based on the method theory of least square method (Gauss-Markov theorem), so that simulation data which more accord with theory and reality are simulated, the quality of the simulation data is effectively improved, and the level and the quality of examination and test are further improved. The method is easier to realize by using modern IT technology, especially big data and cloud computing and related technologies in a low-cost and high-efficiency mode, and effectively improves means and technical level for solving several core requirements (quality, fairness and efficiency) in large-scale examination.

The technical solutions in the embodiments of the present invention will be further clearly and completely described below with reference to practical examples.

Assuming a college entrance examination in a certain province, the influence of the number of people (or other factors) in the research department on the standard specification of the assigning mode needs to be quantitatively analyzed through data simulation, one or more groups of examinee score data need to be simulated according to configured initial conditions and parameters, and the total score is calculated and converted.

The calculation of the total score and the conversion total score does not have any technical problem, and the data simulation of any single subject does not have any problem theoretically, so that the key of the problem is the correlation problem among the results of the subjects simulated by final simulation, if the correlation problem is not considered, the correlation coefficient among the subjects of the finally simulated result data is certainly close to 0, which is not in accordance with theory and reality, or an algorithm is adopted for processing, but whether the final result data is optimal? It is basically difficult to guarantee that this critical problem is solved by the method of the invention, which is explained step by way of example.

First, it is assumed that the historical data is insufficient, or the reliability of simulation based on the historical data is far from sufficient due to a change in origin and policy, and therefore, it is assumed that simulation is performed in accordance with all the theories.

Configuration of basic condition parameters:

the number of examinees is 10 ten thousand, and the examinee numbers are assumed to be C000001 to C100000; three public basic disciplines: chinese (YW), math (SX), English (YY); the subjects of six to three include: physical (WL), chemical (HX), biological (SW), historical (LS), geographic (DL), political (ZZ);

subject test paper difficulty setting: assuming that a normal distribution is used to perform data simulation of a single subject, the difficulty and the degree of distinction of the test paper are controlled by an average score and a standard deviation, and the average score of each subject is μ _ yw, μ _ SX, μ _ YY, μ _ WL, μ _ HX, μ _ SW, μ _ LS, μ _ DL, and μ _ ZZ; the standard deviation is: σ _ yw, σ _ SX, σ _ YY, σ _ WL, σ _ HX, σ _ SW, σ _ LS, σ _ DL, σ _ ZZ; as shown in the following table:

subject of discipline	Average score	Standard deviation of	Lower limit of score range	Score Range Up
					Chinese language	μ＿YW	σ_YW	0	100
Mathematics, and	μ＿SX	σ_SX	0	100
					english language	μ＿YY	σ_YY	0	100
Physics of physics	μ＿WL	σ_WL	0	100
					Chemistry	μ＿HX	σ_HX	0	100
Biological organisms	μ＿SW	σ_SW	0	100
					History of	μ＿LS	σ_LS	0	100
Geography	μ＿DL	σ_DL	0	100
					Politics	μ＿ZZ	σ_ZZ	0	100

Setting of correlation coefficient:

generating subject performance data according to the parameters:

according to the basic configuration, there are 10 thousands of examinees, and the result data of each subject of 10 thousands of examinees are X _ YW [ ], X _ SX [ ], X _ YY [ ], X _ WL [ ], X _ HX [ ], X _ SW [ ], X _ LS [ ], X _ DL [ ], X _ ZZ [ ], each array contains 10 thousands of result data, the examinee list data is KS [ ] array, and 10 thousands of examination numbers of C000001 to C100000 are stored;

probability density function according to normal distribution:

wherein, the theoretical distribution probability of each point value can be calculated by substituting sigma into the standard deviation and mu into the average point, and then the number of people is multiplied by 10 ten thousands of sample numbers, so that the number of people distributed in each point can be obtained, wherein, the average point 65 and the standard deviation 6.5 are used for simulation calculation, and the result of the number distribution of the achievements is as follows (the number which is not listed is 0):

achievement of	Number of	Achievement of	Number of	Achievement of	Number of	Achievement of	Number of	Achievement of	Number of
										0	0	48	201	61	5079	74	2353	87	20
…	0	49	297	62	5516	75	1879	88	12
										37	1	50	428	63	5854	76	1466	89	7
38	1	51	603	64	6065	77	1117	90	4
										39	2	52	831	65	6138	78	831	91	2
40	4	53	1117	66	6065	79	603	92	1
										41	7	54	1466	67	5854	80	428	93	1
42	12	55	1879	68	5517	81	296	94	0
										43	20	56	2353	69	5079	82	201	…	0
44	33	57	2878	70	4566	83	133	100	0
										45	54	58	3437	71	4008	84	86
46	86	59	4008	72	3437	85	54
										47	133	60	4566	73	2878	86	33

Score quantity distribution data table calculated according to normal distribution

The distribution curve is standard normal distribution; note that there may be precision errors caused by mathematical calculation in the calculation process, and the final number has a small error of 10 ten thousand in total, and here, error correction is needed, and the total number is made up to 10 ten thousand; after the result number distribution data is calculated, the result data can be given to 10 ten thousand result arrays according to the number;

according to the method and the steps, the achievement data of all disciplines are respectively calculated and simulated, at this time, each single discipline simulation data comes out, but the matching with the examinee is not carried out yet, generally, a random matching mode is adopted, if the uniform matching is carried out after the ordering, the correlation coefficient of each discipline is 1 or-1, if the random matching is carried out, the correlation coefficient is approximately equal to 0, and obviously, the theory and the reality are not met; the next step is the core processing stage of the method;

data optimization matching and iteration by using least square method principle

The mathematical proof of the least square principle can be proved by the Gauss-Markov theorem, and the description is omitted; the specific processing method of this step is described here by the actual data calculation and iterative processing flow;

firstly, matching data of a subject with data of an examinee, and fixing the matching relation between the subject achievement data and the examinee, wherein the assumed language is a Chinese language, the result simulation data can be generated as follows:

the correlation coefficients between all disciplines were calculated:

here, the correlation coefficient (pearson product-moment correlation coefficient) is calculated as follows:

or

After all correlation coefficient values R1 are calculated, the sum of squares of errors of all correlation coefficients is calculated using the principle of least squares, and Σ (Y) is used_i-U_j)²]Minimum as the optimal criterion, where Yi is equivalent to R1 (calculated correlation coefficient) and Yj is equivalent to R (configured correlation coefficient);

then, through repeated iteration, firstly, the score sequence of other disciplines of the examinee except the Chinese discipline (the fixed subject) is modified, then the correlation coefficients of all the disciplines are calculated, and according to the optimal criterion, repeated iteration is carried out, and finally optimized data is obtained, namely the final result data of simulation;

it should be noted that there is an extremely important issue to consider when performing the iteration: efficiency; if the algorithm is not optimized, the iteration efficiency is very low and very poor, so that a certain algorithm design is pertinently adopted to optimize the whole iteration process, and the low efficiency of conventional random processing is avoided;

the optimization algorithm of the specific iterative process is as follows:

(1) because a plurality of groups of correlation coefficients exist, the correlation coefficient with the largest error is found out firstly for optimization processing;

(2) after the error of the correlation coefficient of a certain group of data is reduced to a certain degree, when the error is not maximum, the correlation coefficients of another group or two disciplines are adjusted until the errors of all the correlation coefficients are kept at a relatively balanced level;

(3) in order to carry out iteration most efficiently, when subject result data is exchanged, the adjustment is carried out at a destination after the adjustment direction is determined, and the adjustment is avoided by adopting a completely random mode, and the specific method comprises the following steps: one group of data is sorted and fixed (assumed to be in ascending order here), the other group of data is also sorted in ascending order and marked in the order, and when data exchange is carried out and the correlation coefficient needs to be improved, the high-score data arranged in the front is exchanged with the low-score data arranged in the back; when the correlation coefficient needs to be reduced, the operation is performed vice versa, so that the targeted operation is pointed, and the efficiency of iterative processing can be greatly improved;

4. after the simulation of the performance data is completed, subsequent data analysis processing, such as influence of the selection data and the like, can be performed according to research needs, which are not within the scope of the method and are not described here.

Please refer to fig. 2,.

the basic condition parameter input unit 10 is configured to input basic condition parameters, where the basic condition parameters include examinee information, subject information, and historical examination data.

And the achievement data calculation unit 20 of each individual subject is used for measuring and calculating achievement data of each individual subject by utilizing a probability density function of normal distribution according to the basic condition parameters.

And the simulation unit 30 is used for performing optimization matching and iterative processing on the result data of each individual department by using the principle of the least square method to obtain simulation result data.

or

and repeating iteration until the result data of the simulation is obtained.

one or more processors;

storage means for storing one or more programs;

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and can include the processes of the embodiments of the methods described above when the computer program is executed. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention.

Claims

1. A large-scale examination data simulation method is characterized by comprising the following steps:

2. The large-scale examination data simulation method according to claim 1, wherein the examinee information comprises the number of examinees, the number of examinees and the information of the subjects selected; the subject information comprises basic information of each individual test paper, a score range and difficulty parameters.

3. The large-scale examination data simulation method according to claim 1, wherein the calculating of the performance data of each individual subject according to the basic condition parameters by using a normally distributed probability density function comprises:

4. The large-scale examination data simulation method according to claim 1, wherein the obtaining of simulation result data by performing optimization matching and iterative processing on the performance data of each individual subject using the least square principle comprises:

calculating the sum of squares of the errors of all correlation coefficients using the principle of least squares, using sigma (Yi-Yj)²The minimum is used as an optimal criterion, wherein Yi is equivalent to a correlation coefficient of Pearson product moments R1, and Yj is equivalent to a configuration correlation coefficient set by R;

and repeating iteration until the result data of the simulation is obtained.

5. A large-scale examination data simulation apparatus, comprising:

6. The large-scale examination data simulation device according to claim 5, wherein the examinee information comprises the number of examinees, the numbers of the examinees and the information of the subjects selected; the subject information comprises basic information of each individual test paper, a score range and difficulty parameters.

7. A big data statistics sampling server, comprising:

one or more processors;

storage means for storing one or more programs;

when executed by the one or more processors, cause the one or more processors to implement a large test data simulation methodology according to any one of claims 1 to 4.

8. A computer-readable storage medium, comprising a stored computer program, wherein the computer program, when executed, controls a device on which the storage medium is located to perform a large test data simulation method according to any one of claims 1 to 4.