CN109145113B

CN109145113B - Student poverty degree prediction method based on machine learning

Info

Publication number: CN109145113B
Application number: CN201810972342.8A
Authority: CN
Inventors: 陈岩; 俞跃舒
Original assignee: Beijing Taohuadao Information Technology Co ltd
Current assignee: Beijing Taohuadao Information Technology Co ltd
Priority date: 2018-08-24
Filing date: 2018-08-24
Publication date: 2021-12-21
Anticipated expiration: 2038-08-24
Also published as: CN109145113A

Abstract

The invention relates to a student poverty degree prediction method based on machine learning, which comprises the steps of obtaining data of channels related to students, analyzing the data, calculating various characteristic values of the student poverty, filling missing values, standardizing the data, mapping the data to a fixed interval, gathering the data into a plurality of classes by adopting Euclidean distance according to a rapid clustering algorithm, and calculating the importance degree of each class to the evaluation of the poverty degree. And partitioning the matrix formed by each group of classified data according to the correlation, and finally calculating the poor and stranded comprehensive score according to the partitioned matrix, wherein the comprehensive score can be used for reference in the decision of the subsidy amount during the subsidy of the poor and stranded people, and the higher the score is, the more poor the stranded data is, the more the stranded data needs to be subsidized. The invention also provides a plurality of schemes for rapidly finding abnormal poverty-causing poverty and screening poverty-causing poverty reasons from the data.

Description

Student poverty degree prediction method based on machine learning

Technical Field

The invention belongs to the technical field of big data application, and particularly relates to a student poverty degree prediction method based on machine learning.

Background

At present, most colleges and universities primarily determine the family economic condition in the class according to the family economic condition questionnaires of the students and the comprehensive conditions reflected by relevant teachers and classmates. The level of poverty of the subsidized students cannot be quantified, and the subsidization is inevitable to fall into the average meaning.

With the technical development of the big data era, the global backbone communication network transmits tens of thousands of terabytes of data every day, the behavior of each person is recorded by various forms of data, and students can generate data in all rows of the campus and record various characteristics of the students. The data can reflect the real conditions of students, and can solve the problem of the averaging of the contributions of the poverty-stricken students to a certain extent through reasonable utilization, thereby providing more help for the really poverty-stricken students.

Disclosure of Invention

The invention aims to provide a student poverty degree prediction method based on machine learning, aiming at solving the defect of average sense when poverty is subsidized.

The invention realizes the purpose through the following technical scheme:

a student poverty degree prediction method based on machine learning comprises the following steps;

step 1, acquiring data related to poverty of students;

step 2, analyzing data, dividing the data into unstructured text data and structured data, and directly storing the structured data into a database;

step 3, finding and filling missing data;

step 4, carrying out standardization processing on the original structured data to enable the result values to be uniformly mapped to a fixed interval;

step 5, according to a fast clustering algorithm, adopting Euclidean distance to cluster data into k types:

let the set of Kth initial point be

Note the book

Dividing the sample into disjoint k classes to obtain an initial classification

From the initial class G⁽⁰⁾Begin to compute a new set of points L⁽¹⁾Calculating

Get a new set

From L⁽¹⁾Then, the classification and the recording are carried out,

get a new class

Repeating the above steps m times to obtain

Wherein

Is a

The center of gravity of (a). As m gradually increases, the classification tends to stabilize. At the same time

Can be approximately regarded as

Center of gravity, i.e.

At this point, the calculation is finished; alternatively, if for a certain m,

and

if the two are the same, the calculation is ended;

step 6, calculating the relative importance degree of each factor in each category evaluation factor system after each cluster to the realization of the evaluation target and the function, and checking the calculation result to ensure the reliability of the evaluation conclusion;

step 6.1, analyzing a correlation coefficient matrix according to the poverty-stricken factor summarized data, wherein two equidirectional indexes are positive indexes or negative indexes and are positively correlated, and the correlation coefficient is larger than zero; two reverse indexes, namely a positive index and a negative index, should be negatively correlated, and the correlation coefficient should be less than zero; if the score is not satisfied, the step 6.2 is carried out;

and 6.2, separating the two indexes of which the correlation coefficients do not accord with the principle to obtain a block matrix. Respectively performing principal component analysis on each block, and if the index weight coefficient of a certain block is not met, repeating the step 6.1 to obtain a final block matrix;

and 7, calculating the comprehensive evaluation score of the principal component:

q is the final block matrix number obtained in step 6, and the ith (i equals 1, …, q) block has a_iThe weight of the ith block is the index item;

composite score

S_iIf the ith block is in negative correlation with the total score, t is 1, and if the ith block is in positive correlation with the total score, t is 0;

S_ithe solving method is as follows: solving a normalized matrix

R, R of the correlation coefficient matrix_ij(i 1, …, n, j 1, …, m) is the i row and j column elements of R, and the eigenvalue lambda of the correlation matrix is solved_i(i is 1,2, …, s, s is the total number of eigenvalues) and a eigenvector v_i；

Calculating contribution rate

And arranging according to the size of the characteristic value. The new pointer variable is composed of feature vectors:

wherein y is₁Is the first principal component, y_sIs the s-th principal component; calculating the comprehensive evaluation score of the principal component

Wherein

And sequencing S corresponding to each student, wherein the higher S indicates the higher difficulty degree, and the value of S is used as a reference for subsidizing distribution.

The method is further improved in that the method also comprises the steps of extracting the online data of students in the school, wherein the online data comprises browsing content, the type of used electronic products, the online place, the type of browsing websites and the online time, and the online browsing information comprises data of abroad, tourism, Yasi training and new east training.

Preferably, the acquiring data in the method includes extracting basic information data of students, including names, native countries, nationalities, health conditions, political aspects, whether to enter the school through green channels, whether to handle a place of origin loan, whether to handle a campus loan, whether to enjoy a benefit, what kind of reward or subsidy was received during the university, the way of entering the school, the name of the school, the school number, the college, the sex, the specialty, the class, the current address, the relationship with the guardian, the guardian's duties, and other information of the guardian;

also comprises extracting student family information data including whether the student is a solitary child, whether the student is left to guard, whether the student enters a city service staff along with a mobile child, whether the student is a low-protection family, whether the student is a knight or a careless child, whether the student is an orphan, whether the student is established with a card and is poverty, urban and rural minimum guarantee, whether the student is particularly poverty in urban and rural areas, whether the family is urban and rural minimum life guarantee, whether the student is a low-income family, whether the countryside is five-protection family, the student is resident, whether the student is physically handicapped, whether the student is sick and sick, whether the parent is disabled, whether the student is sick and sick, whether the parent is sick and sick, whether the family is mainly derived from income, whether the family is annual income, whether the parent is single, whether the student is different, whether the parent is nursed, whether the family is a person is lost, whether the family is a poor county, whether the family is a county, Whether the school is mountainous, whether study-assisting loan is transacted, house conditions, sudden unexpected events of families, the condition of unemployment of family members, the amount of debt of the family, the reason of the debt, the condition of the students who have been subsidized in the school year, the occupation of father, the occupation of mother, whether the town has houses or not, and the amount of the medical expenditure years of the family;

extracting campus one-card consumption data and calculating characteristics including total consumption amount, maximum value of total monthly consumption amount, total consumption times, average value of daily consumption amount, average value of daily consumption times, consumption days, dining room consumption amount, total number of dining room consumption times, maximum value of total number of dining room consumption times, daily consumption amount of dining room, maximum value of daily consumption amount of dining room, daily consumption times of dining room, consumption amount of supermarket, consumption times of supermarket, medical consumption amount, medical consumption times, boiled water consumption times, consumption amount of book, consumption times of book, daily breakfast consumption amount, breakfast daily lunch consumption amount, lunch consumption variance, daily dinner consumption amount, dinner daily consumption variance, total monthly consumption amount, total breakfast consumption time, total lunch and month consumption amount, total lunch and month consumption number, total dinner consumption amount of dinner and month, The total consumption times of the supper and the month, the canteen consumption ratio, the supermarket consumption ratio, the fruit consumption ratio, the canteen consumption ratio and the total consumption deviation degree of holidays;

extracting student score data including course name, course number, course starting period, time of study, score, examination score and ordinary score, and calculating score point, average score of period, average score of year of study and hanging number;

further comprising extracting library data: the book borrowing method comprises the following steps of book name, book number, book type, book borrowing time, book borrowing place, book returning time and book borrowing place, and features are calculated according to data: the book borrowing times, the book borrowing frequency, the borrowing time, the average borrowing time, the annual average borrowing times, the deviation degree of the book borrowing time and the book borrowing quantity on holidays;

still including drawing entrance guard and power consumption data: the method comprises the steps of card swiping time, card swiping place, time for entering and exiting a dormitory building, dormitory residence time, library residence time, dormitory total power consumption, dormitory per-person power consumption and school student per-person power consumption;

preferably, the method standardized in step 4 is: each index a_ijIs converted into a_ij', has a_ij’＝(a_ij-u_j)/s_j，i＝1,2,...,n，j＝1,2,...,m

The transformed matrix is recorded as

Is the result of the normalization.

After the step 4, the data are arranged from small to large, and the number of the data of one type is recorded as

The number ordered as k is denoted x_(k)(1≤k≤n)；

The p quantile calculation formula is as follows:

wherein [ np]Representing integer parts rounded to np, i.e. npAnd (4) dividing. Calculating an upper cut-off point Q according to equation (1)₁＝M_0.75+1.5R and lower intercept point Q₃＝M_0.25-1.5R. Wherein M is_0.25Is 0.25 quantile, M_0.75Is 0.75 quantile, R ═ M_0.75-M_0.25. Is less than Q₃Or greater than Q₁The data of (2) is an abnormal value, and the data is listed as the main body of the data, namely abnormal poverty. Some aspect of abnormal poverty presents a severe departure from the population and requires special treatment for it.

The method for extracting the poverty-inducing reasons of poverty is characterized in that after the step 2, text data in the data are extracted, a Natural Language Processing (NLP) technology is used, a snownlp library in python is called to realize functions of text word segmentation, named entity recognition and syntactic analysis, description objects and object features in the text are extracted, and the description objects and the object features are made into a table to be output, so that a person can quickly see the poverty-inducing reasons of students.

The invention has the beneficial effects that:

1) according to the method, a large amount of data is adopted when the poverty-suffering and poverty-avoiding degree is evaluated, so that one-sidedness of evaluation of a single data source is avoided;

2) the invention adopts machine learning to process data, which can overcome the influence of human subjective factors;

3) the invention also provides a method for rapidly filtering the abnormal poverty-stricken birth data, which can find the abnormal poverty-stricken birth and give special treatment.

Detailed Description

The present application is described in further detail below, and it should be noted that the following detailed description is provided for illustrative purposes only, and is not intended to limit the scope of the present application, which is defined by the appended claims.

Example 1

A student poverty degree prediction method based on machine learning is characterized in that: comprises the following steps;

step a1, acquiring data: the data comprises poverty relief data, civil administration hall data, learning-aid system data, student system data, school educational administration system data, campus one-card consumption data, internet behavior data, attendance system data, school forum data, library data and school hospital system data, and a database is established for storage;

the data acquisition comprises extracting basic information data of students, including names, native countries, nationalities, health conditions, political appearances, whether to enter a school through a green channel, whether to handle a place of origin loan, whether to handle a place of school loan, whether to enjoy a complement, what kind of reward or subsidy is received during the university, a way of entering a school, a school name, a school number, a college, sex, a specialty, a class, a current address, a relationship with a guardian, a guardian's duty and other information;

and the method also comprises the steps of extracting the online data of the students at the school, wherein the online data comprises browsing content, the type of the used electronic product, the online place, the type of the browsing website and the online time, and the online browsing information comprises data of foreign countries, tourism, Abort training and New Orient training, and when the poverty degrees of the two students in the step A7 are the same, the students with high consumption in the field are regarded as the students with lower poverty degree.

Step A2, analyzing data, dividing the data into unstructured text data and structured data, and directly storing the structured data into a database;

step A3, finding and filling missing data, and using different filling strategies including mean filling, interpolation and fitting according to different missing conditions to finish initialization of the missing data, wherein the strategies are conventional technologies and are not explained any more;

step A4, standardizing the original structured data to make the result value uniformly mapped to a fixed interval;

each index a_ijIs converted into a_ij', has a_ij’＝(a_ij-u_j)/s_j，i＝1,2,...,n，j＝1,2,...,m

The transformed matrix is recorded as

Is the result of the normalization.

Arranging the data from small to large, and recording the number of the data in one class as

The number ordered as k is denoted x_(k)(1≤k≤n)；

The p quantile calculation formula is as follows:

wherein [ np]Representing the integer part of np by the integer. Calculating an upper cut-off point Q according to equation (1)₁＝M_0.75+1.5R and lower intercept point Q₃＝M_0.25-1.5R. Wherein M is_0.25Is 0.25 quantile, M_0.75Is 0.75 quantile, R ═ M_0.75-M_0.25. Is less than Q₃Or greater than Q₁Is an abnormal value of the data of (a),the data is listed as the subject of the data being abnormally poor.

Step A5, according to the fast clustering algorithm, using Euclidean distance to cluster data into k types:

let the set of Kth initial point be

Note the book

Dividing the sample into disjoint k classes to obtain an initial classification

Get a new set

From L⁽¹⁾Then, the classification and the recording are carried out,

get a new class

Repeating the above step Am for several times to obtain

Wherein

Is a

Can be approximately regarded as

Center of gravity, i.e.

At this point, the calculation is finished; alternatively, if for a certain m,

and

if the two are the same, the calculation is ended;

step A6, calculating the relative importance degree of each factor in each category evaluation factor system after each cluster to the realization of the evaluation target and function, and checking the calculation result to ensure the reliability of the evaluation conclusion;

step A6.1, analyzing a correlation coefficient matrix according to the poverty-stricken factor summarized data, wherein two equidirectional indexes, namely two positive indexes or negative indexes, are positively correlated, and the correlation coefficient is larger than zero; two reverse indexes, namely a positive index and a negative index, should be negatively correlated, and the correlation coefficient should be less than zero; if the score is not satisfied, the step A7 is directly carried out to calculate the score, and if the score is not satisfied, the step A6.2 is carried out;

step A6.2 separates the two indexes of which the correlation coefficients do not accord with the principle to obtain a block matrix. Respectively performing principal component analysis on each block, and if the index weight coefficient of a certain block is not met, repeating the step A6.1 to obtain a final block matrix;

step A7, calculating a principal component comprehensive evaluation score:

q is the final partition matrix number obtained in step a6, and the ith (i equals 1, …, q) partition has a_iThe weight of the ith block is the index item;

composite score

S_ithe solving method is as follows: solving a normalized matrix

Calculating contribution rate

Wherein

Embodiment 2, a method for extracting causes of poverty, which comprises the following steps:

step B1, acquiring data: the data comprises poverty relief data, civil administration hall data, learning-aid system data, student system data, school educational administration system data, campus one-card consumption data, internet behavior data, attendance system data, school forum data, library data and school hospital system data, and a database is established for storage;

and step B2, analyzing the data, dividing the data into unstructured text data and structured data, calling a snornlp library in python by using an NLP natural language processing technology to realize functions of text word segmentation, named entity recognition and syntactic analysis, extracting description objects and object characteristics in the text, and making a table for output.

Embodiment 3, a method for rapidly monitoring and filtering abnormal poverty-stricken data, comprising the following steps:

step C1, acquiring data: the data comprises poverty relief data, civil administration hall data, learning-aid system data, student system data, school educational administration system data, campus one-card consumption data, internet behavior data, attendance system data, school forum data, library data and school hospital system data, and a database is established for storage;

step C2, analyzing the data, dividing the data into unstructured text data and structured data, and directly storing the structured data into a database;

step C3, finding and filling the missing data, and using different filling strategies according to different missing conditions, including mean value filling, interpolation and fitting, to complete initialization of the missing data;

step C4, carrying out linear transformation on the original structured data to enable the result value to be uniformly mapped to a fixed interval;

step C5, arranging the data from small to large, and recording the number of the data in one class as

The number ordered as k is denoted x_(k)(1≤k≤n)；

The p quantile calculation formula is as follows:

wherein [ np]Representing the integer part of np by the integer. Calculating an upper cut-off point Q according to equation (1)₁＝M_0.75+1.5R and lower intercept point Q₃＝M_0.25-1.5R. Wherein M is_0.25Is 0.25 quantile, M_0.75Is 0.75 quantile, R ═ M_0.75-M_0.25. Is less than Q₃Or greater than Q₁The data in (2) is an abnormal value, the data is listed as a main body of the data, which is abnormal, poverty and sleepy, and the behavior is out of the average level of the public, and special attention should be paid. The method is not limited by data integrity, and judgment can be made according to independent data classification, so that the method is rapid.

The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention.

Claims

1. A student poverty degree prediction method based on machine learning is characterized in that: the method comprises the following steps:

step 1, obtaining data related to poverty of students, wherein the data obtaining comprises extracting basic information data of the students, including names, native places, nationality, health conditions, political appearances, whether to enter a school through a green channel, whether to handle a place of origin loan, whether to handle a campus loan, whether to enjoy a supplement, what kind of rewards or subsidies are received during a university, a way of entering a school, a school name, a school number, a college, sex, a specialty, a class, a current residence, a relationship with a guardian, a guardian's duties, and other information of the guardian;

step 3, finding and filling missing data;

let the set of Kth initial point be

Note the book

Dividing the sample into disjoint k classes to obtain an initial classification

Get a new set

From L⁽¹⁾Then, the classification and the recording are carried out,

get a new class

Repeating the above steps m times to obtain

Wherein

Is a

The center of gravity of; when m is gradually increased, the classification tends to be stable, and at the same time

Can be approximately regarded as

Center of gravity, i.e.

At this point, the calculation is finished; alternatively, if for a certain m,

and

if the two are the same, the calculation is ended;

step 6.2, separating two indexes of which the correlation coefficients do not accord with the principle to obtain a block matrix, respectively carrying out principal component analysis on each block, and repeating the step 6.1 if the index weight coefficients of a certain block do not meet the principle to obtain a final block matrix;

q is the final block matrix number obtained in step 6, and the ith block has a_iAn index term, wherein i is 1, L, q, the weight of the ith block,

composite score

S_iThe score of the ith block is 1 if the ith block is in negative correlation with the total score, and the score of the ith block is in positive correlation with the total score,then t is equal to 0 and,

S_ithe solving method is as follows: solving the correlation coefficient matrix R, R of the normalized matrix A-_ijI rows and j columns of elements of R, wherein i is 1, K, n, j is 1, K, m, solving the eigenvalue lambda of the correlation matrix_iAnd the feature vector v_iWherein i is 1, and K, s and s are the total number of the characteristic values;

calculating contribution rate

Arranging the parameters in sequence according to the size of the characteristic value, and forming a new pointer variable by the characteristic vector:

Wherein

Sorting S corresponding to each student, wherein the higher the S is, the higher the difficulty degree is, and performing subsidy distribution by taking the value of S as a reference;

step 1 also comprises extracting the online data of students in the school, including browsing content, electronic product model, online site, website browsing type and online time, and extracting the data of the online browsing information including foreign country, tourism, Yasi training and New Orient training, and in step 7, when the poverty degrees of two students are the same, the students with high consumption in the field are regarded as the students with low poverty degree.

2. The method for predicting the poverty of students based on machine learning as claimed in claim 1, wherein the method standardized in step 4 is as follows: each index a_ijIs converted into a_ij', havea_ij’＝(a_ij-u_j)/s_j，j＝1,2,...,m，i＝1,2,...,n

The transformed matrix is recorded as

Is the result of the normalization.

3. The machine learning-based student poverty degree prediction method according to claim 1, wherein: after step 4, the data are arranged from small to large, and the number of the data in one class is recorded as

The number ordered as k is denoted x_(k)，1≤k≤n；

The p quantile calculation formula is as follows:

wherein [ np]Representing the integer part of np, the upper end of the computation of the intercept point Q according to equation (1)₁＝M_0.75+1.5R and lower intercept point Q₃＝M_0.25-1.5R, wherein M_0.25Is 0.25 quantile, M_0.75Is 0.75 quantile, R ═ M_0.75-M_0.25Is then less than Q₃Or greater than Q₁The data of (2) is an abnormal value, and the data is listed as the main body of the data, namely abnormal poverty.

4. The method for predicting the poverty of students based on machine learning as claimed in claim 1, wherein in step 2, the data is parsed and divided into unstructured text data and structured data, the structured data is directly stored in the database, then the text data in the data is extracted, NLP natural language processing technology is used, the snowlp library in python is called to realize text segmentation, named entity recognition and syntactic analysis functions, and descriptive objects and object features in the text are extracted and tabulated for output.