CN109145113B - Student poverty degree prediction method based on machine learning - Google Patents

Student poverty degree prediction method based on machine learning Download PDF

Info

Publication number
CN109145113B
CN109145113B CN201810972342.8A CN201810972342A CN109145113B CN 109145113 B CN109145113 B CN 109145113B CN 201810972342 A CN201810972342 A CN 201810972342A CN 109145113 B CN109145113 B CN 109145113B
Authority
CN
China
Prior art keywords
data
consumption
student
poverty
score
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810972342.8A
Other languages
Chinese (zh)
Other versions
CN109145113A (en
Inventor
陈岩
俞跃舒
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Taohuadao Information Technology Co ltd
Original Assignee
Beijing Taohuadao Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Taohuadao Information Technology Co ltd filed Critical Beijing Taohuadao Information Technology Co ltd
Priority to CN201810972342.8A priority Critical patent/CN109145113B/en
Publication of CN109145113A publication Critical patent/CN109145113A/en
Application granted granted Critical
Publication of CN109145113B publication Critical patent/CN109145113B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/04Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/20Education
    • G06Q50/205Education administration or guidance

Landscapes

  • Business, Economics & Management (AREA)
  • Engineering & Computer Science (AREA)
  • Strategic Management (AREA)
  • Economics (AREA)
  • Human Resources & Organizations (AREA)
  • Tourism & Hospitality (AREA)
  • Marketing (AREA)
  • General Business, Economics & Management (AREA)
  • Theoretical Computer Science (AREA)
  • Educational Technology (AREA)
  • Educational Administration (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Primary Health Care (AREA)
  • General Health & Medical Sciences (AREA)
  • Development Economics (AREA)
  • Game Theory and Decision Science (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention relates to a student poverty degree prediction method based on machine learning, which comprises the steps of obtaining data of channels related to students, analyzing the data, calculating various characteristic values of the student poverty, filling missing values, standardizing the data, mapping the data to a fixed interval, gathering the data into a plurality of classes by adopting Euclidean distance according to a rapid clustering algorithm, and calculating the importance degree of each class to the evaluation of the poverty degree. And partitioning the matrix formed by each group of classified data according to the correlation, and finally calculating the poor and stranded comprehensive score according to the partitioned matrix, wherein the comprehensive score can be used for reference in the decision of the subsidy amount during the subsidy of the poor and stranded people, and the higher the score is, the more poor the stranded data is, the more the stranded data needs to be subsidized. The invention also provides a plurality of schemes for rapidly finding abnormal poverty-causing poverty and screening poverty-causing poverty reasons from the data.

Description

Student poverty degree prediction method based on machine learning
Technical Field
The invention belongs to the technical field of big data application, and particularly relates to a student poverty degree prediction method based on machine learning.
Background
At present, most colleges and universities primarily determine the family economic condition in the class according to the family economic condition questionnaires of the students and the comprehensive conditions reflected by relevant teachers and classmates. The level of poverty of the subsidized students cannot be quantified, and the subsidization is inevitable to fall into the average meaning.
With the technical development of the big data era, the global backbone communication network transmits tens of thousands of terabytes of data every day, the behavior of each person is recorded by various forms of data, and students can generate data in all rows of the campus and record various characteristics of the students. The data can reflect the real conditions of students, and can solve the problem of the averaging of the contributions of the poverty-stricken students to a certain extent through reasonable utilization, thereby providing more help for the really poverty-stricken students.
Disclosure of Invention
The invention aims to provide a student poverty degree prediction method based on machine learning, aiming at solving the defect of average sense when poverty is subsidized.
The invention realizes the purpose through the following technical scheme:
a student poverty degree prediction method based on machine learning comprises the following steps;
step 1, acquiring data related to poverty of students;
step 2, analyzing data, dividing the data into unstructured text data and structured data, and directly storing the structured data into a database;
step 3, finding and filling missing data;
step 4, carrying out standardization processing on the original structured data to enable the result values to be uniformly mapped to a fixed interval;
step 5, according to a fast clustering algorithm, adopting Euclidean distance to cluster data into k types:
let the set of Kth initial point be
Figure BDA0001776514650000021
Note the book
Figure BDA0001776514650000022
Dividing the sample into disjoint k classes to obtain an initial classification
Figure BDA0001776514650000023
From the initial class G(0)Begin to compute a new set of points L(1)Calculating
Figure BDA0001776514650000024
Get a new set
Figure BDA0001776514650000025
From L(1)Then, the classification and the recording are carried out,
Figure BDA0001776514650000026
get a new class
Figure BDA0001776514650000027
Repeating the above steps m times to obtain
Figure BDA0001776514650000028
Wherein
Figure BDA0001776514650000029
Is a
Figure BDA00017765146500000210
The center of gravity of (a). As m gradually increases, the classification tends to stabilize. At the same time
Figure BDA00017765146500000211
Can be approximately regarded as
Figure BDA00017765146500000212
Center of gravity, i.e.
Figure BDA00017765146500000213
At this point, the calculation is finished; alternatively, if for a certain m,
Figure BDA00017765146500000214
and
Figure BDA00017765146500000215
if the two are the same, the calculation is ended;
step 6, calculating the relative importance degree of each factor in each category evaluation factor system after each cluster to the realization of the evaluation target and the function, and checking the calculation result to ensure the reliability of the evaluation conclusion;
step 6.1, analyzing a correlation coefficient matrix according to the poverty-stricken factor summarized data, wherein two equidirectional indexes are positive indexes or negative indexes and are positively correlated, and the correlation coefficient is larger than zero; two reverse indexes, namely a positive index and a negative index, should be negatively correlated, and the correlation coefficient should be less than zero; if the score is not satisfied, the step 6.2 is carried out;
and 6.2, separating the two indexes of which the correlation coefficients do not accord with the principle to obtain a block matrix. Respectively performing principal component analysis on each block, and if the index weight coefficient of a certain block is not met, repeating the step 6.1 to obtain a final block matrix;
and 7, calculating the comprehensive evaluation score of the principal component:
q is the final block matrix number obtained in step 6, and the ith (i equals 1, …, q) block has aiThe weight of the ith block is the index item;
Figure BDA0001776514650000031
composite score
Figure BDA0001776514650000032
SiIf the ith block is in negative correlation with the total score, t is 1, and if the ith block is in positive correlation with the total score, t is 0;
Sithe solving method is as follows: solving a normalized matrix
Figure BDA00017765146500000310
R, R of the correlation coefficient matrixij(i 1, …, n, j 1, …, m) is the i row and j column elements of R, and the eigenvalue lambda of the correlation matrix is solvedi(i is 1,2, …, s, s is the total number of eigenvalues) and a eigenvector vi
Calculating contribution rate
Figure BDA0001776514650000033
And arranging according to the size of the characteristic value. The new pointer variable is composed of feature vectors:
Figure BDA0001776514650000034
Figure BDA0001776514650000035
Figure BDA0001776514650000039
Figure BDA0001776514650000036
wherein y is1Is the first principal component, ysIs the s-th principal component; calculating the comprehensive evaluation score of the principal component
Figure BDA0001776514650000037
Wherein
Figure BDA0001776514650000038
And sequencing S corresponding to each student, wherein the higher S indicates the higher difficulty degree, and the value of S is used as a reference for subsidizing distribution.
The method is further improved in that the method also comprises the steps of extracting the online data of students in the school, wherein the online data comprises browsing content, the type of used electronic products, the online place, the type of browsing websites and the online time, and the online browsing information comprises data of abroad, tourism, Yasi training and new east training.
Preferably, the acquiring data in the method includes extracting basic information data of students, including names, native countries, nationalities, health conditions, political aspects, whether to enter the school through green channels, whether to handle a place of origin loan, whether to handle a campus loan, whether to enjoy a benefit, what kind of reward or subsidy was received during the university, the way of entering the school, the name of the school, the school number, the college, the sex, the specialty, the class, the current address, the relationship with the guardian, the guardian's duties, and other information of the guardian;
also comprises extracting student family information data including whether the student is a solitary child, whether the student is left to guard, whether the student enters a city service staff along with a mobile child, whether the student is a low-protection family, whether the student is a knight or a careless child, whether the student is an orphan, whether the student is established with a card and is poverty, urban and rural minimum guarantee, whether the student is particularly poverty in urban and rural areas, whether the family is urban and rural minimum life guarantee, whether the student is a low-income family, whether the countryside is five-protection family, the student is resident, whether the student is physically handicapped, whether the student is sick and sick, whether the parent is disabled, whether the student is sick and sick, whether the parent is sick and sick, whether the family is mainly derived from income, whether the family is annual income, whether the parent is single, whether the student is different, whether the parent is nursed, whether the family is a person is lost, whether the family is a poor county, whether the family is a county, Whether the school is mountainous, whether study-assisting loan is transacted, house conditions, sudden unexpected events of families, the condition of unemployment of family members, the amount of debt of the family, the reason of the debt, the condition of the students who have been subsidized in the school year, the occupation of father, the occupation of mother, whether the town has houses or not, and the amount of the medical expenditure years of the family;
extracting campus one-card consumption data and calculating characteristics including total consumption amount, maximum value of total monthly consumption amount, total consumption times, average value of daily consumption amount, average value of daily consumption times, consumption days, dining room consumption amount, total number of dining room consumption times, maximum value of total number of dining room consumption times, daily consumption amount of dining room, maximum value of daily consumption amount of dining room, daily consumption times of dining room, consumption amount of supermarket, consumption times of supermarket, medical consumption amount, medical consumption times, boiled water consumption times, consumption amount of book, consumption times of book, daily breakfast consumption amount, breakfast daily lunch consumption amount, lunch consumption variance, daily dinner consumption amount, dinner daily consumption variance, total monthly consumption amount, total breakfast consumption time, total lunch and month consumption amount, total lunch and month consumption number, total dinner consumption amount of dinner and month, The total consumption times of the supper and the month, the canteen consumption ratio, the supermarket consumption ratio, the fruit consumption ratio, the canteen consumption ratio and the total consumption deviation degree of holidays;
extracting student score data including course name, course number, course starting period, time of study, score, examination score and ordinary score, and calculating score point, average score of period, average score of year of study and hanging number;
further comprising extracting library data: the book borrowing method comprises the following steps of book name, book number, book type, book borrowing time, book borrowing place, book returning time and book borrowing place, and features are calculated according to data: the book borrowing times, the book borrowing frequency, the borrowing time, the average borrowing time, the annual average borrowing times, the deviation degree of the book borrowing time and the book borrowing quantity on holidays;
still including drawing entrance guard and power consumption data: the method comprises the steps of card swiping time, card swiping place, time for entering and exiting a dormitory building, dormitory residence time, library residence time, dormitory total power consumption, dormitory per-person power consumption and school student per-person power consumption;
preferably, the method standardized in step 4 is: each index aijIs converted into aij', has aij’=(aij-uj)/sj,i=1,2,...,n,j=1,2,...,m
Figure BDA0001776514650000051
The transformed matrix is recorded as
Figure BDA0001776514650000061
Is the result of the normalization.
After the step 4, the data are arranged from small to large, and the number of the data of one type is recorded as
Figure BDA0001776514650000062
The number ordered as k is denoted x(k)(1≤k≤n);
The p quantile calculation formula is as follows:
Figure BDA0001776514650000063
wherein [ np]Representing integer parts rounded to np, i.e. npAnd (4) dividing. Calculating an upper cut-off point Q according to equation (1)1=M0.75+1.5R and lower intercept point Q3=M0.25-1.5R. Wherein M is0.25Is 0.25 quantile, M0.75Is 0.75 quantile, R ═ M0.75-M0.25. Is less than Q3Or greater than Q1The data of (2) is an abnormal value, and the data is listed as the main body of the data, namely abnormal poverty. Some aspect of abnormal poverty presents a severe departure from the population and requires special treatment for it.
The method for extracting the poverty-inducing reasons of poverty is characterized in that after the step 2, text data in the data are extracted, a Natural Language Processing (NLP) technology is used, a snownlp library in python is called to realize functions of text word segmentation, named entity recognition and syntactic analysis, description objects and object features in the text are extracted, and the description objects and the object features are made into a table to be output, so that a person can quickly see the poverty-inducing reasons of students.
The invention has the beneficial effects that:
1) according to the method, a large amount of data is adopted when the poverty-suffering and poverty-avoiding degree is evaluated, so that one-sidedness of evaluation of a single data source is avoided;
2) the invention adopts machine learning to process data, which can overcome the influence of human subjective factors;
3) the invention also provides a method for rapidly filtering the abnormal poverty-stricken birth data, which can find the abnormal poverty-stricken birth and give special treatment.
Detailed Description
The present application is described in further detail below, and it should be noted that the following detailed description is provided for illustrative purposes only, and is not intended to limit the scope of the present application, which is defined by the appended claims.
Example 1
A student poverty degree prediction method based on machine learning is characterized in that: comprises the following steps;
step a1, acquiring data: the data comprises poverty relief data, civil administration hall data, learning-aid system data, student system data, school educational administration system data, campus one-card consumption data, internet behavior data, attendance system data, school forum data, library data and school hospital system data, and a database is established for storage;
the data acquisition comprises extracting basic information data of students, including names, native countries, nationalities, health conditions, political appearances, whether to enter a school through a green channel, whether to handle a place of origin loan, whether to handle a place of school loan, whether to enjoy a complement, what kind of reward or subsidy is received during the university, a way of entering a school, a school name, a school number, a college, sex, a specialty, a class, a current address, a relationship with a guardian, a guardian's duty and other information;
also comprises extracting student family information data including whether the student is a solitary child, whether the student is left to guard, whether the student enters a city service staff along with a mobile child, whether the student is a low-protection family, whether the student is a knight or a careless child, whether the student is an orphan, whether the student is established with a card and is poverty, urban and rural minimum guarantee, whether the student is particularly poverty in urban and rural areas, whether the family is urban and rural minimum life guarantee, whether the student is a low-income family, whether the countryside is five-protection family, the student is resident, whether the student is physically handicapped, whether the student is sick and sick, whether the parent is disabled, whether the student is sick and sick, whether the parent is sick and sick, whether the family is mainly derived from income, whether the family is annual income, whether the parent is single, whether the student is different, whether the parent is nursed, whether the family is a person is lost, whether the family is a poor county, whether the family is a county, Whether the school is mountainous, whether study-assisting loan is transacted, house conditions, sudden unexpected events of families, the condition of unemployment of family members, the amount of debt of the family, the reason of the debt, the condition of the students who have been subsidized in the school year, the occupation of father, the occupation of mother, whether the town has houses or not, and the amount of the medical expenditure years of the family;
extracting campus one-card consumption data and calculating characteristics including total consumption amount, maximum value of total monthly consumption amount, total consumption times, average value of daily consumption amount, average value of daily consumption times, consumption days, dining room consumption amount, total number of dining room consumption times, maximum value of total number of dining room consumption times, daily consumption amount of dining room, maximum value of daily consumption amount of dining room, daily consumption times of dining room, consumption amount of supermarket, consumption times of supermarket, medical consumption amount, medical consumption times, boiled water consumption times, consumption amount of book, consumption times of book, daily breakfast consumption amount, breakfast daily lunch consumption amount, lunch consumption variance, daily dinner consumption amount, dinner daily consumption variance, total monthly consumption amount, total breakfast consumption time, total lunch and month consumption amount, total lunch and month consumption number, total dinner consumption amount of dinner and month, The total consumption times of the supper and the month, the canteen consumption ratio, the supermarket consumption ratio, the fruit consumption ratio, the canteen consumption ratio and the total consumption deviation degree of holidays;
extracting student score data including course name, course number, course starting period, time of study, score, examination score and ordinary score, and calculating score point, average score of period, average score of year of study and hanging number;
further comprising extracting library data: the book borrowing method comprises the following steps of book name, book number, book type, book borrowing time, book borrowing place, book returning time and book borrowing place, and features are calculated according to data: the book borrowing times, the book borrowing frequency, the borrowing time, the average borrowing time, the annual average borrowing times, the deviation degree of the book borrowing time and the book borrowing quantity on holidays;
still including drawing entrance guard and power consumption data: the method comprises the steps of card swiping time, card swiping place, time for entering and exiting a dormitory building, dormitory residence time, library residence time, dormitory total power consumption, dormitory per-person power consumption and school student per-person power consumption;
and the method also comprises the steps of extracting the online data of the students at the school, wherein the online data comprises browsing content, the type of the used electronic product, the online place, the type of the browsing website and the online time, and the online browsing information comprises data of foreign countries, tourism, Abort training and New Orient training, and when the poverty degrees of the two students in the step A7 are the same, the students with high consumption in the field are regarded as the students with lower poverty degree.
Step A2, analyzing data, dividing the data into unstructured text data and structured data, and directly storing the structured data into a database;
step A3, finding and filling missing data, and using different filling strategies including mean filling, interpolation and fitting according to different missing conditions to finish initialization of the missing data, wherein the strategies are conventional technologies and are not explained any more;
step A4, standardizing the original structured data to make the result value uniformly mapped to a fixed interval;
each index aijIs converted into aij', has aij’=(aij-uj)/sj,i=1,2,...,n,j=1,2,...,m
Figure BDA0001776514650000091
The transformed matrix is recorded as
Figure BDA0001776514650000092
Is the result of the normalization.
Arranging the data from small to large, and recording the number of the data in one class as
Figure BDA0001776514650000101
The number ordered as k is denoted x(k)(1≤k≤n);
The p quantile calculation formula is as follows:
Figure BDA0001776514650000102
wherein [ np]Representing the integer part of np by the integer. Calculating an upper cut-off point Q according to equation (1)1=M0.75+1.5R and lower intercept point Q3=M0.25-1.5R. Wherein M is0.25Is 0.25 quantile, M0.75Is 0.75 quantile, R ═ M0.75-M0.25. Is less than Q3Or greater than Q1Is an abnormal value of the data of (a),the data is listed as the subject of the data being abnormally poor.
Step A5, according to the fast clustering algorithm, using Euclidean distance to cluster data into k types:
let the set of Kth initial point be
Figure BDA0001776514650000103
Note the book
Figure BDA0001776514650000104
Dividing the sample into disjoint k classes to obtain an initial classification
Figure BDA0001776514650000105
From the initial class G(0)Begin to compute a new set of points L(1)Calculating
Figure BDA0001776514650000106
Get a new set
Figure BDA0001776514650000107
From L(1)Then, the classification and the recording are carried out,
Figure BDA0001776514650000108
get a new class
Figure BDA0001776514650000109
Repeating the above step Am for several times to obtain
Figure BDA00017765146500001010
Wherein
Figure BDA00017765146500001011
Is a
Figure BDA00017765146500001012
The center of gravity of (a). As m gradually increases, the classification tends to stabilize. At the same time
Figure BDA00017765146500001013
Can be approximately regarded as
Figure BDA00017765146500001014
Center of gravity, i.e.
Figure BDA00017765146500001015
At this point, the calculation is finished; alternatively, if for a certain m,
Figure BDA00017765146500001016
and
Figure BDA00017765146500001017
if the two are the same, the calculation is ended;
step A6, calculating the relative importance degree of each factor in each category evaluation factor system after each cluster to the realization of the evaluation target and function, and checking the calculation result to ensure the reliability of the evaluation conclusion;
step A6.1, analyzing a correlation coefficient matrix according to the poverty-stricken factor summarized data, wherein two equidirectional indexes, namely two positive indexes or negative indexes, are positively correlated, and the correlation coefficient is larger than zero; two reverse indexes, namely a positive index and a negative index, should be negatively correlated, and the correlation coefficient should be less than zero; if the score is not satisfied, the step A7 is directly carried out to calculate the score, and if the score is not satisfied, the step A6.2 is carried out;
step A6.2 separates the two indexes of which the correlation coefficients do not accord with the principle to obtain a block matrix. Respectively performing principal component analysis on each block, and if the index weight coefficient of a certain block is not met, repeating the step A6.1 to obtain a final block matrix;
step A7, calculating a principal component comprehensive evaluation score:
q is the final partition matrix number obtained in step a6, and the ith (i equals 1, …, q) partition has aiThe weight of the ith block is the index item;
Figure BDA0001776514650000111
composite score
Figure BDA0001776514650000112
SiIf the ith block is in negative correlation with the total score, t is 1, and if the ith block is in positive correlation with the total score, t is 0;
Sithe solving method is as follows: solving a normalized matrix
Figure BDA0001776514650000118
R, R of the correlation coefficient matrixij(i 1, …, n, j 1, …, m) is the i row and j column elements of R, and the eigenvalue lambda of the correlation matrix is solvedi(i is 1,2, …, s, s is the total number of eigenvalues) and a eigenvector vi
Calculating contribution rate
Figure BDA0001776514650000113
And arranging according to the size of the characteristic value. The new pointer variable is composed of feature vectors:
Figure BDA0001776514650000114
Figure BDA0001776514650000115
Figure BDA0001776514650000116
Figure BDA0001776514650000117
wherein y is1Is the first principal component, ysIs the s-th principal component; calculating the comprehensive evaluation score of the principal component
Figure BDA0001776514650000121
Wherein
Figure BDA0001776514650000122
And sequencing S corresponding to each student, wherein the higher S indicates the higher difficulty degree, and the value of S is used as a reference for subsidizing distribution.
Embodiment 2, a method for extracting causes of poverty, which comprises the following steps:
step B1, acquiring data: the data comprises poverty relief data, civil administration hall data, learning-aid system data, student system data, school educational administration system data, campus one-card consumption data, internet behavior data, attendance system data, school forum data, library data and school hospital system data, and a database is established for storage;
and step B2, analyzing the data, dividing the data into unstructured text data and structured data, calling a snornlp library in python by using an NLP natural language processing technology to realize functions of text word segmentation, named entity recognition and syntactic analysis, extracting description objects and object characteristics in the text, and making a table for output.
Embodiment 3, a method for rapidly monitoring and filtering abnormal poverty-stricken data, comprising the following steps:
step C1, acquiring data: the data comprises poverty relief data, civil administration hall data, learning-aid system data, student system data, school educational administration system data, campus one-card consumption data, internet behavior data, attendance system data, school forum data, library data and school hospital system data, and a database is established for storage;
step C2, analyzing the data, dividing the data into unstructured text data and structured data, and directly storing the structured data into a database;
step C3, finding and filling the missing data, and using different filling strategies according to different missing conditions, including mean value filling, interpolation and fitting, to complete initialization of the missing data;
step C4, carrying out linear transformation on the original structured data to enable the result value to be uniformly mapped to a fixed interval;
step C5, arranging the data from small to large, and recording the number of the data in one class as
Figure BDA0001776514650000131
The number ordered as k is denoted x(k)(1≤k≤n);
The p quantile calculation formula is as follows:
Figure BDA0001776514650000132
wherein [ np]Representing the integer part of np by the integer. Calculating an upper cut-off point Q according to equation (1)1=M0.75+1.5R and lower intercept point Q3=M0.25-1.5R. Wherein M is0.25Is 0.25 quantile, M0.75Is 0.75 quantile, R ═ M0.75-M0.25. Is less than Q3Or greater than Q1The data in (2) is an abnormal value, the data is listed as a main body of the data, which is abnormal, poverty and sleepy, and the behavior is out of the average level of the public, and special attention should be paid. The method is not limited by data integrity, and judgment can be made according to independent data classification, so that the method is rapid.
The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention.

Claims (4)

1. A student poverty degree prediction method based on machine learning is characterized in that: the method comprises the following steps:
step 1, obtaining data related to poverty of students, wherein the data obtaining comprises extracting basic information data of the students, including names, native places, nationality, health conditions, political appearances, whether to enter a school through a green channel, whether to handle a place of origin loan, whether to handle a campus loan, whether to enjoy a supplement, what kind of rewards or subsidies are received during a university, a way of entering a school, a school name, a school number, a college, sex, a specialty, a class, a current residence, a relationship with a guardian, a guardian's duties, and other information of the guardian;
also comprises extracting student family information data including whether the student is a solitary child, whether the student is left to guard, whether the student enters a city service staff along with a mobile child, whether the student is a low-protection family, whether the student is a knight or a careless child, whether the student is an orphan, whether the student is established with a card and is poverty, urban and rural minimum guarantee, whether the student is particularly poverty in urban and rural areas, whether the family is urban and rural minimum life guarantee, whether the student is a low-income family, whether the countryside is five-protection family, the student is resident, whether the student is physically handicapped, whether the student is sick and sick, whether the parent is disabled, whether the student is sick and sick, whether the parent is sick and sick, whether the family is mainly derived from income, whether the family is annual income, whether the parent is single, whether the student is different, whether the parent is nursed, whether the family is a person is lost, whether the family is a poor county, whether the family is a county, Whether the school is mountainous, whether study-assisting loan is transacted, house conditions, sudden unexpected events of families, the condition of unemployment of family members, the amount of debt of the family, the reason of the debt, the condition of the students who have been subsidized in the school year, the occupation of father, the occupation of mother, whether the town has houses or not, and the amount of the medical expenditure years of the family;
extracting campus one-card consumption data and calculating characteristics including total consumption amount, maximum value of total monthly consumption amount, total consumption times, average value of daily consumption amount, average value of daily consumption times, consumption days, dining room consumption amount, total number of dining room consumption times, maximum value of total number of dining room consumption times, daily consumption amount of dining room, maximum value of daily consumption amount of dining room, daily consumption times of dining room, consumption amount of supermarket, consumption times of supermarket, medical consumption amount, medical consumption times, boiled water consumption times, consumption amount of book, consumption times of book, daily breakfast consumption amount, breakfast daily lunch consumption amount, lunch consumption variance, daily dinner consumption amount, dinner daily consumption variance, total monthly consumption amount, total breakfast consumption time, total lunch and month consumption amount, total lunch and month consumption number, total dinner consumption amount of dinner and month, The total consumption times of the supper and the month, the canteen consumption ratio, the supermarket consumption ratio, the fruit consumption ratio, the canteen consumption ratio and the total consumption deviation degree of holidays;
extracting student score data including course name, course number, course starting period, time of study, score, examination score and ordinary score, and calculating score point, average score of period, average score of year of study and hanging number;
further comprising extracting library data: the book borrowing method comprises the following steps of book name, book number, book type, book borrowing time, book borrowing place, book returning time and book borrowing place, and features are calculated according to data: the book borrowing times, the book borrowing frequency, the borrowing time, the average borrowing time, the annual average borrowing times, the deviation degree of the book borrowing time and the book borrowing quantity on holidays;
still including drawing entrance guard and power consumption data: the method comprises the steps of card swiping time, card swiping place, time for entering and exiting a dormitory building, dormitory residence time, library residence time, dormitory total power consumption, dormitory per-person power consumption and school student per-person power consumption;
step 2, analyzing data, dividing the data into unstructured text data and structured data, and directly storing the structured data into a database;
step 3, finding and filling missing data;
step 4, carrying out standardization processing on the original structured data to enable the result values to be uniformly mapped to a fixed interval;
step 5, according to a fast clustering algorithm, adopting Euclidean distance to cluster data into k types:
let the set of Kth initial point be
Figure FDA0003342167290000031
Note the book
Figure FDA0003342167290000032
Dividing the sample into disjoint k classes to obtain an initial classification
Figure FDA0003342167290000033
From the initial class G(0)Begin to compute a new set of points L(1)Calculating
Figure FDA0003342167290000034
Get a new set
Figure FDA0003342167290000035
From L(1)Then, the classification and the recording are carried out,
Figure FDA0003342167290000036
get a new class
Figure FDA0003342167290000037
Repeating the above steps m times to obtain
Figure FDA0003342167290000038
Wherein
Figure FDA0003342167290000039
Is a
Figure FDA00033421672900000310
The center of gravity of; when m is gradually increased, the classification tends to be stable, and at the same time
Figure FDA00033421672900000311
Can be approximately regarded as
Figure FDA00033421672900000312
Center of gravity, i.e.
Figure FDA00033421672900000313
At this point, the calculation is finished; alternatively, if for a certain m,
Figure FDA00033421672900000314
and
Figure FDA00033421672900000315
if the two are the same, the calculation is ended;
step 6, calculating the relative importance degree of each factor in each category evaluation factor system after each cluster to the realization of the evaluation target and the function, and checking the calculation result to ensure the reliability of the evaluation conclusion;
step 6.1, analyzing a correlation coefficient matrix according to the poverty-stricken factor summarized data, wherein two equidirectional indexes are positive indexes or negative indexes and are positively correlated, and the correlation coefficient is larger than zero; two reverse indexes, namely a positive index and a negative index, should be negatively correlated, and the correlation coefficient should be less than zero; if the score is not satisfied, the step 6.2 is carried out;
step 6.2, separating two indexes of which the correlation coefficients do not accord with the principle to obtain a block matrix, respectively carrying out principal component analysis on each block, and repeating the step 6.1 if the index weight coefficients of a certain block do not meet the principle to obtain a final block matrix;
and 7, calculating the comprehensive evaluation score of the principal component:
q is the final block matrix number obtained in step 6, and the ith block has aiAn index term, wherein i is 1, L, q, the weight of the ith block,
Figure FDA0003342167290000041
composite score
Figure FDA0003342167290000042
SiThe score of the ith block is 1 if the ith block is in negative correlation with the total score, and the score of the ith block is in positive correlation with the total score,then t is equal to 0 and,
Sithe solving method is as follows: solving the correlation coefficient matrix R, R of the normalized matrix A-ijI rows and j columns of elements of R, wherein i is 1, K, n, j is 1, K, m, solving the eigenvalue lambda of the correlation matrixiAnd the feature vector viWherein i is 1, and K, s and s are the total number of the characteristic values;
calculating contribution rate
Figure FDA0003342167290000043
Arranging the parameters in sequence according to the size of the characteristic value, and forming a new pointer variable by the characteristic vector:
Figure FDA0003342167290000044
wherein y is1Is the first principal component, ysIs the s-th principal component; calculating the comprehensive evaluation score of the principal component
Figure FDA0003342167290000045
Wherein
Figure FDA0003342167290000046
Sorting S corresponding to each student, wherein the higher the S is, the higher the difficulty degree is, and performing subsidy distribution by taking the value of S as a reference;
step 1 also comprises extracting the online data of students in the school, including browsing content, electronic product model, online site, website browsing type and online time, and extracting the data of the online browsing information including foreign country, tourism, Yasi training and New Orient training, and in step 7, when the poverty degrees of two students are the same, the students with high consumption in the field are regarded as the students with low poverty degree.
2. The method for predicting the poverty of students based on machine learning as claimed in claim 1, wherein the method standardized in step 4 is as follows: each index aijIs converted into aij', haveaij’=(aij-uj)/sj,j=1,2,...,m,i=1,2,...,n
Figure FDA0003342167290000051
The transformed matrix is recorded as
Figure FDA0003342167290000052
Is the result of the normalization.
3. The machine learning-based student poverty degree prediction method according to claim 1, wherein: after step 4, the data are arranged from small to large, and the number of the data in one class is recorded as
Figure FDA0003342167290000053
The number ordered as k is denoted x(k),1≤k≤n;
The p quantile calculation formula is as follows:
Figure FDA0003342167290000054
wherein [ np]Representing the integer part of np, the upper end of the computation of the intercept point Q according to equation (1)1=M0.75+1.5R and lower intercept point Q3=M0.25-1.5R, wherein M0.25Is 0.25 quantile, M0.75Is 0.75 quantile, R ═ M0.75-M0.25Is then less than Q3Or greater than Q1The data of (2) is an abnormal value, and the data is listed as the main body of the data, namely abnormal poverty.
4. The method for predicting the poverty of students based on machine learning as claimed in claim 1, wherein in step 2, the data is parsed and divided into unstructured text data and structured data, the structured data is directly stored in the database, then the text data in the data is extracted, NLP natural language processing technology is used, the snowlp library in python is called to realize text segmentation, named entity recognition and syntactic analysis functions, and descriptive objects and object features in the text are extracted and tabulated for output.
CN201810972342.8A 2018-08-24 2018-08-24 Student poverty degree prediction method based on machine learning Active CN109145113B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810972342.8A CN109145113B (en) 2018-08-24 2018-08-24 Student poverty degree prediction method based on machine learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810972342.8A CN109145113B (en) 2018-08-24 2018-08-24 Student poverty degree prediction method based on machine learning

Publications (2)

Publication Number Publication Date
CN109145113A CN109145113A (en) 2019-01-04
CN109145113B true CN109145113B (en) 2021-12-21

Family

ID=64827822

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810972342.8A Active CN109145113B (en) 2018-08-24 2018-08-24 Student poverty degree prediction method based on machine learning

Country Status (1)

Country Link
CN (1) CN109145113B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109754349B (en) * 2019-01-07 2023-09-05 上海复岸网络信息科技有限公司 Intelligent teacher-student matching system for online education
CN109992592B (en) * 2019-04-10 2020-12-08 哈尔滨工业大学 College poverty and poverty identification method based on flow data of campus consumption card
CN110097142A (en) * 2019-05-15 2019-08-06 杭州华网信息技术有限公司 Poor student's prediction technique of behavior is serialized for student
CN112215385B (en) * 2020-03-24 2024-03-19 北京桃花岛信息技术有限公司 Student difficulty degree prediction method based on greedy selection strategy
CN111415099A (en) * 2020-03-30 2020-07-14 西北大学 Poverty-poverty identification method based on multi-classification BP-Adaboost
CN112150111A (en) * 2020-09-23 2020-12-29 沈阳晁圣科技有限公司 Wisdom campus management system based on block chain
CN112416914B (en) * 2020-10-15 2023-07-11 三峡大学 Difficult student identification and early warning method and system based on big data analysis
CN112465300A (en) * 2020-11-05 2021-03-09 贵州广播电视大学(贵州职业技术学院) Campus network-based full-range big data system

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102682352A (en) * 2011-03-11 2012-09-19 鮑济美 Intelligent campus security information management system based on internet of things
CN105930540A (en) * 2016-03-23 2016-09-07 四川长虹电器股份有限公司 Data processing system
CN106504144A (en) * 2016-10-19 2017-03-15 中国矿业大学 A kind of campus service system based on cloud computing and mobile phone terminal
CN106886922A (en) * 2017-03-01 2017-06-23 安徽大智睿科技技术有限公司 It is a kind of that students ' analysis method and system are paid close attention to based on all-purpose card consumption
CN106934742A (en) * 2017-02-22 2017-07-07 黔南民族师范学院 A kind of Impoverished College Studentss assessment method
WO2017128868A1 (en) * 2016-01-26 2017-08-03 华为技术有限公司 Classification method, device and system for program file
CN108170765A (en) * 2017-12-25 2018-06-15 合肥城市云数据中心股份有限公司 Recommend method based on the poverty-stricken mountains in school behavioral data multidimensional analysis

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160308898A1 (en) * 2015-04-20 2016-10-20 Phirelight Security Solutions Inc. Systems and methods for tracking, analyzing and mitigating security threats in networks via a network traffic analysis platform
US20170032250A1 (en) * 2015-07-29 2017-02-02 Ching-Ping Chang Machine Status And User Behavior Analysis System

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102682352A (en) * 2011-03-11 2012-09-19 鮑济美 Intelligent campus security information management system based on internet of things
WO2017128868A1 (en) * 2016-01-26 2017-08-03 华为技术有限公司 Classification method, device and system for program file
CN105930540A (en) * 2016-03-23 2016-09-07 四川长虹电器股份有限公司 Data processing system
CN106504144A (en) * 2016-10-19 2017-03-15 中国矿业大学 A kind of campus service system based on cloud computing and mobile phone terminal
CN106934742A (en) * 2017-02-22 2017-07-07 黔南民族师范学院 A kind of Impoverished College Studentss assessment method
CN106886922A (en) * 2017-03-01 2017-06-23 安徽大智睿科技技术有限公司 It is a kind of that students ' analysis method and system are paid close attention to based on all-purpose card consumption
CN108170765A (en) * 2017-12-25 2018-06-15 合肥城市云数据中心股份有限公司 Recommend method based on the poverty-stricken mountains in school behavioral data multidimensional analysis

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
随机森林算法在中医药院校贫困生认定预测中的应用研究;唐燕等;《中国医药导报》;20170515;第14卷(第14期);164-168 *

Also Published As

Publication number Publication date
CN109145113A (en) 2019-01-04

Similar Documents

Publication Publication Date Title
CN109145113B (en) Student poverty degree prediction method based on machine learning
Modalsli Intergenerational mobility in Norway, 1865–2011
Antón The Impact of Remittances on Nutritional Status of Children in Ecuador 1
Peugh et al. Modeling unobserved heterogeneity using latent profile analysis: A Monte Carlo simulation
Neff A confirmatory factor analysis of a measure of “machismo” among Anglo, African American, and Mexican American male drinkers
WO2014055238A1 (en) System and method for building and validating a credit scoring function
Vencloviene et al. Effects of weather conditions on emergency ambulance calls for acute coronary syndromes
Rooyackers et al. Mother–child relations in adulthood: Immigrant and nonimmigrant families in the Netherlands
Richardson et al. An evaluation of international surveys of children
Zamand et al. Impact of climatic shocks on child human capital: evidence from young lives data
Lagas et al. Regional quality of living in Europe
Gelibo et al. Low fruit and vegetable intake and its associated factors in Ethiopia: a community based cross sectional NCD steps survey
CN111415099A (en) Poverty-poverty identification method based on multi-classification BP-Adaboost
Jancz et al. Housing preferences of seniors and pre-senior citizens in Poland—A case study
Razzaq et al. An automatic determining food security status: machine learning based analysis of household survey data
Rhodes-Bratton et al. The relationship between childhood obesity and neighborhood food ecology explored through the context of gentrification in New York City
Munoz Intergenerational educational mobility within Chile
Jednak et al. A comparative analysis of development in Southeast European countries
Bogacheva et al. Predicting vocational personality type from socio-demographic features using machine learning methods
Curci et al. Features of autobiographical memory: Theoretical and empirical issues in the measurement of flashbulb memory
Chen et al. Postsecondary Expectations and Plans for the High School Senior Class of 2003-04. Issue Tables. NCES 2010-170rev.
Crespo-Bellido et al. Food security and alternative food acquisition among US low-income households: results from the National Food Acquisition and Purchasing Survey (FoodAPS)
Llorente-Marrón et al. Ranking fertility predictors in Spain: a multicriteria decision approach
Davis School enrollment effects in a South-South migration context
Kuse et al. Individual, maternal, household, and community level variability in determining inequalities in childhood anaemia within Ethiopia: four-level multilevel analysis approach

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 1218, 12th floor, building 8, East District, yard 9, Linglong Road, Haidian District, Beijing 100089

Applicant after: BEIJING TAOHUADAO INFORMATION TECHNOLOGY Co.,Ltd.

Address before: Room 1503, Yanshan Hotel, No. 38 Guancun Avenue, Haidian District, Beijing

Applicant before: BEIJING TAOHUADAO INFORMATION TECHNOLOGY Co.,Ltd.

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant