CN109145113B - Student poverty degree prediction method based on machine learning - Google Patents
Student poverty degree prediction method based on machine learning Download PDFInfo
- Publication number
- CN109145113B CN109145113B CN201810972342.8A CN201810972342A CN109145113B CN 109145113 B CN109145113 B CN 109145113B CN 201810972342 A CN201810972342 A CN 201810972342A CN 109145113 B CN109145113 B CN 109145113B
- Authority
- CN
- China
- Prior art keywords
- data
- consumption
- student
- poverty
- score
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 35
- 238000010801 machine learning Methods 0.000 title claims abstract description 12
- 239000011159 matrix material Substances 0.000 claims abstract description 26
- 238000011156 evaluation Methods 0.000 claims abstract description 17
- 230000002159 abnormal effect Effects 0.000 claims abstract description 12
- 238000004364 calculation method Methods 0.000 claims description 13
- 235000021152 breakfast Nutrition 0.000 claims description 9
- 230000002596 correlated effect Effects 0.000 claims description 6
- 230000006870 function Effects 0.000 claims description 6
- 230000005484 gravity Effects 0.000 claims description 6
- 238000003058 natural language processing Methods 0.000 claims description 6
- 238000012549 training Methods 0.000 claims description 6
- 238000005516 engineering process Methods 0.000 claims description 4
- 239000013598 vector Substances 0.000 claims description 4
- 239000010755 BS 2869 Class G Substances 0.000 claims description 3
- 238000004458 analytical method Methods 0.000 claims description 3
- 239000002131 composite material Substances 0.000 claims description 3
- 230000000875 corresponding effect Effects 0.000 claims description 3
- 235000020983 fruit intake Nutrition 0.000 claims description 3
- 230000036541 health Effects 0.000 claims description 3
- 238000010606 normalization Methods 0.000 claims description 3
- 238000000513 principal component analysis Methods 0.000 claims description 3
- 230000011218 segmentation Effects 0.000 claims description 3
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 claims description 3
- 235000000935 Santalum yasi Nutrition 0.000 claims description 2
- 241000775525 Santalum yasi Species 0.000 claims description 2
- 238000012545 processing Methods 0.000 claims description 2
- 239000000047 product Substances 0.000 claims 1
- 239000013589 supplement Substances 0.000 claims 1
- 238000013507 mapping Methods 0.000 abstract 1
- 238000012216 screening Methods 0.000 abstract 1
- 238000000638 solvent extraction Methods 0.000 abstract 1
- 230000002354 daily effect Effects 0.000 description 18
- 230000006399 behavior Effects 0.000 description 5
- 238000001914 filtration Methods 0.000 description 2
- 230000001939 inductive effect Effects 0.000 description 2
- 238000005192 partition Methods 0.000 description 2
- 238000012163 sequencing technique Methods 0.000 description 2
- 238000012935 Averaging Methods 0.000 description 1
- 206010041349 Somnolence Diseases 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000000295 complement effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000003203 everyday effect Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/04—Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/10—Services
- G06Q50/20—Education
- G06Q50/205—Education administration or guidance
Landscapes
- Business, Economics & Management (AREA)
- Engineering & Computer Science (AREA)
- Strategic Management (AREA)
- Economics (AREA)
- Human Resources & Organizations (AREA)
- Tourism & Hospitality (AREA)
- Marketing (AREA)
- General Business, Economics & Management (AREA)
- Theoretical Computer Science (AREA)
- Educational Technology (AREA)
- Educational Administration (AREA)
- General Physics & Mathematics (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Primary Health Care (AREA)
- General Health & Medical Sciences (AREA)
- Development Economics (AREA)
- Game Theory and Decision Science (AREA)
- Entrepreneurship & Innovation (AREA)
- Operations Research (AREA)
- Quality & Reliability (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention relates to a student poverty degree prediction method based on machine learning, which comprises the steps of obtaining data of channels related to students, analyzing the data, calculating various characteristic values of the student poverty, filling missing values, standardizing the data, mapping the data to a fixed interval, gathering the data into a plurality of classes by adopting Euclidean distance according to a rapid clustering algorithm, and calculating the importance degree of each class to the evaluation of the poverty degree. And partitioning the matrix formed by each group of classified data according to the correlation, and finally calculating the poor and stranded comprehensive score according to the partitioned matrix, wherein the comprehensive score can be used for reference in the decision of the subsidy amount during the subsidy of the poor and stranded people, and the higher the score is, the more poor the stranded data is, the more the stranded data needs to be subsidized. The invention also provides a plurality of schemes for rapidly finding abnormal poverty-causing poverty and screening poverty-causing poverty reasons from the data.
Description
Technical Field
The invention belongs to the technical field of big data application, and particularly relates to a student poverty degree prediction method based on machine learning.
Background
At present, most colleges and universities primarily determine the family economic condition in the class according to the family economic condition questionnaires of the students and the comprehensive conditions reflected by relevant teachers and classmates. The level of poverty of the subsidized students cannot be quantified, and the subsidization is inevitable to fall into the average meaning.
With the technical development of the big data era, the global backbone communication network transmits tens of thousands of terabytes of data every day, the behavior of each person is recorded by various forms of data, and students can generate data in all rows of the campus and record various characteristics of the students. The data can reflect the real conditions of students, and can solve the problem of the averaging of the contributions of the poverty-stricken students to a certain extent through reasonable utilization, thereby providing more help for the really poverty-stricken students.
Disclosure of Invention
The invention aims to provide a student poverty degree prediction method based on machine learning, aiming at solving the defect of average sense when poverty is subsidized.
The invention realizes the purpose through the following technical scheme:
a student poverty degree prediction method based on machine learning comprises the following steps;
step 1, acquiring data related to poverty of students;
step 2, analyzing data, dividing the data into unstructured text data and structured data, and directly storing the structured data into a database;
step 3, finding and filling missing data;
step 4, carrying out standardization processing on the original structured data to enable the result values to be uniformly mapped to a fixed interval;
step 5, according to a fast clustering algorithm, adopting Euclidean distance to cluster data into k types:
Repeating the above steps m times to obtainWhereinIs aThe center of gravity of (a). As m gradually increases, the classification tends to stabilize. At the same timeCan be approximately regarded asCenter of gravity, i.e.At this point, the calculation is finished; alternatively, if for a certain m,andif the two are the same, the calculation is ended;
step 6, calculating the relative importance degree of each factor in each category evaluation factor system after each cluster to the realization of the evaluation target and the function, and checking the calculation result to ensure the reliability of the evaluation conclusion;
step 6.1, analyzing a correlation coefficient matrix according to the poverty-stricken factor summarized data, wherein two equidirectional indexes are positive indexes or negative indexes and are positively correlated, and the correlation coefficient is larger than zero; two reverse indexes, namely a positive index and a negative index, should be negatively correlated, and the correlation coefficient should be less than zero; if the score is not satisfied, the step 6.2 is carried out;
and 6.2, separating the two indexes of which the correlation coefficients do not accord with the principle to obtain a block matrix. Respectively performing principal component analysis on each block, and if the index weight coefficient of a certain block is not met, repeating the step 6.1 to obtain a final block matrix;
and 7, calculating the comprehensive evaluation score of the principal component:
q is the final block matrix number obtained in step 6, and the ith (i equals 1, …, q) block has aiThe weight of the ith block is the index item;
composite scoreSiIf the ith block is in negative correlation with the total score, t is 1, and if the ith block is in positive correlation with the total score, t is 0;
Sithe solving method is as follows: solving a normalized matrixR, R of the correlation coefficient matrixij(i 1, …, n, j 1, …, m) is the i row and j column elements of R, and the eigenvalue lambda of the correlation matrix is solvedi(i is 1,2, …, s, s is the total number of eigenvalues) and a eigenvector vi;
Calculating contribution rateAnd arranging according to the size of the characteristic value. The new pointer variable is composed of feature vectors:
wherein y is1Is the first principal component, ysIs the s-th principal component; calculating the comprehensive evaluation score of the principal componentWhereinAnd sequencing S corresponding to each student, wherein the higher S indicates the higher difficulty degree, and the value of S is used as a reference for subsidizing distribution.
The method is further improved in that the method also comprises the steps of extracting the online data of students in the school, wherein the online data comprises browsing content, the type of used electronic products, the online place, the type of browsing websites and the online time, and the online browsing information comprises data of abroad, tourism, Yasi training and new east training.
Preferably, the acquiring data in the method includes extracting basic information data of students, including names, native countries, nationalities, health conditions, political aspects, whether to enter the school through green channels, whether to handle a place of origin loan, whether to handle a campus loan, whether to enjoy a benefit, what kind of reward or subsidy was received during the university, the way of entering the school, the name of the school, the school number, the college, the sex, the specialty, the class, the current address, the relationship with the guardian, the guardian's duties, and other information of the guardian;
also comprises extracting student family information data including whether the student is a solitary child, whether the student is left to guard, whether the student enters a city service staff along with a mobile child, whether the student is a low-protection family, whether the student is a knight or a careless child, whether the student is an orphan, whether the student is established with a card and is poverty, urban and rural minimum guarantee, whether the student is particularly poverty in urban and rural areas, whether the family is urban and rural minimum life guarantee, whether the student is a low-income family, whether the countryside is five-protection family, the student is resident, whether the student is physically handicapped, whether the student is sick and sick, whether the parent is disabled, whether the student is sick and sick, whether the parent is sick and sick, whether the family is mainly derived from income, whether the family is annual income, whether the parent is single, whether the student is different, whether the parent is nursed, whether the family is a person is lost, whether the family is a poor county, whether the family is a county, Whether the school is mountainous, whether study-assisting loan is transacted, house conditions, sudden unexpected events of families, the condition of unemployment of family members, the amount of debt of the family, the reason of the debt, the condition of the students who have been subsidized in the school year, the occupation of father, the occupation of mother, whether the town has houses or not, and the amount of the medical expenditure years of the family;
extracting campus one-card consumption data and calculating characteristics including total consumption amount, maximum value of total monthly consumption amount, total consumption times, average value of daily consumption amount, average value of daily consumption times, consumption days, dining room consumption amount, total number of dining room consumption times, maximum value of total number of dining room consumption times, daily consumption amount of dining room, maximum value of daily consumption amount of dining room, daily consumption times of dining room, consumption amount of supermarket, consumption times of supermarket, medical consumption amount, medical consumption times, boiled water consumption times, consumption amount of book, consumption times of book, daily breakfast consumption amount, breakfast daily lunch consumption amount, lunch consumption variance, daily dinner consumption amount, dinner daily consumption variance, total monthly consumption amount, total breakfast consumption time, total lunch and month consumption amount, total lunch and month consumption number, total dinner consumption amount of dinner and month, The total consumption times of the supper and the month, the canteen consumption ratio, the supermarket consumption ratio, the fruit consumption ratio, the canteen consumption ratio and the total consumption deviation degree of holidays;
extracting student score data including course name, course number, course starting period, time of study, score, examination score and ordinary score, and calculating score point, average score of period, average score of year of study and hanging number;
further comprising extracting library data: the book borrowing method comprises the following steps of book name, book number, book type, book borrowing time, book borrowing place, book returning time and book borrowing place, and features are calculated according to data: the book borrowing times, the book borrowing frequency, the borrowing time, the average borrowing time, the annual average borrowing times, the deviation degree of the book borrowing time and the book borrowing quantity on holidays;
still including drawing entrance guard and power consumption data: the method comprises the steps of card swiping time, card swiping place, time for entering and exiting a dormitory building, dormitory residence time, library residence time, dormitory total power consumption, dormitory per-person power consumption and school student per-person power consumption;
preferably, the method standardized in step 4 is: each index aijIs converted into aij', has aij’=(aij-uj)/sj,i=1,2,...,n,j=1,2,...,m
After the step 4, the data are arranged from small to large, and the number of the data of one type is recorded asThe number ordered as k is denoted x(k)(1≤k≤n);
The p quantile calculation formula is as follows:
wherein [ np]Representing integer parts rounded to np, i.e. npAnd (4) dividing. Calculating an upper cut-off point Q according to equation (1)1=M0.75+1.5R and lower intercept point Q3=M0.25-1.5R. Wherein M is0.25Is 0.25 quantile, M0.75Is 0.75 quantile, R ═ M0.75-M0.25. Is less than Q3Or greater than Q1The data of (2) is an abnormal value, and the data is listed as the main body of the data, namely abnormal poverty. Some aspect of abnormal poverty presents a severe departure from the population and requires special treatment for it.
The method for extracting the poverty-inducing reasons of poverty is characterized in that after the step 2, text data in the data are extracted, a Natural Language Processing (NLP) technology is used, a snownlp library in python is called to realize functions of text word segmentation, named entity recognition and syntactic analysis, description objects and object features in the text are extracted, and the description objects and the object features are made into a table to be output, so that a person can quickly see the poverty-inducing reasons of students.
The invention has the beneficial effects that:
1) according to the method, a large amount of data is adopted when the poverty-suffering and poverty-avoiding degree is evaluated, so that one-sidedness of evaluation of a single data source is avoided;
2) the invention adopts machine learning to process data, which can overcome the influence of human subjective factors;
3) the invention also provides a method for rapidly filtering the abnormal poverty-stricken birth data, which can find the abnormal poverty-stricken birth and give special treatment.
Detailed Description
The present application is described in further detail below, and it should be noted that the following detailed description is provided for illustrative purposes only, and is not intended to limit the scope of the present application, which is defined by the appended claims.
Example 1
A student poverty degree prediction method based on machine learning is characterized in that: comprises the following steps;
step a1, acquiring data: the data comprises poverty relief data, civil administration hall data, learning-aid system data, student system data, school educational administration system data, campus one-card consumption data, internet behavior data, attendance system data, school forum data, library data and school hospital system data, and a database is established for storage;
the data acquisition comprises extracting basic information data of students, including names, native countries, nationalities, health conditions, political appearances, whether to enter a school through a green channel, whether to handle a place of origin loan, whether to handle a place of school loan, whether to enjoy a complement, what kind of reward or subsidy is received during the university, a way of entering a school, a school name, a school number, a college, sex, a specialty, a class, a current address, a relationship with a guardian, a guardian's duty and other information;
also comprises extracting student family information data including whether the student is a solitary child, whether the student is left to guard, whether the student enters a city service staff along with a mobile child, whether the student is a low-protection family, whether the student is a knight or a careless child, whether the student is an orphan, whether the student is established with a card and is poverty, urban and rural minimum guarantee, whether the student is particularly poverty in urban and rural areas, whether the family is urban and rural minimum life guarantee, whether the student is a low-income family, whether the countryside is five-protection family, the student is resident, whether the student is physically handicapped, whether the student is sick and sick, whether the parent is disabled, whether the student is sick and sick, whether the parent is sick and sick, whether the family is mainly derived from income, whether the family is annual income, whether the parent is single, whether the student is different, whether the parent is nursed, whether the family is a person is lost, whether the family is a poor county, whether the family is a county, Whether the school is mountainous, whether study-assisting loan is transacted, house conditions, sudden unexpected events of families, the condition of unemployment of family members, the amount of debt of the family, the reason of the debt, the condition of the students who have been subsidized in the school year, the occupation of father, the occupation of mother, whether the town has houses or not, and the amount of the medical expenditure years of the family;
extracting campus one-card consumption data and calculating characteristics including total consumption amount, maximum value of total monthly consumption amount, total consumption times, average value of daily consumption amount, average value of daily consumption times, consumption days, dining room consumption amount, total number of dining room consumption times, maximum value of total number of dining room consumption times, daily consumption amount of dining room, maximum value of daily consumption amount of dining room, daily consumption times of dining room, consumption amount of supermarket, consumption times of supermarket, medical consumption amount, medical consumption times, boiled water consumption times, consumption amount of book, consumption times of book, daily breakfast consumption amount, breakfast daily lunch consumption amount, lunch consumption variance, daily dinner consumption amount, dinner daily consumption variance, total monthly consumption amount, total breakfast consumption time, total lunch and month consumption amount, total lunch and month consumption number, total dinner consumption amount of dinner and month, The total consumption times of the supper and the month, the canteen consumption ratio, the supermarket consumption ratio, the fruit consumption ratio, the canteen consumption ratio and the total consumption deviation degree of holidays;
extracting student score data including course name, course number, course starting period, time of study, score, examination score and ordinary score, and calculating score point, average score of period, average score of year of study and hanging number;
further comprising extracting library data: the book borrowing method comprises the following steps of book name, book number, book type, book borrowing time, book borrowing place, book returning time and book borrowing place, and features are calculated according to data: the book borrowing times, the book borrowing frequency, the borrowing time, the average borrowing time, the annual average borrowing times, the deviation degree of the book borrowing time and the book borrowing quantity on holidays;
still including drawing entrance guard and power consumption data: the method comprises the steps of card swiping time, card swiping place, time for entering and exiting a dormitory building, dormitory residence time, library residence time, dormitory total power consumption, dormitory per-person power consumption and school student per-person power consumption;
and the method also comprises the steps of extracting the online data of the students at the school, wherein the online data comprises browsing content, the type of the used electronic product, the online place, the type of the browsing website and the online time, and the online browsing information comprises data of foreign countries, tourism, Abort training and New Orient training, and when the poverty degrees of the two students in the step A7 are the same, the students with high consumption in the field are regarded as the students with lower poverty degree.
Step A2, analyzing data, dividing the data into unstructured text data and structured data, and directly storing the structured data into a database;
step A3, finding and filling missing data, and using different filling strategies including mean filling, interpolation and fitting according to different missing conditions to finish initialization of the missing data, wherein the strategies are conventional technologies and are not explained any more;
step A4, standardizing the original structured data to make the result value uniformly mapped to a fixed interval;
each index aijIs converted into aij', has aij’=(aij-uj)/sj,i=1,2,...,n,j=1,2,...,m
Arranging the data from small to large, and recording the number of the data in one class asThe number ordered as k is denoted x(k)(1≤k≤n);
The p quantile calculation formula is as follows:
wherein [ np]Representing the integer part of np by the integer. Calculating an upper cut-off point Q according to equation (1)1=M0.75+1.5R and lower intercept point Q3=M0.25-1.5R. Wherein M is0.25Is 0.25 quantile, M0.75Is 0.75 quantile, R ═ M0.75-M0.25. Is less than Q3Or greater than Q1Is an abnormal value of the data of (a),the data is listed as the subject of the data being abnormally poor.
Step A5, according to the fast clustering algorithm, using Euclidean distance to cluster data into k types:
Repeating the above step Am for several times to obtainWhereinIs aThe center of gravity of (a). As m gradually increases, the classification tends to stabilize. At the same timeCan be approximately regarded asCenter of gravity, i.e.At this point, the calculation is finished; alternatively, if for a certain m,andif the two are the same, the calculation is ended;
step A6, calculating the relative importance degree of each factor in each category evaluation factor system after each cluster to the realization of the evaluation target and function, and checking the calculation result to ensure the reliability of the evaluation conclusion;
step A6.1, analyzing a correlation coefficient matrix according to the poverty-stricken factor summarized data, wherein two equidirectional indexes, namely two positive indexes or negative indexes, are positively correlated, and the correlation coefficient is larger than zero; two reverse indexes, namely a positive index and a negative index, should be negatively correlated, and the correlation coefficient should be less than zero; if the score is not satisfied, the step A7 is directly carried out to calculate the score, and if the score is not satisfied, the step A6.2 is carried out;
step A6.2 separates the two indexes of which the correlation coefficients do not accord with the principle to obtain a block matrix. Respectively performing principal component analysis on each block, and if the index weight coefficient of a certain block is not met, repeating the step A6.1 to obtain a final block matrix;
step A7, calculating a principal component comprehensive evaluation score:
q is the final partition matrix number obtained in step a6, and the ith (i equals 1, …, q) partition has aiThe weight of the ith block is the index item;
composite scoreSiIf the ith block is in negative correlation with the total score, t is 1, and if the ith block is in positive correlation with the total score, t is 0;
Sithe solving method is as follows: solving a normalized matrixR, R of the correlation coefficient matrixij(i 1, …, n, j 1, …, m) is the i row and j column elements of R, and the eigenvalue lambda of the correlation matrix is solvedi(i is 1,2, …, s, s is the total number of eigenvalues) and a eigenvector vi;
Calculating contribution rateAnd arranging according to the size of the characteristic value. The new pointer variable is composed of feature vectors:
wherein y is1Is the first principal component, ysIs the s-th principal component; calculating the comprehensive evaluation score of the principal componentWhereinAnd sequencing S corresponding to each student, wherein the higher S indicates the higher difficulty degree, and the value of S is used as a reference for subsidizing distribution.
Embodiment 2, a method for extracting causes of poverty, which comprises the following steps:
step B1, acquiring data: the data comprises poverty relief data, civil administration hall data, learning-aid system data, student system data, school educational administration system data, campus one-card consumption data, internet behavior data, attendance system data, school forum data, library data and school hospital system data, and a database is established for storage;
and step B2, analyzing the data, dividing the data into unstructured text data and structured data, calling a snornlp library in python by using an NLP natural language processing technology to realize functions of text word segmentation, named entity recognition and syntactic analysis, extracting description objects and object characteristics in the text, and making a table for output.
Embodiment 3, a method for rapidly monitoring and filtering abnormal poverty-stricken data, comprising the following steps:
step C1, acquiring data: the data comprises poverty relief data, civil administration hall data, learning-aid system data, student system data, school educational administration system data, campus one-card consumption data, internet behavior data, attendance system data, school forum data, library data and school hospital system data, and a database is established for storage;
step C2, analyzing the data, dividing the data into unstructured text data and structured data, and directly storing the structured data into a database;
step C3, finding and filling the missing data, and using different filling strategies according to different missing conditions, including mean value filling, interpolation and fitting, to complete initialization of the missing data;
step C4, carrying out linear transformation on the original structured data to enable the result value to be uniformly mapped to a fixed interval;
step C5, arranging the data from small to large, and recording the number of the data in one class asThe number ordered as k is denoted x(k)(1≤k≤n);
The p quantile calculation formula is as follows:
wherein [ np]Representing the integer part of np by the integer. Calculating an upper cut-off point Q according to equation (1)1=M0.75+1.5R and lower intercept point Q3=M0.25-1.5R. Wherein M is0.25Is 0.25 quantile, M0.75Is 0.75 quantile, R ═ M0.75-M0.25. Is less than Q3Or greater than Q1The data in (2) is an abnormal value, the data is listed as a main body of the data, which is abnormal, poverty and sleepy, and the behavior is out of the average level of the public, and special attention should be paid. The method is not limited by data integrity, and judgment can be made according to independent data classification, so that the method is rapid.
The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention.
Claims (4)
1. A student poverty degree prediction method based on machine learning is characterized in that: the method comprises the following steps:
step 1, obtaining data related to poverty of students, wherein the data obtaining comprises extracting basic information data of the students, including names, native places, nationality, health conditions, political appearances, whether to enter a school through a green channel, whether to handle a place of origin loan, whether to handle a campus loan, whether to enjoy a supplement, what kind of rewards or subsidies are received during a university, a way of entering a school, a school name, a school number, a college, sex, a specialty, a class, a current residence, a relationship with a guardian, a guardian's duties, and other information of the guardian;
also comprises extracting student family information data including whether the student is a solitary child, whether the student is left to guard, whether the student enters a city service staff along with a mobile child, whether the student is a low-protection family, whether the student is a knight or a careless child, whether the student is an orphan, whether the student is established with a card and is poverty, urban and rural minimum guarantee, whether the student is particularly poverty in urban and rural areas, whether the family is urban and rural minimum life guarantee, whether the student is a low-income family, whether the countryside is five-protection family, the student is resident, whether the student is physically handicapped, whether the student is sick and sick, whether the parent is disabled, whether the student is sick and sick, whether the parent is sick and sick, whether the family is mainly derived from income, whether the family is annual income, whether the parent is single, whether the student is different, whether the parent is nursed, whether the family is a person is lost, whether the family is a poor county, whether the family is a county, Whether the school is mountainous, whether study-assisting loan is transacted, house conditions, sudden unexpected events of families, the condition of unemployment of family members, the amount of debt of the family, the reason of the debt, the condition of the students who have been subsidized in the school year, the occupation of father, the occupation of mother, whether the town has houses or not, and the amount of the medical expenditure years of the family;
extracting campus one-card consumption data and calculating characteristics including total consumption amount, maximum value of total monthly consumption amount, total consumption times, average value of daily consumption amount, average value of daily consumption times, consumption days, dining room consumption amount, total number of dining room consumption times, maximum value of total number of dining room consumption times, daily consumption amount of dining room, maximum value of daily consumption amount of dining room, daily consumption times of dining room, consumption amount of supermarket, consumption times of supermarket, medical consumption amount, medical consumption times, boiled water consumption times, consumption amount of book, consumption times of book, daily breakfast consumption amount, breakfast daily lunch consumption amount, lunch consumption variance, daily dinner consumption amount, dinner daily consumption variance, total monthly consumption amount, total breakfast consumption time, total lunch and month consumption amount, total lunch and month consumption number, total dinner consumption amount of dinner and month, The total consumption times of the supper and the month, the canteen consumption ratio, the supermarket consumption ratio, the fruit consumption ratio, the canteen consumption ratio and the total consumption deviation degree of holidays;
extracting student score data including course name, course number, course starting period, time of study, score, examination score and ordinary score, and calculating score point, average score of period, average score of year of study and hanging number;
further comprising extracting library data: the book borrowing method comprises the following steps of book name, book number, book type, book borrowing time, book borrowing place, book returning time and book borrowing place, and features are calculated according to data: the book borrowing times, the book borrowing frequency, the borrowing time, the average borrowing time, the annual average borrowing times, the deviation degree of the book borrowing time and the book borrowing quantity on holidays;
still including drawing entrance guard and power consumption data: the method comprises the steps of card swiping time, card swiping place, time for entering and exiting a dormitory building, dormitory residence time, library residence time, dormitory total power consumption, dormitory per-person power consumption and school student per-person power consumption;
step 2, analyzing data, dividing the data into unstructured text data and structured data, and directly storing the structured data into a database;
step 3, finding and filling missing data;
step 4, carrying out standardization processing on the original structured data to enable the result values to be uniformly mapped to a fixed interval;
step 5, according to a fast clustering algorithm, adopting Euclidean distance to cluster data into k types:
From the initial class G(0)Begin to compute a new set of points L(1)CalculatingGet a new setFrom L(1)Then, the classification and the recording are carried out,
Repeating the above steps m times to obtainWhereinIs aThe center of gravity of; when m is gradually increased, the classification tends to be stable, and at the same timeCan be approximately regarded asCenter of gravity, i.e.At this point, the calculation is finished; alternatively, if for a certain m,andif the two are the same, the calculation is ended;
step 6, calculating the relative importance degree of each factor in each category evaluation factor system after each cluster to the realization of the evaluation target and the function, and checking the calculation result to ensure the reliability of the evaluation conclusion;
step 6.1, analyzing a correlation coefficient matrix according to the poverty-stricken factor summarized data, wherein two equidirectional indexes are positive indexes or negative indexes and are positively correlated, and the correlation coefficient is larger than zero; two reverse indexes, namely a positive index and a negative index, should be negatively correlated, and the correlation coefficient should be less than zero; if the score is not satisfied, the step 6.2 is carried out;
step 6.2, separating two indexes of which the correlation coefficients do not accord with the principle to obtain a block matrix, respectively carrying out principal component analysis on each block, and repeating the step 6.1 if the index weight coefficients of a certain block do not meet the principle to obtain a final block matrix;
and 7, calculating the comprehensive evaluation score of the principal component:
q is the final block matrix number obtained in step 6, and the ith block has aiAn index term, wherein i is 1, L, q, the weight of the ith block,
composite scoreSiThe score of the ith block is 1 if the ith block is in negative correlation with the total score, and the score of the ith block is in positive correlation with the total score,then t is equal to 0 and,
Sithe solving method is as follows: solving the correlation coefficient matrix R, R of the normalized matrix A-ijI rows and j columns of elements of R, wherein i is 1, K, n, j is 1, K, m, solving the eigenvalue lambda of the correlation matrixiAnd the feature vector viWherein i is 1, and K, s and s are the total number of the characteristic values;
calculating contribution rateArranging the parameters in sequence according to the size of the characteristic value, and forming a new pointer variable by the characteristic vector:
wherein y is1Is the first principal component, ysIs the s-th principal component; calculating the comprehensive evaluation score of the principal componentWhereinSorting S corresponding to each student, wherein the higher the S is, the higher the difficulty degree is, and performing subsidy distribution by taking the value of S as a reference;
step 1 also comprises extracting the online data of students in the school, including browsing content, electronic product model, online site, website browsing type and online time, and extracting the data of the online browsing information including foreign country, tourism, Yasi training and New Orient training, and in step 7, when the poverty degrees of two students are the same, the students with high consumption in the field are regarded as the students with low poverty degree.
2. The method for predicting the poverty of students based on machine learning as claimed in claim 1, wherein the method standardized in step 4 is as follows: each index aijIs converted into aij', haveaij’=(aij-uj)/sj,j=1,2,...,m,i=1,2,...,n
3. The machine learning-based student poverty degree prediction method according to claim 1, wherein: after step 4, the data are arranged from small to large, and the number of the data in one class is recorded asThe number ordered as k is denoted x(k),1≤k≤n;
The p quantile calculation formula is as follows:
wherein [ np]Representing the integer part of np, the upper end of the computation of the intercept point Q according to equation (1)1=M0.75+1.5R and lower intercept point Q3=M0.25-1.5R, wherein M0.25Is 0.25 quantile, M0.75Is 0.75 quantile, R ═ M0.75-M0.25Is then less than Q3Or greater than Q1The data of (2) is an abnormal value, and the data is listed as the main body of the data, namely abnormal poverty.
4. The method for predicting the poverty of students based on machine learning as claimed in claim 1, wherein in step 2, the data is parsed and divided into unstructured text data and structured data, the structured data is directly stored in the database, then the text data in the data is extracted, NLP natural language processing technology is used, the snowlp library in python is called to realize text segmentation, named entity recognition and syntactic analysis functions, and descriptive objects and object features in the text are extracted and tabulated for output.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810972342.8A CN109145113B (en) | 2018-08-24 | 2018-08-24 | Student poverty degree prediction method based on machine learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810972342.8A CN109145113B (en) | 2018-08-24 | 2018-08-24 | Student poverty degree prediction method based on machine learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109145113A CN109145113A (en) | 2019-01-04 |
CN109145113B true CN109145113B (en) | 2021-12-21 |
Family
ID=64827822
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810972342.8A Active CN109145113B (en) | 2018-08-24 | 2018-08-24 | Student poverty degree prediction method based on machine learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109145113B (en) |
Families Citing this family (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109754349B (en) * | 2019-01-07 | 2023-09-05 | 上海复岸网络信息科技有限公司 | Intelligent teacher-student matching system for online education |
CN109992592B (en) * | 2019-04-10 | 2020-12-08 | 哈尔滨工业大学 | College poverty and poverty identification method based on flow data of campus consumption card |
CN110097142A (en) * | 2019-05-15 | 2019-08-06 | 杭州华网信息技术有限公司 | Poor student's prediction technique of behavior is serialized for student |
CN112215385B (en) * | 2020-03-24 | 2024-03-19 | 北京桃花岛信息技术有限公司 | Student difficulty degree prediction method based on greedy selection strategy |
CN111415099A (en) * | 2020-03-30 | 2020-07-14 | 西北大学 | Poverty-poverty identification method based on multi-classification BP-Adaboost |
CN112150111A (en) * | 2020-09-23 | 2020-12-29 | 沈阳晁圣科技有限公司 | Wisdom campus management system based on block chain |
CN112416914B (en) * | 2020-10-15 | 2023-07-11 | 三峡大学 | Difficult student identification and early warning method and system based on big data analysis |
CN112465300A (en) * | 2020-11-05 | 2021-03-09 | 贵州广播电视大学(贵州职业技术学院) | Campus network-based full-range big data system |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102682352A (en) * | 2011-03-11 | 2012-09-19 | 鮑济美 | Intelligent campus security information management system based on internet of things |
CN105930540A (en) * | 2016-03-23 | 2016-09-07 | 四川长虹电器股份有限公司 | Data processing system |
CN106504144A (en) * | 2016-10-19 | 2017-03-15 | 中国矿业大学 | A kind of campus service system based on cloud computing and mobile phone terminal |
CN106886922A (en) * | 2017-03-01 | 2017-06-23 | 安徽大智睿科技技术有限公司 | It is a kind of that students ' analysis method and system are paid close attention to based on all-purpose card consumption |
CN106934742A (en) * | 2017-02-22 | 2017-07-07 | 黔南民族师范学院 | A kind of Impoverished College Studentss assessment method |
WO2017128868A1 (en) * | 2016-01-26 | 2017-08-03 | 华为技术有限公司 | Classification method, device and system for program file |
CN108170765A (en) * | 2017-12-25 | 2018-06-15 | 合肥城市云数据中心股份有限公司 | Recommend method based on the poverty-stricken mountains in school behavioral data multidimensional analysis |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160308898A1 (en) * | 2015-04-20 | 2016-10-20 | Phirelight Security Solutions Inc. | Systems and methods for tracking, analyzing and mitigating security threats in networks via a network traffic analysis platform |
US20170032250A1 (en) * | 2015-07-29 | 2017-02-02 | Ching-Ping Chang | Machine Status And User Behavior Analysis System |
-
2018
- 2018-08-24 CN CN201810972342.8A patent/CN109145113B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102682352A (en) * | 2011-03-11 | 2012-09-19 | 鮑济美 | Intelligent campus security information management system based on internet of things |
WO2017128868A1 (en) * | 2016-01-26 | 2017-08-03 | 华为技术有限公司 | Classification method, device and system for program file |
CN105930540A (en) * | 2016-03-23 | 2016-09-07 | 四川长虹电器股份有限公司 | Data processing system |
CN106504144A (en) * | 2016-10-19 | 2017-03-15 | 中国矿业大学 | A kind of campus service system based on cloud computing and mobile phone terminal |
CN106934742A (en) * | 2017-02-22 | 2017-07-07 | 黔南民族师范学院 | A kind of Impoverished College Studentss assessment method |
CN106886922A (en) * | 2017-03-01 | 2017-06-23 | 安徽大智睿科技技术有限公司 | It is a kind of that students ' analysis method and system are paid close attention to based on all-purpose card consumption |
CN108170765A (en) * | 2017-12-25 | 2018-06-15 | 合肥城市云数据中心股份有限公司 | Recommend method based on the poverty-stricken mountains in school behavioral data multidimensional analysis |
Non-Patent Citations (1)
Title |
---|
随机森林算法在中医药院校贫困生认定预测中的应用研究;唐燕等;《中国医药导报》;20170515;第14卷(第14期);164-168 * |
Also Published As
Publication number | Publication date |
---|---|
CN109145113A (en) | 2019-01-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109145113B (en) | Student poverty degree prediction method based on machine learning | |
Modalsli | Intergenerational mobility in Norway, 1865–2011 | |
Antón | The Impact of Remittances on Nutritional Status of Children in Ecuador 1 | |
Peugh et al. | Modeling unobserved heterogeneity using latent profile analysis: A Monte Carlo simulation | |
Neff | A confirmatory factor analysis of a measure of “machismo” among Anglo, African American, and Mexican American male drinkers | |
WO2014055238A1 (en) | System and method for building and validating a credit scoring function | |
Vencloviene et al. | Effects of weather conditions on emergency ambulance calls for acute coronary syndromes | |
Rooyackers et al. | Mother–child relations in adulthood: Immigrant and nonimmigrant families in the Netherlands | |
Richardson et al. | An evaluation of international surveys of children | |
Zamand et al. | Impact of climatic shocks on child human capital: evidence from young lives data | |
Lagas et al. | Regional quality of living in Europe | |
Gelibo et al. | Low fruit and vegetable intake and its associated factors in Ethiopia: a community based cross sectional NCD steps survey | |
CN111415099A (en) | Poverty-poverty identification method based on multi-classification BP-Adaboost | |
Jancz et al. | Housing preferences of seniors and pre-senior citizens in Poland—A case study | |
Razzaq et al. | An automatic determining food security status: machine learning based analysis of household survey data | |
Rhodes-Bratton et al. | The relationship between childhood obesity and neighborhood food ecology explored through the context of gentrification in New York City | |
Munoz | Intergenerational educational mobility within Chile | |
Jednak et al. | A comparative analysis of development in Southeast European countries | |
Bogacheva et al. | Predicting vocational personality type from socio-demographic features using machine learning methods | |
Curci et al. | Features of autobiographical memory: Theoretical and empirical issues in the measurement of flashbulb memory | |
Chen et al. | Postsecondary Expectations and Plans for the High School Senior Class of 2003-04. Issue Tables. NCES 2010-170rev. | |
Crespo-Bellido et al. | Food security and alternative food acquisition among US low-income households: results from the National Food Acquisition and Purchasing Survey (FoodAPS) | |
Llorente-Marrón et al. | Ranking fertility predictors in Spain: a multicriteria decision approach | |
Davis | School enrollment effects in a South-South migration context | |
Kuse et al. | Individual, maternal, household, and community level variability in determining inequalities in childhood anaemia within Ethiopia: four-level multilevel analysis approach |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information |
Address after: 1218, 12th floor, building 8, East District, yard 9, Linglong Road, Haidian District, Beijing 100089 Applicant after: BEIJING TAOHUADAO INFORMATION TECHNOLOGY Co.,Ltd. Address before: Room 1503, Yanshan Hotel, No. 38 Guancun Avenue, Haidian District, Beijing Applicant before: BEIJING TAOHUADAO INFORMATION TECHNOLOGY Co.,Ltd. |
|
CB02 | Change of applicant information | ||
GR01 | Patent grant | ||
GR01 | Patent grant |