CN112530546A

CN112530546A - Psychological pre-judging method and system based on K-means clustering and XGboost algorithm

Info

Publication number: CN112530546A
Application number: CN202011467838.3A
Authority: CN
Inventors: 邵亚斌; 韩雨彤; 胡梦圆; 李雪莲; 钟义菊; 方艺添
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2020-12-14
Filing date: 2020-12-14
Publication date: 2021-03-19
Anticipated expiration: 2040-12-14
Also published as: CN112530546B

Abstract

The invention requests to protect a psychological prejudging method and a psychological prejudging system based on K-means clustering and XGboost algorithm, wherein the method comprises the following steps: 1) establishing and training a psychological pre-judging model based on the decisive characteristics of the samples by using a machine learning method and based on the decisive characteristics of the known samples; 2) and acquiring the decisive characteristic data of the new individual, obtaining the mental health condition of the new individual according to the psychological pre-judging model based on the sample decisive characteristic, and pre-judging the mental health of the students according to the result. The invention achieves the following beneficial effects: the XGboost algorithm in machine learning is used for extracting key decisive characteristics of the model, and mental health state assessment can be accurately performed. After the parameters are properly adjusted, the test set test is carried out on the test set, the accuracy can reach 98-100%, the explanation effect is obvious, and the model universality is high. The invention can provide an accurate psychological pre-judging model and provide convenience for the management work of the students in the school.

Description

Psychological pre-judging method and system based on K-means clustering and XGboost algorithm

Technical Field

The invention belongs to the field of machine learning, and particularly relates to a mental health prejudging method and system based on machine learning student behavior data.

Background

In the 'thirteen five' education program, the state emphasizes and attaches importance to physical and mental health. In fact, the great trend of the new era education development is the physical and mental health education of students, which is also the great trend of the education adapting to the technological progress and the big data development of the artificial intelligence era. The project aims to record the behavior data of students by utilizing big data technology, understand the mental health of the students and promote the physical and mental development of the students.

The knowledge, skill and intelligence of students are closely related to physical and healthy conditions, mental health and barrier-free conditions are the basis for effective learning, and unhealthy mental conditions influence intelligence and learning efficiency. The healthy mind can correctly perceive external things, the thinking has no illusion and illusion, and the order is clear. Not only can keep deeper interest and desire for learning and devote to the learning and obtain fun, but also can overcome the difficulty in learning and keep good learning efficiency.

Students leave a series of "digital footprints" in the campus, which include behavioral data of the learning process, evaluation data of the learning result, social networking relationship data formed by online learning, and the like. In a big data scene, the examination score, canteen consumption, supermarket consumption, book borrowing, academic score, ordinary score and the like of each student can be immediately recorded and added into the database.

The traditional psychological early warning method in China is mainly based on the fact that students are issued with psychological questionnaires such as a university student personality questionnaire (UPI), an Essenk Personality Questionnaire (EPQ) and the like, questionnaire data are mined and analyzed, important indexes influencing the psychological health of the students are obtained, and the psychological health work of colleges and universities is assisted. However, the method covers less student data, and the accuracy of the data is also influenced by various subjective factors, so that the above mentioned psychological early warning result is probably inaccurate.

Disclosure of Invention

The present invention is directed to solving the above problems of the prior art. A psychological pre-judging method and a psychological pre-judging system based on K-means clustering and an XGboost algorithm are provided. The technical scheme of the invention is as follows:

a psychological prejudging method based on K-means clustering and an XGboost algorithm comprises the following steps:

collecting the school behavior data of students, taking the student behavior data as a classification label, recording the data, and carrying out data preprocessing including repeated value processing, missing value processing, noise value processing and type conversion on the student behavior data;

for the discrete features, one-hot coding is used, the values of the discrete features are expanded to an Euclidean space, and a certain value of the discrete features corresponds to a certain point of the Euclidean space;

clustering and segmenting into three types of data sets by adopting a K-means algorithm, wherein the three types of data sets are respectively label 1: psychological hidden dangers are likely to exist; label 2: there are psychological hidden dangers, but the possibility is not obvious; label 3: no psychological hidden trouble exists.

Selecting label as a sample class with 'great possibility of psychological hidden danger', adopting an XGboost algorithm to perform supervised learning classification to obtain an XGboost prediction model, inputting behavior data of a new individual into the established XGboost prediction model to obtain a psychological prejudgment result, adjusting parameters, and performing XGboost prediction model test to obtain the accuracy of the model.

Further, the individual behavior-at-school data includes:

basic information data, score data, classroom data, all-purpose card data, dormitory access data, library access and borrowing data and campus activity data.

Further, the basic information data includes: gender, specialty, age, native place, hobbies; the achievement data comprises: the necessary lesson/optional lesson score and the class normal score; the classroom data includes: class attendance and job completion; the all-purpose card data comprises: the amount of canteen consumption, the class of canteen consumption and the time of canteen consumption; water fetching time; amount of shower consumption, time of shower consumption; supermarket consumption amount, supermarket consumption category and supermarket consumption time; the balance in the all-purpose card; the dormitory access data includes: dormitory access time, dormitory access location; the library access and borrowing data comprises: the method comprises the steps of library access time, book borrowing names, book borrowing time and book returning time; the campus activity data comprises: the situation of the duties of the class and the organization of the duties in the school; the time of the hardworking and the frugal of the business; each school period awards and punishments; extracurricular activity score values.

Further, the pretreatment step specifically comprises: duplicate value processing, missing value processing, noise value processing, and class transformation.

The repeated value processing includes: duplicate value deletions were performed with the duplicates function, with the parameters explained as follows:

subset: column names, defaulting to all columns;

keep: whether { 'first', 'last', False } is reserved, and keep { 'first' means that the first piece of data is reserved in each group of repeated data during deduplication, and the rest of data is discarded; when deduplication is performed, each group of duplicate data retains the last piece of data, and the rest of data is discarded; the keep-False represents that all the repeated data in each group are discarded and not kept when the duplicate is removed;

infill: whether to replace { False, True }, wherein infill denotes that the original table data is not covered after the deduplication, and infill denotes that the original table data is covered after the deduplication.

Missing value processing includes: checking the missing condition, and filling the missing value with the specified value;

the noise value processing comprises the following steps: meanwhile, a cap method is adopted to process the noise value, and a box separation method is adopted to process the noise value;

the type transformation includes: rapid conversion is performed by a LabeleEncoder; and mapping the category to a numerical value in a mapping mode. However, this method has a limited range of applicability; the conversion is performed by the get _ dummy method.

Further, in the missing value processing step, the missing condition is checked, and the missing value is filled with the specified value, and the specific steps are as follows:

looking at the missing value by constructing a lambda function in Python, wherein sum (col. isnull ()) represents how many missing lines exist in the current column, and col. size represents how many rows of data are in total in the current column; filling the missing value is completed by a method of filling the missing value with fillna:

the syntax of the fillna method for filling missing values is:

fillna(value＝None,method＝None,axis＝None,inplace＝False,limit＝None,downcast＝None,^**kwargs)

wherein the parameter value is used for specifying a value to be replaced and can be a scalar, a dictionary, Series or DataFrame; the parameter method is used to specify the way of filling the missing value, when the value is 'pad' or 'kill', it means that the last valid value encountered in the scanning process is used to fill up to the next valid value, and when the value is 'back fill' or 'bfil', it means that all the consecutive missing values encountered before are filled with the first valid value encountered after the missing value; the parameter limit is used to specify how many consecutive miss values are filled up to when the parameter method is set; when the parameter inplace is True, in-place replacement is indicated.

Furthermore, the capping method is characterized in that records outside a standard deviation range of three times above and below the mean value of continuous variables are replaced by standard deviation values of three times above and below the mean value, a parameter x represents a pd.series row, quantile refers to a range interval of a cap, values smaller than 1 percentile and larger than 99 percentile are replaced by the 1 percentile and the 99 percentile by default, and changes of data frequency are compared through a histogram; the binning method is to perform equal-width binning by using a cut function directly. The cut function automatically selects a value less than the column minimum as the lower limit, the maximum as the upper limit, equally divided into five, resulting in a Categories class column, similar to the factor in R, representing the categorical variable column.

Further, the method for segmenting the three types of data sets by adopting the Kmeans algorithm mainly comprises the following substeps:

randomly selecting K samples as initial clustering center points in a data set;

calculating Euclidean distances between all other samples and the K sample point;

comparing K distance values of the sample point and K central points, and classifying the sample point as the closest central point;

the cluster center point is recalculated and the previous steps are repeated until the cluster center point location converges.

At this point, the K-means algorithm ends up partitioning the three types of data sets.

Further, the step of adjusting the parameters specifically comprises:

firstly, checking the data condition, and then setting training parameters;

max _ depth sets the maximum depth of the tree, the default value is 6, and the value range is: [1, ∞ ];

by using min _ child _ weight, after each lifting calculation, the algorithm can directly obtain the weight of a new feature, and meanwhile, in order to prevent overfitting, when the value of the weight is larger, a model can be prevented from learning a local special sample;

defining a learning task and a corresponding learning objective, wherein "binary: logistic" represents a logistic regression problem of two classifications, and the output is probability;

taking default values of other parameters, and starting training the model below;

setting the boosting iterative computation times to be 60 times;

returning the dictionary dit as a list to an array of traversable (key, value) tuples;

the output value is the probability that the sample is in the first class, and the probability value needs to be converted into 0 or 1;

the above steps of regulating the parameters are completed.

Further, the classification with supervised learning by using the XGBoost algorithm to obtain the XGBoost prediction model specifically includes:

XGboost algorithm objective function:

wherein, J (f)_t) An objective function representing a learning model for a total of t iterations, i represents the ith iteration, and,

Sample x representing the top t-1 tree pairs_iPredicted result of (1), x_iRepresenting sample input features, f_tRepresents the t-th regression tree, Ω (f)_t) The regularization term employed for the t-th tree is represented.

Expand according to taylor's formula:

while ordering the first derivative g_iAnd second derivative h_i：

The decision tree complexity calculation formula is as follows:

in the above formula, the number of leaf nodes is T, gamma is the coefficient of T, and the score of the jth leaf node of the tth tree is w_jAnd the values of the parameters are obtained by adopting a grid search algorithm.

Substituting (2), (3) and (4) into the formula (1) to obtain an objective function:

solving equation (5) by using the objective function to pair w_jDerived to optimize

Comprises the following steps:

the corresponding minimum value of the objective function is as follows:

the above equation corresponds to a scoring function, and the smaller the function value, the better the tree structure.

Discretized data sets were scaled by test set and training set 7: and 3, segmenting, inputting test samples corresponding to 3 tasks into the trained XGBoost model, obtaining estimated values at the same time, finally selecting a required estimated value from the test samples, and comparing the selected estimated value with actually measured data.

A psychological prejudgment system based on K-means clustering and XGboost algorithm of any method comprises the following steps:

the acquisition and pretreatment module: the system is used for collecting the student behavior data at school, taking the student behavior data as a classification label, recording the data, and carrying out data preprocessing including repeated value processing, missing value processing, noise value processing and type conversion on the student behavior data;

a clustering module: for the discrete feature, one-hot coding is used, the value of the discrete feature is expanded to an Euclidean space, and a certain value of the discrete feature corresponds to a certain point of the Euclidean space. Clustering and segmenting into three types of data sets by adopting a K-means algorithm, wherein the three types of data sets are respectively label 1: psychological hidden dangers are likely to exist; label 2: there are psychological hidden dangers, but the possibility is not obvious; label 3: no psychological hidden trouble exists.

A prediction module: selecting label1 as a sample class with larger possibility of psychological hidden danger, adopting an XGboost algorithm to classify supervised learning to obtain an XGboost prediction model, inputting behavior data of a new individual into the established XGboost prediction model to obtain a psychological prejudgment result, adjusting parameters, and testing the XGboost prediction model to obtain the accuracy of the model.

The invention has the following advantages and beneficial effects:

1. the behavior data and basic information data of the individual according to the claim 2 and the claim 3 are data actively generated by the human under the natural situation, the data is real and large, and the data has better representativeness compared with the mode of inquiry in the traditional psychological counseling of colleges and universities; and the algorithm is a result generated in a data-driven manner, and is in the form of an a posteriori.

2. The K-means algorithm of claim 7, wherein the student's behavioral dataset is classified into three categories by calculating euclidean distances in the samples.

3. The features of claims 1, 2 and 7, the student behavior data after data preprocessing is divided into three categories by Euclidean distance, so that the disadvantages caused by individual subjective conditions in prior experiments are reduced, the student behavior data is fully utilized, and a reasonable digital information platform and a psychological crisis early warning system are established.

Drawings

FIG. 1 is a flow chart of a psychological prediction method based on K-means clustering and XGboost algorithm according to a preferred embodiment of the present invention;

FIG. 2 is a psychological prediction model based on K-means clustering and XGboost algorithm;

FIG. 3 is a feature variable importance.

Detailed Description

The technical solutions in the embodiments of the present invention will be described in detail and clearly with reference to the accompanying drawings. The described embodiments are only some of the embodiments of the present invention.

The technical scheme for solving the technical problems is as follows:

as shown in fig. 1, a psychological prediction method based on K-means clustering and XGBoost algorithm includes the following steps:

1) establishing and training a mental health prejudging model based on campus behavior characteristics by using a machine learning method based on behavior characteristics of individuals in known samples;

2) acquiring the behavior data of the new individual, and obtaining the psychological health pre-judging result of the new individual according to the psychological health pre-judging model based on the campus behavior characteristics.

The campus behavior characteristics are a behavior result set reflecting that the individual is influenced by various factors and psychological changes in a campus, and the campus behavior characteristics are acquired from records recording relevant departments of the campus to which the individual belongs.

The process of extracting the campus behavior characteristics comprises the following steps:

11) according to the permission of relevant departments of schools, behavior data of students in schools are collected, student individuals are used as classification labels, a fixed period is used as the duration of data statistics content, a group of data in the fixed period is formed, and the data is recorded as single group of data;

12) extracting effective behavior records of the individuals from the behavior data of the individuals, wherein the behavior records of the individuals are relational data stored by taking the individuals as units.

After the initial data is collected, the initial data is preprocessed, namely, operations such as repeated value processing, missing value processing, noise value processing, type conversion and the like are carried out on the data, and a relatively complete data set which can be used for training a model is obtained.

The repeated value processing includes: duplicate deletion was performed using duplicate (topic, keep, infilace) method.

Missing value processing includes: looking at the missing condition, filling the missing value with the specified value.

The missing value can be checked by constructing a lambda function in Python, where sum (col. isnull ()) represents how many missing are present in the current column, and col. size represents how many rows of data are present in the current column.

And filling missing values by using a fillna method and a quantile method.

The noise value processing comprises the following steps: a cap method is used for processing the noise value, and a box separation method is used for processing the noise value.

The capping method is characterized in that records outside a standard deviation range of three times the upper and lower parts of the mean value of continuous variables are replaced by standard deviation values of three times the upper and lower parts of the mean value, a parameter x represents a pd.series row, quantile represents a capping range interval, default values smaller than 1 percentile and larger than 99 percentile are replaced by 1 percentile and 99 percentile, and the change of data frequency is compared through a histogram.

The binning method is to perform equal-width binning by using a cut function directly. The cut function automatically selects a value less than the column minimum as the lower limit, the maximum as the upper limit, equally divided into five. The result is a Categories class column, similar to the factor in R, representing the categorical variable column.

Meanwhile, one-hot coding is used for the discrete characteristics, so that the distance calculation between the characteristics is more reasonable.

The method for segmenting the three types of data sets by adopting the Kmeans algorithm mainly comprises the following substeps:

randomly selecting K samples as initial clustering center points in a data set;

At this point, the Kmeans algorithm finishes the division of the three types of data sets, and then the XGboost algorithm is used for supervised learning training.

Firstly, checking the data condition, and then setting training parameters;

setting the boosting iterative computation times to be 60 times;

and finishing the parameter adjusting steps, and starting the XGboost algorithm.

XGboost algorithm objective function:

While ordering the first derivative g_iAnd second derivative h_i：

The decision tree complexity calculation formula is as follows:

Comprises the following steps:

the corresponding minimum value of the objective function is as follows:

To sum up, the discretized data set is scaled by test set and training set 7: and 3, dividing, inputting the test samples corresponding to the 3 tasks into a trained XGBoost model, obtaining estimated values at the same time, and finally selecting the required estimated value from the estimated values. After the required estimation value is selected, the estimation value can be compared with the actually measured data, so that the accurate condition of algorithm prediction is obtained.

And finally, directly obtaining the mental health condition of a classmate by typing in proper data of the classmate, and informing a teacher if corresponding hidden dangers exist to find out a reasonable introduction method in time. The device helps to check the missing and fill in the gaps of the school mental health education work, and improves the working efficiency and the effect.

The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. One typical implementation device is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smartphone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The above examples are to be construed as merely illustrative and not limitative of the remainder of the disclosure. After reading the description of the invention, the skilled person can make various changes or modifications to the invention, and these equivalent changes and modifications also fall into the scope of the invention defined by the claims.

Claims

1. A psychological prejudging method based on K-means clustering and an XGboost algorithm is characterized by comprising the following steps:

clustering and segmenting into three types of data sets by adopting a K-means algorithm, wherein the three types of data sets are respectively label 1: psychological hidden dangers are likely to exist; label 2: there are psychological hidden dangers, but the possibility is not obvious; label 3: no psychological hidden danger exists;

selecting label1 as a sample class with larger possibility of psychological hidden danger, adopting an XGboost algorithm to classify supervised learning to obtain an XGboost prediction model, inputting behavior data of a new individual into the established XGboost prediction model to obtain a psychological prejudgment result, adjusting parameters, and testing the XGboost prediction model to obtain the accuracy of the model.

2. The psychological prediction method based on K-means clustering and XGboost algorithm as claimed in claim 1, wherein the data of the behavior of the individual at school comprises:

3. The psychological prediction method based on K-means clustering and XGboost algorithm according to claim 2, characterized in that the basic information data comprises: gender, specialty, age, native place, hobbies; the achievement data comprises: the necessary lesson/optional lesson score and the class normal score; the classroom data includes: class attendance and job completion; the all-purpose card data comprises: the amount of canteen consumption, the class of canteen consumption and the time of canteen consumption; water fetching time; amount of shower consumption, time of shower consumption; supermarket consumption amount, supermarket consumption category and supermarket consumption time; the balance in the all-purpose card; the dormitory access data includes: dormitory access time, dormitory access location; the library access and borrowing data comprises: the method comprises the steps of library access time, book borrowing names, book borrowing time and book returning time; the campus activity data comprises: the situation of the duties of the class and the organization of the duties in the school; the time of the hardworking and the frugal of the business; each school period awards and punishments; extracurricular activity score values.

4. The psychological prediction method based on K-means clustering and XGboost algorithm according to claim 2, characterized in that the preprocessing step specifically comprises: duplicate value processing, missing value processing, noise value processing, and class transformation.

subset: column names, defaulting to all columns;

the type transformation includes: fast switching was performed by LabeleEncoder: and mapping the category to a numerical value in a mapping mode. However, this method has a limited range of applicability; the conversion is performed by the get _ dummy method.

5. The psychological prejudging method based on K-means clustering and XGboost algorithm according to claim 4, wherein in the missing value processing step, the missing condition is checked, and the missing value is filled with a specified value, and the specific steps are as follows:

the syntax of the fillna method for filling missing values is:

fillna(value＝None,method＝None,axis＝None,inplace＝False,limit＝None, downcast＝None,^**kwargs)

6. The psychological prediction method based on K-means clustering and XGboost algorithm as claimed in claim 4, wherein the capping method is characterized in that the records outside the range of standard deviation three times above and below the mean value of the continuous variable are replaced by standard deviation three times above and below the mean value, the parameter x represents a pd.series column, the quantile refers to the range interval of the capping, and default values less than 1 quantile percent and more than 99 quantile percent are replaced by 1 quantile percent and 99 quantile percent, and the change of the data frequency is compared through a histogram; the binning method is to perform equal-width binning by using a cut function directly. The cut function automatically selects a value less than the column minimum as the lower limit, the maximum as the upper limit, equally divided into five, resulting in a Categories class column, similar to the factor in R, representing the categorical variable column.

7. The psychological prediction method based on K-means clustering and XGboost algorithm according to any one of claims 1-6, characterized in that the method for partitioning three types of data sets by using the Kmeans algorithm mainly comprises the following sub-steps:

randomly selecting K samples as initial clustering center points in a data set;

To this end, the K-means algorithm ends up segmenting the three types of data sets.

8. The psychological prediction method based on K-means clustering and XGboost algorithm according to any one of claims 1-6, characterized in that the step of adjusting parameters specifically comprises:

firstly, checking the data condition, and then setting training parameters;

max _ depth sets the maximum depth of the tree, the default value is 6, and the value range is: [1, + ∞ ];

setting the boosting iterative computation times to be 60 times;

the above steps of regulating the parameters are completed.

9. The psychological prediction method based on K-means clustering and XGboost algorithm according to any one of claim 8, wherein the classification using the XGboost algorithm for supervised learning to obtain the XGboost prediction model specifically comprises:

XGboost algorithm objective function:

Sample x representing the top t-1 tree pairs_iPredicted result of (1), x_iRepresenting sample input features, f_tRepresents the t-th regression tree, Ω (f)_t) Representing the regular terms adopted for the t tree;

expand according to taylor's formula:

while ordering the first derivative g_iAnd second derivative h_i：

The decision tree complexity calculation formula is as follows:

Comprises the following steps:

the corresponding minimum value of the objective function is as follows:

10. A psychological prognostics system based on K-means clustering and XGBoost algorithm according to any of claims 1-9, comprising:

a clustering module: for the discrete features, one-hot coding is used, the values of the discrete features are expanded to an Euclidean space, and a certain value of the discrete features corresponds to a certain point of the Euclidean space; clustering and segmenting into three types of data sets by adopting a K-means algorithm, wherein the three types of data sets are respectively label 1: psychological hidden dangers are likely to exist; label 2: there are psychological hidden dangers, but the possibility is not obvious; label 3: no psychological hidden trouble exists.