CN112530546A - Psychological pre-judging method and system based on K-means clustering and XGboost algorithm - Google Patents

Psychological pre-judging method and system based on K-means clustering and XGboost algorithm Download PDF

Info

Publication number
CN112530546A
CN112530546A CN202011467838.3A CN202011467838A CN112530546A CN 112530546 A CN112530546 A CN 112530546A CN 202011467838 A CN202011467838 A CN 202011467838A CN 112530546 A CN112530546 A CN 112530546A
Authority
CN
China
Prior art keywords
data
value
psychological
xgboost
missing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011467838.3A
Other languages
Chinese (zh)
Other versions
CN112530546B (en
Inventor
邵亚斌
韩雨彤
胡梦圆
李雪莲
钟义菊
方艺添
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University of Post and Telecommunications
Original Assignee
Chongqing University of Post and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University of Post and Telecommunications filed Critical Chongqing University of Post and Telecommunications
Priority to CN202011467838.3A priority Critical patent/CN112530546B/en
Publication of CN112530546A publication Critical patent/CN112530546A/en
Application granted granted Critical
Publication of CN112530546B publication Critical patent/CN112530546B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H20/00ICT specially adapted for therapies or health-improving plans, e.g. for handling prescriptions, for steering therapy or for monitoring patient compliance
    • G16H20/70ICT specially adapted for therapies or health-improving plans, e.g. for handling prescriptions, for steering therapy or for monitoring patient compliance relating to mental therapies, e.g. psychological therapy or autogenous training
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/20ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Public Health (AREA)
  • Medical Informatics (AREA)
  • Theoretical Computer Science (AREA)
  • Primary Health Care (AREA)
  • Physics & Mathematics (AREA)
  • Biomedical Technology (AREA)
  • Epidemiology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Databases & Information Systems (AREA)
  • Pathology (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Developmental Disabilities (AREA)
  • Child & Adolescent Psychology (AREA)
  • Psychiatry (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychology (AREA)
  • Social Psychology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention requests to protect a psychological prejudging method and a psychological prejudging system based on K-means clustering and XGboost algorithm, wherein the method comprises the following steps: 1) establishing and training a psychological pre-judging model based on the decisive characteristics of the samples by using a machine learning method and based on the decisive characteristics of the known samples; 2) and acquiring the decisive characteristic data of the new individual, obtaining the mental health condition of the new individual according to the psychological pre-judging model based on the sample decisive characteristic, and pre-judging the mental health of the students according to the result. The invention achieves the following beneficial effects: the XGboost algorithm in machine learning is used for extracting key decisive characteristics of the model, and mental health state assessment can be accurately performed. After the parameters are properly adjusted, the test set test is carried out on the test set, the accuracy can reach 98-100%, the explanation effect is obvious, and the model universality is high. The invention can provide an accurate psychological pre-judging model and provide convenience for the management work of the students in the school.

Description

Psychological pre-judging method and system based on K-means clustering and XGboost algorithm
Technical Field
The invention belongs to the field of machine learning, and particularly relates to a mental health prejudging method and system based on machine learning student behavior data.
Background
In the 'thirteen five' education program, the state emphasizes and attaches importance to physical and mental health. In fact, the great trend of the new era education development is the physical and mental health education of students, which is also the great trend of the education adapting to the technological progress and the big data development of the artificial intelligence era. The project aims to record the behavior data of students by utilizing big data technology, understand the mental health of the students and promote the physical and mental development of the students.
The knowledge, skill and intelligence of students are closely related to physical and healthy conditions, mental health and barrier-free conditions are the basis for effective learning, and unhealthy mental conditions influence intelligence and learning efficiency. The healthy mind can correctly perceive external things, the thinking has no illusion and illusion, and the order is clear. Not only can keep deeper interest and desire for learning and devote to the learning and obtain fun, but also can overcome the difficulty in learning and keep good learning efficiency.
Students leave a series of "digital footprints" in the campus, which include behavioral data of the learning process, evaluation data of the learning result, social networking relationship data formed by online learning, and the like. In a big data scene, the examination score, canteen consumption, supermarket consumption, book borrowing, academic score, ordinary score and the like of each student can be immediately recorded and added into the database.
The traditional psychological early warning method in China is mainly based on the fact that students are issued with psychological questionnaires such as a university student personality questionnaire (UPI), an Essenk Personality Questionnaire (EPQ) and the like, questionnaire data are mined and analyzed, important indexes influencing the psychological health of the students are obtained, and the psychological health work of colleges and universities is assisted. However, the method covers less student data, and the accuracy of the data is also influenced by various subjective factors, so that the above mentioned psychological early warning result is probably inaccurate.
Disclosure of Invention
The present invention is directed to solving the above problems of the prior art. A psychological pre-judging method and a psychological pre-judging system based on K-means clustering and an XGboost algorithm are provided. The technical scheme of the invention is as follows:
a psychological prejudging method based on K-means clustering and an XGboost algorithm comprises the following steps:
collecting the school behavior data of students, taking the student behavior data as a classification label, recording the data, and carrying out data preprocessing including repeated value processing, missing value processing, noise value processing and type conversion on the student behavior data;
for the discrete features, one-hot coding is used, the values of the discrete features are expanded to an Euclidean space, and a certain value of the discrete features corresponds to a certain point of the Euclidean space;
clustering and segmenting into three types of data sets by adopting a K-means algorithm, wherein the three types of data sets are respectively label 1: psychological hidden dangers are likely to exist; label 2: there are psychological hidden dangers, but the possibility is not obvious; label 3: no psychological hidden trouble exists.
Selecting label as a sample class with 'great possibility of psychological hidden danger', adopting an XGboost algorithm to perform supervised learning classification to obtain an XGboost prediction model, inputting behavior data of a new individual into the established XGboost prediction model to obtain a psychological prejudgment result, adjusting parameters, and performing XGboost prediction model test to obtain the accuracy of the model.
Further, the individual behavior-at-school data includes:
basic information data, score data, classroom data, all-purpose card data, dormitory access data, library access and borrowing data and campus activity data.
Further, the basic information data includes: gender, specialty, age, native place, hobbies; the achievement data comprises: the necessary lesson/optional lesson score and the class normal score; the classroom data includes: class attendance and job completion; the all-purpose card data comprises: the amount of canteen consumption, the class of canteen consumption and the time of canteen consumption; water fetching time; amount of shower consumption, time of shower consumption; supermarket consumption amount, supermarket consumption category and supermarket consumption time; the balance in the all-purpose card; the dormitory access data includes: dormitory access time, dormitory access location; the library access and borrowing data comprises: the method comprises the steps of library access time, book borrowing names, book borrowing time and book returning time; the campus activity data comprises: the situation of the duties of the class and the organization of the duties in the school; the time of the hardworking and the frugal of the business; each school period awards and punishments; extracurricular activity score values.
Further, the pretreatment step specifically comprises: duplicate value processing, missing value processing, noise value processing, and class transformation.
The repeated value processing includes: duplicate value deletions were performed with the duplicates function, with the parameters explained as follows:
subset: column names, defaulting to all columns;
keep: whether { 'first', 'last', False } is reserved, and keep { 'first' means that the first piece of data is reserved in each group of repeated data during deduplication, and the rest of data is discarded; when deduplication is performed, each group of duplicate data retains the last piece of data, and the rest of data is discarded; the keep-False represents that all the repeated data in each group are discarded and not kept when the duplicate is removed;
infill: whether to replace { False, True }, wherein infill denotes that the original table data is not covered after the deduplication, and infill denotes that the original table data is covered after the deduplication.
Missing value processing includes: checking the missing condition, and filling the missing value with the specified value;
the noise value processing comprises the following steps: meanwhile, a cap method is adopted to process the noise value, and a box separation method is adopted to process the noise value;
the type transformation includes: rapid conversion is performed by a LabeleEncoder; and mapping the category to a numerical value in a mapping mode. However, this method has a limited range of applicability; the conversion is performed by the get _ dummy method.
Further, in the missing value processing step, the missing condition is checked, and the missing value is filled with the specified value, and the specific steps are as follows:
looking at the missing value by constructing a lambda function in Python, wherein sum (col. isnull ()) represents how many missing lines exist in the current column, and col. size represents how many rows of data are in total in the current column; filling the missing value is completed by a method of filling the missing value with fillna:
the syntax of the fillna method for filling missing values is:
fillna(value=None,method=None,axis=None,inplace=False,limit=None,downcast=None,**kwargs)
wherein the parameter value is used for specifying a value to be replaced and can be a scalar, a dictionary, Series or DataFrame; the parameter method is used to specify the way of filling the missing value, when the value is 'pad' or 'kill', it means that the last valid value encountered in the scanning process is used to fill up to the next valid value, and when the value is 'back fill' or 'bfil', it means that all the consecutive missing values encountered before are filled with the first valid value encountered after the missing value; the parameter limit is used to specify how many consecutive miss values are filled up to when the parameter method is set; when the parameter inplace is True, in-place replacement is indicated.
Furthermore, the capping method is characterized in that records outside a standard deviation range of three times above and below the mean value of continuous variables are replaced by standard deviation values of three times above and below the mean value, a parameter x represents a pd.series row, quantile refers to a range interval of a cap, values smaller than 1 percentile and larger than 99 percentile are replaced by the 1 percentile and the 99 percentile by default, and changes of data frequency are compared through a histogram; the binning method is to perform equal-width binning by using a cut function directly. The cut function automatically selects a value less than the column minimum as the lower limit, the maximum as the upper limit, equally divided into five, resulting in a Categories class column, similar to the factor in R, representing the categorical variable column.
Further, the method for segmenting the three types of data sets by adopting the Kmeans algorithm mainly comprises the following substeps:
randomly selecting K samples as initial clustering center points in a data set;
calculating Euclidean distances between all other samples and the K sample point;
comparing K distance values of the sample point and K central points, and classifying the sample point as the closest central point;
the cluster center point is recalculated and the previous steps are repeated until the cluster center point location converges.
At this point, the K-means algorithm ends up partitioning the three types of data sets.
Further, the step of adjusting the parameters specifically comprises:
firstly, checking the data condition, and then setting training parameters;
max _ depth sets the maximum depth of the tree, the default value is 6, and the value range is: [1, ∞ ];
by using min _ child _ weight, after each lifting calculation, the algorithm can directly obtain the weight of a new feature, and meanwhile, in order to prevent overfitting, when the value of the weight is larger, a model can be prevented from learning a local special sample;
defining a learning task and a corresponding learning objective, wherein "binary: logistic" represents a logistic regression problem of two classifications, and the output is probability;
taking default values of other parameters, and starting training the model below;
setting the boosting iterative computation times to be 60 times;
returning the dictionary dit as a list to an array of traversable (key, value) tuples;
the output value is the probability that the sample is in the first class, and the probability value needs to be converted into 0 or 1;
the above steps of regulating the parameters are completed.
Further, the classification with supervised learning by using the XGBoost algorithm to obtain the XGBoost prediction model specifically includes:
XGboost algorithm objective function:
Figure BDA0002835066410000051
wherein, J (f)t) An objective function representing a learning model for a total of t iterations, i represents the ith iteration, and,
Figure BDA0002835066410000052
Sample x representing the top t-1 tree pairsiPredicted result of (1), xiRepresenting sample input features, ftRepresents the t-th regression tree, Ω (f)t) The regularization term employed for the t-th tree is represented.
Expand according to taylor's formula:
Figure BDA0002835066410000053
while ordering the first derivative giAnd second derivative hi
Figure BDA0002835066410000054
The decision tree complexity calculation formula is as follows:
Figure BDA0002835066410000055
in the above formula, the number of leaf nodes is T, gamma is the coefficient of T, and the score of the jth leaf node of the tth tree is wjAnd the values of the parameters are obtained by adopting a grid search algorithm.
Substituting (2), (3) and (4) into the formula (1) to obtain an objective function:
Figure BDA0002835066410000056
solving equation (5) by using the objective function to pair wjDerived to optimize
Figure BDA0002835066410000057
Comprises the following steps:
Figure BDA0002835066410000058
the corresponding minimum value of the objective function is as follows:
Figure BDA0002835066410000061
the above equation corresponds to a scoring function, and the smaller the function value, the better the tree structure.
Discretized data sets were scaled by test set and training set 7: and 3, segmenting, inputting test samples corresponding to 3 tasks into the trained XGBoost model, obtaining estimated values at the same time, finally selecting a required estimated value from the test samples, and comparing the selected estimated value with actually measured data.
A psychological prejudgment system based on K-means clustering and XGboost algorithm of any method comprises the following steps:
the acquisition and pretreatment module: the system is used for collecting the student behavior data at school, taking the student behavior data as a classification label, recording the data, and carrying out data preprocessing including repeated value processing, missing value processing, noise value processing and type conversion on the student behavior data;
a clustering module: for the discrete feature, one-hot coding is used, the value of the discrete feature is expanded to an Euclidean space, and a certain value of the discrete feature corresponds to a certain point of the Euclidean space. Clustering and segmenting into three types of data sets by adopting a K-means algorithm, wherein the three types of data sets are respectively label 1: psychological hidden dangers are likely to exist; label 2: there are psychological hidden dangers, but the possibility is not obvious; label 3: no psychological hidden trouble exists.
A prediction module: selecting label1 as a sample class with larger possibility of psychological hidden danger, adopting an XGboost algorithm to classify supervised learning to obtain an XGboost prediction model, inputting behavior data of a new individual into the established XGboost prediction model to obtain a psychological prejudgment result, adjusting parameters, and testing the XGboost prediction model to obtain the accuracy of the model.
The invention has the following advantages and beneficial effects:
1. the behavior data and basic information data of the individual according to the claim 2 and the claim 3 are data actively generated by the human under the natural situation, the data is real and large, and the data has better representativeness compared with the mode of inquiry in the traditional psychological counseling of colleges and universities; and the algorithm is a result generated in a data-driven manner, and is in the form of an a posteriori.
2. The K-means algorithm of claim 7, wherein the student's behavioral dataset is classified into three categories by calculating euclidean distances in the samples.
3. The features of claims 1, 2 and 7, the student behavior data after data preprocessing is divided into three categories by Euclidean distance, so that the disadvantages caused by individual subjective conditions in prior experiments are reduced, the student behavior data is fully utilized, and a reasonable digital information platform and a psychological crisis early warning system are established.
Drawings
FIG. 1 is a flow chart of a psychological prediction method based on K-means clustering and XGboost algorithm according to a preferred embodiment of the present invention;
FIG. 2 is a psychological prediction model based on K-means clustering and XGboost algorithm;
FIG. 3 is a feature variable importance.
Detailed Description
The technical solutions in the embodiments of the present invention will be described in detail and clearly with reference to the accompanying drawings. The described embodiments are only some of the embodiments of the present invention.
The technical scheme for solving the technical problems is as follows:
as shown in fig. 1, a psychological prediction method based on K-means clustering and XGBoost algorithm includes the following steps:
1) establishing and training a mental health prejudging model based on campus behavior characteristics by using a machine learning method based on behavior characteristics of individuals in known samples;
2) acquiring the behavior data of the new individual, and obtaining the psychological health pre-judging result of the new individual according to the psychological health pre-judging model based on the campus behavior characteristics.
The campus behavior characteristics are a behavior result set reflecting that the individual is influenced by various factors and psychological changes in a campus, and the campus behavior characteristics are acquired from records recording relevant departments of the campus to which the individual belongs.
The process of extracting the campus behavior characteristics comprises the following steps:
11) according to the permission of relevant departments of schools, behavior data of students in schools are collected, student individuals are used as classification labels, a fixed period is used as the duration of data statistics content, a group of data in the fixed period is formed, and the data is recorded as single group of data;
12) extracting effective behavior records of the individuals from the behavior data of the individuals, wherein the behavior records of the individuals are relational data stored by taking the individuals as units.
After the initial data is collected, the initial data is preprocessed, namely, operations such as repeated value processing, missing value processing, noise value processing, type conversion and the like are carried out on the data, and a relatively complete data set which can be used for training a model is obtained.
The repeated value processing includes: duplicate deletion was performed using duplicate (topic, keep, infilace) method.
Missing value processing includes: looking at the missing condition, filling the missing value with the specified value.
The missing value can be checked by constructing a lambda function in Python, where sum (col. isnull ()) represents how many missing are present in the current column, and col. size represents how many rows of data are present in the current column.
And filling missing values by using a fillna method and a quantile method.
The noise value processing comprises the following steps: a cap method is used for processing the noise value, and a box separation method is used for processing the noise value.
The capping method is characterized in that records outside a standard deviation range of three times the upper and lower parts of the mean value of continuous variables are replaced by standard deviation values of three times the upper and lower parts of the mean value, a parameter x represents a pd.series row, quantile represents a capping range interval, default values smaller than 1 percentile and larger than 99 percentile are replaced by 1 percentile and 99 percentile, and the change of data frequency is compared through a histogram.
The binning method is to perform equal-width binning by using a cut function directly. The cut function automatically selects a value less than the column minimum as the lower limit, the maximum as the upper limit, equally divided into five. The result is a Categories class column, similar to the factor in R, representing the categorical variable column.
The type transformation includes: rapid conversion is performed by a LabeleEncoder; and mapping the category to a numerical value in a mapping mode. However, this method has a limited range of applicability; the conversion is performed by the get _ dummy method.
Meanwhile, one-hot coding is used for the discrete characteristics, so that the distance calculation between the characteristics is more reasonable.
The method for segmenting the three types of data sets by adopting the Kmeans algorithm mainly comprises the following substeps:
randomly selecting K samples as initial clustering center points in a data set;
calculating Euclidean distances between all other samples and the K sample point;
comparing K distance values of the sample point and K central points, and classifying the sample point as the closest central point;
the cluster center point is recalculated and the previous steps are repeated until the cluster center point location converges.
At this point, the Kmeans algorithm finishes the division of the three types of data sets, and then the XGboost algorithm is used for supervised learning training.
Firstly, checking the data condition, and then setting training parameters;
max _ depth sets the maximum depth of the tree, the default value is 6, and the value range is: [1, ∞ ];
by using min _ child _ weight, after each lifting calculation, the algorithm can directly obtain the weight of a new feature, and meanwhile, in order to prevent overfitting, when the value of the weight is larger, a model can be prevented from learning a local special sample;
defining a learning task and a corresponding learning objective, wherein "binary: logistic" represents a logistic regression problem of two classifications, and the output is probability;
taking default values of other parameters, and starting training the model below;
setting the boosting iterative computation times to be 60 times;
returning the dictionary dit as a list to an array of traversable (key, value) tuples;
the output value is the probability that the sample is in the first class, and the probability value needs to be converted into 0 or 1;
and finishing the parameter adjusting steps, and starting the XGboost algorithm.
XGboost algorithm objective function:
Figure BDA0002835066410000091
wherein, J (f)t) An objective function representing a learning model for a total of t iterations, i represents the ith iteration, and,
Figure BDA0002835066410000092
Sample x representing the top t-1 tree pairsiPredicted result of (1), xiRepresenting sample input features, ftRepresents the t-th regression tree, Ω (f)t) The regularization term employed for the t-th tree is represented.
Figure BDA0002835066410000101
While ordering the first derivative giAnd second derivative hi
Figure BDA0002835066410000102
The decision tree complexity calculation formula is as follows:
Figure BDA0002835066410000103
in the above formula, the number of leaf nodes is T, gamma is the coefficient of T, and the score of the jth leaf node of the tth tree is wjAnd the values of the parameters are obtained by adopting a grid search algorithm.
Substituting (2), (3) and (4) into the formula (1) to obtain an objective function:
Figure BDA0002835066410000104
solving equation (5) by using the objective function to pair wjDerived to optimize
Figure BDA0002835066410000105
Comprises the following steps:
Figure BDA0002835066410000106
the corresponding minimum value of the objective function is as follows:
Figure BDA0002835066410000107
the above equation corresponds to a scoring function, and the smaller the function value, the better the tree structure.
To sum up, the discretized data set is scaled by test set and training set 7: and 3, dividing, inputting the test samples corresponding to the 3 tasks into a trained XGBoost model, obtaining estimated values at the same time, and finally selecting the required estimated value from the estimated values. After the required estimation value is selected, the estimation value can be compared with the actually measured data, so that the accurate condition of algorithm prediction is obtained.
And finally, directly obtaining the mental health condition of a classmate by typing in proper data of the classmate, and informing a teacher if corresponding hidden dangers exist to find out a reasonable introduction method in time. The device helps to check the missing and fill in the gaps of the school mental health education work, and improves the working efficiency and the effect.
The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. One typical implementation device is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smartphone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The above examples are to be construed as merely illustrative and not limitative of the remainder of the disclosure. After reading the description of the invention, the skilled person can make various changes or modifications to the invention, and these equivalent changes and modifications also fall into the scope of the invention defined by the claims.

Claims (10)

1. A psychological prejudging method based on K-means clustering and an XGboost algorithm is characterized by comprising the following steps:
collecting the school behavior data of students, taking the student behavior data as a classification label, recording the data, and carrying out data preprocessing including repeated value processing, missing value processing, noise value processing and type conversion on the student behavior data;
for the discrete features, one-hot coding is used, the values of the discrete features are expanded to an Euclidean space, and a certain value of the discrete features corresponds to a certain point of the Euclidean space;
clustering and segmenting into three types of data sets by adopting a K-means algorithm, wherein the three types of data sets are respectively label 1: psychological hidden dangers are likely to exist; label 2: there are psychological hidden dangers, but the possibility is not obvious; label 3: no psychological hidden danger exists;
selecting label1 as a sample class with larger possibility of psychological hidden danger, adopting an XGboost algorithm to classify supervised learning to obtain an XGboost prediction model, inputting behavior data of a new individual into the established XGboost prediction model to obtain a psychological prejudgment result, adjusting parameters, and testing the XGboost prediction model to obtain the accuracy of the model.
2. The psychological prediction method based on K-means clustering and XGboost algorithm as claimed in claim 1, wherein the data of the behavior of the individual at school comprises:
basic information data, score data, classroom data, all-purpose card data, dormitory access data, library access and borrowing data and campus activity data.
3. The psychological prediction method based on K-means clustering and XGboost algorithm according to claim 2, characterized in that the basic information data comprises: gender, specialty, age, native place, hobbies; the achievement data comprises: the necessary lesson/optional lesson score and the class normal score; the classroom data includes: class attendance and job completion; the all-purpose card data comprises: the amount of canteen consumption, the class of canteen consumption and the time of canteen consumption; water fetching time; amount of shower consumption, time of shower consumption; supermarket consumption amount, supermarket consumption category and supermarket consumption time; the balance in the all-purpose card; the dormitory access data includes: dormitory access time, dormitory access location; the library access and borrowing data comprises: the method comprises the steps of library access time, book borrowing names, book borrowing time and book returning time; the campus activity data comprises: the situation of the duties of the class and the organization of the duties in the school; the time of the hardworking and the frugal of the business; each school period awards and punishments; extracurricular activity score values.
4. The psychological prediction method based on K-means clustering and XGboost algorithm according to claim 2, characterized in that the preprocessing step specifically comprises: duplicate value processing, missing value processing, noise value processing, and class transformation.
The repeated value processing includes: duplicate value deletions were performed with the duplicates function, with the parameters explained as follows:
subset: column names, defaulting to all columns;
keep: whether { 'first', 'last', False } is reserved, and keep { 'first' means that the first piece of data is reserved in each group of repeated data during deduplication, and the rest of data is discarded; when deduplication is performed, each group of duplicate data retains the last piece of data, and the rest of data is discarded; the keep-False represents that all the repeated data in each group are discarded and not kept when the duplicate is removed;
infill: whether to replace { False, True }, wherein infill denotes that the original table data is not covered after the deduplication, and infill denotes that the original table data is covered after the deduplication.
Missing value processing includes: checking the missing condition, and filling the missing value with the specified value;
the noise value processing comprises the following steps: meanwhile, a cap method is adopted to process the noise value, and a box separation method is adopted to process the noise value;
the type transformation includes: fast switching was performed by LabeleEncoder: and mapping the category to a numerical value in a mapping mode. However, this method has a limited range of applicability; the conversion is performed by the get _ dummy method.
5. The psychological prejudging method based on K-means clustering and XGboost algorithm according to claim 4, wherein in the missing value processing step, the missing condition is checked, and the missing value is filled with a specified value, and the specific steps are as follows:
looking at the missing value by constructing a lambda function in Python, wherein sum (col. isnull ()) represents how many missing lines exist in the current column, and col. size represents how many rows of data are in total in the current column; filling the missing value is completed by a method of filling the missing value with fillna:
the syntax of the fillna method for filling missing values is:
fillna(value=None,method=None,axis=None,inplace=False,limit=None, downcast=None,**kwargs)
wherein the parameter value is used for specifying a value to be replaced and can be a scalar, a dictionary, Series or DataFrame; the parameter method is used to specify the way of filling the missing value, when the value is 'pad' or 'kill', it means that the last valid value encountered in the scanning process is used to fill up to the next valid value, and when the value is 'back fill' or 'bfil', it means that all the consecutive missing values encountered before are filled with the first valid value encountered after the missing value; the parameter limit is used to specify how many consecutive miss values are filled up to when the parameter method is set; when the parameter inplace is True, in-place replacement is indicated.
6. The psychological prediction method based on K-means clustering and XGboost algorithm as claimed in claim 4, wherein the capping method is characterized in that the records outside the range of standard deviation three times above and below the mean value of the continuous variable are replaced by standard deviation three times above and below the mean value, the parameter x represents a pd.series column, the quantile refers to the range interval of the capping, and default values less than 1 quantile percent and more than 99 quantile percent are replaced by 1 quantile percent and 99 quantile percent, and the change of the data frequency is compared through a histogram; the binning method is to perform equal-width binning by using a cut function directly. The cut function automatically selects a value less than the column minimum as the lower limit, the maximum as the upper limit, equally divided into five, resulting in a Categories class column, similar to the factor in R, representing the categorical variable column.
7. The psychological prediction method based on K-means clustering and XGboost algorithm according to any one of claims 1-6, characterized in that the method for partitioning three types of data sets by using the Kmeans algorithm mainly comprises the following sub-steps:
randomly selecting K samples as initial clustering center points in a data set;
calculating Euclidean distances between all other samples and the K sample point;
comparing K distance values of the sample point and K central points, and classifying the sample point as the closest central point;
the cluster center point is recalculated and the previous steps are repeated until the cluster center point location converges.
To this end, the K-means algorithm ends up segmenting the three types of data sets.
8. The psychological prediction method based on K-means clustering and XGboost algorithm according to any one of claims 1-6, characterized in that the step of adjusting parameters specifically comprises:
firstly, checking the data condition, and then setting training parameters;
max _ depth sets the maximum depth of the tree, the default value is 6, and the value range is: [1, + ∞ ];
by using min _ child _ weight, after each lifting calculation, the algorithm can directly obtain the weight of a new feature, and meanwhile, in order to prevent overfitting, when the value of the weight is larger, a model can be prevented from learning a local special sample;
defining a learning task and a corresponding learning objective, wherein "binary: logistic" represents a logistic regression problem of two classifications, and the output is probability;
taking default values of other parameters, and starting training the model below;
setting the boosting iterative computation times to be 60 times;
returning the dictionary dit as a list to an array of traversable (key, value) tuples;
the output value is the probability that the sample is in the first class, and the probability value needs to be converted into 0 or 1;
the above steps of regulating the parameters are completed.
9. The psychological prediction method based on K-means clustering and XGboost algorithm according to any one of claim 8, wherein the classification using the XGboost algorithm for supervised learning to obtain the XGboost prediction model specifically comprises:
XGboost algorithm objective function:
Figure FDA0002835066400000041
wherein, J (f)t) An objective function representing a learning model for a total of t iterations, i represents the ith iteration, and,
Figure FDA0002835066400000042
Sample x representing the top t-1 tree pairsiPredicted result of (1), xiRepresenting sample input features, ftRepresents the t-th regression tree, Ω (f)t) Representing the regular terms adopted for the t tree;
expand according to taylor's formula:
Figure FDA0002835066400000043
while ordering the first derivative giAnd second derivative hi
Figure FDA0002835066400000044
The decision tree complexity calculation formula is as follows:
Figure FDA0002835066400000051
in the above formula, the number of leaf nodes is T, gamma is the coefficient of T, and the score of the jth leaf node of the tth tree is wjAnd the values of the parameters are obtained by adopting a grid search algorithm.
Substituting (2), (3) and (4) into the formula (1) to obtain an objective function:
Figure FDA0002835066400000052
solving equation (5) by using the objective function to pair wjDerived to optimize
Figure FDA0002835066400000053
Comprises the following steps:
Figure FDA0002835066400000054
the corresponding minimum value of the objective function is as follows:
Figure FDA0002835066400000055
the above equation corresponds to a scoring function, and the smaller the function value, the better the tree structure.
Discretized data sets were scaled by test set and training set 7: and 3, segmenting, inputting test samples corresponding to 3 tasks into the trained XGBoost model, obtaining estimated values at the same time, finally selecting a required estimated value from the test samples, and comparing the selected estimated value with actually measured data.
10. A psychological prognostics system based on K-means clustering and XGBoost algorithm according to any of claims 1-9, comprising:
the acquisition and pretreatment module: the system is used for collecting the student behavior data at school, taking the student behavior data as a classification label, recording the data, and carrying out data preprocessing including repeated value processing, missing value processing, noise value processing and type conversion on the student behavior data;
a clustering module: for the discrete features, one-hot coding is used, the values of the discrete features are expanded to an Euclidean space, and a certain value of the discrete features corresponds to a certain point of the Euclidean space; clustering and segmenting into three types of data sets by adopting a K-means algorithm, wherein the three types of data sets are respectively label 1: psychological hidden dangers are likely to exist; label 2: there are psychological hidden dangers, but the possibility is not obvious; label 3: no psychological hidden trouble exists.
A prediction module: selecting label1 as a sample class with larger possibility of psychological hidden danger, adopting an XGboost algorithm to classify supervised learning to obtain an XGboost prediction model, inputting behavior data of a new individual into the established XGboost prediction model to obtain a psychological prejudgment result, adjusting parameters, and testing the XGboost prediction model to obtain the accuracy of the model.
CN202011467838.3A 2020-12-14 2020-12-14 Psychological pre-judging method and system based on K-means clustering and XGBoost algorithm Active CN112530546B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011467838.3A CN112530546B (en) 2020-12-14 2020-12-14 Psychological pre-judging method and system based on K-means clustering and XGBoost algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011467838.3A CN112530546B (en) 2020-12-14 2020-12-14 Psychological pre-judging method and system based on K-means clustering and XGBoost algorithm

Publications (2)

Publication Number Publication Date
CN112530546A true CN112530546A (en) 2021-03-19
CN112530546B CN112530546B (en) 2024-03-22

Family

ID=74999564

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011467838.3A Active CN112530546B (en) 2020-12-14 2020-12-14 Psychological pre-judging method and system based on K-means clustering and XGBoost algorithm

Country Status (1)

Country Link
CN (1) CN112530546B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113807612A (en) * 2021-10-13 2021-12-17 四川久远银海软件股份有限公司 Prediction method and device based on mental scale data

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160239857A1 (en) * 2013-01-04 2016-08-18 PlaceIQ, Inc. Inferring consumer affinities based on shopping behaviors with unsupervised machine learning models
CN109256192A (en) * 2018-07-24 2019-01-22 同济大学 A kind of undergraduate psychological behavior unusual fluctuation monitoring and pre-alarming method neural network based
AU2020100709A4 (en) * 2020-05-05 2020-06-11 Bao, Yuhang Mr A method of prediction model based on random forest algorithm
CN111402095A (en) * 2020-03-23 2020-07-10 温州医科大学 Method for detecting student behaviors and psychology based on homomorphic encrypted federated learning

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160239857A1 (en) * 2013-01-04 2016-08-18 PlaceIQ, Inc. Inferring consumer affinities based on shopping behaviors with unsupervised machine learning models
CN109256192A (en) * 2018-07-24 2019-01-22 同济大学 A kind of undergraduate psychological behavior unusual fluctuation monitoring and pre-alarming method neural network based
CN111402095A (en) * 2020-03-23 2020-07-10 温州医科大学 Method for detecting student behaviors and psychology based on homomorphic encrypted federated learning
AU2020100709A4 (en) * 2020-05-05 2020-06-11 Bao, Yuhang Mr A method of prediction model based on random forest algorithm

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
罗春芳等: "基于Kmeans 聚类的XGBoost 集成算法研究", 《计算机时代》, no. 10, 15 October 2020 (2020-10-15), pages 12 - 14 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113807612A (en) * 2021-10-13 2021-12-17 四川久远银海软件股份有限公司 Prediction method and device based on mental scale data

Also Published As

Publication number Publication date
CN112530546B (en) 2024-03-22

Similar Documents

Publication Publication Date Title
Anastasopoulos et al. Machine learning for public administration research, with application to organizational reputation
US10356027B2 (en) Location resolution of social media posts
Ward Jr et al. Application of an hierarchical grouping procedure to a problem of grouping profiles
Miller Species distribution models: Spatial autocorrelation and non-stationarity
Li et al. Influence of entrepreneurial experience, alertness, and prior knowledge on opportunity recognition
Tiefelsdorf et al. The exact distribution of Moran's I
US20230289665A1 (en) Failure feedback system for enhancing machine learning accuracy by synthetic data generation
CN109002492B (en) Performance point prediction method based on LightGBM
CN110795613B (en) Commodity searching method, device and system and electronic equipment
Agrawal et al. Using data mining classifier for predicting student’s performance in UG level
Ma Evaluating the fit of sequential G-DINA model using limited-information measures
US20200175052A1 (en) Classification of electronic documents
Ikawati et al. Student behavior analysis to detect learning styles in Moodle learning management system
Ali Khalil et al. Developing machine learning models to predict roadway traffic noise: An opportunity to escape conventional techniques
CN114358014B (en) Work order intelligent diagnosis method, device, equipment and medium based on natural language
CN112530546B (en) Psychological pre-judging method and system based on K-means clustering and XGBoost algorithm
Ladi et al. Applications of machine learning and deep learning methods for climate change mitigation and adaptation
Spithourakis et al. Numerically grounded language models for semantic error correction
Alsultanny Selecting a suitable method of data mining for successful forecasting
Pisutaporn et al. Relevant factors and classification of student alcohol consumption
CN114331789B (en) Intelligent cheap and clean knowledge recommendation method, device, equipment and storage medium
CN109190658A (en) Video degree of awakening classification method, device and computer equipment
Agustin Implementation of Generative Pre-Trained Transformer 3 Classify-Text in Determining Thesis Supervisor
CN117272999A (en) Model training method and device based on class incremental learning, equipment and storage medium
Nichols The Application of Machine Learning Techniques to Empowerment, Stress and Workplace Outcomes

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant