CN116611915A - Salary prediction method and device based on statistical reasoning - Google Patents

Salary prediction method and device based on statistical reasoning Download PDF

Info

Publication number
CN116611915A
CN116611915A CN202310709286.XA CN202310709286A CN116611915A CN 116611915 A CN116611915 A CN 116611915A CN 202310709286 A CN202310709286 A CN 202310709286A CN 116611915 A CN116611915 A CN 116611915A
Authority
CN
China
Prior art keywords
salary
field
data set
fields
payroll
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310709286.XA
Other languages
Chinese (zh)
Inventor
向桥梁
张俊龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Liantong Hangzhou Technology Service Co ltd
Original Assignee
Liantong Hangzhou Technology Service Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Liantong Hangzhou Technology Service Co ltd filed Critical Liantong Hangzhou Technology Service Co ltd
Priority to CN202310709286.XA priority Critical patent/CN116611915A/en
Publication of CN116611915A publication Critical patent/CN116611915A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/03Credit; Loans; Processing thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/12Accounting
    • G06Q40/125Finance or payroll

Abstract

The application aims to provide a salary prediction method and equipment based on statistical reasoning, which are used for constructing a first data set by acquiring information related to positions from recruitment websites; correcting the weighted salary in the position information in the first data set to obtain a second data set; sequentially and respectively carrying out field mapping and field grouping on the fields of the position information in the second data set to obtain a third data set; based on the combination of the third data set to the fields of the city, the working years, the education program, the industry classification and the job type, calculating salary distribution in the corresponding salary interval to construct a salary statistics table; and according to the input value of the target field and the payroll statistics table, payroll prediction is carried out by adopting a dimension-reduction recursive reasoning mode, so as to obtain predicted payroll corresponding to the target field, and an accurate payroll prediction effect is achieved by adopting a statistical reasoning mode.

Description

Salary prediction method and device based on statistical reasoning
Technical Field
The application relates to the technical field of computers, in particular to a salary prediction method and device based on statistical reasoning.
Background
In the field of credit, the salary of the applicant is an important piece of information. It can affect whether or not credit is passed, through later ratings, initial credit, and future credit increases. There is therefore a need to find a reasonable way to determine the payroll range of the applicant.
The self-filling method, the social security public accumulation method and the recruitment data statistics method are common salary estimation methods, but the self-filling method can overestimate salary, and the applicant has a certain incentive to fill in a higher number to improve the pass rate, so that the overestimated salary increases the credit risk of a financial institution; the coverage range of the social security accumulation method is limited, not all the applicants can provide social security and accumulation information, and the paid cost is easy to underestimate, and the underestimated salary leads the financial institution to refuse high-quality clients, thereby increasing marketing cost and reducing profits; when the recruitment data statistics method adopts simple average to change the salary range into salary, overestimation is easy, and the characteristics of data are not fully utilized by manually grouping fields, so that the missing condition in real data is difficult to effectively process.
Disclosure of Invention
The application aims to provide a payroll prediction method and device based on statistical reasoning, which are used for obtaining corresponding reasonable payroll by adopting a statistical correction mode for an obtained data set, then sequentially carrying out field mapping, field grouping and field combination, realizing intelligent and manual intervention-free data set processing, and accurately predicting payroll by utilizing a dimension-reduction recursion reasoning method.
According to one aspect of the present application, there is provided a salary prediction method based on statistical reasoning, wherein the method comprises:
acquiring information related to positions from a recruitment website to construct a first data set, wherein the first data set comprises at least two position information, the position information comprises six fields and weighted salary, and the six fields are respectively a city, a working life, an education program, industry classification, position types and salary intervals;
correcting the weighted salary in each piece of job information in the first data set to obtain a second data set;
performing field mapping and field grouping on the fields of the position information in the second data set in sequence to obtain a third data set;
based on the combination of the third data set to the fields of the city, the working years, the education program, the industry classification and the job type, calculating payroll distribution in the corresponding payroll interval to construct a payroll statistics table;
and carrying out salary prediction by adopting a dimension-reduction recursive reasoning mode according to the input value of the target field and the salary statistical table to obtain predicted salary corresponding to the target field.
Further, in the above method, the correcting the weighted salary in each job information in the first data set to obtain a second data set includes:
counting different scores of weighted payroll in the payroll interval in each piece of position information in the first data set;
calculating a correction weight corresponding to each job information based on quantile salaries corresponding to different quantiles in each job information in a salary interval;
and carrying out correction and replacement on the weighted salaries in the position information based on the lowest salary and the highest salary in the salary interval of the position information and the corresponding correction weight to obtain the correction salary in the position information so as to obtain a second data set.
Further, in the above method, the performing field mapping and field grouping on the fields of the position information in the second data set sequentially to obtain a third data set includes:
performing value mapping between actual service and the recruitment website on the fields of the position information in the second data set to obtain a mapping data set corresponding to the actual service;
And according to the mapped value corresponding to each position information field in the mapping data set, carrying out field grouping on each position information field to obtain a third data set.
Further, in the above method, according to the mapped value corresponding to the field of each piece of position information in the mapping dataset, field grouping is performed on the field of each piece of position information to obtain a third dataset, including:
each field in each piece of position information in the mapping data set is respectively subjected to the following operation to obtain a corresponding third data set after field grouping:
calculating the values corresponding to different fractional numbers of the correction salary in the salary interval aiming at each mapped value V of the field so as to form a row vector V-of a multidimensional feature corresponding to each mapped value V;
and combining the row vectors V-corresponding to the mapped values V of the fields to form a feature matrix M1 corresponding to the fields, wherein each row in the feature matrix M1 corresponds to the value of one field, and each column corresponds to one feature.
Carrying out standardization processing on the feature matrix M1 by adopting standard deviation to obtain a corresponding standardized feature matrix M2;
And clustering the standardized feature matrix M2 by adopting a K-means clustering algorithm and a Euclidean distance algorithm to group fields.
Further, in the above method, the calculating, based on the combination of the third data set to the fields of the city, the working year, the education program, the industry classification, and the job type, the salary distribution in the corresponding salary interval to construct a salary statistical table includes:
taking the fields of the city, the working years, the educational program, the industry classification, and the job type in the third dataset as a field set;
any combination is carried out on the fields in the field set to obtain all field subsets;
the values of the fields in each field subset are exhausted, and all field value subsets corresponding to each field subset are obtained;
dividing the third data set into a sub data set according to each field value subset, counting the number of the sub data sets, and counting payroll distribution corresponding to the field subset if the number of the sub data sets is greater than a preset number threshold;
and merging all the field subsets and the corresponding salary distribution thereof to form a salary statistical table.
Further, in the above method, the performing salary prediction by adopting a dimension-reduction recursive reasoning mode according to the input value of the target field and the salary statistics table to obtain a predicted salary corresponding to the target field includes:
obtaining a target field value set according to the input value of the target field, wherein the target field value set comprises each alternative field and the field value thereof;
querying whether the alternative fields in the target field value set and the field values exist in the salary statistic table,
if so, obtaining predicted salary corresponding to the target field;
if not, removing the alternative field of one dimension in the target field value set each time to obtain different subsets; based on the salary statistical table, adopting a recursion mode to inquire the salary distribution of each subset to obtain the salary distribution corresponding to each subset;
and carrying out weighted average or average weighting on the salary distribution corresponding to all the subsets to obtain predicted salaries corresponding to all the subsets, and determining the predicted salaries corresponding to all the subsets as the predicted salaries corresponding to the target fields.
According to another aspect of the present application, there is also provided a non-volatile storage medium having stored thereon computer readable instructions which, when executed by a processor, cause the processor to implement a payroll prediction method based on statistical reasoning as described above.
According to another aspect of the present application, there is also provided a salary prediction apparatus based on statistical reasoning, wherein the apparatus includes:
one or more processors;
a computer readable medium for storing one or more computer readable instructions,
the one or more computer-readable instructions, when executed by the one or more processors, cause the one or more processors to implement a statistical reasoning-based payroll prediction method as described above.
Compared with the prior art, the method and the system have the advantages that the information related to the positions is acquired from the recruitment website to construct the first data set, wherein the first data set comprises at least two position information, the position information comprises six fields and weighted salary, and the six fields are respectively located in cities, working years, education programs, industry classifications, position types and salary intervals; correcting the weighted salary in each piece of job information in the first data set to obtain a second data set; performing field mapping and field grouping on the fields of the position information in the second data set in sequence to obtain a third data set; based on the combination of the third data set to the fields of the city, the working years, the education program, the industry classification and the job type, calculating payroll distribution in the corresponding payroll interval to construct a payroll statistics table; and according to the input value of the target field and the payroll statistics table, carrying out payroll prediction in a dimension-reduction recursion reasoning mode to obtain predicted payroll corresponding to the target field, modifying and perfecting a data set through weighting payroll, then adopting field mapping, field grouping and field combination to obtain a high-quality position information data set, carrying out statistics to obtain accurate and stable payroll distribution, and combining with the dimension-reduction recursion reasoning mode to achieve an accurate payroll prediction effect.
Drawings
Other features, objects and advantages of the present application will become more apparent upon reading of the detailed description of non-limiting embodiments, made with reference to the accompanying drawings in which:
FIG. 1 illustrates a flow diagram of a payroll prediction method based on statistical reasoning in accordance with an aspect of the present application.
The same or similar reference numbers in the drawings refer to the same or similar parts.
Detailed Description
The application is described in further detail below with reference to the accompanying drawings.
In one exemplary configuration of the application, the terminal, the device of the service network, and the trusted party each include one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.
Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer readable media, as defined herein, does not include non-transitory computer readable media (transmission media), such as modulated data signals and carrier waves.
As shown in fig. 1, a flow chart of a payroll prediction method based on statistical reasoning according to an aspect of the present application, wherein the method includes steps S11, S12, S13, S14 and S15, and specifically includes the following steps:
step S11, acquiring information related to positions from a recruitment website to construct a first data set, wherein the first data set comprises at least two position information, the position information comprises six fields and weighted salary, and the six fields are respectively a city, a working life, an education program, an industry classification, a position type and a salary section.
Here, the fields of the job information refer to the relevant attributes of the job information in the recruitment website, and each piece of job information further includes a field value corresponding to each field, specifically, the city field is a field related to the city attribute of the job and its field value (for example, beijing city, shanghai city); the working period field is an attribute field about the requirement of the working period, and in the practical application scene, the field value can be represented by a specific numerical value or a numerical range (for example, 2 years, 3-5 years, 5-10 years); the education level field is the requirement of each piece of position information on the education level and the field value thereof (such as middle school, high school, university, family, master, doctor, etc.); the industry classification field is the industry to which the recruitment company belongs and the field value (such as education industry, finance industry, etc.); the job type field is a specific category of job and its field value (e.g., salesman, sales person, product sales, etc.); the salary interval is the salary condition corresponding to each piece of job information and the value of its field (for example, a salary range between the lowest salary and the highest salary).
However, since the salary interval field is a range and cannot be directly used for statistics about salaries, it needs to be converted into a representative salary related value, specifically, a specific value may be converted by weighting the salary, where the weighted weight may be any value between 0 and 1, and in a preferred embodiment of the present application, the weighted weight may be preferably 0.5 for example.
For example, in a preferred embodiment of the present application, a weighted calculation with a weight of 0.5 is set, a weighted salary S1 corresponding to each piece of job information is obtained by s1=lowest salary 0.5+highest salary 0.5, and the weighted salary of each piece of job information is added to each piece of job information, and the weighted salary is combined by six fields to comprehensively and clearly represent each piece of job information, so as to construct a clear and comprehensive first data set.
Step S12, correcting the weighted salary in each job information in the first data set to obtain a second data set; the correction means that the weighted salary is further processed and modified to obtain high-precision salary data, and the salary obtained through correction is used for replacing the weighted salary in each piece of job information, so that the problems of unstable and inaccurate weighted salary caused by large display difference between actual salary and recruitment website salary are effectively avoided, reasonable salary corresponding to each recruitment information in the second data set is improved, and statistical reasoning of the salary data is facilitated.
And step S13, sequentially and respectively carrying out field mapping and field grouping on the fields of the position information in the second data set to obtain a third data set.
In the practical application scene, the field mapping is mainly completed according to the meaning corresponding to the value of the field, so that the problem that the value of the field in recruitment website data and the value of the field in practical service have large difference is solved, and the unification and standardization of the position information in the second data set are realized; the field grouping method includes, but is not limited to, grouping each field through the value of each field to obtain a combination corresponding to each field, and summarizing the combination corresponding to each field to obtain a third data set, so that each field is grouped independently in an intelligent and unsupervised mode.
Step S14, based on the combination of the third data set to the fields of the city, the working years, the education program, the industry classification and the job position type, calculating payroll distribution in the corresponding payroll interval to construct a payroll statistical table; the fields can be combined in any permutation and combination mode to generate a plurality of different field groups; the salary distribution includes, but is not limited to, distribution statistics of salary data such as minimum value, 25% score value, 50% score value, 75% score value and maximum value for salary of corresponding position information in each field combination, so that statistics of corresponding salary distribution is performed according to data sets of different field combinations, and salary distribution conditions are comprehensively known and obtained.
And step S15, performing salary prediction in a dimension-reduction recursion reasoning mode according to the input value of the target field and the salary statistical table to obtain predicted salary corresponding to the target field.
Through the steps S11 to S15, the realization is realized that the weighted salary is corrected and perfected into a data set, then the field mapping, the field grouping and the field combination are adopted to obtain a high-quality position information data set, then the accurate and stable salary distribution condition is obtained through statistics, and the accurate salary prediction effect is achieved by combining a dimension reduction recursion reasoning mode.
In a preferred embodiment of the present application, position information 1, position information 2, … …, position information (n-1), position information n are obtained from the recruitment website, and { city in position information 1=city 1, service life=year 1, education program=education 1, industry class=industry 1, position type=position 1, salary section=salary section 1}; position information 2 { city=urban 2, working year=annual 2, educational program=educational 2, industry classification=industrial 2, position type=position 2, salary interval=salary interval 2}; … …; position information (n-1) { position city=city (n-1), working year=year (n-1), education program=education (n-1), industry classification=industry (n-1), position type=position (n-1) }, salary section=salary section (n-1) }, position information (n) { position city=city n, working year=year n, education program=education n, industry classification=industry n, position type=position n, salary section=salary section n }; preferably, the weight is set to be 0.5, the weighted salary S1 corresponding to each piece of position information is calculated through a calculation formula of the weighted salary s1=lowest salary 0.5+highest salary 0.5, and the weighted salary of each piece of position information is added to each piece of position information to obtain the first data set D1.
And correcting the weighted salary S1 corresponding to each piece of position information in the first data set D1 to obtain corrected salary S2 corresponding to each piece of position information, and replacing the weighted salary S1 in each piece of position information with the corrected salary S2 to obtain a second data set D2{ position information 1', position information 2', … …, position information (n-1) ', position information n' }.
And according to the meaning of the field value corresponding to each field, mapping the values of the fields from the position information 1 'to the position information n' in the second data set D2, and after the field mapping is unified, respectively grouping the fields based on the field value in the six fields to obtain the grouping of each field (namely, the city field, the working year field, the education program field, the industry classification field, the position type field and the payroll interval field), and respectively summarizing the grouping of the six fields to obtain the third data set D4 (D4=the city field grouping+the working year field grouping+the education program field grouping+the industry classification field grouping+the position type field grouping+the payroll interval field grouping).
Based on the third data set D4, the city field, the working year field, the educational program field, the industry classification field, and the job type field are randomly ordered and combined to obtain 32 field groups (field group 1, field group 2, … …, field group 31, and field group 32), and statistics of salary distribution are performed on each field group to obtain salary distribution of field group 1, salary distribution of field group 2, … …, salary distribution of field group 31, and salary distribution of field group 32, thereby constructing a salary statistics table T1.
And inputting a target field, and carrying out salary prediction in a salary statistics table T1 to obtain predicted salary corresponding to the target field.
In order to solve the problem, in the present application, the step S12 corrects the weighted payroll in each job information in the first data set to obtain a second data set, which specifically includes:
counting different scores of weighted payroll in the payroll interval in each piece of position information in the first data set; here, the quantiles include, but are not limited to, payroll values with weighted payroll at 1%, 50%, 99% quantiles, and in a preferred embodiment of the present application, payroll values with weighted payroll at 1%, 50%, 99% quantiles in payroll intervals are preferably counted.
Calculating a correction weight corresponding to each job information based on quantile salaries corresponding to different quantiles in each job information in a salary interval; here, the specific calculation formula of the correction weight may be set according to the data set/requirement by a computer algorithm, or may be set manually in advance, and the correction weight is generally smaller than 0.5, and in a preferred embodiment of the present application, the correction weight w= (50% salary-1% salary)/(99% salary-1% salary) is preferred.
Based on the lowest salary, the highest salary and the corresponding correction weight in the salary interval of each job information, correcting and replacing the weighted salary in each job information to obtain corrected salary in each job information (S2) so as to obtain a second data set; in a preferred embodiment of the present application, the calculation formula of the preferred correction salary is the lowest salary (1-correction weight) +the highest salary, so as to implement correction and conversion of the weighted salary in each job information, and obtain the correction salary with reference to the overall salary distribution, so that the salary data in the second data set is reasonably stable, and the problem of inaccurate salary representation in the data set caused by excessively high/excessively low reasons is effectively avoided.
In a preferred embodiment of the present application, 1%, 50%, 99% fractional salary value statistics are performed on the weighted salary S1 corresponding to each piece of job information 1-job information n in the first dataset D1, to obtain three fractional salary values (S11, S12, and S13) of the weighted salary S1 of job information 1, where S11 is a 1% fractional salary value, S12 is a 50% fractional salary value, and S13 is a 99% fractional salary value), three different fractional salary values (S21, S22, and S23) of the weighted salary S1 of job information 2, and three different fractional salary values [ S (n-1) 1, S (n-1) 2, and S (n-1) 3] of the weighted salary S1 of job information (n-1), and three different fractional salary values (S n, 623) of the weighted salary S1 of job information n.
Calculating modification weight w1 of position information 1 by using modification weight w= (50% salary-1% salary)/(99% salary-1% salary) and 1%, 50% and 99% quantile salary corresponding to each piece of position information in position information 1-position information n; the modification weight w2 of the position information 2; …; the modification weight w (n-1) of the position information (n-1); modifying weight wn of position information n; combining the calculation formula of the correction salary S2: s2=lowest salary (1-w) +highest salary (w), respectively calculating corrected salary corresponding to position information 1, corrected salary corresponding to position information 2, …, corrected salary corresponding to position salary (n-1) and corrected salary corresponding to position information n, and replacing the weighted salary S1 with the corrected salary S2 for each piece of position information to obtain a second data set D2.
Next, in the above embodiment of the present application, the step S13 sequentially and respectively performs field mapping and field grouping on the fields of the position information in the second data set to obtain a third data set, which specifically includes:
performing value mapping between actual service and the recruitment website on the fields of the position information in the second data set to obtain a mapping data set (D3) corresponding to the actual service; in the preferred embodiment of the application, the determination of the field value in the actual service is preferably performed by referring to the national economy industry classification and the national institutes of occupation classification, so that the problems of messy data sets, difficult statistics and the like caused by different field values of recruitment websites and the field values of the actual service are solved, unified standardized processing of each field in the second data set is realized, the obtained mapping data set corresponding to the actual service is enabled to be tidy and clear, and the subsequent data processing operation of the third data set is accelerated.
According to the mapped value corresponding to each position information field in the mapping data set, field grouping is carried out on each position information field to obtain a third data set; in the practical application scenario, when field values are utilized to group fields, the values of some fields are quite large (for example, industry classification and job types are more than 20 industries/job types), so that the data quantity corresponding to the value combination of some fields is quite low or no corresponding data exists, the effect of subsequent establishment of a payroll table is unstable, therefore, after one field is grouped according to the values, the grouping quantity is controlled to be between 3 and 5, the grouping generation of no actual data is effectively avoided, the intelligent grouping is realized, no manual participation is needed, and an unsupervised mode is extended into each field, so that the grouping of the fields is more thorough, and the data analysis is more thorough.
In a preferred embodiment of the present application, the field value in the actual service is determined by referring to the national economy industry classification and the national institutes of occupation classification dictionary, and the field of the position information 1 '-position information n' in the second data set is mapped to the value between the actual service and the recruitment website, so as to obtain the mapping data set D3 corresponding to the actual service.
And respectively grouping the six fields according to each piece of position information in the mapping data set D3 to obtain a group a of the city field, a group b of the working year field, a group c of the education degree field, a group D of the industry classification field, a group e of the position type field and a group f of the payroll interval field, and respectively summarizing the groups of the six fields to obtain a third data set D4 (namely D4=a+b+c+d+f).
Next, in the foregoing embodiment of the present application, in step S13, according to the mapped value corresponding to the field of each piece of position information in the mapping dataset, the fields of each piece of position information are subjected to field grouping to obtain a third dataset, which specifically includes:
each field in each piece of position information in the mapping data set is respectively subjected to the following operation to obtain a corresponding third data set after field grouping:
for each mapped value V of a field, calculating values corresponding to different fractional numbers of corrected payroll in the payroll interval to form a row vector V-of a multidimensional feature corresponding to each mapped value V, i.e., each value V corresponds to a multiple numberThe row vector V-, of bit features, for example, Wherein v is 25% Correction salary for value v 25% of the fractional number of salary in salary interval 50% Correction salary for value v is 50% of the fractional salary value in salary interval 70% Correction salary for value v is a salary value of 70% fractional number in the salary interval.
Combining the row vectors V-corresponding to the mapped values V of the fields to form a feature matrix M1 corresponding to the fields, wherein each row in the feature matrix M1 corresponds to the values of one field, each column corresponds to one feature, namely,
and carrying out standardization processing on the feature matrix M1 by adopting standard deviation to obtain a corresponding standardized feature matrix M2, wherein the average value of each column vector in the standardized feature matrix M2 obtained by the standardization processing is 0, and the standard deviation is 1.
Clustering the standardized feature matrix M2 by adopting a K-means clustering algorithm and a Euclidean distance algorithm to group fields; the K mean value clustering algorithm (K-means clustering algorithm) is an iterative solution clustering analysis algorithm, and comprises the steps of dividing data into K groups, randomly selecting K objects as initial clustering centers, calculating the distance between each object and each seed clustering center, and distributing each object to the closest clustering center; the Euclidean distance algorithm is used for circularly adjusting and calculating the distance between two vectors, so that the two vectors with the short distance are placed in the same group, an unsupervised clustering method is adopted to obtain more reasonable grouping, the reasonable grouping is beneficial to understanding payroll rules and correlations inside each field grouping, follow-up payroll prediction is facilitated to be more convenient and rapid, the intelligent field grouping is realized in the whole process, the grouping efficiency is improved, and meanwhile, labor and time cost are saved.
For example, a certain field I to be grouped is determined, wherein 5 values, a, b, c, d and e respectively, exist in the field I to be grouped, and payroll values of correction payroll at different quantiles are calculated for the 5 values respectively to obtain respectively Androw vector of multi-bit features with 5 values +.>And->) Combining to obtain a feature matrix corresponding to the field I to be grouped> After the feature matrix M1 is standardized to obtain M2, a K-means clustering algorithm and a Euclidean distance algorithm are adopted to obtain two categories g1 and g2, wherein g1 comprises a, b and c, and g2 comprises d and e, so that field grouping of the field I to be grouped into g1 groups and g2 groups is completed.
Meanwhile, in the practical application scene, in order to determine the optimal grouping number, different grouping numbers can be tried, and are generally between 3 and 6, after the grouping is performed by adopting the methods such as the K-means clustering algorithm, the Euclidean distance and the like, a box graph of each category can be drawn, and the optimal clustering number is determined according to the degree of distinction between the categories, namely, when the optimal grouping number is K, the K groups have obvious statistical distinction degree, but when the grouping number is increased by 1 and k+1, the degree of distinction between the groups is reduced.
Next, in the above embodiment of the present application, the step S14 counts payroll distributions in the corresponding payroll intervals based on the combination of the third data set on the fields of the city, the working period, the education program, the industry classification and the job type, so as to construct a payroll statistics table, which specifically includes:
And taking the fields of the city, the working years, the educational program, the industry classification and the job type in the third data set as a field set.
Any combination is carried out on the fields in the field set to obtain all field subsets; here, the corresponding fields in any combination may include one or more, or there may be no fields (i.e., the subset of fields is an empty set).
The values of the fields in each field subset are exhausted to obtain all field value subsets corresponding to each field subset, for example, the field subsets are { city, education degree }, and examples of the generated field value subsets include { city=beijing, education degree=senior }, { city=beijing, education degree=family }, { city=Shanghai, education degree=senior }, { city=Shanghai, education degree=family } and the like.
Dividing the third data set into a sub data set according to each field value subset, counting the number of the sub data sets, and counting payroll distribution corresponding to the field subset if the number of the sub data sets is greater than a preset number threshold; if the value is smaller than or equal to the preset quantity threshold, the value is not counted, namely the subset of the field values has no corresponding salary distribution.
And combining all the field subsets and the corresponding salary distribution thereof to form a salary statistics table, and realizing the salary statistics of the corresponding field combinations in a field combination mode so as to improve the stability of statistics by adopting the minimum data quantity.
In a preferred embodiment of the present application, the field set corresponding to the third data set D4 is { city, working year, education level, industry classification, job type }; by means of permutation and combination, 5 different fields in the field set are arbitrarily combined to obtain 32 field subsets: { }, { place city }, { working year }, { education level }, { industry classification }, { job type }, { place city, working year }, { place city, education level }, …, { place city, working year, education level, industry classification, job type }.
According to the position information 1 '-position information n' in the third data set D4, enumerating the values of the fields of each field subset of the 32 field subsets to obtain all field value subsets corresponding to the 32 field subsets; for each field value subset, generating x field value subsets according to the value exhaustion of the field; and dividing the third data set D4 into corresponding sub data sets by each field value subset, counting the number of the sub data sets, and counting the salary distribution of the sub data sets when the number of the sub data sets is larger than 30 (a preset number threshold), thereby continuously counting the salary distribution of all the sub data sets larger than 30 to obtain salary distribution corresponding to a plurality of field value subsets, and combining the salary distribution of all the field value subsets to form a final salary statistics table T1.
In the following embodiment of the present application, the step S15 performs salary prediction by adopting a dimension-reduction recursive reasoning method according to the input value of the target field and the salary statistics table, to obtain a predicted salary corresponding to the target field, and specifically includes:
and obtaining a target field value set according to the input value of the target field, wherein the target field value set comprises each alternative field and the field value thereof.
And inquiring whether the alternative fields and the field values thereof in the target field value set exist in the salary statistical table, and if so, obtaining the predicted salary corresponding to the target field.
If not, removing the alternative field of one dimension in the target field value set each time to obtain different subsets; and carrying out query of salary distribution on each subset in a recursion mode based on the salary statistical table to obtain salary distribution corresponding to each subset.
And carrying out weighted average or average weighting on the salary distribution corresponding to all the subsets to obtain predicted salaries corresponding to all the subsets, determining the predicted salaries corresponding to all the subsets as the predicted salaries corresponding to the target fields, realizing prediction by adopting a dimension reduction recursion reasoning method, solving the problem that the missing data exists in the real fields, still carrying out weighting to reason to obtain the predicted data, realizing accurate prediction no matter whether the input of the target fields is complete or not, improving the usable range, and meeting the requirements of various application scenes.
For example, inputting the value { city=beijing, working year=, education level=family, industry classification=finance, job type= }, to obtain a target field value set, where the target field value set includes the candidate fields (city, education level, industry classification) and their field values (background, family, finance); and querying alternative fields (city, education level and industry classification) and field values (background, family and finance) in a salary statistics table.
If so, obtaining the prediction salary corresponding to the target field { city = Beijing, working year =, education degree = family, industry classification = finance and job type = }.
If not, removing the dimension of the city to obtain a subset { education degree=family, industry classification=financial industry; removing the dimension of education degree to obtain a subset { located city } = Beijing, industry classification } = financial industry }; removing the dimension of industry classification to obtain a subset { located city=Beijing }; … …; adopting recursion to count salary distribution for each sub-set obtained by removing one dimension in sequence; after the salary distribution of all the subsets is obtained, carrying out weighted average on the salary distribution of all the subsets to obtain the predicted salary of each subset; the predicted salary for the target field is completed based on the predicted salary for all the subsets.
Meanwhile, in the practical application of the application, the method can be set as follows: the system comprises a data acquisition module, a data correction module, a field mapping module, a field grouping module, a salary statistics module and a salary prediction module, wherein salary is corrected by using a statistical method, so that salary values in a data set are more reasonable, more reasonable grouping is obtained by using an unsupervised clustering method, statistical stability is improved by using a subset method, and finally prediction is performed by combining a dimension reduction recursive reasoning method.
Specifically, in the data acquisition module, information related to the job position is acquired from the recruitment site.
In the data correction module, the weight is calculated, so that the salary S2 is obtained through statistical correction, and the accuracy of salary corresponding to each piece of position information is improved.
In the field mapping module, the data is processed according to the meaning of the value of the field, so that the uniformity of the data is improved.
In the field grouping module, a characteristic vector and characteristic matrix mode is adopted, and a K-means clustering algorithm is combined to carry out field grouping.
In the salary statistics module, a salary statistics table is built for a data set after field grouping based on the processing of the data in the field grouping module for subsequent salary prediction.
In the salary prediction module, the salary prediction can be completed by inquiring in a salary statistics table mainly according to the value of the input target field.
Moreover, any hardware or software or programming language of the application can be realized, and the application can be realized on a traditional server, a notebook computer, a mobile phone, an embedded device with a chip and the like with a program running function.
According to another aspect of the present application, there is also provided a non-volatile storage medium having stored thereon computer readable instructions which, when executed by a processor, cause the processor to implement a payroll prediction method based on statistical reasoning as described above.
According to another aspect of the present application, there is also provided an apparatus for salary prediction based on statistical reasoning, wherein the apparatus includes:
one or more processors;
a computer readable medium for storing one or more computer readable instructions,
the one or more computer-readable instructions, when executed by the one or more processors, cause the one or more processors to implement a statistical reasoning-based payroll prediction method as described above.
Herein, for details of each embodiment of the foregoing statistical reasoning-based payroll prediction apparatus, reference may be specifically made to the corresponding portion of the foregoing embodiment of the statistical reasoning-based payroll prediction method, which is not described herein again.
In summary, the information related to the position is acquired from the recruitment website to construct a first data set, wherein the first data set comprises at least two position information, the position information comprises six fields and weighted salary, and the six fields are respectively a city, a working life, an education program, an industry classification, a position type and a salary interval; correcting the weighted salary in each piece of job information in the first data set to obtain a second data set; performing field mapping and field grouping on the fields of the position information in the second data set in sequence to obtain a third data set; based on the combination of the third data set to the fields of the city, the working years, the education program, the industry classification and the job type, calculating payroll distribution in the corresponding payroll interval to construct a payroll statistics table; and according to the input value of the target field and the payroll statistics table, carrying out payroll prediction in a dimension-reduction recursion reasoning mode to obtain predicted payroll corresponding to the target field, modifying and perfecting a data set through weighting payroll, then adopting field mapping, field grouping and field combination to obtain a high-quality position information data set, carrying out statistics to obtain accurate and stable payroll distribution, and combining with the dimension-reduction recursion reasoning mode to achieve an accurate payroll prediction effect.
It should be noted that the present application may be implemented in software and/or a combination of software and hardware, e.g., using Application Specific Integrated Circuits (ASIC), a general purpose computer or any other similar hardware device. In one embodiment, the software program of the present application may be executed by a processor to perform the steps or functions described above. Likewise, the software programs of the present application (including associated data structures) may be stored on a computer readable recording medium, such as RAM memory, magnetic or optical drive or diskette and the like. In addition, some steps or functions of the present application may be implemented in hardware, for example, as circuitry that cooperates with the processor to perform various steps or functions.
Furthermore, portions of the present application may be implemented as a computer program product, such as computer program instructions, which when executed by a computer, may invoke or provide methods and/or techniques in accordance with the present application by way of operation of the computer. Program instructions for invoking the inventive methods may be stored in fixed or removable recording media and/or transmitted via a data stream in a broadcast or other signal bearing medium and/or stored within a working memory of a computer device operating according to the program instructions. An embodiment according to the application comprises an apparatus comprising a memory for storing computer program instructions and a processor for executing the program instructions, wherein the computer program instructions, when executed by the processor, trigger the apparatus to operate a method and/or a solution according to the embodiments of the application as described above.
It will be evident to those skilled in the art that the application is not limited to the details of the foregoing illustrative embodiments, and that the present application may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the application being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is evident that the word "comprising" does not exclude other elements or steps, and that the singular does not exclude a plurality. A plurality of units or means recited in the apparatus claims can also be implemented by means of one unit or means in software or hardware. The terms first, second, etc. are used to denote a name, but not any particular order.

Claims (8)

1. A payroll prediction method based on statistical reasoning, wherein the method comprises:
acquiring information related to positions from a recruitment website to construct a first data set, wherein the first data set comprises at least two position information, the position information comprises six fields and weighted salary, and the six fields are respectively a city, a working life, an education program, industry classification, position types and salary intervals;
Correcting the weighted salary in each piece of job information in the first data set to obtain a second data set;
performing field mapping and field grouping on the fields of the position information in the second data set in sequence to obtain a third data set;
based on the combination of the third data set to the fields of the city, the working years, the education program, the industry classification and the job type, calculating payroll distribution in the corresponding payroll interval to construct a payroll statistics table;
and carrying out salary prediction by adopting a dimension-reduction recursive reasoning mode according to the input value of the target field and the salary statistical table to obtain predicted salary corresponding to the target field.
2. The method of claim 1, wherein the modifying the weighted payroll in each of the job information in the first dataset to obtain a second dataset comprises:
counting different scores of weighted payroll in the payroll interval in each piece of position information in the first data set;
calculating a correction weight corresponding to each job information based on quantile salaries corresponding to different quantiles in each job information in a salary interval;
And carrying out correction and replacement on the weighted salaries in the position information based on the lowest salary and the highest salary in the salary interval of the position information and the corresponding correction weight to obtain the correction salary in the position information so as to obtain a second data set.
3. The method according to claim 2, wherein the sequentially performing field mapping and field grouping on the fields of the position information in the second data set respectively to obtain a third data set includes:
performing value mapping between actual service and the recruitment website on the fields of the position information in the second data set to obtain a mapping data set corresponding to the actual service;
and according to the mapped value corresponding to each position information field in the mapping data set, carrying out field grouping on each position information field to obtain a third data set.
4. A method according to claim 3, wherein said grouping fields of each job information according to mapped values corresponding to fields of each job information in the mapping dataset to obtain a third dataset comprises:
each field in each piece of position information in the mapping data set is respectively subjected to the following operation to obtain a corresponding third data set after field grouping:
Calculating the values corresponding to different fractional numbers of the correction salary in the salary interval aiming at each mapped value V of the field so as to form a row vector V-of a multidimensional feature corresponding to each mapped value V;
combining the row vectors V-corresponding to the mapped values V of the fields to form a feature matrix M1 corresponding to the fields, wherein each row in the feature matrix M1 corresponds to the value of one field, and each column corresponds to one feature;
carrying out standardization processing on the feature matrix M1 by adopting standard deviation to obtain a corresponding standardized feature matrix M2;
and clustering the standardized feature matrix M2 by adopting a K-means clustering algorithm and a Euclidean distance algorithm to group fields.
5. The method of claim 4, wherein the counting payroll distributions within corresponding payroll intervals based on the combination of the third dataset for the fields of the city, the working hours, the educational program, the industry classification, and the job type to construct a payroll statistics table comprises:
taking the fields of the city, the working years, the educational program, the industry classification, and the job type in the third dataset as a field set;
Any combination is carried out on the fields in the field set to obtain all field subsets;
the values of the fields in each field subset are exhausted, and all field value subsets corresponding to each field subset are obtained;
dividing the third data set into a sub data set according to each field value subset, counting the number of the sub data sets, and counting payroll distribution corresponding to the field subset if the number of the sub data sets is greater than a preset number threshold;
and merging all the field subsets and the corresponding salary distribution thereof to form a salary statistical table.
6. The method of claim 5, wherein the performing salary prediction by using dimension-reduction recursive reasoning according to the input value of the target field and the salary statistics table to obtain the predicted salary corresponding to the target field comprises:
obtaining a target field value set according to the input value of the target field, wherein the target field value set comprises each alternative field and the field value thereof;
querying whether the alternative fields in the target field value set and the field values exist in the salary statistic table,
if so, obtaining predicted salary corresponding to the target field;
If not, removing the alternative field of one dimension in the target field value set each time to obtain different subsets;
based on the salary statistical table, adopting a recursion mode to inquire the salary distribution of each subset to obtain the salary distribution corresponding to each subset; and carrying out weighted average or average weighting on the salary distribution corresponding to all the subsets to obtain predicted salaries corresponding to all the subsets, and determining the predicted salaries corresponding to all the subsets as the predicted salaries corresponding to the target fields.
7. A non-volatile storage medium having stored thereon computer readable instructions which, when executed by a processor, cause the processor to implement the method of any of claims 1 to 6.
8. A payroll prediction device based on statistical reasoning, wherein the device comprises:
one or more processors;
a computer readable medium for storing one or more computer readable instructions,
the one or more computer-readable instructions, when executed by the one or more processors, cause the one or more processors to implement the method of claims 1 to 6.
CN202310709286.XA 2023-06-14 2023-06-14 Salary prediction method and device based on statistical reasoning Pending CN116611915A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310709286.XA CN116611915A (en) 2023-06-14 2023-06-14 Salary prediction method and device based on statistical reasoning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310709286.XA CN116611915A (en) 2023-06-14 2023-06-14 Salary prediction method and device based on statistical reasoning

Publications (1)

Publication Number Publication Date
CN116611915A true CN116611915A (en) 2023-08-18

Family

ID=87683608

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310709286.XA Pending CN116611915A (en) 2023-06-14 2023-06-14 Salary prediction method and device based on statistical reasoning

Country Status (1)

Country Link
CN (1) CN116611915A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117217944A (en) * 2023-09-28 2023-12-12 聘聘云(上海)智能科技有限公司 Multidimensional salary distribution information determining method and device

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117217944A (en) * 2023-09-28 2023-12-12 聘聘云(上海)智能科技有限公司 Multidimensional salary distribution information determining method and device

Similar Documents

Publication Publication Date Title
US11977541B2 (en) Systems and methods for rapid data analysis
US10489372B2 (en) Data storage methods, query methods, and apparatuses thereof
CN103748579B (en) Data are handled in MapReduce frame
US20200192894A1 (en) System and method for using data incident based modeling and prediction
CN103577440B (en) A kind of data processing method and device in non-relational database
US11048739B2 (en) Computer-implemented systems and methods for intelligently retrieving, analyzing, and synthesizing data from databases
US20100293163A1 (en) Operational-related data computation engine
US9600559B2 (en) Data processing for database aggregation operation
CN111125266B (en) Data processing method, device, equipment and storage medium
Buza et al. Storage-optimizing clustering algorithms for high-dimensional tick data
CN116611915A (en) Salary prediction method and device based on statistical reasoning
CN106844541B (en) Online analysis processing method and device
CN116611914A (en) Salary prediction method and device based on grouping statistics
CN114896285A (en) Bank flow calculation service real-time index system based on multi-dimensional intermediate state aggregation
CN114387085B (en) Method, device, computer equipment and storage medium for processing stream data
US20090198643A1 (en) Apparatus and method for utilizing density metadata to process multi-dimensional data
CN115564578B (en) Fraud recognition model generation method
CN115712757A (en) Enterprise name matching method and device based on index tree
Govindasamy et al. Prediction of events based on complex event processing and probabilistic fuzzy logic
CN113435748A (en) Dot state determination method and apparatus, electronic device and storage medium
CN112667859A (en) Data processing method and device based on memory
CN114238258B (en) Database data processing method, device, computer equipment and storage medium
CN111813800B (en) Streaming data real-time approximate calculation method based on deep reinforcement learning
US20240126771A1 (en) Multi-parameter data type frameworks for database environments and database systems
CN117151756A (en) Method, device, equipment, medium and program product for determining user tag information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination