WO2021179544A1 - 样本分类方法、装置、计算机设备及存储介质 - Google Patents

样本分类方法、装置、计算机设备及存储介质 Download PDF

Info

Publication number
WO2021179544A1
WO2021179544A1 PCT/CN2020/111949 CN2020111949W WO2021179544A1 WO 2021179544 A1 WO2021179544 A1 WO 2021179544A1 CN 2020111949 W CN2020111949 W CN 2020111949W WO 2021179544 A1 WO2021179544 A1 WO 2021179544A1
Authority
WO
WIPO (PCT)
Prior art keywords
sample
information
feature
classification model
cluster
Prior art date
Application number
PCT/CN2020/111949
Other languages
English (en)
French (fr)
Inventor
万忠伟
甘丽婷
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021179544A1 publication Critical patent/WO2021179544A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0635Risk analysis of enterprise or organisation activities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/018Certifying business or products
    • G06Q30/0185Product, service or business identity fraud

Definitions

  • This application relates to the field of artificial intelligence technology, and in particular to a sample classification method, device, computer equipment, and storage medium.
  • the above-mentioned employees correspond to the samples, and the above-mentioned business processing information corresponds to historical data.
  • the embodiments of the present application provide a sample classification method, device, computer equipment, and storage medium, aiming to solve the problem of the inability to accurately classify samples through historical data in the prior art methods.
  • an embodiment of the present application provides a sample classification method, which includes:
  • the feature information of the newly added sample is input into the sample classification model to obtain a target category corresponding to the attribute information of the newly added sample.
  • an embodiment of the present application provides a sample classification device, which includes:
  • the sample group obtaining unit is configured to classify the sample clusters included in the history information table according to preset classification information to obtain multiple sample groups, wherein each sample group includes at least one sample;
  • the cluster feature information acquiring unit is used to acquire sample attribute information corresponding to each of the sample clusters, and quantify the sample attribute information according to a preset information quantification rule to obtain cluster feature information corresponding to each of the sample clusters ;
  • the sample classification model construction unit is used to construct a sample classification model including input nodes, feature units, and output nodes according to the cluster feature information and preset feature unit configuration formulas;
  • a newly-added sample feature information acquiring unit is configured to, if the newly-added sample attribute information of the newly-added sample is received, quantify the newly-added sample attribute information according to the information quantification rule to obtain a new sample corresponding to the newly-added sample. Increase sample feature information;
  • the target category obtaining unit is configured to input the feature information of the newly added sample into the sample classification model to obtain the target category corresponding to the attribute information of the newly added sample.
  • an embodiment of the present application provides a computer device, which includes a memory, a processor, and a computer program stored on the memory and running on the processor, and the processor executes the computer
  • the program implements the sample classification method described in the first aspect.
  • an embodiment of the present application also provides a computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the processor executes the above-mentioned first On the one hand, the sample classification method.
  • the embodiments of the present application provide a sample classification method, device, computer equipment, and storage medium.
  • the information quantification rules the sample attribute information corresponding to each sample cluster is quantified to obtain the cluster feature information corresponding to each cluster.
  • the cluster feature information and The feature unit configuration formula constructs a sample classification model, quantifies the attribute information of the new sample according to the information quantification rules to obtain the new sample feature information, and inputs the new sample feature information into the sample classification model to obtain the corresponding target category.
  • sample clusters can be obtained based on samples with historical records, and a sample classification model can be further constructed to accurately classify newly-added samples without historical records, which improves the accuracy of classifying samples without historical records. Good technical results have been achieved in the actual application process.
  • FIG. 1 is a schematic flowchart of a sample classification method provided by an embodiment of the application
  • FIG. 2 is a schematic diagram of a sub-flow of a sample classification method provided by an embodiment of the application
  • FIG. 3 is a schematic diagram of another sub-process of the sample classification method provided by an embodiment of the application.
  • FIG. 4 is a schematic diagram of another sub-flow of the sample classification method provided by an embodiment of the application.
  • FIG. 5 is a schematic diagram of another process of a sample classification method provided by an embodiment of the application.
  • FIG. 6 is a schematic diagram of another sub-process of the sample classification method provided by an embodiment of the application.
  • FIG. 7 is a schematic block diagram of a sample classification device provided by an embodiment of the application.
  • FIG. 8 is a schematic block diagram of a computer device provided by an embodiment of the application.
  • FIG. 1 is a schematic flowchart of a sample classification method provided by an embodiment of the present application.
  • the sample classification method is applied to a user terminal, and the user terminal is a terminal device used to perform the sample classification method to complete the classification of the sample, such as a desktop computer, a notebook computer, a tablet computer, or a mobile phone.
  • the method includes steps S110 to S150.
  • S110 Classify the sample clusters included in the history information table according to preset classification information to obtain multiple sample clusters.
  • the sample clusters included in the history information table are classified according to the preset classification information to obtain multiple sample clusters.
  • the classification information includes statistical items and level classification rules.
  • the classification information can be input by the user.
  • the user is the user of the user terminal.
  • the user can input classification information including statistical items and level classification rules according to the classification purpose to compare historical data.
  • the samples contained in the information table are classified; the classification information may also be information pre-configured in the user terminal, and each sample group includes multiple samples.
  • the historical information table is a data table that records the historical processing information of each sample. All the samples contained in the historical information table constitute the sample cluster of the historical information table.
  • the historical information table can be a historical business information table of an enterprise. Each sample corresponds to an employee.
  • the historical business information table contains the business handling information handled by each employee, and the business handling information contains comprehensive information about the business handled. In the process of screening the violation employees, only Part of the information in the business processing information needs to be used.
  • the statistical item is the project information for the statistics of the part of the information that needs to be used in the business processing information. According to the statistical item, the business processing information handled by each employee can be counted to get the corresponding According to the business statistical information, the level classification rule is the rule information for classifying the violation level of the employee. According to the level classification rule and the business statistical information, the employees can be classified according to the violation level to obtain the information of multiple employee categories, each employee category The information contains the names of multiple employees.
  • step S110 includes sub-steps S111 and S112.
  • S111 Perform statistics on the historical information table according to the statistical items to obtain sample statistical information of each sample.
  • the manager in the generated business processing information is the employee.
  • the statistical items are the rate of signing contracts, the rate of real-name authentication failures, the proportion of blacklisted users, the proportion of abnormal emergency contact numbers, and the downtime rate of new users within 30 days.
  • the classification rules include the project threshold corresponding to each statistical item. Taking the business information table as an example, the statistical items in the business statistical information that exceed the corresponding project threshold are regarded as risk items, and the business statistics of each employee can be obtained according to the project threshold.
  • the number of risk items in the information the classification rules also include multiple violation levels and the number of risk items corresponding to each violation level. According to the number of risk items for each employee, the violation level corresponding to the employee can be obtained, and the employee’s risk item The greater the number, the higher the risk of the employee's illegal operations. According to the violation level of each employee, the employee can be classified into the employee category corresponding to the violation level.
  • the violation levels included in the classification rules and the number of risk items corresponding to each violation level are: the first violation level, 0; the second violation level, [1, 2]; the third violation level, [3, + ⁇ ). If the number of risk items for an employee is 2, the employee's violation level is the second violation level.
  • the number of associated employees for each employee can also be obtained based on business statistics. If the same customer handles multiple businesses through multiple employees at the same time, the above employees are associated employees with each other, and the number of associated employees for each employee can be obtained. , And set the interval of the number of associated employees corresponding to each violation level to classify employees into the employee category corresponding to the violation level. The greater the number of associated employees of an employee, the higher the risk of colluding with other employees to cooperate with the fraud.
  • the number of associated employees may also include the number of first-level associated employees, the number of second-level associated employees, and the number of third-level associated employees.
  • the sample attribute information corresponding to each sample group is acquired, and the sample attribute information is quantified according to a preset information quantification rule to obtain the group characteristic information corresponding to each sample group.
  • the sample attribute information corresponding to each sample group can be obtained according to the sample data information table prestored in the user terminal.
  • the sample attribute information table is a data table that records the specific attributes of each sample, and the group characteristic information can be used to characterize the corresponding The overall characteristics of the sample clusters and the corresponding cluster feature information of sample clusters of different violation levels are also different.
  • the sample attribute information table can be the employee information table of the company.
  • the employee information table contains the information of all employees in the company. According to the employee information table, the corresponding employee information of each employee group can be obtained.
  • Each employee information is a sample attribute, and one The sample attribute information of the sample group contains corresponding multiple pieces of employee information.
  • the employee information table contains the employee’s name, ID number, age, length of entry into the company, educational background, and number of credit defaults. The younger the employee, the likelihood of violations. The higher the number of years in the company, the higher the possibility of violation; the lower the degree of education, the higher the possibility of violation; the more credit defaults, the higher the possibility of violation.
  • the computer cannot directly analyze the text information, it is necessary to convert the text information into a corresponding vector to quantify the text information through the vector to facilitate computer analysis and processing, that is, the group feature information is regarded as the employee of the same employee group. Quantitative representation of the overall characteristics of the computer to facilitate the computer to recognize the overall characteristics of the employees of the same employee group by means of identification vectors.
  • step S120 includes sub-steps S121 and S122.
  • the information quantification rule is the rule used to convert the attribute information of each sample into characteristic variables.
  • each item of information in the employee information can be converted into a corresponding vector value through the information quantification rule
  • each employee's information can be correspondingly converted into a multi-dimensional feature vector, that is, a feature variable.
  • the employee information table contains the employee information of all employees in the enterprise.
  • the feature variable corresponding to all employee information can be obtained through the feature variable rule, that is, the feature variable corresponding to the employee information is used as a quantitative representation of the feature of the employee information to facilitate the computer Identify the characteristic information contained in the employee information by means of the recognition vector. Since a unified information quantification rule is adopted to transform all sample attribute information, the characteristic variables corresponding to each sample attribute information obtained after conversion contain the same number of quantified values, and the same type of quantified value corresponds to multiple samples in the same dimension. Corresponding characteristics.
  • a piece of employee information in the employee information table includes the name "XXX”, the ID number "1011XXXXXXXXXXXXX”, the age “25 years old”, the number of years of joining the company "3", the educational background “undergraduate”, and the number of credit defaults "1” .
  • the quantization value corresponding to "25 years old” is "2.5”
  • the quantization value corresponding to "undergraduate” is "4"
  • S122 Calculate the cluster feature information of each sample cluster according to the feature variables of all samples in each sample cluster.
  • the cluster feature information of each sample cluster is calculated according to the feature variables of all samples in each sample cluster.
  • the same type of quantitative value corresponds to the characteristics of multiple samples in the same dimension. Therefore, the sample can be obtained by calculating the average or median of the feature variables of all samples included in the same sample group in each dimension.
  • the cluster quantified value corresponding to each dimension of the cluster is used as the cluster feature information of the sample type.
  • each employee category contains multiple employees, obtain the characteristic variables of multiple employees belonging to the same employee category, and calculate the average of the quantitative values of all employees in the same employee category in multiple dimensions Value or median, you can get the group feature information of the employee group, and based on the above method, you can get the group feature information of all the employee groups.
  • S130 Construct a sample classification model including input nodes, characteristic units, and output nodes according to the group characteristic information and preset characteristic unit configuration formulas.
  • a sample classification model including input nodes, feature units and output nodes is constructed.
  • the cluster feature information is used to configure the input node and the output node
  • the feature unit configuration formula is used to configure the feature unit.
  • the sample classification model includes multiple input nodes, multiple output nodes, and multiple feature units.
  • the sample classification model can predict the category to which the sample belongs based on the sample attribute information of a sample. Specifically, an enterprise employee is taken as an example.
  • the feature unit can be used to reflect the relationship between the input employee information and the violation level corresponding to the employee information.
  • Each dimension of the feature vector in the cluster feature information corresponds to an input node, and each employee cluster corresponds to an output node.
  • step S130 includes sub-steps S131, S132, S133, S134, and S135.
  • the input node of the sample classification model is constructed according to the number of dimensions of feature variables in the cluster feature information. Since the dimensions of the feature variables in the obtained cluster feature information are the same, the same number of input nodes can be generated corresponding to the number of dimensions of the feature variables, and each item of information in the employee information in the input sample classification model is equal to an input node Correspondingly, the input value corresponding to the input node is the quantified value of the corresponding item information in the employee information.
  • S132 Construct an output node of the sample classification model according to the number of sample clusters in the cluster feature information.
  • the output node of the sample classification model is constructed according to the number of sample clusters in the cluster feature information.
  • Each employee category corresponds to a violation level, and the same number of output nodes can be generated corresponding to the number of employee categories included in the category feature information, and each output node is the matching rate between the employee and the violation level.
  • the above-mentioned cluster feature information contains three employee clusters, which correspond to the first violation level, the second violation level, and the third violation level, respectively, and three output nodes are generated correspondingly.
  • the number of input nodes and the number of output nodes are input into the feature unit configuration formula to construct a fully connected hidden layer including a corresponding number of feature units according to the calculation result.
  • the fully connected hidden layer is an intermediate layer used to connect the input nodes and output nodes.
  • the fully connected hidden layer contains several feature units, and each feature unit is associated with all input nodes and all output nodes.
  • the number of feature units included in the fully connected hidden layer can be calculated according to the feature unit configuration formula. There is an association between the number of feature units and the number of input nodes and the number of output nodes.
  • a first formula group from the input node to the feature unit is constructed with the input node value as the input value and the feature unit value as the output value.
  • the first formula group includes formulas from all input nodes to all characteristic units.
  • the input node is the node used to input a certain employee information in the sample classification model.
  • the specific value of the input node is the input node value, that is, the quantified value obtained by quantifying a certain employee information.
  • the input nodes correspond to a piece of information in the employee information, all input nodes correspond to a piece of employee information, and the characteristic unit value is the calculated value of the characteristic unit in the fully connected hidden layer.
  • the feature unit value is used as the input value and the output node value is used as the output value to construct a second formula group from the feature unit to the output node to obtain a sample classification model .
  • the feature unit value is used as the input value and the output node value is used as the output value to construct a second formula group from the feature unit to the output node to obtain a sample classification model.
  • the second formula group includes formulas from all characteristic units to all output nodes.
  • the output node is the node used to output the matching rate between the employee and each violation level in the sample classification model.
  • the specific value of the output node is the output node value, and the output node value represents the employee corresponding to the output node
  • the matching rate between the violation levels, the feature unit value is the calculated value of the feature unit in the fully connected hidden layer.
  • step S1310 is further included after step S130.
  • the generated sample classification model is the initial prediction model. Before use, the generated sample classification model can also be trained, that is, the parameter values of the formula in the sample classification model are adjusted and optimized to obtain the prediction accuracy that meets the requirements.
  • the required sample classification model Specifically, the data set contains the target violation level of the employee and the characteristic variable corresponding to the employee information of each employee.
  • the parameter adjustment rule is the rule for adjusting the parameter value in the sample classification model.
  • step S1310 includes sub-steps S1311, S1312, and S1313.
  • the data set is equally divided into a preset number of sub-data sets.
  • the preset quantity is the quantity information used to split the data set. According to the preset quantity, the employees in the data set can be evenly split into multiple corresponding sub-data sets. Each sub-data set contains multiple employee correspondences. Information.
  • the preset data set contains 2000 pieces of information corresponding to employees, and the preset number is 10, then the information corresponding to 2000 points of employees is divided into 10 sub-data sets, and each sub-data set contains 200 pieces of information corresponding to employees. .
  • S1312. Perform multiple rounds of training on the sample classification model according to the parameter value adjustment rule and the plurality of sub-data sets, and calculate the accuracy of the sample classification model after each round of training according to the sub-data sets.
  • This training process is also the grid search method, in which one sub-data set is selected as the training data set, the remaining sub-data sets are used as the test data set, and the parameter adjustment rules are combined to perform multiple rounds of training on the sample classification model, and according to the sub-data set Calculate the accuracy of the sample classification model after each round of training. Specifically, if the total number of sub-data sets is k, then k rounds of cross-training are performed on the sample classification model. When the first round of training is performed on the sample classification model, the first sub-data set is used as the test data set, and the remaining k-1 sub-data sets The data set is used as a training data set.
  • the parameter adjustment rules include accuracy threshold, parameter adjustment direction and parameter adjustment range.
  • the parameter adjustment direction includes positive adjustment and negative adjustment.
  • the parameter adjustment range is the specific amplitude value to be adjusted.
  • the current training data set is classifying the sample Whether the training accuracy of the model is less than the accuracy threshold when the model is trained, if the judgment result is not less, then the parameter values in the sample classification model are adjusted according to the positive adjustment in the parameter adjustment direction and the amplitude value in the parameter adjustment range; If the judgment result is less than, the parameter value in the sample classification model is adjusted according to the reverse adjustment in the parameter adjustment direction and the amplitude value in the parameter adjustment range.
  • the amplitude value in the parameter adjustment range is 0.03
  • the judgment result is that the training accuracy of the current training data set for training the sample classification model is not less than the accuracy threshold
  • this adjustment needs to be positively adjusted, and this adjustment is in Multiply the original value of the parameter value in the sample classification model by 1.03 to obtain the new parameter value.
  • One training data set can adjust the parameter values in the sample classification model once. After k-1 training data sets are used to train the sample classification model, the sample classification model after the first round of training is obtained, and the remaining test data Set input to the sample classification model after the first round of training to calculate the corresponding accuracy, that is, complete a round of training of the sample classification model, and calculate the accuracy of the sample classification model through the test data set. Method and calculation training The method of accuracy is the same.
  • the parameter value of the training round with the highest accuracy is used as the parameter value of the sample classification model to obtain the sample classification model after training. After the sample classification model undergoes multiple rounds of cross-training, the accuracy of each round of training is obtained, and the parameter value of the training round with the highest accuracy is used as the optimal parameter value of the sample classification model to obtain the trained sample classification model.
  • S140 If the newly-added sample attribute information of the newly-added sample is received, quantify the newly-added sample attribute information according to the information quantification rule to obtain the newly-added sample characteristic information corresponding to the newly-added sample.
  • the newly-added sample attribute information of the newly-added sample is received, the newly-added sample attribute information is quantified according to the information quantification rule to obtain the newly-added sample characteristic information corresponding to the newly-added sample, and the received one
  • the feature information of a new sample corresponds to a new sample, and the new sample is not included in the historical information table, that is, the historical information table does not contain the information corresponding to the new sample, because the historical information table It does not contain the information of the new sample, that is, the historical information table does not have a data basis for classifying the new sample, and the new sample cannot be classified using traditional classification methods.
  • the new employee information is the information corresponding to the new employees who join the company.
  • Each item of information in the new employee information can be quantified and represented by the corresponding vector value through the information quantification rule.
  • the specific quantization method is the same as the quantization method in the above steps
  • the feature information of the newly added sample is input into the sample classification model to obtain the target category corresponding to the attribute information of the newly added sample. Input the obtained new sample feature information into the trained sample classification model to get the corresponding target category.
  • add the quantitative value corresponding to each input node in the new employee feature information Input the input nodes in the information prediction model respectively, and the input node value of each input node is a quantized value corresponding to the input node.
  • each output can be obtained
  • the output node value of the node, the output node value is the matching rate of the newly added employee with the corresponding violation level of the output node, and the violation level with the highest matching rate is obtained as the predicted violation level corresponding to the newly added employee. If the violation level of the new employee is too high, corresponding management measures can be taken for the employee, such as restricting the scope of business that the employee can handle.
  • the sample clusters in the historical information table are classified by classification information to obtain multiple sample clusters, and the sample attribute information corresponding to each sample cluster is quantified according to the information quantification rules to obtain each sample cluster.
  • the cluster feature information corresponding to a cluster the sample classification model is constructed according to the cluster feature information and the feature unit configuration formula, the new sample attribute information is quantified according to the information quantification rules to obtain the new sample feature information, and the new sample feature information is input into the sample classification Model to obtain the corresponding target category.
  • the group information can be obtained based on the samples with historical records, and the sample classification model can be further constructed to accurately classify the newly added samples without historical records, which improves the accuracy of classifying the samples without historical records. Good technical results have been achieved in the actual application process.
  • An embodiment of the present application also provides a sample classification device, which is used to execute any embodiment of the foregoing sample classification method.
  • FIG. 7 is a schematic block diagram of a sample classification device provided in an embodiment of the present application.
  • the sample classification device can be configured in the user terminal.
  • the sample classification device 100 includes a sample cluster acquisition unit 110, a cluster feature information acquisition unit 120, a sample classification model construction unit 130, a newly added sample feature information acquisition unit 140 and a target category acquisition unit 150.
  • the sample cluster obtaining unit 110 is configured to classify sample clusters included in the history information table according to preset classification information to obtain multiple sample clusters, wherein each sample cluster includes at least one sample.
  • the sample cluster obtaining unit 110 includes sub-units: a sample statistical information obtaining unit and a grade classification unit.
  • the sample statistical information acquisition unit is used to perform statistics on the historical information table according to the statistical items to obtain sample statistical information of each sample;
  • the grade classification unit is used to perform statistics according to the grade classification rules and the sample statistics The information ranks the samples to obtain multiple sample clusters.
  • the cluster feature information acquiring unit 120 is configured to acquire sample attribute information corresponding to each of the sample clusters, and quantify the sample attribute information according to preset information quantization rules to obtain cluster features corresponding to each of the sample clusters information.
  • the cluster feature information acquisition unit 120 includes sub-units: a sample attribute information conversion unit and a feature variable average calculation unit.
  • the sample attribute information conversion unit is used to convert each of the sample attribute information into corresponding characteristic variables according to the information quantification rule; the characteristic variable average value calculation unit is used to calculate the characteristics of all samples in each sample group The variable calculation obtains the group characteristic information of each of the sample groups.
  • the sample classification model construction unit 130 is configured to construct a sample classification model including input nodes, feature units, and output nodes according to the cluster feature information and preset feature unit configuration formulas.
  • the sample classification model construction unit 130 includes sub-units: an input node construction unit, an output node construction unit, a fully connected hidden layer construction unit, a first formula group construction unit, and a second formula group construction unit.
  • the input node construction unit is used to construct the input node of the sample classification model according to the number of dimensions of the feature variable in the cluster feature information; the output node construction unit is used to construct the sample classification model according to the number of sample clusters in the cluster feature information
  • the output node of the sample classification model a fully connected hidden layer construction unit for inputting the number of input nodes and the number of output nodes into the feature unit configuration formula to construct a corresponding number of feature units according to the calculation result
  • Fully connected hidden layer a first formula group building unit for constructing input node to feature according to the feature unit in the fully connected hidden layer and the input node, using the input node value as the input value and the feature unit value as the output value
  • the first formula group of the unit; the second formula group construction unit used to construct the characteristic unit based on the characteristic unit in the fully connected hidden layer and the output node, with the characteristic unit value as the input value and the output node value as the output value
  • the sample classification device 100 further includes a subunit: a sample classification model training unit.
  • the sample classification model training unit is used to train the sample classification model according to the input data set and parameter adjustment rules to obtain the trained sample classification model.
  • the sample classification model training unit includes subunits: a data set splitting unit, a training accuracy rate acquisition unit, and a parameter value determination unit.
  • the data set splitting unit is used to split the data set into a preset number of sub-data sets on average; the training accuracy rate acquisition unit is used to adjust the rule according to the parameter value and compare the multiple sub-data sets to the data set.
  • the sample classification model is trained in multiple rounds, and the accuracy of the sample classification model after each round of training is calculated according to the sub-data set; the parameter value determination unit is used to take the parameter value of the round of training with the highest accuracy as The parameter values of the sample classification model are used to obtain the sample classification model after training.
  • the newly-added sample feature information acquiring unit 140 is configured to, if the newly-added sample attribute information of the newly-added sample is received, quantify the newly-added sample attribute information according to the information quantification rule to obtain the corresponding information of the newly-added sample Added sample feature information.
  • the target category obtaining unit 150 is configured to input the feature information of the newly added sample into the sample classification model to obtain the target category corresponding to the attribute information of the newly added sample.
  • the above-mentioned sample classification method is applied to classify the sample clusters in the historical information table through the classification information to obtain multiple sample clusters, and the sample attribute information corresponding to each sample cluster is determined according to the information quantification rules.
  • Perform quantification to obtain the cluster feature information corresponding to each cluster construct a sample classification model according to the cluster feature information and feature unit configuration formula, quantify the new sample attribute information according to the information quantification rules to obtain the new sample feature information, and add the new sample features
  • the information is input into the sample classification model to obtain the corresponding target category.
  • sample clusters can be obtained based on samples with historical records, and a sample classification model can be further constructed to accurately classify newly-added samples without historical records, which improves the accuracy of classifying samples without historical records. Good technical results have been achieved in the actual application process.
  • the above-mentioned sample classification device can be implemented in the form of a computer program, and the computer program can be run on a computer device as shown in FIG. 8.
  • FIG. 8 is a schematic block diagram of a computer device according to an embodiment of the present application.
  • the computer device 500 includes a processor 502, a memory, and a network interface 505 connected through a system bus 501, where the memory may include a non-volatile storage medium 503 and an internal memory 504.
  • the non-volatile storage medium 503 can store an operating system 5031 and a computer program 5032.
  • the processor 502 can execute the sample classification method.
  • the processor 502 is used to provide calculation and control capabilities, and support the operation of the entire computer device 500.
  • the internal memory 504 provides an environment for the operation of the computer program 5032 in the non-volatile storage medium 503.
  • the processor 502 can execute the sample classification method.
  • the network interface 505 is used for network communication, such as providing data information transmission.
  • the structure shown in FIG. 8 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device 500 to which the solution of the present application is applied.
  • the specific computer device 500 may include more or fewer components than shown in the figure, or combine certain components, or have a different component arrangement.
  • the processor 502 is configured to run a computer program 5032 stored in a memory, so as to implement the corresponding function in the above-mentioned sample classification method.
  • the embodiment of the computer device shown in FIG. 8 does not constitute a limitation on the specific configuration of the computer device.
  • the computer device may include more or less components than those shown in the figure. Or some parts are combined, or different parts are arranged.
  • the computer device may only include a memory and a processor. In such an embodiment, the structures and functions of the memory and the processor are consistent with the embodiment shown in FIG. 8 and will not be repeated here.
  • the processor 502 may be a central processing unit (Central Processing Unit, CPU), and the processor 502 may also be other general-purpose processors, digital signal processors (Digital Signal Processors, DSPs), Application Specific Integrated Circuit (ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor may be a microprocessor or the processor may also be any conventional processor.
  • a computer-readable storage medium may be a non-volatile computer-readable storage medium, or may be a volatile computer-readable storage medium.
  • the computer-readable storage medium stores a computer program, where the computer program implements the steps included in the above-mentioned sample classification method when the computer program is executed by the processor.
  • the disclosed equipment, device, and method may be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the units is only a logical function division. In actual implementation, there may be other division methods, or the units with the same function may be combined into one. Units, for example, multiple units or components can be combined or integrated into another system, or some features can be omitted or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may also be electrical, mechanical or other forms of connection.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments of the present application.
  • the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.
  • the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium.
  • the read storage medium includes several instructions to enable a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned computer-readable storage medium includes: U disk, mobile hard disk, Read-Only Memory (ROM, Read-Only Memory), magnetic disk or optical disk and other media that can store program codes.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Human Resources & Organizations (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Economics (AREA)
  • Strategic Management (AREA)
  • Molecular Biology (AREA)
  • Development Economics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biophysics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Business, Economics & Management (AREA)
  • Marketing (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Finance (AREA)
  • Accounting & Taxation (AREA)
  • Educational Administration (AREA)
  • Game Theory and Decision Science (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Tourism & Hospitality (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种样本分类方法、装置、计算机设备及存储介质。方法包括:根据预置的分类信息对历史信息表中所包含的样本集群进行分类以得到多个样本类群(S110);获取与每一所述样本类群对应的样本属性信息,根据预置的信息量化规则对所述样本属性信息进行量化以得到每一所述样本类群对应的类群特征信息(S120);根据所述类群特征信息及预置的特征单元配置公式构建包含输入节点、特征单元和输出节点的样本分类模型(S130);若接收到新增样本的新增样本属性信息,根据所述信息量化规则对所述新增样本属性信息进行量化以得到与所述新增样本对应的新增样本特征信息(S140);将所述新增样本特征信息输入所述样本分类模型以获取与所述新增样本属性信息对应的目标类别(S150)。

Description

样本分类方法、装置、计算机设备及存储介质
本申请要求于2020年03月12日提交中国专利局、申请号为202010171236.7,发明名称为“样本分类方法、装置、计算机设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能技术领域,尤其涉及一种样本分类方法、装置、计算机设备及存储介质。
背景技术
企业通常会基于样本的历史数据对样本进行分析或对样本进行分类,例如企业通常通过所雇佣的员工为客户办理业务,办理业务的过程被记录为业务办理信息,然而部分企业员工存在伪造业务办理信息的违规操作行为,通过业务办理信息可获取员工所属的违规等级以对该员工进行分类。但部分情况下通过历史数据无法对某一样本进行准确分类,例如,新增的员工由于并未办理业务或办理业务数量较少,则无法基于该员工的业务办理信息准确判断其是否存在违规操作行为。上述员工与样本对应,上述业务办理信息与历史数据对应,发明人发现,由于历史数据中并不包含某一样本的数据,因此通过历史数据无法对该样本进行分类;历史数据中某一样本的数据不充足,则通过历史数据无法对该样本进行精确分类。因此,现有技术方法存在无法通过历史数据对样本进行准确分类的问题。
发明内容
本申请实施例提供了一种样本分类方法、装置、计算机设备及存储介质,旨在解决现有技术方法所存在的无法通过历史数据对样本进行准确分类的问题。
第一方面,本申请实施例提供了一种样本分类方法,其包括:
根据预置的分类信息对历史信息表中所包含的样本集群进行分类以得到多个样本类群,其中,每一样本类群中包括至少一个样本;
获取与每一所述样本类群对应的样本属性信息,根据预置的信息量化规则对所述样本属性信息进行量化以得到每一所述样本类群对应的类群特征信息;
根据所述类群特征信息及预置的特征单元配置公式构建包含输入节点、特征单元和输出节点的样本分类模型;
若接收到新增样本的新增样本属性信息,根据所述信息量化规则对所述新增样本属性信息进行量化以得到与所述新增样本对应的新增样本特征信息;
将所述新增样本特征信息输入所述样本分类模型以获取与所述新增样本属性信息对应的目标类别。
第二方面,本申请实施例提供了一种样本分类装置,其包括:
样本类群获取单元,用于根据预置的分类信息对历史信息表中所包含的样本集群进行分 类以得到多个样本类群,其中,每一样本类群中包括至少一个样本;
类群特征信息获取单元,用于获取与每一所述样本类群对应的样本属性信息,根据预置的信息量化规则对所述样本属性信息进行量化以得到每一所述样本类群对应的类群特征信息;
样本分类模型构建单元,用于根据所述类群特征信息及预置的特征单元配置公式构建包含输入节点、特征单元和输出节点的样本分类模型;
新增样本特征信息获取单元,用于若接收到新增样本的新增样本属性信息,根据所述信息量化规则对所述新增样本属性信息进行量化以得到与所述新增样本对应的新增样本特征信息;
目标类别获取单元,用于将所述新增样本特征信息输入所述样本分类模型以获取与所述新增样本属性信息对应的目标类别。
第三方面,本申请实施例又提供了一种计算机设备,其包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现上述第一方面所述的样本分类方法。
第四方面,本申请实施例还提供了一种计算机可读存储介质,其中所述计算机可读存储介质存储有计算机程序,所述计算机程序当被处理器执行时使所述处理器执行上述第一方面所述的样本分类方法。
本申请实施例提供了一种样本分类方法、装置、计算机设备及存储介质。通过分类信息对历史信息表中的样本集群进行分类得到多个样本类群,根据信息量化规则对每一样本类群对应的样本属性信息进行量化得到每一类群对应的类群特征信息,根据类群特征信息及特征单元配置公式构建样本分类模型,根据信息量化规则对新增样本属性信息进行量化得到新增样本特征信息,将新增样本特征信息输入样本分类模型以获取对应的目标类别。通过上述方法,可基于有历史记录的样本获取样本类群,并进一步构建样本分类模型,以对无历史记录的新增样本进行准确分类,提高了对无历史记录的样本进行分类的准确性,在实际应用过程中取得了良好的技术效果。
附图说明
为了更清楚地说明本申请实施例技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本申请实施例提供的样本分类方法的流程示意图;
图2为本申请实施例提供的样本分类方法的子流程示意图;
图3为本申请实施例提供的样本分类方法的另一子流程示意图;
图4为本申请实施例提供的样本分类方法的另一子流程示意图;
图5为本申请实施例提供的样本分类方法的另一流程示意图;
图6为本申请实施例提供的样本分类方法的另一子流程示意图;
图7为本申请实施例提供的样本分类装置的示意性框图;
图8为本申请实施例提供的计算机设备的示意性框图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
应当理解,当在本说明书和所附权利要求书中使用时,术语“包括”和“包含”指示所描述特征、整体、步骤、操作、元素和/或组件的存在,但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。
还应当理解,在此本申请说明书中所使用的术语仅仅是出于描述特定实施例的目的而并不意在限制本申请。如在本申请说明书和所附权利要求书中所使用的那样,除非上下文清楚地指明其它情况,否则单数形式的“一”、“一个”及“该”意在包括复数形式。
还应当进一步理解,在本申请说明书和所附权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。
请参阅图1,图1是本申请实施例提供的样本分类方法的流程示意图。该样本分类方法应用于用户终端中,用户终端即是用于执行所述样本分类方法以完成对样本进行分类的终端设备,例如台式机、笔记本电脑、平板电脑或手机等。
如图1所示,该方法包括步骤S110~S150。
S110、根据预置的分类信息对历史信息表中所包含的样本集群进行分类以得到多个样本类群。
根据预置的分类信息对历史信息表中所包含的样本集群进行分类以得到多个样本类群。其中,所述分类信息包含统计项目及等级分类规则,分类信息可以由用户输入,用户即为用户终端的使用者,用户可根据分类目的输入包含统计项目及等级分类规则的分类信息,以对历史信息表中所包含的样本进行分类;分类信息还可是用户终端中所预先配置的信息,每一样本类群包含多个样本。历史信息表即为对每一样本的历史办理信息进行记录的数据表,历史信息表中所包含的所有样本即构成该历史信息表的样本集群,历史信息表可以是企业的历史业务信息表,则每一样本即与一名员工对应,历史业务信息表中包含每一员工所办理的业务办理信息,业务办理信息中包含所办理的业务的全面信息,在对违规员工进行筛选的过程中仅需使用业务办理信息中的部分信息,统计项目即为对业务办理信息中所需使用的部分信息进行统计的项目信息,根据统计项目对每一员工所办理的业务办理信息进行统计即可得到对应的业务统计信息,等级分类规则即为对员工进行违规等级分类的规则信息,根据等级分类规则及业务统计信息即可按照违规等级对员工进行分类,以得到多个员工类群信息,每一员工类群信息中均包含多个员工的姓名。
在一实施例中,如图2所示,步骤S110包括子步骤S111和S112。
S111、根据所述统计项目对所述历史信息表进行统计以获取每一所述样本的样本统计信息。
根据所述统计项目对所述历史信息表进行统计以获取每一所述样本的样本统计信息。具体的,以业务信息表为例,员工办理一项业务,则对应生成的业务办理信息中的经办人即为该员工,根据业务信息表中每一业务办理信息的经办人,即可获取每一员工所办理的业务办理信息,根据统计项目对每一员工所办理的业务办理信息进行统计即可获取每一员工的业务统计信息,所得业务统计信息即为对应的样本统计信息。
例如,若所办理的业务为合约手机分期业务,则统计项目为签约率、实名认证失败率、黑名单用户占比、紧急联系人合约号异常占比、新办用户30天内停机率。
S112、根据所述等级分类规则及所述样本统计信息对所述样本进行等级分类以得到多个样本类群。
根据所述等级分类规则及所述样本统计信息对所述样本进行等级分类以得到多个样本类群。等级分类规则中包含与每一统计项目对应的项目阈值,以业务信息表为例,将业务统计信息中超出对应项目阈值的统计项目作为风险项目,根据项目阈值即可获取每一员工的业务统计信息的风险项目数量,等级分类规则中还包含多个违规等级以及每一违规等级对应的风险项目数量,根据每一员工的风险项目数量即可获取与该员工对应的违规等级,员工的风险项目数量越多则该员工发生违规操作的风险也越高,根据每一员工的违规等级即可将员工分类至与违规等级对应的员工类群。
例如,等级分类规则中包含的违规等级及每一违规等级对应的风险项目数量为:第一违规等级,0;第二违规等级,[1,2];第三违规等级,[3,+∞)。某一员工的风险项目数量为2,则该员工的违规等级为第二违规等级。
此外,还可根据业务统计信息获取每一员工的关联员工数,若同一客户同时经由多个员工分别办理多项业务,则上述员工之间互为关联员工,可获取每一员工的关联员工数,并设置每一违规等级对应的关联员工数区间,以将员工分类至违规等级对应的员工类群,员工的关联员工数越多则表明该员工串通其他员工一起伙同诈骗的风险也越高。关联员工数还可以包括一级关联员工数、二级关联员工数、三级关联员工数。
S120、获取与每一所述样本类群对应的样本属性信息,根据预置的信息量化规则对所述样本属性信息进行量化以得到每一所述样本类群对应的类群特征信息。
获取与每一所述样本类群对应的样本属性信息,根据预置的信息量化规则对所述样本属性信息进行量化以得到每一所述样本类群对应的类群特征信息。具体的,根据用户终端中预存的样本数据信息表即可获取与每一样本类群对应的样本属性信息,样本属性信息表即为记载每一样本具体属性的数据表,类群特征信息可用于表征对应样本类群的整体特征,不同违规等级的样本类群其对应的类群特征信息也存在区别。样本属性信息表可以是企业的员工信息表,员工信息表中包含企业所有员工的信息,根据员工信息表即可获取每一员工类群对应的员工信息,每一员工信息即为样本属性,则一个样本类群的样本属性信息包含对应的多份员工信息,员工信息表中包含员工的姓名、身份证号、年龄、入司年限、学历及信用违约次数等信息,员工年龄越小、其违规可能性越高;入司年限越长,其违规可能性越高;学历越低,其违规可能性越高;信用违约次数越多,其违规可能性越高。根据信息量化规则对每一 员工类群所包含的员工信息进行了量化后,即可得到每一员工类群的类群特征信息,类群特征信息即可用于对同一员工类群的员工所具有的整体特征进行量化表示。由于计算机无法直接分析文字信息,因此需将文字信息转换为对应的向量,以通过向量对文字信息进行量化表示,以方便计算机分析处理,也即是将类群特征信息作为同一员工类群的员工所具有的整体特征的量化表示,以方便计算机通过识别向量的方式识别同一员工类群的员工所具有的整体特征。
在一实施例中,如图3所示,步骤S120包括子步骤S121和S122。
S121、根据所述信息量化规则将每一所述样本属性信息转换为对应的特征变量。
根据所述信息量化规则将每一所述样本属性信息转换为对应的特征变量。具体的,信息量化规则即是用于将每一样本属性信息转换为特征变量的规则,以员工信息表为例,员工信息中的每一项信息均可通过信息量化规则转换为对应的向量值进行量化表示,则可将每一员工信息对应转换为一个多维的特征向量,也即是特征变量。员工信息表包含企业中所有员工的员工信息,通过特征变量规则即可获取所有员工信息对应的特征变量,也即是将员工信息对应的特征变量作为该员工信息的特征的量化表示,以方便计算机通过识别向量的方式识别该员工信息中所包含的特征信息。由于采用统一的信息量化规则对所有样本属性信息进行转换,因此所转换得到的每一样本属性信息对应的特征变量所包含量化值的数量相等,相同类型的量化值对应体现多个样本在同一维度所对应的特征。
例如,员工信息表中的某一份员工信息包括,姓名“XXX”,身份证号“1011XXXXXXXXXXXXXX”,年龄“25岁”,入司年限“3”,学历“本科”,信用违约次数“1”。信息量化规则中“25岁”对应的量化值为“2.5”,“本科”对应的量化值为“4”,则转换得到对应的特征变量为F={2.5,3,4,1}。
S122、根据每一所述样本类群中所有样本的特征变量计算得到每一所述样本类群的类群特征信息。
根据每一所述样本类群中所有样本的特征变量计算得到每一所述样本类群的类群特征信息。相同类型的量化值对应体现多个样本在同一维度所对应的特征,因此可通过计算同一样本类群所包含的所有样本的特征变量在每一维度的平均值或中位数的方式,获取该样本类群在每一维度对应的类群量化值作为该样本类型的类群特征信息。以员工信息表为例,每一员工类群中均包含多个员工,获取属于同一员工类群的多个员工的特征变量,并计算同一员工类群中包含的所有员工在多个维度的量化值的平均值或中位数,即可得到该员工类群的类群特征信息,基于上述方法即可获取所有员工类群的类群特征信息。
S130、根据所述类群特征信息及预置的特征单元配置公式构建包含输入节点、特征单元和输出节点的样本分类模型。
根据所述类群特征信息及预置的特征单元配置公式构建包含输入节点、特征单元和输出节点的样本分类模型。其中,所述类群特征信息用于配置所述输入节点及所述输出节点,所述特征单元配置公式用于配置所述特征单元。样本分类模型中包含多个输入节点、多个输出节点及多个特征单元,样本分类模型可基于某一样本的样本属性信息对该样本对应所属的类 别进行预测,具体的,企业员工为例,特征单元可用于反映输入的员工信息与该员工信息对应的违规等级之间的关系,类群特征信息中特征向量的每一个维度对应一个输入节点,每一个员工类群对应一个输出节点。
在一实施例中,如图4所示,步骤S130包括子步骤S131、S132、S133、S134和S135。
S131、根据所述类群特征信息中特征变量的维度数量构建所述样本分类模型的输入节点。
根据所述类群特征信息中特征变量的维度数量构建所述样本分类模型的输入节点。由于所得到的类群特征信息中特征变量的维度均相同,因此可通过特征变量的维度数量对应生成相同数量的输入节点,输入样本分类模型中的员工信息中的每一项信息即与一个输入节点对应,输入节点对应的输入值即为员工信息中对应项信息的量化值。
例如,上述类群特征信息中特征变量的维度数量为4,则对应生成4个输入节点。
S132、根据所述类群特征信息中样本类群的数量构建所述样本分类模型的输出节点。
根据所述类群特征信息中样本类群的数量构建所述样本分类模型的输出节点。每一员工类群对应一个违规等级,可通过类群特征信息中所包含员工类群的数量对应生成相同数量的输出节点,每一输出节点即为员工与该违规等级之间的匹配率。
例如,上述类群特征信息中包含3个员工类群,分别对应第一违规等级、第二违规等级和第三违规等级,则对应生成3个输出节点。
S133、将所述输入节点的数量及所述输出节点的数量输入所述特征单元配置公式,以根据计算结果构建包含相应数量的特征单元的全连接隐层。
将所述输入节点的数量及所述输出节点的数量输入所述特征单元配置公式,以根据计算结果构建包含相应数量的特征单元的全连接隐层。全连接隐层即是用于对输入节点与输出节点进行联系的中间层,全连接隐层中包含若干个特征单元,每一个特征单元均与所有输入节点和所有输出节点进行关联。全连接隐层中所包含特征单元的配置数量可根据特征单元配置公式计算得到,特征单元的配置数量与输入节点的数量及输出节点的数量存在关联关系,具体的,特征单元配置公式可以是S 0=S 1×S 2/2或S 0=2×(S 1×S 2) 1/2,其中,S 0为全连接隐层中特征单元的配置数量,S 1为输入节点的数量,S 2为输出节点的数量。
例如,输入节点为4个,输出节点为3个,根据上式S 0=2×(S 1×S 2)1/2进行计算并对结果四舍五入,得到特征单元的数量为7,则可对应构建包含7个特征单元的全连接隐层。
S134、根据所述全连接隐层中的特征单元及所述输入节点,以输入节点值作为输入值、特征单元值作为输出值构建输入节点至特征单元的第一公式组。
根据所述全连接隐层中的特征单元及所述输入节点,以输入节点值作为输入值、特征单元值作为输出值构建输入节点至特征单元的第一公式组。其中,第一公式组包含所有输入节点至所有特征单元的公式。输入节点即是样本分类模型中用于对某一员工信息进行输入的节点,输入节点的具体数值即为输入节点值,也即是对某一员工信息进行量化后所得的量化值,由于每一输入节点均对应员工信息中的一项信息,所有输入节点即对应与一份员工信息,特征单元值即是全连接隐层中的特征单元的计算值。
例如,某一输入节点的输入节点值为x 1,某一特征单元的特征单元值为y 1,则该输入节 点至该特征单元的公式为y 1=a×x 1+b;其中,a和b为该公式中的参数,公式中的参数值为随机生成的数字。
S135、根据所述全连接隐层中的特征单元及所述输出节点,以特征单元值作为输入值、输出节点值作为输出值构建特征单元至输出节点的第二公式组,以得到样本分类模型。
根据所述全连接隐层中的特征单元及所述输出节点,以特征单元值作为输入值、输出节点值作为输出值构建特征单元至输出节点的第二公式组,以得到样本分类模型。其中,第二公式组包含所有特征单元至所有输出节点的公式。输出节点即是样本分类模型中用于对员工与每一违规等级之间的匹配率进行输出的节点,输出节点的具体数值即是输出节点值,输出节点值即表示员工与该输出节点对应的违规等级之间的匹配率,特征单元值即是全连接隐层中的特征单元的计算值。
例如,某一特征单元的特征单元值为y 1,某一输出节点的输出节点值为z 1,则该特征单元至该输出节点的公式为z 1=c×y 1+d;其中,c和d为该公式中的参数,公式中的参数值为随机生成的数字。
在一实施例中,如图5所示,步骤S130之后还包括步骤S1310。
S1310、根据所输入的数据集及参数调整规则对所述样本分类模型进行训练,以得到训练后的所述样本分类模型。
根据所输入的数据集及参数调整规则对所述样本分类模型进行训练,以得到训练后的所述样本分类模型。所生成的样本分类模型为初始预测模型,在使用之前,还可对所生成的样本分类模型进行训练,也即是对样本分类模型中公式的参数值进行调整优化,以得到预测精确度满足使用要求的样本分类模型。具体的,数据集中包含员工的目标违规等级,以及每一员工的员工信息对应的特征变量。所述参数调整规则即为对样本分类模型中参数值进行调整的规则。
在一实施例中,如图6所示,步骤S1310包括子步骤S1311、S1312和S1313。
S1311、将所述数据集平均拆分为预设数量的子数据集。
将所述数据集平均拆分为预设数量的子数据集。预设数量即是用于对数据集进行拆分的数量信息,根据预设数量即可将数据集中的员工平均拆分至对应的多个子数据集,每一子数据集中均包含多份员工对应的信息。
例如,预置的数据集中包含2000份员工对应的信息,预设数量为10,则将2000分员工对应的信息平均拆分至10个子数据集,每一子数据集中包含200份员工对应的信息。
S1312、根据所述参数值调整规则及多个所述子数据集对所述样本分类模型进行多轮训练,并根据所述子数据集计算每一轮训练后所述样本分类模型的准确率。
这一训练过程也即是网格搜索法,依次选择一个子数据集作为训练数据集、其余子数据集作为测试数据集并结合参数调整规则对样本分类模型进行多轮训练,并根据子数据集计算每一轮训练后所述样本分类模型的准确率。具体的,子数据集总数为k,则对样本分类模型进行k轮交叉训练,对样本分类模型进行第一轮训练时,将第一个子数据集作为测试数据集,其余的k-1个子数据集作为训练数据集,将第一个训练数据集中每一员工的特征向量输入样 本分类模型得到每一员工户与多个违规等级之间的匹配率,若某一员工匹配率最高的违规等级与该员工的目标违规等级相同,则将该员工作为正样本员工,统计该训练数据集中正样本员工的占比得到即可得到该训练数据集的训练准确率Z=S/V,其中,S为该训练数据集中正样本员工的数量,V为该训练数据集中所包含员工的数量。参数调整规则中包括准确率阈值、参数调整方向及参数调整幅度,参数调整方向包括正向调整及负向调整,参数调整幅度即是进行调整的具体幅度值,判断当前训练数据集在对样本分类模型进行训练时的训练准确率是否小于准确率阈值,若判断结果为不小于,则根据参数调整方向中的正向调整及参数调整幅度中的幅度值对样本分类模型中的参数值进行调整;若判断结果为小于,则根据参数调整方向中的反向调整及参数调整幅度中的幅度值对样本分类模型中的参数值进行调整。
例如,参数调整幅度中的幅度值为0.03,判断结果为当前训练数据集对样本分类模型进行训练时的训练准确率不小于准确率阈值,则本次调整需进行正向调整,本次调整在该样本分类模型中参数值原数值基础上乘以1.03得到新的参数值。
一个训练数据集即可对样本分类模型中的参数值进行一次调整,通过k-1个训练数据集对样本分类模型进行训练后得到第一轮训练过后的样本分类模型,将剩余的一个测试数据集输入第一轮训练过后的样本分类模型即可计算得到对应的准确率,也即是完成对该样本分类模型的一轮训练,通过测试数据集计算样本分类模型的准确率的方法与计算训练准确率的方法相同。
S1313、将准确率最高的一轮训练的参数值作为所述样本分类模型的参数值以得到训练后的所述样本分类模型。
将准确率最高的一轮训练的参数值作为所述样本分类模型的参数值以得到训练后的所述样本分类模型。样本分类模型进行多轮交叉训练后,得到每一轮训练的准确率,将准确率最高的一轮训练的参数值作为样本分类模型最优的参数值,即可得到训练后的样本分类模型。
S140、若接收到新增样本的新增样本属性信息,根据所述信息量化规则对所述新增样本属性信息进行量化以得到与所述新增样本对应的新增样本特征信息。
若接收到新增样本的新增样本属性信息,根据所述信息量化规则对所述新增样本属性信息进行量化以得到与所述新增样本对应的新增样本特征信息,所接收到的一份新增样本特征信息与一个新增样本对应,该新增样本并不包含于历史信息表中,也即是历史信息表中并不包含与该新增样本所对应的信息,由于历史信息表中不包含该新增样本的信息,也即是历史信息表中并不具有对该新增样本进行分类的数据基础,采用传统的分类方法无法对该新增样本进行分类。以企业员工为例,新增员工信息即为新加入企业的员工对应的信息,新增员工信息中的每一项信息均可通过信息量化规则转换为对应的向量值进行量化表示,也即是得到以一个多维的特征向量进行表示的新增员工特征信息。具体量化方式与上述步骤中的量化方法相同。
S150、将所述新增样本特征信息输入所述样本分类模型以获取与所述新增样本属性信息对应的目标类别。
将所述新增样本特征信息输入所述样本分类模型以获取与所述新增样本属性信息对应的 目标类别。将所得到的新增样本特征信息输入训练后的样本分类模型,即可得到对应的目标类别,具体的,以企业员工为例,将新增员工特征信息中与每一输入节点对应的量化值分别输入信息预测模型中的输入节点,则每一输入节点的输入节点值即为与该输入节点对应的一个量化值,通过第一公式组及第二公式组的计算,即可得到每一输出节点的输出节点值,输出节点值即为新增员工与该输出节点对应违规等级的匹配率,获取匹配率最高的违规等级作为预测得到的与新增员工对应的违规等级。若新增员工的违规等级太高,则可对该员工采取相应管理措施,例如限制该员工可办理的业务范围。
在本申请实施例所提供的样本分类方法中,通过分类信息对历史信息表中的样本集群进行分类得到多个样本类群,根据信息量化规则对每一样本类群对应的样本属性信息进行量化得到每一类群对应的类群特征信息,根据类群特征信息及特征单元配置公式构建样本分类模型,根据信息量化规则对新增样本属性信息进行量化得到新增样本特征信息,将新增样本特征信息输入样本分类模型以获取对应的目标类别。通过上述方法,可基于有历史记录的样本获取类群信息,并进一步构建样本分类模型,以对无历史记录的新增样本进行准确分类,提高了对无历史记录的样本进行分类的准确性,在实际应用过程中取得了良好的技术效果。
本申请实施例还提供一种样本分类装置,该样本分类装置用于执行前述样本分类方法的任一实施例。具体地,请参阅图7,图7是本申请实施例提供的样本分类装置的示意性框图。该样本分类装置可以配置于用户终端中。
如图7所示,样本分类装置100包括样本类群获取单元110、类群特征信息获取单元120、样本分类模型构建单元130、新增样本特征信息获取单元140和目标类别获取单元150。
样本类群获取单元110,用于根据预置的分类信息对历史信息表中所包含的样本集群进行分类以得到多个样本类群,其中,每一样本类群中包括至少一个样本。
其他申请实施例中,所述样本类群获取单元110包括子单元:样本统计信息获取单元和等级分类单元。
样本统计信息获取单元,用于根据所述统计项目对所述历史信息表进行统计以获取每一所述样本的样本统计信息;等级分类单元,用于根据所述等级分类规则及所述样本统计信息对所述样本进行等级分类以得到多个样本类群。
类群特征信息获取单元120,用于获取与每一所述样本类群对应的样本属性信息,根据预置的信息量化规则对所述样本属性信息进行量化以得到每一所述样本类群对应的类群特征信息。
其他申请实施例中,所述类群特征信息获取单元120包括子单元:样本属性信息转换单元和特征变量平均值计算单元。
样本属性信息转换单元,用于根据所述信息量化规则将每一所述样本属性信息转换为对应的特征变量;特征变量平均值计算单元,用于根据每一所述样本类群中所有样本的特征变量计算得到每一所述样本类群的类群特征信息。
样本分类模型构建单元130,用于根据所述类群特征信息及预置的特征单元配置公式构建包含输入节点、特征单元和输出节点的样本分类模型。
其他申请实施例中,所述样本分类模型构建单元130包括子单元:输入节点构建单元、输出节点构建单元、全连接隐层构建单元、第一公式组构建单元和第二公式组构建单元。
输入节点构建单元,用于根据所述类群特征信息中特征变量的维度数量构建所述样本分类模型的输入节点;输出节点构建单元,用于根据所述类群特征信息中样本类群的数量构建所述样本分类模型的输出节点;全连接隐层构建单元,用于将所述输入节点的数量及所述输出节点的数量输入所述特征单元配置公式,以根据计算结果构建包含相应数量的特征单元的全连接隐层;第一公式组构建单元,用于根据所述全连接隐层中的特征单元及所述输入节点,以输入节点值作为输入值、特征单元值作为输出值构建输入节点至特征单元的第一公式组;第二公式组构建单元,用于根据所述全连接隐层中的特征单元及所述输出节点,以特征单元值作为输入值、输出节点值作为输出值构建特征单元至输出节点的第二公式组,以得到样本分类模型。
其他申请实施例中,所述样本分类装置100还包括子单元:样本分类模型训练单元。
样本分类模型训练单元,用于根据所输入的数据集及参数调整规则对所述样本分类模型进行训练,以得到训练后的所述样本分类模型。
其他申请实施例中,所述样本分类模型训练单元包括子单元:数据集拆分单元、训练准确率获取单元和参数值确定单元。
数据集拆分单元,用于将所述数据集平均拆分为预设数量的子数据集;训练准确率获取单元,用于根据所述参数值调整规则及多个所述子数据集对所述样本分类模型进行多轮训练,并根据所述子数据集计算每一轮训练后所述样本分类模型的准确率;参数值确定单元,用于将准确率最高的一轮训练的参数值作为所述样本分类模型的参数值以得到训练后的所述样本分类模型。
新增样本特征信息获取单元140,用于若接收到新增样本的新增样本属性信息,根据所述信息量化规则对所述新增样本属性信息进行量化以得到与所述新增样本对应的新增样本特征信息。
目标类别获取单元150,用于将所述新增样本特征信息输入所述样本分类模型以获取与所述新增样本属性信息对应的目标类别。
在本申请实施例所提供的样本分类装置应用上述样本分类方法,通过分类信息对历史信息表中的样本集群进行分类得到多个样本类群,根据信息量化规则对每一样本类群对应的样本属性信息进行量化得到每一类群对应的类群特征信息,根据类群特征信息及特征单元配置公式构建样本分类模型,根据信息量化规则对新增样本属性信息进行量化得到新增样本特征信息,将新增样本特征信息输入样本分类模型以获取对应的目标类别。通过上述方法,可基于有历史记录的样本获取样本类群,并进一步构建样本分类模型,以对无历史记录的新增样本进行准确分类,提高了对无历史记录的样本进行分类的准确性,在实际应用过程中取得了良好的技术效果。
上述样本分类装置可以实现为计算机程序的形式,该计算机程序可以在如图8所示的计算机设备上运行。
请参阅图8,图8是本申请实施例提供的计算机设备的示意性框图。
参阅图8,该计算机设备500包括通过系统总线501连接的处理器502、存储器和网络接口505,其中,存储器可以包括非易失性存储介质503和内存储器504。
该非易失性存储介质503可存储操作系统5031和计算机程序5032。该计算机程序5032被执行时,可使得处理器502执行样本分类方法。
该处理器502用于提供计算和控制能力,支撑整个计算机设备500的运行。
该内存储器504为非易失性存储介质503中的计算机程序5032的运行提供环境,该计算机程序5032被处理器502执行时,可使得处理器502执行样本分类方法。
该网络接口505用于进行网络通信,如提供数据信息的传输等。本领域技术人员可以理解,图8中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备500的限定,具体的计算机设备500可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。
其中,所述处理器502用于运行存储在存储器中的计算机程序5032,以实现上述的样本分类方法中对应的功能。
本领域技术人员可以理解,图8中示出的计算机设备的实施例并不构成对计算机设备具体构成的限定,在其他实施例中,计算机设备可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。例如,在一些实施例中,计算机设备可以仅包括存储器及处理器,在这样的实施例中,存储器及处理器的结构及功能与图8所示实施例一致,在此不再赘述。
应当理解,在本申请实施例中,处理器502可以是中央处理单元(Central Processing Unit,CPU),该处理器502还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。其中,通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。
在本申请的另一实施例中提供计算机可读存储介质。该计算机可读存储介质可以为非易失性的计算机可读存储介质,也可以是易失性的计算机可读存储介质。该计算机可读存储介质存储有计算机程序,其中计算机程序被处理器执行时实现上述的样本分类方法中所包含的步骤。
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,上述描述的设备、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
在本申请所提供的几个实施例中,应该理解到,所揭露的设备、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为逻辑功能划分,实际实现时可以有另外的划分方式,也可以将具有相同功能的单元集合成一个单元,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另外,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口、装置或单元的间接耦合或通信连接,也可以是电的,机械的或其它的形式连接。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本申请实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以是两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分,或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个计算机可读存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的计算机可读存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以权利要求的保护范围为准。

Claims (20)

  1. 一种样本分类方法,其中,包括:
    根据预置的分类信息对历史信息表中所包含的样本集群进行分类以得到多个样本类群,其中,每一样本类群中包括至少一个样本;
    获取与每一所述样本类群对应的样本属性信息,根据预置的信息量化规则对所述样本属性信息进行量化以得到每一所述样本类群对应的类群特征信息;
    根据所述类群特征信息及预置的特征单元配置公式构建包含输入节点、特征单元和输出节点的样本分类模型;
    若接收到新增样本的新增样本属性信息,根据所述信息量化规则对所述新增样本属性信息进行量化以得到与所述新增样本对应的新增样本特征信息;
    将所述新增样本特征信息输入所述样本分类模型以获取与所述新增样本属性信息对应的目标类别。
  2. 根据权利要求1所述的样本分类方法,其中,所述分类信息包含统计项目及等级分类规则,所述根据预置的分类信息对历史信息表中所包含的样本集群进行分类以得到多个样本类群,包括:
    根据所述统计项目对所述历史信息表进行统计以获取每一所述样本的样本统计信息;
    根据所述等级分类规则及所述样本统计信息对所述样本进行等级分类以得到多个样本类群。
  3. 根据权利要求1所述的样本分类方法,其中,所述根据预置的信息量化规则对所述样本属性信息进行量化以得到每一所述样本类群对应的类群特征信息,包括:
    根据所述信息量化规则将每一所述样本属性信息转换为对应的特征变量;
    根据每一所述样本类群中所有样本的特征变量计算得到每一所述样本类群的类群特征信息。
  4. 根据权利要求1所述的样本分类方法,其中,所述根据所述类群特征信息及预置的特征单元配置公式构建包含输入节点、特征单元和输出节点的样本分类模型,包括:
    根据所述类群特征信息中特征变量的维度数量构建所述样本分类模型的输入节点;
    根据所述类群特征信息中样本类群的数量构建所述样本分类模型的输出节点;
    将所述输入节点的数量及所述输出节点的数量输入所述特征单元配置公式,以根据计算结果构建包含相应数量的特征单元的全连接隐层;
    根据所述全连接隐层中的特征单元及所述输入节点,以输入节点值作为输入值、特征单元值作为输出值构建输入节点至特征单元的第一公式组;
    根据所述全连接隐层中的特征单元及所述输出节点,以特征单元值作为输入值、输出节点值作为输出值构建特征单元至输出节点的第二公式组,以得到样本分类模型。
  5. 根据权利要求1所述的样本分类方法,其中,所述根据所述类群特征信息及预置的特征单元配置公式构建包含输入节点、特征单元和输出节点的样本分类模型之后,还包括:
    根据所输入的数据集及参数调整规则对所述样本分类模型进行训练,以得到训练后的所述样本分类模型。
  6. 根据权利要求5所述的样本分类方法,其中,所述根据所输入的数据集及参数调整规则对所述样本分类模型进行训练,以得到训练后的所述样本分类模型,包括:
    将所述数据集平均拆分为预设数量的子数据集;
    根据所述参数值调整规则及多个所述子数据集对所述样本分类模型进行多轮训练,并根据所述子数据集计算每一轮训练后所述样本分类模型的准确率;
    将准确率最高的一轮训练的参数值作为所述样本分类模型的参数值以得到训练后的所述样本分类模型。
  7. 根据权利要求4所述的样本分类方法,其中,所述特征单元配置公式为S 0=S 1×S 2/2或S 0=2×(S 1×S 2) 1/2,其中,S 0为所述全连接隐层中特征单元的配置数量,S 1为所述输入节点的数量,S 2为所述输出节点的数量。
  8. 一种样本分类装置,其中,包括:
    样本类群获取单元,用于根据预置的分类信息对历史信息表中所包含的样本集群进行分类以得到多个样本类群,其中,每一样本类群中包括至少一个样本;
    类群特征信息获取单元,用于获取与每一所述样本类群对应的样本属性信息,根据预置的信息量化规则对所述样本属性信息进行量化以得到每一所述样本类群对应的类群特征信息;
    样本分类模型构建单元,用于根据所述类群特征信息及预置的特征单元配置公式构建包含输入节点、特征单元和输出节点的样本分类模型;
    新增样本特征信息获取单元,用于若接收到新增样本的新增样本属性信息,根据所述信息量化规则对所述新增样本属性信息进行量化以得到与所述新增样本对应的新增样本特征信息;
    目标类别获取单元,用于将所述新增样本特征信息输入所述样本分类模型以获取与所述新增样本属性信息对应的目标类别。
  9. 根据权利要求8所述的样本分类装置,其中,所述样本类群获取单元,包括:
    样本统计信息获取单元,用于根据所述统计项目对所述历史信息表进行统计以获取每一所述样本的样本统计信息;
    等级分类单元,用于根据所述等级分类规则及所述样本统计信息对所述样本进行等级分类以得到多个样本类群。
  10. 根据权利要求8所述的样本分类装置,其中,所述类群特征信息获取单元,包括:
    样本属性信息转换单元,用于根据所述信息量化规则将每一所述样本属性信息转换为对应的特征变量;
    特征变量平均值计算单元,用于根据每一所述样本类群中所有样本的特征变量计算得到每一所述样本类群的类群特征信息。
  11. 一种计算机设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,其中,所述处理器执行所述计算机程序时实现以下步骤:
    根据预置的分类信息对历史信息表中所包含的样本集群进行分类以得到多个样本类群,其中,每一样本类群中包括至少一个样本;
    获取与每一所述样本类群对应的样本属性信息,根据预置的信息量化规则对所述样本属性信息进行量化以得到每一所述样本类群对应的类群特征信息;
    根据所述类群特征信息及预置的特征单元配置公式构建包含输入节点、特征单元和输出节点的样本分类模型;
    若接收到新增样本的新增样本属性信息,根据所述信息量化规则对所述新增样本属性信息进行量化以得到与所述新增样本对应的新增样本特征信息;
    将所述新增样本特征信息输入所述样本分类模型以获取与所述新增样本属性信息对应的目标类别。
  12. 根据权利要求11所述的计算机设备,其中,所述分类信息包含统计项目及等级分类规则,所述根据预置的分类信息对历史信息表中所包含的样本集群进行分类以得到多个样本类群,包括:
    根据所述统计项目对所述历史信息表进行统计以获取每一所述样本的样本统计信息;
    根据所述等级分类规则及所述样本统计信息对所述样本进行等级分类以得到多个样本类群。
  13. 根据权利要求11所述的计算机设备,其中,所述根据预置的信息量化规则对所述样本属性信息进行量化以得到每一所述样本类群对应的类群特征信息,包括:
    根据所述信息量化规则将每一所述样本属性信息转换为对应的特征变量;
    根据每一所述样本类群中所有样本的特征变量计算得到每一所述样本类群的类群特征信息。
  14. 根据权利要求11所述的计算机设备,其中,所述根据所述类群特征信息及预置的特征单元配置公式构建包含输入节点、特征单元和输出节点的样本分类模型,包括:
    根据所述类群特征信息中特征变量的维度数量构建所述样本分类模型的输入节点;
    根据所述类群特征信息中样本类群的数量构建所述样本分类模型的输出节点;
    将所述输入节点的数量及所述输出节点的数量输入所述特征单元配置公式,以根据计算结果构建包含相应数量的特征单元的全连接隐层;
    根据所述全连接隐层中的特征单元及所述输入节点,以输入节点值作为输入值、特征单元值作为输出值构建输入节点至特征单元的第一公式组;
    根据所述全连接隐层中的特征单元及所述输出节点,以特征单元值作为输入值、输出节点值作为输出值构建特征单元至输出节点的第二公式组,以得到样本分类模型。
  15. 根据权利要求11所述的计算机设备,其中,所述根据所述类群特征信息及预置的特征单元配置公式构建包含输入节点、特征单元和输出节点的样本分类模型之后,还包括:
    根据所输入的数据集及参数调整规则对所述样本分类模型进行训练,以得到训练后的所述样本分类模型。
  16. 根据权利要求15所述的计算机设备,其中,所述根据所输入的数据集及参数调整规 则对所述样本分类模型进行训练,以得到训练后的所述样本分类模型,包括:
    将所述数据集平均拆分为预设数量的子数据集;
    根据所述参数值调整规则及多个所述子数据集对所述样本分类模型进行多轮训练,并根据所述子数据集计算每一轮训练后所述样本分类模型的准确率;
    将准确率最高的一轮训练的参数值作为所述样本分类模型的参数值以得到训练后的所述样本分类模型。
  17. 根据权利要求14所述的计算机设备,其中,所述特征单元配置公式为S 0=S 1×S 2/2或S 0=2×(S 1×S 2) 1/2,其中,S 0为所述全连接隐层中特征单元的配置数量,S 1为所述输入节点的数量,S 2为所述输出节点的数量。
  18. 一种计算机可读存储介质,其中,所述计算机可读存储介质存储有计算机程序,所述计算机程序当被处理器执行时使所述处理器执行以下操作:
    根据预置的分类信息对历史信息表中所包含的样本集群进行分类以得到多个样本类群,其中,每一样本类群中包括至少一个样本;
    获取与每一所述样本类群对应的样本属性信息,根据预置的信息量化规则对所述样本属性信息进行量化以得到每一所述样本类群对应的类群特征信息;
    根据所述类群特征信息及预置的特征单元配置公式构建包含输入节点、特征单元和输出节点的样本分类模型;
    若接收到新增样本的新增样本属性信息,根据所述信息量化规则对所述新增样本属性信息进行量化以得到与所述新增样本对应的新增样本特征信息;
    将所述新增样本特征信息输入所述样本分类模型以获取与所述新增样本属性信息对应的目标类别。
  19. 根据权利要求18所述的计算机可读存储介质,其中,所述分类信息包含统计项目及等级分类规则,所述根据预置的分类信息对历史信息表中所包含的样本集群进行分类以得到多个样本类群,包括:
    根据所述统计项目对所述历史信息表进行统计以获取每一所述样本的样本统计信息;
    根据所述等级分类规则及所述样本统计信息对所述样本进行等级分类以得到多个样本类群。
  20. 根据权利要求18所述的计算机可读存储介质,其中,所述根据预置的信息量化规则对所述样本属性信息进行量化以得到每一所述样本类群对应的类群特征信息,包括:
    根据所述信息量化规则将每一所述样本属性信息转换为对应的特征变量;
    根据每一所述样本类群中所有样本的特征变量计算得到每一所述样本类群的类群特征信息。
PCT/CN2020/111949 2020-03-12 2020-08-28 样本分类方法、装置、计算机设备及存储介质 WO2021179544A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010171236.7 2020-03-12
CN202010171236.7A CN111461180B (zh) 2020-03-12 2020-03-12 样本分类方法、装置、计算机设备及存储介质

Publications (1)

Publication Number Publication Date
WO2021179544A1 true WO2021179544A1 (zh) 2021-09-16

Family

ID=71683250

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/111949 WO2021179544A1 (zh) 2020-03-12 2020-08-28 样本分类方法、装置、计算机设备及存储介质

Country Status (2)

Country Link
CN (1) CN111461180B (zh)
WO (1) WO2021179544A1 (zh)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111461180B (zh) * 2020-03-12 2024-07-09 平安科技(深圳)有限公司 样本分类方法、装置、计算机设备及存储介质
CN114077858A (zh) 2020-08-17 2022-02-22 浙江宇视科技有限公司 向量数据处理方法、装置、设备及存储介质
CN113392294B (zh) * 2020-10-15 2023-11-10 腾讯科技(深圳)有限公司 样本标注方法及装置
CN112017042A (zh) * 2020-10-22 2020-12-01 北京淇瑀信息科技有限公司 基于tweedie分布的资源配额确定方法、装置和电子设备
CN112348079B (zh) * 2020-11-05 2023-10-31 平安科技(深圳)有限公司 数据降维处理方法、装置、计算机设备及存储介质
CN112561479B (zh) * 2020-12-16 2023-09-19 中国平安人寿保险股份有限公司 基于智能决策的企业增员的方法、装置及计算机设备
CN113792202B (zh) * 2021-08-31 2023-05-05 中国电子科技集团公司第三十研究所 一种用户分类的筛选方法
CN113535964B (zh) * 2021-09-15 2021-12-24 深圳前海环融联易信息科技服务有限公司 企业分类模型智能构建方法、装置、设备及介质
CN114443849B (zh) * 2022-02-09 2023-10-27 北京百度网讯科技有限公司 一种标注样本选取方法、装置、电子设备和存储介质
CN117787815B (zh) * 2024-02-27 2024-05-07 山东杰出人才发展集团有限公司 一种基于大数据的人力资源外包服务系统及方法

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108629633A (zh) * 2018-05-09 2018-10-09 浪潮软件股份有限公司 一种基于大数据建立用户画像的方法及系统
CN109522424A (zh) * 2018-10-16 2019-03-26 北京达佳互联信息技术有限公司 数据的处理方法、装置、电子设备及存储介质
CN109948730A (zh) * 2019-03-29 2019-06-28 中诚信征信有限公司 一种数据分类方法、装置、电子设备及存储介质
CN110096499A (zh) * 2019-04-10 2019-08-06 华南理工大学 一种基于行为时间序列大数据的用户对象识别方法及系统
US20190272419A1 (en) * 2018-03-05 2019-09-05 Shutterfly, Inc. Automated communication design construction system
CN110717503A (zh) * 2018-07-12 2020-01-21 深圳灰猫科技有限公司 一种分类方法、装置、电子设备及计算机存储介质
CN111461180A (zh) * 2020-03-12 2020-07-28 平安科技(深圳)有限公司 样本分类方法、装置、计算机设备及存储介质

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190042952A1 (en) * 2017-08-03 2019-02-07 Beijing University Of Technology Multi-task Semi-Supervised Online Sequential Extreme Learning Method for Emotion Judgment of User
CN109241288A (zh) * 2018-10-12 2019-01-18 平安科技(深圳)有限公司 文本分类模型的更新训练方法、装置及设备

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190272419A1 (en) * 2018-03-05 2019-09-05 Shutterfly, Inc. Automated communication design construction system
CN108629633A (zh) * 2018-05-09 2018-10-09 浪潮软件股份有限公司 一种基于大数据建立用户画像的方法及系统
CN110717503A (zh) * 2018-07-12 2020-01-21 深圳灰猫科技有限公司 一种分类方法、装置、电子设备及计算机存储介质
CN109522424A (zh) * 2018-10-16 2019-03-26 北京达佳互联信息技术有限公司 数据的处理方法、装置、电子设备及存储介质
CN109948730A (zh) * 2019-03-29 2019-06-28 中诚信征信有限公司 一种数据分类方法、装置、电子设备及存储介质
CN110096499A (zh) * 2019-04-10 2019-08-06 华南理工大学 一种基于行为时间序列大数据的用户对象识别方法及系统
CN111461180A (zh) * 2020-03-12 2020-07-28 平安科技(深圳)有限公司 样本分类方法、装置、计算机设备及存储介质

Also Published As

Publication number Publication date
CN111461180A (zh) 2020-07-28
CN111461180B (zh) 2024-07-09

Similar Documents

Publication Publication Date Title
WO2021179544A1 (zh) 样本分类方法、装置、计算机设备及存储介质
WO2021068513A1 (zh) 异常对象识别方法、装置、介质及电子设备
CN110866782B (zh) 一种客户分类方法、系统以及电子设备
US8861691B1 (en) Methods for managing telecommunication service and devices thereof
CN105512465B (zh) 基于改进vikor法的云平台安全性量化评估方法
WO2021098265A1 (zh) 缺失信息预测方法、装置、计算机设备及存储介质
CN111079941B (zh) 信用信息处理方法、系统、终端和存储介质
TWI505667B (zh) 完成有關於通訊網路節點之預測分析的方法與系統
CN109063736B (zh) 数据分类方法、装置、电子设备及计算机可读存储介质
CN112215604A (zh) 交易双方关系信息识别方法及装置
WO2021068798A1 (zh) 基于文本的指标提取方法、装置、计算机设备及存储介质
CN111797320A (zh) 数据处理方法、装置、设备及存储介质
CN106408325A (zh) 基于用户支付信息的用户消费行为预测分析方法及系统
KR20150137175A (ko) 고객 불만을 모니터링 하기 위한 장치 및 컴퓨터-판독가능 매체
CN106919564A (zh) 一种基于移动用户行为的影响力度量方法
CN111062564A (zh) 一种电力客户诉求敏感值计算方法
WO2023029065A1 (zh) 数据集质量评估方法、装置、计算机设备及存储介质
CN114548118A (zh) 一种服务对话检测方法及系统
CN117371861A (zh) 基于数字化的家政服务质量智能分析方法及系统
CN112732886A (zh) 一种会话管理方法、装置、系统及介质
US11250365B2 (en) Systems and methods for utilizing compliance drivers to conserve system resources and reduce compliance violations
CN117172795A (zh) 一种智能化的技术服务费的在线咨询系统
CN112085509A (zh) 通知信息发送方法、装置、计算机设备及存储介质
CN115170153B (zh) 一种基于多维属性的工单处理方法、装置及存储介质
CN114238615B (zh) 一种企业服务成果数据处理方法及系统

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20923877

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20923877

Country of ref document: EP

Kind code of ref document: A1