CN116628465B

CN116628465B - Feature selection method based on screening machine learning user

Info

Publication number: CN116628465B
Application number: CN202310596434.1A
Authority: CN
Inventors: 阮宁; 曹天赐; 倪萌
Original assignee: Henan Shunying Data Technology Co ltd; Henan Normal University
Current assignee: Henan Shunying Data Technology Co ltd; Henan Normal University
Priority date: 2023-05-25
Filing date: 2023-05-25
Publication date: 2024-07-12
Anticipated expiration: 2043-05-25
Also published as: CN116628465A

Abstract

The invention discloses a feature selection method based on screening machine learning users, and belongs to the technical field of machine learning. The invention comprises the following steps: collecting user registration information in real time according to a user registration window, and acquiring user characteristic data in real time based on the user registration information; preprocessing the user characteristic data acquired in real time to determine the user characteristic data with distribution characteristics; and analyzing and evaluating the user characteristic data with the distribution characteristics to determine a corresponding analysis and evaluation report. The invention solves the problem that the screening precision is low and the user feature selection effect is poor when the user feature selection is performed based on the machine learning in the prior art.

Description

Feature selection method based on screening machine learning user

Technical Field

The invention relates to the technical field of machine learning, in particular to a feature selection method based on screening machine learning users.

Background

Machine learning is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, and the like. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance.

Chinese patent publication No. CN114970312a discloses a method and apparatus for screening machine learning features, which includes a mounting base, an equipment box, a detection screening computer, a keyboard, a data interface, a data connection line, a support frame, a fixing frame and a display screen; the method is simple and convenient to operate, a plurality of groups of machines are conveniently and simultaneously detected, learning characteristics of the machines are intuitively known, surplus resources are reduced, and the use accuracy of the machines is improved; however, the above patent has the following drawbacks in practical use:

at present, when user feature selection is performed based on machine learning, the screening precision is low, so that the user feature selection effect is poor.

Disclosure of Invention

The invention aims to provide a feature selection method based on screening machine learning users, which can improve screening precision and user feature selection effect when user feature selection is performed based on machine learning, and solves the problems in the background technology.

In order to achieve the above purpose, the present invention provides the following technical solutions:

A feature selection method based on screening machine learning users comprises the following steps:

S1: collecting user registration information in real time according to a user registration window, and acquiring user characteristic data in real time based on the user registration information, wherein the user characteristic data comprises, but is not limited to, user name characteristics, user gender characteristics, user age characteristics, user region characteristics, user preference characteristics, user payment characteristics and user social characteristics;

S2: preprocessing the user characteristic data acquired in real time, converting the user characteristic data acquired in real time into a form which can be received by a processor based on the selection requirement of the machine learning user characteristic, and performing filtering, noise reduction and sequencing on the converted user characteristic data to determine the user characteristic data with distributed characteristics;

S3: analyzing and evaluating the user characteristic data with the distribution characteristics to obtain the user characteristic data with the distribution characteristics, training the user characteristic model on the basis of the linear model, determining a user characteristic selection model on the basis of the linear model, analyzing and evaluating the user characteristic data with the distribution characteristics on the basis of the user characteristic selection model, and determining a corresponding analysis and evaluation report;

S4: and selecting and controlling the user characteristic data with the distributed characteristics, acquiring an analysis evaluation report based on a user characteristic selection model, determining a user characteristic selection strategy based on the analysis evaluation report based on a data mining technology and a correlation analysis method, and intelligently selecting the user characteristic data according to the user characteristic selection strategy.

Preferably, in the step S1, the user registration information is collected in real time according to the user registration window, and the following operations are performed:

acquiring a user registration window;

The user inputs user registration information corresponding to the specific prompt including, but not limited to, user name characteristics, user gender characteristics, user age characteristics, user region characteristics, user preference characteristics, user payment characteristics and user social characteristics in the user registration window according to the specific prompt of the user registration window;

After the user registration information is input, the user registration information is acquired in real time, and the acquired user registration information is extracted to determine the user characteristic data.

Preferably, in the step S2, the user feature data acquired in real time is preprocessed, and the following operations are executed:

acquiring user characteristic data, and preprocessing the user characteristic data;

Based on the machine learning user feature selection requirement, converting the user feature data acquired in real time into a form which can be received by a processor;

obtaining converted user characteristic data, and filtering the user characteristic data;

Based on the data filtering model, identifying user characteristic data by adopting a statistical method, filtering out user characteristic data which is useless for machine learning user characteristic selection, and retaining user characteristic data which is useful for machine learning user characteristic selection;

Acquiring filtered user characteristic data, and carrying out noise reduction treatment on the user characteristic data;

Processing the user characteristic data based on a data noise reduction method, removing noise characteristic data in the user characteristic data, and determining user characteristic data without the noise characteristic data;

acquiring user characteristic data after noise reduction, and sequencing the user characteristic data;

and based on an internal sorting method, the user characteristic data of the determined noiseless characteristic data are effectively sorted, and the user characteristic data with distributed characteristics are determined.

Preferably, converting the user characteristic data acquired in real time into a form receivable by the processor includes:

setting a unit time length, wherein the value range of the unit time length is 3s-5s;

monitoring the data volume of the user characteristic data acquired in unit time in real time;

Acquiring types of receivable forms of the processors corresponding to the user feature data based on machine learning user feature selection requirements, and taking the types of the receivable forms of each processor as target data forms;

extracting the maximum data conversion amount capable of converting the user characteristic data into each target data form in unit time;

acquiring fit parameters between the user characteristic data and the processor according to the maximum data conversion quantity of each target data form; the fit parameters are obtained through the following formula:

Wherein W represents a fitting parameter; c represents the average value of the data amount of the user characteristic data generated in unit time; c _zi represents the maximum data conversion amount corresponding to the ith target data form in unit time; n represents the number of kinds of target data forms; λ represents an adjustment coefficient; w ₀ represents the fitting parameter reference constant value; m represents the number of unit time currently experienced; c _j denotes the data amount of the user characteristic data generated in the jth unit time;

And determining a data conversion time interval corresponding to each target data form according to the fit parameters.

Preferably, determining a data conversion time interval corresponding to each target data form according to the fit parameters includes:

extracting fit parameters between the user characteristic data and the processor;

And setting the fit parameters through a configuration model to determine a data conversion time interval corresponding to each target data form, wherein the configuration model is as follows:

Wherein, T _i represents a data conversion time interval corresponding to the ith target data form; t _d represents the corresponding time length of the unit time; c _zi represents the maximum data conversion amount corresponding to the ith target data form in unit time; c represents the average value of the data amount of the user characteristic data generated in unit time; w represents a fitting parameter; w ₀ denotes a fitting parameter reference constant value.

Preferably, in the step S3, the user feature data with the distribution feature is analyzed and evaluated, and the following operations are performed:

acquiring user characteristic data with distribution characteristics;

Training a user characteristic model based on the linear model for the user characteristic data, and determining a user characteristic selection model based on the linear model;

Based on the user feature selection model, analyzing and evaluating the user feature data with the distributed features, and determining a corresponding analysis and evaluation report;

Aiming at the situation that the user characteristic data with the distributed characteristics contains redundant user characteristics, the determined analysis and evaluation report is that the user characteristic data needs to be subjected to machine learning-based user characteristic selection processing;

For the case that the user feature data with distributed features does not contain redundant user features, the determined analysis evaluation reports that no machine learning-based user feature selection processing is required for the user feature data.

Preferably, in the step S4, the following operations are performed to selectively control the user feature data with the distribution feature:

Acquiring an analysis evaluation report based on a user feature selection model;

Aiming at the situation that the analysis evaluation report is that the user characteristic data needs to be subjected to machine learning-based user characteristic selection processing, determining a user characteristic selection strategy based on the analysis evaluation report based on a data mining technology and an association analysis method, and intelligently selecting the user characteristic data according to the user characteristic selection strategy.

Preferably, in the step S4, the following operations are performed to determine a user feature selection policy based on the analysis evaluation report:

Based on a data mining technology and a correlation analysis method, carrying out deep analysis on user characteristic data with distributed characteristics, and determining redundant user characteristics in the user characteristic data;

the method comprises the steps of obtaining redundant user characteristics, effectively sequencing the redundant user characteristics based on an internal sequencing method, and determining a redundant user characteristic set based on machine learning user characteristic selection requirements;

and carrying out correlation analysis on the redundant user feature sets, determining a user feature selection strategy based on the redundant user feature sets, and carrying out intelligent selection on user feature data according to the user feature selection strategy.

Preferably, in the step S4, the user feature data is intelligently selected according to a user feature selection policy, and the following operations are executed:

acquiring a user feature selection strategy based on the redundant user feature set, and intelligently selecting redundant user features in the user feature data based on the user feature selection strategy;

Extracting user characteristic data one by one, carrying out correlation analysis on the user characteristic data, determining redundant user characteristics with high correlation, carrying out intelligent selection on the user characteristic data containing the redundant user characteristics with high correlation based on a method for eliminating data redundancy, and removing the redundant user characteristics in the user characteristic data.

Preferably, in the step S4, after the redundant user features are removed, the following operations are further performed:

Acquiring user characteristic data from which redundant user characteristics are removed;

Comprehensively analyzing the user characteristic data, and checking whether the user characteristic data contains low-value user characteristics;

Aiming at the situation that the user characteristic data contains low-value user characteristics, removing the low-value user characteristics contained in the user characteristic data, and reserving high-value user characteristics;

and for the case that the user characteristic data does not contain low-value user characteristics, the user characteristic data is fully reserved.

Compared with the prior art, the invention has the beneficial effects that:

1. The invention acquires user registration information in real time according to the user registration window, acquires user characteristic data in real time based on the user registration information, preprocesses the user characteristic data acquired in real time, converts the user characteristic data acquired in real time into a form which can be received by a processor based on machine learning user characteristic selection requirements, and carries out filtering, noise reduction and sorting processing on the converted user characteristic data to determine the user characteristic data with distributed characteristics.

2. According to the invention, the user characteristic data with distributed characteristics is obtained, the user characteristic model training is carried out on the user characteristic data based on the linear model, the user characteristic selection model based on the linear model is determined, the user characteristic data with distributed characteristics is analyzed and evaluated based on the user characteristic selection model, the corresponding analysis and evaluation report is determined, the user characteristic selection strategy based on the analysis and evaluation report is determined based on the data mining technology and the association analysis method, the user characteristic data is intelligently selected according to the user characteristic selection strategy, and the screening precision and the user characteristic selection effect can be improved when the user characteristic selection is carried out based on machine learning.

Drawings

FIG. 1 is a flow chart of a feature selection method based on a filtering machine learning user of the present invention;

FIG. 2 is a flowchart of an algorithm for analyzing, evaluating, selecting and controlling user feature data according to the present invention;

FIG. 3 is an algorithmic flow chart of the low value user feature processing contained in the user feature data of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In order to solve the problem that when the user feature selection is performed based on machine learning, the screening precision is low, resulting in poor user feature selection effect, referring to fig. 1-3, the present embodiment provides the following technical solutions:

In S1, user registration information is acquired in real time according to a user registration window, and the following operations are executed:

acquiring a user registration window;

S2, preprocessing the user characteristic data acquired in real time, and executing the following operations:

Specifically, converting the user characteristic data acquired in real time into a form receivable by a processor includes:

The technical scheme has the effects that: the technical scheme aims to realize real-time monitoring and processing of the user characteristic data and convert the data into a form which can be received by a processor according to the requirements of machine learning user characteristic selection. By setting the unit time length, the time interval of data processing can be controlled. Through the technical scheme, the data volume of the user characteristic data acquired in unit time can be monitored in real time, and corresponding processing is performed. This facilitates real-time response to user characteristic changes and needs. By acquiring the types of the forms which can be received by the processor corresponding to the user characteristic data based on the requirement of machine learning user characteristic selection, data conversion can be performed according to different processor requirements, and the data can be received and processed by the processor. And the fitting parameters between the user characteristic data and the processor are determined by extracting the maximum data conversion amount capable of converting the user characteristic data into each target data form in unit time, so that the efficiency and the performance of data conversion are improved. The data conversion time interval corresponding to each target data form is determined according to the fit parameters, so that the time interval of data processing can be reasonably arranged, the requirements of a processor are met, and the real-time performance is maintained.

Meanwhile, the fit parameter W can be adjusted according to the actual data amount, so as to ensure the adaptability between the data amount average value C of the user characteristic data generated in unit time and the maximum data conversion amount C _zi of each target data form. By adjusting the fit parameters, the data conversion speed can be increased when the data volume is larger, and the conversion speed can be reduced when the data volume is smaller, so that the data processing requirements under different conditions can be met.

By introducing the adjustment coefficient lambda, the fit parameters can be flexibly adjusted to further optimize the effect of data conversion. The adjustment coefficient λ may be set according to specific requirements, for example, increasing λ may increase the data conversion speed, and decreasing λ may decrease the data conversion speed, so as to achieve better data processing performance.

The fitting parameter reference constant value W0 and the number of unit times currently elapsed m may be combined, taking into account the influence of the history information on the fitting parameter. The baseline constant value W ₀ may provide an initial reference point, and the number of unit times currently elapsed m may reflect the history of the data processing. By comprehensively considering the reference constant value and the history information, the fitting parameters can be adjusted to adapt to the dynamic changes of the data processing.

In summary, by setting the fit parameters according to the above factors, the adaptability, flexibility and optimality of data conversion can be realized, so as to better meet the requirement of data processing, and consider the influence of historical information on data conversion.

Specifically, determining the data conversion time interval corresponding to each target data form according to the fit parameters includes:

The technical scheme has the effects that: extracting fit parameters between the user characteristic data and the processor may help optimize the process of data transformation. According to the setting of the fit parameters, the speed and the efficiency of data conversion can be adjusted so as to meet the receiving and processing requirements of the processor on the data to the greatest extent. By optimizing the data conversion process, the efficiency and performance of data processing can be improved.

Setting fitting parameters through a configuration model, and determining a data conversion time interval corresponding to each target data form according to the fitting parameters. This allows for a reasonable time interval arrangement during the data conversion process to ensure timely data transmission and processing. By determining the appropriate data conversion time interval, the instantaneity of data processing and the utilization of system resources can be balanced.

By extracting and configuring the fit parameters, the stability and compatibility of the system can be improved. The fit parameters may be adjusted according to the performance of the processor and the data processing requirements to ensure that the data conversion process does not exceed the load capacity of the processor. This helps to avoid problems such as system crashes, data loss, or processing delays, and to improve reliability and stability of the system. Therefore, by extracting the fit parameters between the user characteristic data and the processor and setting these parameters by configuring the model, the data conversion process can be optimized, the data conversion time interval can be determined, and the stability and compatibility of the system can be improved. Specific technical effects also need to be evaluated and verified according to the details of the actual implementation and application.

Meanwhile, the speed of data conversion can be controlled by setting the data conversion time interval corresponding to each target data form. The value of T _i may be adjusted according to the actual requirements and the processing power of the processor to ensure that the amount of data that can be processed per unit time does not exceed the maximum data conversion T _i. Thus, data accumulation and processing delay can be avoided, and timeliness and stability of data conversion are maintained.

By the method, the rationality of setting the data conversion time corresponding to each target data form can be effectively improved, and the instantaneity and the efficiency of data processing can be realized. According to the average value C of the data quantity of the user characteristic data generated in unit time and the maximum data conversion quantity T _i of each target data form, the matching parameter W and the standard constant value W ₀ are combined, so that the time delay of data conversion can be reduced as much as possible on the premise of ensuring the accuracy of data conversion, and the real-time performance and the efficiency of data processing are improved.

Meanwhile, the load of the system can be balanced by adjusting the data conversion time corresponding to each target data form. According to the average value C of the data quantity of the user characteristic data generated in unit time and the maximum data conversion quantity C _zi of each target data form, the value of T _i can be reasonably set so as to ensure that the data conversion process does not exceed the load capacity of a processor. This helps to avoid problems of system crashes or processor overload, improving the stability and reliability of the system.

S3, analyzing and evaluating the user characteristic data with the distributed characteristics, and executing the following operations:

acquiring user characteristic data with distribution characteristics;

S4, selecting and controlling the user characteristic data with the distribution characteristics, and executing the following operations:

S4, determining a user characteristic selection strategy based on the analysis evaluation report, and executing the following operations:

And S4, intelligently selecting the user characteristic data according to a user characteristic selection strategy, and executing the following operations:

In S4, after the redundant user features are removed, the following operations are further performed:

In summary, the screening machine learning user-based feature selection method of the invention collects user registration information in real time according to a user registration window, acquires user feature data in real time based on the user registration information, pre-processes the user feature data acquired in real time, converts the user feature data acquired in real time into a form which can be received by a processor based on machine learning user feature selection requirements, performs filtering, noise reduction and sorting processing on the converted user feature data, determines user feature data with distributed features, performs analysis and evaluation on the user feature data with distributed features, acquires the user feature data with distributed features, performs user feature model training on the user feature data based on a linear model, determines a linear model-based user feature selection model, performs analysis and evaluation on the user feature data with distributed features, determines a corresponding analysis and evaluation report based on a data mining technology and an associated analysis method, determines a user feature selection strategy based on the analysis and evaluation report, performs intelligent selection on the user feature data according to a user feature selection strategy, and can improve screening precision and user feature selection effect when performing user feature selection based on machine learning.

It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A feature selection method based on screening machine learning users, comprising the steps of:

wherein, the user registration information is collected in real time according to the user registration window, and the following operations are executed:

acquiring a user registration window;

After the user registration information is input, acquiring the user registration information in real time, extracting the acquired user registration information, and determining user characteristic data;

Preprocessing the user characteristic data acquired in real time, and executing the following operations:

Based on an internal sorting method, user characteristic data of the determined noiseless characteristic data are effectively sorted, and user characteristic data with distributed characteristics are determined;

the method for converting the user characteristic data acquired in real time into a form which can be received by a processor comprises the following steps:

Determining a data conversion time interval corresponding to each target data form according to the fit parameters; determining a data conversion time interval corresponding to each target data form according to the fit parameters, wherein the data conversion time interval comprises the following steps:

Wherein, T _i represents a data conversion time interval corresponding to the ith target data form; t _d represents the corresponding time length of the unit time; c _zi represents the maximum data conversion amount corresponding to the ith target data form in unit time; c represents the average value of the data amount of the user characteristic data generated in unit time; w represents a fitting parameter; w ₀ represents the fitting parameter reference constant value;

the user characteristic data with the distributed characteristics are analyzed and evaluated, and the following operations are executed:

acquiring user characteristic data with distribution characteristics;

Aiming at the situation that the user characteristic data with the distributed characteristics does not contain redundant user characteristics, the determined analysis and evaluation report is that the user characteristic data does not need to be subjected to machine learning-based user characteristic selection processing;

s4: selecting and controlling the user characteristic data with the distributed characteristics, acquiring an analysis evaluation report based on a user characteristic selection model, determining a user characteristic selection strategy based on the analysis evaluation report based on a data mining technology and an associated analysis method, and intelligently selecting the user characteristic data according to the user characteristic selection strategy;

The user characteristic data with the distribution characteristics are selected and controlled, and the following operations are executed:

Aiming at the situation that the analysis evaluation report is that the user characteristic data needs to be subjected to machine learning-based user characteristic selection processing, determining a user characteristic selection strategy based on the analysis evaluation report based on a data mining technology and an association analysis method, and intelligently selecting the user characteristic data according to the user characteristic selection strategy;

Determining a user feature selection policy based on the analysis evaluation report, performing the following operations:

Performing correlation analysis on the redundant user feature sets, determining a user feature selection strategy based on the redundant user feature sets, and performing intelligent selection on user feature data according to the user feature selection strategy;

Intelligent selection is carried out on the user characteristic data according to the user characteristic selection strategy, and the following operations are executed:

2. A method of feature selection for a machine learning user based on screening of claim 1, wherein: in the step S4, after the redundant user features are removed, the following operations are further performed: