CN117235480B - Screening method and system based on big data under data processing - Google Patents

Screening method and system based on big data under data processing Download PDF

Info

Publication number
CN117235480B
CN117235480B CN202311528107.9A CN202311528107A CN117235480B CN 117235480 B CN117235480 B CN 117235480B CN 202311528107 A CN202311528107 A CN 202311528107A CN 117235480 B CN117235480 B CN 117235480B
Authority
CN
China
Prior art keywords
sample set
data
neural network
network model
calculating
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311528107.9A
Other languages
Chinese (zh)
Other versions
CN117235480A (en
Inventor
杨峰
王纪元
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Wugu Big Data Technology Co ltd
Original Assignee
Shenzhen Wugu Big Data Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Wugu Big Data Technology Co ltd filed Critical Shenzhen Wugu Big Data Technology Co ltd
Priority to CN202311528107.9A priority Critical patent/CN117235480B/en
Publication of CN117235480A publication Critical patent/CN117235480A/en
Application granted granted Critical
Publication of CN117235480B publication Critical patent/CN117235480B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention relates to the field of data processing, and discloses a screening method and a screening system for realizing data processing based on big data, wherein the method comprises the following steps: carrying out data cleaning on the original data to obtain cleaned original data, constructing a built-in filtering function of the cleaned original data, and filtering the cleaned original data to obtain filtered original data; carrying out layered acquisition on the filtered original data to obtain a layered sample set; calculating sample set information entropy of the layered sample set, identifying sample set characteristics of the layered sample set, calculating characteristic condition entropy of the sample set characteristics, calculating information gain of the sample set characteristics, and determining target characteristics of the sample set characteristics; calculating updated model parameters of the initialized neural network model; generating an updated neural network model of the initialized neural network model, and screening out the target information in the layered sample set by using the updated neural network model. The invention can improve the screening effect of data processing.

Description

Screening method and system based on big data under data processing
Technical Field
The invention relates to the field of data processing, in particular to a screening method and a screening system for realizing data processing based on big data.
Background
Data processing refers to the process of processing raw data so as to better meet specific requirements or improve data quality, and can help organizations to better manage and utilize data and improve the quality and value of the data, thereby providing support and reference for enterprise decision making and business development.
The screening method under the current data processing mainly comprises the steps of constructing a screening rule of original data, screening the original data through the screening rule to obtain required data, and only screening out data meeting specific conditions and ignoring other possibly useful data due to the diversification of data attributes. This may result in insufficient screening results, failing to reflect the actual situation of the data, and thus resulting in poor screening results under current data processing.
Disclosure of Invention
The invention provides a screening method and a screening system under data processing based on big data, and mainly aims to improve the screening effect of data processing.
In order to achieve the above object, the present invention provides a screening method based on big data under data processing, including:
acquiring original data, performing data cleaning on the original data to obtain cleaned original data, constructing a built-in filter function of the cleaned original data, and filtering the cleaned original data through the built-in filter function to obtain filtered original data;
Determining the data attribute of the filtered original data, and carrying out layered acquisition on the filtered original data according to the data attribute to obtain a layered sample set;
calculating sample set information entropy of the layered sample set, identifying sample set characteristics of the layered sample set, calculating characteristic condition entropy of the sample set characteristics, calculating information gain of the sample set characteristics through the sample set information entropy and the characteristic condition entropy, and determining target characteristics of the sample set characteristics through the information gain;
calculating function output of a preset neural network model through the layered sample set and the target feature, calculating a function gradient value of an initialized neural network model corresponding to the neural network model based on the function output, and calculating updated model parameters of the initialized neural network model through the function gradient value;
when the updated model parameters reach the preset maximum iteration times, generating an updated neural network model of the initialized neural network model, calculating the recall rate of the updated neural network model, and when the recall rate meets the requirements, screening out target information in the layered sample set by using the updated neural network model.
Optionally, the performing data cleansing on the raw data to obtain cleansing raw data includes:
identifying invalid data in the original data;
removing the invalid data to obtain valid original data;
retrieving abnormal data in the effective original data;
performing data replacement on the abnormal data to obtain normal original data;
identifying inconsistent data of the normal original data;
according to the inconsistent data, formulating a data cleaning rule of the normal original data;
and carrying out data cleaning on the inconsistent data through the data cleaning rule to obtain the cleaning original data.
Optionally, the performing layered collection on the filtered raw data according to the data attribute to obtain a layered sample set includes:
determining layering characteristics of the filtered original data according to the data attributes;
determining the layering level of the filtered original data through the layering characteristics;
marking the number of the hierarchical samples of the hierarchy corresponding to the hierarchical level;
and carrying out layered acquisition on the filtered original data based on the layered progression and the layered sampling quantity to obtain the layered sample set.
Optionally, the calculating the sample set information entropy of the hierarchical sample set includes:
determining class labels of the hierarchical sample set;
identifying the number of category samples of the category label in the hierarchical sample set;
calculating the class proportion of the class label in the layered sample set according to the number of the class samples;
and calculating the sample set information entropy of the layered sample set through the category proportion.
Optionally, the calculating, by the class proportion, the sample set information entropy of the hierarchical sample set includes:
calculating the sample set information entropy of the layered sample set by using the following formula according to the category proportion:
wherein,representing the initial information entropy->Representing that the hierarchical sample set corresponds to +.>Category proportion of individual category labels, +.>Data quantity representing a characteristic data set, +.>Representing the probability +.>Taking the logarithm, taking the logarithm based on 2, and +.>Representing that the hierarchical sample set corresponds to +.>Personal category label->Representing a probability function.
Optionally, the calculating the feature condition entropy of the sample set feature includes:
calculating a sample set feature value of the sample set feature;
calculating the conditional probability of the sample set characteristic value;
Analyzing class information entropy of the sample set characteristic value through the conditional probability;
and carrying out weighted summation on the class information entropy to obtain the characteristic condition entropy of the sample set characteristic.
Optionally, said calculating a function output of said neural network model from said hierarchical sample set and said target feature comprises:
initializing the weight and bias of the neural network model to obtain an initialized neural network model;
identifying neurons of the initialized neural network model;
calculating a neuron output of the neuron from the hierarchical sample set and the target feature;
and determining a function output of the initialized neural network model through the neuron output.
Optionally, the calculating, based on the function output, a function gradient value of the neural network model corresponding to the initialized neural network model includes:
determining a loss function of the initialized neural network model through the function output;
marking model parameters of the initialized neural network model;
calculating the function gradient value of the loss function to the model parameters by using the following formula:
wherein,function gradient value representing loss function versus model parameter, < - >Data set representing a hierarchical sample set and target features of an input-initialized neural network model +.>Model representing an initialized neural network modelThe parameters of the parameters are set to be,represents the partial derivative->Representing a loss function->Representing the loss function with respect to the input->Is of the type of (A) and (B)>Representation input +.>Partial derivatives with respect to model parameters.
Optionally, the calculating, by using the function gradient value, updated model parameters of the initialized neural network model includes:
identifying a learning rate of the initialized neural network model;
calculating updated model parameters of the initialized neural network model by using the following formulas through the function gradient values and the learning rate:
wherein,updated model parameters representing an initialized neural network model, < ->Representing model parameters corresponding to the initialized neural network model, < ->Represent learning rate of initializing neural network model, +.>And representing the function gradient value of the model parameters of the corresponding loss function of the initialized neural network model.
In order to solve the above problems, the present invention further provides a screening system for implementing data processing based on big data, the system comprising:
the data processing module is used for acquiring original data, carrying out data cleaning on the original data to obtain cleaned original data, constructing a built-in filter function of the cleaned original data, and filtering the cleaned original data through the built-in filter function to obtain filtered original data;
The data layering module is used for determining the data attribute of the filtered original data, and carrying out layering acquisition on the filtered original data according to the data attribute to obtain a layering sample set;
the characteristic selection module is used for calculating sample set information entropy of the layered sample set, identifying sample set characteristics of the layered sample set, calculating characteristic condition entropy of the sample set characteristics, calculating information gain of the sample set characteristics through the sample set information entropy and the characteristic condition entropy, and determining target characteristics of the sample set characteristics through the information gain;
the model parameter updating module is used for calculating the function output of a preset neural network model through the layered sample set and the target feature, calculating the function gradient value of the initialized neural network model corresponding to the neural network model based on the function output, and calculating the updated model parameter of the initialized neural network model through the function gradient value;
and the target information screening module is used for generating an updated neural network model of the initialized neural network model when the updated model parameters reach the preset maximum iteration times, calculating the recall rate of the updated neural network model, and screening target information in the layered sample set by using the updated neural network model when the recall rate meets the requirement.
According to the method, the device and the system, the data is cleaned to obtain the cleaned original data, so that a cleaned original data set can be obtained, and the cleaned original data set contains more accurate, consistent and reliable data, so that a reliable basis is provided for subsequent data analysis and modeling; further, according to the embodiment of the invention, the filtering original data is collected in a layering manner according to the data attribute, so that a layering sample set can be obtained, enough samples can be ensured for each level, and the whole data set can be better represented; further, according to the embodiment of the invention, through the information gain, the target characteristics of the sample set characteristics can be determined to select proper characteristics, the performance of a later model can be improved, the risk of overfitting is reduced, the calculation efficiency is improved, the model result is explained, and the data visualization is improved. Therefore, the screening method and the screening system based on big data under data processing can improve the monitoring effect of underground petroleum detection of the detection equipment.
Drawings
Fig. 1 is a flow chart of a screening method based on big data processing according to an embodiment of the present invention;
FIG. 2 is a functional block diagram of a screening system for implementing data processing based on big data according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of an electronic device for implementing a screening system under data processing based on big data according to an embodiment of the present invention;
the achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.
Detailed Description
It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
The embodiment of the application provides a screening method based on big data under data processing. The main execution body of the screening method based on big data processing includes, but is not limited to, at least one of a server, a terminal and the like, which can be configured to execute the method provided by the embodiment of the application. In other words, the filtering method based on big data under data processing may be performed by software or hardware installed in a terminal device or a server device, where the software may be a blockchain platform. The service end includes but is not limited to: a single server, a server cluster, a cloud server or a cloud server cluster, and the like. The server may be an independent server, or may be a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, content delivery networks (Content Delivery Network, CDN), and basic cloud computing services such as big data and artificial intelligence platforms.
Referring to fig. 1, a flow chart of a screening method based on big data processing according to an embodiment of the invention is shown. In this embodiment, the method for implementing screening under data processing based on big data includes:
s1, acquiring original data, performing data cleaning on the original data to obtain cleaned original data, constructing a built-in filter function of the cleaned original data, and filtering the cleaned original data through the built-in filter function to obtain filtered original data.
In the embodiment of the invention, the original data refers to a data set to be processed.
Furthermore, according to the embodiment of the invention, the data is cleaned to obtain the cleaned original data, so that a cleaned original data set can be obtained, wherein the cleaned original data set contains more accurate, consistent and reliable data, and a reliable basis is provided for subsequent data analysis and modeling. The cleaning of the original data refers to a data set obtained after the processing of removing repeated data, processing missing values, correcting error data and the like of the original data.
As an embodiment of the present invention, the performing data cleansing on the raw data to obtain cleansing raw data includes: identifying invalid data in the original data; removing the repeated data to obtain effective original data; retrieving abnormal data in the effective original data; performing data replacement on the abnormal data to obtain normal original data; identifying inconsistent data of the normal original data; according to the inconsistent data, formulating a data cleaning rule of the normal original data; and carrying out data cleaning on the inconsistent data through the data cleaning rule to obtain the cleaning original data.
The invalid data refers to data with no actual effect such as repetition and blank space in the original data, the abnormal data refers to data with abnormal data such as data missing and unclear data meaning in the valid original data, the inconsistent data refers to data with inconsistent different spellings, cases, units and the like in the normal original data, and the data cleaning rule refers to a rule for unifying the inconsistent data.
Furthermore, the embodiment of the invention can screen out the data needed in the cleaning original data by constructing the built-in filtering function of the cleaning original data, and can further improve the reliability of the data. The built-in filter function refers to a function capable of performing data filtering, and the built-in filter function can be realized through a filter script written in a Python language.
Further, in the embodiment of the present invention, the cleaning raw data is filtered by the built-in filter function, so as to obtain filtered raw data, and the filtering of the built-in filter function may be performed to obtain filtered target data. The filtering of the original data refers to a data set obtained after the filtering operation is performed on the cleaning original data, and the filtering of the cleaning original data can be achieved by running a code filtering program written in a Python language.
S2, determining the data attribute of the filtered original data, and carrying out layered acquisition on the filtered original data according to the data attribute to obtain a layered sample set.
In the embodiment of the invention, the data attribute refers to the property and the characteristic used for representing the entity set. These attributes may be different types of data, character type, numeric type, date type, etc.
Further, according to the embodiment of the invention, the filtering original data is collected in a layering mode according to the data attribute, so that a layering sample set can be obtained, enough samples can be ensured for each level, and the whole data set can be better represented. The hierarchical sample set refers to a data set obtained by performing hierarchical sampling on the filtered original data.
As an embodiment of the present invention, the performing, according to the data attribute, hierarchical collection on the filtered raw data to obtain a hierarchical sample set includes: determining layering characteristics of the filtered original data according to the data attributes; determining the layering level of the filtered original data through the layering characteristics; marking the number of the hierarchical samples of the hierarchy corresponding to the hierarchical level; and carrying out layered acquisition on the filtered original data based on the layered progression and the layered sampling quantity to obtain the layered sample set.
The layering characteristics refer to characteristics of gender, age, region and the like, the layering progression refers to the number of layering the filtered original data, and the layering sampling number refers to the data number acquired for each layer.
Further, in an optional embodiment of the present invention, the filtering raw data is collected in a layered manner, and the obtained layered sample set may be sampled by a simple random sampling method to ensure that the probability of each sample being selected is equal.
S3, calculating sample set information entropy of the layered sample set, identifying sample set characteristics of the layered sample set, calculating characteristic condition entropy of the sample set characteristics, calculating information gain of the sample set characteristics through the sample set information entropy and the characteristic condition entropy, and determining target characteristics of the sample set characteristics through the information gain.
Further, the embodiment of the invention determines the uncertainty of the sample distribution of the layered sample set by calculating the sample set information entropy of the layered sample set. The sample set information entropy refers to an index for measuring uncertainty of sample distribution in the layered sample set.
As one embodiment of the present invention, the calculating the sample set information entropy of the hierarchical sample set includes: determining class labels of the hierarchical sample set; identifying the number of category samples of the category label in the hierarchical sample set; calculating the class proportion of the class label in the layered sample set according to the number of the class samples; and calculating the sample set information entropy of the layered sample set through the category proportion.
The class labels refer to data classes of the hierarchical sample set, such as personnel information classes, financial data classes and the like, the number of class samples refers to the number of data corresponding to the class labels, and the class proportion refers to the data duty ratio of the class labels in the hierarchical sample set.
Further, in an optional embodiment of the present invention, the calculating, by the class ratio, the sample set information entropy of the hierarchical sample set includes: calculating the sample set information entropy of the layered sample set by using the following formula according to the category proportion:
wherein,representing the initial information entropy->Representation layeringSample set corresponds to->Category proportion of individual category labels, +. >Data quantity representing a characteristic data set, +.>Representing the probability +.>Taking the logarithm, taking the logarithm based on 2, and +.>Representing that the hierarchical sample set corresponds to +.>Personal category label->Representing a probability function.
Further, the embodiment of the invention can provide a data basis for calculating the characteristic condition entropy in the later stage by identifying the sample set characteristics of the layered sample set. The sample set features refer to sample feature attributes of the layered sample set, and the sample set features can be realized through feature functions.
Further, the embodiment of the invention can help us better understand the relation between the features by calculating the feature condition entropy of the features of the sample set. And the characteristic condition entropy is given to the information entropy of the sample set under a certain characteristic condition. It is used to measure the degree of contribution of a feature to the classification of a sample set.
As one embodiment of the present invention, the calculating the feature condition entropy of the sample set feature includes: calculating a sample set feature value of the sample set feature; calculating the conditional probability of the sample set characteristic value; analyzing class information entropy of the sample set characteristic value through the conditional probability; and carrying out weighted summation on the class information entropy to obtain the characteristic condition entropy of the sample set characteristic.
Wherein the sample set feature values refer to different values that a feature may have in a sample set, for example, it is assumed that we have one of the hierarchical sample sets, which contains one of the sample set feature "colors" to describe the color of the fruit. The sample set feature value of the sample set feature "color" may be "red", "yellow", "green", etc., the conditional probability refers to the ratio (frequency) of feature values in the hierarchical sample set, i.e., the number of samples of each feature value divided by the total number of samples, and the class information entropy refers to the degree of contribution of the sample set feature to the sample set at the value of the sample set feature value.
Further, according to the embodiment of the invention, through the sample set information entropy and the characteristic condition entropy, the information gain of the sample set characteristics is calculated, so that the data basis can be improved for characteristic selection in the later stage. Wherein the information gain refers to an increase in the amount of information that can be obtained by classifying using a certain feature. The information gain can be obtained by subtracting the sample set information entropy and the characteristic condition entropy.
Further, according to the embodiment of the invention, through the information gain, the target characteristics of the sample set characteristics can be determined to select proper characteristics, so that the later model performance can be improved, the overfitting risk can be reduced, the calculation efficiency can be improved, the model result can be interpreted, and the data visualization can be improved. The target characteristics refer to targets screened out through the information gain. The target features may be screened by a preset information gain threshold.
S4, calculating function output of a preset neural network model through the layered sample set and the target feature, calculating a function gradient value of the neural network model corresponding to the initialized neural network model based on the function output, and calculating updated model parameters of the initialized neural network model through the function gradient value.
According to the embodiment of the invention, the function output of the initialized neural network model is calculated through the layered sample set and the target feature, so that a data basis can be provided for model performance evaluation in the later stage. The function output refers to an output value obtained by inputting the layered sample set and the target feature into the initialized neural network model.
As one embodiment of the present invention, said calculating a function output of said neural network model from said hierarchical sample set and said target feature comprises: initializing the weight and bias of the neural network model to obtain an initialized neural network model; identifying neurons of the initialized neural network model; calculating a neuron output of the neuron from the hierarchical sample set and the target feature; and determining a function output of the initialized neural network model through the neuron output.
The method comprises the steps of initializing a neural network model, carrying out data initialization on weights and biases of the neural network model, recovering an initial state, calculating a neuron of the initialized neural network model, and calculating an output result of each neuron by transmitting input features to the neural network by neuron output.
Further, according to the embodiment of the invention, the model parameters of the neural network model corresponding to the initialized neural network model are calculated based on the function output and the preset data real label, so that the gradient value of the loss function can be subjected to the parameter optimization of the model, and the performance of the model in data screening is improved. Wherein the gradient value refers to the derivative of the loss function at a certain point.
As one embodiment of the present invention, the calculating, based on the function output, a function gradient value of the neural network model corresponding to the initialized neural network model includes: determining a loss function of the initialized neural network model through the function output; marking model parameters of the initialized neural network model; calculating the function gradient value of the loss function to the model parameters by using the following formula:
Wherein,function gradient value representing loss function versus model parameter, < ->Data set representing a hierarchical sample set and target features of an input-initialized neural network model +.>Representing model parameters for initializing a neural network model,represents the partial derivative->Representing a loss function->Representing the loss function with respect to the input->Is of the type of (A) and (B)>Representation input +.>Partial derivatives with respect to model parameters.
Wherein the loss function refers to a function that calculates a loss value between the function output and a true value.
Furthermore, according to the embodiment of the invention, the model evaluation performance of the model can be improved and the reliability of the model on the screening result can be improved by calculating the updated model parameters of the initialized neural network model through the function gradient values. Wherein, the updated model parameters refer to new parameters after updating the model parameters.
As one embodiment of the present invention, the calculating, by using the function gradient value, updated model parameters of the initialized neural network model includes: identifying a learning rate of the initialized neural network model; calculating updated model parameters of the initialized neural network model by using the following formulas through the function gradient values and the learning rate:
Wherein,updated model parameters representing an initialized neural network model, < ->Representing model parameters corresponding to the initialized neural network model, < ->Represent learning rate of initializing neural network model, +.>And representing the function gradient value of the model parameters of the corresponding loss function of the initialized neural network model.
And S5, when the updated model parameters reach the preset maximum iteration times, generating an updated neural network model of the initialized neural network model, calculating the recall rate of the updated neural network model, and when the recall rate meets the requirement, screening out target information in the layered sample set by using the updated neural network model.
In the embodiment of the invention, the updating of the neural network model is realized by continuously updating the model parameters of the initialized neural network model, and the updating of the neural network model is realized by stopping updating when the updating times reach the maximum iteration times.
Further, the embodiment of the invention can evaluate the performance of the updated neural network model by calculating the recall rate of the updated neural network model. The recall rate refers to the accuracy degree of the data analysis result performed by the updated neural network model, and the recall rate can be obtained by dividing the correct result generated by the updated neural network model by the generated total result.
Further, according to the embodiment of the invention, when the recall rate meets the requirement, the target information in the layered sample set is screened out by utilizing the updated neural network model, so that the important target information in the layered sample data can be analyzed and screened through the model, and the effect of target data screening under data processing is improved. The target information refers to requirement information obtained after a series of operations are performed on the layered sample set.
According to the method, the device and the system, the data is cleaned to obtain the cleaned original data, so that a cleaned original data set can be obtained, and the cleaned original data set contains more accurate, consistent and reliable data, so that a reliable basis is provided for subsequent data analysis and modeling; further, according to the embodiment of the invention, the filtering original data is collected in a layering manner according to the data attribute, so that a layering sample set can be obtained, enough samples can be ensured for each level, and the whole data set can be better represented; further, according to the embodiment of the invention, through the information gain, the target characteristics of the sample set characteristics can be determined to select proper characteristics, the performance of a later model can be improved, the risk of overfitting is reduced, the calculation efficiency is improved, the model result is explained, and the data visualization is improved. Therefore, the screening method based on big data under data processing can improve the monitoring effect of underground petroleum detection of the detection equipment.
Fig. 2 is a functional block diagram of a screening system for implementing data processing based on big data according to an embodiment of the present invention.
The screening system 200 for realizing data processing based on big data can be installed in electronic equipment. Depending on the implementation, the screening system 200 for implementing data processing based on big data may include a data processing module 201, a data layering module 202, a feature selection module 203, a model parameter updating module 204, and a target information screening module 205. The module of the invention, which may also be referred to as a unit, refers to a series of computer program segments, which are stored in the memory of the electronic device, capable of being executed by the processor of the electronic device and of performing a fixed function.
In the present embodiment, the functions concerning the respective modules/units are as follows:
the data processing module 201 is configured to obtain original data, perform data cleaning on the original data to obtain cleaned original data, construct a built-in filtering function of the cleaned original data, and filter the cleaned original data through the built-in filtering function to obtain filtered original data;
the data layering module 202 is configured to determine a data attribute of the filtered original data, and perform layering acquisition on the filtered original data according to the data attribute to obtain a layering sample set;
The feature selection module 203 is configured to calculate a sample set information entropy of the layered sample set, identify sample set features of the layered sample set, calculate a feature condition entropy of the sample set features, calculate an information gain of the sample set features by using the sample set information entropy and the feature condition entropy, and determine target features of the sample set features by using the information gain;
the model parameter updating module 204 is configured to calculate a function output of a preset neural network model according to the hierarchical sample set and the target feature, calculate a function gradient value of the neural network model corresponding to an initialized neural network model based on the function output, and calculate an updated model parameter of the initialized neural network model according to the function gradient value;
the target information screening module 205 is configured to generate an updated neural network model of the initialized neural network model when the updated model parameter reaches a preset maximum iteration number, calculate a recall rate of the updated neural network model, and screen target information in the hierarchical sample set by using the updated neural network model when the recall rate meets a requirement.
In detail, each module in the screening system 200 for implementing data processing based on big data in the embodiment of the present invention adopts the same technical means as the screening method for implementing data processing based on big data in the drawings when in use, and can produce the same technical effects, which are not described herein.
The embodiment of the invention provides electronic equipment for realizing a screening method under data processing based on big data.
Referring to fig. 3, the electronic device may include a processor 30, a memory 31, a communication bus 32, and a communication interface 33, and may further include a computer program stored in the memory 31 and executable on the processor 30, such as a screening method program under data processing based on big data.
The processor may be formed by an integrated circuit in some embodiments, for example, a single packaged integrated circuit, or may be formed by a plurality of integrated circuits packaged with the same function or different functions, including one or more central processing units (Central Processing Unit, CPU), a microprocessor, a digital processing chip, a graphics processor, a combination of various control chips, and the like. The processor is a Control Unit (Control Unit) of the electronic device, connects various components of the entire electronic device using various interfaces and lines, executes or executes programs or modules stored in the memory (e.g., executes a filter program under data processing based on big data, etc.), and invokes data stored in the memory to perform various functions of the electronic device and process the data.
The memory includes at least one type of readable storage medium including flash memory, removable hard disk, multimedia card, card memory (e.g., SD or DX memory, etc.), magnetic memory, magnetic disk, optical disk, etc. The memory may in some embodiments be an internal storage unit of the electronic device, such as a mobile hard disk of the electronic device. The memory may in other embodiments also be an external storage device of the electronic device, such as a plug-in mobile hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) or the like, which are provided on the electronic device. Further, the memory may also include both internal storage units and external storage devices of the electronic device. The memory may be used not only for storing application software installed in an electronic device and various types of data, for example, code based on a filter program under data processing based on big data, etc., but also for temporarily storing data that has been output or is to be output.
The communication bus may be a peripheral component interconnect standard (Peripheral Component Interconnect, PCI) bus, or an extended industry standard architecture (Extended Industry Standard Architecture, EISA) bus, among others. The bus may be classified as an address bus, a data bus, a control bus, etc. The bus is arranged to enable a connection communication between the memory and at least one processor or the like.
The communication interface is used for communication between the electronic equipment and other equipment, and comprises a network interface and a user interface. Optionally, the network interface may include a wired interface and/or a wireless interface (e.g., WI-FI interface, bluetooth interface, etc.), typically used to establish a communication connection between the electronic device and other electronic devices. The user interface may be a Display (Display), an input unit such as a Keyboard (Keyboard), or alternatively a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch, or the like. The display may also be referred to as a display screen or display unit, as appropriate, for displaying information processed in the electronic device and for displaying a visual user interface.
For example, although not shown, the electronic device may further include a power source (such as a battery) for powering the respective components, and preferably, the power source may be logically connected to the at least one processor through a power management system, so as to perform functions of charge management, discharge management, and power consumption management through the power management system. The power supply may also include one or more of any of a direct current or alternating current power supply, a recharging system, a power failure detection circuit, a power converter or inverter, a power status indicator, and the like. The electronic device may further include various sensors, bluetooth modules, wi-Fi modules, etc., which are not described herein.
It should be understood that the embodiments described are for illustrative purposes only and are not limited to this configuration in the scope of the patent application.
The filter program stored in the memory of the electronic device and used for realizing data processing based on big data is a combination of a plurality of instructions, and when the filter program runs in the processor, the filter program can realize:
acquiring original data, performing data cleaning on the original data to obtain cleaned original data, constructing a built-in filter function of the cleaned original data, and filtering the cleaned original data through the built-in filter function to obtain filtered original data;
determining the data attribute of the filtered original data, and carrying out layered acquisition on the filtered original data according to the data attribute to obtain a layered sample set;
calculating sample set information entropy of the layered sample set, identifying sample set characteristics of the layered sample set, calculating characteristic condition entropy of the sample set characteristics, calculating information gain of the sample set characteristics through the sample set information entropy and the characteristic condition entropy, and determining target characteristics of the sample set characteristics through the information gain;
calculating function output of a preset neural network model through the layered sample set and the target feature, calculating a function gradient value of an initialized neural network model corresponding to the neural network model based on the function output, and calculating updated model parameters of the initialized neural network model through the function gradient value;
When the updated model parameters reach the preset maximum iteration times, generating an updated neural network model of the initialized neural network model, calculating the recall rate of the updated neural network model, and when the recall rate meets the requirements, screening out target information in the layered sample set by using the updated neural network model.
Specifically, the specific implementation method of the above instruction by the processor may refer to descriptions of related steps in the corresponding embodiment of the drawings, which are not repeated herein.
Further, the electronic device integrated modules/units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. The computer readable storage medium may be volatile or nonvolatile. For example, the computer readable medium may include: any entity or system capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM).
The present invention also provides a computer readable storage medium storing a computer program which, when executed by a processor of an electronic device, can implement:
Acquiring original data, performing data cleaning on the original data to obtain cleaned original data, constructing a built-in filter function of the cleaned original data, and filtering the cleaned original data through the built-in filter function to obtain filtered original data;
determining the data attribute of the filtered original data, and carrying out layered acquisition on the filtered original data according to the data attribute to obtain a layered sample set;
calculating sample set information entropy of the layered sample set, identifying sample set characteristics of the layered sample set, calculating characteristic condition entropy of the sample set characteristics, calculating information gain of the sample set characteristics through the sample set information entropy and the characteristic condition entropy, and determining target characteristics of the sample set characteristics through the information gain;
calculating function output of a preset neural network model through the layered sample set and the target feature, calculating a function gradient value of an initialized neural network model corresponding to the neural network model based on the function output, and calculating updated model parameters of the initialized neural network model through the function gradient value;
when the updated model parameters reach the preset maximum iteration times, generating an updated neural network model of the initialized neural network model, calculating the recall rate of the updated neural network model, and when the recall rate meets the requirements, screening out target information in the layered sample set by using the updated neural network model.
In the several embodiments provided by the present invention, it should be understood that the disclosed apparatus, system and method may be implemented in other manners. For example, the system embodiments described above are merely illustrative, e.g., the division of the modules is merely a logical function division, and other manners of division may be implemented in practice.
The modules described as separate components may or may not be physically separate, and components shown as modules may or may not be physical units, may be located in one place, or may be distributed over multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional module in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units can be realized in a form of hardware or a form of hardware and a form of software functional modules.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof.
The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference signs in the claims shall not be construed as limiting the claim concerned.
The embodiment of the application can acquire and process the related data based on the artificial intelligence technology. Among these, artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use knowledge to obtain optimal results.
Furthermore, it is evident that the word "comprising" does not exclude other elements or steps, and that the singular does not exclude a plurality. Multiple units or systems as set forth in the system claims may also be implemented by means of one unit or system in software or hardware. The terms first, second, etc. are used to denote a name, but not any particular order.
Finally, it should be noted that the above-mentioned embodiments are merely for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made to the technical solution of the present invention without departing from the spirit and scope of the technical solution of the present invention.

Claims (4)

1. The screening method based on big data under the data processing is characterized by comprising the following steps:
acquiring original data, performing data cleaning on the original data to obtain cleaned original data, constructing a built-in filter function of the cleaned original data, and filtering the cleaned original data through the built-in filter function to obtain filtered original data;
determining the data attribute of the filtered original data, and carrying out layered acquisition on the filtered original data according to the data attribute to obtain a layered sample set;
calculating sample set information entropy of the layered sample set, identifying sample set characteristics of the layered sample set, calculating characteristic condition entropy of the sample set characteristics, calculating information gain of the sample set characteristics through the sample set information entropy and the characteristic condition entropy, and determining target characteristics of the sample set characteristics through the information gain;
the calculating the sample set information entropy of the layered sample set includes: determining class labels of the hierarchical sample set; identifying the number of category samples of the category label in the hierarchical sample set; calculating the class proportion of the class label in the layered sample set according to the class sample number; calculating sample set information entropy of the layered sample set through the category proportion;
The class label refers to a data class of the hierarchical sample set, which includes: personnel information class and financial data class; the category sample number refers to the data number of the category label corresponding data; the class proportion refers to the data duty ratio of the class label in the layered sample set;
the calculating the feature condition entropy of the sample set feature comprises: calculating a sample set feature value of the sample set feature; calculating the conditional probability of the sample set characteristic value; analyzing class information entropy of the sample set characteristic value through the conditional probability; carrying out weighted summation on the category information entropy to obtain a characteristic condition entropy of the sample set characteristic;
the conditional probability refers to the proportion of the feature value in the hierarchical sample set; the class information entropy refers to the contribution degree of the sample set features to the sample set under the value of the sample set feature value;
understanding the relation between the features by calculating the feature condition entropy of the sample set features; the characteristic condition entropy is the information entropy of the sample set under a given certain characteristic condition and is used for measuring the contribution degree of the characteristics to the classification of the sample set;
determining target characteristics of the sample set characteristics through the information gain, selecting proper characteristics to improve the later model performance, reduce the risk of overfitting, improve the calculation efficiency, explain the model results and improve the data visualization;
Calculating function output of a preset neural network model through the layered sample set and the target feature, calculating a function gradient value of an initialized neural network model corresponding to the neural network model based on the function output, and calculating updated model parameters of the initialized neural network model through the function gradient value;
the computing a functional output of the neural network model from the hierarchical sample set and the target feature comprises: initializing the weight and bias of the neural network model to obtain an initialized neural network model; identifying neurons of the initialized neural network model; calculating a neuron output of the neuron from the hierarchical sample set and the target feature; determining a function output of the initialized neural network model from the neuron outputs;
based on the function output, calculating a function gradient value of the neural network model corresponding to the initialized neural network model, including: determining a loss function of the initialized neural network model through the function output; marking model parameters of the initialized neural network model; calculating the function gradient value of the loss function to the model parameters by using the following formula:
Wherein (1)>Function gradient value representing loss function versus model parameter, < ->Data set representing a hierarchical sample set and target features of an input-initialized neural network model +.>Model parameters representing an initialized neural network model, < +.>Represents the partial derivative->Representing a loss function->Representing the loss function with respect to the input->Is of the type of (A) and (B)>Representation input +.>Partial derivatives with respect to model parameters;
the calculating, by the function gradient value, updated model parameters of the initialized neural network model includes: identifying a learning rate of the initialized neural network model; calculating updated model parameters of the initialized neural network model by using the following formulas through the function gradient values and the learning rate:
wherein (1)>Updated model parameters representing an initialized neural network model, < ->Representing model parameters corresponding to the initialized neural network model, < ->Representing the learning rate of initializing the neural network model,representing a function gradient value of a corresponding loss function of the initialized neural network model for model parameters;
when the updated model parameters reach the preset maximum iteration times, generating an updated neural network model of the initialized neural network model, calculating the recall rate of the updated neural network model, and when the recall rate meets the requirements, screening out target information in the layered sample set by using the updated neural network model.
2. The method for implementing screening under data processing based on big data according to claim 1, wherein the performing data cleaning on the raw data to obtain cleaned raw data includes:
identifying invalid data in the original data;
removing the invalid data to obtain valid original data;
retrieving abnormal data in the effective original data;
performing data replacement on the abnormal data to obtain normal original data;
identifying inconsistent data of the normal original data;
according to the inconsistent data, formulating a data cleaning rule of the normal original data;
and carrying out data cleaning on the inconsistent data through the data cleaning rule to obtain the cleaning original data.
3. The method for implementing screening under data processing based on big data according to claim 1, wherein the performing hierarchical collection on the filtered raw data according to the data attribute to obtain a hierarchical sample set includes:
determining layering characteristics of the filtered original data according to the data attributes;
determining the layering level of the filtered original data through the layering characteristics;
Marking the number of the hierarchical samples of the hierarchy corresponding to the hierarchical level;
and carrying out layered acquisition on the filtered original data based on the layered progression and the layered sampling quantity to obtain the layered sample set.
4. A screening system for implementing data processing based on big data, for performing the screening method for implementing data processing based on big data according to any of claims 1 to 3, the system comprising:
the data processing module is used for acquiring original data, carrying out data cleaning on the original data to obtain cleaned original data, constructing a built-in filter function of the cleaned original data, and filtering the cleaned original data through the built-in filter function to obtain filtered original data;
the data layering module is used for determining the data attribute of the filtered original data, and carrying out layering acquisition on the filtered original data according to the data attribute to obtain a layering sample set;
the characteristic selection module is used for calculating sample set information entropy of the layered sample set, identifying sample set characteristics of the layered sample set, calculating characteristic condition entropy of the sample set characteristics, calculating information gain of the sample set characteristics through the sample set information entropy and the characteristic condition entropy, and determining target characteristics of the sample set characteristics through the information gain;
The model parameter updating module is used for calculating the function output of a preset neural network model through the layered sample set and the target feature, calculating the function gradient value of the initialized neural network model corresponding to the neural network model based on the function output, and calculating the updated model parameter of the initialized neural network model through the function gradient value;
and the target information screening module is used for generating an updated neural network model of the initialized neural network model when the updated model parameters reach the preset maximum iteration times, calculating the recall rate of the updated neural network model, and screening target information in the layered sample set by using the updated neural network model when the recall rate meets the requirement.
CN202311528107.9A 2023-11-16 2023-11-16 Screening method and system based on big data under data processing Active CN117235480B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311528107.9A CN117235480B (en) 2023-11-16 2023-11-16 Screening method and system based on big data under data processing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311528107.9A CN117235480B (en) 2023-11-16 2023-11-16 Screening method and system based on big data under data processing

Publications (2)

Publication Number Publication Date
CN117235480A CN117235480A (en) 2023-12-15
CN117235480B true CN117235480B (en) 2024-02-13

Family

ID=89086614

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311528107.9A Active CN117235480B (en) 2023-11-16 2023-11-16 Screening method and system based on big data under data processing

Country Status (1)

Country Link
CN (1) CN117235480B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105677791A (en) * 2015-12-31 2016-06-15 新疆金风科技股份有限公司 Method and system used for analyzing operating data of wind generating set
CN113435521A (en) * 2021-06-30 2021-09-24 平安科技(深圳)有限公司 Neural network model training method and device and computer readable storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20140043184A (en) * 2012-09-28 2014-04-08 한국전자통신연구원 Apparatus and method for forecasting an energy comsumption

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105677791A (en) * 2015-12-31 2016-06-15 新疆金风科技股份有限公司 Method and system used for analyzing operating data of wind generating set
CN113435521A (en) * 2021-06-30 2021-09-24 平安科技(深圳)有限公司 Neural network model training method and device and computer readable storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
动态选择性集成神经网络软测量建模;夏陆岳 等;计算机与应用化学;20160228;第33卷(第02期);第163-167页 *

Also Published As

Publication number Publication date
CN117235480A (en) 2023-12-15

Similar Documents

Publication Publication Date Title
CN109522192B (en) Prediction method based on knowledge graph and complex network combination
CN112306835B (en) User data monitoring and analyzing method, device, equipment and medium
CN114519524A (en) Enterprise risk early warning method and device based on knowledge graph and storage medium
CN113268665A (en) Information recommendation method, device and equipment based on random forest and storage medium
CN115081025A (en) Sensitive data management method and device based on digital middlebox and electronic equipment
CN114612194A (en) Product recommendation method and device, electronic equipment and storage medium
CN113516417A (en) Service evaluation method and device based on intelligent modeling, electronic equipment and medium
CN114399212A (en) Ecological environment quality evaluation method and device, electronic equipment and storage medium
CN113627160B (en) Text error correction method and device, electronic equipment and storage medium
CN117155771B (en) Equipment cluster fault tracing method and device based on industrial Internet of things
CN116843481A (en) Knowledge graph analysis method, device, equipment and storage medium
CN117114409A (en) Data processing method, device and storage medium for enterprise data
CN117235480B (en) Screening method and system based on big data under data processing
CN113706019B (en) Service capability analysis method, device, equipment and medium based on multidimensional data
CN113657546B (en) Information classification method, device, electronic equipment and readable storage medium
CN114742412A (en) Software technology service system and method
CN114780688A (en) Text quality inspection method, device and equipment based on rule matching and storage medium
CN113704407A (en) Complaint amount analysis method, device, equipment and storage medium based on category analysis
CN115327675B (en) Method, system, equipment and storage medium for monitoring running state of meteorological equipment
CN116468266A (en) Applicant performance risk analysis method, device and equipment based on engineering warranty
CN117493797A (en) Fault prediction method and device of Internet of things equipment, electronic equipment and storage medium
CN116841737A (en) Batch task distribution method, device, equipment and storage medium
CN116843219A (en) Buried data analysis method, buried data analysis device, buried data analysis equipment and storage medium
CN116991364A (en) Software development system management method based on big data
CN116484296A (en) Financial fund collection risk analysis method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant