CN117909333A - Screening method and system for realizing data based on big data combined with artificial intelligence - Google Patents

Screening method and system for realizing data based on big data combined with artificial intelligence Download PDF

Info

Publication number
CN117909333A
CN117909333A CN202410145704.1A CN202410145704A CN117909333A CN 117909333 A CN117909333 A CN 117909333A CN 202410145704 A CN202410145704 A CN 202410145704A CN 117909333 A CN117909333 A CN 117909333A
Authority
CN
China
Prior art keywords
data
screening
feature
optimization
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410145704.1A
Other languages
Chinese (zh)
Inventor
黄泽文
陈军勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Tianpu Technology Co ltd
Original Assignee
Shenzhen Tianpu Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Tianpu Technology Co ltd filed Critical Shenzhen Tianpu Technology Co ltd
Priority to CN202410145704.1A priority Critical patent/CN117909333A/en
Publication of CN117909333A publication Critical patent/CN117909333A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the field of data screening, and provides a screening method and a screening system for realizing data based on big data combined with artificial intelligence, wherein the method comprises the following steps: obtaining initial data to be screened, carrying out data cleaning on the initial data, carrying out standardization processing on the initial cleaning data, extracting standard data characteristics, identifying multiple collinearity, calculating the characteristic similarity, carrying out data characteristic marking on the standard characteristics to obtain marked characteristic data, constructing a marked database, training a preset data screening model, classifying and initially screening the standardized data by using the model, identifying the data type of the initially screened data, carrying out optimization verification on the data type, obtaining optimization verification data, inquiring corresponding optimization indexes, optimizing the model based on the optimization indexes, obtaining an optimization model, and carrying out secondary screening on the initially screened data by using the optimization model to obtain required screening data. The invention can improve the screening accuracy of data screening.

Description

Screening method and system for realizing data based on big data combined with artificial intelligence
Technical Field
The invention relates to the field of data screening, in particular to a method and a system for realizing data screening based on big data combined with artificial intelligence.
Background
With the rapid development of information technology, big data and artificial intelligence are widely used in various fields. In terms of data processing, big data and artificial intelligence can effectively screen and analyze huge data sets, extracting useful information and patterns.
At present, the common data screening method is used for realizing data screening by a statistical method, a machine learning method and the like, wherein the statistical method is used for screening by analyzing the statistical characteristics and modes of the data, and can process large-scale data; the screening method of machine learning utilizes a machine learning algorithm to learn data screening rules from a large number of marked data, and can automatically learn and adapt to the change of the data, thereby realizing data screening; however, the above method is easy to result in poor screening accuracy of data screening due to incomplete or inaccurate rules, so a screening method and system for implementing data based on big data combined with artificial intelligence are needed to improve the screening accuracy of data screening.
Disclosure of Invention
The invention provides a screening method and a system for realizing data based on big data combined with artificial intelligence, which mainly aim at improving screening accuracy of data screening.
In order to achieve the above object, the method for screening data based on big data combined with artificial intelligence provided by the invention comprises the following steps:
acquiring initial data to be screened, performing data cleaning on the initial data to obtain initial cleaning data, and performing standardized processing on the initial cleaning data to obtain standardized data;
extracting features of the standardized data to obtain standard data features, identifying multiple collinearity corresponding to the standard data features, determining reference features corresponding to the standard data features based on the multiple collinearity, and calculating feature similarity between the reference features;
Determining screening requirements corresponding to the standardized data based on the feature similarity, and marking the data features of the reference features based on the screening requirements to obtain marked feature data;
Constructing a labeling database corresponding to the labeling feature data, performing model training on a preset data screening model by using data in the labeling database to obtain a trained data screening model, and performing classification primary screening on the standardized data by using the trained data screening model to obtain primary screening type data;
Identifying the data category corresponding to the primary screening data, carrying out optimization verification on the primary screening data based on the data category to obtain optimization verification data, inquiring optimization indexes corresponding to the optimization verification data, optimizing the trained data screening model based on the optimization indexes to obtain an optimized data screening model, and carrying out secondary screening on the primary screening data by utilizing the optimized data screening model to obtain screening data corresponding to the initial data.
Optionally, the performing data cleansing on the initial data to obtain initial cleansing data includes:
Identifying missing values in the initial data;
performing function statistics on the missing values to obtain number entries corresponding to the missing values;
based on the number entries, interpolating the initial data by the missing values to obtain filling data;
And taking the filling data as initial cleaning data corresponding to the initial data.
Optionally, the normalizing the initial cleaning data to obtain normalized data includes:
Loading the initial cleaning data into a preset database, and identifying a data column corresponding to the data in the database;
calculating the average value and standard deviation of the data columns;
Calculating the deviation degree corresponding to the data column based on the average value and the standard deviation;
and scaling and normalizing the initial cleaning data based on the deviation degree to obtain normalized data.
Optionally, the identifying the multiple collinearity corresponding to the standard data feature includes:
Inquiring characteristic parameters corresponding to the standard data characteristics, and calculating a correlation coefficient matrix between the characteristic parameters;
Identifying a corresponding variance expansion factor in the correlation coefficient matrix;
and determining the multiple collinearity corresponding to the standard data characteristic based on the variance expansion factor.
Optionally, the calculating the feature similarity between the reference features includes:
calculating the feature similarity between the reference features using the following formula:
wherein Tz represents feature similarity between the reference features, n represents feature vectors corresponding to the reference features, i represents indexes in the feature vectors, Representing the i-th element corresponding to the feature x in the feature vector,/>And representing the ith element corresponding to the feature y in the feature vector.
Optionally, the performing data feature labeling on the reference feature based on the screening requirement to obtain labeled feature data includes:
Determining feature data points in the reference feature based on the screening requirements;
extracting sample data points corresponding to the characteristic data points;
Carrying out data point integration on the sample data points to obtain an integrated data matrix;
And marking the data characteristics of the data in the integrated data matrix to obtain marked characteristic data.
Optionally, the identifying the data category corresponding to the primary screening class data includes:
identifying a corresponding data form in the primary screening class data;
Extracting classification features in the primary screening class data based on the data form;
performing feature coding on the classification features to obtain feature classification codes;
and performing category decoding on the characteristic classification codes to obtain data categories corresponding to the primary screening category data.
Optionally, the calculating, based on the subset parameters, classification accuracy corresponding to the primary screening class data includes:
Calculating the classification accuracy corresponding to the primary screening type data by using the following formula:
wherein Zq represents the classification accuracy corresponding to the primary screening class data, m represents the data sample corresponding to the subset parameter, and f (i) represents the classification accuracy of the ith sample corresponding to the data sample.
Optionally, the querying the optimization index corresponding to the optimization verification data includes:
Determining a corresponding optimization target in the optimization verification data;
Identifying verification data corresponding to the optimized verification data based on the optimization target;
Based on the verification data, calculating a verification index corresponding to the optimized verification data;
And performing index optimization on the verification index to obtain an optimization index corresponding to the optimization verification data.
In order to solve the above problems, the present invention further provides a screening system for implementing data based on big data in combination with artificial intelligence, the system comprising:
the standardized module is used for acquiring initial data to be screened, carrying out data cleaning on the initial data to obtain initial cleaning data, and carrying out standardized processing on the initial cleaning data to obtain standardized data;
The similarity calculation module is used for carrying out feature extraction on the standardized data to obtain standard data features, identifying multiple collinearity corresponding to the standard data features, determining reference features corresponding to the standard data features based on the multiple collinearity, and calculating feature similarity between the reference features;
the feature labeling module is used for determining screening requirements corresponding to the standardized data based on the feature similarity, and labeling the data features of the reference features based on the screening requirements to obtain labeled feature data;
The classification primary screening module is used for constructing a labeling database corresponding to the labeling feature data, carrying out model training on a preset data screening model by utilizing data in the labeling database to obtain a trained data screening model, and carrying out classification primary screening on the standardized data by utilizing the trained data screening model to obtain primary screening type data;
the model optimizing module is used for identifying the data category corresponding to the primary screening data, carrying out optimization verification on the primary screening data based on the data category to obtain optimization verification data, inquiring optimization indexes corresponding to the optimization verification data, optimizing the trained data screening model based on the optimization indexes to obtain an optimized data screening model, and carrying out secondary screening on the primary screening data by utilizing the optimized data screening model to obtain screening data corresponding to the initial data.
The invention obtains the initial data to be screened, carries out data cleaning on the initial data to obtain initial cleaning data, can improve the data quality, the accuracy of analysis results, reduce the processing time and the resource consumption, protect the data privacy and the safety, optimize the data structure and the format, thereby laying a good foundation for the subsequent data processing and analysis work. Therefore, the screening method and the system for realizing data based on big data and artificial intelligence are provided by the invention, so that the screening accuracy of data screening is improved.
Drawings
FIG. 1 is a flow chart of a screening method based on big data combined with artificial intelligence according to an embodiment of the invention;
FIG. 2 is a schematic diagram of a screening system for implementing data based on big data in combination with artificial intelligence according to an embodiment of the present invention;
Fig. 3 is a schematic diagram of an internal structure of an electronic device according to an embodiment of the present invention for implementing a data screening method based on big data and artificial intelligence.
The achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.
Detailed Description
It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
The embodiment of the application provides a screening method for realizing data based on big data and artificial intelligence. The execution main body of the screening method for realizing data based on big data and artificial intelligence comprises at least one of a server, a terminal and the like which can be configured to execute the method provided by the embodiment of the application. In other words, the screening method based on big data combined with artificial intelligence implementation data can be executed by software or hardware installed in a terminal device or a server device, and the software can be a blockchain platform. The service end includes but is not limited to: a single server, a server cluster, a cloud server or a cloud server cluster, and the like. The server may be an independent server, or may be a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, content delivery networks (Content Delivery Network, CDN), and basic cloud computing services such as big data and artificial intelligence platforms.
Referring to fig. 1, a flow chart of a screening method for implementing data based on big data combined with artificial intelligence according to an embodiment of the invention is shown. In this embodiment, the method for screening data based on big data combined with artificial intelligence includes:
S1, acquiring initial data to be screened, performing data cleaning on the initial data to obtain initial cleaning data, and performing standardization processing on the initial cleaning data to obtain standardized data.
According to the invention, the initial data to be screened is obtained, and the data is cleaned to obtain the initial cleaning data, so that the data quality and the accuracy of analysis results can be improved, the processing time and the resource consumption are reduced, the data privacy and safety are protected, and the data structure and format are optimized, thereby laying a good foundation for the subsequent data processing and analysis work.
Wherein the initial data refers to an original data set obtained before data cleaning is performed; the initial cleaning data refers to a data set obtained after preprocessing and cleaning the initial data.
As an embodiment of the present invention, the performing data cleansing on the initial data to obtain initial cleansing data includes: identifying missing values in the initial data; performing function statistics on the missing values to obtain number entries corresponding to the missing values; based on the number entries, interpolating the initial data by the missing values to obtain filling data; and taking the filling data as initial cleaning data corresponding to the initial data.
Wherein, the missing value refers to the condition that the observed value of one or some variables in the data is missing or not recorded; the number entry is the number of records or samples contained in the data set; the padding data is data obtained by replacing or interpolating the missing value.
Further, the missing values may be obtained by dropna () function implementation of Pandas libraries; the number of entries may be obtained by an entry statistics tool implementation, such as: pandas, seaborn, missingno, etc.; the padding data may be obtained by an interpolation algorithm, such as: there are linear interpolation, quadratic interpolation, cubic spline interpolation, etc. algorithms.
According to the invention, the initial cleaning data is standardized to obtain the standardized data, so that the performance and stability of a subsequent model can be improved, the training process is accelerated, the characteristic comparison and model interpretation are facilitated, and the complexity of the model is reduced.
The standardized data is data obtained by processing the original data according to a certain rule.
As an embodiment of the present invention, the normalizing the initial cleaning data to obtain normalized data includes: loading the initial cleaning data into a preset database; identifying a data column corresponding to the data in the database; calculating the average value and standard deviation of the data columns; calculating the deviation degree corresponding to the data column based on the average value and the standard deviation; and scaling and normalizing the initial cleaning data based on the deviation degree to obtain normalized data.
The preset database is a database which is preset and used for storing and managing data; the data column refers to a column or a field in the database and represents the same type of data; the average value refers to the sum of all values in the data column divided by the number of values; the standard deviation refers to a measure of the degree of dispersion of the values in the data columns; the deviation degree refers to the degree of difference between each value in the data series calculated based on the average value and the standard deviation and the average value.
Further, the preset database may be obtained through an RDBMS tool, for example: tools such as MySQL workbench, oracle SQL Developer; the data columns may be obtained by a data collection tool, such as: web Scraper, octoparse, etc.; the mean and the standard deviation may be obtained by a statistical calculation function implementation, such as: np.mean (), np.std (), np.var (), and the like; the degree of deviation may be obtained by regression analysis methods such as: linear regression, polynomial regression, logistic regression, etc.
S2, carrying out feature extraction on the standardized data to obtain standard data features, identifying multiple collinearity corresponding to the standard data features, determining reference features corresponding to the standard data features based on the multiple collinearity, and calculating feature similarity between the reference features.
According to the invention, the standard data characteristics are obtained by extracting the characteristics of the standardized data, so that the consumption of computing resources and time can be reduced, the speed of data processing and model training is increased, and the efficiency of the whole data analysis flow is improved.
The standard data features are new feature information obtained by extracting features of the standardized data.
Alternatively, the standard data features may be obtained by a feature selection algorithm implementation, such as: chi-square test, analysis of variance, etc.
As one embodiment of the present invention, the identifying the multiple collinearity corresponding to the standard data feature includes: inquiring feature parameters corresponding to the standard data features; calculating a correlation coefficient matrix between the characteristic parameters; identifying a corresponding variance expansion factor in the correlation coefficient matrix; and determining the multiple collinearity corresponding to the standard data characteristic based on the variance expansion factor.
Wherein, the characteristic parameter refers to the numerical value or variable of each characteristic in the standard data; the correlation coefficient matrix refers to a matrix which presents the correlation coefficients among the characteristic parameters in a matrix form; the variance expansion factor refers to a statistical index for measuring the degree of multiple collinearity; the multiple collinearity refers to a phenomenon in which there is a high correlation between independent variables.
Further, the characteristic parameters may be obtained through a statistical tool implementation, such as: pearson, spearman and other correlation coefficients; the correlation coefficient matrix may be obtained by Python tool implementation, such as: numPy, pandas, sciPy, etc.; the variance expansion factor may be obtained by a regression model calculation implementation, such as: models such as OLS regression; the multiple collinearity may be achieved by multiple calculation methods, such as: durian-Watson statistics, principal component analysis, ridge regression, and the like.
The method and the device determine the reference characteristics corresponding to the standard data characteristics based on the multiple collinearity, can simplify the subsequent model, reduce multiple interpretation, improve the prediction accuracy, improve the interpretation of the characteristics and improve the stability of the model, thereby being capable of helping better understand and utilize the data.
Wherein the baseline feature refers to a feature or attribute selected as a reference in the dataset.
Alternatively, the reference feature may be obtained by a feature processing algorithm implementation, such as: SIFT, HOG, etc.
As an embodiment of the present invention, the calculating the feature similarity between the reference features includes:
calculating the feature similarity between the reference features using the following formula:
wherein Tz represents feature similarity between the reference features, n represents feature vectors corresponding to the reference features, i represents indexes in the feature vectors, Representing the i-th element corresponding to the feature x in the feature vector,/>And representing the ith element corresponding to the feature y in the feature vector.
And S3, determining screening requirements corresponding to the standardized data based on the feature similarity, and marking the data features of the reference features based on the screening requirements to obtain marked feature data.
According to the invention, the screening requirements corresponding to the standardized data are determined based on the feature similarity, so that accurate matching can be provided, time and cost are saved, screening result quality is improved, and expandability and flexibility are provided.
Wherein, the screening requirement refers to the requirement of screening and filtering data according to specific conditions or requirements.
Alternatively, the screening requirement may be obtained by a clustering algorithm, such as: k means clustering and other algorithms.
According to the invention, based on the screening requirement, the data characteristic marking is carried out on the reference characteristic, so that marked characteristic data is obtained, the research range is reduced or redundant data is reduced, the data processing efficiency is improved, and meanwhile, the subsequent analysis, modeling or decision is more accurate and reliable.
The labeling of the characteristic data refers to assigning a label or tag to each sample or data point after screening the data based on the screening requirements.
As an embodiment of the present invention, the performing, based on the screening requirement, data feature labeling on the reference feature to obtain labeled feature data includes: determining feature data points in the reference feature based on the screening requirements; extracting sample data points corresponding to the characteristic data points; carrying out data point integration on the sample data points to obtain an integrated data matrix; and marking the data characteristics of the data in the integrated data matrix to obtain marked characteristic data.
Wherein, the characteristic data points are representative data points selected from the reference characteristics and are used for marking the characteristic data; the sample data points refer to sample data corresponding to the characteristic data points; the integrated data matrix refers to a matrix formed by integrating all sample data points according to a certain rule, wherein each row represents one sample data point, and each column represents one feature.
Further, the feature data points may be obtained by a dimension reduction algorithm, such as: PCA, LDA and other algorithms; the sample data points may be obtained by generating a model implementation, such as: gaussian mixture model, variational self-coder and other models; the integrated data matrix may be obtained by a feature selection algorithm, such as: chi-square test, information gain, LASSO regression, and the like.
S4, constructing a labeling database corresponding to the labeling feature data, performing model training on a preset data screening model by using data in the labeling database to obtain a trained data screening model, and performing classification primary screening on the standardized data by using the trained data screening model to obtain primary screening type data.
The method can provide various benefits such as data classification, model training, decision support and the like by constructing the annotation database corresponding to the annotation feature data, and is beneficial to optimizing a data screening flow and improving data processing efficiency.
The annotation database is a database storing a large amount of classification and annotation data, and can be used as a basis for training and verifying a data screening model.
Alternatively, the annotation database may be obtained by an annotation tool implementation, such as: labelbox, supervisely, dataturks, etc.
According to the method, the model training is carried out on the preset data screening model by utilizing the data in the annotation database, so that the trained data screening model is obtained, the data screening accuracy can be improved, the data processing efficiency is improved, the repeatability and the consistency are realized, the large-scale data processing requirement can be effectively met, the iterative optimization of the model is supported, and the quality and the effect of data screening are further improved.
The preset data screening model is based on an SVM model and is used for screening, classifying or filtering data, can be designed according to task requirements and data characteristics, and improves screening accuracy and effect through training and optimization; the trained data screening model is a screening model obtained by training a preset data screening model by using data in a labeling database.
Alternatively, the trained data screening model may be obtained through a model framework tool implementation, such as: tensorFlow, pyTorch, scikit-learn, etc.
According to the invention, the standardized data is classified and primarily screened by using the trained data screening model to obtain primary screening data, so that the working efficiency can be improved, the accurate screening can be realized, the automatic flow can be realized, the expandability can be enhanced, the accuracy can be improved, and a beneficial foundation is provided for subsequent data processing and analysis.
The primary screening data are data obtained by primarily classifying standardized data through a trained data screening model.
Alternatively, the primary screening class data may be obtained through a Python programming language implementation, such as: scikit-learn, tensorFlow, pyTorch, etc.
S5, identifying a data category corresponding to the primary screening data, carrying out optimization verification on the primary screening data based on the data category to obtain optimization verification data, inquiring an optimization index corresponding to the optimization verification data, optimizing the trained data screening model based on the optimization index to obtain an optimized data screening model, and carrying out secondary screening on the primary screening data by utilizing the optimized data screening model to obtain screening data corresponding to the primary data.
According to the invention, by identifying the data category corresponding to the primary screening type data, performance optimization can be performed according to specific requirements, and classification efficiency and accuracy are improved, so that a proper algorithm and data processing mode are selected.
Wherein, the data category refers to different categories or types obtained in the process of classifying or classifying the data according to the characteristics or other distinguishable factors.
As one embodiment of the present invention, the identifying the data category corresponding to the primary screening class data includes: identifying a corresponding data form in the primary screening class data; extracting classification features in the primary screening class data based on the data form; performing feature coding on the classification features to obtain feature classification codes; and performing category decoding on the characteristic classification codes to obtain data categories corresponding to the primary screening category data.
Wherein the data form refers to the type of the primary screening class data; the classification features are representative features extracted from the primary screening type data; the feature classification code refers to a process of encoding the extracted classification features, and the features are converted into a form which can be processed by a computer.
Further, the data form may be obtained through a data recognition model implementation, such as: word bag model, TF-IDF model, word2Vec model, etc.; the classification features may be obtained by a classification model implementation, such as: support vector machine, random forest, etc.; the feature classification code can be obtained through a common coding method, such as: single-hot encoding, tag encoding, ordinal encoding, etc.
The invention carries out optimization verification on the primary screening data based on the data category to obtain the optimization verification data, which can help to determine the optimal feature coding method and parameter configuration, so that the classification data is more accurate and effective in the modeling process, thereby improving the accuracy of the model.
Wherein, the optimized verification data refers to a verified data set obtained by performing a series of optimization and verification steps on the primary screening data.
As an embodiment of the present invention, the performing optimization verification on the primary screening class data based on the data category to obtain optimization verification data includes: determining a classification data set corresponding to the primary screening class data based on the data class; performing variance analysis on the classified data set to obtain a data variance subset; extracting corresponding subset parameters in the data variance subset; calculating classification accuracy corresponding to the primary screening type data based on the subset parameters; and carrying out optimization verification on the primary screening data based on the classification accuracy to obtain optimization verification data.
The classification data set refers to dividing the primary screening type data into different data sets according to different type labels; the data variance subset is a subset formed by selecting features with significant differences according to the result of variance analysis; the subset parameters refer to parameters related to the features or the models extracted from the data variance subset; the classification accuracy refers to the proportion of the number of correctly predicted samples to the total number of samples when performing classification tasks.
Further, the classification data set may be obtained through a data processing model implementation, such as: decision tree, naive bayes, logistic regression, etc.; the subset of data variances may be obtained by a correlation algorithm implementation, such as: one-way ANOVA, two-way ANOVA and other algorithms; the subset parameters may be obtained by a parameter extraction tool implementation, such as: scikit-learn, tensorFlow, etc.; the classification accuracy can be obtained by the following calculation formula.
As one embodiment of the present invention, the calculating, based on the subset parameters, the classification accuracy corresponding to the primary screening class data includes:
Calculating the classification accuracy corresponding to the primary screening type data by using the following formula:
wherein Zq represents the classification accuracy corresponding to the primary screening class data, m represents the data sample corresponding to the subset parameter, and f (i) represents the classification accuracy of the ith sample corresponding to the data sample.
According to the invention, continuous optimization and iteration can be realized by inquiring the optimization index corresponding to the optimization verification data, and the accuracy and efficiency of data screening are continuously improved.
Wherein, the optimization index refers to an index or a metric used for evaluating and measuring the performance of the data screening model.
As one embodiment of the present invention, the querying the optimization index corresponding to the optimization verification data includes: determining a corresponding optimization target in the optimization verification data; identifying verification data corresponding to the optimized verification data based on the optimization target; based on the verification data, calculating a verification index corresponding to the optimized verification data; and performing index optimization on the verification index to obtain an optimization index corresponding to the optimization verification data.
Wherein, the optimization target refers to a specific target or index to be realized in the optimization process; the verification data refers to a data set for evaluating and verifying the performance of a model, an algorithm or a system in the optimization process; the verification index refers to a specific metric value calculated based on verification data.
Further, the optimization objective may be achieved by a data analysis tool, such as: pandas in Python, tidyverse in R language, etc.; the verification data may be obtained through a web crawler implementation, such as: scrapy, beautifulSoup, etc.; the verification index may be obtained by a model evaluation algorithm, such as: cross-validation, leave-one-out, etc.
According to the invention, the trained data screening model is optimized based on the optimization index, so that the optimized data screening model is obtained, the accuracy, the robustness and the resource utilization efficiency of the model can be improved, the corresponding task demands are met, and the continuous improvement of the data screening method and system is promoted.
The optimized data screening model is a model obtained by adjusting, improving or optimizing a trained model to improve the performance, accuracy and adaptability of the model.
Alternatively, the optimized data screening model may be obtained by a gradient descent optimization algorithm, such as: batch gradient descent, random gradient descent, etc.
According to the invention, the optimized data screening model is utilized to carry out secondary screening on the primary screening data to obtain the screening data corresponding to the primary data, so that the accuracy of classification can be further improved, and the condition that the primary screening data possibly has certain misjudgment or missed judgment is reduced, thereby improving the overall accuracy.
The screening data refers to a required data set obtained after the optimized data screening model performs secondary screening.
Alternatively, the screening data may be obtained by a data visualization tool implementation, such as: tableau, power BI, and the like.
The invention obtains the initial data to be screened, carries out data cleaning on the initial data to obtain initial cleaning data, can improve the data quality, the accuracy of analysis results, reduce the processing time and the resource consumption, protect the data privacy and the safety, optimize the data structure and the format, thereby laying a good foundation for the subsequent data processing and analysis work. Therefore, the screening method and the system for realizing data based on big data and artificial intelligence are provided by the invention, so that the screening accuracy of data screening is improved.
FIG. 2 is a functional block diagram of a method and a system for screening data based on big data combined with artificial intelligence according to an embodiment of the present invention.
The screening system 200 for realizing data based on big data and artificial intelligence can be installed in electronic equipment. Depending on the implementation, the screening system 200 based on big data combined with artificial intelligence implementation data may include a normalization module 201, a similarity calculation module 202, a feature labeling module 203, a classification prescreening module 204, and a model optimization module 205. The module of the invention, which may also be referred to as a unit, refers to a series of computer program segments, which are stored in the memory of the electronic device, capable of being executed by the processor of the electronic device and of performing a fixed function.
In the present embodiment, the functions concerning the respective modules/units are as follows:
The normalization module 201 is configured to obtain initial data to be screened, perform data cleaning on the initial data to obtain initial cleaning data, and perform normalization processing on the initial cleaning data to obtain normalized data;
The similarity calculation module 202 is configured to perform feature extraction on the standardized data to obtain a standard data feature, identify multiple collinearity corresponding to the standard data feature, determine a reference feature corresponding to the standard data feature based on the multiple collinearity, and calculate feature similarity between the reference features;
the feature labeling module 203 is configured to determine a screening requirement corresponding to the standardized data based on the feature similarity, and label the data feature of the reference feature based on the screening requirement to obtain labeled feature data;
The classification and preliminary screening module 204 is configured to construct a labeling database corresponding to the labeling feature data, perform model training on a preset data screening model by using data in the labeling database to obtain a trained data screening model, and perform classification and preliminary screening on the standardized data by using the trained data screening model to obtain preliminary screening class data;
the model optimizing module 205 is configured to identify a data category corresponding to the primary screening data, perform optimization verification on the primary screening data based on the data category, obtain optimization verification data, query an optimization index corresponding to the optimization verification data, optimize the trained data screening model based on the optimization index, obtain an optimized data screening model, and perform secondary screening on the primary screening data by using the optimized data screening model, so as to obtain screening data corresponding to the primary data.
In detail, each module in the screening system 200 for implementing data based on big data and artificial intelligence in the embodiment of the present invention adopts the same technical means as the screening method for implementing data based on big data and artificial intelligence in the drawings when in use, and can produce the same technical effects, which are not described herein.
Fig. 3 is a schematic structural diagram of an electronic device for implementing a data screening method based on big data and artificial intelligence.
The electronic device 1 may comprise a processor 30, a memory 31, a communication bus 32 and a communication interface 33, and may further comprise a computer program stored in the memory 31 and executable on the processor 30, such as an engineering safety supervisor based on artificial intelligence.
The processor 30 may be formed by an integrated circuit in some embodiments, for example, a single packaged integrated circuit, or may be formed by a plurality of integrated circuits packaged with the same function or different functions, including one or more central processing units (Central Processing unit, CPU), microprocessors, digital processing chips, graphics processors, and combinations of various control chips. The processor 30 is a Control Unit (Control Unit) of the electronic device 1, connects various components of the entire electronic device using various interfaces and lines, executes or executes programs or modules (e.g., an artificial intelligence-based engineering safety supervision program, etc.) stored in the memory 31, and invokes data stored in the memory 31 to perform various functions of the electronic device and process the data.
The memory 31 includes at least one type of readable storage medium including flash memory, a removable hard disk, a multimedia card, a card memory (e.g., SD or DX memory, etc.), a magnetic memory, a magnetic disk, an optical disk, etc. The memory 31 may in some embodiments be an internal storage unit of the electronic device, such as a mobile hard disk of the electronic device. The memory 31 may also be an external storage device of the electronic device in other embodiments, such as a plug-in mobile hard disk, a smart memory card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, a flash memory card (FLASH CARD) or the like, which are provided on the electronic device. Further, the memory 31 may also include both an internal storage unit and an external storage device of the electronic device. The memory 31 may be used not only for storing application software installed in an electronic device and various types of data, such as codes of a database-configured connection program, but also for temporarily storing data that has been output or is to be output.
The communication bus 32 may be a peripheral component interconnect standard (PERIPHERAL COMPONENT INTERCONNECT, PCI) bus, or an extended industry standard architecture (extended industry standard architecture, EISA) bus, among others. The bus may be classified as an address bus, a data bus, a control bus, etc. The bus is arranged to enable a connection communication between the memory 31 and at least one processor 30 or the like.
The communication interface 33 is used for communication between the electronic device 1 and other devices, including a network interface and a user interface. Optionally, the network interface may include a wired interface and/or a wireless interface (e.g., WI-FI interface, bluetooth interface, etc.), typically used to establish a communication connection between the electronic device and other electronic devices. The user interface may be a Display (Display), an input unit such as a Keyboard (Keyboard), or alternatively a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch, or the like. The display may also be referred to as a display screen or display unit, as appropriate, for displaying information processed in the electronic device and for displaying a visual user interface.
Fig. 3 shows only an electronic device with components, it being understood by a person skilled in the art that the structure shown in fig. 3 does not constitute a limitation of the electronic device 1, and may comprise fewer or more components than shown, or may combine certain components, or may be arranged in different components.
For example, although not shown, the electronic device 1 may further include a power source (such as a battery) for supplying power to each component, and preferably, the power source may be logically connected to the at least one processor 30 through a power management device, so that functions of charge management, discharge management, power consumption management, and the like are implemented through the power management device. The power supply may also include one or more of any of a direct current or alternating current power supply, recharging device, power failure detection circuit, power converter or inverter, power status indicator, etc. The electronic device 1 may further include various sensors, bluetooth modules, wi-Fi modules, etc., which will not be described herein.
It should be understood that the examples are for illustrative purposes only.
The database-configured connection program stored in the memory 31 in the electronic device 1 is a combination of a plurality of computer programs, which, when run in the processor 30, can implement:
acquiring initial data to be screened, performing data cleaning on the initial data to obtain initial cleaning data, and performing standardized processing on the initial cleaning data to obtain standardized data;
extracting features of the standardized data to obtain standard data features, identifying multiple collinearity corresponding to the standard data features, determining reference features corresponding to the standard data features based on the multiple collinearity, and calculating feature similarity between the reference features;
Determining screening requirements corresponding to the standardized data based on the feature similarity, and marking the data features of the reference features based on the screening requirements to obtain marked feature data;
Constructing a labeling database corresponding to the labeling feature data, performing model training on a preset data screening model by using data in the labeling database to obtain a trained data screening model, and performing classification primary screening on the standardized data by using the trained data screening model to obtain primary screening type data;
Identifying the data category corresponding to the primary screening data, carrying out optimization verification on the primary screening data based on the data category to obtain optimization verification data, inquiring optimization indexes corresponding to the optimization verification data, optimizing the trained data screening model based on the optimization indexes to obtain an optimized data screening model, and carrying out secondary screening on the primary screening data by utilizing the optimized data screening model to obtain screening data corresponding to the initial data.
In particular, the specific implementation method of the processor 30 on the computer program may refer to the description of the relevant steps in the corresponding embodiment of fig. 1, which is not repeated herein.
Further, the integrated modules/units of the electronic device 1 may be stored in a non-volatile computer readable storage medium if implemented in the form of software functional units and sold or used as a stand alone product. The storage medium may be volatile or nonvolatile. For example, the computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM).
The present invention also provides a storage medium storing a computer program which, when executed by a processor of an electronic device, can implement:
acquiring initial data to be screened, performing data cleaning on the initial data to obtain initial cleaning data, and performing standardized processing on the initial cleaning data to obtain standardized data;
extracting features of the standardized data to obtain standard data features, identifying multiple collinearity corresponding to the standard data features, determining reference features corresponding to the standard data features based on the multiple collinearity, and calculating feature similarity between the reference features;
Determining screening requirements corresponding to the standardized data based on the feature similarity, and marking the data features of the reference features based on the screening requirements to obtain marked feature data;
Constructing a labeling database corresponding to the labeling feature data, performing model training on a preset data screening model by using data in the labeling database to obtain a trained data screening model, and performing classification primary screening on the standardized data by using the trained data screening model to obtain primary screening type data;
Identifying the data category corresponding to the primary screening data, carrying out optimization verification on the primary screening data based on the data category to obtain optimization verification data, inquiring optimization indexes corresponding to the optimization verification data, optimizing the trained data screening model based on the optimization indexes to obtain an optimized data screening model, and carrying out secondary screening on the primary screening data by utilizing the optimized data screening model to obtain screening data corresponding to the initial data.
In the several embodiments provided in the present invention, it should be understood that the disclosed apparatus, device and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is merely a logical function division, and there may be other manners of division when actually implemented.
The modules described as separate components may or may not be physically separate, and components shown as modules may or may not be physical units, may be located in one place, or may be distributed over multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional module in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units can be realized in a form of hardware or a form of hardware and a form of software functional modules.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof.
The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.
It should be noted that in this document, relational terms such as "first" and "second" and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The foregoing is only a specific embodiment of the invention to enable those skilled in the art to understand or practice the invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. The method for realizing data screening based on big data and artificial intelligence is characterized by comprising the following steps:
acquiring initial data to be screened, performing data cleaning on the initial data to obtain initial cleaning data, and performing standardized processing on the initial cleaning data to obtain standardized data;
extracting features of the standardized data to obtain standard data features, identifying multiple collinearity corresponding to the standard data features, determining reference features corresponding to the standard data features based on the multiple collinearity, and calculating feature similarity between the reference features;
Determining screening requirements corresponding to the standardized data based on the feature similarity, and marking the data features of the reference features based on the screening requirements to obtain marked feature data;
Constructing a labeling database corresponding to the labeling feature data, performing model training on a preset data screening model by using data in the labeling database to obtain a trained data screening model, and performing classification primary screening on the standardized data by using the trained data screening model to obtain primary screening type data;
Identifying the data category corresponding to the primary screening data, carrying out optimization verification on the primary screening data based on the data category to obtain optimization verification data, inquiring optimization indexes corresponding to the optimization verification data, optimizing the trained data screening model based on the optimization indexes to obtain an optimized data screening model, and carrying out secondary screening on the primary screening data by utilizing the optimized data screening model to obtain screening data corresponding to the initial data.
2. The method for realizing data screening based on big data combined with artificial intelligence according to claim 1, wherein the step of performing data cleaning on the initial data to obtain initial cleaning data comprises the following steps:
Identifying missing values in the initial data;
performing function statistics on the missing values to obtain number entries corresponding to the missing values;
based on the number entries, interpolating the initial data by the missing values to obtain filling data;
And taking the filling data as initial cleaning data corresponding to the initial data.
3. The method for screening data based on big data combined with artificial intelligence according to claim 1, wherein the normalizing the initial cleaning data to obtain normalized data comprises:
Loading the initial cleaning data into a preset database, and identifying a data column corresponding to the data in the database;
calculating the average value and standard deviation of the data columns;
Calculating the deviation degree corresponding to the data column based on the average value and the standard deviation;
and scaling and normalizing the initial cleaning data based on the deviation degree to obtain normalized data.
4. The method for implementing data screening based on big data combined with artificial intelligence according to claim 1, wherein the identifying multiple collinearity corresponding to the standard data feature comprises:
Inquiring characteristic parameters corresponding to the standard data characteristics, and calculating a correlation coefficient matrix between the characteristic parameters;
Identifying a corresponding variance expansion factor in the correlation coefficient matrix;
and determining the multiple collinearity corresponding to the standard data characteristic based on the variance expansion factor.
5. The method for screening data based on big data combined with artificial intelligence according to claim 1, wherein the calculating feature similarity between the reference features comprises:
calculating the feature similarity between the reference features using the following formula:
wherein Tz represents feature similarity between the reference features, n represents feature vectors corresponding to the reference features, i represents indexes in the feature vectors, and/> Representing the i-th element corresponding to the feature x in the feature vector,/>And representing the ith element corresponding to the feature y in the feature vector.
6. The method for realizing data screening based on big data combined with artificial intelligence according to claim 1, wherein the step of labeling the data features of the reference features based on the screening requirement to obtain labeled feature data comprises the following steps:
Determining feature data points in the reference feature based on the screening requirements;
extracting sample data points corresponding to the characteristic data points;
Carrying out data point integration on the sample data points to obtain an integrated data matrix;
And marking the data characteristics of the data in the integrated data matrix to obtain marked characteristic data.
7. The method for implementing data screening based on big data combined with artificial intelligence according to claim 1, wherein the identifying the data category corresponding to the primary screening class data comprises:
identifying a corresponding data form in the primary screening class data;
Extracting classification features in the primary screening class data based on the data form;
performing feature coding on the classification features to obtain feature classification codes;
and performing category decoding on the characteristic classification codes to obtain data categories corresponding to the primary screening category data.
8. The method for screening data based on big data combined with artificial intelligence according to claim 7, wherein the calculating the classification accuracy corresponding to the primary screening class data based on the subset parameters comprises:
Calculating the classification accuracy corresponding to the primary screening type data by using the following formula:
wherein Zq represents the classification accuracy corresponding to the primary screening class data, m represents the data sample corresponding to the subset parameter, and f (i) represents the classification accuracy of the ith sample corresponding to the data sample.
9. The method for screening data based on big data combined with artificial intelligence according to claim 1, wherein the querying the optimization index corresponding to the optimization verification data comprises:
Determining a corresponding optimization target in the optimization verification data;
Identifying verification data corresponding to the optimized verification data based on the optimization target;
Based on the verification data, calculating a verification index corresponding to the optimized verification data;
And performing index optimization on the verification index to obtain an optimization index corresponding to the optimization verification data.
10. A screening system for implementing data based on big data in combination with artificial intelligence, characterized in that it is used for executing the screening method for implementing data based on big data in combination with artificial intelligence according to any one of claims 1 to 9, the system comprising:
the standardized module is used for acquiring initial data to be screened, carrying out data cleaning on the initial data to obtain initial cleaning data, and carrying out standardized processing on the initial cleaning data to obtain standardized data;
The similarity calculation module is used for carrying out feature extraction on the standardized data to obtain standard data features, identifying multiple collinearity corresponding to the standard data features, determining reference features corresponding to the standard data features based on the multiple collinearity, and calculating feature similarity between the reference features;
the feature labeling module is used for determining screening requirements corresponding to the standardized data based on the feature similarity, and labeling the data features of the reference features based on the screening requirements to obtain labeled feature data;
The classification primary screening module is used for constructing a labeling database corresponding to the labeling feature data, carrying out model training on a preset data screening model by utilizing data in the labeling database to obtain a trained data screening model, and carrying out classification primary screening on the standardized data by utilizing the trained data screening model to obtain primary screening type data;
the model optimizing module is used for identifying the data category corresponding to the primary screening data, carrying out optimization verification on the primary screening data based on the data category to obtain optimization verification data, inquiring optimization indexes corresponding to the optimization verification data, optimizing the trained data screening model based on the optimization indexes to obtain an optimized data screening model, and carrying out secondary screening on the primary screening data by utilizing the optimized data screening model to obtain screening data corresponding to the initial data.
CN202410145704.1A 2024-02-02 2024-02-02 Screening method and system for realizing data based on big data combined with artificial intelligence Pending CN117909333A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410145704.1A CN117909333A (en) 2024-02-02 2024-02-02 Screening method and system for realizing data based on big data combined with artificial intelligence

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410145704.1A CN117909333A (en) 2024-02-02 2024-02-02 Screening method and system for realizing data based on big data combined with artificial intelligence

Publications (1)

Publication Number Publication Date
CN117909333A true CN117909333A (en) 2024-04-19

Family

ID=90687720

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410145704.1A Pending CN117909333A (en) 2024-02-02 2024-02-02 Screening method and system for realizing data based on big data combined with artificial intelligence

Country Status (1)

Country Link
CN (1) CN117909333A (en)

Similar Documents

Publication Publication Date Title
CN112445875B (en) Data association and verification method and device, electronic equipment and storage medium
CN113626607B (en) Abnormal work order identification method and device, electronic equipment and readable storage medium
CN109933502B (en) Electronic device, user operation record processing method and storage medium
CN113254354A (en) Test case recommendation method and device, readable storage medium and electronic equipment
CN115641162A (en) Prediction data analysis system and method based on construction project cost
CN115081025A (en) Sensitive data management method and device based on digital middlebox and electronic equipment
CN113728321A (en) Using a set of training tables to accurately predict errors in various tables
CN113807418A (en) Injection molding machine energy consumption abnormity detection method and system based on Gaussian mixture model
CN116662839A (en) Associated big data cluster analysis method and device based on multidimensional intelligent acquisition
CN110310012B (en) Data analysis method, device, equipment and computer readable storage medium
CN116244367A (en) Visual big data analysis platform based on multi-model custom algorithm
CN111932146A (en) Method and device for analyzing pollution cause, computer equipment and readable storage medium
CN117319452A (en) Safety inspection method and system applied to barium sulfate preparation
CN111950623B (en) Data stability monitoring method, device, computer equipment and medium
CN112801222A (en) Multi-classification method and device based on two-classification model, electronic equipment and medium
CN113658002A (en) Decision tree-based transaction result generation method and device, electronic equipment and medium
CN111581296B (en) Data correlation analysis method and device, computer system and readable storage medium
CN112966965A (en) Import and export big data analysis and decision method, device, equipment and storage medium
CN116012019B (en) Financial wind control management system based on big data analysis
CN111651625A (en) Image retrieval method, image retrieval device, electronic equipment and storage medium
CN115034812B (en) Steel industry sales volume prediction method and device based on big data
CN110796381A (en) Method and device for processing evaluation indexes of modeling data, terminal equipment and medium
CN113705201B (en) Text-based event probability prediction evaluation algorithm, electronic device and storage medium
CN117909333A (en) Screening method and system for realizing data based on big data combined with artificial intelligence
CN112906824B (en) Vehicle clustering method, system, device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination