CN113486088A - Data mining method based on complex technology - Google Patents

Data mining method based on complex technology Download PDF

Info

Publication number
CN113486088A
CN113486088A CN202110759405.3A CN202110759405A CN113486088A CN 113486088 A CN113486088 A CN 113486088A CN 202110759405 A CN202110759405 A CN 202110759405A CN 113486088 A CN113486088 A CN 113486088A
Authority
CN
China
Prior art keywords
data
mining
decision
complex technology
follows
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110759405.3A
Other languages
Chinese (zh)
Inventor
祖玉宁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Sesns Network Technology Co ltd
Original Assignee
Shanghai Sesns Network Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Sesns Network Technology Co ltd filed Critical Shanghai Sesns Network Technology Co ltd
Priority to CN202110759405.3A priority Critical patent/CN113486088A/en
Publication of CN113486088A publication Critical patent/CN113486088A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Fuzzy Systems (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Mathematical Physics (AREA)
  • Quality & Reliability (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention relates to the technical field of data mining, in particular to a data mining method based on a complex technology. The method comprises a data acquisition step, a database establishment step, a data mining step and a redundancy processing step. According to the method, useful data in a field is defined, then data which is not in the field or repeated data is compared and selected, and useless data is deleted through redundancy processing, so that the decision-making performance of decision-making data is prevented from being reduced due to the useless data, the data mining effect is improved, and the problem that the decision-making performance of the mined data cannot be supported is solved.

Description

Data mining method based on complex technology
Technical Field
The invention relates to the technical field of data mining, in particular to a data mining method based on a complex technology.
Background
In recent years, data mining has attracted great attention in the information industry, mainly because there is a great deal of data that can be widely used or utilized, and there is an urgent need to convert the data into useful information and knowledge that can be widely used in various applications including business management, production control, market analysis, engineering design, scientific exploration, and the like, and thus there is a need to mine such valuable decision data.
Data mining is a decision support process, and is mainly based on artificial intelligence, machine learning, pattern recognition, statistics, databases, visualization technologies and the like, the data of enterprises are analyzed in a highly automated manner, inductive reasoning is made, potential patterns are mined out from the data, decision makers are helped to adjust market strategies, risks are reduced, and correct decisions are made.
However, many data mined by many data mining methods at present have many repeated or redundant data without decision value, so that the data mining effect is greatly reduced, and the decision of the mined data cannot be supported.
Disclosure of Invention
The present invention aims to provide a method for data mining based on complex technology to solve the problems in the background art.
In order to achieve the above object, the present invention provides a method for mining data based on complex technology, comprising the following steps:
s1.1, data acquisition: collecting data;
s1.2, establishing a database: constructing metadata corresponding to the acquired data according to the acquired data, storing data information in the metadata, and loading a mining database according to the metadata;
s1.3, data mining: mining useful data in a mining database to form decision data;
s1.4, redundancy processing: and cleaning redundant data in the mined decision data.
As a further improvement of the technical solution, the step of establishing the database in S1.2 is as follows:
s2.1, describing the acquired data;
s2.2, performing quality evaluation on the described data, and combining and integrating to obtain metadata;
and S2.3, loading the mining database and maintaining the mining database.
As a further improvement of the technical solution, the data mining in S1.3 adopts an intelligent mining algorithm, which comprises the following steps:
s3.1, defining decision data according to different decision requirements;
s3.2, extracting the data in a mining database by taking the defined decision data as a standard, and preprocessing the extracted data to improve the quality of the data;
s3.3, evaluating the extracted data, distinguishing redundant data, and forming decision data after distinguishing;
and S3.4, analyzing the decision data and producing a data mining result.
As a further improvement of the present technical solution, the method for preprocessing the extracted data in S3.2 includes noise elimination and data type conversion.
As a further improvement of the technical solution, the K nearest neighbor algorithm in S3.4 classifies data to be analyzed, and the algorithm steps are as follows:
s4.1, extracting characteristic values of the data according to the description of the collected data, and re-describing training data set vectors according to the characteristic values;
s4.2, calculating K data sets similar to the data set acquired again in the training data set;
and S4.3, sequentially calculating the weight of each class in the K adjacent sets of the data set collected again, comparing the weight of each class, and classifying the data set into the class with the maximum weight.
As a further improvement of the present technical solution, the formula of the similarity calculation in S4.2 is as follows:
Figure BDA0003148070180000021
wherein, Sim (d)i,dj) For the j-th acquired data set djWith the ith training data set diThe similarity of (2); m is the number of acquired data; wikFor training data set diThe total number of (2); wjkFor the acquired data set djThe total number of (c).
As a further improvement of the present technical solution, the weight calculation formula in S4.3 is as follows:
Figure BDA0003148070180000022
wherein the content of the first and second substances,
Figure BDA0003148070180000031
for the feature vector of the acquired data set,
Figure BDA0003148070180000032
is the feature vector similarity;
Figure BDA0003148070180000033
is a category attribute function; ciIs i categories; if the data set d is collectedjBelong to CiClass, then
Figure BDA0003148070180000034
Otherwise
Figure BDA0003148070180000035
As a further improvement of the present technical solution, the redundancy processing step in S1.4 is as follows:
and S5.1, comparing the data in the decision data.
And S5.2, deleting redundant decision data in comparison by using a decision algorithm.
As a further improvement of the technical solution, the decision algorithm formula is as follows:
Figure BDA0003148070180000036
wherein, γiFor the ith decision data (c)iM) ultimately resulting in an upper bound supported by the rule; siFor the ith decision data (c)iM) ultimately yields a lower bound for rule support if γi≤γ0Or Si≥S0Then the ith decision data (c) is addediM) deletion, and additionally γ0Is minimum rule support, S0Is the maximum rule support.
As a further improvement of the present technical solution, the support degree calculation formula supported by the rule is as follows:
Figure BDA0003148070180000037
wherein, XiA support set for the ith decision data; and Y is a data total set.
Compared with the prior art, the invention has the beneficial effects that: by defining useful data in a field, comparing and selecting data which is not in the field or repeated data, and deleting the useless data through redundancy processing, the decision-making performance of decision-making data is prevented from being reduced due to the useless data, so that the data mining effect is improved, and the problem that the decision-making performance of the mined data cannot be supported is solved.
Drawings
FIG. 1 is a flow chart of method steps for data mining of the present invention;
FIG. 2 is a flow chart of the steps of building a database according to the present invention;
FIG. 3 is a flow chart of the steps of the intelligent mining algorithm of the present invention;
FIG. 4 is a flow chart of the data classification steps of the present invention;
FIG. 5 is a flow chart of the redundancy processing steps of the present invention.
Detailed Description
Example 1
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1-5, the present invention provides a technical solution:
the invention provides a data mining method based on a complex technology, which comprises the following steps:
s1.1, data acquisition: collecting data, wherein the data comprises the collection of data of business management, production control, market analysis, engineering design, scientific exploration and the like;
s1.2, establishing a database: constructing metadata corresponding to the acquired data according to the acquired data, storing data information in the metadata, and loading a mining database according to the metadata;
s1.3, data mining: mining useful data in a mining database to form decision data;
s1.4, redundancy processing: and cleaning redundant data in the mined decision data.
In addition, the steps of establishing the database in S1.2 are as follows:
s2.1, describing the collected data, so that the conceptual data is converted into logic data to be input into a computer for the computer to recognize;
s2.2, performing quality evaluation on the described data, and combining and integrating to obtain metadata;
wherein, the quality evaluation flow is as follows:
firstly, determining a data quality index and an evaluation rule to be detected, then writing a corresponding SQL script to detect and analyze data, and finally calculating the percentage score of the data meeting each rule; the overall score of the system can be calculated by calculating the score of each rule, and then averaging after the overall score is obtained to obtain the final evaluation value.
And S2.3, loading the mining database, and maintaining the mining database, wherein the maintenance of the database comprises backing up system data, recovering the database system, generating a user information table, authorizing the information table, monitoring the operation condition of the system, processing system errors in time and ensuring the safety of the system data.
Further, data mining in S1.3 adopts an intelligent mining algorithm, and the algorithm steps are as follows:
s3.1, defining decision data according to different decision requirements, wherein the definition refers to defining useful data in various fields, for example, defining market analysis data, defining data of market research, market risk assessment data and the like as useful data, otherwise, defining data irrelevant to the market or repeated data as useless data, and deleting the useless data through redundancy processing, so that the decision-making performance of the decision data is prevented from being reduced due to the useless data, the data mining effect is improved, and the problem that the decision-making performance of the mining data cannot be supported is solved;
s3.2, extracting the data in a mining database by taking the defined decision data as a standard, and preprocessing the extracted data to improve the quality of the data;
s3.3, evaluating the extracted data, distinguishing redundant data, and forming decision data after distinguishing;
and S3.4, analyzing the decision data and producing a data mining result.
Specifically, the method for preprocessing the extracted data in S3.2 includes noise elimination and data type conversion.
The noise elimination adopts a regression denoising method, if the data have a dependency relationship, the dependency relationship between the data is solved, so that the dependency relationship is predicted according to the data change, and the dependency relationship is normal distribution; assuming that the data is observed and noise exists, the observed value is updated according to the continuous change of the data so as to remove random noise in the observed value.
In addition, part of the algorithm for data type conversion is as follows:
<script>
//1, converting numeric type to string variable toString
varnum=10;
varstr=num.toString();
console.log(str);
console.log(typeofstr);
//2, Using String (variants)
console.log(String(num));
//3, realizing implicit conversion effect by using method of + splicing character strings
console.log(num+”);
</script>。
In addition, the K nearest neighbor algorithm in S3.4 classifies the data to be analyzed, and the algorithm steps are as follows:
s4.1, extracting characteristic values of the data according to the description of the collected data, and re-describing training data set vectors according to the characteristic values;
s4.2, calculating K data sets similar to the data set acquired again in the training data set;
and S4.3, sequentially calculating the weight of each class in the K adjacent sets of the data set collected again, comparing the weight of each class, and classifying the data set into the class with the maximum weight, so that the data are classified and analyzed according to the classified class, distributed analysis of the data is realized, the operation speed is greatly improved, the analysis time is shortened, and the load during analysis is reduced.
In addition, the formula for the similar calculation in S4.2 is as follows:
Figure BDA0003148070180000061
wherein, Sim (d)i,dj) For the j-th acquired data set djWith the ith training data set diThe similarity of (2); m is the number of acquired data; wikFor training data set diThe total number of (2); wjkFor the acquired data set djThe total number of (c).
Further, in S4.3, the weight calculation formula is as follows:
Figure BDA0003148070180000062
wherein the content of the first and second substances,
Figure BDA0003148070180000063
for the feature vector of the acquired data set,
Figure BDA0003148070180000064
is the feature vector similarity;
Figure BDA0003148070180000065
is a category attribute function; ciIs i categories; if the data set d is collectedjBelong to CiClass, then
Figure BDA0003148070180000066
Otherwise
Figure BDA0003148070180000067
In addition, the redundancy processing steps in S1.4 are as follows:
and S5.1, comparing the data in the decision data.
And S5.2, deleting redundant decision data in comparison by using a decision algorithm.
In addition, the decision algorithm is formulated as follows:
Figure BDA0003148070180000068
wherein, γiFor the ith decision data (c)iM) ultimately resulting in an upper bound supported by the rule; siFor the ith decision data (c)iM) ultimately yields a lower bound for rule support if γi≤γ0Or Si≥S0Then the ith decision data (c) is addediM) deletion, and additionally γ0Is minimum rule support, S0Is the maximum rule support.
Specifically, the support calculation formula supported by the rule is as follows:
Figure BDA0003148070180000071
wherein, XiA support set for the ith decision data; and Y is a data total set.
Example 2
In order to improve the decision quality of the market analysis, this embodiment describes the embodiment 1 with respect to the application of the market analysis, and the working flow is as follows:
first, market data is collected, for example, when the market of fruits is analyzed, the collected data is the sales condition of each variety of fruits (it is worth to say that the fruit data to be analyzed in this embodiment does not need out-of-season fruits and fruits with short storage periods), and the data is represented by a set a ═ a1, a2 and a3, where a1 is apple, a2 is watermelon, and a3 is strawberry; at this time, data is mined, specifically:
classifying a1, a2 and a3, wherein the classification result is as follows:
the result of data mined by a1 is that daily fruits and the storage period are normal; a2 indicates that the fruit is out of season and the storage period is normal; a3 is out-of-season fruit and has short storage period;
then forming decision data of a1, wherein the storage period of the daily fruit is normal; a2 fruits out of season and normal storage period; a3 out-of-season fruits with short storage period;
and finally, deleting out-of-season fruits and fruits with short storage periods by comparison, and finally generating the fruit to be analyzed as a1, so that the out-of-season watermelon and strawberry data are deleted by a data mining method based on a complex technology, the decision-making performance of decision-making data is prevented from being reduced due to useless data, and the decision-making quality of market analysis is improved.
The foregoing shows and describes the general principles, essential features, and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, and the preferred embodiments of the present invention are described in the above embodiments and the description, and are not intended to limit the present invention. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims (10)

1. A method for data mining based on complex technology is characterized by comprising the following steps:
s1.1, data acquisition: collecting data;
s1.2, establishing a database: constructing metadata corresponding to the acquired data according to the acquired data, storing data information in the metadata, and loading a mining database according to the metadata;
s1.3, data mining: mining useful data in a mining database to form decision data;
s1.4, redundancy processing: and cleaning redundant data in the mined decision data.
2. The method of complex technology based data mining of claim 1, wherein: the steps of establishing the database in S1.2 are as follows:
s2.1, describing the acquired data;
s2.2, performing quality evaluation on the described data, and combining and integrating to obtain metadata;
and S2.3, loading the mining database and maintaining the mining database.
3. The method of complex technology based data mining of claim 2, wherein: the data mining in S1.3 adopts an intelligent mining algorithm, and the algorithm steps are as follows:
s3.1, defining decision data according to different decision requirements;
s3.2, extracting the data in a mining database by taking the defined decision data as a standard, and preprocessing the extracted data to improve the quality of the data;
s3.3, evaluating the extracted data, distinguishing redundant data, and forming decision data after distinguishing;
and S3.4, analyzing the decision data and producing a data mining result.
4. The method of complex technology based data mining of claim 3, wherein: the method for preprocessing the extracted data in the S3.2 comprises noise elimination and data type conversion.
5. The method of complex technology based data mining of claim 3, wherein: the K nearest neighbor algorithm in S3.4 classifies the data to be analyzed, and the algorithm steps are as follows:
s4.1, extracting characteristic values of the data according to the description of the collected data, and re-describing training data set vectors according to the characteristic values;
s4.2, calculating K data sets similar to the data set acquired again in the training data set;
and S4.3, sequentially calculating the weight of each class in the K adjacent sets of the data set collected again, comparing the weight of each class, and classifying the data set into the class with the maximum weight.
6. The method of complex technology based data mining of claim 5, wherein: the formula for the similarity calculation in S4.2 is as follows:
Figure FDA0003148070170000021
wherein, Sim (d)i,dj) For the j-th acquired data set djWith the ith training data set diThe similarity of (2); m is the number of acquired data; wikFor training data set diThe total number of (2); wjkFor the acquired data set djThe total number of (c).
7. The method of complex technology based data mining of claim 5, wherein: the weight calculation formula in S4.3 is as follows:
Figure FDA0003148070170000022
wherein the content of the first and second substances,
Figure FDA0003148070170000023
for the feature vector of the acquired data set,
Figure FDA0003148070170000024
is the feature vector similarity;
Figure FDA0003148070170000025
is a category attribute function; ciAre i categories.
8. The method of complex technology based data mining of claim 1, wherein: the redundancy processing steps in S1.4 are as follows:
s5.1, comparing data in the decision data;
and S5.2, deleting redundant decision data in comparison by using a decision algorithm.
9. The method of complex technology based data mining of claim 8, wherein: the decision algorithm formula is as follows:
Figure FDA0003148070170000026
wherein, γiFor the ith decision data (c)iM) ultimately resulting in an upper bound supported by the rule; siFor the ith decision data (c)iM) ultimately results in a lower bound for rule support.
10. The method of complex technology based data mining of claim 9, wherein: the support calculation formula supported by the rule is as follows:
Figure FDA0003148070170000031
wherein, XiA support set for the ith decision data; and Y is a data total set.
CN202110759405.3A 2021-07-05 2021-07-05 Data mining method based on complex technology Pending CN113486088A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110759405.3A CN113486088A (en) 2021-07-05 2021-07-05 Data mining method based on complex technology

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110759405.3A CN113486088A (en) 2021-07-05 2021-07-05 Data mining method based on complex technology

Publications (1)

Publication Number Publication Date
CN113486088A true CN113486088A (en) 2021-10-08

Family

ID=77940954

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110759405.3A Pending CN113486088A (en) 2021-07-05 2021-07-05 Data mining method based on complex technology

Country Status (1)

Country Link
CN (1) CN113486088A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107038165A (en) * 2016-02-03 2017-08-11 腾讯科技(深圳)有限公司 A kind of service parameter acquisition methods and device
CN110188085A (en) * 2019-04-18 2019-08-30 红云红河烟草(集团)有限责任公司 Quality data model method for building up between a kind of tobacco volume hired car
CN112199566A (en) * 2020-09-27 2021-01-08 成都房联云码科技有限公司 City update effect evaluation method and system based on real estate big data
CN112836960A (en) * 2021-02-01 2021-05-25 安徽安医高创信息技术有限公司 Industrial production data scheduling system based on BI technology

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107038165A (en) * 2016-02-03 2017-08-11 腾讯科技(深圳)有限公司 A kind of service parameter acquisition methods and device
CN110188085A (en) * 2019-04-18 2019-08-30 红云红河烟草(集团)有限责任公司 Quality data model method for building up between a kind of tobacco volume hired car
CN112199566A (en) * 2020-09-27 2021-01-08 成都房联云码科技有限公司 City update effect evaluation method and system based on real estate big data
CN112836960A (en) * 2021-02-01 2021-05-25 安徽安医高创信息技术有限公司 Industrial production data scheduling system based on BI technology

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
恩索尔: "《Oracle设计》", 30 September 2002 *
袁津生: "《搜索引擎与信息检索教程》", 30 April 2008 *
邱有强: "基于粗糙集的智能数据挖掘算法在风机监测中的应用", 《东北电力大学学报》 *

Similar Documents

Publication Publication Date Title
Yoon Discovering knowledge in corporate databases
CN108898479B (en) Credit evaluation model construction method and device
CN110390275B (en) Gesture classification method based on transfer learning
WO2021088499A1 (en) False invoice issuing identification method and system based on dynamic network representation
CN112288014B (en) Data mining-based equipment full life cycle management method
CN111882446A (en) Abnormal account detection method based on graph convolution network
CN112380274B (en) Abnormality detection method for control process
CN112329884B (en) Zero sample identification method and system based on discriminant visual attributes
CN115794803B (en) Engineering audit problem monitoring method and system based on big data AI technology
CN114358611A (en) Subject development-based data acquisition system for scientific research capability assessment
CN112966259A (en) Power monitoring system operation and maintenance behavior security threat assessment method and equipment
CN111652253A (en) Well leakage accident detection early warning method based on big data
Wang Research on the features of car insurance data based on machine learning
CN113268370B (en) Root cause alarm analysis method, system, equipment and storage medium
CN113742396A (en) Mining method and device for object learning behavior pattern
Keyvanpour et al. Document image retrieval based on keyword spotting using relevance feedback
CN115860582B (en) Impact risk intelligent early warning method based on self-adaptive lifting algorithm
CN113486088A (en) Data mining method based on complex technology
Rahman et al. Data cleaning in knowledge discovery database-data mining (KDD-DM)
CN115618297A (en) Method and device for identifying abnormal enterprise
CN114064723A (en) Association rule mining method and device, computer equipment and storage medium
CN113988410A (en) Cross-region tight oil reservoir oil well productivity prediction method based on KNN algorithm and polynomial regression algorithm combination
CN111967937A (en) E-commerce recommendation system based on time series analysis and implementation method
Shen et al. RP-NBSR: A Novel Network Attack Detection Model Based on Machine Learning.
CN112836926B (en) Enterprise operation condition evaluation method based on electric power big data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination