CN105205107A - Internet of Things data similarity processing method - Google Patents

Internet of Things data similarity processing method Download PDF

Info

Publication number
CN105205107A
CN105205107A CN201510535354.0A CN201510535354A CN105205107A CN 105205107 A CN105205107 A CN 105205107A CN 201510535354 A CN201510535354 A CN 201510535354A CN 105205107 A CN105205107 A CN 105205107A
Authority
CN
China
Prior art keywords
attribute
product
array
product record
similarity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510535354.0A
Other languages
Chinese (zh)
Inventor
谢东
肖东
成运
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hunan University of Humanities Science and Technology
Original Assignee
Hunan University of Humanities Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hunan University of Humanities Science and Technology filed Critical Hunan University of Humanities Science and Technology
Priority to CN201510535354.0A priority Critical patent/CN105205107A/en
Publication of CN105205107A publication Critical patent/CN105205107A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides an Internet of Things data similarity processing method, comprising the following steps: obtaining multiple product records, selecting a first product record and a second product record with multiple identical attributes; saving the attributes of the first product record in a first array, and saving the attributes of the second product record in a second array; respectively calculating corresponding attribute similarity values for all the attributes of the first product record and the second product record according to corresponding attribute functions; calculating weighting values of all the attributes according to the degree of importance of all the attributes of the first product record and the second product record through weighting functions; and calculating the overall similarity of the first product record and the second product record by combining with a third array of attribute similarity values and a fourth array of weighting values through overall similarity functions. According to the Internet of Things data similarity processing method provided by the invention, the overall similarity of two product records with identical attributes is calculated according to the attribute similarity and attribute weighting values of the two product records, the processing speed is high, and a lot of time cost can be saved.

Description

A kind of Internet of Things data similarity disposal route
Technical field
The present invention relates to data processing field, particularly relate to a kind of Internet of Things data similarity disposal route.
Background technology
Since Internet occurs, the quantity of the WEB page on internet just rapidly increases, and is also just due to its this growth rate, defines information resource database maximum in the world.WEB information integration technology effectively processes this information resource database exactly, integrates relevant information, for data mining provides the support of data aspect, to be applied to the information service in professional domain better.In the cybertimes of current develop rapidly, information resources become increasingly abundant, and WEB information integration has become the important content of information age, have the application of WEB information integration in multiple field.
As in Internet of Things field, product supplier can by multiple WEB transaction platform release product information, and buyer can from WEB transaction platform obtaining information, and product supplier can be related to by the information that product supplier issues and buy; In this course, the process of mass data is just related to.But, because the form of presentation of each WEB transaction platform to information is not quite similar, thus bring certain difficulty to information integration.In addition, same product supplier goes different WEB transaction platforms to issue same product may there is the different forms of expression, it can cause usage data reptile on these WEB product trading platforms to obtain data, and then a lot of repeating data can be produced, therefore, for from different WEB data source, cleaning that the different product data of expression form carry out repeating data is necessary, it is the important leverage being judged whether repeating data by machine.
In the cleaning process of product data, most importantly remove the duplicated records in product many record, to ensure to set up one comprehensively, accurately, specialty, meet the product database of quality of data condition; Now, with regard to needs, Similarity Measure is carried out to many records.At present, the calculating of data similarity is mainly realized by comparison one by one, and its arithmetic speed slowly, consumes a large amount of time costs.
Summary of the invention
The defect of prior art and various weak point in view of the above, the technical problem to be solved in the present invention is to provide a kind of Internet of Things data similarity disposal route can saving plenty of time cost.
For achieving the above object, the invention provides a kind of Internet of Things data similarity disposal route, comprise the following steps:
S1, from WEB transaction platform, obtain many product records, select two product records with multiple same alike result, be respectively the first product record and the second product record;
S2, the attribute of the first product record is kept in the first array, the attribute of the second product record is kept in the second array;
S3, by corresponding attribute function, corresponding attribute similarity angle value is calculated respectively to each attribute of the first product record and the second product record, and the attribute similarity angle value of multiple attribute is kept in the 3rd array;
S4, calculated the weighted value of each attribute by weighting function according to the significance level of the first product record and each attribute of the second product record, and the weighted value of multiple attribute is kept in the 4th array;
S5, in conjunction with the 3rd array of attribute similarity angle value and the 4th array of weighted value, calculated the overall similarity of the first product record and the second product record by overall similarity function.
Further, in described step S3, attribute function comprises product another name matching strategy function, product price conversion matching strategy function, standardization date match policy function, standardization place of production matching strategy function and editing distance algorithmic function.
Preferably, in described step S2, the attribute of the first product record successively puts into multiple first attribute array according to the order in name of product, price, date of manufacture, the place of production, and multiple first attribute array forms described first array.
Preferably, in described step S2, the attribute of the second product record successively puts into multiple second attribute array according to the order in name of product, price, date of manufacture, the place of production, and multiple second attribute array forms described second array.
A kind of Internet of Things data similarity disposal route that the present invention relates to has following beneficial effect:
Two product records with same alike result are carried out the calculating of overall similarity by the application according to respective attributes similarity and Attribute Weight weight values, its processing speed is fast, and computational accuracy is high, thus can save a large amount of time costs.
Above-mentioned explanation is only the general introduction of technical solution of the present invention, in order to better understand technological means of the present invention, and can be implemented according to the content of instructions, coordinates accompanying drawing to be described in detail to this patent below with preferred embodiment of the present invention.
Accompanying drawing explanation
Fig. 1 is the process flow diagram of the application.
Fig. 2 is the process flow diagram of product another name matching strategy function in the application.
Fig. 3 is the process flow diagram of product price conversion matching strategy function in the application.
Fig. 4 is the process flow diagram of date match policy function of standardizing in the application.
Fig. 5 is the process flow diagram of place of production matching strategy function of standardizing in the application.
Embodiment
Below in conjunction with accompanying drawing, the preferred embodiments of the present invention are described in detail.
As shown in Figure 1, the invention provides a kind of data similarity disposal route, comprise the following steps:
S1, from WEB transaction platform, obtain many product records, select two product records with multiple same alike result, be respectively the first product record A and the second product record B.
S2, the attribute of the first product record A is kept in the first array a [], the attribute of the second product record B is kept in the second array b [].
First product record A and the second product record B has n attribute, therefore the first array a [] is made up of n the first attribute array a [0], a [1], a [2], a [3], a [4] ~ a [n], the second array b] be made up of n the second attribute array b [0], b [1], b [2], b [3], b [4] ~ b [n].Simultaneously, multiple attributes of the first product record A are successively kept in a [0] in the first attribute array, a [1], a [2], a [3] successively according to the order in name of product, price, date of manufacture, the place of production, and the first attribute array a [4] ~ a [n] is for preserving other secondary attributes of the first product record A; In like manner, multiple attributes of the second product record B are successively kept in b [0] in the first attribute array, b [1], b [2], b [3] successively according to the order in name of product, price, date of manufacture, the place of production, and the second attribute array b [4] ~ b [n] is for preserving other secondary attributes of the second product record B.
S3, by corresponding attribute function, corresponding attribute similarity angle value is calculated respectively to each attribute of the first product record A and the second product record B, and the attribute similarity angle value of multiple attribute is kept in the 3rd array c [], the 3rd array c [] is double type array.
In described step S3, attribute function comprises product another name matching strategy function Strategy_Name (), product price conversion matching strategy function Strategy_Price (), standardization date match policy function Strategy_Date (), standardization place of production matching strategy function Strategy_Origin () and editing distance algorithmic function Edit_Distance ().
S4, calculated the weighted value of each attribute by weighting function Weight () according to the significance level of the first product record A and each attribute of the second product record B, and the weighted value of multiple attribute is kept in the 4th array w [], the 4th array w [] is double type array.
S5, in conjunction with the 3rd array c [] of attribute similarity angle value and the 4th array w [] of weighted value, calculated the overall similarity Sim (A, B) of the first product record A and the second product record B by overall similarity function Sim ().
Two product records with same alike result are carried out the calculating of overall similarity by the application according to respective attributes similarity and Attribute Weight weight values, its processing speed is fast, and computational accuracy is high, thus can save a large amount of time costs.So the present invention effectively overcomes various shortcoming of the prior art and tool high industrial utilization.
Further, as shown in Figure 2, described product another name matching strategy function Strategy_Name () comprises the following steps:
N1, from document, choose one group of data, be put in S set;
N2, from set first element, the map container in each element C++STL saves, and is formed map with first element;
N3, this property value of agricultural product title for record A, B, find corresponding mapping value, replace them in map container;
N4, the agricultural product title after replacing to be compared, completely equal then both similarity Sim (Ak, Bk)=1, otherwise Sim (Ak, Bk)=0.
Preferably, as shown in Figure 3, described product price conversion matching strategy function Strategy_Price () comprises the following steps:
P1, first a definition map entity: map<string, double>price;
P2, perform following statement unit be mapped with scaled value:
Price [" unit/kilogram "]=1;
Price [" unit/jin "]=2;
Price [" unit/kilogram "]=1;
Price [" unit/1000 grams "]=1;
Price [" unit/500 grams "]=2;
Price [" unit/100 grams "]=10;
Price [" unit/gram "]=1000;
Price [" yuan/ton "]=0.001.
Will be converted into unit " unit/kilogram " if the meaning represented is exactly " x unit/kilogram ", will be multiplied by 1 with x, if " x unit/jin " will be converted into unit " unit/kilogram ", will be multiplied by 2 with x, the rest may be inferred for other;
P3, for record A price attribute value Ak, first the numerical value of price and unit are come, the method of segmentation is from first of character string, sternward search for successively, until find first not belong between ' 0 ' to ' 9 ', and till not being the character p [i] of '. ', now p [0] to p [i] this part be the numerical value of price, they are kept in character string a, and a remaining part is exactly unit, and they are kept in character string b;
P4, use atof () function are converted into double type numerical value character string a, are kept in double type variable c1;
P5, perform c1*=price [b] statement, make c1 be multiplied by the scaled value of unit b, and be kept in the middle of c1, now c1 is exactly the numerical value of price of the input after conversion;
P6, the numerical value c2 making to use the same method after the final conversion obtaining price attribute value Bk to record B;
P7, judge that whether the value of c1-c2<=0.000001 is whether two prices really determining to input are the same.If be true Sim (Ak, Bk)=1, otherwise Sim (Ak, Bk)=0.
Further, as shown in Figure 4, described standardization date match policy function Strategy_Date () comprises the following steps:
D1, sternward to search for successively from the first character of r1, if find one do not belong to ' character r1 [i] between 0'-'9', so it is exactly first separator, and it is converted into '/', i.e. r1 [i]='/' is exactly at this moment the time from first character to the i-th-1 character;
If D2 r1 [i+1] be not character ' 0', so directly enter step 3; If r1 [i+1] for character ' 0', last they all reaches one so to character string from i+2 position, i.e. r1 [i+1, i+2...]=r1 [i+2, i+3....];
D3, the value of i+1 is saved in the middle of j, from the i-th+1 character, sternward search for successively, until find one do not belong to ' character r1 [i] between 0'-'9', so it is exactly second separator, and it is converted into '/', i.e. r1 [i]='/', at this moment be exactly month from a jth character to the i-th-1 character, and be the month having eliminated prefix 0;
D4, repetition step D2 eliminate the prefix 0 in the middle of day issue; Arrive here, the date literal r1 of first input just completes the conversion of separator and the elimination of prefix 0;
D5, for inputting date character string r2, then complete process to r2 by same method above; After processing, utilize formula S i m ( r i k , r j k ) = 10 - | C ( r i k , r j k ) | 10 ( | C ( r i k , r j k ) | &le; 10 ) 0 ( | C ( r i k , r j k ) | > 10 ) Calculate the similarity of r1 and r2.
Further, as shown in Figure 5, described standardization place of production matching strategy function Strategy_Origin () comprises the following steps:
O1, establishment S set prov, Scity, Scoun preserve all provincial administrative divisions, city-level administrative division and administrative areas at the county level respectively and draw;
O2, record A place of production property value carry out Chinese word segmentation, word after point good is put in S set prov, Scity, Scoun to be retrieved it and belongs to which rank of other administrative division, to distinguish province, city, county, then allow the province of record A, city, county be kept at respectively in Aprov, Acity, Acoun, null value NULL is composed to the administrative division rank wherein lacked.Same process is done to the place of production property value of record B, the province of B, city, county are kept in Bprov, Bcity, Bcoun respectively;
O3, disappearance administrative division rank supplement complete.Utilize the administrative division rank that the feature completion from bottom to up of administrative division lacks, for cannot the part of completion, do not process.
Above a kind of data similarity of one disposal route that the embodiment of the present invention provides is described in detail; for one of ordinary skill in the art; according to the thought of the embodiment of the present invention; all will change in specific embodiments and applications; in sum; this description should not be construed as limitation of the present invention, and all any changes made according to design philosophy of the present invention are all within protection scope of the present invention.

Claims (4)

1. an Internet of Things data similarity disposal route, is characterized in that: comprise the following steps:
S1, from WEB transaction platform, obtain many product records, select two product records with multiple same alike result, be respectively the first product record and the second product record;
S2, the attribute of the first product record is kept in the first array, the attribute of the second product record is kept in the second array;
S3, by corresponding attribute function, corresponding attribute similarity angle value is calculated respectively to each attribute of the first product record and the second product record, and the attribute similarity angle value of multiple attribute is kept in the 3rd array;
S4, calculated the weighted value of each attribute by weighting function according to the significance level of the first product record and each attribute of the second product record, and the weighted value of multiple attribute is kept in the 4th array;
S5, in conjunction with the 3rd array of attribute similarity angle value and the 4th array of weighted value, calculated the overall similarity of the first product record and the second product record by overall similarity function.
2. a kind of Internet of Things data similarity disposal route according to claim 1, it is characterized in that: in described step S3, attribute function comprises product another name matching strategy function, product price conversion matching strategy function, standardization date match policy function, standardization place of production matching strategy function and editing distance algorithmic function.
3. a kind of Internet of Things data similarity disposal route according to claim 1, it is characterized in that: in described step S2, the attribute of the first product record successively puts into multiple first attribute array according to the order in name of product, price, date of manufacture, the place of production, and multiple first attribute array forms described first array.
4. a kind of Internet of Things data similarity disposal route according to claim 1, it is characterized in that: in described step S2, the attribute of the second product record successively puts into multiple second attribute array according to the order in name of product, price, date of manufacture, the place of production, and multiple second attribute array forms described second array.
CN201510535354.0A 2015-08-27 2015-08-27 Internet of Things data similarity processing method Pending CN105205107A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510535354.0A CN105205107A (en) 2015-08-27 2015-08-27 Internet of Things data similarity processing method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510535354.0A CN105205107A (en) 2015-08-27 2015-08-27 Internet of Things data similarity processing method

Publications (1)

Publication Number Publication Date
CN105205107A true CN105205107A (en) 2015-12-30

Family

ID=54952791

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510535354.0A Pending CN105205107A (en) 2015-08-27 2015-08-27 Internet of Things data similarity processing method

Country Status (1)

Country Link
CN (1) CN105205107A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107193860A (en) * 2017-03-31 2017-09-22 苏州艾隆信息技术有限公司 Medicine information multidimensional identification method and system
CN109614614A (en) * 2018-12-03 2019-04-12 焦点科技股份有限公司 A kind of BILSTM-CRF name of product recognition methods based on from attention
CN111898035A (en) * 2020-06-19 2020-11-06 深圳奇迹智慧网络有限公司 Data processing strategy configuration method and device based on Internet of things and computer equipment
CN113946722A (en) * 2021-10-22 2022-01-18 北京钢研新材科技有限公司 Intelligent welding material matching method and device

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080027932A1 (en) * 2006-07-27 2008-01-31 International Business Machines Corporation Apparatus of generating browsing paths for data and method for browsing data
CN101286156A (en) * 2007-05-29 2008-10-15 北大方正集团有限公司 Method for removing repeated object based on metadata
CN101814082A (en) * 2010-01-20 2010-08-25 中国人民解放军总参谋部第六十三研究所 Method for automatic feature weighting and selection in detection of similar and duplicate record based on ant colony optimization
CN102456203A (en) * 2010-10-22 2012-05-16 阿里巴巴集团控股有限公司 Method for determining candidate product linked list as well as related device
CN103455555A (en) * 2013-08-06 2013-12-18 北京大学深圳研究生院 Recommendation method and device based on mobile terminal similarity
CN104035983A (en) * 2014-05-29 2014-09-10 西安理工大学 Classified variable clustering method based on attribute weight similarity
CN104615600A (en) * 2013-11-04 2015-05-13 深圳中兴力维技术有限公司 Similar case comparison implementation method and device thereof

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080027932A1 (en) * 2006-07-27 2008-01-31 International Business Machines Corporation Apparatus of generating browsing paths for data and method for browsing data
CN101286156A (en) * 2007-05-29 2008-10-15 北大方正集团有限公司 Method for removing repeated object based on metadata
CN101814082A (en) * 2010-01-20 2010-08-25 中国人民解放军总参谋部第六十三研究所 Method for automatic feature weighting and selection in detection of similar and duplicate record based on ant colony optimization
CN102456203A (en) * 2010-10-22 2012-05-16 阿里巴巴集团控股有限公司 Method for determining candidate product linked list as well as related device
CN103455555A (en) * 2013-08-06 2013-12-18 北京大学深圳研究生院 Recommendation method and device based on mobile terminal similarity
CN104615600A (en) * 2013-11-04 2015-05-13 深圳中兴力维技术有限公司 Similar case comparison implementation method and device thereof
CN104035983A (en) * 2014-05-29 2014-09-10 西安理工大学 Classified variable clustering method based on attribute weight similarity

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107193860A (en) * 2017-03-31 2017-09-22 苏州艾隆信息技术有限公司 Medicine information multidimensional identification method and system
CN107193860B (en) * 2017-03-31 2021-03-02 苏州艾隆信息技术有限公司 Medicine information multidimensional identification method and system
CN109614614A (en) * 2018-12-03 2019-04-12 焦点科技股份有限公司 A kind of BILSTM-CRF name of product recognition methods based on from attention
CN109614614B (en) * 2018-12-03 2021-04-02 焦点科技股份有限公司 BILSTM-CRF product name identification method based on self-attention
CN111898035A (en) * 2020-06-19 2020-11-06 深圳奇迹智慧网络有限公司 Data processing strategy configuration method and device based on Internet of things and computer equipment
CN111898035B (en) * 2020-06-19 2023-10-31 深圳奇迹智慧网络有限公司 Data processing strategy configuration method and device based on Internet of things and computer equipment
CN113946722A (en) * 2021-10-22 2022-01-18 北京钢研新材科技有限公司 Intelligent welding material matching method and device

Similar Documents

Publication Publication Date Title
Clark et al. Trends, cycles and convergence in US regional house prices
Gan et al. How to deal with resource productivity: Relationships between socioeconomic factors and resource productivity
CN102456050B (en) Method and device for extracting data from webpage
CN105205107A (en) Internet of Things data similarity processing method
CN104252507B (en) A kind of business data matching process and device
CN102023989A (en) Information retrieval method and system thereof
CN107741999B (en) Power grid topological structure cross-system automatic matching and constructing method based on graph calculation and machine learning
Verter et al. Economic globalization and economic performance dynamics: Some new empirical evidence from Nigeria
Liu et al. A social network analysis regarding electricity consumption and economic growth in China
CN105183748A (en) Combined forecasting method based on content and score
Hu et al. The “S” curve relationship between export diversity and economic size of countries
Gong et al. The environmental footprint of international business in Africa; the role of natural resources
CN116128213A (en) Industrial chain map construction and analysis method and system
CN110348647A (en) A kind of global trade big data intelligent analysis system and method
Zhang et al. Research on knowledge discovery and stock forecasting of financial news based on domain ontology
CN110019634A (en) The geographical spatial data correlating method and device of quantitative accurate
CN106022599A (en) Industrial design talent level evaluation method and system
CN116414878A (en) Knowledge graph-based data query method, system, equipment and storage medium
CN108255819A (en) A kind of value-added tax data integration method and system based on analysis tool SPARK
CN102193928B (en) Method for matching lightweight ontologies based on multilayer text categorizer
Fedyunina et al. How the adoption of industry 4.0 technologies is related to participation in global and domestic value chains: Evidence from Russia
Gyau et al. A Comprehensive Bibliometric Analysis and Visualization of Publications on Environmental Innovation
CN110175199A (en) Energy enterprise key user&#39;s identifying and analyzing method based on K mean cluster algorithm
Li Reexamining the relationship between oil prices and the US economy using a quantile regression approach
CN106611039A (en) Calculation method for hybrid solution of semantic similarity of ontology concept

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20151230

RJ01 Rejection of invention patent application after publication