CN105205107A

CN105205107A - Internet of Things data similarity processing method

Info

Publication number: CN105205107A
Application number: CN201510535354.0A
Authority: CN
Inventors: 谢东; 肖东; 成运
Original assignee: Hunan University of Humanities Science and Technology
Current assignee: Hunan University of Humanities Science and Technology
Priority date: 2015-08-27
Filing date: 2015-08-27
Publication date: 2015-12-30

Abstract

The present invention provides a method for processing data similarity of the Internet of Things, comprising the following steps: obtaining multiple product records, selecting a first product record and a second product record with multiple identical attributes; storing the attributes of the first product record in In the first array, the attributes of the second product record are stored in the second array; each attribute of the first product record and the second product record is respectively calculated according to the corresponding attribute function; according to the first product Record and the second product record the importance of each attribute, and calculate the weight value of each attribute through the weight function; combine the third array of attribute similarity values and the fourth array of weight values, and calculate the first product record through the overall similarity function The overall similarity to the second product record. This application calculates the overall similarity of two product records with the same attribute according to their respective attribute similarity and attribute weight value, which has a fast processing speed and can save a lot of time and cost.

Description

A method for processing similarity of Internet of Things data

技术领域technical field

本发明涉及数据处理领域，特别是涉及一种物联网数据相似度处理方法。The invention relates to the field of data processing, in particular to a method for processing data similarity of the Internet of Things.

背景技术Background technique

自Internet出现以来，互联网上的WEB页面的数量就飞速增长着，也恰是由于其这种增长速度，形成了世界上最大的信息资源库。WEB信息整合技术就是对这一个信息资源库进行有效处理，整合相关信息，为数据挖掘提供数据方面的支撑，以便更好地应用于专业领域中的信息服务。在当前飞速发展的网络时代，信息资源日益丰富，WEB信息整合已成为信息时代的重要内容，在多个领域中都有WEB信息整合的应用。Since the emergence of the Internet, the number of WEB pages on the Internet has grown rapidly, and it is precisely because of this growth rate that the world's largest information resource library has been formed. WEB information integration technology is to effectively process this information resource library, integrate relevant information, and provide data support for data mining, so that it can be better applied to information services in professional fields. In the current rapidly developing network age, information resources are increasingly abundant, and WEB information integration has become an important content in the information age, and WEB information integration has applications in many fields.

如在物联网领域中，产品供货商可以通过多个WEB交易平台发布产品信息，而买家可以从WEB交易平台中获取信息，并通过产品供货商所发布的信息可以联系到产品供货商进行购买；在这一过程中，就涉及到大量数据的处理。但是，由于每个WEB交易平台对信息的表述方式不尽相同，从而给信息整合带来了一定的困难。另外，同一个产品供货商去不同的WEB交易平台发布同一个产品可能会出现不同的表现形式，其会造成这些WEB产品交易平台上使用数据爬虫获取数据，进而会产生很多重复数据，因此，针对来自不同WEB数据源的、表述形式不一样的产品数据进行重复数据的清洗是非常有必要的，其是通过机器判断是否有重复数据的重要保障。For example, in the field of the Internet of Things, product suppliers can publish product information through multiple WEB trading platforms, and buyers can obtain information from WEB trading platforms, and can contact product suppliers through the information released by product suppliers. In this process, it involves the processing of a large amount of data. However, because each WEB trading platform expresses information in different ways, it brings certain difficulties to information integration. In addition, when the same product supplier releases the same product on different WEB trading platforms, it may appear in different forms, which will cause the use of data crawlers to obtain data on these WEB product trading platforms, which will generate a lot of duplicate data. Therefore, It is very necessary to clean the duplicate data of product data from different WEB data sources and with different expressions, which is an important guarantee for judging whether there is duplicate data through the machine.

产品数据的清洗过程中，最主要的是清除产品多条记录中的相似重复记录，以保证建立一个全面、准确、专业、符合数据质量条件的产品数据库；此时，就需要对多条记录进行相似度计算。目前，数据相似度的计算主要是通过一一比对来实现的，其运算速度非常慢，消耗大量的时间成本。In the cleaning process of product data, the most important thing is to clear similar duplicate records in multiple product records, so as to ensure the establishment of a comprehensive, accurate, professional, and data quality-compliant product database; at this time, multiple records need to be Similarity calculation. At present, the calculation of data similarity is mainly realized by one-to-one comparison, which is very slow and consumes a lot of time.

发明内容Contents of the invention

鉴于以上所述现有技术的缺陷和各种不足之处，本发明要解决的技术问题在于提供一种能够节省大量时间成本的物联网数据相似度处理方法。In view of the defects and various deficiencies of the prior art described above, the technical problem to be solved by the present invention is to provide a method for processing similarity of IoT data that can save a lot of time and cost.

为实现上述目的，本发明提供一种物联网数据相似度处理方法，包括以下步骤：In order to achieve the above object, the present invention provides a method for processing similarity of Internet of Things data, comprising the following steps:

S1、从WEB交易平台中获取多条产品记录，选出具有多个相同属性的两条产品记录，分别为第一产品记录和第二产品记录；S1. Obtain multiple product records from the WEB trading platform, and select two product records with multiple identical attributes, namely the first product record and the second product record;

S2、将第一产品记录的属性保存在第一数组中，将第二产品记录的属性保存在第二数组中；S2. Store the attributes of the first product record in the first array, and store the attributes of the second product record in the second array;

S3、对第一产品记录和第二产品记录的各属性分别按相应的属性函数计算相应的属性相似度值，并将多个属性的属性相似度值保存在第三数组中；S3. Calculate corresponding attribute similarity values for each attribute of the first product record and the second product record according to the corresponding attribute function, and store the attribute similarity values of multiple attributes in the third array;

S4、根据第一产品记录和第二产品记录各属性的重要程度、并通过权重函数计算各属性的权重值，并将多个属性的权重值保存在第四数组中；S4. According to the importance of each attribute of the first product record and the second product record, and calculate the weight value of each attribute through a weight function, and store the weight values of multiple attributes in the fourth array;

S5、结合属性相似度值的第三数组和权重值的第四数组，通过整体相似度函数计算第一产品记录和第二产品记录的整体相似度。S5. Combining the third array of attribute similarity values and the fourth array of weight values, calculate the overall similarity between the first product record and the second product record through the overall similarity function.

进一步地，所述步骤S3中，属性函数包括产品别称匹配策略函数、产品价格转换匹配策略函数、规范化日期匹配策略函数、规范化产地匹配策略函数和编辑距离算法函数。Further, in the step S3, the attribute functions include a product nickname matching strategy function, a product price conversion matching strategy function, a normalized date matching strategy function, a normalized origin matching strategy function and an edit distance algorithm function.

优选地，所述步骤S2中，第一产品记录的属性按照产品名称、价格、生产日期、产地的顺序先后放入多个第一属性数组中，多个第一属性数组构成所述第一数组。Preferably, in the step S2, the attributes of the first product record are successively put into multiple first attribute arrays in the order of product name, price, production date, and place of production, and multiple first attribute arrays constitute the first array .

优选地，所述步骤S2中，第二产品记录的属性按照产品名称、价格、生产日期、产地的顺序先后放入多个第二属性数组中，多个第二属性数组构成所述第二数组。Preferably, in the step S2, the attributes of the second product record are successively put into multiple second attribute arrays in the order of product name, price, production date, and place of production, and multiple second attribute arrays constitute the second array .

本发明涉及的一种物联网数据相似度处理方法具有以下有益效果：A kind of Internet of things data similarity processing method that the present invention relates to has following beneficial effect:

本申请将具有相同属性的两条产品记录按照各自的属性相似度和属性权重值进行整体相似度的计算，其处理速度快，计算精度高，从而可以节省大量的时间成本。This application calculates the overall similarity of two product records with the same attribute according to their respective attribute similarity and attribute weight value. The processing speed is fast and the calculation accuracy is high, thereby saving a lot of time and cost.

上述说明仅是本发明技术方案的概述，为了能够更清楚了解本发明的技术手段，并可依照说明书的内容予以实施，以下以本发明的较佳实施例并配合附图对本专利进行详细说明。The above description is only an overview of the technical solution of the present invention. In order to understand the technical means of the present invention more clearly and implement it according to the contents of the specification, the patent will be described in detail below with preferred embodiments of the present invention and accompanying drawings.

附图说明Description of drawings

图1为本申请的流程图。Fig. 1 is the flow chart of this application.

图2为本申请中产品别称匹配策略函数的流程图。FIG. 2 is a flow chart of the product alias matching strategy function in this application.

图3为本申请中产品价格转换匹配策略函数的流程图。Fig. 3 is a flow chart of the product price conversion matching strategy function in this application.

图4为本申请中规范化日期匹配策略函数的流程图。Fig. 4 is a flow chart of the normalized date matching strategy function in this application.

图5为本申请中规范化产地匹配策略函数的流程图。Fig. 5 is a flow chart of the standardized origin matching strategy function in this application.

具体实施方式detailed description

下面结合附图对本发明的优选实施例进行详细介绍。Preferred embodiments of the present invention will be described in detail below in conjunction with the accompanying drawings.

如图1所示，本发明提供一种数据相似度处理方法，包括以下步骤：As shown in Figure 1, the present invention provides a kind of data similarity processing method, comprises the following steps:

S1、从WEB交易平台中获取多条产品记录，选出具有多个相同属性的两条产品记录，分别为第一产品记录A和第二产品记录B。S1. Obtain multiple product records from the WEB trading platform, and select two product records with multiple identical attributes, namely the first product record A and the second product record B.

S2、将第一产品记录A的属性保存在第一数组a[]中，将第二产品记录B的属性保存在第二数组b[]中。S2. Store the attributes of the first product record A in the first array a[], and store the attributes of the second product record B in the second array b[].

第一产品记录A和第二产品记录B都具有n个属性，故第一数组a[]由n个第一属性数组a[0]、a[1]、a[2]、a[3]、a[4]～a[n]构成，第二数组b]由n个第二属性数组b[0]、b[1]、b[2]、b[3]、b[4]～b[n]构成。同时，第一产品记录A的多个属性按照产品名称、价格、生产日期、产地的顺序先后依次保存在第一属性数组中a[0]、a[1]、a[2]、a[3]中，而第一属性数组a[4]～a[n]用于保存第一产品记录A的其他次要属性；同理，第二产品记录B的多个属性按照产品名称、价格、生产日期、产地的顺序先后依次保存在第一属性数组中b[0]、b[1]、b[2]、b[3]中，而第二属性数组b[4]～b[n]用于保存第二产品记录B的其他次要属性。Both the first product record A and the second product record B have n attributes, so the first array a[] consists of n first attribute arrays a[0], a[1], a[2], a[3] , a[4]～a[n], the second array b] consists of n second attribute arrays b[0], b[1], b[2], b[3], b[4]～b [n] constitute. At the same time, multiple attributes of the first product record A are successively stored in the first attribute array a[0], a[1], a[2], a[3] according to the order of product name, price, production date and place of origin ], and the first attribute array a[4]～a[n] is used to save other secondary attributes of the first product record A; similarly, multiple attributes of the second product record B are classified according to product name, price, production The order of date and place of origin is successively stored in b[0], b[1], b[2], b[3] in the first attribute array, while the second attribute array b[4]~b[n] is used Other secondary attributes for saving the second product record B.

S3、对第一产品记录A和第二产品记录B的各属性分别按相应的属性函数计算相应的属性相似度值，并将多个属性的属性相似度值保存在第三数组c[]中，该第三数组c[]为double型数组。S3. For each attribute of the first product record A and the second product record B, calculate the corresponding attribute similarity value according to the corresponding attribute function, and save the attribute similarity values of multiple attributes in the third array c[] , the third array c[] is an array of double type.

所述步骤S3中，属性函数包括产品别称匹配策略函数Strategy_Name()、产品价格转换匹配策略函数Strategy_Price()、规范化日期匹配策略函数Strategy_Date()、规范化产地匹配策略函数Strategy_Origin()和编辑距离算法函数Edit_Distance()。In the step S3, the attribute function includes a product nickname matching strategy function Strategy_Name(), a product price conversion matching strategy function Strategy_Price(), a standardized date matching strategy function Strategy_Date(), a standardized origin matching strategy function Strategy_Origin() and an edit distance algorithm function Edit_Distance().

S4、根据第一产品记录A和第二产品记录B各属性的重要程度、并通过权重函数Weight()计算各属性的权重值，并将多个属性的权重值保存在第四数组w[]中，该第四数组w[]为double型数组。S4, according to the importance of each attribute of the first product record A and the second product record B, and calculate the weight value of each attribute through the weight function Weight (), and save the weight values of multiple attributes in the fourth array w[] Among them, the fourth array w[] is an array of double type.

S5、结合属性相似度值的第三数组c[]和权重值的第四数组w[]，通过整体相似度函数Sim()计算第一产品记录A和第二产品记录B的整体相似度Sim(A、B)。S5. Combining the third array c[] of the attribute similarity value and the fourth array w[] of the weight value, calculate the overall similarity Sim of the first product record A and the second product record B through the overall similarity function Sim() (A, B).

本申请将具有相同属性的两条产品记录按照各自的属性相似度和属性权重值进行整体相似度的计算，其处理速度快，计算精度高，从而可以节省大量的时间成本。所以，本发明有效克服了现有技术中的种种缺点而具高度产业利用价值。This application calculates the overall similarity of two product records with the same attribute according to their respective attribute similarity and attribute weight value. The processing speed is fast and the calculation accuracy is high, thereby saving a lot of time and cost. Therefore, the present invention effectively overcomes various shortcomings in the prior art and has high industrial application value.

进一步地，如图2所示，所述产品别称匹配策略函数Strategy_Name()包括以下步骤：Further, as shown in Figure 2, the product alias matching strategy function Strategy_Name () includes the following steps:

N1、从文档中选取一组数据，放到集合S中；N1. Select a set of data from the document and put it into the set S;

N2、从集合的第一个元素开始，每一个元素都用C++STL中的map容器保存起来，与第一个元素形成映射；N2. Starting from the first element of the collection, each element is stored in a map container in C++STL to form a mapping with the first element;

N3、对于记录A、B的农产品名称这个属性值，在map容器中找到对应的映射值，对它们进行替换；N3. For the attribute value of the agricultural product name of records A and B, find the corresponding mapping value in the map container, and replace them;

N4、对替换后的农产品名称进行比较，完全相等则两者相似度Sim(Ak,Bk)＝1，否则Sim(Ak,Bk)＝0。N4. Comparing the replaced agricultural product names, if they are completely equal, the similarity Sim(Ak, Bk)=1, otherwise Sim(Ak,Bk)=0.

优选地，如图3所示，所述产品价格转换匹配策略函数Strategy_Price()包括以下步骤：Preferably, as shown in Figure 3, the product price conversion matching strategy function Strategy_Price() includes the following steps:

P1、先定义一个map实体：map<string,double>price；P1. First define a map entity: map<string,double>price;

P2、执行以下语句使单位与换算值对应起来：P2. Execute the following statement to match the unit with the converted value:

price["元/公斤"]＝1；price["yuan/kg"]=1;

price["元/斤"]＝2；price["yuan/jin"]=2;

price["元/千克"]＝1；price["yuan/kg"]=1;

price["元/1000克"]＝1；price["yuan/1000g"]=1;

price["元/500克"]＝2；price["yuan/500g"]=2;

price["元/100克"]＝10；price["yuan/100g"]=10;

price["元/克"]＝1000；price["Yuan/gram"]=1000;

price["元/吨"]＝0.001。price["yuan/ton"]＝0.001.

表示的意思就是如果“x元/公斤”要转化为单位“元/千克”的话，就要用x乘以1，如果“x元/斤”要转化为单位“元/千克”的话，就要用x乘以2，其它依此类推；The meaning is that if "x yuan/kg" is to be converted into the unit "yuan/kg", x must be multiplied by 1, and if "x yuan/jin" is to be converted into the unit "yuan/kg", it must be Multiply x by 2, and so on;

P3、对于记录A的价格属性值Ak，先把价格的数值和单位分割开来，分割的方法是从字符串的第一位开始，依次往后面搜索，直到找到第一个不属于‘0’到‘9’之间，并且不是‘.’的字符p[i]为止，此时p[0]到p[i]这部分是价格的数值，把它们保存在字符串a中，剩下的一部分就是单位，把它们保存在字符串b中；P3. For the price attribute value Ak of record A, first separate the value and unit of the price. The method of segmentation is to start from the first character of the string and search backwards until the first one that does not belong to '0' is found. Between '9' and not the character p[i] of '.', at this time, the part from p[0] to p[i] is the value of the price, save them in the string a, and the rest One part is the unit, save them in the string b;

P4、使用atof()函数把字符串a转化为double型数值，保存在double型变量c1中；P4. Use the atof() function to convert the string a into a double value, and store it in the double variable c1;

P5、执行c1*＝price[b]语句，使c1乘以单位b的换算值，并保存在c1当中，此时c1就是转换后的输入的价格的数值；P5. Execute the c1*=price[b] statement to multiply c1 by the conversion value of unit b and save it in c1, at this time c1 is the value of the converted input price;

P6、对记录B使用同样的方法得到价格属性值Bk的最终转换后的数值c2；P6. Use the same method for record B to obtain the final converted value c2 of the price attribute value Bk;

P7、判断c1-c2<＝0.000001的值是否为真来确定输入的两个价格是否一样。如果为真Sim(Ak,Bk)＝1，否则Sim(Ak,Bk)＝0。P7. Determine whether the value of c1-c2<=0.000001 is true to determine whether the two input prices are the same. Sim(Ak,Bk)=1 if true, Sim(Ak,Bk)=0 otherwise.

进一步地，如图4所示，所述规范化日期匹配策略函数Strategy_Date()包括以下步骤：Further, as shown in Figure 4, the normalized date matching strategy function Strategy_Date () includes the following steps:

D1、从r1的第一个字符开始依次往后面搜索，如果找到一个不属于'0'-'9'之间的字符r1[i]，那么它就是第一个分隔符，把它转化为'/'，即r1[i]＝'/'，这时从第一个字符到第i-1个字符就是年份；D1. Search from the first character of r1 to the back, if you find a character r1[i] that does not belong to '0'-'9', then it is the first separator, convert it to' /', that is, r1[i]='/', at this time, the year is from the first character to the i-1th character;

D2、如果r1[i+1]不为字符'0'，那么直接进入步骤3；如果r1[i+1]为字符'0'，那么从i+2位置开始到字符串的最后把它们全部前移一位，即r1[i+1、i+2...]＝r1[i+2、i+3....]；D2. If r1[i+1] is not the character '0', then go directly to step 3; if r1[i+1] is the character '0', then start from the i+2 position to the end of the string and put them all Move forward one bit, that is, r1[i+1, i+2...]=r1[i+2, i+3....];

D3、把i+1的值保存到j当中，从第i+1个字符开始，依次往后面搜索，直到找到一个不属于'0'-'9'之间的字符r1[i]为止，那么它就是第二个分隔符，把它转化为'/'，即r1[i]＝'/'，这时从第j个字符到第i-1个字符就是月份，并且是已经去掉了前缀0的月份；D3. Save the value of i+1 into j, start from the i+1th character, and search backwards one by one until a character r1[i] that does not belong to '0'-'9' is found, then It is the second separator, convert it to '/', that is, r1[i]='/', at this time, the month from the jth character to the i-1th character is the month, and the prefix 0 has been removed the month of

D4、重复步骤D2来消除日期号当中的前缀0；到这里，第一个输入的日期字符串r1就完成了分隔符的转化以及前缀0的消除；D4, repeat step D2 to eliminate the prefix 0 in the date number; here, the first input date string r1 has completed the transformation of the separator and the elimination of the prefix 0;

D5、对于输入日期字符串r2，然后用上面同样的方法完成对r2的处理；处理完后，利用公式 $S i m (r i k, r j k) = \{\begin{matrix} \frac{10 - | C (r i k, r j k) |}{10} (| C (r i k, r j k) | \leq 10) \\ 0 (| C (r i k, r j k) | > 10) \end{matrix}$ 计算r1与r2的相似度。D5. For the input date string r2, then use the same method above to complete the processing of r2; after processing, use the formula $S i m (r i k, r j k) = \{\begin{matrix} \frac{10 - | C (r i k, r j k) |}{10} (| C (r i k, r j k) | \leq 10) \\ 0 (| C (r i k, r j k) | > 10) \end{matrix}$ Calculate the similarity between r1 and r2.

进一步地，如图5所示，所述规范化产地匹配策略函数Strategy_Origin()包括以下步骤：Further, as shown in Figure 5, the standardized origin matching strategy function Strategy_Origin() includes the following steps:

O1、创建集合Sprov、Scity、Scoun分别保存所有省级行政区划、市级行政区划以及县级行政区划；O1. Create collections Sprov, Scity, and Scoun to save all provincial administrative divisions, city administrative divisions and county administrative divisions respectively;

O2、把记录A的产地属性值进行中文分词，分好后的词放到集合Sprov、Scity、Scoun中检索其属于哪一级别的行政区划，以区分省、市、县，然后让记录A的省、市、县分别保存在Aprov、Acity、Acoun中，对其中缺失的行政区划级别赋空值NULL。对记录B的产地属性值做同样的处理，使B的省、市、县分别保存在Bprov、Bcity、Bcoun中；O2. Segment the origin attribute value of record A into Chinese words, put the divided words into the sets Sprov, Scity, and Scoun to retrieve which level of administrative division it belongs to, to distinguish provinces, cities, and counties, and then let the records of A Provinces, cities, and counties are stored in Aprov, Acity, and Acoun respectively, and NULL is assigned to the missing administrative division level. Do the same process for the origin attribute value of record B, so that the province, city, and county of B are stored in Bprov, Bcity, and Bcoun respectively;

O3、把缺失的行政区划级别补充完整。利用行政区划的特征从下至上补全缺失的行政区划级别，对于不可以补全的部分，不做处理。O3. Complete the missing administrative division levels. Use the characteristics of administrative divisions to complete the missing administrative division levels from bottom to top, and do not deal with the parts that cannot be completed.

以上对本发明实施例所提供的一种一种数据相似度处理方法进行了详细介绍，对于本领域的一般技术人员，依据本发明实施例的思想，在具体实施方式及应用范围上均会有改变之处，综上所述，本说明书内容不应理解为对本发明的限制，凡依本发明设计思想所做的任何改变都在本发明的保护范围之内。The data similarity processing method provided by the embodiment of the present invention has been introduced in detail above. For those of ordinary skill in the art, according to the idea of the embodiment of the present invention, there will be changes in the specific implementation and application scope. In summary, the contents of this specification should not be construed as limiting the present invention, and any changes made according to the design concept of the present invention are within the scope of protection of the present invention.

Claims

1. an Internet of Things data similarity disposal route, is characterized in that: comprise the following steps:

S1, from WEB transaction platform, obtain many product records, select two product records with multiple same alike result, be respectively the first product record and the second product record;

S2, the attribute of the first product record is kept in the first array, the attribute of the second product record is kept in the second array;

S3, by corresponding attribute function, corresponding attribute similarity angle value is calculated respectively to each attribute of the first product record and the second product record, and the attribute similarity angle value of multiple attribute is kept in the 3rd array;

S4, calculated the weighted value of each attribute by weighting function according to the significance level of the first product record and each attribute of the second product record, and the weighted value of multiple attribute is kept in the 4th array;

S5, in conjunction with the 3rd array of attribute similarity angle value and the 4th array of weighted value, calculated the overall similarity of the first product record and the second product record by overall similarity function.

2. a kind of Internet of Things data similarity disposal route according to claim 1, it is characterized in that: in described step S3, attribute function comprises product another name matching strategy function, product price conversion matching strategy function, standardization date match policy function, standardization place of production matching strategy function and editing distance algorithmic function.

3. a kind of Internet of Things data similarity disposal route according to claim 1, it is characterized in that: in described step S2, the attribute of the first product record successively puts into multiple first attribute array according to the order in name of product, price, date of manufacture, the place of production, and multiple first attribute array forms described first array.

4. a kind of Internet of Things data similarity disposal route according to claim 1, it is characterized in that: in described step S2, the attribute of the second product record successively puts into multiple second attribute array according to the order in name of product, price, date of manufacture, the place of production, and multiple second attribute array forms described second array.