CN105069129A - Self-adaptive multi-label prediction method - Google Patents
Self-adaptive multi-label prediction method Download PDFInfo
- Publication number
- CN105069129A CN105069129A CN201510501816.7A CN201510501816A CN105069129A CN 105069129 A CN105069129 A CN 105069129A CN 201510501816 A CN201510501816 A CN 201510501816A CN 105069129 A CN105069129 A CN 105069129A
- Authority
- CN
- China
- Prior art keywords
- gamma
- inst
- num
- voter
- lab
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 26
- 238000012706 support-vector machine Methods 0.000 claims abstract description 8
- 238000006116 polymerization reaction Methods 0.000 claims description 21
- 238000012546 transfer Methods 0.000 claims description 12
- 238000010606 normalization Methods 0.000 claims description 11
- 238000012549 training Methods 0.000 claims description 9
- NAWXUBYGYWOOIX-SFHVURJKSA-N (2s)-2-[[4-[2-(2,4-diaminoquinazolin-6-yl)ethyl]benzoyl]amino]-4-methylidenepentanedioic acid Chemical compound C1=CC2=NC(N)=NC(N)=C2C=C1CCC1=CC=C(C(=O)N[C@@H](CC(=C)C(O)=O)C(O)=O)C=C1 NAWXUBYGYWOOIX-SFHVURJKSA-N 0.000 claims description 6
- 101150102678 clu1 gene Proteins 0.000 claims 3
- 230000015572 biosynthetic process Effects 0.000 claims 1
- 230000003044 adaptive effect Effects 0.000 abstract description 7
- 238000012545 processing Methods 0.000 abstract description 6
- 230000002776 aggregation Effects 0.000 description 34
- 238000004220 aggregation Methods 0.000 description 34
- 230000007704 transition Effects 0.000 description 6
- 238000005516 engineering process Methods 0.000 description 5
- 238000004364 calculation method Methods 0.000 description 4
- 230000008878 coupling Effects 0.000 description 2
- 238000010168 coupling process Methods 0.000 description 2
- 238000005859 coupling reaction Methods 0.000 description 2
- 238000002372 labelling Methods 0.000 description 2
- 230000006399 behavior Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 235000019640 taste Nutrition 0.000 description 1
- 230000001131 transforming effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
- G06F16/285—Clustering or classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/955—Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
- G06F16/9562—Bookmark management
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
本发明公开了一种自适应多标签预测方法,其特征是按如下步骤进行:1、获得初始化示例集;2、获得初始化示例集中的领袖示例、局外示例和选民示例;3、获得选民示例集的所属聚类;4、采用支持向量机对预测示例进行粗分类;5、对预测示例进行多标签预测。本发明能准确地对网络信息加上标签,提高多标签预测的准确性、普适性、可解释性以及可移转性,从而实现大数据环境下智能信息分类和处理。The invention discloses an adaptive multi-label prediction method, which is characterized by following steps: 1. Obtain an initialization example set; 2. Obtain leader examples, outsider examples and voter examples in the initialization example set; 3. Obtain voter examples 4. Use the support vector machine to roughly classify the prediction examples; 5. Perform multi-label prediction on the prediction examples. The invention can accurately add labels to network information, improve the accuracy, universality, interpretability and transferability of multi-label prediction, thereby realizing intelligent information classification and processing in a big data environment.
Description
技术领域technical field
本发明属于智能信息分类与处理领域,特别是涉及一种可应用于大数据环境下多媒体资讯的快速聚类及发现密度峰值点的自适应多标签预测方法。The invention belongs to the field of intelligent information classification and processing, and in particular relates to an adaptive multi-label prediction method applicable to fast clustering of multimedia information in a big data environment and finding density peak points.
背景技术Background technique
随着网络的快速发展,信息量正成几何趋势增长,当下的微博、论坛、微信、在线视频、网络购物和社交网络无一例外都需要标签来方便用户的搜索和分类,准确而详尽的标签一方面可让用户能够快速地找到其所需,另一方面商家也可以借助标签对用户进行分类,对不同的用户群推荐迎合其口味的产品,从而避免用户因浏览大量无关信息,使有价值的内容淹没在信息的海洋中。反之商家若是无法正确处理信息过载问题,将最终导致消费者的不断流失。With the rapid development of the network, the amount of information is increasing geometrically. The current Weibo, forums, WeChat, online video, online shopping and social networks all need tags to facilitate users' search and classification. Accurate and detailed tags On the one hand, it allows users to quickly find what they need. On the other hand, businesses can also use tags to classify users and recommend products that cater to their tastes for different user groups, thereby preventing users from browsing a large amount of irrelevant information and making valuable content is drowned in the ocean of information. Conversely, if merchants cannot properly deal with the problem of information overload, it will eventually lead to the continuous loss of consumers.
目前给信息加多标签的方法主要有将多标签分解转化为独立的单一标签进行标记和将多标签转化为标签间的排序来标记。转化为单一标签,将多标签之间的关联关系完全忽略,准确性低;标签间的排序不仅需要大量的计算,且确定标签的排序后,还需要进一步确定是该标签的前标签还是后标签相似程度更高,因此同样存在准确性不高的缺陷。At present, the methods of adding multi-labels to information mainly include decomposing multi-labels into independent single labels for labeling and transforming multi-labels into sorting between labels for labeling. Converting to a single label completely ignores the relationship between multiple labels, and the accuracy is low; the sorting between labels not only requires a lot of calculations, but also needs to further determine whether it is the previous label or the post label after the label is sorted The degree of similarity is higher, so there is also the defect of low accuracy.
相较于本发明,目前的处理方法存在以下缺点:Compared with the present invention, present processing method has following shortcoming:
1、目前的网络信息通过计算机的学习方法,对单一标签也就是识别问题做出的预测方法较多,但由于信息的多标签存在关联关系,因此利用分解多标签为单一多标签的方法,标签的准确性较低,不能达到实用的目的。1. The current network information uses computer learning methods to make predictions for a single label, that is, the recognition problem. However, due to the correlation between multiple labels in the information, the method of decomposing multiple labels into a single multiple label is used. The accuracy of the label is low and cannot achieve practical purposes.
2、目前的多标签预测技术往往只能对给定的静态数据集做出处理,如考虑新增信息,往往需要重新学习,重新设置参数,不能做到随数据的变化而自动调整参数,因此泛化性弱,普适性差。2. The current multi-label prediction technology can only deal with a given static data set. If new information is considered, it often needs to re-learn and reset parameters. It cannot automatically adjust parameters as the data changes. Therefore, Weak generalization and poor universality.
3、将信息的多标签预测转为标签间的序关系来处理,不仅需要大量的计算,且可解释性较差,预测的准确性也不高。3. Converting the multi-label prediction of information into the order relationship between labels not only requires a lot of calculations, but also has poor interpretability and low prediction accuracy.
4、现有的多标签预测技术多是以提高某一评价标注而设计的,忽略了其它标准,这造成了其可移植性差的特点,仅适合在满足某些条件的数据集中使用。4. Most of the existing multi-label prediction technologies are designed to improve a certain evaluation label, ignoring other standards, which results in poor portability and is only suitable for use in data sets that meet certain conditions.
发明内容Contents of the invention
本发明是为了克服现有技术存在的不足之处,提供一种自适应多标签预测方法,以期能准确地对网络信息加上标签,提高多标签预测的准确性、普适性、可解释性以及可移转性,从而实现大数据环境下智能信息分类和处理。The purpose of the present invention is to overcome the deficiencies of the existing technology and provide an adaptive multi-label prediction method, in order to accurately add labels to network information and improve the accuracy, universality and interpretability of multi-label prediction And transferability, so as to realize intelligent information classification and processing in the big data environment.
本发明为解决技术问题采用如下技术方案:The present invention adopts following technical scheme for solving technical problems:
本发明一种自适应多标签预测方法的特点是按如下步骤进行:The feature of a kind of self-adaptive multi-label prediction method of the present invention is to carry out as follows:
步骤1:获得初始化示例集D:Step 1: Obtain the initialization example set D:
步骤1.1、由num′个已知对象建立原始示例集D′={inst′1,inst′2,…,inst′a,…,inst′num′},inst′a表示第a个已知对象所对应的原始示例;1≤a≤num′;并有inst′a={attr′a;lab′a};attr′a表示所述第a个已知对象特征的属性集;lab′a表示所述第a个已知对象语义的标签集;并有attr′a={attr′a,1,attr′a,2,…,attr′a,n};attr′a,n表示第a个已知对象的第n个属性;n为第a个已知对象的属性数;lab′a={lab′a,1,lab′a,2,…,lab′a,x,…,lab′a,m};lab′a,x表示第a个已知对象的第x个标签;m为第a个已知对象的标签数;1≤x≤m;并有:lab′a,x=1表示第a个已知对象语义符合第x个标签;lab′a,x=0表示第a个已知对象语义不符合第x个标签;Step 1.1. Establish the original example set D′={inst′ 1 , inst′ 2 ,…,inst′ a ,…,inst′ num′ } from num′ known objects, where inst′ a represents the ath known object The corresponding original example; 1≤a≤num'; and inst' a = {attr'a;lab' a }; attr' a represents the attribute set of the a-th known object feature; lab' a represents The label set of the a-th known object semantics; and attr' a ={attr' a,1 ,attr' a,2 ,...,attr' a,n }; attr' a,n represents the a-th The nth attribute of the known object; n is the number of attributes of the ath known object; lab′ a ={lab′ a,1 ,lab′ a,2 ,…,lab′ a,x ,…,lab′ a, m }; lab′ a, x represents the x-th label of the a-th known object; m is the number of labels of the a-th known object; 1≤x≤m; and: lab′ a, x = 1 means that the semantics of the a-th known object conforms to the x-th label; lab′ a, x = 0 means that the semantics of the a-th known object does not conform to the x-th label;
步骤1.2、对所述原始示例集D′中的num′个已知对象特征的属性集{attr′1,attr′2,…,attr′a,…,attr′num′}分别进行归一化处理,获得归一化处理后的num′个已知对象特征的属性集{attr″1,attr″2,…,attr″a,…,attr″num′};当所述归一化后的第a个已知对象特征的属性集arrt″a对应的m个标签值均为0时,删除所述归一化后的第a个已知对象所属的原始示例;从而获得num个示例构成的初始化示例集D={inst1,inst2,…,insti,…,instnum};insti表示初始化后的第i个已知对象所对应的示例;并有insti={attri;labi};attri表示初始化后的第i个示例特征的属性集;labi表示初始化后的所述第i个示例语义的标签集;1≤i≤num;Step 1.2: Normalize the attribute sets {attr' 1 , attr' 2 , ..., attr' a , ..., attr'num' } of num' known object features in the original example set D' Processing, obtain normalized attribute sets of num′ known object features {attr″ 1 , attr″ 2 ,…, attr″ a ,…, attr″ num′ }; when the normalized When the m label values corresponding to the attribute set arrt″ a of the a-th known object feature are all 0, delete the original example to which the a-th known object belongs after the normalization; thereby obtaining the Initialization example set D={inst 1 , inst 2 ,...,inst i ,...,inst num }; inst i represents the example corresponding to the i-th known object after initialization; and inst i ={attr i ; lab i }; attr i represents the attribute set of the i-th example feature after initialization; lab i represents the label set of the i-th example semantics after initialization; 1≤i≤num;
步骤2:求解所述初始化示例集D中各示例的群聚度,从而确定初始化示例集D中的领袖示例、局外示例和选民示例:Step 2: Solve the clustering degree of each example in the initialization example set D, so as to determine the leader example, outsider example and voter example in the initialization example set D:
步骤2.1、将所述初始化示例集D中num个示例中的每个示例的m个标签分别作为m维坐标,从而获得第i个示例insti与第k个示例instk标签的欧式距离dik;1≤k≤num且k≠i;Step 2.1, using the m labels of each example in the num examples in the initialization example set D as m-dimensional coordinates, thereby obtaining the Euclidean distance d ik of the i-th example inst i and the k-th example inst k label ;1≤k≤num and k≠i;
步骤2.2、定义迭代次数γ;并初始化γ=1;定义所述第i个示例insti的所属聚类为clui;Step 2.2, define the number of iterations γ; and initialize γ=1; define the cluster to which the i-th example inst i belongs to as clu i ;
步骤2.3、利用式(1)获得第γ次迭代的第i个示例insti的内聚合度从而获得第γ次迭代的num个示例的内聚合度并将最大的内聚合度记为 Step 2.3, using formula (1) to obtain the degree of internal aggregation of the i-th example inst i of the γ-th iteration To obtain the degree of inner aggregation of the num examples of the γ-th iteration and record the maximum degree of internal polymerization as
式(1)中,为第γ次迭代的阈值;当
步骤2.4、利用式(2)或式(3)获得第γ次迭代的第i个示例insti的差异度从而获得第γ次迭代的num个示例的差异度
当
步骤2.5、对所述第γ次迭代的num个示例的差异度进行归一化处理,获得归一化后的差异度
步骤2.6、利用式(4)获得第γ次迭代的第i个示例insti的群聚度从而获得第γ次迭代的num个示例的群聚度
步骤2.7、对所述第γ次迭代的num个示例的群聚度sco(γ)进行降序排列,获得群聚度序列并令与所述群聚度序列sco′(γ)相对应的内聚合度为
步骤2.8、初始化t=1;Step 2.8, initialization t=1;
步骤2.9、判断且≥num×3%是否成立,若成立,则第γ次迭代的阈值为有效值,并记录t后,执行步骤2.10;否则,判断是否成立,若成立,则将t+1赋值给t,并重复执行步骤2.9;否则,修改阈值将γ+1赋值给γ,并返回执行步骤2.3;Step 2.9, Judgment and Whether ≥num×3% is true, if true, the threshold of the γth iteration is a valid value, and after recording t, go to step 2.10; otherwise, judge Whether it is true, if true, assign t+1 to t, and repeat step 2.9; otherwise, modify the threshold Assign γ+1 to γ, and return to step 2.3;
步骤2.10、若第γ次迭代的第i个示例insti的内聚合度是否满足若满足,则所述第i个示例insti为局外示例,且令所述第i个示例insti的所属聚类clui=-1;否则,判断是否成立,若成立,则第i个示例insti为领袖示例,且令clui=i,否则,第i个示例insti为选民示例;Step 2.10, if the degree of internal aggregation of the i-th example inst i of the γ-th iteration Is it satisfied If it is satisfied, then the i-th example inst i is an outlier example, and the cluster clu i to which the i-th example inst i belongs = -1; otherwise, determine Whether it is true, if true, then the i-th example inst i is a leader example, and let clu i =i, otherwise, the i-th example inst i is a voter example;
步骤2.11、统计所述领袖示例的个数和所述选民示例的个数,并分别记为N和M;Step 2.11, counting the number of leader examples and the number of voter examples, and recording them as N and M respectively;
步骤2.12、记N个领袖示例集为1≤α≤N;则与所述N个领袖示例集D(l)相对应的内聚合度为 表示第α个领袖示例的内聚合度;与所述N个领袖示例集D(l)相对应的标签集为
步骤2.13、记M个选民示例集为1≤β≤M;则与所述M个选民示例集D(v)相对应的内聚合度为 表示第β个选民示例的内聚合度;与所述M个选民示例集D(v)相对应的标签集为
步骤3:获得所述M个选民示例集D(v)的所属聚类clu(v):Step 3: Obtain the cluster clu (v) to which the M voter example sets D (v) belong:
步骤3.1、定义迭代次数χ;并初始化χ=1;并定义第z个中转示例instz;z≥0;并初始化α=1、β=1、z=0;Step 3.1, define the number of iterations χ; and initialize χ=1; and define the zth transit example inst z ; z≥0; and initialize α=1, β=1, z=0;
步骤3.2、从所述N个领袖示例集D(l)中选取任第α个领袖示例获得所述第α个领袖示例为与第χ次迭代的第β个选民示例标签的欧式距离 Step 3.2, select any α-th leader example from the N leader example set D (l) Obtain the αth leader instance as With the βth voter example of the χth iteration Euclidean distance of labels
步骤3.3、若时,则将β+1赋值给β,并判断β≤M是否成立,若成立,重复执行步骤3.3;否则执行步骤3.5;若时,判断第χ次迭代的第β个选民示例的所属聚类是否为空,若为空,则执行步骤3.4;否则,表示第χ次迭代的第β个选民示例的所属聚类的值为第χ次迭代现有的领袖示例的下标,记为执行步骤3.11;Step 3.3, if , then assign β+1 to β, and judge whether β≤M holds true, if true, repeat step 3.3; otherwise, go to step 3.5; if When , judge the β-th voter example of the χ-th iteration belongs to the cluster Is it empty, if it is empty, go to step 3.4; otherwise, it represents the βth voter example of the χth iteration belongs to the cluster The value of is the subscript of the existing leader instance in the χ iteration, denoted as Execute step 3.11;
步骤3.4、将第α个领袖示例的下标α(l)赋值给并将z+1赋值给z,令表示将第χ次迭代的第β个选民示例中的下标βχ、标签集内聚合度和所属聚类均赋值给第χ次迭代的第z个中转示例的下标、标签集、内聚合度和所属聚类;并将β+1赋值给β;判断β≤M是否成立,若成立,则执行步骤3.3;否则执行步骤3.5;Step 3.4, the αth leader example The subscript α (l) of is assigned to And assign z+1 to z, let Indicates that the β-th voter instance of the χ-th iteration will be The subscript β χ in , the label set degree of internal polymerization and belong to the cluster Both are assigned to the z-th transition example of the χ-th iteration The subscript, label set, degree of inner aggregation and belonging cluster; and assign β+1 to β; judge whether β≤M holds true, if true, go to step 3.3; otherwise go to step 3.5;
步骤3.5、若z≤0,则执行步骤3.14;否则,将χ+1赋值给χ,并将依次赋值给令β=1;并获得所述第χ次迭代的第β个选民示例与第χ次迭代第z个中转示例标签的欧式距离并将z-1赋值给z;Step 3.5, if z≤0, then execute step 3.14; otherwise, assign χ+1 to χ, and set in turn assigned to Let β = 1; and obtain the β-th voter instance of the χ-th iteration With the χth iteration zth transition example Euclidean distance of labels And assign z-1 to z;
步骤3.6、若时,则将β+1赋值给β,并判断β≤M是否成立,若成立,重复执行步骤3.6;否则执行步骤3.5;若时,判断第χ次迭代的第β个选民示例的所属聚类是否为空,若为空,则执行步骤3.7;否则,表示第χ次迭代的第β个选民示例的所属聚类的值为第χ次迭代现有的领袖示例的下标,记为执行步骤3.8;Step 3.6, if , then assign β+1 to β, and judge whether β≤M holds true, if true, repeat step 3.6; otherwise, go to step 3.5; if When , judge the β-th voter example of the χ-th iteration belongs to the cluster Is it empty, if it is empty, go to step 3.7; otherwise, it represents the βth voter example of the χth iteration belongs to the cluster The value of is the subscript of the existing leader instance in the χ iteration, denoted as Execute step 3.8;
步骤3.7、将第χ次迭代的第z个中转示例的下标z(χ)赋值给并将z+1赋值给z,令并将β+1赋值给β;并判断β≤M是否成立,若成立,则重复执行步骤3.6;否则执行步骤3.5;Step 3.7, transfer the z-th transfer example of the x-th iteration The subscript z (χ) of is assigned to And assign z+1 to z, let And assign β+1 to β; and judge whether β≤M is true, if it is true, repeat step 3.6; otherwise, execute step 3.5;
步骤3.8、利用式(5)获得第χ次迭代的第β选民示例与所述第χ次迭代现有领袖示例的影响力 Step 3.8, using formula (5) to obtain the βth voter example of the χth iteration The influence of the existing leader example with the χth iteration
步骤3.9、利用式(6)获得第χ次迭代的第β个选民示例与第χ次迭代的第z个中转示例的影响力 Step 3.9, use formula (6) to obtain the βth voter example of the χth iteration The z-th transition example with the χ-th iteration influence
步骤3.10、若则将β+1赋值给β,并执行步骤3.6;否则,令并将z+1赋值给z,令并将β+1赋值给β,并判断β≤M是否成立,若成立,则执行步骤3.6;否则执行步骤3.5;Step 3.10, if Then assign β+1 to β, and execute step 3.6; otherwise, let And assign z+1 to z, let And assign β+1 to β, and judge whether β≤M holds true, if true, go to step 3.6; otherwise go to step 3.5;
步骤3.11、利用式(7)获得第χ次迭代的第β选民示例与所述第χ次迭代现有领袖示例的影响力 Step 3.11, use formula (7) to obtain the βth voter example of the χth iteration The influence of the existing leader example with the χth iteration
步骤3.12、利用式(8)获得第χ次迭代的第β个选民示例与第α个领袖示例的影响力 Step 3.12, use formula (8) to obtain the β-th voter example of the χ-th iteration Example with the αth leader influence
步骤3.13、若则将β+1赋值给β,并执行步骤3.3;否则,将第α个领袖示例的下标α(l)赋值给并将z+1赋值给z,令并将β+1赋值给β,并判断β≤M是否成立,若成立,则执行步骤3.3;否则执行步骤3.5;Step 3.13, if Then assign β+1 to β, and execute step 3.3; otherwise, assign the αth leader instance The subscript α (l) of is assigned to And assign z+1 to z, let And assign β+1 to β, and judge whether β≤M is true, if it is true, go to step 3.3; otherwise go to step 3.5;
步骤3.14、将α+1赋值给α;并判断α≤N是否成立,若成立,令β=1,并执行步骤3.2;否则执行步骤3.15;Step 3.14, assigning α+1 to α; and judging whether α≤N is true, if true, set β=1, and execute step 3.2; otherwise, execute step 3.15;
步骤3.15、将第χ次迭代时所述M个选民示例集D(v)相对应的所属聚类依次赋值给所述M个选民示例集D(v)相对应的所属聚类 Step 3.15, the corresponding clusters of the M voter example sets D (v) during the χ iteration Sequentially assign values to the corresponding clusters of the M voter example sets D (v)
步骤3.16、判断是否还存在所属聚类为空的选民示例,若存在,则设置所属聚类为空的选民示例的所属聚类的值为-1;Step 3.16. Determine whether there are still voter examples whose clusters are empty, and if so, set the cluster value of the voter examples whose clusters are empty to -1;
步骤4;采用支持向量机对预测示例进行粗分类:Step 4: Roughly classify the predicted examples using support vector machines:
4.1、建立由nump个预测示例组成的预测示例集P={instp1,instp2,…,instpj,…,instpnump};instpj表示第j个预测示例;1≤j≤nump;并有instpj={attrpj;labpj};arrtpj表示第j个预测示例instpj的属性集;labpj表示第j个预测示例instpj的标签集;记所述第j个预测示例instpj的所属聚类为clupj;4.1. Establish a prediction example set P={instp 1 , instp 2 ,…,instp j ,…,instp nump } consisting of nmp prediction examples; instp j represents the jth prediction example; 1≤j≤nump; and have instp j = {attrp j ; labp j }; arrtp j represents the attribute set of the jth prediction example instp j ; labp j represents the label set of the jth prediction example instp j ; record the jth prediction example instp j The cluster to which it belongs is clup j ;
4.2、以所述初始化示例集D相对应的num个所属聚类{clu1,clu2,…,clui,…,clunum}作为训练标签,以所述初始化示例集D中的num个已知对象的属性集{attr1,attr2…,attri,…,attrnum}作为训练样本;以所述预测示例集P的nump个属性集{attrp1,attrp2…,attrpj,…,attrpnump}作为预测样本,并用支持向量机方法进行训练,获得nump个预测标签,将所述nump个预测标签分别赋值给所述预测示例集P的nump个所属聚类;从而完成对所述预测示例集P的粗分类;4.2. Use the num clusters {clu 1 ,clu 2 ,...,clu i ,...,clu num } corresponding to the initialization example set D as training labels, and use the num clusters in the initialization example set D that have been The attribute sets {attr 1 , attr 2 ..., attrp 2 ..., attrp j , ..., attrp nump } as a prediction sample, and train with a support vector machine method to obtain nmp prediction labels, and assign the nmp prediction labels to the nump belonging clusters of the prediction example set P; thereby completing the prediction Coarse classification of the example set P;
步骤5、对nump个预测示例进行多标签预测;Step 5, perform multi-label prediction on nmp prediction examples;
步骤5.1、将所述初始化示例集D中num个示例和所述预测示例集P中nump个示例整合为第ψ次更新示例集
步骤5.2、所述第ψ次更新示例集中num+nump个更新示例中的的每个示例的n个属性分别作为n维坐标,从而获得第Ω个第ψ次更新示例与第ξ个第ψ次更新示例属性的欧式距离1≤ξ≤num+nump且ξ≠Ω;Step 5.2, the ψth update example set The n attributes of each example in the num+nump update examples are respectively used as n-dimensional coordinates, so as to obtain the Ω-th ψ-th update example Update example with ξ-th ψ-th Euclidean distance of attributes 1≤ξ≤num+nump and ξ≠Ω;
步骤5.3、利用式(9)获得第Ω个第ψ次更新示例的属性聚合度从而获得第ψ次更新的num+nump个更新示例的属性聚合度 Step 5.3, using formula (9) to obtain the Ω-th ψ-th update example The attribute aggregation degree of So as to obtain the attribute aggregation degree of the num+nump update examples of the ψth update
当
步骤5.4、初始化j=1;Step 5.4, initialize j=1;
步骤5.5、若所述预测示例集P中第j个预测示例instpj的所属聚类为clupj与所述初始化示例集D中第i个已知示例insti的所属聚类为clui相同;则利用式(10)获得第i个已知示例insti与第j个预测示例instpj的影响力graij:Step 5.5, if the cluster of the j-th prediction example instp j in the prediction example set P is clup j and the cluster of the i-th known example inst i in the initialization example set D is clu i ; Then use formula (10) to obtain the influence gra ij of the i-th known example inst i and the j-th predicted example instp j :
式(10)中,Γi表示已知示例insti在第ψ次更新示例集所对应更新示例的属性聚合度,Γj表示预测示例instpj在第ψ次更新示例集所对应更新示例的属性聚合度,dij表示所述第i个已知示例insti与第j个预测示例instpj属性的欧式距离;In formula (10), Γ i represents the known example inst i updating the example set at the ψth time The attribute aggregation degree of the corresponding update example, Γ j indicates that the predicted example instp j updates the example set at the ψth time The attribute aggregation degree of the corresponding update example, d ij represents the Euclidean distance between the i-th known example inst i and the j-th predicted example instp j attribute;
步骤5.6、重复步骤5.5,从而获得第j个预测示例instpj与所述初始化示例集D其他已知示例的影响力,并记录最大影响力gramax;Step 5.6, repeat step 5.5, so as to obtain the influence of the jth prediction example instp j and other known examples of the initialization example set D, and record the maximum influence gra max ;
步骤5.7、若graij=gramax,则令labpj=labi,表示所述预测示例集P的标签集labpj中的各个标签和所述初始化示例集D的标签集labi中的各个标签相同,从而获得第j个多标签预测的预测示例;Step 5.7. If gra ij = gra max , then set labp j = lab i , indicating each label in the label set labp j of the prediction example set P and each label in the label set lab i of the initialization example set D The same, so as to obtain the prediction example of the jth multi-label prediction;
步骤5.8、将j+1赋值给j,并判断j≤nump是否成立,若成立,则返回步骤5.5执行,否则,表示完成对nump个预测示例的多标签预测;Step 5.8, assign j+1 to j, and judge whether j≤nump is true, if true, return to step 5.5 for execution, otherwise, it means that the multi-label prediction of nump prediction examples is completed;
本发明所述的自适应多标签预测方法的特点是:The characteristics of the self-adaptive multi-label prediction method of the present invention are:
所述步骤5中,还包括步骤5.9、将所述完成多标签预测的预测示例集P的标签集赋值到所述对应的第ψ次更新示例集中,从而获得第ψ+1次更新示例集以所述第ψ+1次更新示例集作为新的初始化示例集进行自适应多标签预测。In the step 5, it also includes step 5.9, assigning the label set of the prediction example set P that has completed the multi-label prediction to the corresponding ψth update example set , so as to obtain the ψ+1th update example set Update the example set by the ψ+1th time as a new set of initialization examples for adaptive multi-label prediction.
当出现新的具有相同的对象特征及相同的对象语义的预测示例时,只需从步骤4开始即可完成对新的预测示例进行多标签预测。When a new prediction example with the same object features and the same object semantics appears, it only needs to start from step 4 to complete the multi-label prediction for the new prediction example.
所述步骤2.9中,修改阈值的规则是:若则将减去τ2赋值给否则,将加τ2赋值给0.1≤τ2≤0.5,75%≤τ1<100%。In the step 2.9, modify the threshold The rule is: if then will Subtract τ 2 and assign to Otherwise, will Add τ 2 and assign to 0.1≤τ2≤0.5 , 75% ≤τ1 <100%.
与已有技术相比,本发明有益效果体现在:Compared with the prior art, the beneficial effects of the present invention are reflected in:
1、本发明采用先粗分类再精准预测的方法,借助本发明所含的自适应性,通过多轮迭代,使得预测标签不断进化,进而取得比现有的多标签预测技术更为准确的预测结果,是一个可以投入到实际应用的方法。1. The present invention adopts the method of rough classification first and then precise prediction. With the help of the self-adaptability contained in the present invention, through multiple rounds of iterations, the prediction label is continuously evolved, and then more accurate prediction than the existing multi-label prediction technology can be obtained. The result, is a method that can be put into practical use.
2、本发明通过初始化示例集,可根据不同已知对象特征和语义确定不同的初始化示例集,使得本发明可广泛应用于现有网络平台大部分的应用环境,从简单的文字型数据,到音频,乃至图像,皆可有较好地做出标签预测,相较于现有技术普适性强。2. Through the initialization example set, the present invention can determine different initialization example sets according to different known object characteristics and semantics, so that the present invention can be widely applied to most application environments of existing network platforms, from simple text data to Audio, even images, can make better label predictions, which is more universal than existing technologies.
3、本发明通过计算获得内聚合度来表示示例的内聚程度,通过计算获得差异度来表示示例的耦合程度,并依据内聚合度和差异度求解出来的群聚度,各参数有实际含义,充分考虑了高内聚低耦合的数据分类要求,易于理解和解释,从而在保证了本发明有较高的预测准确性的同时,使得本发明有较强的可移植性,可在各种条件下进行多标签预测。3. The present invention expresses the degree of cohesion of the example by calculating the degree of cohesion, obtains the degree of coupling by calculating the degree of difference, and calculates the degree of clustering based on the degree of cohesion and degree of difference. Each parameter has a practical meaning , fully considers the data classification requirements of high cohesion and low coupling, and is easy to understand and explain, so that while ensuring the high prediction accuracy of the present invention, the present invention has strong portability and can be used in various Under the condition of multi-label prediction.
4、本发明通过内聚合度能够准确找到各个产品领域中的领袖示例;对于微博,论坛和社交网络,借助此法能够准确地找到不同话题领域中影响力最大的关键用户,通过对其行为的详细研究,可预测到该领域可能的趋势,并为该领域的用户提供准确的推荐。4. The present invention can accurately find examples of leaders in various product fields through the degree of inner aggregation; for microblogs, forums and social networks, this method can accurately find key users with the greatest influence in different topic fields, and through their behavior A detailed study of , can predict possible trends in the field and provide accurate recommendations to users in the field.
5、本发明通过计算示例与示例间影响力,不但可以用于多标签预测上,也可对相同语义的已知标签的示例进行类比,找寻到与该示例的多标签极为类似的示例,推荐给用户,提高用户的使用体验。5. By calculating the influence between examples and examples, the present invention can not only be used for multi-label prediction, but also can make analogies to examples with known labels of the same semantics, and find examples that are very similar to the multi-label examples of this example. Recommended To users, improve user experience.
6、本发明在预测示例的多标签确定时,采用选取与预测示例最为相似的已知示例的标签集作为预测示例的标签集的方法,可以将该已知示例的用户群推荐给新出现的预测示例;可为新出现的产品找到其较为准确的市场定位,并为其发现潜在的用户。6. The present invention adopts the method of selecting the label set of the known example that is most similar to the predicted example as the label set of the predicted example when determining the multi-label of the predicted example, and can recommend the user group of the known example to the newly emerging Forecast example; it can find a more accurate market positioning for new products and discover potential users for them.
7、本发明由于采用将完成多标签预测的预测示例加入到初始化示例集的方法,从而丰富了现有训练集,提高了下一轮预测的准确性,使得本发明具有自适应性的学习能力,面对新加入的示例能进一步完善现有数据集合,伴随已知标签示例的增加,将进一步提高该方法预测的准确性。7. The present invention enriches the existing training set and improves the accuracy of the next round of prediction due to the method of adding the prediction examples that complete the multi-label prediction to the initialization example set, so that the present invention has adaptive learning ability , in the face of newly added examples, the existing data set can be further improved, and with the increase of known label examples, the prediction accuracy of this method will be further improved.
具体实施方式Detailed ways
本实施例中,一种自适应多标签预测方法,是按如下步骤进行:In this embodiment, an adaptive multi-label prediction method is performed in the following steps:
步骤1:获得初始化示例集D:Step 1: Obtain the initialization example set D:
步骤1.1、由num′个已知对象建立原始示例集D′={inst′1,inst′2,…,inst′a,…,inst′num′},inst′a表示第a个已知对象所对应的原始示例;1≤a≤num′;并有inst′a={attr′a;lab′a};attr′a表示第a个已知对象特征的属性集;lab′a表示第a个已知对象语义的标签集;并有attr′a={attr′a,1,attr′a,2,…,attr′a,n};attr′a,n表示第a个已知对象的第n个属性;n为第a个已知对象的属性数,lab′a={lab′a,1,lab′a,2,…,lab′a,x,…,lab′a,m};lab′a,x表示第a个已知对象的第x个标签;m为第a个已知对象的标签数;1≤x≤m;并有:lab′a,x=1表示第a个已知对象语义符合第x个标签;lab′a,x=0表示第a个已知对象语义不符合第x个标签;假设,已知对象为图片,将色差,尺寸等需要详细描述的对象特征作为属性集,用准确而详尽的数字作为各个属性的值;将风景图片,动物图片等非是即否的对象语义作为标签集,用0表示不符合该标签,用1表示符合该标签;Step 1.1. Establish the original example set D′={inst′ 1 , inst′ 2 ,…,inst′ a ,…,inst′ num′ } from num′ known objects, where inst′ a represents the ath known object The corresponding original example; 1≤a≤num′; and inst′ a = {attr′ a ; lab′ a }; attr′ a represents the attribute set of the a-th known object feature; lab′ a represents the a-th A label set of known object semantics; and attr′ a ={attr′ a,1 ,attr′ a,2 ,…,attr′ a,n }; attr′ a,n represents the ath known object The nth attribute; n is the number of attributes of the ath known object, lab′ a = {lab′ a,1 ,lab′ a,2 ,…,lab′ a,x ,…,lab′ a,m } ;lab′ a,x represents the xth label of the ath known object; m is the label number of the ath known object; 1≤x≤m; and: lab′ a,x = 1 represents the ath The semantics of a known object conforms to the x-th label; lab′ a, x = 0 means that the semantics of the a-th known object does not conform to the x-th label; assuming that the known object is a picture, the color difference, size, etc. need to be described in detail Object features are used as an attribute set, and accurate and detailed numbers are used as the values of each attribute; object semantics such as landscape pictures and animal pictures are used as a label set, and 0 indicates that it does not meet the label, and 1 indicates that it meets the label. ;
步骤1.2、对原始示例集D′中的num′个已知对象特征的属性集{attr′1,attr′2,…,attr′a,…,attr′num′}分别进行归一化处理;在归一化处理中,以第a个已知对象特征的属性集attr′a为例,即是先记录属性集{attr′a,1,attr′a,2,…,attr′a,n}中值最大的属性attr′a,max,再用最大的属性attr′a,max作为分母,与属性集中每个属性进行除法计算,便可获得第a个归一化处理后的已知对象特征的属性集attr″a;依此类推获得归一化处理后的num′个已知对象特征的属性集{attr″1,attr″2,…,attr″a,…,attr″num′};当归一化后的第a个已知对象特征的属性集arrt″a对应的m个标签值均为0时,删除归一化后的第a个已知对象所属的原始示例;从而获得num个示例构成的初始化示例集D={inst1,inst2,…,insti,…,instnum};insti表示初始化后的第i个已知对象所对应的示例;并有insti={attri;labi};attri表示初始化后的第i个示例特征的属性集;labi表示初始化后的第i个示例语义的标签集;1≤i≤num;如表1所示:Step 1.2, normalize the attribute sets {attr' 1 , attr' 2 , ..., attr' a , ..., attr'num' } of num' known object features in the original example set D'; In the normalization process, take the attribute set attr′ a of the a-th known object feature as an example, that is, first record the attribute set {attr′ a,1 ,attr′ a,2 ,…,attr′ a,n } attribute attr′ a,max with the largest median value, and then use the largest attribute attr′ a,max as the denominator, and perform division calculation with each attribute in the attribute set to obtain the a-th known object after normalization The attribute set attr″ a of the feature; and so on to obtain the attribute set {attr″ 1 , attr″ 2 ,…,attr″ a ,…,attr″ num′ } of num′ known object features after normalization ; When the m label values corresponding to the attribute set arrt″ a of the a-th known object feature after normalization are all 0, delete the original example to which the a-th known object belongs after normalization; thereby obtaining num The initialization example set D={inst 1 , inst 2 ,...,inst i ,...,inst num } composed of examples; inst i represents the example corresponding to the i-th known object after initialization; and inst i ={ attr i ; lab i }; attr i represents the attribute set of the i-th example feature after initialization; lab i represents the label set of the i-th example semantics after initialization; 1≤i≤num; as shown in Table 1:
表1:初始化示例集D第i个示例insti的数据表Table 1: The data table of the i-th example inst i of the initialization example set D
步骤2:求解初始化示例集D中各示例的群聚度,从而确定初始化示例集D中的领袖示例、局外示例和选民示例:Step 2: Solve the clustering degree of each example in the initialization example set D, so as to determine the leader example, outsider example and voter example in the initialization example set D:
步骤2.1、将初始化示例集D中num个示例中的每个示例的m个标签分别作为m维坐标,从而获得第i个示例insti与第k个示例instk标签的欧式距离dik;1≤k≤num且k≠i;例如,求解第一个示例与第二个示例标签的欧式距离d12,第一个示例和第二个示例都有m个相同名称的标签,但由于取值不一定相同,则分别表示为第一个示例的标签集lab1={lab1,1,lab1,2,…,lab1,m}和第二个示例的标签集lab2={lab2,1,lab2,2,…,lab2,m},则标签的欧式距离d12为
步骤2.2、定义迭代次数γ;并初始化γ=1;定义第i个示例insti的所属聚类为clui;Step 2.2, define the number of iterations γ; and initialize γ=1; define the cluster of the i-th example inst i as clu i ;
步骤2.3、利用式(1)获得第γ次迭代的第i个示例insti的内聚合度从而获得第γ次迭代的num个示例的内聚合度并将最大的内聚合度记为 Step 2.3, using formula (1) to obtain the degree of internal aggregation of the i-th example inst i of the γ-th iteration To obtain the degree of inner aggregation of the num examples of the γ-th iteration and record the maximum degree of internal polymerization as
式(1)中,为第γ次迭代的阈值;当
步骤2.4、利用式(2)或式(3)获得第γ次迭代的第i个示例insti的差异度从而获得第γ次迭代的num个示例的差异度
当
步骤2.5、对第γ次迭代的num个示例的差异度进行归一化处理,获得归一化后的差异度借助步骤2.4和步骤2.5将会使归一化后的差异度有较大的区分,使少数接近于1,大部分值都小于0.5,这将有助于领袖示例的选取;Step 2.5, the degree of difference for the num examples of the γ-th iteration Perform normalization processing to obtain the normalized difference degree With the help of steps 2.4 and 2.5, the normalized difference There is a large distinction, so that a few are close to 1, and most of the values are less than 0.5, which will help the selection of leader examples;
步骤2.6、利用式(4)获得第γ次迭代的第i个示例insti的群聚度从而获得第γ次迭代的num个示例的群聚度
步骤2.7、对第γ次迭代的num个示例的群聚度sco(γ)进行降序排列,获得群聚度序列并令与群聚度序列sco′(γ)相对应的内聚合度为
步骤2.8、初始化t=1;Step 2.8, initialization t=1;
步骤2.9、判断且≥num×3%是否成立,若成立,则第γ次迭代的阈值为有效值,并记录t后,执行步骤2.10;否则,判断是否成立,若成立,则将t+1赋值给t,并重复执行步骤2.9;否则,修改阈值修改阈值的规则是:若则将减去τ2赋值给否则,将加τ2赋值给0.1≤τ2≤0.5,75%≤τ1<100%;将γ+1赋值给γ,并返回执行步骤2.3;判断且≥num×3%的条件中,1.25和3%不是固定不变的,本发明是建立在示例数目为万级,标签数目在20以下,会有较优解,当示例数目和标签数目变化时候,可以酌情进行修改,其原则是能保证后面的步骤中仅选取群聚度远大于其它示例的少量示例作为领袖示例;Step 2.9, Judgment and Whether ≥num×3% is true, if true, the threshold of the γth iteration is a valid value, and after recording t, go to step 2.10; otherwise, judge Whether it is true, if true, assign t+1 to t, and repeat step 2.9; otherwise, modify the threshold modify threshold The rule is: if then will Subtract τ 2 and assign to Otherwise, will Add τ 2 and assign to 0.1≤τ 2 ≤0.5, 75%≤τ 1 <100%; assign γ+1 to γ, and return to step 2.3; judge and In the condition of ≥num×3%, 1.25 and 3% are not fixed. The present invention is based on the fact that the number of examples is 10,000 and the number of labels is less than 20. There will be a better solution. When the number of examples and the number of labels change , can be modified as appropriate, and the principle is to ensure that only a small number of examples whose clustering degree is much greater than other examples are selected as leader examples in the following steps;
步骤2.10、若第γ次迭代的第i个示例insti的内聚合度是否满足若满足,则第i个示例insti为局外示例,且令第i个示例insti的所属聚类clui=-1;否则,判断是否成立,若成立,则第i个示例insti为领袖示例,且令clui=i,否则,第i个示例insti为选民示例;Step 2.10, if the degree of internal aggregation of the i-th example inst i of the γ-th iteration Is it satisfied If it is satisfied, then the i-th example inst i is an outlier example, and the cluster clu i to which the i-th example inst i belongs = -1; otherwise, judge Whether it is true, if true, then the i-th example inst i is a leader example, and let clu i =i, otherwise, the i-th example inst i is a voter example;
步骤2.11、统计领袖示例的个数和选民示例的个数,并分别记为N和M;Step 2.11, count the number of leader examples and the number of voter examples, and record them as N and M respectively;
步骤2.12、记N个领袖示例集为1≤α≤N;则与N个领袖示例集D(l)相对应的内聚合度为 表示第α个领袖示例的内聚合度;与N个领袖示例集D(l)相对应的标签集为
步骤2.13、记M个选民示例集为1≤β≤M;则与M个选民示例集D(v)相对应的内聚合度为 表示第β个选民示例的内聚合度;与M个选民示例集D(v)相对应的标签集为
步骤3:获得M个选民示例集D(v)的所属聚类clu(v):Step 3: Obtain the cluster clu (v) of M voter example set D ( v):
步骤3.1、定义迭代次数χ;并初始化χ=1;并定义第z个中转示例instz;z≥0;并初始化α=1、β=1、z=0;第z个中转示例instz存储结构类似于常用的堆栈结构,本发明为了表述清晰,同时引入迭代次数χ,用来区分z相同时的中转示例;此时M个选民示例集D(v)相对应的所属聚类的值皆为空;Step 3.1, define the number of iterations χ; and initialize χ=1; and define the zth transfer example inst z ; z≥0; and initialize α=1, β=1, z=0; the zth transfer example inst z stores The structure is similar to the commonly used stack structure. In order to clarify the expression, the present invention introduces the number of iterations χ at the same time to distinguish the transfer examples when z is the same; The values are all empty;
步骤3.2、从N个领袖示例集D(l)中选取任第α个领袖示例获得第α个领袖示例为与第χ次迭代的第β个选民示例的标签的欧式距离 Step 3.2, select any α-th leader example from the N leader example set D (l) Obtain the αth leader example as With the βth voter example of the χth iteration The Euclidean distance of the label
步骤3.3、若时,则将β+1赋值给β,并判断β≤M是否成立,若成立,重复执行步骤3.3;否则执行步骤3.5;若时,判断第χ次迭代的第β个选民示例的所属聚类是否为空,若为空,则执行步骤3.4;否则,表示第χ次迭代的第β个选民示例的所属聚类的值为第χ次迭代现有的领袖示例的下标,记为执行步骤3.11;例如,第χ次迭代现有的领袖示例为inst9,则 Step 3.3, if , then assign β+1 to β, and judge whether β≤M holds true, if true, repeat step 3.3; otherwise, go to step 3.5; if When , judge the β-th voter example of the χ-th iteration belongs to the cluster Is it empty, if it is empty, go to step 3.4; otherwise, it represents the βth voter example of the χth iteration belongs to the cluster The value of is the subscript of the existing leader instance in the χ iteration, denoted as Execute step 3.11; for example, the existing leader example of the χ iteration is inst 9 , then
步骤3.4、将第α个领袖示例的下标α(l)赋值给并将z+1赋值给z,令表示将第χ次迭代的第β个选民示例中的下标βχ、标签集内聚合度和所属聚类均赋值给第χ次迭代的第z个中转示例的下标、标签集、内聚合度和所属聚类;并将β+1赋值给β;判断β≤M是否成立,若成立,则执行步骤3.3;否则执行步骤3.5;表示一个示例等于了另一个示例,其仅表示这两个示例对应的值相同,即将等号右边示例的下标、标签集、内聚合度和所属聚类赋值给等号左边示例的下标、标签集、内聚合度和所属聚类;Step 3.4, the αth leader example The subscript α (l) of is assigned to And assign z+1 to z, let Indicates that the β-th voter instance of the χ-th iteration will be The subscript β χ in , the label set degree of internal polymerization and belong to the cluster Both are assigned to the z-th transition example of the χ-th iteration The subscript, label set, degree of inner aggregation and belonging cluster; and assign β+1 to β; judge whether β≤M holds true, if true, go to step 3.3; otherwise go to step 3.5; Indicates that one example is equal to another example, it only means that the corresponding values of the two examples are the same, that is, assign the subscript, label set, degree of inner aggregation, and cluster of the example on the right side of the equal sign to the subscript of the example on the left side of the equal sign, label set, degree of inner aggregation and belonging cluster;
步骤3.5、若z≤0,则执行步骤3.14;否则,将χ+1赋值给χ,并将依次赋值给对于其它与χ相关的参数,也需将χ-1关联的赋值给对应的χ关联的,以保持数据的连贯和一致性,譬如令β=1;并获得所述第χ次迭代的第β个选民示例与第χ次迭代第z个中转示例的标签的欧式距离并将z-1赋值给z;Step 3.5, if z≤0, then execute step 3.14; otherwise, assign χ+1 to χ, and set in turn assigned to For other parameters related to χ, it is also necessary to assign the value associated with χ-1 to the corresponding value associated with χ, so as to maintain the coherence and consistency of the data, for example Let β = 1; and obtain the β-th voter instance of the χ-th iteration With the χth iteration zth transition example The Euclidean distance of the label And assign z-1 to z;
步骤3.6、若时,则将β+1赋值给β,并判断β≤M是否成立,若成立,重复执行步骤3.6;否则执行步骤3.5;若时,判断第χ次迭代的第β个选民示例的所属聚类是否为空,若为空,则执行步骤3.7;否则,表示第χ次迭代的第β个选民示例的所属聚类的值为第χ次迭代现有的领袖示例的下标,记为执行步骤3.8;Step 3.6, if , then assign β+1 to β, and judge whether β≤M holds true, if true, repeat step 3.6; otherwise, go to step 3.5; if When , judge the β-th voter example of the χ-th iteration belongs to the cluster Is it empty, if it is empty, go to step 3.7; otherwise, it represents the βth voter example of the χth iteration belongs to the cluster The value of is the subscript of the existing leader instance in the χ iteration, denoted as Execute step 3.8;
步骤3.7、将第χ次迭代的第z个中转示例的下标z(χ)赋值给并将z+1赋值给z,令并将β+1赋值给β;并判断β≤M是否成立,若成立,则重复执行步骤3.6;否则执行步骤3.5;Step 3.7, transfer the z-th transfer example of the x-th iteration The subscript z (χ) of is assigned to And assign z+1 to z, let And assign β+1 to β; and judge whether β≤M is true, if it is true, repeat step 3.6; otherwise, execute step 3.5;
步骤3.8、利用式(5)获得第χ次迭代的第β选民示例与第χ次迭代现有的领袖示例的影响力 Step 3.8, using formula (5) to obtain the βth voter example of the χth iteration The influence of the existing leader example with the χth iteration
式(5)可推广到计算任一两个语义相同的示例的影响力的计算,只需要知道两个示例的内聚合度和两者标签的欧式距离,或是两个示例的属性聚合度和两者属性的欧式距离,套用公式(5),便可获得两个示例间的影响力;Equation (5) can be extended to the calculation of the influence of any two examples with the same semantics, only need to know the degree of inner aggregation of the two examples and the Euclidean distance between the labels of the two examples, or the aggregation degree of attributes of the two examples and The Euclidean distance between the two attributes, the influence between the two examples can be obtained by applying the formula (5);
步骤3.9、利用式(6)获得第χ次迭代的第β个选民示例与第χ次迭代的第z个中转示例的影响力 Step 3.9, use formula (6) to obtain the β-th voter example of the χ-th iteration The z-th transition example with the χ-th iteration influence
步骤3.10、若则将β+1赋值给β,并执行步骤3.6;否则,令并将z+1赋值给z,令并将β+1赋值给β,并判断β≤M是否成立,若成立,则执行步骤3.6;否则执行步骤3.5;Step 3.10, if Then assign β+1 to β, and execute step 3.6; otherwise, let And assign z+1 to z, let And assign β+1 to β, and judge whether β≤M is true, if it is true, go to step 3.6; otherwise go to step 3.5;
步骤3.11、利用式(7)获得第χ次迭代的第β选民示例与第χ次迭代现有领袖示例的影响力 Step 3.11, use formula (7) to obtain the βth voter example of the χth iteration The influence of the existing leader example with the χth iteration
步骤3.12、利用式(8)获得第χ次迭代的第β个选民示例与第α个领袖示例的影响力 Step 3.12, use formula (8) to obtain the β-th voter example of the χ-th iteration Example with the αth leader influence
步骤3.13、若则将β+1赋值给β,并执行步骤3.3;否则,将第α个领袖示例的下标α(l)赋值给并将z+1赋值给z,令并判断β≤M是否成立,若成立,则将β+1赋值给β,并执行步骤3.3;否则执行步骤3.5;Step 3.13, if Then assign β+1 to β, and execute step 3.3; otherwise, assign the αth leader instance The subscript α (l) of is assigned to And assign z+1 to z, let And judge whether β≤M is true, if it is true, assign β+1 to β, and execute step 3.3; otherwise, execute step 3.5;
步骤3.14、将α+1赋值给α;并判断α≤N是否成立,若成立,令β=1,并执行步骤3.2;否则,执行步骤3.15;Step 3.14, assigning α+1 to α; and judging whether α≤N is true, if true, set β=1, and execute step 3.2; otherwise, execute step 3.15;
步骤3.15、将第χ次迭代时M个选民示例集D(v)相对应的所属聚类依次赋值给M个选民示例集D(v)相对应的所属聚类 Step 3.15, the clusters corresponding to the M voter example sets D (v) in the χth iteration Sequentially assign values to the corresponding clusters of M voter example sets D (v)
步骤3.16、判断是否还存在所属聚类为空的选民示例,若存在,则设置所属聚类为空的选民示例的所属聚类的值为-1;因此,选民示例的所属聚类可取的值的数目为N+1,分别对应N个领袖示例的所属聚类的值以及所属聚类为-1的情况;Step 3.16. Determine whether there are still voter examples whose clusters are empty, and if so, set the cluster value of the voter examples whose clusters are empty to -1; therefore, the possible value of the cluster of voter examples The number of is N+1, corresponding to the value of the cluster to which the N leader examples belong and the case where the cluster to which they belong is -1;
步骤4;采用支持向量机对预测示例进行粗分类:Step 4: Roughly classify the predicted examples using support vector machine:
4.1、建立由nump个预测示例组成的预测示例集P={instp1,instp2,…,instpj,…,instpnump};instpj表示第j个预测示例;1≤j≤nump;并有instpj={attrpj;labpj};arrtpj表示第j个预测示例instpj的属性集;labpj表示第j个预测示例instpj的标签集;记第j个预测示例instpj的所属聚类为clupj;本发明中预测示例和已知示例必须是同一对象,即对象的特征和语义相同,例如,已知示例是图片,则预测示例也需是图片,皆将色差,尺寸等需要详细描述的对象特征作为属性集,将风景图片,动物图片等非是即否的对象语义作为标签集,两个示例集拥有相同名称的属性集和标签集,但值各不相同,为表述清晰,本发明在论述时用不同符号进行区分;4.1. Establish a prediction example set P={instp 1 , instp 2 ,…,instp j ,…,instp nump } consisting of nmp prediction examples; instp j represents the jth prediction example; 1≤j≤nump; and have instp j = {attrp j ; labp j }; arrtp j represents the attribute set of the jth prediction example instp j ; labp j represents the label set of the jth prediction example instp j ; record the belonging cluster of the jth prediction example instp j The class is clup j ; in the present invention, the predicted example and the known example must be the same object, that is, the characteristics and semantics of the object are the same, for example, if the known example is a picture, the predicted example must also be a picture, and the color difference, size, etc. The detailed description of the object features is used as an attribute set, and the yes-or-no object semantics such as landscape pictures and animal pictures are used as a label set. The two example sets have the same attribute set and label set, but the values are different for clarity. , the present invention uses different symbols to distinguish when discussing;
4.2、以初始化示例集D相对应的num个所属聚类{clu1,clu2,…,clui,…,clunum}作为训练标签,以初始化示例集D中的num个已知对象的属性集{attr1,attr2…,attri,…,attrnum}作为训练样本;以预测示例集P的nump个属性集{attrp1,attrp2…,attrpj,…,attrpnump}作为预测样本,并用支持向量机方法进行训练,获得nump个预测标签,将nump个预测标签分别赋值给预测示例集P的nump个所属聚类;从而完成对预测示例集P的粗分类;支持向量机方法通常有三个输入,分别为训练标签,训练样本和预测样本,从而得到一个输出,即预测标签;4.2. Use the num clusters {clu 1 ,clu 2 ,…,clu i ,…,clu num } corresponding to the initialization example set D as training labels to initialize the attributes of num known objects in the example set D Set {attr 1 , attr 2 ..., attr i , ..., attr num } as a training sample; take the nump attribute set {attrp 1 , attrp 2 ..., attrp j , ..., attrp nump } of the prediction example set P as a prediction sample , and use the support vector machine method to train, obtain nmp prediction labels, and assign the nump prediction labels to the nump clusters of the prediction example set P; thus complete the rough classification of the prediction example set P; the support vector machine method usually There are three inputs, namely training label, training sample and prediction sample, so as to obtain an output, that is, the prediction label;
步骤5、对nump个预测示例进行多标签预测;Step 5, perform multi-label prediction on nmp prediction examples;
步骤5.1、将所述初始化示例集D中num个示例和所述预测示例集P中nump个示例整合为第ψ次更新示例集
步骤5.2、所述第ψ次更新示例集中num+nump个更新示例中的的每个示例的n个属性分别作为n维坐标,从而获得第Ω个第ψ次更新示例与第ξ个第ψ次更新示例属性的欧式距离1≤ξ≤num+nump且ξ≠Ω;Step 5.2, the ψth update example set The n attributes of each example in the num+nump update examples are respectively used as n-dimensional coordinates, so as to obtain the Ω-th ψ-th update example Update example with ξ-th ψ-th Euclidean distance of attributes 1≤ξ≤num+nump and ξ≠Ω;
步骤5.3、利用式(9)获得第Ω个第ψ次更新示例的属性聚合度从而获得第ψ次更新的num+nump个更新示例的属性聚合度 Step 5.3, using formula (9) to obtain the Ω-th ψ-th update example The degree of attribute aggregation So as to obtain the attribute aggregation degree of the num+nump update examples of the ψth update
当
步骤5.4、初始化j=1;Step 5.4, initialize j=1;
步骤5.5、若所述预测示例集P中第j个预测示例instpj的所属聚类为clupj与所述初始化示例集D中第i个已知示例insti的所属聚类为clui相同;则利用式(10)获得第i个已知示例insti与第j个预测示例instpj的影响力graij:Step 5.5, if the cluster of the j-th prediction example instp j in the prediction example set P is clup j and the cluster of the i-th known example inst i in the initialization example set D is clu i ; Then use formula (10) to obtain the influence gra ij of the i-th known example inst i and the j-th predicted example instp j :
式(10)中,Γi表示已知示例insti在第ψ次更新示例集所对应更新示例的属性聚合度,Γj表示预测示例instpj在第ψ次更新示例集所对应更新示例的属性聚合度,dij表示所述第i个已知示例insti与第j个预测示例instpj属性的欧式距离;In formula (10), Γ i represents the known example inst i updating the example set at the ψth time The attribute aggregation degree of the corresponding update example, Γ j indicates that the predicted example instp j updates the example set at the ψth time The attribute aggregation degree of the corresponding update example, d ij represents the Euclidean distance between the i-th known example inst i and the j-th predicted example instp j attribute;
步骤5.6、重复步骤5.5,从而获得第j个预测示例instpj与所述初始化示例集D其他已知示例的影响力,并记录最大影响力gramax;Step 5.6, repeat step 5.5, so as to obtain the influence of the jth prediction example instp j and other known examples of the initialization example set D, and record the maximum influence gra max ;
步骤5.7、若graij=gramax,则令labpj=labi,表示所述预测示例集P的标签集labpj中的各个标签和所述初始化示例集D的标签集labi中的各个标签相同,从而获得第j个多标签预测的预测示例;Step 5.7. If gra ij = gra max , then set labp j = lab i , indicating each label in the label set labp j of the prediction example set P and each label in the label set lab i of the initialization example set D The same, so as to obtain the prediction example of the jth multi-label prediction;
步骤5.8、将j+1赋值给j,并判断j≤nump是否成立,若成立,则返回步骤5.5执行,否则,表示完成对nump个预测示例的多标签预测;Step 5.8, assign j+1 to j, and judge whether j≤nump is true, if true, return to step 5.5 for execution, otherwise, it means that the multi-label prediction of nump prediction examples is completed;
步骤5.9、将所述完成多标签预测的预测示例集P的标签集赋值到所述对应的第ψ次更新示例集中,从而获得第ψ+1次更新示例集以所述第ψ+1次更新示例集作为新的初始化示例集进行自适应多标签预测,从而丰富现有训练集,提高下一轮预测的准确性,当出现新的具有相同的对象特征及相同的对象语义的预测示例时,只需从步骤4开始即可完成对新的预测示例进行多标签预测。Step 5.9: Assign the label set of the predicted example set P that has completed the multi-label prediction to the corresponding updated example set for the ψth time , so as to obtain the ψ+1th update example set Update the example set by the ψ+1th time Adaptive multi-label prediction is performed as a new initialization example set to enrich the existing training set and improve the accuracy of the next round of prediction. When there are new prediction examples with the same object features and the same object semantics, just Starting from step 4, the multi-label prediction of the new prediction example can be completed.
Claims (4)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510501816.7A CN105069129B (en) | 2015-06-24 | 2015-08-14 | Adaptive multi-tag Forecasting Methodology |
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510355030.9A CN104915436A (en) | 2015-06-24 | 2015-06-24 | Adaptive multi-tag predication method |
CN2015103550309 | 2015-06-24 | ||
CN201510501816.7A CN105069129B (en) | 2015-06-24 | 2015-08-14 | Adaptive multi-tag Forecasting Methodology |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105069129A true CN105069129A (en) | 2015-11-18 |
CN105069129B CN105069129B (en) | 2018-05-18 |
Family
ID=54084499
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510355030.9A Withdrawn CN104915436A (en) | 2015-06-24 | 2015-06-24 | Adaptive multi-tag predication method |
CN201510501816.7A Active CN105069129B (en) | 2015-06-24 | 2015-08-14 | Adaptive multi-tag Forecasting Methodology |
Family Applications Before (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510355030.9A Withdrawn CN104915436A (en) | 2015-06-24 | 2015-06-24 | Adaptive multi-tag predication method |
Country Status (1)
Country | Link |
---|---|
CN (2) | CN104915436A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108629358A (en) * | 2017-03-23 | 2018-10-09 | 北京嘀嘀无限科技发展有限公司 | The prediction technique and device of object type |
CN110162692A (en) * | 2018-12-10 | 2019-08-23 | 腾讯科技(深圳)有限公司 | User tag determines method, apparatus, computer equipment and storage medium |
US11379758B2 (en) | 2019-12-06 | 2022-07-05 | International Business Machines Corporation | Automatic multilabel classification using machine learning |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106909540A (en) * | 2015-12-23 | 2017-06-30 | 神州数码信息系统有限公司 | A kind of smart city citizen's preference discovery technique based on Cooperative Study |
CN106971713B (en) * | 2017-01-18 | 2020-01-07 | 北京华控智加科技有限公司 | Speaker marking method and system based on density peak value clustering and variational Bayes |
CN108647711B (en) * | 2018-05-08 | 2021-04-20 | 重庆邮电大学 | Multi-label classification method of image based on gravity model |
CN110547806B (en) * | 2019-09-11 | 2022-05-31 | 湖北工业大学 | An online gesture recognition method and system based on surface electromyography signals |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090164416A1 (en) * | 2007-12-10 | 2009-06-25 | Aumni Data Inc. | Adaptive data classification for data mining |
CN102004801A (en) * | 2010-12-30 | 2011-04-06 | 焦点科技股份有限公司 | Information classification method |
CN102364498A (en) * | 2011-10-17 | 2012-02-29 | 江苏大学 | A Multi-label Based Image Recognition Method |
CN102945371A (en) * | 2012-10-18 | 2013-02-27 | 浙江大学 | Classifying method based on multi-label flexible support vector machine |
CN103077228A (en) * | 2013-01-02 | 2013-05-01 | 北京科技大学 | Set characteristic vector-based quick clustering method and device |
CN103927394A (en) * | 2014-05-04 | 2014-07-16 | 苏州大学 | Multi-label active learning classification method and system based on SVM |
-
2015
- 2015-06-24 CN CN201510355030.9A patent/CN104915436A/en not_active Withdrawn
- 2015-08-14 CN CN201510501816.7A patent/CN105069129B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090164416A1 (en) * | 2007-12-10 | 2009-06-25 | Aumni Data Inc. | Adaptive data classification for data mining |
CN102004801A (en) * | 2010-12-30 | 2011-04-06 | 焦点科技股份有限公司 | Information classification method |
CN102364498A (en) * | 2011-10-17 | 2012-02-29 | 江苏大学 | A Multi-label Based Image Recognition Method |
CN102945371A (en) * | 2012-10-18 | 2013-02-27 | 浙江大学 | Classifying method based on multi-label flexible support vector machine |
CN103077228A (en) * | 2013-01-02 | 2013-05-01 | 北京科技大学 | Set characteristic vector-based quick clustering method and device |
CN103927394A (en) * | 2014-05-04 | 2014-07-16 | 苏州大学 | Multi-label active learning classification method and system based on SVM |
Non-Patent Citations (2)
Title |
---|
XIN LI 等: "Active Learning with Multi-Label SVM Classification", 《PROCEEDINGS OF THE TWENTY-THIRD INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE》 * |
李培培: "数据流中概念漂移检测与分类方法研究", 《中国博士学位论文全文数据库·信息科技辑》 * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108629358A (en) * | 2017-03-23 | 2018-10-09 | 北京嘀嘀无限科技发展有限公司 | The prediction technique and device of object type |
CN108629358B (en) * | 2017-03-23 | 2020-12-25 | 北京嘀嘀无限科技发展有限公司 | Object class prediction method and device |
CN110162692A (en) * | 2018-12-10 | 2019-08-23 | 腾讯科技(深圳)有限公司 | User tag determines method, apparatus, computer equipment and storage medium |
CN110162692B (en) * | 2018-12-10 | 2021-05-25 | 腾讯科技(深圳)有限公司 | User label determination method and device, computer equipment and storage medium |
US11379758B2 (en) | 2019-12-06 | 2022-07-05 | International Business Machines Corporation | Automatic multilabel classification using machine learning |
Also Published As
Publication number | Publication date |
---|---|
CN105069129B (en) | 2018-05-18 |
CN104915436A (en) | 2015-09-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105069129B (en) | Adaptive multi-tag Forecasting Methodology | |
CN106202054B (en) | A kind of name entity recognition method towards medical field based on deep learning | |
CN106651519B (en) | Personalized recommendation method and system based on label information | |
CN108287864B (en) | Interest group dividing method, device, medium and computing equipment | |
CN110674407B (en) | Hybrid recommendation method based on graph convolutional neural network | |
CN104199826B (en) | A kind of dissimilar medium similarity calculation method and search method based on association analysis | |
CN107506793A (en) | Clothes recognition methods and system based on weak mark image | |
CN106600052A (en) | User attribute and social network detection system based on space-time locus | |
CN110020176A (en) | A kind of resource recommendation method, electronic equipment and computer readable storage medium | |
CN107239993A (en) | A kind of matrix decomposition recommendation method and system based on expansion label | |
Wu et al. | Joint semi-supervised learning and re-ranking for vehicle re-identification | |
CN105869016A (en) | Method for estimating click through rate based on convolution neural network | |
CN113222653B (en) | Method, system, equipment and storage medium for expanding audience of programmed advertisement users | |
CN108804577B (en) | Method for estimating interest degree of information tag | |
CN110033097A (en) | The method and device of the incidence relation of user and article is determined based on multiple data fields | |
CN112380433A (en) | Recommendation meta-learning method for cold-start user | |
CN104778283A (en) | User occupation classification method and system based on microblog | |
US20220156519A1 (en) | Methods and systems for efficient batch active learning of a deep neural network | |
CN104572915B (en) | One kind is based on the enhanced customer incident relatedness computation method of content environment | |
CN109885745A (en) | User portrait method, device, readable storage medium and terminal device | |
Huang et al. | An Ad CTR prediction method based on feature learning of deep and shallow layers | |
Wang et al. | Learning with group noise | |
CN107169830A (en) | A kind of personalized recommendation method based on cluster PU matrix decompositions | |
Chang et al. | Fine-grained butterfly and moth classification using deep convolutional neural networks | |
CN103544500B (en) | Multi-user natural scene mark sequencing method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |