CN110766273A

CN110766273A - Semi-supervised clustering teaching asset classification method for optimizing feature weight

Info

Publication number: CN110766273A
Application number: CN201910871026.6A
Authority: CN
Inventors: 孙曜; 孙双平
Original assignee: Hangzhou Electronic Science and Technology University
Current assignee: Hangzhou Dianzi University; Hangzhou Electronic Science and Technology University
Priority date: 2019-09-16
Filing date: 2019-09-16
Publication date: 2020-02-07

Abstract

The invention discloses a semi-supervised clustering teaching asset classification method for optimizing feature weights. The method comprises the following steps: analyzing the expression mode of the property characteristics of the assets according to the characteristics of the assets samples, extracting the characteristic items of the assets samples, introducing a characteristic item weight calculation formula, and calculating the corresponding characteristic item weights to obtain the vector space representation of the assets samples; carrying out unsupervised initial clustering on the processed asset samples to obtain initial clustering clusters; performing semi-supervised clustering on the asset samples by using the pairwise constraint set of the samples; and classifying the newly added asset samples according to the semi-supervised hierarchical clustering result. The invention combines clustering and pairwise constraint for the classification of asset samples, and provides an asset classification method based on semi-supervised clustering, thereby reducing the time required by manual classification and avoiding different classification results caused by subjective difference; meanwhile, through the paired constraint set, the knowledge contained in the supervision information is further mined, and higher effectiveness and correctness are achieved.

Description

Semi-supervised clustering teaching asset classification method for optimizing feature weight

Technical Field

The invention relates to a semi-supervised clustering teaching asset classification method for optimizing feature weights.

Background

The teaching assets are various conditions that can be utilized, such as materials provided for the effective development of various studies, activities, and the like. With the continuous development of computers and the internet, the number and the types of teaching assets are increased extremely rapidly, and in the management, if classification is not carried out or the classification is not accurate, the management and the use of the teaching assets are seriously influenced, and suitable available assets cannot be found from massive teaching assets.

Most of the existing asset classification methods are manual classification and depend on subjective experience of people. Therefore, the problems of different asset class division, inaccurate classification, even difficult classification and the like exist. And when the number of samples is large, the time and expense required for manual classification is enormous. The clustering algorithm is applied to asset classification, so that the time and cost required by manual classification can be greatly saved, and the influence of subjective experience of people on classification results is avoided, so that more and more attention is paid.

At present, most of clustering classification completely depends on numerical value class parameters of asset samples, for example, assets are classified into an easily managed type, a difficultly managed type, a large expense and a small expense according to the use duration and the investment amount of the assets through a clustering algorithm; classifying the teaching assets according to the management mode according to the equipment total amount, the number of students, the number of teachers and the material consumption cost of the assets; and classifying the assets according to the fault maintenance mode according to the IP addresses, the fault alarm times and the like of the hardware assets. The classification is realized by clustering the asset characteristics by using the value parameters of the assets, such as: the numerical value class characteristics such as the use duration, the asset investment amount, the number of users, the asset consumable charge and the like which can be directly processed by the computer are called numerical value class characteristics. Such numerical class features can be directly processed by a computer without further calculation of feature weights. However, these kinds of characteristics are classified for satisfying specific capital management and maintenance requirements (classification for measuring the amount of capital expenditure of assets, whether to be easily managed and the use condition), and these numerical characteristics are not inherent characteristics of assets, and can be only reflected in the use process and purchase, and will change with the use time and place of assets, so the result of classifying assets under different conditions is different, and there is no commonality. In asset management in colleges and universities, assets are generally classified according to their uses, asset specifications, and the like, and the above classification method does not conform to the habit of classifying assets according to their attributes such as uses. For example, the national standard fixed asset classification standard is mainly classified according to the economic use and the use of the assets, which cannot be realized by the classification method.

Therefore, the invention provides the classification of the teaching assets according to the attribute characteristics of the asset samples. The property characteristics of the assets are inherent characteristics of the assets, and the property characteristics exist objectively and do not change with the placement position of the assets, the preference of people, the working demand degree and the like, for example: asset usage, asset form, asset specifications, asset ownership, etc. For example: in manual classification, people generally classify assets into engineering buildings, equipment, book archives, cultural relic specimens, intellectual property rights, natural resources, licensing rights and interests, data information, and the like, and the assets are classified according to attribute characteristics according to the purposes, forms, and the like of the assets. The property features of the assets cannot be directly used for clustering, so that the property features of the assets need to be firstly subjected to feature extraction and feature weight calculation, and are converted into a numerical form which can be processed by a computer. The simple and undifferentiated characteristic weight calculation method comprises the following steps: attribute features that appear in a certain asset sample are assigned a value of 1 and attribute features that do not appear are assigned a value of 0. However, the importance of each attribute in the asset classification is different, and the calculation method does not distinguish the difference of different attribute features in the asset classification, so the classification effect is not ideal. The method calculates the feature weights of different attribute features by introducing a weight calculation formula.

According to the invention, a feature weight formula is introduced according to different sources of the attribute features of the teaching assets, the importance degrees of the attribute features of different sources are distinguished through different feature weight coefficients, and then the teaching asset sample is expressed into a vector space form which can be processed by a computer. In addition, the distance between the clusters is adjusted through the constraint set, and therefore classification of the teaching assets is achieved through a semi-supervised hierarchical clustering method.

The invention has the advantages that: on one hand, the time and capital cost of manual classification are saved, and the asset classification result is not influenced by subjective experience. On the other hand, the teaching assets are classified by using the inherent attribute characteristics of the assets, so that the classification result has universality. In addition, different influence degrees of characteristics of different sources on teaching asset classification are considered, an attribute characteristic weight calculation formula of the teaching asset is introduced, and calculation of the characteristic weight is optimized. And finally, the clustering accuracy is improved through a semi-supervision method, and the operation effect of teaching asset clustering is improved.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: a large amount of cost is consumed by manual classification in the classification process of the existing teaching assets; the automatic classification method is single and has no universality, and provides a semi-supervised clustering teaching asset classification method for optimizing feature weights, wherein some basic attributes of teaching assets are used as features, an attribute feature weight calculation formula of the teaching assets is introduced, and the attribute feature weights of teaching asset samples are calculated; classifying the teaching assets by adopting a hierarchical clustering method; improving a clustering result through a semi-supervised constraint set; and classifying a certain teaching asset sample to be classified according to the clustering result.

The technical scheme adopted by the invention is as follows:

a semi-supervised clustering teaching asset classification method for optimizing feature weight comprises the following steps:

the method comprises the following steps: and acquiring a teaching asset sample comprising an asset name, an asset attribute set and asset entry information.

An asset attribute set refers to a word set formed by a part of attributes of a certain asset, for example, an attribute set of a certain device may be: device number, power, model, vendor, specification, brand, age, etc.

Asset entry information is the entry interpretation of an asset, e.g. an encyclopedia, an entry interpretation of a certain asset in the department of the dog searching.

Step two: the method comprises the following steps of extracting attribute features of the teaching assets from different attribute feature sources according to the characteristics of teaching asset samples, introducing a feature weight calculation formula of the teaching assets, calculating corresponding attribute feature weights, and obtaining vector space representation of the teaching asset samples, wherein the method specifically comprises the following steps:

s21 extracting attribute features of teaching assets

When the asset attribute features are extracted, a plurality of attribute features which can describe the teaching asset sample most are extracted according to the priority order of the asset name, the asset attribute set and the asset entry information, semantic similar attribute features are combined, and irrelevant attributes are removed, so that the running time is reduced, and the running efficiency is improved.

The semantic similar attribute features are as follows: instruments, devices and the like in the asset names can express instrument asset characteristics, and at the moment, the two semantic similar characteristics can be combined into the same attribute characteristic.

Irrelevant attributes are for example: age, brand, etc. At this time, the irrelevant attribute needs to be removed, so as to avoid that the algorithm runs too long due to the excessive irrelevant attribute.

S22 obtaining attribute feature weight of teaching assets

Sorting the attribute features according to the source (namely asset name, asset attribute set and/or asset entry information) and the priority order of the asset name, asset attribute set and asset entry information, setting different feature weight coefficients, and calculating the attribute feature weight of the teaching asset according to a formula:

wherein ω is_ijRepresenting the weight value of the jth attribute feature in the ith educational asset sample α_(j)The attribute feature source coefficients; SD_(i)The number of the specified attribute features contained in the ith teaching asset sample is calculated; n is the number of all attribute features extracted from the teaching asset sample set. The definitional attribute feature refers to an attribute feature that can definitely point a certain teaching asset to a certain category, that is, an attribute feature that can definitely divide a certain teaching asset into a certain category, such as: if the instructional asset sample contains certain attribute features, such as instrumentation, power, etc., the instructional asset sample can be classified as an instrumentation class.

S23, using vector space model to represent the attribute feature of the teaching asset, and representing the selected attribute feature and attribute feature weight of the teaching asset into a feature vector form, i.e. the teaching asset is regarded as a vector of multidimensional vector space:

in this model, a sample set of teaching assets containing m samples of teaching assets, n attribute features, can be represented as a vector space:

C＝{d₁,d₂,…d_m} formula (2)

Each teaching asset sample D_i(1. ltoreq. i.ltoreq.m) can be expressed as an n-dimensional row vector:

d_i＝(ω_i1,ω_i2,…,ω_in)^Tformula (3)

Wherein (i is more than or equal to 1 and less than or equal to m), and T represents transposition.

Step three: carrying out unsupervised initial clustering on the processed teaching asset sample to obtain an initial clustering cluster; the method comprises the following steps:

s31: for a given sample set, initializing m teaching asset sample points as m cluster types, calculating the distance between every two of the m cluster types, arranging the distance between every two of the m cluster types into a matrix form, and recording the matrix form as an initial distance matrix; the method comprises the following steps:

any two samples D₁And D₂Expressed as two vectors d in VSM₁＝(ω₁₁,ω₁₂,…,ω_1n)^TAnd d₂＝(ω₂₁,ω₂₂,…,ω_2n)^TT denotes transposition, then sample D₁And D₂I.e. to represent a class cluster D₁And D₂At this time, two kinds of clusters D₁And D₂The distance calculation formula of (c) is as follows:

then an initial distance matrix is obtained:

s32: and searching the cluster closest to each cluster through the initial distance matrix obtained in the step S31, and combining the two clusters closest to each cluster to form a new cluster.

By searching the initial distance matrix, the samples with the shortest distance are combined into a class cluster, and then the distance between every two combined class clusters is calculated in sequence, wherein the calculation method comprises the following steps:

let S be a cluster of t samples, d_xIf S is a sample in S, the center point of S is:

then the class cluster S₁And S₂The distance between them is the cluster S₁And S₂Center point of (S) O (S)₁) And O (S)₂) D (O (S)) between₁),O(S₂) Namely:

d(S₁,S₂)＝d(O(S₁),O(S₂))＝‖O(S₁)O(S₂) II type (6)

S33: and repeating the step S32 until the obtained cluster number is the set initial cluster number K1.

Step four: by absorbing empirical knowledge, performing semi-supervised hierarchical clustering on the teaching asset samples by using the pairwise constraint set of the samples so as to improve the accuracy of a clustering effect; the method comprises the following steps:

s41: setting a paired constraint sample set in the sample set by using empirical knowledge;

the pairwise constraints include a must-link constraint and a cannot-link constraint. Where the must-link constraint indicates that two samples must be assigned to the same cluster, and the cannot-link constraint indicates that two samples must be assigned to different clusters. The set of pairwise constraints in a certain class of clusters is denoted as M (S; d) and N (S; d). M (S; d) refers to the set of samples in the cluster S having a multist-link constrained relationship with sample d, and N (S; d) refers to the set of samples in the cluster S having a candot-link constrained relationship with sample d. Accordingly, M (S; S ') represents the set of all samples having a list-link constrained relationship in the cluster S and the cluster S', and N (S; S ') represents the set of all samples having a cannot-link constrained relationship in the cluster S and the cluster S'.

S42: combining the initial unsupervised clustering results in the step three, and changing the distance between the clustering clusters by using pairwise constraint information;

the method for changing the distance between the clusters in the S42 comprises the following steps: introducing KNN algorithm, and the idea of the method is that if a sample d_yWhen most of t existing teaching asset samples closest to the sample belong to a certain category, the sample also belongs to the category. By using

Representation and sample d_yThe nearest t marked samples (namely the samples of the existing teaching assets), then the sample d_yThe closeness to the t labeled samples closest to it is expressed as:

finally, the degree of constraint between the clusters S and S 'is represented by P (S; S'):

where ρ is_uRepresents a sample d_uProximity to the t labeled samples closest thereto; rho_lRepresents a sample d_lProximity to the t labeled samples closest thereto;representing the degree of the must-link constraint between the cluster S and the cluster S';indicates the degree of cannot-link constraint between clusters S and S'.

When P (S; S ') is > 1, then S is considered to be must-link bound to S'; when P (S; S ') <1, then S is considered cannot-link bound to S'.

According to a constraint equationDegree P (S; S'), clustering cluster S₁And cluster S₂Is changed to:

d′(S₁,S₂)＝d(O(S₁),O(S₂))P(S₁；S₂)P(S₂；S₁) Formula (9)

Wherein O (S)₁) And O (S)₂) Are respectively cluster clusters S₁And S₂A center point of (a); d (O (S)₁),O(S₂) Is a cluster S₁And S₂Center point of (S) O (S)₁) And O (S)₂) The distance between them.

S43: searching two clustering clusters with the shortest distance, and combining the clustering clusters with the shortest distance into one clustering cluster according to the hierarchical clustering algorithm principle;

s44: repeating the step S43 until the obtained cluster number is the set teaching asset class number K;

step five: and classifying the teaching asset samples to be classified according to the four-half supervision hierarchical clustering result.

And if the teaching assets to be classified are the existing teaching assets, classifying the teaching asset samples into the classes according to the semi-supervised hierarchical clustering result. If the teaching assets to be classified are newly added teaching asset samples D_m+1Initializing newly added teaching asset sample D_m+1Is a cluster-like S_m+1Then, the newly added teaching asset D_m+1The distances from the K teaching asset classes are:

d(S_m+1,S_i)＝‖O(S_m+1),O(S_i) II type (10)

And comparing the distance calculation results of the formula (10), determining a target teaching asset class having the minimum distance with the teaching asset to be classified in the K teaching asset classes, and adding the newly added teaching asset into the target teaching asset class.

Compared with the existing manual classification technology, the method greatly saves the manpower and time required by classification, and avoids the classification result difference caused by different subjective experiences of people. Compared with a simple undifferentiated weight assignment method, the method introduces a teaching asset attribute feature weight calculation formula according to different sources of the teaching asset attribute features, and enables the teaching asset feature weights to correspond to different numerical values according to the importance degree by changing the weight coefficient. The method highlights the differences of different teaching asset samples, enables the classification result to be more accurate, and reduces the error of the classification result. Compared with an unsupervised clustering method, the method realizes semi-supervised clustering through the constraint set, and effectively improves the accuracy of a clustering result.

Drawings

FIG. 1 is a flow chart of a method for classifying an instructional asset sample in accordance with the present invention.

FIG. 2 is a sub-flow chart of the semi-supervised clustering algorithm based on constraints of the present invention.

Detailed Description

In order that the above objects, features and advantages of the present invention can be more clearly understood, a more particular description of the invention will be rendered by reference to the appended drawings. It should be noted that the embodiments and features of the embodiments of the present application may be combined with each other without conflicting with each other. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, however, the present invention may be practiced in other ways than those specifically described and thus the scope of the present invention is not limited by the specific embodiments disclosed below. Other equivalent or alternative features having similar purposes may be substituted unless specifically stated otherwise. That is, unless expressly stated otherwise, each feature is only an example of a generic series of equivalent or similar features.

The classification method based on semi-supervised clustering is applied to classification of teaching assets, and attribute features of teaching asset samples are extracted and attribute feature weights are calculated according to the characteristics of the asset samples; performing initial clustering calculation according to the sample space vector; adding constraint set information and adjusting the distance of the teaching asset clustering cluster by a semi-supervised learning method so as to optimize the clustering effect of the teaching asset data; and classifying the newly added teaching assets according to the semi-supervised clustering result.

As shown in fig. 1, the classification method for teaching assets sample according to the present invention includes:

(1) acquiring a college teaching asset sample;

(2) and clustering the teaching asset sample set by using a semi-supervised clustering algorithm.

(3) And classifying the to-be-classified teaching asset samples according to the semi-supervised hierarchical clustering result.

The following is a detailed description of the various parts:

(1) and acquiring a college teaching asset sample. According to the requirement of the asset management of the colleges and universities, the names of the assets and the attribute set information of the assets are already input when the existing teaching assets of the colleges and universities are input into an asset management system, and the information forms a unique asset database of the colleges and universities. Therefore, the sample set of the existing teaching assets of a certain university can be imported into the clustering algorithm through the computer. According to the asset information of the teaching asset database, the teaching asset class name and the class number are determined as follows: houses, instruments, books, archives, furniture, and 4 categories.

Unlike other samples, the instructional asset attribute features can be obtained in three ways, namely: asset name, asset attribute set, entry information. Features that best represent the concept of the asset class are selected. For example: the class information of the teaching asset sample can be mostly obtained from the name of the teaching asset sample, such as a building, an instrument, a device, a machine, a frame, a table, paper and the like, the class information of the teaching asset sample can be reflected by the attributes of part of teaching asset parameters, such as area, power, publishing houses and the like, and in addition, the class attribute characteristics can be obtained from the interpretation of the asset entry information of the unusual or newly-added unknown teaching asset.

For example: in the teaching asset sample database, the information of a certain teaching building is as follows: asset name: second teaching and scientific research building, asset attribute set: a developer, an address, a total area, a building type, an opening date, a completion date, and the like; in the teaching asset sample database, the information of a certain brand of projector is as follows: asset name: projector (or projector), asset attribute set: model, power, specification, supplier, age, etc.; in the teaching asset sample database, the information of a certain book is as follows: asset name: robotics and applications, asset attribute set: publishing company, publication date, publication number, age, etc. It should be noted that: asset entry information interpretation is generally not required for such existing assets. For unusual or newly-added unknown teaching assets, the category attribute characteristics can be obtained through the interpretation of the entry information thereof, for example, the target is also called as targetMeasuring markIs arranged atTriangular pointOrWire pointUpper supplyObservation ofOr a frame used by the survey station as a standard, and the keyword feature ' frame ' after ' can be extracted as the attribute feature.

(2) The teaching asset sample set is clustered as in the semi-supervised clustering algorithm of fig. 2. The semi-supervised clustering algorithm specifically comprises the following steps: extracting attribute features of the teaching asset sample and calculating feature weights; a vector space representation of a sample set of the educational asset; carrying out initial unsupervised clustering; semi-supervised clustering with constraint information.

Extracting attribute features of the teaching asset sample and calculating feature weights:

after the teaching asset sample is obtained, the sample attributes are extracted, wherein the attributes are mainly the same or similar characteristics of the merged semantics, and irrelevant attributes are removed. Such as a certain brand of teaching asset projecting apparatus: the name projector (or projector), where "instrument" and "machine" are synonymous, should be combined into the same feature item. While the property set of its assets, supplier and lifetime are the properties that are not relevant to classification, such features should be removed to reduce the algorithm runtime. According to the priority order of the asset name and the asset attribute set, the characteristic extraction of the certain brand of projector is as follows: instrument, rated power, rated voltage and specification. The specific attribute characteristics are as follows: the instrument and the power can definitely divide the teaching assets into instrument and equipment types, the attribute feature models and specifications are non-specific attribute features, and the teaching assets cannot be represented to belong to the instrument and equipment, but are also connected with the instrument and equipment types. After any feature extraction, the teaching assets containing the features and the projector are more likely to belong to the same teaching asset class. After the processing, the teaching asset attribute features are extracted according to the priority order of the asset name, the asset attribute set and the asset entry information, wherein the priority order is set.

And after the attribute features are extracted, calculating the feature weight. After feature extraction, the number of attribute features of a teaching asset sample set of a certain teaching asset is determined to be 12, and the number of the indicative attribute features contained in a certain teaching asset sample is determined to be 3. And arranging the extracted attribute features according to the sequence of firstly indicating attribute features and then non-indicating attribute features, and sequencing all the indicating attribute features and the non-indicating attribute features according to the source priority respectively to obtain asset name source attribute features, asset attribute set source attribute features and asset entry information source attribute features. Order to

And calculating the weight of the sample feature item according to the attribute feature weight calculation formula (1) of the teaching asset.

The method for calculating the feature weight of the feature item 'instrument' of the projector comprises the following steps: assuming that the projector is the 3 rd sample, the arrangement serial number of the attribute feature 'instrument' in the teaching asset sample set of 12 attribute features is 1, the arrangement serial number of the attribute feature 'rated power' in the attribute feature of 12 is 2, the arrangement serial number of the attribute feature 'rated voltage' in the attribute feature of 12 is 6, and the arrangement serial number of the attribute feature 'specification' in the attribute feature of 12 is 10, the feature weights of the four attribute features in the teaching asset projector are respectively: omega₃₁＝0.911；ω₃₂＝0.3205；ω₃₆＝0.254；ω₃₁₀0.0519, the remainder not being mentioned in the instructional asset sampleThe obtained attribute feature weights are all 0.

Vector space representation of teaching asset samples:

the teaching asset sample set comprises 15 teaching asset samples and 12 attribute characteristics, wherein each teaching asset sample D_i(1. ltoreq. i.ltoreq.m) can be represented as an n-dimensional row vector d_i＝(ω_i1,ω_i2,…,ω_in)^T(i is more than or equal to 1 and less than or equal to m), and T represents transposition. . Wherein ω is_ijAnd the weight value of the jth attribute feature in the ith asset sample is represented, and the specific numerical value is obtained through the feature item weight calculation formula. After the vector conversion operation is completed, the space vector coordinates of each sample are stored in a space vector library for the next inter-sample distance calculation. As above teaching asset sample D₃The "projector" can be represented as a vector:

d₃＝(0.736,0.575,0,0,0,0.321,0,0,0,0.158,0,0)^T

in the teaching asset sample, a desktop computer, a camera and a projector have certain similarity and belong to instrument equipment, so that the coordinate difference between the three samples is small, and the similarity between the projector and the camera is higher, so that the coordinate difference of the two vectors is smaller. Example (c): teaching asset sample D₁"desktop computer" can be expressed as a vector: d₁＝(0.692,0.514,0,0,0.341,0.302,0,0,0,0.148,0,0)^TTeaching asset sample D₂"Camera" can be expressed as a vector: d₂＝(0.73,0.571,0,0,0,0.318,0,0,0,0.156,0,0.124)^T。

Initial unsupervised clustering:

after the teaching asset sample data preprocessing and the structured conversion operation, each teaching asset sample D_iCorresponding to a vector d_i. And then, calculating the distance between the teaching asset samples by using the inter-sample distance measurement formula to serve as a basis for measuring whether the samples can be clustered to the same cluster. Then teaching asset sample D₁、D₂、D₂The distance between each two is as follows: d (D)₁,D₂)＝0.366，d(D₁,D₃)＝0.346，d(D₂,D₃) 0.124. Merging teaching asset samples with the closest distance into a cluster, namely D₂(vidicon) and D₃(projectors) combined into a cluster-like S₁，D₁(desktop computer) as a cluster-like S₂. Computing a class cluster S₁Center point of (S) O (S)₁) The coordinates of (a) are: (0.733,0.573,0,0,0,0.32,0,0,0,0.157,0,0.062), then the cluster S₁And cluster S₂A distance d (S) therebetween₁,S₂) 0.351. And calculating the distances between all samples and all clusters, and continuously combining the two closest clusters to form a new cluster until the obtained cluster number is the set initial cluster number K1 which is 7.

Semi-supervised clustering with pairwise constraints:

and carrying out initial unsupervised clustering classification on the given teaching asset sample data to obtain an initial classification result. Since the initial unsupervised clustering has no learning ability, the classification accuracy is not ideal. Therefore, the following semi-supervised link is added to further improve the clustering method. A sub-flowchart of the constraint-based semi-supervised clustering algorithm is shown in fig. 2.

For a teaching asset sample set subjected to initial clustering (D)_iLet the cluster grouping formed be { S }₁,S₂,…S_NSetting the number of the types of the teaching asset samples to be 4, namely, requiring to output the number of the clustering clusters to be 4; all P (S; S') are calculated, here using information with constraints such as: class cluster S₁And sample d₁Having a must-link constraint relationship, cluster S₂And sample d₁Having a cannot-link constraint relationship, thereby utilizing the formula

Adjusting a class cluster S₁And cluster S₂The distance between them. Calculating the distance between 7 adjusted classes, and searching two nearest cluster clusters (S)_p,S_q) Merging the two clusters into a new cluster S_rAnd let Y be Y-1, if Y be K be 4, the algorithm is stopped and the output nodeFruit; if Y is larger than K, returning to the previous step, calculating all P (S; S ') and all distances d (S, S') again, searching two cluster clusters which are closest to each other again and merging until Y is equal to K and equal to 4.

And outputting semi-supervised clustering results, namely which samples of 15 teaching asset samples are contained in the 4 classes of teaching assets respectively:

S₁(class of instruments and devices): d₁，D₂，D₃，D₆，D₇，D₁₃

S₂(house type): d₅，D₈，D₁₀，D₁₄

S₃(book archives class): d₄，D₉，D₁₁

S₄(furniture items): d₁₂，D₁₅

Inputting to-be-classified teaching assets D_rIf the teaching assets to be classified are existing teaching assets, for example, the teaching assets to be classified are D₃Classifying the teaching asset samples into D according to the semi-supervised hierarchical clustering result₃Class S to which it belongs₁(class of instruments and devices).

If the to-be-classified teaching assets are newly added teaching assets, obtaining a newly added teaching asset sample D through teaching asset sample feature extraction and teaching asset sample feature item weight calculation₁₆Corresponding vector d₁₆＝(0.738,0.577,0,0,0,0.322,0,0,0,0,0.141,0)^TSeparately calculating the newly added teaching asset samples D₁₆Distance D (D) from the final output of 4 teaching asset classes_m+1,S_i) And (i ═ 1,2,3 and 4), comparing the 4 distances, determining a target teaching asset class with the minimum distance to the newly added teaching asset in the 4 teaching asset classes, and adding the newly added teaching asset into the target teaching asset class. Such as: separately calculate samples D₁₆And cluster S₁，S₂，S₃，S₄Comparing the above 4 distances, if D₁₆And cluster S₁Is closest, the sample D is obtained₁₆Classification into clusters S₁In (1).

According to different sources of the attribute features of the teaching assets, the invention introduces a calculation formula of the attribute feature weight of the teaching assets, and the weight coefficient is changed to enable the attribute feature weight of the teaching assets to correspond to different numerical values according to the importance degree. The method highlights the difference of the teaching asset samples, so that the classification result is more accurate, and the error of the classification result is reduced. Compared with an unsupervised clustering method, the method disclosed by the invention is combined with experience knowledge, semi-supervised clustering is realized through a constraint set, and the accuracy of teaching asset classification is effectively improved.

The invention is not limited to the foregoing embodiments. The invention extends to any novel feature or any novel combination of features disclosed in this specification and any novel method or process steps or any novel combination of features disclosed.

Claims

1. A semi-supervised clustering teaching asset classification method for optimizing feature weight is characterized by comprising the following steps:

the method comprises the following steps: acquiring a teaching asset sample comprising an asset name, an asset attribute set and asset entry information;

step two: and aiming at the characteristics of the teaching asset sample, extracting the attribute characteristics of the teaching asset from different attribute characteristic sources, introducing a characteristic weight calculation formula of the teaching asset, calculating the corresponding attribute characteristic weight, and obtaining the vector space representation of the teaching asset sample.

Step three: carrying out unsupervised initial clustering on the processed teaching asset sample to obtain an initial clustering cluster;

pairwise constraints include the must-link constraint and the cannot-link constraint; wherein the most-link constraint indicates that two samples must be assigned to the same cluster, and the cannot-link constraint indicates that two samples must be assigned to different clusters; representing the paired constraint sets in a certain cluster as M (S; d) and N (S; d); m (S; d) refers to a set of samples having a multist-link constraint relationship with sample d in cluster S, and N (S; d) refers to a set of samples having a candot-link constraint relationship with sample d in cluster S; accordingly, M (S; S ') represents the set of all samples having a list-link constraint relationship in the cluster S and the cluster S', and N (S; S ') represents the set of all samples having a cannot-link constraint relationship in the cluster S and the cluster S';

s43: searching two closest clustering clusters, and combining the closest clustering clusters into one clustering cluster;

step five: classifying the teaching asset samples to be classified according to the four-half supervision hierarchical clustering result;

if the teaching assets to be classified are existing teaching assets, classifying the teaching asset samples into the classes according to the semi-supervised hierarchical clustering result; and if the teaching assets to be classified are newly added teaching assets, calculating the distance between the newly added teaching assets and the K teaching asset classes, determining a target teaching asset class with the minimum distance between the K teaching asset classes and the teaching assets to be classified according to the distance comparison result, and adding the newly added teaching assets into the target teaching asset class.

2. The semi-supervised clustering teaching asset classification method for optimizing feature weights as claimed in claim 1, wherein the second step is specifically:

s21 extracting attribute features of teaching assets

When the asset attribute features are extracted, firstly, a plurality of attribute features which can describe the teaching asset sample most are extracted according to the priority order of the asset name, the asset attribute set and the asset entry information, semantic similar attribute features are combined, and irrelevant attributes are removed;

s22 obtaining attribute feature weight of teaching assets

Sorting the attribute features according to the source thereof and the priority sequence of the asset name, the asset attribute set and the asset entry information, setting different feature weight coefficients, and calculating the attribute feature weight of the teaching asset according to a formula:

wherein ω is_ijRepresenting the weight value of the jth attribute feature in the ith educational asset sample α_(j)The attribute feature source coefficients; SD_(i)The number of the specified attribute features contained in the ith teaching asset sample is calculated; n is the number of all attribute features extracted from the teaching asset sample set;

C＝{d₁，d₂，…d_m} formula (2)

d_i＝(ω_i1，ω_i2，…，ω_in)^Tformula (3)

3. The semi-supervised clustering teaching asset classification method for optimizing feature weights as claimed in claim 2, wherein the third step is specifically:

s31: for a given sample set, initializing m teaching asset sample points as m cluster types, calculating the distance between every two of the m cluster types, and recording as an initial distance matrix; the method comprises the following steps:

any two samples D₁And D₂Expressed as two vectors d in VSM₁＝(ω₁₁，ω₁₂，…，ω_1n)^TAnd d₂＝(ω₂₁，ω₂₂，…，ω_2n)^TT denotes transposition, then sample D₁And D₂I.e. to represent a class cluster D₁And D₂At this time, two kinds of clusters D₁And D₂The distance calculation formula of (c) is as follows:

s32: passing distance matrix D (D)₁，D₂) Searching the closest class cluster of each class cluster, and combining the two closest class clusters to form a new class cluster;

then the class cluster S₁And S₂The distance between them is:

d(S₁，S₂)＝d(O(S₁)，O(S₂))＝||O(S₁)O(S₂) | | formula (6)

4. The semi-supervised clustering teaching asset classification method for optimizing feature weight of claim 3, wherein the method for changing the distance between clusters in S42 is as follows:

if a sample d_yMost of t existing teaching asset samples closest to the sample belong to a certain category, and the sample also belongs to the category; by usingRepresentation and sample d_yThe nearest t labeled samples, sample d_yThe closeness to the t labeled samples closest to it is expressed as:

where ρ is_uRepresents a sample d_uProximity to the t labeled samples closest thereto; rho_lRepresents a sample d_lProximity to the t labeled samples closest thereto;

represents the degree of the must-link constraint between clusters S and S',

representing the cannot-link constraint degree between the cluster S and the cluster S';

when P (S; S ') is > 1, then S is considered to be must-link bound to S'; when P (S; S ') <1, then S is considered cannot-link bound to S';

clustering the clusters S according to the degree of constraint P (S; S₁And cluster S₂Is changed to:

d′(S₁，S₂)＝d(O(S₁)，O(S₂))P(S₁；S₂)P(S₂；S₁) Formula (9)

Wherein, O (S)₁) And O (S)₂) Are respectively cluster clusters S₁And S₂Center point of (d), d (O (S)₁)，O(S₂) Is a cluster S₁And S₂Center point of (S) O (S)₁) And O (S)₂) The distance between them.