CN110766273A - Semi-supervised clustering teaching asset classification method for optimizing feature weight - Google Patents

Semi-supervised clustering teaching asset classification method for optimizing feature weight Download PDF

Info

Publication number
CN110766273A
CN110766273A CN201910871026.6A CN201910871026A CN110766273A CN 110766273 A CN110766273 A CN 110766273A CN 201910871026 A CN201910871026 A CN 201910871026A CN 110766273 A CN110766273 A CN 110766273A
Authority
CN
China
Prior art keywords
asset
teaching
cluster
samples
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910871026.6A
Other languages
Chinese (zh)
Inventor
孙曜
孙双平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Dianzi University
Hangzhou Electronic Science and Technology University
Original Assignee
Hangzhou Electronic Science and Technology University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Electronic Science and Technology University filed Critical Hangzhou Electronic Science and Technology University
Priority to CN201910871026.6A priority Critical patent/CN110766273A/en
Publication of CN110766273A publication Critical patent/CN110766273A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0631Resource planning, allocation, distributing or scheduling for enterprises or organisations
    • G06Q10/06315Needs-based resource requirements planning or analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/20Education
    • G06Q50/205Education administration or guidance

Landscapes

  • Business, Economics & Management (AREA)
  • Engineering & Computer Science (AREA)
  • Human Resources & Organizations (AREA)
  • Theoretical Computer Science (AREA)
  • Strategic Management (AREA)
  • Tourism & Hospitality (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Economics (AREA)
  • Educational Administration (AREA)
  • Educational Technology (AREA)
  • General Business, Economics & Management (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Marketing (AREA)
  • Databases & Information Systems (AREA)
  • Game Theory and Decision Science (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Data Mining & Analysis (AREA)
  • Development Economics (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Primary Health Care (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a semi-supervised clustering teaching asset classification method for optimizing feature weights. The method comprises the following steps: analyzing the expression mode of the property characteristics of the assets according to the characteristics of the assets samples, extracting the characteristic items of the assets samples, introducing a characteristic item weight calculation formula, and calculating the corresponding characteristic item weights to obtain the vector space representation of the assets samples; carrying out unsupervised initial clustering on the processed asset samples to obtain initial clustering clusters; performing semi-supervised clustering on the asset samples by using the pairwise constraint set of the samples; and classifying the newly added asset samples according to the semi-supervised hierarchical clustering result. The invention combines clustering and pairwise constraint for the classification of asset samples, and provides an asset classification method based on semi-supervised clustering, thereby reducing the time required by manual classification and avoiding different classification results caused by subjective difference; meanwhile, through the paired constraint set, the knowledge contained in the supervision information is further mined, and higher effectiveness and correctness are achieved.

Description

Semi-supervised clustering teaching asset classification method for optimizing feature weight
Technical Field
The invention relates to a semi-supervised clustering teaching asset classification method for optimizing feature weights.
Background
The teaching assets are various conditions that can be utilized, such as materials provided for the effective development of various studies, activities, and the like. With the continuous development of computers and the internet, the number and the types of teaching assets are increased extremely rapidly, and in the management, if classification is not carried out or the classification is not accurate, the management and the use of the teaching assets are seriously influenced, and suitable available assets cannot be found from massive teaching assets.
Most of the existing asset classification methods are manual classification and depend on subjective experience of people. Therefore, the problems of different asset class division, inaccurate classification, even difficult classification and the like exist. And when the number of samples is large, the time and expense required for manual classification is enormous. The clustering algorithm is applied to asset classification, so that the time and cost required by manual classification can be greatly saved, and the influence of subjective experience of people on classification results is avoided, so that more and more attention is paid.
At present, most of clustering classification completely depends on numerical value class parameters of asset samples, for example, assets are classified into an easily managed type, a difficultly managed type, a large expense and a small expense according to the use duration and the investment amount of the assets through a clustering algorithm; classifying the teaching assets according to the management mode according to the equipment total amount, the number of students, the number of teachers and the material consumption cost of the assets; and classifying the assets according to the fault maintenance mode according to the IP addresses, the fault alarm times and the like of the hardware assets. The classification is realized by clustering the asset characteristics by using the value parameters of the assets, such as: the numerical value class characteristics such as the use duration, the asset investment amount, the number of users, the asset consumable charge and the like which can be directly processed by the computer are called numerical value class characteristics. Such numerical class features can be directly processed by a computer without further calculation of feature weights. However, these kinds of characteristics are classified for satisfying specific capital management and maintenance requirements (classification for measuring the amount of capital expenditure of assets, whether to be easily managed and the use condition), and these numerical characteristics are not inherent characteristics of assets, and can be only reflected in the use process and purchase, and will change with the use time and place of assets, so the result of classifying assets under different conditions is different, and there is no commonality. In asset management in colleges and universities, assets are generally classified according to their uses, asset specifications, and the like, and the above classification method does not conform to the habit of classifying assets according to their attributes such as uses. For example, the national standard fixed asset classification standard is mainly classified according to the economic use and the use of the assets, which cannot be realized by the classification method.
Therefore, the invention provides the classification of the teaching assets according to the attribute characteristics of the asset samples. The property characteristics of the assets are inherent characteristics of the assets, and the property characteristics exist objectively and do not change with the placement position of the assets, the preference of people, the working demand degree and the like, for example: asset usage, asset form, asset specifications, asset ownership, etc. For example: in manual classification, people generally classify assets into engineering buildings, equipment, book archives, cultural relic specimens, intellectual property rights, natural resources, licensing rights and interests, data information, and the like, and the assets are classified according to attribute characteristics according to the purposes, forms, and the like of the assets. The property features of the assets cannot be directly used for clustering, so that the property features of the assets need to be firstly subjected to feature extraction and feature weight calculation, and are converted into a numerical form which can be processed by a computer. The simple and undifferentiated characteristic weight calculation method comprises the following steps: attribute features that appear in a certain asset sample are assigned a value of 1 and attribute features that do not appear are assigned a value of 0. However, the importance of each attribute in the asset classification is different, and the calculation method does not distinguish the difference of different attribute features in the asset classification, so the classification effect is not ideal. The method calculates the feature weights of different attribute features by introducing a weight calculation formula.
According to the invention, a feature weight formula is introduced according to different sources of the attribute features of the teaching assets, the importance degrees of the attribute features of different sources are distinguished through different feature weight coefficients, and then the teaching asset sample is expressed into a vector space form which can be processed by a computer. In addition, the distance between the clusters is adjusted through the constraint set, and therefore classification of the teaching assets is achieved through a semi-supervised hierarchical clustering method.
The invention has the advantages that: on one hand, the time and capital cost of manual classification are saved, and the asset classification result is not influenced by subjective experience. On the other hand, the teaching assets are classified by using the inherent attribute characteristics of the assets, so that the classification result has universality. In addition, different influence degrees of characteristics of different sources on teaching asset classification are considered, an attribute characteristic weight calculation formula of the teaching asset is introduced, and calculation of the characteristic weight is optimized. And finally, the clustering accuracy is improved through a semi-supervision method, and the operation effect of teaching asset clustering is improved.
Disclosure of Invention
The technical problem to be solved by the invention is as follows: a large amount of cost is consumed by manual classification in the classification process of the existing teaching assets; the automatic classification method is single and has no universality, and provides a semi-supervised clustering teaching asset classification method for optimizing feature weights, wherein some basic attributes of teaching assets are used as features, an attribute feature weight calculation formula of the teaching assets is introduced, and the attribute feature weights of teaching asset samples are calculated; classifying the teaching assets by adopting a hierarchical clustering method; improving a clustering result through a semi-supervised constraint set; and classifying a certain teaching asset sample to be classified according to the clustering result.
The technical scheme adopted by the invention is as follows:
a semi-supervised clustering teaching asset classification method for optimizing feature weight comprises the following steps:
the method comprises the following steps: and acquiring a teaching asset sample comprising an asset name, an asset attribute set and asset entry information.
An asset attribute set refers to a word set formed by a part of attributes of a certain asset, for example, an attribute set of a certain device may be: device number, power, model, vendor, specification, brand, age, etc.
Asset entry information is the entry interpretation of an asset, e.g. an encyclopedia, an entry interpretation of a certain asset in the department of the dog searching.
Step two: the method comprises the following steps of extracting attribute features of the teaching assets from different attribute feature sources according to the characteristics of teaching asset samples, introducing a feature weight calculation formula of the teaching assets, calculating corresponding attribute feature weights, and obtaining vector space representation of the teaching asset samples, wherein the method specifically comprises the following steps:
s21 extracting attribute features of teaching assets
When the asset attribute features are extracted, a plurality of attribute features which can describe the teaching asset sample most are extracted according to the priority order of the asset name, the asset attribute set and the asset entry information, semantic similar attribute features are combined, and irrelevant attributes are removed, so that the running time is reduced, and the running efficiency is improved.
The semantic similar attribute features are as follows: instruments, devices and the like in the asset names can express instrument asset characteristics, and at the moment, the two semantic similar characteristics can be combined into the same attribute characteristic.
Irrelevant attributes are for example: age, brand, etc. At this time, the irrelevant attribute needs to be removed, so as to avoid that the algorithm runs too long due to the excessive irrelevant attribute.
S22 obtaining attribute feature weight of teaching assets
Sorting the attribute features according to the source (namely asset name, asset attribute set and/or asset entry information) and the priority order of the asset name, asset attribute set and asset entry information, setting different feature weight coefficients, and calculating the attribute feature weight of the teaching asset according to a formula:
Figure BDA0002202805560000031
wherein ω isijRepresenting the weight value of the jth attribute feature in the ith educational asset sample α(j)The attribute feature source coefficients; SD(i)The number of the specified attribute features contained in the ith teaching asset sample is calculated; n is the number of all attribute features extracted from the teaching asset sample set. The definitional attribute feature refers to an attribute feature that can definitely point a certain teaching asset to a certain category, that is, an attribute feature that can definitely divide a certain teaching asset into a certain category, such as: if the instructional asset sample contains certain attribute features, such as instrumentation, power, etc., the instructional asset sample can be classified as an instrumentation class.
S23, using vector space model to represent the attribute feature of the teaching asset, and representing the selected attribute feature and attribute feature weight of the teaching asset into a feature vector form, i.e. the teaching asset is regarded as a vector of multidimensional vector space:
in this model, a sample set of teaching assets containing m samples of teaching assets, n attribute features, can be represented as a vector space:
C={d1,d2,…dm} formula (2)
Each teaching asset sample Di(1. ltoreq. i.ltoreq.m) can be expressed as an n-dimensional row vector:
di=(ωi1i2,…,ωin)Tformula (3)
Wherein (i is more than or equal to 1 and less than or equal to m), and T represents transposition.
Step three: carrying out unsupervised initial clustering on the processed teaching asset sample to obtain an initial clustering cluster; the method comprises the following steps:
s31: for a given sample set, initializing m teaching asset sample points as m cluster types, calculating the distance between every two of the m cluster types, arranging the distance between every two of the m cluster types into a matrix form, and recording the matrix form as an initial distance matrix; the method comprises the following steps:
any two samples D1And D2Expressed as two vectors d in VSM1=(ω1112,…,ω1n)TAnd d2=(ω2122,…,ω2n)TT denotes transposition, then sample D1And D2I.e. to represent a class cluster D1And D2At this time, two kinds of clusters D1And D2The distance calculation formula of (c) is as follows:
Figure BDA0002202805560000041
then an initial distance matrix is obtained:
Figure BDA0002202805560000042
s32: and searching the cluster closest to each cluster through the initial distance matrix obtained in the step S31, and combining the two clusters closest to each cluster to form a new cluster.
By searching the initial distance matrix, the samples with the shortest distance are combined into a class cluster, and then the distance between every two combined class clusters is calculated in sequence, wherein the calculation method comprises the following steps:
let S be a cluster of t samples, dxIf S is a sample in S, the center point of S is:
Figure BDA0002202805560000043
then the class cluster S1And S2The distance between them is the cluster S1And S2Center point of (S) O (S)1) And O (S)2) D (O (S)) between1),O(S2) Namely:
d(S1,S2)=d(O(S1),O(S2))=‖O(S1)O(S2) II type (6)
S33: and repeating the step S32 until the obtained cluster number is the set initial cluster number K1.
Step four: by absorbing empirical knowledge, performing semi-supervised hierarchical clustering on the teaching asset samples by using the pairwise constraint set of the samples so as to improve the accuracy of a clustering effect; the method comprises the following steps:
s41: setting a paired constraint sample set in the sample set by using empirical knowledge;
the pairwise constraints include a must-link constraint and a cannot-link constraint. Where the must-link constraint indicates that two samples must be assigned to the same cluster, and the cannot-link constraint indicates that two samples must be assigned to different clusters. The set of pairwise constraints in a certain class of clusters is denoted as M (S; d) and N (S; d). M (S; d) refers to the set of samples in the cluster S having a multist-link constrained relationship with sample d, and N (S; d) refers to the set of samples in the cluster S having a candot-link constrained relationship with sample d. Accordingly, M (S; S ') represents the set of all samples having a list-link constrained relationship in the cluster S and the cluster S', and N (S; S ') represents the set of all samples having a cannot-link constrained relationship in the cluster S and the cluster S'.
S42: combining the initial unsupervised clustering results in the step three, and changing the distance between the clustering clusters by using pairwise constraint information;
the method for changing the distance between the clusters in the S42 comprises the following steps: introducing KNN algorithm, and the idea of the method is that if a sample dyWhen most of t existing teaching asset samples closest to the sample belong to a certain category, the sample also belongs to the category. By using
Figure BDA0002202805560000051
Representation and sample dyThe nearest t marked samples (namely the samples of the existing teaching assets), then the sample dyThe closeness to the t labeled samples closest to it is expressed as:
Figure BDA0002202805560000052
finally, the degree of constraint between the clusters S and S 'is represented by P (S; S'):
Figure BDA0002202805560000053
where ρ isuRepresents a sample duProximity to the t labeled samples closest thereto; rholRepresents a sample dlProximity to the t labeled samples closest thereto;representing the degree of the must-link constraint between the cluster S and the cluster S';indicates the degree of cannot-link constraint between clusters S and S'.
When P (S; S ') is > 1, then S is considered to be must-link bound to S'; when P (S; S ') <1, then S is considered cannot-link bound to S'.
According to a constraint equationDegree P (S; S'), clustering cluster S1And cluster S2Is changed to:
d′(S1,S2)=d(O(S1),O(S2))P(S1;S2)P(S2;S1) Formula (9)
Wherein O (S)1) And O (S)2) Are respectively cluster clusters S1And S2A center point of (a); d (O (S)1),O(S2) Is a cluster S1And S2Center point of (S) O (S)1) And O (S)2) The distance between them.
S43: searching two clustering clusters with the shortest distance, and combining the clustering clusters with the shortest distance into one clustering cluster according to the hierarchical clustering algorithm principle;
s44: repeating the step S43 until the obtained cluster number is the set teaching asset class number K;
step five: and classifying the teaching asset samples to be classified according to the four-half supervision hierarchical clustering result.
And if the teaching assets to be classified are the existing teaching assets, classifying the teaching asset samples into the classes according to the semi-supervised hierarchical clustering result. If the teaching assets to be classified are newly added teaching asset samples Dm+1Initializing newly added teaching asset sample Dm+1Is a cluster-like Sm+1Then, the newly added teaching asset Dm+1The distances from the K teaching asset classes are:
d(Sm+1,Si)=‖O(Sm+1),O(Si) II type (10)
And comparing the distance calculation results of the formula (10), determining a target teaching asset class having the minimum distance with the teaching asset to be classified in the K teaching asset classes, and adding the newly added teaching asset into the target teaching asset class.
According to the invention, a feature weight formula is introduced according to different sources of the attribute features of the teaching assets, the importance degrees of the attribute features of different sources are distinguished through different feature weight coefficients, and then the teaching asset sample is expressed into a vector space form which can be processed by a computer. In addition, the distance between the clusters is adjusted through the constraint set, and therefore classification of the teaching assets is achieved through a semi-supervised hierarchical clustering method.
Compared with the existing manual classification technology, the method greatly saves the manpower and time required by classification, and avoids the classification result difference caused by different subjective experiences of people. Compared with a simple undifferentiated weight assignment method, the method introduces a teaching asset attribute feature weight calculation formula according to different sources of the teaching asset attribute features, and enables the teaching asset feature weights to correspond to different numerical values according to the importance degree by changing the weight coefficient. The method highlights the differences of different teaching asset samples, enables the classification result to be more accurate, and reduces the error of the classification result. Compared with an unsupervised clustering method, the method realizes semi-supervised clustering through the constraint set, and effectively improves the accuracy of a clustering result.
Drawings
FIG. 1 is a flow chart of a method for classifying an instructional asset sample in accordance with the present invention.
FIG. 2 is a sub-flow chart of the semi-supervised clustering algorithm based on constraints of the present invention.
Detailed Description
In order that the above objects, features and advantages of the present invention can be more clearly understood, a more particular description of the invention will be rendered by reference to the appended drawings. It should be noted that the embodiments and features of the embodiments of the present application may be combined with each other without conflicting with each other. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, however, the present invention may be practiced in other ways than those specifically described and thus the scope of the present invention is not limited by the specific embodiments disclosed below. Other equivalent or alternative features having similar purposes may be substituted unless specifically stated otherwise. That is, unless expressly stated otherwise, each feature is only an example of a generic series of equivalent or similar features.
The classification method based on semi-supervised clustering is applied to classification of teaching assets, and attribute features of teaching asset samples are extracted and attribute feature weights are calculated according to the characteristics of the asset samples; performing initial clustering calculation according to the sample space vector; adding constraint set information and adjusting the distance of the teaching asset clustering cluster by a semi-supervised learning method so as to optimize the clustering effect of the teaching asset data; and classifying the newly added teaching assets according to the semi-supervised clustering result.
As shown in fig. 1, the classification method for teaching assets sample according to the present invention includes:
(1) acquiring a college teaching asset sample;
(2) and clustering the teaching asset sample set by using a semi-supervised clustering algorithm.
(3) And classifying the to-be-classified teaching asset samples according to the semi-supervised hierarchical clustering result.
The following is a detailed description of the various parts:
(1) and acquiring a college teaching asset sample. According to the requirement of the asset management of the colleges and universities, the names of the assets and the attribute set information of the assets are already input when the existing teaching assets of the colleges and universities are input into an asset management system, and the information forms a unique asset database of the colleges and universities. Therefore, the sample set of the existing teaching assets of a certain university can be imported into the clustering algorithm through the computer. According to the asset information of the teaching asset database, the teaching asset class name and the class number are determined as follows: houses, instruments, books, archives, furniture, and 4 categories.
Unlike other samples, the instructional asset attribute features can be obtained in three ways, namely: asset name, asset attribute set, entry information. Features that best represent the concept of the asset class are selected. For example: the class information of the teaching asset sample can be mostly obtained from the name of the teaching asset sample, such as a building, an instrument, a device, a machine, a frame, a table, paper and the like, the class information of the teaching asset sample can be reflected by the attributes of part of teaching asset parameters, such as area, power, publishing houses and the like, and in addition, the class attribute characteristics can be obtained from the interpretation of the asset entry information of the unusual or newly-added unknown teaching asset.
For example: in the teaching asset sample database, the information of a certain teaching building is as follows: asset name: second teaching and scientific research building, asset attribute set: a developer, an address, a total area, a building type, an opening date, a completion date, and the like; in the teaching asset sample database, the information of a certain brand of projector is as follows: asset name: projector (or projector), asset attribute set: model, power, specification, supplier, age, etc.; in the teaching asset sample database, the information of a certain book is as follows: asset name: robotics and applications, asset attribute set: publishing company, publication date, publication number, age, etc. It should be noted that: asset entry information interpretation is generally not required for such existing assets. For unusual or newly-added unknown teaching assets, the category attribute characteristics can be obtained through the interpretation of the entry information thereof, for example, the target is also called as targetMeasuring markIs arranged atTriangular pointOrWire pointUpper supplyObservation ofOr a frame used by the survey station as a standard, and the keyword feature ' frame ' after ' can be extracted as the attribute feature.
(2) The teaching asset sample set is clustered as in the semi-supervised clustering algorithm of fig. 2. The semi-supervised clustering algorithm specifically comprises the following steps: extracting attribute features of the teaching asset sample and calculating feature weights; a vector space representation of a sample set of the educational asset; carrying out initial unsupervised clustering; semi-supervised clustering with constraint information.
Extracting attribute features of the teaching asset sample and calculating feature weights:
after the teaching asset sample is obtained, the sample attributes are extracted, wherein the attributes are mainly the same or similar characteristics of the merged semantics, and irrelevant attributes are removed. Such as a certain brand of teaching asset projecting apparatus: the name projector (or projector), where "instrument" and "machine" are synonymous, should be combined into the same feature item. While the property set of its assets, supplier and lifetime are the properties that are not relevant to classification, such features should be removed to reduce the algorithm runtime. According to the priority order of the asset name and the asset attribute set, the characteristic extraction of the certain brand of projector is as follows: instrument, rated power, rated voltage and specification. The specific attribute characteristics are as follows: the instrument and the power can definitely divide the teaching assets into instrument and equipment types, the attribute feature models and specifications are non-specific attribute features, and the teaching assets cannot be represented to belong to the instrument and equipment, but are also connected with the instrument and equipment types. After any feature extraction, the teaching assets containing the features and the projector are more likely to belong to the same teaching asset class. After the processing, the teaching asset attribute features are extracted according to the priority order of the asset name, the asset attribute set and the asset entry information, wherein the priority order is set.
And after the attribute features are extracted, calculating the feature weight. After feature extraction, the number of attribute features of a teaching asset sample set of a certain teaching asset is determined to be 12, and the number of the indicative attribute features contained in a certain teaching asset sample is determined to be 3. And arranging the extracted attribute features according to the sequence of firstly indicating attribute features and then non-indicating attribute features, and sequencing all the indicating attribute features and the non-indicating attribute features according to the source priority respectively to obtain asset name source attribute features, asset attribute set source attribute features and asset entry information source attribute features. Order to
Figure BDA0002202805560000091
And calculating the weight of the sample feature item according to the attribute feature weight calculation formula (1) of the teaching asset.
The method for calculating the feature weight of the feature item 'instrument' of the projector comprises the following steps: assuming that the projector is the 3 rd sample, the arrangement serial number of the attribute feature 'instrument' in the teaching asset sample set of 12 attribute features is 1, the arrangement serial number of the attribute feature 'rated power' in the attribute feature of 12 is 2, the arrangement serial number of the attribute feature 'rated voltage' in the attribute feature of 12 is 6, and the arrangement serial number of the attribute feature 'specification' in the attribute feature of 12 is 10, the feature weights of the four attribute features in the teaching asset projector are respectively: omega31=0.911;ω32=0.3205;ω36=0.254;ω3100.0519, the remainder not being mentioned in the instructional asset sampleThe obtained attribute feature weights are all 0.
Vector space representation of teaching asset samples:
the teaching asset sample set comprises 15 teaching asset samples and 12 attribute characteristics, wherein each teaching asset sample Di(1. ltoreq. i.ltoreq.m) can be represented as an n-dimensional row vector di=(ωi1i2,…,ωin)T(i is more than or equal to 1 and less than or equal to m), and T represents transposition. . Wherein ω isijAnd the weight value of the jth attribute feature in the ith asset sample is represented, and the specific numerical value is obtained through the feature item weight calculation formula. After the vector conversion operation is completed, the space vector coordinates of each sample are stored in a space vector library for the next inter-sample distance calculation. As above teaching asset sample D3The "projector" can be represented as a vector:
d3=(0.736,0.575,0,0,0,0.321,0,0,0,0.158,0,0)T
in the teaching asset sample, a desktop computer, a camera and a projector have certain similarity and belong to instrument equipment, so that the coordinate difference between the three samples is small, and the similarity between the projector and the camera is higher, so that the coordinate difference of the two vectors is smaller. Example (c): teaching asset sample D1"desktop computer" can be expressed as a vector: d1=(0.692,0.514,0,0,0.341,0.302,0,0,0,0.148,0,0)TTeaching asset sample D2"Camera" can be expressed as a vector: d2=(0.73,0.571,0,0,0,0.318,0,0,0,0.156,0,0.124)T
Initial unsupervised clustering:
after the teaching asset sample data preprocessing and the structured conversion operation, each teaching asset sample DiCorresponding to a vector di. And then, calculating the distance between the teaching asset samples by using the inter-sample distance measurement formula to serve as a basis for measuring whether the samples can be clustered to the same cluster. Then teaching asset sample D1、D2、D2The distance between each two is as follows: d (D)1,D2)=0.366,d(D1,D3)=0.346,d(D2,D3) 0.124. Merging teaching asset samples with the closest distance into a cluster, namely D2(vidicon) and D3(projectors) combined into a cluster-like S1,D1(desktop computer) as a cluster-like S2. Computing a class cluster S1Center point of (S) O (S)1) The coordinates of (a) are: (0.733,0.573,0,0,0,0.32,0,0,0,0.157,0,0.062), then the cluster S1And cluster S2A distance d (S) therebetween1,S2) 0.351. And calculating the distances between all samples and all clusters, and continuously combining the two closest clusters to form a new cluster until the obtained cluster number is the set initial cluster number K1 which is 7.
Semi-supervised clustering with pairwise constraints:
and carrying out initial unsupervised clustering classification on the given teaching asset sample data to obtain an initial classification result. Since the initial unsupervised clustering has no learning ability, the classification accuracy is not ideal. Therefore, the following semi-supervised link is added to further improve the clustering method. A sub-flowchart of the constraint-based semi-supervised clustering algorithm is shown in fig. 2.
For a teaching asset sample set subjected to initial clustering (D)iLet the cluster grouping formed be { S }1,S2,…SNSetting the number of the types of the teaching asset samples to be 4, namely, requiring to output the number of the clustering clusters to be 4; all P (S; S') are calculated, here using information with constraints such as: class cluster S1And sample d1Having a must-link constraint relationship, cluster S2And sample d1Having a cannot-link constraint relationship, thereby utilizing the formula
Figure BDA0002202805560000101
Adjusting a class cluster S1And cluster S2The distance between them. Calculating the distance between 7 adjusted classes, and searching two nearest cluster clusters (S)p,Sq) Merging the two clusters into a new cluster SrAnd let Y be Y-1, if Y be K be 4, the algorithm is stopped and the output nodeFruit; if Y is larger than K, returning to the previous step, calculating all P (S; S ') and all distances d (S, S') again, searching two cluster clusters which are closest to each other again and merging until Y is equal to K and equal to 4.
And outputting semi-supervised clustering results, namely which samples of 15 teaching asset samples are contained in the 4 classes of teaching assets respectively:
S1(class of instruments and devices): d1,D2,D3,D6,D7,D13
S2(house type): d5,D8,D10,D14
S3(book archives class): d4,D9,D11
S4(furniture items): d12,D15
(3) And classifying the to-be-classified teaching asset samples according to the semi-supervised hierarchical clustering result.
Inputting to-be-classified teaching assets DrIf the teaching assets to be classified are existing teaching assets, for example, the teaching assets to be classified are D3Classifying the teaching asset samples into D according to the semi-supervised hierarchical clustering result3Class S to which it belongs1(class of instruments and devices).
If the to-be-classified teaching assets are newly added teaching assets, obtaining a newly added teaching asset sample D through teaching asset sample feature extraction and teaching asset sample feature item weight calculation16Corresponding vector d16=(0.738,0.577,0,0,0,0.322,0,0,0,0,0.141,0)TSeparately calculating the newly added teaching asset samples D16Distance D (D) from the final output of 4 teaching asset classesm+1,Si) And (i ═ 1,2,3 and 4), comparing the 4 distances, determining a target teaching asset class with the minimum distance to the newly added teaching asset in the 4 teaching asset classes, and adding the newly added teaching asset into the target teaching asset class. Such as: separately calculate samples D16And cluster S1,S2,S3,S4Comparing the above 4 distances, if D16And cluster S1Is closest, the sample D is obtained16Classification into clusters S1In (1).
According to different sources of the attribute features of the teaching assets, the invention introduces a calculation formula of the attribute feature weight of the teaching assets, and the weight coefficient is changed to enable the attribute feature weight of the teaching assets to correspond to different numerical values according to the importance degree. The method highlights the difference of the teaching asset samples, so that the classification result is more accurate, and the error of the classification result is reduced. Compared with an unsupervised clustering method, the method disclosed by the invention is combined with experience knowledge, semi-supervised clustering is realized through a constraint set, and the accuracy of teaching asset classification is effectively improved.
The invention is not limited to the foregoing embodiments. The invention extends to any novel feature or any novel combination of features disclosed in this specification and any novel method or process steps or any novel combination of features disclosed.

Claims (4)

1. A semi-supervised clustering teaching asset classification method for optimizing feature weight is characterized by comprising the following steps:
the method comprises the following steps: acquiring a teaching asset sample comprising an asset name, an asset attribute set and asset entry information;
step two: and aiming at the characteristics of the teaching asset sample, extracting the attribute characteristics of the teaching asset from different attribute characteristic sources, introducing a characteristic weight calculation formula of the teaching asset, calculating the corresponding attribute characteristic weight, and obtaining the vector space representation of the teaching asset sample.
Step three: carrying out unsupervised initial clustering on the processed teaching asset sample to obtain an initial clustering cluster;
step four: by absorbing empirical knowledge, performing semi-supervised hierarchical clustering on the teaching asset samples by using the pairwise constraint set of the samples so as to improve the accuracy of a clustering effect; the method comprises the following steps:
s41: setting a paired constraint sample set in the sample set by using empirical knowledge;
pairwise constraints include the must-link constraint and the cannot-link constraint; wherein the most-link constraint indicates that two samples must be assigned to the same cluster, and the cannot-link constraint indicates that two samples must be assigned to different clusters; representing the paired constraint sets in a certain cluster as M (S; d) and N (S; d); m (S; d) refers to a set of samples having a multist-link constraint relationship with sample d in cluster S, and N (S; d) refers to a set of samples having a candot-link constraint relationship with sample d in cluster S; accordingly, M (S; S ') represents the set of all samples having a list-link constraint relationship in the cluster S and the cluster S', and N (S; S ') represents the set of all samples having a cannot-link constraint relationship in the cluster S and the cluster S';
s42: combining the initial unsupervised clustering results in the step three, and changing the distance between the clustering clusters by using pairwise constraint information;
s43: searching two closest clustering clusters, and combining the closest clustering clusters into one clustering cluster;
s44: repeating the step S43 until the obtained cluster number is the set teaching asset class number K;
step five: classifying the teaching asset samples to be classified according to the four-half supervision hierarchical clustering result;
if the teaching assets to be classified are existing teaching assets, classifying the teaching asset samples into the classes according to the semi-supervised hierarchical clustering result; and if the teaching assets to be classified are newly added teaching assets, calculating the distance between the newly added teaching assets and the K teaching asset classes, determining a target teaching asset class with the minimum distance between the K teaching asset classes and the teaching assets to be classified according to the distance comparison result, and adding the newly added teaching assets into the target teaching asset class.
2. The semi-supervised clustering teaching asset classification method for optimizing feature weights as claimed in claim 1, wherein the second step is specifically:
s21 extracting attribute features of teaching assets
When the asset attribute features are extracted, firstly, a plurality of attribute features which can describe the teaching asset sample most are extracted according to the priority order of the asset name, the asset attribute set and the asset entry information, semantic similar attribute features are combined, and irrelevant attributes are removed;
s22 obtaining attribute feature weight of teaching assets
Sorting the attribute features according to the source thereof and the priority sequence of the asset name, the asset attribute set and the asset entry information, setting different feature weight coefficients, and calculating the attribute feature weight of the teaching asset according to a formula:
Figure FDA0002202805550000021
wherein ω isijRepresenting the weight value of the jth attribute feature in the ith educational asset sample α(j)The attribute feature source coefficients; SD(i)The number of the specified attribute features contained in the ith teaching asset sample is calculated; n is the number of all attribute features extracted from the teaching asset sample set;
s23, using vector space model to represent the attribute feature of the teaching asset, and representing the selected attribute feature and attribute feature weight of the teaching asset into a feature vector form, i.e. the teaching asset is regarded as a vector of multidimensional vector space:
in this model, a sample set of teaching assets containing m samples of teaching assets, n attribute features, can be represented as a vector space:
C={d1,d2,…dm} formula (2)
Each teaching asset sample Di(1. ltoreq. i.ltoreq.m) can be expressed as an n-dimensional row vector:
di=(ωi1,ωi2,…,ωin)Tformula (3)
Wherein (i is more than or equal to 1 and less than or equal to m), and T represents transposition.
3. The semi-supervised clustering teaching asset classification method for optimizing feature weights as claimed in claim 2, wherein the third step is specifically:
s31: for a given sample set, initializing m teaching asset sample points as m cluster types, calculating the distance between every two of the m cluster types, and recording as an initial distance matrix; the method comprises the following steps:
any two samples D1And D2Expressed as two vectors d in VSM1=(ω11,ω12,…,ω1n)TAnd d2=(ω21,ω22,…,ω2n)TT denotes transposition, then sample D1And D2I.e. to represent a class cluster D1And D2At this time, two kinds of clusters D1And D2The distance calculation formula of (c) is as follows:
Figure FDA0002202805550000031
s32: passing distance matrix D (D)1,D2) Searching the closest class cluster of each class cluster, and combining the two closest class clusters to form a new class cluster;
by searching the initial distance matrix, the samples with the shortest distance are combined into a class cluster, and then the distance between every two combined class clusters is calculated in sequence, wherein the calculation method comprises the following steps:
let S be a cluster of t samples, dxIf S is a sample in S, the center point of S is:
Figure FDA0002202805550000032
then the class cluster S1And S2The distance between them is:
d(S1,S2)=d(O(S1),O(S2))=||O(S1)O(S2) | | formula (6)
S33: and repeating the step S32 until the obtained cluster number is the set initial cluster number K1.
4. The semi-supervised clustering teaching asset classification method for optimizing feature weight of claim 3, wherein the method for changing the distance between clusters in S42 is as follows:
if a sample dyMost of t existing teaching asset samples closest to the sample belong to a certain category, and the sample also belongs to the category; by usingRepresentation and sample dyThe nearest t labeled samples, sample dyThe closeness to the t labeled samples closest to it is expressed as:
Figure FDA0002202805550000034
finally, the degree of constraint between the clusters S and S 'is represented by P (S; S'):
Figure FDA0002202805550000035
where ρ isuRepresents a sample duProximity to the t labeled samples closest thereto; rholRepresents a sample dlProximity to the t labeled samples closest thereto;
Figure FDA0002202805550000036
represents the degree of the must-link constraint between clusters S and S',
Figure FDA0002202805550000041
representing the cannot-link constraint degree between the cluster S and the cluster S';
when P (S; S ') is > 1, then S is considered to be must-link bound to S'; when P (S; S ') <1, then S is considered cannot-link bound to S';
clustering the clusters S according to the degree of constraint P (S; S1And cluster S2Is changed to:
d′(S1,S2)=d(O(S1),O(S2))P(S1;S2)P(S2;S1) Formula (9)
Wherein, O (S)1) And O (S)2) Are respectively cluster clusters S1And S2Center point of (d), d (O (S)1),O(S2) Is a cluster S1And S2Center point of (S) O (S)1) And O (S)2) The distance between them.
CN201910871026.6A 2019-09-16 2019-09-16 Semi-supervised clustering teaching asset classification method for optimizing feature weight Pending CN110766273A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910871026.6A CN110766273A (en) 2019-09-16 2019-09-16 Semi-supervised clustering teaching asset classification method for optimizing feature weight

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910871026.6A CN110766273A (en) 2019-09-16 2019-09-16 Semi-supervised clustering teaching asset classification method for optimizing feature weight

Publications (1)

Publication Number Publication Date
CN110766273A true CN110766273A (en) 2020-02-07

Family

ID=69329951

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910871026.6A Pending CN110766273A (en) 2019-09-16 2019-09-16 Semi-supervised clustering teaching asset classification method for optimizing feature weight

Country Status (1)

Country Link
CN (1) CN110766273A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111897962A (en) * 2020-07-27 2020-11-06 绿盟科技集团股份有限公司 Internet of things asset marking method and device
CN112200212A (en) * 2020-08-17 2021-01-08 广州市自来水有限公司 Artificial intelligence-based enterprise material classification catalogue construction method
CN112506930A (en) * 2020-12-15 2021-03-16 北京三维天地科技股份有限公司 Data insight platform based on machine learning technology
CN113035281A (en) * 2021-05-24 2021-06-25 浙江中科华知科技股份有限公司 Medical data processing method and device
CN113052266A (en) * 2021-04-27 2021-06-29 中国工商银行股份有限公司 Transaction mode type identification method and device
CN113052534A (en) * 2021-03-30 2021-06-29 上海东普信息科技有限公司 Address allocation method, device, equipment and storage medium based on semi-supervised clustering
CN113239968A (en) * 2021-04-15 2021-08-10 国家计算机网络与信息安全管理中心 Method, device, computer storage medium and terminal for realizing server classification
CN115310879A (en) * 2022-10-11 2022-11-08 浙江浙石油综合能源销售有限公司 Multi-fueling-station power consumption control method based on semi-supervised clustering algorithm

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111897962A (en) * 2020-07-27 2020-11-06 绿盟科技集团股份有限公司 Internet of things asset marking method and device
CN111897962B (en) * 2020-07-27 2024-03-15 绿盟科技集团股份有限公司 Asset marking method and device for Internet of things
CN112200212A (en) * 2020-08-17 2021-01-08 广州市自来水有限公司 Artificial intelligence-based enterprise material classification catalogue construction method
CN112506930A (en) * 2020-12-15 2021-03-16 北京三维天地科技股份有限公司 Data insight platform based on machine learning technology
CN113052534A (en) * 2021-03-30 2021-06-29 上海东普信息科技有限公司 Address allocation method, device, equipment and storage medium based on semi-supervised clustering
CN113052534B (en) * 2021-03-30 2023-08-01 上海东普信息科技有限公司 Address allocation method, device, equipment and storage medium based on semi-supervised clustering
CN113239968A (en) * 2021-04-15 2021-08-10 国家计算机网络与信息安全管理中心 Method, device, computer storage medium and terminal for realizing server classification
CN113052266A (en) * 2021-04-27 2021-06-29 中国工商银行股份有限公司 Transaction mode type identification method and device
CN113035281A (en) * 2021-05-24 2021-06-25 浙江中科华知科技股份有限公司 Medical data processing method and device
CN115310879A (en) * 2022-10-11 2022-11-08 浙江浙石油综合能源销售有限公司 Multi-fueling-station power consumption control method based on semi-supervised clustering algorithm
CN115310879B (en) * 2022-10-11 2022-12-16 浙江浙石油综合能源销售有限公司 Multi-fueling-station power consumption control method based on semi-supervised clustering algorithm

Similar Documents

Publication Publication Date Title
CN110766273A (en) Semi-supervised clustering teaching asset classification method for optimizing feature weight
Roy et al. Inferring concept prerequisite relations from online educational resources
Chen et al. General functional matrix factorization using gradient boosting
KR20190118477A (en) Entity recommendation method and apparatus
Ma et al. Course recommendation based on semantic similarity analysis
CN110310012B (en) Data analysis method, device, equipment and computer readable storage medium
Karan et al. FAQIR–a frequently asked questions retrieval test collection
Santoso et al. The analysis of student performance using data mining
Isljamovıc et al. PREDICTING STUDENTS’ACADEMIC PERFORMANCE USING ARTIFICIAL NEURAL NETWORK: A CASE STUDY FROM FACULTY OF ORGANIZATIONAL SCIENCES
Ramachandran et al. Integration of machine learning algorithms for E-Learning System course recommendation based on Data Science
Wang et al. Data-driven flow cytometry analysis
García-Romero et al. Another brick in the wall: a new ranking of academic journals in Economics using FDH
Alsultanny Selecting a suitable method of data mining for successful forecasting
CN103279549A (en) Method and device for acquiring target data of target objects
Sasmita et al. Development of machine learning implementation in engineering education: A literature review
Rashid et al. Student Career Recommendation System Using Content-Based Filtering Method
Niswatin et al. Classification of category selection title undergraduate thesis using k-nearest neighbor method
Tone et al. How to deal with non-convex frontiers in data envelopment analysis
Zahir et al. Access plan recommendation: A clustering based approach using queries similarity
Hafdi et al. Student Performance Prediction in Learning Management System Using Small Dataset
Siahaan et al. Implementation of Data Mining Using the K-Nearest Neighbor Method to Determine the feasibility of a lecturer's functional promotion
Prakash et al. App Review Prediction Using Machine Learning
Rianti et al. Machine Learning Journal Article Recommendation System using Content based Filtering
Göksün et al. The role of learning analytics in distance learning: a SWOT analysis
Siren Statistical models for inferring the structure and history of populations from genetic data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20200207

RJ01 Rejection of invention patent application after publication