CN110704624A - Geographic information service metadata text multi-level multi-label classification method - Google Patents
Geographic information service metadata text multi-level multi-label classification method Download PDFInfo
- Publication number
- CN110704624A CN110704624A CN201910942287.2A CN201910942287A CN110704624A CN 110704624 A CN110704624 A CN 110704624A CN 201910942287 A CN201910942287 A CN 201910942287A CN 110704624 A CN110704624 A CN 110704624A
- Authority
- CN
- China
- Prior art keywords
- text
- classification
- theme
- samples
- sample
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/34—Browsing; Visualisation therefor
- G06F16/345—Summarisation for human users
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/38—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/387—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using geographical or spatial information, e.g. location
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
- G06F18/23213—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
- G06F18/24155—Bayesian classification
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Probability & Statistics with Applications (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Library & Information Science (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a method for classifying geographic information service metadata texts in a multi-level and multi-label manner, which comprises the following steps: 1) acquiring a geographic information service metadata text set for text preprocessing, and dividing each data sample into text feature word combinations; 2) setting a first-level classification catalogue, and generating a typical word list related to classification category semantics; 3) screening the text characteristic words according to the typical word list; 4) selecting ML-KNN as a base model for collaborative training; 5) establishing a theme prediction model ML-CSW as another base model of the collaborative training; 6) designing a cooperative mechanism, matching a multi-label theme for the metadata text, and taking the multi-label theme as a primary coarse-grained theme classification result; 7) and selecting a metadata text corresponding to a certain classification label to obtain a fine-grained subject category catalog of different levels. The method considers the domain characteristics and text semantics of the geographic information service metadata, only depends on a small number of marked data samples, and compared with the traditional multi-label classification method, the classification result has better overall performance.
Description
Technical Field
The invention relates to a natural language processing technology, in particular to a method for classifying geographic information service metadata texts in a multi-level and multi-label manner.
Background
The text accurate classification is an important means for data analysis, is a key for improving the geographic information resource retrieval quality, and has a wide application scene. The traditional classification method is mostly suitable for two-classification or single-classification scenes, and training a classification model by excessively depending on a large number of labeled samples limits the accuracy and comprehensiveness of text classification and the application scene of the model. Particularly, for the metadata of the geographic information service, a sample data set for marking a topic is usually lacked, text content is complicated, and a characteristic vocabulary is complicated due to the mixing of a geoscience term and a general knowledge vocabulary; and the overlapping and membership between the topics enables the metadata text topics to have multi-granularity and multi-class characteristics, and further increases the difficulty of topic classification. Aiming at the problem of lack of training samples and the requirement of multi-class matching, some students propose mechanisms such as semi-supervision and weak supervision to reduce the dependence of a classifier on the training samples, and also realize text multi-label classification by methods such as ML-KNN, BR-KNN and TSVM. However, these methods usually do not combine the domain features, do not consider the semantics of the professional terms in the text, and cannot effectively conform to the text characteristics of the geographic information service metadata.
Disclosure of Invention
The technical problem to be solved by the invention is to provide a method for classifying geographic information service metadata texts in a multi-level and multi-label manner aiming at the defects in the prior art.
The technical scheme adopted by the invention for solving the technical problems is as follows: a geographic information service metadata text multi-level multi-label classification method comprises the following steps:
1) acquiring a geographic information service metadata text set containing unmarked samples and marked samples to perform text preprocessing, and dividing each data sample into text feature word combinations;
2) defining a primary classification catalogue based on the domain application theme category of the geographic information resource, and generating a typical word list which is closely associated with the semantics of the classification category (hereinafter referred to as theme);
3) screening text characteristic words according to the typical word list, filtering out the characteristics of which the distance from the typical words is greater than a threshold value, and obtaining a characteristic subset screened according to the theme classification;
4) selecting a classical Multi-label classification algorithm ML-KNN (Multi-label K Nearest Neighbors) as a base model H for cooperative training1;
5) Calculating the semantic distance from the features to the theme according to the corpus, and establishing a theme prediction model ML-CSW (Multi-label Classification based on SWEET spot)&WordNet), using the model as another basis model H for co-training2;
6) Designing a cooperative mechanism based on the two basic models, and matching a multi-label theme for the metadata text to serve as a primary coarse-grained theme classification result;
7) selecting a metadata text corresponding to a certain classification label according to a primary coarse-grained theme classification result, extracting a text theme to serve as a fine-grained theme of a next level, and simultaneously obtaining a matching relation between the metadata text and a double-layer theme catalog;
8) and 7) repeating the step 7) to obtain fine-grained subject category catalogues with different levels and a matching relation between the metadata text and the subject catalogues.
According to the scheme, the step 2) of defining the primary classification directory based on the domain application theme categories of the geographic information resources is to obtain primary classification by expanding the social benefit fields SBAs proposed by the international earth observation organization aiming at the field of geology.
According to the scheme, the typical vocabulary generation mode in the step 2) is as follows:
and (3) extracting the superior words, the inferior words and the synonyms of the subjects in the SWEET and WordNet definitions as typical words related to the subject semantics by taking the SBAs as a subject classification directory to generate a typical word list.
According to the scheme, the text characteristic words are screened according to the typical word list in the step 3), which specifically comprises the following steps:
s31, representing the typical words and the text feature words into two-dimensional space Word vectors based on the Word2vec algorithm;
s32, calculating the cosine distance between the typical word and the text feature word vector;
and S33, setting a distance threshold T, and filtering out text characteristic words with the cosine distance with the typical word larger than T.
According to the scheme, the method for establishing the topic model in the step 5) is as follows:
according to the SWEET ontology library and the network definition of WordNet English vocabulary net, calculating the text characteristics f and each theme piSemantic distance d betweenpi
Finding features f and each topic piSemantic distance d betweenpiAnd the minimum value of (c) is obtained and is used as the maximum semantic relevance s of the text feature f and all the subjects PfWherein P is the set of all topics;
defining feature weight based on the shortest distance between the text feature and the theme, establishing a theme prediction model, and predicting a multi-label theme for the unmarked sample;
assuming that the training set contains n text features in total, the vector S ═ S of the maximum semantic relevance from all the features to all the subjects in the training set can be calculated1,s2,…,sn]Defining the weight w (x) of a single piece of data x as a vector of 1 × n, respectively corresponding to the weights of n text features, and defining the weight w (x) as s if the feature f appears in the sample xfOtherwise, defining as 0;
and establishing a theme prediction model Y, wherein F is an adjustment vector of the features, and alpha is a smoothing parameter. Based on the marked sample data, adopting a BP neural network iterative optimization training model Y, calculating the optimal solution of F and alpha under the condition of minimum loss, obtaining a final model, and predicting the category set of the unmarked sample t according to the model;
Y=w(x)*F+α。
according to the scheme, the step 6) designs a cooperation mechanism, and matches a multi-label theme for the metadata text as a primary coarse-grained theme classification result; the method comprises the following specific steps:
s61, generating L according to mark sample in geographic information service metadata text set1And L2Two subsets, respectively as co-training basis model H1And H2The training set of (2);
s62 training base model H by using training set1And H2Predicting the category vector of the unlabeled sample by using the trained base model;
s63, selecting classifier H from unlabeled samples1And H2Giving pseudo-marks to samples with the same prediction result, and respectively adding the pseudo-marked samples to the two training subsets L1And L2Updating the training set, and repeating the steps S62-S63 until the classification results of the two classifiers do not change obviously, so as to obtain the class sets of all unlabeled samples and the finally updated training set;
s64 training classifier H based on all marked samples1A set of topic categories is matched for the test sample.
According to the scheme, the classic multi-label classification algorithm ML-KNN is selected as a base model for collaborative training in the step 4), and the method specifically comprises the following steps:
s41, selecting ML-KNN algorithm as a base model H of cooperative training1Specifying the number k of neighbor samples, expressing the set of k neighbor samples of the samples x in the training set by N (x), and counting the number c [ j ] of the samples belonging to the subject class l in N (x)]Counting the number of samples c' in N (x) that do not belong to the subject category l [ j]. In the following formula, when a sample x belongs to the topic category l,the number of the carbon atoms is 1,is 0, otherwiseIs a non-volatile organic compound (I) with a value of 0,is 1;
s42, calculating the prior probability that the unlabeled sample t belongs to the subject category lAnd posterior probabilityWherein the value of b is 0 and 1,an event indicating that a sample t belongs to the topic category l,an event indicating that the sample t does not belong to the topic class l, s is a smoothing parameter, m is the number of training samples,an event representing that sample j among k neighboring samples of sample t belongs to class l;
s43, predicting the category set of the unlabeled samples t according to the maximum posterior probability and the Bayes principle
According to the scheme, the text theme extracted in the step 7) is extracted based on a Latent Dirichlet Allocation (LDA) algorithm.
The invention has the following beneficial effects: the invention provides a novel multi-level multi-label classification process aiming at an OGC network map service WMS and other geographic information network resource metadata texts. The process introduces a geoscience ontology library SWEET and a general English vocabulary network WordNet into a classification process, and combines a traditional classification algorithm ML-KNN and a classification algorithm ML-CSW with close fit domain characteristics and text semantics to perform collaborative training so as to obtain the matching relation between a geographic information service metadata text and a multi-level topic directory. The method only depends on a small number of marked data samples by considering the field characteristics and text semantics of the geographic information service metadata; meanwhile, compared with the traditional multi-label classification algorithm such as a classifier chain and a voting classifier, the method has better overall performance of the classification result.
Drawings
The invention will be further described with reference to the accompanying drawings and examples, in which:
FIG. 1 is a flow chart of a method of an embodiment of the present invention;
FIG. 2 is a flow diagram of a method of an embodiment of the invention;
FIG. 3 is a diagram of an exemplary word for an embodiment of the present invention;
FIG. 4 is a diagram illustrating an example of calculating a shortest distance between a text feature and a topic in the ML-CSW algorithm according to an embodiment of the present invention;
FIG. 5 is a classification result of an exemplary text of an embodiment of the present invention;
FIG. 6 is a comparison of classification results of different classification algorithms according to an embodiment of the present invention;
FIG. 7 is a comparison of classification results based on different feature selection algorithms according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
There are 46000 pieces of Web Map Service (WMS) text data, 400 of which are marked with SBAs topics, and all the topics are uniformly distributed. The text content comes from the URL, Abstract, Keywords and Title fields in the Service tag in the WMS GetCapability capability document. Because text contents are mixed and mashup, sections are different in length, a single datum corresponds to a plurality of theme categories, the sample data size of the marked themes is small, the traditional multi-label classification algorithm is difficult to accurately and comprehensively classify, and multi-level theme matching results cannot be obtained.
The invention combines the theoretical basis of cooperative training in semi-supervised learning and introduces a geoscience ontology library and a basic classification model of general English vocabulary net design and fitting with characteristics of the geoscience field. And performing collaborative training in combination with a widely-applied classical multi-label classification model in the classification process, and extracting a multi-level fine-grained theme to match the multi-level multi-label theme with the WMS metadata text.
The algorithm process of the present invention will be described in detail below with reference to the accompanying drawings, in which:
as shown in fig. 1 and 2, a method for multi-level and multi-label classification of meta-data text of geographic information service includes the following steps:
1) performing text preprocessing on all WMS metadata, including three steps of word segmentation, stop word removal and word shape reduction, and segmenting each text into text feature word combinations;
2) the first class is obtained by expanding Social Benefit Areas (SBAs) proposed for the field of geography based on international Earth observation organization (GEO), the SBAs include 9 major interest topics including Agriculture (Agriculture), Biodiversity (Biodiversity), Climate (Climate), Disaster (disaser), ecology (ecosys), Energy (Energy), Health (Health), Water (Water), and Weather (Weather), etc., the SBAs are Social Benefit Areas (SBAs) proposed for the field of geography based on international Earth observation organization (GEO), including 9 major interest topics including Agriculture (Agriculture), Biodiversity (Biodiversity), Climate (Climate), Disaster (ecological Disaster), Health (Health), and Energy (Weather), etc. The topic classification catalog of this embodiment is expanded on the basis of SBAs, and Geology (geography) is added as the 10 th topic, so all topic classification catalogs and primary topic classification catalogs referred to in this embodiment refer to these 10 topics.
Using SBAs as a topic classification directory, extracting hypernyms, hyponyms and synonyms of topics in the SWEET and WordNet definitions as typical words related to topic semantics, and generating a typical word list, wherein a diagram in FIG. 3(a) is a typical word example corresponding to a topic "Agriculture" extracted from the SWEET, a diagram in FIG. 3(b) is a typical word example corresponding to a topic "Agriculture" extracted from the WordNet, and different colors represent different semantic sets;
3) the CBOW model based on the Word2vec algorithm represents the typical words and the text characteristic words as two-dimensional space Word vectors, and calculates cosine distances between the typical words and the text characteristic Word vectors;
4) setting a distance threshold, screening text feature words based on the distance threshold, and filtering features with the distance from the typical words larger than the threshold, thereby obtaining a feature subset with larger contribution to topic classification as model input of a classification algorithm;
5) designing a multi-label classification algorithm ML-CSW which is fit with WMS field characteristics and considers text semantics as a collaborative training base model H1And training a theme prediction model by taking the semantic association degree between text features and themes calculated by the corpus as feature weight:
5.1) taking the network definition of SWEET as a main part and WordNet as an auxiliary part to calculate the semantic shortest distance between text features and a theme;
if the text feature word is recorded by the SWEET, the shortest distance between the crawled feature word and the theme is defined according to the SWEET network, as shown in fig. 4(a), the distance between the feature "Glacier" and the theme "Water" is 3;
if the text features are not included by the SWEET, searching the superior words upwards layer by layer in the WordNet as the substitute words of the text features until the substitute words included by the SWEET are searched, and calculating the shortest distance D from the features to the substitute words in the WordNet definition1As shown in fig. 4(b), the alternative word of the feature "new (snow)" is "Ice", and the shortest distance is 1. And then calculating the shortest distance D between the substitute word and the subject based on Dijkstra algorithm according to the network definition of SWEET2The shortest distance from the alternative word "Ice" to the subject "Water" as in fig. 4(b) is 2. The final distance between the text feature and the theme is the sum of the distance between the text feature and the substitute word and the distance between the substitute word and the theme, namely D-D1+D2The shortest distance from the feature "new" to the subject "Water" as in fig. 4(b) is 3.
5.2) defining feature weight based on the shortest distance between text features and topics, establishing a topic prediction model, and predicting multi-label topics for unmarked samples;
a) according to the step 5.1), the text characteristics f and each theme p can be calculatediSemantic distance betweenDeriving the shortest distance as the maximum semantic relevance s of the text features f to all the topics PfWherein P is the set of all topics;
b) if all texts contain n text features, the maximum semantic relevance vector S ═ S from all features to all subjects in the training set can be calculated1,s2,…,sn]. Defining the weight w (x) of single data x as a 1 x n vector, respectively corresponding to the weights of n text features, and defining the weight w (x) as s if the feature f appears in a sample xfOtherwise, it is defined as 0.
c) And establishing a theme prediction model Y, wherein F is an adjustment vector of the features, and alpha is a smoothing parameter. Based on the marked sample data, adopting a BP neural network iterative optimization training topic prediction model, calculating the optimal solution of F and alpha under the condition of minimum loss to obtain a final model, and predicting the category set of the unmarked sample t according to the model;
Y=w(x)*F+α
6) selecting a widely-applied classic multi-label classification algorithm ML-KNN as a collaborative training basis model H2:
The number k of adjacent samples is specified, and N (x) represents a training set L1K neighbor sample sets of the middle sample x, and the number c [ j ] of the samples belonging to the subject class l in N (x) is counted]Counting the number of samples c' in N (x) that do not belong to the subject category l [ j]. In the following formula, when a sample x belongs to the topic category l,the number of the carbon atoms is 1,is 0, when the sample x does not belong to the topic class l,is a non-volatile organic compound (I) with a value of 0,is 1;
calculating the prior probability that an unlabeled sample t belongs to a topic class lAnd posterior probabilityWherein s is a smoothing parameter, m is the number of training samples,indicating that the event sample t belongs to the topic category/,indicating that the event sample t does not belong to the topic category l,the instance j of the k neighboring samples representing the event sample t belongs to the class l;
predicting the category set of unlabeled samples t according to the maximum posterior probability and Bayesian principle
7) Divide 80% of the repeated random samples of all labeled samples into L1And L2Two subsets, each as a classifier H1And H2Predicting the class set of all unlabeled samples by using two classifiers;
8) sorting classifier H1And H2The samples with the same prediction result are endowed with pseudo-marks, and the pseudo-marked samples are respectively added to the two training subsets L1And L2And updating the training set, and repeating 7) until the classification results of the two classifiers do not have obvious change, thereby obtaining the class set of the unlabeled samples.
9) The test samples were matched with a topic class set using a trained classifier with 10% of all labeled samples as test samples, such as the SBAs class labels of the example text in fig. 5 containing Biodiversity, click, Disaster, Ecosystem, Water and Weather.
10) Specifying a topic layer number N, selecting a metadata text of a single topic category for each layer, extracting a text fine-grained topic based on a Latent Dirichlet Allocation (LDA) algorithm until generating an N-layer topic directory, matching the WMS metadata text with N-layer topics, wherein a secondary topic corresponding to biology in FIG. 5 is wildlife, specie and diversity, a secondary topic corresponding to Climate is forest and metology, a secondary topic corresponding to Disaster is polarization, a secondary topic corresponding to Ecosystem is bittaat, resource and containment, a secondary topic corresponding to Water is rain, and a secondary topic corresponding to Weather is metology.
The method considers the field characteristics and text semantics of the geographic information service metadata, and only depends on a small number of marked data samples; as shown in fig. 6, compared with the conventional multi-label classification algorithm such as a classifier chain and a voting classifier, the classification result of the method of the present invention is better in overall performance.
As shown in fig. 7, the text feature selection process of the present invention can filter out features that do not contribute to the classification result compared to the chi-square test and WordNet-based feature selection method. The method can be popularized and applied to geographic information portals and data directory services, and assists in the retrieval and discovery of various geographic information resources.
It will be understood that modifications and variations can be made by persons skilled in the art in light of the above teachings and all such modifications and variations are intended to be included within the scope of the invention as defined in the appended claims.
Claims (8)
1. A geographic information service metadata text multi-level multi-label classification method is characterized by comprising the following steps:
1) acquiring a geographic information service metadata text set containing unmarked samples and marked samples to perform text preprocessing, and dividing each data sample into text feature word combinations;
2) setting a primary classification catalogue based on the field application topic category of the geographic information resource, obtaining a classification category, namely a topic, and then generating a typical word list semantically associated with the classification category;
3) screening text characteristic words according to the typical word list, filtering out the characteristics of which the distance from the typical words is greater than a threshold value, and obtaining a characteristic subset screened according to the theme classification;
4) selecting a classic multi-label classification algorithm ML-KNN as a base model of the cooperative training, and recording as H1;
5) Calculating the semantic distance from the features to the theme according to the corpus, establishing a theme prediction model ML-CSW, and taking the model as another base model of collaborative training, which is marked as H2;
6) Designing a cooperative mechanism based on the two basic models, and matching a multi-label theme for the metadata text to serve as a primary coarse-grained theme classification result;
7) selecting a metadata text corresponding to a certain classification label, extracting a text theme as a fine-grained theme of a next level, and simultaneously obtaining a matching relation between the metadata text and a double-layer theme directory;
8) and 7) repeating the step 7) to obtain fine-grained subject category catalogues with different levels and a matching relation between the metadata text and the subject catalogues.
2. The method as claimed in claim 1, wherein the step 2) of defining the primary classification list based on the domain application topic categories of the geographic information resources is a step of expanding the social benefit domains SBAs proposed by the international earth observation organization for the field of geography to obtain primary classification.
3. The method as claimed in claim 1, wherein the typical vocabulary in step 2) is generated as follows:
and (3) extracting the superior words, the inferior words and the synonyms of the subjects in the SWEET and WordNet definitions as typical words related to the subject semantics by taking the SBAs as a subject classification directory to generate a typical word list.
4. The method as claimed in claim 1, wherein the step 3) is a step of screening the text feature words according to a typical word list, and the method comprises the following steps:
s31, representing the typical words and the text feature words into two-dimensional space Word vectors based on the Word2vec algorithm;
s32, calculating the cosine distance between the typical word and the text feature word vector;
and S33, setting a distance threshold T, and filtering out text characteristic words with the cosine distance with the typical word larger than T.
5. The method for multi-level and multi-label classification of metadata text of geographic information services according to claim 1, wherein the method for establishing the topic model in the step 5) is specifically as follows:
s51, calculating text characteristics f and each theme p according to the SWEET ontology library and the network definition of WordNet English vocabulary networkiSemantic distance betweenIf the feature f is recorded by the SWEET, the feature f and each theme p are obtained directly based on the Dijsktra algorithm according to the SWEET networkiSemantic distance betweenIf the feature f is not included by the SWEET, searching the superior word included by the SWEET upwards layer by layer as a substitute word of the feature f, and comparing the distance between the feature f and the substitute word in the WordNet and the substitute word and each subject p in the SWEETiAs a sum of the distances of the feature f to each topic piSemantic distance between
S52, calculating the characteristic f and each theme piSemantic distance betweenAnd the minimum value of (c) is obtained and is used as the maximum semantic relevance s of the text feature f and all the subjects PfWherein, P is all topic sets;
s53, defining feature weight based on the shortest distance between text features and a theme, establishing a theme prediction model, and predicting a multi-label theme for an unmarked sample;
s54, assuming that the training set contains n text features, the vector S ═ S of the maximum semantic relevance from all the features to all the subjects in the training set can be calculated1,s2,…,sn]Defining the weight w (x) of a single piece of data x as a vector of 1 × n, respectively corresponding to the weights of n text features, and defining the weight w (x) as s if the feature f appears in the sample xfOtherwise, defining as 0;
s55, establishing a theme prediction model Y, wherein F is an adjustment vector of the characteristics, and alpha is a smoothing parameter; based on the marked sample data, adopting a BP neural network iterative optimization training model Y, calculating the optimal solution of F and alpha under the condition of minimum loss, obtaining a final model, and predicting the category set of the unmarked sample t according to the model;
Y=w(x)*F+α。
6. the method for multi-level and multi-label classification of metadata text of geographic information service according to claim 1, wherein the step 6) designs a cooperative mechanism to match multi-label topics for the metadata text as a primary coarse-grained topic classification result; the method comprises the following specific steps:
s61, generating L according to mark sample in geographic information service metadata text set1And L2Two subsets, respectively as co-training basis model H1And H2The training set of (2);
s62 training base model H by using training set1And H2And using the trained base moldType-predicting a class vector of unlabeled samples;
s63, selecting classifier H from unlabeled samples1And H2Giving pseudo-marks to samples with the same prediction result, and respectively adding the pseudo-marked samples to the two training subsets L1And L2Updating the training set, and repeating the steps S62-S63 until the classification results of the two classifiers do not obviously change, so as to obtain the class set of all the unlabeled samples;
s64 training classifier H based on all marked samples1A set of topic categories is matched for the test sample.
7. The method for multi-level and multi-label classification of metadata text of geographic information service according to claim 1, wherein in the step 4), a classic multi-label classification algorithm ML-KNN is selected as a base model of collaborative training, specifically as follows:
s41, appointing the number k of adjacent samples, expressing the set of k adjacent samples of the sample x in the training set by N (x), and counting the number c [ j ] of the samples belonging to the subject class l in the N (x)]Counting the number of samples c' in N (x) that do not belong to the subject category l [ j](ii) a In the following formula, when a sample x belongs to the topic category l,the number of the carbon atoms is 1,is 0, otherwiseIs a non-volatile organic compound (I) with a value of 0,is 1;
s42, calculating that the unlabeled sample t belongs to the subject categoryA priori probability of lAnd posterior probabilityWherein the value of b is 0 and 1,an event indicating that a sample t belongs to the topic category l,an event indicating that the sample t does not belong to the topic class l, s is a smoothing parameter, m is the number of training samples,an event representing that sample j among k neighboring samples of sample t belongs to class l;
s43, predicting the category set of the unlabeled samples t according to the maximum posterior probability and the Bayes principle
8. The method as claimed in claim 1, wherein the step 7) of extracting text topics is based on hidden dirichlet distribution algorithm.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910942287.2A CN110704624B (en) | 2019-09-30 | 2019-09-30 | Geographic information service metadata text multi-level multi-label classification method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910942287.2A CN110704624B (en) | 2019-09-30 | 2019-09-30 | Geographic information service metadata text multi-level multi-label classification method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110704624A true CN110704624A (en) | 2020-01-17 |
CN110704624B CN110704624B (en) | 2021-08-10 |
Family
ID=69197772
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910942287.2A Active CN110704624B (en) | 2019-09-30 | 2019-09-30 | Geographic information service metadata text multi-level multi-label classification method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110704624B (en) |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111460097A (en) * | 2020-03-26 | 2020-07-28 | 华泰证券股份有限公司 | Small sample text classification method based on TPN |
CN111611801A (en) * | 2020-06-02 | 2020-09-01 | 腾讯科技(深圳)有限公司 | Method, device, server and storage medium for identifying text region attribute |
CN112256938A (en) * | 2020-12-23 | 2021-01-22 | 畅捷通信息技术股份有限公司 | Message metadata processing method, device and medium |
CN112465075A (en) * | 2020-12-31 | 2021-03-09 | 杭银消费金融股份有限公司 | Metadata management method and system |
CN112464010A (en) * | 2020-12-17 | 2021-03-09 | 中国矿业大学(北京) | Automatic image labeling method based on Bayesian network and classifier chain |
CN113792081A (en) * | 2021-08-31 | 2021-12-14 | 吉林银行股份有限公司 | Method and system for automatically checking data assets |
CN114358208A (en) * | 2022-01-13 | 2022-04-15 | 辽宁工程技术大学 | Science and collaboration activity text title recognition method based on deep learning |
CN116343104A (en) * | 2023-02-03 | 2023-06-27 | 中国矿业大学 | Map scene recognition method and system for visual feature and vector semantic space coupling |
CN115408525B (en) * | 2022-09-29 | 2023-07-04 | 中电科新型智慧城市研究院有限公司 | Letters and interviews text classification method, device, equipment and medium based on multi-level label |
CN116541752A (en) * | 2023-07-06 | 2023-08-04 | 杭州美创科技股份有限公司 | Metadata management method, device, computer equipment and storage medium |
Citations (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101283353A (en) * | 2005-08-03 | 2008-10-08 | 温克科技公司 | Systems for and methods of finding relevant documents by analyzing tags |
US7958068B2 (en) * | 2007-12-12 | 2011-06-07 | International Business Machines Corporation | Method and apparatus for model-shared subspace boosting for multi-label classification |
US7975039B2 (en) * | 2003-12-01 | 2011-07-05 | International Business Machines Corporation | Method and apparatus to support application and network awareness of collaborative applications using multi-attribute clustering |
CN102129470A (en) * | 2011-03-28 | 2011-07-20 | 中国科学技术大学 | Tag clustering method and system |
US8340405B2 (en) * | 2009-01-13 | 2012-12-25 | Fuji Xerox Co., Ltd. | Systems and methods for scalable media categorization |
CN104408153A (en) * | 2014-12-03 | 2015-03-11 | 中国科学院自动化研究所 | Short text hash learning method based on multi-granularity topic models |
CN104850650A (en) * | 2015-05-29 | 2015-08-19 | 清华大学 | Short-text expanding method based on similar-label relation |
CN104951554A (en) * | 2015-06-29 | 2015-09-30 | 浙江大学 | Method for matching landscape with verses according with artistic conception of landscape |
CN104991974A (en) * | 2015-07-31 | 2015-10-21 | 中国地质大学(武汉) | Particle swarm algorithm-based multi-label classification method |
CN105354593A (en) * | 2015-10-22 | 2016-02-24 | 南京大学 | NMF (Non-negative Matrix Factorization)-based three-dimensional model classification method |
CN105868415A (en) * | 2016-05-06 | 2016-08-17 | 黑龙江工程学院 | Microblog real-time filtering model based on historical microblogs |
CN105868905A (en) * | 2016-03-28 | 2016-08-17 | 国网天津市电力公司 | Managing and control system based on sensitive content perception |
US20180089540A1 (en) * | 2016-09-23 | 2018-03-29 | International Business Machines Corporation | Image classification utilizing semantic relationships in a classification hierarchy |
-
2019
- 2019-09-30 CN CN201910942287.2A patent/CN110704624B/en active Active
Patent Citations (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7975039B2 (en) * | 2003-12-01 | 2011-07-05 | International Business Machines Corporation | Method and apparatus to support application and network awareness of collaborative applications using multi-attribute clustering |
CN101283353A (en) * | 2005-08-03 | 2008-10-08 | 温克科技公司 | Systems for and methods of finding relevant documents by analyzing tags |
US7958068B2 (en) * | 2007-12-12 | 2011-06-07 | International Business Machines Corporation | Method and apparatus for model-shared subspace boosting for multi-label classification |
US8340405B2 (en) * | 2009-01-13 | 2012-12-25 | Fuji Xerox Co., Ltd. | Systems and methods for scalable media categorization |
CN102129470A (en) * | 2011-03-28 | 2011-07-20 | 中国科学技术大学 | Tag clustering method and system |
CN104408153A (en) * | 2014-12-03 | 2015-03-11 | 中国科学院自动化研究所 | Short text hash learning method based on multi-granularity topic models |
CN104850650A (en) * | 2015-05-29 | 2015-08-19 | 清华大学 | Short-text expanding method based on similar-label relation |
CN104951554A (en) * | 2015-06-29 | 2015-09-30 | 浙江大学 | Method for matching landscape with verses according with artistic conception of landscape |
CN104991974A (en) * | 2015-07-31 | 2015-10-21 | 中国地质大学(武汉) | Particle swarm algorithm-based multi-label classification method |
CN105354593A (en) * | 2015-10-22 | 2016-02-24 | 南京大学 | NMF (Non-negative Matrix Factorization)-based three-dimensional model classification method |
CN105868905A (en) * | 2016-03-28 | 2016-08-17 | 国网天津市电力公司 | Managing and control system based on sensitive content perception |
CN105868415A (en) * | 2016-05-06 | 2016-08-17 | 黑龙江工程学院 | Microblog real-time filtering model based on historical microblogs |
US20180089540A1 (en) * | 2016-09-23 | 2018-03-29 | International Business Machines Corporation | Image classification utilizing semantic relationships in a classification hierarchy |
Non-Patent Citations (2)
Title |
---|
DJAVAN DE CLERCQA ET.AL: ""Multi-label classification and interactive NLP-based visualization of electric"", 《HTTPS://DOI.ORG/10.1016/J.WPI.2019.101903》 * |
刘培奇: ""基于 LDA 主题模型的标签传递算法"", 《计算机应用》 * |
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111460097A (en) * | 2020-03-26 | 2020-07-28 | 华泰证券股份有限公司 | Small sample text classification method based on TPN |
CN111460097B (en) * | 2020-03-26 | 2024-06-07 | 华泰证券股份有限公司 | TPN-based small sample text classification method |
CN111611801A (en) * | 2020-06-02 | 2020-09-01 | 腾讯科技(深圳)有限公司 | Method, device, server and storage medium for identifying text region attribute |
CN112464010B (en) * | 2020-12-17 | 2021-08-27 | 中国矿业大学(北京) | Automatic image labeling method based on Bayesian network and classifier chain |
CN112464010A (en) * | 2020-12-17 | 2021-03-09 | 中国矿业大学(北京) | Automatic image labeling method based on Bayesian network and classifier chain |
CN112256938A (en) * | 2020-12-23 | 2021-01-22 | 畅捷通信息技术股份有限公司 | Message metadata processing method, device and medium |
CN112465075A (en) * | 2020-12-31 | 2021-03-09 | 杭银消费金融股份有限公司 | Metadata management method and system |
CN112465075B (en) * | 2020-12-31 | 2021-05-25 | 杭银消费金融股份有限公司 | Metadata management method and system |
CN113792081A (en) * | 2021-08-31 | 2021-12-14 | 吉林银行股份有限公司 | Method and system for automatically checking data assets |
CN114358208A (en) * | 2022-01-13 | 2022-04-15 | 辽宁工程技术大学 | Science and collaboration activity text title recognition method based on deep learning |
CN115408525B (en) * | 2022-09-29 | 2023-07-04 | 中电科新型智慧城市研究院有限公司 | Letters and interviews text classification method, device, equipment and medium based on multi-level label |
CN116343104A (en) * | 2023-02-03 | 2023-06-27 | 中国矿业大学 | Map scene recognition method and system for visual feature and vector semantic space coupling |
CN116343104B (en) * | 2023-02-03 | 2023-09-15 | 中国矿业大学 | Map scene recognition method and system for visual feature and vector semantic space coupling |
CN116541752A (en) * | 2023-07-06 | 2023-08-04 | 杭州美创科技股份有限公司 | Metadata management method, device, computer equipment and storage medium |
CN116541752B (en) * | 2023-07-06 | 2023-09-15 | 杭州美创科技股份有限公司 | Metadata management method, device, computer equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN110704624B (en) | 2021-08-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110704624B (en) | Geographic information service metadata text multi-level multi-label classification method | |
Gao et al. | Visual-textual joint relevance learning for tag-based social image search | |
CN110298042A (en) | Based on Bilstm-crf and knowledge mapping video display entity recognition method | |
Li et al. | News text classification model based on topic model | |
CN100583101C (en) | Text categorization feature selection and weight computation method based on field knowledge | |
CN107577785A (en) | A kind of level multi-tag sorting technique suitable for law identification | |
CN112256939B (en) | Text entity relation extraction method for chemical field | |
CN104035996B (en) | Field concept abstracting method based on Deep Learning | |
CN107220237A (en) | A kind of method of business entity's Relation extraction based on convolutional neural networks | |
US9870376B2 (en) | Method and system for concept summarization | |
CN111832289A (en) | Service discovery method based on clustering and Gaussian LDA | |
CN103678422A (en) | Web page classification method and device and training method and device of web page classifier | |
CN112836029A (en) | Graph-based document retrieval method, system and related components thereof | |
CN116186381A (en) | Intelligent retrieval recommendation method and system | |
Kordumova et al. | Best practices for learning video concept detectors from social media examples | |
Shen et al. | Accurate online video tagging via probabilistic hybrid modeling | |
Zhang et al. | Multi-label learning with Relief-based label-specific feature selection | |
Meng et al. | Semi-supervised hierarchical clustering for personalized web image organization | |
Kumar et al. | Web 2.0 social bookmark selection for tag clustering | |
Tian et al. | Automatic image annotation with real-world community contributed data set | |
CN108804524B (en) | Emotion distinguishing and importance dividing method based on hierarchical classification system | |
Park et al. | Estimating comic content from the book cover information using fine-tuned VGG model for comic search | |
Wei et al. | Coached active learning for interactive video search | |
Wang et al. | Research on pseudo-label technology for multi-label news classification | |
Chen et al. | Multi-modal multi-layered topic classification model for social event analysis |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |