CN107862089B - Label extraction method based on perception data - Google Patents

Label extraction method based on perception data Download PDF

Info

Publication number
CN107862089B
CN107862089B CN201711253610.2A CN201711253610A CN107862089B CN 107862089 B CN107862089 B CN 107862089B CN 201711253610 A CN201711253610 A CN 201711253610A CN 107862089 B CN107862089 B CN 107862089B
Authority
CN
China
Prior art keywords
numerical
similarity
value
label
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201711253610.2A
Other languages
Chinese (zh)
Other versions
CN107862089A (en
Inventor
丁治明
刘凡
才智
曹阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Technology
Original Assignee
Beijing University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Technology filed Critical Beijing University of Technology
Priority to CN201711253610.2A priority Critical patent/CN107862089B/en
Publication of CN107862089A publication Critical patent/CN107862089A/en
Application granted granted Critical
Publication of CN107862089B publication Critical patent/CN107862089B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3346Query execution using probabilistic model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles

Abstract

The invention discloses a label extraction method based on perception data, wherein object data comprises numerical data and text data, the label extraction firstly separates and processes the two parts of data, and associates the data after extracting respective labels through probability statistics. And extracting numerical value feature labels, wherein the feature label extraction is to select a mass center as a final label of the class through clustering, and the quality of the clustering is directly related to the feature label extraction effect. The labels all represent the most prominent feature points of a class, i.e., the least semantically distinct from all instances in the class. Most clustering algorithms follow this principle, but in practice, the numerical feature labels themselves do not need to represent the feature semantics of the cluster. The sequence feature tag extraction process mainly comprises two parts of clustering and centroid selection. And carrying out similarity matching on the transmitted sensing data and the labels in the label library to obtain corresponding numerical labels. And acquiring a corresponding text label through the association with the text label library and returning.

Description

Label extraction method based on perception data
Technical Field
The invention belongs to the field of label extraction, and particularly relates to a label extraction method based on perception data.
Background
With the rapid development of technologies such as internet, internet of things, cloud computing and the like, and the popularization and construction of information bodies such as intelligent terminals, network society, digital earth and the like, the global data volume has explosive growth. The cost for finding the information wanted by the user in the vast data set is higher and higher, people fall into the embarrassment that the data is rich but the effective information is difficult to obtain, and how to quickly find the knowledge information really interested by the user from the huge data becomes a problem which is concerned gradually. And the data format generated by the sensor or the mobile device is various, and the generated data has both numerical data and text data. The example records are clustered by comparing the similarity between the example records and extracting the class labels to express the class information is a main method for solving the class problems, however, no matter the similarity calculation of the examples or the clustering has no targeted algorithm in the aspect of label extraction.
The current information carrier structures are mainly divided into numeric and text types. The semantic information of the single text or the single numerical information is relatively unilateral, and the recorded relatively comprehensive information can be obtained after the semantic information is combined with the information extracted from other types. Through processing and analysis of recorded information, the actual content semantics of the recorded information is further presumed to be a core problem of the label extraction field. In the aspect of natural language processing, linguistic research attempts to introduce rich linguistic features to improve the performance of information analysis, but the effect is not ideal, and the complex linguistic features seriously reduce the system efficiency. The basic statistical mode is to take the statistics of the word frequency as the only semantic basis, which abandons the bad semantic effect of a large amount of documents. Then the topic model should be exported, which adds a topic middle layer between the document and the words, i.e. the document is composed of topics, and the topics are composed of several words. Although the topic model can solve the problem of word ambiguity and word-sense ambiguity to some extent due to the concept of topics being added between words and documents. However, in the aspect of processing short text data, the existing various natural language processing algorithms are difficult to accurately process, and finally, the extracted tags have high randomness and are irrelevant and difficult to expand and manage. In the aspect of numerical data processing, it is also difficult for numerical data to capture semantic features in the time dimension in static time, and it is difficult to reasonably define the similarity of numerical data. The text semantics of a large amount of perception data or other natural language data are expressed by a plurality of labels, so that the time and efficiency of query and management of a user or other systems can be greatly reduced.
Disclosure of Invention
The invention provides a label extraction method based on similarity calculation and clustering, which is characterized in that perception numerical data and text data associated with the perception numerical data are subjected to clustering according to the similarity between the calculated numerical data to obtain a numerical label and a text label extracted according to a text, and the perception data are converted into a clear and simple text type label by using an algorithm.
A label extraction method based on perception data is realized by the following steps:
the method comprises the following steps: the object data comprises numerical data and text data, the two parts of data are separated and processed by label extraction, and after respective labels are extracted, the labels are associated through probability statistics. In the numerical label part, according to appearance similarity and character similarity, an object similarity calculation method combining scalar quantity similarity and vector similarity is designed, and similarity between objects is calculated;
step 1.1: the similarity of the numerical entities refers to the similarity between semantics of a certain instance, and the higher the similarity is, the more likely the instances belong to the same class. The numerical entity is composed of several attributes, and the final attribute value contains both a single numerical value and a numerical sequence composed of a plurality of numerical values, whereby the similarity calculation of the data entity is divided into a single numerical similarity calculation and a numerical sequence similarity calculation and a structure matching.
When comparing whether two individual values are similar, the difference between the two values and the size characteristics thereof are mainly considered. And when the difference between the two values is smaller, the difference is smaller than the smaller value by 10%, the change trend of the similarity is highlighted, and when the difference is larger, the difference is larger than the smaller value by more than 10 times, the change of the similarity is reduced, and the like. For example, although the difference between (10, 20) and (1010, 1020) is the same as 10, the former is much lower than the latter in similarity relationship. Thereby, a single numerical similarity S calculation formula is proposed:
Figure BDA0001492245900000021
wherein x and y are any two numbers larger than zero, and max () is a function taking a larger value. The formula satisfies the following points:
1) the value range is between 0 and 1;
2) the similarity between two single values is inversely proportional to the difference between the two values and forms a reference with the value of the value;
3) the function is a symmetric function, i.e., S (x, y) ═ S (y, x);
4) the trend of the similarity change between two single values is reduced along with the increase of the difference value;
the above points basically accord with daily cognition, after the similarity among the attribute values is obtained, the similarity of the attribute values is combined into a new numerical sequence, and the similarity among the new sequences is calculated by the numerical sequence similarity calculation method, so that a final similarity result among the entities is obtained.
Step 1.2, calculating the numerical sequence similarity;
the similarity of the numerical entity is mainly characterized by similarity calculation among numerical sequences, and the numerical sequence is mainly characterized by two points, namely the numerical characteristic S of the sequence1Second is the waveform characteristic S of the sequence2. The numerical characteristic consists of the average value, the maximum value, the minimum value and the variance of the sequence, the fluctuation characteristic of the sequence is calculated by utilizing function fitting or cosine similarity, and then the final similarity value is obtained by balancing the two characteristic values. The non-time sequence consists of individual numerical attribute values, has no waveform characteristics in the time sequence, and can be obtained by weighting and summing the similarity among the attribute values according to a numerical characteristic calculation formula.
The specific calculation process is as follows: for two sequences of length n X < X1,x2,...,xn> and the sequence Y < Y1,y2,...,ynTaking the numerical characteristic of the mean value of the sequence as S1Taking the cosine similarity between sequences as the sequence fluctuation characteristic S2Then the final similarity result is S ═ θ1*S12*S2. Wherein theta is1,θ2Is a weight parameter, and the sum is 1. The minimum value (positive or negative) in this sequence is then subtracted from each sequence, i.e., X-min (X), Y-min (Y). This minimizes the cross-over effect between the numerical features and waveform features, improves the difference between the numerical features and waveform features between the two sequences, and also prevents numerical negatives or other problems. Then, the similarity of numerical features is calculated according to a single numerical similarity calculation formula:
Figure BDA0001492245900000031
Figure BDA0001492245900000032
where mean () is the averaging function and max () is the function taking the larger value. By simple derivation of the formula, S can be proved1,S2The value ranges are (0,1), and the final similarity value interval can be ensured to be between (0, 1). Finally, the optimum parameter value theta is obtained by means of a supervised learning algorithm, such as gradient descent training1、θ2And obtaining the final similarity according to a formula.
Step two: extracting numerical characteristic labels;
the feature label extraction selects the centroid as the final label of the class through clustering, so that the cluster quality is directly related to the feature label extraction effect. The labels all represent the most prominent feature points of a class, i.e., the least semantically distinct from all instances in the class. Most clustering algorithms follow this principle, but in practice, the numerical feature labels themselves do not need to represent the feature semantics of the cluster. Therefore, the clustering algorithm adds the distance characteristic with the mass center of the adjacent cluster, and aims to select the optimal class division point. The sequence feature tag extraction process mainly comprises two parts of clustering and centroid selection.
Step 2.1 clustering;
the clustering process regards the similarity between records as a distance, i.e., dist (x, y) ═ S (x, y), based on similarity calculation of numerical sequences. Firstly, finishing preliminary classification by means of an algorithm idea based on density, setting a parameter radius R and a minimum use case number MinPts, classifying points with similarity greater than R and case number greater than MinPts into one class, and selecting a point with minimum distance in a cluster as a preliminary centroid. Namely:
Figure BDA0001492245900000041
wherein xiIs any one numerical entity, xnIs divided by xiIn any other entity, S is similarityA function is calculated.
Step 2.2, adjusting the mass center;
when the result prediction is performed by using the system, the classification is performed by using the similarity with the labels as the distance, that is, the space of one class is a quasi-circular space which takes the labels of the class as the center and takes half of the distance between the labels of the class and the labels of the adjacent class as the radius. Therefore, after finishing clustering, in order to find out the point which can make the region distinguish best as the centroid, the characteristic of the distance from the centroid of the adjacent cluster is added, namely according to the formula F ═ theta1*C12*C2,C1Distance of members of this class, C2Distance from the centroid of the adjacent class, θ1、θ2And selecting the point with the maximum F value as a category label for the weight parameter. And (5) sequentially iterating according to the steps until convergence, and taking the centroid at the moment as a final class label.
Step three: processing a text;
because the text target of the current time is a short text object, an accurate semantic label is difficult to position according to the previous label extraction mode, and therefore the text label extraction of the current time is mainly based on word frequency statistics and matching of a subject word stock. The method mainly comprises the steps of extracting topics from text data by processing the text data and using the topic words as semantic labels of corresponding numerical value entities. The overall structure is a three-layer Bayesian network, specifically an attribute word layer, a subject word layer and a category layer. The text labels extract corresponding attribute words-subject word layers, and the numerical text semanteme is associated with corresponding subject words-category layers.
Step 3.1: text word segmentation;
the method comprises the steps of firstly blocking a text by using stop words, then matching word blocks through a word bank, wherein the successfully matched word blocks are the subject words and the rest words which are used as attribute words, and finally performing word frequency statistics on the attribute words, removing the attribute words with the word frequency higher than a threshold value α and the word frequency lower than a threshold value β and adding the stop words.
Step 3.2: calculating the contribution degree of the theme;
calculating the degree of relation between the attribute words and the subject words according to the frequency of the independent appearance or the simultaneous appearance of the attribute words and the subject words; the similar TF-IDF value is used as the contribution degree of the attribute word to the subject word, and the calculation formula is as follows:
Figure BDA0001492245900000051
Figure BDA0001492245900000052
tfidfi,j=tfi,j*idfi(7)
wherein n isi,jThe simultaneous occurrence times of the attribute word i and the subject word j, njThe total number of occurrences of the subject term j, D is the total number of documents, wiIs the total number of occurrences of the attribute word i.
Step 3.3: extracting a text label;
the text label extraction is different from a natural language processing technology, and mainly aims at short text data with determined semantic direction, such as microblogs, cases or other comments.
When the subject term is extracted, the term is directly selected as a label and the contribution degree of the corresponding attribute term is updated, and when the successfully matched subject term does not exist, one or more subject terms with the highest contribution degree are selected as text labels according to the contribution degree of each attribute term to the subject term.
Step 3.4: semantic association of numerical labels with text labels;
semantic association is mainly directed to semantic association of a numerical sequence and text data generated by the same instance. Since the prior numerical value instance is through clustering, the text data is through subject word extraction. The resulting data structure is one category for multiple subject words, and one subject word may correspond to multiple categories.
If the feature labels of the numerical value instances in one piece of data are directly associated with the subject words in a one-to-one correspondence, a large amount of other associated information is lost. Therefore, the subject word and the category are in a many-to-many structural relationship, and semantic information is retained to the maximum extent in a probability form.
The bayesian network has the formula P (C, T, W) ═ P (T | W) × P (C | T), where C is a numerical sequence category, T is a subject word, and W is an attribute word. Through statistical analysis, each corresponding probability value can be obtained, wherein at a given value instance feature label, the formula of the predicted subject word (such as illness state, loan result and the like) is as follows:
Figure BDA0001492245900000061
when a class C is given, one or more text values with the largest value can be returned by comparing the values of P (C | T) × P (T).
Drawings
FIG. 1: a system flow diagram.
FIG. 2: and (5) a system training flow chart.
FIG. 3: the clustering algorithm embodiment example figure, (A) original data set, (B) finish the preliminary classification, (C) choose the initial mass center, (D) iterate and obtain the final mass center.
FIG. 4: numerical-semantic association structure chart.
Detailed Description
The invention is explained and illustrated below with reference to the accompanying drawings:
the data set adopted by the invention is semi-structured sensing data, namely, the data set comprises numerical data and text data.
The method comprises the following steps: according to the enlightening of similarity between people, the similarity is divided into two characteristics of vector similarity and scalar similarity, and a similarity calculation formula between numerical examples is designed to be used as the distance between each node;
scalar similarity of the two numerical sequences is calculated according to formula (1), and vector similarity between the two numerical sequences is calculated according to formula (3). Finally according to the formula S ═ theta1*S12*S2And calculating the final similarity. Wherein theta is1,θ2Is a weight parameter, and the sum is 1.
First, the sequence x1 is calculated, the mean (x1) of x2 is 3.03571428571, and the mean (x2) is 2.96428571429, and then the following new sequence is obtained by subtracting the minimum value of the sequence itself:
x1[4.0,2.0,4.0,3.0,3.5,5.0,3.0,4.0,3.0,3.0,4.0,1.0,0.0,3.0],
x2[6.0,6.5,5.5,4.0,2.0,4.0,1.5,1.5,0.5,3.0,1.0,0.0,1.5,4.5]
then, according to the formula, S1 is 0.963211521552, S2 is 0.823280350237, and each weight value is 0.5, so that S < x1, and x2> is 0.5 × 0.963211521552+0.5 × 0.823280350237 is 0.893245935895. In the same way, the result can be obtained by comparing the S < x2 and x3> -0.371814751237, and the result is consistent with the cognition of people.
Step two: extracting a numerical label;
the extraction of the label mainly adopts an algorithm idea based on density to finish clustering, and then the intra-cluster distance and the adjacent cluster centroid distance are comprehensively considered to obtain a final centroid as the label.
Step 2.1 clustering;
assuming that the existing use case data is as shown in fig. 3(a), the parameter radius R and the minimum number of use cases MinPts are set first. For the use cases x1 and x2, if S (x1 and x2) > R and the number of use cases where the similarity around x1 (or x2) is greater than R is greater than MinPts, it means that the densities from x1 (or x2) to x2 (or x1) are reachable. According to the above calculation process, all reachable use cases are classified into one class to complete the preliminary classification as shown in fig. 3(B), and the use case with the minimum sum of the distances in the cluster is selected as the preliminary centroid as shown in fig. 3 (C).
Step 2.2, adjusting the mass center;
after the initial selection of the centroid is completed, the adjustment phase of the centroid is entered next. Firstly, the proportion of the distance in the cluster and the distance between adjacent centroids is set according to the actual training effect, and the existing 5 clusters (C) are assumed1,C2,C3,C4, C5) The initial centroid is c1,c2,c3,c4,c5. First begin adjusting cluster C1For each at C1According to the formula F ═ theta1*D12*D2Calculating the F value of each point, wherein D1Is the sum of the distances between members of this class, D2Is the point anddistance of the center of mass of the adjacent cluster, theta, being the smallest1、θ2Is the related weight parameter. It is worth mentioning in calculating D2When the distance between the point and the centroid with the minimum distance is within the same cluster, the distance between the point and the centroid of all clusters needs to be calculated and the point with the minimum distance is selected, that is, the point with the minimum distance is selected
Figure BDA0001492245900000071
Cluster C1Is adjusted to the point where the F value is the smallest, namely:
Figure BDA0001492245900000072
where xi is any instance and xj is any entity other than xi. Two centroids c of the rest clusters are adjusted in turn according to the above2,c3,c4,c5Then, iteration is started until stationary, and finally the centroids of the respective clusters are returned, and the final result is shown in fig. 3 (D).
Step three: extracting and associating text labels;
the extraction of the text labels is mainly completed based on word frequency statistics and the matching of a subject word bank, wherein the word bank is divided into two types, one type is a stop word bank, and the other type is a subject word bank.
Step 3.1, text word segmentation;
the method comprises the steps of inputting a text, calling stop words to divide the text into blocks, directly calling an existing stop word lexicon from an initial stop word lexicon, matching word blocks through a subject lexicon, wherein the successfully matched word blocks are subject words and residual attribute words, and finally removing the attribute words with the word frequency larger than a threshold value α and the word frequency smaller than a threshold value β and adding the stop words by performing word frequency statistics on the attribute words.
Step 3.2: calculating the contribution degree of the theme;
calculating the degree of relation between the attribute words and the subject words according to the frequency of the independent appearance or the simultaneous appearance of each word; and (5) calling formulas (5), (6) and (7) to complete the calculation of the attribute words to the subject words. After the contribution degree calculation of the attribute words is completed, a matrix of subject words-attribute words (TW) is formed, wherein TW (i, j) represents the contribution degree of the attribute words j to the subject words i. For example, suppose that in 40000 files, the subject word "sports" appears 50 times, the word "basketball" appears 400 times, where "sports" and "basketball" appear 20 times simultaneously. Then tf is 20/50 and idf is log (40000/400) and 2, and the final term "basketball" contributes 0.4 and 2 to the theme "sports" as 0.8.
Step 3.3: extracting a text label;
the extraction condition of the text label is divided into two types, one type is that the participle contains words in the subject word bank, and then the subject word is directly returned. And the other is that the participle does not contain the subject word in the subject word bank, and at the moment, one or more subjects with the highest contribution degree are selected as the label of the document by carrying out statistics on the contribution degree of the subject word on the attribute word in the document.
Step 3.4: semantic association of numerical labels with text labels;
a numerical label corresponds to one or more subject words and a subject word pair uses one or more attribute words. The subject word-attribute word (TW) matrix stores the relationship between subject words and attribute words, and similarly, a numerical label (category label) -subject word relationship probability matrix can be obtained by a word frequency statistics method, and a total structure diagram as shown in fig. 4 is finally generated. When a numerical sequence x1 is input, the subject word label is obtained according to equation (8), i.e. the numerical label is converted into a text label. Suppose that in the existing medical database, the probability of a heart disease P (t)1)=fHeart disease/fGeneral assembly0.02, probability of tuberculosis P (t)2)=fPulmonary tuberculosis/fGeneral assembly0.06; wherein the heart disease t1Probability CT (t) of the electrocardiogram of (2) to the sequence x11,x1) 60% of CT (t)1,x2) The probability of (A) is 40%, tuberculosis t2Probability CT (t) of the electrocardiogram of (2) to the sequence x12,x1) 30% of CT (t)2,x2) The content was 70%. The result obtained by matching the user's electrocardiogram to the value tag library is x1Then p (t)1|x1)=CT(t1,x1)*P(t1)=0.012,p(t2|x1)=CT(t2,x1)*P(t2) 0.018, thereby returning the tuberculosis signature.

Claims (1)

1. A label extraction method based on perception data is characterized in that: the method comprises the following implementation steps:
the method comprises the following steps: the object data comprises numerical data and text data, the label extraction firstly processes the two parts of data separately, and after respective labels are extracted, the labels are associated through probability statistics; in the numerical label part, according to appearance similarity and character similarity, an object similarity calculation method combining scalar quantity similarity and vector similarity is designed, and similarity between objects is calculated;
step 1.1: the similarity of the numerical entities refers to the similarity between semantics of a certain instance, and the higher the similarity is, the more likely the instances belong to the same class; the numerical entity is composed of a plurality of attributes, and the final attribute value contains a single numerical value and a numerical sequence composed of a plurality of numerical values, so that the similarity calculation of the data entity is divided into single numerical value similarity calculation, numerical value sequence similarity calculation and structure matching;
when comparing whether the two single values are similar, considering the difference between the two single values and the size characteristics of the two single values; when the difference between the two values is smaller, the difference is smaller than the smaller value by 10%, the change trend of the similarity is highlighted, and when the difference is larger, the difference is larger than the smaller value by more than 10 times, the change of the similarity is reduced; single numerical similarity S calculation formula:
Figure FDA0002231893170000011
wherein x and y are any two numbers larger than zero, and max () is a function taking a larger value; the formula satisfies the following points:
1) the value range is between 0 and 1;
2) the similarity between two single values is inversely proportional to the difference between the two values and forms a reference with the value of the value;
3) the function is a symmetric function, i.e., S (x, y) ═ S (y, x);
4) the trend of the similarity change between two single values is reduced along with the increase of the difference value;
the above points basically accord with daily cognition, after the similarity among the attribute values is obtained, the similarity of the attribute values is combined into a new numerical sequence, and the similarity among the new sequences is calculated by the numerical sequence similarity calculation method so as to obtain a final similarity result among the entities;
step 1.2, calculating the numerical sequence similarity;
the similarity of the numerical entity is mainly characterized by similarity calculation among numerical sequences, and the numerical sequence is mainly characterized by two points, namely the numerical characteristic S of the sequence1Second is the waveform characteristic S of the sequence2(ii) a The numerical characteristics consist of the average value, the maximum value, the minimum value and the variance of the sequence, the fluctuation characteristics of the sequence are calculated by utilizing function fitting or cosine similarity, and then a final similarity value is obtained by balancing the two characteristic values; the non-time sequence consists of individual numerical attribute values, has no waveform characteristics in the time sequence, and only needs to obtain the similarity among the attribute values according to a numerical characteristic calculation formula for weighted summation;
the specific calculation process is as follows: for two sequences of length n X < X1,x2,...,xn> and the sequence Y < Y1,y2,...,ynTaking the numerical characteristic of the mean value of the sequence as S1Taking the cosine similarity between sequences as the sequence fluctuation characteristic S2Then the final similarity result is S ═ θ1*S12*S2(ii) a Wherein theta is1,θ2Is a weight parameter, and the sum is 1; then subtracting the minimum value in the sequence, namely X-min (X), Y-min (Y); the cross influence between the numerical characteristic and the waveform characteristic is minimized, the difference between the numerical characteristic and the waveform characteristic between the two sequences is improved, and the numerical negative number is prevented; then, the similarity of numerical features is calculated according to a single numerical similarity calculation formula:
Figure FDA0002231893170000021
Figure FDA0002231893170000022
wherein mean () is a averaging function, and max () is a function taking a larger value; by simple derivation of the formula, S can be proved1,S2The value ranges are (0,1), and the final similarity value interval can be ensured to be between (0, 1); finally, the optimum parameter value theta is obtained by means of a supervised learning algorithm, such as gradient descent training1、θ2Obtaining the final similarity according to a formula;
step two: extracting numerical characteristic labels;
the characteristic label extraction selects the centroid as the final label of the class through clustering, so that the cluster quality is directly related to the characteristic label extraction effect; the labels all represent the most prominent feature points of a certain class, i.e. the semantically least different from all instances in the class; most clustering algorithms follow this principle, but in practice, the numerical feature labels themselves do not need to represent the feature semantics of the cluster; adding the distance characteristic between the clustering algorithm and the centroid of the adjacent clusters to select the optimal class division point; the sequence feature tag extraction process comprises two parts of clustering and centroid selection;
step 2.1 clustering;
the clustering process is to take the similarity between records as a distance based on the similarity calculation of numerical sequences, namely dist (x, y) is S (x, y); firstly, finishing primary classification by means of an algorithm idea based on density, setting a parameter radius R and a minimum use case number MinPts, classifying points with similarity greater than R and case number greater than MinPts into one class, and selecting a point with minimum intra-cluster distance as a primary centroid; namely:
Figure FDA0002231893170000031
wherein xiIs any one numerical entity, xnIs divided by xiS is a similarity calculation function of any other entity;
step 2.2, adjusting the mass center;
when the system is used for result prediction, the similarity with the label is taken as the distance for classification, namely, the space of one class is a class circular space which takes the label of the class as the center and takes half of the distance between the label of the class and the label of the adjacent class as the radius; therefore, after finishing clustering, in order to find out the point which can make the region distinguish best as the centroid, the characteristic of the distance from the centroid of the adjacent cluster is added, namely according to the formula F ═ theta1*C12*C2,C1Distance of members of this class, C2Distance from the centroid of the adjacent class, θ1、θ2Selecting the point with the maximum F value as a category label for the weight parameter; iterating in sequence according to the steps until convergence, and taking the centroid at the moment as a final class label;
step three: processing a text;
because the text target is a short text object, the accurate semantic tag is difficult to locate according to the prior tag extraction mode, and the text tag extraction is mainly based on word frequency statistics and matching of a subject word stock; mainly through processing text data, extracting the subject from the text data and using the subject word as a semantic label of a corresponding numerical value entity; the overall structure is a three-layer Bayesian network, specifically an attribute word layer, a subject word layer and a category layer; wherein the text label extracts a corresponding attribute word-subject word layer, and the numerical text semantic association corresponds to the subject word-category layer;
step 3.1: text word segmentation;
firstly, segmenting a text by using stop words, then matching word blocks through a word bank, wherein the successfully matched word blocks are subject words and residual words which are used as attribute words, and finally, performing word frequency statistics on the attribute words, removing the attribute words with the word frequency higher than a threshold value α and the word frequency lower than a threshold value β and adding the stop words;
step 3.2: calculating the contribution degree of the theme;
calculating the degree of relation between the attribute words and the subject words according to the frequency of the independent appearance or the simultaneous appearance of the attribute words and the subject words; the similar TF-IDF value is used as the contribution degree of the attribute word to the subject word, and the calculation formula is as follows:
Figure FDA0002231893170000032
Figure FDA0002231893170000033
tfidfi,j=tfi,j*idfi(7)
wherein n isi,jThe simultaneous occurrence times of the attribute word i and the subject word j, njThe total number of occurrences of the subject term j, D is the total number of documents, wiThe total number of occurrences of the attribute word i;
step 3.3: extracting a text label;
text label extraction is different from natural language processing technology, and aims at short text data with determined semantic direction;
when the subject term is extracted, directly selecting the term as a label and updating the contribution degree of the corresponding attribute term, and when the successfully matched subject term does not exist, selecting one or more subject terms with the highest contribution degree as text labels according to the contribution degree of each attribute term to the subject term;
step 3.4: semantic association of numerical labels with text labels;
semantic association mainly aims at semantic association of a numerical sequence and text data generated by the same instance; since the prior numerical example is clustered, the text data is subject word extracted; therefore, the finally formed data structure is that one category corresponds to a plurality of subject words, and one subject word can also correspond to a plurality of categories;
if the feature labels of the numerical value examples in one piece of data are directly associated with the subject words in a one-to-one correspondence manner, a large amount of other associated information is lost; therefore, the subject term and the category are in a many-to-many structural relationship, and semantic information is retained to the maximum extent in a probability form;
the formula of the bayesian network is P (C, T, W) ═ P (T | W) × P (C | T), where C is a numerical sequence category, T is a subject term, and W is an attribute term; through statistical analysis, each corresponding probability value can be obtained, wherein the predicted topic word formula is shown as follows under the given numerical example feature label:
Figure FDA0002231893170000041
when a class C is given, one or more text values with the largest value are returned by comparing the value sizes of P (C | T) × P (T).
CN201711253610.2A 2017-12-02 2017-12-02 Label extraction method based on perception data Active CN107862089B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711253610.2A CN107862089B (en) 2017-12-02 2017-12-02 Label extraction method based on perception data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711253610.2A CN107862089B (en) 2017-12-02 2017-12-02 Label extraction method based on perception data

Publications (2)

Publication Number Publication Date
CN107862089A CN107862089A (en) 2018-03-30
CN107862089B true CN107862089B (en) 2020-03-13

Family

ID=61704739

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711253610.2A Active CN107862089B (en) 2017-12-02 2017-12-02 Label extraction method based on perception data

Country Status (1)

Country Link
CN (1) CN107862089B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110555101A (en) * 2019-09-09 2019-12-10 浙江诺诺网络科技有限公司 customer service knowledge base updating method, device, equipment and storage medium
US11392769B2 (en) * 2020-07-15 2022-07-19 Fmr Llc Systems and methods for expert driven document identification
CN113159802A (en) * 2021-04-15 2021-07-23 武汉白虹软件科技有限公司 Algorithm model and system for realizing fraud-related application collection and feature extraction clustering
CN113343980B (en) * 2021-06-10 2023-06-09 西安邮电大学 Natural scene text detection method and system
CN113553429B (en) * 2021-07-07 2023-09-29 北京计算机技术及应用研究所 Normalized label system construction and text automatic labeling method
CN115600945B (en) * 2022-09-07 2023-06-30 淮阴工学院 Cold chain loading user image construction method and device based on multiple granularities

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102081642A (en) * 2010-10-28 2011-06-01 华南理工大学 Chinese label extraction method for clustering search results of search engine
CN104598543A (en) * 2014-11-28 2015-05-06 广东工业大学 Social matching data mining system
CN105138510A (en) * 2015-08-10 2015-12-09 昆明理工大学 Microblog-based neologism emotional tendency judgment method
CN106599181A (en) * 2016-12-13 2017-04-26 浙江网新恒天软件有限公司 Hot news detecting method based on topic model

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130166565A1 (en) * 2011-12-23 2013-06-27 Kevin LEPSOE Interest based social network system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102081642A (en) * 2010-10-28 2011-06-01 华南理工大学 Chinese label extraction method for clustering search results of search engine
CN104598543A (en) * 2014-11-28 2015-05-06 广东工业大学 Social matching data mining system
CN105138510A (en) * 2015-08-10 2015-12-09 昆明理工大学 Microblog-based neologism emotional tendency judgment method
CN106599181A (en) * 2016-12-13 2017-04-26 浙江网新恒天软件有限公司 Hot news detecting method based on topic model

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于概率主题模型的物联网服务发现;魏强 等;《软件学报》;20141231;全文 *

Also Published As

Publication number Publication date
CN107862089A (en) 2018-03-30

Similar Documents

Publication Publication Date Title
CN107862089B (en) Label extraction method based on perception data
CN108052593B (en) Topic keyword extraction method based on topic word vector and network structure
CN108132968B (en) Weak supervision learning method for associated semantic elements in web texts and images
CN110059181B (en) Short text label method, system and device for large-scale classification system
CN110674407A (en) Hybrid recommendation method based on graph convolution neural network
CN110751027B (en) Pedestrian re-identification method based on deep multi-instance learning
CN103761286B (en) A kind of Service Source search method based on user interest
US11886515B2 (en) Hierarchical clustering on graphs for taxonomy extraction and applications thereof
Mahmood et al. Semi-supervised evolutionary ensembles for Web video categorization
CN116383430A (en) Knowledge graph construction method, device, equipment and storage medium
Hossny et al. Enhancing keyword correlation for event detection in social networks using SVD and k-means: Twitter case study
Ding et al. Context-aware semantic type identification for relational attributes
Chen et al. A new method to estimate null values in relational database systems based on automatic clustering techniques
Guo Intelligent sports video classification based on deep neural network (DNN) algorithm and transfer learning
Cao et al. Detecting communities on topic of transportation with sparse crowd annotations
Kumar et al. Multi document summarization based on cross-document relation using voting technique
Wang et al. High-level semantic image annotation based on hot Internet topics
Yu et al. A Multi-Directional Search technique for image annotation propagation
CN113761192A (en) Text processing method, text processing device and text processing equipment
CN110750963B (en) News document duplication removing method, device and storage medium
Castano et al. A new approach to security system development
Vu et al. Density-based clustering with side information and active learning
Fan et al. A text clustering algorithm hybriding invasive weed optimization with K-means
Liu et al. Classification of Medical Text Data Using Convolutional Neural Network-Support Vector Machine Method
Kaur et al. Text document clustering and classification using k-means algorithm and neural networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant