CN107862089B

CN107862089B - Label extraction method based on perception data

Info

Publication number: CN107862089B
Application number: CN201711253610.2A
Authority: CN
Inventors: 丁治明; 刘凡; 才智; 曹阳
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2017-12-02
Filing date: 2017-12-02
Publication date: 2020-03-13
Anticipated expiration: 2037-12-02
Also published as: CN107862089A

Abstract

The invention discloses a label extraction method based on perception data, wherein object data comprises numerical data and text data, the label extraction firstly separates and processes the two parts of data, and associates the data after extracting respective labels through probability statistics. And extracting numerical value feature labels, wherein the feature label extraction is to select a mass center as a final label of the class through clustering, and the quality of the clustering is directly related to the feature label extraction effect. The labels all represent the most prominent feature points of a class, i.e., the least semantically distinct from all instances in the class. Most clustering algorithms follow this principle, but in practice, the numerical feature labels themselves do not need to represent the feature semantics of the cluster. The sequence feature tag extraction process mainly comprises two parts of clustering and centroid selection. And carrying out similarity matching on the transmitted sensing data and the labels in the label library to obtain corresponding numerical labels. And acquiring a corresponding text label through the association with the text label library and returning.

Description

Label extraction method based on perception data

Technical Field

The invention belongs to the field of label extraction, and particularly relates to a label extraction method based on perception data.

Background

With the rapid development of technologies such as internet, internet of things, cloud computing and the like, and the popularization and construction of information bodies such as intelligent terminals, network society, digital earth and the like, the global data volume has explosive growth. The cost for finding the information wanted by the user in the vast data set is higher and higher, people fall into the embarrassment that the data is rich but the effective information is difficult to obtain, and how to quickly find the knowledge information really interested by the user from the huge data becomes a problem which is concerned gradually. And the data format generated by the sensor or the mobile device is various, and the generated data has both numerical data and text data. The example records are clustered by comparing the similarity between the example records and extracting the class labels to express the class information is a main method for solving the class problems, however, no matter the similarity calculation of the examples or the clustering has no targeted algorithm in the aspect of label extraction.

The current information carrier structures are mainly divided into numeric and text types. The semantic information of the single text or the single numerical information is relatively unilateral, and the recorded relatively comprehensive information can be obtained after the semantic information is combined with the information extracted from other types. Through processing and analysis of recorded information, the actual content semantics of the recorded information is further presumed to be a core problem of the label extraction field. In the aspect of natural language processing, linguistic research attempts to introduce rich linguistic features to improve the performance of information analysis, but the effect is not ideal, and the complex linguistic features seriously reduce the system efficiency. The basic statistical mode is to take the statistics of the word frequency as the only semantic basis, which abandons the bad semantic effect of a large amount of documents. Then the topic model should be exported, which adds a topic middle layer between the document and the words, i.e. the document is composed of topics, and the topics are composed of several words. Although the topic model can solve the problem of word ambiguity and word-sense ambiguity to some extent due to the concept of topics being added between words and documents. However, in the aspect of processing short text data, the existing various natural language processing algorithms are difficult to accurately process, and finally, the extracted tags have high randomness and are irrelevant and difficult to expand and manage. In the aspect of numerical data processing, it is also difficult for numerical data to capture semantic features in the time dimension in static time, and it is difficult to reasonably define the similarity of numerical data. The text semantics of a large amount of perception data or other natural language data are expressed by a plurality of labels, so that the time and efficiency of query and management of a user or other systems can be greatly reduced.

Disclosure of Invention

The invention provides a label extraction method based on similarity calculation and clustering, which is characterized in that perception numerical data and text data associated with the perception numerical data are subjected to clustering according to the similarity between the calculated numerical data to obtain a numerical label and a text label extracted according to a text, and the perception data are converted into a clear and simple text type label by using an algorithm.

A label extraction method based on perception data is realized by the following steps:

the method comprises the following steps: the object data comprises numerical data and text data, the two parts of data are separated and processed by label extraction, and after respective labels are extracted, the labels are associated through probability statistics. In the numerical label part, according to appearance similarity and character similarity, an object similarity calculation method combining scalar quantity similarity and vector similarity is designed, and similarity between objects is calculated;

step 1.1: the similarity of the numerical entities refers to the similarity between semantics of a certain instance, and the higher the similarity is, the more likely the instances belong to the same class. The numerical entity is composed of several attributes, and the final attribute value contains both a single numerical value and a numerical sequence composed of a plurality of numerical values, whereby the similarity calculation of the data entity is divided into a single numerical similarity calculation and a numerical sequence similarity calculation and a structure matching.

When comparing whether two individual values are similar, the difference between the two values and the size characteristics thereof are mainly considered. And when the difference between the two values is smaller, the difference is smaller than the smaller value by 10%, the change trend of the similarity is highlighted, and when the difference is larger, the difference is larger than the smaller value by more than 10 times, the change of the similarity is reduced, and the like. For example, although the difference between (10, 20) and (1010, 1020) is the same as 10, the former is much lower than the latter in similarity relationship. Thereby, a single numerical similarity S calculation formula is proposed:

wherein x and y are any two numbers larger than zero, and max () is a function taking a larger value. The formula satisfies the following points:

1) the value range is between 0 and 1;

2) the similarity between two single values is inversely proportional to the difference between the two values and forms a reference with the value of the value;

3) the function is a symmetric function, i.e., S (x, y) ═ S (y, x);

4) the trend of the similarity change between two single values is reduced along with the increase of the difference value;

the above points basically accord with daily cognition, after the similarity among the attribute values is obtained, the similarity of the attribute values is combined into a new numerical sequence, and the similarity among the new sequences is calculated by the numerical sequence similarity calculation method, so that a final similarity result among the entities is obtained.

Step 1.2, calculating the numerical sequence similarity;

the similarity of the numerical entity is mainly characterized by similarity calculation among numerical sequences, and the numerical sequence is mainly characterized by two points, namely the numerical characteristic S of the sequence₁Second is the waveform characteristic S of the sequence₂. The numerical characteristic consists of the average value, the maximum value, the minimum value and the variance of the sequence, the fluctuation characteristic of the sequence is calculated by utilizing function fitting or cosine similarity, and then the final similarity value is obtained by balancing the two characteristic values. The non-time sequence consists of individual numerical attribute values, has no waveform characteristics in the time sequence, and can be obtained by weighting and summing the similarity among the attribute values according to a numerical characteristic calculation formula.

The specific calculation process is as follows: for two sequences of length n X < X₁,x₂,...,x_n> and the sequence Y < Y₁,y₂,...,y_nTaking the numerical characteristic of the mean value of the sequence as S₁Taking the cosine similarity between sequences as the sequence fluctuation characteristic S₂Then the final similarity result is S ═ θ₁*S₁+θ₂*S₂. Wherein theta is₁，θ₂Is a weight parameter, and the sum is 1. The minimum value (positive or negative) in this sequence is then subtracted from each sequence, i.e., X-min (X), Y-min (Y). This minimizes the cross-over effect between the numerical features and waveform features, improves the difference between the numerical features and waveform features between the two sequences, and also prevents numerical negatives or other problems. Then, the similarity of numerical features is calculated according to a single numerical similarity calculation formula:

where mean () is the averaging function and max () is the function taking the larger value. By simple derivation of the formula, S can be proved₁，S₂The value ranges are (0,1), and the final similarity value interval can be ensured to be between (0, 1). Finally, the optimum parameter value theta is obtained by means of a supervised learning algorithm, such as gradient descent training₁、θ₂And obtaining the final similarity according to a formula.

Step two: extracting numerical characteristic labels;

the feature label extraction selects the centroid as the final label of the class through clustering, so that the cluster quality is directly related to the feature label extraction effect. The labels all represent the most prominent feature points of a class, i.e., the least semantically distinct from all instances in the class. Most clustering algorithms follow this principle, but in practice, the numerical feature labels themselves do not need to represent the feature semantics of the cluster. Therefore, the clustering algorithm adds the distance characteristic with the mass center of the adjacent cluster, and aims to select the optimal class division point. The sequence feature tag extraction process mainly comprises two parts of clustering and centroid selection.

Step 2.1 clustering;

the clustering process regards the similarity between records as a distance, i.e., dist (x, y) ═ S (x, y), based on similarity calculation of numerical sequences. Firstly, finishing preliminary classification by means of an algorithm idea based on density, setting a parameter radius R and a minimum use case number MinPts, classifying points with similarity greater than R and case number greater than MinPts into one class, and selecting a point with minimum distance in a cluster as a preliminary centroid. Namely:

wherein x_iIs any one numerical entity, x_nIs divided by x_iIn any other entity, S is similarityA function is calculated.

Step 2.2, adjusting the mass center;

when the result prediction is performed by using the system, the classification is performed by using the similarity with the labels as the distance, that is, the space of one class is a quasi-circular space which takes the labels of the class as the center and takes half of the distance between the labels of the class and the labels of the adjacent class as the radius. Therefore, after finishing clustering, in order to find out the point which can make the region distinguish best as the centroid, the characteristic of the distance from the centroid of the adjacent cluster is added, namely according to the formula F ═ theta₁*C₁+θ₂*C₂，C₁Distance of members of this class, C₂Distance from the centroid of the adjacent class, θ₁、θ₂And selecting the point with the maximum F value as a category label for the weight parameter. And (5) sequentially iterating according to the steps until convergence, and taking the centroid at the moment as a final class label.

Step three: processing a text;

because the text target of the current time is a short text object, an accurate semantic label is difficult to position according to the previous label extraction mode, and therefore the text label extraction of the current time is mainly based on word frequency statistics and matching of a subject word stock. The method mainly comprises the steps of extracting topics from text data by processing the text data and using the topic words as semantic labels of corresponding numerical value entities. The overall structure is a three-layer Bayesian network, specifically an attribute word layer, a subject word layer and a category layer. The text labels extract corresponding attribute words-subject word layers, and the numerical text semanteme is associated with corresponding subject words-category layers.

Step 3.1: text word segmentation;

the method comprises the steps of firstly blocking a text by using stop words, then matching word blocks through a word bank, wherein the successfully matched word blocks are the subject words and the rest words which are used as attribute words, and finally performing word frequency statistics on the attribute words, removing the attribute words with the word frequency higher than a threshold value α and the word frequency lower than a threshold value β and adding the stop words.

Step 3.2: calculating the contribution degree of the theme;

calculating the degree of relation between the attribute words and the subject words according to the frequency of the independent appearance or the simultaneous appearance of the attribute words and the subject words; the similar TF-IDF value is used as the contribution degree of the attribute word to the subject word, and the calculation formula is as follows:

tfidf_i,j＝tf_i,j*idf_i(7)

wherein n is_i,jThe simultaneous occurrence times of the attribute word i and the subject word j, n_jThe total number of occurrences of the subject term j, D is the total number of documents, w_iIs the total number of occurrences of the attribute word i.

Step 3.3: extracting a text label;

the text label extraction is different from a natural language processing technology, and mainly aims at short text data with determined semantic direction, such as microblogs, cases or other comments.

When the subject term is extracted, the term is directly selected as a label and the contribution degree of the corresponding attribute term is updated, and when the successfully matched subject term does not exist, one or more subject terms with the highest contribution degree are selected as text labels according to the contribution degree of each attribute term to the subject term.

Step 3.4: semantic association of numerical labels with text labels;

semantic association is mainly directed to semantic association of a numerical sequence and text data generated by the same instance. Since the prior numerical value instance is through clustering, the text data is through subject word extraction. The resulting data structure is one category for multiple subject words, and one subject word may correspond to multiple categories.

If the feature labels of the numerical value instances in one piece of data are directly associated with the subject words in a one-to-one correspondence, a large amount of other associated information is lost. Therefore, the subject word and the category are in a many-to-many structural relationship, and semantic information is retained to the maximum extent in a probability form.

The bayesian network has the formula P (C, T, W) ═ P (T | W) × P (C | T), where C is a numerical sequence category, T is a subject word, and W is an attribute word. Through statistical analysis, each corresponding probability value can be obtained, wherein at a given value instance feature label, the formula of the predicted subject word (such as illness state, loan result and the like) is as follows:

when a class C is given, one or more text values with the largest value can be returned by comparing the values of P (C | T) × P (T).

Drawings

FIG. 1: a system flow diagram.

FIG. 2: and (5) a system training flow chart.

FIG. 3: the clustering algorithm embodiment example figure, (A) original data set, (B) finish the preliminary classification, (C) choose the initial mass center, (D) iterate and obtain the final mass center.

FIG. 4: numerical-semantic association structure chart.

Detailed Description

The invention is explained and illustrated below with reference to the accompanying drawings:

the data set adopted by the invention is semi-structured sensing data, namely, the data set comprises numerical data and text data.

The method comprises the following steps: according to the enlightening of similarity between people, the similarity is divided into two characteristics of vector similarity and scalar similarity, and a similarity calculation formula between numerical examples is designed to be used as the distance between each node;

scalar similarity of the two numerical sequences is calculated according to formula (1), and vector similarity between the two numerical sequences is calculated according to formula (3). Finally according to the formula S ═ theta₁*S₁+θ₂*S₂And calculating the final similarity. Wherein theta is₁，θ₂Is a weight parameter, and the sum is 1.

First, the sequence x1 is calculated, the mean (x1) of x2 is 3.03571428571, and the mean (x2) is 2.96428571429, and then the following new sequence is obtained by subtracting the minimum value of the sequence itself:

x1[4.0,2.0,4.0,3.0,3.5,5.0,3.0,4.0,3.0,3.0,4.0,1.0,0.0,3.0],

x2[6.0,6.5,5.5,4.0,2.0,4.0,1.5,1.5,0.5,3.0,1.0,0.0,1.5,4.5]

then, according to the formula, S1 is 0.963211521552, S2 is 0.823280350237, and each weight value is 0.5, so that S < x1, and x2> is 0.5 × 0.963211521552+0.5 × 0.823280350237 is 0.893245935895. In the same way, the result can be obtained by comparing the S < x2 and x3> -0.371814751237, and the result is consistent with the cognition of people.

Step two: extracting a numerical label;

the extraction of the label mainly adopts an algorithm idea based on density to finish clustering, and then the intra-cluster distance and the adjacent cluster centroid distance are comprehensively considered to obtain a final centroid as the label.

Step 2.1 clustering;

assuming that the existing use case data is as shown in fig. 3(a), the parameter radius R and the minimum number of use cases MinPts are set first. For the use cases x1 and x2, if S (x1 and x2) > R and the number of use cases where the similarity around x1 (or x2) is greater than R is greater than MinPts, it means that the densities from x1 (or x2) to x2 (or x1) are reachable. According to the above calculation process, all reachable use cases are classified into one class to complete the preliminary classification as shown in fig. 3(B), and the use case with the minimum sum of the distances in the cluster is selected as the preliminary centroid as shown in fig. 3 (C).

Step 2.2, adjusting the mass center;

after the initial selection of the centroid is completed, the adjustment phase of the centroid is entered next. Firstly, the proportion of the distance in the cluster and the distance between adjacent centroids is set according to the actual training effect, and the existing 5 clusters (C) are assumed₁，C₂，C₃，C₄， C₅) The initial centroid is c₁，c₂，c₃，c₄，c₅. First begin adjusting cluster C₁For each at C₁According to the formula F ═ theta₁*D₁+θ₂*D₂Calculating the F value of each point, wherein D₁Is the sum of the distances between members of this class, D₂Is the point anddistance of the center of mass of the adjacent cluster, theta, being the smallest₁、θ₂Is the related weight parameter. It is worth mentioning in calculating D₂When the distance between the point and the centroid with the minimum distance is within the same cluster, the distance between the point and the centroid of all clusters needs to be calculated and the point with the minimum distance is selected, that is, the point with the minimum distance is selected

Cluster C₁Is adjusted to the point where the F value is the smallest, namely:

where xi is any instance and xj is any entity other than xi. Two centroids c of the rest clusters are adjusted in turn according to the above₂，c₃，c₄，c₅Then, iteration is started until stationary, and finally the centroids of the respective clusters are returned, and the final result is shown in fig. 3 (D).

Step three: extracting and associating text labels;

the extraction of the text labels is mainly completed based on word frequency statistics and the matching of a subject word bank, wherein the word bank is divided into two types, one type is a stop word bank, and the other type is a subject word bank.

Step 3.1, text word segmentation;

the method comprises the steps of inputting a text, calling stop words to divide the text into blocks, directly calling an existing stop word lexicon from an initial stop word lexicon, matching word blocks through a subject lexicon, wherein the successfully matched word blocks are subject words and residual attribute words, and finally removing the attribute words with the word frequency larger than a threshold value α and the word frequency smaller than a threshold value β and adding the stop words by performing word frequency statistics on the attribute words.

Step 3.2: calculating the contribution degree of the theme;

calculating the degree of relation between the attribute words and the subject words according to the frequency of the independent appearance or the simultaneous appearance of each word; and (5) calling formulas (5), (6) and (7) to complete the calculation of the attribute words to the subject words. After the contribution degree calculation of the attribute words is completed, a matrix of subject words-attribute words (TW) is formed, wherein TW (i, j) represents the contribution degree of the attribute words j to the subject words i. For example, suppose that in 40000 files, the subject word "sports" appears 50 times, the word "basketball" appears 400 times, where "sports" and "basketball" appear 20 times simultaneously. Then tf is 20/50 and idf is log (40000/400) and 2, and the final term "basketball" contributes 0.4 and 2 to the theme "sports" as 0.8.

Step 3.3: extracting a text label;

the extraction condition of the text label is divided into two types, one type is that the participle contains words in the subject word bank, and then the subject word is directly returned. And the other is that the participle does not contain the subject word in the subject word bank, and at the moment, one or more subjects with the highest contribution degree are selected as the label of the document by carrying out statistics on the contribution degree of the subject word on the attribute word in the document.

Step 3.4: semantic association of numerical labels with text labels;

a numerical label corresponds to one or more subject words and a subject word pair uses one or more attribute words. The subject word-attribute word (TW) matrix stores the relationship between subject words and attribute words, and similarly, a numerical label (category label) -subject word relationship probability matrix can be obtained by a word frequency statistics method, and a total structure diagram as shown in fig. 4 is finally generated. When a numerical sequence x1 is input, the subject word label is obtained according to equation (8), i.e. the numerical label is converted into a text label. Suppose that in the existing medical database, the probability of a heart disease P (t)₁)＝f_{Heart disease}/f_{General assembly}0.02, probability of tuberculosis P (t)₂)＝f_{Pulmonary tuberculosis}/f_{General assembly}0.06; wherein the heart disease t₁Probability CT (t) of the electrocardiogram of (2) to the sequence x1₁,x₁) 60% of CT (t)₁,x₂) The probability of (A) is 40%, tuberculosis t₂Probability CT (t) of the electrocardiogram of (2) to the sequence x1₂,x₁) 30% of CT (t)₂,x₂) The content was 70%. The result obtained by matching the user's electrocardiogram to the value tag library is x₁Then p (t)₁|x₁)＝CT(t₁,x₁)*P(t₁)＝0.012，p(t₂|x₁)＝CT(t₂,x₁)*P(t₂) 0.018, thereby returning the tuberculosis signature.

Claims

1. A label extraction method based on perception data is characterized in that: the method comprises the following implementation steps:

the method comprises the following steps: the object data comprises numerical data and text data, the label extraction firstly processes the two parts of data separately, and after respective labels are extracted, the labels are associated through probability statistics; in the numerical label part, according to appearance similarity and character similarity, an object similarity calculation method combining scalar quantity similarity and vector similarity is designed, and similarity between objects is calculated;

step 1.1: the similarity of the numerical entities refers to the similarity between semantics of a certain instance, and the higher the similarity is, the more likely the instances belong to the same class; the numerical entity is composed of a plurality of attributes, and the final attribute value contains a single numerical value and a numerical sequence composed of a plurality of numerical values, so that the similarity calculation of the data entity is divided into single numerical value similarity calculation, numerical value sequence similarity calculation and structure matching;

when comparing whether the two single values are similar, considering the difference between the two single values and the size characteristics of the two single values; when the difference between the two values is smaller, the difference is smaller than the smaller value by 10%, the change trend of the similarity is highlighted, and when the difference is larger, the difference is larger than the smaller value by more than 10 times, the change of the similarity is reduced; single numerical similarity S calculation formula:

wherein x and y are any two numbers larger than zero, and max () is a function taking a larger value; the formula satisfies the following points:

1) the value range is between 0 and 1;

3) the function is a symmetric function, i.e., S (x, y) ═ S (y, x);

the above points basically accord with daily cognition, after the similarity among the attribute values is obtained, the similarity of the attribute values is combined into a new numerical sequence, and the similarity among the new sequences is calculated by the numerical sequence similarity calculation method so as to obtain a final similarity result among the entities;

step 1.2, calculating the numerical sequence similarity;

the similarity of the numerical entity is mainly characterized by similarity calculation among numerical sequences, and the numerical sequence is mainly characterized by two points, namely the numerical characteristic S of the sequence₁Second is the waveform characteristic S of the sequence₂(ii) a The numerical characteristics consist of the average value, the maximum value, the minimum value and the variance of the sequence, the fluctuation characteristics of the sequence are calculated by utilizing function fitting or cosine similarity, and then a final similarity value is obtained by balancing the two characteristic values; the non-time sequence consists of individual numerical attribute values, has no waveform characteristics in the time sequence, and only needs to obtain the similarity among the attribute values according to a numerical characteristic calculation formula for weighted summation;

the specific calculation process is as follows: for two sequences of length n X < X₁,x₂,...,x_n> and the sequence Y < Y₁,y₂,...,y_nTaking the numerical characteristic of the mean value of the sequence as S₁Taking the cosine similarity between sequences as the sequence fluctuation characteristic S₂Then the final similarity result is S ═ θ₁*S₁+θ₂*S₂(ii) a Wherein theta is₁，θ₂Is a weight parameter, and the sum is 1; then subtracting the minimum value in the sequence, namely X-min (X), Y-min (Y); the cross influence between the numerical characteristic and the waveform characteristic is minimized, the difference between the numerical characteristic and the waveform characteristic between the two sequences is improved, and the numerical negative number is prevented; then, the similarity of numerical features is calculated according to a single numerical similarity calculation formula:

wherein mean () is a averaging function, and max () is a function taking a larger value; by simple derivation of the formula, S can be proved₁，S₂The value ranges are (0,1), and the final similarity value interval can be ensured to be between (0, 1); finally, the optimum parameter value theta is obtained by means of a supervised learning algorithm, such as gradient descent training₁、θ₂Obtaining the final similarity according to a formula;

step two: extracting numerical characteristic labels;

the characteristic label extraction selects the centroid as the final label of the class through clustering, so that the cluster quality is directly related to the characteristic label extraction effect; the labels all represent the most prominent feature points of a certain class, i.e. the semantically least different from all instances in the class; most clustering algorithms follow this principle, but in practice, the numerical feature labels themselves do not need to represent the feature semantics of the cluster; adding the distance characteristic between the clustering algorithm and the centroid of the adjacent clusters to select the optimal class division point; the sequence feature tag extraction process comprises two parts of clustering and centroid selection;

step 2.1 clustering;

the clustering process is to take the similarity between records as a distance based on the similarity calculation of numerical sequences, namely dist (x, y) is S (x, y); firstly, finishing primary classification by means of an algorithm idea based on density, setting a parameter radius R and a minimum use case number MinPts, classifying points with similarity greater than R and case number greater than MinPts into one class, and selecting a point with minimum intra-cluster distance as a primary centroid; namely:

wherein x_iIs any one numerical entity, x_nIs divided by x_iS is a similarity calculation function of any other entity;

step 2.2, adjusting the mass center;

when the system is used for result prediction, the similarity with the label is taken as the distance for classification, namely, the space of one class is a class circular space which takes the label of the class as the center and takes half of the distance between the label of the class and the label of the adjacent class as the radius; therefore, after finishing clustering, in order to find out the point which can make the region distinguish best as the centroid, the characteristic of the distance from the centroid of the adjacent cluster is added, namely according to the formula F ═ theta₁*C₁+θ₂*C₂，C₁Distance of members of this class, C₂Distance from the centroid of the adjacent class, θ₁、θ₂Selecting the point with the maximum F value as a category label for the weight parameter; iterating in sequence according to the steps until convergence, and taking the centroid at the moment as a final class label;

step three: processing a text;

because the text target is a short text object, the accurate semantic tag is difficult to locate according to the prior tag extraction mode, and the text tag extraction is mainly based on word frequency statistics and matching of a subject word stock; mainly through processing text data, extracting the subject from the text data and using the subject word as a semantic label of a corresponding numerical value entity; the overall structure is a three-layer Bayesian network, specifically an attribute word layer, a subject word layer and a category layer; wherein the text label extracts a corresponding attribute word-subject word layer, and the numerical text semantic association corresponds to the subject word-category layer;

step 3.1: text word segmentation;

firstly, segmenting a text by using stop words, then matching word blocks through a word bank, wherein the successfully matched word blocks are subject words and residual words which are used as attribute words, and finally, performing word frequency statistics on the attribute words, removing the attribute words with the word frequency higher than a threshold value α and the word frequency lower than a threshold value β and adding the stop words;

step 3.2: calculating the contribution degree of the theme;

tfidf_i,j＝tf_i,j*idf_i(7)

wherein n is_i,jThe simultaneous occurrence times of the attribute word i and the subject word j, n_jThe total number of occurrences of the subject term j, D is the total number of documents, w_iThe total number of occurrences of the attribute word i;

step 3.3: extracting a text label;

text label extraction is different from natural language processing technology, and aims at short text data with determined semantic direction;

when the subject term is extracted, directly selecting the term as a label and updating the contribution degree of the corresponding attribute term, and when the successfully matched subject term does not exist, selecting one or more subject terms with the highest contribution degree as text labels according to the contribution degree of each attribute term to the subject term;

step 3.4: semantic association of numerical labels with text labels;

semantic association mainly aims at semantic association of a numerical sequence and text data generated by the same instance; since the prior numerical example is clustered, the text data is subject word extracted; therefore, the finally formed data structure is that one category corresponds to a plurality of subject words, and one subject word can also correspond to a plurality of categories;

if the feature labels of the numerical value examples in one piece of data are directly associated with the subject words in a one-to-one correspondence manner, a large amount of other associated information is lost; therefore, the subject term and the category are in a many-to-many structural relationship, and semantic information is retained to the maximum extent in a probability form;

the formula of the bayesian network is P (C, T, W) ═ P (T | W) × P (C | T), where C is a numerical sequence category, T is a subject term, and W is an attribute term; through statistical analysis, each corresponding probability value can be obtained, wherein the predicted topic word formula is shown as follows under the given numerical example feature label:

when a class C is given, one or more text values with the largest value are returned by comparing the value sizes of P (C | T) × P (T).