CN115357708A

CN115357708A - Hot line data extraction and data element analysis method

Info

Publication number: CN115357708A
Application number: CN202211038229.5A
Authority: CN
Inventors: 王伟; 金婷; 石爱辉; 李逸玮; 施茜
Original assignee: Clp Hongxin Information Technology Co ltd
Current assignee: Clp Hongxin Information Technology Co ltd
Priority date: 2022-08-29
Filing date: 2022-08-29
Publication date: 2022-11-18

Abstract

A hot line data extraction and data element analysis method comprises the following steps of S1: collecting data to form data set, segmenting data and preprocessingThen obtaining feature word data to be extracted; s2: calculating the word frequency and the frequency value of each characteristic word; s3: calculating the weight value of the corresponding feature word according to the word frequency and frequency value

According to the weight value of each feature word

Sorting the sizes and outputting the results beforetopNA feature word; s4: inputting the feature words processed in the step S3 into a vector model for training to obtain vectorization representation of the feature words; s5: calculating the similarity of the vectors among the feature words in the step S4 based on a word moving distance method WMD; s6: calculating similar feature words of the feature words corresponding to the vectors by using a K-means clustering algorithm to form a cluster set to which the vectors belong; s7: and identifying potential different types of the characteristic words through cluster analysis to generate related events.

Description

Hot line data extraction and data element analysis method

Technical Field

The invention relates to the technical field of data analysis, in particular to a data extraction and data element analysis method of a hotline.

Background

With the rapid development of internet and computer technologies, data analysis is rapidly developing from business intelligence to user intelligence at present, data stocks of various industries and government departments are larger and larger, data types are various, the existing data are utilized to find relationships among data and deep values of the data according to different requirements, various hot spots, emergencies, sensitivity and other events are rapidly found in a government hot line system, leadership decision-making is served, the government hot line business is served, the public government business service efficiency is improved, more intelligent data support is provided, and the method is a data mining and analyzing capability which is needed urgently at present.

At present, different data analysis systems appear in the industry, although data analysis tools and mining channels are increasingly rich and diverse, the main principle of the data analysis systems is to preset event characteristic words and then perform characteristic word matching on the data to form event data. For example, in patent No. 2016105990149 of china, the method is to identify event components contained in data according to a preset event identification model. However, under the condition that the preset event feature words of events with certain real-time hot spots, emergencies, sensitivities and the like are not timely enough, the event feature words may not be extracted or the extracted feature words are invalid, so that effective related events cannot be generated timely. In addition, accurate event clustering can be performed only in different data texts according to semantic understanding, so that matching cannot be performed simply according to preset feature words.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a hot line data extraction and data element analysis method, which is not limited to matching the characteristic words in the data by presetting event characteristic words, but is based on matching each characteristic word in the original data from the original data to form an effective related event, can quickly screen effective and important contents from massive data to make decision service for each party, and is particularly suitable for government hot line service, thereby improving the efficiency of public government service and providing more intelligent data support.

In order to achieve the purpose, the invention adopts the following technical scheme:

a data extraction and data element analysis method of a hotline comprises the following steps:

s1: collecting the heat data, constructing a plurality of data sets, and preprocessing the data in the data sets by word segmentation to obtain feature word data to be extracted in each data set;

s2: calculating the word frequency WF of each characteristic word aiming at the characteristic word data to be extracted _a And frequency value RDF _a Further generating a frequency value dictionary;

s3: selecting different parts of speech to be filtered according to the current hot line service scene, further screening partial feature word data to be extracted, and then inquiring the word frequency WF of each feature word in the frequency value dictionary _a And frequency value RDF _a Calculating the weight value V of the corresponding feature words, sorting the feature words from large to small according to the weight value V of each feature word, and outputting top topN feature words;

s4: establishing an existing vector model, inputting the feature words processed in the step S3 into the vector model for training, and further obtaining vectorization representation of the feature words;

s5: similarity calculation is carried out on the vectors among the feature words in the step S4 based on a word moving distance method WMD;

s6: setting K cluster sets by using a K-means clustering algorithm, then randomly selecting K vector data from the data sets of the feature word vectors formed after the processing in the step S4 as initial centroids of the clustering algorithm, and calculating similar feature words of the feature words corresponding to the vectors by using the method in the step S5 to form a cluster set to which the vectors belong;

s7: and (5) identifying potential different types of the characteristic word data through the cluster analysis in the step S6, and generating related events under different conditions.

In order to optimize the technical scheme, the specific measures adopted further comprise:

further, the specific content of the feature word data to be extracted in each data set after the preprocessing in step S1 is as follows: the data in the data sets form a plurality of characteristic words after word segmentation, stop words and interference words in each data set are deleted, and therefore each data set forms characteristic word data to be extracted.

Further, the step S2 of calculating the word frequency WF of each occurrence of the feature words _a And frequency value RDF _a The specific contents are as follows:

calculating a word frequency WF of each feature word _a ：

Where k denotes the frequency of occurrence of any feature word a in a data set, Σ _M j _m For the respective occurrence times j of M characteristic words in the data set _m The sum of (a);

calculating frequency value RDF of each feature word _a ：

In the formula, D represents the total number of data sets, and D represents the number of data sets containing the feature word.

Further, the specific content of calculating the weight value V of the corresponding feature word in step S3 is:

V＝WF _a *RDF _a 。

further, when the weighted values V of each feature word are sorted from large to small in step S3, if the weighted values of several feature words are the same, the feature words are sorted according to their ASCII codes.

Further, the specific content of step S5 is: and adding all the vectors of the feature words in the step S4 based on a word moving distance method WMD, solving an average value, and then calculating the Euclidean distance between every two vectors, wherein if the Euclidean distance is smaller than the average value, the feature words corresponding to the two vectors have similarity, otherwise, if the Euclidean distance is larger than the average value, the feature words corresponding to the two vectors have no similarity.

The invention has the beneficial effects that:

1. the main principle of the traditional data analysis tool is to preset event feature words, then perform feature word recognition and matching on data to form event data, but the principle can not extract event feature words or the extracted feature words are invalid for events with certain real-time hot spots, emergencies, sensitivities and the like, which may have the situation that the preset event feature words are not timely enough, so that effective related events cannot be generated timely. According to the method, based on the starting of original data, event characteristic words are not preset, and matching schemes such as word frequency and frequency are adopted to match the characteristic words to form effective related events, effective and important contents can be screened from massive data quickly to make decision services for all parties.

2. By extracting the feature word set from the service data, the problems of high-dimensionality feature word vectors, low calculation efficiency and the like are avoided. The feature words are expressed by using low-dimensional vectors through the vector model, so that the words with similar semantics are closer, the problem of semantic loss is effectively solved, and meanwhile, the problem of semantic recognition caused by the fact that a network new word cannot be matched due to untimely word bank updating can be avoided. By analyzing the data, the defect that preset feature words are matched with text data is effectively overcome. Compared with the traditional method for matching the hit characteristic words, the method has the advantages of higher accuracy, more intellectualization and more practical guidance value.

Drawings

FIG. 1 is a schematic flow diagram of the overall process of the present invention.

Detailed Description

The present invention will now be described in further detail with reference to the accompanying drawings.

Referring to fig. 1, the main technical solution of the present application is as follows:

step 1: data word segmentation: reading a data set which consists of a plurality of data, segmenting all input data, and then removing stop words and interference words to form feature word data to be extracted.

And 2, step: generating a frequency value dictionary: and counting all words appearing in the data and the frequency of the words to finish the preprocessed data, and generating a corresponding frequency value dictionary according to the data set, wherein the dictionary changes along with the preprocessed data.

And step 3: extracting feature words: determining a set of words to be extracted according to a service scene, and determining a word frequency according to the word frequency

(where k is the frequency of occurrence of the entry a in a data set, ∑ o _M j _m For the number j of occurrences of each of the M entries in the data set _m Sum of) and frequency values

(wherein D is the total data set quantity, and D is the quantity of the data sets containing the entry) the dictionary calculates the weighted value of a single word, firstly carries out reverse ordering according to the weighted value, and when the phases are the same, carries out ordering according to the ASCII code of the characteristic word, and outputs a topN keyword.

And 4, step 4: establishing an existing vector model: adopting a mapping process of 'word → vector 1 → vector 2', firstly taking each word of the word set obtained in the step 1 as a characteristic column, initializing a word vector (0,1) and enabling each word to be positioned on a coordinate axis; SVD is used to solve the feature decomposition of vector 1 to vector 2, and the word vector is mapped to a lower dimension space through dimension reduction to form a vector model based on a neural network language.

And 5: generating a feature word vector: inputting the feature words into the vector model for training to obtain vectorization expression of the words, and storing the generated feature word-vector values in a key value pair mode; the deep learning model is characterized in that under the support of a Word2Vec tool, word vector model training is carried out on data by using Skip-gram, so that the generalization performance of the vector model is better.

Step 6: text similarity calculation: and 5, calculating the similarity among the feature word vectors obtained in the step 5. On the basis of a Word Moving Distance (WMD) method, word vectors of all feature words are respectively added up to obtain an average value, and then the Euclidean distance of the two obtained vectors is calculated to replace a WMD result; the running speed is greatly accelerated under the condition of losing certain precision, the time complexity is reduced to O (dp), and the complexity of the model for a large data set is effectively solved.

And 7: clustering feature words: setting K cluster sets by using a K-means clustering algorithm, then randomly selecting K word vector data from the data sets of the feature word vectors in the step 5 as an initial centroid of the clustering algorithm, and calculating the cluster set to which the feature words belong by using the method in the step 6. And continuously looping the step 7 until the mass center is not changed.

And step 8: and (3) analysis results: and (7) according to the step 7, clustering analysis is carried out, potentially different types of data are identified, and related events under different conditions are generated.

Compared with the prior art, the method and the device have the advantages that the problem that high-dimensionality feature word vectors, low calculation efficiency and the like are solved by extracting the feature word set from the service data. The feature words are expressed by using low-dimensional vectors through the deep learning model, so that the words with similar semantics are closer, the problem of semantic loss is effectively solved, and meanwhile, the problem of semantic recognition caused by the fact that a network new word cannot be matched due to the fact that a word bank is not updated in time can be avoided. The data are analyzed through the pre-trained deep learning model, and the defect that preset feature words are matched with text data is effectively overcome. Compared with the traditional method for matching the hit characteristic words, the method has the advantages of higher accuracy, more intellectualization and more practical guidance value.

It should be noted that the terms "upper", "lower", "left", "right", "front", "back", etc. used in the present invention are for clarity of description only, and are not intended to limit the scope of the present invention, and the relative relationship between the terms and the terms is not limited by the technical contents of the essential changes.

The above is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above-mentioned embodiments, and all technical solutions belonging to the idea of the present invention belong to the protection scope of the present invention. It should be noted that modifications and adaptations to those skilled in the art without departing from the principles of the present invention may be apparent to those skilled in the relevant art and are intended to be within the scope of the present invention.

Claims

1. A hot line data extraction and data element analysis method is characterized by comprising the following steps:

s3: selecting different parts of speech to be filtered according to the current hot line service scene, further screening partial feature word data to be extracted, and then inquiring the word frequency WF of each feature word in the frequency value dictionary _a Sum frequency value RDF _a And calculating the weight value V of the corresponding feature word according to eachSorting the weighted values V of the feature words from large to small, and outputting top topN feature words;

s7: through the cluster analysis in the step S6, the potential different types of the feature word data are identified, and related events under different conditions are generated.

2. The method for data extraction and data element analysis of hot line according to claim 1, wherein the specific content of the feature word data to be extracted in each data set after the preprocessing in step S1 is: the data in the data sets form a plurality of characteristic words after word segmentation, stop words and interference words in each data set are deleted, and therefore each data set forms characteristic word data to be extracted.

3. The method for hot-line data extraction and data element analysis according to claim 1, wherein the step S2 of calculating the word frequency WF of each occurrence of the feature words _a And frequency value RDF _a The specific contents are as follows:

calculating a word frequency WF of each feature word _a ：

In the formula, k represents that any characteristic word a is in a certain positionFrequency, sigma, occurring in a set of data _M j _m For the respective occurrence times j of M characteristic words in the data set _m The sum of (a);

calculating frequency value RDF of each feature word _a ：

4. The method for data extraction and data element analysis of a hotline according to claim 1, wherein the specific content of calculating the weight value V of the corresponding feature word in step S3 is:

V＝WF _a *RDF _a 。

5. the method for data extraction and data element analysis of hot line according to claim 1, wherein in step S3, when the weight values V of each feature word are sorted from large to small, if the weight values of several feature words are the same, the sorting is performed according to the ASCII code of the feature words.

6. The hotline data extraction and data element analysis method according to claim 1, wherein the specific content of step S5 is: and adding all the vectors of the feature words in the step S4 based on a word moving distance method WMD, solving an average value, and then calculating the Euclidean distance between every two vectors, wherein if the Euclidean distance is smaller than the average value, the feature words corresponding to the two vectors have similarity, otherwise, if the Euclidean distance is larger than the average value, the feature words corresponding to the two vectors have no similarity.