CN115357708A - Hot line data extraction and data element analysis method - Google Patents

Hot line data extraction and data element analysis method Download PDF

Info

Publication number
CN115357708A
CN115357708A CN202211038229.5A CN202211038229A CN115357708A CN 115357708 A CN115357708 A CN 115357708A CN 202211038229 A CN202211038229 A CN 202211038229A CN 115357708 A CN115357708 A CN 115357708A
Authority
CN
China
Prior art keywords
data
word
feature
words
calculating
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211038229.5A
Other languages
Chinese (zh)
Inventor
王伟
金婷
石爱辉
李逸玮
施茜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Clp Hongxin Information Technology Co ltd
Original Assignee
Clp Hongxin Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Clp Hongxin Information Technology Co ltd filed Critical Clp Hongxin Information Technology Co ltd
Priority to CN202211038229.5A priority Critical patent/CN115357708A/en
Publication of CN115357708A publication Critical patent/CN115357708A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A hot line data extraction and data element analysis method comprises the following steps of S1: collecting data to form data set, segmenting data and preprocessingThen obtaining feature word data to be extracted; s2: calculating the word frequency and the frequency value of each characteristic word; s3: calculating the weight value of the corresponding feature word according to the word frequency and frequency value
Figure 316322DEST_PATH_IMAGE002
According to the weight value of each feature word
Figure 96059DEST_PATH_IMAGE002
Sorting the sizes and outputting the results beforetopNA feature word; s4: inputting the feature words processed in the step S3 into a vector model for training to obtain vectorization representation of the feature words; s5: calculating the similarity of the vectors among the feature words in the step S4 based on a word moving distance method WMD; s6: calculating similar feature words of the feature words corresponding to the vectors by using a K-means clustering algorithm to form a cluster set to which the vectors belong; s7: and identifying potential different types of the characteristic words through cluster analysis to generate related events.

Description

Hot line data extraction and data element analysis method
Technical Field
The invention relates to the technical field of data analysis, in particular to a data extraction and data element analysis method of a hotline.
Background
With the rapid development of internet and computer technologies, data analysis is rapidly developing from business intelligence to user intelligence at present, data stocks of various industries and government departments are larger and larger, data types are various, the existing data are utilized to find relationships among data and deep values of the data according to different requirements, various hot spots, emergencies, sensitivity and other events are rapidly found in a government hot line system, leadership decision-making is served, the government hot line business is served, the public government business service efficiency is improved, more intelligent data support is provided, and the method is a data mining and analyzing capability which is needed urgently at present.
At present, different data analysis systems appear in the industry, although data analysis tools and mining channels are increasingly rich and diverse, the main principle of the data analysis systems is to preset event characteristic words and then perform characteristic word matching on the data to form event data. For example, in patent No. 2016105990149 of china, the method is to identify event components contained in data according to a preset event identification model. However, under the condition that the preset event feature words of events with certain real-time hot spots, emergencies, sensitivities and the like are not timely enough, the event feature words may not be extracted or the extracted feature words are invalid, so that effective related events cannot be generated timely. In addition, accurate event clustering can be performed only in different data texts according to semantic understanding, so that matching cannot be performed simply according to preset feature words.
Disclosure of Invention
Aiming at the defects in the prior art, the invention provides a hot line data extraction and data element analysis method, which is not limited to matching the characteristic words in the data by presetting event characteristic words, but is based on matching each characteristic word in the original data from the original data to form an effective related event, can quickly screen effective and important contents from massive data to make decision service for each party, and is particularly suitable for government hot line service, thereby improving the efficiency of public government service and providing more intelligent data support.
In order to achieve the purpose, the invention adopts the following technical scheme:
a data extraction and data element analysis method of a hotline comprises the following steps:
s1: collecting the heat data, constructing a plurality of data sets, and preprocessing the data in the data sets by word segmentation to obtain feature word data to be extracted in each data set;
s2: calculating the word frequency WF of each characteristic word aiming at the characteristic word data to be extracted a And frequency value RDF a Further generating a frequency value dictionary;
s3: selecting different parts of speech to be filtered according to the current hot line service scene, further screening partial feature word data to be extracted, and then inquiring the word frequency WF of each feature word in the frequency value dictionary a And frequency value RDF a Calculating the weight value V of the corresponding feature words, sorting the feature words from large to small according to the weight value V of each feature word, and outputting top topN feature words;
s4: establishing an existing vector model, inputting the feature words processed in the step S3 into the vector model for training, and further obtaining vectorization representation of the feature words;
s5: similarity calculation is carried out on the vectors among the feature words in the step S4 based on a word moving distance method WMD;
s6: setting K cluster sets by using a K-means clustering algorithm, then randomly selecting K vector data from the data sets of the feature word vectors formed after the processing in the step S4 as initial centroids of the clustering algorithm, and calculating similar feature words of the feature words corresponding to the vectors by using the method in the step S5 to form a cluster set to which the vectors belong;
s7: and (5) identifying potential different types of the characteristic word data through the cluster analysis in the step S6, and generating related events under different conditions.
In order to optimize the technical scheme, the specific measures adopted further comprise:
further, the specific content of the feature word data to be extracted in each data set after the preprocessing in step S1 is as follows: the data in the data sets form a plurality of characteristic words after word segmentation, stop words and interference words in each data set are deleted, and therefore each data set forms characteristic word data to be extracted.
Further, the step S2 of calculating the word frequency WF of each occurrence of the feature words a And frequency value RDF a The specific contents are as follows:
calculating a word frequency WF of each feature word a
Figure BDA0003819437870000021
Where k denotes the frequency of occurrence of any feature word a in a data set, Σ M j m For the respective occurrence times j of M characteristic words in the data set m The sum of (a);
calculating frequency value RDF of each feature word a
Figure BDA0003819437870000022
In the formula, D represents the total number of data sets, and D represents the number of data sets containing the feature word.
Further, the specific content of calculating the weight value V of the corresponding feature word in step S3 is:
V=WF a *RDF a
further, when the weighted values V of each feature word are sorted from large to small in step S3, if the weighted values of several feature words are the same, the feature words are sorted according to their ASCII codes.
Further, the specific content of step S5 is: and adding all the vectors of the feature words in the step S4 based on a word moving distance method WMD, solving an average value, and then calculating the Euclidean distance between every two vectors, wherein if the Euclidean distance is smaller than the average value, the feature words corresponding to the two vectors have similarity, otherwise, if the Euclidean distance is larger than the average value, the feature words corresponding to the two vectors have no similarity.
The invention has the beneficial effects that:
1. the main principle of the traditional data analysis tool is to preset event feature words, then perform feature word recognition and matching on data to form event data, but the principle can not extract event feature words or the extracted feature words are invalid for events with certain real-time hot spots, emergencies, sensitivities and the like, which may have the situation that the preset event feature words are not timely enough, so that effective related events cannot be generated timely. According to the method, based on the starting of original data, event characteristic words are not preset, and matching schemes such as word frequency and frequency are adopted to match the characteristic words to form effective related events, effective and important contents can be screened from massive data quickly to make decision services for all parties.
2. By extracting the feature word set from the service data, the problems of high-dimensionality feature word vectors, low calculation efficiency and the like are avoided. The feature words are expressed by using low-dimensional vectors through the vector model, so that the words with similar semantics are closer, the problem of semantic loss is effectively solved, and meanwhile, the problem of semantic recognition caused by the fact that a network new word cannot be matched due to untimely word bank updating can be avoided. By analyzing the data, the defect that preset feature words are matched with text data is effectively overcome. Compared with the traditional method for matching the hit characteristic words, the method has the advantages of higher accuracy, more intellectualization and more practical guidance value.
Drawings
FIG. 1 is a schematic flow diagram of the overall process of the present invention.
Detailed Description
The present invention will now be described in further detail with reference to the accompanying drawings.
Referring to fig. 1, the main technical solution of the present application is as follows:
step 1: data word segmentation: reading a data set which consists of a plurality of data, segmenting all input data, and then removing stop words and interference words to form feature word data to be extracted.
And 2, step: generating a frequency value dictionary: and counting all words appearing in the data and the frequency of the words to finish the preprocessed data, and generating a corresponding frequency value dictionary according to the data set, wherein the dictionary changes along with the preprocessed data.
And step 3: extracting feature words: determining a set of words to be extracted according to a service scene, and determining a word frequency according to the word frequency
Figure BDA0003819437870000031
(where k is the frequency of occurrence of the entry a in a data set, ∑ o M j m For the number j of occurrences of each of the M entries in the data set m Sum of) and frequency values
Figure BDA0003819437870000032
(wherein D is the total data set quantity, and D is the quantity of the data sets containing the entry) the dictionary calculates the weighted value of a single word, firstly carries out reverse ordering according to the weighted value, and when the phases are the same, carries out ordering according to the ASCII code of the characteristic word, and outputs a topN keyword.
And 4, step 4: establishing an existing vector model: adopting a mapping process of 'word → vector 1 → vector 2', firstly taking each word of the word set obtained in the step 1 as a characteristic column, initializing a word vector (0,1) and enabling each word to be positioned on a coordinate axis; SVD is used to solve the feature decomposition of vector 1 to vector 2, and the word vector is mapped to a lower dimension space through dimension reduction to form a vector model based on a neural network language.
And 5: generating a feature word vector: inputting the feature words into the vector model for training to obtain vectorization expression of the words, and storing the generated feature word-vector values in a key value pair mode; the deep learning model is characterized in that under the support of a Word2Vec tool, word vector model training is carried out on data by using Skip-gram, so that the generalization performance of the vector model is better.
Step 6: text similarity calculation: and 5, calculating the similarity among the feature word vectors obtained in the step 5. On the basis of a Word Moving Distance (WMD) method, word vectors of all feature words are respectively added up to obtain an average value, and then the Euclidean distance of the two obtained vectors is calculated to replace a WMD result; the running speed is greatly accelerated under the condition of losing certain precision, the time complexity is reduced to O (dp), and the complexity of the model for a large data set is effectively solved.
And 7: clustering feature words: setting K cluster sets by using a K-means clustering algorithm, then randomly selecting K word vector data from the data sets of the feature word vectors in the step 5 as an initial centroid of the clustering algorithm, and calculating the cluster set to which the feature words belong by using the method in the step 6. And continuously looping the step 7 until the mass center is not changed.
And step 8: and (3) analysis results: and (7) according to the step 7, clustering analysis is carried out, potentially different types of data are identified, and related events under different conditions are generated.
Compared with the prior art, the method and the device have the advantages that the problem that high-dimensionality feature word vectors, low calculation efficiency and the like are solved by extracting the feature word set from the service data. The feature words are expressed by using low-dimensional vectors through the deep learning model, so that the words with similar semantics are closer, the problem of semantic loss is effectively solved, and meanwhile, the problem of semantic recognition caused by the fact that a network new word cannot be matched due to the fact that a word bank is not updated in time can be avoided. The data are analyzed through the pre-trained deep learning model, and the defect that preset feature words are matched with text data is effectively overcome. Compared with the traditional method for matching the hit characteristic words, the method has the advantages of higher accuracy, more intellectualization and more practical guidance value.
It should be noted that the terms "upper", "lower", "left", "right", "front", "back", etc. used in the present invention are for clarity of description only, and are not intended to limit the scope of the present invention, and the relative relationship between the terms and the terms is not limited by the technical contents of the essential changes.
The above is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above-mentioned embodiments, and all technical solutions belonging to the idea of the present invention belong to the protection scope of the present invention. It should be noted that modifications and adaptations to those skilled in the art without departing from the principles of the present invention may be apparent to those skilled in the relevant art and are intended to be within the scope of the present invention.

Claims (6)

1. A hot line data extraction and data element analysis method is characterized by comprising the following steps:
s1: collecting the heat data, constructing a plurality of data sets, and preprocessing the data in the data sets by word segmentation to obtain feature word data to be extracted in each data set;
s2: calculating the word frequency WF of each characteristic word aiming at the characteristic word data to be extracted a And frequency value RDF a Further generating a frequency value dictionary;
s3: selecting different parts of speech to be filtered according to the current hot line service scene, further screening partial feature word data to be extracted, and then inquiring the word frequency WF of each feature word in the frequency value dictionary a Sum frequency value RDF a And calculating the weight value V of the corresponding feature word according to eachSorting the weighted values V of the feature words from large to small, and outputting top topN feature words;
s4: establishing an existing vector model, inputting the feature words processed in the step S3 into the vector model for training, and further obtaining vectorization representation of the feature words;
s5: similarity calculation is carried out on the vectors among the feature words in the step S4 based on a word moving distance method WMD;
s6: setting K cluster sets by using a K-means clustering algorithm, then randomly selecting K vector data from the data sets of the feature word vectors formed after the processing in the step S4 as initial centroids of the clustering algorithm, and calculating similar feature words of the feature words corresponding to the vectors by using the method in the step S5 to form a cluster set to which the vectors belong;
s7: through the cluster analysis in the step S6, the potential different types of the feature word data are identified, and related events under different conditions are generated.
2. The method for data extraction and data element analysis of hot line according to claim 1, wherein the specific content of the feature word data to be extracted in each data set after the preprocessing in step S1 is: the data in the data sets form a plurality of characteristic words after word segmentation, stop words and interference words in each data set are deleted, and therefore each data set forms characteristic word data to be extracted.
3. The method for hot-line data extraction and data element analysis according to claim 1, wherein the step S2 of calculating the word frequency WF of each occurrence of the feature words a And frequency value RDF a The specific contents are as follows:
calculating a word frequency WF of each feature word a
Figure FDA0003819437860000011
In the formula, k represents that any characteristic word a is in a certain positionFrequency, sigma, occurring in a set of data M j m For the respective occurrence times j of M characteristic words in the data set m The sum of (a);
calculating frequency value RDF of each feature word a
Figure FDA0003819437860000012
In the formula, D represents the total number of data sets, and D represents the number of data sets containing the feature word.
4. The method for data extraction and data element analysis of a hotline according to claim 1, wherein the specific content of calculating the weight value V of the corresponding feature word in step S3 is:
V=WF a *RDF a
5. the method for data extraction and data element analysis of hot line according to claim 1, wherein in step S3, when the weight values V of each feature word are sorted from large to small, if the weight values of several feature words are the same, the sorting is performed according to the ASCII code of the feature words.
6. The hotline data extraction and data element analysis method according to claim 1, wherein the specific content of step S5 is: and adding all the vectors of the feature words in the step S4 based on a word moving distance method WMD, solving an average value, and then calculating the Euclidean distance between every two vectors, wherein if the Euclidean distance is smaller than the average value, the feature words corresponding to the two vectors have similarity, otherwise, if the Euclidean distance is larger than the average value, the feature words corresponding to the two vectors have no similarity.
CN202211038229.5A 2022-08-29 2022-08-29 Hot line data extraction and data element analysis method Pending CN115357708A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211038229.5A CN115357708A (en) 2022-08-29 2022-08-29 Hot line data extraction and data element analysis method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211038229.5A CN115357708A (en) 2022-08-29 2022-08-29 Hot line data extraction and data element analysis method

Publications (1)

Publication Number Publication Date
CN115357708A true CN115357708A (en) 2022-11-18

Family

ID=84004940

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211038229.5A Pending CN115357708A (en) 2022-08-29 2022-08-29 Hot line data extraction and data element analysis method

Country Status (1)

Country Link
CN (1) CN115357708A (en)

Similar Documents

Publication Publication Date Title
CN107515877B (en) Sensitive subject word set generation method and device
CN110532542B (en) Invoice false invoice identification method and system based on positive case and unmarked learning
CN106649490B (en) Image retrieval method and device based on depth features
CN107145516B (en) Text clustering method and system
CN106598959B (en) Method and system for determining mutual translation relationship of bilingual sentence pairs
CN112732946B (en) Modular data analysis and database establishment method for medical literature
CN113315789B (en) Web attack detection method and system based on multi-level combined network
CN107526721B (en) Ambiguity elimination method and device for comment vocabularies of e-commerce products
CN110413998B (en) Self-adaptive Chinese word segmentation method oriented to power industry, system and medium thereof
CN116432655B (en) Method and device for identifying named entities with few samples based on language knowledge learning
CN112069307B (en) Legal provision quotation information extraction system
CN112347223A (en) Document retrieval method, document retrieval equipment and computer-readable storage medium
CN113095858A (en) Method for identifying fraud-related short text
CN111191051A (en) Method and system for constructing emergency knowledge map based on Chinese word segmentation technology
CN114491062B (en) Short text classification method integrating knowledge graph and topic model
Mukherjee et al. Analyzing large news corpus using text mining techniques for recognizing high crime prone areas
Wibowo et al. Sentiments analysis of Indonesian tweet about Covid-19 vaccine using support vector machine and Fasttext embedding
CN111753547A (en) Keyword extraction method and system for sensitive data leakage detection
CN115357708A (en) Hot line data extraction and data element analysis method
CN113268986B (en) Unit name matching and searching method and device based on fuzzy matching algorithm
CN115577269A (en) Blacklist fuzzy matching method based on character string text feature similarity
CN112651590B (en) Instruction processing flow recommending method
CN111061939B (en) Scientific research academic news keyword matching recommendation method based on deep learning
CN114328812A (en) Community resident event identification method and device based on text clustering
CN114528908A (en) Network request data classification model training method, classification method and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination