CN112560469A - Method and system for automatically exploring Chinese text topics - Google Patents
Method and system for automatically exploring Chinese text topics Download PDFInfo
- Publication number
- CN112560469A CN112560469A CN202011603044.5A CN202011603044A CN112560469A CN 112560469 A CN112560469 A CN 112560469A CN 202011603044 A CN202011603044 A CN 202011603044A CN 112560469 A CN112560469 A CN 112560469A
- Authority
- CN
- China
- Prior art keywords
- clustering
- text
- chinese text
- central point
- word
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
- G06F18/23213—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/258—Heading extraction; Automatic titling; Numbering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/40—Processing or translation of natural language
- G06F40/42—Data-driven translation
- G06F40/49—Data-driven translation using very large corpora, e.g. the web
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Abstract
The invention discloses a method and a system for automatically exploring Chinese text topics, wherein the system comprises a word vector construction module, a text clustering module and a visualization module; more classification characteristic information is provided, and the text theme can be conveniently and rapidly extracted manually.
Description
Technical Field
The invention relates to the field of text theme exploration, in particular to a method and a system for automatically exploring a Chinese text theme.
Background
The topic exploration method has various methods, such as LDA topic extraction, unsupervised learning-based K-Means text clustering and the like, an LDA topic model is topic inference based on a probability statistics angle by utilizing a Bayesian idea, a K-Means clustering model is scattered point clustering based on a space vector distance, and finally the text can be divided into different clusters or classes, on the basis, further information extraction and induction are performed manually, and the purpose of text topic extraction is finally achieved; in this context, K-Means has the following disadvantages:
the K-Means judges that the target is to minimize the sum of squared distances from a cluster member to an actual centroid containing the member, the distances from all data points to the centroid are required to be calculated every time with the continuous increase of an analyzed data set, the calculation amount is continuously increased, and the consumed time is increased;
the K-Means can only divide the text into a plurality of different clusters or classes according to a given quantity, does not provide more classification information, and is inconvenient for people to summarize and extract the text theme more quickly.
In order to overcome the defects of the K-Means clustering method, an automatic Chinese text topic exploration method and system are constructed.
Disclosure of Invention
The present invention is directed to solving at least one of the problems of the prior art. Therefore, the invention provides a method and a system for automatically exploring Chinese text topics.
The technical scheme adopted by the embodiment of the invention for solving the technical problem is as follows: a method for automatically exploring Chinese text topics comprises the following steps:
step 1, segmenting words of a Chinese text, and screening out nouns, verbs, adjectives and adverbs;
step 2, constructing word vectors for the screened nouns, verbs, adjectives and adverbs by using a TF-IDF algorithm, vectorizing and preprocessing the Chinese text, converting text data into vector points of a space, and reducing the dimension of high-dimensional vector space data by using a TruncatedSVD singular value decomposition method so as to improve the calculation speed of the model;
step 3, clustering the texts by using a Mini Batch K-Means clustering method;
step 4, analyzing the emotional tendency of the text by using an emotional analysis method; carrying out cross statistics on the clustering result and the emotion analysis result to obtain the overall emotion tendency distribution of various articles; extracting keywords of N names before each category of ranking according to the word frequency to serve as various semantic keywords, and selecting the article of the central point or the nearest central point to serve as a representative article of each category according to the central point of each category calculated by using a Mini Batch K-Means clustering method;
and 5, displaying the obtained clustering information through a visual method, summarizing and inducing each type of topic information through the clustering result and the auxiliary information provided by each subclass by a user to finish text topic exploration, storing the clustering result data in a variable for the user to call, and performing cross analysis with other dimension variables.
The visualization method includes word cloud pictures, pie charts and/or lists.
And N in the step 4 is a positive integer which is more than or equal to 1 and less than or equal to 10.
An automated Chinese text topic exploration system, a method for using the automated Chinese text topic exploration, comprising:
the word vector construction module is used for carrying out word segmentation on the Chinese text, screening out nouns, verbs, adjectives and adverbs, constructing word vectors on the screened nouns, verbs, adjectives and adverbs by using a TF-IDF algorithm, carrying out vectorization pretreatment on the Chinese text, converting text data into vector points of a space, and reducing the dimension of high-dimensional vector space data by using a TruncatedSVD singular value decomposition method;
the text clustering module is used for clustering texts by using a Mini Batch K-Means clustering method; analyzing the emotional tendency of the text by using an emotional analysis method; carrying out cross statistics on the clustering result and the emotion analysis result to obtain the overall emotion tendency distribution of various articles; extracting keywords of N names before each category of ranking according to the word frequency to serve as various semantic keywords, and selecting the article of the central point or the nearest central point to serve as a representative article of each category according to the central point of each category calculated by using a Mini Batch K-Means clustering method;
and the visualization module is used for displaying the clustering information in the module through a visualization method according to the obtained clustering information, a user summarizes and summarizes the topic information of each class through the clustering result and the auxiliary information provided by each subclass to finish the text topic exploration, and the clustering result data can be stored in one variable for the user to call and can be subjected to cross analysis with other dimension variables.
The invention has the beneficial effects that: a method and system for exploring the automatic Chinese text theme, the system includes word vector construction module, text clustering module and visualization module, use the method that the automatic Chinese text theme is explored in the system, can solve the problem that K-Means clustering method calculates the time consuming longer; more classification characteristic information is provided, and the text theme can be conveniently and rapidly extracted manually.
Drawings
The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
FIG. 1 is a block diagram of an automated Chinese text topic exploration system.
Detailed Description
Reference will now be made in detail to the present preferred embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to like elements throughout.
In the description of the present invention, a plurality of means is two or more, and greater than, less than, more than, etc. are understood as excluding the present number, and greater than, less than, etc. are understood as including the present number. If the first and second are described for the purpose of distinguishing technical features, they are not to be understood as indicating or implying relative importance or implicitly indicating the number of technical features indicated or implicitly indicating the precedence of the technical features indicated.
In the description of the present invention, it should be understood that the orientation or positional relationship referred to in the description of the orientation, such as the upper, lower, front, rear, left, right, etc., is based on the orientation or positional relationship shown in the drawings, and is only for convenience of description and simplification of description, and does not indicate or imply that the device or element referred to must have a specific orientation, be constructed and operated in a specific orientation, and thus, should not be construed as limiting the present invention.
In the present invention, unless explicitly defined otherwise, the terms "disposed," "mounted," "connected," and the like are to be understood in a broad sense, and for example, may be directly connected or indirectly connected through an intermediate; can be fixedly connected, can also be detachably connected and can also be integrally formed; may be a mechanical connection; either as communication within the two elements or as an interactive relationship of the two elements. The specific meaning of the above-mentioned words in the present invention can be reasonably determined by those skilled in the art in combination with the detailed contents of the technical solutions.
Referring to fig. 1, a method for automated chinese text topic exploration includes the following steps:
step 1, segmenting words of a Chinese text, and screening out nouns, verbs, adjectives and adverbs;
step 2, constructing word vectors for the screened nouns, verbs, adjectives and adverbs by using a TF-IDF algorithm, vectorizing and preprocessing the Chinese text, converting text data into vector points of a space, and reducing the dimension of high-dimensional vector space data by using a TruncatedSVD singular value decomposition method so as to improve the calculation speed of the model;
step 3, clustering the texts by using a Mini Batch K-Means clustering method;
step 4, analyzing the emotional tendency of the text by using an emotional analysis method; carrying out cross statistics on the clustering result and the emotion analysis result to obtain the overall emotion tendency distribution of various articles; extracting keywords of N names before each category of ranking according to the word frequency to serve as various semantic keywords, and selecting the article of the central point or the nearest central point to serve as a representative article of each category according to the central point of each category calculated by using a Mini Batch K-Means clustering method;
and 5, displaying the obtained clustering information through a visual method, summarizing and inducing each type of topic information through the clustering result and the auxiliary information provided by each subclass by a user to finish text topic exploration, storing the clustering result data in a variable for the user to call, and performing cross analysis with other dimension variables.
The visualization method includes word cloud pictures, pie charts and/or lists.
And N in the step 4 is a positive integer which is more than or equal to 1 and less than or equal to 10.
An automated Chinese text topic exploration system, a method for using the automated Chinese text topic exploration, comprising:
the word vector construction module is used for carrying out word segmentation on the Chinese text, screening out nouns, verbs, adjectives and adverbs, constructing word vectors on the screened nouns, verbs, adjectives and adverbs by using a TF-IDF algorithm, carrying out vectorization pretreatment on the Chinese text, converting text data into vector points of a space, and reducing the dimension of high-dimensional vector space data by using a TruncatedSVD singular value decomposition method;
the text clustering module is used for clustering texts by using a Mini Batch K-Means clustering method; analyzing the emotional tendency of the text by using an emotional analysis method; carrying out cross statistics on the clustering result and the emotion analysis result to obtain the overall emotion tendency distribution of various articles; extracting keywords of N names before each category of ranking according to the word frequency to serve as various semantic keywords, and selecting the article of the central point or the nearest central point to serve as a representative article of each category according to the central point of each category calculated by using a Mini Batch K-Means clustering method;
and the visualization module is used for displaying the clustering information in the module through a visualization method according to the obtained clustering information, a user summarizes and summarizes the topic information of each class through the clustering result and the auxiliary information provided by each subclass to finish the text topic exploration, and the clustering result data can be stored in one variable for the user to call and can be subjected to cross analysis with other dimension variables.
The text representation method comprises various traditional one-hot models, Word bag models, TF-IDF models, Word2Vec models, FastText models, GloVe models, ElMo models, GPT models, BERT models and the like, wherein the system adopts the TF-IDF method, and the TF-IDF method comprehensively considers the occurrence frequency (TF value) of different words in the text and the distinguishing capability (IDF value) of the words to different texts, and the Word vectors obtained by multiplying the two are represented; the importance of a word increases in proportion to the number of times the word appears in a text, but decreases in inverse proportion to the frequency of the word appearing in a corpus, TF is the number of times a term appears/the total number of all terms, IDF is log (the total number of documents in the corpus/the number of documents containing a specified term +1), TF-IDF is TF × IDF, and after the TF-IDF value of each word is obtained, the TF-IDF value of each keyword can be embedded into the position according to the keyword arrangement sequence to construct a sentence vector, so that text data can be expressed as vector data, and subsequent calculation is facilitated.
SVD singular value decomposition is an important matrix decomposition method in linear algebra, and for a real matrix a (m × n), it decomposes into the product U Σ V of two orthogonal matrices and a diagonal matrix, where V is the n × n orthogonal matrix, U is the m × m orthogonal matrix, and Σ is the m × n diagonal matrix.
The method comprises the steps that a Mini Batch K-Means algorithm is established on the basis of the K-Means algorithm, the K-Means algorithm is that a classification number K is preset, K sample points are randomly selected to serve as initial central points (centroids), then Euclidean distances are calculated between all the sample points in a set and the K central points respectively, all the points are sequentially divided into subsets where the central points closest to the points are located, the subset central points are updated in all the subsets by adopting a mean value method, and after continuous iteration, the central points of all the subsets are stable or reach a specified threshold value condition, calculation is finished, and clustering is achieved; obviously, when the K-Means finds various stable central points, the K-Means adopts full data calculation each time, and when the data volume or the iteration times are large, the calculation time is greatly increased; the Mini Batch K-Means adopts a sampling idea, only small data needs to be sampled from all data for iterative computation each time, and the computation time is greatly shortened. Although a certain clustering quality is lost by using a sampling method for iterative computation, the difference of precision is negligible under the condition of large data volume.
The invention provides a method for automatically exploring Chinese text themes and an automatic Chinese text theme exploring system applying the method, wherein the system comprises a word vector construction module, a text clustering module and a visualization module; more classification characteristic information is provided, and the text theme can be conveniently and rapidly extracted manually.
It is to be understood that the present invention is not limited to the above-described embodiments, and that equivalent modifications and substitutions may be made by those skilled in the art without departing from the spirit of the present invention, and that such equivalent modifications and substitutions are to be included within the scope of the appended claims.
Claims (4)
1. A method for automatically exploring Chinese text topics is characterized by comprising the following steps:
step 1, segmenting words of a Chinese text, and screening out nouns, verbs, adjectives and adverbs;
step 2, constructing word vectors for the screened nouns, verbs, adjectives and adverbs by using a TF-IDF algorithm, vectorizing and preprocessing the Chinese text, converting text data into vector points of a space, and reducing the dimension of high-dimensional vector space data by using a TruncatedSVD singular value decomposition method so as to improve the calculation speed of the model;
step 3, clustering the texts by using a Mini Batch K-Means clustering method;
step 4, analyzing the emotional tendency of the text by using an emotional analysis method; carrying out cross statistics on the clustering result and the emotion analysis result to obtain the overall emotion tendency distribution of various articles; extracting keywords of N names before each category of ranking according to the word frequency to serve as various semantic keywords, and selecting the article of the central point or the nearest central point to serve as a representative article of each category according to the central point of each category calculated by using a Mini Batch K-Means clustering method;
and 5, displaying the obtained clustering information through a visual method, summarizing and inducing each type of topic information through the clustering result and the auxiliary information provided by each subclass by a user to finish text topic exploration, storing the clustering result data in a variable for the user to call, and performing cross analysis with other dimension variables.
2. The method of claim 1, wherein the method comprises: the visualization method includes word cloud pictures, pie charts and/or lists.
3. The method of claim 1, wherein N in step 4 is a positive integer greater than or equal to 1 and less than or equal to 10.
4. An automated chinese text topic exploration system employing the method of automated chinese text topic exploration of any of claims 1-3, comprising:
the word vector construction module is used for carrying out word segmentation on the Chinese text, screening out nouns, verbs, adjectives and adverbs, constructing word vectors on the screened nouns, verbs, adjectives and adverbs by using a TF-IDF algorithm, carrying out vectorization pretreatment on the Chinese text, converting text data into vector points of a space, and reducing the dimension of high-dimensional vector space data by using a TruncatedSVD singular value decomposition method;
the text clustering module is used for clustering texts by using a Mini Batch K-Means clustering method; analyzing the emotional tendency of the text by using an emotional analysis method; carrying out cross statistics on the clustering result and the emotion analysis result to obtain the overall emotion tendency distribution of various articles; extracting keywords of N names before each category of ranking according to the word frequency to serve as various semantic keywords, and selecting the article of the central point or the nearest central point to serve as a representative article of each category according to the central point of each category calculated by using a Mini Batch K-Means clustering method;
and the visualization module is used for displaying the clustering information in the module through a visualization method according to the obtained clustering information, a user summarizes and summarizes the topic information of each class through the clustering result and the auxiliary information provided by each subclass to finish the text topic exploration, and the clustering result data can be stored in one variable for the user to call and can be subjected to cross analysis with other dimension variables.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011603044.5A CN112560469B (en) | 2020-12-29 | 2020-12-29 | Method and system for automatically exploring Chinese text theme |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011603044.5A CN112560469B (en) | 2020-12-29 | 2020-12-29 | Method and system for automatically exploring Chinese text theme |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112560469A true CN112560469A (en) | 2021-03-26 |
CN112560469B CN112560469B (en) | 2023-07-04 |
Family
ID=75034320
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011603044.5A Active CN112560469B (en) | 2020-12-29 | 2020-12-29 | Method and system for automatically exploring Chinese text theme |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112560469B (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104899335A (en) * | 2015-06-25 | 2015-09-09 | 四川友联信息技术有限公司 | Method for performing sentiment classification on network public sentiment of information |
CN107103043A (en) * | 2017-03-29 | 2017-08-29 | 国信优易数据有限公司 | A kind of Text Clustering Method and system |
CN108536762A (en) * | 2018-03-21 | 2018-09-14 | 上海蔚界信息科技有限公司 | A kind of high-volume text data automatically analyzes scheme |
CN109299280A (en) * | 2018-12-12 | 2019-02-01 | 河北工程大学 | Short text clustering analysis method, device and terminal device |
WO2020101477A1 (en) * | 2018-11-14 | 2020-05-22 | Mimos Berhad | System and method for dynamic entity sentiment analysis |
-
2020
- 2020-12-29 CN CN202011603044.5A patent/CN112560469B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104899335A (en) * | 2015-06-25 | 2015-09-09 | 四川友联信息技术有限公司 | Method for performing sentiment classification on network public sentiment of information |
CN107103043A (en) * | 2017-03-29 | 2017-08-29 | 国信优易数据有限公司 | A kind of Text Clustering Method and system |
CN108536762A (en) * | 2018-03-21 | 2018-09-14 | 上海蔚界信息科技有限公司 | A kind of high-volume text data automatically analyzes scheme |
WO2020101477A1 (en) * | 2018-11-14 | 2020-05-22 | Mimos Berhad | System and method for dynamic entity sentiment analysis |
CN109299280A (en) * | 2018-12-12 | 2019-02-01 | 河北工程大学 | Short text clustering analysis method, device and terminal device |
Non-Patent Citations (1)
Title |
---|
徐莉莎: "《中国优秀硕士学位论文全文数据库信息科技辑(月刊)》", 15 June 2020, pages: 138 - 1311 * |
Also Published As
Publication number | Publication date |
---|---|
CN112560469B (en) | 2023-07-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110162593B (en) | Search result processing and similarity model training method and device | |
US11093854B2 (en) | Emoji recommendation method and device thereof | |
US20220138423A1 (en) | Deep learning based text classification | |
CN105183833B (en) | Microblog text recommendation method and device based on user model | |
CN107609121B (en) | News text classification method based on LDA and word2vec algorithm | |
CN107085581B (en) | Short text classification method and device | |
CN111259215A (en) | Multi-modal-based topic classification method, device, equipment and storage medium | |
CN109471942B (en) | Chinese comment emotion classification method and device based on evidence reasoning rule | |
CN111694958A (en) | Microblog topic clustering method based on word vector and single-pass fusion | |
CN108509490B (en) | Network hot topic discovery method and system | |
CN112926308B (en) | Method, device, equipment, storage medium and program product for matching text | |
CN110232127A (en) | File classification method and device | |
CN113673223A (en) | Keyword extraction method and system based on semantic similarity | |
CN112949713A (en) | Text emotion classification method based on ensemble learning of complex network | |
CN114491062B (en) | Short text classification method integrating knowledge graph and topic model | |
Alam et al. | Social media content categorization using supervised based machine learning methods and natural language processing in bangla language | |
US20220156489A1 (en) | Machine learning techniques for identifying logical sections in unstructured data | |
CN109543002A (en) | Write a Chinese character in simplified form restoring method, device, equipment and the storage medium of character | |
Yildiz | A comparative study of author gender identification | |
Wang | Iteration-based naive Bayes sentiment classification of microblog multimedia posts considering emoticon attributes | |
CN109871429B (en) | Short text retrieval method integrating Wikipedia classification and explicit semantic features | |
CN111859032A (en) | Method and device for detecting character-breaking sensitive words of short message and computer storage medium | |
CN109344252B (en) | Microblog text classification method and system based on high-quality theme extension | |
CN112560469A (en) | Method and system for automatically exploring Chinese text topics | |
CN115510326A (en) | Internet forum user interest recommendation algorithm based on text features and emotional tendency |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |