CN112560469A - Method and system for automatically exploring Chinese text topics - Google Patents

Method and system for automatically exploring Chinese text topics Download PDF

Info

Publication number
CN112560469A
CN112560469A CN202011603044.5A CN202011603044A CN112560469A CN 112560469 A CN112560469 A CN 112560469A CN 202011603044 A CN202011603044 A CN 202011603044A CN 112560469 A CN112560469 A CN 112560469A
Authority
CN
China
Prior art keywords
clustering
text
chinese text
central point
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011603044.5A
Other languages
Chinese (zh)
Other versions
CN112560469B (en
Inventor
张荣显
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhuhai Hengqin Boyi Data Technology Co ltd
Original Assignee
Zhuhai Hengqin Boyi Data Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhuhai Hengqin Boyi Data Technology Co ltd filed Critical Zhuhai Hengqin Boyi Data Technology Co ltd
Priority to CN202011603044.5A priority Critical patent/CN112560469B/en
Publication of CN112560469A publication Critical patent/CN112560469A/en
Application granted granted Critical
Publication of CN112560469B publication Critical patent/CN112560469B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/258Heading extraction; Automatic titling; Numbering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/42Data-driven translation
    • G06F40/49Data-driven translation using very large corpora, e.g. the web
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention discloses a method and a system for automatically exploring Chinese text topics, wherein the system comprises a word vector construction module, a text clustering module and a visualization module; more classification characteristic information is provided, and the text theme can be conveniently and rapidly extracted manually.

Description

Method and system for automatically exploring Chinese text topics
Technical Field
The invention relates to the field of text theme exploration, in particular to a method and a system for automatically exploring a Chinese text theme.
Background
The topic exploration method has various methods, such as LDA topic extraction, unsupervised learning-based K-Means text clustering and the like, an LDA topic model is topic inference based on a probability statistics angle by utilizing a Bayesian idea, a K-Means clustering model is scattered point clustering based on a space vector distance, and finally the text can be divided into different clusters or classes, on the basis, further information extraction and induction are performed manually, and the purpose of text topic extraction is finally achieved; in this context, K-Means has the following disadvantages:
the K-Means judges that the target is to minimize the sum of squared distances from a cluster member to an actual centroid containing the member, the distances from all data points to the centroid are required to be calculated every time with the continuous increase of an analyzed data set, the calculation amount is continuously increased, and the consumed time is increased;
the K-Means can only divide the text into a plurality of different clusters or classes according to a given quantity, does not provide more classification information, and is inconvenient for people to summarize and extract the text theme more quickly.
In order to overcome the defects of the K-Means clustering method, an automatic Chinese text topic exploration method and system are constructed.
Disclosure of Invention
The present invention is directed to solving at least one of the problems of the prior art. Therefore, the invention provides a method and a system for automatically exploring Chinese text topics.
The technical scheme adopted by the embodiment of the invention for solving the technical problem is as follows: a method for automatically exploring Chinese text topics comprises the following steps:
step 1, segmenting words of a Chinese text, and screening out nouns, verbs, adjectives and adverbs;
step 2, constructing word vectors for the screened nouns, verbs, adjectives and adverbs by using a TF-IDF algorithm, vectorizing and preprocessing the Chinese text, converting text data into vector points of a space, and reducing the dimension of high-dimensional vector space data by using a TruncatedSVD singular value decomposition method so as to improve the calculation speed of the model;
step 3, clustering the texts by using a Mini Batch K-Means clustering method;
step 4, analyzing the emotional tendency of the text by using an emotional analysis method; carrying out cross statistics on the clustering result and the emotion analysis result to obtain the overall emotion tendency distribution of various articles; extracting keywords of N names before each category of ranking according to the word frequency to serve as various semantic keywords, and selecting the article of the central point or the nearest central point to serve as a representative article of each category according to the central point of each category calculated by using a Mini Batch K-Means clustering method;
and 5, displaying the obtained clustering information through a visual method, summarizing and inducing each type of topic information through the clustering result and the auxiliary information provided by each subclass by a user to finish text topic exploration, storing the clustering result data in a variable for the user to call, and performing cross analysis with other dimension variables.
The visualization method includes word cloud pictures, pie charts and/or lists.
And N in the step 4 is a positive integer which is more than or equal to 1 and less than or equal to 10.
An automated Chinese text topic exploration system, a method for using the automated Chinese text topic exploration, comprising:
the word vector construction module is used for carrying out word segmentation on the Chinese text, screening out nouns, verbs, adjectives and adverbs, constructing word vectors on the screened nouns, verbs, adjectives and adverbs by using a TF-IDF algorithm, carrying out vectorization pretreatment on the Chinese text, converting text data into vector points of a space, and reducing the dimension of high-dimensional vector space data by using a TruncatedSVD singular value decomposition method;
the text clustering module is used for clustering texts by using a Mini Batch K-Means clustering method; analyzing the emotional tendency of the text by using an emotional analysis method; carrying out cross statistics on the clustering result and the emotion analysis result to obtain the overall emotion tendency distribution of various articles; extracting keywords of N names before each category of ranking according to the word frequency to serve as various semantic keywords, and selecting the article of the central point or the nearest central point to serve as a representative article of each category according to the central point of each category calculated by using a Mini Batch K-Means clustering method;
and the visualization module is used for displaying the clustering information in the module through a visualization method according to the obtained clustering information, a user summarizes and summarizes the topic information of each class through the clustering result and the auxiliary information provided by each subclass to finish the text topic exploration, and the clustering result data can be stored in one variable for the user to call and can be subjected to cross analysis with other dimension variables.
The invention has the beneficial effects that: a method and system for exploring the automatic Chinese text theme, the system includes word vector construction module, text clustering module and visualization module, use the method that the automatic Chinese text theme is explored in the system, can solve the problem that K-Means clustering method calculates the time consuming longer; more classification characteristic information is provided, and the text theme can be conveniently and rapidly extracted manually.
Drawings
The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
FIG. 1 is a block diagram of an automated Chinese text topic exploration system.
Detailed Description
Reference will now be made in detail to the present preferred embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to like elements throughout.
In the description of the present invention, a plurality of means is two or more, and greater than, less than, more than, etc. are understood as excluding the present number, and greater than, less than, etc. are understood as including the present number. If the first and second are described for the purpose of distinguishing technical features, they are not to be understood as indicating or implying relative importance or implicitly indicating the number of technical features indicated or implicitly indicating the precedence of the technical features indicated.
In the description of the present invention, it should be understood that the orientation or positional relationship referred to in the description of the orientation, such as the upper, lower, front, rear, left, right, etc., is based on the orientation or positional relationship shown in the drawings, and is only for convenience of description and simplification of description, and does not indicate or imply that the device or element referred to must have a specific orientation, be constructed and operated in a specific orientation, and thus, should not be construed as limiting the present invention.
In the present invention, unless explicitly defined otherwise, the terms "disposed," "mounted," "connected," and the like are to be understood in a broad sense, and for example, may be directly connected or indirectly connected through an intermediate; can be fixedly connected, can also be detachably connected and can also be integrally formed; may be a mechanical connection; either as communication within the two elements or as an interactive relationship of the two elements. The specific meaning of the above-mentioned words in the present invention can be reasonably determined by those skilled in the art in combination with the detailed contents of the technical solutions.
Referring to fig. 1, a method for automated chinese text topic exploration includes the following steps:
step 1, segmenting words of a Chinese text, and screening out nouns, verbs, adjectives and adverbs;
step 2, constructing word vectors for the screened nouns, verbs, adjectives and adverbs by using a TF-IDF algorithm, vectorizing and preprocessing the Chinese text, converting text data into vector points of a space, and reducing the dimension of high-dimensional vector space data by using a TruncatedSVD singular value decomposition method so as to improve the calculation speed of the model;
step 3, clustering the texts by using a Mini Batch K-Means clustering method;
step 4, analyzing the emotional tendency of the text by using an emotional analysis method; carrying out cross statistics on the clustering result and the emotion analysis result to obtain the overall emotion tendency distribution of various articles; extracting keywords of N names before each category of ranking according to the word frequency to serve as various semantic keywords, and selecting the article of the central point or the nearest central point to serve as a representative article of each category according to the central point of each category calculated by using a Mini Batch K-Means clustering method;
and 5, displaying the obtained clustering information through a visual method, summarizing and inducing each type of topic information through the clustering result and the auxiliary information provided by each subclass by a user to finish text topic exploration, storing the clustering result data in a variable for the user to call, and performing cross analysis with other dimension variables.
The visualization method includes word cloud pictures, pie charts and/or lists.
And N in the step 4 is a positive integer which is more than or equal to 1 and less than or equal to 10.
An automated Chinese text topic exploration system, a method for using the automated Chinese text topic exploration, comprising:
the word vector construction module is used for carrying out word segmentation on the Chinese text, screening out nouns, verbs, adjectives and adverbs, constructing word vectors on the screened nouns, verbs, adjectives and adverbs by using a TF-IDF algorithm, carrying out vectorization pretreatment on the Chinese text, converting text data into vector points of a space, and reducing the dimension of high-dimensional vector space data by using a TruncatedSVD singular value decomposition method;
the text clustering module is used for clustering texts by using a Mini Batch K-Means clustering method; analyzing the emotional tendency of the text by using an emotional analysis method; carrying out cross statistics on the clustering result and the emotion analysis result to obtain the overall emotion tendency distribution of various articles; extracting keywords of N names before each category of ranking according to the word frequency to serve as various semantic keywords, and selecting the article of the central point or the nearest central point to serve as a representative article of each category according to the central point of each category calculated by using a Mini Batch K-Means clustering method;
and the visualization module is used for displaying the clustering information in the module through a visualization method according to the obtained clustering information, a user summarizes and summarizes the topic information of each class through the clustering result and the auxiliary information provided by each subclass to finish the text topic exploration, and the clustering result data can be stored in one variable for the user to call and can be subjected to cross analysis with other dimension variables.
The text representation method comprises various traditional one-hot models, Word bag models, TF-IDF models, Word2Vec models, FastText models, GloVe models, ElMo models, GPT models, BERT models and the like, wherein the system adopts the TF-IDF method, and the TF-IDF method comprehensively considers the occurrence frequency (TF value) of different words in the text and the distinguishing capability (IDF value) of the words to different texts, and the Word vectors obtained by multiplying the two are represented; the importance of a word increases in proportion to the number of times the word appears in a text, but decreases in inverse proportion to the frequency of the word appearing in a corpus, TF is the number of times a term appears/the total number of all terms, IDF is log (the total number of documents in the corpus/the number of documents containing a specified term +1), TF-IDF is TF × IDF, and after the TF-IDF value of each word is obtained, the TF-IDF value of each keyword can be embedded into the position according to the keyword arrangement sequence to construct a sentence vector, so that text data can be expressed as vector data, and subsequent calculation is facilitated.
SVD singular value decomposition is an important matrix decomposition method in linear algebra, and for a real matrix a (m × n), it decomposes into the product U Σ V of two orthogonal matrices and a diagonal matrix, where V is the n × n orthogonal matrix, U is the m × m orthogonal matrix, and Σ is the m × n diagonal matrix.
The method comprises the steps that a Mini Batch K-Means algorithm is established on the basis of the K-Means algorithm, the K-Means algorithm is that a classification number K is preset, K sample points are randomly selected to serve as initial central points (centroids), then Euclidean distances are calculated between all the sample points in a set and the K central points respectively, all the points are sequentially divided into subsets where the central points closest to the points are located, the subset central points are updated in all the subsets by adopting a mean value method, and after continuous iteration, the central points of all the subsets are stable or reach a specified threshold value condition, calculation is finished, and clustering is achieved; obviously, when the K-Means finds various stable central points, the K-Means adopts full data calculation each time, and when the data volume or the iteration times are large, the calculation time is greatly increased; the Mini Batch K-Means adopts a sampling idea, only small data needs to be sampled from all data for iterative computation each time, and the computation time is greatly shortened. Although a certain clustering quality is lost by using a sampling method for iterative computation, the difference of precision is negligible under the condition of large data volume.
The invention provides a method for automatically exploring Chinese text themes and an automatic Chinese text theme exploring system applying the method, wherein the system comprises a word vector construction module, a text clustering module and a visualization module; more classification characteristic information is provided, and the text theme can be conveniently and rapidly extracted manually.
It is to be understood that the present invention is not limited to the above-described embodiments, and that equivalent modifications and substitutions may be made by those skilled in the art without departing from the spirit of the present invention, and that such equivalent modifications and substitutions are to be included within the scope of the appended claims.

Claims (4)

1. A method for automatically exploring Chinese text topics is characterized by comprising the following steps:
step 1, segmenting words of a Chinese text, and screening out nouns, verbs, adjectives and adverbs;
step 2, constructing word vectors for the screened nouns, verbs, adjectives and adverbs by using a TF-IDF algorithm, vectorizing and preprocessing the Chinese text, converting text data into vector points of a space, and reducing the dimension of high-dimensional vector space data by using a TruncatedSVD singular value decomposition method so as to improve the calculation speed of the model;
step 3, clustering the texts by using a Mini Batch K-Means clustering method;
step 4, analyzing the emotional tendency of the text by using an emotional analysis method; carrying out cross statistics on the clustering result and the emotion analysis result to obtain the overall emotion tendency distribution of various articles; extracting keywords of N names before each category of ranking according to the word frequency to serve as various semantic keywords, and selecting the article of the central point or the nearest central point to serve as a representative article of each category according to the central point of each category calculated by using a Mini Batch K-Means clustering method;
and 5, displaying the obtained clustering information through a visual method, summarizing and inducing each type of topic information through the clustering result and the auxiliary information provided by each subclass by a user to finish text topic exploration, storing the clustering result data in a variable for the user to call, and performing cross analysis with other dimension variables.
2. The method of claim 1, wherein the method comprises: the visualization method includes word cloud pictures, pie charts and/or lists.
3. The method of claim 1, wherein N in step 4 is a positive integer greater than or equal to 1 and less than or equal to 10.
4. An automated chinese text topic exploration system employing the method of automated chinese text topic exploration of any of claims 1-3, comprising:
the word vector construction module is used for carrying out word segmentation on the Chinese text, screening out nouns, verbs, adjectives and adverbs, constructing word vectors on the screened nouns, verbs, adjectives and adverbs by using a TF-IDF algorithm, carrying out vectorization pretreatment on the Chinese text, converting text data into vector points of a space, and reducing the dimension of high-dimensional vector space data by using a TruncatedSVD singular value decomposition method;
the text clustering module is used for clustering texts by using a Mini Batch K-Means clustering method; analyzing the emotional tendency of the text by using an emotional analysis method; carrying out cross statistics on the clustering result and the emotion analysis result to obtain the overall emotion tendency distribution of various articles; extracting keywords of N names before each category of ranking according to the word frequency to serve as various semantic keywords, and selecting the article of the central point or the nearest central point to serve as a representative article of each category according to the central point of each category calculated by using a Mini Batch K-Means clustering method;
and the visualization module is used for displaying the clustering information in the module through a visualization method according to the obtained clustering information, a user summarizes and summarizes the topic information of each class through the clustering result and the auxiliary information provided by each subclass to finish the text topic exploration, and the clustering result data can be stored in one variable for the user to call and can be subjected to cross analysis with other dimension variables.
CN202011603044.5A 2020-12-29 2020-12-29 Method and system for automatically exploring Chinese text theme Active CN112560469B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011603044.5A CN112560469B (en) 2020-12-29 2020-12-29 Method and system for automatically exploring Chinese text theme

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011603044.5A CN112560469B (en) 2020-12-29 2020-12-29 Method and system for automatically exploring Chinese text theme

Publications (2)

Publication Number Publication Date
CN112560469A true CN112560469A (en) 2021-03-26
CN112560469B CN112560469B (en) 2023-07-04

Family

ID=75034320

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011603044.5A Active CN112560469B (en) 2020-12-29 2020-12-29 Method and system for automatically exploring Chinese text theme

Country Status (1)

Country Link
CN (1) CN112560469B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104899335A (en) * 2015-06-25 2015-09-09 四川友联信息技术有限公司 Method for performing sentiment classification on network public sentiment of information
CN107103043A (en) * 2017-03-29 2017-08-29 国信优易数据有限公司 A kind of Text Clustering Method and system
CN108536762A (en) * 2018-03-21 2018-09-14 上海蔚界信息科技有限公司 A kind of high-volume text data automatically analyzes scheme
CN109299280A (en) * 2018-12-12 2019-02-01 河北工程大学 Short text clustering analysis method, device and terminal device
WO2020101477A1 (en) * 2018-11-14 2020-05-22 Mimos Berhad System and method for dynamic entity sentiment analysis

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104899335A (en) * 2015-06-25 2015-09-09 四川友联信息技术有限公司 Method for performing sentiment classification on network public sentiment of information
CN107103043A (en) * 2017-03-29 2017-08-29 国信优易数据有限公司 A kind of Text Clustering Method and system
CN108536762A (en) * 2018-03-21 2018-09-14 上海蔚界信息科技有限公司 A kind of high-volume text data automatically analyzes scheme
WO2020101477A1 (en) * 2018-11-14 2020-05-22 Mimos Berhad System and method for dynamic entity sentiment analysis
CN109299280A (en) * 2018-12-12 2019-02-01 河北工程大学 Short text clustering analysis method, device and terminal device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
徐莉莎: "《中国优秀硕士学位论文全文数据库信息科技辑(月刊)》", 15 June 2020, pages: 138 - 1311 *

Also Published As

Publication number Publication date
CN112560469B (en) 2023-07-04

Similar Documents

Publication Publication Date Title
CN110162593B (en) Search result processing and similarity model training method and device
US11093854B2 (en) Emoji recommendation method and device thereof
US20220138423A1 (en) Deep learning based text classification
CN105183833B (en) Microblog text recommendation method and device based on user model
CN107609121B (en) News text classification method based on LDA and word2vec algorithm
CN107085581B (en) Short text classification method and device
CN111259215A (en) Multi-modal-based topic classification method, device, equipment and storage medium
CN109471942B (en) Chinese comment emotion classification method and device based on evidence reasoning rule
CN111694958A (en) Microblog topic clustering method based on word vector and single-pass fusion
CN108509490B (en) Network hot topic discovery method and system
CN112926308B (en) Method, device, equipment, storage medium and program product for matching text
CN110232127A (en) File classification method and device
CN113673223A (en) Keyword extraction method and system based on semantic similarity
CN112949713A (en) Text emotion classification method based on ensemble learning of complex network
CN114491062B (en) Short text classification method integrating knowledge graph and topic model
Alam et al. Social media content categorization using supervised based machine learning methods and natural language processing in bangla language
US20220156489A1 (en) Machine learning techniques for identifying logical sections in unstructured data
CN109543002A (en) Write a Chinese character in simplified form restoring method, device, equipment and the storage medium of character
Yildiz A comparative study of author gender identification
Wang Iteration-based naive Bayes sentiment classification of microblog multimedia posts considering emoticon attributes
CN109871429B (en) Short text retrieval method integrating Wikipedia classification and explicit semantic features
CN111859032A (en) Method and device for detecting character-breaking sensitive words of short message and computer storage medium
CN109344252B (en) Microblog text classification method and system based on high-quality theme extension
CN112560469A (en) Method and system for automatically exploring Chinese text topics
CN115510326A (en) Internet forum user interest recommendation algorithm based on text features and emotional tendency

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant