CN112560469A

CN112560469A - Method and system for automatically exploring Chinese text topics

Info

Publication number: CN112560469A
Application number: CN202011603044.5A
Authority: CN
Inventors: 张荣显
Original assignee: Zhuhai Hengqin Boyi Data Technology Co ltd
Current assignee: Zhuhai Hengqin Boyi Data Technology Co ltd
Priority date: 2020-12-29
Filing date: 2020-12-29
Publication date: 2021-03-26
Anticipated expiration: 2040-12-29
Also published as: CN112560469B

Abstract

The invention discloses a method and a system for automatically exploring Chinese text topics, wherein the system comprises a word vector construction module, a text clustering module and a visualization module; more classification characteristic information is provided, and the text theme can be conveniently and rapidly extracted manually.

Description

Method and system for automatically exploring Chinese text topics

Technical Field

The invention relates to the field of text theme exploration, in particular to a method and a system for automatically exploring a Chinese text theme.

Background

The topic exploration method has various methods, such as LDA topic extraction, unsupervised learning-based K-Means text clustering and the like, an LDA topic model is topic inference based on a probability statistics angle by utilizing a Bayesian idea, a K-Means clustering model is scattered point clustering based on a space vector distance, and finally the text can be divided into different clusters or classes, on the basis, further information extraction and induction are performed manually, and the purpose of text topic extraction is finally achieved; in this context, K-Means has the following disadvantages:

the K-Means judges that the target is to minimize the sum of squared distances from a cluster member to an actual centroid containing the member, the distances from all data points to the centroid are required to be calculated every time with the continuous increase of an analyzed data set, the calculation amount is continuously increased, and the consumed time is increased;

the K-Means can only divide the text into a plurality of different clusters or classes according to a given quantity, does not provide more classification information, and is inconvenient for people to summarize and extract the text theme more quickly.

In order to overcome the defects of the K-Means clustering method, an automatic Chinese text topic exploration method and system are constructed.

Disclosure of Invention

The present invention is directed to solving at least one of the problems of the prior art. Therefore, the invention provides a method and a system for automatically exploring Chinese text topics.

The technical scheme adopted by the embodiment of the invention for solving the technical problem is as follows: a method for automatically exploring Chinese text topics comprises the following steps:

step 1, segmenting words of a Chinese text, and screening out nouns, verbs, adjectives and adverbs;

step 2, constructing word vectors for the screened nouns, verbs, adjectives and adverbs by using a TF-IDF algorithm, vectorizing and preprocessing the Chinese text, converting text data into vector points of a space, and reducing the dimension of high-dimensional vector space data by using a TruncatedSVD singular value decomposition method so as to improve the calculation speed of the model;

step 3, clustering the texts by using a Mini Batch K-Means clustering method;

step 4, analyzing the emotional tendency of the text by using an emotional analysis method; carrying out cross statistics on the clustering result and the emotion analysis result to obtain the overall emotion tendency distribution of various articles; extracting keywords of N names before each category of ranking according to the word frequency to serve as various semantic keywords, and selecting the article of the central point or the nearest central point to serve as a representative article of each category according to the central point of each category calculated by using a Mini Batch K-Means clustering method;

and 5, displaying the obtained clustering information through a visual method, summarizing and inducing each type of topic information through the clustering result and the auxiliary information provided by each subclass by a user to finish text topic exploration, storing the clustering result data in a variable for the user to call, and performing cross analysis with other dimension variables.

The visualization method includes word cloud pictures, pie charts and/or lists.

And N in the step 4 is a positive integer which is more than or equal to 1 and less than or equal to 10.

An automated Chinese text topic exploration system, a method for using the automated Chinese text topic exploration, comprising:

the word vector construction module is used for carrying out word segmentation on the Chinese text, screening out nouns, verbs, adjectives and adverbs, constructing word vectors on the screened nouns, verbs, adjectives and adverbs by using a TF-IDF algorithm, carrying out vectorization pretreatment on the Chinese text, converting text data into vector points of a space, and reducing the dimension of high-dimensional vector space data by using a TruncatedSVD singular value decomposition method;

the text clustering module is used for clustering texts by using a Mini Batch K-Means clustering method; analyzing the emotional tendency of the text by using an emotional analysis method; carrying out cross statistics on the clustering result and the emotion analysis result to obtain the overall emotion tendency distribution of various articles; extracting keywords of N names before each category of ranking according to the word frequency to serve as various semantic keywords, and selecting the article of the central point or the nearest central point to serve as a representative article of each category according to the central point of each category calculated by using a Mini Batch K-Means clustering method;

and the visualization module is used for displaying the clustering information in the module through a visualization method according to the obtained clustering information, a user summarizes and summarizes the topic information of each class through the clustering result and the auxiliary information provided by each subclass to finish the text topic exploration, and the clustering result data can be stored in one variable for the user to call and can be subjected to cross analysis with other dimension variables.

The invention has the beneficial effects that: a method and system for exploring the automatic Chinese text theme, the system includes word vector construction module, text clustering module and visualization module, use the method that the automatic Chinese text theme is explored in the system, can solve the problem that K-Means clustering method calculates the time consuming longer; more classification characteristic information is provided, and the text theme can be conveniently and rapidly extracted manually.

Drawings

The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

FIG. 1 is a block diagram of an automated Chinese text topic exploration system.

Detailed Description

Reference will now be made in detail to the present preferred embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to like elements throughout.

In the description of the present invention, a plurality of means is two or more, and greater than, less than, more than, etc. are understood as excluding the present number, and greater than, less than, etc. are understood as including the present number. If the first and second are described for the purpose of distinguishing technical features, they are not to be understood as indicating or implying relative importance or implicitly indicating the number of technical features indicated or implicitly indicating the precedence of the technical features indicated.

In the description of the present invention, it should be understood that the orientation or positional relationship referred to in the description of the orientation, such as the upper, lower, front, rear, left, right, etc., is based on the orientation or positional relationship shown in the drawings, and is only for convenience of description and simplification of description, and does not indicate or imply that the device or element referred to must have a specific orientation, be constructed and operated in a specific orientation, and thus, should not be construed as limiting the present invention.

In the present invention, unless explicitly defined otherwise, the terms "disposed," "mounted," "connected," and the like are to be understood in a broad sense, and for example, may be directly connected or indirectly connected through an intermediate; can be fixedly connected, can also be detachably connected and can also be integrally formed; may be a mechanical connection; either as communication within the two elements or as an interactive relationship of the two elements. The specific meaning of the above-mentioned words in the present invention can be reasonably determined by those skilled in the art in combination with the detailed contents of the technical solutions.

Referring to fig. 1, a method for automated chinese text topic exploration includes the following steps:

step 3, clustering the texts by using a Mini Batch K-Means clustering method;

The visualization method includes word cloud pictures, pie charts and/or lists.

The text representation method comprises various traditional one-hot models, Word bag models, TF-IDF models, Word2Vec models, FastText models, GloVe models, ElMo models, GPT models, BERT models and the like, wherein the system adopts the TF-IDF method, and the TF-IDF method comprehensively considers the occurrence frequency (TF value) of different words in the text and the distinguishing capability (IDF value) of the words to different texts, and the Word vectors obtained by multiplying the two are represented; the importance of a word increases in proportion to the number of times the word appears in a text, but decreases in inverse proportion to the frequency of the word appearing in a corpus, TF is the number of times a term appears/the total number of all terms, IDF is log (the total number of documents in the corpus/the number of documents containing a specified term +1), TF-IDF is TF × IDF, and after the TF-IDF value of each word is obtained, the TF-IDF value of each keyword can be embedded into the position according to the keyword arrangement sequence to construct a sentence vector, so that text data can be expressed as vector data, and subsequent calculation is facilitated.

SVD singular value decomposition is an important matrix decomposition method in linear algebra, and for a real matrix a (m × n), it decomposes into the product U Σ V of two orthogonal matrices and a diagonal matrix, where V is the n × n orthogonal matrix, U is the m × m orthogonal matrix, and Σ is the m × n diagonal matrix.

The method comprises the steps that a Mini Batch K-Means algorithm is established on the basis of the K-Means algorithm, the K-Means algorithm is that a classification number K is preset, K sample points are randomly selected to serve as initial central points (centroids), then Euclidean distances are calculated between all the sample points in a set and the K central points respectively, all the points are sequentially divided into subsets where the central points closest to the points are located, the subset central points are updated in all the subsets by adopting a mean value method, and after continuous iteration, the central points of all the subsets are stable or reach a specified threshold value condition, calculation is finished, and clustering is achieved; obviously, when the K-Means finds various stable central points, the K-Means adopts full data calculation each time, and when the data volume or the iteration times are large, the calculation time is greatly increased; the Mini Batch K-Means adopts a sampling idea, only small data needs to be sampled from all data for iterative computation each time, and the computation time is greatly shortened. Although a certain clustering quality is lost by using a sampling method for iterative computation, the difference of precision is negligible under the condition of large data volume.

The invention provides a method for automatically exploring Chinese text themes and an automatic Chinese text theme exploring system applying the method, wherein the system comprises a word vector construction module, a text clustering module and a visualization module; more classification characteristic information is provided, and the text theme can be conveniently and rapidly extracted manually.

It is to be understood that the present invention is not limited to the above-described embodiments, and that equivalent modifications and substitutions may be made by those skilled in the art without departing from the spirit of the present invention, and that such equivalent modifications and substitutions are to be included within the scope of the appended claims.

Claims

1. A method for automatically exploring Chinese text topics is characterized by comprising the following steps:

step 3, clustering the texts by using a Mini Batch K-Means clustering method;

2. The method of claim 1, wherein the method comprises: the visualization method includes word cloud pictures, pie charts and/or lists.

3. The method of claim 1, wherein N in step 4 is a positive integer greater than or equal to 1 and less than or equal to 10.

4. An automated chinese text topic exploration system employing the method of automated chinese text topic exploration of any of claims 1-3, comprising: