WO2017148267A1 - 一种文本信息聚类方法和文本信息聚类系统 - Google Patents

一种文本信息聚类方法和文本信息聚类系统 Download PDF

Info

Publication number
WO2017148267A1
WO2017148267A1 PCT/CN2017/073720 CN2017073720W WO2017148267A1 WO 2017148267 A1 WO2017148267 A1 WO 2017148267A1 CN 2017073720 W CN2017073720 W CN 2017073720W WO 2017148267 A1 WO2017148267 A1 WO 2017148267A1
Authority
WO
WIPO (PCT)
Prior art keywords
text information
topics
clustering
level
words
Prior art date
Application number
PCT/CN2017/073720
Other languages
English (en)
French (fr)
Inventor
付子豪
张凯
蔡宁
杨旭
褚崴
Original Assignee
阿里巴巴集团控股有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴集团控股有限公司 filed Critical 阿里巴巴集团控股有限公司
Priority to JP2018544207A priority Critical patent/JP2019511040A/ja
Publication of WO2017148267A1 publication Critical patent/WO2017148267A1/zh
Priority to US16/116,851 priority patent/US20180365218A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/53Processing of non-Latin text
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/414Extracting the geometrical structure, e.g. layout tree; Block segmentation, e.g. bounding boxes for graphics or text

Definitions

  • the present application relates to the field of text processing, and in particular, to a text information clustering method and a text information clustering system.
  • Text clustering of text information according to the corresponding topic has a very important application in the field of text processing.
  • the number of text information generated every day is also very large. Therefore, large-scale text clustering analysis is very Significance.
  • embodiments of the present application have been made in order to provide a text information clustering method and a text information clustering system that overcome the above problems or at least partially solve the above problems.
  • an embodiment of the present application discloses a text information clustering method, including the following steps:
  • At least two pieces of text information included in each of the first-level topics are subjected to secondary clustering to form a plurality of secondary topics.
  • Another embodiment of the present invention discloses a text information clustering system, including:
  • a word segmentation processing module configured to perform word segmentation processing on each piece of text information in multiple pieces of text information
  • a primary clustering module configured to perform initial clustering on the plurality of text information processed by the word segmentation to form a plurality of first-level topics, each of the first-level topics including at least two pieces of text information;
  • a subject number determining module configured to determine, according to a preset rule, the number of second-level topics under each of the first-level topics based on the number of text information under each of the first-level topics;
  • a secondary clustering module configured to perform secondary clustering on at least two pieces of text information included in each of the first-level topics according to the number of secondary topics under each of the first-level topics, to form a plurality of two Level theme.
  • the text information clustering method and the text information clustering system proposed in the embodiments of the present application have at least the following advantages:
  • the hierarchical clustering method is adopted, which reduces the number of total first-level topics in the initial clustering, speeds up the calculation efficiency, and reduces the calculation efficiency.
  • System resource consumption in the case of secondary clustering, the number of secondary topics is dynamically determined according to the number of text information, and the average number of text information under each secondary theme is reduced, and the calculation speed of the secondary theme is accelerated.
  • FIG. 1 is a schematic diagram of the principle of an LDA algorithm used in an embodiment of the present invention.
  • FIG. 2 is a flow chart of a text information clustering method according to a first embodiment of the present invention.
  • FIG. 3 is a flow chart of a text information clustering method according to a second embodiment of the present invention.
  • FIG. 4 is a flow chart of a text information clustering method according to a third embodiment of the present invention.
  • Figure 5 is a block diagram of a text information clustering system in accordance with a fourth embodiment of the present invention.
  • Figure 6 is a block diagram of a text information clustering system in accordance with a fifth embodiment of the present invention.
  • One of the core ideas of the present application is that two or more clusters of text information are clustered by an algorithm, and a plurality of first-level topics are generated after the initial clustering; and according to the number of text information under each of the first-level topics, Determine the number of secondary topics under each primary theme; then, based on the number of secondary topics under each of the first-level themes, At least two text information under one level theme are subjected to secondary clustering to generate a plurality of second-level topics.
  • the system needs to cluster 5000 text information.
  • the 5000 text information can be first clustered into 5 first-level topics by using an algorithm.
  • the number of text information included in each of the first-level topics is: 1000, 1500, 500, 1800, and 200, and then determined according to the number of text information included under each first-level theme.
  • Each level 1 topic should be divided into the number of secondary topics.
  • manual analysis or algorithm parameter setting can be used to determine that the above five first-level topics should be divided into 10, 15, 5, and 18 respectively. 2 secondary themes.
  • the secondary clustering is performed on each of the first-level topics to generate 10, 15, 5, 18, and 2 secondary topics, and each of the secondary topics includes several Text information.
  • the number of textual information that is usually required to be processed in practice is much more than 5,000, and may be of a higher order of magnitude.
  • the above examples of the present invention are for convenience of understanding only, and are not particularly limited.
  • multiple text information may be clustered by the LDA algorithm.
  • the LDA (Latent Dirichlet Allocation) algorithm is a document secondary topic model algorithm.
  • the algorithm introduces a Bayesian framework in the existing pLSA algorithm, which can better represent the document generation model.
  • the specific implementation steps are as follows:
  • Figure 1 shows the schematic of the LDA algorithm.
  • the subject of the text information obeys the multinomial distribution with the parameter ⁇
  • the prior distribution is the Dirichlet distribution with the parameter ⁇
  • z represents the subject obtained from the topic distribution, for each topic.
  • the word also obeys the multinomial distribution with the parameter ⁇ under the subject, and the prior distribution of the part is the Dirichlet distribution with the parameter ⁇ .
  • the corresponding word is obtained from its corresponding distribution.
  • M indicates the number of articles
  • N indicates the number of words
  • K indicates the number of topics
  • w indicates a word
  • a dark color indicates that it is observable
  • a square indicates repetition
  • the number of repetitions is represented by a letter in the lower right corner.
  • a hierarchical LDA clustering framework is constructed by using hierarchical clustering method.
  • initial clustering the number of total first-level topics is reduced, and the computational efficiency is accelerated. , reduced System resource consumption; in the case of secondary clustering, the number of secondary topics is dynamically determined according to the number of text information, and the number of text information items averaged under each secondary theme is reduced, achieving Decoupling speeds up the computation of secondary topics in parallel.
  • FIG. 2 is a flowchart of a text information clustering method according to an embodiment of the present application.
  • the text information clustering method of the first embodiment of the present application includes the following steps:
  • Step S101 performing segmentation processing on each piece of text information in the plurality of pieces of text information to form a plurality of words
  • each piece of text information can be first processed in word segmentation.
  • "Python is an object-oriented, interpreted computer programming language” can be divided into "Python / yes / one / oriented / object / interpretation / type / computer / program / design / language”.
  • the words appearing in the text information can be compared with the words in the preset word library.
  • the words appearing in the text information are consistent with the words in the word library, the words are Cut it out.
  • the words mentioned above and below in the text can be words or words.
  • the "face” in the text information coincides with the "face” in the word library, the "face” in the text information is separately segmented.
  • the "type” in the text information coincides with the "type” in the word library, the "type” in the text information is separately segmented.
  • step S102 may be performed, and the plurality of text information processed by the word segmentation is initially clustered according to the plurality of words to form a plurality of first-level topics, and each of the first-level topics includes at least two pieces of text information. ;
  • all text information can be initially clustered using the aforementioned LDA algorithm.
  • this clustering given the large number of textual information, the number of primary topics can be set relatively small, avoiding excessive computational resources, resulting in slow calculations.
  • text information can be roughly divided into several first-level topics, each of which has a different size, and the number of text information contained therein can also be different.
  • the 5000 text information is clustered into 5 first-level topics by using an LDA algorithm, for example, under each level topic.
  • the number of textual information includes: 1000, 1500, 500, 1800, and 200.
  • step S103 may be performed, and determining, according to the preset rule, the number of secondary topics under each of the first-level topics based on the number of text information under each of the first-level topics;
  • the number of secondary topics under each primary theme can be determined according to the number of text information under each primary theme, using parameter settings of the LDA algorithm, or artificial settings.
  • the number of secondary topics under each level of topic may be the same or different.
  • the preset rule here may be, for example, that the number of text information included in each of the preset secondary topics is X, the range of X is M ⁇ X ⁇ N, and M and N are values specified by the developer or the user. For example, if 90 ⁇ X ⁇ 110, then X can be selected as the average value 100.
  • step S104 may be performed, and according to the number of the second-level themes under each of the first-level topics, the plurality of text information included in each of the first-level topics are subjected to secondary clustering according to the plurality of words. , forming multiple secondary themes.
  • all text information can be quadraticized using the aforementioned LDA algorithm.
  • clustering for the multiple text information under each first-level topic, according to the number of secondary topics that the first-level topic should be divided, clustering is performed by, for example, an LDA algorithm to form a specified number of multiples. Level theme.
  • each first-level topic is subjected to secondary clustering, and 10, 15, 5, 18, and 2 secondary topics are respectively generated, and each of the secondary topics includes several Text information.
  • the hierarchical clustering method is adopted, and the number of total first-level topics is reduced in the initial clustering. , speed up the calculation efficiency and reduce the system resource consumption; in the secondary clustering, dynamically determine the number of secondary topics according to the number of text information, reducing the average number of text information under each secondary theme, and speeding up The speed of calculation of the secondary theme.
  • FIG. 3 is a flowchart of a text information clustering method according to a second embodiment of the present application.
  • the text information clustering method of the second embodiment of the present application includes the following steps:
  • Step S201 performing segmentation processing on each piece of text information in the plurality of pieces of text information to form a plurality of words
  • Step S202 using the LDA algorithm, the plurality of text information processed by the word segmentation is initially clustered according to the plurality of words to form a plurality of first-level topics, and each of the first-level topics includes at least two pieces of text information;
  • Step S203 determining, according to the preset rule, each station based on the number of text information under each of the first-level topics The number of secondary topics under the first level theme;
  • Step S204 according to the number of the second-level themes under each of the first-level topics, (using the LDA algorithm), the plurality of text information included in each of the first-level topics are secondarily aggregated according to the plurality of words.
  • the class forms a plurality of secondary topics, each of which includes a plurality of textual information.
  • steps S201 to S204 are the same as or similar to the steps S101 to S104 in the first embodiment, and are not described herein again.
  • the method may further include the following steps:
  • the match may be a separate symbol, such as "&", "%”, etc., or may be various symbols and numbers, letters, such as links.
  • the degree of correlation between the symbol and the content of the text information is determined in step S201a by a specific method, and when it is judged that the degree of correlation is low, the symbol is deleted.
  • the English word is only used as a comment, and when it is determined that the English word is only a comment, the English can be deleted. word.
  • the degree of correlation between the number and the content of the text information can be judged in the same manner, and when it is judged that the degree of correlation is low, the number is deleted.
  • the method may further include the following steps:
  • the method may further include the following steps:
  • the result of the word segmentation usually contains a number of meaningless words such as ",,,,,” These words not only do not help the results, but also take up a lot of computing storage resources, so you need to filter them out before calculating.
  • the specific method may be that meaningless words such as “,, and over” may be summarized in a preset stop table, and when the word appears in the text information, the text information is deleted.
  • meaningless words such as “,, and over” may be summarized in a preset stop table, and when the word appears in the text information, the text information is deleted.
  • the preset stop table when the text information is judged When the above vocabulary appears, the vocabulary in the text information is deleted.
  • steps S201a and S201b, and S201c, and S201d and S201e are not performed sequentially, but steps S201a and S201b, S201c and/or S201d and S201e may be selectively performed.
  • the text information clustering method may further include the following steps:
  • this step it is possible to detect or manually detect whether the number of text information under each level of topic is less than a first threshold by an algorithm. If less than the first threshold, the first level topic is merged with other first level topics for subsequent calculation.
  • the number of text information included in each of the first-level topics formed by the cluster in step S202 is: 1000, 1500, 500, 1800, and 200. If the first threshold is set to 300, it may be determined that the number of text information included in the last first-level topic is less than the first threshold, and the last first-level topic may be merged with other topics, for example, The above third level topics are merged, followed by clustering of the second level topics.
  • the hierarchical clustering method is adopted, and the number of total first-level topics is reduced in the initial clustering. , speed up the calculation efficiency and reduce the system resource consumption; in the secondary clustering, dynamically determine the number of secondary topics according to the number of text information, reducing the average number of text information under each secondary theme, and speeding up The speed of calculation of the secondary theme.
  • the meaningless words and/or symbols are deleted in the clustering process, and the first-level topics with a small number of text information are combined, and the calculation method is further optimized, and the calculation intensity is reduced.
  • FIG. 4 is a flowchart of a text information clustering method according to a third embodiment of the present application.
  • the text information clustering method of the third embodiment of the present application includes the following steps:
  • Step S301 performing segmentation processing on each piece of text information in the plurality of pieces of text information to form a plurality of words
  • Step S302 using the LDA algorithm, the plurality of text information processed by the word segmentation is initially clustered according to the plurality of words to form a plurality of first-level topics, and each of the first-level topics includes at least two pieces of text information;
  • Step S303 determining, according to the preset rule, the number of secondary topics under each of the first-level topics based on the number of text information under each of the first-level topics;
  • Step S304 according to the number of secondary topics under each of the first-level topics, (using the LDA algorithm), performing at least two text information included in each of the first-level topics according to the plurality of words. Clustering to form multiple secondary topics.
  • step S305 is performed, that is, the matching degree is evaluated on the plurality of secondary topics generated after the secondary clustering
  • Step S306 obtaining a matching degree evaluation result.
  • the result of the matching degree evaluation is that the clustering is unqualified, the parameters of the LDA algorithm are adjusted according to the evaluation result.
  • the result of the matching evaluation is cluster failure, for example, the number of topics, the frequency threshold of the low frequency words, the threshold of the number of text information included in the subject to be merged, the content of the deactivation table, etc. .
  • the number of topics is, for example, the value of k in FIG. 1;
  • the frequency threshold of the low-frequency words may be, for example, a manually or machine-set threshold.
  • the frequency of occurrence of one or some words is less than the threshold. Then these words can be regarded as low-frequency words.
  • the frequency threshold of the low-frequency words can be adjusted to increase or decrease the number of low-frequency words, thereby affecting the clustering result;
  • the threshold of the number of text information included in the subject to be merged is, for example, Manual or machine-set thresholds.
  • the detachment table may be, for example, a table provided in the second embodiment, which may store a plurality of stop words, and adjust the content of the stop words to achieve the purpose of affecting the clustering result.
  • the secondary topics generated after clustering can be evaluated by manual evaluation or by using machine algorithms. Since the results of the quadratic clustering will change a lot with the text information, it is necessary to evaluate the results of the quadratic clustering.
  • the specific evaluation method may include checking whether the text information under several secondary topics is about the same Content, through this criterion to determine whether the cluster is appropriate, whether there are inappropriate words selected as keywords, whether secondary topics have aliasing, whether the number of first-level topics and secondary topics are appropriate. If the results do not meet expectations, you can continue to adjust by manual or machine algorithms as needed, such as adjusting LDA calculations. The parameters of the law, etc.
  • step S304 according to the number of secondary topics under each of the first-level topics, multiple text information included in each of the first-level topics is subjected to secondary clustering to form multiple After the second topic, the method may further include the following steps:
  • S307. Determine whether the second-level topic is a hot topic by using whether the number of text information under the second-level topic exceeds a second threshold.
  • the secondary topic when the number of text information under a certain secondary theme is greater than the second threshold, the secondary topic may be determined to be a hot topic. It is judged that the secondary theme is a hot topic. After the hot topic is determined, the subsequent operations may be performed, for example, automatically or manually displaying the hot topic on the front page of the website, adding the hot topic to the target, and the present invention is not limited thereto.
  • the hierarchical clustering method is adopted, and the number of total first-level topics is reduced in the initial clustering. , speed up the calculation efficiency and reduce the system resource consumption; in the secondary clustering, dynamically determine the number of secondary topics according to the number of text information, reducing the average number of text information under each secondary theme, and speeding up The speed of calculation of the secondary theme.
  • the evaluation section After completing the secondary clustering, it enters the evaluation section to evaluate whether the clustering of the secondary topics is appropriate. Adding the above evaluation link can further optimize the above clustering method and improve the accuracy of clustering.
  • the secondary clustering after the secondary clustering is completed, it can be compared with the second threshold to determine which secondary topics are hot topics, which facilitates subsequent processing.
  • the text information clustering method can be applied, for example, to clustering of news. That is, the text information described above may be, for example, news.
  • This method can be used to cluster a large amount of news.
  • the clustering method may at least include the steps of: forming a plurality of words by performing word segmentation processing on each of the plurality of news items; and performing, by the plurality of words, the plurality of words processed by the word segmentation for the first time Clustering, forming a plurality of first-level topics, each of the first-level topics including at least two news items; determining, according to a preset rule, each of the first-level topics based on the number of news under each of the first-level topics The number of the second-level topics; according to the number of the second-level themes under each of the first-level topics, the multiple news items included in each of the first-level topics are subjected to secondary clustering according to the plurality of words. Form multiple secondary topics. Since a large amount of news is generated every day in daily life
  • the fourth embodiment of the present application provides a text information clustering system, as shown in FIG. 5, which is a fourth embodiment of the present application.
  • the text information clustering system 400 of the fourth embodiment of the present application includes:
  • the word segmentation processing module 401 is configured to perform word segmentation processing on each piece of text information in the plurality of pieces of text information to form a plurality of words;
  • the initial clustering module 402 is configured to perform initial clustering on the plurality of word information after the word segmentation processing according to the plurality of words to form a plurality of first-level topics, each of the first-level topics including multiple text information. ;
  • a subject number determining module 403 configured to determine, according to a preset rule, the number of second-level topics under each of the first-level topics based on the number of text information under each of the first-level topics;
  • a secondary clustering module 404 configured to perform, according to the number of the second-level topics under each of the first-level topics, the plurality of text information included in each of the first-level topics according to the plurality of words Clustering forms a plurality of secondary topics, each of which includes multiple pieces of textual information.
  • a hierarchical clustering system which reduces the number of total first-level topics in the initial clustering, and speeds up the calculation.
  • Efficiency reducing system resource consumption; in the secondary clustering, dynamically determining the number of secondary topics according to the number of text information, reducing the average number of text information under each secondary theme, and accelerating the secondary theme Calculate the speed.
  • FIG. 6 is a block diagram of a text information clustering system according to a fourth embodiment of the present application.
  • the text information clustering system of the fifth embodiment of the present application includes a word segmentation processing module 501, a primary clustering module 502, a topic number determining module 503, and a quadratic clustering module 504.
  • the above modules 501-504 are the same as or similar to the modules 401-404 in the fourth embodiment, and are not described herein again.
  • the initial cluster and the quadratic cluster are clustered by using an LDA algorithm.
  • the system further includes:
  • a correlation determining module configured to determine a degree of correlation between the symbol, the English word and/or the number and the text information when a symbol, an English word, and/or a number appear in the text information is detected;
  • the first deleting module is configured to delete the symbol, the English word and/or the number when it is determined that the degree of correlation between the symbol, the English word and/or the number and the text information content is lower than a specified value.
  • the system further includes:
  • a detecting module configured to detect whether each of the words after the word segmentation process is the same as the word in the preset stop table
  • a second deleting module configured to: when any of the words after the word segmentation process is detected and the preset stop table When the words are the same, the same words after the word segmentation are deleted.
  • the system further includes:
  • the merging module 505 is configured to merge two or more first-level topics including the number of text information less than the first value into one first-level topic.
  • the secondary clustering module 504 is configured to implement any two or more secondary clusters simultaneously.
  • the system further includes:
  • An evaluation module 506, configured to evaluate a plurality of secondary topics generated after the secondary clustering
  • the adjusting module 507 is configured to adjust parameters of the LDA algorithm according to the evaluation result.
  • the system further includes:
  • the hotspot judging module 508 is configured to determine whether the second-level topic is a hot topic by using the number of text information under each second-level topic.
  • the text information clustering system proposed in this embodiment, in the above manner, in the text information clustering system, a hierarchical clustering system is adopted, and the number of total first-level topics is reduced in the initial clustering. The calculation efficiency is accelerated, and the system resource consumption is reduced. In the secondary clustering, the number of secondary topics is dynamically determined according to the number of text information, and the average number of text information under each secondary theme is reduced, and the number of texts is accelerated. The speed of the calculation of the level theme.
  • the system of the embodiment dynamically determines the number of secondary topics according to the number of text information in the secondary clustering, reduces the average number of text information under each secondary theme, and speeds up the calculation of the secondary theme.
  • the meaningless words and/or symbols are deleted in the clustering process, and the first-level topics with a small number of text information are combined, and the calculation method is further optimized, and the calculation intensity is reduced.
  • the system of the embodiment may include an evaluation module for evaluating whether clustering of the secondary topics is appropriate. Adding the above evaluation link can further optimize the above clustering method and improve the accuracy of clustering.
  • the system of this embodiment may include a hotspot determination module, which may determine which secondary topics are hot topics by comparing with the second threshold, which provides convenience for subsequent processing.
  • the text information clustering system can be applied, for example, to clustering of news. That is, the text information described above may be, for example, news. With this system, a large amount of news can be clustered.
  • the clustering system can at least include:
  • a word segmentation processing module for segmenting each news item in a plurality of news articles to form a plurality of words
  • a primary clustering module configured to perform initial clustering on the plurality of words after the word segmentation according to the plurality of words to form a plurality of first-level topics, each of the first-level topics including multiple news items;
  • a subject number determining module configured to determine, according to a preset rule, the number of secondary topics under each of the first-level topics based on the number of news under each of the first-level topics;
  • a secondary clustering module configured to perform secondary clustering on the plurality of news items included in each of the first-level topics according to the number of second-level topics under each of the first-level topics Forming multiple secondary topics, each of which includes multiple news items.
  • the description is relatively simple, and the relevant parts can be referred to the description of the method embodiment.
  • embodiments of the embodiments of the present application can be provided as a method, apparatus, or computer program product. Therefore, the embodiments of the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware. Moreover, embodiments of the present application can take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) including computer usable program code.
  • computer-usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.
  • the computer device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
  • the memory may include non-persistent memory, random access memory (RAM), and/or non-volatile memory in a computer readable medium, such as read only memory (ROM) or flash memory.
  • RAM random access memory
  • ROM read only memory
  • Memory is an example of a computer readable medium.
  • Computer readable media including both permanent and non-persistent, removable and non-removable media may be implemented by any method or technology for signal storage.
  • the signals can be computer readable instructions, data structures, modules of programs, or other data.
  • Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read only memory. (ROM), electrically erasable programmable read only memory (EEPROM), flash memory or other memory technology, compact disk read only memory (CD-ROM), digital versatile disk (DVD) or other optical storage,
  • PRAM phase change memory
  • SRAM static random access memory
  • DRAM dynamic random access memory
  • RAM random access memory
  • ROM read only memory
  • EEPROM electrically erasable programmable read only memory
  • flash memory or other memory technology
  • compact disk read only memory CD-ROM
  • DVD digital versatile disk
  • a magnetic tape cartridge, magnetic tape storage or other magnetic storage device or any other non-transporting medium can be used to store signals that can be accessed by a computing device.
  • computer readable media does not include non-persistent computer readable media, such as modulated data signals and carrier waves.
  • Embodiments of the present application are described with reference to flowcharts and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the present application. It will be understood that each flow and/or block of the flowchart illustrations and/or FIG.
  • These computer program instructions can be provided to a processor of a general purpose computer, special purpose computer, embedded processor or other programmable data processing terminal device to produce a machine such that instructions are executed by a processor of a computer or other programmable data processing terminal device
  • Means are provided for implementing the functions specified in one or more of the flow or in one or more blocks of the flow chart.
  • the computer program instructions can also be stored in a computer readable memory that can direct a computer or other programmable data processing terminal device to operate in a particular manner, such that the instructions stored in the computer readable memory produce an article of manufacture comprising the instruction device.
  • the instruction device implements the functions specified in one or more blocks of the flowchart or in a flow or block of the flowchart.

Abstract

一种文本信息聚类方法和系统,该聚类方法包括如下步骤:将多则文本信息中的每一则文本信息进行分词处理,形成多个字词(S101);对分词处理后的所述多则文本信息进行初次聚类,形成多个一级主题,每个所述一级主题包括至少两则文本信息(S102);根据每个所述一级主题下文本信息的数目,确定每个所述一级主题下二级主题的个数(S103);根据每个所述一级主题下二级主题的个数,对每个所述一级主题中包括的至少两则文本信息进行二次聚类,形成多个二级主题(S104)。采用层次化聚类的方法,在初次聚类时,减少了总的一级主题的个数,加快了计算效率;在二次聚类时,根据文本信息数目动态确定二级主题的个数,加快了二级主题的计算速度。

Description

一种文本信息聚类方法和文本信息聚类系统
本申请要求2016年02月29日递交的申请号为201610112522.X、发明名称为“一种文本信息聚类方法和文本信息聚类系统”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及文本处理领域,尤其涉及一种文本信息聚类方法和文本信息聚类系统。
背景技术
将文本信息按照相应的主题进行文本聚类在文本处理领域有着非常重要的应用,然而由于文本信息覆盖面非常广,每天产生的文本信息数目也非常巨大,因此,开展大规模文本聚类分析有着非常重要的意义。
现有的文本信息聚类分析在主题个数增大的情况下会出现计算缓慢、占用计算资源过多的现象,但是如果限制主题数量,则在不同主题下的文章将会混杂在一起,对最终的结果造成影响。
因此,需要提出一种新的文本信息聚类技术,以解决现有技术存在的计算缓慢、占用计算资源过多的问题。
发明内容
鉴于上述问题,提出了本申请实施例以便提供一种克服上述问题或者至少部分地解决上述问题的文本信息聚类方法和文本信息聚类系统。
为解决上述问题,本申请一实施例公开一种文本信息聚类方法,包括如下步骤:
将多则文本信息中的每一则文本信息进行分词处理;
对分词处理后的所述多则文本信息进行初次聚类,形成多个一级主题,每个所述一级主题包括至少两则文本信息;
根据每个所述一级主题下文本信息的数目,确定每一个所述一级主题下二级主题的个数;
根据每个所述一级主题下二级主题的个数,对每个所述一级主题中包括的至少两则文本信息进行二次聚类,形成多个二级主题。
本发明另一实施例公开一种文本信息聚类系统,包括:
分词处理模块,用于将多则文本信息中的每一则文本信息进行分词处理;
初次聚类模块,用于对分词处理后的所述多则文本信息进行初次聚类,形成多个一级主题,每个所述一级主题包括至少两则文本信息;
主题个数确定模块,用于根据预置规则,基于每个所述一级主题下文本信息的数目,确定每个所述一级主题下二级主题的个数;
二次聚类模块,用于根据每个所述一级主题下二级主题的个数,对每个所述一级主题中包括的至少两则文本信息进行二次聚类,形成多个二级主题。
综上所述,本申请实施例提出的文本信息聚类方法和文本信息聚类系统至少具有以下优点:
在本实施例提出的文本信息聚类方法和聚类系统中,采用层次化聚类的方法,在初次聚类时,减少了总的一级主题的个数,加快了计算效率,减小了系统资源消耗;在二次聚类时,根据文本信息数目动态确定二级主题的个数,减小了每个二级主题下平均的文本信息数目,加快了二级主题的计算速度。
附图说明
图1是本发明一实施例采用的LDA算法的原理示意图。
图2是本发明第一实施例的文本信息聚类方法的流程图。
图3是本发明第二实施例的文本信息聚类方法的流程图。
图4是本发明第三实施例的文本信息聚类方法的流程图。
图5是本发明第四实施例的文本信息聚类系统的方框图。
图6是本发明第五实施例的文本信息聚类系统的方框图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员所获得的所有其他实施例,都属于本申请保护的范围。
本申请的核心思想之一在于,通过算法对多则文本信息进行两次或以上的聚类,在初次聚类之后生成多个一级主题;再根据每个一级主题下文本信息的数目,确定每个一级主题下二级主题的个数;之后,再根据每一个所述一级主题下二级主题的个数,对每 个一级主题下的至少两则文本信息进行二次聚类,生成多个二级主题。
举例来说,系统需要针对5000则文本信息进行聚类,根据本申请提供的文本信息聚类方法,可以先利用算法将该5000则文本信息聚类为5个一级主题。在初次聚类之后,每一个一级主题下分别包括文本信息的数目为:1000则、1500则、500则、1800则、200则,之后根据每个一级主题下包括的文本信息数目,确定每个一级主题应当被划分为二级主题的个数,例如,可以通过人工分析或算法参数设置,确定上述5个一级主题分别应当被划分为10个、15个、5个、18个、2个二级主题。之后,根据上述二级主题的个数对每个一级主题进行二次聚类,生成10个、15个、5个、18个、2个二级主题,每个二级主题下包括若干则文本信息。
正如本领域技术人员所知,实践中通常需要处理的文本信息数目远不止5000则,可能是更高的数量级,本发明的上述示例仅是为了方便理解之用,并不是特别限制。
在本申请实施例中,可以通过LDA算法对多则文本信息进行聚类。LDA(Latent Dirichlet Allocation)算法是一种文档二级主题模型算法。该算法在现有的pLSA算法中引入了贝叶斯框架,能更好地表示文档生成模型。其具体实现步骤如下:
首先假设所有的文档中的每一个词都是从某个文本信息的主题中选择出来的,而该主题也满足一定的概率分布。图1所示为LDA算法的原理图。如图1所示,假设文本信息的主题服从参数为θ的多项分布,其先验分布则是参数为α的狄利克雷分布,z表示从该主题分布中获得的主题,对于每一主题,假设单词在该主题下亦服从参数为φ的多项分布,该分部的先验分布是参数为β的狄利克雷分布。假设一共有K个主题,对于每一个随机选择的主题,从其相应的分布中获取相应的单词。在该图中M表示文章个数,N表示单词数,K表示主题个数,w表示单词,其加深色表示是可以观测的内容,方框表示重复,重复次数用其右下角的字母表示。在完成了建模之后,最后的参数估计由吉布斯采样完成。在利用LDA算法聚类完成之后,多则文本信息依据该算法聚类为特定的主题,每一级主题下包括多则相关的文本信息。
在实际操作中,当主题个数大于300之后,LDA聚类算法运行非常缓慢,并且占用资源也非常地大,与此同时,由于主题个数限制,并不能达到理想的主题个数。因此,在最后的结果中,文本信息条目间的混杂也非常明显,很多不相关的主题被归类到了一个主题之下,给文本信息聚类造成了很多的困难。
本申请提出的文本信息聚类方法中,通过采用层次化聚类的方法,构造了层次化LDA聚类框架,在初次聚类时,减少了总的一级主题的个数,加快了计算效率,减小了 系统资源消耗;在二次聚类时,根据文本信息个数动态确定二级主题的个数,减小了每个二级主题下平均的文本信息条目个数,实现每个一级主题之间的解耦,通过并行的方式加快了二级主题的计算速度。
以下通过多个具体实施例对本申请提出的文本信息聚类方法和文本信息聚类系统进行具体描述。
第一实施例
本申请第一实施例提出一种文本信息聚类方法,如图2所示为本申请一实施例的文本信息聚类方法的流程图。本申请第一实施例的文本信息聚类方法包括如下步骤:
步骤S101,将多则文本信息中的每一则文本信息进行分词处理,形成多个字词;
在这一步骤中,可以首先对每一则文本信息进行分词处理。举例来说,可以将“Python是一种面向对象、解释型计算机程序设计语言”切分成“Python/是/一种/面向/对象/解释/型/计算机/程序/设计/语言”。
通过这一步的处理,将一句话切分成若干个字词,便于后续的处理操作。
在这一步中,可以将文本信息出现的字词与预设的字词库中的字词作比较,当文本信息中出现的字词与字词库中的字词一致,则将该字词切分出来。值得注意的是,文中上下提及的字词可以为字,也可以为词。例如,当文本信息中的“面向”与字词库中的“面向”一致,则将该文本信息中的“面向”单独切分出来。当文本信息中的“型”与字词库中的“型”一致,则将该文本信息中的“型”单独切分出来。
之后,可以执行步骤S102,对分词处理后的所述多则文本信息按照所述多个字词进行初次聚类,形成多个一级主题,每一个所述一级主题包括至少两则文本信息;
在这一步骤中,例如可以采用前述的LDA算法对所有文本信息进行初次聚类。在这次聚类中,鉴于文本信息数目较多,可以将一级主题的数目设置的相对较小,避免消耗过多的计算资源,导致计算缓慢。通过初次聚类,可以将文本信息粗略地划分到若干个一级主题中,每个一级主题的大小各异,其中包含的文本信息数目也可以各不相同。
举例来说,根据前述的示例,当针对5000则文本信息进行聚类时,在本步骤中,利用LDA算法将该5000则文本信息聚类为5个一级主题,每个一级主题下例如分别包括文本信息的数目为:1000则、1500则、500则、1800则、200则。
之后,可以执行步骤S103,根据预置规则,基于每个所述一级主题下文本信息的数目,确定每个所述一级主题下二级主题的个数;
在这一步骤中,可以根据每个一级主题下文本信息的数目,利用LDA算法的参数设定,或者人为的设定,确定每个一级主题下二级主题的数目。每个一级主题下二级主题的数目可以相同或者不同。
这里的预置规则例如可以为:预设的每一个二级主题中包含的文本信息数目为X个,X的范围是M≤X≤N,M和N为开发人员或者使用者指定的数值,例如90≤X≤110,则可以选择X为平均值100,在此基础上,可以计算出上述示例的每一个一级主题下包含的二级主题的个数为:1000/100=10个,1500/100=15个,500/100=5个,1800/100=18个,200/100=2个。
之后,可以执行步骤S104,根据每一个所述一级主题下二级主题的个数,对每一个所述一级主题中包括的多则文本信息按照所述多个字词进行二次聚类,形成多个二级主题。
在这一步骤中,可以采用前述的LDA算法对所有文本信息进行二次聚类。在这次聚类中,针对每个一级主题下的多则文本信息,根据该一级主题应该被划分的二级主题的数目,采用例如LDA算法进行聚类,形成指定数目的多个二级主题。
举例来说,根据前述的示例,对每个一级主题进行二次聚类,分别生成10个、15个、5个、18个、2个二级主题,每个二级主题下包括若干则文本信息。
在本步骤中,由于对每个一级主题中的多则文本信息的二次聚类的过程是独立的,这些二次聚类可以同时处理,或称并行处理,提高了运算的速度。
在本实施例提出的文本信息聚类方法中,通过上述方式,在该文本信息聚类方法中,采用层次化聚类的方法,在初次聚类时,减少了总的一级主题的个数,加快了计算效率,减小了系统资源消耗;在二次聚类时,根据文本信息数目动态确定二级主题的个数,减小了每个二级主题下平均的文本信息数目,加快了二级主题的计算速度。
第二实施例
本申请第二实施例提出一种文本信息聚类方法,如图3所示为本申请第二实施例的文本信息聚类方法的流程图。本申请第二实施例的文本信息聚类方法包括如下步骤:
步骤S201,将多则文本信息中的每一则文本信息进行分词处理,形成多个字词;
步骤S202,采用LDA算法对分词处理后的所述多则文本信息按照所述多个字词进行初次聚类,形成多个一级主题,每一个所述一级主题包括至少两则文本信息;
步骤S203,根据预置规则,基于每个所述一级主题下文本信息的数目,确定每个所 述一级主题下二级主题的个数;
步骤S204,根据每一个所述一级主题下二级主题的个数,(利用LDA算法)对每一个所述一级主题中包括的多则文本信息按照所述多个字词进行二次聚类,形成多个二级主题,每一个所述二级主题包括多则文本信息。
上述四个步骤S201至步骤S204与第一实施例中的步骤S101至S104相同或相似,在此不再赘述。
在本实施例中,在步骤S201之后,该方法还可能包括如下步骤:
S201a,当分词过程中检测到文本信息中出现符号、英文单词和/或数字时,判断该符号、英文单词和/或数字与文本信息的相关程度;
S201b,当判断出该符号、英文单词和/或数字与文本信息的相关程度低于指定值时,删除所述符号、英文单词和/或数字。
上述步骤中,该符合可能是单独的符号,例如“&”、“%”等,也可能是各种符号和数字、字母组成的内容,例如链接等。通过特定的方法在步骤S201a中判断该符号与文本信息内容的相关程度,当判断出相关程度较低时,删除该符号。
同样地,判断英文单词与文本信息内容的相关程度,例如当文本信息中包括“厄尔尼诺现象(El Nino)”该英文单词仅作为注释,当判断出该英文单词仅是注释,则可以删除该英文单词。
同样地,可以利用同样的方式判断数字与文本信息内容的相关程度,当判断出相关程度较低时,删除该数字。
在本实施例中,在步骤S201之后,该方法还可能包括如下步骤:
S201c,当分词过程中检测到文本信息中出现英文单词时,将该英文单词单独切分为一个字词。
在这一步骤中,例如上述示例中,“Python”与该文本信息的内容相关性较大,如删除则无法确切地了解文本信息的含义从而得出正确的分类,在此实施例中可以将“Python”这一单词单独切分为一个字词并保留。
在本实施例中,在步骤S201之后,该方法还可能包括如下步骤:
S201d,检测分词处理后的每一个所述字词是否与预设的停用表中的字词相同;
S201e,当检测到分词处理后的任一个所述字词与所述预设的停用表中的字词相同时,删除所述分词处理后的相同的字词。
在上述步骤中,分词后的结果通常会包含若干无意义的字词如“的、了、过”,这 些字词不仅对结果没有帮助,还占用了大量的计算存储资源,因此需要在计算之前将其过滤掉。具体做法可以为,例如“的、了、过”之类无意义的字词可以被汇总在预设的停用表中,当判断出文本信息中出现上述字词,则删除该文本信息中的上述字词。另外,在实际操作过程中,还会出现一些干扰正常分类的字词,例如一些文本信息的来源标记等等,这些字词也可以被汇总在预设的停用表中,当判断出文本信息中出现上述词汇,则删除该文本信息中的上述词汇。
另外值得注意的是,上述步骤S201a与S201b,以及S201c,以及S201d与S201e并非先后执行,而是可以有选择地执行步骤S201a与S201b、S201c和/或S201d与S201e。
在本实施例中,步骤S202即采用LDA算法对分词处理后的多则文本信息进行初次聚类,形成多个一级主题的步骤之后,该文本信息聚类方法还可以包括如下步骤:
S202a,将两个以上包含的文本信息数目少于第一数值的一级主题合并为一个一级主题。
在这一步骤中,可以通过算法检测或者人工检测每个一级主题下的文本信息数目是否少于第一阈值。如果少于该第一阈值,则将该一级主题与其他一级主题合并,进行后续计算。
举例来说,根据前述的示例,在步骤S202中聚类形成的每个一级主题下分别包括文本信息的数目为:1000则、1500则、500则、1800则、200则。如果将该第一阈值设置为300则,则可以判断出最后一个一级主题中包括的文本信息的数目少于第一阈值,此时可以将上述最后一个一级主题与其他主题合并,例如与上述第三个一级主题合并,之后再进行二级主题的聚类。
在本实施例提出的文本信息聚类方法中,通过上述方式,在该文本信息聚类方法中,采用层次化聚类的方法,在初次聚类时,减少了总的一级主题的个数,加快了计算效率,减小了系统资源消耗;在二次聚类时,根据文本信息数目动态确定二级主题的个数,减小了每个二级主题下平均的文本信息数目,加快了二级主题的计算速度。同时在聚类过程中删除了无意义的字词和/或符号,合并了文本信息数目较小的一级主题,进一步优化了计算方法,减小了计算强度。
第三实施例
本申请第三实施例提出一种文本信息聚类方法,如图4所示为本申请第三实施例的文本信息聚类方法的流程图。本申请第三实施例的文本信息聚类方法包括如下步骤:
步骤S301,将多则文本信息中的每一则文本信息进行分词处理,形成多个字词;
步骤S302,采用LDA算法对分词处理后的所述多则文本信息按照所述多个字词进行初次聚类,形成多个一级主题,每一个所述一级主题包括至少两则文本信息;
步骤S303,根据预置规则,基于每个所述一级主题下文本信息的数目,确定每个所述一级主题下二级主题的个数;
步骤S304,根据每一个所述一级主题下二级主题的个数,(利用LDA算法)对每一个所述一级主题中包括的至少两则文本信息按照所述多个字词进行二次聚类,形成多个二级主题。
上述四个步骤S301至步骤S304与第一实施例中的步骤S101至S104相同或相似,在此不再赘述。
本实施例是在第一的步骤S104或第二实施例的步骤S204完成之后,进行步骤S305,即,对二次聚类后生成的多个二级主题进行匹配度评估,以及
步骤S306,获得匹配度评估结果,当匹配度评估的结果为聚类不合格,根据所述评估结果调整所述LDA算法的参数。
在这一步骤中,当匹配度评估的结果为聚类不合格,例如可以通过调整主题的个数、低频词的频率阈值、需要合并的主题包含的文本信息数目阈值、停用表的内容等。主题的个数例如为图1中的k值;低频词的频率阈值例如可以为人工或者机器设定的阈值,当所有文本信息在分词后,某个或某些字词出现的频率小于阈值,则这些词可以认为是低频词,在这一步骤中,可以调整低频词的频率阈值,使低频词的数目增加或减少,从而影响聚类结果;需要合并的主题包含的文本信息数目阈值例如为人工或者机器设定的阈值,当某一个或多个主题包含的文本信息数目少于特定的阈值,可以认为这些主题是需要合并的,通过修改这一阈值,可以设置更高的合并门槛或者更低的合并门槛,从而影响聚类结果。停用表例如可以为第二实施例例中提供的表格,其可以存储多个停用词,通过调整停用词的内容,达到影响聚类结果的目的。
在这一步骤中,可以通过人工评估的方式,或是利用机器算法,对聚类后生成的二级主题进行评估。由于二次聚类的结果会随文本信息的不同而产生很多变化,因而需要对二次聚类的结果进行评估,具体的评估方法可以包括查看若干个二级主题下的文本信息是否关于同一个内容,通过这一标准判断该聚类是否合适、是否有不合适的词被选为关键词、二级主题是否会有混叠现象、一级主题和二级主题的个数选择是否合适等。如果结果没有达到预期,还可以根据需要通过人工或机器算法继续调整,例如调整LDA算 法的参数等。
在本实施例中,在步骤S304即根据每一个所述一级主题下二级主题的个数,对每一个所述一级主题中包括的多则文本信息进行二次聚类,形成多个二级主题之后,该方法还可以包括如下步骤:
S307,利用二级主题下文本信息的数目是否超过第二阈值,判断所述二级主题是否为热点话题。
在这一步骤中,当某一个二级主题下文本信息的数目大于第二阈值,则可以判断该二级主题为热点话题。在判断出该二级主题为热点话题。在判断出热点话题之后,可以进行后续的操作,例如自动或者人工将该热点话题显示在网站的首页、将该热点话题加上醒目标记等,本发明并不以此为限。
在本实施例提出的文本信息聚类方法中,通过上述方式,在该文本信息聚类方法中,采用层次化聚类的方法,在初次聚类时,减少了总的一级主题的个数,加快了计算效率,减小了系统资源消耗;在二次聚类时,根据文本信息数目动态确定二级主题的个数,减小了每个二级主题下平均的文本信息数目,加快了二级主题的计算速度。同时在完成二次聚类后进入评估环节,评估二级主题的聚类是否合适。加入上述的评估环节能够进一步优化上述聚类方法,提高聚类的准确性。此外,在完成二次聚类之后,可以通过与第二阈值比较,判断哪些二级主题是热点话题,为后续处理提供了便利。
在上述多个实施例中,文本信息聚类方法例如可以应用于新闻的聚类。即,上述的文本信息例如可以为新闻。利用该方法可以对大量的新闻进行聚类。该聚类方法至少可以包括如下步骤:通过将多则新闻中的每一则新闻进行分词处理,形成多个字词;对分词处理后的所述多则新闻按照所述多个字词进行初次聚类,形成多个一级主题,每一个所述一级主题包括至少两则新闻;根据预置规则,基于每个所述一级主题下新闻的数目,确定每个所述一级主题下二级主题的个数;根据每一个所述一级主题下二级主题的个数,对每一个所述一级主题中包括的多则新闻按照所述多个字词进行二次聚类,形成多个二级主题。由于日常生活的每一天均会产生大量的新闻,通过上述步骤,可以更快地对新闻进行聚类,避免了人工分类的繁琐和效率低下,方便使用者更快地获得分类的新闻,提高了使用者的体验。
第四实施例
本申请第四实施例提出一种文本信息聚类系统,如图5所示为本申请第四实施例的 文本信息聚类系统的方框图。本申请第四实施例的文本信息聚类系统400包括:
分词处理模块401,用于将多则文本信息中的每一则文本信息进行分词处理,形成多个字词;
初次聚类模块402,用于对分词处理后的所述多则文本信息按照所述多个字词进行初次聚类,形成多个一级主题,每个所述一级主题包括多则文本信息;
主题个数确定模块403,用于根据预置规则,基于每个所述一级主题下文本信息的数目,确定每个所述一级主题下二级主题的个数;
二次聚类模块404,用于根据每个所述一级主题下二级主题的个数,对每个所述一级主题中包括的多则文本信息按照所述多个字词进行二次聚类,形成多个二级主题,每个所述二级主题包括多则文本信息。
在本实施例提出的文本信息聚类系统中,在该文本信息聚类方法中,采用层次化聚类的系统,在初次聚类时,减少了总的一级主题的个数,加快了计算效率,减小了系统资源消耗;在二次聚类时,根据文本信息数目动态确定二级主题的个数,减小了每个二级主题下平均的文本信息数目,加快了二级主题的计算速度。
第五实施例
本申请第五实施例提出一种文本信息聚类系统,如图6所示为本申请第四实施例的文本信息聚类系统的方框图。本申请第五实施例的文本信息聚类系统包括:分词处理模块501、初次聚类模块502、主题个数确定模块503、二次聚类模块504。上述模块501-504与第四实施例中的模块401-404相同或相似,在此不再赘述。
在本实施例中,优选地,所述初次聚类和所述二次聚类均采用LDA算法进行聚类。
在本实施例中,优选地,所述系统还包括:
相关度判断模块,用于当检测到文本信息中出现符号、英文单词和/或数字时,判断该符号、英文单词和/或数字与所述文本信息的相关程度;以及
第一删除模块,用于当判断出所述符号、英文单词和/或数字与文本信息内容的相关程度低于指定值时,删除所述符号、英文单词和/或数字。
在本实施例中,优选地,所述系统还包括:
检测模块,用于检测分词处理后的每一个所述字词是否与预设的停用表中的字词相同;以及
第二删除模块,用于当检测到分词处理后的任一个所述字词与所述预设的停用表中 的字词相同时,删除所述分词处理后的相同的字词。
在本实施例中,优选地,所述系统还包括:
合并模块505,用于将两个以上包含的文本信息数目少于第一数值的一级主题合并为一个一级主题。
在本实施例中,优选地,所述二次聚类模块504用于同时实施任意两个或两个以上的二次聚类。
在本实施例中,优选地,所述系统还包括:
评估模块506,用于对二次聚类后生成的多个二级主题进行评估;以及
调整模块507,用于根据所述评估结果调整所述LDA算法的参数。
在本实施例中,优选地,所述系统还包括:
热点判断模块508,用于利用每一个二级主题下文本信息的数目,判断所述二级主题是否为热点话题。
在本实施例提出的文本信息聚类系统中,通过上述方式,在该文本信息聚类系统中,采用层次化聚类系统,在初次聚类时,减少了总的一级主题的个数,加快了计算效率,减小了系统资源消耗;在二次聚类时,根据文本信息数目动态确定二级主题的个数,减小了每个二级主题下平均的文本信息数目,加快了二级主题的计算速度。
同时,本实施例的系统在二次聚类时,根据文本信息数目动态确定二级主题的个数,减小了每个二级主题下平均的文本信息数目,加快了二级主题的计算速度。同时在聚类过程中删除了无意义的字词和/或符号,合并了文本信息数目较小的一级主题,进一步优化了计算方法,减小了计算强度。
同时,本实施例的系统可以包括评估模块,用于评估二级主题的聚类是否合适。加入上述的评估环节能够进一步优化上述聚类方法,提高聚类的准确性。此外,本实施例的系统可以包括热点判断模块,可以通过与第二阈值比较,判断哪些二级主题是热点话题,为后续处理提供了便利。
同样地,在上述多个实施例中,文本信息聚类系统例如可以应用于新闻的聚类。即,上述的文本信息例如可以为新闻。利用该系统可以对大量的新闻进行聚类。该聚类系统至少可以包括:
分词处理模块,用于将多则新闻中的每一则新闻进行分词处理,形成多个字词;
初次聚类模块,用于对分词处理后的所述多则新闻按照所述多个字词进行初次聚类,形成多个一级主题,每个所述一级主题包括多则新闻;
主题个数确定模块,用于根据预置规则,基于每个所述一级主题下新闻的数目,确定每个所述一级主题下二级主题的个数;
二次聚类模块,用于根据每个所述一级主题下二级主题的个数,对每个所述一级主题中包括的多则新闻按照所述多个字词进行二次聚类,形成多个二级主题,每个所述二级主题包括多则新闻。
由于日常生活的每一天均会产生大量的新闻,通过上述步骤,可以更快地对新闻进行聚类,避免了人工分类的繁琐和效率低下,方便使用者更快地获得分类的新闻,提高了使用者的体验。
对于装置实施例而言,由于其与方法实施例基本相似,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。
本说明书中的各个实施例均采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似的部分互相参见即可。
本领域内的技术人员应明白,本申请实施例的实施例可提供为方法、装置、或计算机程序产品。因此,本申请实施例可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本申请实施例可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。
在一个典型的配置中,所述计算机设备包括一个或多个处理器(CPU)、输入/输出接口、网络接口和内存。内存可能包括计算机可读介质中的非永久性存储器,随机存取存储器(RAM)和/或非易失性内存等形式,如只读存储器(ROM)或闪存(flash RAM)。内存是计算机可读介质的示例。计算机可读介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信号存储。信号可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机的存储介质的例子包括,但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带,磁带磁磁盘存储或其他磁性存储设备或任何其他非传输介质,可用于存储可以被计算设备访问的信号。按照本文中的界定,计算机可读介质不包括非持续性的电脑可读媒体(transitory media),如调制的数据信号和载波。
本申请实施例是参照根据本申请实施例的方法、终端设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理终端设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理终端设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理终端设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其他可编程数据处理终端设备上,使得在计算机或其他可编程终端设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程终端设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
尽管已描述了本申请实施例的优选实施例,但本领域内的技术人员一旦得知了基本创造性概念,则可对这些实施例做出另外的变更和修改。所以,所附权利要求意欲解释为包括优选实施例以及落入本申请实施例范围的所有变更和修改。
最后,还需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者终端设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者终端设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者终端设备中还存在另外的相同要素。
以上对本申请所提供的一种文本信息聚类方法和聚类系统,进行了详细介绍,本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的一般技术人员,依据本申请 的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本申请的限制。

Claims (18)

  1. 一种文本信息聚类方法,其特征在于,包括:
    将多则文本信息中的每一则文本信息进行分词处理,形成多个字词;
    对分词处理后的所述多则文本信息按照所述多个字词进行初次聚类,形成多个一级主题,每个所述一级主题包括至少两则文本信息;
    根据预置规则,基于每个所述一级主题下文本信息的数目,确定每个所述一级主题下二级主题的个数;
    根据每个所述一级主题下二级主题的个数,对每个所述一级主题中包括的至少两则文本信息按照所述多个字词进行二次聚类,形成多个二级主题。
  2. 如权利要求1所述的文本信息聚类方法,其特征在于,所述初次聚类和所述二次聚类均采用LDA算法,根据所述多个字词聚类成所述多个一级主题和所述多个二级主题。
  3. 如权利要求1所述的文本信息聚类方法,其特征在于,在将多则文本信息中的每一则文本信息进行分词处理的步骤之后,所述方法还包括:
    当检测到文本信息中出现符号、英文单词和/或数字时,判断该符号、英文单词和/或数字与所述文本信息的相关程度;以及
    当判断出所述符号、英文单词和/或数字与所述文本信息的相关程度低于指定值时,删除所述符号、英文单词和/或数字。
  4. 如权利要求1所述的文本信息聚类方法,其特征在于,在将多则文本信息中的每一则文本信息进行分词处理,形成多个字词的步骤之后,所述方法还包括:
    检测分词处理后的每一个所述字词是否与预设的停用表中的字词相同;以及
    当检测到分词处理后的任一个所述字词与所述预设的停用表中的字词相同时,删除所述分词处理后的相同的字词。
  5. 如权利要求1所述的文本信息聚类方法,其特征在于,所述对分词处理后的多则文本信息按照所述字词进行初次聚类,形成多个一级主题的步骤之后,所述方法还包括:
    将两个以上包含的文本信息数目少于第一数值的一级主题合并为一个一级主题。
  6. 如权利要求1所述的文本信息聚类方法,其特征在于,根据每个所述一级主题下二级主题的个数,对每个所述一级主题中包括的至少两则文本信息按照所述多个字词进行二次聚类,形成多个二级主题的步骤中,任意两个或两个以上所述二次聚类同时进行。
  7. 如权利要求1所述的文本信息聚类方法,其特征在于,根据每个所述一级主题下 二级主题的个数,对每个所述一级主题中包括的至少两则文本信息进行二次聚类,形成多个二级主题的步骤之后,所述方法还包括:
    利用每一个二级主题下文本信息的数目,判断所述二级主题是否为热点话题。
  8. 如权利要求2所述的文本信息聚类方法,其特征在于,根据每个所述一级主题下二级主题的个数,对每个所述一级主题中包括的至少两则文本信息按照所述多个字词进行二次聚类,形成多个二级主题的步骤之后,所述方法还包括:
    对二次聚类后生成的多个二级主题进行匹配度评估;
    根据所述匹配度评估结果调整所述LDA算法的参数1个或者多个。
  9. 如权利要求1所述的文本信息聚类方法,其特征在于,所述文本信息为新闻。
  10. 一种文本信息聚类系统,其特征在于,包括:
    分词处理模块,用于将多则文本信息中的每一则文本信息进行分词处理,形成多个字词;
    初次聚类模块,用于对分词处理后的所述多则文本信息按照所述多个字词进行初次聚类,形成多个一级主题,每个所述一级主题包括至少两则文本信息;
    主题个数确定模块,用于根据每个所述一级主题下文本信息的数目,确定每一个所述一级主题下二级主题的个数;
    二次聚类模块,用于根据每个所述一级主题下二级主题的个数,对每个所述一级主题中包括的至少两则文本信息按照所述多个字词进行二次聚类,形成多个二级主题。
  11. 如权利要求10所述的文本信息聚类系统,其特征在于,所述初次聚类和所述二次聚类均采用LDA算法进行聚类,根据所述多个字词聚类成所述多个一级主题和所述多个二级主题。
  12. 如权利要求10所述的文本信息聚类系统,其特征在于,所述系统还包括:
    相关度判断模块,用于当检测到文本信息中出现符号、英文单词和/或数字时,判断该符号、英文单词和/或数字与所述文本信息的相关程度;以及
    第一删除模块,用于当判断出所述符号、英文单词和/或数字与文本信息内容的相关程度低于指定值时,删除所述符号、英文单词和/或数字。
  13. 如权利要求10所述的文本信息聚类系统,其特征在于,所述系统还包括:
    检测模块,用于检测分词处理后的每一个所述字词是否与预设的停用表中的字词相同;以及
    第二删除模块,用于当检测到分词处理后的任一个所述字词与所述预设的停用表中 的字词相同时,删除所述分词处理后的相同的字词。
  14. 如权利要求10所述的文本信息聚类系统,其特征在于,所述系统还包括:
    合并模块,用于将两个以上包含的文本信息数目少于第一数值的一级主题合并为一个一级主题。
  15. 如权利要求10所述的文本信息聚类系统,其特征在于,所述二次聚类模块用于同时实施任意两个或两个以上的二次聚类。
  16. 如权利要求10所述的文本信息聚类系统,其特征在于,所述系统还包括:
    热点判断模块,用于利用每一个二级主题下文本信息的数目,判断所述二级主题是否为热点话题。
  17. 如权利要求11所述的文本信息聚类系统,其特征在于,所述系统还包括:
    评估模块,用于对二次聚类后生成的多个二级主题进行匹配度评估;以及
    调整模块,用于根据所述匹配度评估结果调整所述LDA算法的参数。
  18. 如权利要求10所述的文本信息聚类系统,其特征在于,所述文本信息为新闻。
PCT/CN2017/073720 2016-02-29 2017-02-16 一种文本信息聚类方法和文本信息聚类系统 WO2017148267A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
JP2018544207A JP2019511040A (ja) 2016-02-29 2017-02-16 テキスト情報クラスタ化方法及びテキスト情報クラスタ化システム
US16/116,851 US20180365218A1 (en) 2016-02-29 2018-08-29 Text information clustering method and text information clustering system

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610112522.XA CN107133238A (zh) 2016-02-29 2016-02-29 一种文本信息聚类方法和文本信息聚类系统
CN201610112522.X 2016-02-29

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/116,851 Continuation US20180365218A1 (en) 2016-02-29 2018-08-29 Text information clustering method and text information clustering system

Publications (1)

Publication Number Publication Date
WO2017148267A1 true WO2017148267A1 (zh) 2017-09-08

Family

ID=59721328

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/073720 WO2017148267A1 (zh) 2016-02-29 2017-02-16 一种文本信息聚类方法和文本信息聚类系统

Country Status (5)

Country Link
US (1) US20180365218A1 (zh)
JP (1) JP2019511040A (zh)
CN (1) CN107133238A (zh)
TW (1) TW201734850A (zh)
WO (1) WO2017148267A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111209419A (zh) * 2018-11-20 2020-05-29 浙江宇视科技有限公司 一种图像数据存储的方法及装置

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108255978A (zh) * 2017-12-28 2018-07-06 曙光信息产业(北京)有限公司 新闻稿件话题聚类的方法和系统
CN109101633B (zh) * 2018-08-15 2019-08-27 北京神州泰岳软件股份有限公司 一种层次聚类方法及装置
CN110069772B (zh) * 2019-03-12 2023-10-20 平安科技(深圳)有限公司 预测问答内容的评分的装置、方法及存储介质
CN110309504B (zh) * 2019-05-23 2023-10-31 平安科技(深圳)有限公司 基于分词的文本处理方法、装置、设备及存储介质
CN110597986A (zh) * 2019-08-16 2019-12-20 杭州微洱网络科技有限公司 一种基于微调特征的文本聚类系统及方法
CN111353028B (zh) * 2020-02-20 2023-04-18 支付宝(杭州)信息技术有限公司 用于确定客服话术簇的方法及装置
CN113806524A (zh) * 2020-06-16 2021-12-17 阿里巴巴集团控股有限公司 一种文本内容的层级类目构建和层级结构调整方法及装置
CN111813935B (zh) * 2020-06-22 2024-04-30 贵州大学 一种基于层次狄利克雷多项分配模型的多源文本聚类方法
CN112036176A (zh) * 2020-07-22 2020-12-04 大箴(杭州)科技有限公司 文本聚类方法及装置
CN112948579A (zh) * 2021-01-29 2021-06-11 广东海洋大学 留言文本信息处理方法、装置、系统和计算机设备
CN112597313B (zh) * 2021-03-03 2021-06-29 北京沃丰时代数据科技有限公司 短文本聚类方法、装置、电子设备及存储介质
CN113515593A (zh) * 2021-04-23 2021-10-19 平安科技(深圳)有限公司 基于聚类模型的话题检测方法、装置和计算机设备
CN113420723A (zh) * 2021-07-21 2021-09-21 北京有竹居网络技术有限公司 获取视频热点的方法、装置、可读介质和电子设备

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050141497A1 (en) * 2004-06-18 2005-06-30 Yi-Chieh Wu Data classification management system and method thereof
CN102411638A (zh) * 2011-12-30 2012-04-11 中国科学院自动化研究所 一种新闻检索结果的多媒体摘要生成方法
CN103514183A (zh) * 2012-06-19 2014-01-15 北京大学 基于交互式文档聚类的信息检索方法及系统
CN103870474A (zh) * 2012-12-11 2014-06-18 北京百度网讯科技有限公司 一种新闻话题组织方法及装置
CN104239539A (zh) * 2013-09-22 2014-12-24 中科嘉速(北京)并行软件有限公司 一种基于多种信息融合的微博信息过滤方法

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101989289B (zh) * 2009-08-06 2014-05-07 富士通株式会社 数据聚类方法和装置
CN104199974A (zh) * 2013-09-22 2014-12-10 中科嘉速(北京)并行软件有限公司 一种面向微博的动态主题检测与演变追踪方法
CN104216954B (zh) * 2014-08-20 2017-07-14 北京邮电大学 突发事件话题状态的预测装置及预测方法
CN104462286A (zh) * 2014-11-27 2015-03-25 重庆邮电大学 一种基于改进的lda的微博话题发现方法
CN104850615A (zh) * 2015-05-14 2015-08-19 西安电子科技大学 一种基于g2o的SLAM后端优化算法方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050141497A1 (en) * 2004-06-18 2005-06-30 Yi-Chieh Wu Data classification management system and method thereof
CN102411638A (zh) * 2011-12-30 2012-04-11 中国科学院自动化研究所 一种新闻检索结果的多媒体摘要生成方法
CN103514183A (zh) * 2012-06-19 2014-01-15 北京大学 基于交互式文档聚类的信息检索方法及系统
CN103870474A (zh) * 2012-12-11 2014-06-18 北京百度网讯科技有限公司 一种新闻话题组织方法及装置
CN104239539A (zh) * 2013-09-22 2014-12-24 中科嘉速(北京)并行软件有限公司 一种基于多种信息融合的微博信息过滤方法

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111209419A (zh) * 2018-11-20 2020-05-29 浙江宇视科技有限公司 一种图像数据存储的方法及装置
CN111209419B (zh) * 2018-11-20 2023-09-19 浙江宇视科技有限公司 一种图像数据存储的方法及装置

Also Published As

Publication number Publication date
JP2019511040A (ja) 2019-04-18
CN107133238A (zh) 2017-09-05
TW201734850A (zh) 2017-10-01
US20180365218A1 (en) 2018-12-20

Similar Documents

Publication Publication Date Title
WO2017148267A1 (zh) 一种文本信息聚类方法和文本信息聚类系统
Suttles et al. Distant supervision for emotion classification with discrete binary values
US9542477B2 (en) Method of automated discovery of topics relatedness
US8990327B2 (en) Location estimation of social network users
CN109815336B (zh) 一种文本聚合方法及系统
US20160162802A1 (en) Active Machine Learning
CN106598999B (zh) 一种计算文本主题归属度的方法及装置
JP6335898B2 (ja) 製品認識に基づく情報分類
JP5534280B2 (ja) テキストクラスタリング装置、テキストクラスタリング方法、およびプログラム
US20180081861A1 (en) Smart document building using natural language processing
CN108959474B (zh) 实体关系提取方法
CN106610931B (zh) 话题名称的提取方法及装置
CN104850617A (zh) 短文本处理方法及装置
WO2022228371A1 (zh) 恶意流量账号检测方法、装置、设备和存储介质
CN106815190B (zh) 一种词语识别方法、装置及服务器
Karimi et al. Evaluation methods for statistically dependent text
CN110895654A (zh) 分段方法、分段系统及非暂态电脑可读取媒体
Zhang et al. Ideagraph plus: A topic-based algorithm for perceiving unnoticed events
CN110442863B (zh) 一种短文本语义相似度计算方法及其系统、介质
US11631021B1 (en) Identifying and ranking potentially privileged documents using a machine learning topic model
Wang et al. Sparse multi-task learning for detecting influential nodes in an implicit diffusion network
CN106599002B (zh) 话题演化分析的方法及装置
CN106776529B (zh) 业务情感分析方法及装置
KR20200088164A (ko) 소셜 네트워크 서비스 메시지의 감정 분석을 위한 POS(part of speech) 특징기반의 감정 분석 방법 및 이를 수행하는 감정 분석 장치
CN117171653B (zh) 一种识别信息关系的方法、装置、设备及存储介质

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 2018544207

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17759118

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 17759118

Country of ref document: EP

Kind code of ref document: A1