WO2017101541A1 - 文本聚类方法、装置及计算设备 - Google Patents

文本聚类方法、装置及计算设备 Download PDF

Info

Publication number
WO2017101541A1
WO2017101541A1 PCT/CN2016/099584 CN2016099584W WO2017101541A1 WO 2017101541 A1 WO2017101541 A1 WO 2017101541A1 CN 2016099584 W CN2016099584 W CN 2016099584W WO 2017101541 A1 WO2017101541 A1 WO 2017101541A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
cluster
new
texts
processed
Prior art date
Application number
PCT/CN2016/099584
Other languages
English (en)
French (fr)
Inventor
胡斐然
王楠楠
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2017101541A1 publication Critical patent/WO2017101541A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes

Definitions

  • the present invention relates to the field of computer technologies, and in particular, to a text clustering method, a text clustering apparatus, and a computing device for text clustering.
  • the process of clustering texts is also the process of bringing together similar texts together.
  • the similarity between texts is often calculated according to the content contained in the text, and generally, a plurality of texts containing more of the same content are regarded as having a higher degree of similarity.
  • the application provides a text clustering method, a text clustering device and a computing device for text clustering to improve the accuracy of text clustering.
  • a first aspect of the present application provides a text clustering method, which is executed by a computer, comprising: receiving N texts to be clustered, N being an integer greater than 1, and replacing the numbers in the N texts with the first A logo. Performing a pre-processing operation on the N texts, and merging the adjacent first identifiers of the N texts to obtain N pre-processed texts corresponding to the N texts.
  • the N preprocessed texts are segmented, the segmentation results of the N preprocessed texts are obtained, and the statistical features of the words in the segmentation results of the N preprocessed texts are obtained.
  • the N texts are clustered according to the statistical characteristics of each word in the word segmentation result of the N preprocessed texts.
  • the pre-processed text of the text is no longer the content of the text itself, but the format of the text, and then the text is clustered according to the pre-processed text of each text, so that The clustering process can consider the format of the text and improve the accuracy of text clustering.
  • the pre-processing operation further includes combining the two adjacent second identifiers into one second identifier.
  • M text clusters are obtained. Extracting a regular expression corresponding to the text cluster from the text of each text cluster; obtaining new text, determining whether the new text satisfies a regular expression corresponding to any text cluster in the M text clusters, if the new text matches any text The regular expression corresponding to the cluster, then the new text belongs to the text cluster.
  • Extract the regular expression from the already obtained text cluster obtain the commonality of each text cluster on the content, and after obtaining the new text, it is not necessary to re-synthesize the new text together with the text that has been clustered, but new The text is matched with the regular expression corresponding to each text cluster, which greatly improves the clustering speed of the new text.
  • a third implementation manner of the first aspect after the N texts are clustered, M text clusters are obtained. Extracting a regular expression corresponding to the text cluster from the pre-processed text of the text included in each text cluster; obtaining new text, determining whether the new text satisfies a regular expression corresponding to any text cluster in the M text clusters, if the new The text conforms to the regular expression corresponding to any text cluster, and the new text belongs to the text cluster.
  • Extract the regular expression from the pre-processed text of the obtained text cluster obtain the commonality of the pre-processed text of each text cluster, and obtain the new text without re-creating the new text together with the text that has been clustered.
  • Clustering is performed, and the new text is matched with the regular expression corresponding to each text cluster, which greatly improves the clustering speed of the new text.
  • a second aspect of the present application provides a text clustering apparatus including an acquisition unit and a processing unit.
  • the obtaining unit is configured to receive N texts to be clustered, where N is an integer greater than 1, and replace the numbers in the N texts with the first identifier.
  • the processing unit is configured to perform a pre-processing operation on the N texts, merge the adjacent first identifiers of the N texts, obtain N pre-processed texts corresponding to the N texts, and obtain the N pre-processed texts Perform word segmentation to obtain the word segmentation results of the N preprocessed texts, and The statistical features of each word in the segmentation result of the N preprocessed texts are obtained; then the N texts are clustered according to the statistical features of the words in the segmentation results of the N preprocessed texts.
  • the apparatus is for implementing the text clustering method provided by the first aspect.
  • a third aspect of the present application provides a computing device including a processor and a memory.
  • the computing device can implement the text clustering method provided by the first aspect, and the program code for implementing the text clustering method provided by the first aspect can be saved in a memory and executed by the processor.
  • a fourth aspect of the present application provides a storage medium capable of implementing the text clustering method provided by the first aspect when the program code stored in the storage medium is executed.
  • the program code is comprised of computer instructions implementing the text clustering method provided by the first aspect.
  • FIG. 1 is a schematic diagram showing the organization structure of a text clustering system provided by the present invention.
  • FIG. 2 is a schematic structural diagram of a computing device provided by the present invention.
  • FIG. 3 is a schematic flowchart diagram of a text clustering method provided by the present invention.
  • FIG. 4 is a schematic diagram showing the organization structure of a text clustering apparatus provided by the present invention.
  • borderless language refers to a language in which there are no punctuation or spaces for demarcation between characters.
  • Common borderless languages include Chinese, Japanese, and the like.
  • bordered languages refer to languages in which there are punctuation marks or spaces used to define boundaries. The most common boundary languages include English.
  • clustering refers to the process of classifying objects into different clusters according to the characteristics of different objects. Each cluster contains multiple objects with a certain degree of commonality or a high degree of similarity.
  • regular expression refers to a string of characters used to describe A series of syntactic rules, such as what characters, character positions, character order, etc. are included.
  • FIG. 1 is an implementation of a text clustering system 200, including a storage device 206 and a text clustering device 202.
  • the storage device 206 stores a text library for storing text to be clustered, and the storage device 206 can establish communication with the text clustering device 202 through the communication network 204.
  • the storage device 206 can also be directly disposed in the text clustering device 202. Communication is established with the text clustering device 202 through the input input unit 2021.
  • the text clustering device 202 includes an input and output unit 2021 and a processing unit 2022.
  • the input and output unit 2021 may be a network interface, and if the storage device 206 is deployed within the text clustering device 202, the input and output unit 2021 may also be text clustering. Device 202 accesses the interface of the local storage device.
  • the processor 402, the memory 404, and the communication interface 406 can implement communication connection with each other through the bus 408, and can also implement communication by other means such as wireless transmission.
  • the memory 404 memory may include a volatile memory (English: volatile memory), such as random access memory (English: random-access memory, abbreviation: RAM); the memory may also include non-volatile memory (English: non-volatile memory) ), such as read-only memory (English: read-only memory, abbreviation: ROM), flash memory (English: flash memory), hard disk (English: hard disk drive, abbreviation: HDD) or solid state drive (English: solid-state Drive, abbreviation: SSD); the memory 404 may also include a combination of the above types of memory.
  • memory 404 loads text stored in a text library in storage device 206 for use by processor 402.
  • the program code for implementing the text clustering method provided by FIG. 3 of the present invention may be stored in the memory 404 and executed by the processor 402.
  • the computing device 400 obtains the text to be processed through the communication interface 406. After obtaining the result of the text clustering, the computing device 400 can also return to the user through the communication interface 406.
  • the processor 402 can be a central processing unit (English: central processing unit, abbreviation: CPU).
  • the processor 402 retrieves a plurality of texts stored in the text library and replaces the numbers in the text with a first identifier, which may be a specific character, such as the letter d.
  • the pre-processing operation is performed on the text that performs the replacement operation, and the pre-processing operation combines the two adjacent first identifiers in the text of each of the replacement operations into one first identifier. If there are multiple adjacent first identifiers in the text, the plurality of adjacent first identifiers may be combined into one first identifier. Spaces and punctuation in the text can be preserved.
  • N texts correspond to N preprocessed texts
  • N is a positive integer and N is equal to the number of texts to be clustered.
  • the pre-processed text of each text in the word segmentation result of each text is divided into a plurality of words, for example, M, and the statistical features of each word in the pre-processed text of each text are obtained, for example, for each
  • the word extracts a statistical feature, and M statistical features can be extracted from the pre-processed text of each text, and the plurality of texts to be clustered are clustered according to the statistical features of each word in the segmentation result of the pre-processed text of each text. .
  • the pre-processed text of each text can extract M statistical features, and according to the M statistical features, the pre-processed text of each text is clustered, if the pre-processed text of multiple texts is clustered into a cluster, Then the plurality of texts are also clustered into one cluster.
  • multiple texts to be clustered can be represented by a series of statistical features of the words, and the texts are clustered according to the statistical features of the words, so that the similarity of the content of the text is no longer only based on the content.
  • Clustering is performed by replacing the content of the text with the identifier and merging the adjacent identifiers, and using the logo to represent the format of the text content, so that the text is clustered by the format of the text, which can improve the clustering precision of the text.
  • the method before performing the pre-processing operation on each text, the method further includes: replacing the pixel characters in the plurality of texts to be clustered with the second identifier, and the pre-processing operation further includes: Merged into a second identity. Further, not only the number in the text is replaced, but also the pixel in the text is replaced with the second identifier, so that the obtained pre-processed text can better represent the format of the text to improve the clustering precision.
  • the processor 402 clusters the plurality of texts to be clustered into a plurality of text clusters
  • the regular expression corresponding to the text cluster is extracted from the text included in each text cluster, and the regular expression corresponding to each text cluster reflects This text cluster has some things in common on the content.
  • the processor 402 clusters the plurality of texts to be clustered into a plurality of text clusters, and includes from each text cluster
  • the pre-processing text of the text extracts the regular expression corresponding to the text cluster, and the regular expression corresponding to each text cluster reflects some common points of the pre-processed text of the text in the text cluster.
  • After obtaining the new text if it is necessary to cluster the new text into an existing text cluster, it is determined whether the pre-processed text of the new text satisfies the regular expression corresponding to any text cluster, if the pre-processed text of the new text satisfies A regular expression corresponding to a text cluster, the new text belongs to the text cluster.
  • the text clustering system After classifying the text to be clustered into different text clusters, if the text clustering system acquires new text, it is not necessary to re-cluster all the texts, only from the pre-processed text corresponding to the already acquired text cluster or text cluster.
  • the regular expression is extracted, and the new text satisfies the regular expression extracted from the pre-processed text corresponding to the text cluster or the text cluster, and the new text is classified into which text cluster, which speeds up the clustering speed of the new text.
  • the present invention also provides a text clustering method.
  • the text clustering device 202 in FIG. 1 and the computing device 400 in FIG. 2 execute the text clustering method in operation, and a schematic flowchart thereof is shown in FIG. 3.
  • Step 602 Replace the numbers in the plurality of texts with the first identifier.
  • the plurality of texts to be clustered are obtained, and the numbers in the plurality of texts to be clustered are replaced with the first identifier.
  • the first identifier is the character “d” as an example.
  • Text 1 is one of a plurality of texts to be clustered, and text 1 includes Aug 17 04:27:2203peloton kernel:[pid]uid tgid totalvm, after replacing the number in text 1 with the first identifier, text 1 includes Aug Dd dd:dd:dddd peloton kernel:[pid]uid tgid totalvm.
  • the characters in the plurality of texts to be clustered may be replaced by the second identifier.
  • the second identifier is the character “w”, and after the step 602 is performed, the text is performed. 1 includes www dd dd:dd:dddd wwwwww wwwwww:[www]www wwww wwwww ww.
  • Step 604 Perform a pre-processing operation on each text to obtain pre-processed text of each text.
  • the pre-processing operation includes: combining two adjacent first identifiers into one first identifier.
  • a pre-processing operation is performed on each text, and the pre-processing operation combines the two adjacent first identifiers in each text into one first identifier. If there are multiple adjacent first identifiers in the text, the plurality of adjacent first identifiers may be combined into one first identifier. Spaces and punctuation in the text can be preserved.
  • the preprocessed text of text 1 includes Aug d d:d:ddd peloton kernel:[pid]uid tgid Totalvm, may further merge the adjacent first identifiers in text 1 until there is no adjacent first identifier in the pre-processed text of text 1, ie the pre-processed text of text 1 includes Aug d d:d:d peloton Kernel: [pid]uid tgid totalvm. There is no punctuation between two characters and no spaces and no other characters are said to be adjacent to each other.
  • the pre-processing operation further includes: combining the two adjacent second identifiers into one second identifier.
  • the merged process references the process of merging two adjacent first identities into one first identities.
  • the pre-processed text includes ww d d:d:ddd wwwww wwwww:[ww]ww www wwwww, then the pre-processed text of text 1 includes w d d:d:d w w:[w]w w w.
  • Step 606 segmenting the pre-processed text of each text, and obtaining the word segmentation result of the pre-processed text of each text.
  • word segmentation methods There are many methods for word segmentation of text preprocessing. Common word segmentation methods for bordered languages include N-Gram word segmentation. Word segmentation methods for borderless languages generally need to combine known words in the thesaurus. After the text is segmented, the word segmentation result of the preprocessed text contains the words that the preprocessed text is segmented.
  • the word segmentation result of the pre-processed text w d d:d:d w w:[w]w w w of text 1 includes w d d:d:d, d d:d:d w, d:d:d w w:,w w:[w],w:[w]w,[w]w w,w w w, a total of 7 words.
  • Step 608 Acquire a statistical feature of each word in the word segmentation result of the preprocessed text of each text.
  • the statistical features of each word in the word segmentation result are further obtained, and the statistical features include the word frequency, the variance of the word, and the word frequency of the word-inverse document frequency (English: term frequency–inverse document frequency, Abbreviations: TF-IDF) and so on.
  • the word segmentation result of the preprocessed text of a text includes K words, and L statistical features are extracted for each of the K words, the preprocessed text of the text may extract K*L statistical features in total, thus The pre-processed text of the text can be expressed by a vector of K*L dimensions.
  • the pre-processed text of each text to be clustered can be expressed by a vector.
  • Step 610 Cluster multiple texts according to the statistical features of each word in the word segmentation result of the pre-processed text of each text.
  • the clustering text can be clustered by the clustering algorithm.
  • Clustering algorithms include k-means, k-medoid, clarans, birch, cure, chameleon, dbscan, optics, denclue, and the like.
  • a text corresponds to a preprocessed text
  • a preprocessed text corresponds to a participle result
  • a participle result corresponds to a statistical feature of a series of words, so if the statistical features of the words included in the segmentation results of the two texts are clustered Recognized as belonging to the same cluster, the two texts belong to the same text cluster.
  • the preprocessed text corresponding to text 1 to text 7 respectively is:
  • the text clusters can be further merged according to the obtained text cluster results. For example, if the text cluster 1 and the text included in the text cluster 3 are related, the text cluster 1 and the text cluster 3 can be again
  • the merged, merged text cluster 4 includes text 1, text 2, text 5, text 6, and text 7.
  • the preset conditions may be used, or the user may obtain clusters according to their needs.
  • the resulting text clusters are merged.
  • the text in the present invention may also be a part of an object to be clustered.
  • Common objects to be clustered such as logs, include: CPU: 90MEM: 80info: Aug 17 04:27: 22peloton kernel: [pid]uid tgid totalvm, where the info field includes content that can be the text described in the present invention. Therefore, the object to be clustered includes content included in other fields in addition to the text, for example, the content included in the CPU field indicates the utilization of the CPU when the log is generated, and the content included in the MEM field indicates the memory at the time of generating the log. Utilization rate.
  • This part is already a numeric type and can be used directly for clustering.
  • the pre-processed text of the text can be expressed by a K*L-dimensional vector. Therefore, if the text is part of a log to be clustered, the log can be expressed by a K*L+2 dimensional vector, ie, a K*L dimension of the statistical features extracted from the text's preprocessed text, and in the log.
  • the existing CPU and MEM two fields include the content.
  • the number of words obtained by pre-processing text segmentation of text tends to be more, and there are more than one statistical feature that each word can extract, the number of K*L is often high. If the log is expressed by K*L+2 vectors, and the weight of each dimension of the vector affects the clustering result is the same, the log will be excessively affected by the content of the text during the clustering process, resulting in a log. The content included in other fields is too small to affect the clustering results. Therefore, after extracting the statistical features of the K*L dimension from the text, you can also set weights for the statistical features of the K*L dimension. For example, the weight of each dimension in this K*L dimension is P/K*L, P.
  • the sum of the weights of the statistical features of the K*L dimension extracted in the text is P, and the content of the text can be adjusted according to the setting of P.
  • the influence on the clustering result avoids the case that the K*L is too large, and the influence of the contents of other fields except the text on the clustering result is excessively diluted, which improves the clustering precision.
  • step 610 step 612 to step 616 are further included.
  • Step 612 After multiple texts are clustered, multiple text clusters are acquired.
  • Step 614 Extract a regular expression corresponding to the text cluster from the text of each text cluster. Extracting the regular expression corresponding to the text cluster from each text cluster obtained in step 612, for example, if all the texts included in a text cluster start with "mytime” and end with anomalyScore", then the text cluster can be extracted.
  • the regular expression " ⁇ mytime.*anomalyScore$”.
  • Each text cluster A regular expression that matches each text in the text cluster, or more than a certain percentage of the text in the text cluster.
  • Step 616 Obtain new text, determine whether the new text satisfies a regular expression corresponding to the first text cluster, and the first text cluster is any text cluster of the plurality of text clusters, if the new text meets the regularity corresponding to the first text cluster An expression, the new text belongs to the first text cluster.
  • the device for text clustering After the device for text clustering acquires the new text to be clustered, it determines whether the new text satisfies the regular expression corresponding to any text cluster. If the new text can satisfy the regular expression corresponding to a certain text cluster, the new text belongs to the text cluster. If the new text satisfies the regular expression corresponding to the plurality of text clusters at the same time, the new text can be classified into any one of the plurality of text clusters.
  • the device for text clustering may perform steps 602 through 612 with a portion of its existing text to obtain a plurality of text clusters, and then perform steps 614 and 616 with another portion of the existing text, Then another part of the existing text is the new text.
  • This process is similar to the process of using a part of samples in the mechanical learning algorithm to train the model, and using the trained model to identify the remaining samples, so that the device for text clustering does not need to cluster all the text, and the text is improved.
  • the efficiency of clustering may also perform step 602 to step 612 for all existing texts to obtain a plurality of text clusters, and then obtain new or new user input text, and then use new generation or new user input.
  • the text is executed to perform steps 614 and 616, and the newly generated or newly entered text by the user is new text.
  • the new text can be classified into a certain text cluster. If the new text cannot satisfy the regular expression corresponding to any text cluster, the new text can also belong to a new text cluster. In the case that new text is obtained, it is not necessary to re-synthesize the new text together with the text that has been clustered, and the new text is clustered by the regular expression of the acquired text cluster, thereby improving the aggregation of the new text. The speed of the class.
  • step 610 step 618 to step 624 are further included.
  • Step 618 After clustering the plurality of texts, acquiring a plurality of text clusters.
  • Step 620 Extract a regular expression corresponding to the text cluster from the pre-processed text of the text of each text cluster.
  • the regular expression corresponding to each text cluster can make the pre-processed text of each text in the text cluster conform to, or the pre-processed text of the text cluster exceeds a certain proportion of the text.
  • Step 622 Obtain new text, perform pre-processing on the new text, and obtain pre-processed text corresponding to the new text.
  • the pre-processing operation on the new text in step 622 refers to the alternatives of step 604 and step 604.
  • Step 624 Determine whether the pre-processed text corresponding to the new text satisfies the regular expression corresponding to the second text cluster, and the second text cluster is any text cluster of the plurality of text clusters, if the pre-processed text corresponding to the new text conforms to the second text
  • the regular expression corresponding to the cluster the new text belongs to the second text cluster.
  • the device for text clustering After the device for text clustering obtains the new text to be clustered, it is determined whether the pre-processed text corresponding to the new text satisfies the regular expression corresponding to any text cluster. If the pre-processed text corresponding to the new text can satisfy the regular expression corresponding to a certain text cluster, the new text belongs to the text cluster. If the pre-processed text corresponding to the new text satisfies the regular expression corresponding to the plurality of text clusters, the new text may be classified into any one of the plurality of text clusters.
  • the device for text clustering may perform steps 602 through 618 with a portion of its existing text to obtain a plurality of text clusters, and then perform steps 620 through 624 with another portion of the existing text, Then another part of the existing text is the new text.
  • This process is similar to the process of using a part of samples in the mechanical learning algorithm to train the model, and using the trained model to identify the remaining samples, so that the device for text clustering does not need to cluster all the text, and the text is improved.
  • the efficiency of clustering may also perform step 602 to step 618 for all existing texts to obtain a plurality of text clusters, and then obtain new or new user input text, and then use new generation or new user input.
  • the text is executed to perform steps 620 to 624, and the newly generated or newly entered text by the user is new text.
  • the new text can be classified into a certain text cluster. If the new text cannot satisfy the regular expression corresponding to any text cluster, the new text can also belong to a new text cluster. In the case that new text is obtained, it is not necessary to re-synthesize the new text together with the text that has been clustered, and the new text is clustered by the regular expression of the acquired text cluster, thereby improving the aggregation of the new text. The speed of the class.
  • the above embodiment provides a text clustering method. After pre-processing the text to be clustered, the pre-processed text of the text is segmented and clustered, so that the clustering of the text can be performed according to the format of the text, and the text is improved. The accuracy of clustering.
  • the embodiment of the present invention further provides a text clustering apparatus 800, which may be implemented by the text clustering device 202 shown in FIG. 1, or may be implemented by the computing device 400 shown in FIG. 2, and may also be dedicated.
  • Integrated circuit English: application-specific integrated circuit, abbreviation: ASIC
  • programmable logic device English: programmable logic device, abbreviation: PLD
  • the above PLD can be a complex programmable logic device (English: complex programmable logic device, abbreviation: CPLD), a programmable logic gate array (English: field-programmable gate array, abbreviated: FPGA), general array logic (English: general array logic , abbreviation: GAL) or any combination thereof.
  • the text clustering apparatus 800 is used to implement the text clustering method shown in FIG.
  • the text clustering apparatus 800 includes an obtaining unit 802 for replacing numbers in the plurality of texts with the first identifier, and a processing unit 804 for performing a pre-processing operation on each text to obtain pre-processed text of each text,
  • the pre-processing operation includes: combining two adjacent first identifiers into one first identifier; and is further used for segmenting the pre-processed text of each text, obtaining a word segmentation result of the pre-processed text of each text; Obtaining the statistical characteristics of each word in the word segmentation result of the preprocessed text of each text; and also for clustering the plurality of texts according to the statistical features of each word in the segmentation result of the preprocessed text of each text.
  • the processing unit 804 is further configured to replace the pixel characters in the plurality of texts with the second identifier; the pre-processing operation further includes: combining the two adjacent second identifiers Is a second identifier.
  • the processing unit 804 is further configured to: after the plurality of texts are clustered, acquire a plurality of text clusters; and further, use to extract a regular expression corresponding to the text cluster from the text of each text cluster; Obtaining a new text, determining whether the new text satisfies a regular expression corresponding to the first text cluster, the first text cluster is any text cluster of the plurality of text clusters, and if the new text conforms to the regular expression corresponding to the first text cluster, the new text The text belongs to the first text cluster.
  • the processing unit 804 is further configured to: after the plurality of texts are clustered, acquire a plurality of text clusters; and further, extract a regular expression corresponding to the text cluster from the preprocessed text of the text of each text cluster. Also used to obtain new text, perform preprocessing operations on new text, and obtain preprocessed text corresponding to new text; It is further configured to determine whether the pre-processed text corresponding to the new text satisfies a regular expression corresponding to the second text cluster, and the second text cluster is any text cluster of the plurality of text clusters, if the pre-processed text corresponding to the new text conforms to the second text The regular expression corresponding to the cluster, the new text belongs to the second text cluster.
  • the foregoing embodiment provides a text clustering device, which performs word segmentation and clustering on the pre-processed text of the text after the pre-processing of the text to be clustered, so that the clustering of the text can be performed according to the format of the text.
  • the precision of text clustering is a text clustering device, which performs word segmentation and clustering on the pre-processed text of the text after the pre-processing of the text to be clustered, so that the clustering of the text can be performed according to the format of the text. The precision of text clustering.
  • the methods described in connection with the present disclosure may be implemented by a processor executing software instructions.
  • the software instructions can be composed of corresponding software modules, which can be stored in RAM, flash memory, ROM, erasable programmable read only memory (English: erasable programmable read only memory, abbreviation: EPROM), electrically erasable Programming an audio-only memory (English): hard disk, optical disk, or any other form of storage medium known in the art.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种文本聚类方法,用于文本聚类的设备(202)获取了待聚类的文本后,将待聚类的文本中的数字替换为第一标识(602),并将待聚类的文本中相邻的第一标识合并获取待聚类的文本的预处理文本(604),对待聚类的文本的预处理文本进行聚类(610)。通过对待聚类的文本进行预处理,提取了待聚类的文本的格式,根据待聚类的文本的格式对待聚类的文本进行聚类,提升了文本聚类的精度。

Description

文本聚类方法、装置及计算设备 技术领域
本发明涉及计算机技术领域,尤其涉及一种文本聚类方法,文本聚类装置以及用于文本聚类的计算设备。
背景技术
当存在大量文本时,常需要对这些文本进行聚类,即将大量文本归类到一定数量的簇(英文:cluster)中,以方便后续对这些文本的处理。
文本的聚类过程,也即将相似的文本聚集到一起的过程。现有技术中,常根据文本内包含的内容来计算文本之间的相似度,一般包含相同内容较多的多个文本被视为相似程度较高。
然而,一些类型的文本,例如日志,包含的内容会随着输入参数和输出参数的变化而变化,因此根据文本包含的内容来对这些文本进行聚类的精度不高。
发明内容
本申请提供了一种文本聚类方法,文本聚类装置以及用于文本聚类的计算设备,以提升文本聚类的精度。
本申请的第一方面提供了一种文本聚类方法,该方法由计算机执行,包括:接收待聚类的N个文本,N为大于1的整数,将这N个文本中的数字替换为第一标识。对这N个文本执行预处理操作,将这N个文本中相邻的第一标识合并,获得这N个文本对应的N个预处理文本。对N个预处理文本进行分词,获取这N个预处理文本的分词结果,并获取这N个预处理文本的分词结果中各个词的统计特征。根据这N个预处理文本的分词结果中各个词的统计特征,对这N个文本进行聚类。
通过对待聚类的文本进行预处理操作,使得文本的预处理文本中保留的不再是文本的内容本身,而是文本的格式,随后根据各个文本的预处理文本来对文本进行聚类,使得聚类过程能够将文本的格式加入考虑,提升了文本聚类的精度。
结合第一方面,在第一方面的第一种实现方式中,不仅将N个文本中的数字替换为第一标识,还将这N文本中的字素替换为第二标识。因此,预处理操作还包括:将相邻的两个第二标识合并为一个第二标识。
进一步的,不仅仅针对待聚类的文本中的数字进行处理,还对待聚类的文本中的字素进行处理,进一步抽象出待处理的文本的格式,以供后续聚类中使用,能够进一步提升文本聚类的精度。
结合第一方面和第一方面的第一种实现方式,在第一方面的第二种实现方式中,对N个文本进行聚类后,获取M个文本簇。从每个文本簇的文本中提取该文本簇对应的正则表达式;获取新文本,判断新文本是否满足M个文本簇中任一文本簇对应的正则表达式,如果该新文本符合任一文本簇对应的正则表达式,则该新文本属于该文本簇。
从已经获得的文本簇中提取正则表达式,获取各个文本簇在内容上的共性,获取了新文本之后,无须将新文本和已经执行过聚类的文本一起重新进行聚类,而是将新文本与各个文本簇对应的正则表达式进行匹配,大幅提升了新文本的聚类速度。
结合第一方面和第一方面的第一种实现方式,在第一方面的第三种实现方式中,对N个文本进行聚类后,获取M个文本簇。从每个文本簇包括的文本的预处理文本中提取该文本簇对应的正则表达式;获取新文本,判断新文本是否满足M个文本簇中任一文本簇对应的正则表达式,如果该新文本符合任一文本簇对应的正则表达式,则该新文本属于该文本簇。
从已经获得的文本簇的预处理文本中提取正则表达式,获取各个文本簇的预处理文本在格式上的共性,获取了新文本之后,无须将新文本和已经执行过聚类的文本一起重新进行聚类,而是将新文本与各个文本簇对应的正则表达式进行匹配,大幅提升了新文本的聚类速度。
本申请的第二方面提供了一种文本聚类装置,该装置包括获取单元和处理单元。获取单元用于,接收待聚类的N个文本,N为大于1的整数,将这N个文本中的数字替换为第一标识。处理单元用于,对这N个文本执行预处理操作,将这N个文本中相邻的第一标识合并,获得这N个文本对应的N个预处理文本;并对这N个预处理文本进行分词,获取这N个预处理文本的分词结果,并 获取这N个预处理文本的分词结果中各个词的统计特征;随后根据这N个预处理文本的分词结果中各个词的统计特征,对这N个文本进行聚类。该装置用于实现第一方面提供的文本聚类方法。
本申请的第三方面提供了一种计算设备,包括处理器、存储器。该计算设备运行时能够实现第一方面提供的文本聚类方法,用于实现第一方面提供的文本聚类方法的程序代码可以保存在存储器中,并由处理器来执行。
本申请的第四方面提供了一种存储介质,该存储介质中存储的程序代码被执行时能够实现第一方面提供的文本聚类方法。该程序代码由实现第一方面提供的文本聚类方法的计算机指令构成。
附图说明
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例中所需要使用的附图作以简单地介绍,显而易见的,下面描述中的附图是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本发明提供的文本聚类系统的组织结构示意图;
图2为本发明提供的计算设备的组织结构示意图;
图3为本发明提供的文本聚类方法的流程示意图;
图4为本发明提供的文本聚类装置的组织结构示意图。
具体实施方式
下面结合本发明实施例中的附图,对本发明实施例中的技术方案进行描述。
贯穿本说明书,术语“无边界语言”指代字符间没有用于划定界限的标点符号或空格的语言,常见的无边界语言包括中文、日文等。相应的,有边界语言指代字符间有用于划定界限的标点符号或空格的语言,最常见的有边界语言包括英文。
贯穿本说明书,术语“聚类”指代根据不同对象的特征,将对象归类到不同的簇的过程。每一个簇包含了有一定共性或者相似程度较高的多个对象。
贯穿本说明书,术语“正则表达式”指代一串字符串,该字符串用于描述 一系列句法规则,例如包括什么字符、字符位置、字符顺序等。
图1为文本聚类系统200的一种实现方式,包括存储设备206、文本聚类设备202。其中存储设备206中存储了用于存储待聚类的文本的文本库,存储设备206可以通过通信网络204与文本聚类设备202建立通信,存储设备206也可以直接设置在文本聚类设备202中,通过输入输入单元2021与文本聚类设备202建立通信。文本聚类设备202中包括输入输出单元2021和处理单元2022。如果存储设备206通过通信网络204与文本聚类设备202通信,则输入输出单元2021可以为网络接口,如果存储设备206部署于文本聚类设备202内,则输入输出单元2021还可以为文本聚类设备202访问本地存储设备的接口。
其中,处理器402、存储器404和通信接口406可以通过总线408实现彼此之间的通信连接,也可以通过无线传输等其他手段实现通信。
存储器404存储器可以包括易失性存储器(英文:volatile memory),例如随机存取存储器(英文:random-access memory,缩写:RAM);存储器也可以包括非易失性存储器(英文:non-volatile memory),例如只读存储器(英文:read-only memory,缩写:ROM),快闪存储器(英文:flash memory),硬盘(英文:hard disk drive,缩写:HDD)或固态硬盘(英文:solid-state drive,缩写:SSD);存储器404还可以包括上述种类的存储器的组合。计算设备400运行时,存储器404加载存储设备206中文本库中存储的文本,以供处理器402使用。在通过软件来实现本发明提供的技术方案时,用于实现本发明图3提供的文本聚类方法的程序代码可以保存在存储器404中,并由处理器402来执行。
计算设备400通过通信接口406获取待处理的文本,当获取文本聚类的结果后,还可以通过通信接口406返回给用户。
处理器402可以为中央处理器(英文:central processing unit,缩写:CPU)。处理器402获取文本库中存储的多个文本,并将这些文本中的数字替换为第一标识,第一标识可以为一个特定的字符,例如字母d。对执行完替换操作的文本进行预处理操作,预处理操作即将每一个执行完替换操作的文本中相邻的两个第一标识合并为一个第一标识。如果文本中有多个相邻的第一标识,则可以将多个相邻的第一标识合并为一个第一标识。文本中的空格、标点符号可以保留。
一个文本执行完预处理操作后,生成该文本对应的一个预处理文本。因此,N个文本对应于N个预处理文本,N为正整数且N等于待聚类的文本的数量。对每个文本的预处理文本进行分词,如果预处理文本中仅包括标点符号和第一标识,或仅包括有边界语言,例如英文,则根据空格对文本进行分词即可,如果文本中包括无边界语言,则分词还需根据词库中已有词、以及预设的分词方法等对预处理文本进行分词。
每个文本的预处理文本的分词结果中每个文本的预处理文本被切分为多个词,例如为M个,获取每个文本的预处理文本中各个词的统计特征,例如为每个词提取一个统计特征,则每个文本的预处理文本中可以提取M个统计特征,根据每个文本的预处理文本的分词结果中各个词的统计特征,对待聚类的多个文本进行聚类。
每个文本的预处理文本可以提取M个统计特征,则根据这M个统计特征,对每个文本的预处理文本进行聚类,如果多个文本的的预处理文本被聚类为一个簇,则该多个文本也被聚类到一个簇中。
通过预处理操作和分词处理,多个待聚类的文本可以通过一系列词的统计特征来体现,根据这些词的统计特征来对文本进行聚类,使得不再仅根据文本的内容的相似度进行聚类,而是将文本的内容替换为标识并对相邻标识进行合并,用标识来表现文本内容的格式,这样通过文本的格式来对文本进行聚类,可以提升文本的聚类精度。
可选的,对每个文本执行预处理操作前,还包括:将待聚类的多个文本中的字素替换为第二标识,则预处理操作还包括将相邻的两个第二标识合并为一个第二标识。进一步的,不仅替换文本中的数字,还将文本中的字素替换为第二标识,使得获得的预处理文本能够更好的表现文本的格式,以提升聚类精度。
处理器402将待聚类的多个文本聚类为多个文本簇后,从每个文本簇包括的文本中提取该文本簇对应的正则表达式,每个文本簇对应的正则表达式体现了该文本簇在内容上的一些共同点。获取新文本后,如果需要将新文本也聚类到某一现存的文本簇中,则判断新文本是否满足任一文本簇对应的正则表达式,如果新文本满足某一文本簇对应的正则表达式,则该新文本属于该文本簇。
处理器402将待聚类的多个文本聚类为多个文本簇后,从每个文本簇包括 的文本的预处理文本中提取该文本簇对应的正则表达式,每个文本簇对应的正则表达式体现了该文本簇中的文本的预处理文本在内容上的一些共同点。获取新文本后,如果需要将新文本也聚类到某一现存的文本簇中,则判断新文本的预处理文本是否满足任一文本簇对应的正则表达式,如果新文本的预处理文本满足某一文本簇对应的正则表达式,则该新文本属于该文本簇。
将待聚类的文本分类到不同的文本簇之后,如果文本聚类系统获取了新的文本,无须将全部文本重新聚类,只需从已经获取的文本簇或文本簇对应的预处理文本中提取正则表达式,新文本满足哪个文本簇或文本簇对应的预处理文本中提取出的正则表达式,则该新文本就归类于哪个文本簇中,加快了新文本的聚类速度。
本发明还提供了一种文本聚类方法,图1中的文本聚类设备202以及图2中的计算设备400运行时执行该文本聚类方法,其流程示意图如图3所示。
步骤602,将多个文本中的数字替换为第一标识。
获取待聚类的多个文本,将待聚类的多个文本中的数字替换为第一标识,本说明书中以第一标识为字符“d”为例。文本1为待聚类的多个文本中的一个,文本1包括Aug 17 04:27:2203peloton kernel:[pid]uid tgid totalvm,将文本1中的数字替换为第一标识后,文本1包括Aug dd dd:dd:dddd peloton kernel:[pid]uid tgid totalvm。
可选的,步骤602中还可以将待聚类的多个文本中的字素替换为第二标识,本说明书中以第二标识为字符“w”为例,则执行完步骤602后,文本1包括www dd dd:dd:dddd wwwwww wwwwww:[www]www wwww wwwww ww。
步骤604,对每个文本执行预处理操作,获取每个文本的预处理文本,预处理操作包括:将相邻的两个第一标识合并为一个第一标识。
待聚类的多个文本中的数字均替换为第一标识后,对每个文本执行预处理操作,预处理操作即将每一个文本中相邻的两个第一标识合并为一个第一标识。如果文本中有多个相邻的第一标识,则可以将多个相邻的第一标识合并为一个第一标识。文本中的空格、标点符号可以保留。以文本1为例,文本1执行预处理操作后,文本1的预处理文本包括Aug d d:d:ddd peloton kernel:[pid]uid tgid  totalvm,也可以对文本1中相邻的第一标识进一步进行合并,直至文本1的预处理文本中无相邻的第一标识,即文本1的预处理文本包括Aug d d:d:d peloton kernel:[pid]uid tgid totalvm。两个字符之间无标点符号且无空格且无其他字符则称这两个数字相邻。
可选的,如果步骤602中还将待聚类的多个文本中的字素替换为第二标识则,预处理操作还包括:将相邻的两个第二标识合并为一个第二标识。合并的过程参考将相邻的两个第一标识合并为一个第一标识的过程。还可以进一步对相邻的第一标识进行合并且对相邻的第二标识进行合并,直至文本1的预处理文本中无相邻的第一标识且无相邻的第二标识,例如文本1的预处理文本包括ww d d:d:ddd wwwww wwwww:[ww]ww www wwwww,则文本1的预处理文本包括w d d:d:d w w:[w]w w w。
步骤606,对每个文本的预处理文本进行分词,获取每个文本的预处理文本的分词结果。
对文本的预处理文本进行分词的方法有多种,常见对有边界语言的分词方法包括N-Gram分词法,对无边界语言的分词方法一般需要结合词库中的已知词,对预处理文本进行分词后,预处理文本的分词结果中包含预处理文本被切分出来的各个词。以3-Gram分词为例,文本1的预处理文本w d d:d:d w w:[w]w w w的分词结果包括w d d:d:d,d d:d:d w,d:d:d w w:,w w:[w],w:[w]w,[w]w w,w w w,共7个词。
步骤608,获取每个文本的预处理文本的分词结果中各个词的统计特征。
获取每个文本的预处理文本的分词结果后,进一步获取分词结果中各个词的统计特征,统计特征包括词频、词的方差、词的词频-逆文档频率(英文:term frequency–inverse document frequency,缩写:TF-IDF)等。如果一个文本的预处理文本的分词结果中包括K个词,且为K个词中的每个词提取L个统计特征,则该文本的预处理文本总共可以提取K*L个统计特征,因此,该文本的预处理文本可以通过K*L维的向量表达。每个待聚类的文本的预处理文本均提取了对应的统计特征后,每个待聚类的文本的预处理文本可以通过一个向量表达。
步骤610,根据每个文本的预处理文本的分词结果中各个词的统计特征,对多个文本进行聚类。
获取每个待聚类的文本的预处理文本对应的统计特征后,根据每个文本的预处理文本的分词结果中各个词的统计特征,通过聚类算法可以对待聚类的文本进行聚类。聚类算法包括k-means,k-medoid,clarans,birch,cure,chameleon,dbscan,optics,denclue等。一个文本对应于一个预处理文本,一个预处理文本对应于一个分词结果,一个分词结果对应于一系列词的统计特征,因此,如果两个文本的分词结果包括的词的统计特征被聚类算法识别为属于同一簇,则这两个文本属于同一文本簇。
以待聚类的文本如下文本1至文本7为例:
文本1:Aug 17 04:27:22peloton kernel:[pid]uid tgid totalvm
文本2:Aug 17 03:41:44peloton kernel:[pid]uid tgid totalvm
文本3:Aug 17 03:26:41peloton kernel:Free swap
文本4:Aug 17 03:37:33peloton kernel:Total swap
文本5:Sep 17 08:51:66peloton kernel:[pid]uid tgid total
文本6:Jan 23 08:51:66peloton kernel:?do_page
文本7:Jan 27 11:51:66peloton kernel:?security_real
经过文本预处理后,文本1至文本7分别对应的预处理文本为:
文本1的预处理文本:w d d:d:d w w:[w]w w w
文本2的预处理文本:w d d:d:d w w:[w]w w w
文本3的预处理文本:w d d:d:d w w:w w
文本4的预处理文本:w d d:d:d w w:w w
文本5的预处理文本:w d d:d:d w w:[w]w w w
文本6的预处理文本:w d d:d:d w w:?w_w
文本7的预处理文本:w d d:d:d w w:?w_w
通过对文本1至文本7的预处理文本进行聚类后,文本1、文本2、文本5聚类为文本簇1,文本3和文本4聚类为文本簇2,文本6和文本7聚类为文本簇3。
获得了多个文本簇后,可以根据已获得的文本簇结果,进一步对文本簇进行合并,例如如果文本簇1和文本簇3中包括的文本相关,因此可以将文本簇1和文本簇3再次合并,合并出来的文本簇4包括文本1、文本2、文本5、文本6以及文本7。具体的,可以通过预设的条件,或用户根据其需求对聚类获 得的文本簇进行合并。
需要说明的是,本发明中的文本,也可以为待聚类的对象的一部分。常见的待聚类的对象例如日志,包括:CPU:90MEM:80info:Aug 17 04:27:22peloton kernel:[pid]uid tgid totalvm,其中info字段包括的内容可以为本发明所述的文本。因此,待聚类的对象除了包括文本之外,还包括了其他字段包括的内容,例如CPU字段包括的内容指示该日志生成时CPU的利用率,MEM字段包括的内容指示该日志生成时内存的利用率。这部分内容已经为数字类型,可以直接用于聚类。
文本经过预处理和分词并进一步对这些词提取统计特征后,如步骤608中所述,该文本的预处理文本可以通过K*L维的向量表达。因此,如果该文本是待聚类的日志的一部分,则该日志可以通过K*L+2维的向量表达,即K*L维的从文本的预处理文本中提取的统计特征,以及日志中已有的CPU和MEM两个字段包括的内容。
由于文本的预处理文本分词获得的词的个数往往较多,且每个词能够提取的统计特征也可以有多个,因此K*L的数量往往较高。如果日志通过K*L+2个向量表达,且该向量的每个维度对聚类结果影响的权重相同,则日志在聚类的过程中,会过多的被文本的内容所影响,导致日志中其他字段包括的内容对聚类结果的影响太小。因此,从文本中提取了K*L维的统计特征后,还可以为这K*L维的统计特征设置权重,例如这K*L维中每个维度的权重为P/K*L,P为预设的参数,而CPU和MEM两个字段后包括的内容的权重为1,则文本中提取的K*L维的统计特征的权重之和为P,根据P的设置可以调整文本的内容对聚类结果的影响,避免了K*L过大的情况下,待聚类的对象中除文本外的其他字段的内容对聚类结果的影响被过度淡化,提升了聚类精度。
可选的,步骤610后还包括步骤612至步骤616。
步骤612,多个文本进行聚类后,获取多个文本簇。
步骤614,从每个文本簇的文本中提取该文本簇对应的正则表达式。从步骤612获取的每个文本簇中提取该文本簇对应的正则表达式,例如如果一个文本簇中包括的全部文本均以“mytime”开头,且以anomalyScore”结尾,“则这个文本簇可以提取出正则表达式“^mytime.*anomalyScore$”。每个文本簇提 取的正则表达式,能够让本文本簇中的每个文本均符合,或本文本簇中超过一定比例的文本符合。
步骤616,获取新文本,判断新文本是否满足第一文本簇对应的正则表达式,第一文本簇为所述多个文本簇中任一文本簇,若新文本符合第一文本簇对应的正则表达式,则新文本属于所述第一文本簇。
用于文本聚类的设备获取了待聚类的新文本后,判断该新文本能否满足任一文本簇对应的正则表达式。如果该新文本能够满足某一文本簇对应的正则表达式,则该新文本属于该文本簇。如果该新文本同时满足多个文本簇对应的正则表达式,则该新文本可以被归类于这多个文本簇中的任一个。
用于文本聚类的设备可以用其已有的文本中的一部分来执行步骤602至步骤612,以获取多个文本簇,然后用已有的文本中的另一部分来执行步骤614与步骤616,则该已有的文本中的另一部分即新文本。该过程类似于机械学习算法中采用一部分样本用于训练模型,采用训练完毕的模型来对剩余的样本进行识别的过程,使得用于文本聚类的设备无须对全部文本进行聚类,提升了文本聚类的效率。用于文本聚类的设备也可以其已有的全部文本来执行步骤602至步骤612,以获取多个文本簇,然后获取到新生成或用户新输入的文本后,用新生成或用户新输入的文本来执行步骤614与步骤616,则新生成或用户新输入的文本即新文本。
通过步骤612至步骤616,新文本得以被归类到某一文本簇中,如果新文本无法满足任一文本簇对应的正则表达式,该新文本也可以属于一个新的文本簇。使得获取了新文本的情况下,无须将新文本连同已经执行过聚类的文本重新进行聚类,通过已经获取的文本簇的正则表达式来对新文本进行聚类,提升了新文本的聚类的速度。
可选的,步骤610后还包括步骤618至步骤624
步骤618,多个文本进行聚类后,获取多个文本簇。
步骤620,从每个文本簇的文本的预处理文本中提取该文本簇对应的正则表达式。
从步骤618获取的每个文本簇所包括的文本的预处理文本中提取该文本簇对应的正则表达式,例如如果一个文本簇中全部文本的预处理文本均以“mytime” 开头,且以“d:d”结尾,则这个文本簇的预处理文本可以提取出正则表达式“^mytime.*d:d$”。每个文本簇对应的正则表达式,能够让本文本簇中的每个文本的预处理文本均符合,或本文本簇中超过一定比例的文本的预处理文本符合。
步骤622,获取新文本,对新文本进行预处理操作,获取新文本对应的预处理文本。
步骤622中对新文本进行的预处理操作参考步骤604及步骤604的可选方案。
步骤624,判断新文本对应的预处理文本是否满足第二文本簇对应的正则表达式,第二文本簇为多个文本簇中任一文本簇,若新文本对应的预处理文本符合第二文本簇对应的正则表达式,则新文本属于第二文本簇。
用于文本聚类的设备获取了待聚类的新文本后,判断该新文本对应的预处理文本能否满足任一文本簇对应的正则表达式。如果该新文本对应的预处理文本能够满足某一文本簇对应的正则表达式,则该新文本属于该文本簇。如果该新文本对应的预处理文本同时满足多个文本簇对应的正则表达式,则该新文本可以被归类于这多个文本簇中的任一个。
用于文本聚类的设备可以用其已有的文本中的一部分来执行步骤602至步骤618,以获取多个文本簇,然后用已有的文本中的另一部分来执行步骤620至步骤624,则该已有的文本中的另一部分即新文本。该过程类似于机械学习算法中采用一部分样本用于训练模型,采用训练完毕的模型来对剩余的样本进行识别的过程,使得用于文本聚类的设备无须对全部文本进行聚类,提升了文本聚类的效率。用于文本聚类的设备也可以其已有的全部文本来执行步骤602至步骤618,以获取多个文本簇,然后获取到新生成或用户新输入的文本后,用新生成或用户新输入的文本来执行步骤620至步骤624,则新生成或用户新输入的文本即新文本。
通过步骤618至步骤624,新文本得以被归类到某一文本簇中,如果新文本无法满足任一文本簇对应的正则表达式,该新文本也可以属于一个新的文本簇。使得获取了新文本的情况下,无须将新文本连同已经执行过聚类的文本重新进行聚类,通过已经获取的文本簇的正则表达式来对新文本进行聚类,提升了新文本的聚类的速度。
上述实施例提供了一种文本聚类方法,对待聚类的文本进行预处理后,对文本的预处理文本进行分词并聚类,使得对文本的聚类能够根据文本的格式进行,提升了文本聚类的精度。
本发明实施例还提供了文本聚类装置800,该文本聚类装置800可以通过图1所示的文本聚类设备202实现,还可以通过图2所示的计算设备400实现,还可以通过专用集成电路(英文:application-specific integrated circuit,缩写:ASIC)实现,或可编程逻辑器件(英文:programmable logic device,缩写:PLD)实现。上述PLD可以是复杂可编程逻辑器件(英文:complex programmable logic device,缩写:CPLD),可编程逻辑门阵列(英文:field-programmable gate array,缩写:FPGA),通用阵列逻辑(英文:generic array logic,缩写:GAL)或其任意组合。该文本聚类装置800用于实现图3所示的文本聚类方法。
文本聚类装置800包括获取单元802,用于将多个文本中的数字替换为第一标识;以及处理单元804,用于对每个文本执行预处理操作,获取每个文本的预处理文本,预处理操作包括:将相邻的两个第一标识合并为一个第一标识;还用于对每个文本的预处理文本进行分词,获取每个文本的预处理文本的分词结果;还用于获取每个文本的预处理文本的分词结果中各个词的统计特征;还用于根据每个文本的预处理文本的分词结果中各个词的统计特征,对多个文本进行聚类。
可选的,处理单元804对每个文本执行预处理操作前,还用于将多个文本中的字素替换为第二标识;预处理操作还包括:将相邻的两个第二标识合并为一个第二标识。
可选的,处理单元804,还用于在多个文本进行聚类后,获取多个文本簇;还用于从每个文本簇的文本中提取该文本簇对应的正则表达式;还用于获取新文本,判断新文本是否满足第一文本簇对应的正则表达式,第一文本簇为多个文本簇中任一文本簇,若新文本符合第一文本簇对应的正则表达式,则新文本属于第一文本簇。
可选的,处理单元804,还用于在多个文本进行聚类后,获取多个文本簇;还用于从每个文本簇的文本的预处理文本中提取该文本簇对应的正则表达式;还用于获取新文本,对新文本进行预处理操作,获取新文本对应的预处理文本; 还用于判断新文本对应的预处理文本是否满足第二文本簇对应的正则表达式,第二文本簇为多个文本簇中任一文本簇,若新文本对应的预处理文本符合第二文本簇对应的正则表达式,则新文本属于第二文本簇。
上述实施例提供了一种文本聚类装置,该装置对待聚类的文本进行预处理后,对文本的预处理文本进行分词并聚类,使得对文本的聚类能够根据文本的格式进行,提升了文本聚类的精度。
需要说明的是:对于前述的各方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本发明并不受所描述的动作顺序的限制,因为依据本发明,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作和单元并不一定是本发明所必须的。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。
结合本发明公开内容所描述的方法可以由处理器执行软件指令的方式来实现。软件指令可以由相应的软件模块组成,软件模块可以被存放于RAM、快闪存储器、ROM、可擦除可编程只读存储器(英文:erasable programmable read only memory,缩写:EPROM)、电可擦可编程只读存储器(英文:electrically erasable programmable read only memory,缩写:EEPROM)、硬盘、光盘或者本领域熟知的任何其它形式的存储介质中。
最后应说明的是:以上实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围。

Claims (12)

  1. 一种文本聚类方法,所述文本聚类方法由计算机执行,其特征在于,包括:
    将多个文本中的数字替换为第一标识;
    对每个文本执行预处理操作,获取所述每个文本的预处理文本,所述预处理操作包括:将相邻的两个所述第一标识合并为一个所述第一标识;
    对所述每个文本的预处理文本进行分词,获取所述每个文本的预处理文本的分词结果;
    获取所述每个文本的预处理文本的分词结果中各个词的统计特征;
    根据所述每个文本的预处理文本的分词结果中各个词的统计特征,对所述多个文本进行聚类。
  2. 如权利要求1所述的方法,其特征在于,所述对每个文本执行预处理操作前,还包括:将所述多个文本中的字素替换为第二标识;
    所述预处理操作还包括:将相邻的两个所述第二标识合并为一个所述第二标识。
  3. 如权利要求1或2所述的方法,其特征在于,所述多个文本进行聚类后,获取多个文本簇;
    从每个文本簇的文本中提取该文本簇对应的正则表达式;
    获取新文本,判断所述新文本是否满足第一文本簇对应的正则表达式,所述第一文本簇为所述多个文本簇中任一文本簇,若所述新文本符合所述第一文本簇对应的正则表达式,则所述新文本属于所述第一文本簇。
  4. 如权利要求1或2所述的方法,其特征在于,所述多个文本进行聚类后,获取多个文本簇;
    从每个文本簇的文本的预处理文本中提取该文本簇对应的正则表达式;
    获取新文本,对所述新文本进行所述预处理操作,获取所述新文本对应的预处理文本;
    判断所述新文本对应的预处理文本是否满足第二文本簇对应的正则表达式,所述第二文本簇为所述多个文本簇中任一文本簇,若所述新文本对应的预处理 文本符合所述第二文本簇对应的正则表达式,则所述新文本属于所述第二文本簇。
  5. 一种文本聚类装置,其特征在于,包括:
    获取单元,用于将多个文本中的数字替换为第一标识;
    处理单元,用于对每个文本执行预处理操作,获取所述每个文本的预处理文本,所述预处理操作包括:将相邻的两个所述第一标识合并为一个所述第一标识;还用于对所述每个文本的预处理文本进行分词,获取所述每个文本的预处理文本的分词结果;还用于获取所述每个文本的预处理文本的分词结果中各个词的统计特征;还用于根据所述每个文本的预处理文本的分词结果中各个词的统计特征,对所述多个文本进行聚类。
  6. 如权利要求5所述的装置,其特征在于,所述处理单元对每个文本执行预处理操作前,还用于将所述多个文本中的字素替换为第二标识;所述预处理操作还包括:将相邻的两个所述第二标识合并为一个所述第二标识。
  7. 如权利要求5或6所述的装置,其特征在于,所述处理单元,还用于在所述多个文本进行聚类后,获取多个文本簇;还用于从每个文本簇的文本中提取该文本簇对应的正则表达式;还用于获取新文本,判断所述新文本是否满足第一文本簇对应的正则表达式,所述第一文本簇为所述多个文本簇中任一文本簇,若所述新文本符合所述第一文本簇对应的正则表达式,则所述新文本属于所述第一文本簇。
  8. 如权利要求5或6所述的装置,其特征在于,所述处理单元,还用于在所述多个文本进行聚类后,获取多个文本簇;还用于从每个文本簇的文本的预处理文本中提取该文本簇对应的正则表达式;还用于获取新文本,对所述新文本进行所述预处理操作,获取所述新文本对应的预处理文本;还用于判断所述新文本对应的预处理文本是否满足第二文本簇对应的正则表达式,所述第二文本簇为所述多个文本簇中任一文本簇,若所述新文本对应的预处理文本符合所述第二文本簇对应的正则表达式,则所述新文本属于所述第二文本簇。
  9. 一种计算设备,其特征在于,包括处理器、存储器;
    所述处理器用于读取所述存储器中的程序执行以下操作:将多个文本中的数字替换为第一标识;对每个文本执行预处理操作,获取所述每个文本的预处 理文本,所述预处理操作包括:将相邻的两个所述第一标识合并为一个所述第一标识;对所述每个文本的预处理文本进行分词,获取所述每个文本的预处理文本的分词结果;获取所述每个文本的预处理文本的分词结果中各个词的统计特征;根据所述每个文本的预处理文本的分词结果中各个词的统计特征,对所述多个文本进行聚类。
  10. 如权利要求9所述的计算设备,其特征在于,所述处理器对每个文本执行预处理操作前,还将所述多个文本中的字素替换为第二标识;所述预处理操作还包括:将相邻的两个所述第二标识合并为一个所述第二标识。
  11. 如权利要求9或10所述的计算设备,其特征在于,所述处理器对所述多个文本进行聚类后,获取多个文本簇;从每个文本簇的文本中提取该文本簇对应的正则表达式;获取新文本,判断所述新文本是否满足第一文本簇对应的正则表达式,所述第一文本簇为所述多个文本簇中任一文本簇,若所述新文本符合所述第一文本簇对应的正则表达式,则所述新文本属于所述第一文本簇。
  12. 如权利要求9或10所述的计算设备,其特征在于,所述处理器对所述多个文本进行聚类后,获取多个文本簇;从每个文本簇的文本的预处理文本中提取该文本簇对应的正则表达式;获取新文本,对所述新文本进行所述预处理操作,获取所述新文本对应的预处理文本;判断所述新文本对应的预处理文本是否满足第二文本簇对应的正则表达式,所述第二文本簇为所述多个文本簇中任一文本簇,若所述新文本对应的预处理文本符合所述第二文本簇对应的正则表达式,则所述新文本属于所述第二文本簇。
PCT/CN2016/099584 2015-12-16 2016-09-21 文本聚类方法、装置及计算设备 WO2017101541A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510944341.9A CN105574156B (zh) 2015-12-16 2015-12-16 文本聚类方法、装置及计算设备
CN201510944341.9 2015-12-16

Publications (1)

Publication Number Publication Date
WO2017101541A1 true WO2017101541A1 (zh) 2017-06-22

Family

ID=55884287

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/099584 WO2017101541A1 (zh) 2015-12-16 2016-09-21 文本聚类方法、装置及计算设备

Country Status (2)

Country Link
CN (1) CN105574156B (zh)
WO (1) WO2017101541A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3713151A1 (en) * 2018-06-29 2020-09-23 AO Kaspersky Lab System and method of blocking network connections

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105574156B (zh) * 2015-12-16 2019-03-26 华为技术有限公司 文本聚类方法、装置及计算设备
CN107680579B (zh) * 2017-09-29 2020-08-14 百度在线网络技术(北京)有限公司 文本正则化模型训练方法和装置、文本正则化方法和装置
CN108717461B (zh) * 2018-05-25 2021-03-26 平安科技(深圳)有限公司 海量数据结构化方法、装置、计算机设备及存储介质
CN109344139A (zh) * 2018-11-01 2019-02-15 浪潮电子信息产业股份有限公司 一种存储系统操作日志的聚合方法及相关装置
CN110472031A (zh) * 2019-08-13 2019-11-19 北京知道创宇信息技术股份有限公司 一种正则表达式获得方法、装置、电子设备及存储介质
CN111143312A (zh) * 2019-12-24 2020-05-12 广东电科院能源技术有限责任公司 一种电力日志的格式解析方法、装置、设备和存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6578032B1 (en) * 2000-06-28 2003-06-10 Microsoft Corporation Method and system for performing phrase/word clustering and cluster merging
US20110173197A1 (en) * 2010-01-12 2011-07-14 Yahoo! Inc. Methods and apparatuses for clustering electronic documents based on structural features and static content features
CN103514174A (zh) * 2012-06-18 2014-01-15 北京百度网讯科技有限公司 一种文本分类方法和装置
CN104750833A (zh) * 2015-04-03 2015-07-01 浪潮集团有限公司 一种文本分类方法及装置
CN105574156A (zh) * 2015-12-16 2016-05-11 华为技术有限公司 文本聚类方法、装置及计算设备

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102542209B (zh) * 2010-12-21 2015-03-11 日电(中国)有限公司 数据匿名方法和系统
CN104461484B (zh) * 2013-09-16 2019-03-01 腾讯科技(深圳)有限公司 前端模板的实现方法和装置
CN104408033A (zh) * 2014-11-25 2015-03-11 中国人民解放军国防科学技术大学 一种文本信息提取的方法及系统
CN104933023B (zh) * 2015-05-12 2017-09-01 深圳市华傲数据技术有限公司 中文地址分词标注方法
CN104850650B (zh) * 2015-05-29 2018-04-10 清华大学 基于类标关系的短文本扩充方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6578032B1 (en) * 2000-06-28 2003-06-10 Microsoft Corporation Method and system for performing phrase/word clustering and cluster merging
US20110173197A1 (en) * 2010-01-12 2011-07-14 Yahoo! Inc. Methods and apparatuses for clustering electronic documents based on structural features and static content features
CN103514174A (zh) * 2012-06-18 2014-01-15 北京百度网讯科技有限公司 一种文本分类方法和装置
CN104750833A (zh) * 2015-04-03 2015-07-01 浪潮集团有限公司 一种文本分类方法及装置
CN105574156A (zh) * 2015-12-16 2016-05-11 华为技术有限公司 文本聚类方法、装置及计算设备

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3713151A1 (en) * 2018-06-29 2020-09-23 AO Kaspersky Lab System and method of blocking network connections
US11089006B2 (en) 2018-06-29 2021-08-10 AO Kaspersky Lab System and method of blocking network connections

Also Published As

Publication number Publication date
CN105574156A (zh) 2016-05-11
CN105574156B (zh) 2019-03-26

Similar Documents

Publication Publication Date Title
WO2017101541A1 (zh) 文本聚类方法、装置及计算设备
US11544459B2 (en) Method and apparatus for determining feature words and server
CN110348214B (zh) 对恶意代码检测的方法及系统
CN108304442B (zh) 一种文本信息处理方法、装置及存储介质
CN108710611B (zh) 一种基于词网络和词向量的短文本主题模型生成方法
WO2020114100A1 (zh) 一种信息处理方法、装置和计算机存储介质
JP6912488B2 (ja) 文字列距離計算方法及び装置
CN111444330A (zh) 提取短文本关键词的方法、装置、设备及存储介质
US20180173694A1 (en) Methods and computer systems for named entity verification, named entity verification model training, and phrase expansion
CN108052500B (zh) 一种基于语义分析的文本关键信息提取方法及装置
CN109558482B (zh) 一种基于Spark框架的文本聚类模型PW-LDA的并行化方法
US11036764B1 (en) Document classification filter for search queries
JP2017068833A (ja) 単一文書からのキーワード抽出装置及び方法
KR101509727B1 (ko) 자율학습 정렬 기반의 정렬 코퍼스 생성 장치 및 그 방법과, 정렬 코퍼스를 사용한 파괴 표현 형태소 분석 장치 및 그 형태소 분석 방법
US20150347406A1 (en) Corpus Generation Based Upon Document Attributes
CN111177375A (zh) 一种电子文档分类方法及装置
US11574004B2 (en) Visual image search using text-based search engines
CN112084308A (zh) 用于文本类型数据识别的方法、系统及存储介质
CN113282717B (zh) 文本中实体关系的抽取方法、装置、电子设备及存储介质
CN109753646B (zh) 一种文章属性识别方法以及电子设备
US10191786B2 (en) Application program interface mashup generation
CN110750984A (zh) 命令行字符串处理方法、终端、装置及可读存储介质
CN107798004B (zh) 关键词查找方法、装置及终端
Hejazi et al. Deep learning for arabic image captioning: A comparative study of main factors and preprocessing recommendations
CN114328885A (zh) 一种信息处理方法、装置及计算机可读存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16874601

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16874601

Country of ref document: EP

Kind code of ref document: A1