CN110866097A - Text clustering method and device and computer equipment - Google Patents

Text clustering method and device and computer equipment Download PDF

Info

Publication number
CN110866097A
CN110866097A CN201911030513.6A CN201911030513A CN110866097A CN 110866097 A CN110866097 A CN 110866097A CN 201911030513 A CN201911030513 A CN 201911030513A CN 110866097 A CN110866097 A CN 110866097A
Authority
CN
China
Prior art keywords
text
clustering
texts
character
clustered
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911030513.6A
Other languages
Chinese (zh)
Inventor
曹绍升
张赏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alipay Hangzhou Information Technology Co Ltd
Original Assignee
Alipay Hangzhou Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alipay Hangzhou Information Technology Co Ltd filed Critical Alipay Hangzhou Information Technology Co Ltd
Priority to CN201911030513.6A priority Critical patent/CN110866097A/en
Publication of CN110866097A publication Critical patent/CN110866097A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the specification provides a text clustering method and device and computer equipment. When text clustering is carried out, character segments are extracted from the text and used as the features of the text, a feature matrix of the text is constructed on the basis of the occurrence frequency of the extracted character segments in the text, and then the text is clustered on the basis of the constructed feature matrix. Because the text features are directly extracted at the character granularity, word segmentation is not needed, the method is very suitable for newly generated customer service problem sentences, the problem that the traditional word segmentation accuracy is low due to more new words and new sentence patterns is solved to a certain extent, and the problem that the accuracy of text clustering is influenced by mistaken word segmentation can be avoided.

Description

Text clustering method and device and computer equipment
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to a text clustering method, an apparatus, and a computer device.
Background
For example, in a scene in which the customer service robot automatically asks and answers, new questions can continuously appear to cause the robot to be unable to answer, the new questions can be collected and clustered, and then the clustered questions are put into a new standard question bank, so that the answering rate of the customer service robot to the new questions can be effectively improved. Obviously, the accuracy of clustering the customer service questions greatly affects the accuracy of answering the questions by the customer service robot. In order to improve the accuracy of text clustering in various application scenarios, it is necessary to improve the text clustering method.
Disclosure of Invention
Based on the above, the specification provides a text clustering method, a text clustering device and computer equipment.
According to a first aspect of embodiments of the present specification, there is provided a text clustering method, the method including:
extracting character segments from a plurality of texts to be clustered respectively, wherein each character segment comprises a plurality of continuous characters of the texts;
constructing a feature matrix of the text based on the text and the occurrence times of each character segment in the text;
and clustering the texts based on the feature matrix and a pre-trained clustering model.
According to a second aspect of embodiments herein, there is provided a text clustering apparatus, the apparatus including:
the device comprises an extraction module, a clustering module and a clustering module, wherein the extraction module is used for respectively extracting character segments from a plurality of texts to be clustered, and each character segment comprises a plurality of continuous characters of the texts;
the construction module is used for constructing a feature matrix of the text based on the text and the occurrence frequency of each character segment in the text;
and the clustering module is used for clustering the texts based on the characteristic matrix and a pre-trained clustering model.
According to a third aspect of the embodiments of the present specification, there is provided a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of any of the embodiments when executing the program.
By applying the scheme of the embodiment of the specification, when text clustering is performed, word segmentation processing is not required to be performed on the text, character segments are directly extracted from the text by using character granularity to serve as the features of the text, the feature matrix of the text is constructed on the basis of the occurrence frequency of each extracted character segment in each text, and then the text is clustered on the basis of the constructed feature matrix. Because the text features are directly extracted at the character granularity, word segmentation is not needed, the method is very suitable for newly generated customer service problem sentences, the problem that the traditional word segmentation accuracy is low due to more new words and new sentence patterns is solved to a certain extent, and the problem that the accuracy of text clustering is influenced by mistaken word segmentation can be avoided.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the specification.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present specification and together with the description, serve to explain the principles of the specification.
Fig. 1 is a flowchart of a text clustering method according to an embodiment of the present disclosure.
Fig. 2 is a schematic diagram of an extraction method for extracting a character segment from a text according to an embodiment of the present specification.
Fig. 3 is a flowchart of a text clustering method according to an embodiment of the present disclosure.
Fig. 4 is a schematic diagram of a logical structure of a text clustering apparatus according to an embodiment of the present disclosure.
FIG. 5 is a schematic block diagram of a computer device for implementing the methods of the present description, according to one embodiment of the present description.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present specification. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the specification, as detailed in the appended claims.
The terminology used in the description herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the description. As used in this specification and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.
It should be understood that although the terms first, second, third, etc. may be used herein to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, the first information may also be referred to as second information, and similarly, the second information may also be referred to as first information, without departing from the scope of the present specification. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.
For example, in a scenario in which a customer service robot automatically asks and answers, there are often many expression methods for questions with the same meaning, so that new questions may be continuously presented, and for a new question, the robot may not find a corresponding answer, so that the robot cannot answer. The common solution is to collect these new questions and perform clustering processing on the questions, and then put them into a new standard question bank, so that the robot can find the standard questions and the answers to the questions corresponding to the questions from the standard question bank, thereby effectively improving the answering rate of the customer service robot to the new questions.
When the questions are clustered, it is usually necessary to perform word segmentation processing on each question sentence, for example, to solve the problem "how to open a payment instrument? "it needs to be divided into a number of words," how, open, pay for treasure ". The word segmentation of the sentence can be divided according to the word bank table, i.e. the words appearing in the word bank table are divided into one word, and certainly, the word segmentation can also be realized by other word segmentation algorithms. After a sentence is divided into a plurality of words, each sentence can be represented as a vector according to the position of each word in each sentence in a word stock table and the occurrence frequency of each word in each sentence, and then the vectors of the sentences are clustered by adopting a clustering algorithm, so that the problems are clustered.
Because the accuracy of the word segmentation algorithm is low for the newly appeared sentence patterns and some new words, the phenomenon of word segmentation by mistake is often appeared. For example, assuming that "pay treasure" is a new word, when the sentence "how to open pay treasure" is segmented, it is likely to be segmented, "how, open, pay", and it is obvious that, after the sentence is segmented incorrectly, the sentence vectorization representation is also affected, so that the final sentence clustering is wrong, and the accuracy of sentence clustering is greatly reduced. In order to improve the accuracy of text clustering in various application scenarios, it is necessary to improve the text clustering method.
Based on this, the embodiments of the present specification provide a text clustering method, which can improve the accuracy of text clustering. As shown in fig. 1, the method may include the steps of:
s102, extracting character segments from a plurality of texts to be clustered respectively, wherein each character segment comprises a plurality of continuous characters of the texts;
s104, constructing a feature matrix of the text based on the text and the occurrence frequency of each character segment in the text;
and S106, clustering the texts based on the feature matrix and a pre-trained clustering model.
In order to avoid the problem of low text clustering accuracy caused by word segmentation by mistake, the text clustering method provided in the embodiments of the present specification does not need to perform word segmentation processing on the text. After a plurality of texts to be clustered are obtained, one or more character segments can be respectively extracted from the text to be clustered, where the character segments include a plurality of continuous characters in the texts, for example, the character segments may be formed by any single character in the texts, or may be formed by any two or more continuous characters. The method and the device can be flexibly set according to actual application scenes, and the embodiment of the specification is not limited.
Taking the text to be clustered as "how to open a payment treasure" as an example, the character segment may be formed by any two continuous symbols, such as "how, what to open, free, pay". Or a character segment formed by any continuous three characters, such as "how to open, what to open, pay treasure". The character fragments reflect the structure of the text to a certain degree and represent the characteristics of the text on the character granularity.
In some embodiments, when extracting a character segment from a text to be clustered, a sliding window with a specified length may be slid along a character arrangement path of the text according to a preset sliding step length, and in the process of traversing all characters in the text, a character falling within the sliding window is extracted as a character segment. For example, as shown in fig. 2, the text to be clustered is "how to open a payment treasure", assuming that the length of the sliding window is 2 characters, the moving step length is 1 character length, and the character segments falling into the sliding window are "how, what open, payment, and payment" respectively when the first character of the text slides to the last character, so that the character segments extracted from the text are "how, what open, payment, and payment". Of course, the sliding window may slide from the first character to the last character of the text, may also slide from the last character to the first character of the text, or may slide towards both ends with reference to a certain character in the middle of the text, which is not limited in this specification. By the method, any continuous combination of a plurality of characters in the text can be extracted, so that the extracted character segments can more comprehensively reflect the characteristics of the text.
Of course, the length of the sliding window and the sliding step size may be specifically set according to actual requirements, for example, in some embodiments, the length of the sliding window may be 2 characters in length, the sliding step size may be 1 character in length, and finally, a character fragment formed by two consecutive characters in the text is obtained. In some embodiments, the length of the sliding window may also be 3 characters or more, so that the obtained character segment is a segment composed of 3 or more characters. The step size of the slide may also take 2 characters in length or other values, but to extract more comprehensive feature information from the text, the step size of the slide is typically 1 character in length. Taking "how to open a payment treasure" as an example, assuming that the length of the sliding window is 3 characters and the sliding step length is 1 character, the extracted character segments are "how to open, what to open, open a payment, pay a payment treasure".
After the character segments of each text are extracted from each text to be clustered, a feature matrix of the text to be clustered can be constructed based on the texts to be clustered and the occurrence frequency of each character segment in each text to be clustered. Each line or each column of the feature matrix may correspond to a feature of a text to be clustered, and is used to digitally represent semantic features of the text. In some embodiments, the feature matrix may be a matrix in which the number of the texts to be clustered is used as the number of rows and the number of the extracted character segments is used as the number of columns, and each element in the matrix is the number of times that a character segment appears in a text. For example, the feature matrix is an N × M matrix, where N is the number of the texts, M is the number of the character segments, and the value of the nth row and the mth column of the feature matrix is the number of occurrences of the mth character segment in the nth text. Wherein the nth row of the feature matrix represents a feature of the nth text.
For example, if there are 3 texts to be clustered, which are "how to open a payment treasure", "method for opening a payment treasure", and "method for using a payment treasure", respectively, the character segments extracted from "how to open a payment treasure" are: how, what to open, free, pay. The character fragments extracted from the 'Paibao opening method' are as follows: payment, treasure opener, opening, communication and method. The character fragments extracted in the 'Paibao use method' are as follows: payment, treasures usage, user and method. Combining the same character fragments, the total number of character fragments extracted from the 3 texts is 12, which are respectively: how, what to open, free, pay, treasure open, free, method, treasure messenger, use, user. Assuming that the character segments are arranged according to the above position sequence and numbered 1-12 respectively, a 3 × 12 dimensional matrix can be constructed, the ith row and the jth column of the matrix indicate the number of times of the jth word appearing in the ith text, and the constructed feature matrix is as follows:
Figure BDA0002249998240000061
the first behavior of the feature matrix is the feature of the text to be clustered, namely 'how to open a Payment treasure', the second behavior is the feature of the text to be clustered, namely 'Payment treasure opening method', and the third behavior is the feature of the text to be clustered, namely 'Payment treasure use method'.
Of course, the above is only an illustrative example for explaining the present scheme, and in an actual application scenario, the number of texts and character segments to be clustered may be more, and the feature matrix is also more complex.
The embodiment of the specification does not need to perform word segmentation on the text to be processed, so that the problem of no word segmentation does not exist, and the character segment extraction mode of the specification can directly extract the characteristics of each text to be processed on the granularity of the Chinese characters, so that the method is very suitable for text clustering including new words and new sentence patterns, and effectively solves the problems of low accuracy of traditional word segmentation and inaccurate text clustering caused by more new words and new sentence patterns.
After the feature matrix of the text to be clustered is obtained, the feature matrix can be input into a pre-trained clustering model, and the clustering model can cluster the text to be clustered based on the features of each clustered text in the feature matrix. The clustering model can be obtained by training based on a feature matrix of the training text, and the feature matrix of the training text can also be constructed according to the occurrence frequency of the training text and the character segments extracted from the training text in the clustering text. Wherein the clustering model can be a K-means algorithm or other algorithms with similar functions. Since each line or each column of the feature matrix represents a feature of one text to be clustered, the clustering algorithm can cluster the texts to be clustered based on the features of the texts to be clustered.
In some embodiments, after clustering the texts to be clustered based on the feature matrix of the texts to be clustered and a pre-trained clustering model, a text classification matrix representing a text classification result can be obtained, wherein the text classification matrix is a matrix constructed by using the number of the texts to be clustered as the number of rows, the number of the categories of the texts to be clustered as the number of columns, and the numerical values corresponding to the categories of the texts to be clustered as matrix elements. For example, for an N × M feature matrix, where N is the number of the texts and M is the number of the character segments, the value of the nth row and the mth column of the feature matrix is the number of occurrences of the mth character segment in the nth text. The nxm feature matrix may be input to a pre-trained clustering model and then an nxk matrix may be output, where N is the number of the texts, K represents the number of categories of the clustered texts, and the values of the nth row and the kth column of the nxk matrix represent the category to which the nth text belongs.
Because the word segmentation processing is not performed on the text to be clustered in the embodiment of the description, but the character segments are directly extracted from the text to be clustered, each character segment shares one or more characters with other character segments, that is, a lot of redundant information appears in the character segments, and compared with the word segmentation processing, the number of the obtained character segments is more. For example, "how to open a payment treasure", the words obtained after word segmentation are "how, opening and payment treasures", three words in total, and the extracted character segments are "how, what opening, common tribute, payment and payment treasures", which has 6 character segments in total. Therefore, the feature matrix constructed finally has more dimensions and is more complex. Because the character segments have multiple overlapping of characters and more redundant information, the obtained characteristic matrix has more and more complex redundant information. In order to reduce the operation amount of the clustering model when clustering the texts to be clustered based on the feature matrix, in some embodiments, before clustering the texts to be clustered based on the feature matrix, the feature matrix may be subjected to dimensionality reduction. Through dimension reduction processing, some redundant information in the characteristic matrix can be simplified, and therefore a matrix with smaller dimension and simpler dimension is obtained. The amount of operations in clustering the matrix using a clustering algorithm can also be greatly reduced.
For example, for an N × M feature matrix, where N is the number of the texts, M is the number of the character segments, and the value of the nth row and the mth column of the feature matrix is the number of occurrences of the mth character segment in the nth text, the dimension reduction is performed on the N × M feature matrix, so as to obtain an N × S feature matrix, where S is much smaller than M. Each row of the matrix after dimension reduction still represents the characteristics of one text, and only the characteristic redundant information in the text is simplified, so that the column number of the matrix is greatly reduced.
The feature matrix may be subjected to dimension reduction processing by using a preset dimension reduction algorithm, such as PCA (principal component analysis), which is a commonly used data analysis method. PCA transforms raw data into a set of linearly independent representations of dimensions by linear transformation, which can be used to extract the principal feature components of data, which is often used for dimensionality reduction of high-dimensional data. Of course, other dimension reduction algorithms with similar functions may be also possible, and the embodiments of the present description are not limited.
Since some redundant characters without actual meanings usually appear in the text, the redundant characters have no great influence on the meaning of the text, but the existence of the redundant characters can greatly increase the number of extracted character segments, and the dimension and complexity of the constructed feature matrix are increased, so that the calculation amount is increased. Therefore, in some embodiments, before extracting the character segments from the text to be clustered, the text to be clustered may be preprocessed, and redundant characters in the text to be clustered are deleted. In some embodiments, the redundant characters include stop words, punctuation marks, and the like that have no actual meaning. For example, the stop word in the "opening method of a payment instrument" is not a real meaning, and the stop word can be deleted during preprocessing, "how to open a payment instrument? Is "in"? "may also be deleted. After the redundant characters are deleted, the semantics of the text is not influenced, but the number of the extracted character segments is greatly reduced, the dimension of the characteristic matrix can be reduced, the calculated amount is reduced, and the text clustering efficiency is improved.
To further explain the text clustering method of the embodiments of the present disclosure, a specific embodiment is explained below with reference to fig. 3.
The robot is used as customer service to automatically answer questions input by customers, and as the customer questions are diversified, new sentence patterns and new words are continuously generated, in order to enable the robot to find corresponding answers from standard question-answer pairs in a question-answer library, the questions need to be clustered. In order to accurately cluster the questions so that the robot can accurately answer the customer questions, the following provides a method for clustering the customer service questions.
1. Customer service question text collection (S301): collecting the customer questions which cannot be solved by the customer service robot or the questions which are not satisfied by the user;
2. text preprocessing, deleting redundant characters (S302): delete stop words, punctuation, etc. in the customer question, e.g., the original sentence is "where did my express arrive? "after preprocessing," i "," what "," the "stop word is deleted, and punctuation"? "will also be deleted, will eventually get" where express goes ".
3. Extracting character segments from the text (S303): extracting character segments of the text according to the text preprocessed in the previous step, wherein during extraction, the character segments can slide along a character arrangement path of the text by a sliding window with the length of 2 characters and a sliding step with the length of 1 character, and each character segment falling into the sliding window when traversing each character of the text is extracted; for example, for "where express is to be delivered", the extracted character segments are "express", "delivery", "where to be delivered", and "where". The character segment including 2 characters extracted in this way may be called BiGram, which is obtained for each text, and all bigrams may be regarded as characteristics of the text.
4. Construction of text feature matrix (S304): assuming that M bigrams are extracted from all customer service question texts, and there are N customer service question texts, the features of all texts can be represented by an N × M feature matrix F, where the value of the ith row and the jth column of the matrix F represents the number of times of bigrams with the feature id j in the ith sentence, where each row of the feature matrix corresponds to one text and can be regarded as the feature of the text.
5. The PCA dimension reduction algorithm performs dimension reduction on the feature matrix (S305): because many characters in the extracted BiGram are overlapped, the number of the BiGram is large, namely M is large, so that the constructed feature matrix has more redundant information, the dimension of the feature matrix is more and more complicated, and the operation amount is large when the text is clustered based on the feature matrix, therefore, the dimension reduction processing can be firstly carried out on the BiGram feature matrix by adopting a PCA dimension reduction algorithm, the BiGram feature matrix F (dimension N M) is input into the PCA algorithm model, and the output is a matrix S after dimension reduction, wherein the dimension of S is N x d (d < < M).
6. The k-means clustering algorithm clusters the text based on the feature matrix after dimensionality reduction (S306): obtaining a matrix S after dimensionality reduction, wherein each row of the matrix S still represents the characteristics of each text, inputting the matrix S into a k-means clustering algorithm model, clustering the texts according to the characteristics of each text by the clustering algorithm model, and outputting a matrix C (N x k), wherein k is a category, the ith row and the jth column of the matrix C represent that the ith sentence belongs to the jth category, and thus, customer service problems of the same category are clustered together.
The customer service problem clustering method in the embodiment directly extracts text features in Chinese character granularity, does not need word segmentation, is very suitable for newly generated customer service problem sentences, and solves the problem of low traditional word segmentation accuracy caused by more new words and new sentence patterns to a certain extent. In addition, the method is simple, has very high calculation speed and has a good clustering effect.
The various technical features in the above embodiments can be arbitrarily combined, so long as there is no conflict or contradiction between the combinations of the features, but the combination is limited by the space and is not described one by one, and therefore, any combination of the various technical features in the above embodiments also falls within the scope disclosed in the present specification.
As shown in fig. 4, the text clustering device corresponding to the text clustering method in the present specification, the device 40 may include:
an extracting module 41, configured to extract character segments from a plurality of texts to be clustered, where each character segment includes a plurality of continuous characters of the texts;
a building module 42, configured to build a feature matrix of the text based on the text and the occurrence frequency of each character segment in the text;
and a clustering module 43, configured to cluster the texts based on the feature matrix and a pre-trained clustering model.
In an embodiment, when the apparatus is used to extract the character segments from the plurality of texts to be clustered, the apparatus specifically includes:
sliding a sliding window with a specified length along the character arrangement path of the text according to a preset sliding step length;
and in the process of traversing each character of the text, extracting the character falling into the sliding window as the character segment.
In one embodiment, the specified length is a length of two characters, and the step size of sliding is a length of one character.
In one embodiment, the number of rows of the feature matrix is the number of the texts, the number of columns is the number of the character segments, and the matrix element is the number of times that each character segment appears in each text.
In one embodiment, after clustering the plurality of texts based on the feature matrix and a pre-trained clustering model, the apparatus is further configured to:
and acquiring a text category matrix output by the clustering model, wherein the line number of the text category matrix is the number of the texts, the column number is the number of the categories after the texts are clustered, and the matrix elements are numerical values represented by the categories to which the texts belong.
In one embodiment, before clustering the plurality of texts based on the feature matrix and a pre-trained clustering model, the apparatus is further configured to:
and performing dimension reduction processing on the feature matrix.
In one embodiment, the apparatus is further configured to, before extracting the plurality of character fragments from the plurality of texts to be clustered, respectively:
and deleting redundant characters from the plurality of texts to be clustered.
In one embodiment, the redundant characters include: stop words and/or punctuation.
The specific details of the implementation process of the functions and actions of each module in the device are referred to the implementation process of the corresponding step in the method, and are not described herein again.
For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, wherein the modules described as separate parts may or may not be physically separate, and the parts displayed as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of network modules. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution in the specification. One of ordinary skill in the art can understand and implement it without inventive effort.
The embodiment of the device in the specification can be applied to computer equipment, such as a server or an intelligent terminal. The device embodiments may be implemented by software, or by hardware, or by a combination of hardware and software. The software implementation is taken as an example, and as a logical device, the device is formed by reading corresponding computer program instructions in the nonvolatile memory into the memory for operation through the processor in which the file processing is located. From a hardware aspect, as shown in fig. 5, it is a hardware structure diagram of a computer device in which the apparatus of this specification is located, except for the processor 502, the memory 504, the network interface 506, and the nonvolatile memory 508 shown in fig. 5, a server or an electronic device in which the apparatus is located in the embodiment may also include other hardware according to an actual function of the computer device, which is not described again. The non-volatile memory 508 stores a computer program, and the processor 502 implements the text clustering method according to any one of the embodiments of the present disclosure when executing the computer program.
Accordingly, the embodiments of the present specification also provide a computer storage medium, in which a program is stored, and the program, when executed by a processor, implements the method in any of the above embodiments.
Embodiments of the present description may take the form of a computer program product embodied on one or more storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having program code embodied therein. Computer-usable storage media include permanent and non-permanent, removable and non-removable media, and information storage may be implemented by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of the storage medium of the computer include, but are not limited to: phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technologies, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic tape storage or other magnetic storage devices, or any other non-transmission medium, may be used to store information that may be accessed by a computing device.
Other embodiments of the present disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. The embodiments of the present specification are intended to cover any variations, uses, or adaptations of the embodiments of the specification following, in general, the principles of the embodiments of the specification and including such departures from the present disclosure as come within known or customary practice in the art to which the embodiments of the specification pertain. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the embodiments being indicated by the following claims.
It is to be understood that the embodiments of the present specification are not limited to the precise arrangements described above and shown in the drawings, and that various modifications and changes may be made without departing from the scope thereof. The scope of the embodiments of the present specification is limited only by the appended claims.
The above description is only a preferred embodiment of the present disclosure, and should not be taken as limiting the present disclosure, and any modifications, equivalents, improvements, etc. made within the spirit and principle of the present disclosure should be included in the scope of the present disclosure.

Claims (10)

1. A method of text clustering, the method comprising:
extracting character segments from a plurality of texts to be clustered respectively, wherein each character segment comprises a plurality of continuous characters of the texts;
constructing a feature matrix of the text based on the text and the occurrence times of each character segment in the text;
and clustering the texts based on the feature matrix and a pre-trained clustering model.
2. The text clustering method according to claim 1, wherein extracting character segments from a plurality of texts to be clustered respectively comprises:
sliding a sliding window with a specified length along the character arrangement path of the text according to a preset sliding step length;
and in the process of traversing each character of the text, extracting the character falling into the sliding window as the character segment.
3. The text clustering method according to claim 2, wherein the specified length is a length of two characters, and the step size of sliding is a length of one character.
4. The text clustering method according to any one of claims 1 to 3, wherein the number of rows of the feature matrix is the number of the texts, the number of columns is the number of the character segments, and the matrix elements are the number of times that each character segment appears in each text.
5. The text clustering method according to claim 4, further comprising, after clustering the plurality of texts based on the feature matrix and a pre-trained clustering model:
and acquiring a text category matrix output by the clustering model, wherein the line number of the text category matrix is the number of the texts, the column number is the number of the categories after the texts are clustered, and the matrix elements are numerical values represented by the categories to which the texts belong.
6. The text clustering method according to claim 1, further comprising, before clustering the plurality of texts based on the feature matrix and a pre-trained clustering model:
and performing dimension reduction processing on the feature matrix.
7. The text clustering method according to claim 1, before extracting a plurality of character segments from a plurality of texts to be clustered, respectively, further comprising:
and deleting redundant characters from the plurality of texts to be clustered.
8. The text clustering method of claim 1, the redundant characters comprising: stop words and/or punctuation.
9. A text clustering apparatus, the apparatus comprising:
the device comprises an extraction module, a clustering module and a clustering module, wherein the extraction module is used for respectively extracting character segments from a plurality of texts to be clustered, and each character segment comprises a plurality of continuous characters of the texts;
the construction module is used for constructing a feature matrix of the text based on the text and the occurrence frequency of each character segment in the text;
and the clustering module is used for clustering the texts based on the characteristic matrix and a pre-trained clustering model.
10. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of any one of claims 1 to 8 when executing the program.
CN201911030513.6A 2019-10-28 2019-10-28 Text clustering method and device and computer equipment Pending CN110866097A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911030513.6A CN110866097A (en) 2019-10-28 2019-10-28 Text clustering method and device and computer equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911030513.6A CN110866097A (en) 2019-10-28 2019-10-28 Text clustering method and device and computer equipment

Publications (1)

Publication Number Publication Date
CN110866097A true CN110866097A (en) 2020-03-06

Family

ID=69653398

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911030513.6A Pending CN110866097A (en) 2019-10-28 2019-10-28 Text clustering method and device and computer equipment

Country Status (1)

Country Link
CN (1) CN110866097A (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110044492A1 (en) * 2009-08-18 2011-02-24 Wesley Kenneth Cobb Adaptive voting experts for incremental segmentation of sequences with prediction in a video surveillance system
CN105488033A (en) * 2016-01-26 2016-04-13 中国人民解放军国防科学技术大学 Preprocessing method and device for correlation calculation
CN106997339A (en) * 2016-01-22 2017-08-01 阿里巴巴集团控股有限公司 Text feature, file classification method and device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110044492A1 (en) * 2009-08-18 2011-02-24 Wesley Kenneth Cobb Adaptive voting experts for incremental segmentation of sequences with prediction in a video surveillance system
CN106997339A (en) * 2016-01-22 2017-08-01 阿里巴巴集团控股有限公司 Text feature, file classification method and device
CN105488033A (en) * 2016-01-26 2016-04-13 中国人民解放军国防科学技术大学 Preprocessing method and device for correlation calculation

Similar Documents

Publication Publication Date Title
CN109783632B (en) Customer service information pushing method and device, computer equipment and storage medium
CN108170792B (en) Question and answer guiding method and device based on artificial intelligence and computer equipment
Benchimol et al. Text mining methodologies with R: An application to central bank texts
CN108920467B (en) Method and device for learning word meaning of polysemous word and search result display method
CN106257440B (en) Semantic information generation method and semantic information generation device
JP2019511036A (en) System and method for linguistic feature generation across multiple layer word representations
CN113535963B (en) Long text event extraction method and device, computer equipment and storage medium
CN108021545B (en) Case course extraction method and device for judicial writing
US11907659B2 (en) Item recall method and system, electronic device and readable storage medium
US20200334410A1 (en) Encoding textual information for text analysis
CN110222328B (en) Method, device and equipment for labeling participles and parts of speech based on neural network and storage medium
CN106886567A (en) Microblogging incident detection method and device based on semantic extension
US11934781B2 (en) Systems and methods for controllable text summarization
CN111143551A (en) Text preprocessing method, classification method, device and equipment
CN112911326B (en) Barrage information processing method and device, electronic equipment and storage medium
CN113360699A (en) Model training method and device, image question answering method and device
CN113946657A (en) Knowledge reasoning-based automatic identification method for power service intention
CN110968664A (en) Document retrieval method, device, equipment and medium
CN115063035A (en) Customer evaluation method, system, equipment and storage medium based on neural network
JP7093292B2 (en) Systems and methods for segmenting dialogue session text
CN113065329A (en) Data processing method and device
CN110866097A (en) Text clustering method and device and computer equipment
CN112700203A (en) Intelligent marking method and device
CN113051869B (en) Method and system for realizing identification of text difference content by combining semantic recognition
CN112818687B (en) Method, device, electronic equipment and storage medium for constructing title recognition model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200306