CN113449102A

CN113449102A - Text clustering method, equipment and storage medium

Info

Publication number: CN113449102A
Application number: CN202010228254.4A
Authority: CN
Inventors: 姚亦周; 郭彦涛
Original assignee: Beijing Jingdong Tuoxian Technology Co Ltd
Current assignee: Beijing Jingdong Tuoxian Technology Co Ltd
Priority date: 2020-03-27
Filing date: 2020-03-27
Publication date: 2021-09-28

Abstract

The embodiment of the application provides a text clustering method, a device and a storage medium, wherein a text vector matrix corresponding to a text set is obtained by processing all texts included in the text set to be processed, a text similarity matrix corresponding to the text set is calculated according to the text vector matrix, and finally, all texts in the text set are clustered and analyzed based on the text similarity matrix to obtain a text clustering result. In the technical scheme, the texts are classified by utilizing the similarity between the texts, an unsupervised and high-precision clustering result is obtained by utilizing less cost, the calculation complexity of text clustering is reduced, the batch calculation clustering can be performed, the influence of noise texts on an integral model is small, and the accuracy of the clustering result is improved.

Description

Text clustering method, equipment and storage medium

Technical Field

The embodiment of the application relates to the technical field of data processing, in particular to a text clustering method, text clustering equipment and a storage medium.

Background

In natural language processing, text semantic clustering is in great demand in application scenarios, such as text classification, conversation robots, emotion analysis, and the like. The text clustering is used as the intermediate process of natural language processing, and the accuracy of analysis directly determines the calculation accuracy of the natural language correlation model. Therefore, how to realize the rapid, accurate and batch analysis of the text clustering has great significance.

In the prior art, a common text clustering method is k-means clustering algorithm (k-means clustering algorithm). The specific clustering process is as follows, for a given text set, firstly determining a k value, namely the number of cluster sets expected to be obtained through clustering; secondly, randomly selecting k texts from the text set as centroids; and calculating the distance (such as Euclidean distance) between each text in the text set and each centroid, dividing the text to the set to which the centroid belongs to obtain k cluster sets when the text is close to which centroid, finally recalculating the centroid of each set, and determining that clustering is finished if the distance between the newly calculated centroid and the original centroid is less than a set threshold value to obtain a clustering result. The method has simple principle, easy realization and high convergence speed.

However, in the process of implementing the clustering method, the inventors found that the K value cannot be determined in advance in many cases, and the accuracy of the obtained clustering result is greatly affected by the initially selected centroid, noise and outlier, which may cause the problem of inaccurate clustering result.

Disclosure of Invention

The embodiment of the application provides a text clustering method, text clustering equipment and a storage medium, which are used for solving the problem of inaccurate clustering results in the existing text clustering method.

In a first aspect, an embodiment of the present application provides a text clustering method, including:

processing all texts included in a text set to be processed to obtain a text vector matrix corresponding to the text set, wherein the number of rows of the text vector matrix is the same as the number of text strips included in the text set, and the row vector of the text vector matrix is represented by the vector of each text;

according to the text vector matrix, calculating a text similarity matrix corresponding to the text set, wherein each element value of the text similarity matrix is a similarity value between line vectors at corresponding positions of the two text vector matrices;

and performing clustering analysis on all texts in the text set based on the text similarity matrix to obtain a text clustering result.

In this embodiment, a text clustering result is obtained through clustering analysis by determining a text vector matrix and a text similarity matrix corresponding to a text set, so that an unsupervised and high-precision clustering result is obtained with less cost, the computational complexity of text clustering is reduced, and the accuracy of the clustering result is improved.

In a possible design of the first aspect, the processing all texts included in the text set to be processed to obtain a text vector matrix corresponding to the text set includes:

based on a preset word segmentation rule, carrying out word segmentation processing on each text in the text set to obtain each text word set, wherein each text word set comprises at least one word;

obtaining a dictionary corresponding to the text set according to words included in all text word sets, wherein each word in the dictionary has a unique identifier;

mapping relation between words included in each text word set and the dictionary to obtain each line text vector;

and obtaining a text vector matrix corresponding to the text set according to all the line text vectors.

In this embodiment, after each text is segmented according to a preset segmentation rule, a dictionary corresponding to a text set is obtained by combining all the texts, so that a text vector matrix is obtained, low-density text clusters can be well recognized, and the text recognition accuracy is improved.

Optionally, the obtaining a dictionary corresponding to the text set according to the words included in all the text word sets includes:

collecting words included in all text word sets into a preset set;

deleting repeated words in the preset set to obtain a preset target set;

and adding a unique identifier to each word included in the preset target set to obtain a dictionary corresponding to the text set.

In this embodiment, the dictionary corresponding to the text set is obtained according to all the texts in the text set, so that the accuracy of the text vector matrix corresponding to the text set is improved, and the accuracy of the subsequent text clustering result is improved.

Optionally, obtaining each line of text vector according to the mapping relationship between the words included in each text word set and the dictionary includes:

mapping words included in each text word set into identifications in the dictionary to obtain each text identification set;

and converting each text identification set into a line text vector according to the identification included in each text identification set, wherein the element number of each line text vector is the same as that of the dictionary, and the number of non-zero elements of each line text vector is the same as that of the words of the corresponding text word set.

In this embodiment, based on the identifier carried by each text, each text can be converted into a one-hot coding form, so as to obtain each line of text vectors, thereby achieving batch processing of text information and weakening the influence of a "noise" text on the overall semantics.

Optionally, before performing word segmentation processing on each text in the text set based on the preset word segmentation rule to obtain each text word set, the method further includes:

preprocessing each text of the text set, and deleting preset attribute contents appearing in each text, wherein the preset attribute contents at least comprise one of the following contents: stop words, comment words, symbols.

In this embodiment, by deleting the preset attribute content appearing in each text, the importance of the processed text is improved, and the accuracy of the dictionary and the row vector matrix generated subsequently is improved.

In another possible design of the first aspect, the calculating a text similarity matrix corresponding to the text set according to the text vector matrix includes:

according to a plurality of line text vectors included in the text vector matrix, calculating similarity values between each line text vector and the text vectors of the line text vector and other line text vectors to obtain each line similarity vector;

and obtaining a text similarity matrix corresponding to the text set according to all the line similarity vectors.

In the embodiment, the text similarity matrix corresponding to the text set is calculated based on the plurality of line text vectors included in the text vector matrix, so that the complexity of text processing is reduced, and a foundation is laid for accurate clustering of subsequent texts.

In yet another possible design of the first aspect, the performing, based on the text similarity matrix, a clustering analysis on all texts in the text set to obtain a text clustering result includes:

determining a classification threshold and a first cluster, wherein the first cluster comprises any one text in the text set;

and performing cluster analysis on all texts in the text set according to the text similarity matrix, the classification threshold and the first cluster to obtain a text clustering result.

In the embodiment, the text similarity matrix can be calculated in batches, the classification threshold of similarity clustering is determined, unsupervised clustering of texts is realized, the calculation complexity is greatly reduced, the influence of 'noise' texts on the overall semantics is weakened, and the clustering accuracy is improved.

Optionally, the performing cluster analysis on all texts in the text set according to the text similarity matrix, the classification threshold and the first cluster to obtain the text clustering result includes:

determining a similarity value between a first text and each text in the first cluster according to the text similarity matrix, wherein the first text is any text which is not clustered and divided in the text set;

if the similarity values between the first text and the texts in the first cluster in the preset proportion are all larger than or equal to the classification threshold value, clustering the first text into the first cluster;

if the similarity values between the first text and all texts in the first cluster are smaller than the classification threshold value, generating a second cluster, wherein the second cluster comprises the first text;

when all texts in the text set participate in clustering and dividing, obtaining the text clustering result, wherein the text clustering result comprises: all clusters determined, all text each cluster comprises.

In the embodiment, by determining the classification threshold and the first cluster, on the premise of ensuring the clustering precision, the calculation time and the computer resources of text clustering are saved, and meanwhile, the clustering efficiency is improved.

In yet another possible design of the first aspect, the method further includes:

obtaining an object recommendation request sent by a user, wherein the object recommendation request comprises: a description text;

determining a target cluster matched with the object recommendation request according to the description text and the text clustering result;

and determining a target text from the target cluster, and recommending an object corresponding to the target text to the user.

In the embodiment of the application, the precision of the text clustering result is improved, so that the precision of object recommendation is improved, the user experience is improved, and a foundation is laid for improving the competitiveness of a product.

In a second aspect, an embodiment of the present application provides a text clustering device, including: the device comprises a processing module, a calculation module and a clustering module;

the processing module is configured to process all texts included in a text set to be processed to obtain a text vector matrix corresponding to the text set, where the number of rows of the text vector matrix is the same as the number of text strips included in the text set, and a row vector of the text vector matrix is a vector representation of each text;

the calculation module is used for calculating a text similarity matrix corresponding to the text set according to the text vector matrix, wherein each element value of the text similarity matrix is a similarity value between the row vectors of the corresponding positions of the two text vector matrices;

and the clustering module is used for carrying out clustering analysis on all texts in the text set based on the text similarity matrix to obtain a text clustering result.

In a possible design of the second aspect, the processing module is specifically configured to:

Optionally, the processing module is configured to obtain a dictionary corresponding to the text set according to words included in all text word sets, and specifically:

the processing module is specifically configured to:

collecting words included in all text word sets into a preset set;

deleting repeated words in the preset set to obtain a preset target set;

Optionally, the processing module is configured to obtain each line text vector from a mapping relationship between a word included in each text word set and the dictionary, specifically:

the processing module is specifically configured to:

Optionally, the processing module is further configured to, before performing word segmentation processing on each text in the text set based on a preset word segmentation rule to obtain each text word set, perform preprocessing on each text in the text set, and delete preset attribute content appearing in each text, where the preset attribute content at least includes one of the following contents: stop words, comment words, symbols.

In another possible design of the second aspect, the calculation module is specifically configured to:

In yet another possible design of the second aspect, the clustering module is specifically configured to:

Optionally, the clustering module is configured to perform clustering analysis on all texts in the text set according to the text similarity matrix, the classification threshold and the first cluster to obtain the text clustering result, and specifically includes:

the clustering module is specifically configured to:

In yet another possible design of the second aspect, the apparatus further includes: the system comprises an acquisition module, a determination module and a recommendation module;

the obtaining module is configured to obtain an object recommendation request sent by a user, where the object recommendation request includes: a description text;

the determining module is used for determining a target cluster matched with the object recommendation request according to the description text and the text clustering result, and determining a target text from the target cluster;

and the recommending module is used for recommending the object corresponding to the target text to the user.

The apparatus provided in the second aspect of the present application may be configured to perform the method provided in the first aspect, and the implementation principle and the technical effect are similar, which are not described herein again.

In a third aspect, embodiments of the present application further provide an electronic device, which includes a processor, a memory, and a computer program stored in the memory and executable on the processor, and the processor executes the computer program to implement the method according to the first aspect and possible designs.

In a fourth aspect, embodiments of the present application further provide a computer-readable storage medium, in which computer instructions are stored, and when the computer instructions are executed on a computer, the computer is caused to execute the method according to the first aspect and each possible design.

According to the text clustering method, the text clustering equipment and the text clustering storage medium, all texts in a text set to be processed are processed to obtain a text vector matrix corresponding to the text set, a text similarity matrix corresponding to the text set is calculated according to the text vector matrix, and finally, all texts in the text set are clustered and analyzed based on the text similarity matrix to obtain a text clustering result. In the technical scheme, the texts are classified by utilizing the similarity between the texts, an unsupervised and high-precision clustering result is obtained by utilizing less cost, the calculation complexity of text clustering is reduced, the batch calculation clustering can be performed, the influence of noise texts on an integral model is small, and the accuracy of the clustering result is improved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.

FIG. 1 is a schematic diagram of an application scenario of a text clustering method provided in the present application;

fig. 2 is a schematic flowchart of a first embodiment of a text clustering method provided in the embodiment of the present application;

fig. 3 is a schematic flowchart of a second embodiment of a text clustering method provided in the embodiment of the present application;

fig. 4 is a schematic flowchart of a third embodiment of a text clustering method provided in the embodiment of the present application;

fig. 5 is a schematic flowchart of a fourth embodiment of a text clustering method provided in the embodiment of the present application;

fig. 6 is a schematic structural diagram of an embodiment of a text clustering device provided in an embodiment of the present application;

fig. 7 is a schematic structural diagram of an embodiment of an electronic device provided in the embodiment of the present application.

With the foregoing drawings in mind, certain embodiments of the disclosure have been shown and described in more detail below. These drawings and written description are not intended to limit the scope of the disclosed concepts in any way, but rather to illustrate the concepts of the disclosure to those skilled in the art by reference to specific embodiments.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

Exemplarily, fig. 1 is a schematic view of an application scenario of the text clustering method provided in the present application. As shown in fig. 1, the application scenario may include: at least one terminal device (fig. 1 shows three terminal devices, respectively terminal device 111, terminal device 112, terminal device 113), network 12 and server 13. Wherein each terminal device and the server 13 can communicate through the network 12. Optionally, the application scenario shown in fig. 1 may further include a data storage device 14 connected to the server 13.

For example, in the application scenario shown in fig. 1, the server 13 may obtain a text set to be processed from the network 12, and store the text set to the data storage device 14 so as to be directly used in the subsequent clustering analysis of the text set, and the server 13 may also receive an object recommendation request sent by a user through a terminal device through the network 12, process a description text included in the object recommendation request, and store a processing result in the data storage device 14.

In this embodiment, the data storage device 14 may store a large number of text sets for cluster analysis, and may also store the processing result of the server 13, and the server 13 may execute a program code of a text clustering method based on the text set to be processed in the data storage device 14 to obtain a text clustering result; the server 13 may also determine that an object matching the object recommendation request is recommended to the user based on the program code of the text clustering method executed by the object recommendation request sent by the user in the data storage device 14.

It should be noted that fig. 1 is only a schematic diagram of an application scenario provided by an embodiment of the present application, and the embodiment of the present application does not limit the devices included in fig. 1, nor does it limit the positional relationship between the devices in fig. 1, for example, in fig. 1, the data storage device 14 may be an external memory with respect to the server 13, and in other cases, the data storage device 14 may also be disposed in the server 13.

In practical applications, since the terminal device is also a processing device with data processing capability, the server in the application scenario shown in fig. 1 can also be implemented by the terminal device. In the embodiments of the present application, the server and the terminal device for data processing may be collectively referred to as an electronic device. Optionally, in the embodiment of the present application, an execution subject of the text clustering method is used as an electronic device, for example, a background processing platform and the like for explanation.

Illustratively, the specific application scenarios of the embodiments of the present application may be as follows:

with the rapid development of internet technology, more and more online inquiry clients, for example, internet hospital clients, are emerging. In practical application, each doctor in each department has certain text contents such as adequacy description, a user can make an inquiry through an internet hospital client, specifically, the user sends an inquiry description at the internet hospital client, and the background server can match out departments/doctors suitable for receiving the inquiry according to the inquiry description and return the departments/doctors to the client so as to recommend the doctors to the user.

In practical application, firstly, the data volume of the inquiry description information sent by a user is limited, and the rule matching relation between the inquiry description and departments/doctors established according to the limited data is usually inaccurate; secondly, the inquiry information is various, and how to match the appropriate department/doctor in the changeable inquiry description is a great challenge.

In contrast, in the prior art, the matching process between the inquiry description of the user and the departments/doctors is usually based on a text clustering result obtained by a k-means clustering method, and the matched departments/doctors receiving the inquiry are determined by combining the inquiry description of the user. However, the matching department/doctor has low accuracy, which is caused by inaccurate text clustering result for department and/or doctor.

For the problem of the application scenario, the text clustering method provided in the embodiment of the present application may determine the text clustering result based on all text sets used for describing doctors in the internet hospital inquiry platform by using an internet hospital inquiry platform as a background and using a Natural Language Processing (NLP) technology and an unsupervised clustering method. Practice proves that the text clustering result for describing doctors is applied to a doctor recommendation module of an Internet hospital, and a good application effect is achieved.

The following describes the technical solutions of the present application and how to solve the above technical problems with specific embodiments. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.

Fig. 2 is a schematic flowchart of a first embodiment of a text clustering method provided in the embodiment of the present application. As shown in fig. 2, the method may include the steps of:

s201, processing all texts included in the text set to be processed to obtain a text vector matrix corresponding to the text set.

The number of rows of the text vector matrix is the same as the number of text strips included in the text set, and the row vector of the text vector matrix is a vector representation of each text.

In the embodiment of the application, for a given service platform, when there is a text clustering requirement, the electronic device may obtain a text set to be processed from a network or a storage location where the text set is stored, where the text set generally includes a plurality of pieces of text, and each piece of text is used to describe an object.

For example, in an application scenario of internet hospital inquiry, each piece of text may be a brief description of each doctor in the internet hospital, and the set of texts to be processed may be a set of brief descriptions of all doctors.

Optionally, after obtaining the text set to be processed, the electronic device may perform word segmentation on the texts in the text set, and then represent each text in a form of a line text vector based on words included in each text, so as to obtain a text vector matrix corresponding to the text set based on the line text vectors corresponding to all the texts.

Optionally, for a specific implementation principle of this step, reference may be made to the following description of the embodiment shown in fig. 3, which is not described herein again.

And S202, calculating a text similarity matrix corresponding to the text set according to the text vector matrix.

And each element value of the text similarity matrix is a similarity value between the line vectors at the corresponding positions of the two text vector matrixes.

For example, in an embodiment of the present application, when the electronic device obtains a text vector matrix corresponding to a text set, the corresponding text similarity vector may be determined based on a content overlapping degree between every two lines of text vectors, and then the text similarity matrix corresponding to the text set is obtained according to all the determined text similarity vectors.

For example, assume that the text set includes text 1 through text 3, and the text line vector corresponding to text 1 is [ 111 ]]The text line vector corresponding to text 2 is [ 110 ]]The text line vector corresponding to text 3 is [ 100 ]]Then the text set is rightThe corresponding text vector matrix is

For the text vector matrix, [ 111 ] can be obtained by calculating cosine values between the row vectors]And [ 110 ]]Is equal to 0.816, [ 111 ]]And [ 100 ]]Is equal to 0.577, [ 110 ]]And [ 100 ]]Is equal to 0.707, so the text similarity matrix corresponding to the text set is

And S203, based on the text similarity matrix, performing clustering analysis on all texts in the text set to obtain a text clustering result.

In an embodiment of the application, the electronic device may determine a similarity value between every two texts according to the text similarity matrix, and then determine whether the two texts may be grouped into one type based on a size relationship between the similarity value between the two texts and a preset similarity threshold.

Optionally, if the similarity value between the two texts is greater than or equal to the preset similarity threshold, the two texts are considered to be similar, and are grouped into one type. And if the similarity value between the two texts is smaller than a preset similarity threshold value, the two texts are considered to be dissimilar, and the two texts are respectively dispersed into different categories.

For example, assume that the preset similarity threshold is 0.8 according to the text similarity matrix

It can be known that the similarity value between text 1 and text 2 is 0.816, which is greater than the preset 0.8, the similarity value between text 1 and text 3 is 0.577, which is less than the preset 0.8, the similarity value between text 2 and text 3 is 0.707, which is also less than the preset 0.8, so the text clustering results of the text set including text 1 to text 3 are as follows: text 1 and text 2 may be grouped into one category and text 3 may be a separate category.

The text clustering method provided by the embodiment of the application processes all texts included in a text set to be processed to obtain a text vector matrix corresponding to the text set, calculates a text similarity matrix corresponding to the text set according to the text vector matrix, and finally performs clustering analysis on all texts in the text set based on the text similarity matrix to obtain a text clustering result. In the technical scheme, the texts are classified by utilizing the similarity between the texts, an unsupervised and high-precision clustering result is obtained by utilizing less cost, the calculation complexity of text clustering is reduced, the batch calculation clustering can be performed, the influence of noise texts on an integral model is small, and the accuracy of the clustering result is improved.

Exemplarily, on the basis of the above embodiments, fig. 3 is a schematic flow diagram of a second embodiment of a text clustering method provided in the embodiment of the present application. As shown in fig. 3, in this embodiment, the above S201 can be implemented by the following steps:

s301, based on a preset word segmentation rule, carrying out word segmentation processing on each text in the text set to obtain each text word set, wherein each text word set comprises at least one word.

In the embodiment of the application, a preset word segmentation rule may be stored in the electronic device in advance, so that after the electronic device obtains a text set to be processed, word segmentation processing may be performed on each text in the text set based on the preset word segmentation rule to obtain a text word set corresponding to each text.

Optionally, the word segmentation rule may include vertical domain non-word segmentation, text word segmentation, and the like.

For example, in the embodiment, the electronic device first identifies professional words in the vertical domain, for example, if medical words "seborrheic alopecia", "pityriasis rosea", and the like exist in the text, and the words can largely distinguish the semantics of the text. Therefore, it is first determined that some specified words are not to be participled before the text is participled. And then performing word segmentation processing on the text except the specified text without word segmentation according to the semantics of the words.

Optionally, in this embodiment of the application, the electronic device may add accumulated professional vocabulary by using the ending segmenter to achieve the purpose of non-segmentation of words in the vertical domain.

Optionally, in an embodiment of the present application, before the step S301, the electronic device may further perform the following processing for the text in the text set:

Optionally, stop words are not keywords of sentences, which not only occupy a large amount of computing storage resources, but also cause semantic confusion to a large extent, for example, nonsense words such as "you", "i", "of", "having", "too", and the like, so that these words can be removed before the text is segmented.

In the present embodiment, the symbol may be a single symbol such as "&", "%" or the like, or may be contents composed of various symbols and numbers, letters, such as links or the like. The comment word may be a word for explaining a certain word, or may be a content having a low degree of correlation with the text content, such as a certain number, and thus may be deleted.

S302, obtaining a dictionary corresponding to the text set according to words included in all the text word sets, wherein each word in the dictionary has a unique identification.

Optionally, in this embodiment of the application, the electronic device may integrate all words in the text word set corresponding to all the texts to obtain a dictionary corresponding to the text set.

As an example, the specific implementation manner of this step S302 is as follows: firstly, words included in all text word sets are collected into a preset set, then repeated words in the preset set are deleted to obtain a preset target set, and then unique identifiers are added to all words included in the preset target set respectively to obtain a dictionary corresponding to the text set.

For example, a specific implementation of obtaining a dictionary is described below in conjunction with three texts.

Assume that the text word set corresponding to text 1 includes words: pediatric nervous system pediatric internal medicine of childhood developmental delay epilepsy, cerebral palsy, myasthenia genetic metabolic tic disorder; the text word set corresponding to the text 2 includes words: children's tourette syndrome hyperkinetic syndrome hypoevolutism genetic metabolism infantile nervous system; the text word set corresponding to the text 3 includes words: adult hereditary alopecia integer.

In this embodiment, for text 1 to text 3, the preset set includes words: infantile mental retardation epilepsy cerebral palsy myasthenia genetic metabolism tic disorder infantile nervous system infantile internal children epilepsy tic disorder hyperkinetic syndrome developmental retardation genetic metabolism infantile nervous system adult genetic alopecia integer. Correspondingly, the preset target set comprises the following words: infantile mental hyperkinetic syndrome adult alopecia integer of infantile nervous system of infantile mental department of children hypoevolutism epilepsia cerebral palsy myasthenia genetic metabolism tic disorder. Therefore, the word composition of the dictionary is shown in table 1.

TABLE 1

1. Children's toy	2. Development of	3. Retardation	4. Epilepsy	5. Muscular weakness	6. Heredity and inheritance
						7. Metabolism	8. TwitchingSymptoms of (1)	9. Children's children	10. Nervous system	11. Pediatric internal medicine	12. Hyperactivity disorder
13. Adult	14. Alopecia (baldness)	15. Shaping machine

And S303, obtaining the text vector of each line according to the mapping relation between the words included in each text word set and the dictionary.

In this embodiment, for each text, the electronic device may perform dictionary mapping on words in each text word set, and map each word to an id of an int type, that is, words in each text word set are replaced with ids, and each text is converted into a one-hot encoded form, so as to obtain each line of text vector.

As an example, the specific implementation manner of step S303 is as follows: firstly, mapping words included in each text word set into identifications in the dictionary to obtain each text identification set, and secondly, converting each text identification set into a line text vector according to the identifications included in each text identification set. The number of elements of each line text vector is the same as that of the elements of the dictionary, and the number of non-zero elements of each line text vector is the same as that of the words of the corresponding text word set.

For example, for text 1 to text 3 in step S302, the text identification set corresponding to text 1 includes: 1234567891011, the line text vector is [ 111111111110000 ]; the text identification set corresponding to the text 2 comprises: 14122367910, the line text vector is [ 111101101101000 ]; the text identification set corresponding to the text 3 comprises: 1361415, the line text vector is [ 000001000000111 ].

S304, obtaining a text vector matrix corresponding to the text set according to all the line text vectors.

In this embodiment, the electronic device integrates all the line text vectors, and may obtain a text vector matrix corresponding to the text set.

For example, for the text 1 to the text 3 included in the text set in step S302, the text vector matrix corresponding to the text set is as follows:

the text clustering method provided by the embodiment of the application includes the steps of firstly carrying out word segmentation processing on each text in a text set based on a preset word segmentation rule to obtain each text word set, then obtaining a dictionary corresponding to the text set according to words included in all the text word sets, then obtaining a text vector of each line according to a mapping relation between the words included in each text word set and the dictionary, and finally obtaining a text vector matrix corresponding to the text set according to all the line text vectors. In the technical scheme, the dictionary corresponding to the text set is obtained according to all the texts in the text set, so that the accuracy of the text vector matrix corresponding to the text set is improved, and the accuracy of the subsequent text clustering result is improved.

Exemplarily, on the basis of the above embodiments, fig. 4 is a schematic flow diagram of a third embodiment of a text clustering method provided in the embodiment of the present application. As shown in fig. 4, in this embodiment, the step S202 may be implemented by:

s401, according to the plurality of line text vectors included in the text vector matrix, calculating similarity values between each line text vector and the text vector per se and between each line text vector and other line text vectors to obtain each line similarity vector.

S402, obtaining a text similarity matrix corresponding to the text set according to all the line similarity vectors.

In this embodiment, for each line of text vectors in the text vector matrix, the electronic device calculates cosine values between the line of text vectors and the electronic device and between the line of text vectors and other lines of text vectors in the text vector matrix according to the text vector matrix corresponding to the text set, so as to obtain each line similarity element corresponding to the line of text vectors, where all the line similarities constitute a line similarity vector.

It can be understood that the calculation manner of the line similarity vectors corresponding to other line text vectors in the text vector matrix is similar, and the description thereof is omitted here.

Optionally, in an embodiment of the present application, each row similarity element in the row similarity vector is obtained by solving a cosine value between vectors, and thus a value range of each element value in the row similarity vector is between 0 and 1.

For example, for the text set in the embodiment shown in fig. 3, the similarity values between the line text vector corresponding to text 1 and the line text vectors corresponding to text 1, text 2, and text 3 are 1, 0.80403025, and 0.15075567, respectively; similarity values between the line text vector corresponding to the text 2 and the line text vectors corresponding to the text 1, the text 2 and the text 3 are 0.80403025, 1 and 0.16666667 respectively; similarity values between the line text vector corresponding to the text 3 and the line text vectors corresponding to the text 1, the text 2 and the text 3 are 0.15075567, 0.16666667 and 1 respectively.

Thus, the line similarity vector for text 1 is [ 10.804030250.15075567 ], the line similarity vector for text 2 is [ 0.8040302510.16666667 ], and the line similarity vector for text 3 is [ 0.150755670.166666671 ]. Therefore, the text similarity matrix corresponding to the text set is as follows:

therefore, the similarity value of the text 1 and the text 2 is about 0.804; the similarity value of text 1 and text 3 is about 0.15; the similarity value of text 2 and text 3 is about 0.16.

Optionally, referring to fig. 4, in this embodiment, the step S203 may be implemented by:

s403, determining a classification threshold and a first cluster, wherein the first cluster comprises any one text in the text set.

Specifically, in the embodiment of the present application, in order to implement automatic aggregation classification on a text, a developer may first preset a classification threshold and a first cluster in an electronic device. The classification threshold is used to determine whether two texts can be classified into the same class, and the first cluster is a preset classification reference, which may be any one of the texts in the text set.

It will be appreciated that the classification threshold and the text included in the first cluster may be interpreted as an initial condition for the text cluster.

S404, according to the text similarity matrix, the classification threshold and the first cluster, performing cluster analysis on all texts in the text set to obtain a text cluster result.

For example, after the electronic device determines the text similarity matrix corresponding to the text set, the similarity value between any two texts can be determined, and then the automatic clustering of the texts can be realized by combining the determined classification threshold and the first clustering.

Optionally, in an embodiment of the present application, the step S404 may be implemented by:

and A1, determining a similarity value between a first text and each text in the first cluster according to the text similarity matrix, wherein the first text is any text which is not clustered and divided in the text set.

And for any text which does not participate in the clustering in the text set, marking as a first text, and obtaining a similarity value between the first text and each text in the first clustering by the electronic equipment by querying the determined text similarity matrix.

It can be understood that, in this embodiment, when the determined cluster further includes a cluster other than the first cluster, and when the electronic device determines that the first text cannot be clustered into the first cluster, it needs to obtain a similarity value between the first text and each text in other clusters by querying the text similarity matrix until determining that the first text belongs to the cluster or does not belong to any determined cluster.

And A2, if the similarity values between the first text and the texts in the first cluster in the preset proportion are all larger than or equal to the classification threshold, clustering the first text into the first cluster.

Wherein the predetermined ratio may be 40% or 50%. Or other values, but the preset ratio must be a value equal to or less than 100%.

It can be understood that the higher the value of the preset ratio, the higher the accuracy of the clustering result. The specific value of the preset ratio can be set according to the precision of the actual requirement, and is not described herein again.

In this embodiment, when the electronic device determines that the similarity values between the first text and the texts in the preset proportion in the first cluster are all greater than or equal to the classification threshold, the first text is directly clustered into the first cluster, so that on the premise of ensuring the clustering accuracy, the computing time and the computer resources can be saved, and meanwhile, the clustering efficiency is improved.

And A3, if the similarity values between the first text and all texts in the first cluster are less than the classification threshold value, generating a second cluster, wherein the second cluster comprises the first text.

In this embodiment, when it is determined by calculation that the similarity values between the first text and all the texts in the first cluster are smaller than the preset classification threshold, a new cluster, for example, a second cluster, may be generated from the first text.

It is understood that the first and second ones of the first and second clusters in this application are used to denote different clusters only, and do not denote other meanings.

A4, when all texts in the text set participate in cluster division, obtaining a text clustering result, wherein the text clustering result comprises: all clusters determined, all text each cluster comprises.

In this embodiment, the electronic device may perform cluster partitioning on all texts in the text set based on the implementation steps of a1 to A3 described above until all texts in the text set participate in the cluster partitioning. And when all texts in the text set participate in cluster division, determining all the divided clusters and each text included by each cluster.

For the step S404, colloquially, for developers, that is, firstly, a classification threshold is determined, the determination method mainly includes an observation method, specifically, several pairs of texts with similar semantics are found out from all the texts included in the text set, and then a similarity matrix is queried to query a similarity value between the texts, so that a value range can be roughly determined, and fine tuning is performed in a subsequent clustering process.

Second, after determining the classification threshold, the electronic device can perform an unsupervised clustering process. For example, a first text is designated as a first cluster, a semantic similarity value comparison is performed between a second text and the text included in the first cluster, if the similarity value is greater than a classification threshold value, clustering combination is performed, otherwise, a new cluster is generated.

In the subsequent clustering process, the current text is compared with each cluster, if the number of the text pieces with the similarity value larger than the classification threshold exceeds the preset proportion of the number of the text pieces in the current cluster, the current cluster is added, and if the similarity value does not meet the requirement, a cluster is regenerated, so that the unsupervised clustering process is completed.

Exemplarily, for the text set in the embodiment shown in fig. 3, taking the text similarity matrix obtained in step S401 as an example, if the text 1 is the first cluster, and the similarity between the text 2 and the text 1 is 0.8, it is considered that the text 1 and the text 2 belong to the same cluster, and the similarities between the text 3 and the text 1, and between the text 2 are 0.167 and 0.15, respectively, at this time, a new class, that is, the second cluster, is newly added, and the second cluster includes the text 3. Thus, the text clustering result of the text set is:

first clustering: text 1, text 2; the second type: text 3.

According to the text clustering method provided by the embodiment of the application, firstly, according to a plurality of line text vectors included in the text vector matrix, the similarity value between each line text vector and between each line text vector and each other line text vectors is calculated to obtain each line similarity vector, then, according to all the line similarity vectors, the text similarity matrix corresponding to the text set is obtained, finally, a classification threshold value and a first cluster are determined, and according to the text similarity matrix, the classification threshold value and the first cluster, all texts in the text set are subjected to clustering analysis to obtain a text clustering result. In the technical scheme, the classification threshold of similarity clustering is determined by calculating the text similarity matrix in batches, unsupervised clustering of the text is realized, the calculation complexity is greatly reduced, the influence of noise text on the whole semantics is weakened, and the clustering accuracy is improved.

Further, on the basis of any one of the above embodiments, fig. 5 is a schematic flow chart of a fourth embodiment of the text clustering method provided in the embodiment of the present application. As shown in fig. 5, in this embodiment, the text clustering method may further include the following steps:

s501, acquiring an object recommendation request sent by a user, wherein the object recommendation request comprises: a description text.

In the embodiment of the application, a user can send an object recommendation request to the electronic device through the service terminal, and the description text included in the object recommendation request can be a text describing the inquiry information of the user or a description text requesting to recommend an object. The embodiment of the application does not limit the concrete representation form of the description text, and the description text can be determined according to an actual scene.

For example, for a user service platform of an internet hospital, the description text may be a patient condition introduction of a patient user or an adequacy description for requesting a recommended doctor.

And S502, determining a target cluster matched with the object recommendation request according to the description text and the text clustering result.

In this embodiment, the electronic device may determine, by combining the obtained description text and the text clustering result, a matching cluster of the description text, that is, a target cluster that matches the object recommendation request.

Illustratively, a user sends an inquiry description at the service platform APP, and the electronic device at the background matches a target cluster suitable for receiving a call from a doctor text set describing the internet hospital according to the inquiry description, where the target cluster includes the profile (or the adequacy description) of at least one department doctor.

Wherein the doctor text set may be a doctor text set consisting of the profiles of all doctors in the internet hospital. Accordingly, the text clustering result may be a doctor clustering result determined based on the doctor text set and the text clustering method provided in the present application.

Illustratively, it is still explained by the doctor recommendations of the internet hospital mentioned above and the text set included in the embodiment shown in fig. 3. If the description text includes: children, development, retardation, epilepsy, cerebral palsy, etc., and the matching cluster of the description text can be determined as the first cluster including text 1 and text 2 through this step S502.

S503, determining a target text from the target cluster, and recommending an object corresponding to the target text to the user.

In this embodiment, after the electronic device determines the target cluster according to the description text sent by the user, a target text may be randomly selected from the target cluster, or the target text may be determined based on the comprehensive index of each text in each target cluster. The embodiment of the present application does not limit a specific method for determining the target text.

Illustratively, the comprehensive index of each text includes but is not limited to a plurality of information such as the degree of relevance, the rating, the number of visits, and shift information. The specific content of the composite index may be determined according to the actual scene, and is not described herein again.

In this embodiment, after the electronic device determines the target text, an object corresponding to the target text may be recommended to the user. For example, if the target text is the profile of a doctor, the target is the doctor, and thus accurate recommendation of the doctor is realized.

Practice verifies that on-line real inquiry of a certain internet hospital is subjected to data cleaning and cluster analysis by using an inquiry platform of the internet hospital as a background and an NLP technology and an unsupervised clustering method to obtain a text clustering result, and the text clustering result is applied to a doctor recommendation module of the internet hospital to obtain a good application effect.

According to the text clustering method provided by the embodiment of the application, an object recommendation request sent by a user is obtained, and the object recommendation request comprises the following steps: and describing the text, determining a target cluster matched with the object recommendation request according to the description text and the text clustering result, further determining a target text from the target cluster, and recommending the object corresponding to the target text to the user. In the technical scheme, the precision of the text clustering result is improved, so that the precision of object recommendation is improved, the user experience is improved, and a foundation is laid for improving the competitiveness of a product.

In summary, the text clustering method provided in the embodiment of the present application can be summarized as follows: the method has the advantages that low-density text clusters can be well identified, text information is processed in batches, the influence of 'noise' texts on overall semantics is weakened, only one text cluster parameter is provided and is easy to understand, classification results are not influenced by sample density, and a classification threshold value ensures that the clustering process can be executed without supervision.

The following are embodiments of the apparatus of the present application that may be used to perform embodiments of the methods of the present application. For details which are not disclosed in the embodiments of the apparatus of the present application, reference is made to the embodiments of the method of the present application.

Fig. 6 is a schematic structural diagram of an embodiment of a text clustering device provided in the embodiment of the present application. Referring to fig. 6, the apparatus may include: a processing module 601, a calculation module 602 and a clustering module 603.

The processing module 601 is configured to process all texts included in a text set to be processed to obtain a text vector matrix corresponding to the text set, where the number of rows of the text vector matrix is the same as the number of text entries included in the text set, and a row vector of the text vector matrix is a vector representation of each text;

the calculating module 602 is configured to calculate a text similarity matrix corresponding to the text set according to the text vector matrix, where each element value of the text similarity matrix is a similarity value between row vectors at corresponding positions of two text vector matrices;

the clustering module 603 is configured to perform clustering analysis on all texts in the text set based on the text similarity matrix to obtain a text clustering result.

For example, in a possible design of this embodiment, the processing module 601 is specifically configured to:

Optionally, the processing module 601 is configured to obtain a dictionary corresponding to the text set according to words included in all text word sets, specifically:

the processing module 601 is specifically configured to:

collecting words included in all text word sets into a preset set;

deleting repeated words in the preset set to obtain a preset target set;

Optionally, the processing module 601 is configured to obtain each line text vector by using a mapping relationship between a word included in each text word set and the dictionary, specifically:

the processing module 601 is specifically configured to:

Optionally, the processing module 601 is further configured to, before performing word segmentation processing on each text in the text set based on a preset word segmentation rule to obtain each text word set, perform preprocessing on each text in the text set, and delete preset attribute content appearing in each text, where the preset attribute content at least includes one of the following contents: stop words, comment words, symbols.

For example, in another possible design of this embodiment, the calculating module 602 is specifically configured to:

For example, in another possible design of this embodiment, the clustering module 603 is specifically configured to:

Optionally, the clustering module 603 is configured to perform clustering analysis on all texts in the text set according to the text similarity matrix, the classification threshold and the first cluster to obtain the text clustering result, and specifically includes:

the clustering module 603 is specifically configured to:

For example, in another possible design of this embodiment, the text clustering device further includes: the system comprises an acquisition module, a determination module and a recommendation module;

the recommending module is used for recommending the object corresponding to the target text to the user.

The device provided in the embodiment of the present application may be used to execute the method in the embodiments shown in fig. 2 to fig. 5, and the implementation principle and the technical effect are similar, which are not described herein again.

It should be noted that the division of the modules of the above apparatus is only a logical division, and the actual implementation may be wholly or partially integrated into one physical entity, or may be physically separated. And these modules can be realized in the form of software called by processing element; or may be implemented entirely in hardware; and part of the modules can be realized in the form of calling software by the processing element, and part of the modules can be realized in the form of hardware. For example, the determining module may be a processing element that is separately set up, or may be implemented by being integrated into a chip of the apparatus, or may be stored in a memory of the apparatus in the form of program code, and the function of the determining module may be called and executed by a processing element of the apparatus. Other modules are implemented similarly. In addition, all or part of the modules can be integrated together or can be independently realized. The processing element described herein may be an integrated circuit having signal processing capabilities. In implementation, each step of the above method or each module above may be implemented by an integrated logic circuit of hardware in a processor element or an instruction in the form of software.

For example, the above modules may be one or more integrated circuits configured to implement the above methods, such as: one or more Application Specific Integrated Circuits (ASICs), or one or more microprocessors (DSPs), or one or more Field Programmable Gate Arrays (FPGAs), among others. For another example, when some of the above modules are implemented in the form of a processing element scheduler code, the processing element may be a general-purpose processor, such as a Central Processing Unit (CPU) or other processor that can call program code. As another example, these modules may be integrated together, implemented in the form of a system-on-a-chip (SOC).

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the application to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

Fig. 7 is a schematic structural diagram of an embodiment of an electronic device provided in the embodiment of the present application. As shown in fig. 7, the electronic device may include: the system comprises a processor 71, a memory 72, a communication interface 73 and a system bus 74, wherein the memory 72 and the communication interface 73 are connected with the processor 71 through the system bus 74 and complete mutual communication, the memory 72 is used for storing computer execution instructions, the communication interface 73 is used for communicating with other devices, and the processor 71 implements the scheme of the embodiment shown in fig. 2 to 5 when executing the computer program.

In fig. 7, the processor 71 may be a general-purpose processor including a central processing unit CPU, a Network Processor (NP), and the like; but also a digital signal processor DSP, an application specific integrated circuit ASIC, a field programmable gate array FPGA or other programmable logic device, discrete gate or transistor logic, discrete hardware components.

The memory 72 may comprise Random Access Memory (RAM), read-only memory (RAM), and non-volatile memory (non-volatile memory), such as at least one disk memory.

The communication interface 73 is used to enable communication between the database access device and other devices (e.g., clients, read-write libraries, and read-only libraries).

The system bus 74 may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The system bus may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.

Optionally, an embodiment of the present application further provides a computer-readable storage medium, where computer instructions are stored, and when the computer instructions are executed on a computer, the computer is caused to execute the method according to the embodiment shown in fig. 2 to 5.

Optionally, an embodiment of the present application further provides a chip for executing the instruction, where the chip is configured to execute the method in the embodiment shown in fig. 2 to 5.

Embodiments of the present application further provide a program product, where the program product includes a computer program, where the computer program is stored in a computer-readable storage medium, and the computer program can be read by at least one processor from the computer-readable storage medium, and the at least one processor can implement the method in the embodiments shown in fig. 2 to 5 when executing the computer program.

Other embodiments of the present disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the application disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A text clustering method, comprising:

2. The method according to claim 1, wherein the processing all texts included in the text set to be processed to obtain a text vector matrix corresponding to the text set comprises:

3. The method according to claim 2, wherein the obtaining a dictionary corresponding to the text set according to the words included in all the text word sets comprises:

collecting words included in all text word sets into a preset set;

deleting repeated words in the preset set to obtain a preset target set;

4. The method of claim 2, wherein the mapping relationship between the words included in each text word set and the dictionary to obtain each line of text vector comprises:

5. The method according to claim 2, wherein before performing word segmentation processing on each text in the text set based on a preset word segmentation rule to obtain each text word set, the method further comprises:

6. The method according to any one of claims 1-5, wherein said calculating a text similarity matrix corresponding to the text set according to the text vector matrix comprises:

7. The method according to any one of claims 1 to 5, wherein the performing a cluster analysis on all texts in the text set based on the text similarity matrix to obtain a text clustering result comprises:

8. The method of claim 7, wherein the performing cluster analysis on all texts in the text set according to the text similarity matrix, the classification threshold and the first cluster to obtain the text clustering result comprises:

9. The method of claim 1, further comprising:

10. A text clustering apparatus, comprising: the device comprises a processing module, a calculation module and a clustering module;

11. An electronic device comprising a processor, a memory and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method according to any of the claims 1-9 when executing the program.

12. A computer-readable storage medium having stored thereon computer instructions which, when executed on a computer, cause the computer to perform the method of any one of claims 1-9.