CN117493560A

CN117493560A - Text processing method and device, electronic equipment and computer readable storage medium

Info

Publication number: CN117493560A
Application number: CN202310928489.8A
Authority: CN
Inventors: 王梦宇; 蒋宁; 陆全; 夏粉; 肖冰; 李宽
Original assignee: Mashang Consumer Finance Co Ltd
Current assignee: Mashang Consumer Finance Co Ltd
Priority date: 2023-07-26
Filing date: 2023-07-26
Publication date: 2024-02-02

Abstract

The present disclosure provides a text processing method and apparatus, an electronic device, and a computer readable storage medium, where the method includes: acquiring a text feature vector of a text to be processed; clustering iterative processing is carried out on the text feature vectors to obtain k target clusters; determining text labels of texts to be processed according to the k target clusters; the clustering iterative processing for the text feature vectors comprises the following steps: in the ith clustering process, obtaining target correction values of k first clusters and k first cluster mean values, wherein the first cluster mean values are the mean values of text feature vectors in the corresponding first clusters; correcting the k first cluster mean values according to the target correction value to obtain k target cluster center points; clustering the plurality of text feature vectors according to the center points of the k target clusters to obtain k second clusters; and if the k second clusters meet the preset convergence condition, taking the k second clusters as k target clusters. According to the embodiment of the disclosure, the acquiring speed of the text labels can be improved.

Description

Text processing method and device, electronic equipment and computer readable storage medium

Technical Field

The disclosure relates to the field of computer technology, and in particular, to a text processing method and device, an electronic device, and a computer readable storage medium.

Background

Currently, in the field of natural language processing (NLP, neuro-Linguistic Programming), an unsupervised learning (Unsupervised Learning) method is often used to obtain a label corresponding to a text to automatically label the text, because a large cost is generally consumed for manually labeling the text when the text is processed.

When performing unsupervised learning on a text, a text clustering method may be generally used to obtain a label of the text, for example, a k-means clustering method (kmeans, kmeans clustering algorithm) or a kmeans++ method modified from the kmeans method may be used to perform clustering processing on the text to obtain a label of the text.

Disclosure of Invention

The disclosure provides a text processing method and device, electronic equipment and a computer readable storage medium.

In a first aspect, the present disclosure provides a text processing method, including:

obtaining text feature vectors of a text to be processed, wherein the text feature vectors are in one-to-one correspondence with the text to be processed;

performing clustering iterative processing on the text feature vectors to obtain k target clusters, wherein k is a positive integer;

determining text labels of the text to be processed according to the k target clusters;

The clustering iterative processing for the text feature vector comprises the following steps:

in the ith clustering process, obtaining target correction values of k first clusters and k first cluster mean values, wherein the first cluster mean values are mean values of text feature vectors in the corresponding first clusters, and i is a positive integer;

correcting the k first cluster mean values according to the target correction value to obtain k target cluster center points;

clustering the text feature vectors according to the k target cluster center points to obtain k second clusters;

and if the k second clusters meet a preset convergence condition, taking the k second clusters as the k target clusters.

In a second aspect, the present disclosure provides a text processing apparatus comprising:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring text feature vectors of a text to be processed, and the text feature vectors are in one-to-one correspondence with the text to be processed;

the processing unit is used for carrying out clustering iterative processing on the text feature vectors to obtain k target clusters, wherein k is a positive integer;

the label determining unit is used for determining text labels of the texts to be processed according to the k target clusters;

The processing unit is used for performing clustering iterative processing on the text feature vectors, and is used for:

In a third aspect, the present disclosure provides an electronic device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores one or more computer programs executable by the at least one processor, the one or more computer programs being executable by the at least one processor to enable the at least one processor to perform the text processing method of the first aspect described above.

In a fourth aspect, the present disclosure provides a computer readable storage medium having stored thereon a computer program, wherein the computer program, when executed by a processor, implements the text processing method of the first aspect described above.

According to the embodiment provided by the disclosure, when clustering is performed on a text to be processed, aiming at the problem that in the related art, when clustering iterative processing is performed on the text to be processed or a text feature vector of the text to be processed, the clustering iterative processing speed of a clustering result is slow due to the fact that the clustering central point deviates from an actual cluster center in the text feature vector is avoided, when clustering iterative processing is performed on the text feature vector of the text to be processed, in the current ith clustering processing, the average value of the text feature vector in each of k first clusters is not directly used as a cluster central point of the current clustering processing, but a target correction value of the k first clusters is acquired, the k first cluster average value is corrected based on the target correction value, the clustering iterative processing speed of the text feature vector is improved, k second clusters are obtained based on the k target cluster central points obtained through correction, and k second clusters can be used as target clusters for obtaining a text to be processed under the condition that the k second clusters meet preset convergence conditions, and the label text to be processed is fast, and the label to be obtained is fast.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The accompanying drawings are included to provide a further understanding of the disclosure, and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure, without limitation to the disclosure. The above and other features and advantages will become more readily apparent to those skilled in the art by describing in detail exemplary embodiments with reference to the attached drawings, in which:

FIG. 1 is a schematic diagram of an implementation environment of a text processing method provided by an embodiment of the present disclosure;

FIG. 2 is a flowchart of a text processing method according to an embodiment of the present disclosure;

FIG. 3 is a flowchart of a primary clustering process provided by an embodiment of the present disclosure;

FIG. 4 is a flowchart for determining an initial cluster center point provided by an embodiment of the present disclosure;

FIG. 5 is a block diagram of a text processing apparatus provided by an embodiment of the present disclosure;

fig. 6 is a block diagram of an electronic device according to an embodiment of the present disclosure.

Detailed Description

For a better understanding of the technical solutions of the present disclosure, exemplary embodiments of the present disclosure will be described below with reference to the accompanying drawings, in which various details of the embodiments of the present disclosure are included to facilitate understanding, and they should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Embodiments of the disclosure and features of embodiments may be combined with each other without conflict.

As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. As used herein, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. The terms "connected" or "connected," and the like, are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect.

Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and the present disclosure, and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

In the related art, when performing unsupervised learning on a text, a text clustering method may be generally used to obtain a label of the text, for example, a k-means clustering method (kmeans, kmeans clustering algorithm) or a kmeans++ method modified from the kmeans method may be used to cluster the text to obtain a label of the text.

In the related art, when clustering a text using a kmeans method to obtain a label of the text, steps are generally taken: 1. determining the number k of clusters and randomly initializing k cluster centers, namely cluster center points; 2. determining which cluster each text should be classified into according to the distance between each text and the center point of k clusters, for example, euclidean distance (Euclidean Distance), and completing primary clustering; 3. in the new cluster, re-determining a new cluster center point according to the average value of each text in the cluster, and re-classifying each text into the nearest cluster according to the Euclidean distance between each text and the new cluster center point; 4. and (3) repeating the step (3) until k target clusters meeting the convergence condition are obtained.

In the text processing method for clustering texts to obtain text labels based on the kmeans method, when the texts are clustered, the clustering result is randomly selected in the process of initially selecting the cluster center point, so that the clustering result is influenced by the initially selected cluster center point, the convergence speed of the clustering result is possibly influenced, and the clustering result is not stable enough, so that under the condition that the labels of the texts are required to be determined according to the obtained k target clusters, the obtaining speed of the text labels is influenced, and the accuracy of the obtained text labels is also possibly influenced.

When the texts are clustered based on the improved kmeans++ method, the step 1 is mainly improved, namely, in the step 1, after the number k of clusters is determined, when the initial k cluster center points are selected, one text is selected randomly to be used as one cluster center point, then the distance between each text and the cluster center point is calculated, the probability that each text is selected as the next cluster center point is calculated according to a formula, the text with the highest probability is selected to be used as the second cluster center point, and k initial cluster center points are selected through continuous iteration; and then obtaining k target clusters based on the steps 2-4.

In the text processing method for clustering texts to obtain text labels based on the kmeans++ method, since the initial cluster center point is selected gradually, although the stability of the obtained clustering result can be improved to a certain extent, the clustering result is still not stable enough because the first cluster center point is selected randomly, and since the initial cluster center point is selected gradually, the obtaining speed of the clustering result is not improved, but the obtaining speed of the clustering result is further reduced, so that the problem of low obtaining speed of the text labels still exists in the text processing method, and the problem of inaccurate obtained text labels still exists.

In order to solve at least one problem existing in the related art, in the embodiment of the present disclosure, considering that when a label of a text is acquired based on a text clustering method, a reason for influencing clustering convergence is often due to the existence of an outlier in the text, which results in that under the condition that a cluster center point selected by random initialization is an outlier or is closer to the outlier, even in each iteration process, a mean value of samples in a previous round of clusters is taken as a new cluster center point, and the clustering process is performed, a clustering result is still influenced by the outlier, and further, a convergence speed of the clustering result is slow, therefore, in the embodiment of the present disclosure, in the current i-th clustering process, a mean value of text feature vectors in k first clusters is not directly taken as a cluster center point of the current clustering process, but a target correction value of the k first clusters is acquired, and a k first clusters are corrected based on the target correction value, so that a k-th cluster center point is determined to be closest to the new cluster center point, and a k-th cluster feature vector is not actually taken as a cluster center point of the current clustering process, and a k-th cluster feature vector is not actually required to be a cluster center point of the current clustering process, and the k-th feature vector is not actually has a convergence speed of the k-th cluster, and the k-th feature vector is not actually being different from the current cluster center point, and the k-th cluster feature is determined based on the actual cluster center point, and the k-th is far from the current on the current cluster center point, and the k-th cluster has a high-k-th feature value is far from a high, and a high-k-th cluster center point is far from a high, and a high to be a high, and a high to be a low in a low cost, the clustering iteration times can be reduced, the speed of clustering iteration processing aiming at the text feature vector is improved, and the acquisition speed of the text label is further improved.

Please refer to fig. 1, which is a schematic diagram of an implementation environment of a text processing method according to an embodiment of the present disclosure. As shown in fig. 1, the implementation environment may include a server 101, a terminal device 102, and a network 103.

The server 101 may be, for example, a physical server, for example, a blade server, a rack server, or the like, or the server 101 may be a virtual server, for example, a server cluster deployed in the cloud, which is not limited herein. In the embodiment of the present disclosure, the server 101 may be configured to receive a text to be processed sent by the terminal device 102, and process the text to be processed based on the text processing method of any embodiment of the present disclosure to obtain a text label of the processed text.

The terminal device 102 may be a smart phone, a portable computer, a desktop computer, a tablet computer, etc. In the embodiment of the present disclosure, the terminal device 102 may be configured to obtain a text to be processed, and send the text to be processed to the server 101 for text processing, so as to obtain a text label of the text to be processed; and, the method may also be used for receiving a text label of the text to be processed returned by the server 101, and displaying the text to be processed and the text label of the text to be processed for the user to view.

The network 103 may be a wireless network or a wired network, and may be a local area network or a wide area network. Communication between the server 101 and the terminal device 102 may be performed via a network 103.

In the disclosed embodiments, the server 101 may be used to participate in implementing text processing methods according to any of the embodiments of the disclosure. For example, can be used to: acquiring a text to be processed sent by the terminal equipment 102, and acquiring text feature vectors of the text to be processed, wherein the text feature vectors are in one-to-one correspondence with the text to be processed; clustering iterative processing is carried out on the text feature vectors to obtain k target clusters, wherein k is a positive integer; determining text labels of texts to be processed according to the k target clusters; the clustering iterative processing is carried out on the text feature vectors, and the clustering iterative processing comprises the following steps: in the ith clustering process, obtaining target correction values of k first clusters and k first cluster mean values, wherein the first cluster mean values are mean values of text feature vectors in the corresponding first clusters, and i is a positive integer; correcting the k first cluster mean values according to the target correction value to obtain k target cluster center points; clustering the text feature vectors according to the center points of the k target clusters to obtain k second clusters; and if the k second clusters meet the preset convergence condition, taking the k second clusters as k target clusters. After obtaining the text label of the text to be processed, the server 101 may also be configured to send the text label of the text to be processed to the terminal device 102, so that the terminal device 102 displays the text label of the text to be processed for confirmation by the user.

It will be appreciated that the implementation environment shown in fig. 1 is merely illustrative and is in no way intended to limit the disclosure, its application or uses. For example, although fig. 1 shows only one server 101 and one terminal device 102, it is not meant to limit the respective numbers, and a plurality of servers 101 and a plurality of terminal devices 102 may be included in the implementation environment.

To solve at least one problem in the related art when performing text processing, an embodiment of the present disclosure provides a text processing method, please refer to fig. 2, which is a flowchart of a text processing method provided in an embodiment of the present disclosure. It should be noted that, the text processing method provided in the embodiment of the present disclosure may be applied to an electronic device, which may be a server, for example, may be the server 101 shown in fig. 1, and of course, in actual implementation, the electronic device may also be a terminal device, which is not limited in particular herein.

As shown in fig. 2, the text processing provided by the embodiments of the present disclosure may include the following steps S201 to S203.

Step S201, obtaining text feature vectors of a text to be processed, wherein the text feature vectors are in one-to-one correspondence with the text to be processed.

In the embodiment of the present disclosure, the text to be processed may be any type of text, and a plurality of different texts may be included in the text to be processed. For example, the text to be processed may be a plurality of evaluation texts of users in the financial field for a certain type of financial product; for another example, the text to be processed may be a plurality of comment texts of a user on a certain commodity in the e-commerce field.

For convenience of explanation, in the following explanation, the number of texts in the text to be processed is described as N, where N is a positive integer, for example, N may be 5000, that is, the text to be processed may include 5000 texts.

The text feature vector refers to a text feature vector obtained after feature extraction processing is carried out on a text to be processed; it is understood that, in the case where N pieces of text are included in the text to be processed, the number of text feature vectors may also be N.

In some embodiments, the text feature vector of the text to be processed may be a feature vector obtained by performing feature extraction processing on the text to be processed using the BERT model, and of course, in actual implementation, the text feature vector may also be obtained by other manners, which is not limited herein specifically.

Step S202, clustering iterative processing is carried out on the text feature vectors to obtain k target clusters, wherein k is a positive integer.

In the embodiment of the disclosure, the clustering may also be called a clustering cluster, which means that a clustering is obtained after a text feature vector is clustered; correspondingly, the cluster center point refers to a text feature vector at the center position of the cluster; the samples in the clusters described in the following description refer to text feature vectors included in the clusters unless otherwise specified.

That is, after the text feature vector of the text to be processed is obtained in step S201, the clustering iterative process may be performed on the text feature vector to obtain k target clusters.

As shown in fig. 2, in order to increase the speed of clustering the text feature vectors of the text to be processed and further increase the processing speed of obtaining the text labels of the text to be processed, in the step S202, the clustering iterative processing of the text feature vectors may include the following steps S2021-S2023:

in step S2021, in the ith clustering process, the target correction values of k first clusters and k first cluster means are obtained, where the first cluster means is the mean of text feature vectors in the corresponding first clusters, and i is a positive integer.

In the embodiment of the present disclosure, the k first clusters refer to k clusters obtained in the i-1 th clustering process. The first cluster mean is the mean of all text feature vectors in the corresponding first cluster. The first cluster mean may also be a mean of a plurality of text feature vectors in the corresponding first cluster.

In step S2022, correction processing is performed on the k first cluster means according to the target correction value, so as to obtain k target cluster center points.

Step S2023, clustering the plurality of text feature vectors according to the center points of the k target clusters to obtain k second clusters;

in step S2024, if the k second clusters satisfy the preset convergence condition, the k second clusters are taken as k target clusters.

In the embodiment of the present disclosure, the preset convergence condition may be that the number of clustering iterations reaches a preset number of times; alternatively, the preset convergence condition may be: the distance between the k target cluster center points obtained by recalculation in the ith clustering process and the corresponding cluster center points in the k first cluster center points in the i-1 th clustering process is smaller than a preset threshold, namely, when the clustering process of the current round is carried out, the position of each cluster center point redetermined in the current round is not greatly changed from the position of each cluster center point determined in the previous round.

For example, in the case where the number of clusters is 2, the cluster center points recalculated in the ith clustering process are set to x11 and x12, and the cluster center points in the i-1 th clustering process are set to x21 and x22, and if the distances between x11 and x21 and between x12 and x22 are both smaller than the preset threshold value, it is possible to determine that the 2 clusters obtained in the ith clustering process are converging clusters.

In the embodiment of the present disclosure, in the process of performing the clustering process, when calculating the distance between text feature vectors, the euclidean distance between the text feature vectors may be calculated. Of course, in practical implementation, the distance may be calculated according to other manners, for example, or may be calculated by calculating a manhattan distance (Manhattan Distance) between text feature vectors, which is not limited herein.

In the embodiment of the disclosure, the target correction value is the first cluster mean value x obtained in the i-1 th clustering process _ji-1 Re-determining the center point x of the target cluster in the ith clustering process _ji For each first cluster meanA value of row position correction, where i is used to represent the round or number of rounds of the current cluster iteration and j is used to represent any one of the k clusters.

For example, in the case where k is 2, i.e., the number of clusters is 2, if the i-1 th, e.g., 3 rd clustering, clusters obtained in the clustering are C ₁₃ And C ₂₃ ，C ₁₃ The corresponding cluster mean value is x ₁₃ ，C ₂₃ The corresponding cluster mean value is x ₂₃ In the related art, x is generally directly determined when the cluster center point needs to be redetermined in the ith, e.g., 4 th, clustering process ₁₃ And x ₂₃ Is determined as the cluster center point in the 4 th clustering process and is based on x ₁₃ And x ₂₃ Clustering is carried out on each text feature vector again.

However, in the embodiment of the present disclosure, considering that the distribution of text feature vectors is often not uniform, if the text feature vectors are clustered based on the clustering method in the related art, if an Outlier (Outlier) exists in a cluster, a manner of redefining a cluster center point based on a mean value of sample points in the cluster may have a large deviation from an actual cluster center point, and therefore, multiple iterations are required to converge the clustering result; the outlier may be a point in a cluster obtained by clustering, where the corresponding position deviates from other sample points in the cluster; or points, among a plurality of sample points to be clustered, whose positions deviate farther from other sample points.

To solve the above problem, in the embodiments of the present disclosure, for the k first cluster means obtained after the i-1 th, e.g., 3 rd, clustering process, e.g., the x described above ₁₃ And x ₂₃ Instead of determining the k first cluster means directly as the target cluster center point in the ith, i.e. 4 th, clustering process, the target correction values for the k first clusters are obtained, e.g. C ₁₃ And C ₂₃ Target correction value des_chk, and based on des_chk to C ₁₃ Cluster mean x of (2) ₁₃ Correcting to obtain x _new14 And based on des_chk vs C ₂₃ Cluster mean x of (2) ₂₃ Correcting to obtain x _new24 And x is taken as _new14 And x _new24 As the center point of the target cluster of the current ith, i.e. 4 th clustering process.

In the embodiment of the present disclosure, in the clustering iteration process, after the center points of k target clusters are redetermined in the clustering process, the clustering process is performed on the text feature vectors in the step S2023, which may be performed to calculate the euclidean distance between each text feature vector and the center point of the k target clusters, and the text feature vectors may be re-clustered into k clusters according to the calculated euclidean distance, to obtain k second clusters.

After obtaining k second clusters, taking the k second clusters as k target clusters under the condition that the k second clusters meet a preset convergence condition; if any one of the k second clusters does not meet the preset convergence condition, the k second clusters may be used as new first clusters, and after the self-increasing 1 operation is performed on i, the above steps S2021-S2024 are repeatedly performed again until k target clusters meeting the preset convergence condition are obtained.

After obtaining k target clusters based on step S202, step S203 may be executed, where text labels of the text to be processed are determined according to the k target clusters.

As can be seen, based on the text processing method provided in the embodiments of the present disclosure, when a text to be processed is clustered, the problem that in the related art, when the text to be processed or a text feature vector of the text to be processed is clustered and iterated, the clustering result is slow is solved, when the text feature vector of the text to be processed is clustered and iterated, in the current ith clustering process, the average value of the text feature vector in each of k first clusters is not directly used as a cluster center point of the current clustering process, but a target correction value of the k first clusters is acquired, and based on the target correction value, the k first cluster average value of the k first clusters is corrected, so as to avoid the problem that the clustering result is slow due to the fact that the cluster center point deviates from the actual cluster center in the text feature vector, and to improve the clustering iteration process speed of the text feature vector, and based on the k target cluster center points obtained by correction, the k second clusters are clustered, and the k second clusters are obtained under the condition that the k second clusters satisfy preset convergence conditions, and the k second clusters are quickly used as target labels, and the k text to be processed is quickly obtained.

Please refer to fig. 3, which is a flowchart of a primary clustering process provided in an embodiment of the present disclosure. As shown in fig. 3, in some embodiments, the method may further perform the primary clustering process on the text feature vectors through the following steps S301-S303 before performing the above step S2021.

In step S301, cluster number analysis processing is performed on the text feature vector, and the value of k is determined.

The cluster number analysis process refers to a process for determining the number of clusters, that is, the number of clusters, and in the embodiment of the present disclosure, the cluster number analysis process may be an analysis process for determining the value of k by performing an analysis process on the text feature vector based on a contour coefficient method. In the actual processing, the value of k may be determined based on other means, for example, an Elbow method (Elbow method) may be used to determine the value of k, which is not particularly limited herein.

And step S302, sampling analysis processing is carried out on the text feature vectors, and k initial cluster center points are obtained.

And step S303, clustering the text feature vectors according to the k initial cluster center points.

After the k initial cluster center points are obtained in step S302, the text feature vectors may be subjected to primary clustering based on the k initial cluster center points, that is, euclidean distances between each text feature vector and the k initial cluster center points may be calculated, and each text feature vector may be classified into k clusters according to the calculated euclidean distances.

In the related art, when the clustering is initially performed, generally, k cluster center points are randomly selected to perform the primary clustering as described in a kmeans clustering method; or firstly randomly selecting a cluster center point as described in a kmeans++ clustering method, and then gradually calculating other k-1 cluster center points. As can be seen, in the related art, when the clustering is performed, at least 1 cluster center point is randomly selected, and when the randomly selected cluster center point is far from other sample points, that is, if there are outliers in the randomly selected cluster center points, multiple iterative clustering processes may be generally required to bring the cluster center points closer to the actual cluster center points again, so the manner of randomly selecting the initial cluster center point during the clustering tends to slow down the clustering speed, and may also make the clustering result unstable.

For this reason, in the embodiment of the present disclosure, considering that if a text feature vector is a center point of a certain cluster, when the text feature vectors are randomly ordered and sampled in a random sampling manner, the number of times that the text feature vector located at the cluster center point of the cluster is extracted is generally higher than other text feature vectors that are not the cluster center points, that is, the distribution of the text feature vectors located at the center positions of the respective clusters is generally concentrated, so when the initial k initial cluster center points are selected, the method provided in the embodiment of the present disclosure does not adopt a random selection manner, but determines the k initial cluster center points by sampling analysis processing of the text feature vector, so that the initially determined k cluster centers are closer to the actual cluster center, interference caused by the outlier is avoided, and based on the k initial cluster center points, not only the convergence speed of the clustering result but also the stability of the clustering result can be improved.

Referring to fig. 4, a flowchart of determining an initial cluster center point is provided in an embodiment of the present disclosure. As shown in fig. 4, in some embodiments, the sample analysis processing is performed on the text feature vector in the step S302 to obtain k initial cluster center points, which may include the following steps S401 to S403:

step S401, sampling the text feature vector for multiple times to obtain a sampling result.

In the embodiment of the disclosure, the multiple sampling processes may be a sampling process with a replacement, that is, after the text feature vector is sampled once, the extracted text feature vector is recorded, and then the text feature vector is replaced.

In some embodiments, when the text feature vector is sampled multiple times, the text feature vector may be sampled k×n times with a put back. In this embodiment, the sampling process with k×n times of replacement according to the determined number of clusters k is considered to correspond to the determined number of clusters, that is, if N samples are determined to be able to be divided into k clusters through analysis, statistically, when the N samples are sampled k×n times, the number of sampled points at the central position of each cluster is generally higher than that of other sample points, so that the accuracy of the initially determined central point of the cluster can be improved through the middle sampling process, so as to improve the speed of clustering iteration and also improve the stability of the clustering result.

Step S402, obtaining the frequency number of each text feature vector in the sampling result, wherein the frequency number of the text feature vector is used for representing the number of times the text feature vector is extracted in multiple sampling processes; and, executing step S403, determining the text feature vectors corresponding to the k frequency numbers satisfying the preset condition as k initial cluster center points.

The preset condition may be topk, that is, a text feature vector corresponding to a frequency number of the first k in a plurality of frequency numbers obtained after obtaining the frequency number of each text feature vector may be determined as k initial cluster center points.

Therefore, based on the method provided by the embodiment of the disclosure, unlike the problems that the clustering speed is slow and the clustering result is unstable when the initial cluster center point is randomly selected for clustering in the related art, the method provided by the embodiment of the disclosure determines the initial k cluster center points by sampling the text feature vector, and can improve the accuracy of the obtained k initial cluster center points, thereby improving the clustering speed and the stability of the clustering result.

In some embodiments, the acquiring the target correction values of the k first clusters described in the step S2021 may include: and acquiring the average value of the central points of each of the k first clusters as a first average value, and taking the first average value as a target correction value.

That is, in the embodiment of the present disclosure, the target correction value in the current round may be the average value of the center points of each of the k first clusters obtained through the previous round of clustering.

In addition, in some embodiments, the acquiring the target correction values of the k first clusters in the step S2021 may be: under the condition that i is 1, acquiring an average value of text feature vectors as a second average value, and taking the second average value as a target correction value; and under the condition that i is greater than 1, acquiring the average value of the central points of each of the k first clusters as a first average value, and taking the first average value as a target correction value.

That is, in the embodiment of the present disclosure, considering that even if the text feature vector is sampled to determine initial k initial cluster center points, the k initial cluster center points may have a certain deviation from the actual cluster center points, after the initial cluster processing, in the process of the clustering iterative processing, the average value of the text feature vector may be obtained first, that is, the center point of the text feature vector is obtained as the target correction value, so that in the first round of clustering processing of the clustering iterative processing, the positions of the center points of each cluster are pulled toward the center points of all the text feature vectors, so as to further improve the accuracy of the cluster center points, and further improve the clustering processing speed and the accuracy of the clustering result.

Of course, after the k cluster center points of the first clustering process in the clustering iterative process are corrected based on the whole sample center point, that is, the average value of the text feature vector, as the target correction value, in the clustering process of the subsequent round, it is considered that if the k first cluster average values in the previous round are corrected based on the whole sample center point each time, that is, the second average value as the target correction value, the amplitude of each correction may remain unchanged, and therefore, in order to increase the clustering process speed, the first cluster average value in the previous round may be corrected based on the first average value, that is, the average value of the cluster center points determined in the previous round of clustering process, in the clustering process after the first clustering process, as the target correction value.

It should be noted that, in actual implementation, the target correction value may be determined based on other manners, for example, the first average value and the second average value may be obtained, and the target correction value may be obtained according to the first average value and the second average value, that is, different weights may be set for the first average value and the second average value, and the target correction value may be obtained by weighting and summing the first average value and the second average value, so that the clustering speed and the accuracy of the clustering result may be further improved by obtaining the target correction value from multiple dimensions and obtaining the target cluster center point during each round of clustering based on the target correction value.

In some embodiments, the correcting the k first cluster means according to the target correction value in step S2022 to obtain k target cluster center points includes: acquiring a first preset weight of the first cluster mean value and a second preset weight of the target correction value; and obtaining a target cluster center point of the first cluster mean value based on the target correction value, the first preset weight and the second preset weight.

The first preset weight may be, for example, 0.9, and the second preset weight may be, for example, 0.1, and in actual implementation, the values of the first preset weight and the second preset weight may be set as required, which is not limited herein.

In some embodiments, the obtaining, based on the target correction value, the first preset weight, and the second preset weight, the target cluster center point corresponding to the first cluster mean may include: obtaining a first numerical value according to the first cluster mean value and a first preset weight; obtaining a second value according to the target correction value and the second preset weight; and obtaining a target cluster center point of the first cluster mean value according to the first numerical value and the second numerical value.

In practical implementation, the method can be calculated by the following formula: target cluster center = corresponding first cluster mean x first preset weight + target correction value x second preset weight.

For example, where k is 2, i.e., the number of clusters2, and the clusters obtained in the 3 rd clustering process were C respectively ₁₃ And C ₂₃ ，C ₁₃ The corresponding cluster mean value is x ₁₃ ，C ₂₃ The corresponding cluster mean value is x ₂₃ In the case of (2), the target correction value x is at the time of the fourth clustering process _d Can be x ₁₃ And x ₂₃ The center point of the target cluster in the fourth clustering process may be: x is x _new14 ＝x ₁₃ *0.9+x _d *0.1，x _new24 ＝x ₂₄ *0.9+x _d *0.1。

After performing clustering iterative processing on the text feature vector of the text to be processed based on any embodiment to obtain k target clusters, in the embodiment of the present disclosure, determining the text label of the text to be processed according to the k target clusters in the above step S203 may be: acquiring a target text feature vector of a cluster center point of a target cluster of a text to be processed; and taking the text corresponding to the target text feature vector as a text label of the text to be processed.

That is, after k target clusters are obtained, the second text corresponding to the target text feature vector at the cluster center point in any one target cluster may be used as the text label of the text corresponding to the other text feature vectors in the target cluster.

It should be noted that, in actual implementation, after obtaining the text label of the text to be processed according to the method provided by the embodiment of the present disclosure, the text to be processed may be further processed according to the text label corresponding to each text in the text to be processed, for example, the text in which no intention is recognized in the plurality of texts may be mined based on the text label of the text obtained by clustering, so as to mine a new intention included in the text.

It will be appreciated that the above-mentioned method embodiments of the present disclosure may be combined with each other to form a combined embodiment without departing from the principle logic, and are limited to the description of the present disclosure. It will be appreciated by those skilled in the art that in the above-described methods of the embodiments, the particular order of execution of the steps should be determined by their function and possible inherent logic.

In addition, the disclosure further provides a text processing device, an electronic device, and a computer readable storage medium, where the foregoing may be used to implement any one of the text processing methods provided in the disclosure, and corresponding technical schemes and descriptions and corresponding descriptions referring to method parts are not repeated.

Fig. 5 is a block diagram of a text processing apparatus according to an embodiment of the present disclosure.

Referring to fig. 5, an embodiment of the present disclosure provides a text processing apparatus 500 including: an acquisition unit 501, a processing unit 502, and a tag determination unit 503.

The obtaining unit 501 is configured to obtain a text feature vector of a text to be processed, where the text feature vector corresponds to the text to be processed one by one.

The processing unit 502 is configured to perform clustering iterative processing on the text feature vectors to obtain k target clusters, where k is a positive integer.

The label determining unit 503 is configured to determine a text label of the text to be processed according to the k target clusters.

In this text processing apparatus 500, when performing clustering iterative processing on text feature vectors, the processing unit 502 may be configured to: in the ith clustering process, obtaining target correction values of k first clusters and k first cluster mean values, wherein the first cluster mean values are mean values of text feature vectors in the corresponding first clusters, and i is a positive integer; correcting the k first cluster mean values according to the target correction value to obtain k target cluster center points; clustering the text feature vectors according to the center points of the k target clusters to obtain k second clusters; and if the k second clusters meet the preset convergence condition, taking the k second clusters as k target clusters.

In some embodiments, the apparatus 500 further comprises an initial cluster processing unit, which may be configured to: before clustering iterative processing is carried out on the text feature vectors, cluster number analysis processing is carried out on the text feature vectors, and the value of k is determined; sampling analysis processing is carried out on the text feature vector to obtain k initial cluster center points; and clustering the text feature vectors according to the k initial cluster center points.

In some embodiments, the initial clustering unit may be configured to, when performing sample analysis processing on the text feature vectors to obtain k initial cluster centers: sampling the text feature vector for multiple times to obtain a sampling result; obtaining the frequency number of each text feature vector in the sampling result, wherein the frequency number of the text feature vector is used for representing the number of times the text feature vector is extracted in multiple sampling processes; and determining the text feature vectors corresponding to the k frequency numbers meeting the preset condition as k initial cluster center points.

In some embodiments, the processing unit 502, when acquiring the target correction values for the k first clusters, may be configured to: and acquiring the average value of the central points of each of the k first clusters as a first average value, and taking the first average value as a target correction value.

In some embodiments, the processing unit 502, when acquiring the target correction values for the k first clusters, may be configured to: under the condition that i is 1, acquiring an average value of text feature vectors as a second average value, and taking the second average value as a target correction value; and under the condition that i is greater than 1, acquiring the average value of the central points of each of the k first clusters as a first average value, and taking the first average value as a target correction value.

In some embodiments, when the processing unit 502 performs correction processing on the k first cluster means according to the target correction value to obtain k target cluster center points, the processing unit may be configured to: acquiring a first preset weight of the first cluster mean value and a second preset weight of the target correction value; and obtaining a target cluster center point of the first cluster mean value based on the target correction value, the first preset weight and the second preset weight.

In some embodiments, when the processing unit 502 obtains the target cluster center point corresponding to the first cluster mean based on the target correction value, the first preset weight, and the second preset weight, the processing unit may be configured to: obtaining a first numerical value according to the first cluster mean value and a first preset weight; obtaining a second value according to the target correction value and the second preset weight; and obtaining a target cluster center point of the first cluster mean value according to the first numerical value and the second numerical value.

In some embodiments, the tag determining unit 502 may be configured to, when determining a text tag of a text to be processed according to k target clusters: acquiring a target text feature vector of a cluster center point of a target cluster of a text to be processed; and taking the text corresponding to the target text feature vector as a text label of the text to be processed.

As can be seen, based on the text processing device provided in the embodiments of the present disclosure, when clustering is performed on a text to be processed, the present disclosure aims at the problem in the related art that the clustering result speed is slow when clustering iterative processing is performed on the text feature vector of the text to be processed, when clustering iterative processing is performed on the text feature vector based on a processing unit, the average value of the text feature vector in each of k first clusters is not directly used as the cluster center point of the current clustering processing, but a target correction value of the k first clusters is obtained, and correction processing is performed on the average value of the k first clusters based on the target correction value, so as to avoid the problem that the clustering result converges slowly when the cluster center point deviates from the actual cluster center because of the outlier in the text feature vector, and to increase the speed of clustering iterative processing is performed on the text feature vector.

The respective modules in the above-described text processing apparatus may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

Referring to fig. 6, an embodiment of the present disclosure provides an electronic device 600 including: at least one processor 601; at least one memory 602, and one or more I/O interfaces 603, connected between the processor 601 and the memory 602; the memory 602 stores one or more computer programs executable by the at least one processor 601, and the one or more computer programs are executed by the at least one processor 601 to enable the at least one processor 601 to perform the text processing method as described above.

The various modules in the electronic device described above may be implemented in whole or in part in software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

The disclosed embodiments also provide a computer readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the above-described text processing method. The computer readable storage medium may be a volatile or nonvolatile computer readable storage medium.

Embodiments of the present disclosure also provide a computer program product comprising computer readable code, or a non-transitory computer readable storage medium carrying computer readable code, which when executed in a processor of an electronic device, performs the above-described text processing method.

Those of ordinary skill in the art will appreciate that all or some of the steps, systems, functional modules/units in the apparatus, and methods disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. In a hardware implementation, the division between the functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may be performed cooperatively by several physical components. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer-readable storage media, which may include computer storage media (or non-transitory media) and communication media (or transitory media).

The term computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable program instructions, data structures, program modules or other data, as known to those skilled in the art. Computer storage media includes, but is not limited to, random Access Memory (RAM), read Only Memory (ROM), erasable Programmable Read Only Memory (EPROM), static Random Access Memory (SRAM), flash memory or other memory technology, portable compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical disc storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer. Furthermore, as is well known to those of ordinary skill in the art, communication media typically embodies computer readable program instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and may include any information delivery media.

The computer readable program instructions described herein may be downloaded from a computer readable storage medium to a respective computing/processing device or to an external computer or external storage device over a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmissions, wireless transmissions, routers, firewalls, switches, gateway computers and/or edge servers. The network interface card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium in the respective computing/processing device.

Computer program instructions for performing the operations of the present disclosure can be assembly instructions, instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, c++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer readable program instructions may be executed entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present disclosure are implemented by personalizing electronic circuitry, such as programmable logic circuitry, field Programmable Gate Arrays (FPGAs), or Programmable Logic Arrays (PLAs), with state information of computer readable program instructions, which can execute the computer readable program instructions.

The computer program product described herein may be embodied in hardware, software, or a combination thereof. In an alternative embodiment, the computer program product is embodied as a computer storage medium, and in another alternative embodiment, the computer program product is embodied as a software product, such as a software development kit (Software Development Kit, SDK), or the like.

Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable medium having the instructions stored therein includes an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Example embodiments have been disclosed herein, and although specific terms are employed, they are used and should be interpreted in a generic and descriptive sense only and not for purpose of limitation. In some instances, it will be apparent to one skilled in the art that features, characteristics, and/or elements described in connection with a particular embodiment may be used alone or in combination with other embodiments unless explicitly stated otherwise. It will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the scope of the disclosure as set forth in the appended claims.

Claims

1. A text processing method, comprising:

2. The method of claim 1, wherein prior to clustering the text feature vectors, the method further comprises:

performing cluster number analysis processing on the text feature vector to determine the value of k;

sampling analysis processing is carried out on the text feature vector to obtain k initial cluster center points;

and clustering the text feature vectors according to the k initial cluster center points.

3. The method according to claim 2, wherein the sampling analysis processing is performed on the text feature vector to obtain k initial cluster center points, including:

Sampling the text feature vector for a plurality of times to obtain a sampling result;

obtaining the frequency of each text feature vector in the sampling result, wherein the frequency of the text feature vector is used for representing the number of times the text feature vector is extracted in the multi-sampling processing;

and determining the text feature vectors corresponding to the k frequency numbers meeting the preset condition as the center points of the k initial clusters.

4. The method of claim 1, wherein the obtaining target correction values for k first clusters comprises:

and acquiring an average value of the central points of each of the k first clusters as a first average value, and taking the first average value as the target correction value.

5. The method of claim 1, wherein the obtaining target correction values for k first clusters comprises:

under the condition that i is 1, acquiring an average value of the text feature vector as a second average value, and taking the second average value as the target correction value;

and under the condition that i is larger than 1, acquiring the average value of the central points of each of the k first clusters as a first average value, and taking the first average value as the target correction value.

6. The method of claim 1, wherein the correcting the k first cluster means according to the target correction value to obtain k target cluster center points includes:

Acquiring a first preset weight of the first cluster mean value and a second preset weight of the target correction value;

and obtaining a target cluster center point of the first cluster mean value based on the target correction value, the first preset weight and the second preset weight.

7. The method of claim 6, wherein the obtaining a target cluster center point corresponding to the first cluster mean based on the target correction value, the first preset weight, and the second preset weight comprises:

obtaining a first numerical value according to the first cluster mean value and the first preset weight;

obtaining a second value according to the target correction value and the second preset weight;

and obtaining a target cluster center point of the first cluster mean value according to the first numerical value and the second numerical value.

8. The method of claim 1, wherein the determining the text label of the text to be processed according to the k target clusters comprises:

acquiring a target text feature vector of a cluster center point of a target cluster of the text to be processed;

and taking the text corresponding to the target text feature vector as a text label of the text to be processed.

9. A text processing apparatus, comprising:

10. An electronic device, comprising:

At least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores one or more computer programs executable by the at least one processor to enable the at least one processor to perform the text processing method of any of claims 1-8.

11. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the text processing method according to any of claims 1-8.