CN112231471A

CN112231471A - Text processing method and device, computer equipment and storage medium

Info

Publication number: CN112231471A
Application number: CN202010920127.0A
Authority: CN
Inventors: 俞子轩
Original assignee: Dazhu Hangzhou Technology Co ltd
Current assignee: Dazhu Hangzhou Technology Co ltd
Priority date: 2020-09-04
Filing date: 2020-09-04
Publication date: 2021-01-15
Anticipated expiration: 2040-09-04
Also published as: CN112231471B

Abstract

The invention provides a text processing method and device, computer equipment and a storage medium, wherein the method comprises the following steps: performing word segmentation operation on a target text to form a plurality of statement vectors, wherein the target text comprises a plurality of sub-texts, and each sub-text corresponds to one statement vector; constructing a data set of the plurality of statement vectors, wherein each sample point in the data set corresponds to a statement vector; dividing the sample points in the data set into preset value clusters according to a parallelization k-median model, and determining the mass center of each cluster; and determining a sample point closest to each centroid, and outputting a corresponding sub-text according to the sample points. The invention solves the technical problem of low text processing efficiency in the related technology.

Description

Text processing method and device, computer equipment and storage medium

Technical Field

The invention relates to the field of computers, in particular to a text processing method and device, computer equipment and a storage medium.

Background

The current society is an information age, and the information is plentiful and complicated, so that the processing process of text data is particularly time-consuming, a huge amount of text information is difficult to process in a short time, and the error rate is extremely high by manually counting.

In the related technology, a K-means (K-means) algorithm is adopted to cluster text data, for a given sample set, the sample set is divided into K clusters according to the distance between samples, points in the clusters are connected together as closely as possible, the distance between the clusters is as large as possible, and the obtained compact and independent clusters are taken as a final target. Then, the most time-consuming and intensive calculation of the K-means is to calculate the distance between each point and the centroid, the centroid is determined inaccurately, the calculation time is wasted greatly, and the text processing efficiency is influenced.

In view of the above problems in the related art, no effective solution has been found at present.

Disclosure of Invention

The embodiment of the invention provides a text processing method and device, computer equipment and a storage medium, which at least solve the technical problem of low text processing efficiency in the related technology.

According to an embodiment of the present invention, there is provided a text processing method including: performing word segmentation operation on a target text to form a plurality of statement vectors, wherein the target text comprises a plurality of sub-texts, and each sub-text corresponds to one statement vector; constructing a data set of the plurality of statement vectors, wherein each sample point in the data set corresponds to a statement vector; dividing the sample points in the data set into preset value clusters according to a parallelization k-median model, and determining the mass center of each cluster; and determining a sample point closest to each centroid, and outputting a corresponding sub-text according to the sample points.

Optionally, performing word segmentation on the target text to form a plurality of statement vectors, including: converting the target text into a plurality of character string sequences with fixed length of subfiles, wherein each subfile corresponds to one character string sequence; mapping each character string sequence into a target statement vector based on a preset dictionary, wherein words in the preset dictionary are represented by word vectors; and compressing each target statement vector to obtain a plurality of statement vectors, wherein each statement vector represents an array with a preset dimension.

Optionally, converting the target text into a plurality of character string sequences with fixed sub-text length includes: performing word segmentation on a first sub-text to obtain a word segmentation set, wherein the first sub-text is any one of the sub-texts in the target text, and if the sub-text length of the first sub-text is smaller than a preset length, filling the tail end of the word segmentation set by using a preset word; matching the word segmentation set with a preset word group, wherein the preset word group is used for representing a mapping relation between words and sequences; if a first word contained in the first sub-text is matched in the preset phrase, replacing the first word with a sequence corresponding to the first word; if a second word contained in the first sub-text is not matched in the preset phrase, replacing the second word with a preset character string, wherein the first word and the second word are any word in the first sub-text; and generating the plurality of character string sequences according to the matching result.

Optionally, dividing the sample points in the data set into preset value clusters according to a parallelization k-median model, and determining the centroid of each cluster, including: a1, inputting the data set into a K median model, and setting the initial value of the preset value as K, wherein K represents a natural number greater than 1; b1, selecting K first centroids from the data set, and determining a first K cluster class to which each first sample point belongs according to the first sample point in the data set and the median of the first centroids, wherein the first sample point represents the sample points in the data set except the first centroids; step c1, calculating the median of all sample points in a first cluster, and determining the sample points corresponding to the median as a second centroid of the first cluster, wherein the first cluster is any one of the first K cluster classes; step d1, calculating the distance from the second sample points except the K second centroids in the data set to the second centroids to determine the second K cluster classes to which the second sample points belong; and c1 and d1 are executed in an iterative mode until the Nth centroid determined at the Nth time is the same as the Nth-1 th centroid determined at the N-1 st time, and then the centroids of all the target clusters and corresponding centroids of all the target clusters are determined.

Optionally, the method further includes: calculating a mean value P1 of the distances of any sample point in the target cluster from the centroid of the target cluster, calculating a median O of the respective centroids, and calculating a mean value P2 of the distances of the respective centroids to the median O; calculating the ratio of the P1 to the P2, denoted as M; comparing the M to a threshold; if the M is smaller than the threshold value, determining that the K is the preset value; otherwise, the method steps of claim 4 are executed cyclically.

Optionally, selecting K first centroids from the data set, and determining a first K cluster class to which each first sample point belongs according to the median of the first sample point and the first centroid in the data set, including the following steps: a2, randomly selecting a mass point from the sample points of the data set as a mass center K1; b2, traversing the data set, and selecting a mass point which is farthest from the mass center K1 as a mass center K2; step c2, iteratively executing the operation of the step b2 until K centroids are selected; step d2, calculating a plurality of median values from any sample point in the data set to the K centroids; and e2, selecting the target centroid corresponding to the minimum value from the plurality of median values, and determining the sample point corresponding to the minimum value as the sample point in the cluster class corresponding to the target centroid.

Optionally, determining a sample point closest to each centroid, and outputting a corresponding sub-document according to the sample point includes: calculating a plurality of median values of all sample points in a target cluster and a target centroid of the target cluster for the target cluster; selecting a target sample point corresponding to the minimum value in the plurality of median; determining a first statement vector corresponding to the target sample point based on a mapping relation between the data set and the statement vector; and outputting the corresponding target sub-text based on the first sentence vector.

According to an embodiment of the present invention, there is provided a text processing apparatus including: the word segmentation module is used for performing word segmentation operation on a target text to form a plurality of statement vectors, wherein the target text comprises a plurality of sub-texts, and each sub-text corresponds to one statement vector; a construction module configured to construct a data set of the plurality of statement vectors, wherein each sample point in the data set corresponds to a statement vector; the first determining module is used for dividing the sample points in the data set into preset value clusters according to a parallelization k-median model and determining the mass center of each cluster; and the second determining module is used for determining the sample point closest to each centroid and outputting the corresponding sub-text according to the sample point.

Optionally, the word segmentation module includes: the conversion unit is used for converting the target text into a plurality of character string sequences with fixed sub-text lengths, wherein each sub-text corresponds to one character string sequence; the mapping unit is used for mapping each character string sequence into a target statement vector based on a preset dictionary, wherein words in the preset dictionary are represented by word vectors; and the compression unit is used for compressing each target statement vector to obtain a plurality of statement vectors, wherein each statement vector represents an array with a preset dimension.

Optionally, the conversion unit is configured to: performing word segmentation on a first sub-text to obtain a word segmentation set, wherein the first sub-text is any one of the sub-texts in the target text, and if the sub-text length of the first sub-text is smaller than a preset length, filling the tail end of the word segmentation set by using a preset word; matching the word segmentation set with a preset word group, wherein the preset word group is used for representing a mapping relation between words and sequences; if a first word contained in the first sub-text is matched in the preset phrase, replacing the first word with a sequence corresponding to the first word; if a second word contained in the first sub-text is not matched in the preset phrase, replacing the second word with a preset character string, wherein the first word and the second word are any word in the first sub-text; and generating the plurality of character string sequences according to the matching result.

Optionally, the first determining module is configured to perform the following steps: a1, inputting the data set into a K median model, and setting the initial value of the preset value as K, wherein K represents a natural number greater than 1; b1, selecting K first centroids from the data set, and determining a first K cluster class to which each first sample point belongs according to the first sample point in the data set and the median of the first centroids, wherein the first sample point represents the sample points in the data set except the first centroids; step c1, calculating the median of all sample points in a first cluster, and determining the sample points corresponding to the median as a second centroid of the first cluster, wherein the first cluster is any one of the first K cluster classes; step d1, calculating the distance from the second sample points except the K second centroids in the data set to the second centroids to determine the second K cluster classes to which the second sample points belong; and c1 and d1 are executed in an iterative mode until the Nth centroid determined at the Nth time is the same as the Nth-1 th centroid determined at the N-1 st time, and then the centroids of all the target clusters and corresponding centroids of all the target clusters are determined.

Optionally, the apparatus further comprises: a first calculating module, configured to calculate a mean value P1 of distances from any sample point in the target cluster to a centroid of the target cluster, calculate a median O of the respective centroids, and calculate a mean value P2 of distances from the respective centroids to the median O; a second calculating module, configured to calculate a ratio of the P1 to the P2, denoted as M; a comparison module for comparing the M to a threshold; the processing module is used for determining that the K is the preset value if the M is smaller than the threshold value; otherwise, the method steps of claim 4 are executed cyclically.

Optionally, the first determining module is further configured to perform the following steps: a2, randomly selecting a mass point from the sample points of the data set as a mass center K1; b2, traversing the data set, and selecting a mass point which is farthest from the mass center K1 as a mass center K2; step c2, iteratively executing the operation of the step b2 until K centroids are selected; step d2, calculating a plurality of median values from any sample point in the data set to the K centroids; and e2, selecting the target centroid corresponding to the minimum value from the plurality of median values, and determining the sample point corresponding to the minimum value as the sample point in the cluster class corresponding to the target centroid.

Optionally, the second determining module includes: a calculating unit, configured to calculate, for a target cluster, a plurality of medians of all sample points in the target cluster and a target centroid of the target cluster; the selecting unit is used for selecting a corresponding target sample point when the minimum value in the plurality of median values is selected; a determining unit, configured to determine, based on a mapping relationship between the data set and the statement vector, a first statement vector corresponding to the target sample point; and the output unit is used for outputting the corresponding target sub-text based on the first statement vector.

According to yet another embodiment of the present invention, there is also provided a computer device comprising a memory having a computer program stored therein and a processor configured to execute the computer program to perform the steps of any of the above method embodiments.

According to a further embodiment of the present invention, there is also provided a storage medium having a computer program stored therein, wherein the computer program is arranged to perform the steps in any of the apparatus embodiments described above when executed.

According to the invention, a word segmentation system is utilized to perform word segmentation operation on a target text to generate a statement vector; then constructing a data set of statement vectors, wherein each sample point in the data set corresponds to one statement vector; clustering sample points in the data set according to a parallelization k-median model, dividing the sample points into preset clusters, determining the mass center of each cluster, and clustering a plurality of statement vectors; and determining the sample point closest to each centroid, namely determining the representative sentence vector in each cluster, and outputting the corresponding sub-texts, so that the representative sentences can be clustered from large batches of text data, and the technical problem of low text processing efficiency in the related technology is solved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

fig. 1 is a block diagram of a hardware structure in which a text processing method according to an embodiment of the present invention is applied to a computer terminal;

FIG. 2 is a flow diagram of a method of text processing according to an embodiment of the invention;

FIG. 3 is a flow chart of Chinese text clustering according to an embodiment of the present invention;

fig. 4 is a block diagram of a text processing apparatus according to an embodiment of the present invention.

Detailed Description

The invention will be described in detail hereinafter with reference to the accompanying drawings in conjunction with embodiments. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

Example 1

The method provided by the first embodiment of the present application may be executed in a mobile terminal, a server, a computer terminal, or a similar computing device. Taking the example of running on a computer terminal, fig. 1 is a block diagram of a hardware structure of a text processing method applied to a computer terminal according to an embodiment of the present invention. As shown in fig. 1, the computer terminal may include one or more (only one shown in fig. 1) processors 102 (the processor 102 may include, but is not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA) and a memory 104 for storing data, and optionally, a transmission device 106 for communication functions and an input-output device 108. It will be understood by those skilled in the art that the structure shown in fig. 1 is only an illustration and is not intended to limit the structure of the computer terminal. For example, the computer terminal may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.

The memory 104 may be used to store computer programs, for example, software programs and modules of application software, such as computer programs corresponding to the text processing method in the embodiment of the present invention, and the processor 102 executes various functional applications and data processing by running the computer programs stored in the memory 104, so as to implement the method described above. The memory 104 may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory located remotely from the processor 102, which may be connected to a computer terminal over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission device 106 is used for receiving or transmitting data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the computer terminal. In one example, the transmission device 106 includes a Network adapter (NIC), which can be connected to other Network devices through a base station so as to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module, which is used for communicating with the internet in a wireless manner.

In order to solve the technical problems in the related art, a text processing method is provided in the present embodiment, and fig. 2 is a flowchart of a text processing method according to an embodiment of the present invention, as shown in fig. 2, the flowchart includes the following steps:

step S202, performing word segmentation operation on a target text to form a plurality of statement vectors, wherein the target text comprises a plurality of sub-texts, and each sub-text corresponds to one statement vector;

in this embodiment, a word segmentation system is adopted to convert the imported large batch of text data into sentence vectors with a uniform data format.

Step S204, constructing a data set of a plurality of statement vectors, wherein each sample point in the data set corresponds to one statement vector;

preferably, a mapping table of statement vectors and data sets is constructed for associating statement vectors and sample points in the data sets.

Step S206, dividing the sample points in the data set into preset value clusters according to a parallelization k-median model, and determining the mass center of each cluster;

the parallelized K-median model in this embodiment is a K-median model based on a MapReduce (mapping-reduction) mode, and for a given sample set, the sample set is divided into K clusters according to the distance between samples, where the distance determines the centroid using a median, rather than a mean.

MapReduce is used for parallel operation of large-scale data sets (such as data sets larger than 1 TB), a Map (mapping) function is specified and used for mapping a group of key value pairs into a new group of key value pairs, and a concurrent Reduce function is specified and used for ensuring that all the mapped key value pairs share the same key group, so that the centroid is optimized, and the centroid accuracy is improved.

And step S208, determining the sample points closest to each centroid, and outputting corresponding sub-texts according to the sample points.

According to the embodiment of the invention, a word segmentation system is utilized to perform word segmentation operation on the target text to generate a statement vector; then constructing a data set of statement vectors, wherein each sample point in the data set corresponds to one statement vector; clustering sample points in the data set according to a parallelization k-median model, dividing the sample points into preset clusters, determining the mass center of each cluster, and clustering a plurality of statement vectors; and determining the sample point closest to each centroid, namely determining the representative sentence vector in each cluster, and outputting the corresponding sub-texts, so that the representative sentences can be clustered from large batches of text data, and the technical problem of low text processing efficiency in the related technology is solved.

In an alternative embodiment of the present disclosure, performing a word segmentation operation on a target text to form a plurality of sentence vectors includes: converting the target text into a plurality of character string sequences with fixed length of subfiles, wherein each subfile corresponds to one character string sequence; mapping each character string sequence into a target statement vector based on a preset dictionary, wherein words in the preset dictionary are represented by word vectors; and compressing each target statement vector to obtain a plurality of statement vectors, wherein each statement vector represents an array with a preset dimension.

In an example of the present disclosure, taking a chinese text as an example, fig. 3 is a flow chart of clustering the chinese text according to an embodiment of the present invention, as shown in fig. 3, after a large amount of chinese texts are imported, data cleaning is performed on the chinese text, and by performing review and verification on the chinese text through cleaning, repeated information in the chinese text is deleted, errors existing in the chinese text are corrected, format consistency of the chinese text is ensured, and length of a sentence is fixed. Further, the Chinese text after being cleaned is segmented to form a character string token sequence of the Chinese text.

According to the above embodiment, converting the target text into a plurality of character string sequences of which the sub-text length is fixed includes: performing word segmentation on the first sub-text to obtain a word segmentation set, wherein the first sub-text is any one of the sub-texts in the target text, and if the sub-text length of the first sub-text is smaller than a preset length, filling the end of the word segmentation set with a preset word; matching the word segmentation set with a preset word group, wherein the preset word group is used for representing the mapping relation between words and sequences; if the first word contained in the first sub-text is matched in the preset phrase, replacing the first word with the sequence corresponding to the first word; if the second word contained in the first sub-text is not matched in the preset phrase, replacing the second word with a preset character string, wherein the first word and the second word are any word in the first sub-text; and generating a plurality of character string sequences according to the matching result.

In this embodiment, a sentence vector composed of phrases is formed by mapping and splitting the washed chinese text. In one example, different word segmentation training sets can be selected according to different industries, so that the accuracy of word segmentation is ensured; the sentences in the Chinese text are changed into corresponding token numbers, and the length and the format of the sentences are fixed and unified.

In the mapping in this embodiment, a preset mapping dictionary is used to represent a set of multidimensional vectors corresponding to a word, and the dictionary may be trained by itself or may be an open source dictionary from the internet. The words in the mapping dictionary correspond to vectors of fixed dimensions (each word vector is generally 300 dimensions by default); then there are two special word vectors, one called unk (new word), corresponding to words that do not match the dictionary; a null is used to compensate for the lack of sentence length. Generally, a sentence is divided into words, and then one sentence becomes a word; preferably, the length is uniform, firstly, calculation is convenient, and secondly, overlength sentences can be removed.

In one example, taking a sentence with the length of 20 words as an example, each sentence is composed of 20 words, and insufficient null is used for replacing; then, constructing a sentence token, using a word vector known to be trained in advance (for example, 300 dimensions) as a large dictionary (for example, a large dictionary of 30W words), and constructing a word vector matrix of the self through mapping (mapping the converted word matrix with the large dictionary to obtain a small dictionary (for example, 10 ten thousand words)) of the self; each word de-dictionary corresponds to a 300-dimensional word vector, and each sentence forms a 20-300-dimensional vector called a sentence vector (i.e. the target sentence vector); then, the sentence vectors are compressed in an averaging or maximum value mode to obtain a1 x 300-dimensional vector; finally, the statement vectors are synthesized by summing the statement vectors; and finally, outputting the statement vector according to a numpy format.

For example, one sentence s 1: today, the weather is good, and after word segmentation, the weather becomes s1, namely today/weather/good, three words. The first step is as follows: carrying out data cleaning and then segmenting words; completing with pad (filling) to obtain s1, which is today, weather, good, pad, pad, pad, and the number of the following fillings satisfies that the word length of s1 sentence satisfies 20 words; the second step is that: corresponding numbers are removed, corresponding words are found in array (namely the preset word group), and the numbers are taken out; for example, the word array is: [1: me, 2: day, 3: today, 4: good … …, 100: pad ], then this is s1 ═ 3, unknown, 4, 100, 100, 100 … …. Here, the term "weather" is replaced by unknown (new word) because the term "weather" does not correspond to the term; then s1 goes to the dictionary with 10W × 300 dimensions corresponding to itself; s1 forms a 20 x 300 dimensional vector of sentences; then, the sentence vectors of 20 × 300 are compressed into sentence vectors of 1 × 300 dimensions. Preferably, the compression method is performed with max (maximum) or average, i.e. find the maximum or average per column; and finally converting the data into numpy format for storage.

In this embodiment, the numpy packet inside Python (an object-oriented dynamic type language) is used to process a large amount of data by converting to numpy format (which is an open-source numeric computation extension of Python), i.e., an array, which is a1 x 200-dimensional vector. NumPy provides a variety of file manipulation functions to access array content, and the file holding the array data may be in binary or text format).

Referring to fig. 3, a mapping dictionary is constructed. In the embodiment, after a source text is converted into a sentence vector, the source text and the sentence vector are in a one-to-one correspondence relationship, and the formed sentence vector and the text are converted into a corresponding data set by constructing a mapping table, wherein sample points in the data set correspond to one sentence vector; then saving the cost as a CSV (called Comma-Separated Values) pair to form a mapping dictionary; in the mapping dictionary, key represents a sentence vector, value represents a source text, and a CSV symmetric table in the form of key-value is formed. The CSV is a plain text file, the file size is small, the establishment, distribution and reading are more convenient, the CSV is suitable for storing structured information, and the default opening mode of the CSV file on the windows platform is excel (table).

In another optional embodiment of the present disclosure, dividing sample points in a data set into preset clusters according to a parallelized k-median model, and determining a centroid of each cluster includes: a1, inputting a data set into a K median model, and setting an initial value of a preset value as K, wherein K represents a natural number greater than 1; b1, selecting K first centroids from the data set, and determining a first K cluster class to which each first sample point belongs according to the median of the first sample point and the first centroid in the data set, wherein the first sample point represents the sample points except the first centroid in the data set; step c1, calculating the median of all the sample points in the first cluster, and determining the sample points corresponding to the median as a second centroid of the first cluster, wherein the first cluster is any one of the first K clusters; step d1, calculating the distance from the second sample points except the K second centroids in the data set to the second centroids to determine the second K cluster classes to which the second sample points belong; and c1 and d1 are executed in an iterative mode until the Nth centroid determined at the Nth time is the same as the Nth-1 th centroid determined at the N-1 st time, and then the centroids of all the target clusters and corresponding centroids of all the target clusters are determined.

According to the above embodiment, referring to fig. 3, after the above formed sentence vectors are mapped into corresponding data sets, the data sets are input into a k-median model; in order to avoid the number of iterations for processing the data set and waste of time and resources, the initial value of k is preset, and optionally, a corresponding k value starting point and the increasing amplitude of k each time can be given according to the size of the data set (such as the number of sentences) and a k value comparison graph. In addition, the k value versus graph in this example is an empirical tuning range.

Selecting K points from the data set sample points as centroids of the K clusters according to the initial value of K; and optimizing the determined centroid by using a parallel mode of MapReduce, and calculating which class each sample point belongs to. Wherein, Map function distributes each sample point to the nearest center, Reduce function is responsible for updating the cluster center. To reduce the network responsibility, a combiner (merging) function is needed to handle the partial merging of the intermediate results of the same Map and the same key. This allows fast clustering. And when the centroids of the last two times are unchanged, determining a stable state under the k value, and recording the stable state into a log (log), wherein in the log, a statement vector corresponding to a sample point closest to the centroid and a source text of a text corresponding to the statement vector are recorded when the centroid is determined at the moment.

In the embodiment, the parallel computation is used for solving the computation of the distance from different objects to the center point of each cluster; but when one round of calculation is completed, serial calculation is used when the centroid of the next time needs to be calculated iteratively.

For example, there are 1000 sample points in total, and the initial value of the preset value k is 3, that is, the initial value k is divided into 3 clusters; then 3 centroids (namely the first centroid) are selected from 1-1000 sample points, and on the assumption that the No. 15 point is randomly selected from the No. 1 to 1000 sample points as the 1 st centroid, then the point farthest from the No. 15 point is selected from the No. 1-14 and No. 16-1000 sample points as the 2 nd centroid, and so on, the 3 rd centroid is selected; calculating the distances from the sample points No. 1-30 to the point No. 15, which are both smaller than the distance to the 2 nd centroid and smaller than the distance to the 3 rd centroid, and determining that the sample points No. 1-30 belong to the first cluster (namely the first K cluster class); processing in the same way, and supposing that the cluster class where the obtained 2 nd centroid is located comprises No. 31-60 points, and the cluster class where the 3 rd centroid is located comprises No. 61-1000 points; i.e., step b1 described above.

Then, aiming at the 1 st cluster (namely the first cluster), calculating the median of the 1 st to 30 th points in the 1 st cluster, and if the calculated median corresponds to the 18 th sample point, taking the 18 th point as a new centroid (namely the second centroid) of the 1 st cluster; the same process is used to calculate the new centroid for the 2 nd cluster and the new centroid for the 3 rd cluster, step c1 above.

Then, parallel calculation is performed, the distances from the three centroids are calculated by using all sample points of numbers 1-1000 based on the newly determined 3 centroids (the distance is the median), and then the mark (color or label) with the shortest distance and the same as the centroid is selected, i.e., the sample point is determined as the cluster class in which the centroid is located, so as to perform clustering again on numbers 1-1000, i.e., the step d1 described above. And sequentially iterating until the final mass center is unchanged. By this embodiment, the distance is calculated using the median, and the influence of noise can be reduced.

In another optional embodiment of this case, still include: calculating the average value P1 of the distance from any sample point in the target cluster to the centroid of the target cluster, calculating the median O of each centroid, and calculating the average value P2 of the distance from each centroid to the median O; calculating the ratio of P1 to P2, and recording as M; comparing M to a threshold; if M is smaller than the threshold value, determining K as a preset value; otherwise, the method steps of claim 4 are executed cyclically.

In this embodiment, after determining the centroid and the cluster class to which the sample point belongs, whether the effect is achieved is determined by: if the intra-cluster distance (i.e., P1) and the inter-cluster distance (i.e., P2) are less than a threshold, the optimal clustering effect is achieved, otherwise the k value is determined again. Wherein, the intra-cluster distance is the median from all points in the cluster to the centroid, and the inter-cluster distance is the median of each centroid in the cluster; and after the k value is determined according to the judgment effect, finding the corresponding k and the corresponding mass center in the log. For large-batch data, it is not realistic to look at the data manually; the text data is clustered through the clustering, and disordered data can be divided into several classes.

Optionally, selecting K first centroids from the data set, and determining a first K cluster class to which each first sample point belongs according to the median of the first sample point and the first centroid in the data set, including the following steps: a2, randomly selecting a mass point from the sample points of the data set as a mass center K1; b2, traversing the data set, and selecting a mass point farthest from the mass center K1 as the mass center K2; step c2, iteratively executing the operation of the step b2 until K centroids are selected; step d2, calculating a plurality of medians from any sample point in the data set to K centroids; and e2, selecting the target centroid corresponding to the minimum value from the plurality of median values, and determining the sample point corresponding to the minimum value as the sample point in the cluster corresponding to the target centroid.

According to the above embodiment, referring to fig. 3, an initial k value is set; then randomly selecting a point from all sentence vectors (namely sample points in the data set), setting the point as a point A and setting the point as a first centroid, traversing all the points, and selecting a point with the farthest distance as a second centroid and setting the point as a point B. And by analogy, the point farthest from the first two is selected as C and the third centroid in the third traversal until k centroids are determined.

Optionally, determining a sample point closest to each centroid, and outputting a corresponding sub-document according to the sample point includes: aiming at the target cluster, calculating a plurality of medians of all sample points in the target cluster and a target centroid of the target cluster; selecting a target sample point corresponding to the minimum value in the plurality of median; determining a first statement vector corresponding to the target sample point based on a mapping relation between the data set and the statement vector; and outputting the corresponding target sub-text based on the first sentence vector.

Referring to fig. 3, after determining the centroid and the sample home cluster class, the sentence closest to the centroid location is obtained. Removing the log table according to the determined centroid to find a sample point with the nearest distance, taking out a sentence vector corresponding to the sample point, using the sentence vector as a key to inquire a corresponding Chinese sentence in a mapping dictionary, and then outputting the Chinese sentence; finally, exporting the data to a database or a CSV symmetric table; sentences with clustering representatives (such as key sentences) are then imported into the database or CSV along with the sentence values in this category.

Through the embodiment, the text is preprocessed, and is converted into the sentence vector through the word segmentation system to form the mapping table of the sentence vector and the original sentence; then processing in an optimized parallel kmeans clustering system based on MapReduce; and finally exporting the clustered texts to a database. Therefore, the Chinese text is firstly clustered into a piece of data which is representative of thousands of data behind and has auxiliary functions on named entity identification and the like; the centroid is optimized by using the mapreduce mode, the calculation time is saved, and the difficulty of the kmeans clustering calculation problem is solved.

Example 2

In this embodiment, a text processing apparatus is further provided, and the text processing apparatus is used for implementing the foregoing embodiments and preferred embodiments, and the description of the text processing apparatus is omitted for brevity. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.

Fig. 4 is a block diagram of a data processing apparatus according to an embodiment of the present invention, as shown in fig. 4, the apparatus including: the word segmentation module 40 is configured to perform word segmentation on a target text to form a plurality of statement vectors, where the target text includes a plurality of sub-texts, and each sub-text corresponds to one statement vector; a construction module 42, connected to the word segmentation module 40, configured to construct a data set of multiple statement vectors, where each sample point in the data set corresponds to a statement vector; a first determining module 44, connected to the constructing module 42, for dividing the sample points in the data set into preset value clusters according to the parallelized k-median model, and determining the centroid of each cluster; and a second determining module 46, connected to the first determining module 44, for determining a sample point nearest to each centroid, and outputting a corresponding sub-text according to the sample point.

Optionally, the word segmentation module 40 includes: the conversion unit is used for converting the target text into a plurality of character string sequences with fixed sub-text lengths, wherein each sub-text corresponds to one character string sequence; the mapping unit is used for mapping each character string sequence into a target statement vector based on a preset dictionary, wherein words in the preset dictionary are represented by word vectors; and the compression unit is used for compressing each target statement vector to obtain a plurality of statement vectors, wherein each statement vector represents an array with a preset dimension.

Optionally, the conversion unit is configured to: performing word segmentation on the first sub-text to obtain a word segmentation set, wherein the first sub-text is any one of the sub-texts in the target text, and if the sub-text length of the first sub-text is smaller than a preset length, filling the end of the word segmentation set with a preset word; matching the word segmentation set with a preset word group, wherein the preset word group is used for representing the mapping relation between words and sequences; if the first word contained in the first sub-text is matched in the preset phrase, replacing the first word with the sequence corresponding to the first word; if the second word contained in the first sub-text is not matched in the preset phrase, replacing the second word with a preset character string, wherein the first word and the second word are any word in the first sub-text; and generating a plurality of character string sequences according to the matching result.

Optionally, the first determining module 44 is configured to perform the following steps: a1, inputting a data set into a K median model, and setting an initial value of a preset value as K, wherein K represents a natural number greater than 1; b1, selecting K first centroids from the data set, and determining a first K cluster class to which each first sample point belongs according to the median of the first sample point and the first centroid in the data set, wherein the first sample point represents the sample points except the first centroid in the data set; step c1, calculating the median of all the sample points in the first cluster, and determining the sample points corresponding to the median as a second centroid of the first cluster, wherein the first cluster is any one of the first K clusters; step d1, calculating the distance from the second sample points except the K second centroids in the data set to the second centroids to determine the second K cluster classes to which the second sample points belong; and c1 and d1 are executed in an iterative mode until the Nth centroid determined at the Nth time is the same as the Nth-1 th centroid determined at the N-1 st time, and then the centroids of all the target clusters and corresponding centroids of all the target clusters are determined.

Optionally, the apparatus further comprises: the first calculating module is used for calculating the average value P1 of the distance from any sample point in the target cluster to the centroid of the target cluster, calculating the median O of each centroid, and calculating the average value P2 of the distance from each centroid to the median O; the second calculation module is used for calculating the ratio of P1 to P2, which is marked as M; a comparison module for comparing M to a threshold; the processing module is used for determining that K is a preset value if M is smaller than a threshold value; otherwise, the method steps of claim 4 are executed cyclically.

Optionally, the first determining module 44 is further configured to perform the following steps: a2, randomly selecting a mass point from the sample points of the data set as a mass center K1; b2, traversing the data set, and selecting a mass point farthest from the mass center K1 as the mass center K2; step c2, iteratively executing the operation of the step b2 until K centroids are selected; step d2, calculating a plurality of medians from any sample point in the data set to K centroids; and e2, selecting the target centroid corresponding to the minimum value from the plurality of median values, and determining the sample point corresponding to the minimum value as the sample point in the cluster corresponding to the target centroid.

Optionally, the second determining module 46 includes: the calculating unit is used for calculating a plurality of median of all sample points in the target cluster and the target centroid of the target cluster aiming at the target cluster; the selecting unit is used for selecting a corresponding target sample point when the minimum value in the plurality of median values is selected; the determining unit is used for determining a first statement vector corresponding to the target sample point based on the mapping relation between the data set and the statement vector; and the output unit is used for outputting the corresponding target sub-text based on the first statement vector.

It should be noted that, the above modules may be implemented by software or hardware, and for the latter, the following may be implemented, but not limited to: the modules are all positioned in the same processor; alternatively, the modules are respectively located in different processors in any combination.

Example 3

Embodiments of the present invention also provide a storage medium having a computer program stored therein, wherein the computer program is arranged to perform the steps of any of the above method embodiments when executed.

Alternatively, in the present embodiment, the storage medium may be configured to store a computer program for executing the steps of:

s1, performing word segmentation operation on a target text to form a plurality of statement vectors, wherein the target text comprises a plurality of sub-texts, and each sub-text corresponds to one statement vector;

s2, constructing a data set of the statement vectors, wherein each sample point in the data set corresponds to a statement vector;

s3, dividing the sample points in the data set into preset value clusters according to a parallelization k median model, and determining the mass center of each cluster;

and S4, determining the sample points closest to each centroid, and outputting corresponding sub-texts according to the sample points.

Optionally, in this embodiment, the storage medium may include, but is not limited to: various media capable of storing computer programs, such as a usb disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk.

Embodiments of the present invention also provide an electronic device comprising a memory having a computer program stored therein and a processor arranged to run the computer program to perform the steps of any of the above method embodiments.

Optionally, the electronic apparatus may further include a transmission device and an input/output device, wherein the transmission device is connected to the processor, and the input/output device is connected to the processor.

Optionally, in this embodiment, the processor may be configured to execute the following steps by a computer program:

Optionally, the specific examples in this embodiment may refer to the examples described in the above embodiments and optional implementation manners, and this embodiment is not described herein again.

It will be apparent to those skilled in the art that the modules or steps of the present invention described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and alternatively, they may be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method of text processing, comprising:

performing word segmentation operation on a target text to form a plurality of statement vectors, wherein the target text comprises a plurality of sub-texts, and each sub-text corresponds to one statement vector;

constructing a data set of the plurality of statement vectors, wherein each sample point in the data set corresponds to a statement vector;

dividing the sample points in the data set into preset value clusters according to a parallelization k-median model, and determining the mass center of each cluster;

and determining a sample point closest to each centroid, and outputting a corresponding sub-text according to the sample points.

2. The method of claim 1, wherein performing a word segmentation operation on the target text to form a plurality of sentence vectors comprises:

converting the target text into a plurality of character string sequences with fixed length of subfiles, wherein each subfile corresponds to one character string sequence;

mapping each character string sequence into a target statement vector based on a preset dictionary, wherein words in the preset dictionary are represented by word vectors;

and compressing each target statement vector to obtain a plurality of statement vectors, wherein each statement vector represents an array with a preset dimension.

3. The method of claim 2, wherein converting the target text into a plurality of fixed-length sub-text character string sequences comprises:

performing word segmentation on a first sub-text to obtain a word segmentation set, wherein the first sub-text is any one of the sub-texts in the target text, and if the sub-text length of the first sub-text is smaller than a preset length, filling the tail end of the word segmentation set by using a preset word;

matching the word segmentation set with a preset word group, wherein the preset word group is used for representing a mapping relation between words and sequences;

if a first word contained in the first sub-text is matched in the preset phrase, replacing the first word with a sequence corresponding to the first word; if a second word contained in the first sub-text is not matched in the preset phrase, replacing the second word with a preset character string, wherein the first word and the second word are any word in the first sub-text;

and generating the plurality of character string sequences according to the matching result.

4. The method of claim 1, wherein dividing sample points in the dataset into a predetermined number of clusters according to a parallelized k-median model and determining a centroid for each cluster comprises:

a1, inputting the data set into a K median model, and setting the initial value of the preset value as K, wherein K represents a natural number greater than 1;

b1, selecting K first centroids from the data set, and determining a first K cluster class to which each first sample point belongs according to the first sample point in the data set and the median of the first centroids, wherein the first sample point represents the sample points in the data set except the first centroids;

step c1, calculating the median of all sample points in a first cluster, and determining the sample points corresponding to the median as a second centroid of the first cluster, wherein the first cluster is any one of the first K cluster classes;

step d1, calculating the distance from the second sample points except the K second centroids in the data set to the second centroids to determine the second K cluster classes to which the second sample points belong;

and c1 and d1 are executed in an iterative mode until the Nth centroid determined at the Nth time is the same as the Nth-1 th centroid determined at the N-1 st time, and then the centroids of all the target clusters and corresponding centroids of all the target clusters are determined.

5. The method of claim 4, further comprising:

calculating a mean value P1 of the distances of any sample point in the target cluster from the centroid of the target cluster, calculating a median O of the respective centroids, and calculating a mean value P2 of the distances of the respective centroids to the median O;

calculating the ratio of the P1 to the P2, denoted as M;

comparing the M to a threshold;

if the M is smaller than the threshold value, determining that the K is the preset value; otherwise, the method steps of claim 4 are executed cyclically.

6. The method according to claim 4, wherein K first centroids are selected from the data set, and a first K cluster class to which each first sample point belongs is determined according to a first sample point in the data set and a median of the first centroids, wherein the first sample point represents a sample point in the data set except for the first centroids, the method comprises the following steps:

a2, randomly selecting a mass point from the sample points of the data set as a mass center K1;

b2, traversing the data set, and selecting a mass point which is farthest from the mass center K1 as a mass center K2;

step c2, iteratively executing the operation of the step b2 until K centroids are selected;

step d2, calculating a plurality of median values from any sample point in the data set to the K centroids;

and e2, selecting the target centroid corresponding to the minimum value from the plurality of median values, and determining the sample point corresponding to the minimum value as the sample point in the cluster class corresponding to the target centroid.

7. The method of claim 1, wherein determining a sample point nearest to each centroid and outputting a corresponding sub-version based on the sample points comprises:

calculating a plurality of median values of all sample points in a target cluster and a target centroid of the target cluster for the target cluster;

selecting a target sample point corresponding to the minimum value in the plurality of median;

determining a first statement vector corresponding to the target sample point based on a mapping relation between the data set and the statement vector;

and outputting the corresponding target sub-text based on the first sentence vector.

8. A text processing apparatus, comprising:

the word segmentation module is used for performing word segmentation operation on a target text to form a plurality of statement vectors, wherein the target text comprises a plurality of sub-texts, and each sub-text corresponds to one statement vector;

a construction module configured to construct a data set of the plurality of statement vectors, wherein each sample point in the data set corresponds to a statement vector;

the first determining module is used for dividing the sample points in the data set into preset value clusters according to a parallelization k-median model and determining the mass center of each cluster;

and the second determining module is used for determining the sample point closest to each centroid and outputting the corresponding sub-text according to the sample point.

9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 7 when executing the computer program.

10. A computer storage medium on which a computer program is stored, characterized in that the computer program, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.