WO2016180268A1

WO2016180268A1 - Text aggregate method and device

Info

Publication number: WO2016180268A1
Application number: PCT/CN2016/081090
Authority: WO
Inventors: 冯文镛
Original assignee: 阿里巴巴集团控股有限公司
Priority date: 2015-05-13
Filing date: 2016-05-05
Publication date: 2016-11-17
Also published as: CN106294350A; CN106294350B

Abstract

A text aggregate method and device. The method comprises: when a text feature set corresponding to a text to be aggregated is acquired, employing a determination method combining a locality sensitive hashing algorithm and a similarity check to perform a similarity analysis on the text to be aggregated so as to aggregate the text to be aggregated. Therefore, the present invention can address the problem of a low-accuracy and high-latency text aggregate result due to performing a short-text similarity analysis on the basis of a vector space model or a probabilistic model, thus accurately and quickly aggregating a short text.

Description

Text aggregation method and device

The present application claims priority to Chinese Patent Application Serial No. No. No. No. No. No. No. No. No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No

Technical field

The present application relates to the field of Internet technologies, and in particular, to a text aggregation method and apparatus.

Background technique

In traditional communication applications (such as text messages, emails, etc.) and new Internet social applications (such as WeChat, Weibo, forums, etc.), a large amount of short text data is generated at all times, for example, the length is not greater than the set length. Chinese text data of thresholds (such as 150 to 200 words, etc., in which English words or consecutive numbers are calculated by one Chinese character). There is a lot of valuable information in these text data, and by gathering them, you can discover potential hot spots or laws in the information.

In particular, text aggregation is a technique for grouping text collections under a given similarity measure to group texts that are close to each other into the same group. The text aggregation may specifically include steps such as text feature extraction and text similarity analysis.

Specifically, since the similarity analysis of the text to achieve the aggregation of the text is currently performed mainly based on the vector space model or the probability model. In the vector space model, words or words in the text are used as features to represent the text, and the similarity between the feature vectors is used to measure the relevance of the text. Therefore, for texts that are too short in length, there will be a problem that the feature vector is too sparse, and the calculation result cannot meet the requirements of the similarity analysis, which leads to the problem that the final text aggregation result is not accurate. In addition, in the probability model, if too short text is used, most of the features will be the result of probability smoothing, and cannot reflect the information of the real data. Therefore, there is also a problem that the aggregation result is not accurate and cannot satisfy the user's needs. Furthermore, since the above two types of conventional text similarity algorithms are computationally intensive, there is also the problem that it is difficult to meet the real-time analysis of short text data which can usually reach tens of millions or even billions, so that the effect of text aggregation is not good.

That is to say, at present, when text aggregation of short text data is performed, there is a problem that the accuracy of text aggregation is low and the real-time performance is low due to poor manner of text similarity analysis. Therefore, it is urgent to provide a kind of A new text aggregation method to solve the above problem.

Summary of the invention

The embodiment of the present invention provides a text aggregation method and device, which are used to solve the problem that the text aggregation method has low accuracy and low real-time performance due to poor text textuality analysis.

The embodiment of the present application provides a text aggregation method, including:

Performing feature extraction on the to-be-aggregated text whose length is not greater than the set length threshold, to obtain a first text feature set corresponding to the to-be-aggregated text;

Calculating a hash value of the first text feature set based on the set local sensitivity hash algorithm, and determining, according to the calculated hash value, a constructed hash corresponding to the set local sensitive hash algorithm In the index, whether there is a matching value between the calculated hash value and the set distance is not greater than the set distance;

If yes, selecting a matching value that is the smallest distance from the calculated hash value from the matching value between the calculated hash value and the set distance, and calculating the first text feature And combining a similarity between the second set of text features corresponding to the minimum matching value;

And if it is determined that the similarity between the first text feature set and the second text feature set is not less than a set similarity threshold, the to-be-aggregated text is aggregated to correspond to the second text feature set. In the text class.

Correspondingly, the embodiment of the present application further provides a text aggregation apparatus, including:

a feature extraction unit, configured to perform feature extraction on the to-be-aggregated text whose length is not greater than the set length threshold, to obtain a first text feature set corresponding to the to-be-aggregated text;

a text aggregating unit, configured to calculate a hash value of the first text feature set based on the set local sensitivity hash algorithm, and determine, according to the calculated hash value, the constructed local sensitivity to the setting In the hash index corresponding to the Greek algorithm, whether there is a matching value between the calculated hash value and the calculated distance is not greater than the set distance; if so, the distance between the calculated hash value is not greater than Among the matching values of the fixed distance, the matching value with the smallest distance from the calculated hash value is selected, and the first text feature set is calculated between the second text feature set corresponding to the minimum matching value And determining, if the similarity between the first text feature set and the second text feature set is not less than a set similarity threshold, aggregating the to-be-aggregated text to the second In the text class corresponding to the text feature set.

The beneficial effects of the application are as follows:

The embodiment of the present application provides a text aggregation method and apparatus. In the technical solution of the embodiment of the present application, after obtaining a text feature set corresponding to the text to be aggregated, a local sensitive hash algorithm is used to combine the similarity degree. The method for determining the similarity analysis of the text to be aggregated to realize the aggregation of the text to be aggregated, so that the accuracy of the text aggregation result caused by the short text similarity analysis based on the vector space model or the probability model can be solved. The problem of low real-time performance achieves an accurate and fast aggregation of short text.

DRAWINGS

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the following drawings will be briefly described in the description of the embodiments. It is obvious that the drawings in the following description are only some embodiments of the present application, Those skilled in the art can also obtain other drawings based on these drawings without paying any creative work.

FIG. 1 is a schematic flowchart diagram of a text aggregation method according to Embodiment 1 of the present application;

FIG. 2 is a schematic structural diagram of a text aggregation apparatus according to Embodiment 2 of the present application.

detailed description

The present invention will be further described in detail with reference to the accompanying drawings, in which FIG. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present application without departing from the inventive scope are the scope of the present application.

Embodiment 1:

A text aggregation method is provided in the first embodiment of the present application. As shown in FIG. 1 , it is a schematic flowchart of the text aggregation method in the first embodiment of the present application. The text aggregation method may include the following steps:

Step 101: Perform feature extraction on the to-be-aggregated text whose length is not greater than the set length threshold, to obtain a first text feature set corresponding to the to-be-aggregated text.

Optionally, the text to be aggregated may be Chinese text data whose length is not more than a set length threshold (for example, 150 to 200 words, etc., wherein the English word or the continuous number is calculated by one Chinese character), and the embodiment of the present application I will not go into details about this.

Further, since a large amount of short text data on the Internet has characteristics such as irregular wording and various deformations, it is extracted by using a traditional word segmentation method (for example, using a common word segmentation device for word segmentation, and When the corresponding word segmentation result is described as the feature of the text, there may be a problem that a good feature extraction result cannot be obtained, and the resulting text aggregation result is not accurate.

Therefore, in order to improve the extraction effect of the text feature, in the embodiment of the present application, the feature to be aggregated whose length is not greater than the set length threshold may be extracted in the following manner, and the text corresponding to the to-be-aggregated text is obtained. Text feature collection:

Feature extraction based on mechanical segmentation combined with N-gram model (N-gram) is not longer than the set length threshold The value of the to-be-aggregated text is subjected to feature extraction, and a first text feature set corresponding to the to-be-aggregated text is obtained, and the N is a natural number greater than 1.

It should be noted that compared with the traditional word segmentation method for feature extraction of short text data, the feature extraction method using mechanical word segmentation combined with the N-ary model can achieve better text feature extraction effect. This is because the mechanical participle ignores the semantics to mechanically segment the text, while the N-ary model establishes a certain dependency between the isolated features, thus providing a larger feature set and enriching the information of the feature set. This plays a very good complement to the short text with less information. Therefore, it can achieve good results in the non-standard short text feature extraction, and thus improve the accuracy of text aggregation.

Optionally, the feature extraction method based on the mechanical segmentation and the N-element model is used to perform feature extraction on the to-be-aggregated text whose length is not greater than the set length threshold, and the text feature set corresponding to the to-be-aggregated text is obtained, which may include:

Taking Chinese characters and continuous strings (such as continuous Latin strings, consecutive numeric strings, or consecutive Latin numeric strings, etc.) as the minimum segmentation unit, segmenting the to-be-aggregated text to obtain multiple The word segmentation; for example, taking the text to be aggregated as "my birthday is 1989-01-22" as an example, the text to be aggregated may be classified as "I / / / / / / / / / / / / / / / / / / / / / /

Based on the N-ary model, any N consecutive word segments of the obtained plurality of word segments are combined into one text feature, and a text feature set corresponding to the to-be-aggregated text is obtained. For example, taking the value of N as 2 (ie, the N-ary model is Bi-gram), and the text to be aggregated is “My birthday is 1989-01-22” as an example, the final result is as described above. The set of text features corresponding to the aggregated text can be expressed as {my, birthday, birthday, day, is 1989-01-22}.

Further, in order to improve the text quality and improve the accuracy of the text aggregation, the method may further include the following steps before performing feature extraction on the to-be-aggregated text whose length is not greater than the set length threshold:

Pre-processing the text to be aggregated, so that the corresponding text feature extraction may be performed according to the pre-processed text to be aggregated; wherein the pre-processing may include at least one or more of the following operations, the present application The embodiment does not limit this:

Remove special tags (such as html tags) from the text to be aggregated, remove non-text special symbols (such as &, *, etc.) in the text to be aggregated, and perform complex font conversion on the text to be aggregated (such as the traditional text in the text to be aggregated) Convert words into simplified characters, etc., and normalize the Latin and/or numbers of continuity in the text to be aggregated into a set string (eg, normalize "Abc1234" or "1989-01-22" "xxxxxxx", etc.).

Step 102: Calculate a hash value of the first text feature set based on the set local sensitivity hash algorithm, and determine, according to the calculated hash value, the constructed local sensitive hash algorithm. In the corresponding hash index, whether there is a matching value between the calculated hash value and the set distance is not greater than the set distance.

Specifically, the set local sensitivity hash algorithm is not limited to the Simhash algorithm or the Minhash algorithm. Among them, the Simhash algorithm is a commonly used method for deduplicating web pages, which generates a digital signature by the content of the webpage, and then determines the degree of similarity of the webpage content by calculating the difference between the digital signatures. In addition, like the Simhash algorithm, the Minhash algorithm is also a kind of locally sensitive hash algorithm, which can be used to quickly estimate the similarity of two sets. It is originally used to detect duplicate web pages in search engines, and of course can also be applied to large-scale aggregation. The problem of the class and the like are not described in detail in the embodiments of the present application.

Preferably, because the Simhash algorithm is faster, in the embodiment of the present application, the Simhash algorithm may be preferentially used to calculate the hash value of the first text feature set. Correspondingly, taking the set local sensitive hash algorithm as an example of the Simhash algorithm, step 102 may be specifically performed: calculating a Simhash value of the first text feature set based on the Simhash algorithm, and according to the calculated Simhash value, It is judged whether there is a matching value between the calculated Simhash index and the calculated Simhash value (specifically, the Hamming distance, that is, the Hamming distance) is not greater than the set distance.

The set distance can be flexibly set according to the actual situation. For example, the Hamming distance can be set to 3 to 5, etc., which is not described in this embodiment. In addition, it should be noted that in information theory, the Hamming distance between two equal-length strings refers to the number of different characters corresponding to two strings, that is, transform one string into another string. The number of characters to be replaced is not described in this embodiment of the present application.

Step 103: If it is determined that the constructed hash index corresponding to the set local sensitive hash algorithm exists, and the distance between the calculated hash value is not greater than the set distance, the And a matching value that is not greater than a set distance between the calculated hash value, selecting a matching value that is the smallest distance from the calculated hash value, and calculating the first text feature set and the The similarity between the second set of text features corresponding to the smallest matching value.

Optionally, the similarity between the first text feature set and the second text feature set may be represented by at least one or more of the following similarity measure parameters: Jaccard similarity, Euclidean distance, and Hamming Distance and so on. That is, when calculating the similarity between the first text feature set and the second text feature set corresponding to the minimum matching value, the first text feature set and the second text may be calculated The Jaccard similarity, the Euclidean distance, and the Hamming distance between the feature sets are not described in detail in the embodiments of the present application.

Step 104: If it is determined that the similarity between the first text feature set and the second text feature set is not less than a set similarity threshold, the aggregated text to be aggregated to the second text feature set In the corresponding text class.

The set similarity threshold may be flexibly set according to actual conditions, for example, when the text is aggregated accurately. When the performance requirement is high, the similarity threshold may be set to a relatively high value, and when the accuracy of text aggregation is required to be low, the similarity threshold may be set to a relatively low value, etc. This embodiment of the present application does not describe this.

It should be noted that, in the embodiment of the present application, the similarity between the first text feature set and the second text feature set is verified, mainly to eliminate the local sensitive hash. When the algorithm is applied to the aggregation of short text data, the misjudgment caused by the collision probability of the local sensitive hash algorithm improves the accuracy of text aggregation.

For example, using the Simhash algorithm to calculate the hash value of the first text feature set, and then selecting the corresponding matching value, after using the Simhash algorithm to calculate the hash value of the first text feature set, and then selecting the corresponding matching value, Further verifying the similarity between the first text feature set and the second text feature set corresponding to the selected matching value (such as Jaccard similarity, etc.) to eliminate the misjudgment caused by the Simhash collision.

It should be noted that Jaccard similarity is the most common method to measure the similarity of two sets. It is also suitable for measuring the similarity of short texts, but it cannot be directly used for big data because it is too large. The amount of text aggregated. However, through the Jaccard similarity check, the collision problem of the Simhash algorithm can be completely solved, and the misjudgment caused by the Simhash collision is eliminated. Therefore, when the Simhash algorithm is combined with the Jaccard similarity check method to analyze the similarity of the aggregated text, the effect of synthesizing the short text accurately and quickly can be achieved.

Further, in the embodiment described in this application, the method may further include the following steps:

If it is determined that the hash index corresponding to the set local sensitive hash algorithm is constructed, there is no matching value between the calculated hash value and the set distance; or In the hash index corresponding to the set local sensitive hash algorithm, there is a matching value between the calculated hash value and the set distance, and the first text feature is determined. And the similarity between the set and the second set of text features is less than a set similarity threshold; then updating the calculated hash value to (ie, adding to) the constructed local sensitive hash with the setting Corresponding to the hash index of the algorithm, and creating a new text class based on the text to be aggregated, and categorizing the text to be aggregated into the created new text class.

In other words, if it is determined that the to-be-aggregated text is not attributed to any of the created text classes, the hash value corresponding to the to-be-aggregated text may be added to the corresponding hash index, and the to-be-aggregated text is returned. This is not described in detail in the embodiment of the present application.

Further, it should be noted that the solution described in the embodiments of the present application has no limitation of language, software or hardware. However, in order to improve the efficiency of text aggregation, it is preferred to use a high-performance programming language (such as C++ or Java). This is not described in detail in the embodiments of the present application.

The first embodiment of the present application provides a text aggregation method. In the technical solution of the first embodiment of the present application, feature extraction may be performed on a to-be-aggregated text whose length is not greater than a set length threshold, and is obtained and After the text feature set corresponding to the text is aggregated, a local sensitive hash algorithm and a similarity check method may be used to perform similarity analysis on the to-be-aggregated text to implement aggregation of the text to be aggregated, thereby solving the vector-based solution. When the spatial model or the probabilistic model performs short text similarity analysis, the text aggregation result is less accurate and the real-time performance is lower, and the effect of aggregating short texts accurately and quickly is achieved, such as realizing big data traffic. Real-time aggregation of short text (eg, greater than 10,000 bars/second, etc.) to support real-time analysis of data streams.

Embodiment 2:

Based on the same inventive concept, the second embodiment of the present application provides a text aggregation apparatus. For the specific implementation of the text aggregation apparatus, refer to the related description in the first embodiment of the foregoing method, and the repeated description is not repeated, as shown in FIG. 2, The text aggregation device can mainly include:

The feature extraction unit 21 is configured to perform feature extraction on the to-be-aggregated text whose length is not greater than the set length threshold, to obtain a first text feature set corresponding to the to-be-aggregated text;

The text aggregating unit 22 is configured to calculate a hash value of the first text feature set based on the set local sensitivity hash algorithm, and determine a local sensitivity of the constructed and the set according to the calculated hash value In the hash index corresponding to the hash algorithm, whether there is a matching value between the calculated hash value and the calculated distance is not greater than the set distance; if so, the distance from the calculated hash value is not greater than Setting a matching value of the distance, selecting a matching value that is the smallest distance from the calculated hash value, and calculating a second text feature set corresponding to the first text feature set and the minimum matching value And the similarity between the first text feature set and the second text feature set is determined to be aggregated to the first if the similarity between the first text feature set and the second text feature set is not less than a set similarity threshold The text class corresponding to the two text feature sets.

The locally sensitive hash algorithm is not limited to the Simhash algorithm or the Minhash algorithm. And the similarity between the first text feature set and the second text feature set may be represented by at least one or more of the following similarity measure parameters: Jaccard similarity, Euclidean distance, Hamming distance, etc. .

Further, the text aggregation unit 22 may be further configured to: if it is determined that the constructed hash index corresponding to the set local sensitive hash algorithm, there is no distance between the calculated hash value Not matching the matching value of the set distance; or determining that the constructed hash index corresponding to the set local sensitive hash algorithm has a distance between the calculated hash value and not greater than the set value a matching value of the distance, and determining that the similarity between the first text feature set and the second text feature set is less than a set similarity threshold; The hash value is updated to the constructed hash index corresponding to the set local sensitive hash algorithm, and a new text class is created based on the to-be-aggregated text, and the text to be aggregated is attributed to Created in the new text class.

Further, in order to improve the extraction effect of the text feature, in the embodiment of the present application, the feature extraction unit 21 is specifically applicable to the feature extraction method based on the mechanical segmentation combined with the N-ary model, and the length is not greater than the set length threshold. Performing feature extraction on the aggregated text to obtain a first text feature set corresponding to the to-be-aggregated text, where N is a natural number greater than 1.

Optionally, the feature extraction unit 21 is specifically configured to use a Chinese character and a continuous character string as a minimum segmentation unit, and perform segmentation on the to-be-aggregated text to obtain a plurality of word segments; and based on the N-ary model, the obtained feature is obtained. Any N consecutive word segments of the plurality of word segments are combined into a text feature, and a text feature set corresponding to the text to be aggregated is obtained.

Further, the apparatus may further include a pre-processing unit 23:

The pre-processing unit 23 may be configured to pre-process the to-be-aggregated text before performing feature extraction on the to-be-aggregated text whose length is not greater than the set length threshold; wherein the pre-processing may at least include: removing the to-be-aggregated text Aggregate special tags in text, remove non-text special symbols in text to be aggregated, perform complex font conversions on aggregated text, and normalize Latin and/or numbers of continuity in text to be aggregated into settings One or more of a string, etc.

Those skilled in the art will appreciate that embodiments of the present application can be provided as a method, apparatus (device), or computer program product. Thus, the present application can take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment in combination of software and hardware. Moreover, the application can take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) including computer usable program code.

The present application is described with reference to flowchart illustrations and/or block diagrams of a method, apparatus, and computer program product according to embodiments of the present application. It will be understood that each flow and/or block of the flowchart illustrations and/or FIG. These computer program instructions can be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing device to produce a machine for the execution of instructions for execution by a processor of a computer or other programmable data processing device. Means for implementing the functions specified in one or more of the flow or in a block or blocks of the flow chart.

These computer program instructions can also be stored in a particular computer capable of booting a computer or other programmable data processing device In a computer readable memory that operates in a computer readable memory, causing instructions stored in the computer readable memory to produce an article of manufacture comprising instruction means implemented in a block or in a flow or a flow diagram and/or block diagram of the flowchart The functions specified in the boxes.

These computer program instructions can also be loaded onto a computer or other programmable data processing device such that a series of operational steps are performed on a computer or other programmable device to produce computer-implemented processing for execution on a computer or other programmable device. The instructions provide steps for implementing the functions specified in one or more of the flow or in a block or blocks of a flow diagram.

While the preferred embodiment of the present application has been described, it will be apparent that those skilled in the art can make further changes and modifications to the embodiments. Therefore, the appended claims are intended to be interpreted as including the preferred embodiments and the modifications and

It will be apparent to those skilled in the art that various modifications and changes can be made in the present application without departing from the spirit and scope of the application. Thus, it is intended that the present invention cover the modifications and variations of the present invention.

Claims

A text aggregation method, comprising:

Performing feature extraction on the to-be-aggregated text whose length is not greater than the set length threshold, to obtain a first text feature set corresponding to the to-be-aggregated text;

Calculating a hash value of the first text feature set based on the set local sensitivity hash algorithm, and determining, according to the calculated hash value, a constructed hash corresponding to the set local sensitive hash algorithm In the index, whether there is a matching value between the calculated hash value and the set distance is not greater than the set distance;

If yes, selecting a matching value that is the smallest distance from the calculated hash value from the matching value between the calculated hash value and the set distance, and calculating the first text feature And combining a similarity between the second set of text features corresponding to the minimum matching value;

And if it is determined that the similarity between the first text feature set and the second text feature set is not less than a set similarity threshold, the to-be-aggregated text is aggregated to correspond to the second text feature set. In the text class.
The method of claim 1 wherein the method further comprises:

If it is determined that the hash index corresponding to the set local sensitive hash algorithm is constructed, there is no matching value between the calculated hash value and the set distance; or In the hash index corresponding to the set local sensitive hash algorithm, there is a matching value between the calculated hash value and the set distance, and the first text feature is determined. And the similarity between the set and the second set of text features is less than a set similarity threshold, then

Updating the calculated hash value to the constructed hash index corresponding to the set local sensitive hash algorithm, and creating a new text class based on the to-be-aggregated text, and The aggregated text is returned to the new text class created.
The method according to claim 1 or 2, wherein the feature extraction is performed on the to-be-aggregated text whose length is not greater than the set length threshold, and the text feature set corresponding to the to-be-aggregated text is obtained, including:

The feature extraction method based on the mechanical segmentation and the N-element model performs feature extraction on the to-be-aggregated text whose length is not greater than the set length threshold, and obtains a first text feature set corresponding to the to-be-aggregated text, where N is greater than 1 Natural number.
The method according to claim 3, wherein the feature extraction method based on the mechanical segmentation and the N-element model extracts features to be aggregated text having a length not greater than a set length threshold, and obtains the text to be aggregated Corresponding first set of text features, including:

The Chinese character and the continuous character string are used as the minimum segmentation unit, and the text to be aggregated is segmented to obtain Multiple participles;

Based on the N-ary model, any N consecutive word segments of the obtained plurality of word segments are combined into one text feature, and a text feature set corresponding to the to-be-aggregated text is obtained.
The method according to claim 1 or 2, wherein the set local sensitive hash algorithm is not limited to being a Simhash algorithm or a Minhash algorithm.
The method according to claim 1 or 2, wherein the similarity between the first text feature set and the second text feature set is at least any of Jaccard similarity, Euclidean distance, and Hamming distance. One or more similarity metric parameters are represented.
The method of claim 1 or 2, wherein before the feature extraction of the text to be aggregated having a length that is not greater than the set length threshold, the method further comprises:

Pre-processing the text to be aggregated; wherein the pre-processing includes at least: removing a special label in the text to be aggregated, removing a non-text special symbol in the text to be aggregated, performing a simplified font conversion on the text to be aggregated, and The Latin and/or numbers of continuity in the aggregated text are normalized to one or more of the set strings.
A text aggregation device, comprising:

a feature extraction unit, configured to perform feature extraction on the to-be-aggregated text whose length is not greater than the set length threshold, to obtain a first text feature set corresponding to the to-be-aggregated text;

a text aggregating unit, configured to calculate a hash value of the first text feature set based on the set local sensitivity hash algorithm, and determine, according to the calculated hash value, the constructed local sensitivity to the setting In the hash index corresponding to the Greek algorithm, whether there is a matching value between the calculated hash value and the calculated distance is not greater than the set distance; if so, the distance between the calculated hash value is not greater than Among the matching values of the fixed distance, the matching value with the smallest distance from the calculated hash value is selected, and the first text feature set is calculated between the second text feature set corresponding to the minimum matching value And determining, if the similarity between the first text feature set and the second text feature set is not less than a set similarity threshold, aggregating the to-be-aggregated text to the second In the text class corresponding to the text feature set.
The device of claim 8 wherein:

The text aggregating unit is further configured to: if it is determined that the constructed hash index corresponding to the set local sensitive hash algorithm, the distance between the calculated hash value and the calculated hash value is not greater than a setting a matching value of the distance; or determining that the constructed hash index corresponding to the set local sensitive hash algorithm has a matching value with the calculated hash value not greater than a set distance And determining that the similarity between the first text feature set and the second text feature set is less than a set similarity threshold,

Updating the calculated hash value to the constructed hash index corresponding to the set local sensitive hash algorithm, and creating a new text class based on the to-be-aggregated text, and The aggregated text is returned to the new text class created.
The device according to claim 8 or 9, wherein

The feature extraction unit is configured to perform feature extraction on the to-be-aggregated text whose length is not greater than the set length threshold based on the feature extraction method of the mechanical segmentation and the N-dimensional model, to obtain the first text corresponding to the to-be-aggregated text. A set of features, the N being a natural number greater than one.
The device of claim 10 wherein:

The feature extraction unit is specifically configured to use a Chinese character and a continuous character string as a minimum segmentation unit to segment the to-be-aggregated text to obtain a plurality of word segments; and based on the N-ary model, the plurality of word segments are obtained. Any of the N consecutive word segments is combined into a text feature to obtain a set of text features corresponding to the text to be aggregated.
The apparatus according to claim 8 or 9, wherein the set local sensitive hash algorithm is not limited to being a Simhash algorithm or a Minhash algorithm.
The apparatus according to claim 8 or 9, wherein the similarity between the first text feature set and the second text feature set is at least any of Jaccard similarity, Euclidean distance, and Hamming distance. One or more similarity metric parameters are represented.
The device according to claim 8 or 9, wherein the device further comprises a preprocessing unit:

The pre-processing unit is configured to pre-process the to-be-aggregated text before performing feature extraction on the to-be-aggregated text whose length is not greater than the set length threshold;

The pre-processing includes at least: removing a special label in the text to be aggregated, removing a non-text special symbol in the text to be aggregated, performing a simplified font conversion on the aggregated text, and a Latin character in the continuity of the text to be aggregated. And/or numbers are normalized to one or more of the set strings.