WO2016180268A1 - Text aggregate method and device - Google Patents

Text aggregate method and device Download PDF

Info

Publication number
WO2016180268A1
WO2016180268A1 PCT/CN2016/081090 CN2016081090W WO2016180268A1 WO 2016180268 A1 WO2016180268 A1 WO 2016180268A1 CN 2016081090 W CN2016081090 W CN 2016081090W WO 2016180268 A1 WO2016180268 A1 WO 2016180268A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
aggregated
similarity
feature
distance
Prior art date
Application number
PCT/CN2016/081090
Other languages
French (fr)
Chinese (zh)
Inventor
冯文镛
Original Assignee
阿里巴巴集团控股有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴集团控股有限公司 filed Critical 阿里巴巴集团控股有限公司
Publication of WO2016180268A1 publication Critical patent/WO2016180268A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Definitions

  • the present application relates to the field of Internet technologies, and in particular, to a text aggregation method and apparatus.
  • text aggregation is a technique for grouping text collections under a given similarity measure to group texts that are close to each other into the same group.
  • the text aggregation may specifically include steps such as text feature extraction and text similarity analysis.
  • the similarity analysis of the text to achieve the aggregation of the text is currently performed mainly based on the vector space model or the probability model.
  • the vector space model words or words in the text are used as features to represent the text, and the similarity between the feature vectors is used to measure the relevance of the text. Therefore, for texts that are too short in length, there will be a problem that the feature vector is too sparse, and the calculation result cannot meet the requirements of the similarity analysis, which leads to the problem that the final text aggregation result is not accurate.
  • the probability model if too short text is used, most of the features will be the result of probability smoothing, and cannot reflect the information of the real data.
  • the embodiment of the present invention provides a text aggregation method and device, which are used to solve the problem that the text aggregation method has low accuracy and low real-time performance due to poor text textuality analysis.
  • the embodiment of the present application provides a text aggregation method, including:
  • the to-be-aggregated text is aggregated to correspond to the second text feature set.
  • the text class In the text class.
  • the embodiment of the present application further provides a text aggregation apparatus, including:
  • a feature extraction unit configured to perform feature extraction on the to-be-aggregated text whose length is not greater than the set length threshold, to obtain a first text feature set corresponding to the to-be-aggregated text
  • a text aggregating unit configured to calculate a hash value of the first text feature set based on the set local sensitivity hash algorithm, and determine, according to the calculated hash value, the constructed local sensitivity to the setting In the hash index corresponding to the Greek algorithm, whether there is a matching value between the calculated hash value and the calculated distance is not greater than the set distance; if so, the distance between the calculated hash value is not greater than Among the matching values of the fixed distance, the matching value with the smallest distance from the calculated hash value is selected, and the first text feature set is calculated between the second text feature set corresponding to the minimum matching value And determining, if the similarity between the first text feature set and the second text feature set is not less than a set similarity threshold, aggregating the to-be-aggregated text to the second In the text class corresponding to the text feature set.
  • the embodiment of the present application provides a text aggregation method and apparatus.
  • a local sensitive hash algorithm is used to combine the similarity degree.
  • the method for determining the similarity analysis of the text to be aggregated to realize the aggregation of the text to be aggregated, so that the accuracy of the text aggregation result caused by the short text similarity analysis based on the vector space model or the probability model can be solved.
  • the problem of low real-time performance achieves an accurate and fast aggregation of short text.
  • FIG. 1 is a schematic flowchart diagram of a text aggregation method according to Embodiment 1 of the present application;
  • FIG. 2 is a schematic structural diagram of a text aggregation apparatus according to Embodiment 2 of the present application.
  • Embodiment 1 is a diagrammatic representation of Embodiment 1:
  • a text aggregation method is provided in the first embodiment of the present application. As shown in FIG. 1 , it is a schematic flowchart of the text aggregation method in the first embodiment of the present application.
  • the text aggregation method may include the following steps:
  • Step 101 Perform feature extraction on the to-be-aggregated text whose length is not greater than the set length threshold, to obtain a first text feature set corresponding to the to-be-aggregated text.
  • the text to be aggregated may be Chinese text data whose length is not more than a set length threshold (for example, 150 to 200 words, etc., wherein the English word or the continuous number is calculated by one Chinese character), and the embodiment of the present application I will not go into details about this.
  • a set length threshold for example, 150 to 200 words, etc., wherein the English word or the continuous number is calculated by one Chinese character
  • the feature to be aggregated whose length is not greater than the set length threshold may be extracted in the following manner, and the text corresponding to the to-be-aggregated text is obtained.
  • N-gram N-gram model
  • the feature extraction method using mechanical word segmentation combined with the N-ary model can achieve better text feature extraction effect. This is because the mechanical participle ignores the semantics to mechanically segment the text, while the N-ary model establishes a certain dependency between the isolated features, thus providing a larger feature set and enriching the information of the feature set. This plays a very good complement to the short text with less information. Therefore, it can achieve good results in the non-standard short text feature extraction, and thus improve the accuracy of text aggregation.
  • the feature extraction method based on the mechanical segmentation and the N-element model is used to perform feature extraction on the to-be-aggregated text whose length is not greater than the set length threshold, and the text feature set corresponding to the to-be-aggregated text is obtained, which may include:
  • any N consecutive word segments of the obtained plurality of word segments are combined into one text feature, and a text feature set corresponding to the to-be-aggregated text is obtained.
  • the N-ary model is Bi-gram
  • My birthday is 1989-01-22
  • the final result is as described above.
  • the set of text features corresponding to the aggregated text can be expressed as ⁇ my, birthday, birthday, day, is 1989-01-22 ⁇ .
  • the method may further include the following steps before performing feature extraction on the to-be-aggregated text whose length is not greater than the set length threshold:
  • Pre-processing the text to be aggregated so that the corresponding text feature extraction may be performed according to the pre-processed text to be aggregated; wherein the pre-processing may include at least one or more of the following operations, the present application
  • the embodiment does not limit this:
  • Remove special tags such as html tags
  • remove non-text special symbols such as &, *, etc.
  • perform complex font conversion on the text to be aggregated such as the traditional text in the text to be aggregated
  • Convert words into simplified characters, etc., and normalize the Latin and/or numbers of continuity in the text to be aggregated into a set string eg, normalize "Abc1234" or "1989-01-22" "xxxxxxx", etc.).
  • Step 102 Calculate a hash value of the first text feature set based on the set local sensitivity hash algorithm, and determine, according to the calculated hash value, the constructed local sensitive hash algorithm. In the corresponding hash index, whether there is a matching value between the calculated hash value and the set distance is not greater than the set distance.
  • the set local sensitivity hash algorithm is not limited to the Simhash algorithm or the Minhash algorithm.
  • the Simhash algorithm is a commonly used method for deduplicating web pages, which generates a digital signature by the content of the webpage, and then determines the degree of similarity of the webpage content by calculating the difference between the digital signatures.
  • the Minhash algorithm is also a kind of locally sensitive hash algorithm, which can be used to quickly estimate the similarity of two sets. It is originally used to detect duplicate web pages in search engines, and of course can also be applied to large-scale aggregation. The problem of the class and the like are not described in detail in the embodiments of the present application.
  • the Simhash algorithm may be preferentially used to calculate the hash value of the first text feature set.
  • step 102 may be specifically performed: calculating a Simhash value of the first text feature set based on the Simhash algorithm, and according to the calculated Simhash value, It is judged whether there is a matching value between the calculated Simhash index and the calculated Simhash value (specifically, the Hamming distance, that is, the Hamming distance) is not greater than the set distance.
  • the set distance can be flexibly set according to the actual situation.
  • the Hamming distance can be set to 3 to 5, etc., which is not described in this embodiment.
  • the Hamming distance between two equal-length strings refers to the number of different characters corresponding to two strings, that is, transform one string into another string. The number of characters to be replaced is not described in this embodiment of the present application.
  • Step 103 If it is determined that the constructed hash index corresponding to the set local sensitive hash algorithm exists, and the distance between the calculated hash value is not greater than the set distance, the And a matching value that is not greater than a set distance between the calculated hash value, selecting a matching value that is the smallest distance from the calculated hash value, and calculating the first text feature set and the The similarity between the second set of text features corresponding to the smallest matching value.
  • the similarity between the first text feature set and the second text feature set may be represented by at least one or more of the following similarity measure parameters: Jaccard similarity, Euclidean distance, and Hamming Distance and so on. That is, when calculating the similarity between the first text feature set and the second text feature set corresponding to the minimum matching value, the first text feature set and the second text may be calculated
  • Jaccard similarity, the Euclidean distance, and the Hamming distance between the feature sets are not described in detail in the embodiments of the present application.
  • Step 104 If it is determined that the similarity between the first text feature set and the second text feature set is not less than a set similarity threshold, the aggregated text to be aggregated to the second text feature set In the corresponding text class.
  • the set similarity threshold may be flexibly set according to actual conditions, for example, when the text is aggregated accurately.
  • the similarity threshold may be set to a relatively high value, and when the accuracy of text aggregation is required to be low, the similarity threshold may be set to a relatively low value, etc. This embodiment of the present application does not describe this.
  • the similarity between the first text feature set and the second text feature set is verified, mainly to eliminate the local sensitive hash.
  • the algorithm is applied to the aggregation of short text data, the misjudgment caused by the collision probability of the local sensitive hash algorithm improves the accuracy of text aggregation.
  • the Simhash algorithm to calculate the hash value of the first text feature set, and then selecting the corresponding matching value
  • the Simhash algorithm to calculate the hash value of the first text feature set, and then selecting the corresponding matching value
  • Jaccard similarity is the most common method to measure the similarity of two sets. It is also suitable for measuring the similarity of short texts, but it cannot be directly used for big data because it is too large. The amount of text aggregated. However, through the Jaccard similarity check, the collision problem of the Simhash algorithm can be completely solved, and the misjudgment caused by the Simhash collision is eliminated. Therefore, when the Simhash algorithm is combined with the Jaccard similarity check method to analyze the similarity of the aggregated text, the effect of synthesizing the short text accurately and quickly can be achieved.
  • the method may further include the following steps:
  • the hash index corresponding to the set local sensitive hash algorithm If it is determined that the hash index corresponding to the set local sensitive hash algorithm is constructed, there is no matching value between the calculated hash value and the set distance; or In the hash index corresponding to the set local sensitive hash algorithm, there is a matching value between the calculated hash value and the set distance, and the first text feature is determined. And the similarity between the set and the second set of text features is less than a set similarity threshold; then updating the calculated hash value to (ie, adding to) the constructed local sensitive hash with the setting Corresponding to the hash index of the algorithm, and creating a new text class based on the text to be aggregated, and categorizing the text to be aggregated into the created new text class.
  • the hash value corresponding to the to-be-aggregated text may be added to the corresponding hash index, and the to-be-aggregated text is returned. This is not described in detail in the embodiment of the present application.
  • the first embodiment of the present application provides a text aggregation method.
  • feature extraction may be performed on a to-be-aggregated text whose length is not greater than a set length threshold, and is obtained and After the text feature set corresponding to the text is aggregated, a local sensitive hash algorithm and a similarity check method may be used to perform similarity analysis on the to-be-aggregated text to implement aggregation of the text to be aggregated, thereby solving the vector-based solution.
  • the spatial model or the probabilistic model performs short text similarity analysis, the text aggregation result is less accurate and the real-time performance is lower, and the effect of aggregating short texts accurately and quickly is achieved, such as realizing big data traffic.
  • Real-time aggregation of short text eg, greater than 10,000 bars/second, etc.
  • Embodiment 2 is a diagrammatic representation of Embodiment 1:
  • the text aggregation device can mainly include:
  • the feature extraction unit 21 is configured to perform feature extraction on the to-be-aggregated text whose length is not greater than the set length threshold, to obtain a first text feature set corresponding to the to-be-aggregated text;
  • the text aggregating unit 22 is configured to calculate a hash value of the first text feature set based on the set local sensitivity hash algorithm, and determine a local sensitivity of the constructed and the set according to the calculated hash value In the hash index corresponding to the hash algorithm, whether there is a matching value between the calculated hash value and the calculated distance is not greater than the set distance; if so, the distance from the calculated hash value is not greater than Setting a matching value of the distance, selecting a matching value that is the smallest distance from the calculated hash value, and calculating a second text feature set corresponding to the first text feature set and the minimum matching value And the similarity between the first text feature set and the second text feature set is determined to be aggregated to the first if the similarity between the first text feature set and the second text feature set is not less than a set similarity threshold The text class corresponding to the two text feature sets.
  • the locally sensitive hash algorithm is not limited to the Simhash algorithm or the Minhash algorithm.
  • the similarity between the first text feature set and the second text feature set may be represented by at least one or more of the following similarity measure parameters: Jaccard similarity, Euclidean distance, Hamming distance, etc. .
  • the text aggregation unit 22 may be further configured to: if it is determined that the constructed hash index corresponding to the set local sensitive hash algorithm, there is no distance between the calculated hash value Not matching the matching value of the set distance; or determining that the constructed hash index corresponding to the set local sensitive hash algorithm has a distance between the calculated hash value and not greater than the set value a matching value of the distance, and determining that the similarity between the first text feature set and the second text feature set is less than a set similarity threshold; The hash value is updated to the constructed hash index corresponding to the set local sensitive hash algorithm, and a new text class is created based on the to-be-aggregated text, and the text to be aggregated is attributed to Created in the new text class.
  • the feature extraction unit 21 is specifically applicable to the feature extraction method based on the mechanical segmentation combined with the N-ary model, and the length is not greater than the set length threshold. Performing feature extraction on the aggregated text to obtain a first text feature set corresponding to the to-be-aggregated text, where N is a natural number greater than 1.
  • the feature extraction unit 21 is specifically configured to use a Chinese character and a continuous character string as a minimum segmentation unit, and perform segmentation on the to-be-aggregated text to obtain a plurality of word segments; and based on the N-ary model, the obtained feature is obtained. Any N consecutive word segments of the plurality of word segments are combined into a text feature, and a text feature set corresponding to the text to be aggregated is obtained.
  • the apparatus may further include a pre-processing unit 23:
  • the pre-processing unit 23 may be configured to pre-process the to-be-aggregated text before performing feature extraction on the to-be-aggregated text whose length is not greater than the set length threshold; wherein the pre-processing may at least include: removing the to-be-aggregated text Aggregate special tags in text, remove non-text special symbols in text to be aggregated, perform complex font conversions on aggregated text, and normalize Latin and/or numbers of continuity in text to be aggregated into settings One or more of a string, etc.
  • embodiments of the present application can be provided as a method, apparatus (device), or computer program product.
  • the present application can take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment in combination of software and hardware.
  • the application can take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) including computer usable program code.
  • These computer program instructions can also be stored in a particular computer capable of booting a computer or other programmable data processing device In a computer readable memory that operates in a computer readable memory, causing instructions stored in the computer readable memory to produce an article of manufacture comprising instruction means implemented in a block or in a flow or a flow diagram and/or block diagram of the flowchart The functions specified in the boxes.
  • These computer program instructions can also be loaded onto a computer or other programmable data processing device such that a series of operational steps are performed on a computer or other programmable device to produce computer-implemented processing for execution on a computer or other programmable device.
  • the instructions provide steps for implementing the functions specified in one or more of the flow or in a block or blocks of a flow diagram.

Abstract

A text aggregate method and device. The method comprises: when a text feature set corresponding to a text to be aggregated is acquired, employing a determination method combining a locality sensitive hashing algorithm and a similarity check to perform a similarity analysis on the text to be aggregated so as to aggregate the text to be aggregated. Therefore, the present invention can address the problem of a low-accuracy and high-latency text aggregate result due to performing a short-text similarity analysis on the basis of a vector space model or a probabilistic model, thus accurately and quickly aggregating a short text.

Description

一种文本聚合方法及装置Text aggregation method and device
本申请要求2015年05月13日递交的申请号为201510242860.0、发明名称为“一种文本聚合方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。The present application claims priority to Chinese Patent Application Serial No. No. No. No. No. No. No. No. No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No
技术领域Technical field
本申请涉及互联网技术领域,尤其涉及一种文本聚合方法及装置。The present application relates to the field of Internet technologies, and in particular, to a text aggregation method and apparatus.
背景技术Background technique
在传统的通信应用(如短信、邮件等)以及新型的互联网社交应用(如微信、微博、论坛等)等场景中,时刻都会产生大量的短文本数据,如,长度不大于设定的长度阈值(如150~200个字等,其中,英文单词或者连续数字按一个汉字计算)的中文文本数据。这些文本数据中存在大量有价值的信息,通过对其进行聚合可以发现信息中潜在的热点或者规律。In traditional communication applications (such as text messages, emails, etc.) and new Internet social applications (such as WeChat, Weibo, forums, etc.), a large amount of short text data is generated at all times, for example, the length is not greater than the set length. Chinese text data of thresholds (such as 150 to 200 words, etc., in which English words or consecutive numbers are calculated by one Chinese character). There is a lot of valuable information in these text data, and by gathering them, you can discover potential hot spots or laws in the information.
具体地,文本聚合是一种在给定的相似性度量之下对文本集合进行分组,使彼此相近的文本分到同一个组内的技术。文本聚合具体可包括文本特征提取以及文本相似性分析等步骤。In particular, text aggregation is a technique for grouping text collections under a given similarity measure to group texts that are close to each other into the same group. The text aggregation may specifically include steps such as text feature extraction and text similarity analysis.
具体地,由于目前,在对文本进行相似性分析以实现文本的聚合时,主要基于向量空间模型或概率模型进行。而在向量空间模型中,是采用文本中的字或者词作为特征表示文本,用特征向量之间的相似度来度量文本的相关性。因而,对于长度过短的文本,会存在特征向量过于稀疏,导致计算结果无法满足相似性分析的要求,进而导致最终所得到的文本聚合结果并不准确的问题。另外,在概率模型中,若使用过短的文本,则大部分特征都会是概率平滑的结果,不能反映真实数据的信息,因而,也会存在聚合结果并不准确、无法满足用户需求的问题。再有,由于上述两类传统的文本相似度算法计算量巨大,因而,还会存在难以满足通常可以达到千万级甚至亿级的短文本数据的实时分析的问题,使得文本聚合的效果并不佳。Specifically, since the similarity analysis of the text to achieve the aggregation of the text is currently performed mainly based on the vector space model or the probability model. In the vector space model, words or words in the text are used as features to represent the text, and the similarity between the feature vectors is used to measure the relevance of the text. Therefore, for texts that are too short in length, there will be a problem that the feature vector is too sparse, and the calculation result cannot meet the requirements of the similarity analysis, which leads to the problem that the final text aggregation result is not accurate. In addition, in the probability model, if too short text is used, most of the features will be the result of probability smoothing, and cannot reflect the information of the real data. Therefore, there is also a problem that the aggregation result is not accurate and cannot satisfy the user's needs. Furthermore, since the above two types of conventional text similarity algorithms are computationally intensive, there is also the problem that it is difficult to meet the real-time analysis of short text data which can usually reach tens of millions or even billions, so that the effect of text aggregation is not good.
也就是说,目前,在对短文本数据进行文本聚合时,存在文本相似性分析的方式较差所导致的文本聚合的准确性较低、实时性较低的问题,因此,亟需提供一种新的文本聚合方法以解决上述问题。That is to say, at present, when text aggregation of short text data is performed, there is a problem that the accuracy of text aggregation is low and the real-time performance is low due to poor manner of text similarity analysis. Therefore, it is urgent to provide a kind of A new text aggregation method to solve the above problem.
发明内容Summary of the invention
本申请实施例提供了一种文本聚合方法及装置,用以解决目前的文本聚合方式存在文本相似性分析的方式较差所导致的文本聚合的准确性较低、实时性较低的问题。The embodiment of the present invention provides a text aggregation method and device, which are used to solve the problem that the text aggregation method has low accuracy and low real-time performance due to poor text textuality analysis.
本申请实施例提供了一种文本聚合方法,包括:The embodiment of the present application provides a text aggregation method, including:
对长度不大于设定的长度阈值的待聚合文本进行特征提取,得到与所述待聚合文本相对应的第一文本特征集合;Performing feature extraction on the to-be-aggregated text whose length is not greater than the set length threshold, to obtain a first text feature set corresponding to the to-be-aggregated text;
基于设定的局部敏感哈希算法计算所述第一文本特征集合的哈希值,并根据计算得到的哈希值,判断已构建的与所述设定的局部敏感哈希算法相对应的哈希索引中,是否存在与计算得到的哈希值之间的距离不大于设定距离的匹配值;Calculating a hash value of the first text feature set based on the set local sensitivity hash algorithm, and determining, according to the calculated hash value, a constructed hash corresponding to the set local sensitive hash algorithm In the index, whether there is a matching value between the calculated hash value and the set distance is not greater than the set distance;
若是,则从与计算得到的哈希值之间的距离不大于设定距离的匹配值中,选取与计算得到的哈希值之间的距离最小的匹配值,并计算所述第一文本特征集合与所述最小的匹配值所对应的第二文本特征集合之间的相似度;If yes, selecting a matching value that is the smallest distance from the calculated hash value from the matching value between the calculated hash value and the set distance, and calculating the first text feature And combining a similarity between the second set of text features corresponding to the minimum matching value;
若确定所述第一文本特征集合与所述第二文本特征集合之间的相似度不小于设定的相似度阈值,则将所述待聚合文本聚合至所述第二文本特征集合所对应的文本类中。And if it is determined that the similarity between the first text feature set and the second text feature set is not less than a set similarity threshold, the to-be-aggregated text is aggregated to correspond to the second text feature set. In the text class.
相应地,本申请实施例还提供了一种文本聚合装置,包括:Correspondingly, the embodiment of the present application further provides a text aggregation apparatus, including:
特征提取单元,用于对长度不大于设定的长度阈值的待聚合文本进行特征提取,得到与所述待聚合文本相对应的第一文本特征集合;a feature extraction unit, configured to perform feature extraction on the to-be-aggregated text whose length is not greater than the set length threshold, to obtain a first text feature set corresponding to the to-be-aggregated text;
文本聚合单元,用于基于设定的局部敏感哈希算法计算所述第一文本特征集合的哈希值,并根据计算得到的哈希值,判断已构建的与所述设定的局部敏感哈希算法相对应的哈希索引中,是否存在与计算得到的哈希值之间的距离不大于设定距离的匹配值;若是,则从与计算得到的哈希值之间的距离不大于设定距离的匹配值中,选取与计算得到的哈希值之间的距离最小的匹配值,并计算所述第一文本特征集合与所述最小的匹配值所对应的第二文本特征集合之间的相似度;以及,若确定所述第一文本特征集合与所述第二文本特征集合之间的相似度不小于设定的相似度阈值,则将所述待聚合文本聚合至所述第二文本特征集合所对应的文本类中。a text aggregating unit, configured to calculate a hash value of the first text feature set based on the set local sensitivity hash algorithm, and determine, according to the calculated hash value, the constructed local sensitivity to the setting In the hash index corresponding to the Greek algorithm, whether there is a matching value between the calculated hash value and the calculated distance is not greater than the set distance; if so, the distance between the calculated hash value is not greater than Among the matching values of the fixed distance, the matching value with the smallest distance from the calculated hash value is selected, and the first text feature set is calculated between the second text feature set corresponding to the minimum matching value And determining, if the similarity between the first text feature set and the second text feature set is not less than a set similarity threshold, aggregating the to-be-aggregated text to the second In the text class corresponding to the text feature set.
本申请有益效果如下:The beneficial effects of the application are as follows:
本申请实施例提供了一种文本聚合方法及装置,在本申请实施例所述技术方案中,可在得到与待聚合文本相对应的文本特征集合之后,采用局部敏感哈希算法结合相似度校验的判定方法,对所述待聚合文本进行相似性分析以实现待聚合文本的聚合,从而可解决基于向量空间模型或概率模型进行短文本相似性分析时所导致的文本聚合结果准确性较低、实时性较低的问题,达到准确且又快速地对短文本进行聚合的效果。 The embodiment of the present application provides a text aggregation method and apparatus. In the technical solution of the embodiment of the present application, after obtaining a text feature set corresponding to the text to be aggregated, a local sensitive hash algorithm is used to combine the similarity degree. The method for determining the similarity analysis of the text to be aggregated to realize the aggregation of the text to be aggregated, so that the accuracy of the text aggregation result caused by the short text similarity analysis based on the vector space model or the probability model can be solved. The problem of low real-time performance achieves an accurate and fast aggregation of short text.
附图说明DRAWINGS
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简要介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域的普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present application, the following drawings will be briefly described in the description of the embodiments. It is obvious that the drawings in the following description are only some embodiments of the present application, Those skilled in the art can also obtain other drawings based on these drawings without paying any creative work.
图1所示为本申请实施例一中所述文本聚合方法的流程示意图;FIG. 1 is a schematic flowchart diagram of a text aggregation method according to Embodiment 1 of the present application;
图2所示为本申请实施例二中所述文本聚合装置的结构示意图。FIG. 2 is a schematic structural diagram of a text aggregation apparatus according to Embodiment 2 of the present application.
具体实施方式detailed description
为了使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请作进一步地详细描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其它实施例,都属于本申请保护的范围。The present invention will be further described in detail with reference to the accompanying drawings, in which FIG. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present application without departing from the inventive scope are the scope of the present application.
实施例一:Embodiment 1:
本申请实施例一提供了一种文本聚合方法,如图1所示,其为本申请实施例一中所述文本聚合方法的流程示意图,所述文本聚合方法可包括以下步骤:A text aggregation method is provided in the first embodiment of the present application. As shown in FIG. 1 , it is a schematic flowchart of the text aggregation method in the first embodiment of the present application. The text aggregation method may include the following steps:
步骤101:对长度不大于设定的长度阈值的待聚合文本进行特征提取,得到与所述待聚合文本相对应的第一文本特征集合。Step 101: Perform feature extraction on the to-be-aggregated text whose length is not greater than the set length threshold, to obtain a first text feature set corresponding to the to-be-aggregated text.
可选地,所述待聚合文本具体可为长度不大于设定的长度阈值(如150~200个字等,其中,英文单词或者连续数字按一个汉字计算)的中文文本数据,本申请实施例对此不作赘述。Optionally, the text to be aggregated may be Chinese text data whose length is not more than a set length threshold (for example, 150 to 200 words, etc., wherein the English word or the continuous number is calculated by one Chinese character), and the embodiment of the present application I will not go into details about this.
进一步地,由于互联网上的大量短文本数据存在用词不规范、存在各种变形等特点,因而使得,在使用传统的分词方法对其进行特征提取(如利用普通的分词器进行分词,并将相应的分词结果作为文本的特征描述)时,可能存在无法获得较好的特征提取结果,进而导致最终所得到的文本聚合结果并不准确的问题。Further, since a large amount of short text data on the Internet has characteristics such as irregular wording and various deformations, it is extracted by using a traditional word segmentation method (for example, using a common word segmentation device for word segmentation, and When the corresponding word segmentation result is described as the feature of the text, there may be a problem that a good feature extraction result cannot be obtained, and the resulting text aggregation result is not accurate.
因而,为了提高文本特征的提取效果,在本申请所述实施例中,可采用以下方式对长度不大于设定的长度阈值的待聚合文本进行特征提取,得到与所述待聚合文本相对应的文本特征集合:Therefore, in order to improve the extraction effect of the text feature, in the embodiment of the present application, the feature to be aggregated whose length is not greater than the set length threshold may be extracted in the following manner, and the text corresponding to the to-be-aggregated text is obtained. Text feature collection:
基于机械分词结合N元模型(N-gram)的特征提取方式对长度不大于设定的长度阈 值的待聚合文本进行特征提取,得到与所述待聚合文本相对应的第一文本特征集合,所述N为大于1的自然数。Feature extraction based on mechanical segmentation combined with N-gram model (N-gram) is not longer than the set length threshold The value of the to-be-aggregated text is subjected to feature extraction, and a first text feature set corresponding to the to-be-aggregated text is obtained, and the N is a natural number greater than 1.
需要说明的是,相对于采用传统的分词方法对短文本数据进行特征提取来说,采用机械分词结合N元模型的特征提取方式可达到较好的文本特征提取效果。这是因为,机械分词是忽略语意对文本进行机械地分割,而N元模型则是给孤立的特征之间建立了一定的依赖性,从而能够提供更大的特征集合,丰富了特征集合的信息,这对本身信息就较少的短文本来说起到了很好的补充作用,因而,可在不规范的短文本特征提取中取得良好的效果,进而提高文本聚合的准确性。It should be noted that compared with the traditional word segmentation method for feature extraction of short text data, the feature extraction method using mechanical word segmentation combined with the N-ary model can achieve better text feature extraction effect. This is because the mechanical participle ignores the semantics to mechanically segment the text, while the N-ary model establishes a certain dependency between the isolated features, thus providing a larger feature set and enriching the information of the feature set. This plays a very good complement to the short text with less information. Therefore, it can achieve good results in the non-standard short text feature extraction, and thus improve the accuracy of text aggregation.
可选地,基于机械分词结合N元模型的特征提取方式对长度不大于设定的长度阈值的待聚合文本进行特征提取,得到与所述待聚合文本相对应的文本特征集合,可包括:Optionally, the feature extraction method based on the mechanical segmentation and the N-element model is used to perform feature extraction on the to-be-aggregated text whose length is not greater than the set length threshold, and the text feature set corresponding to the to-be-aggregated text is obtained, which may include:
以中文汉字以及连续的字符串(如连续的拉丁文字符串、连续的数字串、或连续的拉丁文数字字符串等)为最小切分单元,对所述待聚合文本进行分词,得到多个分词;例如,以待聚合文本为“我的生日是1989-01-22”为例,可将所述待聚合文本分词为“我/的/生/日/是/1989-01-22”;Taking Chinese characters and continuous strings (such as continuous Latin strings, consecutive numeric strings, or consecutive Latin numeric strings, etc.) as the minimum segmentation unit, segmenting the to-be-aggregated text to obtain multiple The word segmentation; for example, taking the text to be aggregated as "my birthday is 1989-01-22" as an example, the text to be aggregated may be classified as "I / / / / / / / / / / / / / / / / / / / / / /
基于N元模型,将得到的多个分词中的任意N个连续的分词组合为一文本特征,得到与所述待聚合文本相对应的文本特征集合。例如,以所述N的取值为2(即所述N元模型为Bi-gram),且待聚合文本为“我的生日是1989-01-22”为例,最终所得到的与所述待聚合文本相对应的文本特征集合可表示为{我的,的生,生日,日是,是1989-01-22}。Based on the N-ary model, any N consecutive word segments of the obtained plurality of word segments are combined into one text feature, and a text feature set corresponding to the to-be-aggregated text is obtained. For example, taking the value of N as 2 (ie, the N-ary model is Bi-gram), and the text to be aggregated is “My birthday is 1989-01-22” as an example, the final result is as described above. The set of text features corresponding to the aggregated text can be expressed as {my, birthday, birthday, day, is 1989-01-22}.
进一步地,为了提高文本质量,进而提高文本聚合的准确性,在对长度不大于设定的长度阈值的待聚合文本进行特征提取之前,所述方法还可包括以下步骤:Further, in order to improve the text quality and improve the accuracy of the text aggregation, the method may further include the following steps before performing feature extraction on the to-be-aggregated text whose length is not greater than the set length threshold:
对所述待聚合文本进行预处理,以便后续可依据预处理后的待聚合文本进行相应的文本特征提取;其中,所述预处理至少可包括以下操作中的任意一种或多种,本申请实施例对此不作任何限定:Pre-processing the text to be aggregated, so that the corresponding text feature extraction may be performed according to the pre-processed text to be aggregated; wherein the pre-processing may include at least one or more of the following operations, the present application The embodiment does not limit this:
去除待聚合文本中的特殊标签(如html标签等)、去除待聚合文本中的非文字特殊符号(如&、*等)、对待聚合文本进行繁简字体转换(如将待聚合文本中的繁体字转换为简体字等)、以及将待聚合文本中的连续性的拉丁文和/或数字归一化为设定的字符串(如,将“Abc1234”或“1989-01-22”归一化成“xxxxxxx”等)等。Remove special tags (such as html tags) from the text to be aggregated, remove non-text special symbols (such as &, *, etc.) in the text to be aggregated, and perform complex font conversion on the text to be aggregated (such as the traditional text in the text to be aggregated) Convert words into simplified characters, etc., and normalize the Latin and/or numbers of continuity in the text to be aggregated into a set string (eg, normalize "Abc1234" or "1989-01-22" "xxxxxxx", etc.).
步骤102:基于设定的局部敏感哈希算法计算所述第一文本特征集合的哈希值,并根据计算得到的哈希值,判断已构建的与所述设定的局部敏感哈希算法相对应的哈希索引中,是否存在与计算得到的哈希值之间的距离不大于设定距离的匹配值。 Step 102: Calculate a hash value of the first text feature set based on the set local sensitivity hash algorithm, and determine, according to the calculated hash value, the constructed local sensitive hash algorithm. In the corresponding hash index, whether there is a matching value between the calculated hash value and the set distance is not greater than the set distance.
具体地,所述设定的局部敏感哈希算法不限于为Simhash算法或Minhash算法等。其中,Simhash算法是一种用来对网页去重的常用方法,其通过对网页的内容生成一个数字签名,然后通过计算数字签名之间的差异来判定网页内容的相似程度。另外,与Simhash算法一样,Minhash算法也是局部敏感哈希算法的一种,可以用来快速估算两个集合的相似度,最初用于在搜索引擎中检测重复网页,当然也可以应用于大规模聚类问题等,本申请实施例对此均不作赘述。Specifically, the set local sensitivity hash algorithm is not limited to the Simhash algorithm or the Minhash algorithm. Among them, the Simhash algorithm is a commonly used method for deduplicating web pages, which generates a digital signature by the content of the webpage, and then determines the degree of similarity of the webpage content by calculating the difference between the digital signatures. In addition, like the Simhash algorithm, the Minhash algorithm is also a kind of locally sensitive hash algorithm, which can be used to quickly estimate the similarity of two sets. It is originally used to detect duplicate web pages in search engines, and of course can also be applied to large-scale aggregation. The problem of the class and the like are not described in detail in the embodiments of the present application.
优选地,由于Simhash算法的速度较快,因此,在本申请所述实施例中,可优先选用所述Simhash算法来计算第一文本特征集合的哈希值。相应地,以所述设定的局部敏感哈希算法为Simhash算法为例,步骤102可具体执行为:基于Simhash算法计算所述第一文本特征集合的Simhash值,并根据计算得到的Simhash值,判断已构建的Simhash索引中,是否存在与计算得到的Simhash值之间的距离(具体可为海明距离,即Hamming距离)不大于设定距离的匹配值。Preferably, because the Simhash algorithm is faster, in the embodiment of the present application, the Simhash algorithm may be preferentially used to calculate the hash value of the first text feature set. Correspondingly, taking the set local sensitive hash algorithm as an example of the Simhash algorithm, step 102 may be specifically performed: calculating a Simhash value of the first text feature set based on the Simhash algorithm, and according to the calculated Simhash value, It is judged whether there is a matching value between the calculated Simhash index and the calculated Simhash value (specifically, the Hamming distance, that is, the Hamming distance) is not greater than the set distance.
其中,所述设定距离可根据实际情况灵活设定,如以海明距离为例,可设置为3~5等,本申请实施例对此不作赘述。另外,需要说明的是,在信息论中,两个等长字符串之间的海明距离是指两个字符串对应位置的不同字符的个数,即,将一个字符串变换成另外一个字符串所需要替换的字符个数,本申请实施例对此也不作赘述。The set distance can be flexibly set according to the actual situation. For example, the Hamming distance can be set to 3 to 5, etc., which is not described in this embodiment. In addition, it should be noted that in information theory, the Hamming distance between two equal-length strings refers to the number of different characters corresponding to two strings, that is, transform one string into another string. The number of characters to be replaced is not described in this embodiment of the present application.
步骤103:若确定已构建的与所述设定的局部敏感哈希算法相对应的哈希索引中,存在与计算得到的哈希值之间的距离不大于设定距离的匹配值,则从与计算得到的哈希值之间的距离不大于设定距离的匹配值中,选取与计算得到的哈希值之间的距离最小的匹配值,并计算所述第一文本特征集合与所述最小的匹配值所对应的第二文本特征集合之间的相似度。Step 103: If it is determined that the constructed hash index corresponding to the set local sensitive hash algorithm exists, and the distance between the calculated hash value is not greater than the set distance, the And a matching value that is not greater than a set distance between the calculated hash value, selecting a matching value that is the smallest distance from the calculated hash value, and calculating the first text feature set and the The similarity between the second set of text features corresponding to the smallest matching value.
可选地,所述第一文本特征集合与所述第二文本特征集合之间的相似度至少可通过以下任意一种或多种相似度度量参数来表示:Jaccard相似度、欧式距离以及海明距离等。也就是说,在计算所述第一文本特征集合与所述最小的匹配值所对应的第二文本特征集合之间的相似度时,可计算所述第一文本特征集合与所述第二文本特征集合之间的Jaccard相似度、欧式距离以及海明距离等,本申请实施例对此不作赘述。Optionally, the similarity between the first text feature set and the second text feature set may be represented by at least one or more of the following similarity measure parameters: Jaccard similarity, Euclidean distance, and Hamming Distance and so on. That is, when calculating the similarity between the first text feature set and the second text feature set corresponding to the minimum matching value, the first text feature set and the second text may be calculated The Jaccard similarity, the Euclidean distance, and the Hamming distance between the feature sets are not described in detail in the embodiments of the present application.
步骤104:若确定所述第一文本特征集合与所述第二文本特征集合之间的相似度不小于设定的相似度阈值,则将所述待聚合文本聚合至所述第二文本特征集合所对应的文本类中。Step 104: If it is determined that the similarity between the first text feature set and the second text feature set is not less than a set similarity threshold, the aggregated text to be aggregated to the second text feature set In the corresponding text class.
其中,所述设定的相似度阈值可根据实际情况灵活设定,如,当对文本聚合的准确 性要求较高时,可将所述相似度阈值设置为一个相对较高的数值,当对文本聚合的准确性要求较低时,可将所述相似度阈值设置为一个相对较低的数值等,本申请实施例对此不作赘述。The set similarity threshold may be flexibly set according to actual conditions, for example, when the text is aggregated accurately. When the performance requirement is high, the similarity threshold may be set to a relatively high value, and when the accuracy of text aggregation is required to be low, the similarity threshold may be set to a relatively low value, etc. This embodiment of the present application does not describe this.
需要说明的是,在本申请所述实施例中,之所以对所述第一文本特征集合与所述第二文本特征集合之间的相似度进行校验,主要是为了消除将局部敏感哈希算法应用于短文本数据的聚合时,局部敏感哈希算法的碰撞概率所导致的误判现象,以提高文本聚合的准确性。It should be noted that, in the embodiment of the present application, the similarity between the first text feature set and the second text feature set is verified, mainly to eliminate the local sensitive hash. When the algorithm is applied to the aggregation of short text data, the misjudgment caused by the collision probability of the local sensitive hash algorithm improves the accuracy of text aggregation.
例如,以采用Simhash算法计算第一文本特征集合的哈希值、进而选取相应的匹配值为例,在采用Simhash算法计算第一文本特征集合的哈希值、进而选取相应的匹配值之后,可进一步对所述第一文本特征集合与选取的匹配值所对应的第二文本特征集合之间的相似度(如Jaccard相似度等)进行校验,以消除Simhash碰撞导致的误判问题。For example, using the Simhash algorithm to calculate the hash value of the first text feature set, and then selecting the corresponding matching value, after using the Simhash algorithm to calculate the hash value of the first text feature set, and then selecting the corresponding matching value, Further verifying the similarity between the first text feature set and the second text feature set corresponding to the selected matching value (such as Jaccard similarity, etc.) to eliminate the misjudgment caused by the Simhash collision.
需要说明的是,Jaccard相似度是最常见的衡量两个集合相似性的一种方法,其也很适合用于衡量短文本的相似性,但由于计算量过大,所以无法直接用于大数据量的文本聚合。但是,通过Jaccard相似度校验,却可以完全解决Simhash算法的碰撞问题,消除了Simhash碰撞导致的误判问题。因而,采用Simhash算法结合Jaccard相似度校验的判定方法对待聚合文本进行相似性分析时,可达到准确且又快速地对短文本进行聚合的效果。It should be noted that Jaccard similarity is the most common method to measure the similarity of two sets. It is also suitable for measuring the similarity of short texts, but it cannot be directly used for big data because it is too large. The amount of text aggregated. However, through the Jaccard similarity check, the collision problem of the Simhash algorithm can be completely solved, and the misjudgment caused by the Simhash collision is eliminated. Therefore, when the Simhash algorithm is combined with the Jaccard similarity check method to analyze the similarity of the aggregated text, the effect of synthesizing the short text accurately and quickly can be achieved.
进一步地,在本申请所述实施例中,所述方法还可包括以下步骤:Further, in the embodiment described in this application, the method may further include the following steps:
若确定已构建的与所述设定的局部敏感哈希算法相对应的哈希索引中,不存在与计算得到的哈希值之间的距离不大于设定距离的匹配值;或者,确定已构建的与所述设定的局部敏感哈希算法相对应的哈希索引中,存在与计算得到的哈希值之间的距离不大于设定距离的匹配值、且确定所述第一文本特征集合与所述第二文本特征集合之间的相似度小于设定的相似度阈值;则将计算得到的哈希值更新至(即添加至)已构建的与所述设定的局部敏感哈希算法相对应的哈希索引中,并基于所述待聚合文本创建一个新的文本类,以及将所述待聚合文本归至创建的所述新的文本类中。If it is determined that the hash index corresponding to the set local sensitive hash algorithm is constructed, there is no matching value between the calculated hash value and the set distance; or In the hash index corresponding to the set local sensitive hash algorithm, there is a matching value between the calculated hash value and the set distance, and the first text feature is determined. And the similarity between the set and the second set of text features is less than a set similarity threshold; then updating the calculated hash value to (ie, adding to) the constructed local sensitive hash with the setting Corresponding to the hash index of the algorithm, and creating a new text class based on the text to be aggregated, and categorizing the text to be aggregated into the created new text class.
也就是说,若确定待聚合文本不归属于任何一个已创建的文本类时,可将所述待聚合文本对应的哈希值添加至相应的哈希索引中,并将所述待聚合文本归至一个新创建的文本类中,本申请实施例对此不作赘述。In other words, if it is determined that the to-be-aggregated text is not attributed to any of the created text classes, the hash value corresponding to the to-be-aggregated text may be added to the corresponding hash index, and the to-be-aggregated text is returned. This is not described in detail in the embodiment of the present application.
进一步地,需要说明的是,本申请实施例所述方案无语言、软件或者硬件的限制。但是,为了提高文本聚合的效率,可优先选用性能高的编程语言(如C++或者Java等) 和性能高的硬件等来实现,本申请实施例对此不作赘述。Further, it should be noted that the solution described in the embodiments of the present application has no limitation of language, software or hardware. However, in order to improve the efficiency of text aggregation, it is preferred to use a high-performance programming language (such as C++ or Java). This is not described in detail in the embodiments of the present application.
本申请实施例一提供了一种文本聚合方法,在本申请实施例一所述技术方案中,可对长度不大于设定的长度阈值的待聚合文本进行特征提取,并在得到与所述待聚合文本相对应的文本特征集合之后,可采用局部敏感哈希算法结合相似度校验的判定方法,对所述待聚合文本进行相似性分析以实现待聚合文本的聚合,从而可在解决基于向量空间模型或概率模型进行短文本相似性分析时所导致的文本聚合结果准确性较低、实时性较低的问题,达到准确且又快速地对短文本进行聚合的效果,如可实现大数据流量(如大于1万条/秒等)下的短文本的实时聚合,以支持对数据流的实时分析。The first embodiment of the present application provides a text aggregation method. In the technical solution of the first embodiment of the present application, feature extraction may be performed on a to-be-aggregated text whose length is not greater than a set length threshold, and is obtained and After the text feature set corresponding to the text is aggregated, a local sensitive hash algorithm and a similarity check method may be used to perform similarity analysis on the to-be-aggregated text to implement aggregation of the text to be aggregated, thereby solving the vector-based solution. When the spatial model or the probabilistic model performs short text similarity analysis, the text aggregation result is less accurate and the real-time performance is lower, and the effect of aggregating short texts accurately and quickly is achieved, such as realizing big data traffic. Real-time aggregation of short text (eg, greater than 10,000 bars/second, etc.) to support real-time analysis of data streams.
实施例二:Embodiment 2:
基于同一发明构思,本申请实施例二提供了一种文本聚合装置,该文本聚合装置的具体实施可参见上述方法实施例一中的相关描述,重复之处不再赘述,如图2所示,该文本聚合装置主要可包括:Based on the same inventive concept, the second embodiment of the present application provides a text aggregation apparatus. For the specific implementation of the text aggregation apparatus, refer to the related description in the first embodiment of the foregoing method, and the repeated description is not repeated, as shown in FIG. 2, The text aggregation device can mainly include:
特征提取单元21,可用于对长度不大于设定的长度阈值的待聚合文本进行特征提取,得到与所述待聚合文本相对应的第一文本特征集合;The feature extraction unit 21 is configured to perform feature extraction on the to-be-aggregated text whose length is not greater than the set length threshold, to obtain a first text feature set corresponding to the to-be-aggregated text;
文本聚合单元22,可用于基于设定的局部敏感哈希算法计算所述第一文本特征集合的哈希值,并根据计算得到的哈希值,判断已构建的与所述设定的局部敏感哈希算法相对应的哈希索引中,是否存在与计算得到的哈希值之间的距离不大于设定距离的匹配值;若是,则从与计算得到的哈希值之间的距离不大于设定距离的匹配值中,选取与计算得到的哈希值之间的距离最小的匹配值,并计算所述第一文本特征集合与所述最小的匹配值所对应的第二文本特征集合之间的相似度;以及,若确定所述第一文本特征集合与所述第二文本特征集合之间的相似度不小于设定的相似度阈值,则将所述待聚合文本聚合至所述第二文本特征集合所对应的文本类中。The text aggregating unit 22 is configured to calculate a hash value of the first text feature set based on the set local sensitivity hash algorithm, and determine a local sensitivity of the constructed and the set according to the calculated hash value In the hash index corresponding to the hash algorithm, whether there is a matching value between the calculated hash value and the calculated distance is not greater than the set distance; if so, the distance from the calculated hash value is not greater than Setting a matching value of the distance, selecting a matching value that is the smallest distance from the calculated hash value, and calculating a second text feature set corresponding to the first text feature set and the minimum matching value And the similarity between the first text feature set and the second text feature set is determined to be aggregated to the first if the similarity between the first text feature set and the second text feature set is not less than a set similarity threshold The text class corresponding to the two text feature sets.
其中,所述设定的局部敏感哈希算法不限于为Simhash算法或Minhash算法等。且,所述第一文本特征集合与所述第二文本特征集合之间的相似度至少可通过以下任意一种或多种相似度度量参数来表示:Jaccard相似度、欧式距离以及海明距离等。The locally sensitive hash algorithm is not limited to the Simhash algorithm or the Minhash algorithm. And the similarity between the first text feature set and the second text feature set may be represented by at least one or more of the following similarity measure parameters: Jaccard similarity, Euclidean distance, Hamming distance, etc. .
进一步地,所述文本聚合单元22,还可用于若确定已构建的与所述设定的局部敏感哈希算法相对应的哈希索引中,不存在与计算得到的哈希值之间的距离不大于设定距离的匹配值;或者,确定已构建的与所述设定的局部敏感哈希算法相对应的哈希索引中,存在与计算得到的哈希值之间的距离不大于设定距离的匹配值、且确定所述第一文本特征集合与所述第二文本特征集合之间的相似度小于设定的相似度阈值;则将计算得到的 哈希值更新至已构建的与所述设定的局部敏感哈希算法相对应的哈希索引中,并基于所述待聚合文本创建一个新的文本类,以及将所述待聚合文本归至创建的所述新的文本类中。Further, the text aggregation unit 22 may be further configured to: if it is determined that the constructed hash index corresponding to the set local sensitive hash algorithm, there is no distance between the calculated hash value Not matching the matching value of the set distance; or determining that the constructed hash index corresponding to the set local sensitive hash algorithm has a distance between the calculated hash value and not greater than the set value a matching value of the distance, and determining that the similarity between the first text feature set and the second text feature set is less than a set similarity threshold; The hash value is updated to the constructed hash index corresponding to the set local sensitive hash algorithm, and a new text class is created based on the to-be-aggregated text, and the text to be aggregated is attributed to Created in the new text class.
进一步地,为了提高文本特征的提取效果,在本申请所述实施例中,所述特征提取单元21具体可用于基于机械分词结合N元模型的特征提取方式对长度不大于设定的长度阈值的待聚合文本进行特征提取,得到与所述待聚合文本相对应的第一文本特征集合,所述N为大于1的自然数。Further, in order to improve the extraction effect of the text feature, in the embodiment of the present application, the feature extraction unit 21 is specifically applicable to the feature extraction method based on the mechanical segmentation combined with the N-ary model, and the length is not greater than the set length threshold. Performing feature extraction on the aggregated text to obtain a first text feature set corresponding to the to-be-aggregated text, where N is a natural number greater than 1.
可选地,所述特征提取单元21具体可用于以中文汉字以及连续的字符串为最小切分单元,对所述待聚合文本进行分词,得到多个分词;并基于N元模型,将得到的多个分词中的任意N个连续的分词组合为一文本特征,得到与所述待聚合文本相对应的文本特征集合。Optionally, the feature extraction unit 21 is specifically configured to use a Chinese character and a continuous character string as a minimum segmentation unit, and perform segmentation on the to-be-aggregated text to obtain a plurality of word segments; and based on the N-ary model, the obtained feature is obtained. Any N consecutive word segments of the plurality of word segments are combined into a text feature, and a text feature set corresponding to the text to be aggregated is obtained.
进一步地,所述装置还可包括预处理单元23:Further, the apparatus may further include a pre-processing unit 23:
所述预处理单元23,可用于在对长度不大于设定的长度阈值的待聚合文本进行特征提取之前,对所述待聚合文本进行预处理;其中,所述预处理至少可包括:去除待聚合文本中的特殊标签、去除待聚合文本中的非文字特殊符号、对待聚合文本进行繁简字体转换、以及将待聚合文本中的连续性的拉丁文和/或数字归一化为设定的字符串等中的一种或多种。The pre-processing unit 23 may be configured to pre-process the to-be-aggregated text before performing feature extraction on the to-be-aggregated text whose length is not greater than the set length threshold; wherein the pre-processing may at least include: removing the to-be-aggregated text Aggregate special tags in text, remove non-text special symbols in text to be aggregated, perform complex font conversions on aggregated text, and normalize Latin and/or numbers of continuity in text to be aggregated into settings One or more of a string, etc.
本领域技术人员应明白,本申请的实施例可提供为方法、装置(设备)、或计算机程序产品。因此,本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。Those skilled in the art will appreciate that embodiments of the present application can be provided as a method, apparatus (device), or computer program product. Thus, the present application can take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment in combination of software and hardware. Moreover, the application can take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) including computer usable program code.
本申请是参照根据本申请实施例的方法、装置(设备)和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The present application is described with reference to flowchart illustrations and/or block diagrams of a method, apparatus, and computer program product according to embodiments of the present application. It will be understood that each flow and/or block of the flowchart illustrations and/or FIG. These computer program instructions can be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing device to produce a machine for the execution of instructions for execution by a processor of a computer or other programmable data processing device. Means for implementing the functions specified in one or more of the flow or in a block or blocks of the flow chart.
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方 式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions can also be stored in a particular computer capable of booting a computer or other programmable data processing device In a computer readable memory that operates in a computer readable memory, causing instructions stored in the computer readable memory to produce an article of manufacture comprising instruction means implemented in a block or in a flow or a flow diagram and/or block diagram of the flowchart The functions specified in the boxes.
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions can also be loaded onto a computer or other programmable data processing device such that a series of operational steps are performed on a computer or other programmable device to produce computer-implemented processing for execution on a computer or other programmable device. The instructions provide steps for implementing the functions specified in one or more of the flow or in a block or blocks of a flow diagram.
尽管已描述了本申请的优选实施例,但本领域内的技术人员一旦得知了基本创造性概念,则可对这些实施例作出另外的变更和修改。所以,所附权利要求意欲解释为包括优选实施例以及落入本申请范围的所有变更和修改。While the preferred embodiment of the present application has been described, it will be apparent that those skilled in the art can make further changes and modifications to the embodiments. Therefore, the appended claims are intended to be interpreted as including the preferred embodiments and the modifications and
显然,本领域的技术人员可以对本申请进行各种改动和变型而不脱离本申请的精神和范围。这样,倘若本申请的这些修改和变型属于本申请权利要求及其等同技术的范围之内,则本申请也意图包含这些改动和变型在内。 It will be apparent to those skilled in the art that various modifications and changes can be made in the present application without departing from the spirit and scope of the application. Thus, it is intended that the present invention cover the modifications and variations of the present invention.

Claims (14)

  1. 一种文本聚合方法,其特征在于,包括:A text aggregation method, comprising:
    对长度不大于设定的长度阈值的待聚合文本进行特征提取,得到与所述待聚合文本相对应的第一文本特征集合;Performing feature extraction on the to-be-aggregated text whose length is not greater than the set length threshold, to obtain a first text feature set corresponding to the to-be-aggregated text;
    基于设定的局部敏感哈希算法计算所述第一文本特征集合的哈希值,并根据计算得到的哈希值,判断已构建的与所述设定的局部敏感哈希算法相对应的哈希索引中,是否存在与计算得到的哈希值之间的距离不大于设定距离的匹配值;Calculating a hash value of the first text feature set based on the set local sensitivity hash algorithm, and determining, according to the calculated hash value, a constructed hash corresponding to the set local sensitive hash algorithm In the index, whether there is a matching value between the calculated hash value and the set distance is not greater than the set distance;
    若是,则从与计算得到的哈希值之间的距离不大于设定距离的匹配值中,选取与计算得到的哈希值之间的距离最小的匹配值,并计算所述第一文本特征集合与所述最小的匹配值所对应的第二文本特征集合之间的相似度;If yes, selecting a matching value that is the smallest distance from the calculated hash value from the matching value between the calculated hash value and the set distance, and calculating the first text feature And combining a similarity between the second set of text features corresponding to the minimum matching value;
    若确定所述第一文本特征集合与所述第二文本特征集合之间的相似度不小于设定的相似度阈值,则将所述待聚合文本聚合至所述第二文本特征集合所对应的文本类中。And if it is determined that the similarity between the first text feature set and the second text feature set is not less than a set similarity threshold, the to-be-aggregated text is aggregated to correspond to the second text feature set. In the text class.
  2. 如权利要求1所述的方法,其特征在于,所述方法还包括:The method of claim 1 wherein the method further comprises:
    若确定已构建的与所述设定的局部敏感哈希算法相对应的哈希索引中,不存在与计算得到的哈希值之间的距离不大于设定距离的匹配值;或者,确定已构建的与所述设定的局部敏感哈希算法相对应的哈希索引中,存在与计算得到的哈希值之间的距离不大于设定距离的匹配值、且确定所述第一文本特征集合与所述第二文本特征集合之间的相似度小于设定的相似度阈值,则If it is determined that the hash index corresponding to the set local sensitive hash algorithm is constructed, there is no matching value between the calculated hash value and the set distance; or In the hash index corresponding to the set local sensitive hash algorithm, there is a matching value between the calculated hash value and the set distance, and the first text feature is determined. And the similarity between the set and the second set of text features is less than a set similarity threshold, then
    将计算得到的哈希值更新至已构建的与所述设定的局部敏感哈希算法相对应的哈希索引中,并基于所述待聚合文本创建一个新的文本类,以及将所述待聚合文本归至创建的所述新的文本类中。Updating the calculated hash value to the constructed hash index corresponding to the set local sensitive hash algorithm, and creating a new text class based on the to-be-aggregated text, and The aggregated text is returned to the new text class created.
  3. 如权利要求1或2所述的方法,其特征在于,所述对长度不大于设定的长度阈值的待聚合文本进行特征提取,得到与所述待聚合文本相对应的文本特征集合,包括:The method according to claim 1 or 2, wherein the feature extraction is performed on the to-be-aggregated text whose length is not greater than the set length threshold, and the text feature set corresponding to the to-be-aggregated text is obtained, including:
    基于机械分词结合N元模型的特征提取方式对长度不大于设定的长度阈值的待聚合文本进行特征提取,得到与所述待聚合文本相对应的第一文本特征集合,所述N为大于1的自然数。The feature extraction method based on the mechanical segmentation and the N-element model performs feature extraction on the to-be-aggregated text whose length is not greater than the set length threshold, and obtains a first text feature set corresponding to the to-be-aggregated text, where N is greater than 1 Natural number.
  4. 如权利要求3所述的方法,其特征在于,所述基于机械分词结合N元模型的特征提取方式对长度不大于设定的长度阈值的待聚合文本进行特征提取,得到与所述待聚合文本相对应的第一文本特征集合,包括:The method according to claim 3, wherein the feature extraction method based on the mechanical segmentation and the N-element model extracts features to be aggregated text having a length not greater than a set length threshold, and obtains the text to be aggregated Corresponding first set of text features, including:
    以中文汉字以及连续的字符串为最小切分单元,对所述待聚合文本进行分词,得到 多个分词;The Chinese character and the continuous character string are used as the minimum segmentation unit, and the text to be aggregated is segmented to obtain Multiple participles;
    基于N元模型,将得到的多个分词中的任意N个连续的分词组合为一文本特征,得到与所述待聚合文本相对应的文本特征集合。Based on the N-ary model, any N consecutive word segments of the obtained plurality of word segments are combined into one text feature, and a text feature set corresponding to the to-be-aggregated text is obtained.
  5. 如权利要求1或2所述的方法,其特征在于,所述设定的局部敏感哈希算法不限于为Simhash算法或Minhash算法。The method according to claim 1 or 2, wherein the set local sensitive hash algorithm is not limited to being a Simhash algorithm or a Minhash algorithm.
  6. 如权利要求1或2所述的方法,其特征在于,所述第一文本特征集合与所述第二文本特征集合之间的相似度至少通过Jaccard相似度、欧式距离以及海明距离中的任意一种或多种相似度度量参数来表示。The method according to claim 1 or 2, wherein the similarity between the first text feature set and the second text feature set is at least any of Jaccard similarity, Euclidean distance, and Hamming distance. One or more similarity metric parameters are represented.
  7. 如权利要求1或2所述的方法,其特征在于,在对长度不大于设定的长度阈值的待聚合文本进行特征提取之前,所述方法还包括:The method of claim 1 or 2, wherein before the feature extraction of the text to be aggregated having a length that is not greater than the set length threshold, the method further comprises:
    对所述待聚合文本进行预处理;其中,所述预处理至少包括:去除待聚合文本中的特殊标签、去除待聚合文本中的非文字特殊符号、对待聚合文本进行繁简字体转换、以及将待聚合文本中的连续性的拉丁文和/或数字归一化为设定的字符串中的一种或多种。Pre-processing the text to be aggregated; wherein the pre-processing includes at least: removing a special label in the text to be aggregated, removing a non-text special symbol in the text to be aggregated, performing a simplified font conversion on the text to be aggregated, and The Latin and/or numbers of continuity in the aggregated text are normalized to one or more of the set strings.
  8. 一种文本聚合装置,其特征在于,包括:A text aggregation device, comprising:
    特征提取单元,用于对长度不大于设定的长度阈值的待聚合文本进行特征提取,得到与所述待聚合文本相对应的第一文本特征集合;a feature extraction unit, configured to perform feature extraction on the to-be-aggregated text whose length is not greater than the set length threshold, to obtain a first text feature set corresponding to the to-be-aggregated text;
    文本聚合单元,用于基于设定的局部敏感哈希算法计算所述第一文本特征集合的哈希值,并根据计算得到的哈希值,判断已构建的与所述设定的局部敏感哈希算法相对应的哈希索引中,是否存在与计算得到的哈希值之间的距离不大于设定距离的匹配值;若是,则从与计算得到的哈希值之间的距离不大于设定距离的匹配值中,选取与计算得到的哈希值之间的距离最小的匹配值,并计算所述第一文本特征集合与所述最小的匹配值所对应的第二文本特征集合之间的相似度;以及,若确定所述第一文本特征集合与所述第二文本特征集合之间的相似度不小于设定的相似度阈值,则将所述待聚合文本聚合至所述第二文本特征集合所对应的文本类中。a text aggregating unit, configured to calculate a hash value of the first text feature set based on the set local sensitivity hash algorithm, and determine, according to the calculated hash value, the constructed local sensitivity to the setting In the hash index corresponding to the Greek algorithm, whether there is a matching value between the calculated hash value and the calculated distance is not greater than the set distance; if so, the distance between the calculated hash value is not greater than Among the matching values of the fixed distance, the matching value with the smallest distance from the calculated hash value is selected, and the first text feature set is calculated between the second text feature set corresponding to the minimum matching value And determining, if the similarity between the first text feature set and the second text feature set is not less than a set similarity threshold, aggregating the to-be-aggregated text to the second In the text class corresponding to the text feature set.
  9. 如权利要求8所述的装置,其特征在于,The device of claim 8 wherein:
    所述文本聚合单元,还用于若确定已构建的与所述设定的局部敏感哈希算法相对应的哈希索引中,不存在与计算得到的哈希值之间的距离不大于设定距离的匹配值;或者,确定已构建的与所述设定的局部敏感哈希算法相对应的哈希索引中,存在与计算得到的哈希值之间的距离不大于设定距离的匹配值、且确定所述第一文本特征集合与所述第二文本特征集合之间的相似度小于设定的相似度阈值,则 The text aggregating unit is further configured to: if it is determined that the constructed hash index corresponding to the set local sensitive hash algorithm, the distance between the calculated hash value and the calculated hash value is not greater than a setting a matching value of the distance; or determining that the constructed hash index corresponding to the set local sensitive hash algorithm has a matching value with the calculated hash value not greater than a set distance And determining that the similarity between the first text feature set and the second text feature set is less than a set similarity threshold,
    将计算得到的哈希值更新至已构建的与所述设定的局部敏感哈希算法相对应的哈希索引中,并基于所述待聚合文本创建一个新的文本类,以及将所述待聚合文本归至创建的所述新的文本类中。Updating the calculated hash value to the constructed hash index corresponding to the set local sensitive hash algorithm, and creating a new text class based on the to-be-aggregated text, and The aggregated text is returned to the new text class created.
  10. 如权利要求8或9所述的装置,其特征在于,The device according to claim 8 or 9, wherein
    所述特征提取单元,具体用于基于机械分词结合N元模型的特征提取方式对长度不大于设定的长度阈值的待聚合文本进行特征提取,得到与所述待聚合文本相对应的第一文本特征集合,所述N为大于1的自然数。The feature extraction unit is configured to perform feature extraction on the to-be-aggregated text whose length is not greater than the set length threshold based on the feature extraction method of the mechanical segmentation and the N-dimensional model, to obtain the first text corresponding to the to-be-aggregated text. A set of features, the N being a natural number greater than one.
  11. 如权利要求10所述的装置,其特征在于,The device of claim 10 wherein:
    所述特征提取单元,具体用于以中文汉字以及连续的字符串为最小切分单元,对所述待聚合文本进行分词,得到多个分词;并基于N元模型,将得到的多个分词中的任意N个连续的分词组合为一文本特征,得到与所述待聚合文本相对应的文本特征集合。The feature extraction unit is specifically configured to use a Chinese character and a continuous character string as a minimum segmentation unit to segment the to-be-aggregated text to obtain a plurality of word segments; and based on the N-ary model, the plurality of word segments are obtained. Any of the N consecutive word segments is combined into a text feature to obtain a set of text features corresponding to the text to be aggregated.
  12. 如权利要求8或9所述的装置,其特征在于,所述设定的局部敏感哈希算法不限于为Simhash算法或Minhash算法。The apparatus according to claim 8 or 9, wherein the set local sensitive hash algorithm is not limited to being a Simhash algorithm or a Minhash algorithm.
  13. 如权利要求8或9所述的装置,其特征在于,所述第一文本特征集合与所述第二文本特征集合之间的相似度至少通过Jaccard相似度、欧式距离以及海明距离中的任意一种或多种相似度度量参数来表示。The apparatus according to claim 8 or 9, wherein the similarity between the first text feature set and the second text feature set is at least any of Jaccard similarity, Euclidean distance, and Hamming distance. One or more similarity metric parameters are represented.
  14. 如权利要求8或9所述的装置,其特征在于,所述装置还包括预处理单元:The device according to claim 8 or 9, wherein the device further comprises a preprocessing unit:
    所述预处理单元,用于在对长度不大于设定的长度阈值的待聚合文本进行特征提取之前,对所述待聚合文本进行预处理;The pre-processing unit is configured to pre-process the to-be-aggregated text before performing feature extraction on the to-be-aggregated text whose length is not greater than the set length threshold;
    其中,所述预处理至少包括:去除待聚合文本中的特殊标签、去除待聚合文本中的非文字特殊符号、对待聚合文本进行繁简字体转换、以及将待聚合文本中的连续性的拉丁文和/或数字归一化为设定的字符串中的一种或多种。 The pre-processing includes at least: removing a special label in the text to be aggregated, removing a non-text special symbol in the text to be aggregated, performing a simplified font conversion on the aggregated text, and a Latin character in the continuity of the text to be aggregated. And/or numbers are normalized to one or more of the set strings.
PCT/CN2016/081090 2015-05-13 2016-05-05 Text aggregate method and device WO2016180268A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510242860.0 2015-05-13
CN201510242860.0A CN106294350B (en) 2015-05-13 2015-05-13 A kind of text polymerization and device

Publications (1)

Publication Number Publication Date
WO2016180268A1 true WO2016180268A1 (en) 2016-11-17

Family

ID=57248581

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/081090 WO2016180268A1 (en) 2015-05-13 2016-05-05 Text aggregate method and device

Country Status (2)

Country Link
CN (1) CN106294350B (en)
WO (1) WO2016180268A1 (en)

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108959440A (en) * 2018-06-13 2018-12-07 福建新大陆软件工程有限公司 A kind of short message clustering method and device
CN109190117A (en) * 2018-08-10 2019-01-11 中国船舶重工集团公司第七〇九研究所 A kind of short text semantic similarity calculation method based on term vector
CN109299260A (en) * 2018-09-29 2019-02-01 上海晶赞融宣科技有限公司 Data classification method, device and computer readable storage medium
CN109445844A (en) * 2018-11-05 2019-03-08 浙江网新恒天软件有限公司 Code Clones detection method based on cryptographic Hash, electronic equipment, storage medium
CN109657202A (en) * 2017-10-10 2019-04-19 北京国双科技有限公司 The method and device of text-processing
CN110147531A (en) * 2018-06-11 2019-08-20 广州腾讯科技有限公司 A kind of recognition methods, device and the storage medium of Similar Text content
CN110321433A (en) * 2019-06-26 2019-10-11 阿里巴巴集团控股有限公司 Determine the method and device of text categories
CN110991358A (en) * 2019-12-06 2020-04-10 腾讯科技(深圳)有限公司 Text comparison method and device based on block chain
CN111444325A (en) * 2020-03-30 2020-07-24 湖南工业大学 Method for measuring document similarity by position coding single random permutation hash
CN111506708A (en) * 2020-04-22 2020-08-07 上海极链网络科技有限公司 Text auditing method, device, equipment and medium
CN111738437A (en) * 2020-07-17 2020-10-02 支付宝(杭州)信息技术有限公司 Training method, text generation device and electronic equipment
CN113420141A (en) * 2021-06-24 2021-09-21 中国人民解放军陆军工程大学 Sensitive data searching method based on Hash clustering and context information
CN113688629A (en) * 2021-08-04 2021-11-23 德邦证券股份有限公司 Text deduplication method and device and storage medium
CN113704465A (en) * 2021-07-21 2021-11-26 大箴(杭州)科技有限公司 Text clustering method and device, electronic equipment and storage medium
CN116341566A (en) * 2023-05-29 2023-06-27 中债金科信息技术有限公司 Text deduplication method and device, electronic equipment and storage medium
CN110147531B (en) * 2018-06-11 2024-04-23 广州腾讯科技有限公司 Method, device and storage medium for identifying similar text content

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108572961A (en) * 2017-03-08 2018-09-25 北京嘀嘀无限科技发展有限公司 A kind of the vectorization method and device of text
CN106951865B (en) * 2017-03-21 2020-04-07 东莞理工学院 Privacy protection biological identification method based on Hamming distance
CN110019531B (en) * 2017-12-29 2021-11-02 北京京东尚科信息技术有限公司 Method and device for acquiring similar object set
CN108399163B (en) * 2018-03-21 2021-01-12 北京理工大学 Text similarity measurement method combining word aggregation and word combination semantic features
CN109241505A (en) * 2018-10-09 2019-01-18 北京奔影网络科技有限公司 Text De-weight method and device
CN110134768B (en) * 2019-05-13 2023-05-26 腾讯科技(深圳)有限公司 Text processing method, device, equipment and storage medium
CN110532389B (en) * 2019-08-22 2023-07-14 北京睿象科技有限公司 Text clustering method and device and computing equipment
CN110516157B (en) * 2019-08-30 2022-04-01 盈盛智创科技(广州)有限公司 Document retrieval method, document retrieval equipment and storage medium
CN111241275B (en) * 2020-01-02 2022-12-06 厦门快商通科技股份有限公司 Short text similarity evaluation method, device and equipment
CN111694952A (en) * 2020-04-16 2020-09-22 国家计算机网络与信息安全管理中心 Big data analysis model system based on microblog and implementation method thereof
CN111861201A (en) * 2020-07-17 2020-10-30 南京汇宁桀信息科技有限公司 Intelligent government affair order dispatching method based on big data classification algorithm
CN116450918B (en) * 2023-06-09 2023-08-25 辰风策划(深圳)有限公司 Online information consultation method and device and electronic equipment

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2012118659A (en) * 2010-11-30 2012-06-21 Nippon Telegr & Teleph Corp <Ntt> Information search device, information search method and program
CN103064887A (en) * 2012-12-10 2013-04-24 华为技术有限公司 Method and device for recommending information
CN103646080A (en) * 2013-12-12 2014-03-19 北京京东尚科信息技术有限公司 Microblog duplication-eliminating method and system based on reverse-order index
CN103744964A (en) * 2014-01-06 2014-04-23 同济大学 Webpage classification method based on locality sensitive Hash function
CN103914463A (en) * 2012-12-31 2014-07-09 北京新媒传信科技有限公司 Method and device for retrieving similarity of picture messages
CN104391963A (en) * 2014-12-01 2015-03-04 北京中科创益科技有限公司 Method for constructing correlation networks of keywords of natural language texts

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8060747B1 (en) * 2005-09-12 2011-11-15 Microsoft Corporation Digital signatures for embedded code
CN101477563B (en) * 2009-01-21 2010-11-10 北京百问百答网络技术有限公司 Short text clustering method and system, and its data processing device
CN102929906B (en) * 2012-08-10 2015-07-22 北京邮电大学 Text grouped clustering method based on content characteristic and subject characteristic
CN103441924B (en) * 2013-09-03 2016-06-08 盈世信息科技(北京)有限公司 A kind of rubbish mail filtering method based on short text and device
CN103970722B (en) * 2014-05-07 2017-04-05 江苏金智教育信息技术有限公司 A kind of method of content of text duplicate removal

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2012118659A (en) * 2010-11-30 2012-06-21 Nippon Telegr & Teleph Corp <Ntt> Information search device, information search method and program
CN103064887A (en) * 2012-12-10 2013-04-24 华为技术有限公司 Method and device for recommending information
CN103914463A (en) * 2012-12-31 2014-07-09 北京新媒传信科技有限公司 Method and device for retrieving similarity of picture messages
CN103646080A (en) * 2013-12-12 2014-03-19 北京京东尚科信息技术有限公司 Microblog duplication-eliminating method and system based on reverse-order index
CN103744964A (en) * 2014-01-06 2014-04-23 同济大学 Webpage classification method based on locality sensitive Hash function
CN104391963A (en) * 2014-12-01 2015-03-04 北京中科创益科技有限公司 Method for constructing correlation networks of keywords of natural language texts

Cited By (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109657202A (en) * 2017-10-10 2019-04-19 北京国双科技有限公司 The method and device of text-processing
CN109657202B (en) * 2017-10-10 2022-10-28 北京国双科技有限公司 Text processing method and device
CN110147531A (en) * 2018-06-11 2019-08-20 广州腾讯科技有限公司 A kind of recognition methods, device and the storage medium of Similar Text content
CN110147531B (en) * 2018-06-11 2024-04-23 广州腾讯科技有限公司 Method, device and storage medium for identifying similar text content
CN108959440A (en) * 2018-06-13 2018-12-07 福建新大陆软件工程有限公司 A kind of short message clustering method and device
CN109190117A (en) * 2018-08-10 2019-01-11 中国船舶重工集团公司第七〇九研究所 A kind of short text semantic similarity calculation method based on term vector
CN109299260A (en) * 2018-09-29 2019-02-01 上海晶赞融宣科技有限公司 Data classification method, device and computer readable storage medium
CN109445844A (en) * 2018-11-05 2019-03-08 浙江网新恒天软件有限公司 Code Clones detection method based on cryptographic Hash, electronic equipment, storage medium
CN110321433A (en) * 2019-06-26 2019-10-11 阿里巴巴集团控股有限公司 Determine the method and device of text categories
CN110321433B (en) * 2019-06-26 2023-04-07 创新先进技术有限公司 Method and device for determining text category
CN110991358A (en) * 2019-12-06 2020-04-10 腾讯科技(深圳)有限公司 Text comparison method and device based on block chain
CN110991358B (en) * 2019-12-06 2024-03-19 腾讯科技(深圳)有限公司 Text comparison method and device based on blockchain
CN111444325A (en) * 2020-03-30 2020-07-24 湖南工业大学 Method for measuring document similarity by position coding single random permutation hash
CN111444325B (en) * 2020-03-30 2023-06-20 湖南工业大学 Method for measuring document similarity by position coding single random replacement hash
CN111506708A (en) * 2020-04-22 2020-08-07 上海极链网络科技有限公司 Text auditing method, device, equipment and medium
CN111738437A (en) * 2020-07-17 2020-10-02 支付宝(杭州)信息技术有限公司 Training method, text generation device and electronic equipment
CN111738437B (en) * 2020-07-17 2020-11-20 支付宝(杭州)信息技术有限公司 Training method, text generation device and electronic equipment
CN113420141B (en) * 2021-06-24 2022-10-04 中国人民解放军陆军工程大学 Sensitive data searching method based on Hash clustering and context information
CN113420141A (en) * 2021-06-24 2021-09-21 中国人民解放军陆军工程大学 Sensitive data searching method based on Hash clustering and context information
CN113704465A (en) * 2021-07-21 2021-11-26 大箴(杭州)科技有限公司 Text clustering method and device, electronic equipment and storage medium
CN113688629A (en) * 2021-08-04 2021-11-23 德邦证券股份有限公司 Text deduplication method and device and storage medium
CN116341566A (en) * 2023-05-29 2023-06-27 中债金科信息技术有限公司 Text deduplication method and device, electronic equipment and storage medium
CN116341566B (en) * 2023-05-29 2023-10-20 中债金科信息技术有限公司 Text deduplication method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN106294350A (en) 2017-01-04
CN106294350B (en) 2019-10-11

Similar Documents

Publication Publication Date Title
WO2016180268A1 (en) Text aggregate method and device
JP6526329B2 (en) Web page training method and apparatus, search intention identification method and apparatus
WO2019200806A1 (en) Device for generating text classification model, method, and computer readable storage medium
US11544459B2 (en) Method and apparatus for determining feature words and server
US9197665B1 (en) Similarity search and malware prioritization
CN109635296B (en) New word mining method, device computer equipment and storage medium
CN107301170B (en) Method and device for segmenting sentences based on artificial intelligence
WO2019041521A1 (en) Apparatus and method for extracting user keyword, and computer-readable storage medium
WO2021227831A1 (en) Method and apparatus for detecting subject of cyber threat intelligence, and computer storage medium
WO2017080220A1 (en) Knowledge data processing method and apparatus
CN110413787B (en) Text clustering method, device, terminal and storage medium
US20190121868A1 (en) Data clustering
WO2018095411A1 (en) Web page clustering method and device
WO2019028990A1 (en) Code element naming method, device, electronic equipment and medium
CN109271624B (en) Target word determination method, device and storage medium
WO2021143009A1 (en) Text clustering method and apparatus
CN112905753A (en) Method and device for distinguishing text information
CN113408660A (en) Book clustering method, device, equipment and storage medium
JP2021501387A (en) Methods, computer programs and computer systems for extracting expressions for natural language processing
US11500942B2 (en) Focused aggregation of classification model outputs to classify variable length digital documents
CN111160445A (en) Bid document similarity calculation method and device
Zhang et al. Effective and fast near duplicate detection via signature-based compression metrics
CN110598115A (en) Sensitive webpage identification method and system based on artificial intelligence multi-engine
US11347928B2 (en) Detecting and processing sections spanning processed document partitions
WO2015159702A1 (en) Partial-information extraction system

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16792118

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16792118

Country of ref document: EP

Kind code of ref document: A1