CN117093717A

CN117093717A - Similar text aggregation method, device, equipment and storage medium thereof

Info

Publication number: CN117093717A
Application number: CN202311363982.6A
Authority: CN
Inventors: 姜桂林; 贵照众; 贺泽州; 聂萼辉
Original assignee: Hunan Caixin Digital Technology Co ltd
Current assignee: Hunan Data Industry Group Co.,Ltd.
Priority date: 2023-10-20
Filing date: 2023-10-20
Publication date: 2023-11-21
Anticipated expiration: 2043-10-20
Also published as: CN117093717B

Abstract

The embodiment of the application belongs to the technical field of data processing, is applied to a duplicate-removal aggregation scene of multi-source data texts, and relates to a similar text aggregation method, a device, equipment and a storage medium thereof, wherein the method comprises the steps of obtaining texts to be aggregated in a multi-source data end; performing preliminary de-duplication treatment on the text to be aggregated according to a first screening strategy; respectively extracting a text title and a text in a text to be aggregated; preprocessing a text body; according to a preset Hash coding algorithm, calculating a Hash coding value of a text title and a text body in the text to be aggregated, and obtaining the Hash coding value; and carrying out aggregation treatment on the texts to be aggregated through the hash code value and a preset second screening strategy to obtain a text aggregation result, and finishing screening and aggregation of similar texts. Compared with the K-shift algorithm, the method ensures that similar texts can be quickly and accurately aggregated together under the condition of less consumption of computing resources.

Description

Similar text aggregation method, device, equipment and storage medium thereof

Technical Field

The application relates to the technical field of data processing, and is applied to a de-duplication aggregation scene of multi-source data texts, in particular to a similar text aggregation method, a device, equipment and a storage medium thereof.

Background

There are many methods and steps of internet text deduplication technology, and the core approach is to use a similarity-based text deduplication method. The method can be abstracted into a text-to-text similarity matching problem, and mainly solves the matching problem of a vocabulary level or the similarity problem of a vocabulary level. For example, there is the K-Shanger algorithm.

However, the K-shine algorithm needs to generate a huge shine phrase library, when the number of texts and the length of texts are large, huge time and space resources are needed to calculate the phrase library, and the feature vector calculation of each document depends on the common phrase library, so that the process of calculating the feature vector is difficult to be completely parallelized, resulting in slower calculation speed. Therefore, the prior art also has the problems of huge consumption of computing resources and slower computing speed in the process of aggregating similar texts.

Disclosure of Invention

The embodiment of the application aims to provide a similar text aggregation method, a device, equipment and a storage medium thereof, which are used for solving the problems that huge computing resources are consumed and the computing speed is low in the aggregation of similar texts in the prior art.

In order to solve the above technical problems, the embodiment of the present application provides a similar text aggregation method, which adopts the following technical scheme:

a method of similar text aggregation comprising the steps of:

acquiring texts to be aggregated in a multi-source data end and distinguishing identification information of all the texts to be aggregated, wherein the texts to be aggregated comprise text titles and text texts, and the distinguishing identification information is formed by splicing source identifications and text identifications;

performing preliminary de-duplication processing on the text to be aggregated according to the distinguishing identification information and a preset first screening strategy to obtain the text to be aggregated after the preliminary de-duplication is completed, wherein the first screening strategy specifically comprises: identifying whether the identification information of the texts to be aggregated obtained from the same data end is the same or not according to the identification information of the texts to be aggregated, and if the identification information of the texts to be aggregated obtained from the same data end is the same, performing preliminary duplicate removal processing;

respectively extracting text titles and text texts in the text to be aggregated after all the preliminary duplicate removal is completed;

performing text pretreatment on the text to be polymerized after the preliminary duplication removal is completed to obtain the text to be polymerized after the text pretreatment is completed, wherein the text pretreatment mode specifically comprises the steps of cleaning and finishing punctuation, blank, chinese and English and simplified and traditional Chinese in the text;

According to a preset Hash coding algorithm, carrying out Hash coding value calculation on all texts to be aggregated after text preprocessing is completed one by one, and obtaining a Hash coding value corresponding to the texts to be aggregated;

and carrying out aggregation treatment on the texts to be aggregated through the hash code value, the distinguishing identification information, the text title, the text body and a preset second screening strategy to obtain a text aggregation result, and finishing screening and aggregation of similar texts, wherein the second screening strategy specifically comprises the following steps: carrying out segmentation processing on the target hash code value according to preset segmentation parameters to obtain code segments and code segment distinguishing identifiers respectively corresponding to the target hash code value; calculating the coding distance between hash coding values to be compared according to the coding segment distinguishing identification and a preset distance algorithm, and obtaining a coding distance calculation result; screening texts to be aggregated, which meet a preset first requirement, according to a preset coding distance threshold and the coding distance calculation result, and constructing a text comparison group, wherein the preset first requirement is that the coding distance between hash coding values to be compared meets the preset coding distance threshold; and carrying out aggregation processing on the texts to be aggregated in the text comparison group, which meet the preset second requirement, as similar texts, wherein the preset second requirement is that the text titles of the texts to be aggregated are identical, or the front N text characters in the text of the texts to be aggregated are identical, and N is a positive integer greater than 1.

Further, the step of performing preliminary de-duplication processing on the text to be aggregated according to the distinguishing identification information and a preset first screening policy to obtain the text to be aggregated after the preliminary de-duplication processing is completed specifically includes:

splitting the distinguishing identification information according to a preset splitting component to obtain source identifications and text identifications respectively corresponding to all texts to be aggregated;

based on the source identification, obtaining all texts to be aggregated corresponding to the same source identification, and generating a same source text set;

identifying whether texts with the same text identification exist in the text sets with the same source according to the text identification;

if texts with the same text identification exist in the text sets with the same source, acquiring the texts with the same text identification from the text sets with the same source, and constructing a text set to be subjected to preliminary de-duplication;

and selecting one text from the text set to be subjected to preliminary de-duplication as a target text, deleting other texts, and completing the preliminary de-duplication processing.

Further, the preset Hash coding algorithm includes a simHash coding algorithm, and the step of calculating Hash coding values one by one for all texts to be aggregated after the text body pretreatment is completed according to the preset Hash coding algorithm to obtain corresponding Hash coding values specifically includes:

Inputting all texts to be aggregated after the text pretreatment is completed one by one into a preset Hash coding algorithm component, wherein the simHash coding algorithm is built in the Hash coding algorithm component;

respectively carrying out Hash coding value calculation on a text title and a text of the text to be aggregated according to the simHash coding algorithm built in the Hash coding algorithm component to generate a Hash coding value corresponding to the text title and a Hash coding value corresponding to the text, wherein the Hash coding value consists of coding characters 0 and 1, and the number of coding character bits of the Hash coding value is 64;

acquiring hash code values corresponding to text titles of all texts to be aggregated, constructing a first hash code value set, and setting distinguishing identification information for elements in the first hash code value set according to distinguishing identification information of all texts to be aggregated;

and acquiring hash code values corresponding to text texts of all texts to be aggregated, constructing a second hash code value set, and setting distinguishing identification information for elements in the second hash code value set according to the distinguishing identification information of all texts to be aggregated.

Further, the step of performing segmentation processing on the target hash code value according to a preset segmentation parameter to obtain a code segment and a code segment distinguishing identifier corresponding to the target hash code value respectively specifically includes:

Equally dividing a target hash code value into M sections of code segments according to the segment parameters, wherein M is a parameter value of the segment parameters, and M is a positive integer greater than 1 and can be divided by the code character bit number of the hash code value;

and setting coding segment distinguishing identifiers for the M segments respectively according to the position information of the M segments in the target hash code value in a left-to-right or right-to-left mode.

Further, the step of calculating the coding distance between the hash code values to be compared according to the code segment distinguishing identifier and a preset distance algorithm to obtain a coding distance calculation result specifically includes:

randomly screening two hash code values from the second hash code value set to serve as hash code values to be compared;

acquiring coding segments corresponding to the hash coding values to be compared respectively and coding segment distinguishing identifiers corresponding to the coding segments;

counting the number of the coding characters 1 contained in the coding segment, and recording the number as a first number;

taking the code segment distinguishing identifiers and the first quantity as calculation parameters of the distance algorithm, calculating the code distance between the hash code values to be compared according to the distance algorithm,

The calculating step of the distance algorithm specifically comprises the following steps: identifying the code segments with the same code segment distinguishing identification and the same first number from the code segments corresponding to the hash code values to be compared as target code segments; counting the number of target coding segments of the hash code values to be compared, and recording the number as a second number; and calculating the difference value between the second quantity and the M through difference value operation, and taking the difference value as the coding distance.

Further, the step of screening the text to be aggregated meeting the preset first requirement according to the preset coding distance threshold and the coding distance calculation result, and constructing a text comparison group specifically includes:

identifying whether the coding distance calculation result exceeds the coding distance threshold value by comparison;

if the coding distance calculation result exceeds the coding distance threshold, the coding distance between the hash coding values to be compared does not meet the coding distance threshold, and the text to be aggregated corresponding to the hash coding values to be compared is the text to be aggregated which does not meet the preset first requirement;

if the coding distance calculation result does not exceed the coding distance threshold, the coding distance between the hash coding values to be compared meets the coding distance threshold, the text to be aggregated corresponding to the hash coding values to be compared is the text to be aggregated which meets the preset first requirement, and the text to be aggregated corresponding to the hash coding values to be compared is added into a preset text comparison group.

Further, the step of aggregating the text to be aggregated meeting the preset second requirement in the text comparison group as a similar text specifically includes:

acquiring all texts to be aggregated contained in the text comparison group to serve as texts to be compared;

according to the distinguishing identification information of the texts to be compared, acquiring hash code values corresponding to text titles of all the texts to be compared from the first hash code value set;

identifying texts to be compared with the same text title by comparing hash code values corresponding to the text titles of all the texts to be compared, and carrying out aggregation treatment on the texts to be compared with the same text title as similar texts;

respectively extracting the first N characters from the text bodies of all the texts to be compared according to the distinguishing identification information of the texts to be compared;

and identifying the texts to be compared with the same first N characters by comparing the first N characters respectively corresponding to all the texts to be compared, and carrying out aggregation processing on the texts to be compared with the same first N characters as similar texts.

In order to solve the above technical problems, the embodiment of the present application further provides a similar text aggregation device, which adopts the following technical scheme:

A similar text aggregation apparatus, comprising:

the system comprises a text to be aggregated and a text obtaining module, wherein the text to be aggregated is used for obtaining a text to be aggregated and distinguishing identification information of all the texts to be aggregated in a multi-source data end, the text to be aggregated comprises a text title and a text body, and the distinguishing identification information is formed by splicing a source identification and a text identification;

the preliminary de-duplication processing module is configured to perform preliminary de-duplication processing on the text to be aggregated according to the distinguishing identification information and a preset first screening policy, so as to obtain a text to be aggregated after preliminary de-duplication is completed, where the first screening policy specifically is: identifying whether the identification information of the texts to be aggregated obtained from the same data end is the same or not according to the identification information of the texts to be aggregated, and if the identification information of the texts to be aggregated obtained from the same data end is the same, performing preliminary duplicate removal processing;

the title and text extraction module is used for respectively extracting text titles and text texts in the texts to be aggregated after all the preliminary duplicate removal is completed;

the text preprocessing module is used for preprocessing the text to be aggregated after the preliminary duplicate removal is completed to obtain the text to be aggregated after the text preprocessing is completed, wherein the text preprocessing mode specifically comprises the steps of cleaning and finishing punctuation, blank, chinese and English and simplified and traditional Chinese in the text;

The Hash coding value calculation module is used for calculating Hash coding values of all texts to be aggregated after text pretreatment is completed according to a preset Hash coding algorithm one by one to obtain Hash coding values corresponding to the texts to be aggregated;

the text aggregation processing module is configured to aggregate the text to be aggregated through the hash code value, the distinguishing identification information, the text title, the text body and a preset second screening policy, so as to obtain a text aggregation result, and complete screening and aggregation of similar texts, where the second screening policy specifically is: carrying out segmentation processing on the target hash code value according to preset segmentation parameters to obtain code segments and code segment distinguishing identifiers respectively corresponding to the target hash code value; calculating the coding distance between hash coding values to be compared according to the coding segment distinguishing identification and a preset distance algorithm, and obtaining a coding distance calculation result; screening texts to be aggregated, which meet a preset first requirement, according to a preset coding distance threshold and the coding distance calculation result, and constructing a text comparison group, wherein the preset first requirement is that the coding distance between hash coding values to be compared meets the preset coding distance threshold; and carrying out aggregation processing on the texts to be aggregated in the text comparison group, which meet the preset second requirement, as similar texts, wherein the preset second requirement is that the text titles of the texts to be aggregated are identical, or the front N text characters in the text of the texts to be aggregated are identical, and N is a positive integer greater than 1.

In order to solve the above technical problems, the embodiment of the present application further provides a computer device, which adopts the following technical schemes:

a computer device comprising a memory having stored therein computer readable instructions which when executed by a processor implement the steps of the similar text aggregation method described above.

In order to solve the above technical problems, an embodiment of the present application further provides a computer readable storage medium, which adopts the following technical schemes:

a computer readable storage medium having stored thereon computer readable instructions which when executed by a processor perform the steps of a similar text aggregation method as described above.

Compared with the prior art, the embodiment of the application has the following main beneficial effects:

according to the similar text aggregation method, the texts to be aggregated in the multi-source data end and the distinguishing identification information of all the texts to be aggregated are obtained; performing preliminary de-duplication treatment on the text to be aggregated according to the distinguishing identification information and a preset first screening strategy to obtain the text to be aggregated after the preliminary de-duplication is completed; respectively extracting text titles and text texts in the text to be aggregated after all the preliminary duplicate removal is completed; performing text pretreatment on the text to be aggregated after the preliminary deduplication is completed to obtain the text to be aggregated after the text pretreatment is completed; according to a preset Hash coding algorithm, carrying out Hash coding value calculation on all texts to be aggregated after text preprocessing is completed one by one, and obtaining a Hash coding value corresponding to the texts to be aggregated; and carrying out aggregation treatment on the texts to be aggregated through the hash code value, the distinguishing identification information, the text title, the text body and a preset second screening strategy to obtain a text aggregation result, and finishing screening and aggregation of similar texts. According to the method, the text title and the text of the text to be aggregated are respectively subjected to hash coding value calculation, and similar texts are screened out through the second screening strategy, the first requirement and the second requirement, so that compared with a K-shift algorithm, the method ensures that the similar texts can be quickly and accurately aggregated together under the condition of less calculation resource consumption.

Drawings

In order to more clearly illustrate the solution of the present application, a brief description will be given below of the drawings required for the description of the embodiments of the present application, it being apparent that the drawings in the following description are some embodiments of the present application, and that other drawings may be obtained from these drawings without the exercise of inventive effort for a person of ordinary skill in the art.

FIG. 1 is an exemplary system architecture diagram in which the present application may be applied;

FIG. 2 is a flow chart of one embodiment of a similar text aggregation method in accordance with the present application;

FIG. 3 is a flow chart of one embodiment of step 202 of FIG. 2;

FIG. 4 is a flow chart of one embodiment of step 205 of FIG. 2;

FIG. 5 is a schematic diagram illustrating the construction of one embodiment of a similar text aggregation device in accordance with the present application;

FIG. 6 is a schematic structural diagram of one embodiment of a computer device in accordance with the present application.

Detailed Description

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the terminology used in the description of the applications herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application; the terms "comprising" and "having" and any variations thereof in the description of the application and the claims and the description of the drawings above are intended to cover a non-exclusive inclusion. The terms first, second and the like in the description and in the claims or in the above-described figures, are used for distinguishing between different objects and not necessarily for describing a sequential or chronological order.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.

In order to make the person skilled in the art better understand the solution of the present application, the technical solution of the embodiment of the present application will be clearly and completely described below with reference to the accompanying drawings.

As shown in fig. 1, a system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

The user may interact with the server 105 via the network 104 using the terminal devices 101, 102, 103 to receive or send messages or the like. Various communication client applications, such as a web browser application, a shopping class application, a search class application, an instant messaging tool, a mailbox client, social platform software, etc., may be installed on the terminal devices 101, 102, 103.

The terminal devices 101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablet computers, electronic book readers, MP3 players (Moving Picture ExpertsGroup Audio Layer III, dynamic video expert compression standard audio plane 3), MP4 (Moving PictureExperts Group Audio Layer IV, dynamic video expert compression standard audio plane 4) players, laptop and desktop computers, and the like.

The server 105 may be a server providing various services, such as a background server providing support for pages displayed on the terminal devices 101, 102, 103.

It should be noted that, the similar text aggregation method provided by the embodiment of the present application is generally executed by a server/terminal device, and accordingly, the similar text aggregation apparatus is generally set in the server/terminal device.

It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

With continued reference to FIG. 2, a flowchart of one embodiment of a similar text aggregation method in accordance with the present application is shown. The similar text aggregation method comprises the following steps:

Step 201, obtaining a text to be aggregated in a multi-source data end and distinguishing identification information of all the texts to be aggregated, wherein the text to be aggregated comprises a text title and a text body, and the distinguishing identification information is formed by splicing a source identification and a text identification.

In this embodiment, the method for acquiring the text to be aggregated in the multi-source data end includes two modes of active acquisition and passive reception.

The active acquisition mode comprises acquisition through a synchronous acquisition mode and acquisition through a data grabbing mode, and the passive receiving mode comprises passive receiving of the text to be aggregated through an asynchronous acquisition mode;

specifically, the step of acquiring in a synchronous acquisition mode specifically includes: sending a text acquisition request to be aggregated to the multi-source data end, wherein the multi-source data end responds to the text acquisition request to be aggregated and sends the text to be aggregated to a unique receiving end, and the multi-source data end can be a plurality of databases, data warehouses, cloud databases or the like;

specifically, the step of acquiring in a data grabbing manner specifically includes: capturing the text to be aggregated from the multi-source data end through a preset page crawler component, wherein the multi-source data end can be a plurality of web page browsing ends with different IP addresses;

Specifically, the step of passively receiving the text to be aggregated in an asynchronous acquisition mode specifically includes: and transmitting/pushing the text to be aggregated from the multi-source data terminal to the unique receiving terminal in a message stream mode.

Specifically, the source identifier is different according to the difference of the multi-source data terminal, for example, the multi-source data terminal comprises three databases, the source identifier is the distinguishing identifier of the three databases, the text identifier is correspondingly different according to the difference of texts, for example, 10 texts are obtained from a certain database, and the 10 texts can correspond to 10 text identifiers, namely the distinguishing identifier of the texts.

Step 202, performing preliminary de-duplication processing on the text to be aggregated according to the distinguishing identification information and a preset first screening strategy to obtain the text to be aggregated after the preliminary de-duplication is completed,

the first screening strategy specifically comprises the following steps: and identifying whether the texts to be aggregated, which are acquired from the same data end, have the same condition of the distinguishing identification information according to the distinguishing identification information, and if the texts to be aggregated, which are acquired from the same data end, have the same condition of the distinguishing identification information, performing preliminary duplicate removal processing.

With continued reference to FIG. 3, FIG. 3 is a flow chart of one embodiment of step 202 shown in FIG. 2, comprising:

step 301, splitting the distinguishing identification information according to a preset splitting component to obtain source identifications and text identifications corresponding to all texts to be aggregated respectively;

step 302, based on the source identifier, obtaining all texts to be aggregated corresponding to the same source identifier, and generating the same source text set;

step 303, identifying whether texts with the same text identification exist in the text set with the same source according to the text identification;

step 304, if texts with the same text identification exist in the text sets with the same source, acquiring the texts with the same text identification from the text sets with the same source, and constructing a text set to be preliminarily de-duplicated;

and step 305, selecting one text from the text set to be subjected to preliminary duplicate removal as a target text, deleting other texts, and completing the preliminary duplicate removal processing.

In this embodiment, the preliminary de-duplication processing is performed on the text to be aggregated according to the distinguishing identifier information and a preset first screening policy, which aims to prevent obtaining the text with repeated distinguishing identifier information from the same data source.

And 203, respectively extracting text titles and text texts in the text to be aggregated after the primary duplicate removal is completed.

In this embodiment, the text titles and text texts in the text to be aggregated after all the preliminary deduplication are extracted respectively, which aims to provide data basis for the subsequent similar text aggregation operation.

Step 204, performing text body pretreatment on the text to be aggregated after the preliminary de-duplication is completed to obtain the text to be aggregated after the text body pretreatment is completed,

the text body preprocessing mode specifically comprises the steps of cleaning and arranging punctuation, blank, chinese and English and simplified and traditional Chinese in the text body.

Through the preprocessing mode, punctuation, blank, chinese and English and simplified and traditional Chinese in the text body are cleaned and arranged, so that only literal characters are reserved in the text body, and the bodies of the literal characters are unified (for example, english is uniformly converted into Chinese, traditional Chinese is converted into simplified Chinese), thereby being convenient for subsequent similar text aggregation.

And 205, calculating Hash code values of the texts to be aggregated after all text body preprocessing is completed one by one according to a preset Hash code algorithm, and obtaining the Hash code values corresponding to the texts to be aggregated.

In this embodiment, the preset Hash coding algorithm includes a simHash coding algorithm.

With continued reference to fig. 4, fig. 4 is a flow chart of one embodiment of step 205 shown in fig. 2, comprising:

step 401, inputting all texts to be aggregated after text preprocessing is completed one by one into a preset Hash coding algorithm component, wherein the simHash coding algorithm is built in the Hash coding algorithm component;

step 402, respectively performing Hash coding value calculation on a text title and a text body of the text to be aggregated according to the simHash coding algorithm built in the Hash coding algorithm component, and generating a Hash coding value corresponding to the text title and a Hash coding value corresponding to the text body, wherein the Hash coding value is composed of coding characters 0 and 1, and the number of coding character bits of the Hash coding value is 64 bits;

step 403, obtaining hash code values corresponding to text titles of all texts to be aggregated, constructing a first hash code value set, and setting distinguishing identification information for elements in the first hash code value set according to distinguishing identification information of all texts to be aggregated;

and step 404, obtaining hash code values corresponding to text texts of all texts to be aggregated, constructing a second hash code value set, and setting distinguishing identification information for elements in the second hash code value set according to the distinguishing identification information of all texts to be aggregated.

And respectively carrying out hash code value calculation on the text titles and the text texts of the texts to be aggregated by adopting a simHash coding algorithm to generate hash code values corresponding to the text titles and hash code values corresponding to the text texts, so that similarity text aggregation is conveniently carried out by combining the hash code values corresponding to the text titles and the hash code values corresponding to the text texts. In addition, in text aggregation, compared with a K-shift algorithm, the simHash coding algorithm is low in calculation space consumption and calculation time consumption, and can be used for carrying out similar text aggregation more quickly.

Step 206, aggregating the texts to be aggregated through the hash code value, the distinguishing identification information, the text title, the text body and a preset second screening policy to obtain a text aggregation result, completing the screening aggregation of similar texts,

the second screening strategy specifically comprises the following steps: carrying out segmentation processing on the target hash code value according to preset segmentation parameters to obtain code segments and code segment distinguishing identifiers respectively corresponding to the target hash code value; calculating the coding distance between hash coding values to be compared according to the coding segment distinguishing identification and a preset distance algorithm, and obtaining a coding distance calculation result; screening texts to be aggregated, which meet a preset first requirement, according to a preset coding distance threshold and the coding distance calculation result, and constructing a text comparison group, wherein the preset first requirement is that the coding distance between hash coding values to be compared meets the preset coding distance threshold; and carrying out aggregation processing on the texts to be aggregated in the text comparison group, which meet the preset second requirement, as similar texts, wherein the preset second requirement is that the text titles of the texts to be aggregated are identical, or the front N text characters in the text of the texts to be aggregated are identical, and N is a positive integer greater than 1.

In this embodiment, the step of performing a segmentation process on the target hash code value according to a preset segmentation parameter to obtain a code segment and a code segment distinguishing identifier corresponding to the target hash code value respectively specifically includes: equally dividing a target hash code value into M sections of code segments according to the segment parameters, wherein M is a parameter value of the segment parameters, and M is a positive integer greater than 1 and can be divided by the code character bit number of the hash code value; and setting coding segment distinguishing identifiers for the M segments respectively according to the position information of the M segments in the target hash code value in a left-to-right or right-to-left mode.

In this embodiment, the target hash code value includes a hash code value corresponding to a text header and a hash code value corresponding to a text body, that is, elements in the first hash code value set and elements in the second hash code value set.

Specifically, for example, the segmentation parameter is 4, the target hash code value is equally divided into 4 segments of code segments, and each code segment is a 16-bit code consisting of a code character 0 and a code character 1. Of course, the segmentation parameters can be freely set to values of 8, 12, 16 and the like which can be divided by 64, and the specific setting is selected by a manager.

Continuing with the above example, assuming that the segmentation parameter is 4, 4 16-bit encoded segments are obtained, and based on the positional information of the M-segment encoded segments in the target hash code value, the M-segment encoded segments are respectively set with encoding segment distinction identifiers in a left-to-right or right-to-left manner, that is, the 4 16-bit encoded segments are set with distinction identifiers, such as hash_1, hash_2, hash_3, and hash_4, according to the positional information of the encoded segments in the hash code value.

In this embodiment, the step of performing the segmentation processing on the target hash code value according to the preset segmentation parameter to obtain the code segment and the code segment distinguishing identifier corresponding to the target hash code value respectively may further include: randomly dividing a target hash code value into L segments of code segments according to the segment parameters, wherein L is a parameter value of the segment parameters, and L is a positive integer greater than 1 and less than or equal to the code character bit number of the hash code value; and setting coding segment distinguishing identifiers for the L segments respectively according to the left-to-right or right-to-left mode based on the position information of the L segments in the target hash code value.

In this embodiment, the step of calculating the coding distance between hash code values to be compared according to the code segment distinguishing identifier and a preset distance algorithm to obtain a coding distance calculation result specifically includes: randomly screening two hash code values from the second hash code value set to serve as hash code values to be compared; acquiring coding segments corresponding to the hash coding values to be compared respectively and coding segment distinguishing identifiers corresponding to the coding segments; counting the number of the coding characters 1 contained in the coding segment, and recording the number as a first number; taking the code segment distinguishing identifiers and the first quantity as calculation parameters of the distance algorithm, calculating the code distance between the hash code values to be compared according to the distance algorithm,

Specifically, two hash code values are arbitrarily screened from the second hash code value set to be used as hash code values to be compared; acquiring coding segments corresponding to the hash coding values to be compared respectively and coding segment distinguishing identifiers corresponding to the coding segments; counting the number of the coding characters 1 contained in the coding segment, and recording the number as a first number; the code segment distinguishing identifiers and the first quantity are used as calculation parameters of the distance algorithm, and the code distance between the hash code values to be compared is calculated according to the distance algorithm. The number of the code characters 1 contained in the code segments is counted, the code segments with the same code segment distinguishing identification and the same first number are identified from the code segments corresponding to the hash code values to be compared as target code segments, and the aim is to obtain a comparison result of the whole hash code values corresponding to the text through the segment comparison result, so that compared with the method of directly comparing the hash code values, the comparison operation is carried out more quickly, and the calculation resource consumption is saved.

In this embodiment, the step of calculating the coding distance between the hash code values to be compared according to the code segment distinguishing identifier and the preset distance algorithm to obtain a coding distance calculation result may further include: counting the number of the code characters 1 respectively contained in all the hash code values in the second hash code value set, and recording the number as a third number; screening out a third number of hash code values with the same number from the second hash code value set to construct a hash code value set to be compared; randomly screening two hash code values from the hsah code value set to be compared to serve as hash code values to be compared; acquiring coding segments corresponding to the hash coding values to be compared respectively and coding segment distinguishing identifiers corresponding to the coding segments; counting the difference values of the code characters 1 contained in the corresponding code segments when the code segment difference identifiers are the same, and marking the difference values as a fourth number; calculating the coding distance between the hash coding values to be compared according to the coding segment distinguishing identifiers and the fourth quantity,

the step of calculating the coding distance between the hash coding values to be compared according to the coding segment distinguishing identifiers and the fourth number specifically includes: obtaining the fourth quantity corresponding to all the code segment distinguishing identifiers, and accumulating and summing to obtain a sum value; and taking the sum value as the coding distance.

In this embodiment, the step of screening the text to be aggregated, which meets a preset first requirement, according to a preset coding distance threshold and the coding distance calculation result, and constructing a text comparison group specifically includes: identifying whether the coding distance calculation result exceeds the coding distance threshold value by comparison; if the coding distance calculation result exceeds the coding distance threshold, the coding distance between the hash coding values to be compared does not meet the coding distance threshold, and the text to be aggregated corresponding to the hash coding values to be compared is the text to be aggregated which does not meet the preset first requirement; if the coding distance calculation result does not exceed the coding distance threshold, the coding distance between the hash coding values to be compared meets the coding distance threshold, the text to be aggregated corresponding to the hash coding values to be compared is the text to be aggregated which meets the preset first requirement, and the text to be aggregated corresponding to the hash coding values to be compared is added into a preset text comparison group.

In this embodiment, the step of aggregating the text to be aggregated, which satisfies the preset second requirement, in the text comparison set as a similar text specifically includes: acquiring all texts to be aggregated contained in the text comparison group to serve as texts to be compared; according to the distinguishing identification information of the texts to be compared, acquiring hash code values corresponding to text titles of all the texts to be compared from the first hash code value set; identifying texts to be compared with the same text title by comparing hash code values corresponding to the text titles of all the texts to be compared, and carrying out aggregation treatment on the texts to be compared with the same text title as similar texts; respectively extracting the first N characters from the text bodies of all the texts to be compared according to the distinguishing identification information of the texts to be compared; and identifying the texts to be compared with the same first N characters by comparing the first N characters respectively corresponding to all the texts to be compared, and carrying out aggregation processing on the texts to be compared with the same first N characters as similar texts.

In essence, after text similarity calculation, the embodiment identifies the texts to be compared with the same text title by comparing the hash code values corresponding to the text titles of all the texts to be compared, or identifies the texts to be compared with the same first N text characters by comparing the first N text characters respectively corresponding to all the texts to be compared, and then aggregates the texts to be compared according to the identified texts to be compared, thereby further ensuring the accuracy of similar text aggregation.

In this embodiment, the step of identifying the text to be compared with the same text title by comparing the hash code values corresponding to the text titles of all the texts to be compared specifically includes: randomly screening two hash code values from the first hash code value set to serve as hash code values to be compared; acquiring coding segments corresponding to the hash coding values to be compared respectively and coding segment distinguishing identifiers corresponding to the coding segments; counting the number of the coding characters 1 contained in the coding segment, and recording the number as a first number; and taking the code segment distinguishing identifiers and the first quantity as calculation parameters of the distance algorithm, calculating the code distance between the hash code values to be compared according to the distance algorithm, and if the code distance between the hash code values to be compared is 0, the text titles corresponding to the hash code values to be compared are the same.

In this embodiment, after executing the step of aggregating the texts to be aggregated in the text comparison set, which meet the preset second requirement, as similar texts, the method further includes: acquiring an aggregation treatment result; acquiring distinguishing identification information of all texts in the same aggregation set according to the aggregation processing result; and selecting a target distinguishing identifier from the distinguishing identifier information of all texts in the same aggregation set as the identifier information of the aggregation set according to a preset election rule.

According to the method, the text title and the text of the text to be aggregated are respectively subjected to hash coding value calculation, and similar texts are screened out through the second screening strategy, the first requirement and the second requirement, so that compared with a K-shift algorithm, the method ensures that the similar texts can be quickly and accurately aggregated together under the condition of less calculation resource consumption.

The method comprises the steps of obtaining texts to be aggregated in a multi-source data end and distinguishing identification information of all texts to be aggregated; performing preliminary de-duplication treatment on the text to be aggregated according to the distinguishing identification information and a preset first screening strategy to obtain the text to be aggregated after the preliminary de-duplication is completed; respectively extracting text titles and text texts in the text to be aggregated after all the preliminary duplicate removal is completed; performing text pretreatment on the text to be aggregated after the preliminary deduplication is completed to obtain the text to be aggregated after the text pretreatment is completed; according to a preset Hash coding algorithm, carrying out Hash coding value calculation on all texts to be aggregated after text preprocessing is completed one by one, and obtaining a Hash coding value corresponding to the texts to be aggregated; and carrying out aggregation treatment on the texts to be aggregated through the hash code value, the distinguishing identification information, the text title, the text body and a preset second screening strategy to obtain a text aggregation result, and finishing screening and aggregation of similar texts. According to the method, the text title and the text of the text to be aggregated are respectively subjected to hash coding value calculation, and similar texts are screened out through the second screening strategy, the first requirement and the second requirement, so that compared with a K-shift algorithm, the method ensures that the similar texts can be quickly and accurately aggregated together under the condition of less calculation resource consumption.

The embodiment of the application can acquire and process the related data based on the artificial intelligence technology. Among these, artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use knowledge to obtain optimal results.

Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, large similarity text aggregation technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

In the embodiment of the application, the texts to be aggregated in the multi-source data end and the distinguishing identification information of all the texts to be aggregated are obtained; performing preliminary de-duplication treatment on the text to be aggregated according to the distinguishing identification information and a preset first screening strategy to obtain the text to be aggregated after the preliminary de-duplication is completed; respectively extracting text titles and text texts in the text to be aggregated after all the preliminary duplicate removal is completed; performing text pretreatment on the text to be aggregated after the preliminary deduplication is completed to obtain the text to be aggregated after the text pretreatment is completed; according to a preset Hash coding algorithm, carrying out Hash coding value calculation on all texts to be aggregated after text preprocessing is completed one by one, and obtaining a Hash coding value corresponding to the texts to be aggregated; and carrying out aggregation treatment on the texts to be aggregated through the hash code value, the distinguishing identification information, the text title, the text body and a preset second screening strategy to obtain a text aggregation result, and finishing screening and aggregation of similar texts. According to the method, the text title and the text of the text to be aggregated are respectively subjected to hash coding value calculation, and similar texts are screened out through the second screening strategy, the first requirement and the second requirement, so that compared with a K-shift algorithm, the method ensures that the similar texts can be quickly and accurately aggregated together under the condition of less calculation resource consumption.

With further reference to fig. 5, as an implementation of the method shown in fig. 2 described above, the present application provides an embodiment of a similar text aggregation apparatus, which corresponds to the method embodiment shown in fig. 2, and which is particularly applicable to various electronic devices.

As shown in fig. 5, the similar text aggregation apparatus 500 according to the present embodiment includes: a text to be aggregated acquisition module 501, a preliminary deduplication processing module 502, a title and text extraction module 503, a text preprocessing module 504, a Hash code value calculation module 505 and a text aggregation processing module 506. Wherein:

the text to be aggregated obtaining module 501 is configured to obtain a text to be aggregated in a multi-source data end and distinguishing identification information of all the texts to be aggregated, where the text to be aggregated includes a text title and a text body, and the distinguishing identification information is formed by splicing a source identifier and a text identifier;

the preliminary de-duplication processing module 502 is configured to perform preliminary de-duplication processing on the text to be aggregated according to the distinguishing identification information and a preset first screening policy, so as to obtain a text to be aggregated after the preliminary de-duplication is completed, where the first screening policy specifically is: identifying whether the identification information of the texts to be aggregated obtained from the same data end is the same or not according to the identification information of the texts to be aggregated, and if the identification information of the texts to be aggregated obtained from the same data end is the same, performing preliminary duplicate removal processing;

A title and text extraction module 503, configured to extract text titles and text texts in the text to be aggregated after all preliminary deduplication is completed, respectively;

the text preprocessing module 504 is configured to perform text preprocessing on the text to be aggregated after the preliminary deduplication is completed, so as to obtain the text to be aggregated after the text preprocessing is completed, where the text preprocessing method specifically includes cleaning and finishing punctuation, blank, chinese and English, simplified and traditional Chinese in the text;

the Hash code value calculation module 505 is configured to perform Hash code value calculation on the text to be aggregated after all text body preprocessing is completed one by one according to a preset Hash code algorithm, so as to obtain a Hash code value corresponding to the text to be aggregated;

the text aggregation processing module 506 is configured to aggregate the text to be aggregated through the hash code value, the distinguishing identification information, the text header, the text body and a preset second filtering policy, so as to obtain a text aggregation result, and complete filtering and aggregation of similar texts, where the second filtering policy specifically is: carrying out segmentation processing on the target hash code value according to preset segmentation parameters to obtain code segments and code segment distinguishing identifiers respectively corresponding to the target hash code value; calculating the coding distance between hash coding values to be compared according to the coding segment distinguishing identification and a preset distance algorithm, and obtaining a coding distance calculation result; screening texts to be aggregated, which meet a preset first requirement, according to a preset coding distance threshold and the coding distance calculation result, and constructing a text comparison group, wherein the preset first requirement is that the coding distance between hash coding values to be compared meets the preset coding distance threshold; and carrying out aggregation processing on the texts to be aggregated in the text comparison group, which meet the preset second requirement, as similar texts, wherein the preset second requirement is that the text titles of the texts to be aggregated are identical, or the front N text characters in the text of the texts to be aggregated are identical, and N is a positive integer greater than 1.

Those skilled in the art will appreciate that implementing all or part of the above described embodiment methods may be accomplished by computer readable instructions, stored on a computer readable storage medium, that the program when executed may comprise the steps of embodiments of the methods described above. The storage medium may be a nonvolatile storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a random access Memory (Random Access Memory, RAM).

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited in order and may be performed in other orders, unless explicitly stated herein. Moreover, at least some of the steps in the flowcharts of the figures may include a plurality of sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, the order of their execution not necessarily being sequential, but may be performed in turn or alternately with other steps or at least a portion of the other steps or stages.

In order to solve the technical problems, the embodiment of the application also provides computer equipment. Referring specifically to fig. 6, fig. 6 is a basic structural block diagram of a computer device according to the present embodiment.

The computer device 6 comprises a memory 6a, a processor 6b, a network interface 6c communicatively connected to each other via a system bus. It should be noted that only a computer device 6 having components 6a-6c is shown in the figures, but it should be understood that not all of the illustrated components need be implemented, and that more or fewer components may alternatively be implemented. It will be appreciated by those skilled in the art that the computer device herein is a device capable of automatically performing numerical calculations and/or information processing in accordance with predetermined or stored instructions, the hardware of which includes, but is not limited to, microprocessors, application specific integrated circuits (Application Specific Integrated Circuit, ASICs), programmable gate arrays (fields-Programmable Gate Array, FPGAs), digital processors (Digital Signal Processor, DSPs), embedded devices, etc.

The computer equipment can be a desktop computer, a notebook computer, a palm computer, a cloud server and other computing equipment. The computer equipment can perform man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch pad or voice control equipment and the like.

The memory 6a includes at least one type of readable storage medium including flash memory, hard disk, multimedia card, card memory (e.g., SD or DX memory, etc.), random Access Memory (RAM), static Random Access Memory (SRAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), programmable Read Only Memory (PROM), magnetic memory, magnetic disk, optical disk, etc. In some embodiments, the storage 6a may be an internal storage unit of the computer device 6, such as a hard disk or a memory of the computer device 6. In other embodiments, the memory 6a may also be an external storage device of the computer device 6, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash Card (Flash Card) or the like, which are provided on the computer device 6. Of course, the memory 6a may also comprise both an internal memory unit of the computer device 6 and an external memory device. In this embodiment, the memory 6a is typically used to store an operating system and various types of application software installed on the computer device 6, such as computer readable instructions for a similar text aggregation method. Further, the memory 6a may also be used to temporarily store various types of data that have been output or are to be output.

The processor 6b may be a central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor, or other similar text aggregation chip in some embodiments. The processor 6b is typically used to control the overall operation of the computer device 6. In this embodiment, the processor 6b is configured to execute computer readable instructions stored in the memory 6a or process data, such as computer readable instructions for executing the similar text aggregation method.

The network interface 6c may comprise a wireless network interface or a wired network interface, which network interface 6c is typically used to establish a communication connection between the computer device 6 and other electronic devices.

The computer equipment provided by the embodiment belongs to the technical field of data processing and is applied to a duplicate elimination and aggregation scene of multi-source data texts. The method comprises the steps of obtaining texts to be aggregated in a multi-source data end and distinguishing identification information of all texts to be aggregated; performing preliminary de-duplication treatment on the text to be aggregated according to the distinguishing identification information and a preset first screening strategy to obtain the text to be aggregated after the preliminary de-duplication is completed; respectively extracting text titles and text texts in the text to be aggregated after all the preliminary duplicate removal is completed; performing text pretreatment on the text to be aggregated after the preliminary deduplication is completed to obtain the text to be aggregated after the text pretreatment is completed; according to a preset Hash coding algorithm, carrying out Hash coding value calculation on all texts to be aggregated after text preprocessing is completed one by one, and obtaining a Hash coding value corresponding to the texts to be aggregated; and carrying out aggregation treatment on the texts to be aggregated through the hash code value, the distinguishing identification information, the text title, the text body and a preset second screening strategy to obtain a text aggregation result, and finishing screening and aggregation of similar texts. According to the method, the text title and the text of the text to be aggregated are respectively subjected to hash coding value calculation, and similar texts are screened out through the second screening strategy, the first requirement and the second requirement, so that compared with a K-shift algorithm, the method ensures that the similar texts can be quickly and accurately aggregated together under the condition of less calculation resource consumption.

The present application also provides another embodiment, namely, a computer-readable storage medium storing computer-readable instructions executable by a processor to cause the processor to perform steps of a similar text aggregation method as described above.

The computer readable storage medium provided by the embodiment belongs to the technical field of data processing, and is applied to a duplicate elimination and aggregation scene of multi-source data texts. The method comprises the steps of obtaining texts to be aggregated in a multi-source data end and distinguishing identification information of all texts to be aggregated; performing preliminary de-duplication treatment on the text to be aggregated according to the distinguishing identification information and a preset first screening strategy to obtain the text to be aggregated after the preliminary de-duplication is completed; respectively extracting text titles and text texts in the text to be aggregated after all the preliminary duplicate removal is completed; performing text pretreatment on the text to be aggregated after the preliminary deduplication is completed to obtain the text to be aggregated after the text pretreatment is completed; according to a preset Hash coding algorithm, carrying out Hash coding value calculation on all texts to be aggregated after text preprocessing is completed one by one, and obtaining a Hash coding value corresponding to the texts to be aggregated; and carrying out aggregation treatment on the texts to be aggregated through the hash code value, the distinguishing identification information, the text title, the text body and a preset second screening strategy to obtain a text aggregation result, and finishing screening and aggregation of similar texts. According to the method, the text title and the text of the text to be aggregated are respectively subjected to hash coding value calculation, and similar texts are screened out through the second screening strategy, the first requirement and the second requirement, so that compared with a K-shift algorithm, the method ensures that the similar texts can be quickly and accurately aggregated together under the condition of less calculation resource consumption.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method according to the embodiments of the present application.

It is apparent that the above-described embodiments are only some embodiments of the present application, but not all embodiments, and the preferred embodiments of the present application are shown in the drawings, which do not limit the scope of the patent claims. This application may be embodied in many different forms, but rather, embodiments are provided in order to provide a thorough and complete understanding of the present disclosure. Although the application has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that modifications may be made to the embodiments described in the foregoing description, or equivalents may be substituted for elements thereof. All equivalent structures made by the content of the specification and the drawings of the application are directly or indirectly applied to other related technical fields, and are also within the scope of the application.

Claims

1. A method of similar text aggregation comprising the steps of:

2. The method for aggregating similar texts according to claim 1, wherein the step of performing preliminary deduplication processing on the text to be aggregated according to the distinguishing identification information and a preset first screening policy to obtain the text to be aggregated after the preliminary deduplication is completed specifically comprises:

3. The method for aggregating similar texts according to claim 1, wherein the preset Hash coding algorithm includes a simHash coding algorithm, and the step of calculating Hash coding values one by one for the texts to be aggregated after preprocessing all the texts according to the preset Hash coding algorithm to obtain corresponding Hash coding values specifically includes:

4. The method for aggregating similar texts according to claim 3, wherein the step of performing segmentation processing on the target hash code values according to preset segmentation parameters to obtain code segments and code segment distinguishing identifiers corresponding to the target hash code values respectively specifically comprises the following steps:

5. The method for aggregating similar texts according to claim 4, wherein the step of calculating the coding distance between hash code values to be compared according to the code segment distinguishing identifier and a preset distance algorithm to obtain a coding distance calculation result specifically comprises the following steps:

6. The method for aggregating similar texts according to claim 5, wherein the step of screening texts to be aggregated meeting a preset first requirement according to a preset coding distance threshold and the coding distance calculation result, and constructing a text comparison group specifically comprises the following steps:

7. The method for aggregating similar texts according to any one of claims 3 to 6, wherein the step of aggregating the texts to be aggregated, which meet the preset second requirement, in the text comparison set as similar texts specifically includes:

8. A similar text aggregation apparatus, comprising:

9. A computer device comprising a memory having stored therein computer readable instructions which when executed by the processor implement the steps of the similar text aggregation method of any one of claims 1 to 7.

10. A computer readable storage medium having stored thereon computer readable instructions which when executed by a processor implement the steps of the similar text aggregation method of any one of claims 1 to 7.