CN117093717A - Similar text aggregation method, device, equipment and storage medium thereof - Google Patents

Similar text aggregation method, device, equipment and storage medium thereof Download PDF

Info

Publication number
CN117093717A
CN117093717A CN202311363982.6A CN202311363982A CN117093717A CN 117093717 A CN117093717 A CN 117093717A CN 202311363982 A CN202311363982 A CN 202311363982A CN 117093717 A CN117093717 A CN 117093717A
Authority
CN
China
Prior art keywords
text
texts
aggregated
coding
hash
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202311363982.6A
Other languages
Chinese (zh)
Other versions
CN117093717B (en
Inventor
姜桂林
贵照众
贺泽州
聂萼辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hunan Data Industry Group Co.,Ltd.
Original Assignee
Hunan Caixin Digital Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hunan Caixin Digital Technology Co ltd filed Critical Hunan Caixin Digital Technology Co ltd
Priority to CN202311363982.6A priority Critical patent/CN117093717B/en
Publication of CN117093717A publication Critical patent/CN117093717A/en
Application granted granted Critical
Publication of CN117093717B publication Critical patent/CN117093717B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the application belongs to the technical field of data processing, is applied to a duplicate-removal aggregation scene of multi-source data texts, and relates to a similar text aggregation method, a device, equipment and a storage medium thereof, wherein the method comprises the steps of obtaining texts to be aggregated in a multi-source data end; performing preliminary de-duplication treatment on the text to be aggregated according to a first screening strategy; respectively extracting a text title and a text in a text to be aggregated; preprocessing a text body; according to a preset Hash coding algorithm, calculating a Hash coding value of a text title and a text body in the text to be aggregated, and obtaining the Hash coding value; and carrying out aggregation treatment on the texts to be aggregated through the hash code value and a preset second screening strategy to obtain a text aggregation result, and finishing screening and aggregation of similar texts. Compared with the K-shift algorithm, the method ensures that similar texts can be quickly and accurately aggregated together under the condition of less consumption of computing resources.

Description

Similar text aggregation method, device, equipment and storage medium thereof
Technical Field
The application relates to the technical field of data processing, and is applied to a de-duplication aggregation scene of multi-source data texts, in particular to a similar text aggregation method, a device, equipment and a storage medium thereof.
Background
There are many methods and steps of internet text deduplication technology, and the core approach is to use a similarity-based text deduplication method. The method can be abstracted into a text-to-text similarity matching problem, and mainly solves the matching problem of a vocabulary level or the similarity problem of a vocabulary level. For example, there is the K-Shanger algorithm.
However, the K-shine algorithm needs to generate a huge shine phrase library, when the number of texts and the length of texts are large, huge time and space resources are needed to calculate the phrase library, and the feature vector calculation of each document depends on the common phrase library, so that the process of calculating the feature vector is difficult to be completely parallelized, resulting in slower calculation speed. Therefore, the prior art also has the problems of huge consumption of computing resources and slower computing speed in the process of aggregating similar texts.
Disclosure of Invention
The embodiment of the application aims to provide a similar text aggregation method, a device, equipment and a storage medium thereof, which are used for solving the problems that huge computing resources are consumed and the computing speed is low in the aggregation of similar texts in the prior art.
In order to solve the above technical problems, the embodiment of the present application provides a similar text aggregation method, which adopts the following technical scheme:
a method of similar text aggregation comprising the steps of:
acquiring texts to be aggregated in a multi-source data end and distinguishing identification information of all the texts to be aggregated, wherein the texts to be aggregated comprise text titles and text texts, and the distinguishing identification information is formed by splicing source identifications and text identifications;
performing preliminary de-duplication processing on the text to be aggregated according to the distinguishing identification information and a preset first screening strategy to obtain the text to be aggregated after the preliminary de-duplication is completed, wherein the first screening strategy specifically comprises: identifying whether the identification information of the texts to be aggregated obtained from the same data end is the same or not according to the identification information of the texts to be aggregated, and if the identification information of the texts to be aggregated obtained from the same data end is the same, performing preliminary duplicate removal processing;
respectively extracting text titles and text texts in the text to be aggregated after all the preliminary duplicate removal is completed;
performing text pretreatment on the text to be polymerized after the preliminary duplication removal is completed to obtain the text to be polymerized after the text pretreatment is completed, wherein the text pretreatment mode specifically comprises the steps of cleaning and finishing punctuation, blank, chinese and English and simplified and traditional Chinese in the text;
According to a preset Hash coding algorithm, carrying out Hash coding value calculation on all texts to be aggregated after text preprocessing is completed one by one, and obtaining a Hash coding value corresponding to the texts to be aggregated;
and carrying out aggregation treatment on the texts to be aggregated through the hash code value, the distinguishing identification information, the text title, the text body and a preset second screening strategy to obtain a text aggregation result, and finishing screening and aggregation of similar texts, wherein the second screening strategy specifically comprises the following steps: carrying out segmentation processing on the target hash code value according to preset segmentation parameters to obtain code segments and code segment distinguishing identifiers respectively corresponding to the target hash code value; calculating the coding distance between hash coding values to be compared according to the coding segment distinguishing identification and a preset distance algorithm, and obtaining a coding distance calculation result; screening texts to be aggregated, which meet a preset first requirement, according to a preset coding distance threshold and the coding distance calculation result, and constructing a text comparison group, wherein the preset first requirement is that the coding distance between hash coding values to be compared meets the preset coding distance threshold; and carrying out aggregation processing on the texts to be aggregated in the text comparison group, which meet the preset second requirement, as similar texts, wherein the preset second requirement is that the text titles of the texts to be aggregated are identical, or the front N text characters in the text of the texts to be aggregated are identical, and N is a positive integer greater than 1.
Further, the step of performing preliminary de-duplication processing on the text to be aggregated according to the distinguishing identification information and a preset first screening policy to obtain the text to be aggregated after the preliminary de-duplication processing is completed specifically includes:
splitting the distinguishing identification information according to a preset splitting component to obtain source identifications and text identifications respectively corresponding to all texts to be aggregated;
based on the source identification, obtaining all texts to be aggregated corresponding to the same source identification, and generating a same source text set;
identifying whether texts with the same text identification exist in the text sets with the same source according to the text identification;
if texts with the same text identification exist in the text sets with the same source, acquiring the texts with the same text identification from the text sets with the same source, and constructing a text set to be subjected to preliminary de-duplication;
and selecting one text from the text set to be subjected to preliminary de-duplication as a target text, deleting other texts, and completing the preliminary de-duplication processing.
Further, the preset Hash coding algorithm includes a simHash coding algorithm, and the step of calculating Hash coding values one by one for all texts to be aggregated after the text body pretreatment is completed according to the preset Hash coding algorithm to obtain corresponding Hash coding values specifically includes:
Inputting all texts to be aggregated after the text pretreatment is completed one by one into a preset Hash coding algorithm component, wherein the simHash coding algorithm is built in the Hash coding algorithm component;
respectively carrying out Hash coding value calculation on a text title and a text of the text to be aggregated according to the simHash coding algorithm built in the Hash coding algorithm component to generate a Hash coding value corresponding to the text title and a Hash coding value corresponding to the text, wherein the Hash coding value consists of coding characters 0 and 1, and the number of coding character bits of the Hash coding value is 64;
acquiring hash code values corresponding to text titles of all texts to be aggregated, constructing a first hash code value set, and setting distinguishing identification information for elements in the first hash code value set according to distinguishing identification information of all texts to be aggregated;
and acquiring hash code values corresponding to text texts of all texts to be aggregated, constructing a second hash code value set, and setting distinguishing identification information for elements in the second hash code value set according to the distinguishing identification information of all texts to be aggregated.
Further, the step of performing segmentation processing on the target hash code value according to a preset segmentation parameter to obtain a code segment and a code segment distinguishing identifier corresponding to the target hash code value respectively specifically includes:
Equally dividing a target hash code value into M sections of code segments according to the segment parameters, wherein M is a parameter value of the segment parameters, and M is a positive integer greater than 1 and can be divided by the code character bit number of the hash code value;
and setting coding segment distinguishing identifiers for the M segments respectively according to the position information of the M segments in the target hash code value in a left-to-right or right-to-left mode.
Further, the step of calculating the coding distance between the hash code values to be compared according to the code segment distinguishing identifier and a preset distance algorithm to obtain a coding distance calculation result specifically includes:
randomly screening two hash code values from the second hash code value set to serve as hash code values to be compared;
acquiring coding segments corresponding to the hash coding values to be compared respectively and coding segment distinguishing identifiers corresponding to the coding segments;
counting the number of the coding characters 1 contained in the coding segment, and recording the number as a first number;
taking the code segment distinguishing identifiers and the first quantity as calculation parameters of the distance algorithm, calculating the code distance between the hash code values to be compared according to the distance algorithm,
The calculating step of the distance algorithm specifically comprises the following steps: identifying the code segments with the same code segment distinguishing identification and the same first number from the code segments corresponding to the hash code values to be compared as target code segments; counting the number of target coding segments of the hash code values to be compared, and recording the number as a second number; and calculating the difference value between the second quantity and the M through difference value operation, and taking the difference value as the coding distance.
Further, the step of screening the text to be aggregated meeting the preset first requirement according to the preset coding distance threshold and the coding distance calculation result, and constructing a text comparison group specifically includes:
identifying whether the coding distance calculation result exceeds the coding distance threshold value by comparison;
if the coding distance calculation result exceeds the coding distance threshold, the coding distance between the hash coding values to be compared does not meet the coding distance threshold, and the text to be aggregated corresponding to the hash coding values to be compared is the text to be aggregated which does not meet the preset first requirement;
if the coding distance calculation result does not exceed the coding distance threshold, the coding distance between the hash coding values to be compared meets the coding distance threshold, the text to be aggregated corresponding to the hash coding values to be compared is the text to be aggregated which meets the preset first requirement, and the text to be aggregated corresponding to the hash coding values to be compared is added into a preset text comparison group.
Further, the step of aggregating the text to be aggregated meeting the preset second requirement in the text comparison group as a similar text specifically includes:
acquiring all texts to be aggregated contained in the text comparison group to serve as texts to be compared;
according to the distinguishing identification information of the texts to be compared, acquiring hash code values corresponding to text titles of all the texts to be compared from the first hash code value set;
identifying texts to be compared with the same text title by comparing hash code values corresponding to the text titles of all the texts to be compared, and carrying out aggregation treatment on the texts to be compared with the same text title as similar texts;
respectively extracting the first N characters from the text bodies of all the texts to be compared according to the distinguishing identification information of the texts to be compared;
and identifying the texts to be compared with the same first N characters by comparing the first N characters respectively corresponding to all the texts to be compared, and carrying out aggregation processing on the texts to be compared with the same first N characters as similar texts.
In order to solve the above technical problems, the embodiment of the present application further provides a similar text aggregation device, which adopts the following technical scheme:
A similar text aggregation apparatus, comprising:
the system comprises a text to be aggregated and a text obtaining module, wherein the text to be aggregated is used for obtaining a text to be aggregated and distinguishing identification information of all the texts to be aggregated in a multi-source data end, the text to be aggregated comprises a text title and a text body, and the distinguishing identification information is formed by splicing a source identification and a text identification;
the preliminary de-duplication processing module is configured to perform preliminary de-duplication processing on the text to be aggregated according to the distinguishing identification information and a preset first screening policy, so as to obtain a text to be aggregated after preliminary de-duplication is completed, where the first screening policy specifically is: identifying whether the identification information of the texts to be aggregated obtained from the same data end is the same or not according to the identification information of the texts to be aggregated, and if the identification information of the texts to be aggregated obtained from the same data end is the same, performing preliminary duplicate removal processing;
the title and text extraction module is used for respectively extracting text titles and text texts in the texts to be aggregated after all the preliminary duplicate removal is completed;
the text preprocessing module is used for preprocessing the text to be aggregated after the preliminary duplicate removal is completed to obtain the text to be aggregated after the text preprocessing is completed, wherein the text preprocessing mode specifically comprises the steps of cleaning and finishing punctuation, blank, chinese and English and simplified and traditional Chinese in the text;
The Hash coding value calculation module is used for calculating Hash coding values of all texts to be aggregated after text pretreatment is completed according to a preset Hash coding algorithm one by one to obtain Hash coding values corresponding to the texts to be aggregated;
the text aggregation processing module is configured to aggregate the text to be aggregated through the hash code value, the distinguishing identification information, the text title, the text body and a preset second screening policy, so as to obtain a text aggregation result, and complete screening and aggregation of similar texts, where the second screening policy specifically is: carrying out segmentation processing on the target hash code value according to preset segmentation parameters to obtain code segments and code segment distinguishing identifiers respectively corresponding to the target hash code value; calculating the coding distance between hash coding values to be compared according to the coding segment distinguishing identification and a preset distance algorithm, and obtaining a coding distance calculation result; screening texts to be aggregated, which meet a preset first requirement, according to a preset coding distance threshold and the coding distance calculation result, and constructing a text comparison group, wherein the preset first requirement is that the coding distance between hash coding values to be compared meets the preset coding distance threshold; and carrying out aggregation processing on the texts to be aggregated in the text comparison group, which meet the preset second requirement, as similar texts, wherein the preset second requirement is that the text titles of the texts to be aggregated are identical, or the front N text characters in the text of the texts to be aggregated are identical, and N is a positive integer greater than 1.
In order to solve the above technical problems, the embodiment of the present application further provides a computer device, which adopts the following technical schemes:
a computer device comprising a memory having stored therein computer readable instructions which when executed by a processor implement the steps of the similar text aggregation method described above.
In order to solve the above technical problems, an embodiment of the present application further provides a computer readable storage medium, which adopts the following technical schemes:
a computer readable storage medium having stored thereon computer readable instructions which when executed by a processor perform the steps of a similar text aggregation method as described above.
Compared with the prior art, the embodiment of the application has the following main beneficial effects:
according to the similar text aggregation method, the texts to be aggregated in the multi-source data end and the distinguishing identification information of all the texts to be aggregated are obtained; performing preliminary de-duplication treatment on the text to be aggregated according to the distinguishing identification information and a preset first screening strategy to obtain the text to be aggregated after the preliminary de-duplication is completed; respectively extracting text titles and text texts in the text to be aggregated after all the preliminary duplicate removal is completed; performing text pretreatment on the text to be aggregated after the preliminary deduplication is completed to obtain the text to be aggregated after the text pretreatment is completed; according to a preset Hash coding algorithm, carrying out Hash coding value calculation on all texts to be aggregated after text preprocessing is completed one by one, and obtaining a Hash coding value corresponding to the texts to be aggregated; and carrying out aggregation treatment on the texts to be aggregated through the hash code value, the distinguishing identification information, the text title, the text body and a preset second screening strategy to obtain a text aggregation result, and finishing screening and aggregation of similar texts. According to the method, the text title and the text of the text to be aggregated are respectively subjected to hash coding value calculation, and similar texts are screened out through the second screening strategy, the first requirement and the second requirement, so that compared with a K-shift algorithm, the method ensures that the similar texts can be quickly and accurately aggregated together under the condition of less calculation resource consumption.
Drawings
In order to more clearly illustrate the solution of the present application, a brief description will be given below of the drawings required for the description of the embodiments of the present application, it being apparent that the drawings in the following description are some embodiments of the present application, and that other drawings may be obtained from these drawings without the exercise of inventive effort for a person of ordinary skill in the art.
FIG. 1 is an exemplary system architecture diagram in which the present application may be applied;
FIG. 2 is a flow chart of one embodiment of a similar text aggregation method in accordance with the present application;
FIG. 3 is a flow chart of one embodiment of step 202 of FIG. 2;
FIG. 4 is a flow chart of one embodiment of step 205 of FIG. 2;
FIG. 5 is a schematic diagram illustrating the construction of one embodiment of a similar text aggregation device in accordance with the present application;
FIG. 6 is a schematic structural diagram of one embodiment of a computer device in accordance with the present application.
Detailed Description
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the terminology used in the description of the applications herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application; the terms "comprising" and "having" and any variations thereof in the description of the application and the claims and the description of the drawings above are intended to cover a non-exclusive inclusion. The terms first, second and the like in the description and in the claims or in the above-described figures, are used for distinguishing between different objects and not necessarily for describing a sequential or chronological order.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.
In order to make the person skilled in the art better understand the solution of the present application, the technical solution of the embodiment of the present application will be clearly and completely described below with reference to the accompanying drawings.
As shown in fig. 1, a system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.
The user may interact with the server 105 via the network 104 using the terminal devices 101, 102, 103 to receive or send messages or the like. Various communication client applications, such as a web browser application, a shopping class application, a search class application, an instant messaging tool, a mailbox client, social platform software, etc., may be installed on the terminal devices 101, 102, 103.
The terminal devices 101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablet computers, electronic book readers, MP3 players (Moving Picture ExpertsGroup Audio Layer III, dynamic video expert compression standard audio plane 3), MP4 (Moving PictureExperts Group Audio Layer IV, dynamic video expert compression standard audio plane 4) players, laptop and desktop computers, and the like.
The server 105 may be a server providing various services, such as a background server providing support for pages displayed on the terminal devices 101, 102, 103.
It should be noted that, the similar text aggregation method provided by the embodiment of the present application is generally executed by a server/terminal device, and accordingly, the similar text aggregation apparatus is generally set in the server/terminal device.
It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
With continued reference to FIG. 2, a flowchart of one embodiment of a similar text aggregation method in accordance with the present application is shown. The similar text aggregation method comprises the following steps:
Step 201, obtaining a text to be aggregated in a multi-source data end and distinguishing identification information of all the texts to be aggregated, wherein the text to be aggregated comprises a text title and a text body, and the distinguishing identification information is formed by splicing a source identification and a text identification.
In this embodiment, the method for acquiring the text to be aggregated in the multi-source data end includes two modes of active acquisition and passive reception.
The active acquisition mode comprises acquisition through a synchronous acquisition mode and acquisition through a data grabbing mode, and the passive receiving mode comprises passive receiving of the text to be aggregated through an asynchronous acquisition mode;
specifically, the step of acquiring in a synchronous acquisition mode specifically includes: sending a text acquisition request to be aggregated to the multi-source data end, wherein the multi-source data end responds to the text acquisition request to be aggregated and sends the text to be aggregated to a unique receiving end, and the multi-source data end can be a plurality of databases, data warehouses, cloud databases or the like;
specifically, the step of acquiring in a data grabbing manner specifically includes: capturing the text to be aggregated from the multi-source data end through a preset page crawler component, wherein the multi-source data end can be a plurality of web page browsing ends with different IP addresses;
Specifically, the step of passively receiving the text to be aggregated in an asynchronous acquisition mode specifically includes: and transmitting/pushing the text to be aggregated from the multi-source data terminal to the unique receiving terminal in a message stream mode.
Specifically, the source identifier is different according to the difference of the multi-source data terminal, for example, the multi-source data terminal comprises three databases, the source identifier is the distinguishing identifier of the three databases, the text identifier is correspondingly different according to the difference of texts, for example, 10 texts are obtained from a certain database, and the 10 texts can correspond to 10 text identifiers, namely the distinguishing identifier of the texts.
Step 202, performing preliminary de-duplication processing on the text to be aggregated according to the distinguishing identification information and a preset first screening strategy to obtain the text to be aggregated after the preliminary de-duplication is completed,
the first screening strategy specifically comprises the following steps: and identifying whether the texts to be aggregated, which are acquired from the same data end, have the same condition of the distinguishing identification information according to the distinguishing identification information, and if the texts to be aggregated, which are acquired from the same data end, have the same condition of the distinguishing identification information, performing preliminary duplicate removal processing.
With continued reference to FIG. 3, FIG. 3 is a flow chart of one embodiment of step 202 shown in FIG. 2, comprising:
step 301, splitting the distinguishing identification information according to a preset splitting component to obtain source identifications and text identifications corresponding to all texts to be aggregated respectively;
step 302, based on the source identifier, obtaining all texts to be aggregated corresponding to the same source identifier, and generating the same source text set;
step 303, identifying whether texts with the same text identification exist in the text set with the same source according to the text identification;
step 304, if texts with the same text identification exist in the text sets with the same source, acquiring the texts with the same text identification from the text sets with the same source, and constructing a text set to be preliminarily de-duplicated;
and step 305, selecting one text from the text set to be subjected to preliminary duplicate removal as a target text, deleting other texts, and completing the preliminary duplicate removal processing.
In this embodiment, the preliminary de-duplication processing is performed on the text to be aggregated according to the distinguishing identifier information and a preset first screening policy, which aims to prevent obtaining the text with repeated distinguishing identifier information from the same data source.
And 203, respectively extracting text titles and text texts in the text to be aggregated after the primary duplicate removal is completed.
In this embodiment, the text titles and text texts in the text to be aggregated after all the preliminary deduplication are extracted respectively, which aims to provide data basis for the subsequent similar text aggregation operation.
Step 204, performing text body pretreatment on the text to be aggregated after the preliminary de-duplication is completed to obtain the text to be aggregated after the text body pretreatment is completed,
the text body preprocessing mode specifically comprises the steps of cleaning and arranging punctuation, blank, chinese and English and simplified and traditional Chinese in the text body.
Through the preprocessing mode, punctuation, blank, chinese and English and simplified and traditional Chinese in the text body are cleaned and arranged, so that only literal characters are reserved in the text body, and the bodies of the literal characters are unified (for example, english is uniformly converted into Chinese, traditional Chinese is converted into simplified Chinese), thereby being convenient for subsequent similar text aggregation.
And 205, calculating Hash code values of the texts to be aggregated after all text body preprocessing is completed one by one according to a preset Hash code algorithm, and obtaining the Hash code values corresponding to the texts to be aggregated.
In this embodiment, the preset Hash coding algorithm includes a simHash coding algorithm.
With continued reference to fig. 4, fig. 4 is a flow chart of one embodiment of step 205 shown in fig. 2, comprising:
step 401, inputting all texts to be aggregated after text preprocessing is completed one by one into a preset Hash coding algorithm component, wherein the simHash coding algorithm is built in the Hash coding algorithm component;
step 402, respectively performing Hash coding value calculation on a text title and a text body of the text to be aggregated according to the simHash coding algorithm built in the Hash coding algorithm component, and generating a Hash coding value corresponding to the text title and a Hash coding value corresponding to the text body, wherein the Hash coding value is composed of coding characters 0 and 1, and the number of coding character bits of the Hash coding value is 64 bits;
step 403, obtaining hash code values corresponding to text titles of all texts to be aggregated, constructing a first hash code value set, and setting distinguishing identification information for elements in the first hash code value set according to distinguishing identification information of all texts to be aggregated;
and step 404, obtaining hash code values corresponding to text texts of all texts to be aggregated, constructing a second hash code value set, and setting distinguishing identification information for elements in the second hash code value set according to the distinguishing identification information of all texts to be aggregated.
And respectively carrying out hash code value calculation on the text titles and the text texts of the texts to be aggregated by adopting a simHash coding algorithm to generate hash code values corresponding to the text titles and hash code values corresponding to the text texts, so that similarity text aggregation is conveniently carried out by combining the hash code values corresponding to the text titles and the hash code values corresponding to the text texts. In addition, in text aggregation, compared with a K-shift algorithm, the simHash coding algorithm is low in calculation space consumption and calculation time consumption, and can be used for carrying out similar text aggregation more quickly.
Step 206, aggregating the texts to be aggregated through the hash code value, the distinguishing identification information, the text title, the text body and a preset second screening policy to obtain a text aggregation result, completing the screening aggregation of similar texts,
the second screening strategy specifically comprises the following steps: carrying out segmentation processing on the target hash code value according to preset segmentation parameters to obtain code segments and code segment distinguishing identifiers respectively corresponding to the target hash code value; calculating the coding distance between hash coding values to be compared according to the coding segment distinguishing identification and a preset distance algorithm, and obtaining a coding distance calculation result; screening texts to be aggregated, which meet a preset first requirement, according to a preset coding distance threshold and the coding distance calculation result, and constructing a text comparison group, wherein the preset first requirement is that the coding distance between hash coding values to be compared meets the preset coding distance threshold; and carrying out aggregation processing on the texts to be aggregated in the text comparison group, which meet the preset second requirement, as similar texts, wherein the preset second requirement is that the text titles of the texts to be aggregated are identical, or the front N text characters in the text of the texts to be aggregated are identical, and N is a positive integer greater than 1.
In this embodiment, the step of performing a segmentation process on the target hash code value according to a preset segmentation parameter to obtain a code segment and a code segment distinguishing identifier corresponding to the target hash code value respectively specifically includes: equally dividing a target hash code value into M sections of code segments according to the segment parameters, wherein M is a parameter value of the segment parameters, and M is a positive integer greater than 1 and can be divided by the code character bit number of the hash code value; and setting coding segment distinguishing identifiers for the M segments respectively according to the position information of the M segments in the target hash code value in a left-to-right or right-to-left mode.
In this embodiment, the target hash code value includes a hash code value corresponding to a text header and a hash code value corresponding to a text body, that is, elements in the first hash code value set and elements in the second hash code value set.
Specifically, for example, the segmentation parameter is 4, the target hash code value is equally divided into 4 segments of code segments, and each code segment is a 16-bit code consisting of a code character 0 and a code character 1. Of course, the segmentation parameters can be freely set to values of 8, 12, 16 and the like which can be divided by 64, and the specific setting is selected by a manager.
Continuing with the above example, assuming that the segmentation parameter is 4, 4 16-bit encoded segments are obtained, and based on the positional information of the M-segment encoded segments in the target hash code value, the M-segment encoded segments are respectively set with encoding segment distinction identifiers in a left-to-right or right-to-left manner, that is, the 4 16-bit encoded segments are set with distinction identifiers, such as hash_1, hash_2, hash_3, and hash_4, according to the positional information of the encoded segments in the hash code value.
In this embodiment, the step of performing the segmentation processing on the target hash code value according to the preset segmentation parameter to obtain the code segment and the code segment distinguishing identifier corresponding to the target hash code value respectively may further include: randomly dividing a target hash code value into L segments of code segments according to the segment parameters, wherein L is a parameter value of the segment parameters, and L is a positive integer greater than 1 and less than or equal to the code character bit number of the hash code value; and setting coding segment distinguishing identifiers for the L segments respectively according to the left-to-right or right-to-left mode based on the position information of the L segments in the target hash code value.
In this embodiment, the step of calculating the coding distance between hash code values to be compared according to the code segment distinguishing identifier and a preset distance algorithm to obtain a coding distance calculation result specifically includes: randomly screening two hash code values from the second hash code value set to serve as hash code values to be compared; acquiring coding segments corresponding to the hash coding values to be compared respectively and coding segment distinguishing identifiers corresponding to the coding segments; counting the number of the coding characters 1 contained in the coding segment, and recording the number as a first number; taking the code segment distinguishing identifiers and the first quantity as calculation parameters of the distance algorithm, calculating the code distance between the hash code values to be compared according to the distance algorithm,
the calculating step of the distance algorithm specifically comprises the following steps: identifying the code segments with the same code segment distinguishing identification and the same first number from the code segments corresponding to the hash code values to be compared as target code segments; counting the number of target coding segments of the hash code values to be compared, and recording the number as a second number; and calculating the difference value between the second quantity and the M through difference value operation, and taking the difference value as the coding distance.
Specifically, two hash code values are arbitrarily screened from the second hash code value set to be used as hash code values to be compared; acquiring coding segments corresponding to the hash coding values to be compared respectively and coding segment distinguishing identifiers corresponding to the coding segments; counting the number of the coding characters 1 contained in the coding segment, and recording the number as a first number; the code segment distinguishing identifiers and the first quantity are used as calculation parameters of the distance algorithm, and the code distance between the hash code values to be compared is calculated according to the distance algorithm. The number of the code characters 1 contained in the code segments is counted, the code segments with the same code segment distinguishing identification and the same first number are identified from the code segments corresponding to the hash code values to be compared as target code segments, and the aim is to obtain a comparison result of the whole hash code values corresponding to the text through the segment comparison result, so that compared with the method of directly comparing the hash code values, the comparison operation is carried out more quickly, and the calculation resource consumption is saved.
In this embodiment, the step of calculating the coding distance between the hash code values to be compared according to the code segment distinguishing identifier and the preset distance algorithm to obtain a coding distance calculation result may further include: counting the number of the code characters 1 respectively contained in all the hash code values in the second hash code value set, and recording the number as a third number; screening out a third number of hash code values with the same number from the second hash code value set to construct a hash code value set to be compared; randomly screening two hash code values from the hsah code value set to be compared to serve as hash code values to be compared; acquiring coding segments corresponding to the hash coding values to be compared respectively and coding segment distinguishing identifiers corresponding to the coding segments; counting the difference values of the code characters 1 contained in the corresponding code segments when the code segment difference identifiers are the same, and marking the difference values as a fourth number; calculating the coding distance between the hash coding values to be compared according to the coding segment distinguishing identifiers and the fourth quantity,
the step of calculating the coding distance between the hash coding values to be compared according to the coding segment distinguishing identifiers and the fourth number specifically includes: obtaining the fourth quantity corresponding to all the code segment distinguishing identifiers, and accumulating and summing to obtain a sum value; and taking the sum value as the coding distance.
In this embodiment, the step of screening the text to be aggregated, which meets a preset first requirement, according to a preset coding distance threshold and the coding distance calculation result, and constructing a text comparison group specifically includes: identifying whether the coding distance calculation result exceeds the coding distance threshold value by comparison; if the coding distance calculation result exceeds the coding distance threshold, the coding distance between the hash coding values to be compared does not meet the coding distance threshold, and the text to be aggregated corresponding to the hash coding values to be compared is the text to be aggregated which does not meet the preset first requirement; if the coding distance calculation result does not exceed the coding distance threshold, the coding distance between the hash coding values to be compared meets the coding distance threshold, the text to be aggregated corresponding to the hash coding values to be compared is the text to be aggregated which meets the preset first requirement, and the text to be aggregated corresponding to the hash coding values to be compared is added into a preset text comparison group.
In this embodiment, the step of aggregating the text to be aggregated, which satisfies the preset second requirement, in the text comparison set as a similar text specifically includes: acquiring all texts to be aggregated contained in the text comparison group to serve as texts to be compared; according to the distinguishing identification information of the texts to be compared, acquiring hash code values corresponding to text titles of all the texts to be compared from the first hash code value set; identifying texts to be compared with the same text title by comparing hash code values corresponding to the text titles of all the texts to be compared, and carrying out aggregation treatment on the texts to be compared with the same text title as similar texts; respectively extracting the first N characters from the text bodies of all the texts to be compared according to the distinguishing identification information of the texts to be compared; and identifying the texts to be compared with the same first N characters by comparing the first N characters respectively corresponding to all the texts to be compared, and carrying out aggregation processing on the texts to be compared with the same first N characters as similar texts.
In essence, after text similarity calculation, the embodiment identifies the texts to be compared with the same text title by comparing the hash code values corresponding to the text titles of all the texts to be compared, or identifies the texts to be compared with the same first N text characters by comparing the first N text characters respectively corresponding to all the texts to be compared, and then aggregates the texts to be compared according to the identified texts to be compared, thereby further ensuring the accuracy of similar text aggregation.
In this embodiment, the step of identifying the text to be compared with the same text title by comparing the hash code values corresponding to the text titles of all the texts to be compared specifically includes: randomly screening two hash code values from the first hash code value set to serve as hash code values to be compared; acquiring coding segments corresponding to the hash coding values to be compared respectively and coding segment distinguishing identifiers corresponding to the coding segments; counting the number of the coding characters 1 contained in the coding segment, and recording the number as a first number; and taking the code segment distinguishing identifiers and the first quantity as calculation parameters of the distance algorithm, calculating the code distance between the hash code values to be compared according to the distance algorithm, and if the code distance between the hash code values to be compared is 0, the text titles corresponding to the hash code values to be compared are the same.
In this embodiment, after executing the step of aggregating the texts to be aggregated in the text comparison set, which meet the preset second requirement, as similar texts, the method further includes: acquiring an aggregation treatment result; acquiring distinguishing identification information of all texts in the same aggregation set according to the aggregation processing result; and selecting a target distinguishing identifier from the distinguishing identifier information of all texts in the same aggregation set as the identifier information of the aggregation set according to a preset election rule.
According to the method, the text title and the text of the text to be aggregated are respectively subjected to hash coding value calculation, and similar texts are screened out through the second screening strategy, the first requirement and the second requirement, so that compared with a K-shift algorithm, the method ensures that the similar texts can be quickly and accurately aggregated together under the condition of less calculation resource consumption.
The method comprises the steps of obtaining texts to be aggregated in a multi-source data end and distinguishing identification information of all texts to be aggregated; performing preliminary de-duplication treatment on the text to be aggregated according to the distinguishing identification information and a preset first screening strategy to obtain the text to be aggregated after the preliminary de-duplication is completed; respectively extracting text titles and text texts in the text to be aggregated after all the preliminary duplicate removal is completed; performing text pretreatment on the text to be aggregated after the preliminary deduplication is completed to obtain the text to be aggregated after the text pretreatment is completed; according to a preset Hash coding algorithm, carrying out Hash coding value calculation on all texts to be aggregated after text preprocessing is completed one by one, and obtaining a Hash coding value corresponding to the texts to be aggregated; and carrying out aggregation treatment on the texts to be aggregated through the hash code value, the distinguishing identification information, the text title, the text body and a preset second screening strategy to obtain a text aggregation result, and finishing screening and aggregation of similar texts. According to the method, the text title and the text of the text to be aggregated are respectively subjected to hash coding value calculation, and similar texts are screened out through the second screening strategy, the first requirement and the second requirement, so that compared with a K-shift algorithm, the method ensures that the similar texts can be quickly and accurately aggregated together under the condition of less calculation resource consumption.
The embodiment of the application can acquire and process the related data based on the artificial intelligence technology. Among these, artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use knowledge to obtain optimal results.
Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, large similarity text aggregation technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.
In the embodiment of the application, the texts to be aggregated in the multi-source data end and the distinguishing identification information of all the texts to be aggregated are obtained; performing preliminary de-duplication treatment on the text to be aggregated according to the distinguishing identification information and a preset first screening strategy to obtain the text to be aggregated after the preliminary de-duplication is completed; respectively extracting text titles and text texts in the text to be aggregated after all the preliminary duplicate removal is completed; performing text pretreatment on the text to be aggregated after the preliminary deduplication is completed to obtain the text to be aggregated after the text pretreatment is completed; according to a preset Hash coding algorithm, carrying out Hash coding value calculation on all texts to be aggregated after text preprocessing is completed one by one, and obtaining a Hash coding value corresponding to the texts to be aggregated; and carrying out aggregation treatment on the texts to be aggregated through the hash code value, the distinguishing identification information, the text title, the text body and a preset second screening strategy to obtain a text aggregation result, and finishing screening and aggregation of similar texts. According to the method, the text title and the text of the text to be aggregated are respectively subjected to hash coding value calculation, and similar texts are screened out through the second screening strategy, the first requirement and the second requirement, so that compared with a K-shift algorithm, the method ensures that the similar texts can be quickly and accurately aggregated together under the condition of less calculation resource consumption.
With further reference to fig. 5, as an implementation of the method shown in fig. 2 described above, the present application provides an embodiment of a similar text aggregation apparatus, which corresponds to the method embodiment shown in fig. 2, and which is particularly applicable to various electronic devices.
As shown in fig. 5, the similar text aggregation apparatus 500 according to the present embodiment includes: a text to be aggregated acquisition module 501, a preliminary deduplication processing module 502, a title and text extraction module 503, a text preprocessing module 504, a Hash code value calculation module 505 and a text aggregation processing module 506. Wherein:
the text to be aggregated obtaining module 501 is configured to obtain a text to be aggregated in a multi-source data end and distinguishing identification information of all the texts to be aggregated, where the text to be aggregated includes a text title and a text body, and the distinguishing identification information is formed by splicing a source identifier and a text identifier;
the preliminary de-duplication processing module 502 is configured to perform preliminary de-duplication processing on the text to be aggregated according to the distinguishing identification information and a preset first screening policy, so as to obtain a text to be aggregated after the preliminary de-duplication is completed, where the first screening policy specifically is: identifying whether the identification information of the texts to be aggregated obtained from the same data end is the same or not according to the identification information of the texts to be aggregated, and if the identification information of the texts to be aggregated obtained from the same data end is the same, performing preliminary duplicate removal processing;
A title and text extraction module 503, configured to extract text titles and text texts in the text to be aggregated after all preliminary deduplication is completed, respectively;
the text preprocessing module 504 is configured to perform text preprocessing on the text to be aggregated after the preliminary deduplication is completed, so as to obtain the text to be aggregated after the text preprocessing is completed, where the text preprocessing method specifically includes cleaning and finishing punctuation, blank, chinese and English, simplified and traditional Chinese in the text;
the Hash code value calculation module 505 is configured to perform Hash code value calculation on the text to be aggregated after all text body preprocessing is completed one by one according to a preset Hash code algorithm, so as to obtain a Hash code value corresponding to the text to be aggregated;
the text aggregation processing module 506 is configured to aggregate the text to be aggregated through the hash code value, the distinguishing identification information, the text header, the text body and a preset second filtering policy, so as to obtain a text aggregation result, and complete filtering and aggregation of similar texts, where the second filtering policy specifically is: carrying out segmentation processing on the target hash code value according to preset segmentation parameters to obtain code segments and code segment distinguishing identifiers respectively corresponding to the target hash code value; calculating the coding distance between hash coding values to be compared according to the coding segment distinguishing identification and a preset distance algorithm, and obtaining a coding distance calculation result; screening texts to be aggregated, which meet a preset first requirement, according to a preset coding distance threshold and the coding distance calculation result, and constructing a text comparison group, wherein the preset first requirement is that the coding distance between hash coding values to be compared meets the preset coding distance threshold; and carrying out aggregation processing on the texts to be aggregated in the text comparison group, which meet the preset second requirement, as similar texts, wherein the preset second requirement is that the text titles of the texts to be aggregated are identical, or the front N text characters in the text of the texts to be aggregated are identical, and N is a positive integer greater than 1.
The method comprises the steps of obtaining texts to be aggregated in a multi-source data end and distinguishing identification information of all texts to be aggregated; performing preliminary de-duplication treatment on the text to be aggregated according to the distinguishing identification information and a preset first screening strategy to obtain the text to be aggregated after the preliminary de-duplication is completed; respectively extracting text titles and text texts in the text to be aggregated after all the preliminary duplicate removal is completed; performing text pretreatment on the text to be aggregated after the preliminary deduplication is completed to obtain the text to be aggregated after the text pretreatment is completed; according to a preset Hash coding algorithm, carrying out Hash coding value calculation on all texts to be aggregated after text preprocessing is completed one by one, and obtaining a Hash coding value corresponding to the texts to be aggregated; and carrying out aggregation treatment on the texts to be aggregated through the hash code value, the distinguishing identification information, the text title, the text body and a preset second screening strategy to obtain a text aggregation result, and finishing screening and aggregation of similar texts. According to the method, the text title and the text of the text to be aggregated are respectively subjected to hash coding value calculation, and similar texts are screened out through the second screening strategy, the first requirement and the second requirement, so that compared with a K-shift algorithm, the method ensures that the similar texts can be quickly and accurately aggregated together under the condition of less calculation resource consumption.
Those skilled in the art will appreciate that implementing all or part of the above described embodiment methods may be accomplished by computer readable instructions, stored on a computer readable storage medium, that the program when executed may comprise the steps of embodiments of the methods described above. The storage medium may be a nonvolatile storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a random access Memory (Random Access Memory, RAM).
It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited in order and may be performed in other orders, unless explicitly stated herein. Moreover, at least some of the steps in the flowcharts of the figures may include a plurality of sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, the order of their execution not necessarily being sequential, but may be performed in turn or alternately with other steps or at least a portion of the other steps or stages.
In order to solve the technical problems, the embodiment of the application also provides computer equipment. Referring specifically to fig. 6, fig. 6 is a basic structural block diagram of a computer device according to the present embodiment.
The computer device 6 comprises a memory 6a, a processor 6b, a network interface 6c communicatively connected to each other via a system bus. It should be noted that only a computer device 6 having components 6a-6c is shown in the figures, but it should be understood that not all of the illustrated components need be implemented, and that more or fewer components may alternatively be implemented. It will be appreciated by those skilled in the art that the computer device herein is a device capable of automatically performing numerical calculations and/or information processing in accordance with predetermined or stored instructions, the hardware of which includes, but is not limited to, microprocessors, application specific integrated circuits (Application Specific Integrated Circuit, ASICs), programmable gate arrays (fields-Programmable Gate Array, FPGAs), digital processors (Digital Signal Processor, DSPs), embedded devices, etc.
The computer equipment can be a desktop computer, a notebook computer, a palm computer, a cloud server and other computing equipment. The computer equipment can perform man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch pad or voice control equipment and the like.
The memory 6a includes at least one type of readable storage medium including flash memory, hard disk, multimedia card, card memory (e.g., SD or DX memory, etc.), random Access Memory (RAM), static Random Access Memory (SRAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), programmable Read Only Memory (PROM), magnetic memory, magnetic disk, optical disk, etc. In some embodiments, the storage 6a may be an internal storage unit of the computer device 6, such as a hard disk or a memory of the computer device 6. In other embodiments, the memory 6a may also be an external storage device of the computer device 6, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash Card (Flash Card) or the like, which are provided on the computer device 6. Of course, the memory 6a may also comprise both an internal memory unit of the computer device 6 and an external memory device. In this embodiment, the memory 6a is typically used to store an operating system and various types of application software installed on the computer device 6, such as computer readable instructions for a similar text aggregation method. Further, the memory 6a may also be used to temporarily store various types of data that have been output or are to be output.
The processor 6b may be a central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor, or other similar text aggregation chip in some embodiments. The processor 6b is typically used to control the overall operation of the computer device 6. In this embodiment, the processor 6b is configured to execute computer readable instructions stored in the memory 6a or process data, such as computer readable instructions for executing the similar text aggregation method.
The network interface 6c may comprise a wireless network interface or a wired network interface, which network interface 6c is typically used to establish a communication connection between the computer device 6 and other electronic devices.
The computer equipment provided by the embodiment belongs to the technical field of data processing and is applied to a duplicate elimination and aggregation scene of multi-source data texts. The method comprises the steps of obtaining texts to be aggregated in a multi-source data end and distinguishing identification information of all texts to be aggregated; performing preliminary de-duplication treatment on the text to be aggregated according to the distinguishing identification information and a preset first screening strategy to obtain the text to be aggregated after the preliminary de-duplication is completed; respectively extracting text titles and text texts in the text to be aggregated after all the preliminary duplicate removal is completed; performing text pretreatment on the text to be aggregated after the preliminary deduplication is completed to obtain the text to be aggregated after the text pretreatment is completed; according to a preset Hash coding algorithm, carrying out Hash coding value calculation on all texts to be aggregated after text preprocessing is completed one by one, and obtaining a Hash coding value corresponding to the texts to be aggregated; and carrying out aggregation treatment on the texts to be aggregated through the hash code value, the distinguishing identification information, the text title, the text body and a preset second screening strategy to obtain a text aggregation result, and finishing screening and aggregation of similar texts. According to the method, the text title and the text of the text to be aggregated are respectively subjected to hash coding value calculation, and similar texts are screened out through the second screening strategy, the first requirement and the second requirement, so that compared with a K-shift algorithm, the method ensures that the similar texts can be quickly and accurately aggregated together under the condition of less calculation resource consumption.
The present application also provides another embodiment, namely, a computer-readable storage medium storing computer-readable instructions executable by a processor to cause the processor to perform steps of a similar text aggregation method as described above.
The computer readable storage medium provided by the embodiment belongs to the technical field of data processing, and is applied to a duplicate elimination and aggregation scene of multi-source data texts. The method comprises the steps of obtaining texts to be aggregated in a multi-source data end and distinguishing identification information of all texts to be aggregated; performing preliminary de-duplication treatment on the text to be aggregated according to the distinguishing identification information and a preset first screening strategy to obtain the text to be aggregated after the preliminary de-duplication is completed; respectively extracting text titles and text texts in the text to be aggregated after all the preliminary duplicate removal is completed; performing text pretreatment on the text to be aggregated after the preliminary deduplication is completed to obtain the text to be aggregated after the text pretreatment is completed; according to a preset Hash coding algorithm, carrying out Hash coding value calculation on all texts to be aggregated after text preprocessing is completed one by one, and obtaining a Hash coding value corresponding to the texts to be aggregated; and carrying out aggregation treatment on the texts to be aggregated through the hash code value, the distinguishing identification information, the text title, the text body and a preset second screening strategy to obtain a text aggregation result, and finishing screening and aggregation of similar texts. According to the method, the text title and the text of the text to be aggregated are respectively subjected to hash coding value calculation, and similar texts are screened out through the second screening strategy, the first requirement and the second requirement, so that compared with a K-shift algorithm, the method ensures that the similar texts can be quickly and accurately aggregated together under the condition of less calculation resource consumption.
From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method according to the embodiments of the present application.
It is apparent that the above-described embodiments are only some embodiments of the present application, but not all embodiments, and the preferred embodiments of the present application are shown in the drawings, which do not limit the scope of the patent claims. This application may be embodied in many different forms, but rather, embodiments are provided in order to provide a thorough and complete understanding of the present disclosure. Although the application has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that modifications may be made to the embodiments described in the foregoing description, or equivalents may be substituted for elements thereof. All equivalent structures made by the content of the specification and the drawings of the application are directly or indirectly applied to other related technical fields, and are also within the scope of the application.

Claims (10)

1. A method of similar text aggregation comprising the steps of:
acquiring texts to be aggregated in a multi-source data end and distinguishing identification information of all the texts to be aggregated, wherein the texts to be aggregated comprise text titles and text texts, and the distinguishing identification information is formed by splicing source identifications and text identifications;
performing preliminary de-duplication processing on the text to be aggregated according to the distinguishing identification information and a preset first screening strategy to obtain the text to be aggregated after the preliminary de-duplication is completed, wherein the first screening strategy specifically comprises: identifying whether the identification information of the texts to be aggregated obtained from the same data end is the same or not according to the identification information of the texts to be aggregated, and if the identification information of the texts to be aggregated obtained from the same data end is the same, performing preliminary duplicate removal processing;
respectively extracting text titles and text texts in the text to be aggregated after all the preliminary duplicate removal is completed;
performing text pretreatment on the text to be polymerized after the preliminary duplication removal is completed to obtain the text to be polymerized after the text pretreatment is completed, wherein the text pretreatment mode specifically comprises the steps of cleaning and finishing punctuation, blank, chinese and English and simplified and traditional Chinese in the text;
According to a preset Hash coding algorithm, carrying out Hash coding value calculation on all texts to be aggregated after text preprocessing is completed one by one, and obtaining a Hash coding value corresponding to the texts to be aggregated;
and carrying out aggregation treatment on the texts to be aggregated through the hash code value, the distinguishing identification information, the text title, the text body and a preset second screening strategy to obtain a text aggregation result, and finishing screening and aggregation of similar texts, wherein the second screening strategy specifically comprises the following steps: carrying out segmentation processing on the target hash code value according to preset segmentation parameters to obtain code segments and code segment distinguishing identifiers respectively corresponding to the target hash code value; calculating the coding distance between hash coding values to be compared according to the coding segment distinguishing identification and a preset distance algorithm, and obtaining a coding distance calculation result; screening texts to be aggregated, which meet a preset first requirement, according to a preset coding distance threshold and the coding distance calculation result, and constructing a text comparison group, wherein the preset first requirement is that the coding distance between hash coding values to be compared meets the preset coding distance threshold; and carrying out aggregation processing on the texts to be aggregated in the text comparison group, which meet the preset second requirement, as similar texts, wherein the preset second requirement is that the text titles of the texts to be aggregated are identical, or the front N text characters in the text of the texts to be aggregated are identical, and N is a positive integer greater than 1.
2. The method for aggregating similar texts according to claim 1, wherein the step of performing preliminary deduplication processing on the text to be aggregated according to the distinguishing identification information and a preset first screening policy to obtain the text to be aggregated after the preliminary deduplication is completed specifically comprises:
splitting the distinguishing identification information according to a preset splitting component to obtain source identifications and text identifications respectively corresponding to all texts to be aggregated;
based on the source identification, obtaining all texts to be aggregated corresponding to the same source identification, and generating a same source text set;
identifying whether texts with the same text identification exist in the text sets with the same source according to the text identification;
if texts with the same text identification exist in the text sets with the same source, acquiring the texts with the same text identification from the text sets with the same source, and constructing a text set to be subjected to preliminary de-duplication;
and selecting one text from the text set to be subjected to preliminary de-duplication as a target text, deleting other texts, and completing the preliminary de-duplication processing.
3. The method for aggregating similar texts according to claim 1, wherein the preset Hash coding algorithm includes a simHash coding algorithm, and the step of calculating Hash coding values one by one for the texts to be aggregated after preprocessing all the texts according to the preset Hash coding algorithm to obtain corresponding Hash coding values specifically includes:
Inputting all texts to be aggregated after the text pretreatment is completed one by one into a preset Hash coding algorithm component, wherein the simHash coding algorithm is built in the Hash coding algorithm component;
respectively carrying out Hash coding value calculation on a text title and a text of the text to be aggregated according to the simHash coding algorithm built in the Hash coding algorithm component to generate a Hash coding value corresponding to the text title and a Hash coding value corresponding to the text, wherein the Hash coding value consists of coding characters 0 and 1, and the number of coding character bits of the Hash coding value is 64;
acquiring hash code values corresponding to text titles of all texts to be aggregated, constructing a first hash code value set, and setting distinguishing identification information for elements in the first hash code value set according to distinguishing identification information of all texts to be aggregated;
and acquiring hash code values corresponding to text texts of all texts to be aggregated, constructing a second hash code value set, and setting distinguishing identification information for elements in the second hash code value set according to the distinguishing identification information of all texts to be aggregated.
4. The method for aggregating similar texts according to claim 3, wherein the step of performing segmentation processing on the target hash code values according to preset segmentation parameters to obtain code segments and code segment distinguishing identifiers corresponding to the target hash code values respectively specifically comprises the following steps:
Equally dividing a target hash code value into M sections of code segments according to the segment parameters, wherein M is a parameter value of the segment parameters, and M is a positive integer greater than 1 and can be divided by the code character bit number of the hash code value;
and setting coding segment distinguishing identifiers for the M segments respectively according to the position information of the M segments in the target hash code value in a left-to-right or right-to-left mode.
5. The method for aggregating similar texts according to claim 4, wherein the step of calculating the coding distance between hash code values to be compared according to the code segment distinguishing identifier and a preset distance algorithm to obtain a coding distance calculation result specifically comprises the following steps:
randomly screening two hash code values from the second hash code value set to serve as hash code values to be compared;
acquiring coding segments corresponding to the hash coding values to be compared respectively and coding segment distinguishing identifiers corresponding to the coding segments;
counting the number of the coding characters 1 contained in the coding segment, and recording the number as a first number;
taking the code segment distinguishing identifiers and the first quantity as calculation parameters of the distance algorithm, calculating the code distance between the hash code values to be compared according to the distance algorithm,
The calculating step of the distance algorithm specifically comprises the following steps: identifying the code segments with the same code segment distinguishing identification and the same first number from the code segments corresponding to the hash code values to be compared as target code segments; counting the number of target coding segments of the hash code values to be compared, and recording the number as a second number; and calculating the difference value between the second quantity and the M through difference value operation, and taking the difference value as the coding distance.
6. The method for aggregating similar texts according to claim 5, wherein the step of screening texts to be aggregated meeting a preset first requirement according to a preset coding distance threshold and the coding distance calculation result, and constructing a text comparison group specifically comprises the following steps:
identifying whether the coding distance calculation result exceeds the coding distance threshold value by comparison;
if the coding distance calculation result exceeds the coding distance threshold, the coding distance between the hash coding values to be compared does not meet the coding distance threshold, and the text to be aggregated corresponding to the hash coding values to be compared is the text to be aggregated which does not meet the preset first requirement;
if the coding distance calculation result does not exceed the coding distance threshold, the coding distance between the hash coding values to be compared meets the coding distance threshold, the text to be aggregated corresponding to the hash coding values to be compared is the text to be aggregated which meets the preset first requirement, and the text to be aggregated corresponding to the hash coding values to be compared is added into a preset text comparison group.
7. The method for aggregating similar texts according to any one of claims 3 to 6, wherein the step of aggregating the texts to be aggregated, which meet the preset second requirement, in the text comparison set as similar texts specifically includes:
acquiring all texts to be aggregated contained in the text comparison group to serve as texts to be compared;
according to the distinguishing identification information of the texts to be compared, acquiring hash code values corresponding to text titles of all the texts to be compared from the first hash code value set;
identifying texts to be compared with the same text title by comparing hash code values corresponding to the text titles of all the texts to be compared, and carrying out aggregation treatment on the texts to be compared with the same text title as similar texts;
respectively extracting the first N characters from the text bodies of all the texts to be compared according to the distinguishing identification information of the texts to be compared;
and identifying the texts to be compared with the same first N characters by comparing the first N characters respectively corresponding to all the texts to be compared, and carrying out aggregation processing on the texts to be compared with the same first N characters as similar texts.
8. A similar text aggregation apparatus, comprising:
The system comprises a text to be aggregated and a text obtaining module, wherein the text to be aggregated is used for obtaining a text to be aggregated and distinguishing identification information of all the texts to be aggregated in a multi-source data end, the text to be aggregated comprises a text title and a text body, and the distinguishing identification information is formed by splicing a source identification and a text identification;
the preliminary de-duplication processing module is configured to perform preliminary de-duplication processing on the text to be aggregated according to the distinguishing identification information and a preset first screening policy, so as to obtain a text to be aggregated after preliminary de-duplication is completed, where the first screening policy specifically is: identifying whether the identification information of the texts to be aggregated obtained from the same data end is the same or not according to the identification information of the texts to be aggregated, and if the identification information of the texts to be aggregated obtained from the same data end is the same, performing preliminary duplicate removal processing;
the title and text extraction module is used for respectively extracting text titles and text texts in the texts to be aggregated after all the preliminary duplicate removal is completed;
the text preprocessing module is used for preprocessing the text to be aggregated after the preliminary duplicate removal is completed to obtain the text to be aggregated after the text preprocessing is completed, wherein the text preprocessing mode specifically comprises the steps of cleaning and finishing punctuation, blank, chinese and English and simplified and traditional Chinese in the text;
The Hash coding value calculation module is used for calculating Hash coding values of all texts to be aggregated after text pretreatment is completed according to a preset Hash coding algorithm one by one to obtain Hash coding values corresponding to the texts to be aggregated;
the text aggregation processing module is configured to aggregate the text to be aggregated through the hash code value, the distinguishing identification information, the text title, the text body and a preset second screening policy, so as to obtain a text aggregation result, and complete screening and aggregation of similar texts, where the second screening policy specifically is: carrying out segmentation processing on the target hash code value according to preset segmentation parameters to obtain code segments and code segment distinguishing identifiers respectively corresponding to the target hash code value; calculating the coding distance between hash coding values to be compared according to the coding segment distinguishing identification and a preset distance algorithm, and obtaining a coding distance calculation result; screening texts to be aggregated, which meet a preset first requirement, according to a preset coding distance threshold and the coding distance calculation result, and constructing a text comparison group, wherein the preset first requirement is that the coding distance between hash coding values to be compared meets the preset coding distance threshold; and carrying out aggregation processing on the texts to be aggregated in the text comparison group, which meet the preset second requirement, as similar texts, wherein the preset second requirement is that the text titles of the texts to be aggregated are identical, or the front N text characters in the text of the texts to be aggregated are identical, and N is a positive integer greater than 1.
9. A computer device comprising a memory having stored therein computer readable instructions which when executed by the processor implement the steps of the similar text aggregation method of any one of claims 1 to 7.
10. A computer readable storage medium having stored thereon computer readable instructions which when executed by a processor implement the steps of the similar text aggregation method of any one of claims 1 to 7.
CN202311363982.6A 2023-10-20 2023-10-20 Similar text aggregation method, device, equipment and storage medium thereof Active CN117093717B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311363982.6A CN117093717B (en) 2023-10-20 2023-10-20 Similar text aggregation method, device, equipment and storage medium thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311363982.6A CN117093717B (en) 2023-10-20 2023-10-20 Similar text aggregation method, device, equipment and storage medium thereof

Publications (2)

Publication Number Publication Date
CN117093717A true CN117093717A (en) 2023-11-21
CN117093717B CN117093717B (en) 2024-01-30

Family

ID=88771998

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311363982.6A Active CN117093717B (en) 2023-10-20 2023-10-20 Similar text aggregation method, device, equipment and storage medium thereof

Country Status (1)

Country Link
CN (1) CN117093717B (en)

Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101246501A (en) * 2008-03-27 2008-08-20 腾讯科技(深圳)有限公司 Method and system for polymerizing the same subject network document files
WO2009079875A1 (en) * 2007-12-14 2009-07-02 Shanghai Hewlett-Packard Co., Ltd Systems and methods for extracting phrases from text
US20170161375A1 (en) * 2015-12-07 2017-06-08 Adlib Publishing Systems Inc. Clustering documents based on textual content
CN108334513A (en) * 2017-01-20 2018-07-27 阿里巴巴集团控股有限公司 A kind of identification processing method of Similar Text, apparatus and system
CN110750731A (en) * 2019-09-27 2020-02-04 成都数联铭品科技有限公司 Duplicate removal method and system for news public sentiment
CN111666575A (en) * 2020-04-15 2020-09-15 中国人民解放军战略支援部队信息工程大学 Text carrier-free information hiding method based on word element coding
CN111814465A (en) * 2020-06-17 2020-10-23 平安科技(深圳)有限公司 Information extraction method and device based on machine learning, computer equipment and medium
CN111985491A (en) * 2020-09-03 2020-11-24 深圳壹账通智能科技有限公司 Similar information merging method, device, equipment and medium based on deep learning
CN114282511A (en) * 2021-10-26 2022-04-05 腾讯科技(深圳)有限公司 Text duplicate removal method and device, electronic equipment and storage medium
CN114417102A (en) * 2021-12-27 2022-04-29 北京清格科技有限公司 Text duplicate removal method and device and electronic equipment
CN114547384A (en) * 2022-02-25 2022-05-27 联想(北京)有限公司 Resource object processing method and device and computer equipment
CN116028618A (en) * 2022-12-27 2023-04-28 百度国际科技(深圳)有限公司 Text processing method, text searching method, text processing device, text searching device, electronic equipment and storage medium
CN116561298A (en) * 2023-05-11 2023-08-08 中国平安财产保险股份有限公司 Title generation method, device, equipment and storage medium based on artificial intelligence
CN116881446A (en) * 2023-05-05 2023-10-13 中国平安财产保险股份有限公司 Semantic classification method, device, equipment and storage medium thereof

Patent Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2009079875A1 (en) * 2007-12-14 2009-07-02 Shanghai Hewlett-Packard Co., Ltd Systems and methods for extracting phrases from text
CN101246501A (en) * 2008-03-27 2008-08-20 腾讯科技(深圳)有限公司 Method and system for polymerizing the same subject network document files
US20170161375A1 (en) * 2015-12-07 2017-06-08 Adlib Publishing Systems Inc. Clustering documents based on textual content
CN108334513A (en) * 2017-01-20 2018-07-27 阿里巴巴集团控股有限公司 A kind of identification processing method of Similar Text, apparatus and system
CN110750731A (en) * 2019-09-27 2020-02-04 成都数联铭品科技有限公司 Duplicate removal method and system for news public sentiment
CN111666575A (en) * 2020-04-15 2020-09-15 中国人民解放军战略支援部队信息工程大学 Text carrier-free information hiding method based on word element coding
CN111814465A (en) * 2020-06-17 2020-10-23 平安科技(深圳)有限公司 Information extraction method and device based on machine learning, computer equipment and medium
WO2021135469A1 (en) * 2020-06-17 2021-07-08 平安科技(深圳)有限公司 Machine learning-based information extraction method, apparatus, computer device, and medium
CN111985491A (en) * 2020-09-03 2020-11-24 深圳壹账通智能科技有限公司 Similar information merging method, device, equipment and medium based on deep learning
CN114282511A (en) * 2021-10-26 2022-04-05 腾讯科技(深圳)有限公司 Text duplicate removal method and device, electronic equipment and storage medium
CN114417102A (en) * 2021-12-27 2022-04-29 北京清格科技有限公司 Text duplicate removal method and device and electronic equipment
CN114547384A (en) * 2022-02-25 2022-05-27 联想(北京)有限公司 Resource object processing method and device and computer equipment
CN116028618A (en) * 2022-12-27 2023-04-28 百度国际科技(深圳)有限公司 Text processing method, text searching method, text processing device, text searching device, electronic equipment and storage medium
CN116881446A (en) * 2023-05-05 2023-10-13 中国平安财产保险股份有限公司 Semantic classification method, device, equipment and storage medium thereof
CN116561298A (en) * 2023-05-11 2023-08-08 中国平安财产保险股份有限公司 Title generation method, device, equipment and storage medium based on artificial intelligence

Also Published As

Publication number Publication date
CN117093717B (en) 2024-01-30

Similar Documents

Publication Publication Date Title
CN108629043B (en) Webpage target information extraction method, device and storage medium
CN110532381B (en) Text vector acquisition method and device, computer equipment and storage medium
CN112541338A (en) Similar text matching method and device, electronic equipment and computer storage medium
CN113434636B (en) Semantic-based approximate text searching method, semantic-based approximate text searching device, computer equipment and medium
CN112085091B (en) Short text matching method, device, equipment and storage medium based on artificial intelligence
CN107862058B (en) Method and apparatus for generating information
CN111666415A (en) Topic clustering method and device, electronic equipment and storage medium
CN112733645B (en) Handwritten signature verification method, handwritten signature verification device, computer equipment and storage medium
CN113283238B (en) Text data processing method and device, electronic equipment and storage medium
CN110427453B (en) Data similarity calculation method, device, computer equipment and storage medium
EP4390725A1 (en) Video retrieval method and apparatus, device, and storage medium
CN112287069A (en) Information retrieval method and device based on voice semantics and computer equipment
CN114861746A (en) Anti-fraud identification method and device based on big data and related equipment
CN114077841A (en) Semantic extraction method and device based on artificial intelligence, electronic equipment and medium
CN114358023B (en) Intelligent question-answer recall method, intelligent question-answer recall device, computer equipment and storage medium
CN113505595A (en) Text phrase extraction method and device, computer equipment and storage medium
CN117423124A (en) Table data processing method, device, equipment and medium based on table image
CN117216114A (en) Data stream association method, device, equipment and storage medium thereof
CN117093717B (en) Similar text aggregation method, device, equipment and storage medium thereof
CN116881446A (en) Semantic classification method, device, equipment and storage medium thereof
CN116028446A (en) Time sequence data file management method, device, equipment and storage medium thereof
CN113051900B (en) Synonym recognition method, synonym recognition device, computer equipment and storage medium
CN114912003A (en) Document searching method and device, computer equipment and storage medium
CN114238583B (en) Natural language processing method, device, computer equipment and storage medium
CN117786390A (en) Feature data arrangement method to be maintained and related equipment thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP03 Change of name, title or address

Address after: 410000 room 3601, building T2 (block B), Binjiang financial center, No. 112, chazi Shandong Road, guanshaling street, Yuelu District, Changsha City, Hunan Province

Patentee after: Hunan Data Industry Group Co.,Ltd.

Country or region after: China

Address before: 410000 room 3601, building T2 (block B), Binjiang financial center, No. 112, chazi Shandong Road, guanshaling street, Yuelu District, Changsha City, Hunan Province

Patentee before: Hunan Caixin Digital Technology Co.,Ltd.

Country or region before: China

CP03 Change of name, title or address