CN114742042A - Text duplicate removal method and device, electronic equipment and storage medium - Google Patents

Text duplicate removal method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN114742042A
CN114742042A CN202210283294.8A CN202210283294A CN114742042A CN 114742042 A CN114742042 A CN 114742042A CN 202210283294 A CN202210283294 A CN 202210283294A CN 114742042 A CN114742042 A CN 114742042A
Authority
CN
China
Prior art keywords
text
repeated
deduplicated
title
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210283294.8A
Other languages
Chinese (zh)
Inventor
潘帅
陈家银
张伟
陈曦
麻志毅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Advanced Institute of Information Technology AIIT of Peking University
Hangzhou Weiming Information Technology Co Ltd
Original Assignee
Advanced Institute of Information Technology AIIT of Peking University
Hangzhou Weiming Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Advanced Institute of Information Technology AIIT of Peking University, Hangzhou Weiming Information Technology Co Ltd filed Critical Advanced Institute of Information Technology AIIT of Peking University
Priority to CN202210283294.8A priority Critical patent/CN114742042A/en
Publication of CN114742042A publication Critical patent/CN114742042A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/268Morphological analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/313Selection or weighting of terms for indexing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3346Query execution using probabilistic model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a text duplicate removal method, a text duplicate removal device, electronic equipment and a storage medium, wherein the method comprises the following steps: determining a representative word in a title of a text to be deduplicated; judging whether the index in the index space constructed by the titles of the de-duplicated texts has representative words or not; if not, determining that the text to be deduplicated is not a repeated text; if so, judging whether the text to be deduplicated is a repeated text or not based on the title part-of-speech mode; and when the text to be deduplicated is determined not to be the repeated text, taking the representative word as an index and taking the title and part-of-speech tagging results as key values to be added into the index space. The duplication removing complexity can be further reduced and the duplication removing efficiency can be improved by obtaining the most influential representative word in the title of the text for duplication removing, and based on the assumption that the most influential word among similar semantic texts is the same, when the representative word is judged to exist in an index space constructed by the title of the duplicated text, duplication removing is carried out based on a title part-of-speech mode, so that semantic perception duplication removing is realized.

Description

Text duplicate removal method and device, electronic equipment and storage medium
Technical Field
The invention relates to the technical field of text processing, in particular to a text duplicate removal method and device, electronic equipment and a storage medium.
Background
At present, electronic bidding texts are increasingly popular, millions of bidding texts are published in the whole network every day, and enterprises can obtain a great amount of potential business situation information from the bidding texts. However, due to the phenomena of network transfer, plagiarism and the like, a large amount of bidding texts crawled by enterprises have a repeated problem, and the redundant bidding texts are stored in the database, so that the storage space is wasted, and the efficiency of downstream data processing tasks is reduced. The problem of de-duplicating the bidding text is therefore a challenge for the enterprise.
The traditional text de-duplication method comprises the following steps: 1. the Jacobian similarity coefficient duplication elimination method is characterized in that the text repetition rate is judged by calculating the ratio between text intersection and text union, the method cannot determine the semantics of the text, each new text needs to be compared with all the past texts, the required processing time is increased linearly along with the increase of the number of the texts, and therefore the method cannot be applied to duplication elimination tasks of large-scale texts; the Simhash duplication removal method is proposed by Google, efficient indexing is completed by carrying out hash coding on a text, duplication removal of the text is completed by a hamming distance, however, Simhash is only suitable for English, under the condition of Chinese, the text needs to be segmented to obtain the weight of a feature word, for the text with the same length as a bidding text, the segmentation can generate high calculation cost, for short texts such as a bidding title and the like, because the short texts contain fewer feature words, the Simhash can not distinguish the semantics among the short texts, and therefore when the short texts are confronted, a large number of misjudgments or misjudgments exist.
Disclosure of Invention
The present invention provides a text deduplication method, apparatus, electronic device and storage medium for overcoming the above-mentioned deficiencies in the prior art, and the object is achieved by the following technical solutions.
A first aspect of the present invention provides a text deduplication method, including:
determining a representative word in a title of a text to be deduplicated;
judging whether the representative words exist in the index space constructed by the titles of the de-duplicated texts;
if not, determining that the text to be deduplicated is not a repeated text;
if so, judging whether the text to be deduplicated is a repeated text or not based on the title part-of-speech mode;
and when the text to be deduplicated is determined not to be the repeated text, adding the representative word as an index and the title and the part-of-speech tagging result of the title as key values into the index space.
In some embodiments of the present application, determining representative words in a title of text to be deduplicated comprises:
performing word segmentation on the title to obtain a word segmentation result; determining a word frequency-inverse document frequency TF-IDF value of each word in the word segmentation result; and determining the participle corresponding to the maximum TF-IDF value as a representative word.
In some embodiments of the present application, determining a word frequency-inverse text frequency TF-IDF value for each word in the word segmentation result comprises:
determining a word frequency TF value of each word in the word segmentation result; acquiring an inverse document frequency IDF value corresponding to each participle from a preset vocabulary table; the TF-IDF value of each participle is determined using the TF value and the IDF value of that participle.
In some embodiments of the present application, determining whether the text to be deduplicated is a repeated text based on the part-of-speech manner of the title includes:
acquiring an existing title represented by a key value corresponding to the representative word in an index space; comparing the existing titles with the non-repeated words between the titles; if the non-repeated words exist, judging whether the text to be deduplicated is a repeated text or not according to the part of speech of the non-repeated words; and if no non-repeated word exists, determining that the text to be deduplicated is a repeated text.
In some embodiments of the present application, determining whether the text to be deduplicated is a repeated text according to the part of speech of the non-repeated word includes:
judging whether the part of speech of the non-repeated word is a preset part of speech or not; if the part of speech is preset, determining that the text to be deduplicated is not a repeated text; if the text to be deduplicated is not the preset part of speech, judging whether the text to be deduplicated is a repeated text or not according to the ratio of the number of non-repeated words to the total word segmentation amount of the title; if the proportion exceeds a preset value, determining that the text to be deduplicated is not a repeated text; and if the proportion does not exceed the preset value, determining that the text to be deduplicated is a repeated text.
In some embodiments of the present application, the predetermined part of speech includes nouns, english, and numerologies.
A second aspect of the present invention provides a text deduplication apparatus, the apparatus comprising:
the first determining module is used for determining representative words in the title of the text to be deduplicated;
the first judgment module is used for judging whether the representative words exist in the index space constructed by the titles of the de-duplicated texts;
the second determining module is used for determining that the text to be deduplicated is not a repeated text when the representative word is judged not to exist;
the second judgment module is used for judging whether the text to be deduplicated is a repeated text or not based on the title part-of-speech mode when the representative word is judged to exist;
and the space adding module is used for adding the representative word as an index and the title and the part-of-speech tagging result of the title as a key value into the index space when the text to be deduplicated is determined not to be the repeated text.
A third aspect of the present invention proposes an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the method according to the first aspect when executing the program.
A fourth aspect of the present invention proposes a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, carries out the steps of the method according to the first aspect as described above.
Based on the text deduplication method and the text deduplication device in the first aspect and the second aspect, the text deduplication method and the text deduplication system have at least the following advantages:
the invention can achieve the same effect as the text deduplication by carrying out text deduplication on the title. Meanwhile, the title is a short text, the efficiency of processing the title is far higher than that of processing the text, and for large-scale text data volume, the time required by processing can be greatly reduced by adopting a title duplication removing mode.
In addition, the duplication removing complexity can be further reduced and the duplication removing efficiency is improved by obtaining the most influential representative word in the title of the text to be duplicated, and based on the assumption that the most influential word among similar semantic texts is the same, when the representative word is judged to exist in the index space constructed by the title of the duplicated text, duplication is removed based on the title part of speech mode, so that semantic perception duplication removing is realized.
Therefore, the scheme not only can sense semantic deduplication, but also has high deduplication efficiency.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention and not to limit the invention. In the drawings:
FIG. 1 is a flow diagram illustrating an embodiment of a text deduplication method in accordance with an illustrative embodiment of the present invention;
FIG. 2 is a content diagram illustrating an index space according to an exemplary embodiment of the present invention;
FIG. 3 is a schematic diagram illustrating a text deduplication apparatus according to an exemplary embodiment of the present invention;
FIG. 4 is a diagram illustrating a hardware configuration of an electronic device according to an exemplary embodiment of the present invention;
fig. 5 is a schematic diagram illustrating a structure of a storage medium according to an exemplary embodiment of the present invention.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this specification and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.
It is to be understood that although the terms first, second, third, etc. may be used herein to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present invention. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.
The invention provides an improved text deduplication method aiming at the problems that the semantics among texts cannot be perceived by a traditional deduplication method and the deduplication efficiency is low, namely determining representative words in a title of a text to be deduplicated, judging whether the representative words exist in an index space constructed by the title of the text to be deduplicated, if not, determining that the text to be deduplicated is not a repeated text, if so, judging whether the text to be deduplicated is a repeated text based on a title part-of-speech mode, and when determining that the text to be deduplicated is not the repeated text, taking the representative words as an index, and taking other words except the representative words in the title as key values to add to the index space.
The technical effects that can be achieved based on the above description are:
the invention can achieve the same effect as text duplication removal by carrying out text duplication removal on the title. Meanwhile, the title is a short text, the efficiency of processing the title is far higher than that of processing the text, and for large-scale text data volume, the time required by processing can be greatly reduced by adopting a title duplication removing mode.
In addition, the duplication removing complexity can be further reduced and the duplication removing efficiency is improved by obtaining the most influential representative word in the title of the text to be duplicated, and based on the assumption that the most influential word among similar semantic texts is the same, when the representative word is judged to exist in the index space constructed by the title of the duplicated text, duplication is removed based on the title part of speech mode, so that semantic perception duplication removing is realized.
Therefore, the method and the device can sense semantic difference duplication removal, have high duplication removal efficiency, are excellent in duplication removal tasks of large-scale texts, and can meet actual production requirements.
In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application.
The first embodiment is as follows:
fig. 1 is a flowchart illustrating an embodiment of a text deduplication method according to an exemplary embodiment of the present invention, including the following steps:
step 101: representative words in the title of the text to be deduplicated are determined.
Wherein, the representative word refers to the most influential word in the title, which can represent the semantics of the title.
In an optional embodiment, the word segmentation result is obtained by performing word segmentation on the title, the word frequency-inverse document frequency TF-IDF value of each word in the word segmentation result is determined, and then the word corresponding to the maximum TF-IDF value is determined as the representative word.
Wherein, the larger the TF-IDF value is, the larger the influence of the participle in the title is, so that the participle corresponding to the maximum TF-IDF value is determined as the representative word.
The following description is given for the calculation flow of the word frequency-inverse document frequency TF-IDF value:
first, an IDF vocabulary of a target domain is constructed, and the present invention takes a bidding domain as an example.
The Inverse Document Frequency (IDF) is calculated by dividing the total number of files by the number of files containing the term, and represents a measure of the prevalence of the term. The size of the IDF value represents how common a word is, the higher the IDF value is, the less common the representative word is, and the more common the less common word is, the more representative of the semantics of a piece of speech. In order to obtain the IDF vocabulary suitable for the bidding field, the invention carries out word segmentation and part of speech tagging on large-scale bidding field titles (for example 5000w items), and counts and arranges the IDF vocabulary of the bidding field. The IDF is calculated as follows:
Figure BDA0003558919610000061
wherein D represents the total number of titles, | {: t is tj∈dj) The | representation contains the vocabulary tjTotal number of titles.
Then, obtaining the IDF value of each participle in the participle result from the IDF vocabulary, and simultaneously determining the TF value of each participle, wherein the TF value has the following calculation formula:
Figure BDA0003558919610000062
wherein n isiRepresents the number of occurrences of the word segmentation i in the title, ∑knkIndicating the number of occurrences of all the participles in the title.
And finally, determining the TF-IDF value of each participle by using the TF value and the IDF value of each participle, wherein the TF-IDF value is calculated according to the following formula:
TF_IDFi=TFi×IDFi
step 102: and judging whether the representative word exists in the index space constructed by the titles of the de-duplicated texts, if not, executing the step 103, and if so, executing the step 104.
Before step 102 is executed, aiming at the construction process of the index space, based on the assumption that the words with the largest influence among similar semantic texts are the same by acquiring the word segmentation result of the title of the de-duplicated text and the TF-IDF value of each word, the word with the largest TF-IDF value in the title is used as an index, and the title after word segmentation and part-of-speech tagging is used as a key value, the index space is constructed.
As shown in fig. 2, the index space is created with the title "water conservancy project research laboratory's purchase announcement about experimental equipment", the first word "water conservancy project" is the index, and the content in [ ] is the title and the part-of-speech tagging result is the key value.
In step 102, based on the assumption that the most influential words among similar semantic texts are the same, when the representative word is not in the index space, it may be determined that the text to be deduplicated is a text without semantic repetition, and when the representative word is in the index space, it is further determined that deduplication is performed in a heading part-of-speech-based manner.
Step 103: it is determined that the text to be deduplicated is not repeated text.
Step 104: and judging whether the text to be deduplicated is repeated text or not based on the part-of-speech manner of the title.
In an optional embodiment, an existing title represented by a key value corresponding to a representative word in an index space is obtained, non-repeated words between the existing title and the title are compared, if the non-repeated words exist, whether a text to be de-duplicated is a repeated text is judged according to the part of speech of the non-repeated words, and if the non-repeated words do not exist, the text to be de-duplicated is determined to be the repeated text.
The non-repeated words between the existing title and the title of the text to be deduplicated can be from the existing title or from the title of the text to be deduplicated.
It should be noted that, according to a lot of experience, titles with different product words and quantity words appear, and the semantics thereof are different in a rough situation, wherein possible parts of speech of the product words are noun (n) and english (eng), so when words with parts of speech that are not repeated between titles are noun, english and quantity words (m), the input title can be considered as a non-repeated title.
Based on this, in the process of determining whether the text to be deduplicated is the repeated text according to the part of speech of the non-repeated word, whether the part of speech of the non-repeated word is the preset part of speech or not can be determined, if the part of speech is the preset part of speech, the text to be deduplicated is determined not to be the repeated text, if the part of speech is not the preset part of speech, whether the text to be deduplicated is the repeated text or not is determined according to the ratio of the number of the non-repeated words to the total number of the words of the title, if the ratio exceeds the preset value, the text to be deduplicated is determined not to be the repeated text, and if the ratio does not exceed the preset value, the text to be deduplicated is determined to be the repeated text.
The preset parts of speech are nouns, English words and number words.
Further, when the number of the non-repeated words exceeds a certain numerical value of the total word quantity of the titles, the titles cannot keep the same semantics, and the text to be deduplicated is judged not to be the repeated text, otherwise, the text is the repeated text.
Optionally, according to a large amount of data statistics, when the preset value of the proportion is 1/3, the effect is optimal.
It should be noted that, when it is determined that the text to be deduplicated is a repeated text, the text to be deduplicated may be discarded.
Step 105: and when the text to be deduplicated is determined not to be the repeated text, taking the representative word as an index, and taking the part-of-speech tagging results of the title and the title as key values to add into the index space.
When the text to be deduplicated is determined not to be the repeated text, the text to be deduplicated is a valuable text, belongs to the text which is already deduplicated, and the title of the text needs to be processed and then added into an index space, so that the text to be deduplicated can be conveniently input subsequently.
Aiming at the process from the step 101 to the step 105, in order to verify the efficiency and the accuracy of the algorithm provided by the invention, the invention tests in the large-scale corpus de-duplication task of the bidding short text and performs de-duplication processing on 5000 titles in the bidding field.
Experiments show that: the invention can process 100 titles per second on average, the time complexity is not increased along with the increase of the number of the titles, the actual production requirement is met, and the accuracy rate of duplicate removal can reach more than 95 percent.
So far, the deduplication process shown in fig. 1 is completed, and usually the title is a summary of the body and includes core information in the body. Meanwhile, the title is a short text, the efficiency of processing the title is far higher than that of processing the text, and for large-scale text data volume, the time required by processing can be greatly reduced by adopting a title de-duplication mode.
In addition, the duplication removing complexity can be further reduced and the duplication removing efficiency can be improved by obtaining the most influential representative word in the title of the text to be duplicated, and based on the assumption that the most influential words among similar semantic texts are the same, when the representative word is judged to exist in an index space constructed by the title of the text to be duplicated, duplication removal is performed based on a title part-of-speech manner, so that semantic perception duplication removal is realized.
Therefore, the scheme not only can sense semantic deduplication, but also has high deduplication efficiency.
Corresponding to the embodiment of the text deduplication method, the invention also provides an embodiment of a text deduplication device.
Fig. 3 is a schematic structural diagram of a text deduplication device according to an exemplary embodiment of the present invention, the text deduplication device is configured to perform the text deduplication method provided in any of the above embodiments, as shown in fig. 3, the text deduplication device includes:
a first determining module 310, configured to determine a representative word in a title of a text to be deduplicated;
a first judging module 320, configured to judge whether the representative word exists in an index space constructed by titles of deduplicated texts;
a second determining module 330, configured to determine that the text to be deduplicated is not a repeated text when it is determined that the representative word does not exist;
the second judging module 340 is configured to, when judging that the representative word exists, judge whether the text to be deduplicated is a repeated text based on a heading part-of-speech manner;
and a space adding module 350, configured to, when it is determined that the text to be deduplicated is not a repeated text, add the representative word as an index to the index space, and add the title and the part-of-speech tagging result of the title as a key value.
The implementation process of the functions and actions of each unit in the above device is specifically described in the implementation process of the corresponding step in the above method, and is not described herein again.
For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution of the present invention. One of ordinary skill in the art can understand and implement it without inventive effort.
The embodiment of the invention also provides electronic equipment corresponding to the text deduplication method provided by the embodiment, so as to execute the text deduplication method.
Fig. 4 is a hardware block diagram of an electronic device according to an exemplary embodiment of the present invention, the electronic device including: a communication interface 601, a processor 602, a memory 603, and a bus 604; the communication interface 601, the processor 602 and the memory 603 communicate with each other via a bus 604. The processor 602 may execute the text deduplication method described above by reading and executing machine-executable instructions in the memory 603 corresponding to the control logic of the text deduplication method, and the details of the method are described in the above embodiments and will not be described again here.
The memory 603 referred to in this disclosure may be any electronic, magnetic, optical, or other physical storage device that can contain stored information, such as executable instructions, data, and so forth. Specifically, the Memory 603 may be a RAM (Random Access Memory), a flash Memory, a storage drive (e.g., a hard disk drive), any type of storage disk (e.g., an optical disk, a DVD, etc.), or similar storage medium, or a combination thereof. The communication connection between the network element of the system and at least one other network element is realized through at least one communication interface 601 (which may be wired or wireless), and the internet, a wide area network, a local network, a metropolitan area network, and the like can be used.
Bus 604 can be an ISA bus, PCI bus, EISA bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. The memory 603 is used for storing a program, and the processor 602 executes the program after receiving the execution instruction.
The processor 602 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 602. The Processor 602 may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; but may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor.
The electronic device provided by the embodiment of the application and the text deduplication method provided by the embodiment of the application are based on the same inventive concept, and have the same beneficial effects as the method adopted, operated or realized by the electronic device.
Referring to fig. 5, the computer-readable storage medium is an optical disc 30, and a computer program (i.e., a program product) is stored thereon, and when being executed by a processor, the computer program may execute the text deduplication method provided in any of the foregoing embodiments.
It should be noted that examples of the computer-readable storage medium may also include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory, or other optical and magnetic storage media, which are not described in detail herein.
The computer-readable storage medium provided by the above-mentioned embodiment of the present application and the text deduplication method provided by the embodiment of the present application have the same beneficial effects as the method adopted, executed or implemented by the application program stored in the computer-readable storage medium.
Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This invention is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (10)

1. A method for text deduplication, the method comprising:
determining a representative word in a title of a text to be deduplicated;
judging whether the representative words exist in the index space constructed by the titles of the de-duplicated texts;
if not, determining that the text to be deduplicated is not a repeated text;
if so, judging whether the text to be deduplicated is a repeated text or not based on the title part-of-speech mode;
and when the text to be deduplicated is determined not to be the repeated text, adding the representative word as an index and the title and the part-of-speech tagging result of the title as key values into the index space.
2. The method of claim 1, wherein determining representative words in a title of text to be deduplicated comprises:
performing word segmentation on the title to obtain a word segmentation result;
determining a word frequency-inverse document frequency TF-IDF value of each word in the word segmentation result;
and determining the participle corresponding to the maximum TF-IDF value as a representative word.
3. The method of claim 2, wherein determining a word frequency-inverse text frequency TF-IDF value for each word in the word segmentation result comprises:
determining a word frequency TF value of each word in the word segmentation result;
acquiring an inverse document frequency IDF value corresponding to each participle from a preset vocabulary table;
the TF-IDF value of each participle is determined using the TF value and the IDF value of that participle.
4. The method of claim 1, wherein determining whether the text to be de-duplicated is repeated text based on a heading part-of-speech manner comprises:
acquiring an existing title represented by a key value corresponding to the representative word in an index space;
comparing the existing titles with the non-repeated words between the titles;
if the non-repeated words exist, judging whether the text to be deduplicated is a repeated text or not according to the part of speech of the non-repeated words;
and if no non-repeated word exists, determining that the text to be deduplicated is a repeated text.
5. The method of claim 4, wherein determining whether the text to be deduplicated is repeated text based on parts of speech of non-repeating words comprises:
judging whether the part of speech of the non-repeated word is a preset part of speech or not;
if the part of speech is preset, determining that the text to be deduplicated is not a repeated text;
if the text to be deduplicated is not the preset part of speech, judging whether the text to be deduplicated is a repeated text or not according to the ratio of the number of non-repeated words to the total word segmentation amount of the title;
if the proportion exceeds a preset value, determining that the text to be deduplicated is not a repeated text;
and if the proportion does not exceed the preset value, determining that the text to be deduplicated is a repeated text.
6. The method of claim 5, wherein the predetermined part of speech includes noun, English, and numerology.
7. A text deduplication apparatus, the apparatus comprising:
the first determining module is used for determining a representative word in a title of the text to be deduplicated;
the first judgment module is used for judging whether the representative words exist in the index space constructed by the titles of the de-duplicated texts;
the second determining module is used for determining that the text to be deduplicated is not a repeated text when the representative word is judged not to exist;
the second judgment module is used for judging whether the text to be deduplicated is a repeated text or not based on the title part-of-speech mode when the representative word is judged to exist;
and the space adding module is used for adding the representative word as an index and the title and the part-of-speech tagging result of the title as a key value into the index space when the text to be deduplicated is determined not to be the repeated text.
8. The apparatus according to claim 7, wherein the second determining module is specifically configured to obtain an existing title represented by a key value corresponding to the representative word in an index space; comparing existing titles with non-repeated words between the titles; if the non-repeated words exist, judging whether the text to be deduplicated is a repeated text or not according to the part of speech of the non-repeated words; and if no non-repeated word exists, determining that the text to be deduplicated is a repeated text.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method according to any of claims 1-6 are implemented when the processor executes the program.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 6.
CN202210283294.8A 2022-03-22 2022-03-22 Text duplicate removal method and device, electronic equipment and storage medium Pending CN114742042A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210283294.8A CN114742042A (en) 2022-03-22 2022-03-22 Text duplicate removal method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210283294.8A CN114742042A (en) 2022-03-22 2022-03-22 Text duplicate removal method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN114742042A true CN114742042A (en) 2022-07-12

Family

ID=82276380

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210283294.8A Pending CN114742042A (en) 2022-03-22 2022-03-22 Text duplicate removal method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114742042A (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102722526A (en) * 2012-05-16 2012-10-10 成都信息工程学院 Part-of-speech classification statistics-based duplicate webpage and approximate webpage identification method
US20160085634A1 (en) * 2014-09-22 2016-03-24 Storagecraft Technology Corporation Avoiding encryption of certain blocks in a deduplication vault
US20180107678A1 (en) * 2016-10-13 2018-04-19 International Business Machines Corporation Word, phrase and sentence deduplication for text repositories
CN110309446A (en) * 2019-04-26 2019-10-08 深圳市赛为智能股份有限公司 The quick De-weight method of content of text, device, computer equipment and storage medium
CN110750731A (en) * 2019-09-27 2020-02-04 成都数联铭品科技有限公司 Duplicate removal method and system for news public sentiment
CN112364625A (en) * 2020-11-19 2021-02-12 深圳壹账通智能科技有限公司 Text screening method, device, equipment and storage medium
WO2021135319A1 (en) * 2020-01-02 2021-07-08 苏宁云计算有限公司 Deep learning based text generation method and apparatus and electronic device
WO2021174783A1 (en) * 2020-03-02 2021-09-10 平安科技(深圳)有限公司 Near-synonym pushing method and apparatus, electronic device, and medium
KR102349624B1 (en) * 2020-09-24 2022-01-10 주식회사 포스코아이씨티 System and Method for Crawling News

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102722526A (en) * 2012-05-16 2012-10-10 成都信息工程学院 Part-of-speech classification statistics-based duplicate webpage and approximate webpage identification method
US20160085634A1 (en) * 2014-09-22 2016-03-24 Storagecraft Technology Corporation Avoiding encryption of certain blocks in a deduplication vault
US20180107678A1 (en) * 2016-10-13 2018-04-19 International Business Machines Corporation Word, phrase and sentence deduplication for text repositories
CN110309446A (en) * 2019-04-26 2019-10-08 深圳市赛为智能股份有限公司 The quick De-weight method of content of text, device, computer equipment and storage medium
CN110750731A (en) * 2019-09-27 2020-02-04 成都数联铭品科技有限公司 Duplicate removal method and system for news public sentiment
WO2021135319A1 (en) * 2020-01-02 2021-07-08 苏宁云计算有限公司 Deep learning based text generation method and apparatus and electronic device
WO2021174783A1 (en) * 2020-03-02 2021-09-10 平安科技(深圳)有限公司 Near-synonym pushing method and apparatus, electronic device, and medium
KR102349624B1 (en) * 2020-09-24 2022-01-10 주식회사 포스코아이씨티 System and Method for Crawling News
CN112364625A (en) * 2020-11-19 2021-02-12 深圳壹账通智能科技有限公司 Text screening method, device, equipment and storage medium

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
XIAOJUAN YANG: "Study_on_the_elimination_of_duplicated_multimedia_webpages", IEEE, 25 June 2012 (2012-06-25) *
沙 芸: "基于关键词提取的娱乐新闻文档去重算法", 广西师范大学学报:自然科学版, vol. 25, no. 2, 30 June 2007 (2007-06-30) *
聂洋: "改进算法的文本去重研究", 中国优秀硕士学位论文全文数据库信息科技辑, 15 September 2011 (2011-09-15) *

Similar Documents

Publication Publication Date Title
US11354356B1 (en) Video segments for a video related to a task
CN111104794B (en) Text similarity matching method based on subject term
WO2019174132A1 (en) Data processing method, server and computer storage medium
US10546005B2 (en) Perspective data analysis and management
CN110162750B (en) Text similarity detection method, electronic device and computer readable storage medium
CN107688616B (en) Make the unique facts of the entity appear
CN111258966A (en) Data deduplication method, device, equipment and storage medium
CN103678277A (en) Theme-vocabulary distribution establishing method and system based on document segmenting
CN110019669B (en) Text retrieval method and device
CN110019820B (en) Method for detecting time consistency of complaints and symptoms of current medical history in medical records
CN108804418B (en) Document duplicate checking method and device based on semantic analysis
CN106610931B (en) Topic name extraction method and device
US9268878B2 (en) Entity category extraction for an entity that is the subject of pre-labeled data
WO2014025811A2 (en) Method and apparatus of implementing navigation of product properties
US10042913B2 (en) Perspective data analysis and management
CN112989791B (en) Method, system and medium for de-duplication based on text information extraction result
CN113468339A (en) Label extraction method, system, electronic device and medium based on knowledge graph
CN111859962A (en) Method and device for extracting data required by automobile public praise word cloud
CN110019670A (en) A kind of text searching method and device
US10437838B2 (en) Search navigation element
WO2022105178A1 (en) Keyword extraction method and related device
CN110674283B (en) Intelligent extraction method and device for text abstracts, computer equipment and storage medium
CN107943965B (en) Similar article retrieval method and device
CN114742042A (en) Text duplicate removal method and device, electronic equipment and storage medium
CN115129864A (en) Text classification method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination