CN109241008B - Document de-duplication method and device - Google Patents

Document de-duplication method and device Download PDF

Info

Publication number
CN109241008B
CN109241008B CN201810893169.2A CN201810893169A CN109241008B CN 109241008 B CN109241008 B CN 109241008B CN 201810893169 A CN201810893169 A CN 201810893169A CN 109241008 B CN109241008 B CN 109241008B
Authority
CN
China
Prior art keywords
document
attribute
target
target document
author
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810893169.2A
Other languages
Chinese (zh)
Other versions
CN109241008A (en
Inventor
赵荣生
宋再伟
黄振城
周旻
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University Third Hospital
Original Assignee
Beijing Nuodao Cognitive Medical Technology Co ltd
Peking University Third Hospital
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Nuodao Cognitive Medical Technology Co ltd, Peking University Third Hospital filed Critical Beijing Nuodao Cognitive Medical Technology Co ltd
Priority to CN201810893169.2A priority Critical patent/CN109241008B/en
Publication of CN109241008A publication Critical patent/CN109241008A/en
Application granted granted Critical
Publication of CN109241008B publication Critical patent/CN109241008B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

According to the document duplication eliminating method and device provided by the embodiment of the invention, the attribute labels and the attribute contents corresponding to the first target document and the second target document are obtained, the target attribute labels and the attribute contents corresponding to the target attribute labels are screened out from the attribute labels, and the corresponding attribute label repetition rate is obtained according to the attribute contents corresponding to the target attribute labels; and according to the attribute tag repetition rate, duplicate removal results of the first target document and the second target document are obtained, the duplicate checking speed is increased, and labor and time are saved.

Description

Document de-duplication method and device
Technical Field
The invention relates to the technical field of information processing, in particular to a document duplicate removal method and device.
Background
Repeated document screening is quite important and time-consuming work, and if manual screening can be replaced by a mechanical screening mode, the workload of scientific research work can be reduced to a great extent. In this process, listing redundancy is currently a major issue.
The listing redundancy refers to the redundancy of cross-database retrieval results due to database listing journal overlap. Different from the freedom of uploading and transferring general webpage information, the bibliographic information is bound with a specific publication because the bibliographic information usually relates to copyright problems, so that the provenance of the bibliographic information is unique, and the freedom of uploading and transferring is not large. However, a particular publication is always included in one or more network databases, and often the intersection of the publications included in different databases is often accomplished by cross-database search of multiple databases to obtain personal/unit issue information, and thus, the overlap of database listings is the most fundamental cause of redundancy in cross-database search of documents. For inclusion redundancy, the most commonly used method artificially uses ISBN to check for duplicates, but this approach is inefficient.
Disclosure of Invention
The invention provides a document duplicate removal method and a document duplicate removal device, which are used for solving the problem of low document duplicate detection efficiency in the prior art.
In a first aspect, an embodiment of the present invention provides a document deduplication method, including:
acquiring attribute tags and attribute contents corresponding to a first target document and a second target document, wherein the first target document and the second target document are documents with repeated attribute contents;
screening out a target attribute label and attribute content corresponding to the target attribute label from the attribute labels;
obtaining the corresponding attribute label repetition rate according to the attribute content corresponding to the target attribute label;
and obtaining the duplicate removal result of the first target document and the second target document according to the attribute label repetition rate.
Optionally, when the target attribute tag includes a document author and a document title, the obtaining a repetition rate of the corresponding attribute tag according to the attribute content corresponding to the target attribute tag includes:
obtaining the character string length of the document title of the first target document, the character string length of the document title of the second target document and the total character string length of the repeated content of the first target document and the second target document under the document title according to the attribute content corresponding to the document title tag;
obtaining a document topic repetition rate by adopting a first calculation formula according to the respective character string lengths of the document topics of the first target document and the second target document and the total character string length of the repeated content;
acquiring the number of authors corresponding to the first target document and the second target document and the length of a character string in which the author names of the first target document and the author names of the second target document are mutually repeated according to the attribute content corresponding to the document author tags;
and obtaining the document author repetition rate by adopting a second calculation formula according to the number of authors corresponding to the first target document and the second target document and the length of the character string in which the author names of the first target document and the author names of the second target document are repeated mutually.
Optionally, the first calculation formula includes:
Figure BDA0001757479410000021
wherein, TR-Rate is the repetition Rate of the literature topic, LTMA total string length, L, for the repeated content of the first and second target documents under the document titlepaper1Is the character string length, L, of the document title of the first target documentpaper2The length of the character string for the subject of the second target document.
Optionally, the second calculation formula includes:
Figure BDA0001757479410000031
wherein, AR-Rate is the repetition Rate of the author of the document,
Figure BDA0001757479410000032
as a set of authors for the first target document,
Figure BDA0001757479410000033
is a set of authors of the second target document, n1,n2A is n for the number of authors1,n2The subscript corresponding to the medium to minimum value,
Figure BDA0001757479410000034
for the author
Figure BDA0001757479410000035
The length of the character string of (a),
Figure BDA0001757479410000036
for the author
Figure BDA0001757479410000037
The length of the character string of (a),
Figure BDA0001757479410000038
for the author
Figure BDA0001757479410000039
With the author
Figure BDA00017574794100000310
There is a string length of the repeated contents.
Optionally, the method further comprises:
acquiring attribute content of the document corresponding to the duplicate removal result; and obtaining and storing a reserved document according to the attribute content and a preset screening index.
In a second aspect, an embodiment of the present invention provides a document deduplication apparatus, including:
the system comprises an acquisition module, a storage module and a processing module, wherein the acquisition module is used for acquiring attribute labels and attribute contents corresponding to a first target document and a second target document respectively, and the first target document and the second target document are documents with mutually repeated attribute contents;
the screening module is used for screening out a target attribute label and attribute content corresponding to the target attribute label from the attribute labels;
the computing module is used for obtaining the corresponding attribute label repetition rate according to the attribute content corresponding to the target attribute label;
and the judging module is used for obtaining the duplicate removal results of the first target document and the second target document according to the attribute label repetition rate.
Optionally, when the target attribute tag includes a document author and a document title, the calculation module is specifically configured to:
obtaining the character string length of the document title of the first target document, the character string length of the document title of the second target document and the total character string length of the repeated content of the first target document and the second target document under the document title according to the attribute content corresponding to the document title tag;
obtaining a document topic repetition rate by adopting a first calculation formula according to the respective character string lengths of the document topics of the first target document and the second target document and the total character string length of the repeated content;
acquiring the number of authors corresponding to the first target document and the second target document and the length of a character string in which the author names of the first target document and the author names of the second target document are mutually repeated according to the attribute content corresponding to the document author tags;
and obtaining the document author repetition rate by adopting a second calculation formula according to the number of authors corresponding to the first target document and the second target document and the length of the character string in which the author names of the first target document and the author names of the second target document are repeated mutually.
Optionally, the first calculation formula includes:
Figure BDA0001757479410000041
wherein, TR-Rate is the repetition Rate of the literature topic, LTMA total string length, L, for the repeated content of the first and second target documents under the document titlepaper1Is the character string length, L, of the document title of the first target documentpaper2The length of the character string for the subject of the second target document.
Optionally, the second calculation formula includes:
Figure BDA0001757479410000042
among them, AR-Rate is the literature authorThe repetition rate is set in accordance with the number of the repetitions,
Figure BDA0001757479410000043
as a set of authors for the first target document,
Figure BDA0001757479410000044
is a set of authors of the second target document, n1,n2A is n for the number of authors1,n2The subscript corresponding to the medium to minimum value,
Figure BDA0001757479410000045
for the author
Figure BDA0001757479410000046
The length of the character string of (a),
Figure BDA0001757479410000047
for the author
Figure BDA0001757479410000048
The length of the character string of (a),
Figure BDA0001757479410000049
for the author
Figure BDA00017574794100000410
With the author
Figure BDA00017574794100000411
There is a string length of the repeated contents.
Optionally, the system further comprises a screening module, configured to: acquiring attribute content of the document corresponding to the duplicate removal result; and obtaining and storing a reserved document according to the attribute content and a preset screening index.
As can be seen from the foregoing technical solutions, in the document duplication elimination method and apparatus provided in the embodiments of the present invention, by obtaining the attribute tags and the attribute contents corresponding to the first target document and the second target document, the target attribute tags and the attribute contents corresponding to the target attribute tags are screened out from the attribute tags, and the corresponding attribute tag repetition rate is obtained according to the attribute contents corresponding to the target attribute tags; and according to the attribute tag repetition rate, duplicate removal results of the first target document and the second target document are obtained, the duplicate checking speed is increased, and labor and time are saved.
Drawings
FIG. 1 is a schematic flow chart of a document deduplication process according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of extracting repeated character strings of a document title according to an embodiment of the present invention;
FIG. 3 is a schematic flow chart of a document deduplication method according to an embodiment of the present invention;
FIG. 4 is a schematic structural diagram of a document deduplication apparatus according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of a document deduplication apparatus according to an embodiment of the present invention.
Detailed Description
The following detailed description of embodiments of the present invention is provided in connection with the accompanying drawings and examples. The following examples are intended to illustrate the invention but are not intended to limit the scope of the invention.
Fig. 1 shows that an embodiment of the present invention provides a document deduplication method, including:
s11, acquiring attribute labels and attribute contents corresponding to a first target document and a second target document, wherein the first target document and the second target document are documents with mutually repeated attribute contents;
s12, screening out a target attribute label from the attribute labels and attribute content corresponding to the target attribute label;
s13, obtaining the corresponding attribute label repetition rate according to the attribute content corresponding to the target attribute label;
s14, obtaining the duplicate removal result of the first target document and the second target document according to the attribute label repetition rate.
In the above steps S11 to S14, it should be noted that in the embodiment of the present invention, for the document database that needs to be checked for duplication, documents with duplicate information can be screened first, and then two documents are compared with each other until the result of determining whether the documents are duplicated or not is obtained.
Two documents are obtained, a first target document and a second target document. The first target document and the second target document are documents whose attribute contents are duplicated with each other. The basic information of the document can select attribute labels and corresponding attribute contents.
And screening out a target attribute label and attribute content corresponding to the target attribute label from the attribute labels. In an embodiment of the present invention, the target attribute tags for screening may be document authors and document titles.
Obtaining the character string length of the document title of the first target document, the character string length of the document title of the second target document and the total character string length of the repeated content of the first target document and the second target document under the document title according to the attribute content corresponding to the document title tag;
and obtaining the document subject repetition rate by adopting a first calculation formula according to the character string length of the document subject and the total character string length of the repeated content of the first target document and the second target document.
The first calculation formula includes:
Figure BDA0001757479410000061
wherein, TR-Rate is the repetition Rate of the literature topic, LTMA total string length, L, for the repeated content of the first and second target documents under the document titlepaper1Is the character string length, L, of the document title of the first target documentpaper2The length of the character string for the subject of the second target document.
FIG. 2 is a schematic diagram of repeated character string extraction of a document title. As can be seen from the figure, the title of document 1 is: "influence of MTHFR gene polymorphism in children with acute lymphoblastic leukemia on adverse reaction of high-dose methotrexate", the subjects of document 2 are: "the influence of GSTP1 and MTHFR gene polymorphism in children with acute lymphoblastic leukemia on the toxic and side effects of high-dose methotrexate", and then the repeated contents of literature 1 and literature 2 are extracted by the LCS longest common subsequence dynamic method: "influence of MTHFR gene polymorphism in infants with acute lymphoblastic leukemia on response to high-dose methotrexate". From this, it can be known that the character string length of each content, and then the document title repetition rate can be obtained by using a calculation formula.
Acquiring the number of authors corresponding to the first target document and the second target document and the length of a character string in which the author names of the first target document and the author names of the second target document are mutually repeated according to the attribute content corresponding to the document author tags;
and obtaining the document author repetition rate by adopting a second calculation formula according to the number of authors corresponding to the first target document and the second target document and the length of the character string in which the author names of the first target document and the author names of the second target document are repeated mutually.
The second calculation formula includes:
Figure BDA0001757479410000071
wherein, AR-Rate is the repetition Rate of the author of the document,
Figure BDA0001757479410000072
as a set of authors for the first target document,
Figure BDA0001757479410000073
is a set of authors of the second target document, n1,n2A is n for the number of authors1,n2The subscript corresponding to the medium to minimum value,
Figure BDA0001757479410000074
for the author
Figure BDA0001757479410000075
Length of character string,
Figure BDA0001757479410000076
For the author
Figure BDA0001757479410000077
The length of the character string of (a),
Figure BDA0001757479410000078
for the author
Figure BDA0001757479410000079
With the author
Figure BDA00017574794100000710
There is a string length of the repeated contents.
In this embodiment, after the document title repetition rate and the document author repetition rate are obtained, the obtained document title repetition rate and document author repetition rate are compared with a preset repetition rate threshold value, so as to determine whether two documents are duplicate documents. In the judgment, the thresholds may be preset in a one-to-one correspondence manner, or may share one preset threshold.
After the deduplication is finished, documents obtained by deduplication are screened according to PICO (clinical problem in the form of PICO (position, intersection, Comparison, and outcom)), specifically: acquiring attribute content of the document corresponding to the duplicate removal result; and obtaining and storing a reserved document according to the attribute content and a preset screening index.
Whether the following conditions are met or not is sequentially judged according to a preset PICO index:
(1) judging whether the study is a clinical study
PICO index: patient, case, person, infant, child, clinic
(2) Judging whether the people are related or not
PICO index: methotrexate, MTX, methotrexate, High-dose, HD, leukemia, ALL, AL, lymphoma, NHL, osteosarcoma, OS, hematologic tumors
(3) Determining whether intervention is correlated with control
PICO index: genes, polymorphisms, methylenetetrahydrofolate reductase, reduced folate carriers, glycoproteins, multiple drug resistance genes, SNPs, MTHFR, RFC, SLC19A1, ABCB1, MDR
(4) Determining whether the ending indicators are related
PICO index: toxicity, toxic and side effects, adverse events, adverse reactions and side effects
Documents meeting the four conditions are finally retained, and documents which do not meet any one of the conditions are deleted.
In the document duplication eliminating method provided by the embodiment of the invention, the attribute labels and the attribute contents corresponding to the first target document and the second target document are obtained, the target attribute labels and the attribute contents corresponding to the target attribute labels are screened out from the attribute labels, and the corresponding attribute label repetition rate is obtained according to the attribute contents corresponding to the target attribute labels; and according to the attribute tag repetition rate, duplicate removal results of the first target document and the second target document are obtained, the duplicate checking speed is increased, and labor and time are saved.
Fig. 3 shows that an embodiment of the present invention provides a document deduplication method, including:
s21, acquiring the attribute labels and the attribute contents of the documents in each document database, performing label unification processing on the attribute labels of the documents in each document database, acquiring unified attribute labels, and configuring corresponding attribute contents;
s22, acquiring attribute labels and attribute contents corresponding to a first target document and a second target document, wherein the first target document and the second target document are documents with mutually repeated attribute contents;
s23, screening out a target attribute label from the attribute labels and attribute content corresponding to the target attribute label;
s24, obtaining the corresponding attribute label repetition rate according to the attribute content corresponding to the target attribute label;
s25, obtaining the duplicate removal result of the first target document and the second target document according to the attribute label repetition rate.
The principle of the above steps S22-S25 is the same as that of the above embodiments, steps S11-S14, and the description thereof is omitted.
In step S21, it should be noted that, because the third-party databases of the document sources are different, the stored attribute tags of the documents are also different, so that the attribute tags of the documents in the document databases need to be processed uniformly to conform to the preset uniform attribute tags, and then the corresponding attribute contents are configured.
In the document duplication eliminating method provided by the embodiment of the invention, the attribute labels and the attribute contents corresponding to the first target document and the second target document are obtained, the target attribute labels and the attribute contents corresponding to the target attribute labels are screened out from the attribute labels, and the corresponding attribute label repetition rate is obtained according to the attribute contents corresponding to the target attribute labels; and according to the attribute tag repetition rate, duplicate removal results of the first target document and the second target document are obtained, the duplicate checking speed is increased, and labor and time are saved.
Fig. 4 shows a document deduplication apparatus provided in an embodiment of the present invention, which includes an obtaining module 31, a screening module 32, a calculating module 33, and a determining module 34, where:
an obtaining module 31, configured to obtain attribute tags and attribute contents corresponding to a first target document and a second target document, where the first target document and the second target document are documents whose attribute contents are mutually duplicated;
a screening module 32, configured to screen out a target attribute tag and attribute content corresponding to the target attribute tag from the attribute tags;
a calculating module 33, configured to obtain a corresponding attribute tag repetition rate according to the attribute content corresponding to the target attribute tag;
and the judging module 34 is configured to obtain a duplicate removal result of the first target document and the second target document according to the attribute tag repetition rate.
Since the principle of the apparatus according to the embodiment of the present invention is the same as that of the method according to the above embodiment, further details are not described herein for further explanation.
It should be noted that, in the embodiment of the present invention, the relevant functional module may be implemented by a hardware processor (hardware processor).
According to the document duplication removal device provided by the embodiment of the invention, the attribute labels and the attribute contents corresponding to the first target document and the second target document are obtained, the target attribute labels and the attribute contents corresponding to the target attribute labels are screened out from the attribute labels, and the corresponding attribute label repetition rate is obtained according to the attribute contents corresponding to the target attribute labels; and according to the attribute tag repetition rate, duplicate removal results of the first target document and the second target document are obtained, the duplicate checking speed is increased, and labor and time are saved.
Fig. 5 shows a document deduplication apparatus provided in an embodiment of the present invention, which includes a processing module 41, an obtaining module 42, a screening module 43, a calculating module 44, and a determining module 45, where:
the processing module 41 is configured to obtain attribute tags and attribute contents of documents in each document database, perform tag unification processing on the attribute tags of the documents in each document database, obtain unified attribute tags, and configure corresponding attribute contents;
an obtaining module 42, configured to obtain attribute tags and attribute contents corresponding to a first target document and a second target document, where the first target document and the second target document are documents whose attribute contents are mutually duplicated;
a screening module 43, configured to screen out a target attribute tag and attribute content corresponding to the target attribute tag from the attribute tags;
a calculating module 44, configured to obtain a corresponding attribute tag repetition rate according to the attribute content corresponding to the target attribute tag;
and a determining module 45, configured to obtain a duplicate removal result of the first target document and the second target document according to the attribute tag repetition rate.
Since the principle of the apparatus according to the embodiment of the present invention is the same as that of the method according to the above embodiment, further details are not described herein for further explanation.
It should be noted that, in the embodiment of the present invention, the relevant functional module may be implemented by a hardware processor (hardware processor).
According to the document duplication removal device provided by the embodiment of the invention, the attribute labels and the attribute contents corresponding to the first target document and the second target document are obtained, the target attribute labels and the attribute contents corresponding to the target attribute labels are screened out from the attribute labels, and the corresponding attribute label repetition rate is obtained according to the attribute contents corresponding to the target attribute labels; and according to the attribute tag repetition rate, duplicate removal results of the first target document and the second target document are obtained, the duplicate checking speed is increased, and labor and time are saved.
Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.
It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.
Those of ordinary skill in the art will understand that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions and scope of the present invention as defined in the appended claims.

Claims (6)

1. A document deduplication process, comprising:
acquiring attribute tags and attribute contents corresponding to a first target document and a second target document, wherein the first target document and the second target document are documents with repeated attribute contents;
screening out a target attribute label and attribute content corresponding to the target attribute label from the attribute labels;
obtaining the corresponding attribute label repetition rate according to the attribute content corresponding to the target attribute label;
obtaining the duplicate removal result of the first target document and the second target document according to the attribute label repetition rate;
when the target attribute tag comprises a document author and a document title, the obtaining of the corresponding attribute tag repetition rate according to the attribute content corresponding to the target attribute tag comprises:
obtaining the character string length of the document title of the first target document, the character string length of the document title of the second target document and the total character string length of the repeated content of the first target document and the second target document under the document title according to the attribute content corresponding to the document title tag;
obtaining a document topic repetition rate by adopting a first calculation formula according to the respective character string lengths of the document topics of the first target document and the second target document and the total character string length of the repeated content;
acquiring the number of authors corresponding to the first target document and the second target document and the length of a character string in which the author names of the first target document and the author names of the second target document are mutually repeated according to the attribute content corresponding to the document author tags;
obtaining a document author repetition rate by adopting a second calculation formula according to the number of authors corresponding to the first target document and the second target document and the length of a character string in which the author names of the first target document and the author names of the second target document are repeated;
the second calculation formula includes:
Figure FDA0002614760130000021
wherein, AR-Rate is the repetition Rate of the author of the document,
Figure FDA0002614760130000022
as a set of authors for the first target document,
Figure FDA0002614760130000023
is a set of authors of the second target document, n1,n2A is n for the number of authors1,n2The subscript corresponding to the medium to minimum value,
Figure FDA0002614760130000024
for the author
Figure FDA0002614760130000025
The length of the character string of (a),
Figure FDA0002614760130000026
for the author
Figure FDA0002614760130000027
The length of the character string of (a),
Figure FDA0002614760130000028
for the author
Figure FDA0002614760130000029
With the author
Figure FDA00026147601300000210
There is a string length of the repeated contents.
2. The method of claim 1, wherein the first calculation formula comprises:
Figure FDA00026147601300000211
wherein, TR-Rate is the repetition Rate of the literature topic, LTMA total string length, L, for the repeated content of the first and second target documents under the document titlepaper1Is the character string length, L, of the document title of the first target documentpaper2The length of the character string for the subject of the second target document.
3. The method of claim 1, further comprising:
acquiring attribute content of the document corresponding to the duplicate removal result;
and obtaining and storing a reserved document according to the attribute content and a preset screening index.
4. A document de-weighting device, comprising:
the system comprises an acquisition module, a storage module and a processing module, wherein the acquisition module is used for acquiring attribute labels and attribute contents corresponding to a first target document and a second target document respectively, and the first target document and the second target document are documents with mutually repeated attribute contents;
the screening module is used for screening out a target attribute label and attribute content corresponding to the target attribute label from the attribute labels;
the computing module is used for obtaining the corresponding attribute label repetition rate according to the attribute content corresponding to the target attribute label;
the judging module is used for obtaining the duplicate removal results of the first target document and the second target document according to the attribute label repetition rate;
when the target attribute tag includes a document author and a document title, the calculation module is specifically configured to:
obtaining the character string length of the document title of the first target document, the character string length of the document title of the second target document and the total character string length of the repeated content of the first target document and the second target document under the document title according to the attribute content corresponding to the document title tag;
obtaining a document topic repetition rate by adopting a first calculation formula according to the respective character string lengths of the document topics of the first target document and the second target document and the total character string length of the repeated content;
acquiring the number of authors corresponding to the first target document and the second target document and the length of a character string in which the author names of the first target document and the author names of the second target document are mutually repeated according to the attribute content corresponding to the document author tags;
obtaining a document author repetition rate by adopting a second calculation formula according to the number of authors corresponding to the first target document and the second target document and the length of a character string in which the author names of the first target document and the author names of the second target document are repeated;
the second calculation formula includes:
Figure FDA0002614760130000031
wherein, AR-Rate is the repetition Rate of the author of the document,
Figure FDA0002614760130000032
as a set of authors for the first target document,
Figure FDA0002614760130000033
is a set of authors of the second target document, n1,n2A is n for the number of authors1,n2The subscript corresponding to the medium to minimum value,
Figure FDA0002614760130000034
for the author
Figure FDA0002614760130000035
The length of the character string of (a),
Figure FDA0002614760130000036
for the author
Figure FDA0002614760130000037
The length of the character string of (a),
Figure FDA0002614760130000038
for the author
Figure FDA0002614760130000039
With the author
Figure FDA00026147601300000310
There is a string length of the repeated contents.
5. The apparatus of claim 4, wherein the first calculation formula comprises:
Figure FDA00026147601300000311
wherein, TR-Rate is the repetition Rate of the literature topic, LTMA total string length, L, for the repeated content of the first and second target documents under the document titlepaper1Is the first orderCharacter string length, L, of the subject matter of the subject documentpaper2The length of the character string for the subject of the second target document.
6. The apparatus of claim 4, further comprising a screening module to: acquiring attribute content of the document corresponding to the duplicate removal result; and obtaining and storing a reserved document according to the attribute content and a preset screening index.
CN201810893169.2A 2018-08-07 2018-08-07 Document de-duplication method and device Active CN109241008B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810893169.2A CN109241008B (en) 2018-08-07 2018-08-07 Document de-duplication method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810893169.2A CN109241008B (en) 2018-08-07 2018-08-07 Document de-duplication method and device

Publications (2)

Publication Number Publication Date
CN109241008A CN109241008A (en) 2019-01-18
CN109241008B true CN109241008B (en) 2020-10-27

Family

ID=65071023

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810893169.2A Active CN109241008B (en) 2018-08-07 2018-08-07 Document de-duplication method and device

Country Status (1)

Country Link
CN (1) CN109241008B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116126997B (en) * 2023-04-04 2023-06-13 北京洞悉网络有限公司 Document deduplication storage method, system, device and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101350032A (en) * 2008-09-23 2009-01-21 胡辉 Method for judging whether web page content is identical or not
CN105653590A (en) * 2015-12-21 2016-06-08 青岛智能产业技术研究院 Name duplication disambiguation method of Chinese literature authors
CN107870991A (en) * 2017-10-27 2018-04-03 湖南纬度信息科技有限公司 A kind of similarity calculating method and computer-readable recording medium of paper metadata

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5745745A (en) * 1994-06-29 1998-04-28 Hitachi, Ltd. Text search method and apparatus for structured documents
CN102945244A (en) * 2012-09-24 2013-02-27 南京大学 Chinese web page repeated document detection and filtration method based on full stop characteristic word string
CN103577581B (en) * 2013-11-08 2016-09-28 南京绿色科技研究院有限公司 Agricultural product price trend forecasting method
CN107122949B (en) * 2016-02-25 2021-02-26 阿里巴巴集团控股有限公司 E-mail screening method and device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101350032A (en) * 2008-09-23 2009-01-21 胡辉 Method for judging whether web page content is identical or not
CN105653590A (en) * 2015-12-21 2016-06-08 青岛智能产业技术研究院 Name duplication disambiguation method of Chinese literature authors
CN107870991A (en) * 2017-10-27 2018-04-03 湖南纬度信息科技有限公司 A kind of similarity calculating method and computer-readable recording medium of paper metadata

Also Published As

Publication number Publication date
CN109241008A (en) 2019-01-18

Similar Documents

Publication Publication Date Title
US9904694B2 (en) NoSQL relational database (RDB) data movement
US20140344195A1 (en) System and method for machine learning and classifying data
US9710534B2 (en) Methods and systems for discovery of linkage points between data sources
US20160196342A1 (en) Plagiarism Document Detection System Based on Synonym Dictionary and Automatic Reference Citation Mark Attaching System
US9817908B2 (en) Systems and methods for news event organization
Liu et al. Literature retrieval based on citation context
US20140059083A1 (en) Context-based search for a data store related to a graph node
US9792341B2 (en) Database query processing using horizontal data record alignment of multi-column range summaries
US20190080000A1 (en) Entropic classification of objects
Su et al. Identifying and predicting novelty in microbiome studies
US10606922B2 (en) Analyzing document content and generating an appendix
US9747345B2 (en) System and method for identifying relationships in a data graph
WO2016066043A1 (en) Web page deduplication method and apparatus
Gorrell et al. Using@ Twitter conventions to improve# LOD-based named entity disambiguation
US10997218B2 (en) Method and system for managing associations between entity records
Wu et al. CiteSeerX: 20 years of service to scholarly big data
CN109241008B (en) Document de-duplication method and device
US11868335B2 (en) Space-efficient change journal for a storage system
Alizon Superspreading genomes
CN105224583B (en) Method and device for cleaning log files
Nakagome et al. Comment on “Nuclear genomic sequences reveal that polar bears are an old and distinct bear lineage”
CN106503198A (en) A kind of cold data recognition methodss and system based on hadoop metadata
US20140164035A1 (en) Cladistics data analyzer for business data
CN117063171A (en) Extracting and visualizing topic descriptions from a region-separated data store
Kim et al. The language of a virus

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20190118

Address after: 100191 49 Garden Road North, Haidian District, Beijing.

Applicant after: The Third Affiliated Hospital of Peking University

Applicant after: Beijing promise cognitive Medical Technology Co., Ltd.

Address before: 100080 Beijing Haidian District North Fourth Ring West Road, No. 9, No. 18, No. 1812

Applicant before: Beijing promise cognitive Medical Technology Co., Ltd.

SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant