CN109241008B

CN109241008B - Document de-duplication method and device

Info

Publication number: CN109241008B
Application number: CN201810893169.2A
Authority: CN
Inventors: 赵荣生; 宋再伟; 黄振城; 周旻
Original assignee: Beijing Nuodao Cognitive Medical Technology Co ltd; Peking University Third Hospital
Current assignee: Peking University Third Hospital
Priority date: 2018-08-07
Filing date: 2018-08-07
Publication date: 2020-10-27
Anticipated expiration: 2038-08-07
Also published as: CN109241008A

Abstract

According to the document duplication eliminating method and device provided by the embodiment of the invention, the attribute labels and the attribute contents corresponding to the first target document and the second target document are obtained, the target attribute labels and the attribute contents corresponding to the target attribute labels are screened out from the attribute labels, and the corresponding attribute label repetition rate is obtained according to the attribute contents corresponding to the target attribute labels; and according to the attribute tag repetition rate, duplicate removal results of the first target document and the second target document are obtained, the duplicate checking speed is increased, and labor and time are saved.

Description

Document de-duplication method and device

Technical Field

The invention relates to the technical field of information processing, in particular to a document duplicate removal method and device.

Background

Repeated document screening is quite important and time-consuming work, and if manual screening can be replaced by a mechanical screening mode, the workload of scientific research work can be reduced to a great extent. In this process, listing redundancy is currently a major issue.

The listing redundancy refers to the redundancy of cross-database retrieval results due to database listing journal overlap. Different from the freedom of uploading and transferring general webpage information, the bibliographic information is bound with a specific publication because the bibliographic information usually relates to copyright problems, so that the provenance of the bibliographic information is unique, and the freedom of uploading and transferring is not large. However, a particular publication is always included in one or more network databases, and often the intersection of the publications included in different databases is often accomplished by cross-database search of multiple databases to obtain personal/unit issue information, and thus, the overlap of database listings is the most fundamental cause of redundancy in cross-database search of documents. For inclusion redundancy, the most commonly used method artificially uses ISBN to check for duplicates, but this approach is inefficient.

Disclosure of Invention

The invention provides a document duplicate removal method and a document duplicate removal device, which are used for solving the problem of low document duplicate detection efficiency in the prior art.

In a first aspect, an embodiment of the present invention provides a document deduplication method, including:

acquiring attribute tags and attribute contents corresponding to a first target document and a second target document, wherein the first target document and the second target document are documents with repeated attribute contents;

screening out a target attribute label and attribute content corresponding to the target attribute label from the attribute labels;

obtaining the corresponding attribute label repetition rate according to the attribute content corresponding to the target attribute label;

and obtaining the duplicate removal result of the first target document and the second target document according to the attribute label repetition rate.

Optionally, when the target attribute tag includes a document author and a document title, the obtaining a repetition rate of the corresponding attribute tag according to the attribute content corresponding to the target attribute tag includes:

obtaining the character string length of the document title of the first target document, the character string length of the document title of the second target document and the total character string length of the repeated content of the first target document and the second target document under the document title according to the attribute content corresponding to the document title tag;

obtaining a document topic repetition rate by adopting a first calculation formula according to the respective character string lengths of the document topics of the first target document and the second target document and the total character string length of the repeated content;

acquiring the number of authors corresponding to the first target document and the second target document and the length of a character string in which the author names of the first target document and the author names of the second target document are mutually repeated according to the attribute content corresponding to the document author tags;

and obtaining the document author repetition rate by adopting a second calculation formula according to the number of authors corresponding to the first target document and the second target document and the length of the character string in which the author names of the first target document and the author names of the second target document are repeated mutually.

Optionally, the first calculation formula includes:

wherein, TR-Rate is the repetition Rate of the literature topic, L_TMA total string length, L, for the repeated content of the first and second target documents under the document title_paper1Is the character string length, L, of the document title of the first target document_paper2The length of the character string for the subject of the second target document.

Optionally, the second calculation formula includes:

wherein, AR-Rate is the repetition Rate of the author of the document,

as a set of authors for the first target document,

is a set of authors of the second target document, n₁，n₂A is n for the number of authors₁，n₂The subscript corresponding to the medium to minimum value,

for the author

The length of the character string of (a),

for the author

The length of the character string of (a),

for the author

With the author

There is a string length of the repeated contents.

Optionally, the method further comprises:

acquiring attribute content of the document corresponding to the duplicate removal result; and obtaining and storing a reserved document according to the attribute content and a preset screening index.

In a second aspect, an embodiment of the present invention provides a document deduplication apparatus, including:

the system comprises an acquisition module, a storage module and a processing module, wherein the acquisition module is used for acquiring attribute labels and attribute contents corresponding to a first target document and a second target document respectively, and the first target document and the second target document are documents with mutually repeated attribute contents;

the screening module is used for screening out a target attribute label and attribute content corresponding to the target attribute label from the attribute labels;

the computing module is used for obtaining the corresponding attribute label repetition rate according to the attribute content corresponding to the target attribute label;

and the judging module is used for obtaining the duplicate removal results of the first target document and the second target document according to the attribute label repetition rate.

Optionally, when the target attribute tag includes a document author and a document title, the calculation module is specifically configured to:

Optionally, the first calculation formula includes:

Optionally, the second calculation formula includes:

among them, AR-Rate is the literature authorThe repetition rate is set in accordance with the number of the repetitions,

as a set of authors for the first target document,

for the author

The length of the character string of (a),

for the author

The length of the character string of (a),

for the author

With the author

There is a string length of the repeated contents.

Optionally, the system further comprises a screening module, configured to: acquiring attribute content of the document corresponding to the duplicate removal result; and obtaining and storing a reserved document according to the attribute content and a preset screening index.

As can be seen from the foregoing technical solutions, in the document duplication elimination method and apparatus provided in the embodiments of the present invention, by obtaining the attribute tags and the attribute contents corresponding to the first target document and the second target document, the target attribute tags and the attribute contents corresponding to the target attribute tags are screened out from the attribute tags, and the corresponding attribute tag repetition rate is obtained according to the attribute contents corresponding to the target attribute tags; and according to the attribute tag repetition rate, duplicate removal results of the first target document and the second target document are obtained, the duplicate checking speed is increased, and labor and time are saved.

Drawings

FIG. 1 is a schematic flow chart of a document deduplication process according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of extracting repeated character strings of a document title according to an embodiment of the present invention;

FIG. 3 is a schematic flow chart of a document deduplication method according to an embodiment of the present invention;

FIG. 4 is a schematic structural diagram of a document deduplication apparatus according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of a document deduplication apparatus according to an embodiment of the present invention.

Detailed Description

The following detailed description of embodiments of the present invention is provided in connection with the accompanying drawings and examples. The following examples are intended to illustrate the invention but are not intended to limit the scope of the invention.

Fig. 1 shows that an embodiment of the present invention provides a document deduplication method, including:

s11, acquiring attribute labels and attribute contents corresponding to a first target document and a second target document, wherein the first target document and the second target document are documents with mutually repeated attribute contents;

s12, screening out a target attribute label from the attribute labels and attribute content corresponding to the target attribute label;

s13, obtaining the corresponding attribute label repetition rate according to the attribute content corresponding to the target attribute label;

s14, obtaining the duplicate removal result of the first target document and the second target document according to the attribute label repetition rate.

In the above steps S11 to S14, it should be noted that in the embodiment of the present invention, for the document database that needs to be checked for duplication, documents with duplicate information can be screened first, and then two documents are compared with each other until the result of determining whether the documents are duplicated or not is obtained.

Two documents are obtained, a first target document and a second target document. The first target document and the second target document are documents whose attribute contents are duplicated with each other. The basic information of the document can select attribute labels and corresponding attribute contents.

And screening out a target attribute label and attribute content corresponding to the target attribute label from the attribute labels. In an embodiment of the present invention, the target attribute tags for screening may be document authors and document titles.

and obtaining the document subject repetition rate by adopting a first calculation formula according to the character string length of the document subject and the total character string length of the repeated content of the first target document and the second target document.

The first calculation formula includes:

FIG. 2 is a schematic diagram of repeated character string extraction of a document title. As can be seen from the figure, the title of document 1 is: "influence of MTHFR gene polymorphism in children with acute lymphoblastic leukemia on adverse reaction of high-dose methotrexate", the subjects of document 2 are: "the influence of GSTP1 and MTHFR gene polymorphism in children with acute lymphoblastic leukemia on the toxic and side effects of high-dose methotrexate", and then the repeated contents of literature 1 and literature 2 are extracted by the LCS longest common subsequence dynamic method: "influence of MTHFR gene polymorphism in infants with acute lymphoblastic leukemia on response to high-dose methotrexate". From this, it can be known that the character string length of each content, and then the document title repetition rate can be obtained by using a calculation formula.

The second calculation formula includes:

wherein, AR-Rate is the repetition Rate of the author of the document,

as a set of authors for the first target document,

for the author

Length of character string，

For the author

The length of the character string of (a),

for the author

With the author

There is a string length of the repeated contents.

In this embodiment, after the document title repetition rate and the document author repetition rate are obtained, the obtained document title repetition rate and document author repetition rate are compared with a preset repetition rate threshold value, so as to determine whether two documents are duplicate documents. In the judgment, the thresholds may be preset in a one-to-one correspondence manner, or may share one preset threshold.

After the deduplication is finished, documents obtained by deduplication are screened according to PICO (clinical problem in the form of PICO (position, intersection, Comparison, and outcom)), specifically: acquiring attribute content of the document corresponding to the duplicate removal result; and obtaining and storing a reserved document according to the attribute content and a preset screening index.

Whether the following conditions are met or not is sequentially judged according to a preset PICO index:

(1) judging whether the study is a clinical study

PICO index: patient, case, person, infant, child, clinic

(2) Judging whether the people are related or not

PICO index: methotrexate, MTX, methotrexate, High-dose, HD, leukemia, ALL, AL, lymphoma, NHL, osteosarcoma, OS, hematologic tumors

(3) Determining whether intervention is correlated with control

PICO index: genes, polymorphisms, methylenetetrahydrofolate reductase, reduced folate carriers, glycoproteins, multiple drug resistance genes, SNPs, MTHFR, RFC, SLC19A1, ABCB1, MDR

(4) Determining whether the ending indicators are related

PICO index: toxicity, toxic and side effects, adverse events, adverse reactions and side effects

Documents meeting the four conditions are finally retained, and documents which do not meet any one of the conditions are deleted.

In the document duplication eliminating method provided by the embodiment of the invention, the attribute labels and the attribute contents corresponding to the first target document and the second target document are obtained, the target attribute labels and the attribute contents corresponding to the target attribute labels are screened out from the attribute labels, and the corresponding attribute label repetition rate is obtained according to the attribute contents corresponding to the target attribute labels; and according to the attribute tag repetition rate, duplicate removal results of the first target document and the second target document are obtained, the duplicate checking speed is increased, and labor and time are saved.

Fig. 3 shows that an embodiment of the present invention provides a document deduplication method, including:

s21, acquiring the attribute labels and the attribute contents of the documents in each document database, performing label unification processing on the attribute labels of the documents in each document database, acquiring unified attribute labels, and configuring corresponding attribute contents;

s22, acquiring attribute labels and attribute contents corresponding to a first target document and a second target document, wherein the first target document and the second target document are documents with mutually repeated attribute contents;

s23, screening out a target attribute label from the attribute labels and attribute content corresponding to the target attribute label;

s24, obtaining the corresponding attribute label repetition rate according to the attribute content corresponding to the target attribute label;

s25, obtaining the duplicate removal result of the first target document and the second target document according to the attribute label repetition rate.

The principle of the above steps S22-S25 is the same as that of the above embodiments, steps S11-S14, and the description thereof is omitted.

In step S21, it should be noted that, because the third-party databases of the document sources are different, the stored attribute tags of the documents are also different, so that the attribute tags of the documents in the document databases need to be processed uniformly to conform to the preset uniform attribute tags, and then the corresponding attribute contents are configured.

Fig. 4 shows a document deduplication apparatus provided in an embodiment of the present invention, which includes an obtaining module 31, a screening module 32, a calculating module 33, and a determining module 34, where:

an obtaining module 31, configured to obtain attribute tags and attribute contents corresponding to a first target document and a second target document, where the first target document and the second target document are documents whose attribute contents are mutually duplicated;

a screening module 32, configured to screen out a target attribute tag and attribute content corresponding to the target attribute tag from the attribute tags;

a calculating module 33, configured to obtain a corresponding attribute tag repetition rate according to the attribute content corresponding to the target attribute tag;

and the judging module 34 is configured to obtain a duplicate removal result of the first target document and the second target document according to the attribute tag repetition rate.

Since the principle of the apparatus according to the embodiment of the present invention is the same as that of the method according to the above embodiment, further details are not described herein for further explanation.

It should be noted that, in the embodiment of the present invention, the relevant functional module may be implemented by a hardware processor (hardware processor).

According to the document duplication removal device provided by the embodiment of the invention, the attribute labels and the attribute contents corresponding to the first target document and the second target document are obtained, the target attribute labels and the attribute contents corresponding to the target attribute labels are screened out from the attribute labels, and the corresponding attribute label repetition rate is obtained according to the attribute contents corresponding to the target attribute labels; and according to the attribute tag repetition rate, duplicate removal results of the first target document and the second target document are obtained, the duplicate checking speed is increased, and labor and time are saved.

Fig. 5 shows a document deduplication apparatus provided in an embodiment of the present invention, which includes a processing module 41, an obtaining module 42, a screening module 43, a calculating module 44, and a determining module 45, where:

the processing module 41 is configured to obtain attribute tags and attribute contents of documents in each document database, perform tag unification processing on the attribute tags of the documents in each document database, obtain unified attribute tags, and configure corresponding attribute contents;

an obtaining module 42, configured to obtain attribute tags and attribute contents corresponding to a first target document and a second target document, where the first target document and the second target document are documents whose attribute contents are mutually duplicated;

a screening module 43, configured to screen out a target attribute tag and attribute content corresponding to the target attribute tag from the attribute tags;

a calculating module 44, configured to obtain a corresponding attribute tag repetition rate according to the attribute content corresponding to the target attribute tag;

and a determining module 45, configured to obtain a duplicate removal result of the first target document and the second target document according to the attribute tag repetition rate.

Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.

It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.

Those of ordinary skill in the art will understand that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions and scope of the present invention as defined in the appended claims.

Claims

1. A document deduplication process, comprising:

obtaining the duplicate removal result of the first target document and the second target document according to the attribute label repetition rate;

when the target attribute tag comprises a document author and a document title, the obtaining of the corresponding attribute tag repetition rate according to the attribute content corresponding to the target attribute tag comprises:

obtaining a document author repetition rate by adopting a second calculation formula according to the number of authors corresponding to the first target document and the second target document and the length of a character string in which the author names of the first target document and the author names of the second target document are repeated;

the second calculation formula includes:

wherein, AR-Rate is the repetition Rate of the author of the document,

as a set of authors for the first target document,

for the author

The length of the character string of (a),

for the author

The length of the character string of (a),

for the author

With the author

There is a string length of the repeated contents.

2. The method of claim 1, wherein the first calculation formula comprises:

3. The method of claim 1, further comprising:

acquiring attribute content of the document corresponding to the duplicate removal result;

and obtaining and storing a reserved document according to the attribute content and a preset screening index.

4. A document de-weighting device, comprising:

the judging module is used for obtaining the duplicate removal results of the first target document and the second target document according to the attribute label repetition rate;

when the target attribute tag includes a document author and a document title, the calculation module is specifically configured to:

the second calculation formula includes:

wherein, AR-Rate is the repetition Rate of the author of the document,

as a set of authors for the first target document,

for the author

The length of the character string of (a),

for the author

The length of the character string of (a),

for the author

With the author

There is a string length of the repeated contents.

5. The apparatus of claim 4, wherein the first calculation formula comprises:

wherein, TR-Rate is the repetition Rate of the literature topic, L_TMA total string length, L, for the repeated content of the first and second target documents under the document title_paper1Is the first orderCharacter string length, L, of the subject matter of the subject document_paper2The length of the character string for the subject of the second target document.

6. The apparatus of claim 4, further comprising a screening module to: acquiring attribute content of the document corresponding to the duplicate removal result; and obtaining and storing a reserved document according to the attribute content and a preset screening index.