CN112989793B - Article detection method and device - Google Patents

Article detection method and device Download PDF

Info

Publication number
CN112989793B
CN112989793B CN202110531324.8A CN202110531324A CN112989793B CN 112989793 B CN112989793 B CN 112989793B CN 202110531324 A CN202110531324 A CN 202110531324A CN 112989793 B CN112989793 B CN 112989793B
Authority
CN
China
Prior art keywords
fingerprint
article
detected
comparison result
index
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110531324.8A
Other languages
Chinese (zh)
Other versions
CN112989793A (en
Inventor
杨阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Changsha Developer Technology Co ltd
Beijing Innovation Lezhi Network Technology Co ltd
Original Assignee
Changsha Developer Technology Co ltd
Beijing Innovation Lezhi Network Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Changsha Developer Technology Co ltd, Beijing Innovation Lezhi Network Technology Co ltd filed Critical Changsha Developer Technology Co ltd
Priority to CN202110531324.8A priority Critical patent/CN112989793B/en
Publication of CN112989793A publication Critical patent/CN112989793A/en
Application granted granted Critical
Publication of CN112989793B publication Critical patent/CN112989793B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Probability & Statistics with Applications (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Collating Specific Patterns (AREA)

Abstract

The application provides an article detection method and an article detection device, wherein the article detection method comprises the following steps: the method comprises the steps that a server generates a first fingerprint of an article to be detected sent by user equipment in a first construction mode, and generates a second fingerprint in a second construction mode, wherein a plurality of first index records generated based on the first construction mode are set for the first fingerprint, and a plurality of second index records generated based on the second construction mode are set for the second fingerprint; the server respectively compares the first fingerprint and the second fingerprint with the plurality of first index records and the plurality of second index records to obtain a first comparison result and a second comparison result; and determining the detection result of the article to be detected according to the first comparison result and the second comparison result. According to the technical scheme, two different fingerprints are constructed and detected, the misjudgment rate of original articles is effectively reduced, and the recall rate of similar texts is improved.

Description

Article detection method and device
Technical Field
The application relates to the technical field of text detection, in particular to an article detection method and device.
Background
With the increasing number of published blog articles in the internet community, the difficulty of protecting original articles is also increasing. Many authors designate an original article as actually being a reprinted article from another in-station article, and not an original article. Therefore, how to quickly identify whether a new article published by a user is an original article becomes a technical problem which needs to be solved urgently.
Disclosure of Invention
In view of this, embodiments of the present application provide an article detection method and apparatus, which can effectively reduce a misjudgment rate of an original article.
In a first aspect, an embodiment of the present application provides an article detection method, including: the method comprises the steps that a server generates a first fingerprint of an article to be detected sent by user equipment in a first construction mode, and generates a second fingerprint in a second construction mode, wherein a plurality of first index records generated based on the first construction mode are set for the first fingerprint, and a plurality of second index records generated based on the second construction mode are set for the second fingerprint; the server respectively compares the first fingerprint and the second fingerprint with the plurality of first index records and the plurality of second index records to obtain a first comparison result and a second comparison result; and determining the detection result of the article to be detected according to the first comparison result and the second comparison result.
In some embodiments of the present application, generating the first fingerprint in the first construction comprises: acquiring at least one clause with a preset length based on the article to be detected; generating fingerprint information and weight respectively corresponding to at least one clause based on at least one clause, wherein the weight of each clause in the at least one clause is the length of the clause; merging and generating a first fingerprint of the article to be detected based on fingerprint information and weight respectively corresponding to at least one clause; generating the second fingerprint in the second construction mode comprises: extracting at least one keyword according to the incidence relation of the words in the article to be detected; determining word frequency numbers corresponding to the at least one keyword respectively based on the frequency of the at least one keyword appearing in the article to be detected, and setting the word frequency numbers as the weight of the corresponding at least one keyword; and generating a second fingerprint based on the at least one keyword and the weight corresponding to the at least one keyword.
In some embodiments of the present application, the server performs similarity comparison between the first fingerprint and the second fingerprint with the first index records and the second index records, respectively, and obtaining the first comparison result and the second comparison result includes: comparing the similarity of the first fingerprint with a plurality of first index records to obtain a first comparison result; and comparing the Hamming distance of the second fingerprint with the plurality of second index records to obtain a second comparison result, wherein the detection result comprises original or non-original.
In some embodiments of the present application, comparing the similarity of the first fingerprint with the plurality of first index records, and obtaining the first comparison result includes: obtaining at least one article corresponding to a first preset number of first index records based on the first fingerprint and a plurality of first index records, wherein the article which is most similar to the first fingerprint is the first article; when the ratio of the length of the same part of the first fingerprint and a third fingerprint generated by the first article based on the first construction mode to the length of the third fingerprint exceeds a first preset threshold value, generating a first index result; and/or when the ratio of the length of the same part of the fourth fingerprint generated based on the first construction mode and corresponding to the first fingerprint and the at least one article respectively to the total length of the plurality of fourth fingerprints exceeds a second preset threshold, generating a second index result; and/or generating a third indexing result when the Hamming distance of the second fingerprint and a fifth fingerprint corresponding to the first article and generated based on the second construction mode exceeds a third preset threshold; determining a first comparison result based on the first index result and/or the second index result and/or the third index result, wherein the first comparison result comprises a first article.
In some embodiments of the present application, comparing the hamming distance of the second fingerprint to the plurality of second index records, and obtaining the second comparison result comprises: dividing the second fingerprint into four groups of fingerprints; comparing the Hamming distance of each group of fingerprints in the four groups of fingerprints with a plurality of second index records respectively to obtain a plurality of articles corresponding to a second predetermined number of second index records; and when the Hamming distance between the fingerprint generated by the plurality of articles based on the second construction mode and the second fingerprint respectively does not exceed a fourth preset threshold, obtaining a second comparison result, wherein the second comparison result comprises a second article with the smallest Hamming distance from the second fingerprint in the plurality of articles.
In some embodiments of the present application, determining the detection result of the article to be detected according to the first comparison result and the second comparison result includes: and generating a detection result according to the release time of the first article and the release time of the second article, wherein the detection result comprises non-original originals and articles with earlier release times in the first article and the second article.
In some embodiments of the present application, after determining the detection result of the article to be detected according to the first comparison result and the second comparison result, the method further includes: and when the detection result is original, storing the first fingerprint of the article to be detected in the first index structure as a new first index record, and storing the second fingerprint in the second index structure as a new second index record.
In some embodiments of the present application, the first fingerprint is a file fingerprint verification MD5 fingerprint and the second fingerprint is a simhash fingerprint.
In a second aspect, an embodiment of the present application provides an article detection apparatus, including: the generation module is used for generating a first fingerprint in a first construction mode and generating a second fingerprint in a second construction mode for the article to be detected sent by the user equipment, wherein a plurality of first index records generated based on the first construction mode are set for the first fingerprint, and a plurality of second index records generated based on the second construction mode are set for the second fingerprint; the comparison module is used for respectively comparing the similarity of the first fingerprint and the second fingerprint with the plurality of first index records and the similarity of the second fingerprint and the plurality of second index records to obtain a first comparison result and a second comparison result; and the determining module is used for determining the detection result of the article to be detected according to the first comparison result and the second comparison result.
In a third aspect, an embodiment of the present application provides an electronic device, including: a processor; a memory for storing processor executable instructions, wherein the processor is configured to perform the article detection method of the first aspect.
The embodiment of the application provides an article detection method and device, two different fingerprints are generated through two different construction modes, and the detection result of an article to be detected is determined by combining two fingerprint detection schemes, so that the misjudgment rate of an original article is effectively reduced, the recall rate of similar texts is improved, and the purposes of protecting the rights and interests of an article author and supporting originality are achieved.
Drawings
Fig. 1 is a flowchart illustrating an article detection method according to an exemplary embodiment of the present application.
Fig. 2 is a schematic flowchart of generating a first fingerprint according to an exemplary embodiment of the present application.
Fig. 3 is a flowchart illustrating a process of generating a second fingerprint according to an exemplary embodiment of the present application.
Fig. 4 is a flowchart illustrating an article detection method according to another exemplary embodiment of the present application.
Fig. 5 is a flowchart illustrating a process of determining a first comparison result according to an exemplary embodiment of the present application.
Fig. 6 is a flowchart illustrating a process of determining a second comparison result according to an exemplary embodiment of the present application.
Fig. 7 is a schematic structural diagram of an article detection apparatus according to an exemplary embodiment of the present application.
Fig. 8 is a block diagram of an electronic device for article detection provided by an exemplary embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
At present, the simhash algorithm is mainly applied to article detection. However, since the simhash algorithm is a locally sensitive algorithm, a large error exists in the judgment of the repetition degree of a short text, and the judgment accuracy is higher as the text becomes longer, the problem that a pieced article (for example, an article a is pieced together by two articles b and c) cannot be effectively identified exists. If the MD5 fingerprint is used for article detection, a part of sentences in the article need to be captured for detection during the detection process, so that a certain probability of misjudgment still exists for the article with a longer length.
In addition, when the articles published by the user are non-original articles, how to find out the corresponding original articles published for the first time is also a problem to be solved urgently.
Fig. 1 is a flowchart illustrating an article detection method according to an exemplary embodiment of the present application. The method of fig. 1 is performed by a computing device, e.g., a server. As shown in fig. 1, the article detection method includes the following.
110: the server generates a first fingerprint of an article to be detected sent by user equipment in a first construction mode, and generates a second fingerprint in a second construction mode, wherein a plurality of first index records generated based on the first construction mode are set for the first fingerprint, and a plurality of second index records generated based on the second construction mode are set for the second fingerprint.
Specifically, the first construction mode and the second construction mode are different methods for constructing fingerprints, and the specific modes of the first construction mode and the second construction mode are not particularly limited in the present application. For example, in the case that the first fingerprint is a file fingerprint verification MD5 fingerprint and the second fingerprint is a simhash fingerprint, the first construction manner may be a manner of constructing an MD5 fingerprint, and the second construction manner may be a manner of constructing a simhash fingerprint.
According to the first construction mode and the second construction mode, two search cluster indexes (such as an elastic search cluster index) can be respectively established, for example, an MD5 fingerprint index and a simhash fingerprint index.
For the first construction mode, a first index structure may be provided. The first index structure may include a plurality of first index records. The first index record may be fingerprint information of the other articles (i.e., articles other than the article to be detected) that need to be compared with the article to be detected, which is constructed based on the first construction method. Each first index record may include the IDentity Identifiers (IDs) of the remaining articles, the time of creation, and the fingerprints generated based on the first construction. The information of each first index record may be separated by spaces and stored in a text type.
For example, the first fingerprint is an MD5 fingerprint and the first indexing structure is an MD5 fingerprint index. Accordingly, the first index record includes the article ID, the creation time, and the generated MD5 fingerprint.
It should be noted that the first index structure includes an established search cluster index. The first index structure may also include a first article fingerprint repository storing a plurality of first index records. When articles similar to the first fingerprint are searched in the first index structure, the similarity comparison is performed between the first fingerprint and a plurality of first index records in the first article database by searching the cluster index.
For the second construction mode, a second index structure may be provided. The second index structure may include a plurality of second index records. In the second index structure, a second fingerprint generated based on the second construction mode (e.g., the second fingerprint may be a 64-bit binary-coded fingerprint) may be divided into multiple (e.g., 4) codes. Each code is then recorded as a second index. That is, in the second index structure, a complete article may include a plurality of second index records.
For example, the second fingerprint is a simhash fingerprint. The second index structure is a simhash index structure. In the second index structure, the 64-bit binary code generated by the simhash algorithm is divided into 4 copies, i.e. 4 binary codes of 16 bits, and each 16-bit code is recorded as a second index.
It should be noted that the relationship between the second index structure and the search cluster index is substantially the same as that described for the first index structure, and for details, please refer to the description of the first index structure, which is not described herein again to avoid repetition.
It should be understood that the number of articles stored in the first index structure and the second index structure may be the same or different. Therefore, the method and the device for detecting the article retrieve have the advantages that the first fingerprint detection scheme and the second fingerprint detection scheme are combined, index records in the first index structure and the second index structure are retrieved at the same time, the article detection range is expanded, and the similarity of articles recalled is ensured.
120: and the server respectively compares the first fingerprint and the second fingerprint with the plurality of first index records and the plurality of second index records to obtain a first comparison result and a second comparison result.
Specifically, the server may compare the similarity of the first fingerprint with the plurality of first index records to obtain a first comparison result, and compare the similarity of the second fingerprint with the plurality of second index records to obtain a second comparison result.
130: and determining the detection result of the article to be detected according to the first comparison result and the second comparison result.
In particular, the first comparison result may comprise original or non-original. The second comparison result may also include originality or non-originality. If and only if the first comparison result and the second comparison result are both originals, the detection result of the article to be detected is the originals; otherwise, the detection result is non-original.
Therefore, the embodiment of the application generates two different fingerprints through two different construction modes, and combines two fingerprint detection schemes, so that the misjudgment rate of original articles is effectively reduced, the recall rate of similar texts is improved, and the purposes of protecting the rights and interests of article authors and supporting originality are achieved.
Fig. 2 is a schematic flowchart of generating a first fingerprint according to an exemplary embodiment of the present application. As shown in fig. 2, the method for generating the first fingerprint includes the following steps.
210: and acquiring at least one clause with a preset length based on the article to be detected.
Specifically, at least one clause may be truncated based on the length of the article to be detected, wherein symbols and stop words may be removed before the clause is truncated.
In an example, when the length of the article to be detected is greater than or equal to a first preset threshold (for example, the length of 30 clauses), the complete sentence with the top length of 30 in the article to be detected may be intercepted as at least one clause, that is, the number of the at least one clause is 30. The number of the clauses is not particularly limited, and the clauses can be flexibly set according to actual conditions.
When the length of the article to be detected is smaller than a first preset threshold value and the full-text length is larger than or equal to a second preset threshold value (for example, 120 bytes), intercepting all sentences in the article to be detected and taking the sentences as at least one clause.
When the length of the article to be detected is lower than a second preset threshold (for example, 120 bytes), the whole article to be detected is taken as a clause.
220: and generating fingerprint information and weight respectively corresponding to the at least one clause based on the at least one clause.
In one embodiment, the weight of each clause in at least one clause is the length of the clause.
Specifically, the fingerprint information and the weight corresponding to at least one clause (e.g., 30 clauses) may be generated for the at least one clause obtained in step 210 according to the first construction manner, where the weight may be the length of the clause itself.
230: and combining and generating the first fingerprint of the article to be detected based on the fingerprint information and the weight respectively corresponding to the at least one clause.
Specifically, according to fingerprint information corresponding to each clause of at least one clause (e.g., 30 clauses) and length (i.e., weight) thereof, a first fingerprint corresponding to the article to be detected is generated by combination.
It should be understood that the representation of the first fingerprint of the article may be: the first clause fingerprint information is separated from the length of the first clause, the second clause fingerprint information is separated from the length of the second clause, and the third clause fingerprint information is separated from the length of the third clause by a space … ….
It should be noted that the first fingerprint may be an MD5 fingerprint, and by applying the hashing capability of the MD5 algorithm, a small difference between articles may also result in an MD5 fingerprint that is completely different.
Therefore, the length of the clause is used as the weight of the clause, so that the fingerprint information of the article to be detected corresponds to the content and the length of the intercepted clause, and the generated fingerprint information is greatly different due to small difference between the articles.
Fig. 3 is a flowchart illustrating a process of generating a second fingerprint according to an exemplary embodiment of the present application. As shown in fig. 3, the method for generating the second fingerprint includes the following steps.
310: and extracting at least one keyword according to the incidence relation of the words in the article to be detected.
Specifically, at least one keyword in the article to be detected can be extracted according to the association relationship between words in the article to be detected based on a textrank algorithm, the number of the keyword can be dynamically adjusted based on the length of the article to be detected, and the number of the keyword is not specifically limited in the embodiment of the application.
320: determining word frequency numbers corresponding to the at least one keyword respectively based on the frequency of the at least one keyword appearing in the article to be detected, and setting the word frequency numbers as the weight of the corresponding at least one keyword.
Specifically, it should be noted that when extracting at least one keyword, the textrank algorithm calculates a weight based on the content of the article for each keyword. In order to reduce the randomness of second fingerprint generation and weaken the influence of partial high-frequency words on fingerprint generation, the embodiment of the application calculates the weight of each keyword in the whole text by applying a tfidf algorithm, thereby determining the word frequency number of each keyword in an article to be detected, setting the online of the word frequency number, and replacing the word frequency number with the weight randomly generated by a textrank algorithm.
330: and generating a second fingerprint based on the at least one keyword and the weight corresponding to the at least one keyword.
In an embodiment, the second fingerprint is a simhash fingerprint.
Specifically, each keyword is converted into a series of numbers using a specified function, for example, a hash value of each keyword is calculated using a hash function, which is an n-bit signature consisting of binary numbers 0 and 1. And weighting a series of numbers corresponding to each keyword and the corresponding weight. And further accumulating the weighted results obtained by calculating all the keywords and converting the weighted results into a sequence string. And finally, performing dimension reduction processing on the sequence string, for example, setting 1 for the accumulated result which is greater than 0, and otherwise, setting 0, so as to obtain a second fingerprint corresponding to the article to be detected.
Therefore, the word frequency number is used as the weight of the keyword, the weight of partial high-frequency words is improved, and the randomness of generating the second fingerprint is reduced.
Fig. 4 is a flowchart illustrating an article detection method according to another exemplary embodiment of the present application. The method of fig. 4 is performed by a computing device, e.g., a server. As shown in fig. 4, the article detection method includes the following.
410: the server generates a first fingerprint of an article to be detected sent by user equipment in a first construction mode, and generates a second fingerprint in a second construction mode, wherein a plurality of first index records generated based on the first construction mode are set for the first fingerprint, and a plurality of second index records generated based on the second construction mode are set for the second fingerprint.
Specifically, the step is substantially the same as step 110 in fig. 1, please refer to the related description in fig. 1 for details, and details are not repeated herein for avoiding repetition.
420: and comparing the similarity of the first fingerprint with a plurality of first index records to obtain a first comparison result.
In particular, the first comparison result comprises original or non-original.
In an example, when the similarity comparison of the first fingerprint to the plurality of first index records exceeds a threshold, the first comparison result is non-original, and the first index record further includes: in the first index structure, please refer to the description of the embodiment in fig. 5 for details of the recalled first article that is most similar to the article to be detected, which is not described herein again.
In an example, the first comparison result is original when the similarity comparison of the first fingerprint to the plurality of first index records does not exceed the threshold.
430: and comparing the Hamming distance of the second fingerprint with the plurality of second index records to obtain a second comparison result.
In particular, the second comparison result includes an original or non-original.
In an example, when the second comparison result is determined to be non-original, the second comparison result further includes: and in the second index structure, recalling a second article which is most similar to the article to be detected. For details of the specific determination process of the second comparison result, please refer to the record in the embodiment of fig. 6, and details are not repeated herein to avoid repetition.
440: determining a detection result based on the first comparison result and the second comparison result.
In an embodiment, the detection result includes originality or non-originality.
Specifically, the detection result may be determined based on the first comparison result and the second comparison result.
In an example, when the first comparison result and the second comparison result are simultaneously originality, the detection result is originality.
In an example, when any one of the first comparison result and the second comparison result is non-original, the detection result is non-original, and the detection result may further include a recalled article that is most similar to the article to be detected. When the first article and the second article are different, the recall article can be determined according to the release time of the first article and the second article, namely the article with the earlier release time is sent to the user as the most similar recall article. And when the first article is the same as the second article, directly sending the first article (or the second article) to the user as the most similar article.
Therefore, the embodiment of the application effectively improves the recall rate of similar articles by combining two fingerprint schemes. Meanwhile, similar articles can be recalled quickly by the two fingerprint schemes, so that the misjudgment rate of detection results is greatly reduced, and the reliability of the recalled most similar articles is high.
Fig. 5 is a flowchart illustrating a process of determining a first comparison result according to an exemplary embodiment of the present application. The method for determining the first comparison result includes the following steps.
510: based on the first fingerprint and the plurality of first index records, at least one article corresponding to a first predetermined number of first index records is obtained.
In one embodiment, the most similar of the at least one article to the first fingerprint is the first article.
Specifically, a first index record corresponds to an article. When searching for an article similar to the first fingerprint in the first index structure, the similarity comparison between the first fingerprint and the plurality of first index records in the first index structure is performed, the first index records containing the same first fingerprint are searched for, and then the article corresponding to the first index record is returned, wherein the more parts of the fingerprint information are the same, the higher the returned priority is.
For example, the first predetermined number may be 50, 60, or 100, i.e., articles with a similarity ranking of the top 50, 60, or 100 may be considered similar articles and returned according to the priority of the return. The first preset number is not specifically limited, and a user can flexibly set the first preset number according to actual conditions.
Preferably, the first predetermined number is set to 50.
520: and when the ratio of the length of the same part of the first fingerprint and the third fingerprint generated by the first article based on the first construction mode to the length of the third fingerprint exceeds a first preset threshold value, generating a first index result.
In particular, the first article generates a third fingerprint of the same type as the first fingerprint based on the first construction. For example, the first fingerprint is an MD5 fingerprint and the third fingerprint is also an MD5 fingerprint.
In an example, the first index result is generated when a ratio of a length of a fingerprint information overlapping portion of the first fingerprint and the third fingerprint to a total length of the third fingerprint exceeds a first preset threshold. The first index result includes non-originality. Illustratively, the first preset threshold may be 25%, and the first preset threshold is not particularly limited in the embodiments of the present application.
In an example, when a ratio of a length of a fingerprint information overlapping portion of the first fingerprint and the third fingerprint to a total length of the third fingerprint does not exceed a first preset threshold, a first index result is generated. The first indexing result now includes originality.
530: and when the ratio of the length of the same part of the fourth fingerprint generated by the first fingerprint and the at least one article based on the first construction mode to the total length of the plurality of fourth fingerprints exceeds a second preset threshold value, generating a second index result.
In particular, the at least one article may be the top 50 articles recalled that are most similar to the article to be detected. The first 50 articles each generated a fourth fingerprint of the same type as the first fingerprint based on the first construction.
In an example, when the ratio of the length of the fingerprint information overlapping part of the first fingerprint and the plurality of fourth fingerprints corresponding to the first 50 articles to the total length of the fourth fingerprints of the first 50 articles exceeds a second preset threshold, a second index result is generated. The second index result now includes non-originality. The numerical value of the second preset threshold is not particularly limited in the embodiment of the present application.
In an example, when the ratio of the length of the fingerprint information overlapping part of the first fingerprint and the fourth fingerprint corresponding to the first 50 articles to the total length of the fourth fingerprints of the first 50 articles does not exceed a second preset threshold, the second index result is generated. This second indexing result now includes originality.
It should be noted that, in order to ensure the detection speed, when the length of the overlapped part in the first 50 articles is calculated, the same number of times of fingerprint overlapping can be calculated only once, so as to avoid the problem of repeated calculation caused by the fact that the 50 recalled articles all contain the same overlapped part of the fingerprint (i.e. the first 50 articles all contain the same sentence).
540: and when the Hamming distance between the second fingerprint and a fifth fingerprint generated by the first article based on the second construction mode exceeds a third preset threshold value, generating a third indexing result.
Specifically, since the first fingerprint can only represent a certain number (e.g., 30) of long sentences of the article to be detected at most, and the problem of misjudgment still exists for the long article, when the first index result and/or the second index result are determined to be non-original, the hamming distance between the second fingerprint of the article to be detected and the same type and equal length fifth fingerprint generated by the first article based on the second construction method is further compared.
The hamming distance refers to the number of different characters at the corresponding positions of two character strings. That is, it is the number of characters that need to be replaced to convert one string into another.
In an example, when the hamming distance between the second fingerprint of the article to be detected and the fifth fingerprint generated by the first article based on the second construction mode exceeds a third preset threshold, a third indexing result is generated. The third indexed result includes originality. The third preset threshold may be 24, for example, and the third preset threshold is not specifically limited in this embodiment of the application.
In an example, when the hamming distance between the second fingerprint of the article to be detected and the fifth fingerprint generated by the first article based on the second construction mode does not exceed a third preset threshold, a third indexing result is generated. The third indexed result includes non-originality.
550: determining a first comparison result based on the first index result and/or the second index result and/or the third index result.
In one embodiment, the first comparison result includes a first article.
Specifically, the first comparison result is original if and only if the first, second, and third indexing results are simultaneously original, and is not original otherwise. For example, when the first indexing result is non-original, the second indexing result is non-original, and the third indexing result is original, the first comparison result is non-original.
It should be understood that step 520, step 530 and step 540 are parallel steps, and may be performed simultaneously in an actual execution process, or may be performed in a specified order, which is not specifically limited in this embodiment of the application.
Therefore, the embodiment of the application reduces the misjudgment rate of the articles with longer space (for example, more than 30 sentences) by calculating the proportion and the Hamming distance of the articles to be detected and the articles with the preset number of recalls.
Fig. 6 is a flowchart illustrating a process of determining a second comparison result according to an exemplary embodiment of the present application. The method for determining the second comparison result comprises the following steps.
For convenience of description, the method of specifically determining the second comparison result is as follows.
610: the second fingerprint is divided into four groups of fingerprints.
Specifically, the 64-bit simhash fingerprint (i.e., the second fingerprint) generated by the article to be detected based on the second construction method is divided into 4 binary codes with 16 bits.
620: and comparing the Hamming distance of each group of fingerprints in the four groups of fingerprints with a plurality of second index records respectively to obtain a plurality of articles corresponding to a second preset number of second index records.
Specifically, each of the four sets of fingerprints is respectively searched in the second index structure, and hamming distance comparison is performed with the plurality of second index records to obtain similar articles corresponding to the plurality of second index records recalled by each set of fingerprints. And further carrying out priority ranking on the similar articles recalled from each group of fingerprints, and extracting a plurality of articles with the second preset number before the ranking. The second predetermined number is not particularly limited in the embodiments of the present application, and may be 40, 50, or 60.
630: and when the Hamming distance between the fingerprint generated by each article based on the second construction mode and the second fingerprint does not exceed a fourth preset threshold value, obtaining a second comparison result.
In one embodiment, the second comparison result includes a second article of the plurality of articles having a minimum hamming distance from the second fingerprint.
Specifically, the articles respectively generate the same type as the second fingerprint based on the second construction mode, and the fingerprint lengths are the same as 64-bit coded fingerprints. The generated 64-bit encoded fingerprint is then compared with a second fingerprint for hamming distance, and when the hamming distance does not exceed a fourth preset threshold (e.g. 6), a second comparison result is obtained, which includes non-originality. The fourth preset threshold is not specifically limited in the embodiment of the application, and can be flexibly set according to actual conditions.
Preferably, the embodiment of the present application sets the fourth preset threshold to 6.
Therefore, the misjudgment rate of the original article is reduced by judging the second fingerprint (simhash fingerprint) in the second index structure.
In an embodiment of the present application, determining the detection result based on the first comparison result and the second comparison result includes: and generating a detection result according to the release time of the first article and the release time of the second article, wherein the detection result comprises non-original originals and articles with earlier release times in the first article and the second article.
Specifically, if the articles returned by the two discrimination methods are the same, that is, the first article and the second article are the same, the detection result may include the first article (or the second article) while including the non-original prompt sentence.
If the articles returned by the two discrimination modes are different, namely the first article is different from the second article, the detection result comprises a non-original prompt statement, and the article with the earlier release time in the first article and the second article can be included according to the release time for the user to check.
Therefore, the method and the device for detecting the article are based on the release time, determine the most similar article and guarantee the authenticity and the validity of the detection result.
In an embodiment of the present application, after the server compares the similarity between the first fingerprint and the second fingerprint with the plurality of first index records and the plurality of second index records to obtain a detection result of the article to be detected, the method further includes: and when the detection result is original, storing the first fingerprint of the article to be detected in the first index structure as a new first index record, and storing the second fingerprint in the second index structure as a new second index record.
Specifically, when the detection result is original, it is proved that the article to be detected is an original article (i.e., the first article), that is, neither the first index structure nor the second index structure includes the article to be detected.
In one example, the first fingerprint of the article to be detected is stored as a new first index record in a fingerprint library in the first index structure, for example in the fingerprint library of the first article MD 5; and storing the second fingerprint of the article to be detected as a new second index record in a fingerprint library in a second index structure, for example, a first article simhash fingerprint library.
It should be understood that if an article is deleted, the articles in both index structures will also be deleted at the same time.
Therefore, the original articles are recorded into the first index structure and the second index structure, and databases contained in the two index structures are continuously updated and perfected, so that the accuracy of the detection result is higher.
In an embodiment of the present application, the first fingerprint is a file fingerprint verification MD5 fingerprint, and the second fingerprint is a simhash fingerprint.
Fig. 7 is a schematic structural diagram of an article detection apparatus 700 according to an exemplary embodiment of the present application. As shown in fig. 7, the article detection apparatus 700 includes: a generation module 710, a comparison module 720, and a determination module 730.
The generating module 710 is configured to generate a first fingerprint in a first building manner and generate a second fingerprint in a second building manner for an article to be detected sent by user equipment, where multiple first index records generated based on the first building manner are set for the first fingerprint and multiple second index records generated based on the second building manner are set for the second fingerprint; a comparing module 720, configured to perform similarity comparison on the first fingerprint and the second fingerprint with the plurality of first index records and the plurality of second index records, respectively, to obtain a first comparison result and a second comparison result; the determining module 730 is configured to determine a detection result of the article to be detected according to the first comparison result and the second comparison result.
The embodiment of the application provides an article detection device, two different fingerprints are generated through two different construction modes, and two fingerprint detection schemes are combined, so that the misjudgment rate of original articles is effectively reduced, the recall rate of similar texts is improved, and the purposes of protecting the rights and interests of article authors and supporting originality are achieved.
According to an embodiment of the present application, the generating module 710 is further configured to obtain at least one clause with a preset length based on the article to be detected; generating fingerprint information and weight respectively corresponding to at least one clause based on at least one clause, wherein the weight of each clause in the at least one clause is the length of the clause; merging and generating a first fingerprint of the article to be detected based on fingerprint information and weight respectively corresponding to at least one clause; extracting at least one keyword according to the incidence relation of the words in the article to be detected; determining word frequency numbers corresponding to the at least one keyword respectively based on the frequency of the at least one keyword appearing in the article to be detected, and setting the word frequency numbers as the weight of the corresponding at least one keyword; and generating a second fingerprint based on the at least one keyword and the weight corresponding to the at least one keyword.
According to an embodiment of the present application, the comparing module 720 is further configured to perform similarity comparison between the first fingerprint and the plurality of first index records to obtain a first comparison result; and comparing the second fingerprint with the plurality of second index records to obtain a second comparison result, wherein the detection result comprises originality or non-originality.
According to an embodiment of the present application, the comparing module 720 is further configured to obtain at least one article corresponding to a first predetermined number of first index records based on the first fingerprint and the plurality of first index records, where the article that is most similar to the first fingerprint is the first article; when the ratio of the length of the same part of the first fingerprint and a third fingerprint generated by the first article based on the first construction mode to the length of the third fingerprint exceeds a first preset threshold value, generating a first index result; and/or when the ratio of the length of the same part of the fourth fingerprint generated by the first fingerprint and the at least one article based on the first construction mode to the total length of the plurality of fourth fingerprints exceeds a second preset threshold, generating a second index result; and/or generating a third indexing result when the Hamming distance of the second fingerprint and a fifth fingerprint corresponding to the first article and generated based on the second construction mode exceeds a third preset threshold; determining a first comparison result based on the first index result and/or the second index result and/or the third index result, wherein the first comparison result comprises a first article.
The comparing module 720 is further configured to divide the second fingerprint into four groups of fingerprints according to an embodiment of the present application; comparing the Hamming distance of each group of fingerprints in the four groups of fingerprints with a plurality of second index records respectively to obtain a plurality of articles corresponding to a second predetermined number of second index records; and when the Hamming distance between the fingerprint generated by the plurality of articles based on the second construction mode and the second fingerprint respectively does not exceed a fourth preset threshold, obtaining a second comparison result, wherein the second comparison result comprises a second article with the smallest Hamming distance from the second fingerprint in the plurality of articles.
According to an embodiment of the present application, the comparing module 720 is further configured to generate a detection result according to the release time of the first article and the release time of the second article, where the detection result includes the non-original and the articles in the first article and the second article that have earlier release times.
According to an embodiment of the present application, the apparatus further includes a storage module 740, configured to store the first fingerprint of the article to be detected as a new first index record in the first index structure and store the second fingerprint as a new second index record in the second index structure when the detection result is original.
According to an embodiment of the application, the first fingerprint is a file fingerprint verification MD5 fingerprint, and the second fingerprint is a simhash fingerprint.
It should be understood that, for specific working processes and functions of the generating module 710, the comparing module 720, the determining module 730, and the storing module 740 in the foregoing embodiments, reference may be made to the description of the article detection method provided in the foregoing embodiments of fig. 1 to 6, and in order to avoid repetition, details are not repeated herein.
Fig. 8 is a block diagram of an electronic device 800 for article detection provided by an exemplary embodiment of the present application.
Referring to fig. 8, electronic device 800 includes a processing component 810 that further includes one or more processors, and memory resources, represented by memory 820, for storing instructions, such as applications, that are executable by processing component 810. The application programs stored in memory 820 may include one or more modules that each correspond to a set of instructions. Further, the processing component 810 is configured to execute instructions to perform the article detection method described above.
The electronic device 800 may also include a power supply component configured to perform power management of the electronic device 800, a wired or wireless network interface configured to connect the electronic device 800 to a network, and an input-output (I/O) interface. The electronic device 800 may be operated based on an operating system stored in the memory 820, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, or the like.
A non-transitory computer readable storage medium having instructions stored thereon that, when executed by a processor of the electronic device 800, enable the electronic device 800 to perform a method for article detection, comprising: the method comprises the steps that a server generates a first fingerprint of an article to be detected sent by user equipment in a first construction mode, and generates a second fingerprint in a second construction mode, wherein a plurality of first index records generated based on the first construction mode are set for the first fingerprint, and a plurality of second index records generated based on the second construction mode are set for the second fingerprint; the server respectively compares the first fingerprint and the second fingerprint with the plurality of first index records and the plurality of second index records to obtain a first comparison result and a second comparison result; and determining the detection result of the article to be detected according to the first comparison result and the second comparison result.
All the above optional technical solutions can be combined arbitrarily to form optional embodiments of the present application, and are not described herein again.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program check codes, such as a U disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk.
It should be noted that, in the description of the present application, the terms "first", "second", "third", etc. are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. In addition, in the description of the present application, "a plurality" means two or more unless otherwise specified.
The above description is only exemplary of the present application and should not be taken as limiting the present application, as any modifications, equivalents and the like that are within the spirit and principle of the present application should be included in the scope of the present application.

Claims (8)

1. An article detection method, comprising:
the method comprises the steps that a server generates a first fingerprint of an article to be detected sent by user equipment in a first construction mode, and generates a second fingerprint in a second construction mode, wherein a plurality of first index records generated based on the first construction mode are set for the first fingerprint, and a plurality of second index records generated based on the second construction mode are set for the second fingerprint;
the server respectively compares the first fingerprint and the second fingerprint with the plurality of first index records and the plurality of second index records to obtain a first comparison result and a second comparison result;
determining the detection result of the article to be detected according to the first comparison result and the second comparison result,
wherein the generating of the first fingerprint in the first construction mode comprises:
acquiring at least one clause with a preset length based on the article to be detected; generating fingerprint information and weight respectively corresponding to the at least one clause based on the at least one clause, wherein the weight of each clause in the at least one clause is the length of the clause; merging and generating the first fingerprint of the article to be detected based on the fingerprint information and the weight respectively corresponding to the at least one clause,
the obtaining of at least one clause of a preset length based on the article to be detected comprises:
when the length of the article to be detected is greater than or equal to a first preset threshold value, intercepting a predetermined amount of complete sentences in the article to be detected before the length ranking as the at least one clause; when the length of the article to be detected is smaller than the first preset threshold value and is larger than or equal to a second preset threshold value, intercepting all sentences in the article to be detected as the at least one clause; when the length of the article to be detected is smaller than the second preset threshold value, taking the whole article to be detected as a clause,
the generating of the second fingerprint in the second construction mode comprises:
extracting at least one keyword according to the incidence relation of the words in the article to be detected; determining word frequency numbers corresponding to the at least one keyword respectively based on the frequency of the at least one keyword appearing in the article to be detected, and setting the word frequency numbers as the weight of the corresponding at least one keyword; generating the second fingerprint based on the at least one keyword and a weight corresponding to the at least one keyword,
the determining the detection result of the article to be detected according to the first comparison result and the second comparison result comprises:
when the first comparison result and the second comparison result are both originals, the detection result of the article to be detected is the originals, otherwise, the detection result is non-originals, wherein the first comparison result comprises the originals or the non-originals, the second comparison result comprises the originals or the non-originals,
the first fingerprint is a file fingerprint verification MD5 fingerprint, and the second fingerprint is a simhash fingerprint.
2. The article detection method of claim 1, wherein the server compares the first fingerprint and the second fingerprint with the plurality of first index records and the plurality of second index records respectively for similarity, and obtaining a first comparison result and a second comparison result comprises:
comparing the similarity of the first fingerprint with the plurality of first index records to obtain a first comparison result;
comparing the second fingerprint with the plurality of second index records for Hamming distance to obtain the second comparison result,
wherein the detection result comprises original or non-original.
3. The article detection method of claim 2, wherein the comparing the first fingerprint with the plurality of first index records to obtain a first comparison result comprises:
obtaining at least one article corresponding to a first preset number of first index records based on the first fingerprint and the plurality of first index records, wherein the article which is most similar to the first fingerprint is the first article;
when the ratio of the length of the same part of the first fingerprint and a third fingerprint generated by the first article based on the first construction mode to the length of the third fingerprint exceeds a first preset threshold value, generating a first index result; and/or
When the ratio of the length of the same part of a fourth fingerprint generated by the first fingerprint and the at least one article based on the first construction mode to the total length of the fourth fingerprints exceeds a second preset threshold value, generating a second index result; and/or
When the Hamming distance between the second fingerprint and a fifth fingerprint generated by the first article based on the second construction mode exceeds a third preset threshold value, generating a third indexing result;
determining the first comparison result based on the first index result and/or the second index result and/or the third index result, wherein the first comparison result comprises the first article.
4. The article detection method of claim 3, wherein the comparing the second fingerprint to the plurality of second index records to obtain the second comparison result comprises:
dividing the second fingerprint into four groups of fingerprints;
comparing the Hamming distance of each fingerprint in the four groups of fingerprints with the plurality of second index records respectively to obtain a plurality of articles corresponding to a second preset number of second index records;
when the Hamming distance between the fingerprint generated by the plurality of articles based on the second construction mode and the second fingerprint respectively does not exceed a fourth preset threshold, obtaining the second comparison result, wherein the second comparison result comprises a second article with the smallest Hamming distance from the second fingerprint in the plurality of articles.
5. The article detection method according to claim 4, wherein the determining the detection result of the article to be detected according to the first comparison result and the second comparison result comprises:
and generating the detection result according to the release time of the first article and the release time of the second article, wherein the detection result comprises the non-original and the article with the earlier release time in the first article and the second article.
6. The article detection method according to claim 1, further comprising, after determining the detection result of the article to be detected based on the first comparison result and the second comparison result:
and when the detection result is original, storing the first fingerprint of the article to be detected in a first index structure as a new first index record, and storing the second fingerprint in a second index structure as a new second index record.
7. An article detection device, comprising:
the generation module is used for generating a first fingerprint in a first construction mode and generating a second fingerprint in a second construction mode for an article to be detected sent by user equipment, wherein a plurality of first index records generated based on the first construction mode are set for the first fingerprint, and a plurality of second index records generated based on the second construction mode are set for the second fingerprint;
the comparison module is used for respectively comparing the similarity of the first fingerprint and the second fingerprint with the plurality of first index records and the similarity of the second fingerprint and the plurality of second index records to obtain a first comparison result and a second comparison result;
a determining module, configured to determine a detection result of the article to be detected according to the first comparison result and the second comparison result,
the generation module is used for acquiring at least one clause with a preset length based on the article to be detected; generating fingerprint information and weight respectively corresponding to the at least one clause based on the at least one clause, wherein the weight of each clause in the at least one clause is the length of the clause; merging and generating the first fingerprint of the article to be detected based on the fingerprint information and the weight respectively corresponding to the at least one clause,
the generation module is further configured to intercept a predetermined amount of complete sentences in the article to be detected before the article to be detected is ranked as the at least one clause when the length of the article to be detected is greater than or equal to a first preset threshold; when the length of the article to be detected is smaller than the first preset threshold value and is larger than or equal to a second preset threshold value, intercepting all sentences in the article to be detected as the at least one clause; when the length of the article to be detected is smaller than the second preset threshold value, taking the whole article to be detected as a clause,
the generation module is further used for extracting at least one keyword according to the incidence relation of the words in the article to be detected; determining word frequency numbers corresponding to the at least one keyword respectively based on the frequency of the at least one keyword appearing in the article to be detected, and setting the word frequency numbers as the weight of the corresponding at least one keyword; generating the second fingerprint based on the at least one keyword and a weight corresponding to the at least one keyword,
the determining module is configured to determine that the detection result of the article to be detected is the original when the first comparison result and the second comparison result are both the original, otherwise, the detection result is a non-original, where the first comparison result includes the original or the non-original, and the second comparison result includes the original or the non-original,
the first fingerprint is a file fingerprint verification MD5 fingerprint, and the second fingerprint is a simhash fingerprint.
8. An electronic device, the electronic device comprising:
a processor;
a memory for storing the processor-executable instructions,
wherein the processor is configured to perform the article detection method of any one of claims 1 to 6.
CN202110531324.8A 2021-05-17 2021-05-17 Article detection method and device Active CN112989793B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110531324.8A CN112989793B (en) 2021-05-17 2021-05-17 Article detection method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110531324.8A CN112989793B (en) 2021-05-17 2021-05-17 Article detection method and device

Publications (2)

Publication Number Publication Date
CN112989793A CN112989793A (en) 2021-06-18
CN112989793B true CN112989793B (en) 2021-08-06

Family

ID=76336617

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110531324.8A Active CN112989793B (en) 2021-05-17 2021-05-17 Article detection method and device

Country Status (1)

Country Link
CN (1) CN112989793B (en)

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105740266A (en) * 2014-12-10 2016-07-06 国际商业机器公司 Data deduplication method and device
CN107229939B (en) * 2016-03-24 2020-12-04 北大方正集团有限公司 Similar document judgment method and device
CN107491424B (en) * 2016-06-12 2020-11-06 北京云量数盟科技有限公司 Chinese document gene matching method based on multi-weight system
CN106294861B (en) * 2016-08-23 2019-08-09 武汉烽火普天信息技术有限公司 Text polymerize and shows method and system in intelligence channel towards large-scale data
CN110019674A (en) * 2017-11-21 2019-07-16 盛霆信息技术(上海)有限公司 A kind of text plagiarizes detection method and system
CN110489745B (en) * 2019-07-31 2020-12-22 北京大学 Paper text similarity detection method based on citation network
CN110990676A (en) * 2019-11-28 2020-04-10 福建亿榕信息技术有限公司 Social media hotspot topic extraction method and system
CN112084448B (en) * 2020-08-31 2024-05-07 北京金堤征信服务有限公司 Similar information processing method and device
CN112380342A (en) * 2020-11-10 2021-02-19 福建亿榕信息技术有限公司 Electric power document theme extraction method and device

Also Published As

Publication number Publication date
CN112989793A (en) 2021-06-18

Similar Documents

Publication Publication Date Title
KR101627592B1 (en) Detection of confidential information
US20220038424A1 (en) Pattern-based malicious url detection
US10579661B2 (en) System and method for machine learning and classifying data
Chakrabarti et al. An efficient filter for approximate membership checking
US8359472B1 (en) Document fingerprinting with asymmetric selection of anchor points
Urvoy et al. Tracking web spam with html style similarities
Fu et al. Privacy-preserving smart similarity search based on simhash over encrypted data in cloud computing
CN111581355A (en) Method, device and computer storage medium for detecting subject of threat intelligence
CN112579155A (en) Code similarity detection method and device and storage medium
CN112364625A (en) Text screening method, device, equipment and storage medium
US20200125532A1 (en) Fingerprints for open source code governance
CN112131249A (en) Attack intention identification method and device
Sutoyo et al. Detecting documents plagiarism using winnowing algorithm and k-gram method
CN117251879A (en) Secure storage and query method and system based on trust extension and computer storage medium
Shivaji et al. Plagiarism detection by using karp-rabin and string matching algorithm together
Han et al. Towards effective extraction and linking of software mentions from user-generated support tickets
Sindhu et al. Fingerprinting based detection system for identifying plagiarism in Malayalam text documents
CN117827952A (en) Data association analysis method, device, equipment and medium
Zhang et al. Effective and Fast Near Duplicate Detection via Signature‐Based Compression Metrics
CN112989793B (en) Article detection method and device
Rodier et al. Online near-duplicate detection of news articles
Prilepok et al. Spam detection using data compression and signatures
Rodrigues et al. Removing DUST using multiple alignment of sequences
CN112347477A (en) Family variant malicious file mining method and device
CN111625825B (en) Virus detection method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
PE01 Entry into force of the registration of the contract for pledge of patent right

Denomination of invention: Article detection method and device

Effective date of registration: 20230112

Granted publication date: 20210806

Pledgee: Zhongguancun Beijing technology financing Company limited by guarantee

Pledgor: Beijing Innovation Lezhi Network Technology Co.,Ltd.|Changsha developer Technology Co.,Ltd.

Registration number: Y2023990000072

PE01 Entry into force of the registration of the contract for pledge of patent right