CN112380833A - Similar text searching method and device for sentence-by-sentence comparison - Google Patents

Similar text searching method and device for sentence-by-sentence comparison Download PDF

Info

Publication number
CN112380833A
CN112380833A CN202011309156.XA CN202011309156A CN112380833A CN 112380833 A CN112380833 A CN 112380833A CN 202011309156 A CN202011309156 A CN 202011309156A CN 112380833 A CN112380833 A CN 112380833A
Authority
CN
China
Prior art keywords
text
processed
paragraph
contrast
verified
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011309156.XA
Other languages
Chinese (zh)
Other versions
CN112380833B (en
Inventor
贺倩明
雷宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Deli Technology Co ltd
Original Assignee
Shenzhen Deli Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Deli Technology Co ltd filed Critical Shenzhen Deli Technology Co ltd
Priority to CN202011309156.XA priority Critical patent/CN112380833B/en
Publication of CN112380833A publication Critical patent/CN112380833A/en
Application granted granted Critical
Publication of CN112380833B publication Critical patent/CN112380833B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the technical field of data processing, in particular to a method and a device for searching similar texts by sentence-by-sentence comparison. Wherein, the method comprises the following steps: the method comprises the steps of carrying out segmentation processing on a text to be processed and a plurality of comparison texts to obtain a plurality of corresponding paragraphs to be processed and a plurality of comparison paragraphs of each comparison text, calculating to obtain digital fingerprints of all paragraphs, determining the comparison paragraphs identical to the paragraphs to be processed on the basis of the digital fingerprints, processing the rest paragraphs through a dynamic programming algorithm to finally obtain the similarity between the text to be processed and each comparison text, and determining the comparison text with high similarity. Compared with the prior art in which duplicate checking is carried out based on the minimum edit distance calculation and the dimension reduction comparison, the method can carry out preliminary processing by combining the sentence segments as units with the digital fingerprints through the steps and then accurately process the sentence segments through the dynamic programming algorithm, thereby not only ensuring the efficiency of searching and comparing, but also ensuring that the obtained similarity result is more accurate.

Description

Similar text searching method and device for sentence-by-sentence comparison
Technical Field
The invention relates to the technical field of data processing, in particular to a similar text searching method and device for sentence-by-sentence comparison.
Background
At present, there are two main solutions for the task of searching similar texts: a method for calculating the minimum edit distance and a method for reducing the dimension comparison.
The method for calculating the minimum editing distance is realized by calculating the minimum number of editing operations required for converting one text document into another text document. Here, the editing operation includes insertion, deletion, and replacement, and the smaller the editing distance, the greater the similarity of the two text documents. The limitation of this method is that it only performs sequential word-by-word traversal comparison, so it is impossible to identify similar texts with different paragraph orders but same paragraph contents. The method needs to traverse the text content word by word from the beginning of the chapters, and is time-consuming for processing the text of longer chapters.
The method for reducing the dimension is realized by reducing the dimension of the text content to a low-dimension vector space. Based on the assumption that the vectors obtained after the two texts are respectively mapped to the low-dimensional vector space are similar if the contents of the two texts are similar, the texts are converted into vectors or hash values, and the similarity degree is judged by respectively calculating the cosine value of the included angle between the two vectors and the hamming distance between the two hash values. The method has the limitations that the text is fuzzified after dimension reduction, the similarity cannot be accurately expressed, and whether the similarity reaches the established standard can be only defined by taking a numerical value on a low-dimensional space as a standard.
In view of this, it is necessary for those skilled in the art to provide an efficient and accurate similar text searching scheme.
Disclosure of Invention
The invention aims to provide a similar text searching method and device for sentence-by-sentence comparison.
In a first aspect, an embodiment of the present invention provides a similar text searching method for sentence-by-sentence comparison, which is applied to a computer device, where the computer device is in communication connection with a text database server, and the text database server stores a plurality of comparison texts;
the method comprises the following steps:
acquiring a text to be processed;
segmenting the text to be processed and the target contrast text based on a preset separator to obtain a plurality of paragraphs to be processed of the text to be processed and a plurality of target contrast paragraphs of the target contrast text, wherein the target contrast text is any one of the plurality of contrast texts;
calculating to obtain the digital fingerprints of each paragraph to be processed and each target contrast paragraph;
determining a first to-be-processed paragraph from the plurality of to-be-processed paragraphs and a first comparison paragraph from the plurality of target comparison paragraphs, and counting to obtain the same paragraph parameter, wherein the digital fingerprint of the first comparison paragraph is the same as the digital fingerprint of the first to-be-processed paragraph;
under the condition that the second contrast paragraph occupies the plurality of target contrast paragraphs and the occupation ratio does not exceed the preset occupation ratio, calculating the second contrast paragraph and a second to-be-processed paragraph by using a dynamic programming algorithm to obtain similar paragraph parameters, wherein the second contrast paragraph is a paragraph except the first contrast paragraph in the plurality of target contrast paragraphs, and the second to-be-processed paragraph is a paragraph except the first to-be-processed paragraph in the plurality of to-be-processed paragraphs;
calculating text similarity of the text to be processed and the target contrast text according to the same paragraph parameters and the similar paragraph parameters;
and repeating the steps until the text similarity between each comparison text and the text to be processed is determined.
Optionally, the text database server further stores a text length of each comparison text, each comparison text includes a corresponding tag, each comparison text includes a corresponding comparison prefix, and the method further includes:
acquiring a text to be processed and determining the text length and the mark of the text to be processed, wherein the text to be processed comprises a prefix to be processed;
constructing a text processing list according to the text length of the text to be processed and the text length of each comparison text, wherein the marks of the text to be processed and the marks of each comparison text are sequenced in the text processing list according to the file lengths;
respectively determining a first mark and a second mark from the text processing list, wherein the text length of a contrast text corresponding to the first mark is greater than the text length of a text to be processed, and the text length of the contrast text corresponding to the second mark is less than the text length of the text to be processed;
determining a first similar text from a contrast text corresponding to the first mark according to the text to be processed and the prefix to be processed;
and determining a second similar text from the contrast texts corresponding to the second marks according to the text to be processed and the contrast prefixes of the contrast texts corresponding to the second marks.
Optionally, each contrast text includes a corresponding contrast prefix, and the method further includes:
acquiring a comparison prefix of a text to be processed;
confirming an undetermined contrast text from the plurality of contrast texts, wherein the contrast prefix of the undetermined contrast text is consistent with the contrast prefix of the text to be processed;
and comparing the text to be processed with each text to be compared word by word until the same text which is completely the same as the text to be processed is obtained.
Optionally, the method further comprises:
acquiring a text to be compared, and performing segmentation processing on the text to be compared to obtain a plurality of paragraphs to be compared;
comparing the numerical value fingerprint of each paragraph to be processed with the digital fingerprint of each paragraph to be compared to obtain an original paragraph to be processed and an original paragraph to be compared, wherein the digital fingerprints of the original paragraph to be processed and the original paragraph to be compared are the same, and the original paragraph to be processed and the original paragraph to be compared are in one-to-one correspondence;
performing sentence segmentation processing on the original paragraphs to be processed to obtain a plurality of original sentence segments to be processed;
carrying out sentence segmentation processing on the original paragraphs to be compared to obtain a plurality of original sentence segments to be compared;
configuring labels for each original sentence segment to be processed and each original sentence segment to be compared;
and responding and comparing operations, and determining modification operations between the target original sentence segment to be processed and the target original sentence segment to be compared according to the label, wherein the target original sentence segment to be processed is any original sentence segment to be processed in any original paragraph to be processed in the plurality of original paragraphs to be processed, and the target original sentence segment to be compared is any original sentence segment to be compared in the target original paragraph to be compared corresponding to the target original paragraph to be processed.
Optionally, before the step of obtaining the text to be processed, the method further includes:
if an initialization request from the text to be processed is obtained, initializing an authorization component according to the initialization request, wherein the authorization component is used for determining an authorization result corresponding to the information to be verified;
when the authorization component completes initialization, initializing the text query item;
when the text query item is initialized, displaying an initialization result, wherein the initialization result is used for indicating a text to be processed to send a text query instruction by calling an interface of the text query item;
when a text query instruction for a text to be processed is obtained, obtaining information to be verified according to the text query instruction, wherein the information to be verified is generated according to a preset vector and a preset knowledge graph, the preset knowledge graph is obtained after a second encryption rule is adopted to perform correlation operation on the preset vector, the preset vector is obtained after a first encryption rule is adopted to encrypt a vector to be configured of an encryption element, the vector to be configured of the encryption element is obtained after vectorization is performed on the pre-constructed encryption element, and the pre-constructed encryption element meets legal configuration conditions;
acquiring a first encryption rule and a second encryption rule;
analyzing the information to be verified to obtain a preset vector and a preset knowledge graph;
performing association operation on the preset vector by adopting a second encryption rule to obtain a preset knowledge graph to be verified;
if the preset knowledge graph to be verified is matched with the preset knowledge graph consistently, decrypting the preset vector by adopting a first encryption rule to obtain an encrypted element vector to be configured;
decoding the vector to be configured of the encrypted element to obtain the encrypted element, wherein the encrypted element comprises a user identifier and an encryption time limit, the user identifier is used for determining the user identity of the text query item, and the encryption time limit is used for determining the starting time and the ending time of the information to be verified;
acquiring an instruction trigger node, a to-be-verified user identifier and a to-be-verified text query item identifier, wherein the encryption element further comprises the text query item identifier, the to-be-verified text query item identifier and the text query item have a corresponding relation, the instruction trigger node is used for acquiring the time corresponding to the text query instruction, and the to-be-verified user identifier is determined according to the to-be-processed text;
if the instruction triggering node does not exceed the encryption time limit, the user identification to be verified is matched with the user identification in a consistent manner, and the text query item identification to be verified is matched with the text query item identification in a consistent manner, determining an authorization result corresponding to the information to be verified as a first authorization result, wherein the first authorization result represents that the information to be verified is verified successfully;
if the instruction triggering node exceeds the encryption time limit, or the user identifier to be verified is not matched with the user identifier, or the text query item identifier to be verified is not matched with the text query item identifier, determining the authorization result corresponding to the information to be verified as a second authorization result, wherein the second authorization result represents that the information to be verified fails to be verified;
and if the authorization result is used for indicating that the information to be verified is successfully verified, starting a calling function of the text query item aiming at the text to be processed.
Optionally, obtaining the first encryption rule includes:
acquiring a first private key and an encrypted private key;
decrypting the encrypted private key by using the first private key to obtain a first encryption rule;
obtaining a second encryption rule comprising:
acquiring a second private key and encrypted public key information;
and decrypting the encrypted public key information by adopting a second private key to obtain a second encryption rule.
Optionally, when a text query instruction for a text to be processed is obtained, obtaining information to be verified according to the text query instruction includes:
when a text query instruction aiming at a text to be processed sent by a terminal device is received, acquiring information to be verified according to the text query instruction, wherein if an authorization result is used for indicating that the information to be verified is successfully verified, a calling function of text query items is started aiming at the text to be processed, and the method comprises the following steps:
if the authorization result is used for indicating that the information to be verified is successfully verified, starting a calling function of the text query item aiming at the text to be processed, or sending the authorization result to the terminal equipment so that the terminal equipment starts the calling function of the text query item aiming at the text to be processed;
the method further comprises the following steps:
and if the authorization result is used for indicating that the to-be-verified information is failed to be verified, refusing to call the text query item aiming at the to-be-processed text, or sending the authorization result to the terminal equipment so that the terminal equipment refuses to call the text query item aiming at the to-be-processed text.
The second aspect of the present invention provides a similar text searching apparatus for sentence-by-sentence comparison, which is applied to a computer device, wherein the computer device is in communication connection with a text database server, and the text database server stores a plurality of comparison texts;
the device comprises:
the acquisition module is used for acquiring a text to be processed;
the segmentation module is used for segmenting the text to be processed and the target contrast text based on a preset separator to obtain a plurality of paragraphs to be processed of the text to be processed and a plurality of target contrast paragraphs of the target contrast text, wherein the target contrast text is any one of the plurality of contrast texts;
the calculation module is used for calculating and obtaining the digital fingerprints of each paragraph to be processed and each target contrast paragraph; determining a first to-be-processed paragraph from the plurality of to-be-processed paragraphs and a first comparison paragraph from the plurality of target comparison paragraphs, and counting to obtain the same paragraph parameter, wherein the digital fingerprint of the first comparison paragraph is the same as the digital fingerprint of the first to-be-processed paragraph; under the condition that the second contrast paragraph occupies the plurality of target contrast paragraphs and the occupation ratio does not exceed the preset occupation ratio, calculating the second contrast paragraph and a second to-be-processed paragraph by using a dynamic programming algorithm to obtain similar paragraph parameters, wherein the second contrast paragraph is a paragraph except the first contrast paragraph in the plurality of target contrast paragraphs, and the second to-be-processed paragraph is a paragraph except the first to-be-processed paragraph in the plurality of to-be-processed paragraphs; calculating text similarity of the text to be processed and the target contrast text according to the same paragraph parameters and the similar paragraph parameters;
and the determining module is used for repeating the steps until the text similarity between each contrast text and the text to be processed is determined.
Optionally, the text database server further stores a text length of each comparison text, each comparison text includes a corresponding tag, each comparison text includes a corresponding comparison prefix, and the determining module is further configured to:
acquiring a text to be processed and determining the text length and the mark of the text to be processed, wherein the text to be processed comprises a prefix to be processed; constructing a text processing list according to the text length of the text to be processed and the text length of each comparison text, wherein the marks of the text to be processed and the marks of each comparison text are sequenced in the text processing list according to the file lengths; respectively determining a first mark and a second mark from the text processing list, wherein the text length of a contrast text corresponding to the first mark is greater than the text length of a text to be processed, and the text length of the contrast text corresponding to the second mark is less than the text length of the text to be processed; determining a first similar text from a contrast text corresponding to the first mark according to the text to be processed and the prefix to be processed; and determining a second similar text from the contrast texts corresponding to the second marks according to the text to be processed and the contrast prefixes of the contrast texts corresponding to the second marks.
Optionally, each contrast text includes a corresponding contrast prefix, and the determining module is further configured to:
acquiring a comparison prefix of a text to be processed; confirming an undetermined contrast text from the plurality of contrast texts, wherein the contrast prefix of the undetermined contrast text is consistent with the contrast prefix of the text to be processed; and comparing the text to be processed with each text to be compared word by word until the same text which is completely the same as the text to be processed is obtained.
Compared with the prior art, the beneficial effects provided by the invention comprise: by adopting the similar text searching method and device with sentence-by-sentence comparison provided by the embodiment of the invention, the text to be processed is obtained; then, the text to be processed and the target contrast text are segmented based on preset separators, and a plurality of paragraphs to be processed of the text to be processed and a plurality of target contrast paragraphs of the target contrast text are obtained, wherein the target contrast text is any one of the plurality of contrast texts; then, calculating to obtain the digital fingerprints of each paragraph to be processed and each target contrast paragraph; determining a first to-be-processed paragraph from the plurality of to-be-processed paragraphs, determining a first comparison paragraph from the plurality of target comparison paragraphs, and counting to obtain the same paragraph parameter, wherein the digital fingerprint of the first comparison paragraph is the same as the digital fingerprint of the first to-be-processed paragraph; calculating similar paragraph parameters by using a dynamic programming algorithm for a second contrast paragraph and a second to-be-processed paragraph under the condition that the second contrast paragraph occupies the plurality of target contrast paragraphs and the occupation ratio does not exceed a preset occupation ratio, wherein the second contrast paragraph is a paragraph except the first contrast paragraph in the plurality of target contrast paragraphs, and the second to-be-processed paragraph is a paragraph except the first to-be-processed paragraph in the plurality of to-be-processed paragraphs; then calculating to obtain the text similarity of the text to be processed and the target contrast text according to the same paragraph parameters and the similar paragraph parameters; finally, the steps are repeated until the text similarity between each comparison text and the text to be processed is determined, through the steps, sentence segments are skillfully utilized as units, the initial processing is carried out by combining digital fingerprints, and then the accurate processing is carried out through a dynamic programming algorithm, so that the searching and comparing efficiency is ensured, and meanwhile, the obtained similarity result is more accurate.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required to be used in the embodiments will be briefly described below. It is appreciated that the following drawings depict only certain embodiments of the invention and are therefore not to be considered limiting of its scope. For a person skilled in the art, it is possible to derive other relevant figures from these figures without inventive effort.
Fig. 1 is an interaction diagram of a similar text search system with sentence-by-sentence comparison according to an embodiment of the present invention;
fig. 2 is a schematic flow chart illustrating steps of a similar text searching method for sentence-by-sentence comparison according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of an example of an alternative matrix provided by an embodiment of the invention;
fig. 4 is a schematic diagram of an alternative proof-reading backtracking process according to an embodiment of the present invention;
fig. 5 is a block diagram schematically illustrating a structure of a similar text searching apparatus for sentence-by-sentence comparison according to an embodiment of the present invention;
fig. 6 is a block diagram schematically illustrating a structure of a computer device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention. It is to be understood that the embodiments described are only a few embodiments of the present invention, and not all embodiments. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.
Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.
Furthermore, the terms "first," "second," and the like are used merely to distinguish one description from another, and are not to be construed as indicating or implying relative importance.
In the description of the present invention, it is also to be noted that, unless otherwise explicitly stated or limited, the terms "disposed" and "connected" are to be interpreted broadly, and for example, "connected" may be a fixed connection, a detachable connection, or an integral connection; can be mechanically or electrically connected; the connection may be direct or indirect via an intermediate medium, and may be a communication between the two elements. The specific meanings of the above terms in the present invention can be understood by those skilled in the art according to specific situations.
The following detailed description of embodiments of the invention refers to the accompanying drawings.
Fig. 1 is an interaction diagram of a similar text search system comparing sentence by sentence according to an embodiment of the present disclosure. A system for similar text lookup for sentence-by-sentence comparison may include a computer device 100 and a text database server 200 communicatively coupled to the computer device 100. The sentence-by-sentence comparing similar text searching system shown in fig. 1 is only one possible example, and in other possible embodiments, the sentence-by-sentence comparing similar text searching system may also include only one of the components shown in fig. 1 or may also include other components, and the text database server 200 stores a plurality of comparison texts.
In this embodiment, the computer device 100 and the text database server 200 in the sentence-by-sentence comparison similar text searching system may perform the sentence-by-sentence comparison similar text searching method described in the following method embodiment in cooperation, and the detailed description of the method embodiment may be referred to in the following steps executed by the computer device 100 and the text database server 200.
To solve the technical problem in the foregoing background art, fig. 2 is a schematic flow chart of a similar text search method for sentence-by-sentence comparison according to an embodiment of the present disclosure, where the similar text search method for sentence-by-sentence comparison according to the present embodiment may be executed by the computer device 100 shown in fig. 1, and the similar text search method for sentence-by-sentence comparison is described in detail below.
Step 201, obtaining a text to be processed.
In the embodiment of the present invention, the text to be processed may refer to a text that the user wants to perform duplicate checking or search for a similar text, and may be obtained by the user in advance, or may be selected from the text database server 200, which is not limited herein.
Step 202, segmenting the text to be processed and the target contrast text based on the preset separators to obtain a plurality of paragraphs to be processed of the text to be processed and a plurality of target contrast paragraphs of the target contrast text.
Wherein the target contrast text is any one of the plurality of contrast texts.
As described above, there may be a plurality of contrast texts in the text database server 200, and since the plurality of contrast texts are all pre-stored in the text database server 200, the plurality of contrast texts may be segmented in advance, and the plurality of contrast texts after the segmentation processing may be directly obtained. It should be understood that, in an actual situation, no matter the text to be processed or the comparison text, a conventional separator is adopted for segmentation to facilitate reading, and in the embodiment of the present invention, a preset separator may also be used as a basis to implement segmentation processing of the text to be processed and the comparison text.
And step 203, calculating the digital fingerprints of each paragraph to be processed and each target contrast paragraph.
In the embodiment of the invention, the digital fingerprint is an MD5 (Message-Digest Algorithm 5, cryptographic hash function) value, and the text to be processed and the identification of a plurality of other texts and the sentence period MD5 value list can be stored in the form of key-value pairs, so that the overall time consumption is reduced by adopting a multi-thread parallel processing mode due to the large number of texts and the long text.
Step 204, determining a first to-be-processed paragraph from the plurality of to-be-processed paragraphs, determining a first comparison paragraph from the plurality of target comparison paragraphs, and counting to obtain the same paragraph parameter.
Wherein the digital fingerprint of the first comparison paragraph is the same as the digital fingerprint of the first to-be-processed paragraph.
Optionally, when the text to be processed is compared with each comparison text, each sentence segment of the text to be processed respectively traverses each sentence segment of the text to be compared, and determines whether the two MD5 values are equal, if so, the number of text words of the sentence segment is accumulated into variables with the same text length, and if not, the number of text words of the sentence segment is accumulated into variables with different text lengths.
And step 205, under the condition that the second contrast paragraph occupies no more than a preset occupation ratio in the plurality of target contrast paragraphs, calculating the second contrast paragraph and the second to-be-processed paragraph by using a dynamic programming algorithm to obtain a similar paragraph parameter.
The second contrast paragraph is a paragraph of the plurality of target contrast paragraphs except the first contrast paragraph, and the second to-be-processed paragraph is a paragraph of the plurality of to-be-processed paragraphs except the first to-be-processed paragraph.
Because the established similarity standard is usually higher, if there is a comparison text which is close to the standard but not up to the standard, the comparison time is longer and unnecessary, so the lengths of different sentence segments are accumulated while the lengths of the same sentence segments are accumulated in the comparison process, and whether the ratio of the different texts reaches the dissimilarity ratio corresponding to the established similarity is judged once each accumulation, thereby avoiding unnecessary comparison and greatly reducing the time consumption for comparing the documents with the similarity not up to the established standard.
Optionally, for the remaining sentence segments (i.e., the second comparison paragraph), a dynamic programming method is adopted, the highest score of the alignment of the two character sequences is calculated by using a replacement matrix of the whole sentence of characters, and the number of words of similar continuous texts in the two sentence segments is obtained by using the sequence alignment method and is accumulated as the similar part of the two sentence segments.
In order to explain the scheme of the present invention more clearly, please refer to fig. 3, taking two text strings AGCTAGCT and AGTCTGCAT as an example, a replacement matrix is calculated by a recursive calculation formula of sequence alignment, and a trace-back process from the lower right corner of the replacement matrix is please refer to fig. 4, where the position of the bold in fig. 4 is the discontinuous longest common substring of the two text strings, i.e., GCTGCT.
And step 206, calculating text similarity of the text to be processed and the target contrast text according to the same paragraph parameters and the similar paragraph parameters.
On the basis of the scheme, after the traversal of the sentence segments is finished, the proportion is calculated according to the same text length and the whole text length, and the similarity ratio numerical value of the two text documents is obtained. Comparing the similarity value with a predetermined similarity standard, and if the similarity value exceeds the predetermined similarity standard, determining that the similarity standard is met.
And step 207, repeating the steps until the text similarity between each comparison text and the text to be processed is determined.
Through the steps, the contrast text with high similarity to the text to be processed can be determined from the text database server 200, the MD5 value generated by the sentence segment is used as the unique identifier of the sentence segment when the similar text is searched, and the matching is carried out by taking the sentence segment as a unit, so that the problem of time consumption caused by a large number of texts and a long text in the prior art is effectively solved, and a current better result is obtained.
On the basis of the foregoing, the text database server 200 further stores a text length of each comparison text, each comparison text including a corresponding tag, each comparison text including a corresponding comparison prefix, and the method further provides the following examples.
Step 301, obtaining a text to be processed and determining a text length and a mark of the text to be processed, wherein the text to be processed includes a prefix to be processed.
Step 302, a text processing list is constructed according to the text length of the text to be processed and the text length of each comparison text.
And sorting the marks of the text to be processed and the marks of each comparison text in a text processing list according to the file length.
Step 303, determining a first mark and a second mark from the text processing list respectively.
The text length of the contrast text corresponding to the first mark is larger than that of the text to be processed, and the text length of the contrast text corresponding to the second mark is smaller than that of the text to be processed.
And 304, determining a first similar text from the contrast text corresponding to the first mark according to the text to be processed and the prefix to be processed.
And 305, determining a second similar text from the contrast text corresponding to the second mark according to the contrast prefix of the text to be processed and the contrast text corresponding to the second mark.
For each text document needing to be searched for the inclusion relationship, the position of the current text document is searched in the text length ordered list by a binary search method, so that the text list is divided into two parts which are longer and shorter than the text of the current text, and the text containing the current text and the text contained by the current text are respectively searched correspondingly, thereby reducing the time consumption of comparing and judging the text length when traversing the text document to be searched.
On the basis of the foregoing, each contrast text includes a corresponding contrast prefix, and the method further provides the following embodiments.
Step 401, obtaining a comparison prefix of a text to be processed.
Step 402, confirming a pending contrast text from a plurality of contrast texts.
And the comparison prefix of the text to be compared is consistent with the comparison prefix of the text to be processed.
And step 403, comparing the text to be processed with each text to be compared word by word until the same text completely identical to the text to be processed is obtained.
In addition to searching for similar texts, in practical situations, a more serious plagiarism phenomenon may occur, and it can be determined whether there is a contrast text that is completely the same as the text to be processed through the above scheme. Optionally, the text to be processed and each comparison text take their respective prefixes, when searching for a similar text, they are compared with the prefixes of the respective texts first, and then compared word by word with the text having the same prefixes, to obtain a text document having the identical text content, that is, preprocessing stores the text prefix and the unique identifier of the text having the prefix in the form of a key-value pair, when searching for an identical document, it traverses the key-value pair first, and searches for the prefix of the document. If the key of the document prefix does not exist, the text document which is completely the same as the text to be processed does not exist in the plurality of contrast texts; if the key of the prefix of the text document exists, all text documents which are possibly identical to the text to be processed exist in the value list corresponding to the key, and each text document in the list is sequentially compared with the text body of the current text document, so that the text document which is identical to the current text document is obtained.
In addition to the foregoing, in the examples of the present invention, the following embodiments are also provided.
Step 501, obtaining a text to be compared, and performing segmentation processing on the text to be compared to obtain a plurality of paragraphs to be compared.
Step 502, comparing the numerical fingerprint of each paragraph to be processed with the numerical fingerprint of each paragraph to be compared to obtain an original paragraph to be processed and an original paragraph to be compared.
The original paragraphs to be processed and the original paragraphs to be compared have the same digital fingerprints, and the original paragraphs to be processed and the original paragraphs to be compared correspond to each other one by one.
Step 503, performing sentence segmentation processing on the original paragraphs to be processed to obtain a plurality of original sentence fragments to be processed.
Step 504, sentence dividing processing is performed on the original paragraphs to be compared to obtain a plurality of original sentence segments to be compared.
And 505, configuring labels for each original sentence segment to be processed and each original sentence segment to be compared.
Step 506, in response to the comparison operation, determining a modification operation between the target original sentence segment to be processed and the target original sentence segment to be compared according to the label.
The target original sentence segment to be processed is any original sentence segment to be processed in any original paragraph to be processed in the plurality of original paragraphs to be processed, and the target original sentence segment to be compared is any original sentence segment to be compared in the target original paragraph to be compared corresponding to the target original paragraph to be processed.
In addition to the foregoing, embodiments of the present invention provide an example capable of quickly determining the rating of the modified part, and optionally, the preprocessing divides two texts (the text to be processed and the text to be compared) into sentence segments, traverses the sentence segments of the two texts and compares MD5 values, marks the sentence segments with repeated occurrences, and records each repetition only once. When traversing each segment of the two texts in sequence again, the label value of whether the two segments appear repeatedly is taken as the basis of whether the cursor moves backwards during the traversal, and the current sentence segments are distinguished as unchanged, newly added, deleted or changed sentence segments according to the situation. When the previous sentence segment is a changed sentence segment, deletion or addition of the next sentence segment cannot be identified, so that the identification mark needs to be modified after the traversal is finished.
As an alternative embodiment, before the foregoing step 201, the following detailed description is included in the embodiments of the present invention.
Step 601, if an initialization request from the text to be processed is obtained, initializing the authorization component according to the initialization request.
The authorization component is used for determining an authorization result corresponding to the information to be verified.
It should be understood that the similarity query of texts is mostly applied to the fields of paper duplicate checking, academic research and the like in real life, and therefore generally relates to the authority problem, that is, not anyone can use the scheme, and on the basis of the scheme, before the text to be processed is obtained and processed, whether the subsequent operation is performed or not can be determined. In the embodiment of the present invention, the initialization request may be made in advance by the user before the user wants to query the text to be processed.
Step 602, when the authorization component has completed initialization, perform initialization processing on the text query item.
After the associated authorization component completes initialization, the text query entry may be further initialized.
Step 603, when the text query item has completed initialization, displaying the initialization result.
And the initialization result is used for indicating the text to be processed to send a text query instruction through an interface for calling text query items.
After the text query item is initialized, the initialization result can be displayed, and when the initialization result is normal, the subsequent operation can be considered to be performed normally.
In step 604, when a text query instruction for the text to be processed is obtained, the information to be verified is obtained according to the text query instruction.
The information to be verified is generated according to a preset vector and a preset knowledge graph, the preset knowledge graph is obtained after the preset vector is subjected to correlation operation by adopting a second encryption rule, the preset vector is obtained after the vector to be configured of the encryption element is subjected to encryption processing by adopting a first encryption rule, the vector to be configured of the encryption element is obtained after the vector to be configured of the encryption element is vectorized, and the pre-configured encryption element meets legal configuration conditions.
In the embodiment of the present invention, the text query instruction may be issued by a user who wants to perform operations such as duplicate checking on the text to be processed, and the information to be verified may refer to the information related to the authority of the user.
Step 605, obtain the first encryption rule and the second encryption rule.
In the embodiment of the invention, the first encryption rule and the second encryption rule are both used for carrying out secret setting on the authorized privacy information of the user and also used for ensuring the security of the authority of the user.
And 606, analyzing the information to be verified to obtain a preset vector and a preset knowledge graph.
And 607, performing association operation on the preset vector by using a second encryption rule to obtain a preset knowledge graph to be verified.
Step 608, if the preset knowledge graph to be verified is matched with the preset knowledge graph consistently, decrypting the preset vector by using the first encryption rule to obtain the encrypted element to-be-configured vector.
Through the steps, under the condition that the preset knowledge graph to be verified and the preset knowledge graph can be matched, the request of the user can be considered to be primarily in accordance with the condition, and therefore the preset vector can be decrypted by using the first encryption rule, and the encrypted element vector to be configured is obtained.
And step 609, decoding the vector to be configured of the encrypted element to obtain the encrypted element.
The encryption element comprises a user identifier and an encryption time limit, wherein the user identifier is used for determining the user identity of the text query item, and the encryption time limit is used for determining the starting time and the ending time of the information to be verified.
After the vector to be configured of the encrypted element is decoded, a user identifier and an encryption time limit can be obtained, the user identifier may specifically refer to identity information of a user, including a registration ID, and the like, and the encryption time limit may be considered as an effective time limit of the information to be verified, that is, the effective time limit includes a start time and an end time of the information to be verified.
Step 610, acquiring an instruction trigger node, a user identifier to be verified and a text query item identifier to be verified.
The encryption element further comprises a text query item identifier, the text query item identifier to be verified and the text query item have a corresponding relation, the instruction trigger node is time corresponding to the text query instruction, and the user identifier to be verified is determined according to the text to be processed.
In the embodiment of the present invention, the instruction trigger node may represent the current time, and the user identifier to be verified may be associated with the text to be processed, for example, the author of the text, or identity information of a guide of the author of the text, which is not limited herein.
In step 611, if the instruction trigger node does not exceed the encryption time limit, the to-be-verified user identifier matches the user identifier consistently, and the to-be-verified text query item identifier matches the text query item identifier consistently, it is determined that the authorization result corresponding to the to-be-verified information is the first authorization result.
And the first authorization result represents that the information to be verified is verified successfully.
Step 612, if the instruction trigger node exceeds the encryption time limit, or the to-be-verified user identifier is not matched with the user identifier, or the to-be-verified text query transaction identifier is not matched with the text query transaction identifier, determining that the authorization result corresponding to the to-be-verified information is a second authorization result.
And the second authorization result represents that the information to be verified fails to be verified.
Through the steps, the success or the failure of the authorization result corresponding to the information to be verified can be determined.
Step 613, if the authorization result is used to indicate that the information to be verified is successfully verified, starting a function for calling the text query item for the text to be processed.
On the basis that the information to be verified is successfully verified, the calling function of the text query item can be started for the text to be processed, that is, the foregoing step 201 is started to be executed. In other embodiments of the present invention, a scheme is provided, such as determining an authorization result of the verification information, where if the instruction trigger node does not exceed the encryption time limit, and the to-be-verified user identifier matches the user identifier in a consistent manner, the authorization result corresponding to the to-be-verified information is determined to be a first authorization result, where the first authorization result indicates that the to-be-verified information is successfully verified; and if the instruction triggering node exceeds the encryption time limit, or the user identification to be verified is not matched with the user identification, determining that the authorization result corresponding to the information to be verified is a second authorization result, wherein the second authorization result indicates that the information to be verified fails to be verified.
Through the steps, the step related to the duplicate checking of the text to be processed can be carried out under the condition that the authorization is passed, so that the use safety of the scheme is improved, and the problem that the proposed scheme is easy to be cracked in subsequent use is solved.
On the basis of the foregoing, in order to more clearly explain the scheme provided by the present invention, the foregoing step 605 includes the following embodiments.
Sub-step 605-1, the first private key and the encrypted private key are obtained.
Substep 605-2, the encrypted private key is decrypted by the first private key to obtain a first encryption rule.
Accordingly, the foregoing step 605 also includes the following embodiments.
Sub-step 605-3, the second private key and the encrypted public key information are obtained.
Substep 605-4, decrypting the encrypted public key information by using the second private key to obtain a second encryption rule.
As an alternative embodiment, the foregoing step 604 may be embodied by the following steps.
And a substep 604-1, when receiving a text query instruction aiming at the text to be processed sent by the terminal device, acquiring the information to be verified according to the text query instruction.
If the authorization result is used for indicating that the information to be verified is successfully verified, starting a calling function of a text query item aiming at the text to be processed, wherein the calling function comprises the following steps:
(1) and if the authorization result is used for indicating that the information to be verified is successfully verified, starting a calling function of the text query item aiming at the text to be processed, or sending the authorization result to the terminal equipment so that the terminal equipment starts the calling function of the text query item aiming at the text to be processed.
(2) And if the authorization result is used for indicating that the to-be-verified information is failed to be verified, refusing to call the text query item aiming at the to-be-processed text, or sending the authorization result to the terminal equipment so that the terminal equipment refuses to call the text query item aiming at the to-be-processed text.
On the basis of the foregoing, in order to accurately determine the second authorization result, the embodiment of the present invention further provides the following solutions: and if the preset knowledge map to be verified is not matched with the preset knowledge map, determining that the authorization result corresponding to the information to be verified is a second authorization result, wherein the second authorization result represents that the information to be verified fails to be verified. Correspondingly, the example of obtaining the vector to be configured of the encrypted element by decrypting the preset vector by using the first encryption rule may be implemented by the following steps: (1) and decrypting the preset vector by adopting a first encryption rule. (2) If the decryption is successful, obtaining the vector to be configured of the encrypted element. (3) And if the decryption fails, determining that the authorization result corresponding to the information to be verified is a second authorization result.
In order to clearly describe the foregoing steps, the encryption element further includes a type of information to be verified, where the type of information to be verified is a weak correlation type or a strong correlation type. Based on this, the embodiments of the present invention also provide the following solutions, for example.
And if the type of the information to be verified is a weak correlation type, acquiring the first use time of the information to be verified.
And determining the used time according to the instruction trigger node and the first use time.
And if the used time is less than the time threshold corresponding to the weak correlation type, executing a step of determining an authorization result corresponding to the information to be verified according to the matching relationship between the instruction trigger node and the encryption time limit and the matching relationship between the user identifier to be verified and the user identifier.
Correspondingly, if the type of the information to be verified is a strong correlation type, the step of determining the authorization result corresponding to the information to be verified according to the matching relationship between the instruction trigger node and the encryption time limit and the matching relationship between the user identifier to be verified and the user identifier is executed.
In addition to the foregoing steps, an example is provided in an embodiment of the present invention, where the information to be verified is generated according to a preset vector and a preset knowledge graph, the preset knowledge graph is obtained by performing association operation on the preset vector by using a second encryption rule, the preset vector is obtained by encrypting a vector to be configured of an encryption element by using a first encryption rule, and the vector to be configured of the encryption element is obtained by vectorizing the encryption element.
An embodiment of the present invention provides a similar text searching apparatus 110 for sentence-by-sentence comparison, which is applied to a computer device 100, wherein the computer device 100 is in communication connection with a text database server 200, the text database server 200 stores a plurality of comparison texts, and referring to fig. 5, the similar text searching apparatus 110 for sentence-by-sentence comparison includes:
an obtaining module 1101, configured to obtain a text to be processed.
The segmenting module 1102 is configured to segment the to-be-processed text and the target contrast text based on a preset delimiter to obtain a plurality of to-be-processed paragraphs of the to-be-processed text and a plurality of target contrast paragraphs of the target contrast text, where the target contrast text is any one of the plurality of contrast texts.
A calculating module 1103, configured to calculate digital fingerprints of each to-be-processed paragraph and each target comparison paragraph; determining a first to-be-processed paragraph from the plurality of to-be-processed paragraphs and a first comparison paragraph from the plurality of target comparison paragraphs, and counting to obtain the same paragraph parameter, wherein the digital fingerprint of the first comparison paragraph is the same as the digital fingerprint of the first to-be-processed paragraph; under the condition that the second contrast paragraph occupies the plurality of target contrast paragraphs and the occupation ratio does not exceed the preset occupation ratio, calculating the second contrast paragraph and a second to-be-processed paragraph by using a dynamic programming algorithm to obtain similar paragraph parameters, wherein the second contrast paragraph is a paragraph except the first contrast paragraph in the plurality of target contrast paragraphs, and the second to-be-processed paragraph is a paragraph except the first to-be-processed paragraph in the plurality of to-be-processed paragraphs; and calculating the text similarity of the text to be processed and the target contrast text according to the same paragraph parameters and the similar paragraph parameters.
And the determining module 1104 is configured to repeat the above steps until the text similarity between each comparison text and the text to be processed is determined.
Further, the text database server 200 further stores a text length of each comparison text, each comparison text includes a corresponding tag, each comparison text includes a corresponding comparison prefix, and the determining module 1104 is further configured to:
acquiring a text to be processed and determining the text length and the mark of the text to be processed, wherein the text to be processed comprises a prefix to be processed; constructing a text processing list according to the text length of the text to be processed and the text length of each comparison text, wherein the marks of the text to be processed and the marks of each comparison text are sequenced in the text processing list according to the file lengths; respectively determining a first mark and a second mark from the text processing list, wherein the text length of a contrast text corresponding to the first mark is greater than the text length of a text to be processed, and the text length of the contrast text corresponding to the second mark is less than the text length of the text to be processed; determining a first similar text from a contrast text corresponding to the first mark according to the text to be processed and the prefix to be processed; and determining a second similar text from the contrast texts corresponding to the second marks according to the text to be processed and the contrast prefixes of the contrast texts corresponding to the second marks.
Further, each contrast text includes a corresponding contrast prefix, and the determining module 1104 is further configured to:
acquiring a comparison prefix of a text to be processed; confirming an undetermined contrast text from the plurality of contrast texts, wherein the contrast prefix of the undetermined contrast text is consistent with the contrast prefix of the text to be processed; and comparing the text to be processed with each text to be compared word by word until the same text which is completely the same as the text to be processed is obtained.
Further, the determining module 1104 is further configured to:
acquiring a text to be compared, and performing segmentation processing on the text to be compared to obtain a plurality of paragraphs to be compared; comparing the numerical value fingerprint of each paragraph to be processed with the digital fingerprint of each paragraph to be compared to obtain an original paragraph to be processed and an original paragraph to be compared, wherein the digital fingerprints of the original paragraph to be processed and the original paragraph to be compared are the same, and the original paragraph to be processed and the original paragraph to be compared are in one-to-one correspondence; performing sentence segmentation processing on the original paragraphs to be processed to obtain a plurality of original sentence segments to be processed; carrying out sentence segmentation processing on the original paragraphs to be compared to obtain a plurality of original sentence segments to be compared; configuring labels for each original sentence segment to be processed and each original sentence segment to be compared; and responding and comparing operations, and determining modification operations between the target original sentence segment to be processed and the target original sentence segment to be compared according to the label, wherein the target original sentence segment to be processed is any original sentence segment to be processed in any original paragraph to be processed in the plurality of original paragraphs to be processed, and the target original sentence segment to be compared is any original sentence segment to be compared in the target original paragraph to be compared corresponding to the target original paragraph to be processed.
Further, the determining module 1104 is further configured to:
if an initialization request from the text to be processed is obtained, initializing an authorization component according to the initialization request, wherein the authorization component is used for determining an authorization result corresponding to the information to be verified; when the authorization component completes initialization, initializing the text query item; when the text query item is initialized, displaying an initialization result, wherein the initialization result is used for indicating a text to be processed to send a text query instruction by calling an interface of the text query item; when a text query instruction for a text to be processed is obtained, obtaining information to be verified according to the text query instruction, wherein the information to be verified is generated according to a preset vector and a preset knowledge graph, the preset knowledge graph is obtained after a second encryption rule is adopted to perform correlation operation on the preset vector, the preset vector is obtained after a first encryption rule is adopted to encrypt a vector to be configured of an encryption element, the vector to be configured of the encryption element is obtained after vectorization is performed on the pre-constructed encryption element, and the pre-constructed encryption element meets legal configuration conditions; acquiring a first encryption rule and a second encryption rule; analyzing the information to be verified to obtain a preset vector and a preset knowledge graph; performing association operation on the preset vector by adopting a second encryption rule to obtain a preset knowledge graph to be verified; if the preset knowledge graph to be verified is matched with the preset knowledge graph consistently, decrypting the preset vector by adopting a first encryption rule to obtain an encrypted element vector to be configured; decoding the vector to be configured of the encrypted element to obtain the encrypted element, wherein the encrypted element comprises a user identifier and an encryption time limit, the user identifier is used for determining the user identity of the text query item, and the encryption time limit is used for determining the starting time and the ending time of the information to be verified; acquiring an instruction trigger node, a to-be-verified user identifier and a to-be-verified text query item identifier, wherein the encryption element further comprises the text query item identifier, the to-be-verified text query item identifier and the text query item have a corresponding relation, the instruction trigger node is used for acquiring the time corresponding to the text query instruction, and the to-be-verified user identifier is determined according to the to-be-processed text; if the instruction triggering node does not exceed the encryption time limit, the user identification to be verified is matched with the user identification in a consistent manner, and the text query item identification to be verified is matched with the text query item identification in a consistent manner, determining an authorization result corresponding to the information to be verified as a first authorization result, wherein the first authorization result represents that the information to be verified is verified successfully; if the instruction triggering node exceeds the encryption time limit, or the user identifier to be verified is not matched with the user identifier, or the text query item identifier to be verified is not matched with the text query item identifier, determining the authorization result corresponding to the information to be verified as a second authorization result, wherein the second authorization result represents that the information to be verified fails to be verified; and if the authorization result is used for indicating that the information to be verified is successfully verified, starting a calling function of the text query item aiming at the text to be processed.
Further, the determining module 1104 is specifically configured to:
acquiring a first private key and an encrypted private key; decrypting the encrypted private key by using the first private key to obtain a first encryption rule; acquiring a second private key and encrypted public key information; and decrypting the encrypted public key information by adopting a second private key to obtain a second encryption rule.
Further, the determining module 1104 is specifically configured to: when a text query instruction aiming at a text to be processed sent by a terminal device is received, acquiring information to be verified according to the text query instruction, wherein if an authorization result is used for indicating that the information to be verified is successfully verified, a calling function of text query items is started aiming at the text to be processed, and the method comprises the following steps: and if the authorization result is used for indicating that the information to be verified is successfully verified, starting a calling function of the text query item aiming at the text to be processed, or sending the authorization result to the terminal equipment so that the terminal equipment starts the calling function of the text query item aiming at the text to be processed.
The determining module 1104 is specifically further configured to:
and if the authorization result is used for indicating that the to-be-verified information is failed to be verified, refusing to call the text query item aiming at the to-be-processed text, or sending the authorization result to the terminal equipment so that the terminal equipment refuses to call the text query item aiming at the to-be-processed text.
It should be noted that, as for the implementation principle of the similar text searching apparatus 110 for comparing sentence by sentence, reference may be made to the implementation principle of the similar text searching method for comparing sentence by sentence, which is not described herein again. It should be understood that the division of the modules of the above apparatus is only a logical division, and the actual implementation may be wholly or partially integrated into one physical entity or may be physically separated. And these modules can be realized in the form of software called by processing element; or may be implemented entirely in hardware; and part of the modules can be realized in the form of calling software by the processing element, and part of the modules can be realized in the form of hardware. For example, the obtaining module 1101 may be a processing element separately set up, or may be integrated into a chip of the apparatus, or may be stored in a memory of the apparatus in the form of program code, and a processing element of the apparatus calls and executes the functions of the similar text searching apparatus 110 that are compared sentence by sentence. Other modules are implemented similarly. In addition, all or part of the modules can be integrated together or can be independently realized. The processing element described herein may be an integrated circuit having signal processing capabilities. In implementation, each step of the above method or each module above may be implemented by an integrated logic circuit of hardware in a processor element or an instruction in the form of software.
For example, the above modules may be one or more integrated circuits configured to implement the above methods, such as: one or more Application Specific Integrated Circuits (ASICs), or one or more microprocessors (DSPs), or one or more Field Programmable Gate Arrays (FPGAs), among others. For another example, when some of the above modules are implemented in the form of a processing element scheduler code, the processing element may be a general-purpose processor, such as a Central Processing Unit (CPU) or other processor that can call program code. As another example, these modules may be integrated together, implemented in the form of a system-on-a-chip (SOC).
The embodiment of the present invention provides a computer device 100, where the computer device 100 includes a processor and a non-volatile memory storing computer instructions, and when the computer instructions are executed by the processor, the computer device 100 executes the similar text searching apparatus 110 for sentence-by-sentence comparison. As shown in fig. 6, fig. 6 is a block diagram of a computer device 100 according to an embodiment of the present invention. The computer device 100 comprises a similar text searching device 110 for sentence-by-sentence comparison, a memory 111, a processor 112 and a communication unit 113.
To facilitate the transfer or interaction of data, the elements of the memory 111, the processor 112 and the communication unit 113 are electrically connected to each other, directly or indirectly. For example, the components may be electrically connected to each other via one or more communication buses or signal lines. The similar text searching device 110 for sentence-by-sentence comparison comprises at least one software functional module which can be stored in the memory 111 in the form of software or firmware (firmware) or solidified in an Operating System (OS) of the computer device 100. The processor 112 is used for executing the similar text searching apparatus 110 for sentence-by-sentence comparison stored in the memory 111, for example, a software functional module and a computer program included in the similar text searching apparatus 110 for sentence-by-sentence comparison.
The embodiment of the present invention provides a readable storage medium, where the readable storage medium includes a computer program, and when the computer program runs, the computer device 100 where the readable storage medium is located is controlled to execute the above similar text searching method for sentence-by-sentence comparison.
In summary, embodiments of the present invention provide a method and an apparatus for searching similar texts by sentence-by-sentence comparison, where a text to be processed is obtained; then, the text to be processed and the target contrast text are segmented based on preset separators, and a plurality of paragraphs to be processed of the text to be processed and a plurality of target contrast paragraphs of the target contrast text are obtained, wherein the target contrast text is any one of the plurality of contrast texts; then, calculating to obtain the digital fingerprints of each paragraph to be processed and each target contrast paragraph; determining a first to-be-processed paragraph from the plurality of to-be-processed paragraphs, determining a first comparison paragraph from the plurality of target comparison paragraphs, and counting to obtain the same paragraph parameter, wherein the digital fingerprint of the first comparison paragraph is the same as the digital fingerprint of the first to-be-processed paragraph; calculating similar paragraph parameters by using a dynamic programming algorithm for a second contrast paragraph and a second to-be-processed paragraph under the condition that the second contrast paragraph occupies the plurality of target contrast paragraphs and the occupation ratio does not exceed a preset occupation ratio, wherein the second contrast paragraph is a paragraph except the first contrast paragraph in the plurality of target contrast paragraphs, and the second to-be-processed paragraph is a paragraph except the first to-be-processed paragraph in the plurality of to-be-processed paragraphs; then calculating to obtain the text similarity of the text to be processed and the target contrast text according to the same paragraph parameters and the similar paragraph parameters; finally, the steps are repeated until the text similarity between each comparison text and the text to be processed is determined, through the steps, sentence segments are skillfully utilized as units, the initial processing is carried out by combining digital fingerprints, and then the accurate processing is carried out through a dynamic programming algorithm, so that the searching and comparing efficiency is ensured, and meanwhile, the obtained similarity result is more accurate.
The foregoing description, for purpose of explanation, has been described with reference to specific embodiments. However, the illustrative discussions above are not intended to be exhaustive or to limit the disclosure to the precise forms disclosed. Many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the disclosure and its practical applications, to thereby enable others skilled in the art to best utilize the disclosure and various embodiments with various modifications as are suited to the particular use contemplated.

Claims (10)

1. A similar text searching method for sentence-by-sentence comparison is characterized in that the method is applied to computer equipment, the computer equipment is in communication connection with a text database server, and a plurality of comparison texts are stored in the text database server;
the method comprises the following steps:
acquiring a text to be processed;
segmenting the text to be processed and a target contrast text based on a preset separator to obtain a plurality of paragraphs to be processed of the text to be processed and a plurality of target contrast paragraphs of the target contrast text, wherein the target contrast text is any one of the plurality of contrast texts;
calculating to obtain the digital fingerprints of each paragraph to be processed and each paragraph to be compared;
determining a first to-be-processed paragraph from the plurality of to-be-processed paragraphs, determining a first comparison paragraph from the plurality of target comparison paragraphs, and counting to obtain the same paragraph parameter, wherein the digital fingerprint of the first comparison paragraph is the same as the digital fingerprint of the first to-be-processed paragraph;
under the condition that a second contrast paragraph occupies no more than a preset occupation ratio in the plurality of target contrast paragraphs, calculating a similar paragraph parameter by using a dynamic programming algorithm for the second contrast paragraph and a second to-be-processed paragraph, wherein the second contrast paragraph is a paragraph of the plurality of target contrast paragraphs except the first contrast paragraph, and the second to-be-processed paragraph is a paragraph of the plurality of to-be-processed paragraphs except the first to-be-processed paragraph;
calculating text similarity of the text to be processed and the target contrast text according to the same paragraph parameters and the similar paragraph parameters;
and repeating the steps until the text similarity of each contrast text and the text to be processed is determined.
2. The method of claim 1, wherein the text database server further stores a text length for each of the comparison texts, each of the comparison texts comprises a corresponding tag, each of the comparison texts comprises a corresponding comparison prefix, and the method further comprises:
acquiring a text to be processed and determining the text length and the mark of the text to be processed, wherein the text to be processed comprises a prefix to be processed;
constructing a text processing list according to the text length of the text to be processed and the text length of each contrast text, wherein the marks of the text to be processed and the marks of each contrast text are sequenced in the text processing list according to the file lengths;
respectively determining a first mark and a second mark from the text processing list, wherein the text length of a contrast text corresponding to the first mark is greater than the text length of the text to be processed, and the text length of the contrast text corresponding to the second mark is less than the text length of the text to be processed;
determining a first similar text from a contrast text corresponding to the first mark according to the text to be processed and the prefix to be processed;
and determining a second similar text from the contrast text corresponding to the second mark according to the contrast prefix of the text to be processed and the contrast text corresponding to the second mark.
3. The method of claim 1, wherein each of the contrasting texts comprises a corresponding contrasting prefix, the method further comprising:
acquiring a comparison prefix of the text to be processed;
confirming an undetermined contrast text from the plurality of contrast texts, wherein the contrast prefix of the undetermined contrast text is consistent with the contrast prefix of the text to be processed;
and comparing the text to be processed with each text to be compared word by word until the same text which is completely the same as the text to be processed is obtained.
4. The method of claim 1, further comprising:
acquiring a text to be compared, and performing segmentation processing on the text to be compared to obtain a plurality of paragraphs to be compared;
comparing the numerical fingerprint of each paragraph to be processed with the digital fingerprint of each paragraph to be compared to obtain an original paragraph to be processed and an original paragraph to be compared, wherein the digital fingerprints of the original paragraph to be processed and the original paragraph to be compared are the same, and the original paragraph to be processed and the original paragraph to be compared are in one-to-one correspondence;
performing sentence segmentation processing on the original paragraphs to be processed to obtain a plurality of original sentence fragments to be processed;
performing sentence segmentation processing on the original paragraphs to be compared to obtain a plurality of original sentence segments to be compared;
configuring labels for each original sentence segment to be processed and each original sentence segment to be compared;
and responding and comparing operation, and determining modification operation between a target original sentence segment to be processed and a target original sentence segment to be compared according to the label, wherein the target original sentence segment to be processed is any original sentence segment to be processed in any original paragraph to be processed in a plurality of original paragraphs to be processed, and the target original sentence segment to be compared is any original sentence segment to be compared in the target original paragraph to be compared corresponding to the target original paragraph to be processed.
5. The method of claim 1, wherein prior to the step of obtaining the text to be processed, the method further comprises:
if an initialization request from the text to be processed is acquired, initializing an authorization component according to the initialization request, wherein the authorization component is used for determining an authorization result corresponding to information to be verified;
when the authorization component is initialized, initializing the text query item;
when the text query item is initialized, displaying an initialization result, wherein the initialization result is used for indicating the text to be processed to send a text query instruction by calling an interface of the text query item;
when the text query instruction for a text to be processed is obtained, obtaining the information to be verified according to the text query instruction, wherein the information to be verified is generated according to a preset vector and a preset knowledge graph, the preset knowledge graph is obtained by performing association operation on the preset vector by adopting a second encryption rule, the preset vector is obtained by encrypting a vector to be configured of an encryption element by adopting a first encryption rule, the vector to be configured of the encryption element is obtained by vectorizing a pre-constructed encryption element, and the pre-constructed encryption element meets legal configuration conditions;
acquiring the first encryption rule and the second encryption rule;
analyzing the information to be verified to obtain the preset vector and the preset knowledge graph;
performing association operation on the preset vector by adopting the second encryption rule to obtain a preset knowledge graph to be verified;
if the preset knowledge graph to be verified is matched with the preset knowledge graph consistently, decrypting the preset vector by adopting the first encryption rule to obtain the encrypted element to-be-configured vector;
decoding the vector to be configured of the encrypted element to obtain the encrypted element, wherein the encrypted element comprises a user identifier and an encryption time limit, the user identifier is used for determining the user identity of a text query item, and the encryption time limit is used for determining the starting time and the ending time of the information to be verified;
acquiring an instruction trigger node, a to-be-verified user identifier and a to-be-verified text query item identifier, wherein the encryption element further comprises a text query item identifier, the to-be-verified text query item identifier has a corresponding relation with the text query item, the instruction trigger node is used for acquiring the time corresponding to the text query instruction, and the to-be-verified user identifier is determined according to the to-be-processed text;
if the instruction trigger node does not exceed the encryption time limit, the to-be-verified user identifier is matched with the user identifier in a consistent manner, and the to-be-verified text query item identifier is matched with the text query item identifier in a consistent manner, determining that an authorization result corresponding to the to-be-verified information is a first authorization result, wherein the first authorization result represents that the to-be-verified information is verified successfully;
if the instruction trigger node exceeds the encryption time limit, or the to-be-verified user identifier is not matched with the user identifier, or the to-be-verified text query item identifier is not matched with the text query item identifier, determining that an authorization result corresponding to the to-be-verified information is a second authorization result, wherein the second authorization result represents that the to-be-verified information is failed to be verified;
and if the authorization result is used for indicating that the information to be verified is successfully verified, starting a calling function of the text query item aiming at the text to be processed.
6. The method of claim 5, wherein obtaining the first encryption rule comprises:
acquiring a first private key and an encrypted private key;
decrypting the encrypted private key by using the first private key to obtain the first encryption rule;
obtaining a second encryption rule comprising:
acquiring a second private key and encrypted public key information;
and decrypting the encrypted public key information by adopting the second private key to obtain the second encryption rule.
7. The method according to claim 5, wherein when the text query instruction for the text to be processed is obtained, obtaining the information to be verified according to the text query instruction comprises:
when receiving the text query instruction aiming at the text to be processed sent by the terminal device, acquiring the information to be verified according to the text query instruction, wherein if the authorization result is used for indicating that the information to be verified is successfully verified, a call function of the text query item is started aiming at the text to be processed, and the call function comprises the following steps:
if the authorization result is used for indicating that the information to be verified is successfully verified, starting a calling function of the text query item aiming at the text to be processed, or sending the authorization result to the terminal equipment so that the terminal equipment starts the calling function of the text query item aiming at the text to be processed;
the method further comprises the following steps:
if the authorization result is used for indicating that the to-be-verified information is failed to be verified, refusing to invoke the text query item aiming at the to-be-processed text, or sending the authorization result to the terminal device so that the terminal device refuses to invoke the text query item aiming at the to-be-processed text.
8. A similar text searching device for sentence-by-sentence comparison is characterized by being applied to computer equipment, wherein the computer equipment is in communication connection with a text database server, and a plurality of comparison texts are stored in the text database server;
the device comprises:
the acquisition module is used for acquiring a text to be processed;
the segmentation module is used for carrying out segmentation processing on the text to be processed and the target contrast text based on a preset separator to obtain a plurality of paragraphs to be processed of the text to be processed and a plurality of target contrast paragraphs of the target contrast text, wherein the target contrast text is any one of the plurality of contrast texts;
the calculation module is used for calculating and obtaining the digital fingerprints of each paragraph to be processed and each target contrast paragraph; determining a first to-be-processed paragraph from the plurality of to-be-processed paragraphs, determining a first comparison paragraph from the plurality of target comparison paragraphs, and counting to obtain the same paragraph parameter, wherein the digital fingerprint of the first comparison paragraph is the same as the digital fingerprint of the first to-be-processed paragraph; under the condition that a second contrast paragraph occupies no more than a preset occupation ratio in the plurality of target contrast paragraphs, calculating a similar paragraph parameter by using a dynamic programming algorithm for the second contrast paragraph and a second to-be-processed paragraph, wherein the second contrast paragraph is a paragraph of the plurality of target contrast paragraphs except the first contrast paragraph, and the second to-be-processed paragraph is a paragraph of the plurality of to-be-processed paragraphs except the first to-be-processed paragraph; calculating text similarity of the text to be processed and the target contrast text according to the same paragraph parameters and the similar paragraph parameters;
and the determining module is used for repeating the steps until the text similarity between each contrast text and the text to be processed is determined.
9. The apparatus of claim 8, wherein the text database server further stores a text length of each of the comparison texts, each of the comparison texts comprises a corresponding tag, each of the comparison texts comprises a corresponding comparison prefix, and the determining module is further configured to:
acquiring a text to be processed and determining the text length and the mark of the text to be processed, wherein the text to be processed comprises a prefix to be processed; constructing a text processing list according to the text length of the text to be processed and the text length of each contrast text, wherein the marks of the text to be processed and the marks of each contrast text are sequenced in the text processing list according to the file lengths; respectively determining a first mark and a second mark from the text processing list, wherein the text length of a contrast text corresponding to the first mark is greater than the text length of the text to be processed, and the text length of the contrast text corresponding to the second mark is less than the text length of the text to be processed; determining a first similar text from a contrast text corresponding to the first mark according to the text to be processed and the prefix to be processed; and determining a second similar text from the contrast text corresponding to the second mark according to the contrast prefix of the text to be processed and the contrast text corresponding to the second mark.
10. The apparatus of claim 8, wherein each of the comparison texts comprises a corresponding comparison prefix, and wherein the determining module is further configured to:
acquiring a comparison prefix of the text to be processed; confirming an undetermined contrast text from the plurality of contrast texts, wherein the contrast prefix of the undetermined contrast text is consistent with the contrast prefix of the text to be processed; and comparing the text to be processed with each text to be compared word by word until the same text which is completely the same as the text to be processed is obtained.
CN202011309156.XA 2020-11-20 2020-11-20 Similar text searching method and device for sentence-by-sentence comparison Active CN112380833B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011309156.XA CN112380833B (en) 2020-11-20 2020-11-20 Similar text searching method and device for sentence-by-sentence comparison

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011309156.XA CN112380833B (en) 2020-11-20 2020-11-20 Similar text searching method and device for sentence-by-sentence comparison

Publications (2)

Publication Number Publication Date
CN112380833A true CN112380833A (en) 2021-02-19
CN112380833B CN112380833B (en) 2021-05-14

Family

ID=74584482

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011309156.XA Active CN112380833B (en) 2020-11-20 2020-11-20 Similar text searching method and device for sentence-by-sentence comparison

Country Status (1)

Country Link
CN (1) CN112380833B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113268564A (en) * 2021-05-24 2021-08-17 平安科技(深圳)有限公司 Method, device and equipment for generating similar problems and storage medium
CN113590467A (en) * 2021-06-30 2021-11-02 平安健康保险股份有限公司 Data comparison method, system, computer equipment and computer readable storage medium
CN113949765A (en) * 2021-10-18 2022-01-18 北京博瑞彤芸科技股份有限公司 Cloud address book implementation method and device
CN115017269A (en) * 2022-08-05 2022-09-06 中科雨辰科技有限公司 Data processing system for determining similar texts
CN116204918A (en) * 2023-01-17 2023-06-02 内蒙古科技大学 Text similarity secret calculation method and equipment in natural language processing
CN117375627A (en) * 2023-12-08 2024-01-09 深圳市纷享互联科技有限责任公司 Lossless compression method and system for plain text format data suitable for character strings

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105653700A (en) * 2015-03-13 2016-06-08 Tcl集团股份有限公司 Video search method and system
CN109359183A (en) * 2018-10-11 2019-02-19 南京中孚信息技术有限公司 The duplicate checking method, apparatus and electronic equipment of text information
CN111538803A (en) * 2020-04-20 2020-08-14 京东方科技集团股份有限公司 Method, device, equipment and medium for acquiring candidate question text to be matched

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105653700A (en) * 2015-03-13 2016-06-08 Tcl集团股份有限公司 Video search method and system
CN109359183A (en) * 2018-10-11 2019-02-19 南京中孚信息技术有限公司 The duplicate checking method, apparatus and electronic equipment of text information
CN111538803A (en) * 2020-04-20 2020-08-14 京东方科技集团股份有限公司 Method, device, equipment and medium for acquiring candidate question text to be matched

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
MANUEL ZINI ET AL.: "Plagiarism Detection Through Multilevel Text Comparison", 《SECOND INTERNATIONAL CONFERENCE ON AUTOMATED PRODUCTION OF CROSS MEDIA CONTENT FOR MULTI-CHANNEL DISTRIBUTION (AXMEDIS"06)》 *
麻会东 等: "基于提取关键词的中文文档复制检测研究", 《计算机工程与科学》 *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113268564A (en) * 2021-05-24 2021-08-17 平安科技(深圳)有限公司 Method, device and equipment for generating similar problems and storage medium
CN113268564B (en) * 2021-05-24 2023-07-21 平安科技(深圳)有限公司 Method, device, equipment and storage medium for generating similar problems
CN113590467A (en) * 2021-06-30 2021-11-02 平安健康保险股份有限公司 Data comparison method, system, computer equipment and computer readable storage medium
CN113590467B (en) * 2021-06-30 2023-07-21 平安健康保险股份有限公司 Data comparison method, system, computer device and computer readable storage medium
CN113949765A (en) * 2021-10-18 2022-01-18 北京博瑞彤芸科技股份有限公司 Cloud address book implementation method and device
CN115017269A (en) * 2022-08-05 2022-09-06 中科雨辰科技有限公司 Data processing system for determining similar texts
CN115017269B (en) * 2022-08-05 2022-10-25 中科雨辰科技有限公司 Data processing system for determining similar texts
CN116204918A (en) * 2023-01-17 2023-06-02 内蒙古科技大学 Text similarity secret calculation method and equipment in natural language processing
CN116204918B (en) * 2023-01-17 2024-03-26 内蒙古科技大学 Text similarity secret calculation method and equipment in natural language processing
CN117375627A (en) * 2023-12-08 2024-01-09 深圳市纷享互联科技有限责任公司 Lossless compression method and system for plain text format data suitable for character strings
CN117375627B (en) * 2023-12-08 2024-04-05 深圳市纷享互联科技有限责任公司 Lossless compression method and system for plain text format data suitable for character strings

Also Published As

Publication number Publication date
CN112380833B (en) 2021-05-14

Similar Documents

Publication Publication Date Title
CN112380833B (en) Similar text searching method and device for sentence-by-sentence comparison
Chi et al. Hashing techniques: A survey and taxonomy
CN105718502B (en) Method and apparatus for efficient feature matching
US8699799B2 (en) Fingerprint verification method and apparatus with high security
JP2017528070A (en) Information encryption and decryption
CN110019640B (en) Secret-related file checking method and device
US10083194B2 (en) Process for obtaining candidate data from a remote storage server for comparison to a data to be identified
CN113656807A (en) Vulnerability management method, device, equipment and storage medium
CN107819748B (en) Anti-cracking verification code implementation method and device
Lutsenko et al. Biometric cryptosystems: overview, state-of-the-art and perspective directions
Rathgeb et al. Preventing the cross-matching attack in Bloom filter-based cancelable biometrics
You et al. A transformer based approach for image manipulation chain detection
CN116055067B (en) Weak password detection method, device, electronic equipment and medium
CN111414621B (en) Malicious webpage file identification method and device
CN115859370B (en) Transaction data processing method, device, computer equipment and storage medium
CN109359481B (en) Anti-collision search reduction method based on BK tree
CN115314268B (en) Malicious encryption traffic detection method and system based on traffic fingerprint and behavior
CN115567212A (en) File processing method and device, computer equipment and computer readable storage medium
CN115565222A (en) Face recognition method, face recognition system, terminal device and storage medium
CN115935299A (en) Authorization control method, device, computer equipment and storage medium
CN115695054B (en) WAF interception page identification method and device based on machine learning and related components
CN113052157B (en) Label detection method, apparatus, computer device and storage medium
CN116756718B (en) U-Sketch-based biological feature data error correction method, system and tool
CN115408720A (en) Data encryption method and device, processing equipment and storage medium
CN112926422B (en) Template protection method capable of revocating binary features based on OPH

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant