CN116451092A

CN116451092A - Text difference rate determination method and device and electronic equipment

Info

Publication number: CN116451092A
Application number: CN202310445881.7A
Authority: CN
Inventors: 康伟; 薛景文; 刘晨; 郝豪红; 赵洪洋; 黄晨光; 王剑龙; 王新哲; 杨淋淋
Original assignee: Weichai Power Co Ltd
Current assignee: Weichai Power Co Ltd
Priority date: 2023-04-20
Filing date: 2023-04-20
Publication date: 2023-07-18

Abstract

The invention discloses a text difference rate determining method and device and electronic equipment. Respectively analyzing and processing the two files to be compared to obtain a first text content and a second text content; processing the first text content and the second text content based on a file difference analysis algorithm, and determining a text matching result; dividing a text matching result into a plurality of sub-text sequences based on text paragraph information in the first text content, and determining a paragraph group to be verified corresponding to the plurality of sub-text sequences; determining a target similar paragraph group based on the paragraph group to be verified and the similarity judgment model, and determining the total number of characters corresponding to the target similar paragraph group; the difference rate is determined based on the public character number, the unique character number, the total number of characters and the difference rate function, so that the problems of large workload, low efficiency and easiness in error in the file auditing process are solved, the accuracy of determining the file difference rate is improved, the efficiency of determining the file difference is improved, and the error frequency is reduced.

Description

Text difference rate determination method and device and electronic equipment

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to a method and an apparatus for determining a text difference rate, and an electronic device.

Background

Today, many enterprises need to process a large number of files, such as contracts, specifications, labels, etc., which have high similarity in revising different versions, and only a small part of the files are different, and if the review version and the endorsement version are inconsistent, the enterprises may be caused to have disputes in cooperation or even immeasurable losses.

At present, auditing and verification of different versions of files are manually marked by staff to page, look for and compare. In addition, some document comparison tools are generated at present, and characters of a standard format document can be extracted, and then compared by using a method based on a public sequence or using a semantic recognition model to determine difference contents.

However, the manual auditing method has the problems of large workload, low efficiency and easy error, and the method adopting a file comparison tool only considers the local differences with different text fields or determines the global differences between files based on a semantic recognition model, so that the differences between files of different versions cannot be accurately measured.

Disclosure of Invention

The invention provides a text difference rate determining method, a text difference rate determining device and electronic equipment, which improve the accuracy of determining the file difference rate, improve the efficiency of determining the file difference and reduce the error frequency.

In a first aspect, the present invention provides a text difference rate determining method, the method comprising:

respectively analyzing and processing the two files to be compared to obtain a first text content and a second text content; the text content comprises text paragraph information, wherein the essential content corresponding to the two files to be compared is the same and the text is different;

processing the first text content and the second text content based on a file difference analysis algorithm, and determining a text matching result; the text matching result is a long sequence comprising common characters, unique characters of the first text content and unique characters of the second text content;

dividing the text matching result into at least one sub-text sequence based on text paragraph information in the first text content, and determining at least one paragraph group to be verified corresponding to the at least one sub-text sequence;

determining a target similar paragraph group based on the at least one paragraph group to be verified and a similarity judgment model, and determining the total number of characters corresponding to the target similar paragraph group;

and determining the difference rate between the two files to be compared based on the number of common characters, the number of unique characters of the first text content, the number of unique characters of the second text content, the total number of characters corresponding to the target similar paragraph group and a preset difference rate function.

In a second aspect, the present invention provides a text difference rate determining apparatus, the apparatus comprising:

the text content determining module is used for respectively analyzing and processing the two files to be compared to obtain a first text content and a second text content; the text content comprises text paragraph information, wherein the essential content corresponding to the two files to be compared is the same and the text is different;

the matching result determining module is used for processing the first text content and the second text content based on a file difference analysis algorithm and determining a text matching result; the text matching result is a long sequence comprising common characters, unique characters of the first text content and unique characters of the second text content;

the paragraph group determining module is used for dividing the text matching result into at least one sub-text sequence based on text paragraph information in the first text content, and determining at least one paragraph group to be verified corresponding to the at least one sub-text sequence;

the similarity paragraph determining module is used for determining a target similarity paragraph group based on the at least one paragraph group to be verified and the similarity judging model and determining the total number of characters corresponding to the target similarity paragraph group;

The difference rate determining module is used for determining the difference rate between the two files to be compared based on the number of common characters, the number of unique characters of the first text content, the number of unique characters of the second text content, the total number of characters corresponding to the target similar paragraph group and a preset difference rate function.

In a third aspect, the present invention provides a data processing electronic device comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the text difference rate determination method of any one of the embodiments of the present invention.

In a fourth aspect, the present invention provides a computer readable storage medium storing computer instructions for causing a processor to perform the text difference rate determination method of any of the embodiments of the present invention.

In a fifth aspect, the present invention provides a computer program product comprising a computer program which, when executed by a processor, implements the text difference rate determination method of any of the embodiments of the present invention.

According to the technical scheme provided by the embodiment of the invention, the first text content and the second text content are obtained by respectively analyzing and processing the two files to be compared, wherein the essential content corresponding to the two files to be compared is the same, the characters are different, and the text content comprises text paragraph information; processing the first text content and the second text content based on a file difference analysis algorithm, and determining a text matching result, wherein the text matching result is a long sequence comprising common characters, unique characters of the first text content and unique characters of the second text content; further, based on text paragraph information in the first text content, dividing a text matching result into at least one sub-text sequence, determining at least one paragraph group to be verified corresponding to the at least one sub-text sequence, then, based on the at least one paragraph group to be verified and a similarity judgment model, determining a target similar paragraph group, and determining the total number of characters corresponding to the target similar paragraph group, thereby determining the difference rate between the two files to be compared based on the common character number, the unique character number of the first text content, the unique character number of the second text content, the total number of characters corresponding to the target similar paragraph group and a preset difference rate function. The technical scheme provided by the invention solves the problems of large workload, low efficiency and easy error in the file auditing process, improves the accuracy of determining the file difference rate, improves the efficiency of determining the file difference, and reduces the error frequency.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the invention or to delineate the scope of the invention. Other features of the present invention will become apparent from the description that follows.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flowchart of a text difference rate determining method according to an embodiment of the present invention;

fig. 2 is a flowchart of a text difference rate determining method according to a second embodiment of the present invention;

fig. 3 is a flowchart of a text difference rate determining method according to a third embodiment of the present invention;

fig. 4 is a schematic structural diagram of a text difference rate determining apparatus according to a fourth embodiment of the present invention;

fig. 5 is a schematic structural diagram of an electronic device according to a fifth embodiment of the present invention.

Detailed Description

In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

It should be noted that, in the description and claims of the present invention and the above figures, the terms "first preset condition", "second preset condition", and the like are used to distinguish similar objects, and are not necessarily used to describe a specific order or precedence. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Before the present technical solution is introduced, an application scenario may be illustrated. Today, many enterprises need to process a large number of files, such as contracts, specifications, labels and the like, and the files have high similarity in the revisions of different versions, and only a small part of the files are different, if the revisions are inconsistent with the signed version, the enterprises may be caused to have disputes on cooperation or even immeasurable losses, so that the differences between the files of different versions need to be determined. At present, the auditing and verification of different versions of files mainly depend on page-by-page turning, searching and comparison by workers, and manual identification is different. In addition, at present, some document comparison tools exist, text of a standard format document can be extracted, and then the document difference is determined by using a method based on a public sequence or using a semantic recognition model, and the method for automatically determining the document difference is usually only to consider that local differences with different text fields exist or only to determine global differences between documents based on the semantic recognition model, so that differences between documents of different versions cannot be accurately measured.

Example 1

Fig. 1 is a flowchart of a text difference rate determining method according to an embodiment of the present invention, where the embodiment is applicable to a situation where a difference of a file is evaluated. The method may be performed by a text difference rate determining means, which may be implemented in hardware and/or software, which may be arranged on a computer device, which may be a notebook, a desktop computer, a smart tablet or the like. As shown in fig. 1, the method includes:

s110, respectively analyzing and processing the two files to be compared to obtain a first text content and a second text content.

The two files to be compared comprise a first file and a second file. The corresponding substantial content of the two files to be compared is the same, and the characters are different. For example, for a contract, the first draft of the contract is the first file, then the partial content of the first file is changed, added or removed according to the actual situation, and the determined file is the second file.

The first text content is text content, paragraph information, sequence information and version revision information corresponding to the first file; the second text content is text content, paragraph information, sequence information and version revision information corresponding to the second file. For example, if there are 3 natural segments in the first file, and each segment has 25 words, the first text content includes a character sequence consisting of 75 words, 3 paragraph marks, front-to-back sequence information corresponding to each word, and endorsement words.

On the basis of the above embodiment, determining the first text content and the second text content specifically includes: determining the file types of two files to be compared; and analyzing and processing according to each file to be compared and the corresponding file type, and determining text content corresponding to each file to be compared.

In this embodiment, the types of the files to be compared may include a plurality of types, for example, the files to be compared include, but are not limited to, word documents, PDF documents, scanned document pictures, and the like. Based on the method, file types of two files to be compared are determined first, and then the files are analyzed by an analyzer corresponding to the file types, so that first text content and second text content are obtained.

The two files to be compared comprise a first file and a second file, wherein the first file is a Word document, and a first analyzer corresponding to the Word document can be adopted to analyze the first file to obtain first text content; the second file is a PDF document, and a second analyzer corresponding to the PDF document can be adopted to analyze the second file to obtain second text content.

S120, processing the first text content and the second text content based on a file difference analysis algorithm, and determining a text matching result.

The file difference analysis algorithm is a predefined operation method. The file difference analysis algorithm is used for carrying out operation processing on the first text content and the second text content, so that a text matching result can be obtained. The text matching result is a long sequence comprising common characters, first text content unique characters, second text content unique characters.

Specifically, on the basis of determining the first text content and the second text content, the text content in the first text content is used as a first sequence, the text content in the second text content is used as a second sequence, the first sequence and the second sequence are input into a file difference analysis algorithm, and a text matching result can be determined through operation.

Exemplary, if the text content in the first text content is dabbcd, the first sequence is S ₁ = (d, a, b, b, c, d); the text content in the first text content is ddbbca, and the second sequence is S ₂ ＝(d,d,b,b,c,a)，S ₁ = (d, a, b, b, c, d) and S ₂ The text matching result obtained by the operation of the file difference analysis algorithm is as follows:

S＝('＝,d','-,a','+,d','＝,b','＝,b','＝,c','-,d','+,a')

where, "=' represents a common character of two text contents," - "represents a unique character of the first text content, and" ++ "represents a unique character of the second text content.

S130, dividing a text matching result into at least one sub-text sequence based on text paragraph information in the first text content, and determining at least one paragraph group to be verified corresponding to the at least one sub-text sequence.

The long sequence corresponding to the text matching result is divided into a plurality of sequences, and each small sequence obtained is a sub-text sequence. The paragraph group to be verified is two paragraphs reconstructed from characters in the sub-text sequence.

Specifically, the first text content includes paragraph marks corresponding to the text content, and the text matching result can be divided into a plurality of sub-text sequences by using characters corresponding to the paragraph marks as boundaries. For example, the first text content includes 3 paragraphs, and there are 2 paragraph marks, the first paragraph mark corresponds to the last character of the first paragraph, the second paragraph mark corresponds to the last character of the second paragraph, based on which, as long as the character corresponding to the paragraph mark is found in the long sequence corresponding to the text matching result, one long sequence can be divided into 3 sub-sequences.

On the basis of the above embodiment, determining at least one paragraph group to be verified corresponding to at least one sub-text sequence includes: determining a first text passage based on the common characters in the sub-text sequence and the unique characters of the first text content in the sub-text sequence; determining a second text passage based on the common characters in the sub-text sequence and the unique characters of the second text content in the sub-text sequence; the first text paragraph and the second text paragraph are used as a paragraph group to be verified corresponding to at least one sub-text sequence.

In this embodiment, a sub-sequence is taken as an example to describe, where the sub-sequence includes a common character, a first text content unique character and a second text content unique character, the common character in the sub-text sequence and the first text content unique character in the sub-text sequence are sequentially reconstructed, so as to determine a first text paragraph, the common character in the sub-text sequence and the second text content unique character in the sub-text sequence are sequentially reconstructed, so as to determine a second text paragraph, where the first text paragraph and the second text paragraph are to-be-verified paragraph groups corresponding to the sub-text sequence, and in practical application, the first text content includes several paragraphs, and finally several to-be-verified paragraph groups are determined.

S140, determining a target similar paragraph group based on at least one paragraph group to be verified and the similarity judgment model, and determining the total number of characters corresponding to the target similar paragraph group.

The similarity determination model is a pre-trained similarity value determination model, for example, the similarity determination model may be a semantic model or a large-scale expected model. The target similar paragraph group is a paragraph group to be verified, wherein the similarity value of the paragraph group is larger than a preset threshold value.

Specifically, a first text paragraph and a second text paragraph corresponding to a paragraph group to be verified are used as input and are input into a similarity judgment model, and the first text paragraph and the second text paragraph are firstly converted into numbers with specific lengthsVector x= (x) ₁ ,x ₂ ,…,x _k ) And y= (y) ₁ ,y ₂ ,…,y _k ) Then the similarity of two digital vectors calculated by cosine distance functionTherefore, the similarity judgment model can output similarity values of the first text paragraph and the second text paragraph, in practical application, the similarity value corresponding to each paragraph group to be verified can be calculated respectively, the paragraph group to be verified with the similarity value larger than a preset threshold value is used as a target similar paragraph group, a sub-text sequence corresponding to the target paragraph group is further determined, and the total number of characters contained in the sub-text sequence is determined.

S150, determining the difference rate between the two files to be compared based on the number of common characters, the number of unique characters of the first text content, the number of unique characters of the second text content, the total number of characters corresponding to the target similar paragraph group and a preset difference rate function.

In this embodiment, the preset difference rate function is:

wherein alpha is an optional parameter, C _L For the number of unique characters of the first text content, C _R For the unique character number of the second text content, C _A Is the number of common characters, C _S The total number of characters corresponding to the set of target similar paragraphs.

In the present embodiment, the first term of the difference rate functionThe local difference between the text of two documents to be compared can be characterized. Whereas the second term for the difference rate function +.>It shows that as long as the similarity value of the set of target similar paragraphs is greater than the threshold valueThe content of the whole paragraph can be shown to be indistinguishable, and the global variability of the file can be characterized regardless of the differences in individual text.

In this embodiment, the text matching result is a long sequence including common characters, unique characters of the first text content, and unique characters of the second text content, based on the determination of the text matching result, the number of common characters, the number of unique characters of the first text content, and the number of unique characters of the second text content can be determined, and the number of common characters, the number of unique characters of the first text content, the number of unique characters of the second text content, and the total number of characters corresponding to the target similar paragraph group are respectively brought into the difference rate function, so that the difference rate between the two files to be compared can be determined.

On the basis of the embodiment, the method further comprises the following steps: determining a first key field and a second key field corresponding to the preset key word based on at least one preset key word, the first text content and the second text content; and examining the first key field and the second key field, determining a key difference field, and feeding back.

In this embodiment, key words may be predefined, and the neighboring positions of the key words are often some important information of the file, for example, the preset key words may be: amount, money, responsible person, legal person, etc. On the basis of obtaining the first text content and the second text content, searching a first key field and a second key field corresponding to a preset key word in a searching mode, and further performing key examination on the first key field and the second key word respectively to determine a key difference field, and if the key difference field exists, feeding back the key difference field to terminal equipment corresponding to a worker in time. In this way, the accuracy of the key information can be ensured by checking the key information.

For example, if the preset keyword is a responsible person, the "responsible person" is used as a search term to search in the first text content and the second text content, and the text content associated with the preset keyword is determined, and the result obtained by searching the first text content is: the "responsible person Wang Gong" retrieves the second text content to obtain the following result: the principal Li Gong is the first important field of the principal Wang Gong, the second important field of the principal Li Gong, and the important difference field of the principal Li Gong is the king and the plum and is fed back by further examination.

Example two

Fig. 2 is a flowchart of a text difference rate determining method according to a second embodiment of the present invention, where step S130 of the embodiment of the present invention is further refined based on the foregoing embodiments, and the embodiment of the present invention may be combined with each of the alternatives in one or more embodiments. As shown in fig. 2, the method includes:

s210, respectively analyzing and processing the two files to be compared to obtain a first text content and a second text content.

S220, dividing the first text content and the second text content into at least two sub-text contents based on the longest common character.

The longest common character is a character segment formed by the characters with the largest number of common characters corresponding to the first text content and the second text content.

Illustratively, the first text content is adabbd, the second text content is cdbbca, the longest common character is bbc, bbc divides adabbd into ada and d, bbc divides cdbbca into cd and a, and ada and d are sub-text contents corresponding to the first text content and cd and a are sub-text contents corresponding to the second text content.

S230, dividing the sub-text content into at least two sub-text contents again based on the sub-longest common character corresponding to the two sub-text contents.

On the basis of the above exemplary embodiment, the longest common sub-character of the sub-text content ada and the sub-text content cd is d, d divides the sub-text content ada into a and a, d divides the sub-text content cd into c and "space", wherein the space represents a blank character.

S240, repeatedly executing the step of dividing the first text content and the second text content based on the longest common character in the sub-text content until the first text content and the second text content have no common character, and obtaining all the common characters.

Based on the above exemplary embodiment, the final common character is d, b, b, c.

S250, based on the first text content and the common characters, determining unique characters of the first text content. Based on the second text content and the common characters, unique characters of the second text content are determined.

In this embodiment, the common characters d, b, b, c are removed from the text content in the first text content, and the characters obtained finally are unique characters a, d of the first text content; and eliminating the common characters d, b, b, c from the text content in the second text content, wherein the finally obtained characters are unique characters c and a of the second text content. In particular, in the process of analyzing and processing the two files to be compared, the obtained first text content and second text content include the front-back sequence information corresponding to each text, so that although the unique characters of the first text content and the unique characters of the second text content have a, the position information corresponding to a is extremely different.

And S260, sequentially splicing the common characters, the unique characters of the first text content and the unique characters of the second text content, and determining a text matching result.

In this embodiment, on the basis of determining the common character, the unique character of the first text content and the unique character of the second text content, the common character, the unique character of the first text content and the unique character of the second text content are sequentially spliced according to the sequence information corresponding to each character in the original file, and meanwhile, different identifiers are used for defining the common character, the unique character of the first text content and the unique character of the second text content respectively, so that a text matching result is obtained.

On the above exemplary basis, the text matching result may be expressed as:

S＝('-,a','+,c”＝,d','-,a','＝,b','＝,b','＝,c','-,d','+,a')

where, "=' represents a common character of two text contents," - "represents a unique character of the first text content, and" ++ "represents a unique character of the second text content. The '=', '-' and '++' are different identifiers for distinguishing common characters, unique characters of the first text content and unique characters of the second text content.

S270, dividing a text matching result into at least one sub-text sequence based on text paragraph information in the first text content, and determining at least one paragraph group to be verified corresponding to the at least one sub-text sequence.

S280, determining a target similar paragraph group based on at least one paragraph group to be verified and the similarity judgment model, and determining the total number of characters corresponding to the target similar paragraph group.

S290, determining the difference rate between the two files to be compared based on the number of common characters, the number of unique characters of the first text content, the number of unique characters of the second text content, the total number of characters corresponding to the target similar paragraph group and a preset difference rate function.

According to the technical scheme provided by the embodiment of the invention, when a text matching result is determined, the first text content and the second text content are divided into at least two sub-text contents based on the longest public character, then the sub-text contents are divided into at least two sub-text contents again based on the sub-longest public characters corresponding to the two sub-text contents, the steps of dividing the first text content and the second text content based on the longest public character in the sub-text contents are repeatedly executed until the first text content and the second text content have no public character, all the public characters are obtained, further, the unique characters of the first text content are determined based on the first text content and the public characters, the unique characters of the second text content are determined based on the second text content and the unique characters of the public characters, so that the common characters, the unique characters of the first text content and the unique characters of the second text content are spliced sequentially, the text matching result is determined, the first text content and the public characters of the second text content are determined sequentially according to the longest public characters, the first text content and the second text content can be determined quickly and conveniently, the unique characters can be quickly and efficiently supported, and the matching result can be quickly determined, and the unique text is further provided.

Example III

Fig. 3 is a flowchart of a text difference rate determining method according to a third embodiment of the present invention, where step S140 of the embodiment of the present invention is further refined based on the foregoing embodiments, and the embodiment of the present invention may be combined with each of the alternatives in one or more embodiments. As shown in fig. 3, the method includes:

s310, respectively analyzing and processing the two files to be compared to obtain a first text content and a second text content.

S320, processing the first text content and the second text content based on a file difference analysis algorithm, and determining a text matching result.

S330, dividing the text matching result into at least one sub-text sequence based on the text paragraph information in the first text content, and determining at least one paragraph group to be verified corresponding to the at least one sub-text sequence.

S340, inputting at least one paragraph group to be verified into a predetermined similarity judgment model, and determining a similarity value corresponding to the at least one paragraph group to be verified.

In this embodiment, the determined paragraphs to be verified are respectively input into a similarity judgment model, and the similarity judgment model outputs a similarity value corresponding to the paragraphs to be verified.

For example, the to-be-verified paragraph group includes to-be-verified paragraph group 1, to-be-verified paragraph group 2 and to-be-verified paragraph group 3, text contents corresponding to the to-be-verified paragraph group 1, to-be-verified paragraph group 2 and to-be-verified paragraph group 3 are respectively input into the similarity judgment model, the similarity judgment model outputs that the similarity value of the to-be-verified paragraph group 1 is 70%, the similarity value of the to-be-verified paragraph group 2 is 96%, and the similarity value of the to-be-verified paragraph group 3 is 95%.

S350, if the similarity value is greater than a preset threshold value, taking at least one paragraph group to be verified as a target similar paragraph group.

On the basis of the above exemplary embodiments, the preset threshold is a preset fixed value, and the preset threshold can be adaptively adjusted in the application process. If the preset threshold is 96%, the paragraph group 2 to be verified and the paragraph group 3 to be verified are the similar paragraph groups of the targets.

S360, determining a sub-text sequence corresponding to the target similar paragraph group.

In this embodiment, on the basis of determining the target similar paragraph group, the sub-text sequence corresponding to the target similar paragraph group is further determined.

S370, taking the total number of characters contained in the sub-text sequence as the total number of characters corresponding to the target similar paragraph group.

Based on the above exemplary embodiments, the to-be-verified paragraph group 2 and the to-be-verified paragraph group 3 are target similar paragraph groups, and the sum of the total number of characters contained in the sub-text sequence corresponding to the to-be-verified paragraph group 2 and the total number of characters contained in the sub-text sequence corresponding to the to-be-verified paragraph group 3 is the total number of characters corresponding to the target similar paragraph group.

S380, determining the difference rate between the two files to be compared based on the number of common characters, the number of unique characters of the first text content, the number of unique characters of the second text content, the total number of characters corresponding to the target similar paragraph group and a preset difference rate function.

According to the technical scheme provided by the embodiment of the invention, when the total number of characters corresponding to the target similar paragraph group is determined, at least one paragraph group to be verified is input into a predetermined similarity judgment model, a similarity value corresponding to the at least one paragraph group to be verified is determined, if the similarity value is larger than a preset threshold value, the at least one paragraph group to be verified is used as the target similar paragraph group, a sub-text sequence corresponding to the target similar paragraph group is further determined, and the total number of characters contained in the sub-text sequence is used as the total number of characters corresponding to the target similar paragraph group.

Example IV

Fig. 4 is a schematic structural diagram of a text difference rate determining apparatus according to a fourth embodiment of the present invention, where the apparatus may execute the text difference rate determining method according to the embodiment of the present invention. The device comprises: a text content determination module 410, a match result determination module 420, a paragraph group determination module 430, a similar paragraph determination module 440, and a difference rate determination module 450.

The text content determining module 410 is configured to analyze and process the two files to be compared respectively to obtain a first text content and a second text content; the text content comprises text paragraph information, wherein the essential content corresponding to the two files to be compared is the same and the text is different;

a matching result determining module 420, configured to process the first text content and the second text content based on a file difference analysis algorithm, and determine a text matching result; the text matching result is a long sequence comprising common characters, unique characters of the first text content and unique characters of the second text content;

a paragraph group determining module 430, configured to divide the text matching result into at least one sub-text sequence based on text paragraph information in the first text content, and determine at least one paragraph group to be verified corresponding to the at least one sub-text sequence;

A similarity paragraph determining module 440, configured to determine a target similarity paragraph group based on the at least one paragraph group to be verified and the similarity determination model, and determine a total number of characters corresponding to the target similarity paragraph group;

the difference rate determining module 450 is configured to determine a difference rate between the two files to be compared based on the number of common characters, the number of unique characters of the first text content, the number of unique characters of the second text content, the total number of characters corresponding to the target similar paragraph group, and a preset difference rate function.

Based on the above aspects, the text content determining module 410 includes:

a file type determining unit, configured to determine file types to which two files to be compared belong;

and the text content determining unit is used for analyzing and processing according to each file to be compared and the corresponding file type to determine the text content corresponding to each file to be compared.

Based on the above technical solutions, the matching result determining module 420 includes:

a sub-text determining unit for dividing the first text content and the second text content into at least two sub-text contents based on the longest common character;

The sub-text dividing unit is used for dividing the sub-text content into at least two sub-text contents again based on sub-longest common characters corresponding to the two sub-text contents;

a common character determining unit, configured to repeatedly execute the step of dividing the first text content and the second text content based on the longest common character in the sub-text content until the first text content and the second text content have no common character, so as to obtain all common characters;

a unique content determining unit configured to determine unique characters of the first text content based on the first text content and the common characters; determining unique characters of the second text content based on the second text content and the common characters;

and the matching result determining unit is used for sequentially splicing the common characters, the unique characters of the first text content and the unique characters of the second text content to determine the text matching result.

Based on the above technical solutions, the paragraph group determining module 430 includes:

a scene distribution map acquisition unit for determining a first text paragraph based on the common characters in the sub-text sequence and the unique characters of the first text content in the sub-text sequence;

Determining a second text passage based on the common characters in the sub-text sequence and the unique characters of the second text content in the sub-text sequence;

and taking the first text paragraph and the second text paragraph as a paragraph group to be verified, which corresponds to the at least one sub-text sequence.

Based on the above aspects, the similar paragraph determining module 440 includes:

the similarity value determining unit is used for inputting at least one paragraph group to be verified into a predetermined similarity judging model and determining a similarity value corresponding to the at least one paragraph group to be verified;

and the target paragraph group determining unit is used for taking at least one paragraph group to be verified as a target similar paragraph group if the similarity value is larger than a preset threshold value.

Based on the above aspects, the similar paragraph determining module 440 further includes:

a sub-sequence determining unit, configured to determine a sub-text sequence corresponding to the target similar paragraph group;

and the similar character determining unit is used for taking the total number of characters contained in the sub-text sequence as the total number of characters corresponding to the target similar paragraph group.

On the basis of the above technical solutions, the text difference rate determining device further includes:

The key field determining module is used for determining a first key field and a second key field corresponding to the preset key word based on at least one preset key word, the first text content and the second text content;

and the difference field feedback module is used for checking the first important field and the second important field, determining the important difference field and feeding back the important difference field.

The text difference rate determining device provided by the embodiment of the disclosure can execute the text difference rate determining method provided by any embodiment of the disclosure, and has the corresponding functional modules and beneficial effects of the executing method.

It should be noted that each unit and module included in the above apparatus are only divided according to the functional logic, but not limited to the above division, so long as the corresponding functions can be implemented; in addition, the specific names of the functional units are also only for convenience of distinguishing from each other, and are not used to limit the protection scope of the embodiments of the present disclosure.

Example five

Fig. 5 is a schematic structural diagram of an electronic device according to a fifth embodiment of the present invention. The electronic device 10 is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. Electronic equipment may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.

As shown in fig. 5, the electronic device 10 includes at least one processor 11, and a memory, such as a Read Only Memory (ROM) 12, a Random Access Memory (RAM) 13, etc., communicatively connected to the at least one processor 11, in which the memory stores a computer program executable by the at least one processor, and the processor 11 may perform various appropriate actions and processes according to the computer program stored in the Read Only Memory (ROM) 12 or the computer program loaded from the storage unit 18 into the Random Access Memory (RAM) 13. In the RAM13, various programs and data required for the operation of the electronic device 10 may also be stored. The processor 11, the ROM12 and the RAM13 are connected to each other via a bus 14. An input/output (I/O) interface 15 is also connected to bus 14.

Various components in the electronic device 10 are connected to the I/O interface 15, including: an input unit 16 such as a keyboard, a mouse, etc.; an output unit 17 such as various types of displays, speakers, and the like; a storage unit 18 such as a magnetic disk, an optical disk, or the like; and a communication unit 19 such as a network card, modem, wireless communication transceiver, etc. The communication unit 19 allows the electronic device 10 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

The processor 11 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of processor 11 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, digital Signal Processors (DSPs), and any suitable processor, controller, microcontroller, etc. The processor 11 performs the respective methods and processes described above, such as the text difference rate determination method.

In some embodiments, the text difference rate determination method may be implemented as a computer program tangibly embodied on a computer-readable storage medium, such as the storage unit 18. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 10 via the ROM12 and/or the communication unit 19. When the computer program is loaded into RAM13 and executed by processor 11, one or more steps of the text difference rate determination method described above may be performed. Alternatively, in other embodiments, the processor 11 may be configured to perform the text difference rate determination method in any other suitable way (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

A computer program for carrying out methods of the present invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable text difference rate determination apparatus, such that the computer programs, when executed by the processor, cause the functions/operations specified in the flowchart and/or block diagram block or blocks to be implemented. The computer program may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of the present invention, a computer-readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. The computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) through which a user can provide input to the electronic device. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), blockchain networks, and the internet.

The computing system may include clients and servers. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service are overcome. It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present invention may be performed in parallel, sequentially, or in a different order, so long as the desired results of the technical solution of the present invention are achieved, and the present invention is not limited herein. The above embodiments do not limit the scope of the present invention. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the scope of the present invention.

Claims

1. A method for determining a file difference rate, comprising:

2. The method of claim 1, wherein the parsing the two files to be compared to obtain the first text content and the second text content includes:

determining file types of the two files to be compared;

and analyzing and processing according to each file to be compared and the corresponding file type, and determining text content corresponding to each file to be compared.

3. The method of claim 1, wherein the processing the first text content and the second text content based on the file difference analysis algorithm to determine a text match result comprises:

dividing the first text content and the second text content into at least two sub-text contents based on the longest common character;

dividing the sub-text content into at least two sub-text contents again based on sub-longest common characters corresponding to the two sub-text contents;

repeating the step of dividing the first text content and the second text content based on the longest common character in the sub-text content until the first text content and the second text content have no common character, and obtaining all common characters;

Determining unique characters of the first text content based on the first text content and the common characters;

determining unique characters of the second text content based on the second text content and the common characters;

and sequentially splicing the common characters, the unique characters of the first text content and the unique characters of the second text content, and determining the text matching result.

4. The method of claim 1, wherein the determining at least one set of paragraphs to be verified that corresponds to the at least one sequence of sub-text comprises:

determining a first text passage based on the common characters in the sub-text sequence and the unique characters of the first text content in the sub-text sequence;

5. The method of claim 1, wherein the determining the target similar paragraph group based on the at least one paragraph group to be verified and the similarity determination model comprises:

Inputting the at least one paragraph group to be verified into a predetermined similarity judgment model, and determining a similarity value corresponding to the at least one paragraph group to be verified;

and if the similarity value is greater than a preset threshold value, taking the at least one paragraph group to be verified as a target similar paragraph group.

6. The method of claim 1, wherein the determining the total number of characters corresponding to the set of target similar paragraphs comprises:

determining a sub-text sequence corresponding to the target similar paragraph group;

and taking the total number of characters contained in the sub-text sequence as the total number of characters corresponding to the target similar paragraph group.

7. The method of claim 1, wherein the preset difference rate function is:

wherein alpha is an optional parameter, C _L For the number of unique characters of the first text content, C _R For the number of unique characters of the second text content, C _A For the common character number, C _S And the total number of the characters corresponding to the target similar paragraph group.

8. The method as recited in claim 1, further comprising:

determining a first key field and a second key field corresponding to the preset key word based on at least one preset key word, the first text content and the second text content;

And examining the first key field and the second key field, determining a key difference field, and feeding back.

9. A text difference rate determination apparatus, comprising:

10. An electronic device, the electronic device comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the text difference rate determination method of any one of claims 1-8.