CN109657202B - Text processing method and device - Google Patents

Text processing method and device Download PDF

Info

Publication number
CN109657202B
CN109657202B CN201710936432.7A CN201710936432A CN109657202B CN 109657202 B CN109657202 B CN 109657202B CN 201710936432 A CN201710936432 A CN 201710936432A CN 109657202 B CN109657202 B CN 109657202B
Authority
CN
China
Prior art keywords
text
clause
marked
matching
original text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710936432.7A
Other languages
Chinese (zh)
Other versions
CN109657202A (en
Inventor
石鹏
王福伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Gridsum Technology Co Ltd
Original Assignee
Beijing Gridsum Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Gridsum Technology Co Ltd filed Critical Beijing Gridsum Technology Co Ltd
Priority to CN201710936432.7A priority Critical patent/CN109657202B/en
Publication of CN109657202A publication Critical patent/CN109657202A/en
Application granted granted Critical
Publication of CN109657202B publication Critical patent/CN109657202B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/103Formatting, i.e. changing of presentation of documents
    • G06F40/106Display of layout of documents; Previewing
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a text processing method and device, relates to the technical field of data processing, and aims to solve the problem of failure in marking processing in the existing text marking processing mode. The method of the invention comprises the following steps: after the original text is obtained, analyzing the text to be marked from the original text; searching original text content corresponding to the text to be marked based on a text similarity algorithm; and splicing the texts in the original text except the original text content corresponding to the text to be marked with the marked text to be marked to obtain the target text. The invention is suitable for being applied to the process of marking and processing the referee document.

Description

Text processing method and device
Technical Field
The present invention relates to the field of data processing technologies, and in particular, to a method and an apparatus for text processing.
Background
In the process of text analysis, for a given text, special marking processing is usually required for some of the contents. For example, for a given official document, it may be desirable to highlight certain characters in the sections that the court believes to be.
Regarding the above process of performing special marking processing on some contents in a text, the prior art solution is to analyze a text to be marked from an original text, and then perform special marking processing on the text to be marked, such as highlighting some specific characters; then, the original text content corresponding to the analyzed text to be marked and processed is found in a character string matching mode and replaced with the marked text content, and finally the text which is specially marked and processed on some contents in the original text is obtained.
In order to obtain the text to be marked, the original text may be slightly modified without affecting the semantics in the process of parsing the text to be marked from the original text, so that the finally parsed text to be marked and the corresponding original text have a certain difference. For the above prior art, since the string matching method is only suitable for matching the same strings, if the analyzed text to be marked has a difference from the corresponding original text content, the original text content corresponding to the analyzed text to be marked cannot be found through the string matching method, and thus subsequent replacement steps cannot be performed, resulting in failure of marking.
Disclosure of Invention
In view of the above problems, the present invention provides a method and an apparatus for processing a text, so as to solve the problem of failure in the marking process in the existing text marking process.
In order to solve the above technical problem, in a first aspect, the present invention provides a method for text processing, where the method includes:
after the original text is obtained, analyzing the text to be marked from the original text;
marking the text to be marked;
searching original text content corresponding to the text to be marked and processed based on a text similarity algorithm;
and splicing the texts except the original text content corresponding to the text to be marked in the original text with the marked text to be marked to obtain the target text.
Optionally, the searching for the original text content corresponding to the text to be marked based on the text similarity algorithm includes:
the original text and the text to be marked are respectively processed by sentence division;
traversing each clause in the original text, and respectively carrying out similarity matching with a first clause in the text to be marked;
if the initial matching clause matched with the first clause in the text to be marked appears in the original text, ending the similarity matching of the first clause;
traversing each remaining clause in the original text from the initial matching clause, and respectively performing similarity matching with the last clause in the text to be marked;
if a matching stop clause matched with the last clause in the text to be marked appears in the original text, ending the similarity matching of the last clause;
calculating the total number of clauses from the initial matching clause to the final matching clause, and determining the total number of the clauses as the total number of the first clause;
and if the total number of the first clauses is equal to the total number of the clauses in the text to be marked and processed, determining the clauses between the initial matching clause and the final matching clause, the initial matching clause and the final matching clause as the original text content corresponding to the text to be marked and processed.
Optionally, the method further includes:
if the total number of the first clauses is not equal to the total number of the clauses in the text to be marked, traversing all clauses after the initial matching clause determined for the last time in the original text, and re-performing similarity matching between the first clause and the last clause;
re-determining new initial matching clauses, termination matching clauses and the total number of new first clauses;
ending the similarity matching between the first clause and the last clause until the total number of the new first clause is equal to the total number of the clauses in the text to be marked;
and determining the clauses between the new initial matching clause and the new termination matching clause, the new initial matching clause and the new termination matching clause as the original text content corresponding to the text to be marked.
Optionally, performing similarity matching on the first clause or the last clause in the text to be marked includes:
calculating the editing distance of the first clause or the last clause and the clause subjected to similarity matching in the original text based on an editing distance function Levenshtein;
calculating the character string length of the first clause or the last clause;
determining the ratio of the editing distance to the character string length of the first clause or the last clause as a matching result corresponding to the clause for similarity matching;
and if the matching result is greater than or equal to a preset threshold value, determining that the clause subjected to similarity matching in the original text is matched with the first clause or the last clause.
Optionally, the method further includes:
if the text to be marked is analyzed from the original text and is discontinuous text content, dividing the text to be marked into a plurality of sub texts to be marked, wherein the content in each sub text to be marked is continuous;
respectively searching the original text content corresponding to each sub-text to be marked based on a text similarity algorithm;
and splicing the texts in the original text except the original text content corresponding to the sub-text to be marked with the target text with each marked sub-text to be marked to obtain the target text.
Optionally, the marking the text to be marked includes:
highlighting and marking the preset character strings in the text to be marked.
In a second aspect, the present invention also provides a text processing apparatus, including:
the analysis unit is used for analyzing the text to be marked from the original text after the original text is obtained;
the marking processing unit is used for marking the text to be marked;
the searching unit is used for searching the original text content corresponding to the text to be marked based on a text similarity algorithm;
and the splicing unit is used for splicing the texts in the original texts except the original text content corresponding to the text to be marked with the text to be marked after the marking treatment to obtain the target text.
Optionally, the searching unit includes:
the sentence dividing module is used for respectively carrying out sentence dividing processing on the original text and the text to be marked;
the matching module is used for traversing each clause in the original text and respectively carrying out similarity matching with a first clause in the text to be marked;
the system comprises an ending module, a matching module and a matching module, wherein the ending module is used for ending the similarity matching of a first clause in a text to be marked if a starting matching clause matched with the first clause in the original text appears;
the matching module is also used for traversing each remaining clause in the original text from the initial matching clause and respectively carrying out similarity matching with the last clause in the text to be marked;
the ending module is also used for ending the similarity matching of the last clause if the ending matching clause matched with the last clause in the text to be marked and processed appears in the original text;
the calculating module is used for calculating the total number of clauses from the initial matching clause to the final matching clause and determining the total number as the first total number of clauses;
and the determining module is used for determining the clauses between the initial matching clause and the final matching clause, the initial matching clause and the final matching clause as the original text content corresponding to the text to be marked if the total number of the first clauses is equal to the total number of the clauses in the text to be marked.
Optionally, the apparatus further comprises:
the matching unit is used for traversing all clauses after the initial matching clause determined for the last time in the original text and re-performing similarity matching between the first clause and the last clause if the total number of the first clause is not equal to the total number of the clauses in the text to be marked;
a first determining unit, configured to re-determine a new starting matching clause, a new ending matching clause, and a new first clause total number;
the ending unit is used for ending the similarity matching between the first clause and the last clause until the total number of the new first clause is equal to the total number of the clauses in the text to be marked;
and the second determining unit is used for determining the clauses between the new initial matching clause and the new termination matching clause, the new initial matching clause and the new termination matching clause as the original text content corresponding to the text to be marked.
Optionally, the matching module is configured to:
calculating the editing distance of the first clause or the last clause and the clause subjected to similarity matching in the original text based on an editing distance function Levenshtein;
calculating the character string length of the first clause or the last clause;
determining the ratio of the editing distance to the character string length of the first clause or the last clause as a matching result corresponding to the clause for similarity matching;
and if the matching result is greater than or equal to a preset threshold value, determining that the clause subjected to similarity matching in the original text is matched with the first clause or the last clause.
Optionally, the apparatus further comprises:
the dividing unit is used for dividing the text to be marked into a plurality of sub texts to be marked if the text to be marked is analyzed from the original text and is discontinuous text content, and the content in each sub text to be marked is continuous;
the searching unit is also used for respectively searching the original text content corresponding to each sub-text to be marked based on a text similarity algorithm;
the splicing unit is further configured to splice texts in the original text, except for the original text content corresponding to the sub-text to be marked, with the marked sub-texts to be marked, so as to obtain a target text.
Optionally, the mark processing unit is further configured to:
highlighting and marking the preset character strings in the text to be marked.
In order to achieve the above object, according to a third aspect of the present invention, there is provided a storage medium including a stored program, wherein when the program runs, a device on which the storage medium is located is controlled to execute the above-mentioned text processing method.
In order to achieve the above object, according to a fourth aspect of the present invention, there is provided a processor for executing a program, wherein the program executes the method for text processing described above.
By means of the technical scheme, the text processing method and the text processing device provided by the invention search the original text content corresponding to the text to be marked and processed based on the text similarity algorithm in the text processing process. Compared with the prior art, because the original text content corresponding to the text to be marked is searched based on the similarity algorithm, even if the analyzed text to be marked and the corresponding original text content have difference, the difference is usually small (such as modifying some characters or deleting some characters without actual meanings, etc.), that is, the text to be marked and the corresponding original text content are similar to each other to a certain extent, so that the text to be marked and the corresponding original text content can be searched from the original text through the similarity algorithm. Therefore, the text processing method provided by the invention can effectively avoid the situation that the original text content corresponding to the text to be marked and processed cannot be found in the prior art in a character string matching mode.
The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:
FIG. 1 is a flow chart of a method for processing text according to an embodiment of the present invention;
FIG. 2 is a flow diagram illustrating another method of text processing provided by embodiments of the present invention;
FIG. 3 is a block diagram illustrating a text processing apparatus according to an embodiment of the present invention;
fig. 4 is a block diagram illustrating another apparatus for processing text according to an embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited by the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
In order to solve the problem of failure in the markup processing in the existing text markup processing mode, an embodiment of the present invention provides a text processing method, as shown in fig. 1, the method includes:
101. after the original text is obtained, the text to be marked is analyzed from the original text.
The original text is the original text without any processing, and the original text is used for distinguishing the subsequent target text. The text to be marked is part of the content in the original text, and because the requirement that only some texts in the original text need to be marked exists in the actual text analysis and processing process, the text to be marked needs to be analyzed from the original text first and then is marked independently. The specific implementation manner of analyzing the text to be marked from the original text is as follows: and setting the characteristics of the text to be marked according to the requirements in advance, and then analyzing the original text according to the characteristics to finally obtain the text to be marked. It should be noted that, in practical applications, some small changes that do not affect the semantics of the original text, such as modifying characters, reducing characters, etc., may need to be performed on the original text in the parsing process. Therefore, the finally analyzed text to be marked is not completely consistent with the corresponding original text, where complete consistency means that each character is the same.
102. And marking the text to be marked.
After the text to be marked is obtained through analysis in step 101, the text to be marked needs to be marked. In this embodiment, the form of the marking process is not limited, and any marking process may be performed according to actual needs. Such as special marking (e.g., by different colors or font styles, etc.) of certain specific characters for purposes of highlighting them.
103. And searching the original text content corresponding to the text to be marked based on the text similarity algorithm.
The text marking processing of the marking processing is to change the original text into the target text after marking processing is carried out on the text to be marked. Therefore, the original text content corresponding to the text to be marked needs to be replaced by the marked text to be marked, and before replacement, the original text content corresponding to the text to be marked needs to be found out in the original text.
However, because there may be a difference between the finally analyzed text to be marked and the corresponding original text content due to the requirement in the parsing process, and the difference is usually a small difference (for example, modifying some characters or deleting some characters without actual meanings, etc.), so that the text to be marked and the corresponding original text content are similar to each other to some extent, and the original text content corresponding to the text to be marked can be searched by using a similarity calculation method. In the embodiment, when a similarity algorithm is used for searching, similarity matching is performed between the content in the original text and the text to be marked, and when the content in the original text is matched, the similarity of which meets a certain condition with the text to be marked, the content can be regarded as the content of the original text corresponding to the text to be marked. The similarity calculation method in this embodiment may be any algorithm capable of performing text similarity calculation, such as a similarity calculation method based on the cosine of an included angle, a similarity calculation method based on an edit distance, and the like.
104. And splicing the texts in the original text except the original text content corresponding to the text to be marked with the marked text to be marked to obtain the target text.
After the original text content corresponding to the text to be marked is found in the original text, the original text content corresponding to the text to be marked needs to be replaced by the marked text to be marked, and the target text is obtained. The specific alternative implementation manner is to splice the text except the original text content corresponding to the text to be marked in the original text with the marked text to be marked. The splicing here refers to text splicing of the marked text to be marked and the text in the original text except the original text content corresponding to the text to be marked according to the corresponding sentence position in the original text.
In the text processing method provided by the embodiment of the invention, the original text content corresponding to the text to be marked and processed is searched based on the text similarity algorithm in the text processing process. Compared with the prior art, because the original text content corresponding to the text to be marked is searched based on the similarity algorithm, even if the analyzed text to be marked and the corresponding original text content have difference, the difference is usually small (such as modifying some characters or deleting some characters without actual meanings, etc.), that is, the text to be marked and the corresponding original text content are similar to each other to a certain extent, so that the text to be marked and the corresponding original text content can be searched from the original text through the similarity algorithm. Therefore, the text processing method provided by the invention can effectively avoid the situation that the original text content corresponding to the text to be marked and processed cannot be found in the prior art in a character string matching mode.
Further, as a refinement and an extension of the embodiment shown in fig. 1, another text processing method is provided in the embodiment of the present invention, as shown in fig. 2.
201. After the original text is obtained, the text to be marked is analyzed from the original text.
The implementation of this step is the same as that of step 101 in fig. 1, and is not described here again.
202. And marking the text to be marked.
In practical application, the form and the mode of marking the text to be marked are not limited. The embodiment of the invention provides a specific marking processing mode: highlighting and marking the preset character string in the text to be marked to obtain the marked text to be marked. Given a specific example, assuming that the original text is a referee document, the text to be marked is the content of the section considered by the hospital, the "trademark" in the content of the section considered by the hospital needs to be highlighted in actual requirements, and the "trademark" in other text contents is normally displayed, matching trademark characters from the analyzed content of the sections regarded by the hospital, highlighting and marking the trademark characters to obtain marked sections regarded by the hospital, and obtaining marked texts to be marked corresponding to the step.
203. And performing sentence division processing on the original text and the text to be marked.
The sentence dividing processing is respectively carried out on the original text and the text to be marked, namely the original text and the text to be marked are divided into a plurality of independent sentences according to punctuation marks (commas, periods, semicolons and the like) in the original text and the text to be marked. The specific word segmentation processing can be realized by using a word segmentation tool, the word segmentation tool in the embodiment is not limited, and the word segmentation tool can be any existing word segmentation tool, such as an NLTK text segmenter, a TestPhrase segmentation tool, and the like.
204. And traversing each clause in the original text, and respectively carrying out similarity matching with the first clause in the text to be marked.
Specifically, in this embodiment, similarity matching is performed based on a similarity algorithm of an edit distance function Levenshtein. The implementation manner of matching the similarity between each clause in the original text and the first clause in the text to be marked is the same, and this embodiment takes matching of the similarity between a certain clause in the original text and the first clause in the text to be marked as an example to explain:
firstly, calculating the edit distance between a first clause and one clause in the original text based on Levenshtein;
and taking the first clause and one clause in the original text as an independent variable of the Levenshtein function, and returning the editing distance between the first clause and one clause in the original text by the Levenshtein function. The edit distance is the minimum number of edit operations required between two character strings to convert one to another, the edit operations including replacing one character with another, inserting one character, and deleting one character. In this embodiment, the editing distance is the minimum number of editing operations required for converting a first clause into a clause in the original text, or the minimum number of editing operations required for converting a clause in the original text into a first clause.
Secondly, calculating the length of the character string of the first clause;
thirdly, determining the ratio of the editing distance to the character string length of the first clause as a matching result corresponding to the clause for similarity matching;
a specific example is given for explanation, and it is assumed that a set of clauses obtained after the original text is subjected to clause processing is [ S1, S2, \8230;, sn ], and a set of clauses obtained after the text to be marked is subjected to clause processing is [ SubS1, subS2, \8230;, subssm ], where Si (i =1,2, \8230;, n) and SubSj (j =1,2, \8230;, m) respectively represent the ith clause in the original text after clause processing and the jth clause in the text to be marked. If the editing distance between the first clause SubS1 in the text to be marked and processed and a certain clause Si in the original text is lev, wherein the editing distance is obtained according to the calculation mode; the string length of SubS1 is len (SubS 1), and similarity Simiar = lev/len (SubS 1) is a matching result corresponding to similarity matching between SubS1 and a certain clause Si in the original text (similarity matching result is obtained by matching SubS1 with SubS 1)
And finally, if the matching result is greater than or equal to a preset threshold value, determining that the first clause is matched with one clause in the original text.
Corresponding to the above example, if the value of Similar is greater than or equal to the preset threshold, it is determined that the first clause matches with one of the clauses in the original text, otherwise, it does not match. The preset threshold may be set freely according to actual requirements, and may be set to other values such as 0.95, 0.97, and the like.
205. And if the initial matching clause matched with the first clause in the text to be marked appears in the original text, ending the similarity matching of the first clause.
According to the matching process in step 204, if the sentences in the original text are sequentially matched with the first sentence according to the sequence of the sentences in the original text, after the sentence of the original text matched with the first sentence appears for the first time, the sentence of the original text is determined as the initial matching sentence, and the similarity matching of the first sentence is finished. The initial matching clause is the content in the original text corresponding to the first clause.
206. And traversing each remaining clause in the original text from the initial matching clause, and respectively performing similarity matching with the last clause in the text to be marked.
The clause of the original text matched with the last clause is certainly behind the initial matching clause, so that the clause before the initial matching clause does not need to be subjected to similarity matching with the last clause any more, and the speed of similarity matching can be increased.
Specifically, from the beginning of matching clauses, the implementation manner of matching each clause and the last clause in the remaining original text is the same as the implementation manner of matching the similarity between the first clause and one clause in the original text in step 204, and is not described herein again.
207. And if the final matching clause matched with the last clause in the text to be marked appears in the original text, ending the similarity matching of the last clause.
And starting from the initial matching clause, sequentially matching the sequence of the remaining clauses in the original text with the last clause, determining the clause of the original text as a final matching clause after the clause of the original text matched with the last clause appears for the first time, and finishing the similarity matching of the final clause. And the final matching clause is the content in the original text corresponding to the last clause.
208. And calculating the total number of clauses from the initial matching clause to the final matching clause, and determining the total number of the clauses as the first total number of the clauses.
After the contents of the original text corresponding to the first clause and the last clause are determined, the contents from the initial matching clause to the final matching clause can be roughly used as the contents of the original text corresponding to the text to be marked. However, in practical applications, there may be a case where there is a matching between the clauses in other original texts after the initial matching clause and the first clause, so in order to determine whether the obtained initial matching clause is the original text content really corresponding to the text to be marked, it is necessary to perform further determination by calculating the total number of clauses from the initial matching clause to the final matching clause.
209. And if the total number of the first clauses is equal to the total number of the clauses in the text to be marked and processed, determining the clauses between the initial matching clause and the final matching clause, the initial matching clause and the final matching clause as the original text content corresponding to the text to be marked and processed.
Before and after the original text is analyzed, the quantity of the clauses of the original text content corresponding to the text to be marked is the same as that of the clauses in the text to be marked. Therefore, if the total number of clauses from the initial matching clause to the final matching clause, namely the total number of the first clause, is equal to the total number of clauses in the text to be marked, the obtained initial matching clause can be judged to be the original text content really corresponding to the text to be marked. So that the clauses between the starting matching clause and the ending matching clause can be divided and determining the initial matching clause and the final matching clause as the original text content corresponding to the text to be marked.
210. And splicing the texts in the original text except the original text content corresponding to the text to be marked with the marked text to be marked to obtain the target text.
The implementation of this step is the same as that of step 104 in fig. 1, and is not described here again.
In addition, in step 209, if the total number of the first clauses is not equal to the total number of clauses in the text to be marked, it can be determined that the obtained initial matching clause is not the original text content really corresponding to the text to be marked. Therefore, it is necessary to traverse all remaining clauses in the original text from the previously obtained initial matching clause, and perform similarity matching between the first clause and the last clause again, specifically: from the initial matching clause, sequentially carrying out similarity matching on all remaining clauses in the original text and the first clause, and determining a new initial matching clause; and then, starting from the new initial matching clause, carrying out similarity matching on all remaining clauses in the original text and the last clause in sequence, and determining a new final matching clause.
After determining a new initial matching clause and a new final matching clause, calculating the total number of the corresponding new first clauses; continuously judging whether the total number of the new first clauses is equal to the total number of the clauses in the text to be marked, if so, ending the similarity matching between the first clause and the last clause, and determining the clauses between the new initial matching clause and the new ending matching clause, the new initial matching clause and the new ending matching clause as the original text content corresponding to the text to be marked; if the first clauses are not equal to the second clauses in the text to be marked, the original text is traversed by the first matching clause, and similarity matching between the first clause and the last clause is carried out again until the total number of the first clauses is equal to the total number of the clauses in the text to be marked.
In addition, for step 101 in fig. 1 or step 201 in fig. 2, if the text to be marked is discontinuous text content analyzed from the original text, dividing the text to be marked into a plurality of sub texts to be marked, wherein the content of each divided sub text to be marked is continuous. It should be noted that the text to be marked is continuous text content, which means that the content of the text to be marked is semantically continuous, i.e. the original text content corresponding to the text to be marked is continuous in the original text.
The reason why the text to be marked is divided into a plurality of sub-texts to be marked is as follows: when the original text content corresponding to the text to be marked is searched in the original text, the original text content corresponding to the text to be marked is determined according to the searched original text content corresponding to the first clause and the last clause of the text to be marked, and if the text to be marked is discontinuous, the finally searched original text content corresponding to the text to be marked necessarily contains text content which does not correspond to the sub-text to be marked. Therefore, it is necessary to ensure that the text to be marked is continuous, so that the discontinuous text to be marked needs to be divided into a plurality of continuous sub-texts to be marked.
After dividing the text to be marked into a plurality of sub-texts to be marked, respectively searching the original text content corresponding to each sub-text to be marked based on a text similarity algorithm; and then splicing the texts in the original text except the original text content corresponding to all the sub-texts to be marked with the marks with all the sub-texts to be marked with the marks to obtain the target text. The splicing here refers to text splicing of all the marked sub-texts to be marked and the texts in the original text except the original text content corresponding to the sub-texts to be marked according to the corresponding sentence positions in the original text.
Further, as an implementation of the method shown in fig. 1 and fig. 2, another embodiment of the present invention further provides a text processing apparatus, which is used for implementing the method shown in fig. 1 and fig. 2. The embodiment of the apparatus corresponds to the embodiment of the method, and for convenience of reading, details in the embodiment of the apparatus are not described again one by one, but it should be clear that the apparatus in the embodiment can correspondingly implement all the contents in the embodiment of the method. As shown in fig. 3, the apparatus includes: a parsing unit 31, a label processing unit 32, a lookup unit 33 and a stitching unit 34.
The parsing unit 31 is configured to parse a text to be marked from an original text after the original text is obtained;
the specific implementation mode for analyzing the text to be marked from the original text is as follows: and setting the characteristics of the text to be marked according to the requirements in advance, and then analyzing the original text according to the characteristics to finally obtain the text to be marked. It should be noted that, in practical applications, some small changes that do not affect the semantics of the original text, such as modifying characters, reducing characters, etc., may need to be performed on the original text in the parsing process. Therefore, the finally analyzed text to be marked is not completely consistent with the corresponding original text, where complete consistency means that each character is the same.
A marking processing unit 32, configured to perform marking processing on a text to be marked;
in this embodiment, the form of the marking process is not limited, and any marking process may be performed according to actual needs. Such as special marking (e.g., by a different color or font type, etc.) of certain specific characters for purposes of highlighting them.
The searching unit 33 is configured to search, based on a text similarity algorithm, original text content corresponding to a text to be marked;
in the embodiment, when searching is performed by using the similarity algorithm, similarity matching is performed between the content in the original text and the text to be marked, and when matching is performed to the content in the original text whose similarity with the text to be marked meets a certain condition, the content can be considered as the original text content corresponding to the text to be marked. The similarity calculation method in this embodiment may be any algorithm capable of performing text similarity calculation, such as a similarity calculation method based on the cosine of an included angle, a similarity calculation method based on an edit distance, and the like.
And the splicing unit 34 is configured to splice a text in the original text except for the original text content corresponding to the text to be marked with the marked text to be marked, so as to obtain a target text.
As shown in fig. 4, the searching unit 33 includes:
a sentence dividing module 331, configured to perform sentence dividing processing on the original text and the text to be marked;
the sentence dividing processing is respectively carried out on the original text and the text to be marked, namely the original text and the text to be marked are divided into a plurality of independent sentences according to punctuation marks (comma, period, semicolon and the like) in the original text and the text to be marked. The specific word segmentation can be realized by using a word segmentation tool, the word segmentation tool in the embodiment is not limited, and the word segmentation tool can be any existing word segmentation tool, such as an NLTK text segmenter, a TestPhrase segmentation tool, and the like.
The matching module 332 is configured to traverse each clause in the original text, and perform similarity matching with a first clause in the text to be marked;
an ending module 333, configured to end similarity matching for a first clause in the text to be marked if a starting matching clause matching a first clause in the original text appears;
and when the clause of the original text matched with the first clause appears for the first time, determining the clause of the original text as a starting matched clause, and finishing the similarity matching of the first clause. The initial matching clause is the content in the original text corresponding to the first clause.
The matching module 332 is further configured to traverse each remaining clause in the original text from the initial matching clause, and perform similarity matching with the last clause in the text to be marked;
the ending module 333 is further configured to end the similarity matching for the last clause if a matching ending clause matching the last clause in the to-be-marked processed text appears in the original text;
and starting from the initial matching clause, sequentially matching the sequence of the remaining clauses in the original text with the last clause, determining the clause of the original text as a final matching clause after the clause of the original text matched with the last clause appears for the first time, and finishing the similarity matching of the final clause. And the final matching clause is the content in the original text corresponding to the last clause.
A calculating module 334, configured to calculate a total number of clauses from an initial matching clause to an end matching clause, and determine the total number of clauses as a first total number of clauses;
after the original text content corresponding to the first clause and the last clause is determined, the content from the initial matching clause to the final matching clause can be roughly used as the content in the original text corresponding to the text to be marked and processed. However, in practical applications, there may be a case where there is a matching between the clauses in other original texts after the initial matching clause and the first clause, so in order to determine whether the obtained initial matching clause is the original text content really corresponding to the text to be marked, it is necessary to perform further determination by calculating the total number of clauses from the initial matching clause to the final matching clause.
A determining module 335, configured to determine, if the first clause total number is equal to the total number of clauses in the text to be marked and processed, the clause between the starting matching clause and the ending matching clause, the starting matching clause, and the ending matching clause as the original text content corresponding to the text to be marked and processed.
Before and after the original text is analyzed, the quantity of the clauses of the original text content corresponding to the text to be marked is the same as that of the clauses in the text to be marked. Therefore, if the total number of clauses from the initial matching clause to the final matching clause, namely the total number of the first clause, is equal to the total number of clauses in the text to be marked, the obtained initial matching clause can be judged to be the original text content really corresponding to the text to be marked. Therefore, the clauses between the initial matching clause and the final matching clause, the initial matching clause and the final matching clause can be determined as the original text content corresponding to the text to be marked.
As shown in fig. 4, the apparatus further includes:
the matching unit 35 is configured to traverse all clauses after the last determined initial matching clause in the original text and perform similarity matching between the first clause and the last clause again if the total number of the first clause is not equal to the total number of clauses in the text to be marked;
a first determining unit 36, configured to re-determine a new starting matching clause, a new ending matching clause, and a new first clause total number;
an ending unit 37, configured to end similarity matching between the first clause and the last clause until the total number of the new first clauses is equal to the total number of clauses in the text to be marked;
and a second determining unit 38, configured to determine a clause between the new starting matching clause and the new ending matching clause, the new starting matching clause, and the new ending matching clause as the original text content corresponding to the text to be marked.
The matching module 332 is configured to:
calculating the editing distance between the first clause or the last clause and a clause subjected to similarity matching in the original text based on an editing distance function Levenshtein;
and taking the first clause or the last clause and one clause in the original text as an independent variable of the Levenshtein function, and returning the editing distance between the first clause or the last clause and one clause in the original text by the Levenshtein function. The edit distance is the minimum number of edit operations required between two character strings to convert one to another, the edit operations including replacing one character with another, inserting one character, and deleting one character.
Calculating the character string length of the first clause or the last clause;
determining the ratio of the editing distance to the character string length of the first clause or the last clause as a matching result corresponding to the clause subjected to similarity matching;
and if the matching result is greater than or equal to a preset threshold value, determining that the clause subjected to similarity matching in the original text is matched with the first clause or the last clause.
The preset threshold may be set freely according to actual requirements, and may be set to other values such as 0.95, 0.97, and the like.
As shown in fig. 4, the apparatus further includes:
the dividing unit 39 is configured to divide the text to be marked into multiple sub-texts to be marked if the text to be marked is analyzed from the original text and is discontinuous text content, where content in each sub-text to be marked is continuous;
the searching unit 33 is further configured to search, based on a text similarity algorithm, original text contents corresponding to each sub-text to be marked;
the splicing unit 34 is further configured to splice a text in the original text except the original text content corresponding to the sub-text to be marked with the mark processing, with each sub-text to be marked with the mark processing, so as to obtain a target text.
The marking processing unit 32 is further configured to:
and highlighting the preset character strings in the text to be marked.
The text processing device provided by the embodiment of the invention searches the original text content corresponding to the text to be marked and processed based on the text similarity algorithm in the text processing process. Compared with the prior art, because the original text content corresponding to the text to be marked is searched based on the similarity algorithm, even if the analyzed text to be marked and the corresponding original text content have difference, the difference is usually small (such as modifying some characters or deleting some characters without actual meanings and the like), namely the text to be marked and the corresponding original text content are similar to a certain extent, so that the text to be marked and the corresponding original text content can be searched from the original text through the similarity algorithm. Therefore, the text processing method provided by the invention can effectively avoid the situation that the original text content corresponding to the text to be marked and processed cannot be found in the prior art in a character string matching mode.
The text processing device comprises a processor and a memory, the parsing unit 31, the marking processing unit 32, the searching unit 33, the splicing unit 34 and the like are stored in the memory as program units, and the processor executes the program units stored in the memory to realize corresponding functions.
The processor comprises a kernel, and the kernel calls the corresponding program unit from the memory. One or more kernels can be set, and the accuracy of the analysis result required by the user is improved by adjusting the kernel parameters.
The memory may include volatile memory in a computer readable medium, random Access Memory (RAM) and/or nonvolatile memory such as Read Only Memory (ROM) or flash memory (flash RAM), and the memory includes at least one memory chip.
An embodiment of the present invention provides a storage medium on which a program is stored, which, when executed by a processor, implements the method of text processing.
The embodiment of the invention provides a processor, which is used for running a program, wherein the method for processing texts is executed when the program runs.
The embodiment of the invention provides equipment, which comprises a processor, a memory and a program which is stored on the memory and can run on the processor, wherein the processor executes the program and realizes the following steps: after the original text is obtained, analyzing the text to be marked from the original text; marking the text to be marked; searching original text content corresponding to the text to be marked and processed based on a text similarity algorithm; and splicing the texts except the original text content corresponding to the text to be marked in the original text with the marked text to be marked to obtain the target text.
Further, the searching for the original text content corresponding to the text to be marked based on the text similarity algorithm includes:
sentence division processing is respectively carried out on the original text and the text to be marked;
traversing each clause in the original text, and respectively performing similarity matching with a first clause in the text to be marked;
if an initial matching clause matched with a first clause in the text to be marked and processed appears in the original text, ending the similarity matching of the first clause;
traversing each remaining clause in the original text from the initial matching clause, and respectively performing similarity matching with the last clause in the text to be marked;
if a final matching clause matched with the last clause in the text to be marked and processed appears in the original text, ending the similarity matching of the final clause;
calculating the total number of clauses from the initial matching clause to the final matching clause, and determining the total number as the first total number of clauses;
and if the total number of the first clauses is equal to the total number of the clauses in the text to be marked and processed, determining the clauses between the initial matching clause and the final matching clause, the initial matching clause and the final matching clause as the original text content corresponding to the text to be marked and processed.
Further, the method further comprises:
if the total number of the first clauses is not equal to the total number of the clauses in the text to be marked, traversing all clauses after the initial matching clause determined for the last time in the original text, and re-performing similarity matching between the first clause and the last clause;
re-determining new initial matching clauses, termination matching clauses and the total number of new first clauses;
ending the similarity matching between the first clause and the last clause until the total number of the new first clause is equal to the total number of the clauses in the text to be marked;
and determining the clauses between the new initial matching clause and the new termination matching clause, the new initial matching clause and the new termination matching clause as the original text content corresponding to the text to be marked.
Further, the similarity matching of the first clause or the last clause in the text to be marked comprises:
calculating the editing distance of the first clause or the last clause and the clause subjected to similarity matching in the original text based on an editing distance function Levenshtein;
calculating the character string length of the first clause or the last clause;
determining the ratio of the editing distance to the character string length of the first clause or the last clause as a matching result corresponding to the clause for similarity matching;
and if the matching result is greater than or equal to a preset threshold value, determining that the clause subjected to similarity matching in the original text is matched with the first clause or the last clause.
Further, the method further comprises:
if the text to be marked is analyzed from the original text and is discontinuous text content, dividing the text to be marked into a plurality of sub texts to be marked, wherein the content in each sub text to be marked is continuous;
respectively searching the original text content corresponding to each sub-text to be marked based on a text similarity algorithm;
and splicing the texts in the original text except the original text content corresponding to the sub-texts to be marked with the marks with the marked sub-texts to be marked to obtain the target text.
Further, the marking the text to be marked comprises:
and highlighting the preset character strings in the text to be marked.
The device in the embodiment of the invention can be a server, a PC, a PAD, a mobile phone and the like.
An embodiment of the present invention further provides a computer program product, which, when executed on a data processing apparatus, is adapted to execute a program that initializes the following method steps: after the original text is obtained, analyzing the text to be marked from the original text; marking the text to be marked; searching original text content corresponding to the text to be marked and processed based on a text similarity algorithm; and splicing the texts except the original text content corresponding to the text to be marked in the original text with the marked text to be marked to obtain the target text.
Further, the searching for the original text content corresponding to the text to be marked based on the text similarity algorithm includes:
the original text and the text to be marked are respectively processed by sentence division;
traversing each clause in the original text, and respectively performing similarity matching with a first clause in the text to be marked;
if an initial matching clause matched with a first clause in the text to be marked and processed appears in the original text, ending the similarity matching of the first clause;
traversing each remaining clause in the original text from the initial matching clause, and respectively performing similarity matching with the last clause in the text to be marked;
if a final matching clause matched with the last clause in the text to be marked and processed appears in the original text, ending the similarity matching of the final clause;
calculating the total number of clauses from the initial matching clause to the final matching clause, and determining the total number of the clauses as the total number of the first clause;
and if the total number of the first clauses is equal to the total number of the clauses in the text to be marked and processed, determining the clauses between the initial matching clause and the final matching clause, the initial matching clause and the final matching clause as the original text content corresponding to the text to be marked and processed.
Further, the method further comprises:
if the total number of the first clauses is not equal to the total number of the clauses in the text to be marked, traversing all clauses after the initial matching clause determined for the last time in the original text, and re-performing similarity matching between the first clause and the last clause;
re-determining new initial matching clauses, termination matching clauses and the total number of new first clauses;
ending the similarity matching between the first clause and the last clause until the total number of the new first clauses is equal to the total number of the clauses in the text to be marked;
and determining the clauses between the new initial matching clause and the new termination matching clause, the new initial matching clause and the new termination matching clause as the original text content corresponding to the text to be marked and processed.
Further, the matching of the similarity of the first clause or the last clause in the text to be marked comprises:
calculating the editing distance of the first clause or the last clause and the clause subjected to similarity matching in the original text based on an editing distance function Levenshtein;
calculating the character string length of the first clause or the last clause;
determining the ratio of the editing distance to the character string length of the first clause or the last clause as a matching result corresponding to the clause subjected to similarity matching;
and if the matching result is greater than or equal to a preset threshold value, determining that the clause subjected to similarity matching in the original text is matched with the first clause or the last clause.
Further, the method further comprises:
if the text to be marked is analyzed from the original text and is discontinuous text content, dividing the text to be marked into a plurality of sub texts to be marked, wherein the content in each sub text to be marked is continuous;
respectively searching the original text content corresponding to each sub-text to be marked based on a text similarity algorithm;
and splicing the texts in the original text except the original text content corresponding to the sub-text to be marked with the target text with each marked sub-text to be marked to obtain the target text.
Further, the marking the text to be marked comprises:
highlighting and marking the preset character strings in the text to be marked.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). The memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Disks (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising a," "8230," "8230," or "comprising" does not exclude the presence of additional identical elements in the process, method, article, or apparatus comprising the element.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and so forth) having computer-usable program code embodied therein.
The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art to which the present application pertains. Any modification, equivalent replacement, improvement or the like made within the spirit and principle of the present application shall be included in the scope of the claims of the present application.

Claims (8)

1. A method of text processing, the method comprising:
after the original text is obtained, analyzing the text to be marked from the original text;
marking the text to be marked;
searching original text content corresponding to the text to be marked based on a text similarity algorithm;
splicing texts except the original text content corresponding to the text to be marked in the original text with the marked text to be marked to obtain a target text;
the method for searching the original text content corresponding to the text to be marked based on the text similarity algorithm comprises the following steps:
sentence division processing is respectively carried out on the original text and the text to be marked;
traversing each clause in the original text, and respectively carrying out similarity matching with a first clause in the text to be marked;
if an initial matching clause matched with a first clause in the text to be marked and processed appears in the original text, ending the similarity matching of the first clause;
traversing each remaining clause in the original text from the initial matching clause, and respectively performing similarity matching with the last clause in the text to be marked;
if a final matching clause matched with the last clause in the text to be marked and processed appears in the original text, ending the similarity matching of the final clause;
calculating the total number of clauses from the initial matching clause to the final matching clause, and determining the total number as the first total number of clauses;
and if the total number of the first clauses is equal to the total number of the clauses in the text to be marked and processed, determining the clauses between the initial matching clause and the final matching clause, the initial matching clause and the final matching clause as the original text content corresponding to the text to be marked and processed.
2. The method of claim 1, further comprising:
if the total number of the first clauses is not equal to the total number of the clauses in the text to be marked, traversing all the clauses after the initial matching clause which is determined for the last time in the original text, and re-performing similarity matching between the first clause and the last clause;
re-determining a new initial matching clause, a new final matching clause and a new first clause total number;
ending the similarity matching between the first clause and the last clause until the total number of the new first clause is equal to the total number of the clauses in the text to be marked;
and determining the clauses between the new initial matching clause and the new termination matching clause, the new initial matching clause and the new termination matching clause as the original text content corresponding to the text to be marked and processed.
3. The method of claim 1, wherein matching similarity to the first clause or the last clause in the text to be tagged comprises:
calculating the editing distance of the first clause or the last clause and the clause subjected to similarity matching in the original text based on an editing distance function Levenshtein;
calculating the character string length of the first clause or the last clause;
determining the ratio of the editing distance to the character string length of the first clause or the last clause as a matching result corresponding to the clause subjected to similarity matching;
and if the matching result is greater than or equal to a preset threshold value, determining that the clause subjected to similarity matching in the original text is matched with the first clause or the last clause.
4. The method according to any one of claims 1-3, further comprising:
if the text to be marked is analyzed from the original text and is discontinuous text content, dividing the text to be marked into a plurality of sub texts to be marked, wherein the content in each sub text to be marked is continuous;
respectively searching the original text content corresponding to each sub-text to be marked based on a text similarity algorithm;
and splicing the texts in the original text except the original text content corresponding to the sub-text to be marked with the target text with each marked sub-text to be marked to obtain the target text.
5. The method according to any one of claims 1-3, wherein the marking the text to be marked comprises:
and highlighting the preset character strings in the text to be marked.
6. An apparatus for text processing, the apparatus comprising:
the analysis unit is used for analyzing the text to be marked from the original text after the original text is obtained;
the marking processing unit is used for marking the text to be marked;
the searching unit is used for searching the original text content corresponding to the text to be marked based on a text similarity algorithm;
the splicing unit is used for splicing texts in the original texts except the original text content corresponding to the text to be marked with the text to be marked after the marking treatment to obtain a target text;
the search unit includes:
the sentence dividing module is used for respectively carrying out sentence dividing processing on the original text and the text to be marked;
the matching module is used for traversing each clause in the original text and respectively carrying out similarity matching with a first clause in the text to be marked;
the system comprises an ending module, a matching module and a matching module, wherein the ending module is used for ending the similarity matching of a first clause in the original text if an initial matching clause matched with the first clause in the text to be marked and processed appears in the original text;
the matching module is also used for traversing each remaining clause in the original text from the initial matching clause and respectively carrying out similarity matching with the last clause in the text to be marked;
the ending module is also used for ending the similarity matching of the last clause if a termination matching clause matched with the last clause in the text to be marked and processed appears in the original text;
the calculating module is used for calculating the total number of clauses from the initial matching clause to the final matching clause and determining the total number as the first total number of clauses;
and the determining module is used for determining the clauses between the initial matching clause and the final matching clause, the initial matching clause and the final matching clause as the original text content corresponding to the text to be marked if the total number of the first clauses is equal to the total number of the clauses in the text to be marked.
7. A storage medium comprising a stored program, wherein the program when executed controls an apparatus on which the storage medium is located to perform the method of text processing according to any one of claims 1 to 5.
8. A processor, characterized in that the processor is configured to run a program, wherein the program when running performs the method of text processing according to any one of claims 1 to 5.
CN201710936432.7A 2017-10-10 2017-10-10 Text processing method and device Active CN109657202B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710936432.7A CN109657202B (en) 2017-10-10 2017-10-10 Text processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710936432.7A CN109657202B (en) 2017-10-10 2017-10-10 Text processing method and device

Publications (2)

Publication Number Publication Date
CN109657202A CN109657202A (en) 2019-04-19
CN109657202B true CN109657202B (en) 2022-10-28

Family

ID=66108723

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710936432.7A Active CN109657202B (en) 2017-10-10 2017-10-10 Text processing method and device

Country Status (1)

Country Link
CN (1) CN109657202B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112395851B (en) * 2020-11-18 2024-12-06 北京北大英华科技有限公司 A text comparison method, device, computer equipment and readable storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH11282845A (en) * 1998-03-30 1999-10-15 Brother Ind Ltd Machine translation apparatus and computer-readable recording medium recording machine translation processing program
CN105719217A (en) * 2016-01-25 2016-06-29 山东海博科技信息系统有限公司 Legal medical expert injury identification management method and system
WO2016180268A1 (en) * 2015-05-13 2016-11-17 阿里巴巴集团控股有限公司 Text aggregate method and device
CN106598997A (en) * 2015-10-19 2017-04-26 北京国双科技有限公司 Method and device for computing membership degree of text subject

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH11282845A (en) * 1998-03-30 1999-10-15 Brother Ind Ltd Machine translation apparatus and computer-readable recording medium recording machine translation processing program
WO2016180268A1 (en) * 2015-05-13 2016-11-17 阿里巴巴集团控股有限公司 Text aggregate method and device
CN106598997A (en) * 2015-10-19 2017-04-26 北京国双科技有限公司 Method and device for computing membership degree of text subject
CN105719217A (en) * 2016-01-25 2016-06-29 山东海博科技信息系统有限公司 Legal medical expert injury identification management method and system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于编辑距离相似度的文本校验技术研究与应用;何锋等;《飞行器测控学报》;20150812(第04期);全文 *

Also Published As

Publication number Publication date
CN109657202A (en) 2019-04-19

Similar Documents

Publication Publication Date Title
US10664660B2 (en) Method and device for extracting entity relation based on deep learning, and server
US20200312298A1 (en) Generating ground truth annotations corresponding to digital image editing dialogues for training state tracking models
US8250053B2 (en) Intelligent enhancement of a search result snippet
RU2643467C1 (en) Comparison of layout similar documents
EP3977332A1 (en) Keyphrase extraction beyond language modeling
CN107766325B (en) Text splicing method and device
US20180039907A1 (en) Document structure extraction using machine learning
CN109582948B (en) Method and device for extracting evaluation viewpoints
US9753905B2 (en) Generating a document structure using historical versions of a document
US9779728B2 (en) Systems and methods for adding punctuations by detecting silences in a voice using plurality of aggregate weights which obey a linear relationship
WO2023045184A1 (en) Text category recognition method and apparatus, computer device, and medium
CN112446218A (en) Long and short sentence text semantic matching method and device, computer equipment and storage medium
CN106610931B (en) Topic name extraction method and device
CN110968989A (en) Method and device for displaying error correction information on front-end page
CN108170661B (en) A method and system for managing rule texts
CN106598997B (en) Method and device for calculating text theme attribution degree
CN112818687B (en) Method, device, electronic equipment and storage medium for constructing title recognition model
CN104462272B (en) Search need analysis method and device
CN110008475A (en) Participle processing method, device, equipment and storage medium
CN109657202B (en) Text processing method and device
CN105260396A (en) Word retrieval method and apparatus
CN114707489B (en) Method and device for acquiring annotation data set, electronic equipment and storage medium
KR102685135B1 (en) Video editing automation system
CN107861950A (en) The detection method and device of abnormal text
CN110019659B (en) Method and device for searching referee document

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: 100083 No. 401, 4th Floor, Haitai Building, 229 North Fourth Ring Road, Haidian District, Beijing

Applicant after: BEIJING GRIDSUM TECHNOLOGY Co.,Ltd.

Address before: 100086 Beijing city Haidian District Shuangyushu Area No. 76 Zhichun Road cuigongfandian 8 layer A

Applicant before: BEIJING GRIDSUM TECHNOLOGY Co.,Ltd.

GR01 Patent grant
GR01 Patent grant