CN112183032A - Text processing method and device - Google Patents

Text processing method and device Download PDF

Info

Publication number
CN112183032A
CN112183032A CN202011136241.0A CN202011136241A CN112183032A CN 112183032 A CN112183032 A CN 112183032A CN 202011136241 A CN202011136241 A CN 202011136241A CN 112183032 A CN112183032 A CN 112183032A
Authority
CN
China
Prior art keywords
text
processed
chapter
sentence
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011136241.0A
Other languages
Chinese (zh)
Inventor
任宁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dingfu Intelligent Technology Co Ltd
Original Assignee
Dingfu Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dingfu Intelligent Technology Co Ltd filed Critical Dingfu Intelligent Technology Co Ltd
Priority to CN202011136241.0A priority Critical patent/CN112183032A/en
Publication of CN112183032A publication Critical patent/CN112183032A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/151Transformation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the application provides a text processing method and a text processing device, wherein the text processing method comprises the following steps: deleting all target symbols included in the text to be processed, wherein the target symbols include trade symbols or at least one punctuation symbol; respectively adding the target symbols to the beginning part and the end part of the text to be processed according to a first rule; dividing a body part included in the text to be processed into at least one chapter according to a classification model, and adding the line feed symbol to the tail of each chapter in the at least one chapter, wherein the body part is the rest part of the text to be processed except for the beginning part and the end part. Some embodiments of the application respectively process the text head, the text tail and the text body part included in the text to be processed into the text with the standard format, and can effectively solve the problem that the text is not easy to read or machine learning is not easy to perform due to the non-standard format of the document.

Description

Text processing method and device
Technical Field
The present application relates to the field of natural language processing, and in particular, to a text processing method and apparatus.
Background
The referee document is a document with strict literary composition and standard format. For example, in a normative referee document, the first part of the document includes the judgment court, the document type and the case number, the last part of the document includes the referee, the referee date and the bookmarker, and the rest of the document is a natural segment and is presented as a line, and the line refers to the text with a line change symbol. The standard format is not only beneficial to reading convenience, but also beneficial to machine analysis.
In contrast to the normative legal documents, when the normative legal documents are subjected to format conversion, collection or entry by using a relevant method (for example, html-based format transcoding, OCR-based format transcoding or individual document entry non-normative reasons and the like are performed on the normative legal documents), a situation that the formats of a large number of legal documents (for example, referee documents) are disordered (for example, there are errors such as line change loss or illegal line change) is caused, which not only causes reading difficulty for the legal documents, but also affects the machine analysis and learning effects based on the legal documents.
Disclosure of Invention
The embodiments of the present application provide a method and an apparatus for processing a document, which can convert an irregular document into a document with a standard format, so as to improve the effect of machine learning or analysis of legal documents based on the standard format.
In a first aspect, some embodiments of the present application provide a text processing method, including: deleting all target symbols included in the text to be processed, wherein the target symbols include line feed symbols or at least one punctuation mark; respectively adding the target symbols to the beginning part and the end part of the text to be processed according to a first rule; dividing a body part included in the text to be processed into at least one chapter according to a classification model, and adding the line feed symbol to the tail of each chapter in the at least one chapter, wherein the body part is the rest part of the text to be processed except for the beginning part and the end part.
Some embodiments of the application respectively process the head part, the tail part and the body part of the text to be processed into the text with the standard format, and can effectively solve the problem that the text is not easy to read or machine learning due to the non-standard format of the text.
In some embodiments, before deleting all target symbols included in the text to be processed, the text processing method further includes: and confirming that the input text belongs to the text to be processed.
Some embodiments of the application confirm whether the input text belongs to an irregular text through analysis, and then may perform formatting processing of adding a target symbol to the screened text file (i.e., the text to be processed) in an irregular format.
In some embodiments, the confirming input text belongs to the text to be processed, including: confirming that the document type of the input text belongs to a target document type, wherein the target document type comprises: a decision book, an arbitration book, a mediation book, or an execution book; confirming that the input text meets a preset condition, wherein the preset condition comprises confirming that the format of at least one part of the head part and the tail part of the input text meets a first set condition, or confirming that the body part of the input text meets a second set condition.
Some embodiments of the application obtain the text to be processed by identifying the unnormalized format in the input text of the screened target document type, so as to format the input text with the processing necessity in a targeted manner, which can improve the processing efficiency of the system, and avoid resource waste caused by processing the normative document without processing or influence on the processing of the unnormalized document (i.e., the document to be processed) due to processing the normative document.
In some embodiments, the confirming that the input text satisfies a preset condition includes: deleting the text head part and the text tail part from the input text according to a rule word list to obtain the text part; confirming that the ratio of the line number of the punctuation mark before the line feed symbol included in the text part to the total line number of the text part is larger than a set first threshold value; or confirming that the ratio of the total number of lines to the total number of words included in the body part is greater than a second threshold.
Some embodiments of the present application may improve the accuracy of identifying the text to be processed by identifying the problem of the irregular format of the text content through a quantization method (e.g., setting a threshold).
In some embodiments, the text to be processed comprises a decision and a referee, the heading section comprises a decision court, a type of document, and a case number, and the tail section comprises a referee, a referee date, and a bookmarker; the first setting condition includes that the line feed symbol after at least one of the judgment court, the document type, the case number, the judge date and the bookmarker is not standard.
Some embodiments of the application filter the text to be processed by identifying line feed characters lacking in the beginning part and the end part of the text or redundant line feed characters (i.e. the line feed characters are not normal).
In some embodiments, the text to be processed belongs to a decision or an adjudication, the text header portions of the decision and the adjudication including a decision court, a text type, and a case number, the text end portions of the decision and the adjudication including: trial personnel, trial dates and bookmarkers; the adding the target symbol to the beginning part and the end part of the text to be processed according to the first rule respectively comprises: and respectively adding the line-feed symbols to the judged court, the document type, the case number, the judging personnel, the judging date and the bookmarker after all items are obtained by identification according to a plurality of regular expressions.
Some embodiments of the application normalize the irregular format problem of the head and the tail of the text, which lacks line breaks or has too many line breaks, according to a plurality of regular expressions, and further obtain more normative head and tail of the text.
In some embodiments, the classification model comprises a sentence classification algorithm model that corresponds one-to-one to multiple classes of document types, including: a first audit decision book, a second audit decision book, a mediation book or an execution book; the dividing a body part included in the text to be processed into at least one chapter according to a classification model, and adding the target symbol to the end of each chapter in the at least one chapter, includes: confirming the document type corresponding to the text to be processed; confirming an adopted target sentence classification algorithm model according to the document type; identifying the initial sentence of each chapter from a plurality of sentences obtained by segmenting the text part according to the target sentence classification algorithm model; and dividing the chapters according to the starting sentence, and adding the line feed symbols to the tail of each chapter.
Some embodiments of the present application may divide the chapters by identifying the starting sentence of each chapter, thereby increasing the speed and accuracy of chapter identification.
In some embodiments, the dividing the chapters according to the starting sentence and adding the line feed symbol to the end of each chapter includes: when it is confirmed that a first sentence is identified as a starting sentence of a first section and a second sentence is identified as a starting sentence of a second section adjacent to the first section, it is confirmed that the first sentence and all sentences located between the first sentence and the second sentence belong to the first section.
Some embodiments of the present application identify a section between two starting sentences as one section, that is, some embodiments of the present application improve the integrity of the content of each identified section by identifying the sentence between the starting sentence and two adjacent starting sentences to identify the complete content of each section.
In some embodiments, the identifying the starting sentence of each chapter from the plurality of sentences segmented from the body part according to the target sentence classification algorithm model includes: when a third sentence and a fourth sentence positioned after the third sentence are both identified as the starting sentences of a third section, confirming that the third sentence is the starting sentence of the third section.
Some embodiments of the present application provide a way to handle how two starting sentences are identified in a chapter, thereby making the identified chapters emotive and non-overlapping.
In some embodiments, the text processing method further comprises: confirming the name of each chapter and adding a corresponding chapter name for each chapter; providing the beginning portion, the ending portion, and the body portion for further machine analysis of the beginning portion, the ending portion, and the body portion.
Some embodiments of the present application may facilitate speed of subsequent machine learning by adding chapter titles to each chapter.
In a second aspect, some embodiments of the present application provide a text processing apparatus, including: the symbol processing module is configured to delete all target symbols included in the text to be processed, wherein the target symbols include line feed symbols or at least one punctuation mark; the head and tail processing module is configured to add the target symbols to the head part and the tail part of the text to be processed according to a first rule; the text processing module is configured to divide a text part included in the text to be processed into at least one chapter according to a classification model, and add the line feed symbol to the end of each chapter in the at least one chapter, wherein the text part is the remaining part of the text to be processed except for the beginning part and the end part.
In a third aspect, some embodiments of the present application provide a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method of the first aspect described above.
In a fourth aspect, some embodiments of the present application provide an information processing apparatus, including a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor, when executing the computer program, can implement the text processing method according to the first aspect.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and that those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.
FIG. 1 is a schematic diagram of a text processing system according to an embodiment of the present disclosure;
fig. 2 is a flowchart of a text processing method provided in an embodiment of the present application;
FIG. 3 is a block diagram of a text processing apparatus according to an embodiment of the present disclosure;
fig. 4 is a block diagram of an information processing apparatus according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures. Meanwhile, in the description of the present application, the terms "first", "second", and the like are used only for distinguishing the description, and are not to be construed as indicating or implying relative importance.
The referee document is a document with strict literary composition and standard format. However, documents with different formats are in a transcoding process, or documents are identified by adopting an Optical Character Recognition (OCR) technology, or individual document entries are not standardized, and the like, some referee documents have a disordered format, which causes poor subsequent machine learning or analysis based on legal documents with disordered formats.
The difference between the standard official document and the non-standard official document due to the wrong line feed symbol is shown in an exemplary comparison with the two official documents.
Examples of the specification official document are as follows:
national court of certain district of Nanjing City
Book for civil affairs
(2011) First number of Pumin
Yuanjing Nanjing company, residence place in Nanjing City, a certain way, a certain number and a certain building.
Legal representative Ni is a manager.
The proxy agent is attentive to a law firm attorney in Jiangsu.
The quilt is invar and grows in a certain month and a certain day in a certain year.
In the case of dispute between a company in south-Yangjing and a contract of the defended people, the company in south-Yangjing of the original government makes a withdrawal application to the company in a month and a day of the year.
The institute believes that the withdrawal application of a company in Nanjing, as originally reported, does not violate the legal requirements, and is in compliance with the law to exercise the lawsuit rights and shall grant the application. According to the stipulations of the first one hundred thirty items and the first one hundred forty items of the first one hundred and the fifth item of the litigation law of the people's republic of China, the following are adjudged:
a company, Yuantong Nanjing, was granted the withdrawal of prosecution.
The case acceptance fee is reduced by half and is charged by 25 yuan, which is borne by a company of Nanjing, Yuanzhang.
The agent judges slowly
A certain day of a certain month of a certain year
Tan stackers in Tan
(II) the irregular referee document caused by converting the referee document in html format into text format (i.e. txt format) is as follows:
national institute of civil affairs adjudication book (2011) proma first national word of south Beijing, a management company Limited of Naja, and a certain road and a certain building in a certain area of Nanjing.
Legal representative Ni is the manager.
The delegate agent xus a particular law firm in Jiangsu.
The quilt is invar and grows in a certain month and a certain day in a certain year.
In the case of dispute between a company in south-Yangjing and a contract of the defended people, the company in south-Yangjing of the original government makes a withdrawal application to the company in a month and a day of the year.
The institute believes that the withdrawal application of a company in Nanjing, as originally reported, does not violate the legal requirements, and is in compliance with the law to exercise the lawsuit rights and shall grant the application. According to the stipulations of the first one hundred thirty items and the first one hundred forty items of the first one hundred and the fifth item of the litigation law of the people's republic of China, the following are adjudged:
a company, Yuantong Nanjing, was granted the withdrawal of prosecution.
The case acceptance fee is reduced by half and is charged by 25 yuan, which is borne by a company of Nanjing, Yuanzhang. Takeman who stouts in a certain time and a certain month in a certain year
It can be seen from the above two official documents that in the standard official document, the judge court, the document type and the case number included in the first part of the document are presented in a single line, the judge date and the bookmarker included in the last part of the document are presented in a single line, and other contents (i.e. text contents) in the document are presented in a single line with a natural segment, and the single line refers to a text with a line change symbol. The standard format is not only beneficial to reading convenience, but also beneficial to machine analysis. However, when a referee document in html format is converted to txt format or a referee document formed by OCR processing, there are errors such as lost line feed and illegal line feed. For example, in the middle of some official documents, all line feed symbols are lost, which eventually results in only one line feed symbol for the entire text. The line feed loss is not only unfavorable for reading, but also can cause obstacles to machine analysis referee documents and cause errors. Illegal line feeds are generally caused by Optical Character Recognition (OCR) programs that recognize characters in an image without natural language post-processing of the analysis results. The existence of illegal line feed can cause the failure of information extraction and matching.
The technical scheme of the embodiment of the application is explained by combining the drawings, and compared with the situation that the technical scheme of the embodiment of the application is not adopted, the problem of symbol confusion can be effectively solved by adopting the technical scheme of the embodiment of the application, so that the effect of machine learning is improved, or the accuracy of extraction and matching of text information based on natural language processing is improved.
Referring to fig. 1, fig. 1 is a block diagram of a text processing system according to some embodiments of the present application. The text processing system of fig. 1 includes an input device 100 for outputting input text that requires formatting, a processing server 200 for normalizing the input text received from the input device 100, and a training server 300 that generates classification models and provides them to the processing server 200. As shown in fig. 1, the processing server 200 also feeds back the processed text file in the canonical format to the input device 100, so that the input device 100 and the processing server 200 are interconnected through a network. The network includes, but is not limited to, a mobile communication access network (e.g., a 4G or 5G communication network) and a core network.
The input device 100, the processing server 200, and the training server 300 may each be a computing device that includes a processor and memory.
It should be noted that fig. 1 only shows a schematic composition diagram of the text processing system according to some embodiments of the present application, and it is understood that, in some embodiments of the present application, the functions of at least two of the input device 100, the processing server 200, and the training server 300 may be integrated into the same hardware entity. That is, in some embodiments of the present application, the input device 100 and the processing server 200 are located on the same hardware entity, and in still other embodiments of the present application, the processing server 200 and the training server 300 belong to the same hardware entity.
The text processing method performed on the processing server 200 is exemplarily set forth below in connection with fig. 2.
As shown in fig. 2, some embodiments of the present application provide a text processing method, including: s110, deleting all target symbols included in the text to be processed, wherein the target symbols include line feed symbols or at least one punctuation mark; s120, respectively adding the target symbols to the beginning part and the end part of the text to be processed according to a first rule; s130, dividing a body part included in the text to be processed into at least one chapter according to a classification model, and adding the line feed symbol to the tail of each chapter in the at least one chapter, wherein the body part is the rest part of the text to be processed except for the head part and the tail part.
The following exemplarily illustrates S110, S120, and S130 included in the text processing method implemented by the present application.
The target symbol recorded in S110 of some embodiments of the present application includes a line feed symbol or one or more punctuation symbols. For example, in some embodiments, the target symbol is a line feed symbol, and the corresponding S110 is specifically to delete all line feed symbols included in the text to be processed. In other embodiments, the target symbol is a period, and the corresponding S110 is to delete all periods included in the text to be processed. In some embodiments, the target symbol includes a line feed symbol and a period (i.e., one of punctuation marks), then S110 is executed to delete all line feed symbols and periods included in the text to be processed. For example, when the target symbol is a line feed symbol, for an irregular text having only one line in the full text due to the loss of a part of the line feed symbol, S110 of the embodiment of the present application is executed, that is, the last line feed symbol in the full text is deleted.
The first rule of S120 in some embodiments of the present application includes a plurality of to-be-replaced regular expressions and replaced regular expressions corresponding to each to-be-replaced regular expression one to one, as shown in table 1 below, which shows a part of the regular expressions for the beginning part of the decision book. The regular expression to be replaced is composed of keywords or keywords extracted from the head part and the tail part and wildcards, and the regular expression after replacement is composed of standard word description extracted from the head part and the tail part and standard punctuation marks. It can be understood that the irregular descriptions of the beginning part and the end part of the text can be identified by using the regular expression to be replaced, and all the irregular descriptions obtained by identification can be replaced by using the replaced regular expression in a regular format. For example, the target symbol is a line feed symbol, the header portion after replacement includes a canonical line feed symbol, and the canonical line feed symbol includes the appropriate number and position of line feed symbols. For example, the string to be replaced in line 1 of table 1 is: (line \ s \ s \, the result after replacement is: \1 cutting book \ n, wherein the character strings in the small brackets in the character strings to be replaced are in one group, and the \ 1' in the result after replacement refers to the first group. "1" in the results after replacement of the first line regular expression in Table 1 refers to the "administrative" two word, the first group is referred to as "\ 1" in python, belongs to one of the regular reserved words, and in java and some other tools, $1 is written.
For example, using the regular expression in the first row of table 1, we can change "administrative official" to "after substitution:
administrative official book
As an example of the present application, the target symbol includes a line feed symbol, the text to be processed belongs to a decision book or an adjudication, the text header portions of the decision book and the adjudication include a decision court, a text type, and a case number, and the text footer portions of the decision book and the adjudication include: trial personnel, trial dates and bookmarkers; s120 comprises: and respectively adding the line-feed symbols to the judged court, the document type, the case number, the judging personnel, the judging date and the bookmarker after all items are obtained by identification according to a plurality of regular expressions.
Table 1 rules for replacing parts of the contents of a vocabulary
Figure BDA0002735987870000101
Figure BDA0002735987870000111
The symbols in table 1 have the following meanings: "\ s" means a space, "" + "means matching 0 to n times," [ ] "means matching a single character in parentheses in the middle," | "means the meaning of or,"? "means match 0 or 1 time," \ n "means match line feed," \\ 1 "means retention of content in a first set of brackets," \2 "means retention of content in a second set of brackets, and" \\ 3 "means retention of content in a third set of brackets.
It should be noted that in some embodiments of the present application, S120 further includes replacing errors in the vocabulary OCR corpora according to rules. For example, "O, o" indicating the date identified in the OCR corpus is replaced with the number 0 according to the rule vocabulary.
In some embodiments, the classification model comprises a sentence classification algorithm model that corresponds one-to-one to multiple classes of document types, including: a first audit decision book, a second audit decision book, a mediation book or an execution book; s130 includes: confirming the document type corresponding to the text to be processed; confirming an adopted target sentence classification algorithm model according to the document type; identifying the initial sentence of each chapter from a plurality of sentences obtained by segmenting the text part according to the target sentence classification algorithm model; and dividing the chapters according to the starting sentence, and adding the line feed symbols to the tail of each chapter.
In order to accurately determine the sentence included in each chapter, as an example, the dividing S130 the chapters according to the starting sentence, and adding the line feed symbol to the end of each chapter includes: when it is confirmed that a first sentence is identified as a starting sentence of a first section and a second sentence is identified as a starting sentence of a second section adjacent to the first section, it is confirmed that the first sentence and all sentences located between the first sentence and the second sentence belong to the first section.
To avoid the repetition of the divided chapters, as an example, the S130 identifies, according to the target sentence classification algorithm model, a starting sentence of each chapter from a plurality of sentences obtained by dividing the body part, including: when a third sentence and a fourth sentence positioned after the third sentence are both identified as the starting sentences of a third section, confirming that the third sentence is the starting sentence of the third section.
In order to save resources of the processing server 200 and perform targeted processing on the input irregular text to be processed, as shown in fig. 2, before S110 of some embodiments of the present application, the text processing method further includes: and S100, confirming that the input text belongs to the text to be processed. That is to say, some embodiments of the present application determine whether the input text belongs to an irregular text by analyzing the input text, and then may perform formatting processing for adding a target symbol to the screened text file in an irregular format. Some embodiments of the application determine the text to be processed by identifying the type of the input text and the non-standard format of the document of the corresponding type, so that the processing efficiency of the system can be improved, and resource waste caused by processing the standard document not required to be processed or influence on processing of the non-standard document which is really required to be processed due to the fact that the standard document not required to be processed is subjected to the standardized processing is avoided.
The following exemplifies S100 provided by some embodiments of the present application.
In some embodiments, S100 comprises: confirming that the document type of the input text belongs to a target document type, wherein the target document type comprises: a decision book, an arbitration book, a mediation book, or an execution book; confirming that the input text meets a preset condition, wherein the preset condition comprises confirming that the format of at least one part of the head part and the tail part of the input text meets a first set condition, or confirming that the body part of the input text meets a second set condition. For example, the target symbols comprise line feed symbols, the text to be processed comprises a decision book and a cutting book, the text head part comprises a decision court, a text type and a case number, and the text tail part comprises a trial person, a trial date and a bookmarker; the first setting condition includes that the line feed symbol after at least one of the judgment court, the document type, the case number, the judge date, and the bookmarker is not standardized, and the non-standardization includes that the line feed symbol is absent or redundant.
In some embodiments, S100 comprises: confirming that the document type of the input text belongs to a target document type, wherein the target document type comprises: a decision book, an arbitration book, a mediation book, or an execution book; confirming that the input text meets a preset condition, wherein the preset condition comprises confirming that the format of at least one part of the head part and the tail part of the input text meets a first set condition, or confirming that the body part of the input text meets a second set condition. For example, the target symbol is a line feed symbol, and the step S100 includes confirming that the input text satisfies a preset condition, including: deleting the text head part and the text tail part from the input text according to a rule word list to obtain the text part; confirming that the ratio of the line number of the punctuation-free symbol before the line feed symbol included in the text part to the total line number of the text part is greater than a set first threshold value; or confirming that the ratio of the total number of lines to the total number of words included in the body part is greater than a second threshold. For example, the first threshold is one tenth, and the second threshold is one fiftieth, that is, after the beginning part and the end part are removed by using the rule vocabulary: the row/total row number without punctuation before the row-feed character is more than 1/10; or after the beginning part and the end part of the text are removed by utilizing the rule word list: the total number of rows/total number of words > 1/50. When the type of the target document corresponding to the input text is a judgment document or a judgment document, the head part of the judgment document or the judgment document comprises a judgment court, a document type and a case number, and the tail part of the document comprises a judge, a judge date and a bookmarker.
It should be noted that the first threshold and the second threshold may be obtained by: a certain number of corpora are marked out. And (3) performing secondary classification on the corpus: corpora which need to be preprocessed or do not need to be preprocessed; calculating the value of 'line/total line number without punctuation before line feed character' and the value of 'total line number/total word number' in the labeled corpus; taking the line/total line number without punctuation before the line change as 0 and the total line/total word number as 1/100 as the initial state, taking the appropriate learning rate, such as "0.01" and "0.001" (as shown in table 2), respectively increasing the values of the two until the learning rate is increased to 1, classifying the labeled corpus by the method, and respectively taking the values of the two with the highest accuracy as the first threshold and the second threshold.
TABLE 2 first and second threshold values and accuracy corresponding relationship
Figure BDA0002735987870000131
Figure BDA0002735987870000141
That is, as some specific examples of the present application, official document text as recognized by OCR may be considered to belong to the text to be processed; when the input document is not determined to belong to the text to be processed, the type of the document can be judged according to the case number, and if the type of the document is not a judgment book, a cutting book, a mediation book or an execution book, the document is determined not to belong to the text to be processed. If the document type is a judgment book, a cutting book, a mediation book or an execution book, the following judgment needs to be carried out to confirm that the document belongs to the text to be processed, namely if the input text meets any one of the following three conditions, the document belongs to the text to be processed: 1) recognizing that the format of the beginning or the end of the text is not standard by utilizing a regular word list; 2) after the beginning and the end of the text are removed by utilizing the rule word list: row/total row number of non-punctuation before the linefeed > a first threshold (e.g., the first threshold is 1/10); 3) after the beginning and the end of the text are removed by utilizing the rule word list: total number of rows/total number of words > a second threshold (e.g., the second threshold is 1/50). The selection method for the first threshold value and the second threshold value may refer to the above description. Selecting one-fiftieth for the second threshold for OCR text includes at least the following reasons: total words/total number of rows equals the average number of words per row. According to incomplete statistics, if the text is an OCR text, line breaks exist after line breaks in each form, after a head part and a tail part of the text are removed, the average word number of each line is about 34.5, and the total line number/the total word number is 1/34.5. In the case of a standard text, after the beginning part and the end part of the text are removed, the average number of words per line is about 121.2, and the total number of lines/total number of words is 1/121.2. An empirical value of 1/50 is chosen taking into account the characteristics of the OCR text. Regarding the first threshold, in general, the normative document is a natural segment after removing the beginning and the end of the text, the end of the natural segment is necessarily the end of the sentence, and the sentence is generally ended by the semicolon, the period, the question mark, the ellipsis mark and the exclamation mark, so the threshold may be set to 0. But does not exclude the presence of documents in which the listing of evidence or legal terms is in the form of lines. Such as:
the evidence provided includes:
identity card
Original of household notebook
Hospital diagnosis proof
In this case, removing the wrapping would change the originally normal text to unnormal and not recoverable. The first threshold value is set to 1/10 in order to prevent a false touch to some extent. With respect to the second threshold: theoretically, an OCR text is wrapped with a wrap around character for each form. Then the average number of words per line will be larger if the longer a line, the smaller the text. Generally the number of words of the decision text will not exceed 50 per row, so the second threshold is set to 1/50.
In order to perform S130 (i.e., the analysis process based on the modeled model), some embodiments of the present application need to train the relevant model to obtain the classification model of S130 in advance by training, that is, some embodiments of the present application also include S135 to train the obtained classification model (i.e., the modeling process).
As an example, S135 includes: screening corpora which can be used as training corpora and classifying the corpora according to the document type; labeling the corpus by using a chapter analysis rule model to form a training corpus; and training a sentence classification algorithm model on the basis of the training corpus to position the initial sentence of each chapter.
The steps involved in S135 are exemplarily set forth below.
Firstly, screening proper corpora used as training corpora and classifying according to the document types.
Said suitable anticipation for training anticipation needs to satisfy at the same time the following requirements: identifying a format specification of a beginning or a tail of a text by using a regular word list; after the beginning and the end of the text are removed by utilizing the rule word list: the row/total row number without punctuation before the row-feed character is less than 1/20; after the beginning and the end of the text are removed by utilizing the rule word list: total number of rows/total number of words > 1/50; total number of words of text > 50.
For the corpora which are screened out to be suitable as training corpora, the case number is used for judging the document type, and the categories are as follows: first-examination judgment books, second-examination judgment books, arbitration books, mediation books and execution books.
And secondly, labeling each type of linguistic data by using a chapter analysis rule model to form training linguistic data. For example, the analysis process of the section analysis rule model includes: firstly, determining the type of the document according to the case number, and determining and selecting a sub-model for analysis according to the type of the document. Secondly, a basic analysis method of the sub-model comprises the following steps: the document is divided into several chapters according to the character string of the beginning of each chapter, the category of the beginning character string of each chapter is the name of the chapter, and the content from the beginning character string of the chapter to the beginning character string of the next chapter (namely, the first chapter after the beginning character string) is the content of the chapter.
And thirdly, training the corpora of the first-examination judgment book, the second-examination judgment book, the review judgment book, the arbitration book, the mediation book and the execution book respectively to obtain respective sentence classification algorithm models. Specifically, with. "and" \ n "are cut points, training and anticipation are carried out to segment sentences, and whether each sentence obtained by segmentation is a starting sentence of a certain chapter can be known according to an analysis result of a chapter analysis rule model (for example, each chapter of a judgment book specifically comprises party information, a trial pass and an original notice start, and the starting sentences corresponding to the chapters respectively comprise a party information starting sentence, a trial pass starting sentence, an original notice name starting sentence and the like). Not in the beginning sentence of a chapter, other is defined.
It can be understood that, after the sentence classification algorithm model corresponding to each document type is obtained through training, the sentence classification algorithm model formed in the modeling process is used for sentence classification, and chapter analysis is performed based on the sentence classification algorithm model, that is, S130 is performed. As indicated above, some embodiments of the present application, S130, include: firstly, analyzing the types of documents, such as a first-pass judgment book, a second-pass judgment book, a mediation book and an execution book; secondly, judging an algorithm sentence classification model to be used according to the category of the document; then, cutting sentences of the corpus, and classifying the sentences in the corpus by using an algorithm sentence classification model; finally, on the basis of sentence cutting in the previous step, chapter analysis is carried out on the material, and the method comprises the following steps: if a sentence is classified as the starting sentence of a chapter, the starting position of the sentence from the starting position of the sentence until the next starting position of the sentence recognized as the beginning of another chapter is regarded as the content of the chapter. If two or more sentences are analyzed as the starting sentence of the same chapter, the preceding sentence is taken as the standard.
In order to facilitate subsequent information extraction or machine learning, in some embodiments of the present application, the text processing method further includes: confirming the name of each chapter and adding a corresponding chapter name for each chapter; providing the beginning portion, the ending portion, and the body portion for further machine analysis of the beginning portion, the ending portion, and the body portion. Some embodiments of the present application may facilitate speed of subsequent machine learning by adding chapter titles to each chapter.
Referring to fig. 3, fig. 3 shows a text processing apparatus provided in the embodiment of the present application, it should be understood that the apparatus corresponds to the embodiment of the method in fig. 2, and is capable of performing the steps related to the embodiment of the method, and the specific functions of the apparatus can be referred to the description above, and the detailed description is appropriately omitted here to avoid repetition. The device comprises at least one software functional module which can be stored in a memory in the form of software or firmware or solidified in an operating system of the device, and the text processing device comprises: the symbol processing module 101 is configured to delete all target symbols included in the text to be processed, wherein the target symbols include line feed symbols or at least one punctuation mark; a beginning and ending processing module 102, configured to add the target symbols to a beginning part and an ending part of the text to be processed according to a first rule; the text processing module 103 is configured to divide a text portion included in the text to be processed into at least one chapter according to a classification model, and add the line feed symbol to the end of each chapter in the at least one chapter, where the text portion is a remaining portion of the text to be processed except for the beginning portion and the end portion.
It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process of the above-described apparatus may refer to the corresponding process in fig. 2, and will not be described in detail herein.
Some embodiments of the present application provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, may implement the method described above with respect to fig. 2.
As shown in fig. 4, some embodiments of the present application provide an information processing apparatus 400, which includes a memory 410, a processor 420, and a computer program stored on the memory 410 and executable on the processor 420, wherein the processor 420 can implement the text processing method described in fig. 2 when reading the program from the memory 410 through a bus 430 and executing the computer program.
Processor 420 may process digital signals and may include various computing structures. Such as a complex instruction set computer architecture, a structurally reduced instruction set computer architecture, or an architecture that implements a combination of instruction sets. In some examples, processor 420 may be a microprocessor.
Memory 410 may be used to store instructions that are executed by processor 420 or data related to the execution of instructions. The instructions and/or data may include code for performing some or all of the functions of one or more of the modules described in embodiments of the application. The processor 420 of the disclosed embodiment may be used to execute instructions in the memory 410 to implement the method shown in fig. 2. Memory 410 includes dynamic random access memory, static random access memory, flash memory, optical memory, or other memory known to those skilled in the art.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method can be implemented in other ways. The apparatus embodiments described above are merely illustrative, and for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In addition, functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.
The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The above description is only an example of the present application and is not intended to limit the scope of the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application. It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.
The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Claims (10)

1. A text processing method, characterized in that the text processing method comprises:
deleting all target symbols included in the text to be processed, wherein the target symbols include line feed symbols or at least one punctuation mark;
respectively adding the target symbols to the beginning part and the end part of the text to be processed according to a first rule;
dividing a body part included in the text to be processed into at least one chapter according to a classification model, and adding the line feed symbol to the tail of each chapter in the at least one chapter, wherein the body part is the rest part of the text to be processed except for the beginning part and the end part.
2. The text processing method according to claim 1, wherein before deleting all the target symbols included in the text to be processed, the text processing method further comprises: and confirming that the input text belongs to the text to be processed.
3. The text processing method of claim 2, wherein confirming that the input text belongs to the text to be processed comprises:
confirming that the document type of the input text belongs to a target document type, wherein the target document type comprises: a decision book, an arbitration book, a mediation book, or an execution book;
confirming that the input text meets a preset condition, wherein the preset condition comprises confirming that the format of at least one part of the head part and the tail part of the input text meets a first set condition, or confirming that the body part of the input text meets a second set condition.
4. The text processing method of claim 3, wherein the confirming that the input text satisfies a preset condition comprises:
deleting the text head part and the text tail part from the input text according to a rule word list to obtain the text part;
confirming that the ratio of the line number of the punctuation mark before the line feed symbol included in the text part to the total line number of the text part is larger than a set first threshold value; or confirming that the ratio of the total number of lines to the total number of words included in the body part is greater than a second threshold.
5. The text processing method according to claim 3, wherein the text to be processed comprises a judgment and a referee, the heading section comprises a judgment court, a type of document and a case number, and the ending section comprises a referee, a referee date and a bookmarker;
the first setting condition includes that the line feed symbol after at least one of the judgment court, the document type, the case number, the judge date and the bookmarker is not standard.
6. The text processing method of claim 1, wherein the text to be processed belongs to a decision or an adjudication, the text header portions of the decision and the adjudication include a decision court, a text type and a case number, and the text footer portions of the decision and the adjudication include: trial personnel, trial dates and bookmarkers;
the adding the target symbol to the beginning part and the end part of the text to be processed according to the first rule respectively comprises:
and respectively adding the line-feed symbols to the judged court, the document type, the case number, the judging personnel, the judging date and the bookmarker after all items are obtained by identification according to a plurality of regular expressions.
7. The text processing method of claim 1, wherein the classification model comprises a sentence classification algorithm model that corresponds one-to-one to a plurality of document types, the plurality of document types comprising: a first audit decision book, a second audit decision book, a mediation book or an execution book;
the dividing a body part included in the text to be processed into at least one chapter according to a classification model, and adding the target symbol to the end of each chapter in the at least one chapter, includes:
confirming the document type corresponding to the text to be processed;
confirming an adopted target sentence classification algorithm model according to the document type;
identifying the initial sentence of each chapter from a plurality of sentences obtained by segmenting the text part according to the target sentence classification algorithm model;
and dividing the chapters according to the starting sentence, and adding the line feed symbols to the tail of each chapter.
8. The text processing method of claim 7, wherein the dividing the chapters according to the starting sentence and adding the line feed symbol for the end of each chapter comprises:
when it is confirmed that a first sentence is identified as a starting sentence of a first section and a second sentence is identified as a starting sentence of a second section adjacent to the first section, it is confirmed that the first sentence and all sentences located between the first sentence and the second sentence belong to the first section.
9. A text processing apparatus, characterized in that the text processing apparatus comprises:
the symbol processing module is configured to delete all target symbols included in the text to be processed, wherein the target symbols include line feed symbols or at least one punctuation mark;
the head and tail processing module is configured to add the target symbols to the head part and the tail part of the text to be processed according to a first rule;
the text processing module is configured to divide a text part included in the text to be processed into at least one chapter according to a classification model, and add the target symbol to the end of each chapter in the at least one chapter, wherein the text part is the rest part of the text to be processed except for the beginning part and the end part.
10. An information processing apparatus comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor, when executing the computer program, can implement the text processing method of any one of claims 1 to 8.
CN202011136241.0A 2020-10-21 2020-10-21 Text processing method and device Pending CN112183032A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011136241.0A CN112183032A (en) 2020-10-21 2020-10-21 Text processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011136241.0A CN112183032A (en) 2020-10-21 2020-10-21 Text processing method and device

Publications (1)

Publication Number Publication Date
CN112183032A true CN112183032A (en) 2021-01-05

Family

ID=73922037

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011136241.0A Pending CN112183032A (en) 2020-10-21 2020-10-21 Text processing method and device

Country Status (1)

Country Link
CN (1) CN112183032A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117010390A (en) * 2023-07-04 2023-11-07 北大荒信息有限公司 Company entity identification method, device, equipment and medium based on bidding information

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108763483A (en) * 2018-05-25 2018-11-06 南京大学 A kind of Text Information Extraction method towards judgement document
CN109359288A (en) * 2018-08-16 2019-02-19 上海绿狮智能信息科技股份有限公司 A method of for law works field document quantitative evaluation
CN110599289A (en) * 2019-07-31 2019-12-20 长春市万易科技有限公司 Method for formatting official document
CN110705264A (en) * 2019-09-27 2020-01-17 上海智臻智能网络科技股份有限公司 Punctuation correction method, punctuation correction apparatus, and punctuation correction medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108763483A (en) * 2018-05-25 2018-11-06 南京大学 A kind of Text Information Extraction method towards judgement document
CN109359288A (en) * 2018-08-16 2019-02-19 上海绿狮智能信息科技股份有限公司 A method of for law works field document quantitative evaluation
CN110599289A (en) * 2019-07-31 2019-12-20 长春市万易科技有限公司 Method for formatting official document
CN110705264A (en) * 2019-09-27 2020-01-17 上海智臻智能网络科技股份有限公司 Punctuation correction method, punctuation correction apparatus, and punctuation correction medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117010390A (en) * 2023-07-04 2023-11-07 北大荒信息有限公司 Company entity identification method, device, equipment and medium based on bidding information

Similar Documents

Publication Publication Date Title
CN110427623B (en) Semi-structured document knowledge extraction method and device, electronic equipment and storage medium
EP2257896B1 (en) Financial event and relationship extraction
CA2661902C (en) Automated classification of document pages
US7739133B1 (en) System and method for processing insurance claims
US8843815B2 (en) System and method for automatically extracting metadata from unstructured electronic documents
CN110390000A (en) A kind of legal documents automatic identification generates system and method
EP3680850A1 (en) Method and system for determining risk score for a contract document
CN107766328A (en) Text message extracting method, storage medium and the server of structured text
CN108897770A (en) A kind of law article name authority and case towards judgement document is by being associated with statistical method with law article
CN110610005A (en) Stealing crime auxiliary criminal investigation method based on deep learning
CN111460162B (en) Text classification method and device, terminal equipment and computer readable storage medium
CN112258144B (en) Policy file information matching and pushing method based on automatic construction of target entity set
CN112132710A (en) Legal element processing method and device, electronic equipment and storage medium
CN114064851A (en) Multi-machine retrieval method and system for government office documents
CN110516257A (en) It is a kind of based on Boundary Recognition and combined judgement document's evidence abstracting method
CN102955775A (en) Automatic foreign name identification and control method based on context semantics
CN112183032A (en) Text processing method and device
CN105608137A (en) Method and device for extracting identity label
CN113269101A (en) Bill identification method, device and equipment
Poole A corpus-aided study of stance adverbs in judicial opinions and the implications for English for Legal Purposes instruction
CN112784585A (en) Abstract extraction method and terminal for financial bulletin
CN112818693A (en) Automatic extraction method and system for electronic component model words
CN111291535A (en) Script processing method and device, electronic equipment and computer readable storage medium
CN115687790B (en) Advertisement pushing method and system based on big data and cloud platform
CN112101007A (en) Method and system for extracting structured data from unstructured text data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination