CN111259645A

CN111259645A - Referee document structuring method and device

Info

Publication number: CN111259645A
Application number: CN202010041170.XA
Authority: CN
Inventors: 席丽娜; 王文军; 晋耀红
Original assignee: Dinfo Beijing Science Development Co ltd
Current assignee: Dinfo Beijing Science Development Co ltd
Priority date: 2020-01-15
Filing date: 2020-01-15
Publication date: 2020-06-09

Abstract

The application provides a referee document structuring method and a referee document structuring device, wherein a first extraction template is used for extracting a block text in a referee document to be processed to obtain a first structured text. And then, extracting the specified block text of the first structured text by using a second extraction template to obtain a sub-structured text. And finally, updating the corresponding content in the first structured text by using the sub-structured text to obtain a second structured text. Therefore, the structured method of the official document provided by the application can further extract the information hidden in the official document to be processed in a secondary structured mode, so that the content of the official document to be processed can be more completely displayed by the obtained second structured text.

Description

Referee document structuring method and device

Technical Field

The application relates to the technical field of text processing, in particular to a referee document structuring method and device.

Background

Usually, legal documents such as referee documents are lengthy and obscure in terms, making it difficult to quickly locate content to be browsed through from the overall referee document. Moreover, during browsing the official documents, the user usually needs to browse some cases, i.e. official documents corresponding to cases similar to the current official documents, to help understand and compare the current official documents. For some more special referee documents, such as civil referee documents, some implicit information needs to be extracted from partial information of the text information in a targeted manner on the basis of browsing all the text information. For such official documents, it is difficult for a user to browse one official document, and it is more difficult to find an official document similar to the current official document from a large number of official documents, which not only wastes a lot of time, but also may not accurately find the official document with the highest similarity.

Specifically, for example, if a user needs to search for content related to the dispute focus from the referee document, the user needs to browse from the first character of the referee document, judge the part of the content where the dispute focus may appear after knowing the parts of the content described in the referee document, and further refine and analyze the part of the content to obtain the content related to the dispute focus. However, the method of manually analyzing the structure of the official document to obtain the result is not only time-consuming, but also affected by uncertain factors such as learning, thinking and the like, and therefore, the obtained result is very easy to have low accuracy and has no reference value. Therefore, the existing mode for browsing the referee document has lower efficiency and quality.

Disclosure of Invention

The application provides a referee document structuring method and a referee document structuring device, which are used for improving the format standardization of a referee document and facilitating browsing of a user.

In a first aspect, the present application provides a method for structuring a referee document, the method comprising:

extracting block texts in a referee document to be processed by using a first extraction template to obtain a first structured text, wherein the first structured text consists of each extraction node in the first extraction template and the corresponding block texts in the referee document to be processed;

extracting from the appointed block text of the first structured text by using a second extraction template to obtain a sub-structured text, wherein the sub-structured text consists of each extraction node in the second extraction template and a corresponding sub-text in the appointed block text;

and updating the corresponding content in the first structured text by using the sub-structured text to obtain a second structured text.

In a second aspect, the present application provides an apparatus for structuring official documents, the apparatus comprising:

the first extraction unit is used for extracting the block texts in the official document to be processed by utilizing a first extraction template to obtain a first structured text, wherein the first structured text is composed of each extraction node in the first extraction template and the corresponding block texts in the official document to be processed;

a second extraction unit, configured to extract, by using a second extraction template, a sub-structured text from a specified block text of the first structured text, where the sub-structured text is composed of each extraction node in the second extraction template and a corresponding sub-text in the specified block text;

and the updating unit is used for updating the corresponding content in the first structured text by using the sub-structured text to obtain a second structured text.

According to the above technology, the present application provides a referee document structuring method and device, wherein a first extraction template is used to extract a block text in a referee document to be processed to obtain a first structured text. And then, extracting the specified block text of the first structured text by using a second extraction template to obtain a sub-structured text. And finally, updating the corresponding content in the first structured text by using the sub-structured text to obtain a second structured text. Therefore, the structured method of the official document provided by the application can further extract the information hidden in the official document to be processed in a secondary structured mode, so that the content of the official document to be processed can be more completely displayed by the obtained second structured text.

Drawings

In order to more clearly explain the technical solution of the present application, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious to those skilled in the art that other drawings can be obtained according to the drawings without any creative effort.

Fig. 1 is a flowchart of a method for structuring a referee document according to an embodiment of the present application;

fig. 2 is a flowchart of a method for generating an extraction template according to an embodiment of the present disclosure;

FIG. 3 is a flowchart of a method for determining a first extracted template according to an embodiment of the present application;

fig. 4 is a flowchart of a method for extracting a first structured text according to an embodiment of the present application;

fig. 5 is a flowchart of a method for generating word structured text according to an embodiment of the present application;

fig. 6 is a flowchart of a method for generating a sub-structured text according to an embodiment of the present application;

fig. 7 is a flowchart of a method for replacing text content according to an embodiment of the present application;

fig. 8 is a schematic diagram of a first embodiment of an apparatus for structuring official document according to the present application;

fig. 9 is a schematic diagram of a second embodiment of an apparatus for structuring official document according to the present application;

fig. 10 is a schematic diagram of a third embodiment of an apparatus for structuring official document according to the present application;

fig. 11 is a schematic diagram of a fourth embodiment of an apparatus for structuring a referee document according to an embodiment of the present application;

fig. 12 is a schematic diagram of a fifth embodiment of an apparatus for structuring official document according to an embodiment of the present application;

fig. 13 is a schematic diagram of a sixth embodiment of an apparatus for structuring official document according to an embodiment of the present application;

fig. 14 is a schematic diagram of a seventh embodiment of an apparatus for structuring official document according to an embodiment of the present application;

fig. 15 is a schematic diagram of an eighth embodiment of an apparatus for structuring official document according to the present application;

fig. 16 is a schematic diagram of a device for structuring official document according to a ninth embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present invention will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to solve the above problems, the present application provides a method and an apparatus for structuring a referee document, so as to form a structured document from a referee text, so that a user can quickly determine the content required by the user in the referee document.

Fig. 1 is a flowchart of a method for structuring a referee document according to an embodiment of the present invention, and as shown in fig. 1, the method includes:

s1, extracting block texts in the referee document to be processed by using a first extraction template to obtain a first structured text, wherein the first structured text is composed of each extraction node in the first extraction template and the corresponding block texts in the referee document to be processed.

Inputting the official document to be processed into an official document structuralization device, wherein the official document structuralization device can be a server, a Personal Computer (PC), a tablet personal computer, a mobile phone and other text processing equipment. The official documents to be processed can be all the examination and judgment documents in the civil case, and the like. After receiving the official document to be processed, the official document structuralization device needs to preprocess the official document to be processed and determine the text to be structured, for example, the official document to be processed, which is input into the official document structuralization device, includes a criminal first-pass judgment document, a criminal second-pass judgment document and a criminal final-pass judgment document. And the block text is the text content corresponding to each extraction node in the first extraction template in the referee document to be processed. For example, the contents of the official document to be processed include "party x … approved to find x …. "the first extraction template includes the extraction node" party information, audit finding ", then" party x … "is the block text corresponding to" party information "; "trial finding" is a block text corresponding to "trial finding".

Specifically, as shown in fig. 2, the method for generating an extraction template provided in the embodiment of the present application is a flowchart, where the method includes:

s001, obtaining a referee document sample, wherein the referee document sample belongs to the same category;

s002, dividing each referee document sample into sample block texts according to a preset text division rule;

s003, setting a node title for each sample block text;

s004, combining all the node titles of the same referee document sample to generate a corresponding extracted template sample;

and S005, combining the extracted template samples to generate an extracted template.

The referee document is a text with normalized content, that is, the type of content related to the referee documents of the same type is substantially the same regardless of the format change, for example, the referee document basically relates to the content types such as information of a party, trial pass, request of a litigator, debate by the litigator, trial finding, court opinion, decision result, and the like, and therefore, the extraction template can be generated by training a large number of referee document samples.

Generally, extraction templates corresponding to different types of referee documents are different, and the types refer to case fields, judgment levels and the like related to the referee documents, for example, criminal first-pass judgment, criminal second-pass judgment and civil first-pass judgment belong to three types.

Before training an extraction template for a category of official document, it is necessary to first obtain a large number of official document samples of the category, preferably in a format whose titles correspond to specific text contents, such as "party information-party × …; trial finding-trial finding … ", the format of the referee document sample is most similar to the format of the extracted template to be generated finally, and the training efficiency can be effectively improved.

If the selected referee document sample does not have the format, the referee document sample can be firstly divided into sample block texts according to a preset text division rule, wherein the sample block texts refer to block texts correspondingly contained in each selected referee document sample, and for example, the text division rule includes paragraph division, subtitle division in the text, start character division of a specified paragraph and the like. Then, a node title is set for each sample block text, and this node title is usually a character string that can summarize the semantics of the sample block text, for example, if the sample block text is "party x …", then the node title can be set as "party information". Further, for the same referee document sample, if a node title with repeated semantics appears between the set node titles, sample block texts corresponding to the node titles with repeated semantics can be merged, and one node title is selected as the node title corresponding to the merged sample block text.

After the node titles corresponding to the sample block texts of one referee document sample are obtained, the node titles can be summarized to generate an extracted template sample corresponding to the referee document sample. By training a large number of extracted template samples as described above, an extracted template may be obtained. Further, by continuously enriching the referee's document samples, the generated extraction template can be continuously optimized.

For different categories of referee documents, the method can be adopted to generate corresponding extraction templates.

The various extraction templates generated by the method can be used by the referee document structuralization device at any time without regeneration, so that when the referee document structuralization device uses the extraction templates, a first extraction template suitable for the referee document to be processed needs to be selected from all the extraction templates.

Specifically, as shown in fig. 3, a flowchart of a method for determining a first extracted template provided in an embodiment of the present application is provided, where the method includes:

s011, extracting target keywords matched with the words in the keyword library from the official document to be processed;

s012, calculating semantic similarity between each target keyword and the template title of each extracted template in all the extracted templates;

s013, calculating the matching degree of the referee document to be processed and each extracted template by combining the weight and the semantic similarity corresponding to each target keyword;

s014, determining a first extraction template, wherein the first extraction template is the extraction template with the highest matching degree.

Usually, words consistent with the category of the official document to be processed inevitably appear in the title or the text of the official document to be processed, and although the words are different, the words have the same meaning, such as "first trial" and "first trial", at this time, the participles in the official document to be processed can be matched with the words in the keyword library, and the target keywords with the semantic similarity higher than the threshold value are determined to represent the category of the official document to be processed.

The extracted template usually has corresponding template titles, and at this time, the template title with the highest matching degree can be found by matching the target keyword corresponding to the official document to be processed with the template titles, and the extracted template corresponding to the template title is the first extracted template applicable to the official document to be processed.

After determining the target extraction template, node characters need to be determined from the official document to be processed by using the target extraction template, and specifically, as shown in fig. 4, there is provided a flowchart of a method for extracting a first structured text according to an embodiment of the present application, where the method includes:

s101, according to each extraction node in a first extraction template, determining node characters in a referee document to be processed, wherein the extraction nodes are character strings corresponding to contents of all parts in the referee document to be processed, and the node characters are initial characters of the contents of the parts, corresponding to the extraction nodes, in the referee document to be processed;

s102, determining a block text corresponding to each extraction node, wherein all characters from the node character corresponding to the extraction node to the next node character of the block text are included in the block text;

s103, corresponding each extraction node to the block text to generate a first structured text.

Specifically, the first extraction template is composed of a plurality of extraction nodes representing texts to be extracted, for example, the extraction nodes in the first extraction template are "head, party information, trial finding", and corresponding texts can be extracted from the official document to be processed according to the extraction nodes, for example, the official document to be processed includes "xx court …, party xx …, trial finding x …, and the like", at this time, the corresponding extracted part of the extraction node "head" is "xx court …", the corresponding extracted part of the extraction node "party information" is "party x …", and the corresponding extracted part of the extraction node "trial finding" is "trial finding x …" as can be known by correspondence.

Specifically, the node character may be determined as follows.

S1011, obtaining an extraction expression corresponding to each extraction node;

s1012, sequentially matching each extraction expression with the head line character of each unmatched paragraph in the referee document to be processed to obtain a matched paragraph, wherein the unmatched paragraph is a paragraph without the matched extraction expression;

and S1013, extracting the first line characters of the corresponding matched paragraph by using the extraction expression to obtain the node characters.

The semantics, which are usually represented by characters located in the same paragraph, are the smallest units of complete semantics, as determined by the writing habit, and therefore, a node character can be found from each search unit with the paragraph as the search unit. Since the node characters are the key for dividing the official document to be processed, the node characters need to have participles or phrases corresponding to the extracted nodes, and therefore, the node characters can be determined by recognizing the participles or phrases, and can be recognized and extracted by using an extraction expression in general. For example, the extraction node is "trial and error finding", and its corresponding extraction expression may be @ \ n [ "n". Is? (authorized? (home)? Checking and finding out: is @ or @ \ n classic (act)? And (4) examining and finding @ and the like, wherein one extraction node corresponds to a plurality of extraction expressions in general so as to adapt to a plurality of expression modes of the extraction node. Therefore, the first line characters of each paragraph can be matched by using the extraction expression, so that the matched first line characters can be found and extracted to obtain the node characters. For example, the paragraph of the official document to be processed is "audited to find out," xx has a debt relation … with xx ", and the node character" audited to find out "can be extracted by extracting the expression.

It should be noted that in the process of matching by using the extraction expression, paragraphs need to be matched one by one, and the matched paragraphs are unmatched paragraphs, so that not only the order of extraction can be ensured, and omission can be prevented, but also the paragraphs with the determined node characters can be prevented from being extracted again, so as to avoid the problems of time waste and extraction errors.

After the node characters are determined, the corresponding block texts can be determined according to the node characters, and the block texts can be divided through the node characters, specifically, the block texts are located between two adjacent node characters, are started from the previous node character and are cut off to the text content before the next node character. For example, the content of the official document to be processed includes "party × …, audited to find × …", it can be determined through the above-described process that "party" and "audited to find" are node characters, and two node characters are adjacent, then "party × …" is a block text corresponding to the extracted node "party information".

After the corresponding block text of each extraction node is determined, the name of the extraction node can be used as a title, and the corresponding relation between each title and the corresponding block text is established, so that the referee document to be processed can be structured into a first structured text consisting of a plurality of extraction nodes and block texts. For example, for the civil opinion judgment, a first extraction template consisting of extraction nodes of "head, party information, trial pass, original appeal, noticed debate, trial finding, court opinion, judgment result, and tail" may be selected and extracted to obtain block texts corresponding to the extraction nodes, and a first structured text may be generated.

And S2, extracting from the specified block text of the first structured text by using a second extraction template to obtain a sub-structured text, wherein the sub-structured text consists of each extraction node in the second extraction template and a corresponding sub-text in the specified block text.

And S3, updating the corresponding content in the first structured text by using the sub-structured text to obtain a second structured text.

Part of the block texts in the first structured text may further contain implicit information, which generally refers to text content that is dispersed in the block texts and is needed by users to pay attention to the block texts, but can be obtained through further browsing and extraction. For example, a user needs to obtain an evidence list in a referee document to be processed directly from a structured text, and evidences composing the evidence list are dispersed in corresponding block texts such as an original declaration and a defended declaration, so that the block texts are specified block texts, and the block texts need to be further structured to refine and complete the first structured text.

In one implementation, as shown in fig. 5, a flowchart of a method for generating word structured text provided by an embodiment of the present application is provided, where the method includes:

s201, determining a corresponding extraction formula according to each extraction node in the second extraction template;

s202, extracting the specified block text by using each extraction formula to obtain a corresponding target character string;

s203, determining a subfile, wherein the subfile is all characters from the target character string to a preset termination symbol;

and S204, corresponding each extraction node in the second extraction template to the sub-document to generate a sub-structured text.

In this implementation manner, each extraction node in the second extraction template corresponds to a sub-text that needs to be extracted from the specified block text of the first structured text, for example, the extraction node in the second extraction template includes an evidence directory, an event cause, an original reporting attitude, a reported attitude, and the like. For these extraction nodes, there are corresponding extraction formulas, for example, the extraction formula corresponding to the evidence directory may be @ \ n [ "\ n. (ii) a Disputed? Focus [ yes ]? : @ or @ \ n [ ", n. (ii) a The focus of the present controversy is @ and the like. Generally, one extraction node can correspond to a plurality of extraction expressions to adapt to a plurality of expression modes of the extraction node. And matching and extracting in the specified block text by using an extraction formula, wherein the target character string matched with the extraction formula can appear at any position of the block text. For example, the text block is designated as "solicit the consent of the original according to the complaint of the party, and the dispute focus of the present case is determined as follows: 1.…, respectively; 2.…, respectively; 3.… are provided. Around the focus of dispute, the plaintiff provides evidence as follows: … are provided. "As can be seen, the target string can be determined by matching with the extraction formula" the dispute focus of the present case is determined to be ". At this time, the sub-text corresponding to the extraction node may be determined according to the target character string and a preset end symbol, where the preset end symbol may be a designated punctuation symbol, a designated character, a designated word, a designated phrase, a designated sentence, a designated text format, and the like, and generally, according to the writing habit of the text, there are many cases where the same content is divided by a period number, and therefore, the period number may be set as the end symbol. Then, the subfile corresponding to the content of the dispute focus is "the dispute focus of the present case is determined to be: 1.…, respectively; 2.…, respectively; 3.… are provided. ". With reference to the above method, the sub-text corresponding to each extraction node in the second extraction template may be extracted, and at this time, a corresponding relationship may be established between each extraction node and the sub-text, so as to generate the sub-structured text.

For the establishment and determination of the second extracted template, the above specific process for the establishment and determination of the first extracted template may be referred to, and details will not be described here.

In one implementation, as shown in fig. 6, a flowchart of a method for generating a sub-structured text provided by an embodiment of the present application is provided, where the method includes:

s211, determining a corresponding extraction formula according to each extraction node in the second extraction template;

s212, extracting the specified block text by using each extraction formula to obtain a corresponding target character string;

s213, determining to-be-processed contents, wherein the to-be-processed contents are all characters between the target character string and a preset termination symbol;

s214, determining a sub-text from each content to be processed by using a feature matching model;

s215, corresponding each extraction node in the second extraction template to all the sub-texts corresponding to the same content to be processed to generate a sub-structured text.

The process of determining the content to be processed in this implementation is the same as the process of determining the subfile in the previous implementation, and is not described herein again. Compared with the previous implementation, after the content to be processed is determined, the feature matching model is used for further matching from the content to be processed to determine the subfolders. It is equivalent to three times of extraction of referee documents to be processed, for example, taking the extraction of node evidence directory as an example, there is a characteristic matching model @ [ "n. Original is referred to as "@ etc., and the processed content can be accurately extracted by using the feature matching models, and the sub-texts obtained in this way are usually short or sub-texts with certain features.

Further, the sub texts in the sub-structured text generated through the above steps are all texts extracted from the specified block text, so that these sub texts overlap with the specified block text, and in order to avoid the problem of redundancy of the structured text, refer to fig. 7, which is a flowchart of a method for replacing text content provided by an embodiment of the present application, the method includes:

s205, determining a preposed extraction node, wherein the preposed extraction node is an extraction node corresponding to the block text of the sub text;

s206, replacing the preposed extraction node and the corresponding block text in the first structured text by the sub-structured text to obtain a second structured text.

In the above example, the sub-structured text is "content of dispute focus — the dispute focus in this case is determined as follows: 1.…, respectively; 2.…, respectively; 3.… are provided. "wherein subforms" determine the dispute focus of the present case: 1.…, respectively; 2.…, respectively; 3.… are provided. The corresponding appointed block text is' solicited the consent of the original report according to the complaint of the party, and the dispute focus of the scheme is determined as follows: 1.…, respectively; 2.…, respectively; 3.… are provided. Around the focus of dispute, the plaintiff provides evidence as follows: … are provided. ", the extraction node in the first extraction template corresponding to the specified block text is the informed resolution, so the informed resolution is the front extraction node. In order to solve the redundancy problem of the structured text, the part corresponding to the dispute in the first structured text needs to be replaced by the sub-structured text, namely the dispute is called, the dispute obtains the consent of the original report according to the dispute opinion of the party, and the dispute focus of the scheme is determined as follows: 1.…, respectively; 2.…, respectively; 3.… are provided. Around the focus of dispute, the plaintiff provides evidence as follows: … are provided. "replace with" the dispute focus content-determine the dispute focus in this case is: 1.…, respectively; 2.…, respectively; 3.… are provided. ".

At this time, the user can directly locate the contents of the dispute focus by browsing the extraction nodes.

Note that, since the replacement is the entire content that is declared, some information may be replaced together, and in order to avoid the loss of information, the replacement may be performed after the extraction work of another extraction node is completed.

In an implementation manner, if the sub-text corresponding to each extraction node in the sub-structured text covers a part of the text in the specified block text, the sub-structured text is used to replace the part of the text in the specified block text in the first structured text, so as to obtain a second structured text.

The sub-texts extracted by the extraction nodes in the partial second extraction template may be partial texts in the specified block texts, for example, the extraction nodes in the second extraction template are evidence catalogues, wherein each piece of evidence composing the evidence catalogues is from partial texts in different specified block texts, and in order to ensure the schematic representation of other texts in the block texts, the specified block texts cannot be directly replaced by the sub-structured texts, but the partial texts in the specified block texts, which are covered by the sub-texts, are replaced.

The implementation manner is described by taking an evidence directory of the extraction node as an example, and specifically, there exists a corresponding extraction formula for the extraction node, for example, the extraction formula corresponding to the evidence directory may be @ \ n [ "\ n. (ii) a Original proof @ or @ \ n [ ", n. (ii) a 0,10 original [ "n,. (ii) a {0,10} to (court | court) (filing | providing | presentation) @ or @ \ n court [ ", n. (ii) a Evidence {0,15} (certifying I authentication) is as follows: @ and the like. Matching and extracting are performed in the specified block text by using an extraction formula, so that each evidence text in the specified block text, namely a sub-text, can be determined, for example, the specified block text is an original appeal: is advertised …. In support of its litigation request, the original report submitted evidence to the court as follows: … are provided. "for the extraction formula corresponding to the extraction node evidence catalog in the second extraction template, the child text can be determined as" in order to support the litigation request, the original report submits evidence to the court as follows: …'. In order to ensure the completeness of the indication of the specified block text, part of text covered by the sub-text is deleted from the specified block text, and then the sub-structured text is added to the deleted first structured text to complete the replacement.

In the above example, the extraction node corresponding to the specified block text is the original claim, and the first structured text corresponds to the original claim-original claim: is advertised …. In support of its litigation request, the original report submitted evidence to the court as follows: … are provided. "then delete the text covered" to support its litigation request, the original report submitted evidence to the court as follows: … are provided. "after, add sub-structured text" evidence catalogue-to support its litigation request, the original report submitted evidence to the court as follows: … are provided. "get the second structured text.

For example, the original appellation-original appellation: is advertised ….

Evidence catalog-to support its litigation request, the original report submitted evidence to the court as follows: … are provided.

At this time, the user can directly locate the content corresponding to the evidence directory by browsing the extraction node.

In an implementation manner, if a sub-text corresponding to each extraction node in the sub-structured text corresponds to a part of content in the specified block text, and a reference relationship exists between the sub-text and content in the specified block text except for the part of content, the sub-structured text is added to the first structured text, so as to obtain a second structured text.

In the implementation mode, the association exists between the sub text and the rest texts in the specified block text, so that the sub-structured text cannot be directly used for replacing the whole specified block text or replacing the repeated part in the specified block text, and the sub-structured text is added on the basis of completely illustrating the specified block text to independently show the sub-structured text.

The implementation is illustrated by taking the contents of the dispute focus of the extraction node as an example, and specifically, the block text is designated as "thought by the Hospital", …. The dispute focus of the scheme is as follows: 1.…, respectively; 2.…, respectively; 3.… are provided. With respect to dispute focus 1, … with respect to dispute focus 2, …. "refer to the above method for extracting the sub-text of the dispute focus content, and the details are not repeated here. The subfolders corresponding to the contents of the dispute focus obtained through extraction are' the dispute focus of the scheme is as follows: 1.…, respectively; 2.…, respectively; 3.… are provided. ". Although the sub text is a partial text in the designated block text, the sub text is associated with "about dispute focus 1, … about dispute focus 2, …. "associate, if deleted from the specified block text" the present dispute focus is: 1.…, respectively; 2.…, respectively; 3.… are provided. ", then" about dispute focus 1, … about dispute focus 2, …. "is not fully explained, and lacks a basis for explanation. To avoid this, it is necessary to keep the content of the first structured text and add the sub-structured text to the first structured text to obtain the second structured text.

Namely, court opinion-this institute held, …. The dispute focus of the scheme is as follows: 1.…, respectively; 2.…, respectively; 3.… are provided. With respect to dispute focus 1, … with respect to dispute focus 2, ….

The contents of the dispute focus, the dispute focus of the case is as follows: 1.…, respectively; 2.…, respectively; 3.… are provided.

At this moment, the user can directly position the contents of the dispute focus by browsing the extraction node, and the browsing effect on the court opinions is not influenced.

Fig. 8 is a schematic diagram of a first embodiment of an apparatus for structuring official document according to an embodiment of the present application, the apparatus including: a first extraction unit 1, configured to extract a block text in a referee document to be processed by using a first extraction template to obtain a first structured text, where the first structured text is composed of extraction nodes in the first extraction template and corresponding block texts in the referee document to be processed; a second extraction unit 2, configured to extract, by using a second extraction template, a sub-structured text from a specified block text of the first structured text, where the sub-structured text is composed of each extraction node in the second extraction template and a corresponding sub-text in the specified block text; and the updating unit 3 is configured to update the corresponding content in the first structured text by using the sub-structured text to obtain a second structured text.

Fig. 9 is a schematic diagram of a second embodiment of an apparatus for structuring a referee document according to an embodiment of the present application, the apparatus further including: the system comprises a sample acquisition unit 01, a classification calculation unit and a classification calculation unit, wherein the sample acquisition unit 01 is used for acquiring referee document samples, and the types of the referee document samples are the same; the dividing unit 02 is used for dividing each referee document sample into sample block texts according to a preset text division rule; a node title setting unit 03, configured to set a node title for each sample block text; an extracted template sample generating unit 04, configured to generate corresponding extracted template samples by combining all node titles of the same referee document sample; and the extraction template generating unit 05 is used for combining each extraction template sample to generate an extraction template.

Fig. 10 is a schematic diagram of a third embodiment of an apparatus for structuring official document according to the present application, the apparatus further includes: a matching unit 06, configured to extract target keywords that are matched with the words in the keyword library from the official document to be processed; a similarity calculation unit 07, configured to calculate semantic similarities between each target keyword and a template title of each of the extracted templates; a matching degree calculation unit 08, configured to calculate, by combining the weight and the semantic similarity corresponding to each target keyword, a matching degree between the referee document to be processed and each extracted template; a first extracted template determining unit 09, configured to determine a first extracted template, where the first extracted template is the extracted template with the highest matching degree.

Fig. 11 is a schematic diagram of a fourth embodiment of an apparatus for structuring a referee document according to an embodiment of the present application, where the first extraction unit 1 includes: a node character determining unit 11, configured to determine, according to each extraction node in a first extraction template, a node character in a referee document to be processed, where the extraction node is a character string having a corresponding relationship with each part of content in the referee document to be processed, and the node character is a start character of a part of content in the referee document to be processed corresponding to the extraction node; a block text determining unit 12, configured to determine a block text corresponding to each extracted node, where the block text includes all characters from a node character corresponding to the extracted node to a next node character; a first structured text generating unit 13, configured to correspond each of the extraction nodes to the block text, and generate a first structured text.

Fig. 12 is a schematic diagram of a fifth embodiment of an apparatus for structuring a referee document according to an embodiment of the present application, where the second extraction unit 2 includes: a first extraction formula determining unit 21, configured to determine a corresponding extraction formula according to each extraction node in the second extraction template; a first target character string determining unit 22, configured to extract from the specified block text by using each extraction formula, so as to obtain a corresponding target character string; a first sub-text determining unit 23 configured to determine a sub-text, which is all characters from the target character string to a preset end symbol; and a first sub-structured text generating unit 24, configured to correspond each extraction node in the second extraction template to the sub-text, and generate a sub-structured text.

Fig. 13 is a schematic diagram of a sixth embodiment of an apparatus for structuring a referee document according to an embodiment of the present application, where the second extraction unit 2 includes: a second extraction formula determining unit 25, configured to determine a corresponding extraction formula according to each extraction node in the second extraction template; a second target character string determining unit 26, configured to extract from the specified block text by using each extraction formula, so as to obtain a corresponding target character string; a to-be-processed content determining unit 27 configured to determine to-be-processed content, where the to-be-processed content is all characters from the target character string to a preset termination symbol; a second sub-text determining unit 28 configured to determine sub-texts from each of the contents to be processed by using a feature matching model; a second sub-structured text generating unit 29, configured to correspond each extraction node in the second extraction template to all the sub-texts corresponding to the same content to be processed, so as to generate a sub-structured text.

Fig. 14 is a schematic diagram of a seventh embodiment of an apparatus for structuring a referee document according to an embodiment of the present application, where the updating unit 3 includes: a pre-extraction node determining unit 31, configured to determine a pre-extraction node, where the pre-extraction node is an extraction node corresponding to a block text where the sub-text is located; a first replacing unit 32, configured to replace the pre-extraction node and the corresponding block text in the first structured text with the sub-structured text, so as to obtain a second structured text.

Fig. 15 is a schematic diagram of an eighth embodiment of an apparatus for structuring a referee document according to an embodiment of the present application, where the updating unit 3 includes: a second replacing unit 33, configured to replace, if the sub-text corresponding to each extraction node in the sub-structured text covers a part of the text in the specified block text, the part of the text in the specified block text in the first structured text with the sub-structured text, so as to obtain a second structured text.

Fig. 16 is a schematic diagram of a ninth embodiment of an apparatus for structuring a referee document according to an embodiment of the present application, where the updating unit 3 includes: an adding unit 34, configured to add the sub-structured text to the first structured text to obtain a second structured text if the sub-text corresponding to each extraction node in the sub-structured text corresponds to a part of the content in the specified block text, and a reference relationship exists between the sub-text and the content in the specified block text except for the part of the content.

Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.

It will be understood that the present application is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims

1. A method for structuring official documents, the method comprising:

2. The method according to claim 1, wherein the extracting the block text in the official document to be processed by using the first extracting template to obtain the first structured text comprises:

acquiring a referee document sample, wherein the referee document sample has the same category;

dividing each referee document sample into sample block texts according to a preset text division rule;

setting a node title for each sample block text;

combining all the node titles of the same referee document sample to generate a corresponding extracted template sample;

combining each extracted template sample to generate an extracted template.

3. The method according to claim 2, wherein before extracting the block text in the official document to be processed by using the first extraction template, obtaining the first structured text further comprises:

extracting target keywords matched with words in a keyword library from the official document to be processed;

calculating the semantic similarity between each target keyword and the template title of each extracted template in all the extracted templates;

calculating the matching degree of the referee document to be processed and each extraction template by combining the weight and the semantic similarity corresponding to each target keyword;

and determining a first extraction template, wherein the first extraction template is the extraction template with the highest matching degree.

4. The method according to claim 1, wherein the extracting the block text in the official document to be processed by using the first extraction template to obtain the first structured text comprises:

determining node characters in a referee document to be processed according to each extraction node in a first extraction template, wherein the extraction nodes are character strings which have corresponding relations with contents of all parts in the referee document to be processed, and the node characters are initial characters of the contents of the parts, corresponding to the extraction nodes, in the referee document to be processed;

determining a block text corresponding to each extraction node, wherein the block text comprises all characters from the node character corresponding to the extraction node to the next node character;

and corresponding each extraction node to the block text to generate a first structured text.

5. The method of claim 1, wherein extracting from the specified block text of the first structured text using the second extraction template to obtain a sub-structured text comprises:

determining a corresponding extraction formula according to each extraction node in the second extraction template;

extracting from the specified block text by using each extraction formula to obtain a corresponding target character string;

determining a sub-text, wherein the sub-text is all characters from the target character string to a preset termination symbol;

and corresponding each extraction node in the second extraction template to the sub-text to generate a sub-structured text.

6. The method of claim 1, wherein extracting from the specified block text of the first structured text using the second extraction template to obtain a sub-structured text comprises:

determining content to be processed, wherein the content to be processed is all characters between the target character string and a preset termination symbol;

determining a sub-text from each content to be processed by using a feature matching model;

and corresponding each extraction node in the second extraction template to all the sub-texts corresponding to the same content to be processed to generate a sub-structured text.

7. The method of claim 6, wherein the updating the corresponding content in the first structured text with the sub-structured text to obtain a second structured text comprises:

determining a preposed extraction node which is an extraction node corresponding to the block text of the sub text;

and replacing the preposed extraction nodes and the corresponding block texts in the first structured text by using the sub-structured text to obtain a second structured text.

8. The method of claim 5, wherein the updating the corresponding content in the first structured text with the sub-structured text to obtain a second structured text comprises:

and if the sub-texts corresponding to the extraction nodes in the sub-structured text cover part of the text in the specified block, replacing the part of the text in the specified block in the first structured text with the sub-structured text to obtain a second structured text.

9. The method of claim 5, wherein the updating the corresponding content in the first structured text with the sub-structured text to obtain a second structured text comprises:

and if the sub-texts corresponding to the extraction nodes in the sub-structured text correspond to partial contents in the specified block text and reference relations exist between the sub-texts and the contents in the specified block text except the partial contents, adding the sub-structured text to the first structured text to obtain a second structured text.

10. An apparatus for structuring official documents, comprising: