CN111259631B

CN111259631B - Referee document structuring method and referee document structuring device

Info

Publication number: CN111259631B
Application number: CN202010041736.9A
Authority: CN
Inventors: 席丽娜; 王文军; 晋耀红
Original assignee: Dingfu Intelligent Technology Co ltd
Current assignee: Dingfu Intelligent Technology Co ltd
Priority date: 2020-01-15
Filing date: 2020-01-15
Publication date: 2023-08-25
Anticipated expiration: 2040-01-15
Also published as: CN111259631A

Abstract

The application provides a method and a device for structuring a referee document, wherein a first extraction template is used for extracting block texts in the referee document to be processed to obtain a first structured text, then a second extraction template is used for extracting specified block texts of the first structured text to obtain a first sub-structured text, and the sub-block texts of the first sub-structured text are converted into texts with preset feature expression formats to obtain a second sub-structured text. And finally, updating the corresponding content in the first structured text by using the second sub-structured text to obtain a second structured text. Therefore, the referee document structuring method provided by the application can further extract the first structured text and convert the extracted text into the text format more conforming to the display structure, so that a user can quickly position the required content by browsing.

Description

Referee document structuring method and referee document structuring device

Technical Field

The application relates to the technical field of text processing, in particular to a judge document structuring method and device.

Background

In general, legal documents such as referee documents are tedious and obscure, making it difficult for a person to quickly locate content from the overall referee document that needs to be carefully viewed. Moreover, during browsing the referee document, the user typically needs to browse several types of referee documents corresponding to cases similar to the current referee document to help understand and simulate the current referee document. For some special referee documents, such as civil referee documents, some hidden information needs to be extracted and obtained from part of the text information in a targeted manner on the basis of browsing all the text information. For such referee documents, it is difficult for a user to browse one referee document, and it is more difficult to find a referee document similar to the current referee document from a large number of referee documents, which not only wastes a large amount of time, but also is not necessarily capable of accurately finding the referee document with the highest similarity.

Specifically, for example, the user needs to search the content related to the evidence from the referee document, browse from the first character of the referee document, and after knowing each part of the content set forth by the referee document, judge the part of the content where the evidence may appear, and further extract the content related to the evidence from the part of the content. However, the manual analysis of the structure of the referee document further takes time and is affected by uncertain factors such as learning and thinking, so that the obtained result is very easy to have low accuracy and no reference value. It can be seen that the existing way of browsing referee documents is less efficient and of lower quality.

Disclosure of Invention

The application provides a method and a device for structuring a referee document, which are used for improving format standardization of the referee document and facilitating browsing of users.

In a first aspect, the present application provides a method for structuring referee documents, said method comprising:

extracting block texts in the judge document to be processed by using a first extraction template to obtain a first structured text, wherein the first structured text consists of each extraction node in the first extraction template and the corresponding block text in the judge document to be processed;

extracting from a specified block text of the first structured text by using a second extraction template to obtain a first sub-structured text, wherein the sub-structured text consists of extraction nodes in the second extraction template and corresponding sub-block texts in the specified block text;

converting the sub-block text of the first sub-structured text into a text with a preset feature expression format to obtain a second sub-structured text;

and updating the corresponding content in the first structured text by using the second sub-structured text to obtain a second structured text.

In a second aspect, the present application provides a referee document structuring device comprising:

The first extraction unit is used for extracting block texts in the judge document to be processed by using a first extraction template to obtain a first structured text, wherein the first structured text consists of each extraction node in the first extraction template and the corresponding block text in the judge document to be processed;

the second extraction unit is used for extracting from the appointed block text of the first structured text by using a second extraction template to obtain a first sub-structured text, wherein the sub-structured text consists of each extraction node in the second extraction template and a corresponding sub-block text in the appointed block text;

the conversion unit is used for converting the sub-block text of the first sub-structured text into a text with a preset feature expression format to obtain a second sub-structured text;

and the updating unit is used for updating the corresponding content in the first structured text by using the second sub-structured text to obtain a second structured text.

According to the method and the device for structuring the referee document, firstly, block texts in the referee document to be processed are extracted by using a first extraction template to obtain a first structured text, then, extraction is performed from the appointed block texts of the first structured text by using a second extraction template to obtain a first sub-structured text, and the sub-block texts of the first sub-structured text are converted into texts with preset feature expression formats to obtain a second sub-structured text. And finally, updating the corresponding content in the first structured text by using the second sub-structured text to obtain a second structured text. Therefore, the referee document structuring method provided by the application can further extract the first structured text and convert the extracted text into the text format more conforming to the display structure, so that a user can quickly position the required content by browsing.

Drawings

In order to more clearly illustrate the technical solution of the present application, the drawings that are needed in the embodiments will be briefly described below, and it will be obvious to those skilled in the art that other drawings can be obtained from these drawings without inventive effort.

FIG. 1 is a flowchart of a method for structuring referee documents according to an embodiment of the present application;

FIG. 2 is a flowchart of a method for extracting a first structured document according to an embodiment of the present application;

FIG. 3 is a flow chart of a method for generating a first sub-structured document according to an embodiment of the present application;

FIG. 4 is a flowchart of a method for converting text feature expression format according to an embodiment of the present application;

FIG. 5 is a flowchart of a method for converting text feature expression format according to an embodiment of the present application;

FIG. 6 is a flowchart of a method for converting text feature expression format according to an embodiment of the present application;

FIG. 7 is a flowchart of a method for converting text feature expression format according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of a referee document structuring device according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In order to solve the problems, the application provides a referee document structuring method and device, so that referee texts are formed into structured texts, and a user can quickly determine content required by the user in the referee document.

Fig. 1 is a flowchart of a method for structuring a referee document according to an embodiment of the present application, as shown in fig. 1, where the method includes:

s1, extracting block texts in a judge document to be processed by using a first extraction template to obtain a first structured text, wherein the first structured text consists of each extraction node in the first extraction template and the corresponding block text in the judge document to be processed.

And inputting the judge document to be processed into a judge document structuring device, wherein the judge document structuring device can be a server, a PC (personal computer), a tablet personal computer, a mobile phone and other various text processing devices. The judge document to be processed can be each examination and judgment document in civil cases and the like. After receiving the referee document to be processed, the referee document structuring device needs to preprocess the referee document to determine a text to be structured, for example, the referee document to be processed which is input into the referee document structuring device comprises a criminal first-trial judgment document, a criminal second-trial judgment document and a criminal final-trial judgment document, but at present, only the civil first-trial judgment document needs to be structured, at this time, the text of the civil first-trial judgment document needs to be extracted through preprocessing, and the document to be structured can be determined by matching with a document title or a document title in the referee document to be processed. The block text is text content corresponding to each extraction node in the first extraction template in the judge document to be processed. For example, the content of the judge document to be processed includes "principal x …, and x … is found by the examination. The first extraction template includes an extraction node "principal information, and an approval find", and then "principal x …" is a block text corresponding to "principal information"; "trial-found×x …" is block text corresponding to "trial-found".

The first extraction template may be an extraction model, which is pre-established before structuring the pending referee document, in particular,

s001, acquiring a referee document sample, wherein the referee document sample belongs to the same category;

s002, dividing each judge document sample into sample block texts according to a preset text dividing rule;

s003, setting a node title for each sample block text;

s004, combining all node titles of the same judge document sample to generate a corresponding extraction template sample;

s005, combining the extraction template samples to generate an extraction template.

The referee document is a text with standardized content, that is, for referee documents of the same category, the types of content involved are about the same regardless of format changes, for example, the referee document basically involves principal information, trial passes, litigation party requests, litigation party dialects, trial ascertainments, court views, decision results, and other types of content, so that an extraction template can be generated by training a large number of referee document samples.

In general, the extraction templates corresponding to the referee documents of different categories are also different, where the categories refer to the case fields, the decision levels, and the like related to the referee documents, for example, criminal first-trial decision books, criminal second-trial decision books, and civil first-trial decision books belong to three categories.

Before training the extraction template of a referee document of a certain category, a large number of referee document samples of that category need to be obtained first, preferably in a format whose title corresponds to the specific text content, such as "principal information-principal x …; the trial finding-the trial finding …', the format of the judge document sample is most similar to that of the finally generated extraction template, so that the training efficiency can be effectively improved.

If the selected referee document samples do not have the format as described above, the referee document samples may first be divided into sample block texts according to a preset text division rule, where a sample block text refers to a block text that each of the selected referee document samples correspondingly contains, for example, the text division rule is divided by paragraph, divided by subtitle within the text, divided by specified paragraph start character, and so on. Then, a node header is set for each sample block text, typically this node header is a string that can summarize the semantics of the sample block text, e.g. "principal x …" for sample block text, then the node header "principal information" can be set. Further, for the same referee document sample, if node titles with semantic repetition appear between the set node titles, the sample block texts corresponding to the node titles with semantic repetition can be combined, and one node title is selected as the node title corresponding to the combined sample block text.

After obtaining node titles corresponding to each sample block text of a referee document sample, the node titles may be summarized to generate an extraction template sample corresponding to the referee document sample. By training a large number of extraction template samples as described above, an extraction template can be obtained. Further, by continuously enriching referee document samples, the generated extraction template can be continuously optimized.

For different types of referee documents, the corresponding extraction template can be generated by adopting the method.

The various extraction templates generated by the method can be used by the referee document structuring device at any time without regeneration, so that the first extraction template suitable for the referee document to be processed needs to be selected from all the extraction templates when the referee document structuring device uses the extraction templates.

In particular, the method comprises the steps of,

s011, extracting target keywords matched with words in a keyword library from the judge document to be processed;

s012, calculating semantic similarity between each target keyword and a template title of each extraction template in all the extraction templates;

s013, calculating the matching degree of the judge document to be processed and each extraction template by combining the weight and the semantic similarity corresponding to each target keyword;

S014, determining a first extraction template, wherein the first extraction template is the extraction template with the highest matching degree.

Words consistent with the category of the referee document to be processed will necessarily appear in the title or text of the referee document to be processed, and although the words are different, they will represent the same meaning, for example, "first trial and first trial", at this time, the word in the referee document to be processed may be matched with the word in the keyword library, so as to determine the target keyword with semantic similarity higher than the threshold value, which is used to represent the category of the referee document to be processed.

The extraction template is provided with corresponding template titles, at this time, the target keywords corresponding to the judge document to be processed can be matched with the template titles, so that the template title with the highest matching degree is found, and the extraction template corresponding to the template title is the first extraction template applicable to the judge document to be processed.

After determining the first extraction template, determining the node character from the referee document to be processed by using the first extraction template, specifically, as shown in fig. 2, a flowchart of a method for extracting a first structured document according to an embodiment of the present application is provided, where the method includes:

S101, determining node characters in a judge document to be processed according to each extraction node in a first extraction template, wherein the extraction nodes are character strings with corresponding relations with all parts of contents in the judge document to be processed, and the node characters are initial characters of the parts of contents corresponding to the extraction nodes in the judge document to be processed;

s102, determining a block text corresponding to each extraction node, wherein the block text is all characters from node characters corresponding to the extraction nodes to the next node characters;

and S103, each extraction node corresponds to the block text, and a first structured text is generated.

Specifically, the first extraction template is composed of a plurality of extraction nodes, the extraction nodes represent texts to be extracted, for example, the extraction nodes in the first extraction template are "head, principal information and approval find", then corresponding texts can be extracted from the judge document to be processed according to the extraction nodes, for example, the judge document to be processed includes "x court …, principal x …, approval find x … and the like", at this time, the part of the extraction node corresponding to the "head" is "x court …", the part of the extraction node corresponding to the "principal information" is "principal x …", and the part of the extraction node corresponding to the "approval find out" is "approval find out x …".

Specifically, the node character may be determined as follows.

S1011, obtaining extraction expressions corresponding to each extraction node;

s1012, matching each extraction expression with the first line character of each unmatched paragraph in the judge document to be processed to obtain a matched paragraph, wherein the unmatched paragraph is a paragraph without the extraction expression matched;

and S1013, extracting the first line character of the corresponding matched paragraph by using the extraction expression to obtain a node character.

The semantics represented by characters located in the same paragraph are usually the smallest units of the complete semantics, determined by writing habits, and therefore, the paragraph can be used as a search unit from which node characters are searched. Since the node characters are the keys for dividing the referee document to be processed, the node characters are required to have the word or phrase or the like corresponding to the extraction node, and therefore the node characters can be determined by recognition of these word or phrase, and recognition and extraction can be performed generally using the extraction expression. For example, the extraction node is an "aesthetic find", and its corresponding extraction expression may be @ \n [' n. ? (do a warp? (home)? And (3) trial finding: @ or @ \n meridians (law)? The examination finds @, etc., and typically one extraction node corresponds to multiple extraction expressions to accommodate multiple expression modes of the extraction node. Thus, the first line character of each paragraph can be matched by using the extraction expression, so that the matched first line character is found and extracted to obtain the node character. For example, the paragraph of the judge document to be processed is "found by trial", and x and x have liability relationship … ", and the node character" found by trial "can be extracted by extracting the expression.

It should be noted that in the process of matching by using the extraction expression, the paragraphs need to be matched one by one, and the matched paragraphs are unmatched paragraphs, so that not only can the extraction order be ensured, omission be prevented, but also the re-extraction of the paragraphs with the determined node characters can be prevented, so that the problems of time waste and extraction errors are avoided.

After the node characters are determined, corresponding block texts can be determined according to the node characters, wherein the block texts refer to partial texts in the judge document to be processed, the partial texts are positioned between two adjacent node characters and start from the previous node character. For example, the content of the judge document to be processed includes "principal x …, and by the above-described procedure," principal "and" by the approval "are determined to be node characters, and two node characters are adjacent, then" principal x … "is a block text corresponding to the extraction node" principal information ".

After determining the corresponding block text of each extraction node, the name of the extraction node may be used as a title, and a corresponding relationship between each title and the corresponding block text may be established, so that the referee document to be processed may be structured into a first structured text composed of a plurality of "extraction node-block texts". For example, for a civil-affair first-trial-decision, a first extraction template consisting of extraction nodes such as "head, principal information, trial pass, original complaint, advertised dialect, trial find, court view, decision result, tail" may be selected for extraction, and block text corresponding to the extraction nodes is obtained, thereby generating a first structured text.

S2, extracting from the appointed block text of the first structured text by using a second extraction template to obtain a first sub-structured text, wherein each extraction node in the second extraction template is composed of the corresponding sub-block text in the appointed block text.

The partial block text in the first structured document may also contain implicit information, which is typically referred to as text content scattered in the block text and for which the user has a need for attention, but which needs to be obtained by further browsing and extracting, in which embodiment the block text in the first structured document meeting the above requirements is defined as a specified block text. For example, the user needs to obtain the evidence catalogue in the pending referee document directly from the structured texts, and the evidence constituting the evidence catalogue is scattered in the corresponding block texts such as the original complaint scale and the reported complaint scale, so that the block texts are the designated block texts, and the block texts need to be further structured to refine and complete the first structured text.

After the first structured document is obtained, the extraction of the specified block text in the first structured document is continued, the specified block text may be determined by the method shown below,

S211, acquiring a first reference sample, wherein the first reference sample has the same text structure as the first structured text;

s212, obtaining the feature to be extracted corresponding to the feature model;

s213, determining a feature block text corresponding to the feature to be extracted in each first reference sample;

s214, summarizing the number of feature block texts corresponding to the same feature to be extracted;

s215, determining a specified block text, wherein the specified block text is the characteristic block text corresponding to the number, and the ratio of the number to the total number of the first reference samples is greater than or equal to a preset threshold value.

In this embodiment, the feature block text refers to a block text corresponding to a feature to be extracted in a first reference sample, and a specified block text corresponding to a feature model may be determined by learning a large number of first reference samples. A feature model is a model for extracting a specific feature from a block text, for which the same feature to be extracted by the feature model usually occurs in a relatively fixed block text, for example, the feature to be extracted corresponding to the feature model is "evidence", and usually the feature to be extracted occurs in a block text corresponding to original telling, told dialect, etc., but does not occur in a block text for head, tail, etc. In order to increase the accuracy of the determination of the specified block text, a large number of first reference samples may be used, wherein the first reference samples have the same text structure as the first structured text, i.e. the first structured text is a text consisting of extracted nodes and block text corresponding to the extracted nodes, and then the first reference samples need to be text with such a text structure. At this time, by determining the position of the feature to be extracted in each first reference sample, the ratio of the feature to be extracted in each text, that is, the ratio of the number of feature block texts corresponding to the same feature to be extracted to the total number of the first reference samples, can be known. In order to avoid the fact that the feature to be extracted happens to part of the block texts due to abnormal document and other reasons, the preset threshold value can be utilized to screen the appointed block texts, namely, the feature block texts with the ratio being larger than or equal to the preset threshold value are used as the appointed block texts. For example, the total number of the first reference samples is 100, the feature to be extracted is "evidence", the feature block text is the block text corresponding to "advertised dialect", and the number is 80, it can be seen that the ratio of the two is 0.8, and assuming that the preset threshold is 0.75, the block text corresponding to "advertised dialect" is the specified block text.

After determining the specified block text, the specified block text needs to be extracted, specifically, as shown in fig. 3, a flowchart of a method for generating a first sub-structured text according to an embodiment of the present application is provided, where the method includes:

s221, determining a feature extraction model corresponding to each extraction node in the second extraction template;

s222, determining a target character string and a target terminator from the appointed block text by utilizing the characteristic extraction model, wherein the target character string is a character string matched with an extraction expression in the characteristic extraction model, and the target terminator is a symbol which is preset and represents the end of the sub-block text;

s223, determining a sub-block text, wherein the sub-block text is a character from the target character string to the target terminator, which corresponds to the same extraction node;

s224, each extraction node in the second extraction template corresponds to the sub-block text, and a first sub-structured text is generated.

Typically the second extraction template consists of a plurality of extraction nodes, which correspond to the content to be extracted from the specified block text, respectively. For example, the second extraction template is composed of extraction nodes "original notice proof", "reported evidence", "original notice proof", "court certificate", and the like. The text corresponding to these extraction nodes needs to be extracted from the specified block text. Typically, the extraction nodes have corresponding feature extraction models that can extract matching strings from the specified block text by matching feature words. For example, the block text is designated as "original ×× scale …. To support its litigation request, the original report provides the court with evidence of: 1. …; 2. …; 3. …. The feature extraction model of the original notice proof of the extraction node is @ \n [ 'n'. The method comprises the steps of carrying out a first treatment on the surface of the {0, 10} original notice [' n }. The method comprises the steps of carrying out a first treatment on the surface of the {0, 10} to (court I) (provide I submit I show) @, the target string can be determined to be "original report provides evidence to court as follows". The preset terminator may be a specified punctuation mark, a specified word segmentation, a specified phrase, a specified sentence, a specified text format, or the like, and generally, the same contents are classified together in periods according to the writing habit of the text, and thus, periods may be set as termination symbols. Then the sub-block text in the above example is "original report provides evidence to the home as follows: 1. …; 2. …; 3. …. "

If there are multiple extraction nodes in the second extraction template, each extraction node needs to be corresponding to the sub-block text, so as to obtain a first sub-structured text with a corresponding relationship, for example, "original notice evidence-original notice provides the following evidence to the home: 1. …; 2. …; 3. …. ".

S3, converting the sub-block text of the first sub-structured text into a text with a preset feature expression format, and obtaining a second sub-structured text.

From the above, it can be seen that the feature expression format in the first sub-structured text obtained currently is still mixed by a plurality of pieces of fine information, i.e. a plurality of pieces of evidence, which is unfavorable for browsing, and therefore, the first sub-structured text needs to be converted into the feature expression format.

In one implementation manner, as shown in fig. 4, a flowchart of a method for converting a text feature expression format according to an embodiment of the present application is provided, where the method includes:

s311, determining a first type of sub-block text from the sub-block text of the first sub-structured text, wherein the first type of sub-block text is a sub-block text of which the extraction node corresponding to the specified block text is matched with a first type of keyword;

s312, determining target category keywords from the first category sub-block text, wherein the target category keywords are segmented words with matching degree with preset category keywords being larger than or equal to a preset matching threshold value;

S313, determining a classified text, wherein the classified text is a text with the same target category keyword in the sub-block text;

s314, determining a first serial number identifier from each classified text;

s315, dividing the classified texts by taking the first serial number identifier as a separation node to obtain a first sub-text;

s316, adding a line feed character between two adjacent first sub-texts so that one first sub-text corresponds to one paragraph;

s317, generating a second sub-structured text by combining the target category keyword, the sequence number identifier and the corresponding first sub-text.

The first sub-structured text typically contains multiple types of sub-block text, and the converted results will be different for different types of sub-block text. In general, a corresponding keyword library may be established for different types of sub-block texts, for example, for a first type of sub-block text, which is a simple evidence presentation, so that, in general, evidence information, evidence presentation, and the like may be used as a first type of keyword. Thus, as shown in the above example, the first sub-structured text provides the following evidence for "original evidence-original to home: 1. …; 2. …; 3. …. The corresponding extraction node is "original notice and proof", and the sub-block text can be determined to be the first type sub-block text by matching with the first type key word. At this point, the sub-block text needs to be converted into a corresponding feature expression format.

In the example above, the first type of sub-block text provides the following evidence for the original to the home: 1. …; 2. …; 3.…. In general, the action sender in the referee document is very important, and in this case, the action sender may be defined as a target, and different targets are different categories, such as a plague, a notice, a court, and the like. Different kinds of sub-block texts will also correspond to different targets, so that corresponding kinds of keywords can be set for the sub-block texts, and since the present example is presented for the original notice, the category keywords are set as words of the original notice and the like. At this time, the following evidence can be provided from "original report to home" through word matching: 1. …; 2. …; 3.…. "determine target category keyword as" original report ", the sender of these evidences is the original report.

Further, if there is text containing a plurality of target category keywords in the first-type sub-block text, it is necessary to divide the first-type sub-block text into a plurality of classified texts with the target category keywords as dividing points, for example, "original bulletin provided …" and "bulletin provided …" to court, etc.

Continuing to refine and split each classified text, a first sequence number identifier may be determined from the classified text, e.g. "original provides evidence to home as follows: 1. …; 2. …; 3.…. 1, 2, 3 in "are provided. Taking these first sequence number identifiers as separation points will "original report to home" provide the following evidence: 1. …; 2. …; 3.…. "divided into first sub-texts" 1, … "," 2, … "and" 3, … ", at this time, a line-feed is added between two adjacent first sub-texts, where after the line-feed is added, each first sub-text may be made to occupy a paragraph independently, where each first sub-text may be a line or a plurality of lines of character strings, specifically, an expression format is obtained as shown below,

1、…；

2、…；

3、…。

Meanwhile, for clearer characteristic representation, the target category keywords are required to be combined for display, namely

The original report provides the following evidence to the home:

1、…；

2、…；

3、…。

the evidence can be displayed in the form of a list in the block text, so that the user can see the evidence at a glance in the browsing process.

In one implementation manner, as shown in fig. 5, a flowchart of a method for converting a text feature expression format according to an embodiment of the present application is provided, where the method includes:

s321, determining a second type sub-block text from the sub-block text of the first structured text, wherein the second type sub-block text is a sub-block text with extraction nodes corresponding to the appointed block text and matched with a second type keyword;

s322, dividing the second class sub-block text by taking a preset separator as a node to obtain a second sub-text;

s323, extracting a third sub-text from the second sub-text by using the first feature extraction model;

s324, acquiring a second serial number identifier from each third sub-text;

s325, determining a target first sub-text corresponding to the third sub-text, wherein the target first sub-text is a first sub-text corresponding to the first serial number identifier which is identical to the second serial number identifier;

S326, extracting first tag keywords from each second sub-text, wherein the first tag keywords are word segments matched with preset tag keywords;

s327, generating a second sub-structured text by combining the third sub-text, the target first sub-text and the first tag keyword.

In this implementation, some sub-block texts with opinion expression meanings are set as the second type sub-block texts, which is the same as the principle of setting the first type keywords in the previous implementation. The second category keywords may be opinion, attitude, etc. The second type of sub-block text in the first sub-structured text may be determined by matching.

By way of example, the second class of sub-block text "original notice is not objection to the evidence 1-3 provided by the notice; the original notice is objection to the evidence 4 provided by the notice, considering the forty-thousand debts as the relationship of the single debt of the notice. "the second type of sub-block text may be partitioned with a preset separator, e.g., the preset separator is"; the second sub-text "original notice is not objection to the evidence 1-3 provided by the notice" and "original notice is objection to the evidence 4 provided by the notice" can be obtained, and the forty-five-membered debt is considered as the single debt relationship of the notice. "

At this time, a corresponding third sub-text may be extracted from each of the second sub-texts using the first feature extraction model. Specifically, the first feature extraction model extracts a third sub-text from the second sub-text by means of feature extraction expression matching, for example, the first feature extraction model is "target+pair evidence+sequence number", and then the third sub-text "pair evidence 1-3" and "pair evidence 4" may be extracted from the second sub-text.

From the third sub-text, a second sequence number identifier, such as "1-3" and "4", may be determined. At this time, the first sequence number identifiers determined in the previous implementation may be associated with each of the sequence number identifiers, which are used to represent evidence, that is, the first sub-text, and the same number or character may be considered to correspond to the same first sub-text. At this time, the target first sub-text corresponding to the second sequence number identifier may be determined by referring to the first sequence number identifier and the second sequence number identifier. At this point, evidence in the third sub-text will be presented in specific text.

For the second class of sub-block text, it is most important to show opinion and attitudes for these pieces of evidence. These opinions and attitudes may be used as a tag keyword, in this implementation, the first tag keyword. The matching can be performed through preset tag keywords, and the matching can be determined from the second sub-text. For example, if the preset tag keywords are "objection" and "no objection", then the second sub-text "original notice is not objection to the evidence 1-3 provided by the notice" and "original notice is objection to the evidence 4 provided by the notice", the forty-thousand-membered liability is considered as a single liability relationship of the notice. "match", the first tag keyword corresponding to each second sub-text may be determined therefrom. Meanwhile, the first tag keywords have a corresponding relation with the third sub-text.

At this time, the second sub-structured text which is clearly displayed can be obtained by combining the third sub-text, the target first sub-text and the first tag keyword.

For example, the original report is not objection to evidence 1 …, evidence 2 …, evidence 3 …;

the original report is objection to evidence 4 ….

In one implementation manner, as shown in fig. 6, a flowchart of a method for converting a text feature expression format according to an embodiment of the present application is provided, where the method includes:

s331, determining a second type sub-block text from the sub-block text of the first structured text, wherein the second type sub-block text is a sub-block text with extraction nodes corresponding to the appointed block text and matched with a second type keyword;

s332, dividing the second class sub-block text by taking a preset separator as a node to obtain a fourth sub-text;

s333, extracting a fifth sub-text from each fourth sub-text by using a second feature extraction model;

s334, combining all the fifth sub-texts to generate a second sub-structured text.

The present implementation manner still provides a method for structuring the second class of sub-block text, compared with the previous implementation manner, the second feature extraction model is in the form of "target+pair+evidence+tag keyword", meanwhile, the second class of sub-block text has a text format conforming to the second feature extraction model, and the fifth sub-text as "original report is opposite to evidence 4 …" can be directly extracted, so that the fifth sub-text can be directly used as the second structured text.

In one implementation manner, as shown in fig. 7, a flowchart of a method for converting a text feature expression format according to an embodiment of the present application is provided, where the method includes:

s341, determining a third type of sub-block text from the sub-block text of the first structured text, wherein the third type of sub-block text is a sub-block text with extraction nodes corresponding to the specified block text and matched with a third type of keywords;

s342, dividing the third type of sub-block text by using a preset separator to obtain a sixth sub-text;

s343, extracting a seventh sub-text from the sixth sub-text by using a third feature extraction model;

s344, acquiring a third serial number identifier from each seventh sub-text;

s345, determining a target first sub-text corresponding to the seventh sub-text, wherein the target first sub-text is a first sub-text corresponding to the first serial number identifier which is identical to the second serial number identifier;

s346, extracting a result text from each sixth sub-text by using a feature matching formula;

s347, combining the seventh sub-text, the target first sub-text and the result text to generate a second sub-structured text.

In this implementation, the sub-block text having authentication and resolution expressions in some of the demonstrative texts is set as the third-type sub-block text, which is the same as the principle of setting the first-type keywords and the second-type keywords in the above implementation. The third category of keywords may be authentication, decision, etc. A third type of sub-block text in the first sub-structured text may be determined by matching.

In contrast to the structuring of the second class of sub-block texts in the above-described implementation, the present implementation requires that the extraction of the result text from the sixth sub-text be continued after the determination of the seventh sub-text and the corresponding target first sub-text. For example, the sixth sub-text is "the court-to-evidence 1 recognition can be taken as a basis of facts", and the feature matching formula can match corresponding characters, such as @ (recognition i judgment) \n [ \n, from the sixth sub-text in a matching manner. The method comprises the steps of carrying out a first treatment on the surface of the The fact basis @, etc. (i.e., the coincidence @ is the coincidence @, etc.). The result text "identify as a factual basis" may be extracted from the sixth sub-text. Thus, the seventh sub-text, the target first sub-text and the result text may be combined to obtain the second sub-structured text.

For example, the home recognizes evidence 1 … as a fact basis.

It should be noted that the feature extraction model provided in the above implementation may be adjusted according to the actual requirements to extract different objects.

And S4, updating the corresponding content in the first structured text by using the second sub-structured text to obtain a second structured text.

In the referee document structuring method provided by the application, only a part of the sub-block texts in the specified block text are processed by the first sub-structured text, and the processing does not cover all the texts of the specified block text, so that after the second sub-structured text is obtained, only the second sub-structured text is needed to replace the corresponding content in the first structured text, and the second structured text is obtained.

For example, the second sub-structured text is:

the original report provides the following evidence to the home:

1、…；

2、…；

3、…。

the first structured document corresponds to the content of "original telltale-original x title …". To support its litigation request, the original report provides the court with evidence of: 1. …; 2. …; 3. …. "wherein," original notice provides the court with the following evidence: 1. …; 2. …; 3. …. "is content corresponding to the second sub-structured document, and needs to be replaced by the second sub-structured document, namely

Original telltale-original telltale x …. In support of its litigation request,

the original report provides the following evidence to the home:

1、…；

2、…；

3、…。

therefore, the structured referee document can display text information to the user in a finer manner, so that the user can quickly locate the required content.

Fig. 8 is a schematic structural diagram of a referee document structuring device according to an embodiment of the present application, where the device includes: the first extraction unit 1 is used for extracting block texts in the judge document to be processed by using a first extraction template to obtain a first structured text, wherein the first structured text consists of each extraction node in the first extraction template and the corresponding block text in the judge document to be processed; the second extraction unit 2 is configured to extract from a specified block text of the first structured text by using a second extraction template, so as to obtain a first sub-structured text, where the sub-structured text is composed of each extraction node in the second extraction template and a corresponding sub-block text in the specified block text; a conversion unit 3, configured to convert a sub-block text of the first sub-structured text into a text with a preset feature expression format, so as to obtain a second sub-structured text; and the updating unit 4 is used for updating the corresponding content in the first structured text by using the second sub-structured text to obtain a second structured text.

Optionally, the first extraction unit includes: the node character determining unit is used for determining node characters in the judge document to be processed according to each extracting node in the first extracting template, wherein the extracting nodes are character strings with corresponding relations with all parts of contents in the judge document to be processed, and the node characters are initial characters of the parts of contents corresponding to the extracting nodes in the judge document to be processed; the block text determining unit is used for determining block texts corresponding to each extracting node, wherein the block texts are all characters from node characters corresponding to the extracting node to the next node characters; and the first structured text generation unit is used for generating a first structured text by corresponding each extraction node to the block text.

Other embodiments of the application will be apparent to those skilled in the art from consideration of the specification and practice of the application disclosed herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.

It is to be understood that the application is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims

1. A method of structuring referee documents, the method comprising:

updating corresponding content in the first structured text by using the second sub-structured text to obtain a second structured text;

extracting from the specified block text of the first structured text by using a second extraction template, and obtaining a first sub-structured text includes:

determining a feature extraction model corresponding to each extraction node in the second extraction template;

determining a target character string and a target terminator from the appointed block text by utilizing the characteristic extraction model, wherein the target character string is a character string matched with an extraction expression in the characteristic extraction model, and the target terminator is a preset symbol representing the end of the sub-block text;

determining a sub-block text, wherein the sub-block text is a character from the target character string to the target terminator, which corresponds to the same extraction node;

and corresponding each extraction node in the second extraction template to the sub-block text to generate a first sub-structured text.

2. The method of claim 1, wherein extracting the block text in the referee document to be processed using the first extraction template to obtain the first structured text comprises:

Determining node characters in a judge document to be processed according to each extraction node in a first extraction template, wherein the extraction nodes are character strings with corresponding relations with all parts of contents in the judge document to be processed, and the node characters are initial characters of the parts of contents corresponding to the extraction nodes in the judge document to be processed;

determining a block text corresponding to each extraction node, wherein the block text is all characters from node characters corresponding to the extraction node to next node characters;

and corresponding each extraction node to the block text, and generating a first structured text.

3. The method of claim 1, wherein converting the sub-block text of the first sub-structured document into text having a preset feature expression format, obtaining a second sub-structured document comprises:

determining a first type of sub-block text from the sub-block text of the first sub-structured text, wherein the first type of sub-block text is a sub-block text of which extraction nodes corresponding to the appointed block text are matched with first type keywords;

determining target category keywords from the first category sub-block text, wherein the target category keywords are segmented words with matching degree with preset category keywords being larger than or equal to a preset matching threshold value;

Determining a classified text, wherein the classified text is a text with the same target category keyword in the sub-block text;

determining a first sequence number identifier from each of the classified texts;

dividing the classified text by taking the first serial number identifier as a separation node to obtain a first sub-text;

adding a line feed character between two adjacent first sub-texts so that one first sub-text corresponds to one paragraph;

and generating a second sub-structured text by combining the target category keyword, the serial number identifier and the corresponding first sub-text.

4. The method of claim 3, wherein converting the sub-block text of the first sub-structured document to text having a preset feature expression format, obtaining a second sub-structured document comprises:

determining a second type of sub-block text from the sub-block text of the first sub-structured text, wherein the second type of sub-block text is a sub-block text with extraction nodes corresponding to the appointed block text and second type of keywords matched;

dividing the second class sub-block text by taking a preset separator as a node to obtain a second sub-text;

extracting a third sub-text from the second sub-text by using the first feature extraction model;

Acquiring a second serial number identifier from each third sub-text;

determining a target first sub-text corresponding to the third sub-text, wherein the target first sub-text is a first sub-text corresponding to the first serial number identifier which is identical to the second serial number identifier;

extracting a first tag keyword from each second sub-text, wherein the first tag keyword is a word segment matched with a preset tag keyword;

and generating a second sub-structured text by combining the third sub-text, the target first sub-text and the first tag keyword.

5. The method of claim 1, wherein converting the sub-block text of the first sub-structured document into text having a preset feature expression format, obtaining a second sub-structured document comprises:

dividing the second class sub-block text by taking a preset separator as a node to obtain a fourth sub-text;

extracting a fifth sub-text from each of the fourth sub-texts by using a second feature extraction model;

And generating a second sub-structured text by combining all the fifth sub-text.

6. The method of claim 4, wherein converting the sub-block text of the first sub-structured document into text having a preset feature expression format, the obtaining a second sub-structured document comprises:

determining a third type of sub-block text from the sub-block text of the first sub-structured text, wherein the third type of sub-block text is a sub-block text with extraction nodes corresponding to the appointed block text and matched with a third type of keywords;

dividing the third type of sub-block texts by using preset separators to obtain sixth sub-texts;

extracting a seventh sub-text from the sixth sub-text by using a third feature extraction model;

acquiring a third serial number identifier from each seventh sub-text;

determining a target first sub-text corresponding to the seventh sub-text, wherein the target first sub-text is a first sub-text corresponding to the first serial number identifier which is identical to the second serial number identifier;

extracting a result text from each sixth sub-text by using a feature matching formula;

and generating a second sub-structured text by combining the seventh sub-text, the target first sub-text and the result text.

7. The method of claim 1, wherein updating the corresponding content in the first structured document with the second sub-structured document to obtain a second structured document comprises:

and replacing corresponding content in the first structured text by the second sub-structured text to obtain a second structured text.

8. A referee document structuring device, comprising:

The updating unit is used for updating the corresponding content in the first structured text by using the second sub-structured text to obtain a second structured text;

the first extraction unit includes:

the node character determining unit is used for determining node characters in the judge document to be processed according to each extracting node in the first extracting template, wherein the extracting nodes are character strings with corresponding relations with all parts of contents in the judge document to be processed, and the node characters are initial characters of the parts of contents corresponding to the extracting nodes in the judge document to be processed;

the block text determining unit is used for determining block texts corresponding to each extracting node, wherein the block texts are all characters from node characters corresponding to the extracting node to the next node characters;

and the first structured text generation unit is used for generating a first structured text by corresponding each extraction node to the block text.