CN111274354B - Referee document structuring method and referee document structuring device - Google Patents

Referee document structuring method and referee document structuring device Download PDF

Info

Publication number
CN111274354B
CN111274354B CN202010041739.2A CN202010041739A CN111274354B CN 111274354 B CN111274354 B CN 111274354B CN 202010041739 A CN202010041739 A CN 202010041739A CN 111274354 B CN111274354 B CN 111274354B
Authority
CN
China
Prior art keywords
text
extraction
structured
block
node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010041739.2A
Other languages
Chinese (zh)
Other versions
CN111274354A (en
Inventor
席丽娜
王文军
刘大双
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dingfu Intelligent Technology Co ltd
Original Assignee
Dingfu Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dingfu Intelligent Technology Co ltd filed Critical Dingfu Intelligent Technology Co ltd
Priority to CN202010041739.2A priority Critical patent/CN111274354B/en
Publication of CN111274354A publication Critical patent/CN111274354A/en
Application granted granted Critical
Publication of CN111274354B publication Critical patent/CN111274354B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/358Browsing; Visualisation therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a method and a device for structuring a referee document, wherein a first extraction template is used for extracting block texts in the referee document to be processed to obtain a first structured text, then a feature model is used for determining target block texts from the appointed block texts of the first structured text, and a second extraction template is used for extracting each target block text to obtain sub-structured texts. And finally, updating the corresponding content in the first structured text by using the sub-structured text to obtain a second structured text. Therefore, the referee document structuring method provided by the application can further extract and classify the first structured text to obtain the second refined structured text, so that the content of the referee document to be processed can be displayed more completely.

Description

Referee document structuring method and referee document structuring device
Technical Field
The application relates to the technical field of text processing, in particular to a judge document structuring method and device.
Background
In general, legal documents such as referee documents are tedious and obscure, making it difficult for a person to quickly locate content from the overall referee document that needs to be carefully viewed. Moreover, during browsing the referee document, the user typically needs to browse several types of referee documents corresponding to cases similar to the current referee document to help understand and simulate the current referee document. For some special referee documents, such as civil referee documents, some hidden information needs to be extracted and obtained from part of the text information in a targeted manner on the basis of browsing all the text information. For such referee documents, it is difficult for a user to browse one referee document, and it is more difficult to find a referee document similar to the current referee document from a large number of referee documents, which not only wastes a large amount of time, but also is not necessarily capable of accurately finding the referee document with the highest similarity.
Specifically, for example, the user needs to search the referee document for the content related to the dispute focus, browse from the first character of the referee document, and after knowing the contents of each part set forth in the referee document, determine the possible contents of the dispute focus, and further refine and analyze the contents of each part to obtain the content related to the dispute focus. However, the manual analysis of the structure of the referee document further takes time and is affected by uncertain factors such as learning and thinking, so that the obtained result is very easy to have low accuracy and no reference value. It can be seen that the existing way of browsing referee documents is less efficient and of lower quality.
Disclosure of Invention
The application provides a method and a device for structuring a referee document, which are used for improving format standardization of the referee document and facilitating browsing of users.
In a first aspect, the present application provides a method for structuring referee documents, said method comprising:
extracting block texts in the judge document to be processed by using a first extraction template to obtain a first structured text, wherein the first structured text consists of each extraction node in the first extraction template and the corresponding block text in the judge document to be processed;
Determining a target block text from the specified block text of the first structured text by using a feature model;
extracting each target block text by using a second extraction template to obtain a sub-structured text, wherein the sub-structured text consists of each extraction node in the second extraction template and a corresponding sub-text in the target block text;
and updating the corresponding content in the first structured text by using the sub-structured text to obtain a second structured text.
In a second aspect, the present application provides a referee document structuring device comprising:
the first extraction unit is used for extracting block texts in the judge document to be processed by using a first extraction template to obtain a first structured text, wherein the first structured text consists of each extraction node in the first extraction template and the corresponding block text in the judge document to be processed;
a target block text determining unit configured to determine a target block text from specified block texts of the first structured text using a feature model;
the second extraction unit is used for extracting each target block text by using a second extraction template to obtain a sub-structured text, and the sub-structured text consists of each extraction node in the second extraction template and a corresponding sub-text in the target block text;
And the updating unit is used for updating the corresponding content in the first structured text by utilizing the sub-structured text to obtain a second structured text.
According to the method and the device for structuring the referee document, firstly, block texts in the referee document to be processed are extracted by using a first extraction template to obtain a first structured text, then, target block texts are determined from the appointed block texts of the first structured text by using a feature model, and each target block text is extracted by using a second extraction template to obtain sub-structured texts. And finally, updating the corresponding content in the first structured text by using the sub-structured text to obtain a second structured text. Therefore, the referee document structuring method provided by the application can further extract and classify the first structured text to obtain the second refined structured text, so that the content of the referee document to be processed can be displayed more completely.
Drawings
In order to more clearly illustrate the technical solution of the present application, the drawings that are needed in the embodiments will be briefly described below, and it will be obvious to those skilled in the art that other drawings can be obtained from these drawings without inventive effort.
FIG. 1 is a flowchart of a method for structuring referee documents according to an embodiment of the present application;
FIG. 2 is a flowchart of a method for extracting a first structured document according to an embodiment of the present application;
FIG. 3 is a flowchart of a method for determining specified block text according to an embodiment of the present application;
FIG. 4 is a flowchart of a method for determining target block text according to an embodiment of the present application;
FIG. 5 is a flowchart of a method for creating a second extraction template according to an embodiment of the present application;
FIG. 6 is a flow chart of a method of generating sub-structured text according to an embodiment of the present application;
FIG. 7 is a flowchart of a method for updating a first structured document according to an embodiment of the present application;
FIG. 8 is a schematic diagram of a construction of a referee document structuring device according to the embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
In general, legal documents such as referee documents are tedious and obscure, making it difficult for a person to quickly locate content from the overall referee document that needs to be carefully viewed. Moreover, during browsing the referee document, the user typically needs to browse several types of referee documents corresponding to cases similar to the current referee document to help understand and simulate the current referee document. For some special referee documents, such as civil referee documents, some hidden information needs to be extracted and obtained from part of the text information in a targeted manner on the basis of browsing all the text information. For such referee documents, it is difficult for a user to browse one referee document, and it is more difficult to find a referee document similar to the current referee document from a large number of referee documents, which not only wastes a large amount of time, but also is not necessarily capable of accurately finding the referee document with the highest similarity.
Specifically, for example, the user needs to search the referee document for the content related to the dispute focus, browse from the first character of the referee document, and after knowing the contents of each part set forth in the referee document, determine the possible contents of the dispute focus, and further refine and analyze the contents of each part to obtain the content related to the dispute focus. However, the manual analysis of the structure of the referee document further takes time and is affected by uncertain factors such as learning and thinking, so that the obtained result is very easy to have low accuracy and no reference value. It can be seen that the existing way of browsing referee documents is less efficient and of lower quality.
In order to solve the problems, the application provides a referee document structuring method and device, so that referee texts are formed into structured texts, and a user can quickly determine content required by the user in the referee document.
Fig. 1 is a flowchart of a method for structuring a referee document according to an embodiment of the present application, as shown in fig. 1, where the method includes:
s1, extracting block texts in a judge document to be processed by using a first extraction template to obtain a first structured text, wherein the first structured text consists of each extraction node in the first extraction template and the corresponding block text in the judge document to be processed.
And inputting the judge document to be processed into a judge document structuring device, wherein the judge document structuring device can be a server, a PC (personal computer), a tablet personal computer, a mobile phone and other various text processing devices. The judge document to be processed can be each examination and judgment document in civil cases and the like. After receiving the referee document to be processed, the referee document structuring device needs to preprocess the referee document to determine a text to be structured, for example, the referee document to be processed which is input into the referee document structuring device comprises a criminal first-trial judgment document, a criminal second-trial judgment document and a criminal final-trial judgment document, but at present, only the civil first-trial judgment document needs to be structured, at this time, the text of the civil first-trial judgment document needs to be extracted through preprocessing, and the document to be structured can be determined by matching with a document title or a document title in the referee document to be processed. The block text is text content corresponding to each extraction node in the first extraction template in the judge document to be processed. For example, the content of the judge document to be processed includes "principal x …, and x … is found by the examination. The first extraction template includes an extraction node "principal information, and an approval find", and then "principal x …" is a block text corresponding to "principal information"; "trial-found×x …" is block text corresponding to "trial-found".
The first extraction template may be an extraction model, which is pre-established before structuring the pending referee document, in particular,
s001, acquiring a referee document sample, wherein the referee document sample belongs to the same category;
s002, dividing each judge document sample into sample block texts according to a preset text dividing rule;
s003, setting a node title for each sample block text;
s004, combining all node titles of the same judge document sample to generate a corresponding extraction template sample;
s005, combining the extraction template samples to generate an extraction template.
The referee document is a text with standardized content, that is, for referee documents of the same category, the types of content involved are about the same regardless of format changes, for example, the referee document basically involves principal information, trial passes, litigation party requests, litigation party dialects, trial ascertainments, court views, decision results, and other types of content, so that an extraction template can be generated by training a large number of referee document samples.
In general, the extraction templates corresponding to the referee documents of different categories are also different, where the categories refer to the case fields, the decision levels, and the like related to the referee documents, for example, criminal first-trial decision books, criminal second-trial decision books, and civil first-trial decision books belong to three categories.
Before training the extraction template of a referee document of a certain category, a large number of referee document samples of that category need to be obtained first, preferably in a format whose title corresponds to the specific text content, such as "principal information-principal x …; the trial finding-the trial finding …', the format of the judge document sample is most similar to that of the finally generated extraction template, so that the training efficiency can be effectively improved.
If the selected referee document samples do not have the format as described above, the referee document samples may first be divided into sample block texts according to a preset text division rule, where a sample block text refers to a block text that each of the selected referee document samples correspondingly contains, for example, the text division rule is divided by paragraph, divided by subtitle within the text, divided by specified paragraph start character, and so on. Then, a node header is set for each sample block text, typically this node header is a string that can summarize the semantics of the sample block text, e.g. "principal x …" for sample block text, then the node header "principal information" can be set. Further, for the same referee document sample, if node titles with semantic repetition appear between the set node titles, the sample block texts corresponding to the node titles with semantic repetition can be combined, and one node title is selected as the node title corresponding to the combined sample block text.
After obtaining node titles corresponding to each sample block text of a referee document sample, the node titles may be summarized to generate an extraction template sample corresponding to the referee document sample. By training a large number of extraction template samples as described above, an extraction template can be obtained. Further, by continuously enriching referee document samples, the generated extraction template can be continuously optimized.
For different types of referee documents, the corresponding extraction template can be generated by adopting the method.
The various extraction templates generated by the method can be used by the referee document structuring device at any time without regeneration, so that the first extraction template suitable for the referee document to be processed needs to be selected from all the extraction templates when the referee document structuring device uses the extraction templates.
In particular, the method comprises the steps of,
s011, extracting target keywords matched with words in a keyword library from the judge document to be processed;
s012, calculating semantic similarity between each target keyword and a template title of each extraction template in all the extraction templates;
s013, calculating the matching degree of the judge document to be processed and each extraction template by combining the weight and the semantic similarity corresponding to each target keyword;
S014, determining a first extraction template, wherein the first extraction template is the extraction template with the highest matching degree.
Words consistent with the category of the referee document to be processed will necessarily appear in the title or text of the referee document to be processed, and although the words are different, they will represent the same meaning, for example, "first trial and first trial", at this time, the word in the referee document to be processed may be matched with the word in the keyword library, so as to determine the target keyword with semantic similarity higher than the threshold value, which is used to represent the category of the referee document to be processed.
The extraction template is provided with corresponding template titles, at this time, the target keywords corresponding to the judge document to be processed can be matched with the template titles, so that the template title with the highest matching degree is found, and the extraction template corresponding to the template title is the first extraction template applicable to the judge document to be processed.
After determining the first extraction template, determining the node character from the referee document to be processed by using the first extraction template, specifically, as shown in fig. 2, a flowchart of a method for extracting a first structured document according to an embodiment of the present application is provided, where the method includes:
S101, determining node characters in a judge document to be processed according to each extraction node in a first extraction template, wherein the extraction nodes are character strings with corresponding relations with all parts of contents in the judge document to be processed, and the node characters are initial characters of the parts of contents corresponding to the extraction nodes in the judge document to be processed;
s102, determining a block text corresponding to each extraction node, wherein the block text is all characters from node characters corresponding to the extraction nodes to the next node characters;
and S103, each extraction node corresponds to the block text, and a first structured text is generated.
Specifically, the first extraction template is composed of a plurality of extraction nodes, the extraction nodes represent texts to be extracted, for example, the extraction nodes in the first extraction template are "head, principal information and approval find", then corresponding texts can be extracted from the judge document to be processed according to the extraction nodes, for example, the judge document to be processed includes "x court …, principal x …, approval find x … and the like", at this time, the part of the extraction node corresponding to the "head" is "x court …", the part of the extraction node corresponding to the "principal information" is "principal x …", and the part of the extraction node corresponding to the "approval find out" is "approval find out x …".
Specifically, the node character may be determined as follows.
S1011, obtaining extraction expressions corresponding to each extraction node;
s1012, matching each extraction expression with the first line character of each unmatched paragraph in the judge document to be processed to obtain a matched paragraph, wherein the unmatched paragraph is a paragraph without the extraction expression matched;
and S1013, extracting the first line character of the corresponding matched paragraph by using the extraction expression to obtain a node character.
The semantics represented by characters located in the same paragraph are usually the smallest units of the complete semantics, determined by writing habits, and therefore, the paragraph can be used as a search unit from which node characters are searched. Since the node characters are the keys for dividing the referee document to be processed, the node characters are required to have the word or phrase or the like corresponding to the extraction node, and therefore the node characters can be determined by recognition of these word or phrase, and recognition and extraction can be performed generally using the extraction expression. For example, the extraction node is an "aesthetic find", and its corresponding extraction expression may be @ \n [' n. ? (do a warp? (home)? And (3) trial finding: @ or @ \n meridians (law)? The examination finds @, etc., and typically one extraction node corresponds to multiple extraction expressions to accommodate multiple expression modes of the extraction node. Thus, the first line character of each paragraph can be matched by using the extraction expression, so that the matched first line character is found and extracted to obtain the node character. For example, the paragraph of the judge document to be processed is "found by trial", and x and x have liability relationship … ", and the node character" found by trial "can be extracted by extracting the expression.
It should be noted that in the process of matching by using the extraction expression, the paragraphs need to be matched one by one, and the matched paragraphs are unmatched paragraphs, so that not only can the extraction order be ensured, omission be prevented, but also the re-extraction of the paragraphs with the determined node characters can be prevented, so that the problems of time waste and extraction errors are avoided.
After the node characters are determined, corresponding block texts can be determined according to the node characters, wherein the block texts refer to partial texts in the judge document to be processed, the partial texts are positioned between two adjacent node characters and start from the previous node character. For example, the content of the judge document to be processed includes "principal x …, and by the above-described procedure," principal "and" by the approval "are determined to be node characters, and two node characters are adjacent, then" principal x … "is a block text corresponding to the extraction node" principal information ".
After determining the corresponding block text of each extraction node, the name of the extraction node may be used as a title, and a corresponding relationship between each title and the corresponding block text may be established, so that the referee document to be processed may be structured into a first structured text composed of a plurality of "extraction node-block texts". For example, for a civil-affair first-trial-decision, a first extraction template consisting of extraction nodes such as "head, principal information, trial pass, original complaint, advertised dialect, trial find, court view, decision result, tail" may be selected for extraction, and block text corresponding to the extraction nodes is obtained, thereby generating a first structured text.
S2, determining a target block text from the specified block text of the first structured text by utilizing a feature model.
The partial block text in the first structured document may also contain implicit information, which is typically referred to as text content scattered in the block text and for which the user has a need for attention, but which needs to be obtained by further browsing and extracting, in which embodiment the block text in the first structured document meeting the above requirements is defined as a specified block text. For example, the user needs to obtain the content of the dispute focus in the pending referee document directly from the structured text, while the text related to the content of the dispute focus is scattered in the corresponding block text, such as the one that is told to be dialectical, the one that is told to be legal, the one that is told to be legal, etc., and then the block text is designated and needs to be further structured to refine and complete the first structured text.
As shown in fig. 3, a flowchart of a method for determining a specified block text according to an embodiment of the present application is provided, where the method includes:
s211, acquiring a first reference sample, wherein the first reference sample has the same text structure as the first structured text;
S212, obtaining the feature to be extracted corresponding to the feature model;
s213, determining a feature block text corresponding to the feature to be extracted in each first reference sample;
s214, summarizing the number of feature block texts corresponding to the same feature to be extracted;
s215, determining a specified block text, wherein the specified block text is the characteristic block text corresponding to the number, and the ratio of the number to the total number of the first reference samples is greater than or equal to a preset threshold value.
In this embodiment, the feature block text refers to a block text corresponding to a feature to be extracted in a first reference sample, and a specified block text may be determined by learning a large number of first reference samples, specifically, a feature model is a model for extracting a specific feature from the block text, where the same feature to be extracted by the feature model usually appears in a block text that is relatively fixed again, for example, the feature to be extracted corresponding to the feature model is "the present dispute focus", and usually, the feature to be extracted appears in a block text corresponding to a dialect, an aesthetic finding, a court view, and the like, but does not appear in a block text for head, tail, and the like. In order to increase the accuracy of the determination of the specified block text, a large number of first reference samples may be used, wherein the first reference samples have the same text structure as the first structured text, i.e. the first structured text is a text consisting of extracted nodes and block text corresponding to the extracted nodes, and then the first reference samples need to be text with such a text structure. At this time, by determining the position of the feature to be extracted in each first reference sample, the ratio of the feature to be extracted in each text, that is, the ratio of the number of feature block texts corresponding to the same feature to be extracted to the total number of the first reference samples, can be known. In order to avoid the fact that the feature to be extracted happens to part of the block texts due to abnormal document and other reasons, the preset threshold value can be utilized to screen the appointed block texts, namely, the feature block texts with the ratio being larger than or equal to the preset threshold value are used as the appointed block texts. For example, the total number of the first reference samples is 100, the feature to be extracted is "the present dispute focus is", the feature block text is the block text corresponding to "the notice dialect", and the number is 80, it can be seen that the ratio of the feature block text to the notice dialect is 0.8, and the block text corresponding to "the notice dialect" is the specified block text assuming that the preset threshold is 0.75.
Since the specified block text contains implicit information, the specified block text needs to be further extracted, and typically, the specified block text contains a lot of content, and the text of the minimum range containing the implicit information, i.e., the target block text, needs to be extracted first. Specifically, as shown in fig. 4, a flowchart of a method for determining a target block text according to an embodiment of the present application is provided, where the method includes:
s221, matching the specified block texts by utilizing each feature expression in the feature model to obtain feature character strings in each specified block text;
s222, determining a target block text, wherein the target block text is all characters between the characteristic character strings and preset termination symbols.
Typically, a feature model is made up of a plurality of feature expressions that can be matched to extract strings from text. For example, the characteristic expression is @ \n [' n. The method comprises the steps of carrying out a first treatment on the surface of the Is controversial? Focal point [ yes ]? : either @ or @ \n [' n ]. The method comprises the steps of carrying out a first treatment on the surface of the The issue of the dispute is @, and if the content of the specified block text is "the consent of the original notice is obtained according to the complaint of the principal", the determination of the dispute focus of the scheme is: 1. …; 2. …; 3. …. Around the focus of the dispute, the original notice provides evidence as follows: …. ". At this time, the feature string can be determined as "determine that the dispute focus of the present case is" using the feature expression. The preset terminator may be a specified punctuation mark, a specified word segmentation, a specified phrase, a specified sentence, a specified text format, or the like, and generally, the same contents are classified together in periods according to the writing habit of the text, and thus, periods may be set as termination symbols. Then the target block text in the above example is "determine the dispute focus for this case is: 1. …; 2. …; 3. …. ".
And S3, extracting each target block text by using a second extraction template to obtain a sub-structured text, wherein the sub-structured text consists of each extraction node in the second extraction template and a corresponding sub-text in the target block text.
As can be seen from the above example, the obtained target block text also contains more information, and if the information is displayed at the same time, the information is still unfavorable for the user to browse, and at this time, the target block text needs to be further structured.
For example, in the target block text, 1, the reported pair has home violence, but the reported pair does not acknowledge the home violence, and the reported pair does not agree to divorce from the home violence; 2. the reported disagree child is cared for by the original report, and the reported disagree disagrees with the original report visiting the child; 3. the original notice and the first have a liability relationship, and the original notice and the first have a set of houses with common property. It can be seen that each of the disputed focus consists of several dots, which can then be regarded as sub-texts, which need to be clearly shown, i.e. extracted, to form a word structured text, for the convenience of the user to browse.
Specifically, before extracting the sub-block text, a second extraction template needs to be established first, and in particular, as shown in fig. 5, a flowchart of a method for establishing the second extraction template is provided in an embodiment of the present application, where the method includes:
S301, acquiring a second reference sample, wherein the second reference sample and the target block text have the same content category;
s302, dividing each second reference sample into sample block texts according to a preset text classification rule;
s303, setting a classification label for each sample block text;
s304, combining all classification labels of the same second reference sample to generate a corresponding extraction template sample;
s305, combining the extraction template samples to generate a second extraction template.
The second extraction template is used for structuring the target block text, and as can be seen from the above example, each dispute focus in the target block text generally corresponds to one content category, for example, 1, the reported pair has family violence to the original report, but the reported pair does not recognize the family violence, and the reported pair has family violence as being divorced from disagreement; the corresponding event category is 2, the reported disagrees child is cared for by the original report, and the reported disagrees with the original report visit child; the corresponding child category, 3, the original notice and the first have a liability relationship, and the original notice and the reported have a set of houses with common property. "corresponds to property category". Thus, if it is desired to structure the block text of these content categories, it is necessary to use a second extraction template having extraction nodes corresponding to these content categories. In designing the second extraction template, the second extraction template needs to contain the content category to be structured, and meanwhile, the extraction nodes in the second extraction template need to be designed into classification labels related to specific content, for example, a "reported disagreement child is cared for by the original child" as a child career right classification label and a "reported disagreement original visit child" as a child career right classification label in the target block text. And dividing the second reference sample of the same content category into sample block texts according to a text classification rule, for example, classifying by taking punctuation marks as separators, setting a classification label for each sample block text, and generating a corresponding extraction template sample aiming at all classification labels of the same second reference sample. Finally, a second extraction template is generated by learning a plurality of extraction template samples. The principle of sample selection and the principle of generating the extraction template in this embodiment are the same as those of the first extraction template, and specifically, reference may be made to the above, which is not repeated here.
Further, after the second extraction template is determined, each sub-block text in the target block text is extracted using the second extraction template. Specifically, as shown in fig. 6, a flowchart of a method for generating a sub-structured document according to an embodiment of the present application is provided, where the method includes:
s311, determining an extraction expression corresponding to each classification label in the second extraction template;
s312, matching the extraction expression with each target block text to obtain a sub text;
s313, the classification labels are corresponding to the sub-texts, and sub-structured texts are generated.
Specifically, the classification tag "common property" may have a corresponding extraction expression of @ common property {0,6} (identification of range of i) and @, etc. At this time, matching is performed by using the extraction expression and the target block text, so that the child text 'original notice and notice have a set of houses with common property' can be obtained. In the manner described above, the corresponding sub-text may be obtained using the extraction expressions corresponding to the different class labels in the second extraction template. And finally, the classification labels are corresponding to the sub-texts, so that the sub-structured texts can be obtained.
For example, home violence-a home violence exists for the original notice;
Home violence-is not acknowledged as a home violence;
whether to grant divorce-the reported home violence is due to disagreement of divorce;
tending rights-reported disagrees with the child being tending by the original report;
visit rights-the interviewed does not agree to the original interview child;
liability questions-the original notice has liability relation with the nail;
common property-original and advertised have a set of houses of common property.
Therefore, by the method provided by the embodiment, the implicit information can be accurately extracted from the appointed block text.
And S4, updating the corresponding content in the first structured text by using the sub-structured text to obtain a second structured text.
After determining the sub-structured text, the first structured text needs to be updated with the sub-structured text to obtain the final second structured text.
In one implementation manner, as shown in fig. 7, a flowchart of a method for updating a first structured document according to an embodiment of the present application is provided, where the method includes:
s411, if the target block text covers the appointed block text, determining a front extraction node, wherein the front extraction node is an extraction node corresponding to the appointed block text;
and S412, replacing the pre-extraction nodes and the corresponding block texts in the first structured text by using the sub-structured text to obtain a second structured text.
According to the scheme provided by the embodiment of the application, the obtained sub-structured text is required to be displayed in the final structured text, and because the source of the sub-structured text is the target block text, if the target block text covers the specified block text, the repetition of the content between the sub-structured text and the specified block text is explained, and in order to avoid the redundancy of the displayed text, the content corresponding to the specified block text needs to be deleted from the first structured text.
In the example above, if the specified block text is "determine the dispute focus of the present case is: 1. …; 2. …; 3. …. The "determining that the target block text is" determining that the dispute focus of the present case "is: 1. …; 2. …; 3. …. The target block text covers the appointed block text, at the moment, the extraction node of one appointed block text is called by the notice and dialect, the notice and dialect is called as the front extraction node, at the moment, the sub-structured text' family violence-notice and the original notice have family violence; home violence-is not acknowledged as a home violence; whether to grant divorce-the reported home violence is due to disagreement of divorce; tending rights-reported disagrees with the child being tending by the original report; visit rights-the interviewed does not agree to the original interview child; liability questions-the original notice has liability relation with the nail; common property-original and advertised have a set of houses of common property. "replacement" is advertised- … determines that the dispute focus for this case is: 1. …; 2. …; 3. … ", resulting in a second structured document.
At this point, the user may directly locate specific dispute focus content by browsing the extraction nodes.
In another implementation, if the target block text covers a portion of the content in the specified block text and a reference relationship exists between the target block text and the content in the specified block text other than the portion of the content, the sub-structured text is added to the first structured text to obtain a second structured text.
Specifically, the block text is designated as "court view …. The disputed focus is: 1. …; 2. …; 3. …. With respect to dispute focus 1, … with respect to dispute focus 2, …. The sub-structured text corresponding to the content of the disputed focus is extracted as 'family violence-the original notice of the reported pair has family violence'; home violence-is not acknowledged as a home violence; whether to grant divorce-the reported home violence is due to disagreement of divorce; tending rights-reported disagrees with the child being tending by the original report; visit rights-the interviewed does not agree to the original interview child; liability questions-the original notice has liability relation with the nail; common property-original and advertised have a set of houses of common property. ". Although, the sub-text is part of the text in the specified block text, namely, "the present dispute focus is: 1. …; 2. …; 3. …. "part of the following description. But the sub-structured text and "about dispute focus 1, … about dispute focus 2, …. "associated, if deleted from specified block text" the dispute focus is: 1. …; 2. …; 3. …. "then" about dispute focus 1, … about dispute focus 2, …. The explanation of "is incomplete and lacks an explanation basis. To avoid this, it is necessary to preserve the content of the first structured text and add the sub-structured text to the first structured text, resulting in the second structured text.
Namely, court view-court view, …. The disputed focus is: 1. …; 2. …; 3. …. With respect to dispute focus 1, … with respect to dispute focus 2, ….
The content of the disputed focus, namely family violence, is that the original notice has family violence;
home violence-is not acknowledged as a home violence;
whether to grant divorce-the reported home violence is due to disagreement of divorce;
tending rights-reported disagrees with the child being tending by the original report;
visit rights-the interviewed does not agree to the original interview child;
liability questions-the original notice has liability relation with the nail;
common property-original and advertised have a set of houses of common property.
Meanwhile, as shown in fig. 8, the embodiment of the application further provides a structural schematic diagram of the referee document structuring device, which comprises: the first extraction unit 1 is used for extracting block texts in the judge document to be processed by using a first extraction template to obtain a first structured text, wherein the first structured text consists of each extraction node in the first extraction template and the corresponding block text in the judge document to be processed; a target block text determining unit 2 for determining a target block text from the specified block text of the first structured text using a feature model; the second extraction unit 3 is configured to extract each target block text by using a second extraction template, so as to obtain a sub-structured text, where the sub-structured text is composed of each extraction node in the second extraction template and a corresponding sub-text in the target block text; and the updating unit 4 is used for updating the corresponding content in the first structured text by using the sub-structured text to obtain a second structured text.
Optionally, the first extraction unit includes: the node character determining unit is used for determining node characters in the judge document to be processed according to each extracting node in the first extracting template, wherein the extracting nodes are character strings with corresponding relations with all parts of contents in the judge document to be processed, and the node characters are initial characters of the parts of contents corresponding to the extracting nodes in the judge document to be processed; the block text determining unit is used for determining block texts corresponding to each extracting node, wherein the block texts are all characters from node characters corresponding to the extracting node to the next node characters; and the first structured text generation unit is used for generating a first structured text by corresponding each extraction node to the block text.
According to the method and the device for structuring the referee document, firstly, block texts in the referee document to be processed are extracted by using a first extraction template to obtain a first structured text, then, target block texts are determined from the appointed block texts of the first structured text by using a feature model, and each target block text is extracted by using a second extraction template to obtain sub-structured texts. And finally, updating the corresponding content in the first structured text by using the sub-structured text to obtain a second structured text. Therefore, the referee document structuring method provided by the application can further extract and classify the first structured text to obtain the second refined structured text, so that the content of the referee document to be processed can be displayed more completely.
Other embodiments of the application will be apparent to those skilled in the art from consideration of the specification and practice of the application disclosed herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.
It is to be understood that the application is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims (9)

1. A method of structuring referee documents, the method comprising:
extracting block texts in the judge document to be processed by using a first extraction template to obtain a first structured text, wherein the first structured text consists of each extraction node in the first extraction template and the corresponding block text in the judge document to be processed;
defining a part of block texts containing implicit information in the first structured text as specified block texts, wherein the implicit information comprises text contents which are scattered in the block texts, have attention requirements for users and can be obtained only through further browsing and extracting;
Matching the specified block texts by utilizing each feature expression in the feature model to obtain feature character strings in each specified block text;
determining a target block text, wherein the target block text is all characters between the characteristic character strings and preset termination symbols;
extracting each target block text by using a second extraction template to obtain a sub-structured text, wherein the sub-structured text consists of each extraction node in the second extraction template and a corresponding sub-text in the target block text;
and updating the corresponding content in the first structured text by using the sub-structured text to obtain a second structured text.
2. The method of claim 1, wherein extracting the block text in the referee document to be processed using the first extraction template to obtain the first structured text comprises:
determining node characters in a judge document to be processed according to each extraction node in a first extraction template, wherein the extraction nodes are character strings with corresponding relations with all parts of contents in the judge document to be processed, and the node characters are initial characters of the parts of contents corresponding to the extraction nodes in the judge document to be processed;
Determining a block text corresponding to each extraction node, wherein the block text is all characters from node characters corresponding to the extraction node to next node characters;
and corresponding each extraction node to the block text, and generating a first structured text.
3. The method of claim 1, wherein determining the target block text from the specified block text of the first structured document using the feature model comprises:
obtaining a first reference sample, wherein the first reference sample has the same text structure as the first structured text;
acquiring the feature to be extracted corresponding to the feature model;
determining a corresponding feature block text of the feature to be extracted in each first reference sample;
summarizing the number of feature block texts corresponding to the same feature to be extracted;
and determining a designated block text, wherein the designated block text is the characteristic block text corresponding to the number, and the ratio of the number to the total number of the reference samples is greater than or equal to a preset threshold value.
4. The method of claim 1, wherein extracting each of the target block texts using a second extraction template comprises, before obtaining the sub-structured texts:
Acquiring a second reference sample, wherein the second reference sample and the target block text have the same content category;
dividing each second reference sample into sample block texts according to a preset text classification rule;
setting a classification label for each sample block text;
combining all classification labels of the same second reference sample to generate a corresponding extraction template sample;
and combining the extraction template samples to generate a second extraction template.
5. The method of claim 4, wherein extracting each of the target block text using a second extraction template to obtain a sub-structured text comprises:
determining an extraction expression corresponding to each classification label in the second extraction template;
matching the extraction expression with each target block text to obtain a sub-text;
and corresponding the classification label to the sub-text to generate sub-structured text.
6. The method of claim 5, wherein updating the corresponding content in the first structured document with the sub-structured document to obtain a second structured document comprises:
If the target block text covers the appointed block text, determining a front extraction node, wherein the front extraction node is an extraction node corresponding to the appointed block text;
and replacing the pre-extraction nodes and the corresponding block texts in the first structured text by using the sub-structured text to obtain a second structured text.
7. The method of claim 5, wherein updating the corresponding content in the first structured document with the sub-structured document to obtain a second structured document comprises:
and if the target block text covers part of the content in the appointed block text and a reference relation exists between the target block text and the content except the part of the content in the appointed block text, adding the sub-structured text into the first structured text to obtain a second structured text.
8. A referee document structuring device, comprising:
the first extraction unit is used for extracting block texts in the judge document to be processed by using a first extraction template to obtain a first structured text, wherein the first structured text consists of each extraction node in the first extraction template and the corresponding block text in the judge document to be processed;
A target block text determining unit, configured to define a part of block text that includes implicit information in the first structured text as a specified block text, where the implicit information includes text content that is dispersed in the block text, has a focus requirement on the user, and is required to be obtained through further browsing and extraction; determining a target block text from the specified block text of the first structured text by using a feature model;
the block text determining unit is used for determining block texts corresponding to each extracting node, wherein the block texts are all characters from node characters corresponding to the extracting node to the next node characters;
the second extraction unit is used for extracting each target block text by using a second extraction template to obtain a sub-structured text, and the sub-structured text consists of each extraction node in the second extraction template and a corresponding sub-text in the target block text;
and the updating unit is used for updating the corresponding content in the first structured text by utilizing the sub-structured text to obtain a second structured text.
9. The apparatus of claim 8, wherein the first decimation unit comprises:
The node character determining unit is used for determining node characters in the judge document to be processed according to each extracting node in the first extracting template, wherein the extracting nodes are character strings with corresponding relations with all parts of contents in the judge document to be processed, and the node characters are initial characters of the parts of contents corresponding to the extracting nodes in the judge document to be processed;
and the first structured text generation unit is used for generating a first structured text by corresponding each extraction node to the block text.
CN202010041739.2A 2020-01-15 2020-01-15 Referee document structuring method and referee document structuring device Active CN111274354B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010041739.2A CN111274354B (en) 2020-01-15 2020-01-15 Referee document structuring method and referee document structuring device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010041739.2A CN111274354B (en) 2020-01-15 2020-01-15 Referee document structuring method and referee document structuring device

Publications (2)

Publication Number Publication Date
CN111274354A CN111274354A (en) 2020-06-12
CN111274354B true CN111274354B (en) 2023-08-11

Family

ID=71002188

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010041739.2A Active CN111274354B (en) 2020-01-15 2020-01-15 Referee document structuring method and referee document structuring device

Country Status (1)

Country Link
CN (1) CN111274354B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111784505A (en) * 2020-06-30 2020-10-16 鼎富智能科技有限公司 Loan dispute decision book extraction method and device

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH08263518A (en) * 1995-03-24 1996-10-11 Fuji Xerox Co Ltd Hypertext device
DE19955717A1 (en) * 1998-11-11 2000-08-24 Ibm Converting unstructured data into structured data involves suggesting data structure element for selected input data segment that can be structured, allocating structure element as target element
CN106815207A (en) * 2015-12-01 2017-06-09 北京国双科技有限公司 For the information processing method and device of law judgement document
CN107590131A (en) * 2017-10-16 2018-01-16 北京神州泰岳软件股份有限公司 A kind of specification document processing method, apparatus and system
CN108009137A (en) * 2017-12-22 2018-05-08 中科鼎富(北京)科技发展有限公司 A kind of specification document processing method, apparatus and system based on configuration file
CN108197163A (en) * 2017-12-14 2018-06-22 上海银江智慧智能化技术有限公司 A kind of structuring processing method based on judgement document

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH08263518A (en) * 1995-03-24 1996-10-11 Fuji Xerox Co Ltd Hypertext device
DE19955717A1 (en) * 1998-11-11 2000-08-24 Ibm Converting unstructured data into structured data involves suggesting data structure element for selected input data segment that can be structured, allocating structure element as target element
CN106815207A (en) * 2015-12-01 2017-06-09 北京国双科技有限公司 For the information processing method and device of law judgement document
CN107590131A (en) * 2017-10-16 2018-01-16 北京神州泰岳软件股份有限公司 A kind of specification document processing method, apparatus and system
CN108197163A (en) * 2017-12-14 2018-06-22 上海银江智慧智能化技术有限公司 A kind of structuring processing method based on judgement document
CN108009137A (en) * 2017-12-22 2018-05-08 中科鼎富(北京)科技发展有限公司 A kind of specification document processing method, apparatus and system based on configuration file

Also Published As

Publication number Publication date
CN111274354A (en) 2020-06-12

Similar Documents

Publication Publication Date Title
CN111259631B (en) Referee document structuring method and referee document structuring device
US8447588B2 (en) Region-matching transducers for natural language processing
US8266169B2 (en) Complex queries for corpus indexing and search
US8510097B2 (en) Region-matching transducers for text-characterization
Zubrinic et al. The automatic creation of concept maps from documents written using morphologically rich languages
CN111078875B (en) Method for extracting question-answer pairs from semi-structured document based on machine learning
JP5008024B2 (en) Reputation information extraction device and reputation information extraction method
CN110609983B (en) Structured decomposition method for policy file
CN109325201A (en) Generation method, device, equipment and the storage medium of entity relationship data
US20200311345A1 (en) System and method for language-independent contextual embedding
CN111259645A (en) Referee document structuring method and device
CN113569050A (en) Method and device for automatically constructing government affair field knowledge map based on deep learning
CN115017303A (en) Method, computing device and medium for enterprise risk assessment based on news text
CN112069312A (en) Text classification method based on entity recognition and electronic device
CN106897274B (en) Cross-language comment replying method
KR102206781B1 (en) Method of fake news evaluation based on knowledge-based inference, recording medium and apparatus for performing the method
Kim et al. Automatic annotation of bibliographical references in digital humanities books, articles and blogs
CN112711666B (en) Futures label extraction method and device
CN111274354B (en) Referee document structuring method and referee document structuring device
Yao et al. A unified approach to researcher profiling
CN111949781B (en) Intelligent interaction method and device based on natural sentence syntactic analysis
Shekhar et al. Computational linguistic retrieval framework using negative bootstrapping for retrieving transliteration variants
Hoxha et al. An automatically generated annotated corpus for Albanian named entity recognition
Suriyachay et al. Thai named entity tagged corpus annotation scheme and self verification
Lipka Modeling Non-Standard Text Classification Tasks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: 230000 zone B, 19th floor, building A1, 3333 Xiyou Road, hi tech Zone, Hefei City, Anhui Province

Applicant after: Dingfu Intelligent Technology Co.,Ltd.

Address before: Room 630, 6th floor, Block A, Wanliu Xingui Building, 28 Wanquanzhuang Road, Haidian District, Beijing

Applicant before: DINFO (BEIJING) SCIENCE DEVELOPMENT Co.,Ltd.

GR01 Patent grant
GR01 Patent grant