CN111222326A

CN111222326A - Information extraction method and device for referee document

Info

Publication number: CN111222326A
Application number: CN202010042493.0A
Authority: CN
Inventors: 席丽娜; 王文军; 李德彦
Original assignee: Dinfo Beijing Science Development Co ltd
Current assignee: Dinfo Beijing Science Development Co ltd
Priority date: 2020-01-15
Filing date: 2020-01-15
Publication date: 2020-06-02

Abstract

The application discloses an information extraction method and device of a referee document, which comprises the steps of firstly obtaining at least one target block containing target document elements from the referee document, wherein each target block corresponds to a content theme; then selecting an element tree corresponding to the target block according to the content theme corresponding to the target block, wherein the element tree comprises at least one element node and an extraction rule corresponding to the element node; and extracting at least one target document element from the target block by using the element tree. The method and the device can automatically extract the basic document elements from the referee document, thereby realizing the comprehensive understanding of the referee document.

Description

Information extraction method and device for referee document

Technical Field

The present application relates to the field of text processing technologies, and in particular, to a method and an apparatus for extracting information from a referee document.

Background

The referee document is a carrier for recording the result of litigation activities such as the trial process and the result of the people's court, and is also a unique certificate for the people's court to determine and distribute the entity right obligation of the party. Official documents usually have a regular structural framework and writing format, which may be slightly different for different types of official documents. Common types of documents include civil referee documents (e.g., civil adjudication documents), criminal referee documents (e.g., criminal adjudication documents), administrative referee documents (e.g., administrative adjudication documents), and other general litigation documents, among others.

Since the official documents are described with important information such as the trial and error process and the result, which has important values for analysis and attention, for example, performing a case analysis, a case search, and the like based on the information, extracting valuable information (i.e., document elements) from the official documents is a basic requirement of practitioners in the related art. The existing technology for extracting information from the referee document directly takes the chapter content of the referee document as an analysis target, and only single or partial specified information, such as a judgment result, can be obtained from the referee document, but the comprehensive case element information cannot be automatically structured, so the obtained effect is more and more successful.

Therefore, in order to fully understand the content of the official document, how to extract the complete document elements from the official document becomes a technical problem to be solved by those skilled in the art.

Disclosure of Invention

The application provides an information extraction method and device of a referee document, which aim to solve the problem of how to extract complete document elements from the referee document.

In a first aspect, the present application provides a method for extracting information from a referee document, the method comprising:

acquiring at least one target block containing target document elements from a referee document, wherein each target block corresponds to a content subject;

selecting an element tree corresponding to the target block according to a content theme corresponding to the target block, wherein the element tree comprises at least one element node and an extraction rule corresponding to the element node;

and extracting at least one target document element from the target block by using the element tree, wherein each target document element corresponds to one element node.

In a second aspect, the present application also provides an information extraction apparatus for official documents, the apparatus comprising:

the system comprises an acquisition module, a judgment module and a display module, wherein the acquisition module is used for acquiring at least one target block containing target document elements from a referee document, and each target block corresponds to a content subject;

the selection module is used for selecting an element tree corresponding to the target block according to a content theme corresponding to the target block, wherein the element tree comprises at least one element node and an extraction rule corresponding to the element node;

and the extraction module is used for extracting at least one target document element from the target block by using the element tree, wherein each target document element corresponds to one element node.

According to the technical scheme, the embodiment of the application provides an information extraction method and device for a referee document, and the method comprises the steps of firstly obtaining at least one target block containing target document elements from the referee document, wherein each target block corresponds to a content theme; then selecting an element tree corresponding to the target block according to the content theme corresponding to the target block, wherein the element tree comprises at least one element node and an extraction rule corresponding to the element node; and extracting at least one target document element from the target block by using the element tree. The method can automatically extract the basic document elements from the referee document, thereby realizing the comprehensive understanding of the referee document.

Drawings

In order to more clearly explain the technical solution of the present application, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious to those skilled in the art that other drawings can be obtained according to the drawings without any creative effort.

FIG. 1 is a flow chart illustrating a method for extracting information from a referee document according to an exemplary embodiment of the present application;

FIG. 2 is a schematic flow chart of a refinement of step 100 in the embodiment shown in FIG. 1;

fig. 3 is a block diagram of an information extraction apparatus for official documents according to an exemplary embodiment of the present application.

Detailed Description

In order to make those skilled in the art better understand the technical solutions in the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

In the judicial field, referee documents are special documents for recording the results of litigation activities, such as the trial process and the result of the national court, and generally have uniform structural composition and writing format. The official documents to which this application relates include, but are not limited to, civil and criminal adjudications.

Since the official document records important information such as the trial process and the result, which has important values for analysis and attention, the official document can be comprehensively known by extracting valuable information from the official document. For example, the case type, case number, trial court name, trial court hierarchy, territory, conference members, and basic document elements such as time of acceptance, time of trial, etc. are known.

The embodiment of the application provides an information extraction method of a referee document, which is suitable for extracting basic document elements in the referee document and realizing automatic structuring of the referee document. Fig. 1 is a flowchart illustrating an information extraction method for a referee document according to an exemplary embodiment of the present application. As shown in fig. 1, the method may include:

step 100, at least one target block containing target document elements is obtained from the referee document, and each target block corresponds to a content theme.

As mentioned above, the official document has a uniform structure and writing format, and each part of the document (i.e. the text block) corresponds to a content subject for representing the subject of the content covered by each part.

Taking criminal judgment as an example, the criminal judgment consists of head information, party information, trial process, litigation party request, litigation party debate, dispute focus, evidence catalogue, trial finding, court opinions, judgment results and tail information, wherein the corresponding components of each topic have a specific writing format or a specific narrative mode, and each component contains established elements, for example, the head information necessarily contains the name of the judicial court and the case number.

In the prior art, the full text of a referee document is taken as an analysis target, and a single or part of document elements are extracted from the full text, but the extraction method not only can easily improve the complexity of analysis and calculation and consume a large amount of unnecessary calculation resources, but also can cause interference on the extraction of the target document elements and reduce the accuracy of an extraction result when a plurality of similar document elements exist in the referee document. For example, if the target document element to be extracted is "reception time", the "trial time" or "case time" in the document is easily confused with the "reception time".

In order to avoid the problems, the method and the device have the advantages that the referee document has regular structure composition, each component covers the set elements, the referee document to be processed is cut into blocks, the target blocks containing the target document elements are selected from at least one cut text block to serve as analysis targets, and the target document elements are extracted from the target blocks, so that the accuracy of extraction results is improved.

Fig. 2 is a schematic flowchart of a detailed process of step 100 in the embodiment shown in fig. 1, and as shown in fig. 2, the obtaining at least one target block including a target document element from a referee document by using a directory tree in the embodiment of the present application may specifically include:

and step 110, acquiring the document type of the referee document.

In the present application, the document type of the referee document includes document types such as judgment, referee, decision, and the like, and may also include case types such as criminal, civil, administrative, and the like. For official documents of different document types, the structural composition may be slightly different, so that the document type of the official document to be processed needs to be acquired so as to be processed by using directory trees of different structures according to the document type.

In the concrete implementation, firstly, the document name is obtained from the referee document, then the type key words are extracted from the document name, and different type key words represent different document types. Since the referee document has a uniform writing format, and the document name capable of representing the type of the document is described in a specific position of the referee document, for example, "criminal adjudication" in "criminal adjudication" described above is described in the second row, the document name can be acquired at a designated position of the referee document.

In addition, in order to extract the type keyword from the document name, a type keyword set may be preset, the type keyword in the type keyword set may be matched with the document name, and the type keyword may be extracted from the document name according to a matching result. For example, when "civil" and "judgment" are matched in the document name, the document type is determined as a civil judgment, and when "criminal" and "judgment" are matched in the document name, the document type is determined as a criminal judgment.

Step 120, selecting a directory tree corresponding to the referee document according to the document type, wherein the directory tree comprises at least one directory node corresponding to the content subject, and each directory node corresponds to at least one extraction expression.

In the present application, in order to divide the complete official document into at least one text block whose contents can be summarized into general contents topics, a directory tree structure is created in advance according to the regular structure composition of the official document of a given document type and the corresponding contents topics composed of each part. The created directory tree comprises at least one directory node, and each directory node corresponds to at least one extraction expression.

In some embodiments, the directory node sequentially lists the content topics of text blocks that may exist in the umpire document, and the extraction expression under the directory node is used to extract text blocks corresponding to the directory node or the content topics from the umpire document, and a text block includes one or more paragraphs.

Illustratively, one possible directory tree structure is as follows:

criminal decision book

Header information-extraction expression

Party information- -extraction expression >

The trial process is through an-extraction expression

Litigant request- - - - - < extraction expression >

Is resolved by litigation party- - - - - - - - - - < extraction expression >

Checking-out- -finding- -extracting expression >

The focus of dispute- - - - - - - - - - - < extraction expression >

Court view-of-extraction expression

Decision result-extraction expression

Tail information-extraction expression

Where "criminal judgment" is the name of a directory tree selected according to the type of document, "header information" and the like are directory nodes included in the directory tree.

And step 130, performing block cutting processing on the judgment document according to the directory tree to obtain at least one text block corresponding to the directory node.

In some embodiments, the extraction expression corresponding to each directory node is used to extract the block header information of each text block, so that the start position of each text block can be determined according to the block header information, and paragraph contents between two adjacent start positions are extracted to obtain the corresponding text block.

An exemplary segmentation result obtained by performing the segmentation process on a certain criminal decision book through the step 130 is as follows:

< header information >/

XX district people's court of Yongzhou city of Hunan province

Criminal decision book

(2014) Forever cooling criminal first word No. 89

< party information >/</su

People inspection hospital in XX district of Yongzhou city of the public department.

The postnotifier XX is a male, born in 1978 within 1 month and 8 days, Han nationality, XX City in Hunan province, junior middle school culture and no industry; the theft is convicted to be waited by XX branch office of the Yongzhou city public bureau in 2014 at 1 month and 3 days.

< audit pass >/(R)

The people inspection institute in XX area of Yongzhou city takes a prosecution book with Yong-cold inspection criminal prosecution character (2014) No. 90 to instruct the advised people to open XX to commit theft crimes, and the public prosecution is carried out to the institute in 2014, 5 and 13 days. After the hospital is accepted, the trial member height XX is regarded as the trial length, the people accompany the trial member to manage the conference rooms in which XX and XX participate, and trial is conducted in the first trial room open division of the hospital in 5-29 th and 7-21 th months in 2014. Attorney ruxx acts as a court trial record. The national inspection hospital in the XX district of Yongzhou city assigns an inspector Zhao XX to go out of the court to support the official complaints, and the defendant Zhang XX enters the court to participate in the litigation. The present application has been examined and finalized.

< litigation-party request >/</

People in XX district of Yongzhou city direct control, and people in XX company of the advertised department XX …, etc. propose to judge the department according to the law in court in 5-10 months and … in 2011.

< solicited by litigation party >/</H

The bulletin XX of the notifier supplies a subordinate fact to crime facts of prosecution and instruction of people inspection hospitals in the XX area of the Yongzhou city and does not propose objections.

< evidence directory >/(X >

The above facts are submitted by the public complaint department and are verified by the following evidences of court case quality evidence and certification:

…

< audit found >/broken

Approved, the advertised sheet XX is a partnered sheet XX …. The details are as follows:

< court opinion >/</>/broken

The hospital believes that the notifier is X ….

< decision result >/

The decision is as follows:

the defendant opens XX crime of theft, judge …. Unless this decision is taken, the complaint may be made at ….

< trailer information >/

Trial height XX

People's accompany person's pipe XX

People accompany and examine person congratulate XX

Two good things, one four years, seven months, twenty-one days

Proxy bookmarker LuXX

In step 130, the composition of the directory node of the directory tree is designed according to the block composition of the official document of the predetermined type, so that the composition of the directory node of the directory tree corresponds to the block composition of the official document, the text blocks obtained by splitting the directory tree correspond to the directory nodes one by one, and the content subject of the corresponding content block can be obtained by the name of the directory node.

Step 140, determining at least one target block containing the target document element according to the directory node corresponding to the text block.

It can be understood that document elements intended to be acquired by different users may be different, and different document elements may be included in different text blocks, for example, the name of the trial court, the hierarchy and the region of the trial court are included in the head information block, and conference members such as the trial leader and the trial officer are included in the tail information block, so that according to the difference of the target document elements to be paid attention to or viewed, the target block including the target document elements can be selected according to the visual directory node to be used as the analysis target of the next extracted elements. For example, when the name of the court of examination needs to be viewed or acquired, the header information block is selected as the target block.

Step 200, selecting an element tree corresponding to the target block according to the content subject corresponding to the target block, wherein the element tree comprises at least one element node and an extraction rule corresponding to the element node.

As can be seen from the text blocks obtained by the block cutting process in step 100, different text blocks contain different document elements. For example, the header information includes document elements such as "document name", "case number", "case type", "region", "trial court name", and "trial court hierarchy", the trial passage includes document elements such as "inception time", "acceptance time", "plan time", "trial time", "referee time", and "trial period", and the trailer information includes document elements such as "judge", "referee", "agent referee", "people accompanying reviewer", "bookkeeper", "bibliography", and "referee date".

In order to extract document elements included in a given target block from the target block, an element tree structure matched with the specific target block is created in advance, so that different document elements can be extracted from different text blocks by using different element trees. Each element tree comprises at least one element node, each element node corresponds to at least one extraction rule, and the extraction rules are used for extracting the document elements corresponding to the element nodes from the target blocks.

Based on this, in step 200, the element tree corresponding to the target block is selected according to the content subject corresponding to the target block.

Illustratively, for the header information block, the pre-created element tree is as follows:

header information block

Document name- -extraction rule >

Pattern-to-extraction rule

Case type- -extracting rule

Region-to-extraction rule

Court name- -extraction rules >

Court level of the trial- - - - < extraction rules >

…

For another example, for a trial pass block, the pre-created element tree is as follows:

examine and manage the block

Onset time- -sampling rule

Acceptance time- -is- -in- -the extraction rule

Set-up time-to-from extraction rule

Trial time-to-from extraction rule

Asserted time- -of- -extraction rule

Trial cycle- -Inquiry cycle- -extraction rule

…

For another example, for the tail information block, the pre-created element tree is as follows:

trailing information block

Length of trial- - - - - - - - - - - - - - - - - < extraction rule >

Judge-o-r extraction rules

Agent judge-extract rule

People co-auditor- - - - - - - < extraction rule >

A bookmarker- - - - - - - - - - - - - - - - - - - - - - - - - - - - - < extraction rule >

A representative-writer-that-extracts rules

Judgment date- - - - - - - - - - - - - < extraction rule >

…

In some embodiments, the element node compositions of the three element trees may exist in the same element tree structure as the formation of child node compositions under different parent nodes.

Taking the following element tree structure as an example, wherein the node levels of the criminal judgment books and the civil judgment books are the same and can be called as root nodes, and a user can select the corresponding root nodes according to the document types of the judgment documents; the node levels of the head information block, the examination passing block and the tail information block are the same, and are child nodes under the root node of the criminal judgment book, and the child nodes can be called father nodes; the node levels of the 'document name', 'appellation time' and 'trial length' are the same, and the node levels are respectively subnodes of a 'head information block', 'trial passing block' and a 'tail information block'.

Civil judgment book

…

Criminal decision book

Header information block

Document name- -extraction rule >

…

Examine and manage the block

Onset time- -sampling rule

…

Trailing information block

Length of trial- - - - - - - - - - - - - - - - - < extraction rule >

…

And 300, extracting at least one target document element from the target block by using the element tree, wherein each target document element corresponds to one element node.

In the embodiment of the present application, since different element nodes are used for extracting different target document elements, the extraction rule corresponding to each element node is different, and the extraction rule may be: a positioning rule, a time extraction rule, or a normalized element matching rule.

The positioning rule comprises a front positioning rule and a rear positioning rule based on a regular expression, and the main principle of the positioning rule is to determine the starting position of the target document element in the target block by using the front positioning rule and determine the ending position of the target document element in the target block by using the rear positioning rule.

In some embodiments, determining the starting position of the target document element in the target block using the prepositioning rule comprises: identifying prepositive positioning information of the target document element by using a prepositive positioning rule; and determining the initial position of the target document element in the target block according to the preposed positioning information. The prepositioning information can be a specific Chinese word or a specific Chinese context, such as a role tag in front of the name of a conference room member, and can also be a Chinese character or a non-Chinese character of a specific position index, such as the prepositioning information taking the Chinese character at the first position in the header block information as the name of the trial court.

In some embodiments, determining the end position of the target document element in the target block by using the post-positioning rule comprises: identifying the post-positioning information of the target document element by using a post-positioning rule; and determining the end position of the target document element in the target block according to the post-positioning information. The post-positioning information can be a specific suffix characteristic word, such as that of the trial court name, namely the court or the division, and can also be a non-Chinese character indexed by a specific position, such as a line-feed symbol.

With reference to the contents of the criminal judgment book, a specific implementation manner of extracting the target document elements from the head information block by using the positioning rule is described, and the contents of the head information block are as follows:

XX district people's court of Yongzhou city of Hunan province

Criminal decision book

(2014) Forever cooling criminal first word No. 89

Illustratively, the trial court name is extracted from the head information block by using an extraction rule corresponding to an element node "trial court name" in an element tree corresponding to the head information block, wherein the extraction rule is a positioning rule and comprises a preposed positioning rule and a postpositional positioning rule.

Specifically, a position index of a first Chinese character is positioned by utilizing a prepositive positioning rule to identify the first Chinese character in a header information block as prepositive positioning information of an examination court name, and then the prepositive position of the first Chinese character is determined to be an initial position of the examination court name; recognizing preset suffix words of the 'court of trial' name, such as 'court' and 'division' as the postposition positioning information of the 'court of trial' name by using a postposition positioning rule, and determining the postposition of the suffix words as the end position of the 'court of trial' name; further, the text content between the specified start position and end position is extracted from the header information block to obtain the name information of the court of examination.

As another example, the case number is extracted from the header information block using an extraction rule corresponding to the case number of the element node in the element tree corresponding to the header information block, where the extraction rule is a positioning rule and includes a pre-positioning rule and a post-positioning rule.

Specifically, a preposed positioning rule is utilized to obtain the position of the last matched line-feed symbol index in the header information, and the initial position of the 'case number' is obtained; and recognizing a preset suffix word of the case number, such as the number, as the postposition positioning information of the case number by using a postposition positioning rule, and then determining the position of the suffix word as the end position of the case number. Further, the text content between the specified start position and end position is extracted from the header information block to obtain case number information.

As another example, the region information is extracted from the head information block using an extraction rule corresponding to a "region" of the element node in the element tree corresponding to the head information block, where the extraction rule is a positioning rule and includes a pre-positioning rule and a post-positioning rule.

Specifically, a position index of a first Chinese character is positioned by utilizing a prepositive positioning rule so as to identify the first Chinese character in a header information block as prepositive positioning information of a region, and then the prepositive position of the first Chinese character is determined as an initial position of the region; and matching at least one preset suffix word of the region with the head information block according to a preset sequence according to a rear positioning rule, determining rear positioning information of the region as the successfully matched preset suffix word, and determining the rear position of the region as the successfully matched preset suffix word. Further, the text content between the specified start position and end position is extracted from the header information block to obtain region information.

In addition, since the target document elements corresponding to some of the element nodes are included in the extraction results of another one or more element nodes, the target document elements corresponding to these element nodes can be extracted directly from the extraction results of another one or more element nodes. For example, "region" is included in "the name of the court of examination", and therefore, the region information can be extracted from the extracted name of the court of examination directly using the front positioning rule and the rear positioning rule.

The specific implementation of extracting the target document elements from the tail information block by using the positioning rule is described in combination with the content of the criminal judgment book, wherein the content of the tail information block is as follows:

trial height XX

People's accompany person's pipe XX

People accompany and examine person congratulate XX

Two good things, one four years, seven months, twenty-one days

Proxy bookmarker LuXX

It can be seen that the document elements in the trailer information block include a conference family member and a referee date. Wherein each conference family member's name corresponds to a role label, e.g., "high XX" corresponds to a role label of "judge long".

Illustratively, the name of the trial length is extracted from the tail information block by using an extraction rule corresponding to an element node "trial length" in an element tree corresponding to the tail information block, wherein the extraction rule is a positioning rule and comprises a preposed positioning rule and a postpositional positioning rule.

Specifically, a role label 'trial length' is identified by using a preposed positioning rule and is used as preposed positioning information of corresponding member names, and then the initial positions of the corresponding member names are determined; determining a line feed symbol closest to the identified role label by using a post-positioning rule as post-positioning information of the corresponding member name, and further determining an end position of the corresponding member name; and finally, extracting the text content between the determined starting position and the determined ending position from the tail information block to obtain the member name corresponding to the judge length.

It should be noted that the method of the present application further includes: and performing data cleaning on the target document element to remove redundant information such as blank spaces and the like in the target document element.

In some embodiments, the extraction rule corresponding to one or more element nodes in the element tree is a time extraction rule, and the time extraction rule is specifically at least one time extraction expression for extracting time elements from the audit pass block and the tail information block, where the time elements are: the "prosecution time", "acceptance time", "plan time", "trial time", and "trial period" included in the trial pass block, and the "referee time" included in the tail information block.

Specifically, the time extraction expression is a regular expression supporting various date structure types, and supports identification of Chinese, Arabic number and full/half angle type numerical information.

Since the review block includes time elements of various categories, for example, the acceptance time and the filing time are different categories of time elements, in order to distinguish the categories of the extracted time elements, in some embodiments of the present application, the aforementioned time extraction rule includes at least one time extraction expression and a post-processing rule.

With reference to the contents of the criminal judgment book, a specific implementation manner of extracting time elements from the trial-passing block by using the time extraction rule is described, wherein the contents of the trial-passing block are as follows:

the people inspection institute in XX area of Yongzhou city instructs the defended people to open XX to make theft with the Yong-cold inspection criminal complaint word (2014) No. 90 complaint books5 months and 13 days in 2014And (5) lifting the official complaints to the hospital. After the hospital is accepted, a convention court with trial member height XX as trial length and people accompanying trial member managing XX and congratulation XX is formed by law, wherein5 months and 29 days 2014Trial was conducted in the first trial court open court of this court. Attorney ruxx acts as a court trial record. The national inspection hospital in the XX district of Yongzhou city assigns an inspector Zhao XX to go out of the court to support the official complaints, and the defendant Zhang XX enters the court to participate in the litigation. The present application has been examined and finalized.

For the audit pass block, time extraction is first usedExtracting more than one time element from the expression to obtain5 months and 13 days in 2014And5 months and 29 days 2014。

In order to distinguish the categories of the plurality of time elements, the plurality of time elements are disambiguated by using a post-processing rule. Each category may be characterized by at least one time feature word, for example, the time feature word at the time of initiating a complaint is "raise a official complaint" and/or "initiate a complaint", the time feature word at the time of accepting a complaint is "accept", the time feature word at the time of trial processing is "trial processing", and the like. In addition, since the judicial arts of the referee document have been described in a formal way of convention, it can be concluded from the description in the formal way of convention that each time element has the same preposition information, such as "n", and that the time feature word of each time tends to appear immediately thereafter.

Based on this, in the embodiment of the present application, for each time element, a post-processing rule is used to obtain a time feature word located after the time element and closest to the time element in the target block, and further determine the category of the time element according to the obtained time feature word, and then select a time element corresponding to the element node from one or more time elements according to the category of the time element. For example, "in5 months and 13 days in 2014In the official complaints of this institute, the complaints of this institute are followed "2014, 5 and 13 months Day(s)"present, and therefore determined by the category characterized by" lifting complaints "as the time of onset5 months and 13 days in 2014And the time element corresponding to the element node 'appellation time'.

For example, the element tree corresponding to the trial process block includes a plurality of element nodes, such as "time to initiate and" time to accept ", and the like, wherein the extraction rule corresponding to each element node is a time extraction rule, the time extraction rule includes a post-processing rule and at least one time extraction expression, the time element corresponding to the element node can be extracted by using the extraction rule corresponding to the element node, for example, one or more time elements can be extracted from the trial process block by using the time extraction expression corresponding to the element node, and the time element corresponding to the element node can be selected from the one or more time elements by using the corresponding post-processing rule.

Examine and manage the block

Onset time- -sampling rule

Acceptance time- -is- -in- -the extraction rule

Set-up time-to-from extraction rule

Trial time-to-from extraction rule

Asserted time- -of- -extraction rule

Trial cycle- -Inquiry cycle- -extraction rule

…

In the trial-and-error block of the official document, "proposal" and "acceptance" may appear simultaneously, and in this case, the extracted time element is used as both the proposal time and the acceptance time.

In some embodiments, the extraction rule corresponding to one or more element nodes in the element tree is a normalized element matching rule. The normalized elements can be understood as document elements which are necessarily expressed by standard words in the official document, for example, the non-basic court level of the law of law is necessarily expressed by standard words such as "highest level", "high level" or "middle level", and the case type is necessarily expressed by standard words such as "criminal", "civil" or "administrative".

During specific implementation, a standard word set is preset according to the target normalized element, and the standard word set comprises at least one standard word. Furthermore, the standard word set and the target block can be matched according to the matching rule, and then the document elements can be extracted from the target block according to the matching result. Wherein, the matching rule can be a priority matching or a sequence matching.

And when the matching rule is sequential matching, sequentially acquiring a standard word from the standard word set according to a preset sequence, matching the acquired standard word with the target block, if the matching is successful, not acquiring the next standard word from the standard word set to finish the matching process, and if the matching is successful, acquiring the next standard word from the standard word set to continue the matching process until no unsewn standard word exists in the standard word set.

Illustratively, the standard word set preset by the "court of justice hierarchy" includes "top level", "high level" and "middle level", according to the above sequence matching rules, the "top level", "high level" and "middle level" are sequentially obtained from the standard word set and are matched with the header information block, if the "top level" matching is successful, the matching is ended, if the "top level" matching is failed, the "high level" is matched with the header information block, if the "top level" matching is successful, the matching is ended, if the matching is failed, the "middle level" is matched with the header information block, and the matching process is ended. It should be noted that, in this example, if the "highest level", "high level", and "middle level" all fail to match, the "court of trial level" is determined as the base level.

And when the matching rule is the priority matching, acquiring a standard word from the standard word set according to the priority sequence of each standard word in the standard word set, matching the acquired standard word with the target block, if the matching is successful, not acquiring the standard word with the next priority from the standard word set to finish the matching process, and if the matching is unsuccessful, acquiring the next level standard word from the standard word set to continue the matching process until no unsewn standard word exists in the standard word set. When the number of the standard words in the standard word set is large, the matching times can be reduced by adopting the priority matching rule.

Illustratively, the arrangement order of the standard words in the standard word set preset by the "court trial level" according to the priority is as follows: the high level, the middle level and the highest level are sequentially obtained from the standard word set according to the priority matching rule and are matched with the head information block, if the high level is successfully matched, the matching result can be determined to be the highest level or the high level, at the moment, only the highest level is obtained from the standard word set for matching, and the middle level is not required to be matched; if the high-level matching fails, only the middle level needs to be acquired from the standard word set for matching, and the highest level needs to be matched.

In some embodiments, since the target document elements corresponding to some element nodes are included in the extraction result of another element node or nodes, the target document elements corresponding to these element nodes can be extracted directly from the extraction result of another element node or nodes. For example, the "court trial hierarchy" is included in the "court trial name".

Based on this, in some embodiments, the matching at least one preset canonical word with the target block according to the matching rule specifically includes: the specific matching process may refer to the foregoing embodiments, and details are not repeated herein.

It can be seen from the above embodiments that the present application provides an information extraction method for a referee document, first obtaining at least one target block containing target document elements from the referee document, where each target block corresponds to a content subject; then selecting an element tree corresponding to the target block according to the content theme corresponding to the target block, wherein the element tree comprises at least one element node and an extraction rule corresponding to the element node; and extracting at least one target document element from the target block by using the element tree. The method can automatically extract the basic document elements from the referee document, thereby realizing the comprehensive understanding of the referee document.

According to the method for extracting information of a referee document provided by the embodiment of the present application, an embodiment of the present application further provides an apparatus for extracting information of a referee document, fig. 3 is an exemplary block diagram of the apparatus, and as shown in fig. 3, the apparatus may include:

an obtaining module 310, configured to obtain at least one target block including a target document element from a referee document, where each target block corresponds to a content subject.

A selecting module 320, configured to select, according to the content theme corresponding to the target block, an element tree corresponding to the target block, where the element tree includes at least one element node and an extraction rule corresponding to the element node.

An extracting module 330, configured to extract at least one target document element from the target block by using the element tree, where each target document element corresponds to one element node.

In some embodiments, the extraction rules corresponding to the element nodes include a preposition positioning rule and a postposition positioning rule; the extraction module 330 is specifically configured to determine a starting position of the target document element by using the prepositioning rule; determining the end position of the target document element by utilizing the post-positioning rule; extracting text information between the starting position and the ending position to obtain the target document element.

In some embodiments, the extraction module 330 is specifically configured to identify the prepositioning information of the target document element by using the prepositioning rule; determining the initial position of the target document element according to the preposed positioning information; utilizing the post-positioning rule to identify post-positioning information of the target document element; and determining the end position of the target document element according to the post-positioning information.

In some embodiments, the extraction rule corresponding to the element node comprises at least one temporal extraction expression; the extracting module 330 is specifically configured to extract at least one time element from the target block by using the time extraction expression.

In some embodiments, the extraction rules corresponding to the element nodes comprise post-processing rules and at least one time extraction expression; the extracting module 330 is specifically configured to extract one or more time elements from the target block by using the time extraction expression; for each time element, acquiring a time feature word which is positioned behind the time element and is closest to the time element in the target block; and determining the category of the time element according to the time feature words so as to select the time element corresponding to the current element node from the more than one time elements according to the category of the time element.

In some embodiments, the extraction rule corresponding to the element node comprises a matching rule of a normalized element; the extraction module 330 is specifically configured to match at least one preset canonical word with the target block according to the matching rule; and extracting the normalized elements from the target block according to the matching result.

In some embodiments, the extraction module 330 is specifically configured to match at least one preset canonical word with the extraction result of the specified element node according to the matching rule.

In some embodiments, the obtaining module 310 is specifically configured to obtain a document type of the referee document; selecting a directory tree corresponding to the referee document according to the document type, wherein the directory tree comprises at least one directory node corresponding to the content subject, and each directory node corresponds to at least one extraction expression; performing block cutting processing on the referee document according to the directory tree to obtain at least one text block corresponding to the directory node; and selecting at least one target block containing the target document element according to the directory node corresponding to the text block.

In specific implementation, the present invention further provides a computer storage medium, where the computer storage medium may store a program, and the program may include some or all of the steps in the embodiments of the information extraction method provided by the present invention when executed. The storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM) or a Random Access Memory (RAM).

Those skilled in the art will readily appreciate that the techniques of the embodiments of the present invention may be implemented as software plus a required general purpose hardware platform. Based on such understanding, the technical solutions in the embodiments of the present invention may be essentially or partially implemented in the form of a software product, which may be stored in a storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments.

The same and similar parts in the various embodiments in this specification may be referred to each other. In particular, as for the apparatus embodiment, since it is substantially similar to the method embodiment, the description is simple, and the relevant points can be referred to the description in the method embodiment.

The above-described embodiments of the present invention should not be construed as limiting the scope of the present invention.

Claims

1. An information extraction method for a referee document, the method comprising:

2. The method according to claim 1, wherein the extraction rules corresponding to the element nodes comprise a preposition rule and a postposition rule;

the method for extracting the target document element from the target block by using the element tree comprises the following steps:

determining the starting position of the target document element in the target block by utilizing the preposed positioning rule;

determining the end position of the target document element in the target block by utilizing the post-positioning rule;

extracting text information between the starting position and the ending position to obtain the target document element.

3. The method of claim 2, wherein said determining a starting location of said target document element in said target block using prepositioning rules comprises:

utilizing the prepositioning rule to identify prepositioning information of the target document element;

and determining the starting position of the target document element in the target block according to the preposed positioning information.

4. The method of claim 2, wherein said determining the end position of the target document element in the target block using a post-positioning rule comprises:

utilizing the post-positioning rule to identify post-positioning information of the target document element;

and determining the end position of the target document element in the target block according to the post-positioning information.

5. The method according to claim 1, wherein the extraction rule corresponding to the element node comprises at least one temporal extraction expression;

extracting at least one time element from the target block using the time extraction expression.

6. The method according to claim 1, wherein the extraction rules corresponding to the element nodes comprise post-processing rules and at least one temporal extraction expression;

extracting more than one time element from the target block by using the time extraction expression corresponding to the element node;

for each time element, acquiring a time feature word which is positioned behind the time element and is closest to the time element in the target block by utilizing the post-processing rule;

and determining the category of the time element according to the time feature words, so as to select the time element corresponding to the element node from the more than one time elements according to the category of the time element.

7. The method according to claim 1, wherein the extraction rule corresponding to the element node comprises a matching rule of a normalized element;

matching at least one preset standard word with the target block according to the matching rule;

and extracting the normalized elements from the target block according to the matching result.

8. The method according to claim 7, wherein said matching at least one predetermined canonical word with the target block according to a matching rule comprises:

and matching at least one preset standard word with the extraction result of the designated element node according to the matching rule.

9. The method according to any one of claims 1 to 8, wherein said obtaining at least one target block containing target document elements from the official document comprises:

acquiring the document type of the referee document;

selecting a directory tree corresponding to the referee document according to the document type, wherein the directory tree comprises at least one directory node corresponding to the content subject, and each directory node corresponds to at least one extraction expression;

performing block cutting processing on the referee document according to the directory tree to obtain at least one text block corresponding to the directory node;

and selecting at least one target block containing the target document element according to the directory node corresponding to the text block.

10. An information extraction apparatus of a referee document, the apparatus comprising: