CN109992761A - The rule-based adaptive text information extracting method of one kind and software memory - Google Patents

The rule-based adaptive text information extracting method of one kind and software memory Download PDF

Info

Publication number
CN109992761A
CN109992761A CN201910223558.9A CN201910223558A CN109992761A CN 109992761 A CN109992761 A CN 109992761A CN 201910223558 A CN201910223558 A CN 201910223558A CN 109992761 A CN109992761 A CN 109992761A
Authority
CN
China
Prior art keywords
text
template
rule
keyword
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910223558.9A
Other languages
Chinese (zh)
Inventor
李晓林
李道庆
张彦铎
田英明
刘玮
姚峰
范佳莹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
SHANGHAI HUACHUAN ENVIRONMENTAL PROTECTION TECHNOLOGY Co Ltd
Wuhan Institute of Technology
Original Assignee
SHANGHAI HUACHUAN ENVIRONMENTAL PROTECTION TECHNOLOGY Co Ltd
Wuhan Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SHANGHAI HUACHUAN ENVIRONMENTAL PROTECTION TECHNOLOGY Co Ltd, Wuhan Institute of Technology filed Critical SHANGHAI HUACHUAN ENVIRONMENTAL PROTECTION TECHNOLOGY Co Ltd
Priority to CN201910223558.9A priority Critical patent/CN109992761A/en
Publication of CN109992761A publication Critical patent/CN109992761A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/186Templates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of rule-based adaptive text information extracting method and software memories, method includes the following steps: to the rule that the text object building text information of professional domain extracts, and rule is summarised in template.Template rule is handled by tree-shaped ordinal ranking, constitutes text template, template is four-layer structure, including section, row, sentence, word;It is for statistical analysis to text object to be extracted, representative keyword is preset, keyword is made of related term nothing to do with word;Information extraction is carried out to text to be extracted using the template of building, according to template four-layer structure sequence, text matches are carried out by keyword;Each level in template is filtered when there are multiple matching results with keyword, target information is accurately positioned;Output includes the Text Feature Extraction result of keyword.The present invention can adapt to the variation of content of text, structure, the extraction target text information of efficiently and accurately automatically.

Description

The rule-based adaptive text information extracting method of one kind and software memory
Technical field
The present invention relates to technical field of information processing, more particularly to a kind of rule-based adaptive text information extraction side Method and software memory.
Background technique
Currently, there is a large amount of valuable text informations, such as detailed coroner's court in the text of each professional domain Try the court's trial notes, ruling notes, conciliation notes etc. of situation.But it is of interest in legal documents to manually comb, extracting Content, especially processing magnanimity document when, expend a large amount of human and material resources, inefficiency.
Text extraction techniques at this stage, mainly for fixed structure text, extract text keyword, motif discovery or Short text adaptive information extraction etc..These methods are not appropriate for handling and are not fixed, need similar to textual forms such as court's trial notes Section sentence etc. is extracted compared with multi information, the longer text object of text length.
Summary of the invention
The technical problem to be solved in the present invention is that for the defects in the prior art, providing a kind of rule-based adaptive Answer text information extracting method and software memory.
The technical solution adopted by the present invention to solve the technical problems is:
The present invention provides a kind of rule-based adaptive text information extracting method, method includes the following steps:
Statistical Comparison, analysis and summary, the rule that building text information extracts are carried out to the text object of professional domain;
It for rule, is handled by tree-shaped ordinal ranking, constitutes an adaptive text template, template is according to professional domain It is different there are many classification, different classes of template corresponds to different classes of text object, and template is four-layer structure, including Section, row, sentence, word;
It is for statistical analysis to text object to be extracted, representative keyword is preset, keyword is by related term Nothing to do with word is constituted;
Information extraction is carried out to text to be extracted using the template of building, according to template four-layer structure sequence, passes through pass Keyword carries out text matches;
Each level in template is filtered when there are multiple matching results with keyword, mesh is accurately positioned Mark information;
Output includes the Text Feature Extraction result of keyword.
Further, it is matched with the paragraph rank of text object in this method of the invention when choosing keyword, then Extract the corresponding information of text fragment, including following rule:
It chooses keyword to be matched with the paragraph rank of text, obtains the corresponding text fragment content information of keyword;
More than two text fragment content informations are subjected to keyword filtering, obtain perfect copy paragraph location information.
Further, it when carrying out text matches in this method of the invention, is selected from preset each related term nothing to do with word Take corresponding with text fragment section, row, sentence, the matched text information of word, including following rule:
Obtain text fragment corresponding segment, row, sentence, multiple related terms of word, unrelated word;
According to the section of acquisition, row, sentence, word, multistage regular template is constructed;
Text object and template are subjected to matching comparison, extracted comprising related term in text object, but is not included unrelated The part of word;
The content of text information selected is determined as concern of the corresponding text object in regular template with location information Point.
Further, in this method of the invention according to text object by text be divided into four section, row, sentence, word grades Not, including it is following regular:
From the corresponding keyword of the different text types of determining in text object, including related term, unrelated word;
The keyword being collected into is subjected to template construction according to four ranks of section, row, sentence, word.
It further, further include the process for carrying out stencil-chosen when carrying out text information matching in this method of the invention, Its method particularly includes:
It is filtered out from preset template library according to text object type and the matched more than two moulds of text object type Plate;
It is selected from more than two templates and paragraph topic matching degree according to the corresponding paragraph topic of each text fragment Highest template, as the rule template for summarizing each text information.
The present invention provides a kind of software storage for being stored with the rule-based adaptive text information extracting method Device, the software in the software memory execute following procedure:
Statistical Comparison, analysis and summary, the rule that building text information extracts are carried out to the text object of professional domain;
It for rule, is handled by tree-shaped ordinal ranking, constitutes an adaptive text template, template is according to professional domain It is different there are many classification, different classes of template corresponds to different classes of text object, and template is four-layer structure, including Section, row, sentence, word;
It is for statistical analysis to text object to be extracted, representative keyword is preset, keyword is by related term Nothing to do with word is constituted;
Information extraction is carried out to text to be extracted using the template of building, according to template four-layer structure sequence, passes through pass Keyword carries out text matches;
Each level in template is filtered when there are multiple matching results with keyword, mesh is accurately positioned Mark information;
Output includes the Text Feature Extraction result of keyword.
The beneficial effect comprise that: rule-based adaptive text information extracting method of the invention and software Memory, constructs a kind of computation model of regularization, and model is divided into four levels: section, row, sentence, word according to requirement is extracted;And The variation that corresponding matching rule makes model adapt to content of text, structure automatically is defined on the basis of four levels, efficiently and accurately mentions Take target text information;This method has the advantage that 1, coverage area comprehensively, accurately;2, for constructing complete template, when When field changes, it is only necessary to change the templates content such as Feature Words, have no need to change stencil structure, greatly facilitate ordinary skill The work of personnel;3, in terms of big data processing, it can satisfy demand of the user to the information extraction of text in a large amount of fields.
Detailed description of the invention
Present invention will be further explained below with reference to the attached drawings and examples, in attached drawing:
Fig. 1 is the embodiment overall flow schematic diagram of the embodiment of the present invention;
Fig. 2 is the embodiment stencil structure schematic diagram of the embodiment of the present invention;
Fig. 3 is the adaptive text information model of embodiment of the embodiment of the present invention;
Fig. 4 is the partial information screenshot of the embodiment of the present invention;
Fig. 5 is the template partial information screenshot of the embodiment of the present invention.
Specific embodiment
In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawings and embodiments, right The present invention is further elaborated.It should be appreciated that described herein, specific examples are only used to explain the present invention, not For limiting the present invention.
The rule-based adaptive text information extracting method of the embodiment of the present invention, method includes the following steps:
Statistical Comparison, analysis and summary, the rule that building text information extracts are carried out to the text object of professional domain;
It for rule, is handled by tree-shaped ordinal ranking, constitutes an adaptive text template, template is according to professional domain It is different there are many classification, different classes of template corresponds to different classes of text object, and template is four-layer structure, including Section, row, sentence, word;
It is for statistical analysis to text object to be extracted, representative keyword is preset, keyword is by related term Nothing to do with word is constituted;
Information extraction is carried out to text to be extracted using the template of building, according to template four-layer structure sequence, passes through pass Keyword carries out text matches;
Each level in template is filtered when there are multiple matching results with keyword, mesh is accurately positioned Mark information;
Output includes the Text Feature Extraction result of keyword.
In one particular embodiment of the present invention, rule-based adaptive text information extracting method, it is preferred that emphasis is Corresponding rule is inserted into respectively to four levels of building.
Rule one: for one attribute of category setting described in full wafer article, such as court's trial pen is respectively corresponded with 1,2,3 respectively Record, adjusts notes at ruling notes.Different templates is selected according to classification;
Rule two: level is progressive structure, is structure arranged side by side, the template of generally one tree structure between level;
Rule three: section, row, sentence level all include " id ", " Pos ", " name ", " title ", " mode ", " matches ", " This seven attributes of words ", wherein " Pos " is the Position Number of text, " name " is function name, and " title " is to extract content Referred to as, " mode " is extraction mode, and " matches " is the regular expression extracted, and " words " is Feature Words, anti-Feature Words.
Rule four: " Pos ", " name ", " title " these three attributes are combined and are identified to the function of each level, to mentioning The location of content taken sits record;
Rule five: the matching mould (be global registration or match for the first time) of " mode " decision " matches ", " Matches " is used for content of text in conjunction with regular expression, extracts the main information of concern;
Rule six: " words " records a certain amount of Feature Words and keyword, carries out to the result after " matches " matching Screening and filtering extracts more accurate text information;
Rule seven: it is arranged by the sequence of user demand and exports extraction result.
Above-mentioned rule can be divided into three parts according to function.Regular based on first and second rule, third to six is to extract rule Then, the 7th rule is tactical rule.
In another specific embodiment of the invention, information extraction is carried out mainly for the text object of court's trial record. As shown in Figure 1, rule-based adaptive text information extracting method, including the following steps:
Step1 obtains a certain amount of representative document object, determines document object structural model feature, improves document Adaptive ability;
Step2 formulates text information extracting rule one to six, and Rule content is flexibly made according to text object feature with demand It is fixed.Rule is converted to .json format, by the sequential organization of text, establishes matching stencil;
Step3 runs program, and makes the appropriate adjustments, match corresponding information and export to the rule three to six in template, By taking court mediation is put down as an example, extract information include law court, trial the time, plaintiff, defendant, the point at issue, court verdict and Trial personnel etc.;
Step4 analyzes the contents such as time, the trial type for extracting result, and is converted into program and is incorporated into output interface, It is exported according to rule seven by user demand in conjunction with " Pos " attribute for extracting text information.
As shown in Figure 2, the first row indicates article level attributes;Second row indicates each attribute of paragraph rank;The third line table Show each attribute of row rank;The attribute structure of fourth line expression keyword extraction.Regular template is constructed by tree structure.
As shown in Figure 3, left-half is by the template of rule building, and right half part is the target text divided by structure This.Intermediate four-headed arrow is meant to lay down a regulation according to text, and constructs template;Further according to the adaptive extraction text envelope of template Breath.
Specific steps:
Step 1: target information is determined according to base rule.It adjusts notes and writes law regulation, determine that law court adjusts notes book Form is write, puts down example in conjunction with reconciling, constructs regular template main body (as shown in Figure 2);
Step 2: matching result is obtained according to extracting rule.It lays down a regulation in conjunction with regular expression, related term, unrelated word, Such as defendant's ^ (:)? | defendant's s+ (:)? | two case defendants, and construct regular template document (such as Fig. 5 stored with .json format It is shown).
Step 3: writing access, matcher, extract text information by regular template;
Step 4: target information is exported according to tactical rule.The result of extraction is arranged by sequences of text and exports (such as Fig. 4 It is shown).
It should be understood that for those of ordinary skills, it can be modified or changed according to the above description, And all these modifications and variations should all belong to the protection domain of appended claims of the present invention.

Claims (6)

1. a kind of rule-based adaptive text information extracting method, which is characterized in that method includes the following steps:
Statistical Comparison, analysis and summary, the rule that building text information extracts are carried out to the text object of professional domain;
For rule, handled by tree-shaped ordinal ranking, constitute an adaptive text template, template according to professional domain not With there are many classification, different classes of template corresponds to different classes of text object, and template is four-layer structure, including section, Row, sentence, word;
It is for statistical analysis to text object to be extracted, representative keyword is preset, keyword is by related term and nothing Word is closed to constitute;
Information extraction is carried out to text to be extracted using the template of building, according to template four-layer structure sequence, passes through keyword Carry out text matches;
Each level in template is filtered when there are multiple matching results with keyword, target letter is accurately positioned Breath;
Output includes the Text Feature Extraction result of keyword.
2. rule-based adaptive text information extracting method according to claim 1, which is characterized in that in this method It is matched when choosing keyword with the paragraph rank of text object, then extracts the corresponding information of text fragment, including as follows Rule:
It chooses keyword to be matched with the paragraph rank of text, obtains the corresponding text fragment content information of keyword;
More than two text fragment content informations are subjected to keyword filtering, obtain perfect copy paragraph location information.
3. rule-based adaptive text information extracting method according to claim 1, which is characterized in that in this method When carrying out text matches, corresponding with text fragment section, the matching of row, sentence, word are chosen from preset each related term nothing to do with word Text information, including following rule:
Obtain text fragment corresponding segment, row, sentence, multiple related terms of word, unrelated word;
According to the section of acquisition, row, sentence, word, multistage regular template is constructed;
Text object and template are subjected to matching comparison, extracted comprising related term in text object, but does not include unrelated word Part;
The content of text information selected is determined as focus of the corresponding text object in regular template with location information.
4. rule-based adaptive text information extracting method according to claim 1, which is characterized in that in this method According to text object by text be divided into four section, row, sentence, word ranks, including following rule:
From the corresponding keyword of the different text types of determining in text object, including related term, unrelated word;
The keyword being collected into is subjected to template construction according to four ranks of section, row, sentence, word.
5. rule-based adaptive text information extracting method according to claim 1, which is characterized in that in this method It further include the process for carrying out stencil-chosen when carrying out text information matching, method particularly includes:
It is filtered out from preset template library according to text object type and the matched more than two templates of text object type;
It is selected from more than two templates and paragraph topic matching degree highest according to the corresponding paragraph topic of each text fragment Template, as the rule template for summarizing each text information.
6. a kind of software memory for being stored with the rule-based adaptive text information extracting method, feature exist In the software in the software memory executes following procedure:
Statistical Comparison, analysis and summary, the rule that building text information extracts are carried out to the text object of professional domain;
For rule, handled by tree-shaped ordinal ranking, constitute an adaptive text template, template according to professional domain not With there are many classification, different classes of template corresponds to different classes of text object, and template is four-layer structure, including section, Row, sentence, word;
It is for statistical analysis to text object to be extracted, representative keyword is preset, keyword is by related term and nothing Word is closed to constitute;
Information extraction is carried out to text to be extracted using the template of building, according to template four-layer structure sequence, passes through keyword Carry out text matches;
Each level in template is filtered when there are multiple matching results with keyword, target letter is accurately positioned Breath;
Output includes the Text Feature Extraction result of keyword.
CN201910223558.9A 2019-03-22 2019-03-22 The rule-based adaptive text information extracting method of one kind and software memory Pending CN109992761A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910223558.9A CN109992761A (en) 2019-03-22 2019-03-22 The rule-based adaptive text information extracting method of one kind and software memory

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910223558.9A CN109992761A (en) 2019-03-22 2019-03-22 The rule-based adaptive text information extracting method of one kind and software memory

Publications (1)

Publication Number Publication Date
CN109992761A true CN109992761A (en) 2019-07-09

Family

ID=67130813

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910223558.9A Pending CN109992761A (en) 2019-03-22 2019-03-22 The rule-based adaptive text information extracting method of one kind and software memory

Country Status (1)

Country Link
CN (1) CN109992761A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110597959A (en) * 2019-09-17 2019-12-20 北京百度网讯科技有限公司 Text information extraction method and device and electronic equipment
CN111460083A (en) * 2020-03-31 2020-07-28 北京百度网讯科技有限公司 Document title tree construction method and device, electronic equipment and storage medium
CN113704805A (en) * 2021-10-27 2021-11-26 华控清交信息科技(北京)有限公司 Wind control rule matching method and device and electronic equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140074889A1 (en) * 2012-09-07 2014-03-13 Splunk Inc. Generation of a data model for searching machine data
CN107729481A (en) * 2017-10-16 2018-02-23 北京神州泰岳软件股份有限公司 The Text Information Extraction result screening technique and device of a kind of custom rule
CN108197163A (en) * 2017-12-14 2018-06-22 上海银江智慧智能化技术有限公司 A kind of structuring processing method based on judgement document
CN108536678A (en) * 2018-04-12 2018-09-14 腾讯科技(深圳)有限公司 Text key message extracting method, device, computer equipment and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140074889A1 (en) * 2012-09-07 2014-03-13 Splunk Inc. Generation of a data model for searching machine data
CN107729481A (en) * 2017-10-16 2018-02-23 北京神州泰岳软件股份有限公司 The Text Information Extraction result screening technique and device of a kind of custom rule
CN108197163A (en) * 2017-12-14 2018-06-22 上海银江智慧智能化技术有限公司 A kind of structuring processing method based on judgement document
CN108536678A (en) * 2018-04-12 2018-09-14 腾讯科技(深圳)有限公司 Text key message extracting method, device, computer equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
晏文坛: "半结构化中文简历的信息抽取", 《信息科技》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110597959A (en) * 2019-09-17 2019-12-20 北京百度网讯科技有限公司 Text information extraction method and device and electronic equipment
CN111460083A (en) * 2020-03-31 2020-07-28 北京百度网讯科技有限公司 Document title tree construction method and device, electronic equipment and storage medium
CN113704805A (en) * 2021-10-27 2021-11-26 华控清交信息科技(北京)有限公司 Wind control rule matching method and device and electronic equipment

Similar Documents

Publication Publication Date Title
Zhang et al. Mdnet: A semantically and visually interpretable medical image diagnosis network
US10332007B2 (en) Computer-implemented system and method for generating document training sets
US9679049B2 (en) System and method for providing visual suggestions for document classification via injection
Afzal et al. Deepdocclassifier: Document classification with deep convolutional neural network
CN104331498B (en) A kind of method that web page contents to internet user access are classified automatically
CN109992761A (en) The rule-based adaptive text information extracting method of one kind and software memory
CN106445919A (en) Sentiment classifying method and device
CN109800414A (en) Faulty wording corrects recommended method and system
CN110866388A (en) Publishing PDF layout analysis and identification method based on mixing of multiple neural networks
US20090132530A1 (en) Web content mining of pair-based data
CN109308319A (en) File classification method, document sorting apparatus and computer readable storage medium
US20040030723A1 (en) Automatic evaluation of categorization system quality
CN110377659A (en) A kind of intelligence chart recommender system and method
CN106815253A (en) A kind of method for digging based on mixed data type data
CN101515329B (en) Image matching method based on various features
CN108615124B (en) Enterprise evaluation method and system based on word frequency analysis
CN108062563A (en) A kind of representative sample based on classification equilibrium finds method
Olesen et al. From Text Mining to Visual Classification: Rethinking Computational New Cinema History with Jean Desmet’s Digitised Business Archive
CN102591850A (en) Method and system for error text statement correction based on conditional statements
CN111125486A (en) Microblog user attribute analysis method based on multiple features
CN109409390A (en) Deep learning classification method and device
CN112990177B (en) Classified cataloguing method, device and equipment based on electronic file files
CN109522414B (en) Document delivery object selection system
CN102063434B (en) Candidate key capture device and candidate key capture method
Ding et al. VSEC-LDA: boosting topic modeling with embedded vocabulary selection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20190709