CN110162765A - A kind of machine aid reading auditing method and system based on abstract mode - Google Patents

A kind of machine aid reading auditing method and system based on abstract mode Download PDF

Info

Publication number
CN110162765A
CN110162765A CN201810142416.5A CN201810142416A CN110162765A CN 110162765 A CN110162765 A CN 110162765A CN 201810142416 A CN201810142416 A CN 201810142416A CN 110162765 A CN110162765 A CN 110162765A
Authority
CN
China
Prior art keywords
text
abstract
content
module
parsing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810142416.5A
Other languages
Chinese (zh)
Inventor
韩中华
姜伟
徐福海
吴雪军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dingfu Data Technology (beijing) Co Ltd
Original Assignee
Dingfu Data Technology (beijing) Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dingfu Data Technology (beijing) Co Ltd filed Critical Dingfu Data Technology (beijing) Co Ltd
Priority to CN201810142416.5A priority Critical patent/CN110162765A/en
Publication of CN110162765A publication Critical patent/CN110162765A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/258Heading extraction; Automatic titling; Numbering

Abstract

The invention discloses a kind of machine aid reading auditing methods and system based on abstract mode, realize process are as follows: typing text, and complete the parsing of data content and format;Classify to content of text after parsing, same category content is polymerize and marked class label, forms the mac function with class label;Extract corresponding clip Text in each mac function;Clip Text is exported, in conjunction with the opinion for the personnel of checking, result is checked in formation.By machine mould, original text abstract, and the source language message assisted to make a summary are extracted in advance, can effectively be helped user quickly to pass through abstract and be completed audit work;Even if it is not right that autoabstract describes unclear or extraction, it can also quickly be corrected by corresponding textual content, dramatically save manual audit's cost, promote audit efficiency.

Description

A kind of machine aid reading auditing method and system based on abstract mode
Technical field
The present invention relates to document processing fields, and in particular to a kind of machine aid reading auditing method based on abstract mode And system.
Background technique
There are the demand that large volume document reads audit in various industries, it is taking human as master that traditional document, which reads auditing method, Document read auditing method, main processes include: by the document wait audit from operating information system export after pass through industry Business industry specialists carry out subjective examination with human brain.For the data of magnanimity, amount of reading is huge, needs to be managed according to document content Solution, carries out decision at judgement.Due to being all largely Un-structured or partly-structured data in document, and the people for writing document is horizontal Thinking is not quite similar again, and people's all the elements in review process is caused to require to carry out to understand and check, and emphasis pass is actually needed The content of note is in fact and few, and time cost and human cost waste are serious, and inefficiency.
With information technology in recent years since greatly develop, the acquisition of various information datas and to provide frequency quicker, This has aggravated the complexity and difficulties of professional audit again to a certain extent, only far from by traditional text auditing method The development for adapting to society, is not able to satisfy the actual demand of enterprise itself.At present in audit industry, there are no mature to examine Read solution.
Based on the above issues, need to develop a kind of machine aid reading auditing method or system, the accurate weight for understanding document Content is wanted, brief, accurate, important document content is provided for auditor, improves auditor's working efficiency.
Summary of the invention
In order to overcome the above problem, present inventor has performed sharp studies, provide a kind of machine based on abstract mode Aid reading auditing method and system, by carrying out piecemeal classification adjustment class label to the document of input, abstract is extracted and is obtained Information and last edit-modify are paid close attention to, the effect data that user wants is obtained, realizes audit portfolio abstract output, Thereby completing the present invention.
The purpose of the present invention is to provide following technical schemes:
(1) a kind of machine aid reading auditing method based on abstract mode, the described method comprises the following steps:
Step 100, typing text, and complete the parsing of data content and format;
Step 200, classify to content of text after parsing, same category content is polymerize and marks classification mark Label form the mac function with class label;
Step 300, corresponding clip Text in each mac function is extracted;
Step 400, clip Text is exported, in conjunction with the opinion for the personnel of checking, result is checked in formation.
(2) a kind of system for realizing above-mentioned (1) the method, the system comprises:
Typing parsing module is used for typing text, and completes the parsing of data content and format;
Same category content is polymerize and is marked for classifying to content of text after parsing by piecemeal categorization module Class label is infused, the mac function with class label is formed;
Abstract abstraction module, for extracting corresponding clip Text in each mac function;
Output edit module of making a summary, in conjunction with the opinion for the personnel of checking, forms for exporting clip Text and checks result.
A kind of machine aid reading auditing method and system based on abstract mode provided according to the present invention, has following The utility model has the advantages that
(1) in the present invention, by machine mould, original text abstract, and the source language message assisted to make a summary are extracted in advance, is helped User quickly passes through abstract and completes audit work;Even if it is not right that autoabstract describes unclear or extraction, correspondence can also be passed through Textual content quickly corrected, dramatically save manual audit's cost, promote audit efficiency;
(2) in the present invention, by first converting XML format for Word document or PDF document format, it is then converted to plain text Format, it is ensured that initial data is not lost, and guarantees analysis mass;
(3) in the present invention, by whole text by being divided into different mac functions, being not only conducive to subsequent operation can be with It is clear, comprehensive, be quickly found and need the content that extracts, while the type for extracting data can be apparent from;
(4) in the present invention, according to block feature, to the corresponding machine mould of each mac function training, (sentence is chosen Model), the corresponding clip Text directive property and accuracy extracted is stronger;And use the machine mould for being directed to each mac function It is accurate that type and universality machine mould determine that the clip Text of each mac function can not only further increase extraction jointly Property, it is often more important that, it is able to solve the problem of mac function is without corresponding machine model or the few caused machine of training sample amount The problem of model accuracy deficiency.
Detailed description of the invention
Fig. 1 shows a kind of machine aid reading auditing method based on abstract mode of preferred embodiment according to the present invention Flow chart;
Fig. 2 shows the Word document schematic diagrames inputted in illustration;
The Word document that Fig. 3 shows input resolves to XML data format schematic diagram;
Fig. 4 shows the software interface figure of output abstract;
Fig. 5 shows the Word document schematic diagram inputted during model training process or actual classification;
Fig. 6 shows the tree-like file structure schematic diagram formed after parsing file structure;
Fig. 7 is shown to schematic diagram after increase structural information before document text;
Fig. 8 show by increase segmented based on text after structural information after result;
Fig. 9 show use statistic algorithm assign participle after each word with the result schematic diagram of characteristic value;
The software interface that Figure 10 shows abstract starts schematic diagram after editing mode;
Figure 11 is shown in embodiment 2, classification accuracy result of the disaggregated model to audit portfolio before borrowing in credit audit;
Figure 12 is shown in embodiment 2, and NDCG@5 evaluates order models to the effect of each classification.
Specific embodiment
Below by drawings and examples to the exemplary detailed description of the present invention.Illustrated by these, the features of the present invention It will be become more apparent from advantage clear.
Dedicated word " exemplary " means " being used as example, embodiment or illustrative " herein.Here as " exemplary " Illustrated any embodiment should not necessarily be construed as preferred or advantageous over other embodiments.
A kind of machine aid reading auditing method based on abstract mode provided according to the present invention, the auditing method are used for Important information extraction is carried out to document in audit work, user is supplied in a manner of abstract, user is made quick and precisely to realize audit Work.Text audit is Fast Reading and the audit work to specific industry same class Un-structured or semi-structured text, and Form final audit conclusion and opinion.Wherein, Un-structured text refers to the two-dimentional logical table (structure being not easy to database Change) come the data text that shows;Semi-structured text is the data of structuring, but structure change is very big;Because it is to be understood that number According to details handled so data cannot be simply organized into a text according to unstructured data, very due to structure change A two-dimentional logical table can not be simply established greatly to be corresponding to it.
In the present invention, Un-structured to be processed or semi-structured text are batch, structural similarity in same industry Text that is high, thering is certain specification to guide, i.e. " same class " text.It is exemplified below, " certain project financing is awarded in audit of loan industry Believe survey report " or " examination report of certain company application loan ", this class text generally has fixed, clear in relevant departments Clear structure, and the main contents paid close attention in industry are close, are conducive to carry out batch processing.
As shown in Figure 1, a kind of machine aid reading auditing method based on abstract mode provided by the invention, including it is following Step:
Step 100, typing text, and complete the parsing of data content and format;
Step 200, classify to content of text after parsing, same category content is polymerize and marks classification mark Label form the mac function with class label;
Step 300, corresponding clip Text in each mac function is extracted;
Step 400, clip Text is exported, in conjunction with the opinion for the personnel of checking, result is checked in formation.
Step 100, Characters parse: typing text, and complete the parsing of data content and format.
In the present invention, text formatting allows for existing any document format, preferably with Word document or the lattice of PDF document Formula typing urtext, two kinds of format files are also the main ways of presentation of audit portfolio.
Since Word document or PDF document are to provide the visual information of people, but machine can not identify the letter being wherein loaded with Content is ceased, needs to convert machine-processable format, i.e. plain text format for above two text formatting, such as txt format.
In a preferred embodiment, if typing text be Word document or PDF document format, the document is direct It is converted into plain text format.
In further preferred embodiment, if typing text be Word document or PDF document format, will be in document Context resolution is XML (extensible markup language) data, then obtains plain text format text by parsing XML data.Citing is such as Under, input Word document (see Fig. 2), call LibreOffice program, by the Context resolution in Word document at XML data (see Fig. 3), then according to " Open Document Format for Office Applications (OpenDocument) Version 1.2 " OASIS standards obtain final text document result by parsing XML data.It is preferred that the reason of using XML method is, Other extract the tool of text data, meeting lost part initial data or data format using Word extraction tool, and first convert For the mode of XML data, it is ensured that then the integrality of initial data parses XML content, since XML parsing will not be limited In the reason of third party's analysis mass, XML data is converted into plain text format data and can define by project demands, to be not easy to lose Text and format information are lost, guarantees analysis mass.
In the present invention, the text obtained after final parsing remains with the titles at different levels i.e. structure of an article information of original text shelves, And paragraph structure is identical as the paragraph structure of original text shelves, i.e., the text of paragraph is formed in the content of text and original text shelves of composition paragraph Content is identical.Meanwhile the text obtained after final parsing carries out tissue by basic unit of clause, i.e., with comma, fullstop, ask Number, exclamation mark, branch text is divided into clause.
Further, the parsing further include after sequence gives parsing text neutron sentence number accordingly, and with number shape Form a complete sentence subindex.In this way, the abstract of subsequent extraction and clause original in text can be linked by sentence index, clearly Chu efficiently obtains abstract source.
As Fig. 4 shows the software interface of output abstract.During abstract is shown, left side is summary region, and right side is to parse hereinafter One's respective area, the abstract result of each classification in left side are from extracting in the content of same label in the text of right side, thus It is selected the sentence to make an abstract in text all to be covered by colored shading, is one-to-one with left side clip Text;Simultaneously In order to be more convenient the source of confirmation abstract, auditor can click left side clip Text, and corresponding right side can jump directly to phase The abstract answered extracts position, and is highlighted with band color shading.The above corresponding relationship is to index to realize by sentence, right side Document is to carry out tissue by basic unit of clause, and sequence gives corresponding number, as long as then left side record right side text In clause's number obtain clause's content and corresponding corresponding relationship.
Step 200, piecemeal is classified: classifying to content of text after parsing, same category content is polymerize and marked Class label is infused, the mac function with class label is formed.
To distinguish text data type, object content (paying close attention to content) is clearly found, is needed to text after parsing Middle content is classified, and by it is content-aggregated after classification, stamp class label;It is different to be presented with the mode of mac function Content of text.
By taking credit audit portfolio as an example, the document can generally be divided into following ten classifications: summary information, enterprise's back Scape, business circumstance, credit position, account analysis, guarantee analysis, mortgage analysis, project situation, risk analysis, branch's opinion, shape At corresponding ten mac functions.Text data type affiliation is clear after piecemeal classification, convenient for auditor to audit portfolio Processing.
In the present invention, the polymerization refers to neighbouring similar content set, in this way can be on the basis for keeping original text sequence On presented;Alternatively, neighbouring and not neighbouring similar content is gathered, such original text sequence may be changed, But it is easy for the similar content of integrated treatment.
In a kind of preferred embodiment of the present invention, using paragraph as basic unit, classify to content of text.
In a kind of preferred embodiment of the present invention, content of text is carried out using logistic regression method building disaggregated model Classification.Disaggregated model building includes training process and test process:
Training process: it is affiliated classification by corpus labeling, forms training sample;The feature of training sample is extracted to train Model;The corresponding interface of corresponding model is called when training pattern, such as the adoptable third party of the present invention, which increases income, wraps sklearn Linear_model disaggregated model be trained;
Test process: using mark or un-annotated data as test sample;It is loaded after extracting the feature of test sample Model obtains classification results;Model is adjusted according to classification results, until obtaining the high model of classification accuracy.
In a kind of preferred embodiment of the present invention, for referring to during model training or during actual classification Show the feature extraction of classification ownership from structure of an article information and text information.Wherein, structure of an article information refers to each of document Grade title;Text information refer to do not include titles at different levels document body matter.
It is important classification information, by it since structure of an article information has guide or summary to act on its ensuing disclosure It is included in feature extraction, improves the accuracy of classification.
The feature carried out during model training process or actual classification to structure of an article information and text information mentions It takes including following procedure:
I) it is expressed intact document (training sample or test sample) structural information, parses file structure, and by chapter knot Structure information forms tree-like file structure;As shown in figure 5, training sample or test sample are Word document structure, it, will after parsing Structure of an article information forms tree-like file structure as shown in FIG. 6;
The resolving, which refers to, converts XML data for document (training sample or test sample) information, then by XML number According to middle extraction text information and structure of an article information.After being parsed, tree-like file structure is converted by structure of an article information.
II before document titles at different levels) are placed in respective document text by the tree-like file structure, title+text is formed Content-form, as shown in fig. 7, to increase structural information;
III) text is segmented, as shown in Figure 8;Each word forms spy with characteristic value after using statistic algorithm to assign participle Sign;As shown in figure 9, calculating its characteristic value according to TF-IDF forms feature, wherein having carried out extensive processing to the word of setting class, such as Numeric type is generalized for<num>, name is generalized for<person>, punctuate etc. is removed as stop words;
IV) in model training, training in feature input logic regression model (such as LR model) is obtained into disaggregated model;? In test process, in disaggregated model that feature is inputted, classify.
Step 300, abstract extracts: extracting corresponding clip Text in each mac function.
In step 300, the corresponding machine mould of each mac function training is extracted corresponding according to block feature Clip Text.
Abstract is extracted using sentence Selection Model (Rank), and it is that clause (with comma, fullstop, asks that minimum, which chooses unit, Number, the short sentence that is separated to form of exclamation mark, branch punctuate).After the processing of sentence Selection Model, preceding n contents for taking sequence high (can Situation adjusts n value according to demand, and sentence length is adjustable) corresponding clause is as abstract result.
Sentence Selection Model preferably passes through the Boosting homing method (Gradient of such as sklearn in the present invention Boosting Regressor method) for lexical content do regression training, or other order models are used, training obtains.
The building of sentence Selection Model includes training process and test process:
Training process: sentence in corpus is labeled as affiliated classification (such as " being abstract ", " non-abstract ", " important abstract " Etc. classifications), formed training sample;Extract the feature of training sample;Training pattern;
Test process: using mark or un-annotated data as test sample;It is loaded after extracting the feature of test sample Model is obtained test result and is ranked up with test result;Model is adjusted according to ranking results accuracy, is obtained most Whole sentence Selection Model.
In a preferred embodiment, sentence Selection Model building process or in actual use, feature extraction Unit is clause, and is no longer paragraph;After participle, feature extraction is carried out, that is, each word is formed after assigning participle with characteristic value Feature.It is found in practice, feature extraction can use word frequency statistics method or TF-IDF algorithm, preferably word frequency statistics side Method.Feature input sequencing model carry out in the sentence Selection Model after the building of sentence Selection Model, or input building Abstract extracts.
In further preferred embodiment, to sentence mark starting character and termination in training sample and test sample Symbol, is included in feature extraction range for the starting character of sentence and full stop.The starting character refers to the special symbol for indicating that sentence starts Number;Full stop refers to the additional character for indicating that sentence terminates;Sentence is the punctuation mark work that terminated with sentences such as fullstop or question marks It may include multiple clauses for the character string of ending.
Meanwhile the mark of starting character and full stop help to obtain structure of an article information, the reason is that, structure of an article information It is not sentence or clause, starting character can not be marked (such as before and after structure of an article information<s>) and full stop is (such as</s>).In this way, By judging before and after character string whether without starting character and full stop structure of an article information can be quickly obtained, then the structure of an article is believed Breath is used for feature extraction, obtains the clip Text for having structure of an article information.
Characteristic extraction step is exemplified below:
Input clause " 5,000,000 yuan of the said firm's registered capital, " marks starting character: "<s>the said firm registered capital 5,000,000 Member, ";
Participle, as a result are as follows: "<s>/should/company/registration/capital/50,0/0,000 yuan, ";
Feature extraction, as a result are as follows: " "<s>": 1, " being somebody's turn to do ": 1, " Wan Yuan ": 1, "<num>": 1, " registration ": 1, " capital ": 1, " "tibco software, inc." "TIBCO Software: 1 } ".
Since the document that the present invention is handled is same class document, thus the mac function that piecemeal classification obtains is limited and solid It is fixed, be conducive to training and obtain the machine mould of high precision.
In a kind of preferred embodiment of the present invention, in addition to the corresponding machine mould of each mac function training, Also directed to entire chapter text training universality machine mould.Feature extraction is from entire chapter document in universality machine mould training process, Suitable for carrying out clip Text extraction to each mac function.
Preferably, it is determined jointly using the sentence Selection Model and universality machine mould for being directed to each mac function each The clip Text of mac function.For example, assigning sentence Selection Model and universality machine mould with corresponding weight, sentence is selected The result that modulus type measures obtains result A multiplied by its weight, and the result that universality machine mould measures is obtained multiplied by its weight As a result B, then by being calculated result A and result B, being converted, the final testing result to certain clause is obtained, with the final survey Test result is as sort by.
It is here, training is directed to the reason of universality machine mould of entire chapter text: although the knot of " same class " document Structure similarity is very high, however, file structure is inevitably changed with the subjective initiative of people.In this case, There may be do not train corresponding sentence for the corresponding sentence Selection Model of certain mac function training, or for certain mac function The problem of sample size of Selection Model is few, and the model stability and accuracy that training obtains cannot meet the needs.And universality Machine mould can test certain clause to the significance level of entire chapter document, under normal circumstances, if compared to the important journey of entire chapter document Degree is high, then the higher possibility of the significance level in corresponding function block is very big, can from there through universality machine mould Solve the problems, such as without corresponding sentence Selection Model or training sample amount it is few caused by sentence Selection Model accuracy is insufficient asks Topic;Even if training obtains mature sentence Selection Model, universality machine mould can also cooperate with the adjustment of sentence Selection Model to pluck Content is extracted, abstract is improved and extracts accuracy.
Step 400, form conclusion: output clip Text, in conjunction with the opinion for the personnel of checking, result is checked in formation.
On the clip Text that step 300 is formed, the personnel of checking can also modify manually, increase corresponding conclusion letter Breath, to reach the effect data that the personnel of checking want.The modification refers to increase or deletes clause.
If Figure 10 shows the software interface of output abstract, left side is summary region, and right side is text filed after parsing.Pass through " editor's abstract " label is clicked, editing mode is started, summary page can be edited in left side;" X " label is clicked, deletion is received and refers to Show, deletes abstract;Choosing is clicked or drawn in the original text of right side and needs increased content, and instruction is elected in reception additional member, and the abstract of selection is increased It is added to left side summary region.
It is another aspect of the invention to provide a kind of machine aid reading auditing systems based on abstract mode, for real The above method is applied, which includes:
Typing parsing module is used for typing text, and completes the parsing of data content and format;
Same category content is polymerize and is marked for classifying to content of text after parsing by piecemeal categorization module Class label is infused, the mac function with class label is formed;
Abstract abstraction module, for extracting corresponding clip Text in each mac function;
Output edit module of making a summary, in conjunction with the opinion for the personnel of checking, forms for exporting clip Text and checks result.
In the present invention, the text of typing is converted plain text format document, such as text document by typing parsing module.Record Entering parsing module allows to input existing any document format, preferably input Word document or PDF document.
If typing text is Word document or PDF document, the Context resolution in document is first XML number by typing parsing module According to, then pass through parsing XML data acquisition plain text format document.
Further, typing parsing module gives the clause after conversion in text sequentially also to number accordingly, and to compile Number formed sentence index.
In the present invention, piecemeal categorization module minimum classifies to content of text using paragraph as basic unit.It is preferred that Ground, piecemeal categorization module carry out content of text classification using logistic regression method building disaggregated model.
In the present invention, abstract abstraction module minimum is to choose unit with clause, make a summary in each mac function and extracts.
In a preferred embodiment, abstract abstraction module is according to block feature, to the training of each mac function Corresponding machine mould extracts corresponding clip Text.
In further preferred embodiment, in addition to the corresponding machine mould of each mac function training, needle is gone back To entire chapter text training universality machine mould;Using the sentence Selection Model and universality machine for being directed to each mac function Model determines the clip Text of each mac function jointly.
In the present invention, abstract output edit module includes that abstract output sub-module, abstract display sub-module and abstract are compiled Collect submodule:
Make a summary output sub-module, for receive abstract abstraction module instruction, according to abstract abstraction module determine extraction in Hold, corresponding clause number is sent to abstract display sub-module;
Abstract display sub-module, it is aobvious for receiving clause's number information progress clause's content that abstract output sub-module is sent Show;The edit instruction that abstract editor's submodule is sent is received, corresponding clause is deleted or shows the opinion that the personnel of checking edit;
Abstract editor's submodule receives starting editing mode and indicates and start editing mode, receives edit instruction and transmit To abstract display sub-module, implement display Edition Contains (clause deletes or increase the opinion for the personnel that check).
Embodiment
Embodiment 1
By taking the Word document " examination report of first company application loan " of input as an example, by carrying out machine auxiliary to text It reads, obtains the clip Text of user's concern, " examination report of first company application loan " content is as shown in Figure 2:
The first step calls LibreOffice program, by the Context resolution in Word document at XML data (see Fig. 3), then Final plain text document result is obtained by parsing XML;
Second step, the LR model obtained by training carry out classification piecemeal to the plain text document after parsing, obtain " industry Business background " mac function;
Third step, the sentence Selection Model obtained using the training of Boosting homing method is to " business background " mac function Carry out abstract extraction;
It makes a summary during model training, it is abstract (1) and (0) two class of non-abstract that the sample set manually marked, which is only marked, is used Real number value of the Gradient Boosting Regressor method prediction result between 0-1, is ranked up with this result, and Take top-n result as abstract as a result, as shown in table 1.If n is 1, best abstract " the said firm's registered capital 5,000,000 is obtained Member, ";
Table 1
Annotation results Prediction result Sentence Explanation
0 0.245879352093 By the report period, Non- abstract
1 0.886647164822 5,000,000 yuan of the said firm's registered capital, Abstract
0 0.677558422089 Wherein Li Si provides funds 4,500,000 yuan, Non- abstract
0 0.0709818303585 Accounting 90%, Non- abstract
0 0.538590252399 Zhang San provides funds 500,000 yuan, Non- abstract
0 0.0706759169698 Accounting 10%. Non- abstract
4th step exports clip Text " 5,000,000 yuan of the said firm's registered capital, ", in conjunction with the opinion for the personnel of checking, is formed and is examined Read result.
Embodiment 2
Audit portfolio before borrowing in credit audit, can be divided into following classification: summary information, business background, business circumstance, Project situation, account analysis, credit position, borrowing arrangements, repayment schedule, guarantee analysis, mortgage analysis, risk analysis, risk Prevention, overall assessment and branch's opinion.
Using feature extracting method in the present invention, the disaggregated model (LR model) that training obtains carries out document mac function It divides, as shown in figure 11, the classification accuracy that is generally averaged on test set reaches 93.1%.
It extracts to obtain summary info using method in the present invention.Abstract is also used due to using order models The NDCG of order standard as evaluation criterion, it is whole borrow before audit documentation summary extract NDCG result such as the following table 2 (Top5):
Table 2
Type NDCG@1 NDCG@2 NDCG@3 NDCG@4 NDCG@5
As a result 0.816782 0.814651 0.821526 0.821179 0.826785
By taking NDCG@5 as an example, the effect in each classification is shown, as shown in figure 12.By table 2 and Figure 12 it is found that the present invention plucks It is higher to extract accuracy, meets credit audit industry fifes processing requirement.
Combining preferred embodiment above, the present invention is described, but these embodiments are only exemplary , only play the role of illustrative.On this basis, a variety of replacements and improvement can be carried out to the present invention, these each fall within this In the protection scope of invention.

Claims (10)

1. a kind of machine aid reading auditing method based on abstract mode, which is characterized in that method includes the following steps:
Step 100, typing text, and complete the parsing of data content and format;
Step 200, classify to content of text after parsing, same category content is polymerize and marked class label, shape At the mac function with class label;
Step 300, corresponding clip Text in each mac function is extracted;
Step 400, clip Text is exported, in conjunction with the opinion for the personnel of checking, result is checked in formation.
2. the method according to claim 1, wherein in step 100, the parsing includes by typing text lattice Formula is converted into plain text format;
Preferably, typing text is Word document or PDF document format, is XML data by the Context resolution in document, then pass through It parses XML data and obtains plain text format text.
3. the method according to claim 1, wherein in step 100, the parsing further includes sequentially giving to solve Text neutron sentence is numbered accordingly after analysis, and forms sentence index with number.
4. the method according to claim 1, wherein in step 200, being constructed and being classified using logistic regression method Model carries out content of text classification;Preferably, content of text classification is carried out by basic unit of paragraph.
5. the method according to claim 1, wherein in step 200, disaggregated model building includes training process And test process:
Training process: it is affiliated classification by corpus labeling, forms training sample;The feature of training sample is extracted to train mould Type;
Test process: using mark or un-annotated data as test sample;Stress model after the feature of extraction test sample, Obtain classification results;Model is adjusted according to classification results, until obtaining the high model of classification accuracy;
Wherein, the characteristic extraction procedure during model training process or actual classification includes: parsing file structure, and by a piece Chapter structural information forms tree-like file structure;Document titles at different levels are placed in respective document text by the tree-like file structure Before, title+text content-form is formed, feature extraction is carried out based on text in this content-form.
6. the method according to claim 1, wherein in step 300, according to block feature, to each function The corresponding machine mould of block training, extracts corresponding clip Text.
7. according to the method described in claim 6, it is characterized in that, being removed corresponding to the training of each mac function in step 300 Machine mould outside, also directed to entire chapter text training universality machine mould;
Preferably, each mac function is determined using the machine mould and universality machine mould that are directed to each mac function jointly Clip Text.
8. the method according to claim 1, wherein the minimum unit of choosing for extraction of making a summary is son in step 300 Sentence, clause are the short sentence formed with comma, fullstop, question mark, exclamation mark, semicolon separated;After the processing of sentence Selection Model, the row of taking The high corresponding clause of preceding n contents of sequence is as abstract as a result, wherein n value can adjust according to demand.
9. a kind of for implementing the system of one of the claims 1 to 8 the method, which includes:
Typing parsing module is used for typing text, and completes the parsing of data content and format;
Same category content is polymerize for classifying to content of text after parsing and marks class by piecemeal categorization module Distinguishing label forms the mac function with class label;
Abstract abstraction module, for extracting corresponding clip Text in each mac function;
Output edit module of making a summary, in conjunction with the opinion for the personnel of checking, forms for exporting clip Text and checks result.
10. system according to claim 9, which is characterized in that abstract output edit module include abstract output sub-module, Abstract display sub-module and abstract editor's submodule:
Making a summary output sub-module, will according to the extraction content that abstract abstraction module determines for receiving abstract abstraction module instruction Corresponding clause's number is sent to abstract display sub-module;
Abstract display sub-module is shown for receiving clause's number information progress clause's content that abstract output sub-module is sent; The edit instruction that abstract editor's submodule is sent is received, corresponding clause is deleted or shows the opinion that the personnel of checking edit;
Abstract editor's submodule receives starting editing mode and indicates and start editing mode, receives edit instruction and is transferred to and plucks Display sub-module is wanted, display Edition Contains are implemented.
CN201810142416.5A 2018-02-11 2018-02-11 A kind of machine aid reading auditing method and system based on abstract mode Pending CN110162765A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810142416.5A CN110162765A (en) 2018-02-11 2018-02-11 A kind of machine aid reading auditing method and system based on abstract mode

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810142416.5A CN110162765A (en) 2018-02-11 2018-02-11 A kind of machine aid reading auditing method and system based on abstract mode

Publications (1)

Publication Number Publication Date
CN110162765A true CN110162765A (en) 2019-08-23

Family

ID=67635126

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810142416.5A Pending CN110162765A (en) 2018-02-11 2018-02-11 A kind of machine aid reading auditing method and system based on abstract mode

Country Status (1)

Country Link
CN (1) CN110162765A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100312725A1 (en) * 2009-06-08 2010-12-09 Xerox Corporation System and method for assisted document review
CN104657347A (en) * 2015-02-06 2015-05-27 北京中搜网络技术股份有限公司 News optimized reading mobile application-oriented automatic summarization method
CN107392143A (en) * 2017-07-20 2017-11-24 中国科学院软件研究所 A kind of resume accurate Analysis method based on SVM text classifications
CN107403375A (en) * 2017-04-19 2017-11-28 北京文因互联科技有限公司 A kind of listed company's bulletin classification and abstraction generating method based on deep learning

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100312725A1 (en) * 2009-06-08 2010-12-09 Xerox Corporation System and method for assisted document review
CN104657347A (en) * 2015-02-06 2015-05-27 北京中搜网络技术股份有限公司 News optimized reading mobile application-oriented automatic summarization method
CN107403375A (en) * 2017-04-19 2017-11-28 北京文因互联科技有限公司 A kind of listed company's bulletin classification and abstraction generating method based on deep learning
CN107392143A (en) * 2017-07-20 2017-11-24 中国科学院软件研究所 A kind of resume accurate Analysis method based on SVM text classifications

Similar Documents

Publication Publication Date Title
Alexa et al. Text analysis software: Commonalities, differences and limitations: The results of a review
CN111930966B (en) Intelligent policy matching method and system for digital government affairs
US8005815B2 (en) Search engine
CN108038091A (en) A kind of similar calculating of judgement document&#39;s case based on figure and search method and system
CN109933796B (en) Method and device for extracting key information of bulletin text
CN110175585B (en) Automatic correcting system and method for simple answer questions
CN105824791B (en) A kind of bibliography format checking method
US7853595B2 (en) Method and apparatus for creating a tool for generating an index for a document
CN112182148A (en) Standard auxiliary compiling method based on full-text retrieval
CN111144116B (en) Document knowledge structured extraction method and device
CN112380848B (en) Text generation method, device, equipment and storage medium
Chieze et al. An automatic system for summarization and information extraction of legal information
Alexa et al. Commonalities, differences and limitations of text analysis software: the results of a review
CN110162684B (en) Machine reading understanding data set construction and evaluation method based on deep learning
WO2000026839A1 (en) Advanced model for automatic extraction of skill and knowledge information from an electronic document
CN116611447A (en) Information extraction and semantic matching system and method based on deep learning method
CN110765107A (en) Question type identification method and system based on digital coding
CN110162765A (en) A kind of machine aid reading auditing method and system based on abstract mode
CN114118098A (en) Contract review method, equipment and storage medium based on element extraction
CN112488593B (en) Auxiliary bid evaluation system and method for bidding
CN113722421A (en) Contract auditing method and system and computer readable storage medium
CN112347121A (en) Configurable method and system for converting natural language into sql
CN112966105B (en) Method for automatically generating audit test questions by using violation problem analysis
CN117332761B (en) PDF document intelligent identification marking system
Merilaine The frequency and variability of conjunctive adjuncts in the Estonian–English Interlanguage Corpus

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination