CN109190098A

CN109190098A - A kind of document automatic creation method and system based on natural language processing

Info

Publication number: CN109190098A
Application number: CN201810928628.6A
Authority: CN
Inventors: 姚毅
Original assignee: Shanghai Wisdom-Only Laojian Information Technology Co Ltd
Current assignee: Beijing youfatian Technology Co.,Ltd.
Priority date: 2018-08-15
Filing date: 2018-08-15
Publication date: 2019-01-11

Abstract

The invention discloses a kind of document automatic creation method and system based on natural language processing, can automatically generate the report file of professional domain.Its technical solution are as follows: classified automatically to the original document of input, the original document based on different classifications carries out alignment processing, respectively obtains intermediate data and structural data；Word segmentation processing, Entity recognition, Relation extraction, event extraction and construction of knowledge base are carried out to intermediate data, the data extracted are stored in database as structural data；Document template is selected according to the Doctype of output, document assembling is carried out in conjunction with the structural data got, exports final destination document.

Description

A kind of document automatic creation method and system based on natural language processing

Technical field

The present invention relates to documents to automatically generate field, and in particular to the document Auto in terms of legal analysis.

Background technique

In legal field, lawyer usually needs to check large volume document, including legal person's main body situation, equity structure, business Range, business license, great assets and business contract, lawsuit/arbitration cases etc., pass through the methods of field investigation, interview and determine Situation, and then write and generate corresponding law report, Analysis of Policy Making is provided.

The analysis report of law practitioner needs manual analysis legal person's main body situation, equity structure, range of business, business The various documentations such as license, great assets and business contract, lawsuit/arbitration cases, really by the methods of field investigation, access Recognize situation, arranges extract key message by hand, the report of conclusion needed for generating.The row of this method dependence law practitioner many years Industry experience accumulation, it is difficult to which scale application uses full realm information, has the higher extensive threshold of study.

Summary of the invention

A brief summary of one or more aspects is given below to provide to the basic comprehension in terms of these.This general introduction is not The extensive overview of all aspects contemplated, and be both not intended to identify critical or decisive element in all aspects also non- Attempt to define the range in terms of any or all.Its unique purpose is to provide the one of one or more aspects in simplified form A little concepts are with the sequence for more detailed description given later.

The purpose of the present invention is to solve the above problem, provides a kind of document based on natural language processing and automatically generates Method and system, can automatically generating the report file of professional domain, (such as automatically generating, there is primary legal industry to analyze personnel The intelligent law report of ability).

The technical solution of the present invention is as follows: present invention discloses a kind of document side of automatically generating based on natural language processing Method, comprising:

Step 1: being classified automatically to the original document of input, the original document based on different classifications carries out corresponding position Reason, respectively obtains intermediate data and structural data；

Step 2: word segmentation processing, Entity recognition, Relation extraction, event extraction and construction of knowledge base are carried out to intermediate data, The data extracted are stored in database as structural data；

Step 3: document template being selected according to the Doctype of output, carries out sets of documentation in conjunction with the structural data got Dress, exports final destination document.

One embodiment of the document automatic creation method according to the present invention based on natural language processing, step 1 are further Include:

Determine data acquisition demand；

According to the original document of input, the file type of each original document is obtained, and then various differences can be distinguished The original document of type；

Photo-document is judged whether it is, if not after photo-document then first carries out the pictured processing of original document progress again Continuous step then directly carries out subsequent step if photo-document；

Document classification is carried out based on image procossing；

According to document classification judge document whether be fixed format document, be then based on machine if it is the document of fixed format Device study to fixed-format document carry out information extraction obtain structural data, if not fixed format document then carry out after Continuous step；

Judge whether document supports text directly to extract, obtains it from original document if supporting text directly to extract In content of text and be stored as intermediate data, subsequent step is carried out if not supporting text directly to extract；

Document is identified, by the text conversion in image at text formatting；

Content reparation is carried out to the text identified based on natural language processing, the data after reparation are stored as mediant According to.

One embodiment of the document automatic creation method according to the present invention based on natural language processing, step 2 are further Include:

Word segmentation processing is carried out to intermediate data；

Data after word segmentation processing carry out Entity recognition processing；

Relation extraction is carried out to the data after Entity recognition, obtains in text existing grammer between entity or semantically Connection；

Event extraction is carried out to the data after Relation extraction, required interest is extracted from the text containing event information Event information will be presented in the form of structuring with the event of natural language expressing；

Knowledge mapping checking treatment is carried out to the data after event extraction, according to the reality got from multiple documents The relevant knowledge mapping of the information architecture of body, relationship and event, mutual confirmation and the automatic discovery of anomalous event for information；

Data after knowledge mapping checking treatment form structural data.

One embodiment of the document automatic creation method according to the present invention based on natural language processing, at Relation extraction Reference resolution processing is carried out before reason, also to improve the accuracy that follow-up extracts result.

One embodiment of the document automatic creation method according to the present invention based on natural language processing, step 3 are further Include:

Based on structural data, different Task Tree coordinates measurement reports is selected according to the destination document type of required output It accuses；

Processing stage based on current document carries out corresponding processing: according to template if document is in intermediate treatment stage The rough draft document for automatically generating professional domain automatically generates professional domain according to template if document is in the final output stage Official documentation.

Present invention further teaches a kind of document automatic creation system based on natural language processing, system include:

Original document processing module classifies automatically to the original document of input, the original document based on different classifications Alignment processing is carried out, intermediate data and structural data are respectively obtained；

Intermediate data processing module, to intermediate data carry out word segmentation processing, Entity recognition, Relation extraction, event extraction and Construction of knowledge base, the data extracted are stored in database as structural data；And

Destination document automatically-generating module selects document template according to the Doctype of output, in conjunction with the structure got Change data and carry out document assembling, exports final destination document.

One embodiment of the document automatic creation system according to the present invention based on natural language processing, original document processing Module further comprises:

Demand formulates unit, determines data acquisition demand；

Doctype analytical unit obtains the file type of each original document, Jin Erke according to the original document of input To distinguish various different types of original documents；

The pictured processing unit of document, first judges whether it is photo-document, if not photo-document is then first to original document The pictured processing for handling and carrying out subsequent cell again is carried out, the processing of subsequent cell is then directly carried out if photo-document；

Document classification unit carries out document classification based on image procossing；

Fixed-format document information extraction unit, first according to document classification judge document whether be fixed format document, If it is fixed format document be then based on machine learning to fixed-format document carry out information extraction obtain structural data, such as Fruit is not that the document of fixed format then carries out the processing of subsequent cell；

The direct extraction unit of content of text, first judges whether document supports text directly to extract, if supporting text direct Extraction then obtains content of text therein from original document and is stored as intermediate data, if not supporting text directly to extract Carry out the processing of subsequent cell；

Text identification unit, identifies document, by the text conversion in image at text formatting；

Content repairs unit, carries out content reparation to the text identified based on natural language processing, the data after reparation It is stored as intermediate data.

One embodiment of the document automatic creation system according to the present invention based on natural language processing, intermediate data processing Module further comprises:

Word segmentation processing unit carries out word segmentation processing to intermediate data；

Entity recognition unit, the data after word segmentation processing carry out Entity recognition processing；

Relation extraction unit carries out Relation extraction to the data after Entity recognition, obtains existing between entity in text Grammer or connection semantically；

Event extraction unit carries out event extraction to the data after Relation extraction, takes out from the text containing event information Interesting event information needed for taking out, will be presented in the form of structuring with the event of natural language expressing；

Knowledge mapping verification unit carries out knowledge mapping checking treatment to the data after event extraction, according to from multiple texts The relevant knowledge mapping of information architecture of entity, relationship and event that shelves have been got, mutual for information are confirmed and different The automatic discovery of ordinary affair part；

Structural data storage unit, the data after knowledge mapping checking treatment are stored into structural data.

One embodiment of the document automatic creation system according to the present invention based on natural language processing, intermediate data processing Module further include:

Reference resolution unit also carries out reference resolution processing before Relation extraction processing, to improve follow-up extraction As a result accuracy.

One embodiment of the document automatic creation system according to the present invention based on natural language processing, destination document are automatic Generation module further comprises:

Template selection unit is based on structural data, selects different tasks according to the destination document type of required output Coordinates measurement report is set, including the different template of selection；

Destination document generation unit, the processing stage based on current document carry out corresponding processing: if document is in centre Processing stage then automatically generates the rough draft document of professional domain according to template, according to template if document is in the final output stage Automatically generate the official documentation of professional domain.

Present invention discloses a kind of document automatic creation system based on natural language processing, comprising:

Processor；And

Memory, the memory be configured as the executable instruction of storage series of computation machine and with it is described a series of The executable associated computer-accessible data of instruction of computer,

Wherein, when the instruction that the series of computation machine can be performed is executed by the processor, so that the processor Carry out method above-mentioned.

Present invention discloses a kind of non-transitorycomputer readable storage mediums, which is characterized in that the non-transitory meter The executable instruction of series of computation machine is stored on calculation machine readable storage medium storing program for executing, when a series of executable instructions are counted When calculating device execution, so that computing device carries out method above-mentioned.

The present invention comparison prior art has following the utility model has the advantages that present invention combination professional domain (such as legal field) knowledge With the technologies such as natural language processing, by extracting to magnanimity document classification, OCR, NLP is repaired, Chinese and proprietary term participle, real Multiple path combinations such as body identification, event extraction, template, generation are ultimately generated with analysis personnel's energy primary in professional domain The intelligence report (such as intelligent law report with primary legal industry analysis personnel ability) of power is simultaneously applied.For intelligence Energy law report, then can be widely applied to merging and acquisition, security IPO (Initial Public Offering), financial institution loan, restructure, again The laws scenes such as big assets transfer.

Detailed description of the invention

After the detailed description for reading embodiment of the disclosure in conjunction with the following drawings, it better understood when of the invention Features described above and advantage.In the accompanying drawings, each component is not necessarily drawn to scale, and has similar correlation properties or feature Component may have same or similar appended drawing reference.

Fig. 1 shows the process of an embodiment of the document automatic creation method of the invention based on natural language processing Figure.

Fig. 2 shows the flow charts of the step S1 in embodiment shown in FIG. 1.

Fig. 3 shows the flow chart of the step S2 in embodiment shown in FIG. 1.

Fig. 4 shows the flow chart of the step S3 in embodiment shown in FIG. 1.

Fig. 5 shows the principle of an embodiment of the document automatic creation system of the invention based on natural language processing Figure.

Fig. 6 shows the schematic diagram of the original document processing module in embodiment shown in fig. 5.

Fig. 7 shows the schematic diagram of the intermediate data processing module in embodiment shown in fig. 5.

Fig. 8 shows the schematic diagram of the destination document automatically-generating module in embodiment shown in fig. 5.

Specific embodiment

Below in conjunction with the drawings and specific embodiments, the present invention is described in detail.Note that below in conjunction with attached drawing and specifically real The aspects for applying example description is merely exemplary, and is understood not to carry out any restrictions to protection scope of the present invention.

Fig. 1 shows the process of an embodiment of the document automatic creation method of the invention based on natural language processing. Referring to Figure 1, details are as follows for the implementation steps of the method for the present embodiment.

Step S1: classifying automatically to the original document of input, and the original document based on different classifications carries out corresponding position Reason, respectively obtains intermediate data and structural data.

Specifically, shown in Figure 2, this is the micronization processes of step S1.

Step S101: data acquisition demand is determined.

It is discussed based on the scene of professional domain (such as legal field), expert team and product team and determines specific crucial number According to the demand of acquisition.

Step S102: determining original document type is analyzed.

According to the original document of input, the file type of each original document is obtained, and then various differences can be distinguished The original document of type.

Step S103: judge whether it is photo-document.

According to the original document type got, judge whether to belong to photo-document.If it is picture type document then S105 is entered step, then enters step S104 if not the document of picture type.

Step S104: pictured processing is carried out to original document.

The process of pictured processing, which is generally comprised, carries out correction process, including offset correction, denoising pond to original document Deng, finally output conversion after photo-document.

Step S105: the document classification based on image procossing.

The document classification of this step refers to using the convolutional neural networks model in deep learning, passes through multilayer convolutional Neural Then the identification model of network struction document identifies photo-document input neural network, document is carried out after the completion of identification Affiliated classification.

Step S106: according to document classification judge document whether be fixed format document.Then turn if it is fixed format Enter step S112, step S107 is then transferred to if not fixed format.

For legal field, the document of fixed format includes business license, Tax Registration Certificate, patent certificate, identity card etc..

Step S107: judge whether document supports text directly to extract.It is transferred to step S110 if supporting, if do not propped up It holds, is transferred to step S108.

Step S108: OCR identification is carried out to document.

For the document that can not directly acquire word content, such as scanning PDF, picture etc., OCR identification is carried out, it will be original Word segment in document identifies.

OCR full name Optical Character Recognition, i.e. optical character identification refer to using optical side Text conversion in paper document is become the image file of black and white lattice by formula, and is turned the text in image by identification software Change text formatting into, the technology further edited and processed for word processor.

Step S109: the content that OCR is identified is repaired based on NLP.

Due to picture quality etc., there are several mistakes for the text that OCR is identified, by identification problem and NLP model knot Altogether, transition probability calculating is carried out to the text that OCR is identified, word judgement is then carried out according to semanteme, to transition probability The word of very low word and Semantic judgement mistake is modified, and improves the accuracy rate of text conversion.

NLP full name is Natural Language Processing, i.e. natural language processing, NLP is artificial intelligence (AI) A subdomains, be one fusion artificial intelligence and linguistics, computer science scheduling theory technology cross discipline, include The technologies such as participle, part-of-speech tagging, Entity recognition, keyword abstraction, interdependent syntactic analysis, time phrase identification, cluster, reasoning. It has been successfully applied to the fields such as recommender system, public sentiment monitoring, interactive voice at present.The present invention is to be applied to natural language processing Professional domain as such as law documentation analysis field, handles the massive information of enterprise, and then extracts lawyer pass The related data of the heart carries out situation awareness and document output accordingly, maximizes and reduces unnecessary manual labor, and help lawyer mentions High working efficiency.

Step S110: for document (such as word, excel and the division format PDF that can support to obtain word content Deng), original document is read, content of text therein is got and is stored as intermediate data.

Step S111: it is produced by the direct Word Input of document, through OCR identification and NLP content reparation initial available Valid data, which has had been provided with preliminary analysis and research value.

Step S112: information extraction is carried out to fixed-format document based on machine learning.

For similar fixed-format document is had determined, using the convolutional neural networks model in machine learning, pass through Multilayer convolutional neural networks construct the identification model of specific format document, and fixed-format document is inputted neural network, obtain special Determine area image, Text region then is carried out to the specific region image got, is structuring the text output identified Data.

Step S2: word segmentation processing (canonical matching), Entity recognition, reference resolution, Relation extraction, thing are carried out to intermediate data Part extracts and construction of knowledge base, and the data extracted are stored in database as structural data.

Specifically, shown in Figure 3, this is the micronization processes of step S2.

Step S201: word segmentation processing is carried out to intermediate data.

Intermediate data refers to that by original document treated valid data, these data portions are straight from original document The text information taken is obtained, is partially the text information that first OCR identification is exported through NLP content reparation again, these information are intermediate The importation of data processing.

Word segmentation processing is to complete participle using the participle technique of natural language processing to act (including word, phrase and phrase Cutting), the present embodiment, which combines Forward Maximum Method method and reverse maximum matching process, constitutes bi-directional matching method to mention Rise participle correctness.The present embodiment can get out the universaling dictionary on basis and the special term of professional domain (such as legal industry) in advance Allusion quotation is conducive to the participle effect for promoting the professional domains documents such as legal industry in this way.

Step S202: Entity recognition processing.

Naming Entity recognition (full name Named Entity Recognition, abbreviation NER) is the basic of information extraction Work, task is out of Party A, Party B, target, the amount of money, liability for breach of contract, the time identified in contract in text etc. Hold, for another example the date in asset examination report, accounting firm's title, report number, certification of registered capital result etc., and the corresponding mark of addition for it Information is infused, provides convenience for information extraction follow-up work.

Step S203: reference resolution processing.

Reference is a kind of common language phenomenon, is generally divided into and refers to and refer to altogether two kinds, refers to refer to current anaphor There are close semantic relevances with the word, phrase or sentence that occur above；Refer to altogether and is then primarily referred to as multiple nouns (including code name Word, noun phrase) it is directed toward the same reference body in real world.Reference resolution can simplify, the form of presentation of consolidated entity, right The accuracy for improving information extraction result has very big facilitation.

Step S204: Relation extraction processing.

The effect of Relation extraction is to obtain in text existing grammer between entity or connection semantically, Relation extraction are Key link in information extraction.The MBL method and SVM method of comprehensive use pattern matching, dictionary driving, machine learning, into And detection judgement is carried out to multi-method effect, export optimal solution.

Step S205: event extraction processing.

In information extraction, event refer to it is occurring in some specific time slice and territorial scope, by one or Something that multiple roles participate in, are made of one or more movements, usually Sentence-level.Event extraction (Event Extraction main target) is that required interesting event information is extracted from the text containing event information, will be with certainly The event of right language expression is showed in the form of structuring.

Step S206: knowledge mapping checking treatment.

Knowledge mapping verification is to have got the information such as entity, relationship and event according to from multiple documents, and building is related Knowledge mapping, for the mutual confirmation of information and the automatic discovery of anomalous event, such as multistage shareholder's information in legal field Combination discovery connected transaction, the proof document missing of property ownership certificate with contract information etc..

Mapping knowledge domains abbreviation KG, full name Knowledge Graph/Vault are explicit knowledge's development process and structure A series of a variety of different figures of relationship, describe knowledge resource and its carrier with visualization technique, excavate, analyze, construct, draw System and explicit knowledge and connecting each other between them.

Step S207: structural data is formed.

Structural data includes the information needed extracted, these information automatically generate the document for being used for the later period.

Step S3: selecting document template according to the Doctype of output, carries out document in conjunction with the structural data got Assembling, exports final document.

Specifically, shown in Figure 4, this is the micronization processes of step S3.

Step S301: destination document type judgement.

Different Task Tree coordinates measurement reports is selected according to the destination document type of required output based on structural data It accuses.

In the present embodiment, type is divided into Excel document, Word document and PPT document.If the document class of target output Type is Word document, then selects corresponding Word document template.If the Doctype of target output is Excel document, select Select corresponding Excel document template.If the Doctype of target output is PPT document, corresponding PPT document mould is selected Plate.

Step S302: judge the document process stage.

Judge that current document generation phase, document are likely to be at the stage of intermediate treatment process, it is also possible to be in finally just Formula output stage.It is transferred to step S304 if the stage in intermediate treatment process, if turning in final formal output stage Enter step S303.

Step S303: in the case where being currently at the final output stage, professional domain is being automatically generated just according to template Formula document.

It is automatically generated the official documentation of law report in the present embodiment.Law report is that lawyer provides legal services A kind of comprehensive written document, content include providing legal basis, legislative advice and the side solved the problems, such as to consultant Case.Law report is widely used in merging and acquisition, security IPO (Initial Public Offering), financial institution loan, restructures, great assets Transfer the possession of etc..

Report Auto refers to the pattern according to document template, is filled in conjunction with several information of acquisition, into And a kind of technology reported needed for automatically generating out, the technology have more universal application in various industries.

Step S304: in the case where being currently at intermediate treatment stage, the grass of professional domain is automatically generated according to template Original text document.

It is automatically generated rough draft (draft) document of law report in the present embodiment.

Fig. 5 shows the principle of an embodiment of the document automatic creation system of the invention based on natural language processing. Refer to Fig. 5, the document automatic creation system of the present embodiment include: original document processing module, intermediate data processing module and Destination document automatically-generating module.

Original document processing module is for classifying automatically to the original document of input, the original text based on different classifications Shelves carry out alignment processing, respectively obtain intermediate data and structural data.

Intermediate data processing module is used to carry out word segmentation processing, Entity recognition, Relation extraction, event extraction to intermediate data And construction of knowledge base, the data extracted are stored in database as structural data.

Destination document automatically-generating module is used to select document template according to the Doctype of output, in conjunction with the knot got Structure data carry out document assembling, export final destination document.

As shown in fig. 6, the original document processing module of the present embodiment includes: that demand formulates unit, Doctype analysis list Member, the pictured processing unit of document, document classification unit, fixed-format document information extraction unit, content of text directly extract Unit, text identification unit, content repair unit.

Demand formulates unit for determining data acquisition demand.

Doctype analytical unit is used for the original document according to input, obtains the file type of each original document, into And various different types of original documents can be distinguished.

The pictured processing unit of document is for first judging whether it is photo-document, if not photo-document is then first to original text Shelves carry out the pictured processing for handling and carrying out subsequent cell again, and the processing of subsequent cell is then directly carried out if photo-document.

Document classification unit is used to carry out document classification based on image procossing.

Fixed-format document information extraction unit for first according to document classification judge document whether be fixed format text Shelves, if it is fixed format document be then based on machine learning to fixed-format document carry out information extraction obtain structuring number According to then carrying out the processing of subsequent cell if not the document of fixed format.

The direct extraction unit of content of text is for first judging whether document supports text directly to extract, if supporting text straight It connects extraction then to obtain content of text therein from original document and be stored as intermediate data, if not supporting text directly to extract Then carry out the processing of subsequent cell.

Text identification unit is for identifying document, by the text conversion in image at text formatting.

Content repairs unit and is used to carry out content reparation to the text identified based on natural language processing, the number after reparation According to being stored as intermediate data.

As shown in fig. 7, the intermediate data processing module of the present embodiment includes: word segmentation processing unit, Entity recognition unit, refers to Generation resolution unit, Relation extraction unit, event extraction unit, knowledge mapping verification unit, structural data storage unit.

Word segmentation processing unit is used to carry out word segmentation processing to intermediate data.

Entity recognition unit is used for the data progress Entity recognition processing after word segmentation processing.

Reference resolution unit is used for the reference resolution processing before Relation extraction processing, extracts knot to improve follow-up The accuracy of fruit.

Relation extraction unit is used to carry out Relation extraction to the data after Entity recognition, obtains in text and exists between entity Grammer or connection semantically.

Event extraction unit is used to carry out event extraction to the data after Relation extraction, from the text containing event information Required interesting event information is extracted, will be presented in the form of structuring with the event of natural language expressing.

Knowledge mapping verification unit is used to carry out knowledge mapping checking treatment to the data after event extraction, according to from multiple The relevant knowledge mapping of information architecture for entity, relationship and the event that document has been got, for information it is mutual confirmation and The automatic discovery of anomalous event.

Structural data storage unit is stored for the data after knowledge mapping checking treatment into structural data.

As shown in figure 8, the destination document automatically-generating module of the present embodiment includes: that template selection unit and destination document are raw At unit.

Template selection unit is used to be based on structural data, selects different appoint according to the destination document type of required output Business tree coordinates measurement report, including the different template of selection.

Destination document generation module carries out corresponding processing for the processing stage based on current document: if during document is in Between processing stage the rough draft document of professional domain is then automatically generated according to template, according to mould if document is in the final output stage Plate automatically generates the official documentation of professional domain.

In addition, present invention further teaches a kind of document automatic creation system based on natural language processing, system includes place Device and memory are managed, memory is configured as the executable instruction of storage series of computation machine and can hold with series of computation machine The associated computer-accessible data of capable instruction, wherein when the instruction that this family computer can be performed is by processor When execution, so that processor carries out method above-mentioned.

Present invention further teaches a kind of non-transitorycomputer readable storage medium, non-transitory computer-readable storage mediums The executable instruction of series of computation machine is stored in matter to be made when a series of this executable instruction is executed by a computing apparatus It obtains computing device and carries out method above-mentioned.

The specific implementation of method is described in detail in the aforementioned embodiment, and details are not described herein.

In addition to the report for the legal industry being related in previous embodiment automatically generates, News Field can also be applied to, Based on being excavated with search system by news, with the form tissue news agregator of news topic and entity, by media event News topic excavation, the relationship analysis of news property, the extraction of theme of news sentence are carried out, the relevant statistics of a large amount of media events is obtained Data and semantic description.The information that these are excavated from news agregator is retouched in the form of chart, table, text paragraph It states, the material as symposium.Finally according to the style of writing feature tissue material of symposium, automatically generate brief, objective, more The news roundup at visual angle is reported.

Although for simplify explain the above method is illustrated to and is described as a series of actions, it should be understood that and understand, The order that these methods are not acted is limited, because according to one or more embodiments, some movements can occur in different order And/or with from it is depicted and described herein or herein it is not shown and describe but it will be appreciated by those skilled in the art that other Movement concomitantly occurs.

Those skilled in the art will further appreciate that, the various illustratives described in conjunction with the embodiments described herein Logic plate, module, circuit and algorithm steps can be realized as electronic hardware, computer software or combination of the two.It is clear Explain to Chu this interchangeability of hardware and software, various illustrative components, frame, module, circuit and step be above with Its functional form makees generalization description.Such functionality be implemented as hardware or software depend on concrete application and It is applied to the design constraint of total system.Technical staff can realize every kind of specific application described with different modes Functionality, but such realization decision should not be interpreted to cause departing from the scope of the present invention.

General place can be used in conjunction with various illustrative logic plates, module and the circuit that presently disclosed embodiment describes Reason device, digital signal processor (DSP), specific integrated circuit (ASIC), field programmable gate array (FPGA) other are compiled Journey logical device, discrete door or transistor logic, discrete hardware component or its be designed to carry out function described herein Any combination is realized or is executed.General processor can be microprocessor, but in alternative, which, which can be, appoints What conventional processor, controller, microcontroller or state machine.Processor is also implemented as calculating the combination of equipment, example As DSP and the combination of microprocessor, multi-microprocessor, the one or more microprocessors to cooperate with DSP core or it is any its His such configuration.

The step of method or algorithm for describing in conjunction with embodiment disclosed herein, can be embodied directly in hardware, in by processor It is embodied in the software module of execution or in combination of the two.Software module can reside in RAM memory, flash memory, ROM and deposit Reservoir, eprom memory, eeprom memory, register, hard disk, removable disk, CD-ROM or known in the art appoint In the storage medium of what other forms.Exemplary storage medium is coupled to processor so that the processor can be from/to the storage Medium reads and writees information.In alternative, storage medium can be integrated into processor.Pocessor and storage media can It resides in ASIC.ASIC can reside in user terminal.In alternative, pocessor and storage media can be used as discrete sets Part is resident in the user terminal.

In one or more exemplary embodiments, described function can be in hardware, software, firmware, or any combination thereof Middle realization.If being embodied as computer program product in software, each function can be used as one or more item instructions or generation Code may be stored on the computer-readable medium or be transmitted by it.Computer-readable medium includes computer storage medium and communication Both media comprising any medium for facilitating computer program to shift from one place to another.Storage medium can be can quilt Any usable medium of computer access.It is non-limiting as example, such computer-readable medium may include RAM, ROM, EEPROM, CD-ROM or other optical disc storages, disk storage or other magnetic storage apparatus can be used to carrying or store instruction Or data structure form desirable program code and any other medium that can be accessed by a computer.Any connection is also by by rights Referred to as computer-readable medium.For example, if software is using coaxial cable, fiber optic cables, twisted pair, digital subscriber line (DSL) or the wireless technology of such as infrared, radio and microwave etc is passed from web site, server or other remote sources It send, then the coaxial cable, fiber optic cables, twisted pair, DSL or such as infrared, radio and microwave etc is wireless Technology is just included among the definition of medium.Disk (disk) and dish (disc) as used herein include compression dish (CD), laser disc, optical disc, digital versatile disc (DVD), floppy disk and blu-ray disc, which disk (disk) are often reproduced in a manner of magnetic Data, and dish (disc) with laser reproduce data optically.Combinations of the above should also be included in computer-readable medium In the range of.

Offer is to make any person skilled in the art all and can make or use this public affairs to the previous description of the disclosure It opens.The various modifications of the disclosure all will be apparent for a person skilled in the art, and as defined herein general Suitable principle can be applied to other variants without departing from the spirit or scope of the disclosure.The disclosure is not intended to be limited as a result, Due to example described herein and design, but should be awarded and principle disclosed herein and novel features phase one The widest scope of cause.

Claims

1. a kind of document automatic creation method based on natural language processing characterized by comprising

Step 1: to be classified automatically to the original document of input, the original document based on different classifications carries out alignment processing, point Intermediate data and structural data are not obtained；

Step 2: word segmentation processing, Entity recognition, Relation extraction, event extraction and construction of knowledge base being carried out to intermediate data, extracted Data out are stored in database as structural data；

Step 3: document template is selected according to the Doctype of output, carries out document assembling in conjunction with the structural data got, Export final destination document.

2. the document automatic creation method according to claim 1 based on natural language processing, which is characterized in that step 1 Further comprise:

Determine data acquisition demand；

According to the original document of input, the file type of each original document is obtained, and then various variety classes can be distinguished Original document；

Photo-document is judged whether it is, if not photo-document then first carries out pictured processing to original document carries out subsequent step again Suddenly, subsequent step is then directly carried out if photo-document；

Document classification is carried out based on image procossing；

According to document classification judge document whether be fixed format document, be then based on engineering if it is the document of fixed format It practises and structural data is obtained to fixed-format document progress information extraction, then carry out subsequent step if not the document of fixed format Suddenly；

Judge whether document supports text directly to extract, is obtained from original document if supporting text directly to extract therein Content of text is simultaneously stored as intermediate data, and subsequent step is carried out if not supporting text directly to extract；

Document is identified, by the text conversion in image at text formatting；

Content reparation is carried out to the text identified based on natural language processing, the data after reparation are stored as intermediate data.

3. the document automatic creation method according to claim 1 based on natural language processing, which is characterized in that step 2 Further comprise:

Word segmentation processing is carried out to intermediate data；

Relation extraction is carried out to the data after Entity recognition, obtains in text existing grammer between entity or connection semantically System；

Event extraction is carried out to the data after Relation extraction, required interesting event is extracted from the text containing event information Information will be presented in the form of structuring with the event of natural language expressing；

Knowledge mapping checking treatment is carried out to the data after event extraction, according to the entity got from multiple documents, is closed It is knowledge mapping relevant with the information architecture of event, the mutual confirmation and the automatic discovery of anomalous event for information；

Data after knowledge mapping checking treatment form structural data.

4. the document automatic creation method according to claim 3 based on natural language processing, which is characterized in that in relationship Reference resolution processing is also carried out before extracting processing, to improve the accuracy that follow-up extracts result.

5. the document automatic creation method according to claim 1 based on natural language processing, which is characterized in that step 3 Further comprise:

Based on structural data, different Task Tree coordinates measurements is selected to report according to the destination document type of required output；

Processing stage based on current document carries out corresponding processing: automatic according to template if document is in intermediate treatment stage The rough draft document for generating professional domain, automatically generates the formal of professional domain according to template if document is in the final output stage Document.

6. a kind of document automatic creation system based on natural language processing, which is characterized in that system includes:

Original document processing module classifies automatically to the original document of input, and the original document based on different classifications carries out Alignment processing respectively obtains intermediate data and structural data；

Intermediate data processing module carries out word segmentation processing, Entity recognition, Relation extraction, event extraction and knowledge to intermediate data Library building, the data extracted are stored in database as structural data；And

Destination document automatically-generating module selects document template according to the Doctype of output, in conjunction with the structuring number got According to document assembling is carried out, final destination document is exported.

7. the document automatic creation system according to claim 6 based on natural language processing, which is characterized in that original text Shelves processing module further comprises:

Demand formulates unit, determines data acquisition demand；

Doctype analytical unit obtains the file type of each original document according to the original document of input, and then can be with area Separate various different types of original documents；

The pictured processing unit of document, first judges whether it is photo-document, if not photo-document then first carries out original document It is pictured to handle the processing for carrying out subsequent cell again, the processing of subsequent cell is then directly carried out if photo-document；

Fixed-format document information extraction unit, first according to document classification judge document whether be fixed format document, if Be fixed format document be then based on machine learning to fixed-format document carry out information extraction obtain structural data, if not It is that the document of fixed format then carries out the processing of subsequent cell；

The direct extraction unit of content of text, first judges whether document supports text directly to extract, if text is supported directly to extract Content of text therein is then obtained from original document and is stored as intermediate data, is carried out if not supporting text directly to extract The processing of subsequent cell；

Content repairs unit, carries out content reparation to the text identified based on natural language processing, the data storage after reparation For intermediate data.

8. the document automatic creation system according to claim 6 based on natural language processing, which is characterized in that mediant Further comprise according to processing module:

Relation extraction unit carries out Relation extraction to the data after Entity recognition, existing grammer between entity in acquisition text Or connection semantically；

Event extraction unit carries out event extraction to the data after Relation extraction, extracts from the text containing event information Required interesting event information will be presented in the form of structuring with the event of natural language expressing；

Knowledge mapping verification unit, to after event extraction data carry out knowledge mapping checking treatment, according to from multiple documents Entity, relationship through getting and the relevant knowledge mapping of the information architecture of event, mutual confirmation and abnormal thing for information The automatic discovery of part；

9. the document automatic creation system according to claim 8 based on natural language processing, which is characterized in that mediant According to processing module further include:

Reference resolution unit also carries out reference resolution processing before Relation extraction processing, extracts result to improve follow-up Accuracy.

10. the document automatic creation system according to claim 6 based on natural language processing, which is characterized in that target Document automatically-generating module further comprises:

Template selection unit is based on structural data, different Task Tree roads is selected according to the destination document type of required output Diameter generates report, including the different template of selection；

Destination document generation unit, the processing stage based on current document carry out corresponding processing: if document is in intermediate treatment Stage then automatically generates the rough draft document of professional domain according to template, automatic according to template if document is in the final output stage Generate the official documentation of professional domain.

11. a kind of document automatic creation system based on natural language processing characterized by comprising

Processor；And

Memory, the memory be configured as the executable instruction of storage series of computation machine and with the series of computation The executable associated computer-accessible data of instruction of machine,

Wherein, when the instruction that the series of computation machine can be performed is executed by the processor, so that the processor carries out Method as described in any one of claims 1 to 5.

12. a kind of non-transitorycomputer readable storage medium, which is characterized in that the non-transitory computer-readable storage medium The executable instruction of series of computation machine is stored in matter, when a series of executable instructions are executed by a computing apparatus, So that computing device carries out the method as described in any one of claims 1 to 5.