CN108920656A

CN108920656A - Document properties description content extracting method and device

Info

Publication number: CN108920656A
Application number: CN201810718897.XA
Authority: CN
Inventors: 郑权; 张峰; 聂颖
Original assignee: Dragon Horse Zhixin (zhuhai Hengqin) Technology Co Ltd
Current assignee: Dragon Horse Zhixin (zhuhai Hengqin) Technology Co Ltd
Priority date: 2018-07-03
Filing date: 2018-07-03
Publication date: 2018-11-30

Abstract

The invention discloses a kind of document properties description content extracting method and devices.This method includes：Obtain the document information of attribute text to be extracted；Document information is input in preparatory trained attributes extraction model and carries out model calculation, obtains operation result；Content corresponding with attribute to be extracted in document information is determined according to operation result.Through the invention, quick reading document properties information has been achieved the effect that.

Description

Document properties description content extracting method and device

Technical field

The present invention relates to field of information processing, in particular to a kind of document properties description content extracting method and dress It sets.

Background technique

When user largely reads the document of a theme, it is most concerned with several focus.This focus is exactly text This attribute.Such as：When user wants to read tens of thousands of a bidding documents, feel emerging if only reading focus and can quickly find The specific bidding document of interest.But since focus can not be positioned quickly in the text, the reading speed of user is greatly delayed Degree.Come if the focus in file can be explicitly enumerated, can quickly navigate to interested file.

Document content in the related technology can not rapidly extracting aiming at the problem that, currently no effective solution has been proposed.

Summary of the invention

The main purpose of the present invention is to provide a kind of document properties description content extracting method and devices, to solve document Content can not rapidly extracting the problem of.

To achieve the goals above, according to an aspect of the invention, there is provided a kind of document properties description content is extracted Method, this method include：Obtain the document information of attribute text to be extracted；The document information is input to trained in advance Model calculation is carried out in attributes extraction model, obtains operation result；According to the operation result determine in the document information with The corresponding description content of document properties.

Further, description content corresponding with document properties in the document information is being determined according to the operation result Later, the method also includes：Description content corresponding with document properties to be extracted in the document information is passed through default Mode, which marks out, to be come.

Further, description content corresponding with document properties to be extracted in the document information is passed through into predetermined manner It marks out to include：Mark in the document information that each document properties to be extracted are corresponding to be retouched by the background color of different colours State content.

Further, model calculation is carried out in the document information to be input to preparatory trained attributes extraction model Before, the method also includes：Acquire the model training sample of preset quantity；To paragraph and sentence in the model training sample It labels, the sample content after being labelled；Depth is carried out to the sample content after labelling by neural network It practises, obtains trained attributes extraction model.

Further, deep learning is carried out to the sample content after labelling by neural network, obtains trained category Property extract model include：Word in sample after labelling is converted to digital vectors；Learnt by LSTM to the number Vector is trained, and obtains trained attributes extraction model.

To achieve the goals above, according to another aspect of the present invention, a kind of document properties description content is additionally provided to mention Device is taken, which includes：Acquiring unit, for obtaining the document information of attribute text to be extracted；Arithmetic element is used for institute It states document information and is input in preparatory trained attributes extraction model and carry out model calculation, obtain operation result；Determination unit, For determining description content corresponding with document properties in the document information according to the operation result.

Further, described device further includes：Unit is marked, for determining that the document is believed according to the operation result In breath after description content corresponding with document properties, by description corresponding with document properties to be extracted in the document information Content by predetermined manner mark out come.

Further, the mark unit is used for：By the background colors of different colours mark in the document information each to The corresponding description content of the document properties of extraction.

To achieve the goals above, according to another aspect of the present invention, a kind of storage medium is additionally provided, including storage Program, wherein equipment where controlling the storage medium in described program operation executes document properties of the present invention and retouches State method for extracting content.

To achieve the goals above, according to another aspect of the present invention, a kind of processor is additionally provided, for running journey Sequence, wherein described program executes document properties description content extracting method of the present invention when running.

The document information that the present invention passes through acquisition attribute text to be extracted；Document information is input to preparatory trained category Property extract model in carry out model calculation, obtain operation result；According to operation result determine in document information with document properties pair The description content answered, solve the problems, such as document content can not rapidly extracting, and then reached quick reading document properties information Effect.

Detailed description of the invention

The attached drawing constituted part of this application is used to provide further understanding of the present invention, schematic reality of the invention It applies example and its explanation is used to explain the present invention, do not constitute improper limitations of the present invention.In the accompanying drawings：

Fig. 1 is the flow chart of document properties description content extracting method according to an embodiment of the present invention；

Fig. 2 is the schematic diagram that text attribute according to an embodiment of the present invention describes that paragraph extracts result；And

Fig. 3 is the schematic diagram of document properties description content extraction element according to an embodiment of the present invention.

Specific embodiment

It should be noted that in the absence of conflict, the features in the embodiments and the embodiments of the present application can phase Mutually combination.The present invention will be described in detail below with reference to the accompanying drawings and embodiments.

In order to make those skilled in the art more fully understand application scheme, below in conjunction in the embodiment of the present application Attached drawing, the technical scheme in the embodiment of the application is clearly and completely described, it is clear that described embodiment is only The embodiment of the application a part, instead of all the embodiments.Based on the embodiment in the application, ordinary skill people Member's every other embodiment obtained without making creative work, all should belong to the model of the application protection It encloses.

It should be noted that the description and claims of this application and term " first " in above-mentioned attached drawing, " Two " etc. be to be used to distinguish similar objects, without being used to describe a particular order or precedence order.It should be understood that using in this way Data be interchangeable under appropriate circumstances, so as to embodiments herein described herein.In addition, term " includes " and " tool Have " and their any deformation, it is intended that cover it is non-exclusive include, for example, containing a series of steps or units Process, method, system, product or equipment those of are not necessarily limited to be clearly listed step or unit, but may include without clear Other step or units listing to Chu or intrinsic for these process, methods, product or equipment.

The embodiment of the invention provides a kind of document properties description content extracting methods.

Fig. 1 is the flow chart of document properties description content extracting method according to an embodiment of the present invention, as shown in Figure 1, should Method includes the following steps：

Step S102：Obtain the document information of attribute text to be extracted；

Step S104：Document information is input in preparatory trained attributes extraction model and carries out model calculation, is obtained Operation result；

Step S106：Description content corresponding with document properties in document information is determined according to operation result.

The embodiment is using the document information for obtaining attribute text to be extracted；Document information is input to trained in advance Model calculation is carried out in attributes extraction model, obtains operation result；According to operation result determine in document information with document properties Corresponding description content, solve the problems, such as document content can not rapidly extracting, and then reached quick reading document properties letter The effect of breath.

In embodiments of the present invention, it is a variety of to can be word format or table format etc. for the document of attribute text to be extracted Document information after obtaining document information, can be input to preparatory trained attributes extraction model by the format file of type Middle carry out model calculation, wherein attributes extraction model is trained according to large volume document, is carried in each document There are attribute to be extracted and position and content of the attribute in the document, it, can be according to can after large volume document training Can position in a document or the keyword closed on or the keyword for being included it is corresponding with attribute to be extracted to determine Content of text can extract the attribute of user's concern in the shortest time in this way, improve reading efficiency.

Optionally, according to operation result determine in document information description content corresponding with document properties to be extracted it Afterwards, by description content corresponding with document properties to be extracted in document information by predetermined manner mark out come.

Optionally, content corresponding with attribute to be extracted in document information is marked out by predetermined manner to include： Pass through the corresponding description content of document properties to be extracted each in the background color mark document information of different colours.

It can be shown in several ways after determining document properties to be extracted, such as different colours can be passed through Corresponding mark out of each attribute is come, corresponding content can be aobvious by same color in a document for Property Name and the attribute Show, distinguishes different attribute with different colours, can be convenient user in this way and quickly read the corresponding content of each generic attribute.

Optionally, before carrying out model calculation in document information to be input to preparatory trained attributes extraction model, Acquire the model training sample of preset quantity；It labels to paragraph in model training sample and sentence, after being labelled Sample content；Deep learning is carried out to the sample content after labelling by neural network, obtains trained attributes extraction Model.

Optionally, deep learning is carried out to the sample content after labelling by neural network, obtains trained attribute Extracting model includes：Word in sample after labelling is converted to digital vectors；By LSTM study to digital vectors into Row training, obtains trained attributes extraction model.

The process of model training can first collect representative Training document, label to the data in document, with Each sentence is an individual, and the corresponding paragraph of every class document properties is all started with B-, and such as the beginning word of " project name " is by B- Title indicates that then subsequent sentence is I-title, and the sentence of ending is E-title, and corresponding if it is document properties is single Sentence is then S-title, and the sentence for being not belonging to any attribute is labeled as O, the word in document is converted to digital vectors (Word Embedding), attribute labeling model (mark here is sentence flag attribute) is then trained by LSTM study, repeatedly Training study is to obtain satisfactory model.

Optionally, in attributes extraction, if can be counted to some attributes extraction to two or more content of text The probability that this multiple text may be the attribute is calculated, chooses maximum probability as the corresponding content of text of the attribute.

The embodiment of the invention also provides a kind of specific embodiments, below with reference to the specific embodiment to of the invention Technical solution is illustrated.

The technical solution of the embodiment of the present invention can be used as a kind of text attribute based on dictionary and describe paragraph extracting method, Deep learning method neural network based identifies that overall procedure is as follows to text attribute descriptive statement or paragraph：

1, collect representative Training document.

2, sample data mark.Sample files are labeled according to different attribute, as each attribute description sentence or section Drop marker attribute, the sentence for being not belonging to any attribute are labeled as other.

3, deep learning method neural network based learns the data of mark, training attribute labeling model.

4, characteristic attribute extraction is carried out to document with trained model.

The deep learning method neural network based of the embodiment of the present invention identifies text attribute descriptive statement or paragraph Method, can be realized by following steps：

Step 1 first collects representative Training document before to model training.

Step 2 is labeled data, and specific step is as follows：

Each sentence is an individual.The corresponding paragraph of every class document properties is all started with B-, and such as " project name " is opened Beginning word is indicated that then subsequent sentence is I-title by B-title, and the sentence of ending is E-title.If it is document properties pair What is answered be simple sentence is then S-title.The sentence for being not belonging to any attribute is labeled as O.

Step 3 learns the data of mark.Here we use deep learning method neural network based, example Such as Word Embedding+LSTM.Specific steps：

1. word is converted to digital vectors (Word Embedding) first.

2. then training attribute labeling model by LSTM study.(mark here is sentence flag attribute).

Step 4 carries out characteristic attribute extraction to document with trained model.

Fig. 2 is the schematic diagram that text attribute according to an embodiment of the present invention describes that paragraph extracts result, as shown in Fig. 2, literary This attribute description sentence or paragraph identification are exactly to find out the description of association attributes from a natural language text, and mark out it Position and type, it is corresponding：Project name, budget amount, contents of a project description, bidding document price, contact method, qualification are wanted It each classification such as seeks, identifies that it corresponding content and marks out in the text, project name is corresponding：Shandong Province's mother and child care Key lab of institute fertility regulation project equipment buying (second batch) Ultracentrifuge buying two, budget amount is corresponding 70.000000 ten thousand yuan, the contents of a project are described as Ultracentrifuge, and bidding document price is 300 yuan/packet, and contact method is corresponding It is marked：Healthcare hospital for women & children of purchaser Shandong Province, address, contact person, agency address and contact person and phone etc., with convenient User reads wherein information in the shortest time.

By the above method, user can rapidly carry out the reading of text, can quickly locate required concern Focus on, improve the efficiency of reading.

It should be noted that step shown in the flowchart of the accompanying drawings can be in such as a group of computer-executable instructions It is executed in computer system, although also, logical order is shown in flow charts, and it in some cases, can be with not The sequence being same as herein executes shown or described step.

The embodiment of the invention provides a kind of document properties description content extraction element, which can be used for executing this hair The document properties description content extracting method of bright embodiment.

Fig. 3 is the schematic diagram of document properties description content extraction element according to an embodiment of the present invention, as shown in figure 3, should Device includes：

Acquiring unit 10, for obtaining the document information of attribute text to be extracted；

Arithmetic element 20 carries out model fortune for document information to be input in preparatory trained attributes extraction model It calculates, obtains operation result；

Determination unit 30, for determining description content corresponding with document properties in document information according to operation result.

The embodiment uses acquiring unit 10, for obtaining the document information of attribute text to be extracted；Arithmetic element 20 is used Model calculation is carried out in being input to document information in preparatory trained attributes extraction model, obtains operation result；It determines single Member 30, for determining content corresponding with attribute to be extracted in document information according to operation result, to solve in document Hold can not rapidly extracting the problem of, and then achieved the effect that quick reading document properties information.

Optionally, which further includes：Mark unit, for according to operation result determine in document information with document category After the corresponding description content of property, description content corresponding with document properties to be extracted in document information is passed through into predetermined manner It marks out and.

Optionally, mark unit is used to mark each document category to be extracted in document information by the background color of different colours The corresponding description content of property.

The document properties description content extraction element includes processor and memory, above-mentioned acquiring unit, arithmetic element, Determination unit etc. stores in memory as program unit, executes above procedure list stored in memory by processor Member realizes corresponding function.

Include kernel in processor, is gone in memory to transfer corresponding program unit by kernel.Kernel can be set one Or more, come to read document properties information quickly by adjusting kernel parameter.

Memory may include the non-volatile memory in computer-readable medium, random access memory (RAM) and/ Or the forms such as Nonvolatile memory, if read-only memory (ROM) or flash memory (flash RAM), memory include that at least one is deposited Store up chip.

The embodiment of the invention provides a kind of storage mediums, are stored thereon with program, real when which is executed by processor The existing document properties description content extracting method.

The embodiment of the invention provides a kind of processor, the processor is for running program, wherein described program operation Document properties description content extracting method described in Shi Zhihang.

The embodiment of the invention provides a kind of equipment, equipment include processor, memory and storage on a memory and can The program run on a processor, processor realize following steps when executing program：Obtain the document letter of attribute text to be extracted Breath；Document information is input in preparatory trained attributes extraction model and carries out model calculation, obtains operation result；According to fortune It calculates result and determines description content corresponding with document properties in document information.Equipment herein can be server, PC, PAD, Mobile phone etc..

Present invention also provides a kind of computer program products, when executing on data processing equipment, are adapted for carrying out just The program of beginningization there are as below methods step：Obtain the document information of attribute text to be extracted；Document information is input to preparatory instruction Model calculation is carried out in the attributes extraction model perfected, obtains operation result；According to operation result determine in document information with text The corresponding description content of shelves attribute.

It should be understood by those skilled in the art that, embodiments herein can provide as method, system or computer program Product.Therefore, complete hardware embodiment, complete software embodiment or reality combining software and hardware aspects can be used in the application Apply the form of example.Moreover, it wherein includes the computer of computer usable program code that the application, which can be used in one or more, The computer program implemented in usable storage medium (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) produces The form of product.

The application is referring to method, the process of equipment (system) and computer program product according to the embodiment of the present application Figure and/or block diagram describe.It should be understood that every one stream in flowchart and/or the block diagram can be realized by computer program instructions The combination of process and/or box in journey and/or box and flowchart and/or the block diagram.It can provide these computer programs Instruct the processor of general purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices to produce A raw machine, so that being generated by the instruction that computer or the processor of other programmable data processing devices execute for real The device for the function of being specified in present one or more flows of the flowchart and/or one or more blocks of the block diagram.

These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing devices with spy Determine in the computer-readable memory that mode works, so that it includes referring to that instruction stored in the computer readable memory, which generates, Enable the manufacture of device, the command device realize in one box of one or more flows of the flowchart and/or block diagram or The function of being specified in multiple boxes.

These computer program instructions also can be loaded onto a computer or other programmable data processing device, so that counting Series of operation steps are executed on calculation machine or other programmable devices to generate computer implemented processing, thus in computer or The instruction executed on other programmable devices is provided for realizing in one or more flows of the flowchart and/or block diagram one The step of function of being specified in a box or multiple boxes.

In a typical configuration, calculating equipment includes one or more processors (CPU), input/output interface, net Network interface and memory.

Memory may include the non-volatile memory in computer-readable medium, random access memory (RAM) and/ Or the forms such as Nonvolatile memory, such as read-only memory (ROM) or flash memory (flash RAM).Memory is computer-readable Jie The example of matter.

Computer-readable medium includes permanent and non-permanent, removable and non-removable media can be by any method Or technology come realize information store.Information can be computer readable instructions, data structure, the module of program or other data. The example of the storage medium of computer includes, but are not limited to phase change memory (PRAM), static random access memory (SRAM), moves State random access memory (DRAM), other kinds of random access memory (RAM), read-only memory (ROM), electric erasable Programmable read only memory (EEPROM), flash memory or other memory techniques, read-only disc read only memory (CD-ROM) (CD-ROM), Digital versatile disc (DVD) or other optical storage, magnetic cassettes, tape magnetic disk storage or other magnetic storage devices Or any other non-transmission medium, can be used for storage can be accessed by a computing device information.As defined in this article, it calculates Machine readable medium does not include temporary computer readable media (transitory media), such as the data-signal and carrier wave of modulation.

It should also be noted that, the terms "include", "comprise" or its any other variant are intended to nonexcludability It include so that the process, method, commodity or the equipment that include a series of elements not only include those elements, but also to wrap Include other elements that are not explicitly listed, or further include for this process, method, commodity or equipment intrinsic want Element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that including element There is also other identical elements in process, method, commodity or equipment.

It will be understood by those skilled in the art that embodiments herein can provide as method, system or computer program product. Therefore, complete hardware embodiment, complete software embodiment or embodiment combining software and hardware aspects can be used in the application Form.It is deposited moreover, the application can be used to can be used in the computer that one or more wherein includes computer usable program code The shape for the computer program product implemented on storage media (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) Formula.

The above is only embodiments herein, are not intended to limit this application.To those skilled in the art, Various changes and changes are possible in this application.It is all within the spirit and principles of the present application made by any modification, equivalent replacement, Improve etc., it should be included within the scope of the claims of this application.

Claims

1. a kind of document properties description content extracting method, which is characterized in that including：

Obtain the document information of attribute text to be extracted；

The document information is input in preparatory trained attributes extraction model and carries out model calculation, obtains operation result；

Description content corresponding with document properties in the document information is determined according to the operation result.

2. the method according to claim 1, wherein being determined in the document information according to the operation result After description content corresponding with document properties, the method also includes：

By description content corresponding with document properties to be extracted in the document information by predetermined manner mark out come.

3. according to the method described in claim 2, it is characterized in that, by the document information with document properties pair to be extracted The description content answered is marked out by predetermined manner：

The corresponding description content of each document properties to be extracted in the document information is marked by the background color of different colours.

4. the method according to claim 1, wherein the document information is input to preparatory trained category Property extract model in carry out model calculation before, the method also includes：

Acquire the model training sample of preset quantity；

It labels to paragraph in the model training sample and sentence, the sample content after being labelled；

Deep learning is carried out to the sample content after labelling by neural network, obtains trained attributes extraction model.

5. according to the method described in claim 4, it is characterized in that, being carried out by neural network to the sample content after labelling Deep learning, obtaining trained attributes extraction model includes：

Word in sample after labelling is converted to digital vectors；

The digital vectors are trained by LSTM study, obtain trained attributes extraction model.

6. a kind of document properties description content extraction element, which is characterized in that including：

Acquiring unit, for obtaining the document information of attribute text to be extracted；

Arithmetic element carries out model calculation for the document information to be input in preparatory trained attributes extraction model, Obtain operation result；

Determination unit, for determining description content corresponding with document properties in the document information according to the operation result.

7. device according to claim 6, which is characterized in that described device further includes：

Unit is marked, for determining description content corresponding with document properties in the document information according to the operation result Later, by description content corresponding with document properties to be extracted in the document information by predetermined manner mark out come.

8. device according to claim 7, which is characterized in that the mark unit is used for：

9. a kind of storage medium, which is characterized in that the storage medium includes the program of storage, wherein run in described program When control the storage medium where equipment perform claim require any one of 1 to 5 described in document properties description content mention Take method.

10. a kind of processor, which is characterized in that the processor is for running program, wherein right of execution when described program is run Benefit require any one of 1 to 5 described in document properties description content extracting method.