CN109815500A - Management method, device, computer equipment and the storage medium of unstructured official document - Google Patents
Management method, device, computer equipment and the storage medium of unstructured official document Download PDFInfo
- Publication number
- CN109815500A CN109815500A CN201910074336.5A CN201910074336A CN109815500A CN 109815500 A CN109815500 A CN 109815500A CN 201910074336 A CN201910074336 A CN 201910074336A CN 109815500 A CN109815500 A CN 109815500A
- Authority
- CN
- China
- Prior art keywords
- unstructured
- official document
- identified
- identification model
- module
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000007726 management method Methods 0.000 title claims abstract description 40
- 238000000034 method Methods 0.000 claims abstract description 25
- 238000012360 testing method Methods 0.000 claims description 24
- 230000015654 memory Effects 0.000 claims description 21
- 238000012545 processing Methods 0.000 claims description 21
- 230000011218 segmentation Effects 0.000 claims description 20
- 238000012549 training Methods 0.000 claims description 19
- 238000004590 computer program Methods 0.000 claims description 7
- 238000010586 diagram Methods 0.000 description 13
- 230000006870 function Effects 0.000 description 13
- 230000008569 process Effects 0.000 description 7
- 230000003287 optical effect Effects 0.000 description 5
- 238000004891 communication Methods 0.000 description 4
- 239000000463 material Substances 0.000 description 4
- 230000008520 organization Effects 0.000 description 3
- 230000002093 peripheral effect Effects 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 241000208340 Araliaceae Species 0.000 description 1
- 235000005035 Panax pseudoginseng ssp. pseudoginseng Nutrition 0.000 description 1
- 235000003140 Panax quinquefolius Nutrition 0.000 description 1
- 230000001133 acceleration Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000010485 coping Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 235000013399 edible fruits Nutrition 0.000 description 1
- 230000005611 electricity Effects 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 235000008434 ginseng Nutrition 0.000 description 1
- 230000007787 long-term memory Effects 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 210000003733 optic disk Anatomy 0.000 description 1
- 210000000056 organ Anatomy 0.000 description 1
- 238000009394 selective breeding Methods 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The application proposes management method, device, computer equipment and the storage medium of a kind of unstructured official document, wherein method includes: by obtaining unstructured official document to be identified;Unstructured official document to be identified is identified according to preset identification model, obtains the attribute information in unstructured official document to be identified;Unstructured official document to be identified is stored according to attribute information.The validity and accuracy of the management of unstructured official document are improved as a result,.
Description
Technical field
This application involves E-Government technical field more particularly to a kind of management methods of unstructured official document, device, meter
Calculate machine equipment and storage medium.
Background technique
Currently, usually having management means and technical solution two ways on government affairs document treatment, wherein management means is
Department is issued in all official documents, all official documents that send out are realized into objectification, be mainly related to official document abstract, receiving department, official document
Personnel, contact method etc. it is artificial be managed system typing, but management means inefficiency, and without full-time staff into
Row typing, history official document is abandoned, so that official document that is parallel and intersecting department is numerous in entire government system to be had
The typing of effect;Technical solution mainly by official document whole typing, carries out the simple match inquiry of part official document or content, is managing
In the process without effectively identification and organized management, and closed to department is intersected with the official document relationship of parallel department, official document
Connection etc. cannot achieve management.Therefore, above two scheme all can not effectively manage official document.
Summary of the invention
The application is intended to solve at least some of the technical problems in related technologies.
For this purpose, the application proposes management method, device and the storage medium of a kind of unstructured official document, it is existing for solving
The technical issues of can not effectively being managed in technology for unstructured official document.
In order to achieve the above object, the application first aspect embodiment proposes a kind of management method of unstructured official document, packet
It includes:
Obtain unstructured official document to be identified;
The unstructured official document to be identified is identified according to preset identification model, obtains the non-knot to be identified
Attribute information in structure official document;
The unstructured official document to be identified is stored according to the attribute information.
The management method of the unstructured official document of the embodiment of the present application, by obtaining unstructured official document to be identified;According to
Preset identification model identifies unstructured official document to be identified, obtains the attribute letter in unstructured official document to be identified
Breath;Unstructured official document to be identified is stored according to attribute information.Having for the management of unstructured official document is improved as a result,
Effect property and accuracy.
In order to achieve the above object, the application second aspect embodiment proposes a kind of managing device of unstructured official document, packet
It includes:
Module is obtained, for obtaining unstructured official document to be identified;
Identification module is obtained for being identified according to preset identification model to the unstructured official document to be identified
Attribute information in the unstructured official document to be identified;
Memory module, for being stored according to the attribute information to the unstructured official document to be identified.
The managing device of the unstructured official document of the embodiment of the present application, by according to preset brand names word set and default
Enterprise attributes word set, calculate text to be identified and different enterprises and identify text similarity between corresponding attribute information, will
It is corresponding with different enterprises marks to obtain text to be identified for the text input to be identified obtained semantic similarity model of training in advance
Semantic similarity between attribute information, according to text similarity and semantic similarity, the determining mesh with text matches to be identified
Mark enterprise's mark.The accuracy rate of the management of unstructured official document is improved as a result, but also improves the management of unstructured official document
Recall rate.
In order to achieve the above object, the application third aspect embodiment proposes a kind of computer equipment, comprising: processor and deposit
Reservoir;Wherein, the processor is held to run with described by reading the executable program code stored in the memory
The corresponding program of line program code, with the management method for realizing the unstructured official document as described in first aspect embodiment.
In order to achieve the above object, the application fourth aspect embodiment proposes a kind of non-transitory computer-readable storage medium
Matter is stored thereon with computer program, realizes when which is executed by processor non-structural as described in first aspect embodiment
Change the management method of official document.
In order to achieve the above object, the 5th aspect embodiment of the application proposes a kind of computer program product, when the calculating
When instruction in machine program product is executed by processor, the management of the unstructured official document as described in first aspect embodiment is realized
Method.
The additional aspect of the application and advantage will be set forth in part in the description, and will partially become from the following description
It obtains obviously, or recognized by the practice of the application.
Detailed description of the invention
The application is above-mentioned and/or additional aspect and advantage will become from the following description of the accompanying drawings of embodiments
Obviously and it is readily appreciated that, in which:
Fig. 1 is the flow diagram of the management method of unstructured official document provided by the embodiment of the present application one;
Fig. 2 is the flow diagram of the management method of unstructured official document provided by the embodiment of the present application two;
Fig. 3 is the flow diagram of the management method of unstructured official document provided by the embodiment of the present application three;
Fig. 4 is the flow diagram of the management method of unstructured official document provided by the embodiment of the present application four;
Fig. 5 is the structural schematic diagram of the managing device of unstructured official document provided by the embodiment of the present application one;
Fig. 6 is the structural schematic diagram of the managing device of unstructured official document provided by the embodiment of the present application two;
Fig. 7 is the structural schematic diagram of the managing device of unstructured official document provided by the embodiment of the present application three;
Fig. 8 is the structural schematic diagram of the managing device of unstructured official document provided by the embodiment of the present application four;
Fig. 9 is the structural schematic diagram of computer equipment provided by the embodiment of the present application.
Specific embodiment
Embodiments herein is described below in detail, examples of the embodiments are shown in the accompanying drawings, wherein from beginning to end
Same or similar label indicates same or similar element or element with the same or similar functions.Below with reference to attached
The embodiment of figure description is exemplary, it is intended to for explaining the application, and should not be understood as the limitation to the application.
Below with reference to the accompanying drawings management method, device, the computer equipment of the unstructured official document of the embodiment of the present application are described
And storage medium.
Fig. 1 is the flow diagram of the management method of unstructured official document provided by the embodiment of the present application one.
As shown in Figure 1, the management method of the unstructured official document may comprise steps of:
Step 101, unstructured official document to be identified is obtained.
In practical applications, many government affairs official documents are not store according to certain way, be it is non-structured, do not have
Full-time user, which carries out effective typing etc. to it, to cause effectively manage official document, and the application is preset by establishing
Identification model can effectively identify a large amount of unstructured official document, and be stored, to improve unstructured official document
The efficiency and accuracy of management.
Firstly, obtaining unstructured official document to be identified, it is to be understood that there are one or more not according to certain side
The unstructured official document to be identified that formula is stored can determine one or more non-knot to be identified according to the actual application
Structure official document.
Step 102, unstructured official document to be identified is identified according to preset identification model, is obtained described to be identified
Attribute information in unstructured official document.
Step 103, unstructured official document to be identified is stored according to attribute information.
Specifically, identification model is pre-generated, as a kind of possible implementation, as shown in Figure 2, comprising:
Step 201, tagged corpus is determined.
Step 202, word segmentation processing is carried out to the unstructured official document of multiple training, obtains each and trains unstructured official document
In it is multiple training participle.
Step 203, tagged corpus and multiple training participles are handled according to preset algorithm, generates preset identification
Model.
Specifically, it is determined that there are many kinds of the modes of tagged corpus, can directly by marked there are corpus
Such as People's Daily's corpus is directly used as tagged corpus, can also by the multiple official documents not marked of artificial selection into
Rower note generates tagged corpus, can also be that the corpus that a part selection has marked carries out artificial mark generation mark with a part
Corpus is infused, can be selected according to the actual application.
It wherein, the use of bilstm+crf is a two-way LSTM (Long for example there are many kinds of the modes of mark
Short-Term Memory, shot and long term memory network)+CRF (Conditional Random Field, condition random field) layer
The two-way available context of LSTM of model information, can better deep learning, reduce the artificial ginseng of later period mark
With, further increase identification model generation efficiency.
It should be noted that is stored in tagged corpus is the language material really occurred in the actual use of language
Material, tagged corpus are the basic resources that linguistry is carried using electronic computer as carrier, and real corpus is needed by processing
(such as analysis and processing), could become useful resource.
Wherein, it is exactly the grammatical category that each word is determined in given sentence that mark, which can be part-of-speech tagging, determines its word
Property and the process that is marked, such as position attribution vector, part-of-speech tagging sequence vector, cluster or sorting algorithm etc..
It is understood that before generating identification model, it is thus necessary to determine that multiple unstructured official documents of training, and to each
A unstructured official document progress word segmentation processing of training obtains each and trains multiple training participles in unstructured official document.
As an example, obtain official document content in training unstructured official document A, official document A unstructured to training into
Row word segmentation processing, it is to be understood that word segmentation processing can be carried out to official document content by preset participle mode, for example passed through
NlpAnalysis participle (point with new word discovery function in Ansj Chinese word segmentation (the Chinese word segmentation tool based on java)
Word) mode, more specifically, the jar (software package file format) for introducing corresponding Ansj is wrapped and is executed the participle of NlpAnalysis
Method can identify unregistered word, and have good performance to the identification of name, organization names, number, and support user
Custom Dictionaries.
Wherein, Chinese word segmentation refers to for a chinese character sequence being cut into individual word one by one, and participle is exactly will even
Continuous word sequence is reassembled into the process of word sequence according to certain specification.
Finally, handling by preset algorithm tagged corpus and multiple training participles, preset identification mould is generated
Type.Wherein, preset algorithm, which can according to need, is selected, such as the program that deep neural network model is either write in advance
Algorithm etc..
Specifically, the corresponding attribute information of the either different participles of different corpus is store in the identification model of generation, than
Such as location information, name information and contact details, therefore by preset identification model to unstructured official document to be identified
It carries out identifying the attribute information in available unstructured official document to be identified, it is thus possible to according to attribute information to be identified
Unstructured official document is stored.
Wherein, attribute information can determine according to the actual application, for example name, connection are extracted from unstructured official document
It is mode, official document abstract, membership credentials etc..
It for example, such as can be by extracting dispatch/Jie Wen organ to realize the analysis gone to organization;Extract people
Member's dictionary is to realize to personnel's attributive classification;Official document abstract is extracted to realize and classify to similar official document etc..
Wherein, in order to further ensure store official document quality, can also by official document in violation of rules and regulations filter, by violation official document into
Row specific classification, so as to subsequent processing.
It should be noted that the validity in order to guarantee identification model, also needed after generating preset identification model pair
It is tested, specific as shown in Figure 3, comprising:
Step 301, unstructured official document to be tested is obtained.
Step 302, word segmentation processing is carried out to unstructured official document to be tested, obtained more in unstructured official document to be tested
A test participle.
Step 303, multiple tests participle is identified according to preset identification model, obtains test value.
Step 304, the validity of preset identification model is judged according to test value and preset threshold.
Specifically, after generating identification model, unstructured official document to be tested is determined, and to unstructured official document to be tested
Word segmentation processing is carried out, detailed process may refer to the description of step 202, then according to preset identification model to multiple tests
Participle is identified, it is several can to determine that the attribute information identified has, if it is correct etc., so that it is determined that test value, most
Test value is compared to the validity for judging preset identification model with preset threshold afterwards.
As an example, test value includes: accuracy rate and recall rate, obtains the ratio of accuracy rate and recall rate, if than
Value is more than or equal to preset threshold, it is determined that preset identification model is effective.
For example, identification model is LDA (Latent Dirichlet Allocation, document subject matter generate model),
Also referred to as three layers of bayesian probability model, may include word, theme and document three-decker.Identification mould in the application
Type, that is, it is believed that each word in an official document is by " with some theme of certain probability selection, and leading from this
With some word of certain probability selection in topic " such a process is available, the individual sum/knowledge for accuracy=correctly identify
Not Chu individual sum;The sum of individual present in the individual sum/test set for recall rate=correctly identify;F value=accuracy *
Recall rate * 2/ (accuracy+recall rate).
Therefore, one or more in accuracy, recall rate and F value can be selected according to the actual application, or
Its mutual ratio selects corresponding preset threshold to be compared the validity of determining identification model as test value.
Description based on the above embodiment, after step 103, as shown in Figure 4, further includes:
Step 401, extracting keywords are obtained.
Step 402, the unstructured official document of target is extracted according to extracting keywords.
That is, can be by determining extracting keywords such as organization, contact method etc., from according to attribute information
The unstructured official document of target is extracted in multiple non-structural official documents after being stored, to improve the standard that unstructured official document extracts
True property.
The vertical of official document, mould are able to solve to official document identification most difficult during problem of management and object extraction, is allowed to
The subsequent hierarchy management to official document solves validity management of official document during receiving and issuing.
The management method of the unstructured official document of the embodiment of the present application, by obtaining unstructured official document to be identified, according to
Preset identification model identifies unstructured official document to be identified, obtains the attribute letter in unstructured official document to be identified
Breath, stores unstructured official document to be identified according to attribute information.Thereby, it is possible to realize the management to unstructured official document
Validity and accuracy.
In order to realize above-described embodiment, the application also proposes a kind of managing device of unstructured official document.
Fig. 5 is the structural schematic diagram of the managing device of unstructured official document provided by the embodiment of the present application one.
As shown in figure 5, the managing device 50 of the unstructured official document includes: to obtain module 510, identification module 520 and deposit
Store up module 530.Wherein,
Module 510 is obtained, for obtaining unstructured official document to be identified.
Identification module 520, for being identified according to preset identification model to unstructured official document to be identified, obtain to
Identify the attribute information in unstructured official document.
Memory module 530, for being stored according to attribute information to unstructured official document to be identified.
Further, in a kind of possible implementation of the embodiment of the present application, as shown in fig. 6, implementing as shown in Figure 5
On the basis of example, described device, further includes: determining module 540, word segmentation module 550 and generation module 560.
Determining module 540, for determining tagged corpus.
Word segmentation module 550 obtains each and trains non-knot for carrying out word segmentation processing to the unstructured official document of multiple training
Multiple training participles in structure official document.
Generation module 560 generates pre- for being handled according to preset algorithm tagged corpus and multiple training participles
If identification model.
In a kind of possible implementation of the embodiment of the present application, as shown in fig. 7, on the basis of embodiment as shown in Figure 6
On, described device, further includes: judgment module 570.Wherein,
Module 510 is obtained, is also used to obtain unstructured official document to be tested.
Word segmentation module 550 is also used to carry out word segmentation processing to unstructured official document to be tested, obtain to be tested unstructured
Multiple tests participle in official document.
Identification module 520 is also used to identify multiple tests participle according to preset identification model, obtains test
Value.
Judgment module 570, for judging the validity of preset identification model according to test value and preset threshold.
In a kind of possible implementation of the embodiment of the present application, test value includes: accuracy rate and recall rate;Judgment module
570 are specifically used for, and obtain the ratio of accuracy rate and recall rate, if ratio is more than or equal to preset threshold, it is determined that preset identification
Model is effective.
In a kind of possible implementation of the embodiment of the present application, as shown in figure 8, on the basis of embodiment as shown in Figure 5
On, described device, further includes: abstraction module 580.
Wherein, module 510 is obtained, is also used to obtain extracting keywords.
Abstraction module 580, for extracting the unstructured official document of target according to extracting keywords.
It should be noted that the explanation of the aforementioned management method embodiment to unstructured official document is also applied for the reality
The managing device of the unstructured official document of example is applied, realization principle is similar, and details are not described herein again.
The managing device of the unstructured official document of the embodiment of the present application, by obtaining unstructured official document to be identified, according to
Preset identification model identifies unstructured official document to be identified, obtains the attribute letter in unstructured official document to be identified
Breath, stores unstructured official document to be identified according to attribute information.Thereby, it is possible to realize the management to unstructured official document
Validity and accuracy.
In order to realize above-described embodiment, the application also proposes a kind of computer equipment, comprising: processor and memory.Its
In, processor runs journey corresponding with executable program code by reading the executable program code stored in memory
Sequence, with the management method for realizing unstructured official document as in the foregoing embodiment.
Fig. 9 is the structural schematic diagram of computer equipment provided by the embodiment of the present application, shows and is suitable for being used to realizing this
Apply for the block diagram of the exemplary computer device 90 of embodiment.The computer equipment 90 that Fig. 9 is shown is only an example, no
The function and use scope for coping with the embodiment of the present application bring any restrictions.
As shown in figure 9, computer equipment 90 is showed in the form of general purpose computing device.The component of computer equipment 90 can
To include but is not limited to: one or more processor or processing unit 906, system storage 910 connect not homologous ray group
The bus 908 of part (including system storage 910 and processing unit 906).
Bus 908 indicates one of a few class bus structures or a variety of, including memory bus or Memory Controller,
Peripheral bus, graphics acceleration port, processor or the local bus using any bus structures in a variety of bus structures.It lifts
For example, these architectures include but is not limited to industry standard architecture (Industry Standard
Architecture;Hereinafter referred to as: ISA) bus, microchannel architecture (Micro Channel Architecture;Below
Referred to as: MAC) bus, enhanced isa bus, Video Electronics Standards Association (Video Electronics Standards
Association;Hereinafter referred to as: VESA) local bus and peripheral component interconnection (Peripheral Component
Interconnection;Hereinafter referred to as: PCI) bus.
Computer equipment 90 typically comprises a variety of computer system readable media.These media can be it is any can be by
The usable medium that computer equipment 90 accesses, including volatile and non-volatile media, moveable and immovable medium.
System storage 910 may include the computer system readable media of form of volatile memory, such as deposit at random
Access to memory (Random Access Memory;Hereinafter referred to as: RAM) 911 and/or cache memory 912.Computer is set
Standby 90 may further include other removable/nonremovable, volatile/non-volatile computer system storage mediums.Only
As an example, storage system 913 can be used for reading and writing immovable, non-volatile magnetic media (Fig. 9 do not show, commonly referred to as
" hard disk drive ").Although being not shown in Fig. 9, can provide for reading removable non-volatile magnetic disk (such as " floppy disk ")
The disc driver write, and to removable anonvolatile optical disk (such as: compact disc read-only memory (Compact Disc Read
Only Memory;Hereinafter referred to as: CD-ROM), digital multi CD-ROM (Digital Video Disc Read Only
Memory;Hereinafter referred to as: DVD-ROM) or other optical mediums) read-write CD drive.In these cases, each driving
Device can be connected by one or more data media interfaces with bus 908.System storage 910 may include at least one
Program product, the program product have one group of (for example, at least one) program module, these program modules are configured to perform this
Apply for the function of each embodiment.
Computer-readable signal media may include in a base band or as carrier wave a part propagate data-signal,
Wherein carry computer-readable program code.The data-signal of this propagation can take various forms, including --- but
It is not limited to --- electromagnetic signal, optical signal or above-mentioned any appropriate combination.Computer-readable signal media can also be
Any computer-readable medium other than computer readable storage medium, which can send, propagate or
Transmission is for by the use of instruction execution system, device or device or program in connection.
The program code for including on computer-readable medium can transmit with any suitable medium, including --- but it is unlimited
In --- wireless, electric wire, optical cable, RF etc. or above-mentioned any appropriate combination.
Can with one or more programming languages or combinations thereof come write for execute the application operation computer
Program code, described program design language include object oriented program language-such as Java, Smalltalk, C++,
It further include conventional procedural programming language-such as " C " language or similar programming language.Program code can be with
It fully executes, partly execute on the user computer on the user computer, being executed as an independent software package, portion
Divide and partially executes or executed on a remote computer or server completely on the remote computer on the user computer.
Program/utility 914 with one group of (at least one) program module 9140, can store and deposit in such as system
In reservoir 910, such program module 9140 includes but is not limited to operating system, one or more application program, Qi Tacheng
It may include the realization of network environment in sequence module and program data, each of these examples or certain combination.Program
Module 9140 usually executes function and/or method in embodiments described herein.
Computer equipment 90 can also be with one or more external equipments 10 (such as keyboard, sensing equipment, display 100
Deng) communication, can also be enabled a user to one or more equipment interact with the terminal device 90 communicate, and/or with make
Any equipment (such as network interface card, the modulation /demodulation that the computer equipment 90 can be communicated with one or more of the other calculating equipment
Device etc.) communication.This communication can be carried out by input/output (I/O) interface 902.Also, computer equipment 90 can be with
Pass through network adapter 900 and one or more network (such as local area network (Local Area Network;Hereinafter referred to as:
LAN), wide area network (Wide Area Network;Hereinafter referred to as: WAN) and/or public network, for example, internet) communication.Such as figure
Shown in 9, network adapter 900 is communicated by bus 908 with other modules of computer equipment 90.Although should be understood that in Fig. 9
It is not shown, other hardware and/or software module can be used in conjunction with computer equipment 90, including but not limited to: microcode, equipment
Driver, redundant processing unit, external disk drive array, RAID system, tape drive and data backup storage system
Deng.
Processing unit 906 by the program that is stored in system storage 910 of operation, thereby executing various function application with
And data processing, such as realize the management method of the unstructured official document referred in previous embodiment.
In order to realize above-described embodiment, the application also proposes a kind of non-transitorycomputer readable storage medium, deposits thereon
Computer program is contained, when which is executed by processor, realizes the management of unstructured official document as in the foregoing embodiment
Method.
In order to realize above-described embodiment, the application also proposes a kind of computer program product, when the computer program produces
When instruction in product is executed by processor, the management method of unstructured official document as in the foregoing embodiment is realized.
In the description of this specification, reference term " one embodiment ", " some embodiments ", " example ", " specifically show
The description of example " or " some examples " etc. means specific features, structure, material or spy described in conjunction with this embodiment or example
Point is contained at least one embodiment or example of the application.In the present specification, schematic expression of the above terms are not
It must be directed to identical embodiment or example.Moreover, particular features, structures, materials, or characteristics described can be in office
It can be combined in any suitable manner in one or more embodiment or examples.In addition, without conflicting with each other, the skill of this field
Art personnel can tie the feature of different embodiments or examples described in this specification and different embodiments or examples
It closes and combines.
In addition, term " first ", " second " are used for descriptive purposes only and cannot be understood as indicating or suggesting relative importance
Or implicitly indicate the quantity of indicated technical characteristic.Define " first " as a result, the feature of " second " can be expressed or
Implicitly include at least one this feature.In the description of the present application, the meaning of " plurality " is at least two, such as two, three
It is a etc., unless otherwise specifically defined.
Any process described otherwise above or method description are construed as in flow chart or herein, and expression includes
It is one or more for realizing custom logic function or process the step of executable instruction code module, segment or portion
Point, and the range of the preferred embodiment of the application includes other realization, wherein can not press shown or discussed suitable
Sequence, including according to related function by it is basic simultaneously in the way of or in the opposite order, Lai Zhihang function, this should be by the application
Embodiment person of ordinary skill in the field understood.
Expression or logic and/or step described otherwise above herein in flow charts, for example, being considered use
In the order list for the executable instruction for realizing logic function, may be embodied in any computer-readable medium, for
Instruction execution system, device or equipment (such as computer based system, including the system of processor or other can be held from instruction
The instruction fetch of row system, device or equipment and the system executed instruction) it uses, or combine these instruction execution systems, device or set
It is standby and use.For the purpose of this specification, " computer-readable medium ", which can be, any may include, stores, communicates, propagates or pass
Defeated program is for instruction execution system, device or equipment or the dress used in conjunction with these instruction execution systems, device or equipment
It sets.The more specific example (non-exhaustive list) of computer-readable medium include the following: there is the electricity of one or more wirings
Interconnecting piece (electronic device), portable computer diskette box (magnetic device), random access memory (RAM), read-only memory
(ROM), erasable edit read-only storage (EPROM or flash memory), fiber device and portable optic disk is read-only deposits
Reservoir (CDROM).In addition, computer-readable medium can even is that the paper that can print described program on it or other are suitable
Medium, because can then be edited, be interpreted or when necessary with it for example by carrying out optical scanner to paper or other media
His suitable method is handled electronically to obtain described program, is then stored in computer storage.
It should be appreciated that each section of the application can be realized with hardware, software, firmware or their combination.Above-mentioned
In embodiment, software that multiple steps or method can be executed in memory and by suitable instruction execution system with storage
Or firmware is realized.Such as, if realized with hardware in another embodiment, following skill well known in the art can be used
Any one of art or their combination are realized: have for data-signal is realized the logic gates of logic function from
Logic circuit is dissipated, the specific integrated circuit with suitable combinational logic gate circuit, programmable gate array (PGA), scene can compile
Journey gate array (FPGA) etc..
Those skilled in the art are understood that realize all or part of step that above-described embodiment method carries
It suddenly is that relevant hardware can be instructed to complete by program, the program can store in a kind of computer-readable storage medium
In matter, which when being executed, includes the steps that one or a combination set of embodiment of the method.
It, can also be in addition, can integrate in a processing module in each functional unit in each embodiment of the application
It is that each unit physically exists alone, can also be integrated in two or more units in a module.Above-mentioned integrated mould
Block both can take the form of hardware realization, can also be realized in the form of software function module.The integrated module is such as
Fruit is realized and when sold or used as an independent product in the form of software function module, also can store in a computer
In read/write memory medium.
Storage medium mentioned above can be read-only memory, disk or CD etc..Although having been shown and retouching above
Embodiments herein is stated, it is to be understood that above-described embodiment is exemplary, and should not be understood as the limit to the application
System, those skilled in the art can be changed above-described embodiment, modify, replace and become within the scope of application
Type.
Claims (10)
1. a kind of management method of unstructured official document, which comprises the following steps:
Obtain unstructured official document to be identified;
The unstructured official document to be identified is identified according to preset identification model, is obtained described to be identified unstructured
Attribute information in official document;
The unstructured official document to be identified is stored according to the attribute information.
2. the method as described in claim 1, which is characterized in that it is described according to preset identification model to the multiple participle
It is identified, before the attribute information in the acquisition unstructured official document to be identified, further includes:
Determine tagged corpus;
Word segmentation processing is carried out to the unstructured official document of multiple training, each is obtained and trains multiple training in unstructured official document
Participle;
The tagged corpus and the multiple training participle are handled according to preset algorithm, generate the preset identification
Model.
3. method according to claim 2, which is characterized in that after generating the preset identification model, further includes:
Obtain unstructured official document to be tested;
Word segmentation processing is carried out to the unstructured official document to be tested, obtains multiple surveys in the unstructured official document to be tested
Examination participle;
The multiple test participle is identified according to the preset identification model, obtains test value;
The validity of the preset identification model is judged according to the test value and preset threshold.
4. method as claimed in claim 3, which is characterized in that the test value includes: accuracy rate and recall rate;
The validity that the preset identification model is judged according to the test value and preset threshold, comprising:
Obtain the ratio of the accuracy rate and the recall rate;
If the ratio is more than or equal to preset threshold, it is determined that the preset identification model is effective.
5. the method as described in claim 1, which is characterized in that segmenting corresponding attribute information to described according to the target
After unstructured official document to be identified is stored, further includes:
Obtain extracting keywords;
The unstructured official document of target is extracted according to the extracting keywords.
6. a kind of managing device of unstructured official document characterized by comprising
Module is obtained, for obtaining unstructured official document to be identified;
Identification module, for being identified according to preset identification model to the unstructured official document to be identified, described in acquisition
Attribute information in unstructured official document to be identified;
Memory module, for being stored according to the attribute information to the unstructured official document to be identified.
7. device as claimed in claim 6, which is characterized in that further include:
Determining module, for determining tagged corpus;
Word segmentation module obtains each and trains unstructured public affairs for carrying out word segmentation processing to the unstructured official document of multiple training
Multiple training participles in text;
Generation module is generated for being handled according to preset algorithm the tagged corpus and the multiple training participle
The preset identification model.
8. device as claimed in claim 7, which is characterized in that further include:
The acquisition module is also used to obtain unstructured official document to be tested;
The word segmentation module is also used to carry out word segmentation processing to the unstructured official document to be tested, obtain described to be tested non-
Multiple tests participle in structuring official document;
The identification module is also used to identify the multiple test participle according to the preset identification model, obtain
Test value;
Judgment module, for judging the validity of the preset identification model according to the test value and preset threshold.
9. a kind of computer equipment, which is characterized in that including processor and memory;
Wherein, the processor is run by reading the executable program code stored in the memory can be performed with described
The corresponding program of program code, with the manager for realizing unstructured official document according to any one of claims 1 to 5
Method.
10. a kind of non-transitorycomputer readable storage medium, is stored thereon with computer program, which is characterized in that the program
The management method of unstructured official document according to any one of claims 1 to 5 is realized when being executed by processor.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910074336.5A CN109815500A (en) | 2019-01-25 | 2019-01-25 | Management method, device, computer equipment and the storage medium of unstructured official document |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910074336.5A CN109815500A (en) | 2019-01-25 | 2019-01-25 | Management method, device, computer equipment and the storage medium of unstructured official document |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109815500A true CN109815500A (en) | 2019-05-28 |
Family
ID=66604984
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910074336.5A Pending CN109815500A (en) | 2019-01-25 | 2019-01-25 | Management method, device, computer equipment and the storage medium of unstructured official document |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109815500A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112507968A (en) * | 2020-12-24 | 2021-03-16 | 成都网安科技发展有限公司 | Method and device for identifying official document text based on feature association |
CN112541373A (en) * | 2019-09-20 | 2021-03-23 | 北京国双科技有限公司 | Judicial text recognition method, text recognition model obtaining method and related equipment |
CN112948347A (en) * | 2019-12-11 | 2021-06-11 | 北京懿医云科技有限公司 | Text data structuring processing method, device, equipment and storage medium |
CN113449525A (en) * | 2021-07-08 | 2021-09-28 | 安徽商信政通信息技术股份有限公司 | Intelligent file transfer method and system based on entity identification |
CN113656353A (en) * | 2021-08-03 | 2021-11-16 | 煤炭科学研究总院 | BIM model processing method and device, computer equipment and storage medium |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102184262A (en) * | 2011-06-15 | 2011-09-14 | 悠易互通(北京)广告有限公司 | Web-based text classification mining system and web-based text classification mining method |
US8407217B1 (en) * | 2010-01-29 | 2013-03-26 | Guangsheng Zhang | Automated topic discovery in documents |
CN103310025A (en) * | 2013-07-08 | 2013-09-18 | 北京邮电大学 | Unstructured-data description method and device |
CN107992597A (en) * | 2017-12-13 | 2018-05-04 | 国网山东省电力公司电力科学研究院 | A kind of text structure method towards electric network fault case |
CN108228101A (en) * | 2017-12-28 | 2018-06-29 | 北京盛和大地数据科技有限公司 | A kind of method and system for managing data |
CN108416032A (en) * | 2018-03-12 | 2018-08-17 | 腾讯科技(深圳)有限公司 | A kind of file classification method, device and storage medium |
CN108664473A (en) * | 2018-05-11 | 2018-10-16 | 平安科技(深圳)有限公司 | Recognition methods, electronic device and the readable storage medium storing program for executing of text key message |
CN108875067A (en) * | 2018-06-29 | 2018-11-23 | 北京百度网讯科技有限公司 | text data classification method, device, equipment and storage medium |
CN108920656A (en) * | 2018-07-03 | 2018-11-30 | 龙马智芯(珠海横琴)科技有限公司 | Document properties description content extracting method and device |
-
2019
- 2019-01-25 CN CN201910074336.5A patent/CN109815500A/en active Pending
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8407217B1 (en) * | 2010-01-29 | 2013-03-26 | Guangsheng Zhang | Automated topic discovery in documents |
CN102184262A (en) * | 2011-06-15 | 2011-09-14 | 悠易互通(北京)广告有限公司 | Web-based text classification mining system and web-based text classification mining method |
CN103310025A (en) * | 2013-07-08 | 2013-09-18 | 北京邮电大学 | Unstructured-data description method and device |
CN107992597A (en) * | 2017-12-13 | 2018-05-04 | 国网山东省电力公司电力科学研究院 | A kind of text structure method towards electric network fault case |
CN108228101A (en) * | 2017-12-28 | 2018-06-29 | 北京盛和大地数据科技有限公司 | A kind of method and system for managing data |
CN108416032A (en) * | 2018-03-12 | 2018-08-17 | 腾讯科技(深圳)有限公司 | A kind of file classification method, device and storage medium |
CN108664473A (en) * | 2018-05-11 | 2018-10-16 | 平安科技(深圳)有限公司 | Recognition methods, electronic device and the readable storage medium storing program for executing of text key message |
CN108875067A (en) * | 2018-06-29 | 2018-11-23 | 北京百度网讯科技有限公司 | text data classification method, device, equipment and storage medium |
CN108920656A (en) * | 2018-07-03 | 2018-11-30 | 龙马智芯(珠海横琴)科技有限公司 | Document properties description content extracting method and device |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112541373A (en) * | 2019-09-20 | 2021-03-23 | 北京国双科技有限公司 | Judicial text recognition method, text recognition model obtaining method and related equipment |
WO2021051957A1 (en) * | 2019-09-20 | 2021-03-25 | 北京国双科技有限公司 | Judicial text recognition method, text recognition model obtaining method, and related device |
CN112541373B (en) * | 2019-09-20 | 2023-10-31 | 北京国双科技有限公司 | Judicial text recognition method, text recognition model obtaining method and related equipment |
CN112948347A (en) * | 2019-12-11 | 2021-06-11 | 北京懿医云科技有限公司 | Text data structuring processing method, device, equipment and storage medium |
CN112507968A (en) * | 2020-12-24 | 2021-03-16 | 成都网安科技发展有限公司 | Method and device for identifying official document text based on feature association |
CN112507968B (en) * | 2020-12-24 | 2024-03-05 | 成都网安科技发展有限公司 | Document text recognition method and device based on feature association |
CN113449525A (en) * | 2021-07-08 | 2021-09-28 | 安徽商信政通信息技术股份有限公司 | Intelligent file transfer method and system based on entity identification |
CN113656353A (en) * | 2021-08-03 | 2021-11-16 | 煤炭科学研究总院 | BIM model processing method and device, computer equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109815500A (en) | Management method, device, computer equipment and the storage medium of unstructured official document | |
CN106156365B (en) | A kind of generation method and device of knowledge mapping | |
CN108460014A (en) | Recognition methods, device, computer equipment and the storage medium of business entity | |
CN108009293A (en) | Video tab generation method, device, computer equipment and storage medium | |
CN107436922A (en) | Text label generation method and device | |
CN110276023B (en) | POI transition event discovery method, device, computing equipment and medium | |
CN108170773A (en) | Media event method for digging, device, computer equipment and storage medium | |
CN109376309A (en) | Document recommendation method and device based on semantic label | |
CN108319720A (en) | Man-machine interaction method, device based on artificial intelligence and computer equipment | |
CN108563655A (en) | Text based event recognition method and device | |
CN108733778A (en) | The industry type recognition methods of object and device | |
CN108108468A (en) | A kind of short text sentiment analysis method and apparatus based on concept and text emotion | |
CN108460098A (en) | Information recommendation method, device and computer equipment | |
CN108121697A (en) | Method, apparatus, equipment and the computer storage media that a kind of text is rewritten | |
CN110162786A (en) | Construct the method, apparatus of configuration file and drawing-out structure information | |
Otto et al. | Characterization and classification of semantic image-text relations | |
CN107943940A (en) | Data processing method, medium, system and electronic equipment | |
Kumar et al. | BERT based semi-supervised hybrid approach for aspect and sentiment classification | |
CN110196929A (en) | The generation method and device of question and answer pair | |
CN110020163A (en) | Searching method, device, computer equipment and storage medium based on human-computer interaction | |
CN116402166B (en) | Training method and device of prediction model, electronic equipment and storage medium | |
Tüselmann et al. | Recognition-free question answering on handwritten document collections | |
JP2023517518A (en) | Vector embedding model for relational tables with null or equivalent values | |
US20200302332A1 (en) | Client-specific document quality model | |
CN113051396B (en) | Classification recognition method and device for documents and electronic equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information |
Address after: Room 1901, building 1, No. 1782 Jiangling Road, Xixing street, Binjiang District, Hangzhou City, Zhejiang Province Applicant after: HANGZHOU LVWAN NETWORK TECHNOLOGY Co.,Ltd. Address before: 2, No. 2630, building 2, superior Science Park, No. 310026 South Ring Road, Hangzhou, Binjiang District, Zhejiang, China Applicant before: HANGZHOU LVWAN NETWORK TECHNOLOGY Co.,Ltd. |
|
CB02 | Change of applicant information | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20190528 |
|
WD01 | Invention patent application deemed withdrawn after publication |