CN109815500A - Management method, device, computer equipment and the storage medium of unstructured official document - Google Patents

Management method, device, computer equipment and the storage medium of unstructured official document Download PDF

Info

Publication number
CN109815500A
CN109815500A CN201910074336.5A CN201910074336A CN109815500A CN 109815500 A CN109815500 A CN 109815500A CN 201910074336 A CN201910074336 A CN 201910074336A CN 109815500 A CN109815500 A CN 109815500A
Authority
CN
China
Prior art keywords
unstructured
official document
identified
identification model
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910074336.5A
Other languages
Chinese (zh)
Inventor
吴雄辉
王丽娟
秦锋剑
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Green Bay Network Technology Co Ltd
Original Assignee
Hangzhou Green Bay Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Green Bay Network Technology Co Ltd filed Critical Hangzhou Green Bay Network Technology Co Ltd
Priority to CN201910074336.5A priority Critical patent/CN109815500A/en
Publication of CN109815500A publication Critical patent/CN109815500A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application proposes management method, device, computer equipment and the storage medium of a kind of unstructured official document, wherein method includes: by obtaining unstructured official document to be identified;Unstructured official document to be identified is identified according to preset identification model, obtains the attribute information in unstructured official document to be identified;Unstructured official document to be identified is stored according to attribute information.The validity and accuracy of the management of unstructured official document are improved as a result,.

Description

Management method, device, computer equipment and the storage medium of unstructured official document
Technical field
This application involves E-Government technical field more particularly to a kind of management methods of unstructured official document, device, meter Calculate machine equipment and storage medium.
Background technique
Currently, usually having management means and technical solution two ways on government affairs document treatment, wherein management means is Department is issued in all official documents, all official documents that send out are realized into objectification, be mainly related to official document abstract, receiving department, official document Personnel, contact method etc. it is artificial be managed system typing, but management means inefficiency, and without full-time staff into Row typing, history official document is abandoned, so that official document that is parallel and intersecting department is numerous in entire government system to be had The typing of effect;Technical solution mainly by official document whole typing, carries out the simple match inquiry of part official document or content, is managing In the process without effectively identification and organized management, and closed to department is intersected with the official document relationship of parallel department, official document Connection etc. cannot achieve management.Therefore, above two scheme all can not effectively manage official document.
Summary of the invention
The application is intended to solve at least some of the technical problems in related technologies.
For this purpose, the application proposes management method, device and the storage medium of a kind of unstructured official document, it is existing for solving The technical issues of can not effectively being managed in technology for unstructured official document.
In order to achieve the above object, the application first aspect embodiment proposes a kind of management method of unstructured official document, packet It includes:
Obtain unstructured official document to be identified;
The unstructured official document to be identified is identified according to preset identification model, obtains the non-knot to be identified Attribute information in structure official document;
The unstructured official document to be identified is stored according to the attribute information.
The management method of the unstructured official document of the embodiment of the present application, by obtaining unstructured official document to be identified;According to Preset identification model identifies unstructured official document to be identified, obtains the attribute letter in unstructured official document to be identified Breath;Unstructured official document to be identified is stored according to attribute information.Having for the management of unstructured official document is improved as a result, Effect property and accuracy.
In order to achieve the above object, the application second aspect embodiment proposes a kind of managing device of unstructured official document, packet It includes:
Module is obtained, for obtaining unstructured official document to be identified;
Identification module is obtained for being identified according to preset identification model to the unstructured official document to be identified Attribute information in the unstructured official document to be identified;
Memory module, for being stored according to the attribute information to the unstructured official document to be identified.
The managing device of the unstructured official document of the embodiment of the present application, by according to preset brand names word set and default Enterprise attributes word set, calculate text to be identified and different enterprises and identify text similarity between corresponding attribute information, will It is corresponding with different enterprises marks to obtain text to be identified for the text input to be identified obtained semantic similarity model of training in advance Semantic similarity between attribute information, according to text similarity and semantic similarity, the determining mesh with text matches to be identified Mark enterprise's mark.The accuracy rate of the management of unstructured official document is improved as a result, but also improves the management of unstructured official document Recall rate.
In order to achieve the above object, the application third aspect embodiment proposes a kind of computer equipment, comprising: processor and deposit Reservoir;Wherein, the processor is held to run with described by reading the executable program code stored in the memory The corresponding program of line program code, with the management method for realizing the unstructured official document as described in first aspect embodiment.
In order to achieve the above object, the application fourth aspect embodiment proposes a kind of non-transitory computer-readable storage medium Matter is stored thereon with computer program, realizes when which is executed by processor non-structural as described in first aspect embodiment Change the management method of official document.
In order to achieve the above object, the 5th aspect embodiment of the application proposes a kind of computer program product, when the calculating When instruction in machine program product is executed by processor, the management of the unstructured official document as described in first aspect embodiment is realized Method.
The additional aspect of the application and advantage will be set forth in part in the description, and will partially become from the following description It obtains obviously, or recognized by the practice of the application.
Detailed description of the invention
The application is above-mentioned and/or additional aspect and advantage will become from the following description of the accompanying drawings of embodiments Obviously and it is readily appreciated that, in which:
Fig. 1 is the flow diagram of the management method of unstructured official document provided by the embodiment of the present application one;
Fig. 2 is the flow diagram of the management method of unstructured official document provided by the embodiment of the present application two;
Fig. 3 is the flow diagram of the management method of unstructured official document provided by the embodiment of the present application three;
Fig. 4 is the flow diagram of the management method of unstructured official document provided by the embodiment of the present application four;
Fig. 5 is the structural schematic diagram of the managing device of unstructured official document provided by the embodiment of the present application one;
Fig. 6 is the structural schematic diagram of the managing device of unstructured official document provided by the embodiment of the present application two;
Fig. 7 is the structural schematic diagram of the managing device of unstructured official document provided by the embodiment of the present application three;
Fig. 8 is the structural schematic diagram of the managing device of unstructured official document provided by the embodiment of the present application four;
Fig. 9 is the structural schematic diagram of computer equipment provided by the embodiment of the present application.
Specific embodiment
Embodiments herein is described below in detail, examples of the embodiments are shown in the accompanying drawings, wherein from beginning to end Same or similar label indicates same or similar element or element with the same or similar functions.Below with reference to attached The embodiment of figure description is exemplary, it is intended to for explaining the application, and should not be understood as the limitation to the application.
Below with reference to the accompanying drawings management method, device, the computer equipment of the unstructured official document of the embodiment of the present application are described And storage medium.
Fig. 1 is the flow diagram of the management method of unstructured official document provided by the embodiment of the present application one.
As shown in Figure 1, the management method of the unstructured official document may comprise steps of:
Step 101, unstructured official document to be identified is obtained.
In practical applications, many government affairs official documents are not store according to certain way, be it is non-structured, do not have Full-time user, which carries out effective typing etc. to it, to cause effectively manage official document, and the application is preset by establishing Identification model can effectively identify a large amount of unstructured official document, and be stored, to improve unstructured official document The efficiency and accuracy of management.
Firstly, obtaining unstructured official document to be identified, it is to be understood that there are one or more not according to certain side The unstructured official document to be identified that formula is stored can determine one or more non-knot to be identified according to the actual application Structure official document.
Step 102, unstructured official document to be identified is identified according to preset identification model, is obtained described to be identified Attribute information in unstructured official document.
Step 103, unstructured official document to be identified is stored according to attribute information.
Specifically, identification model is pre-generated, as a kind of possible implementation, as shown in Figure 2, comprising:
Step 201, tagged corpus is determined.
Step 202, word segmentation processing is carried out to the unstructured official document of multiple training, obtains each and trains unstructured official document In it is multiple training participle.
Step 203, tagged corpus and multiple training participles are handled according to preset algorithm, generates preset identification Model.
Specifically, it is determined that there are many kinds of the modes of tagged corpus, can directly by marked there are corpus Such as People's Daily's corpus is directly used as tagged corpus, can also by the multiple official documents not marked of artificial selection into Rower note generates tagged corpus, can also be that the corpus that a part selection has marked carries out artificial mark generation mark with a part Corpus is infused, can be selected according to the actual application.
It wherein, the use of bilstm+crf is a two-way LSTM (Long for example there are many kinds of the modes of mark Short-Term Memory, shot and long term memory network)+CRF (Conditional Random Field, condition random field) layer The two-way available context of LSTM of model information, can better deep learning, reduce the artificial ginseng of later period mark With, further increase identification model generation efficiency.
It should be noted that is stored in tagged corpus is the language material really occurred in the actual use of language Material, tagged corpus are the basic resources that linguistry is carried using electronic computer as carrier, and real corpus is needed by processing (such as analysis and processing), could become useful resource.
Wherein, it is exactly the grammatical category that each word is determined in given sentence that mark, which can be part-of-speech tagging, determines its word Property and the process that is marked, such as position attribution vector, part-of-speech tagging sequence vector, cluster or sorting algorithm etc..
It is understood that before generating identification model, it is thus necessary to determine that multiple unstructured official documents of training, and to each A unstructured official document progress word segmentation processing of training obtains each and trains multiple training participles in unstructured official document.
As an example, obtain official document content in training unstructured official document A, official document A unstructured to training into Row word segmentation processing, it is to be understood that word segmentation processing can be carried out to official document content by preset participle mode, for example passed through NlpAnalysis participle (point with new word discovery function in Ansj Chinese word segmentation (the Chinese word segmentation tool based on java) Word) mode, more specifically, the jar (software package file format) for introducing corresponding Ansj is wrapped and is executed the participle of NlpAnalysis Method can identify unregistered word, and have good performance to the identification of name, organization names, number, and support user Custom Dictionaries.
Wherein, Chinese word segmentation refers to for a chinese character sequence being cut into individual word one by one, and participle is exactly will even Continuous word sequence is reassembled into the process of word sequence according to certain specification.
Finally, handling by preset algorithm tagged corpus and multiple training participles, preset identification mould is generated Type.Wherein, preset algorithm, which can according to need, is selected, such as the program that deep neural network model is either write in advance Algorithm etc..
Specifically, the corresponding attribute information of the either different participles of different corpus is store in the identification model of generation, than Such as location information, name information and contact details, therefore by preset identification model to unstructured official document to be identified It carries out identifying the attribute information in available unstructured official document to be identified, it is thus possible to according to attribute information to be identified Unstructured official document is stored.
Wherein, attribute information can determine according to the actual application, for example name, connection are extracted from unstructured official document It is mode, official document abstract, membership credentials etc..
It for example, such as can be by extracting dispatch/Jie Wen organ to realize the analysis gone to organization;Extract people Member's dictionary is to realize to personnel's attributive classification;Official document abstract is extracted to realize and classify to similar official document etc..
Wherein, in order to further ensure store official document quality, can also by official document in violation of rules and regulations filter, by violation official document into Row specific classification, so as to subsequent processing.
It should be noted that the validity in order to guarantee identification model, also needed after generating preset identification model pair It is tested, specific as shown in Figure 3, comprising:
Step 301, unstructured official document to be tested is obtained.
Step 302, word segmentation processing is carried out to unstructured official document to be tested, obtained more in unstructured official document to be tested A test participle.
Step 303, multiple tests participle is identified according to preset identification model, obtains test value.
Step 304, the validity of preset identification model is judged according to test value and preset threshold.
Specifically, after generating identification model, unstructured official document to be tested is determined, and to unstructured official document to be tested Word segmentation processing is carried out, detailed process may refer to the description of step 202, then according to preset identification model to multiple tests Participle is identified, it is several can to determine that the attribute information identified has, if it is correct etc., so that it is determined that test value, most Test value is compared to the validity for judging preset identification model with preset threshold afterwards.
As an example, test value includes: accuracy rate and recall rate, obtains the ratio of accuracy rate and recall rate, if than Value is more than or equal to preset threshold, it is determined that preset identification model is effective.
For example, identification model is LDA (Latent Dirichlet Allocation, document subject matter generate model), Also referred to as three layers of bayesian probability model, may include word, theme and document three-decker.Identification mould in the application Type, that is, it is believed that each word in an official document is by " with some theme of certain probability selection, and leading from this With some word of certain probability selection in topic " such a process is available, the individual sum/knowledge for accuracy=correctly identify Not Chu individual sum;The sum of individual present in the individual sum/test set for recall rate=correctly identify;F value=accuracy * Recall rate * 2/ (accuracy+recall rate).
Therefore, one or more in accuracy, recall rate and F value can be selected according to the actual application, or Its mutual ratio selects corresponding preset threshold to be compared the validity of determining identification model as test value.
Description based on the above embodiment, after step 103, as shown in Figure 4, further includes:
Step 401, extracting keywords are obtained.
Step 402, the unstructured official document of target is extracted according to extracting keywords.
That is, can be by determining extracting keywords such as organization, contact method etc., from according to attribute information The unstructured official document of target is extracted in multiple non-structural official documents after being stored, to improve the standard that unstructured official document extracts True property.
The vertical of official document, mould are able to solve to official document identification most difficult during problem of management and object extraction, is allowed to The subsequent hierarchy management to official document solves validity management of official document during receiving and issuing.
The management method of the unstructured official document of the embodiment of the present application, by obtaining unstructured official document to be identified, according to Preset identification model identifies unstructured official document to be identified, obtains the attribute letter in unstructured official document to be identified Breath, stores unstructured official document to be identified according to attribute information.Thereby, it is possible to realize the management to unstructured official document Validity and accuracy.
In order to realize above-described embodiment, the application also proposes a kind of managing device of unstructured official document.
Fig. 5 is the structural schematic diagram of the managing device of unstructured official document provided by the embodiment of the present application one.
As shown in figure 5, the managing device 50 of the unstructured official document includes: to obtain module 510, identification module 520 and deposit Store up module 530.Wherein,
Module 510 is obtained, for obtaining unstructured official document to be identified.
Identification module 520, for being identified according to preset identification model to unstructured official document to be identified, obtain to Identify the attribute information in unstructured official document.
Memory module 530, for being stored according to attribute information to unstructured official document to be identified.
Further, in a kind of possible implementation of the embodiment of the present application, as shown in fig. 6, implementing as shown in Figure 5 On the basis of example, described device, further includes: determining module 540, word segmentation module 550 and generation module 560.
Determining module 540, for determining tagged corpus.
Word segmentation module 550 obtains each and trains non-knot for carrying out word segmentation processing to the unstructured official document of multiple training Multiple training participles in structure official document.
Generation module 560 generates pre- for being handled according to preset algorithm tagged corpus and multiple training participles If identification model.
In a kind of possible implementation of the embodiment of the present application, as shown in fig. 7, on the basis of embodiment as shown in Figure 6 On, described device, further includes: judgment module 570.Wherein,
Module 510 is obtained, is also used to obtain unstructured official document to be tested.
Word segmentation module 550 is also used to carry out word segmentation processing to unstructured official document to be tested, obtain to be tested unstructured Multiple tests participle in official document.
Identification module 520 is also used to identify multiple tests participle according to preset identification model, obtains test Value.
Judgment module 570, for judging the validity of preset identification model according to test value and preset threshold.
In a kind of possible implementation of the embodiment of the present application, test value includes: accuracy rate and recall rate;Judgment module 570 are specifically used for, and obtain the ratio of accuracy rate and recall rate, if ratio is more than or equal to preset threshold, it is determined that preset identification Model is effective.
In a kind of possible implementation of the embodiment of the present application, as shown in figure 8, on the basis of embodiment as shown in Figure 5 On, described device, further includes: abstraction module 580.
Wherein, module 510 is obtained, is also used to obtain extracting keywords.
Abstraction module 580, for extracting the unstructured official document of target according to extracting keywords.
It should be noted that the explanation of the aforementioned management method embodiment to unstructured official document is also applied for the reality The managing device of the unstructured official document of example is applied, realization principle is similar, and details are not described herein again.
The managing device of the unstructured official document of the embodiment of the present application, by obtaining unstructured official document to be identified, according to Preset identification model identifies unstructured official document to be identified, obtains the attribute letter in unstructured official document to be identified Breath, stores unstructured official document to be identified according to attribute information.Thereby, it is possible to realize the management to unstructured official document Validity and accuracy.
In order to realize above-described embodiment, the application also proposes a kind of computer equipment, comprising: processor and memory.Its In, processor runs journey corresponding with executable program code by reading the executable program code stored in memory Sequence, with the management method for realizing unstructured official document as in the foregoing embodiment.
Fig. 9 is the structural schematic diagram of computer equipment provided by the embodiment of the present application, shows and is suitable for being used to realizing this Apply for the block diagram of the exemplary computer device 90 of embodiment.The computer equipment 90 that Fig. 9 is shown is only an example, no The function and use scope for coping with the embodiment of the present application bring any restrictions.
As shown in figure 9, computer equipment 90 is showed in the form of general purpose computing device.The component of computer equipment 90 can To include but is not limited to: one or more processor or processing unit 906, system storage 910 connect not homologous ray group The bus 908 of part (including system storage 910 and processing unit 906).
Bus 908 indicates one of a few class bus structures or a variety of, including memory bus or Memory Controller, Peripheral bus, graphics acceleration port, processor or the local bus using any bus structures in a variety of bus structures.It lifts For example, these architectures include but is not limited to industry standard architecture (Industry Standard Architecture;Hereinafter referred to as: ISA) bus, microchannel architecture (Micro Channel Architecture;Below Referred to as: MAC) bus, enhanced isa bus, Video Electronics Standards Association (Video Electronics Standards Association;Hereinafter referred to as: VESA) local bus and peripheral component interconnection (Peripheral Component Interconnection;Hereinafter referred to as: PCI) bus.
Computer equipment 90 typically comprises a variety of computer system readable media.These media can be it is any can be by The usable medium that computer equipment 90 accesses, including volatile and non-volatile media, moveable and immovable medium.
System storage 910 may include the computer system readable media of form of volatile memory, such as deposit at random Access to memory (Random Access Memory;Hereinafter referred to as: RAM) 911 and/or cache memory 912.Computer is set Standby 90 may further include other removable/nonremovable, volatile/non-volatile computer system storage mediums.Only As an example, storage system 913 can be used for reading and writing immovable, non-volatile magnetic media (Fig. 9 do not show, commonly referred to as " hard disk drive ").Although being not shown in Fig. 9, can provide for reading removable non-volatile magnetic disk (such as " floppy disk ") The disc driver write, and to removable anonvolatile optical disk (such as: compact disc read-only memory (Compact Disc Read Only Memory;Hereinafter referred to as: CD-ROM), digital multi CD-ROM (Digital Video Disc Read Only Memory;Hereinafter referred to as: DVD-ROM) or other optical mediums) read-write CD drive.In these cases, each driving Device can be connected by one or more data media interfaces with bus 908.System storage 910 may include at least one Program product, the program product have one group of (for example, at least one) program module, these program modules are configured to perform this Apply for the function of each embodiment.
Computer-readable signal media may include in a base band or as carrier wave a part propagate data-signal, Wherein carry computer-readable program code.The data-signal of this propagation can take various forms, including --- but It is not limited to --- electromagnetic signal, optical signal or above-mentioned any appropriate combination.Computer-readable signal media can also be Any computer-readable medium other than computer readable storage medium, which can send, propagate or Transmission is for by the use of instruction execution system, device or device or program in connection.
The program code for including on computer-readable medium can transmit with any suitable medium, including --- but it is unlimited In --- wireless, electric wire, optical cable, RF etc. or above-mentioned any appropriate combination.
Can with one or more programming languages or combinations thereof come write for execute the application operation computer Program code, described program design language include object oriented program language-such as Java, Smalltalk, C++, It further include conventional procedural programming language-such as " C " language or similar programming language.Program code can be with It fully executes, partly execute on the user computer on the user computer, being executed as an independent software package, portion Divide and partially executes or executed on a remote computer or server completely on the remote computer on the user computer.
Program/utility 914 with one group of (at least one) program module 9140, can store and deposit in such as system In reservoir 910, such program module 9140 includes but is not limited to operating system, one or more application program, Qi Tacheng It may include the realization of network environment in sequence module and program data, each of these examples or certain combination.Program Module 9140 usually executes function and/or method in embodiments described herein.
Computer equipment 90 can also be with one or more external equipments 10 (such as keyboard, sensing equipment, display 100 Deng) communication, can also be enabled a user to one or more equipment interact with the terminal device 90 communicate, and/or with make Any equipment (such as network interface card, the modulation /demodulation that the computer equipment 90 can be communicated with one or more of the other calculating equipment Device etc.) communication.This communication can be carried out by input/output (I/O) interface 902.Also, computer equipment 90 can be with Pass through network adapter 900 and one or more network (such as local area network (Local Area Network;Hereinafter referred to as: LAN), wide area network (Wide Area Network;Hereinafter referred to as: WAN) and/or public network, for example, internet) communication.Such as figure Shown in 9, network adapter 900 is communicated by bus 908 with other modules of computer equipment 90.Although should be understood that in Fig. 9 It is not shown, other hardware and/or software module can be used in conjunction with computer equipment 90, including but not limited to: microcode, equipment Driver, redundant processing unit, external disk drive array, RAID system, tape drive and data backup storage system Deng.
Processing unit 906 by the program that is stored in system storage 910 of operation, thereby executing various function application with And data processing, such as realize the management method of the unstructured official document referred in previous embodiment.
In order to realize above-described embodiment, the application also proposes a kind of non-transitorycomputer readable storage medium, deposits thereon Computer program is contained, when which is executed by processor, realizes the management of unstructured official document as in the foregoing embodiment Method.
In order to realize above-described embodiment, the application also proposes a kind of computer program product, when the computer program produces When instruction in product is executed by processor, the management method of unstructured official document as in the foregoing embodiment is realized.
In the description of this specification, reference term " one embodiment ", " some embodiments ", " example ", " specifically show The description of example " or " some examples " etc. means specific features, structure, material or spy described in conjunction with this embodiment or example Point is contained at least one embodiment or example of the application.In the present specification, schematic expression of the above terms are not It must be directed to identical embodiment or example.Moreover, particular features, structures, materials, or characteristics described can be in office It can be combined in any suitable manner in one or more embodiment or examples.In addition, without conflicting with each other, the skill of this field Art personnel can tie the feature of different embodiments or examples described in this specification and different embodiments or examples It closes and combines.
In addition, term " first ", " second " are used for descriptive purposes only and cannot be understood as indicating or suggesting relative importance Or implicitly indicate the quantity of indicated technical characteristic.Define " first " as a result, the feature of " second " can be expressed or Implicitly include at least one this feature.In the description of the present application, the meaning of " plurality " is at least two, such as two, three It is a etc., unless otherwise specifically defined.
Any process described otherwise above or method description are construed as in flow chart or herein, and expression includes It is one or more for realizing custom logic function or process the step of executable instruction code module, segment or portion Point, and the range of the preferred embodiment of the application includes other realization, wherein can not press shown or discussed suitable Sequence, including according to related function by it is basic simultaneously in the way of or in the opposite order, Lai Zhihang function, this should be by the application Embodiment person of ordinary skill in the field understood.
Expression or logic and/or step described otherwise above herein in flow charts, for example, being considered use In the order list for the executable instruction for realizing logic function, may be embodied in any computer-readable medium, for Instruction execution system, device or equipment (such as computer based system, including the system of processor or other can be held from instruction The instruction fetch of row system, device or equipment and the system executed instruction) it uses, or combine these instruction execution systems, device or set It is standby and use.For the purpose of this specification, " computer-readable medium ", which can be, any may include, stores, communicates, propagates or pass Defeated program is for instruction execution system, device or equipment or the dress used in conjunction with these instruction execution systems, device or equipment It sets.The more specific example (non-exhaustive list) of computer-readable medium include the following: there is the electricity of one or more wirings Interconnecting piece (electronic device), portable computer diskette box (magnetic device), random access memory (RAM), read-only memory (ROM), erasable edit read-only storage (EPROM or flash memory), fiber device and portable optic disk is read-only deposits Reservoir (CDROM).In addition, computer-readable medium can even is that the paper that can print described program on it or other are suitable Medium, because can then be edited, be interpreted or when necessary with it for example by carrying out optical scanner to paper or other media His suitable method is handled electronically to obtain described program, is then stored in computer storage.
It should be appreciated that each section of the application can be realized with hardware, software, firmware or their combination.Above-mentioned In embodiment, software that multiple steps or method can be executed in memory and by suitable instruction execution system with storage Or firmware is realized.Such as, if realized with hardware in another embodiment, following skill well known in the art can be used Any one of art or their combination are realized: have for data-signal is realized the logic gates of logic function from Logic circuit is dissipated, the specific integrated circuit with suitable combinational logic gate circuit, programmable gate array (PGA), scene can compile Journey gate array (FPGA) etc..
Those skilled in the art are understood that realize all or part of step that above-described embodiment method carries It suddenly is that relevant hardware can be instructed to complete by program, the program can store in a kind of computer-readable storage medium In matter, which when being executed, includes the steps that one or a combination set of embodiment of the method.
It, can also be in addition, can integrate in a processing module in each functional unit in each embodiment of the application It is that each unit physically exists alone, can also be integrated in two or more units in a module.Above-mentioned integrated mould Block both can take the form of hardware realization, can also be realized in the form of software function module.The integrated module is such as Fruit is realized and when sold or used as an independent product in the form of software function module, also can store in a computer In read/write memory medium.
Storage medium mentioned above can be read-only memory, disk or CD etc..Although having been shown and retouching above Embodiments herein is stated, it is to be understood that above-described embodiment is exemplary, and should not be understood as the limit to the application System, those skilled in the art can be changed above-described embodiment, modify, replace and become within the scope of application Type.

Claims (10)

1. a kind of management method of unstructured official document, which comprises the following steps:
Obtain unstructured official document to be identified;
The unstructured official document to be identified is identified according to preset identification model, is obtained described to be identified unstructured Attribute information in official document;
The unstructured official document to be identified is stored according to the attribute information.
2. the method as described in claim 1, which is characterized in that it is described according to preset identification model to the multiple participle It is identified, before the attribute information in the acquisition unstructured official document to be identified, further includes:
Determine tagged corpus;
Word segmentation processing is carried out to the unstructured official document of multiple training, each is obtained and trains multiple training in unstructured official document Participle;
The tagged corpus and the multiple training participle are handled according to preset algorithm, generate the preset identification Model.
3. method according to claim 2, which is characterized in that after generating the preset identification model, further includes:
Obtain unstructured official document to be tested;
Word segmentation processing is carried out to the unstructured official document to be tested, obtains multiple surveys in the unstructured official document to be tested Examination participle;
The multiple test participle is identified according to the preset identification model, obtains test value;
The validity of the preset identification model is judged according to the test value and preset threshold.
4. method as claimed in claim 3, which is characterized in that the test value includes: accuracy rate and recall rate;
The validity that the preset identification model is judged according to the test value and preset threshold, comprising:
Obtain the ratio of the accuracy rate and the recall rate;
If the ratio is more than or equal to preset threshold, it is determined that the preset identification model is effective.
5. the method as described in claim 1, which is characterized in that segmenting corresponding attribute information to described according to the target After unstructured official document to be identified is stored, further includes:
Obtain extracting keywords;
The unstructured official document of target is extracted according to the extracting keywords.
6. a kind of managing device of unstructured official document characterized by comprising
Module is obtained, for obtaining unstructured official document to be identified;
Identification module, for being identified according to preset identification model to the unstructured official document to be identified, described in acquisition Attribute information in unstructured official document to be identified;
Memory module, for being stored according to the attribute information to the unstructured official document to be identified.
7. device as claimed in claim 6, which is characterized in that further include:
Determining module, for determining tagged corpus;
Word segmentation module obtains each and trains unstructured public affairs for carrying out word segmentation processing to the unstructured official document of multiple training Multiple training participles in text;
Generation module is generated for being handled according to preset algorithm the tagged corpus and the multiple training participle The preset identification model.
8. device as claimed in claim 7, which is characterized in that further include:
The acquisition module is also used to obtain unstructured official document to be tested;
The word segmentation module is also used to carry out word segmentation processing to the unstructured official document to be tested, obtain described to be tested non- Multiple tests participle in structuring official document;
The identification module is also used to identify the multiple test participle according to the preset identification model, obtain Test value;
Judgment module, for judging the validity of the preset identification model according to the test value and preset threshold.
9. a kind of computer equipment, which is characterized in that including processor and memory;
Wherein, the processor is run by reading the executable program code stored in the memory can be performed with described The corresponding program of program code, with the manager for realizing unstructured official document according to any one of claims 1 to 5 Method.
10. a kind of non-transitorycomputer readable storage medium, is stored thereon with computer program, which is characterized in that the program The management method of unstructured official document according to any one of claims 1 to 5 is realized when being executed by processor.
CN201910074336.5A 2019-01-25 2019-01-25 Management method, device, computer equipment and the storage medium of unstructured official document Pending CN109815500A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910074336.5A CN109815500A (en) 2019-01-25 2019-01-25 Management method, device, computer equipment and the storage medium of unstructured official document

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910074336.5A CN109815500A (en) 2019-01-25 2019-01-25 Management method, device, computer equipment and the storage medium of unstructured official document

Publications (1)

Publication Number Publication Date
CN109815500A true CN109815500A (en) 2019-05-28

Family

ID=66604984

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910074336.5A Pending CN109815500A (en) 2019-01-25 2019-01-25 Management method, device, computer equipment and the storage medium of unstructured official document

Country Status (1)

Country Link
CN (1) CN109815500A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112507968A (en) * 2020-12-24 2021-03-16 成都网安科技发展有限公司 Method and device for identifying official document text based on feature association
CN112541373A (en) * 2019-09-20 2021-03-23 北京国双科技有限公司 Judicial text recognition method, text recognition model obtaining method and related equipment
CN112948347A (en) * 2019-12-11 2021-06-11 北京懿医云科技有限公司 Text data structuring processing method, device, equipment and storage medium
CN113449525A (en) * 2021-07-08 2021-09-28 安徽商信政通信息技术股份有限公司 Intelligent file transfer method and system based on entity identification
CN113656353A (en) * 2021-08-03 2021-11-16 煤炭科学研究总院 BIM model processing method and device, computer equipment and storage medium

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102184262A (en) * 2011-06-15 2011-09-14 悠易互通(北京)广告有限公司 Web-based text classification mining system and web-based text classification mining method
US8407217B1 (en) * 2010-01-29 2013-03-26 Guangsheng Zhang Automated topic discovery in documents
CN103310025A (en) * 2013-07-08 2013-09-18 北京邮电大学 Unstructured-data description method and device
CN107992597A (en) * 2017-12-13 2018-05-04 国网山东省电力公司电力科学研究院 A kind of text structure method towards electric network fault case
CN108228101A (en) * 2017-12-28 2018-06-29 北京盛和大地数据科技有限公司 A kind of method and system for managing data
CN108416032A (en) * 2018-03-12 2018-08-17 腾讯科技(深圳)有限公司 A kind of file classification method, device and storage medium
CN108664473A (en) * 2018-05-11 2018-10-16 平安科技(深圳)有限公司 Recognition methods, electronic device and the readable storage medium storing program for executing of text key message
CN108875067A (en) * 2018-06-29 2018-11-23 北京百度网讯科技有限公司 text data classification method, device, equipment and storage medium
CN108920656A (en) * 2018-07-03 2018-11-30 龙马智芯(珠海横琴)科技有限公司 Document properties description content extracting method and device

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8407217B1 (en) * 2010-01-29 2013-03-26 Guangsheng Zhang Automated topic discovery in documents
CN102184262A (en) * 2011-06-15 2011-09-14 悠易互通(北京)广告有限公司 Web-based text classification mining system and web-based text classification mining method
CN103310025A (en) * 2013-07-08 2013-09-18 北京邮电大学 Unstructured-data description method and device
CN107992597A (en) * 2017-12-13 2018-05-04 国网山东省电力公司电力科学研究院 A kind of text structure method towards electric network fault case
CN108228101A (en) * 2017-12-28 2018-06-29 北京盛和大地数据科技有限公司 A kind of method and system for managing data
CN108416032A (en) * 2018-03-12 2018-08-17 腾讯科技(深圳)有限公司 A kind of file classification method, device and storage medium
CN108664473A (en) * 2018-05-11 2018-10-16 平安科技(深圳)有限公司 Recognition methods, electronic device and the readable storage medium storing program for executing of text key message
CN108875067A (en) * 2018-06-29 2018-11-23 北京百度网讯科技有限公司 text data classification method, device, equipment and storage medium
CN108920656A (en) * 2018-07-03 2018-11-30 龙马智芯(珠海横琴)科技有限公司 Document properties description content extracting method and device

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112541373A (en) * 2019-09-20 2021-03-23 北京国双科技有限公司 Judicial text recognition method, text recognition model obtaining method and related equipment
WO2021051957A1 (en) * 2019-09-20 2021-03-25 北京国双科技有限公司 Judicial text recognition method, text recognition model obtaining method, and related device
CN112541373B (en) * 2019-09-20 2023-10-31 北京国双科技有限公司 Judicial text recognition method, text recognition model obtaining method and related equipment
CN112948347A (en) * 2019-12-11 2021-06-11 北京懿医云科技有限公司 Text data structuring processing method, device, equipment and storage medium
CN112507968A (en) * 2020-12-24 2021-03-16 成都网安科技发展有限公司 Method and device for identifying official document text based on feature association
CN112507968B (en) * 2020-12-24 2024-03-05 成都网安科技发展有限公司 Document text recognition method and device based on feature association
CN113449525A (en) * 2021-07-08 2021-09-28 安徽商信政通信息技术股份有限公司 Intelligent file transfer method and system based on entity identification
CN113656353A (en) * 2021-08-03 2021-11-16 煤炭科学研究总院 BIM model processing method and device, computer equipment and storage medium

Similar Documents

Publication Publication Date Title
CN109815500A (en) Management method, device, computer equipment and the storage medium of unstructured official document
CN106156365B (en) A kind of generation method and device of knowledge mapping
CN108460014A (en) Recognition methods, device, computer equipment and the storage medium of business entity
CN108009293A (en) Video tab generation method, device, computer equipment and storage medium
CN107436922A (en) Text label generation method and device
CN110276023B (en) POI transition event discovery method, device, computing equipment and medium
CN108170773A (en) Media event method for digging, device, computer equipment and storage medium
CN109376309A (en) Document recommendation method and device based on semantic label
CN108319720A (en) Man-machine interaction method, device based on artificial intelligence and computer equipment
CN108563655A (en) Text based event recognition method and device
CN108733778A (en) The industry type recognition methods of object and device
CN108108468A (en) A kind of short text sentiment analysis method and apparatus based on concept and text emotion
CN108460098A (en) Information recommendation method, device and computer equipment
CN108121697A (en) Method, apparatus, equipment and the computer storage media that a kind of text is rewritten
CN110162786A (en) Construct the method, apparatus of configuration file and drawing-out structure information
Otto et al. Characterization and classification of semantic image-text relations
CN107943940A (en) Data processing method, medium, system and electronic equipment
Kumar et al. BERT based semi-supervised hybrid approach for aspect and sentiment classification
CN110196929A (en) The generation method and device of question and answer pair
CN110020163A (en) Searching method, device, computer equipment and storage medium based on human-computer interaction
CN116402166B (en) Training method and device of prediction model, electronic equipment and storage medium
Tüselmann et al. Recognition-free question answering on handwritten document collections
JP2023517518A (en) Vector embedding model for relational tables with null or equivalent values
US20200302332A1 (en) Client-specific document quality model
CN113051396B (en) Classification recognition method and device for documents and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: Room 1901, building 1, No. 1782 Jiangling Road, Xixing street, Binjiang District, Hangzhou City, Zhejiang Province

Applicant after: HANGZHOU LVWAN NETWORK TECHNOLOGY Co.,Ltd.

Address before: 2, No. 2630, building 2, superior Science Park, No. 310026 South Ring Road, Hangzhou, Binjiang District, Zhejiang, China

Applicant before: HANGZHOU LVWAN NETWORK TECHNOLOGY Co.,Ltd.

CB02 Change of applicant information
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20190528

WD01 Invention patent application deemed withdrawn after publication