CN110175322A - A kind of structural method and device of document - Google Patents

A kind of structural method and device of document Download PDF

Info

Publication number
CN110175322A
CN110175322A CN201910430088.3A CN201910430088A CN110175322A CN 110175322 A CN110175322 A CN 110175322A CN 201910430088 A CN201910430088 A CN 201910430088A CN 110175322 A CN110175322 A CN 110175322A
Authority
CN
China
Prior art keywords
sentence
template
similarity
title
document
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910430088.3A
Other languages
Chinese (zh)
Inventor
晋耀红
李健铨
赵红红
陈夏飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dinfo Beijing Science Development Co ltd
Original Assignee
Beijing Shenzhou Taiyue Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Shenzhou Taiyue Software Co Ltd filed Critical Beijing Shenzhou Taiyue Software Co Ltd
Priority to CN201910430088.3A priority Critical patent/CN110175322A/en
Publication of CN110175322A publication Critical patent/CN110175322A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/186Templates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/258Heading extraction; Automatic titling; Numbering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

This application provides a kind of structural method of document and devices, wherein the described method includes: dividing according to text structure identification model to structured document is several single chapters and sections documents;The similarity for calculating each template name in the chapter title and structured stencil obtains adaptation template name;The similarity for calculating subordinate's sentence of the corresponding element of the adaptation template name and corresponding chapter title obtains adaptation sentence;The adaptation sentence for filling in all single chapters and sections documents could fill out region into the structured stencil accordingly, obtain structured document.It can be seen that, the structural method and device of document provided herein can accurately divide non-structured document according to preset structured stencil, and the structured document that there is corresponding relationship with template name and element is accurately generated, to guarantee the accuracy of subsequent determining key point.

Description

A kind of structural method and device of document
Technical field
This application involves natural language processing technique field more particularly to the structural methods and device of a kind of document.
Background technique
Document audit is to audit according to laws and regulations and the agreement of party to the content of document, format.In general, It needs to be determined the key point in document, and carries out examining for legality expression and potential legal risk for each key point Core.
Currently, document audit mode be mainly manual examination and verification, i.e. auditor after reading documents, by itself from Industry experience positions each key point in a document, and judges whether each key point meets legality expression by experience, And judge it with the presence or absence of legal risk.But the length of usual document is longer, auditor can waste a large amount of on reading Time, simultaneously because the content of document is excessive, auditor is difficult to accomplish to determine each key point with not omitting, and accurately Each key point is sorted out to suitable classification, meanwhile, there are higher subjectivities for the experience of auditor, this also can be big It is big to reduce the accuracy rate for judging that the legality expression of key point whether there is legal risk.
In order to improve the efficiency and accuracy rate of document audit, automation audit mode is gradually adopted now, i.e., using calculating Machine audits document.By semantic analysis, it can objectively determine that keyword, these keywords are key from unitary document Point.But document length is longer, if disposably carrying out semantic analysis to unitary document, it is longer to expend the time, moreover, literary The form of shelves is various, and computer indigestion and the various non-structured documents of identification will increase computer to whole text in this way The semantic difficulty of shelves analysis, and substantially reduce the accuracy of determining key point.
Summary of the invention
This application provides a kind of structural method of document and devices, audit unstructured text to solve active computer The low problem of shelves accuracy.
In a first aspect, the embodiment of the present application provides a kind of structural method of document, comprising:
Dividing according to text structure identification model to structured document is several single chapters and sections documents, the single chapters and sections text Shelves are made of chapter title and subordinate's sentence corresponding with the chapter title;
The similarity for calculating each template name in the chapter title and structured stencil obtains adaptation template name, institute Structured stencil is stated to be made of template name, the corresponding element of template name and the corresponding region that could fill out of template name, it is described Being adapted to template name is the template name for being greater than default title similarity threshold with the similarity of the chapter title;
The similarity for calculating subordinate's sentence of the corresponding element of the adaptation template name and corresponding chapter title, is fitted With sentence, the adaptation sentence is the subordinate's sentence for being greater than default statement similarity threshold value with the similarity of the element;
The adaptation sentence for filling in all single chapters and sections documents could fill out region into the structured stencil accordingly, obtain Structured document.
Second aspect, this application provides a kind of structurizers of document, comprising:
Division module is several single chapters and sections documents for dividing according to text structure identification model to structured document, The single chapters and sections document is made of chapter title and subordinate's sentence corresponding with the chapter title;
It is adapted to template name determining module, for calculating the phase of the chapter title with template name each in structured stencil Like degree, adaptation template name is obtained, the structured stencil is by template name, the corresponding element of template name and template name pair That answers could fill out region composition, and the adaptation template name is to be greater than default title similarity with the similarity of the chapter title The template name of threshold value;
It is adapted to sentence determining module, for calculating under the corresponding element of the adaptation template name and corresponding chapter title The similarity for belonging to sentence, obtains adaptation sentence, and the adaptation sentence is similar greater than default sentence to the similarity of the element Spend subordinate's sentence of threshold value;
Module is filled in, the adaptation sentence for filling in all single chapters and sections documents accordingly may be used into the structured stencil Region is filled in, structured document is obtained.
By the above technology it is found that this application provides a kind of structural method of document and device, know according to text structure Other model identifies the structure to structured document, and this is waited for that structured document is divided into several single chapters and sections documents.Default knot Structure template, and the similarity of each chapter title and each template name in structured stencil in single chapters and sections document is calculated, it determines Adaptation template name corresponding with each single chapters and sections document.By calculating element and phase corresponding to each adaptation template name The similarity of subordinate's sentence of chapter title is answered, determines and is adapted to sentence with what element matched.Will all adaptation sentences fill in Structured stencil accordingly could fill out region, and then realize the structuring for treating structured document, obtain structured document, at this time Structured document be the document for being divided according to template name and element, and being integrated together again in detail.As it can be seen that this Shen Please provided by document structural method and device can be accurate according to preset structured stencil by non-structured document It divides, and accurately generates the structured document with template name and element with corresponding relationship, to guarantee subsequent determining crucial The accuracy of point.
Detailed description of the invention
In order to illustrate more clearly of the technical solution of the application, letter will be made to attached drawing needed in the embodiment below Singly introduce, it should be apparent that, for those of ordinary skills, without any creative labor, It is also possible to obtain other drawings based on these drawings.
Fig. 1 is a kind of flow chart of the structural method of document provided by the embodiments of the present application;
Fig. 2 (1) is the document schematic diagram provided by the embodiments of the present application with digital identification feature;
Fig. 2 (2) is the document schematic diagram provided by the embodiments of the present application with text formatting identification feature;
Fig. 2 (3) is the document schematic diagram provided by the embodiments of the present application with additional character identification feature;
Fig. 3 is the flow diagram provided by the embodiments of the present application for dividing single chapters and sections document;
Fig. 4 is the structural schematic diagram of structured stencil provided by the embodiments of the present application;
Fig. 5 is the schematic diagram of structured document provided by the embodiments of the present application;
Fig. 6 is a kind of flow chart of the method for building up of structured stencil provided by the embodiments of the present application;
Fig. 7 is the flow chart of the method for building up of another structured stencil provided by the embodiments of the present application;
Fig. 8 is the flow chart provided by the embodiments of the present application for dividing the method to structured document;
Fig. 9 is a kind of flow chart of the method for determining chapter title provided by the embodiments of the present application;
Figure 10 is a kind of flow chart of method for weakening specified chapter title provided by the embodiments of the present application;
Figure 11 is a kind of flow chart of the method for determining adaptation template name provided by the embodiments of the present application;
Figure 12 is a kind of flow chart of the method for determining adaptation template name provided by the embodiments of the present application;
Figure 13 is a kind of flow chart of method for calculating subordinate's sentence and template sentence similarity provided by the embodiments of the present application;
Figure 14 is a kind of schematic diagram of the structurizer of document provided by the embodiments of the present application.
Specific embodiment
In order to make those skilled in the art better understand the technical solutions in the application, below in conjunction with of the invention real The attached drawing in example is applied, the technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described implementation Example is only a part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, this field is common Technical staff's every other embodiment obtained without making creative work, all should belong to protection of the present invention Range.
Document audit is to audit according to laws and regulations and the agreement of party to the content of document, format.In general, It needs to be determined the key point in document, and carries out examining for legality expression and potential legal risk for each key point Core.
Currently, document audit mode be mainly manual examination and verification, i.e. auditor after reading documents, by itself from Industry experience positions each key point in a document, and judges whether each key point meets legality expression by experience, And judge it with the presence or absence of legal risk.But the length of usual document is longer, auditor can waste a large amount of on reading Time, simultaneously because the content of document is excessive, auditor is difficult to accomplish to determine each key point with not omitting, and accurately Each key point is sorted out to suitable classification, meanwhile, there are higher subjectivities for the experience of auditor, this also can be big It is big to reduce the accuracy rate for judging that the legality expression of key point whether there is legal risk.
In order to improve the efficiency and accuracy rate of document audit, automation audit mode is gradually adopted now, i.e., using calculating Machine audits document.By semantic analysis, it can objectively determine that keyword, these keywords are key from unitary document Point.But document length is longer, if disposably carrying out semantic analysis to unitary document, it is longer to expend the time, moreover, literary The form of shelves is various, and computer indigestion and the various non-structured documents of identification will increase computer to whole text in this way The semantic difficulty of shelves analysis, and substantially reduce the accuracy of determining key point.
It can be seen that being still remained careful using existing computer audit document although Auditing Principal is objective enough The core time is longer, can not accurately handle the problem of non-structured document.
To solve the above-mentioned problems, the embodiment of the present application provides the structural method and device of a kind of document.
Here is the present processes embodiment.
Fig. 1 is a kind of flow chart of the structural method of document provided by the embodiments of the present application.This method can be applied to Server, PC (PC), tablet computer, mobile phone etc. are a variety of to be operated in equipment.
Referring to Figure 1, method includes the following steps:
S1, according to text structure identification model divide to structured document be several single chapters and sections documents, the single chapter Section document is made of chapter title and subordinate's sentence corresponding with the chapter title.
Structurizer will be uploaded to structured document, wherein can be literary for electronic document, papery to structured document The diversified forms such as shelves, electronic pictures, papery picture, and can be the multiple types such as contract, legal document, paper, books, article Type;Upload mode can be cable-network transmission, wireless network transmissions or real-time network transmission.
Text structure identification model be can be realized detection, identification, locating documents format model or tool.For example, According to the text structure identification model of number identification text structure, in a document, distinguished in each section usually using digital number Hold, as shown in Fig. 2 (1), it is seen then that according to the sequence of " one, 1,1.1,1.1.1 ", document can be divided into four hierarchical organizations, text This structure recognition model need to only be identified to the number in structured document, that is, can determine the structure to structured document;Alternatively, root According to the text structure identification model of document format identification text structure, in a document, if also can without apparent digital number In the presence of using text formatting, such as the character quantity of section leading whitespace, the character quantity in space of having no progeny, font, font size, area's departmentalism Divide content, as shown in Fig. 2 (2), it is seen then that the character quantity of the section leading whitespace of text each section is different, and each section institute is right The font size answered is different;Alternatively, identifying that the text structure identification model of text structure can make in a document according to additional character Each section content is distinguished with additional character, as shown in Fig. 2 (3), it is seen then that according to " *, &, # ", document can be divided into three grades Structure, text structure identification model need to only be identified to the additional character in structured document, that is, can determine to structured document Structure;Alternatively, using three of the above identification feature want in conjunction with text structure identification model identify the knot to structured document Structure.It should be noted that the embodiment of the present application provides only part text structure identification model, other are identified with text structure The model of function can be used as text structure identification model disclosed in the present application and use.
It, can after identifying by text structure identification model and determine the identification feature to structured document various pieces Using using these identification features as cut-point, division is several single chapters and sections documents to structured document, it is clear that as shown in figure 3, this A little single chapters and sections documents include chapter title and the corresponding subordinate's sentence of chapter title, wherein can be with identification in chapter title Feature.
Each single chapters and sections document is obtained by S1 to still fall within scattered structure and put in order, and includes a large amount of useless Information is to meet structured stencil structure and put in order, and reject useless therefore, it is necessary to arrange each single chapters and sections document The document of information can enable the computer fast understanding document audited in this way, and accurately determine the key point in document.
S2, the similarity for calculating each template name in the chapter title and structured stencil obtain adaptation template name, The structured stencil is made of template name, the corresponding element of template name and the corresponding region that could fill out of template name, institute Stating adaptation template name is the template name for being greater than default title similarity threshold with the similarity of the chapter title.
Structured stencil is a kind of document template preset, with the structure for meeting audit computer understanding logic, As shown in figure 4, structured stencil includes template name, the corresponding element of template name and template name is corresponding could fill out Region, wherein template name corresponds to the chapter title in single chapters and sections document, and element corresponds to subordinate in single chapters and sections document The semanteme of sentence could fill out region corresponding to subordinate's sentence in single chapters and sections document.
In order under single chapters and sections document Corresponding matching to adaptable template name, need the generation of single chapters and sections document Table, i.e. chapter title are matched with each template name.By calculating the similarity of chapter title and each template name, in turn It can determine the template name for meeting matching rule, as the adaptation template name of the chapter title, where the chapter title Single chapters and sections document is corresponded to the adaptation template name, wherein matching rule can be similarity highest or similarity Greater than a certain default value etc..Hereby, it is achieved that tentatively will be under single chapters and sections document whole matching to respective template name.
S3, the similarity for calculating the adaptation template name corresponding element and subordinate's sentence of corresponding chapter title, obtain To adaptation sentence, the adaptation sentence is the subordinate's sentence for being greater than default statement similarity threshold value with the similarity of the element.
It include part garbage in subordinate's sentence of single chapters and sections document, these usual garbages and method examine point nothing It closes, but the presence of these garbages can expend the time of computer audit, it is therefore desirable to be rejected from subordinate's sentence, change one A angle picks out sentence relevant to key point from subordinate's sentence, i.e. adaptation sentence.
In order to determine adaptation sentence, the similarity of each subordinate's sentence and each element in single chapters and sections document can be calculated, from And it determines similarity and is greater than subordinate's sentence of default statement similarity threshold value for adaptation sentence.
For example, subordinate's sentence in single article document:
It has the right that Party B is required to submit technological service achievement according to this contract is predetermined.
By about orientation Party B's payt.
Corresponding operating condition is provided to Party A.
Element:
Party A's right, Party A's obligation.
By calculating, " have the right that Party B is required to submit technological service achievement according to this contract is predetermined." and " Party A's right " Similarity is greater than default statement similarity threshold value;" by about orientation Party B's payt." be greater than with the similarity of " Party A obligation " Default statement similarity threshold value.Therefore, above-mentioned two subordinate's sentence is adaptation sentence.However, " providing corresponding work to Party A Make condition." and any element between the not up to default statement similarity threshold value of similarity, therefore, subordinate's sentence is herein Place is garbage.
S4, the adaptation sentence for filling in all single chapters and sections documents could fill out region into the structured stencil accordingly, Obtain structured document.
By S3 it is found that template name corresponding to single chapters and sections document is unique, but due to a template name Multiple elements can be corresponded to, therefore, there can be multiple adaptation sentences in single chapters and sections document.In S4 during filling in, need by Whole adaptation sentences in single chapters and sections document are filled in could fill out region, to prevent from causing subsequent due to being adapted to sentence drop Computer determines that method examines the drop of point.As shown in figure 5, by the adaptation sentence of each single chapters and sections document, all matching is filled in knot It could fill out region in structure template, ultimately form structuring treaty documents.The structured document, which has, meets audit computer The file structure for understanding logic, can assisted verification computer quickly, accurately determine key point.
Further, complete fill in after, can by structured stencil content and frame all conceal, only leave Each adaptation sentence.
From the above technical scheme, the embodiment of the present application provides a kind of structural method of document, and this method passes through According to text structure identification model, the structure to structured document is identified, and to wait for that structured document is divided into several single by this Chapters and sections document.Preset structure template, and calculate each chapter title and each template name in structured stencil in single chapters and sections document The similarity of title, determining adaptation template name corresponding with each single chapters and sections document.By calculating each adaptation template name The similarity of subordinate's sentence of corresponding element and corresponding chapter title, determines and is adapted to sentence with what element matched.It will be complete Portion's adaptation sentence, which is filled in structured stencil, accordingly could fill out region, and then realize the structuring for treating structured document, obtain To structured document, structured document at this time is to be divided in detail according to template name and element, and be integrated into one again The document risen.As it can be seen that the structural method and device of document provided herein can be by non-structured documents according to pre- If structured stencil accurately divide, and accurately generate the structured document that there is corresponding relationship with template name and element, from And guarantee the accuracy of subsequent determining key point.
Fig. 6 is a kind of flow chart of the method for building up of structured stencil provided by the embodiments of the present application.
In one embodiment, as shown in fig. 6, the method may include following steps:
S101, obtain sample to be processed from document library, the sample to be processed be comprising being classified title, and with it is described It is classified the treaty documents of the corresponding subordinate's sentence of title.
Selected sample to be processed, needs to have the format of structuring, that is, has during establishing structured stencil There is classification title, to correspond to template name as the clear cut-point for dividing sample to be processed;It is corresponding with classification title Subordinate's sentence, using the basis as refinement element.The quality of sample to be processed directly affects the quality of structured stencil, therefore, Need to filter out suitable sample to be processed with caution.
S102, using semantic analysis, extract the element in subordinate's sentence, the element be in subordinate's sentence with It is classified the corresponding keyword of title.
It include a large amount of sentences in subordinate's sentence, wherein division statement is not correspond to classification title, and therefore, it is necessary to exclude This division statement extracts keyword from sentence corresponding with classification title, as element.
For example, the embodiment of the present application provides classification title:
Party A's rights and duties;
Subordinate's sentence:
It has the right that Party B is required to submit technological service achievement according to this contract is predetermined.
By about orientation Party B's payt.
Corresponding operating condition is provided to Party A.
By semantic analysis it is found that " providing corresponding operating condition to Party A." not corresponding with " Party A's rights and duties ", It " has the right that Party B is required to submit technological service achievement according to this contract is predetermined thus, it is only required to extract." and " by about orientation Party B payment Remuneration." keyword, respectively " Party A's right " and " Party A obligation ".Therefore, corresponding to " Party A's rights and duties " Element is " Party A's right " and " Party A's obligation ".
S103, according to template name format corresponding with element, generate training sample, wherein the template name pair It should be in classification title.
It during generating training sample, needs template name is accurately corresponding with element, is asked to prevent what is caused confusion Topic, to guarantee the accuracy of structured stencil.
S104, each training sample of training, generate template to be processed, the template to be processed is by template name and template The corresponding element composition of title;
By a large amount of training sample of training, the to be processed of numerous training sample formats and content can be represented by generating to have Template.
S105, correspond to each template name, addition could fill out region in the template to be processed, generate structuring mould Plate, the content of text that could fill out region for filling in structuring treaty documents.
Structured stencil is needed with can correspond to the region filled in structured document content of text, therefore, it is necessary to Corresponding addition could fill out region in template to be processed, so as to the subsequent adaptation sentence filled in structured document.
Accurate structuring sample can be obtained by the embodiment of the present application, to guarantee the accurate of subsequent structural process Property.
It only include element in the structured stencil generated by S101-S105, therefore, the foundation of subsequent determining adaptation sentence is only For element, matching of the matched process between " sentence and word " since the two character differs more, and includes semantic multifarious It differs greatly, therefore, simple user element is lower as the accuracy for the foundation for determining adaptation sentence.
As shown in fig. 7, the flow chart of the method for building up for another structured stencil provided by the embodiments of the present application, described Method includes:
S111, obtain sample to be processed from document library, the sample to be processed be comprising being classified title, and with it is described It is classified the treaty documents of the corresponding subordinate's sentence of title;
S112, using semantic analysis, extract the element in subordinate's sentence, the element be in subordinate's sentence with It is classified the corresponding keyword of title;
S113, corresponding element generate template sentence, and the template sentence is the language comprising the element and/or the template sentence The adopted semantic similarity with the element is greater than the sentence of default template sentence similarity threshold;
S114, according to template name format corresponding with element and template sentence, generate training sample;
S115, each training sample of training, generate template to be processed, the template to be processed is by template name, template The corresponding element of title and template sentence composition;
S116, correspond to each template name, addition could fill out region in the template to be processed, generate structuring mould Plate, the content of text that could fill out region for filling in structured document.
S111 and S112 is identical as the method for S101 and S102 in a upper embodiment in the embodiment of the present application, refers to It states, details are not described herein again.
In order to improve matched accuracy between subordinate's sentence and " element ", expands the range of " element ", that is, correspond to element Template sentence is generated, the semantic phase of the semanteme comprising the element and/or the template sentence and the element in these template sentences It is greater than the sentence of default template sentence similarity threshold like degree.
For example, element provided in this embodiment are as follows:
Party A's right;
Template sentence are as follows:
It has the right that Party B is required to submit technological service achievement according to this contract engagement.
Party A's right is to have the right that Party B is required to rectify and improve the problem of its service process.
At this point, " element " as subordinate's statement matching foundation includes two parts, i.e. element (keyword) and template sentence.
Subsequent S114-S116 is identical as the method for S103-S105 in a upper embodiment, refer to it is above-mentioned, it is no longer superfluous herein It states.
Template sentence one is added in structured stencil provided by the embodiment of the present application, can effectively improve subsequent determination It is adapted to the accuracy of sentence, to guarantee that audit computer determines the accuracy of key point.
As shown in figure 8, for the flow chart provided by the embodiments of the present application for dividing the method to structured document, the method Include:
S131, the structure type to structured document, the structure type are determined using text structure identification model Including tape format type and unformatted type;
If S132, the structure type to structured document are tape format type, known using the text structure Other model determines that the chapter title to structured document, the chapter title are made of title identifier and title content;
S133, the parsing title identifier, obtain normalized subject, and the normalized subject is with unified form header The title of number;
S134, using the normalized subject as cut-point, dividing described to structured document is several single chapters and sections documents.
Document is divided into tape format type and two kinds of unformatted type, and wherein tape format type is with digital, and/or special Symbol, and/or text formatting (referring to aforementioned) etc., enable document per se with the type of certain format logic;Unformatted type, i.e., Document itself does not have the type of any format logic.The knot of document is able to detect and identified by text structure identification model Structure type.Obviously, capable of passing through to structured document for tape format type identifies the identification such as number, additional character, text formatting Feature is easier to obtain the chapter title to structured document.But since each chapter title to structured document corresponds to Different identification features or even a same piece correspond to different identification features to the chapter title of structured document.This format On difference, will affect operating efficiency and the accuracy of structurizer, therefore, it is necessary to obtain by parsing each chapter title To the normalized subject with unified form header number.
For example, in chapter title using number " 1, one, 1., I " be classified;Or it is classified using section leading whitespace; Or it is classified using additional character.Then all resolve to the normalized subject being classified using Arabic numerals.Later with this A little normalized subjects are cut-point, and several single chapters and sections documents will be divided into structured document.
Wherein, if the Format Type to structured document is unformatted type, illustrating can not be special by some identifications Sign directly determines chapter title.Therefore, as shown in figure 9, determining chapter title using subordinate's method:
If S135, the structure type to structured document are no structure type, according to default regular expression, Determine the chapter title to structured document;
S136, using the chapter title as cut-point, dividing described to structured document is several single chapters and sections documents.
For example, default regular expression is " (.*) [article | chapter] ", " (d+) [ ,] ", the regular expression pair is utilized It is matched to structured document, the character to match with the regular expression, as chapter title.
The chapter title obtained using default regular expression matching, there is reference format therefore can directly press for itself It divides according to chapter title obtained to structured document.
Further, partially there can be multistage title to structured document, however, if the title according to each rank is drawn Divide to structured document, then the quantity of single chapters and sections document is excessive, will increase the time of subsequent determining adaptation sentence;And it divided Carefully, semantic more similar document can be divided into different single chapters and sections documents, increases subsequent semantic analysis and determines that adaptation is wanted The difficulty of element, adaptation sentence.Therefore, as shown in Figure 10, chapter title is specified for a kind of weaken provided by the embodiments of the present application The flow chart of method, the method for weakening specified chapter title include:
S137, title to be weakened, the chapter title for being lower than default title grade wait weaken entitled grade are determined;
S138, by the title set to be weakened be text rank.
Chapter title by registration lower than default title grade is determined as title to be weakened.
For example, level-one chapter title, 1, ××;Second level chapter title, 1.1, ××;Three-level chapter title, 1.1.1, × ×, default title grade is set as second level, then three-level chapter title should be confirmed as title to be weakened.
The grade of the title to be weakened determined in S137 is set as text rank, and then enables title to be weakened to structure Change and weakened in document, and then effectively reduces the division number to structured document.
It as shown in figure 11, is a kind of flow chart of the method for determining adaptation template name provided by the embodiments of the present application, institute The method of stating includes:
S201, using method of semantic differential, successively calculate the similarity of each template name in chapter title and structured stencil;
S202, determine that similarity is greater than the target template title of default title similarity threshold;
S203, determine that adaptation template name, the adaptation template name are to have highest phase in the target template title Like the target template title of degree.
For example, the similarity of each template name in chapter title and structured stencil is calculated using following formula,
Wherein, score represents the similarity of chapter title and template name, and t represents the character of chapter title, and o represents mould Board name, e represent element corresponding to o.
Wherein, using following formula calculation formula (1) sim values,
Wherein, s(1)With s(2)Calculating parameter is represented,WithWord calculating parameter is represented,WithRepresent word meter Calculate parameter.
By above-mentioned two formula, the similarity of chapter title Yu each template name can be accurately calculated.Firstly, from whole The target template title for being greater than default title similarity threshold is determined in similarity.
For example, the similarity of chapter title and template name A, B, C are respectively 0.2,0.8,0.6, title similarity is preset Threshold value is 0.5, then template name B and C is target template title.
The entitled adaptation template name of target template with highest similarity is determined in target template title B and C, i.e., Target template title B is adaptation template name.
Method by determining adaptation template name provided by the embodiment of the present application can accurately determine unique adaptation Template name, to enable under single chapters and sections document matches to the highest template name of similarity.
It as shown in figure 12, is a kind of flow chart of the method for determining adaptation sentence provided by the embodiments of the present application, the side Method includes:
S301, the similarity for calculating subordinate's sentence and each corresponding templates sentence, the corresponding templates sentence are that subordinate's sentence is corresponding Template sentence corresponding to element;
The average value of the similarity of S302, the similarity of computational element and subordinate's sentence and subordinate's sentence and template sentence, obtains To final statement similarity;
S303, determine that adaptation sentence, the adaptation sentence are that final statement similarity is greater than default statement similarity threshold value Subordinate's sentence.
For example, final statement similarity is calculated using following formula,
Wherein, score represents final statement similarity, and sim (text, e) represents the similarity of subordinate's sentence and element, can It is obtained with being calculated by formula (2), multipleSim (text, s) represents the similarity between subordinate's sentence and template sentence.
It is corresponding that subordinate's sentence that final statement similarity is greater than default statement similarity threshold value is finally determined as element It is adapted to sentence.
The method that adaptation sentence is determined provided by the embodiment of the present application, can consider subordinate's sentence and element and mould simultaneously Similarity between plate sentence, and then it is adapted to the accuracy that sentence determines.
It wherein, as shown in figure 13, is a kind of calculating subordinate's sentence provided by the embodiments of the present application and template sentence similarity The flow chart of method, which comprises
S3011, the jaccard similarity for calculating subordinate's sentence and corresponding templates sentence;
S3012, the bert similarity for calculating subordinate's sentence and corresponding templates sentence;
S3013, according to default weighted value, the weight for calculating the jaccard similarity and bert similarity sums it up, and obtains The similarity of subordinate's sentence and corresponding templates sentence.
For example, jaccard (Jie Kade) similarity is to consider in subordinate's sentence word or word in corresponding templates sentence in calculating Under frequency of occurrence, be the upgrading of formula (2), using following formula calculate jaccard similarity,
Wherein, weightJac (s(1),s(2)) representing jaccard similarity, tf (w) represents in subordinate's sentence word in correspondence Frequency of occurrence under template sentence, tf (c) represent frequency of occurrence of the word under corresponding templates sentence in subordinate's sentence.
The bert similarity that subordinate's sentence and template sentence are calculated using bert model, specifically, when by bert model prediction The softmax result of classification is directly as bert similarity.
The similarity of subordinate's sentence and corresponding templates sentence is calculated using following formula,
MultipleSim (text, s)=α weightJac (text, s)+β bertSim (text, s) (5)
Wherein, multipleSim (text, s) represents the similarity of subordinate's sentence Yu corresponding templates sentence, bertSim (text, s) represents bert similarity, and α and β respectively represent jaccard similarity weighted value corresponding with bert similarity.
The method of the similarity provided by the embodiments of the present application for calculating subordinate's sentence and corresponding templates sentence, it is similar using two kinds Calculation method is spent, error caused by single similarity calculating method can be effectively avoided, to guarantee the accurate of similarity value Property.
Figure 14 is a kind of schematic diagram of the structurizer of treaty documents provided by the embodiments of the present application.The device can answer It operates in equipment for server, PC (PC), tablet computer, mobile phone etc. to be a variety of.
As shown in figure 14, which includes:
Division module 1 is several single chapters and sections documents for dividing according to text structure identification model to structured document, The single chapters and sections document is made of chapter title and subordinate's sentence corresponding with the chapter title;
It is adapted to template name determining module 2, for calculating each template name in the chapter title and structured stencil Similarity obtains adaptation template name, and the structured stencil is by template name, the corresponding element of template name and template name Corresponding to could fill out region composition, the adaptation template name is similar greater than default title to the similarity of the chapter title Spend the template name of threshold value;
It is adapted to sentence determining module 3, for calculating the corresponding element of the adaptation template name and corresponding chapter title The similarity of subordinate's sentence, obtains adaptation sentence, and the adaptation sentence is to be greater than default sentence phase with the similarity of the element Like subordinate's sentence of degree threshold value;
Module 4 is filled in, the adaptation sentence for filling in all single chapters and sections documents is corresponding into the structured stencil It could fill out region, obtain structured document.
By the above technology it is found that this application provides a kind of structural method of document and device, know according to text structure Other model identifies the structure to structured document, and this is waited for that structured document is divided into several single chapters and sections documents.Default knot Structure template, and the similarity of each chapter title and each template name in structured stencil in single chapters and sections document is calculated, it determines Adaptation template name corresponding with each single chapters and sections document.By calculating element and phase corresponding to each adaptation template name The similarity of subordinate's sentence of chapter title is answered, determines and is adapted to sentence with what element matched.Will all adaptation sentences fill in Structured stencil accordingly could fill out region, and then realize the structuring for treating structured document, obtain structured document, at this time Structured document be the document for being divided according to template name and element, and being integrated together again in detail.As it can be seen that this Shen Please provided by document structural method and device can be accurate according to preset structured stencil by non-structured document It divides, and accurately generates the structured document with template name and element with corresponding relationship, to guarantee subsequent determining crucial The accuracy of point.
Those skilled in the art after considering the specification and implementing the invention disclosed here, will readily occur to of the invention its Its embodiment.This application is intended to cover any variations, uses, or adaptations of the invention, these modifications, purposes or Person's adaptive change follows general principle of the invention and including the undocumented common knowledge in the art of the present invention Or conventional techniques.The description and examples are only to be considered as illustrative, and true scope and spirit of the invention are by following Claim is pointed out.
It should be understood that the application is not limited to the precise structure that has been described above and shown in the drawings, and And various modifications and changes may be made without departing from the scope thereof.Scope of the present application is only limited by the accompanying claims.

Claims (10)

1. a kind of structural method of document, which is characterized in that the described method includes:
According to text structure identification model divide to structured document be several single chapters and sections documents, the single chapters and sections document by Chapter title and subordinate's sentence corresponding with chapter title composition;
The similarity for calculating each template name in the chapter title and structured stencil obtains adaptation template name, the knot Structure template is made of template name, the corresponding element of template name and the corresponding region that could fill out of template name, the adaptation Template name is the template name for being greater than default title similarity threshold with the similarity of the chapter title;
The similarity for calculating subordinate's sentence of the corresponding element of the adaptation template name and corresponding chapter title obtains adaptation language Sentence, the adaptation sentence are the subordinate's sentence for being greater than default statement similarity threshold value with the similarity of the element;
The adaptation sentence for filling in all single chapters and sections documents could fill out region into the structured stencil accordingly, obtain structure Change document.
2. the method according to claim 1, wherein each template in the calculating chapter title and structured stencil The similarity of title, obtain adaptation template name includes: before
Obtain sample to be processed from document library, the sample to be processed be comprising being classified title, and with the classification title The document of corresponding subordinate's sentence;
Using semantic analysis, extract the element in subordinate's sentence, the element be in subordinate's sentence with classification title Corresponding keyword;
According to template name format corresponding with element, training sample is generated, wherein the template name corresponds to classification mark Topic;
Each training sample of training, generates template to be processed, and the template to be processed is corresponding by template name and template name Element composition;
Corresponding to each template name, addition could fill out region in the template to be processed, generate structured stencil, described to fill out Region is write for filling in the content of text to structured document.
3. the method according to claim 1, wherein each template in the calculating chapter title and structured stencil The similarity of title, obtain adaptation template name includes: before
Obtain sample to be processed from document library, the sample to be processed be comprising being classified title, and with the classification title The document of corresponding subordinate's sentence;
Using semantic analysis, extract the element in subordinate's sentence, the element be in subordinate's sentence with classification title Corresponding keyword;
Corresponding element generates template sentence, and the template sentence is that the semanteme comprising the element and/or the template sentence is wanted with described The semantic similarity of element is greater than the sentence of default template sentence similarity threshold;
According to template name format corresponding with element and template sentence, training sample is generated;
Each training sample of training, generates template to be processed, the template to be processed is corresponding by template name, template name Element and template sentence composition;
Corresponding to each template name, addition could fill out region in the template to be processed, generate structured stencil, described to fill out Region is write for filling in the content of text to structured document.
4. the method according to claim 1, wherein described divide according to text structure identification model to structuring Document is that several single chapters and sections documents include:
The structure type to structured document is determined using text structure identification model, and the structure type includes tape format Type and unformatted type;
If the structure type to structured document is tape format type, determined using the text structure identification model The chapter title to structured document, the chapter title are made of title identifier and title content;
The title identifier is parsed, normalized subject is obtained, the normalized subject is the mark with unified form header number Topic;
Using the normalized subject as cut-point, dividing described to structured document is several single chapters and sections documents.
5. according to the method described in claim 4, it is characterized in that, described divide according to text structure identification model to structuring Document is that several single chapters and sections documents include:
If the structure type to structured document is no structure type, according to default regular expression, determine described in Chapter title to structured document;
Using the chapter title as cut-point, dividing described to structured document is several single chapters and sections documents.
6. method according to claim 4 or 5, which is characterized in that described to divide according to text structure identification model wait tie Structure document is several single chapters and sections documents further include:
Determine title to be weakened, the chapter title for being lower than default title grade wait weaken entitled grade;
It is text rank by the title set to be weakened.
7. the method according to claim 1, wherein described calculate in the chapter title and structured stencil respectively The similarity of template name, obtaining adaptation template name includes:
Using method of semantic differential, the similarity of each template name in chapter title and structured stencil is successively calculated;
Determine that similarity is greater than the target template title of default title similarity threshold;
Determine that adaptation template name, the adaptation template name are the target in the target template title with highest similarity Template name.
8. according to the method described in claim 3, it is characterized in that, the calculating corresponding element of adaptation template name and corresponding The similarity of subordinate's sentence of chapter title, obtaining adaptation sentence includes:
The similarity of subordinate's sentence and each corresponding templates sentence is calculated, the corresponding templates sentence is corresponding to the corresponding element of subordinate's sentence Template sentence;
The average value of the similarity of the similarity and subordinate's sentence and template sentence of computational element and subordinate's sentence, obtains final sentence Similarity;
Determine that adaptation sentence, the adaptation sentence are subordinate's language that final statement similarity is greater than default statement similarity threshold value Sentence.
9. according to the method described in claim 8, it is characterized in that, calculating subordinate's sentence is similar to each corresponding templates sentence Degree includes:
Calculate the jaccard similarity of subordinate's sentence and corresponding templates sentence;
Calculate the bert similarity of subordinate's sentence and corresponding templates sentence;
According to default weighted value, the weight for calculating the jaccard similarity and bert similarity is summed it up, obtain subordinate's sentence with The similarity of corresponding templates sentence.
10. a kind of structurizer of document, which is characterized in that described device includes:
Division module is several single chapters and sections documents for dividing according to text structure identification model to structured document, described Single chapters and sections document is made of chapter title and subordinate's sentence corresponding with the chapter title;
It is adapted to template name determining module, it is similar to template name each in structured stencil for calculating the chapter title Degree, obtains adaptation template name, and the structured stencil is corresponding by template name, the corresponding element of template name and template name Could fill out region composition, the adaptation template name is to be greater than default title similarity threshold with the similarity of the chapter title The template name of value;
It is adapted to sentence determining module, for calculating subordinate's language of the adaptation template name corresponding element and corresponding chapter title The similarity of sentence, obtains adaptation sentence, and the adaptation sentence is to be greater than default statement similarity threshold with the similarity of the element Subordinate's sentence of value;
Module is filled in, the adaptation sentence for filling in all single chapters and sections documents could fill out accordingly into the structured stencil Region obtains structured document.
CN201910430088.3A 2019-05-22 2019-05-22 A kind of structural method and device of document Pending CN110175322A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910430088.3A CN110175322A (en) 2019-05-22 2019-05-22 A kind of structural method and device of document

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910430088.3A CN110175322A (en) 2019-05-22 2019-05-22 A kind of structural method and device of document

Publications (1)

Publication Number Publication Date
CN110175322A true CN110175322A (en) 2019-08-27

Family

ID=67691880

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910430088.3A Pending CN110175322A (en) 2019-05-22 2019-05-22 A kind of structural method and device of document

Country Status (1)

Country Link
CN (1) CN110175322A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111079402A (en) * 2019-12-31 2020-04-28 北大方正集团有限公司 Document hierarchy dividing method, document hierarchy dividing device, and readable storage medium
CN111400446A (en) * 2020-03-11 2020-07-10 中国计量大学 Standard text duplicate checking method and system
CN111783449A (en) * 2020-06-24 2020-10-16 鼎富智能科技有限公司 Method and device for extracting elements of judgment result in judgment document
CN111859863A (en) * 2020-06-03 2020-10-30 远光软件股份有限公司 Document structure conversion method and device, storage medium and electronic equipment
CN112001163A (en) * 2020-09-03 2020-11-27 深圳证券信息有限公司 Method and system for detecting integrity of file, electronic device and storage medium
CN112329548A (en) * 2020-10-16 2021-02-05 北京临近空间飞行器系统工程研究所 Document chapter segmentation method and device and storage medium
WO2021068684A1 (en) * 2019-10-11 2021-04-15 平安科技(深圳)有限公司 Method and apparatus for automatically generating document directory, computer device and storage medium
CN114065719A (en) * 2021-11-23 2022-02-18 中国工商银行股份有限公司 Document processing method and device, electronic equipment and computer readable storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102541948A (en) * 2010-12-23 2012-07-04 北大方正集团有限公司 Method and device for extracting document structure
US20130268554A1 (en) * 2012-03-14 2013-10-10 Toshiba Solutions Corporation Structured document management apparatus and structured document search method
CN107203548A (en) * 2016-03-17 2017-09-26 阿里巴巴集团控股有限公司 Attribute acquisition methods and device
CN107391479A (en) * 2017-06-19 2017-11-24 中国信息通信研究院 The construction method in modularization achievement storehouse
CN109783787A (en) * 2018-12-29 2019-05-21 远光软件股份有限公司 A kind of generation method of structured document, device and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102541948A (en) * 2010-12-23 2012-07-04 北大方正集团有限公司 Method and device for extracting document structure
US20130268554A1 (en) * 2012-03-14 2013-10-10 Toshiba Solutions Corporation Structured document management apparatus and structured document search method
CN107203548A (en) * 2016-03-17 2017-09-26 阿里巴巴集团控股有限公司 Attribute acquisition methods and device
CN107391479A (en) * 2017-06-19 2017-11-24 中国信息通信研究院 The construction method in modularization achievement storehouse
CN109783787A (en) * 2018-12-29 2019-05-21 远光软件股份有限公司 A kind of generation method of structured document, device and storage medium

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021068684A1 (en) * 2019-10-11 2021-04-15 平安科技(深圳)有限公司 Method and apparatus for automatically generating document directory, computer device and storage medium
CN111079402A (en) * 2019-12-31 2020-04-28 北大方正集团有限公司 Document hierarchy dividing method, document hierarchy dividing device, and readable storage medium
CN111079402B (en) * 2019-12-31 2021-10-26 北大方正集团有限公司 Document hierarchy dividing method, document hierarchy dividing device, and readable storage medium
CN111400446A (en) * 2020-03-11 2020-07-10 中国计量大学 Standard text duplicate checking method and system
CN111859863A (en) * 2020-06-03 2020-10-30 远光软件股份有限公司 Document structure conversion method and device, storage medium and electronic equipment
CN111783449A (en) * 2020-06-24 2020-10-16 鼎富智能科技有限公司 Method and device for extracting elements of judgment result in judgment document
CN111783449B (en) * 2020-06-24 2023-09-22 鼎富智能科技有限公司 Element extraction method and device for judgment result in judge document
CN112001163A (en) * 2020-09-03 2020-11-27 深圳证券信息有限公司 Method and system for detecting integrity of file, electronic device and storage medium
CN112001163B (en) * 2020-09-03 2024-01-30 深圳证券信息有限公司 Method, system, electronic equipment and storage medium for detecting file integrity
CN112329548A (en) * 2020-10-16 2021-02-05 北京临近空间飞行器系统工程研究所 Document chapter segmentation method and device and storage medium
CN114065719A (en) * 2021-11-23 2022-02-18 中国工商银行股份有限公司 Document processing method and device, electronic equipment and computer readable storage medium

Similar Documents

Publication Publication Date Title
CN110175322A (en) A kind of structural method and device of document
US8457947B2 (en) Hybrid translation apparatus and method thereof
CN110020424B (en) Contract information extraction method and device and text information extraction method
CN105912629B (en) A kind of intelligent answer method and device
CN107301163B (en) Formula-containing text semantic parsing method and device
CN110377900A (en) Checking method, device, computer equipment and the storage medium of Web content publication
CN107436916B (en) Intelligent answer prompting method and device
CN110287405B (en) Emotion analysis method, emotion analysis device and storage medium
CN110489747A (en) A kind of image processing method, device, storage medium and electronic equipment
Ojokoh et al. A feature–opinion extraction approach to opinion mining
CN111079029A (en) Sensitive account detection method, storage medium and computer equipment
CN112052424B (en) Content auditing method and device
Perkins et al. Decoding academic integrity policies: A corpus linguistics investigation of AI and other technological threats
CN110688540B (en) Cheating account screening method, device, equipment and medium
CN117332072A (en) Dialogue processing, voice abstract extraction and target dialogue model training method
KR102206781B1 (en) Method of fake news evaluation based on knowledge-based inference, recording medium and apparatus for performing the method
JP5462546B2 (en) Content detection support apparatus, content detection support method, and content detection support program
CN114722174A (en) Word extraction method and device, electronic equipment and storage medium
CN101470699B (en) Information extraction model training apparatus, information extraction apparatus and information extraction system and method thereof
CN110232124A (en) A kind of sentiment analysis system
CN111737475A (en) Unsupervised network public opinion spam long text recognition method
CN114528851B (en) Reply sentence determination method, reply sentence determination device, electronic equipment and storage medium
WO2015182559A1 (en) Information analysis system, information analysis method and information analysis program
CN112700203B (en) Intelligent marking method and device
CN114239539A (en) English composition off-topic detection method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20190906

Address after: Room 630, 6th floor, Block A, Wanliu Xingui Building, 28 Wanquanzhuang Road, Haidian District, Beijing

Applicant after: DINFO (BEIJING) SCIENCE DEVELOPMENT Co.,Ltd.

Address before: 100089 Beijing city Haidian District wanquanzhuang Road No. 28 Wanliu new building block A Room 601

Applicant before: ULTRAPOWER SOFTWARE Co.,Ltd.

TA01 Transfer of patent application right
CB02 Change of applicant information

Address after: 230000 zone B, 19th floor, building A1, 3333 Xiyou Road, hi tech Zone, Hefei City, Anhui Province

Applicant after: Dingfu Intelligent Technology Co.,Ltd.

Address before: Room 630, 6th floor, Block A, Wanliu Xingui Building, 28 Wanquanzhuang Road, Haidian District, Beijing

Applicant before: DINFO (BEIJING) SCIENCE DEVELOPMENT Co.,Ltd.

CB02 Change of applicant information
RJ01 Rejection of invention patent application after publication

Application publication date: 20190827