Specific embodiment
In order to make those skilled in the art better understand the technical solutions in the application, below in conjunction with of the invention real
The attached drawing in example is applied, the technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described implementation
Example is only a part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, this field is common
Technical staff's every other embodiment obtained without making creative work, all should belong to protection of the present invention
Range.
Document audit is to audit according to laws and regulations and the agreement of party to the content of document, format.In general,
It needs to be determined the key point in document, and carries out examining for legality expression and potential legal risk for each key point
Core.
Currently, document audit mode be mainly manual examination and verification, i.e. auditor after reading documents, by itself from
Industry experience positions each key point in a document, and judges whether each key point meets legality expression by experience,
And judge it with the presence or absence of legal risk.But the length of usual document is longer, auditor can waste a large amount of on reading
Time, simultaneously because the content of document is excessive, auditor is difficult to accomplish to determine each key point with not omitting, and accurately
Each key point is sorted out to suitable classification, meanwhile, there are higher subjectivities for the experience of auditor, this also can be big
It is big to reduce the accuracy rate for judging that the legality expression of key point whether there is legal risk.
In order to improve the efficiency and accuracy rate of document audit, automation audit mode is gradually adopted now, i.e., using calculating
Machine audits document.By semantic analysis, it can objectively determine that keyword, these keywords are key from unitary document
Point.But document length is longer, if disposably carrying out semantic analysis to unitary document, it is longer to expend the time, moreover, literary
The form of shelves is various, and computer indigestion and the various non-structured documents of identification will increase computer to whole text in this way
The semantic difficulty of shelves analysis, and substantially reduce the accuracy of determining key point.
It can be seen that being still remained careful using existing computer audit document although Auditing Principal is objective enough
The core time is longer, can not accurately handle the problem of non-structured document.
To solve the above-mentioned problems, the embodiment of the present application provides the structural method and device of a kind of document.
Here is the present processes embodiment.
Fig. 1 is a kind of flow chart of the structural method of document provided by the embodiments of the present application.This method can be applied to
Server, PC (PC), tablet computer, mobile phone etc. are a variety of to be operated in equipment.
Referring to Figure 1, method includes the following steps:
S1, according to text structure identification model divide to structured document be several single chapters and sections documents, the single chapter
Section document is made of chapter title and subordinate's sentence corresponding with the chapter title.
Structurizer will be uploaded to structured document, wherein can be literary for electronic document, papery to structured document
The diversified forms such as shelves, electronic pictures, papery picture, and can be the multiple types such as contract, legal document, paper, books, article
Type;Upload mode can be cable-network transmission, wireless network transmissions or real-time network transmission.
Text structure identification model be can be realized detection, identification, locating documents format model or tool.For example,
According to the text structure identification model of number identification text structure, in a document, distinguished in each section usually using digital number
Hold, as shown in Fig. 2 (1), it is seen then that according to the sequence of " one, 1,1.1,1.1.1 ", document can be divided into four hierarchical organizations, text
This structure recognition model need to only be identified to the number in structured document, that is, can determine the structure to structured document;Alternatively, root
According to the text structure identification model of document format identification text structure, in a document, if also can without apparent digital number
In the presence of using text formatting, such as the character quantity of section leading whitespace, the character quantity in space of having no progeny, font, font size, area's departmentalism
Divide content, as shown in Fig. 2 (2), it is seen then that the character quantity of the section leading whitespace of text each section is different, and each section institute is right
The font size answered is different;Alternatively, identifying that the text structure identification model of text structure can make in a document according to additional character
Each section content is distinguished with additional character, as shown in Fig. 2 (3), it is seen then that according to " *, &, # ", document can be divided into three grades
Structure, text structure identification model need to only be identified to the additional character in structured document, that is, can determine to structured document
Structure;Alternatively, using three of the above identification feature want in conjunction with text structure identification model identify the knot to structured document
Structure.It should be noted that the embodiment of the present application provides only part text structure identification model, other are identified with text structure
The model of function can be used as text structure identification model disclosed in the present application and use.
It, can after identifying by text structure identification model and determine the identification feature to structured document various pieces
Using using these identification features as cut-point, division is several single chapters and sections documents to structured document, it is clear that as shown in figure 3, this
A little single chapters and sections documents include chapter title and the corresponding subordinate's sentence of chapter title, wherein can be with identification in chapter title
Feature.
Each single chapters and sections document is obtained by S1 to still fall within scattered structure and put in order, and includes a large amount of useless
Information is to meet structured stencil structure and put in order, and reject useless therefore, it is necessary to arrange each single chapters and sections document
The document of information can enable the computer fast understanding document audited in this way, and accurately determine the key point in document.
S2, the similarity for calculating each template name in the chapter title and structured stencil obtain adaptation template name,
The structured stencil is made of template name, the corresponding element of template name and the corresponding region that could fill out of template name, institute
Stating adaptation template name is the template name for being greater than default title similarity threshold with the similarity of the chapter title.
Structured stencil is a kind of document template preset, with the structure for meeting audit computer understanding logic,
As shown in figure 4, structured stencil includes template name, the corresponding element of template name and template name is corresponding could fill out
Region, wherein template name corresponds to the chapter title in single chapters and sections document, and element corresponds to subordinate in single chapters and sections document
The semanteme of sentence could fill out region corresponding to subordinate's sentence in single chapters and sections document.
In order under single chapters and sections document Corresponding matching to adaptable template name, need the generation of single chapters and sections document
Table, i.e. chapter title are matched with each template name.By calculating the similarity of chapter title and each template name, in turn
It can determine the template name for meeting matching rule, as the adaptation template name of the chapter title, where the chapter title
Single chapters and sections document is corresponded to the adaptation template name, wherein matching rule can be similarity highest or similarity
Greater than a certain default value etc..Hereby, it is achieved that tentatively will be under single chapters and sections document whole matching to respective template name.
S3, the similarity for calculating the adaptation template name corresponding element and subordinate's sentence of corresponding chapter title, obtain
To adaptation sentence, the adaptation sentence is the subordinate's sentence for being greater than default statement similarity threshold value with the similarity of the element.
It include part garbage in subordinate's sentence of single chapters and sections document, these usual garbages and method examine point nothing
It closes, but the presence of these garbages can expend the time of computer audit, it is therefore desirable to be rejected from subordinate's sentence, change one
A angle picks out sentence relevant to key point from subordinate's sentence, i.e. adaptation sentence.
In order to determine adaptation sentence, the similarity of each subordinate's sentence and each element in single chapters and sections document can be calculated, from
And it determines similarity and is greater than subordinate's sentence of default statement similarity threshold value for adaptation sentence.
For example, subordinate's sentence in single article document:
It has the right that Party B is required to submit technological service achievement according to this contract is predetermined.
By about orientation Party B's payt.
Corresponding operating condition is provided to Party A.
Element:
Party A's right, Party A's obligation.
By calculating, " have the right that Party B is required to submit technological service achievement according to this contract is predetermined." and " Party A's right "
Similarity is greater than default statement similarity threshold value;" by about orientation Party B's payt." be greater than with the similarity of " Party A obligation "
Default statement similarity threshold value.Therefore, above-mentioned two subordinate's sentence is adaptation sentence.However, " providing corresponding work to Party A
Make condition." and any element between the not up to default statement similarity threshold value of similarity, therefore, subordinate's sentence is herein
Place is garbage.
S4, the adaptation sentence for filling in all single chapters and sections documents could fill out region into the structured stencil accordingly,
Obtain structured document.
By S3 it is found that template name corresponding to single chapters and sections document is unique, but due to a template name
Multiple elements can be corresponded to, therefore, there can be multiple adaptation sentences in single chapters and sections document.In S4 during filling in, need by
Whole adaptation sentences in single chapters and sections document are filled in could fill out region, to prevent from causing subsequent due to being adapted to sentence drop
Computer determines that method examines the drop of point.As shown in figure 5, by the adaptation sentence of each single chapters and sections document, all matching is filled in knot
It could fill out region in structure template, ultimately form structuring treaty documents.The structured document, which has, meets audit computer
The file structure for understanding logic, can assisted verification computer quickly, accurately determine key point.
Further, complete fill in after, can by structured stencil content and frame all conceal, only leave
Each adaptation sentence.
From the above technical scheme, the embodiment of the present application provides a kind of structural method of document, and this method passes through
According to text structure identification model, the structure to structured document is identified, and to wait for that structured document is divided into several single by this
Chapters and sections document.Preset structure template, and calculate each chapter title and each template name in structured stencil in single chapters and sections document
The similarity of title, determining adaptation template name corresponding with each single chapters and sections document.By calculating each adaptation template name
The similarity of subordinate's sentence of corresponding element and corresponding chapter title, determines and is adapted to sentence with what element matched.It will be complete
Portion's adaptation sentence, which is filled in structured stencil, accordingly could fill out region, and then realize the structuring for treating structured document, obtain
To structured document, structured document at this time is to be divided in detail according to template name and element, and be integrated into one again
The document risen.As it can be seen that the structural method and device of document provided herein can be by non-structured documents according to pre-
If structured stencil accurately divide, and accurately generate the structured document that there is corresponding relationship with template name and element, from
And guarantee the accuracy of subsequent determining key point.
Fig. 6 is a kind of flow chart of the method for building up of structured stencil provided by the embodiments of the present application.
In one embodiment, as shown in fig. 6, the method may include following steps:
S101, obtain sample to be processed from document library, the sample to be processed be comprising being classified title, and with it is described
It is classified the treaty documents of the corresponding subordinate's sentence of title.
Selected sample to be processed, needs to have the format of structuring, that is, has during establishing structured stencil
There is classification title, to correspond to template name as the clear cut-point for dividing sample to be processed;It is corresponding with classification title
Subordinate's sentence, using the basis as refinement element.The quality of sample to be processed directly affects the quality of structured stencil, therefore,
Need to filter out suitable sample to be processed with caution.
S102, using semantic analysis, extract the element in subordinate's sentence, the element be in subordinate's sentence with
It is classified the corresponding keyword of title.
It include a large amount of sentences in subordinate's sentence, wherein division statement is not correspond to classification title, and therefore, it is necessary to exclude
This division statement extracts keyword from sentence corresponding with classification title, as element.
For example, the embodiment of the present application provides classification title:
Party A's rights and duties;
Subordinate's sentence:
It has the right that Party B is required to submit technological service achievement according to this contract is predetermined.
By about orientation Party B's payt.
Corresponding operating condition is provided to Party A.
By semantic analysis it is found that " providing corresponding operating condition to Party A." not corresponding with " Party A's rights and duties ",
It " has the right that Party B is required to submit technological service achievement according to this contract is predetermined thus, it is only required to extract." and " by about orientation Party B payment
Remuneration." keyword, respectively " Party A's right " and " Party A obligation ".Therefore, corresponding to " Party A's rights and duties "
Element is " Party A's right " and " Party A's obligation ".
S103, according to template name format corresponding with element, generate training sample, wherein the template name pair
It should be in classification title.
It during generating training sample, needs template name is accurately corresponding with element, is asked to prevent what is caused confusion
Topic, to guarantee the accuracy of structured stencil.
S104, each training sample of training, generate template to be processed, the template to be processed is by template name and template
The corresponding element composition of title;
By a large amount of training sample of training, the to be processed of numerous training sample formats and content can be represented by generating to have
Template.
S105, correspond to each template name, addition could fill out region in the template to be processed, generate structuring mould
Plate, the content of text that could fill out region for filling in structuring treaty documents.
Structured stencil is needed with can correspond to the region filled in structured document content of text, therefore, it is necessary to
Corresponding addition could fill out region in template to be processed, so as to the subsequent adaptation sentence filled in structured document.
Accurate structuring sample can be obtained by the embodiment of the present application, to guarantee the accurate of subsequent structural process
Property.
It only include element in the structured stencil generated by S101-S105, therefore, the foundation of subsequent determining adaptation sentence is only
For element, matching of the matched process between " sentence and word " since the two character differs more, and includes semantic multifarious
It differs greatly, therefore, simple user element is lower as the accuracy for the foundation for determining adaptation sentence.
As shown in fig. 7, the flow chart of the method for building up for another structured stencil provided by the embodiments of the present application, described
Method includes:
S111, obtain sample to be processed from document library, the sample to be processed be comprising being classified title, and with it is described
It is classified the treaty documents of the corresponding subordinate's sentence of title;
S112, using semantic analysis, extract the element in subordinate's sentence, the element be in subordinate's sentence with
It is classified the corresponding keyword of title;
S113, corresponding element generate template sentence, and the template sentence is the language comprising the element and/or the template sentence
The adopted semantic similarity with the element is greater than the sentence of default template sentence similarity threshold;
S114, according to template name format corresponding with element and template sentence, generate training sample;
S115, each training sample of training, generate template to be processed, the template to be processed is by template name, template
The corresponding element of title and template sentence composition;
S116, correspond to each template name, addition could fill out region in the template to be processed, generate structuring mould
Plate, the content of text that could fill out region for filling in structured document.
S111 and S112 is identical as the method for S101 and S102 in a upper embodiment in the embodiment of the present application, refers to
It states, details are not described herein again.
In order to improve matched accuracy between subordinate's sentence and " element ", expands the range of " element ", that is, correspond to element
Template sentence is generated, the semantic phase of the semanteme comprising the element and/or the template sentence and the element in these template sentences
It is greater than the sentence of default template sentence similarity threshold like degree.
For example, element provided in this embodiment are as follows:
Party A's right;
Template sentence are as follows:
It has the right that Party B is required to submit technological service achievement according to this contract engagement.
Party A's right is to have the right that Party B is required to rectify and improve the problem of its service process.
At this point, " element " as subordinate's statement matching foundation includes two parts, i.e. element (keyword) and template sentence.
Subsequent S114-S116 is identical as the method for S103-S105 in a upper embodiment, refer to it is above-mentioned, it is no longer superfluous herein
It states.
Template sentence one is added in structured stencil provided by the embodiment of the present application, can effectively improve subsequent determination
It is adapted to the accuracy of sentence, to guarantee that audit computer determines the accuracy of key point.
As shown in figure 8, for the flow chart provided by the embodiments of the present application for dividing the method to structured document, the method
Include:
S131, the structure type to structured document, the structure type are determined using text structure identification model
Including tape format type and unformatted type;
If S132, the structure type to structured document are tape format type, known using the text structure
Other model determines that the chapter title to structured document, the chapter title are made of title identifier and title content;
S133, the parsing title identifier, obtain normalized subject, and the normalized subject is with unified form header
The title of number;
S134, using the normalized subject as cut-point, dividing described to structured document is several single chapters and sections documents.
Document is divided into tape format type and two kinds of unformatted type, and wherein tape format type is with digital, and/or special
Symbol, and/or text formatting (referring to aforementioned) etc., enable document per se with the type of certain format logic;Unformatted type, i.e.,
Document itself does not have the type of any format logic.The knot of document is able to detect and identified by text structure identification model
Structure type.Obviously, capable of passing through to structured document for tape format type identifies the identification such as number, additional character, text formatting
Feature is easier to obtain the chapter title to structured document.But since each chapter title to structured document corresponds to
Different identification features or even a same piece correspond to different identification features to the chapter title of structured document.This format
On difference, will affect operating efficiency and the accuracy of structurizer, therefore, it is necessary to obtain by parsing each chapter title
To the normalized subject with unified form header number.
For example, in chapter title using number " 1, one, 1., I " be classified;Or it is classified using section leading whitespace;
Or it is classified using additional character.Then all resolve to the normalized subject being classified using Arabic numerals.Later with this
A little normalized subjects are cut-point, and several single chapters and sections documents will be divided into structured document.
Wherein, if the Format Type to structured document is unformatted type, illustrating can not be special by some identifications
Sign directly determines chapter title.Therefore, as shown in figure 9, determining chapter title using subordinate's method:
If S135, the structure type to structured document are no structure type, according to default regular expression,
Determine the chapter title to structured document;
S136, using the chapter title as cut-point, dividing described to structured document is several single chapters and sections documents.
For example, default regular expression is " (.*) [article | chapter] ", " (d+) [ ,] ", the regular expression pair is utilized
It is matched to structured document, the character to match with the regular expression, as chapter title.
The chapter title obtained using default regular expression matching, there is reference format therefore can directly press for itself
It divides according to chapter title obtained to structured document.
Further, partially there can be multistage title to structured document, however, if the title according to each rank is drawn
Divide to structured document, then the quantity of single chapters and sections document is excessive, will increase the time of subsequent determining adaptation sentence;And it divided
Carefully, semantic more similar document can be divided into different single chapters and sections documents, increases subsequent semantic analysis and determines that adaptation is wanted
The difficulty of element, adaptation sentence.Therefore, as shown in Figure 10, chapter title is specified for a kind of weaken provided by the embodiments of the present application
The flow chart of method, the method for weakening specified chapter title include:
S137, title to be weakened, the chapter title for being lower than default title grade wait weaken entitled grade are determined;
S138, by the title set to be weakened be text rank.
Chapter title by registration lower than default title grade is determined as title to be weakened.
For example, level-one chapter title, 1, ××;Second level chapter title, 1.1, ××;Three-level chapter title, 1.1.1, ×
×, default title grade is set as second level, then three-level chapter title should be confirmed as title to be weakened.
The grade of the title to be weakened determined in S137 is set as text rank, and then enables title to be weakened to structure
Change and weakened in document, and then effectively reduces the division number to structured document.
It as shown in figure 11, is a kind of flow chart of the method for determining adaptation template name provided by the embodiments of the present application, institute
The method of stating includes:
S201, using method of semantic differential, successively calculate the similarity of each template name in chapter title and structured stencil;
S202, determine that similarity is greater than the target template title of default title similarity threshold;
S203, determine that adaptation template name, the adaptation template name are to have highest phase in the target template title
Like the target template title of degree.
For example, the similarity of each template name in chapter title and structured stencil is calculated using following formula,
Wherein, score represents the similarity of chapter title and template name, and t represents the character of chapter title, and o represents mould
Board name, e represent element corresponding to o.
Wherein, using following formula calculation formula (1) sim values,
Wherein, s(1)With s(2)Calculating parameter is represented,WithWord calculating parameter is represented,WithRepresent word meter
Calculate parameter.
By above-mentioned two formula, the similarity of chapter title Yu each template name can be accurately calculated.Firstly, from whole
The target template title for being greater than default title similarity threshold is determined in similarity.
For example, the similarity of chapter title and template name A, B, C are respectively 0.2,0.8,0.6, title similarity is preset
Threshold value is 0.5, then template name B and C is target template title.
The entitled adaptation template name of target template with highest similarity is determined in target template title B and C, i.e.,
Target template title B is adaptation template name.
Method by determining adaptation template name provided by the embodiment of the present application can accurately determine unique adaptation
Template name, to enable under single chapters and sections document matches to the highest template name of similarity.
It as shown in figure 12, is a kind of flow chart of the method for determining adaptation sentence provided by the embodiments of the present application, the side
Method includes:
S301, the similarity for calculating subordinate's sentence and each corresponding templates sentence, the corresponding templates sentence are that subordinate's sentence is corresponding
Template sentence corresponding to element;
The average value of the similarity of S302, the similarity of computational element and subordinate's sentence and subordinate's sentence and template sentence, obtains
To final statement similarity;
S303, determine that adaptation sentence, the adaptation sentence are that final statement similarity is greater than default statement similarity threshold value
Subordinate's sentence.
For example, final statement similarity is calculated using following formula,
Wherein, score represents final statement similarity, and sim (text, e) represents the similarity of subordinate's sentence and element, can
It is obtained with being calculated by formula (2), multipleSim (text, s) represents the similarity between subordinate's sentence and template sentence.
It is corresponding that subordinate's sentence that final statement similarity is greater than default statement similarity threshold value is finally determined as element
It is adapted to sentence.
The method that adaptation sentence is determined provided by the embodiment of the present application, can consider subordinate's sentence and element and mould simultaneously
Similarity between plate sentence, and then it is adapted to the accuracy that sentence determines.
It wherein, as shown in figure 13, is a kind of calculating subordinate's sentence provided by the embodiments of the present application and template sentence similarity
The flow chart of method, which comprises
S3011, the jaccard similarity for calculating subordinate's sentence and corresponding templates sentence;
S3012, the bert similarity for calculating subordinate's sentence and corresponding templates sentence;
S3013, according to default weighted value, the weight for calculating the jaccard similarity and bert similarity sums it up, and obtains
The similarity of subordinate's sentence and corresponding templates sentence.
For example, jaccard (Jie Kade) similarity is to consider in subordinate's sentence word or word in corresponding templates sentence in calculating
Under frequency of occurrence, be the upgrading of formula (2), using following formula calculate jaccard similarity,
Wherein, weightJac (s(1),s(2)) representing jaccard similarity, tf (w) represents in subordinate's sentence word in correspondence
Frequency of occurrence under template sentence, tf (c) represent frequency of occurrence of the word under corresponding templates sentence in subordinate's sentence.
The bert similarity that subordinate's sentence and template sentence are calculated using bert model, specifically, when by bert model prediction
The softmax result of classification is directly as bert similarity.
The similarity of subordinate's sentence and corresponding templates sentence is calculated using following formula,
MultipleSim (text, s)=α weightJac (text, s)+β bertSim (text, s) (5)
Wherein, multipleSim (text, s) represents the similarity of subordinate's sentence Yu corresponding templates sentence, bertSim
(text, s) represents bert similarity, and α and β respectively represent jaccard similarity weighted value corresponding with bert similarity.
The method of the similarity provided by the embodiments of the present application for calculating subordinate's sentence and corresponding templates sentence, it is similar using two kinds
Calculation method is spent, error caused by single similarity calculating method can be effectively avoided, to guarantee the accurate of similarity value
Property.
Figure 14 is a kind of schematic diagram of the structurizer of treaty documents provided by the embodiments of the present application.The device can answer
It operates in equipment for server, PC (PC), tablet computer, mobile phone etc. to be a variety of.
As shown in figure 14, which includes:
Division module 1 is several single chapters and sections documents for dividing according to text structure identification model to structured document,
The single chapters and sections document is made of chapter title and subordinate's sentence corresponding with the chapter title;
It is adapted to template name determining module 2, for calculating each template name in the chapter title and structured stencil
Similarity obtains adaptation template name, and the structured stencil is by template name, the corresponding element of template name and template name
Corresponding to could fill out region composition, the adaptation template name is similar greater than default title to the similarity of the chapter title
Spend the template name of threshold value;
It is adapted to sentence determining module 3, for calculating the corresponding element of the adaptation template name and corresponding chapter title
The similarity of subordinate's sentence, obtains adaptation sentence, and the adaptation sentence is to be greater than default sentence phase with the similarity of the element
Like subordinate's sentence of degree threshold value;
Module 4 is filled in, the adaptation sentence for filling in all single chapters and sections documents is corresponding into the structured stencil
It could fill out region, obtain structured document.
By the above technology it is found that this application provides a kind of structural method of document and device, know according to text structure
Other model identifies the structure to structured document, and this is waited for that structured document is divided into several single chapters and sections documents.Default knot
Structure template, and the similarity of each chapter title and each template name in structured stencil in single chapters and sections document is calculated, it determines
Adaptation template name corresponding with each single chapters and sections document.By calculating element and phase corresponding to each adaptation template name
The similarity of subordinate's sentence of chapter title is answered, determines and is adapted to sentence with what element matched.Will all adaptation sentences fill in
Structured stencil accordingly could fill out region, and then realize the structuring for treating structured document, obtain structured document, at this time
Structured document be the document for being divided according to template name and element, and being integrated together again in detail.As it can be seen that this Shen
Please provided by document structural method and device can be accurate according to preset structured stencil by non-structured document
It divides, and accurately generates the structured document with template name and element with corresponding relationship, to guarantee subsequent determining crucial
The accuracy of point.
Those skilled in the art after considering the specification and implementing the invention disclosed here, will readily occur to of the invention its
Its embodiment.This application is intended to cover any variations, uses, or adaptations of the invention, these modifications, purposes or
Person's adaptive change follows general principle of the invention and including the undocumented common knowledge in the art of the present invention
Or conventional techniques.The description and examples are only to be considered as illustrative, and true scope and spirit of the invention are by following
Claim is pointed out.
It should be understood that the application is not limited to the precise structure that has been described above and shown in the drawings, and
And various modifications and changes may be made without departing from the scope thereof.Scope of the present application is only limited by the accompanying claims.