CN113642320A - Method, device, equipment and medium for extracting document directory structure - Google Patents

Method, device, equipment and medium for extracting document directory structure Download PDF

Info

Publication number
CN113642320A
CN113642320A CN202010344802.XA CN202010344802A CN113642320A CN 113642320 A CN113642320 A CN 113642320A CN 202010344802 A CN202010344802 A CN 202010344802A CN 113642320 A CN113642320 A CN 113642320A
Authority
CN
China
Prior art keywords
title
component
logic
previous
assembly
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010344802.XA
Other languages
Chinese (zh)
Inventor
林得苗
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Pai Tech Co ltd
Original Assignee
Pai Tech Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Pai Tech Co ltd filed Critical Pai Tech Co ltd
Priority to CN202010344802.XA priority Critical patent/CN113642320A/en
Publication of CN113642320A publication Critical patent/CN113642320A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention discloses a method, a device, equipment and a medium for extracting a document directory structure. The method comprises the following steps: acquiring a title component ordered sequence of a document to be processed; establishing a title logic tree based on the hierarchical relationship among the title components in the ordered sequence of the title components; and generating a directory structure of the document to be processed according to the title logic tree. According to the method, the device, the equipment and the medium for extracting the document directory structure, provided by the embodiment of the invention, the extraction accuracy of the directory structure can be improved.

Description

Method, device, equipment and medium for extracting document directory structure
Technical Field
The present invention relates to the field of data processing, and in particular, to a method, an apparatus, a device, and a medium for extracting a document directory structure.
Background
Title is a brief sentence that marks the content of the document. In order to enhance the readability of a document, multiple levels of titles are typically provided within a document. Specifically, the document content corresponding to a certain level of title may be subdivided into a plurality of parts by establishing several subordinate titles.
The document object structure contains the membership of different levels of titles, with the lower level of titles being subordinate to the higher level of titles. Fig. 1 shows an exemplary document directory structure, and as shown in fig. 1, the document directory structure comprises three levels of titles, namely a first level title, a second level title and a third level title in sequence from high to low in the hierarchy. Wherein, the second-level title "1. second-level title" is subordinate to the first-level title "first, first-level title", the third-level title "2.1 third-level title" and "2.2 third-level title" are subordinate to the second-level title "2. second-level title".
In the existing directory extraction method, the titles in the document need to be manually marked, for example, set as a title style. When the directory is generated, the directory structure is generated by using the title marked as the title style, and the extracted directory structure has low accuracy.
Disclosure of Invention
The embodiment of the invention provides a method, a device, equipment and a medium for extracting a document directory structure, which can improve the extraction accuracy of the directory structure.
In a first aspect, a method for extracting a document directory structure is provided, including: acquiring a title assembly ordered sequence of a document to be processed; establishing a title logic tree based on a hierarchical relation among the title components in the ordered sequence of the title components; and generating a directory structure of the document to be processed according to the title logic tree.
According to the method for extracting the document directory structure in the embodiment of the invention, the ordered sequence of the title components in the document to be processed can be obtained firstly, and the title logic tree is established by utilizing each title component in the ordered sequence of the title components. Because the title component corresponding to any node in the title logical tree is the upper-level title of the title component corresponding to the child node of the node, the hierarchical relationship among the title components can be determined by establishing the title logical tree, and therefore the extraction accuracy of the target structure is improved.
In an alternative embodiment, the creating a title logical tree based on the hierarchical relationship between each title component in the ordered sequence of title components specifically includes: sequentially taking the title components in the ordered sequence of the title components as first title components; for each first title component, performing the following operations: if a second title component with the same level as the first title component exists in the ancestor nodes of the previous title component and the previous title component of the first title component in the title logical tree, inserting the first title component into the title logical tree as the brother node of the second title component; if the second title element does not exist in the predecessor nodes of the previous title element and the previous title element, the first title element is inserted into the title logical tree as a child node of the previous title element.
A method for generating a directory structure using a title template, the number of hierarchies of a target structure being the same as the number of hierarchies set in the title template. For example, if only 3 caption levels are set in the template, only 3 caption levels at the maximum can be generated. By using the method for extracting the document directory structure in the embodiment of the invention, whether the document directory structure is the same as the title component added to the logical tree of the title can be compared, and if the document directory structure is different from the logical tree of the title, the document directory structure is used as a child node of the previous title component. Even if the document to be processed has more title levels, the corresponding title levels can be generated. E.g., 8-stage, 9-stage, etc. Compared with the method for generating the directory structure by using the title template, the method can improve the flexibility, the accuracy and the depth of generating the directory structure.
In an optional embodiment, the method further comprises: obtaining hierarchical classification scores of the previous title assembly and the title assembly in the ancestor node of the previous title assembly by using target data and a title hierarchical two-classification model, wherein the target data comprises the characteristics of the title assembly in the ancestor node of the previous title assembly and the characteristics of the first title assembly; if the prior title assembly and the title assembly with the hierarchy classification score larger than the preset score threshold exist in the ancestor nodes of the prior title assembly, determining that a second title assembly exists in the prior title assembly and the ancestor nodes of the prior title assembly; and if the hierarchical classification scores of all the title assemblies in the ancestor nodes of the previous title assembly and the previous title assembly are less than a preset score threshold, determining that a second title assembly does not exist in the ancestor nodes of the previous title assembly and the previous title assembly.
In the embodiment, whether each title component is the second title component can be judged by using the title level binary model, so that the judgment precision is ensured. Especially, if the title level two-classification model selects the deep learning model, the accuracy of the directory structure identification can be improved.
In an alternative embodiment, using the target data and the title hierarchical two-classification model to obtain hierarchical classification scores for the title component in the previous title component and the ancestor node of the previous title component, comprises: inputting the characteristics of the title assembly in the ancestor node of the previous title assembly and the characteristics of the first title assembly into a first relational characteristic generation model to obtain a first relational characteristic representing the relationship between the characteristics of the title assembly in the ancestor node of the previous title assembly and the characteristics of the first title assembly; and inputting the first relation characteristic into a title level two classification model to obtain the level classification scores of the previous title component and the title component in the ancestor node of the previous title component.
In an alternative embodiment, using the target data and the title hierarchical two-classification model to obtain hierarchical classification scores for the title component in the previous title component and the ancestor node of the previous title component, comprises: inputting the characteristics of the title assemblies in the ancestor nodes of the previous title assembly and the characteristics of the brother nodes of the title assemblies in the ancestor nodes of the previous title assembly and the previous title assembly into a second relational characteristic generation model to obtain a second relational characteristic representing the relationship between the characteristics of the title assemblies in the ancestor nodes of the previous title assembly and the characteristics of the brother nodes; inputting the second relational feature and the feature of the first title assembly into a third relational feature generation model to generate a third relational feature which represents the relationship between the second relational feature and the feature of the first title assembly; and inputting the third relation characteristic into the title level binary classification model to obtain the level classification scores of the previous title component and the title components in the ancestor nodes of the previous title component.
In an alternative embodiment, obtaining an ordered sequence of title components of a document to be processed includes: acquiring a logic component ordered sequence of a document to be processed; inputting the logic component ordered sequence into a title detection model to obtain a title component ordered sequence; if the title detection model comprises a first feature extraction submodel and a title binary classification submodel, inputting the logic component ordered sequence into the title detection model to obtain a title component ordered sequence, wherein the method comprises the following steps: inputting the logic assembly ordered sequence into a first characteristic extraction submodel to obtain the characteristics of the logic assemblies in the logic assembly ordered sequence; inputting the characteristics of each logic component into a title binary classification submodel aiming at the characteristics of each logic component to obtain a title classification result of each logic component, wherein the title classification result is a title or a non-title; adding the logic component with the title classification result as the title in the logic component ordered sequence into the title component ordered sequence; or, if the title detection model includes the first feature extraction submodel, the second feature extraction submodel and the title binary classification submodel, inputting the ordered sequence of the logic components into the title detection model to obtain the ordered sequence of the title components, including: inputting the logic assembly ordered sequence into a first feature extraction submodel to obtain the features of the logic assemblies in the logic assembly ordered sequence; inputting the characteristics of each logic assembly and the characteristics of the adjacent logic assemblies of each logic assembly into a second characteristic extraction submodel to obtain the context characteristics of each logic assembly; inputting the context characteristics into a title binary classification submodel to obtain a title classification result of each logic component, wherein the title classification result comprises a title or a non-title; and adding the logic component with the title classification result as the title in the logic component ordered sequence into the title component ordered sequence.
In the embodiment of the invention, the characteristics of the logic components can be obtained firstly, and then the classification results of the logic components are determined by utilizing the characteristics of the logic components, so that the classification efficiency and accuracy are ensured. Especially, if the first feature extraction sub-model and the title two-classification sub-model adopt the deep learning model, the accuracy of identifying the directory structure can be improved.
And a list and other logic components with higher similarity to the text structure of the title component may exist in the document. For example, each line in the list may be a combination of a number and a literal symbol. The method takes certain relevance between adjacent logic components in the document to be processed into consideration. For example, the front and back logical components of a title component may be non-peer title components, document content paragraphs, charts, pictures, and the like. And the front and back logical components of the list may be numbered the same. Even though the characteristics of the list are the same as those of the title component, there is a large difference between the contextual characteristics of the list and those of the title component. Therefore, by using the context feature of the logical component, the recognition accuracy of the title detection model can be improved based on the feature of the logical component surrounding the logical component. Especially, if the first feature extraction submodel, the second feature extraction submodel and the title binary classification submodel adopt the deep learning model, the accuracy rate of the directory structure identification can be improved.
In an alternative embodiment, the step of inputting the ordered sequence of logic components into the first feature extraction submodel to obtain the features of the logic components in the ordered sequence of logic components includes: acquiring a text feature vector of a logic component and a format feature vector of the logic component; splicing the text feature vector and the format feature vector into a feature vector of the logic component, wherein the text feature vector of the logic component is generated based on the character ordered sequence of the logic component, and the format feature vector characterizes at least one of the following format information: whether the logic component is thickened, the text word size of the logic component, whether the text of the logic component is centered and represents the category to which the logic component belongs, wherein the category to which the logic component belongs comprises: paragraphs, tables, charts, pictures.
In the embodiment of the invention, the text characteristic and the format characteristic of the logic component can be comprehensively utilized, and the accuracy of target structure identification is improved.
In a second aspect, an apparatus for extracting a document directory structure is provided, including: the title sequence acquisition module is used for acquiring the ordered sequence of the title components of the document to be processed; the logic tree building module is used for building a title logic tree based on the hierarchical relationship among the title components in the ordered sequence of the title components; and the target structure generating module is used for generating a directory structure of the document to be processed according to the title logic tree.
In a third aspect, an extracting apparatus for a document directory structure is provided, including: a memory for storing a program; and the processor is used for operating the program stored in the memory to execute the document directory structure extraction method provided by the first aspect or any optional implementation manner of the first aspect.
In a fourth aspect, a computer storage medium is provided, on which computer program instructions are stored, and the computer program instructions, when executed by a processor, implement the method for extracting a document directory structure provided in the first aspect or any optional implementation manner of the first aspect.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments of the present invention will be briefly described below, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 illustrates an exemplary document directory structure;
FIG. 2 is a schematic diagram illustrating an exemplary logical tree structure of a title in an embodiment of the present invention;
FIG. 3 is a schematic flow chart diagram illustrating a method of extracting a document directory structure according to an embodiment of the present invention;
FIG. 4 is a logical schematic diagram of an exemplary generate title logical tree provided by an embodiment of the present invention;
FIG. 5 is a schematic structural diagram showing an extracting apparatus of a document directory structure according to an embodiment of the present invention;
fig. 6 is a block diagram of an exemplary hardware architecture of an extracting apparatus of a document directory structure in the embodiment of the present invention.
Detailed Description
Features and exemplary embodiments of various aspects of the present invention will be described in detail below, and in order to make objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not to be construed as limiting the invention. It will be apparent to one skilled in the art that the present invention may be practiced without some of these specific details. The following description of the embodiments is merely intended to provide a better understanding of the present invention by illustrating examples of the present invention.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The embodiment of the invention provides an extraction scheme of a document directory structure, which is suitable for a specific scene needing to extract the directory structure from a document. Such as a survey of complex financial information texts, such as a survey of shares, a bond recruitment statement, an annual report, a financial report, a merger and reorganization report, a rating report, a research report, a legal contract document, and public opinion news. After the title components are extracted from the document, the title logical tree can be generated by utilizing the hierarchical relationship among the title components. The title logical tree consists of a root node and N subtrees, wherein N is an integer, and the N subtrees have no direct connection relation.
To facilitate understanding of the logical tree of the title in the embodiment of the present invention, as shown in fig. 2, fig. 2 is a schematic structural diagram of the logical tree of the title provided in the embodiment of the present invention. As shown in FIG. 2, the logical tree of headers is composed of a root node R0And child node A1-A7Composed subtree and child node A8-A13Composed subtree and child node A14-A19Three fruit trees of which R is0Three directly connected child nodes are respectively A1、A8、A14. Specifically, the three subtrees are respectively: by child node A1And with child node A1Directly or indirectlyAll connected child nodes A2To A7A sub-tree composed of sub-nodes A8And with child node A8All child nodes A connected directly or indirectly9To A13A sub-tree composed of sub-nodes A14And with child node A14All child nodes A connected directly or indirectly15To A19And (4) forming a subtree. There is no direct connection between the three subtrees.
In the logical tree of headings shown in FIG. 2, the root node R0May be the subject name of the document or the subject matter of the document. Composition A1、A8、A14All child nodes of the three subtrees are titles. For any child node in the child tree, the parent node is the title of the upper level, and the child node is the title of the lower level. For example, child node A1Is the first level header, child node A2Is the first secondary title under the first primary title. Still alternatively, the root node R shown in FIG. 20Or may be left vacant, i.e. the root node R0And is not used to represent the hierarchical structure of a directory.
In order to better understand the technical solution of the embodiment of the present invention, a method, an apparatus, a device, and a medium for extracting a document directory structure according to the embodiment of the present invention will be described in detail below with reference to the accompanying drawings, and it should be noted that these embodiments are not intended to limit the scope of the present invention.
Fig. 3 is a flowchart illustrating an extracting method of a document directory structure according to an embodiment of the present invention. As shown in fig. 3, the method 300 for extracting a document directory structure may include S310 to S330.
S310, acquiring the ordered sequence of the title components of the document to be processed.
The document to be processed refers to an electronic document capable of acquiring the text information of the document. Specifically, it may be an electronic document in a WORD format, a PDF format, TXT, or the like.
At least one title paragraph may be included in the document to be processed, each title paragraph being referred to as a title component. Front-to-back order of title elements in ordered sequence of title elements, and corresponding methodThe appearance sequence in the document to be processed is the same. Illustratively, the title components are sequentially title paragraphs A if in the order of appearance in the document to be processed1Title paragraph A2… …, heading paragraph AmWherein, the subscript of each title component represents the sequential order in which the title component appears in the document. Then the title element ordered sequence is { title paragraph A1Title paragraph A2… …, heading paragraph Am}. Wherein m is a positive integer.
For the title component, the title component refers to a single title in an article and may include a number and words. The number may be a number, such as an arabic numeral "123", a chinese numeral "twenty-three", a roman numeral, and the like. The number may also be a combination of numbers and symbols, and the symbols may be a pause, an English period, a Chinese period, a colon, a comma, and the like. Such as "1.1", "two, one", "2.2.1", etc. The number may also be a combination of numbers and words that characterize the title structure unit, such as a volume, a chapter, a section, a subsection, etc., e.g., "third chapter". In the same document, titles of different numbers have different hierarchies. For example, "chapter one", "section one", "1.1" and "1.1.1" represent different hierarchies.
In S310, since many documents to be processed are not directly composed of the header component, but are composed of a plurality of logical components including the header component. Accordingly, the title extracting component may extract the ordered sequence of logical components of the document to be processed from the document to be processed, and screen the ordered sequence of title components from the ordered sequence of logical components. Therefore, in the process of obtaining the title component, the ordered sequence of the logical components of the document to be processed can be extracted from the document to be processed, and the ordered sequence of the title component can be screened from the ordered sequence of the logical components. Accordingly, S310 specifically includes S311 and S312.
S311, acquiring the logic component ordered sequence of the document to be processed.
First, in S311, the document to be processed may be divided into a plurality of logical components independent of each other, for example, paragraphs, tables, diagramsTables, pictures, etc. Wherein, paragraphs can be further subdivided into document content paragraphs and heading paragraphs. Similar to the above-described ordered sequence of header components is that the front-to-back order of the logical components in a logical component is the same as the order in which they appear in the document to be processed. Illustratively, the logical components are sequentially title paragraph A if they appear in the order in which they appear in the document to be processed1Document content paragraph B1Document content paragraph B2Table C1Title paragraph A2Chart D1Title paragraph A3Picture E1Then the title element ordered sequence is { title paragraph A1Document content paragraph B1Document content paragraph B2Table C1Title paragraph A2Graph D1Title paragraph A3Picture E1}。
Secondly, in the process of obtaining the ordered sequence of the logical components, the document to be processed can be input into the logical structure analysis model. And obtaining the logic component ordered sequence of the document to be processed. Wherein, the logic structure analysis model can be trained by using the document sample marked with the logic component. For example, the logic structure analysis module may cluster the documents to be processed according to the character text features and/or the structural features of the documents to be processed, and take each clustering result as a logic component.
In addition, in the process of obtaining the ordered sequence of the logical components, if the document to be processed is a multi-column document, each page of the document to be processed needs to be divided into a plurality of column regions by using the column model. Then, the columnar areas of each page are sorted in order from left to right. Wherein, the logic components in each column are arranged in sequence from top to bottom. Finally, the logic components of each column area are analyzed. And acquiring the logic component ordered sequence of the document to be processed according to the page sequence, the column sequence of each page from left to right and the logic component sequence of each column from top to bottom.
Illustratively, if a page in a document is in columns 1 and 2, respectively, from left to right. Column 1 includes textDocument content paragraph B2Table C1. Column 2 includes title paragraph A2Graph D1. The logical components in the page are in turn document content paragraph B2Table C1Title paragraph A2Graph D1
S312, inputting the logic assembly ordered sequence into the title detection model to obtain the title assembly ordered sequence.
In the process of obtaining the ordered sequence of the title components, the title detection model can extract relevant features of the logic components, perform secondary classification on the logic components and confirm whether each logic component is the title component. The following section of the embodiments of the present invention explains the specific implementation of S312 with two possible embodiments.
In a first embodiment, the heading detection model includes a first feature extraction submodel and a heading two classification submodel for extracting features of the logical component. S312 may specifically include S3121 to S3123.
S3121, inputting the logic component ordered sequence into a first feature extraction submodel to obtain the features of the logic components in the logic component ordered sequence. Wherein the characteristics of the logic component include text characteristics of the logic component and format characteristics of the logic component. In the calculation of the first feature extraction submodel, the above features may be expressed in the form of a vector. Accordingly, S3121 may include the following two steps.
Firstly, acquiring a text feature vector of a logic component and a format feature vector of the logic component.
First, for a text feature vector of a logical component, the text feature vector represents a text feature of the logical component in the form of a vector. Wherein, if the document to be processed comprises K logic components. Ith logical component xiMay include MiEach character is w1To wMi. Wherein i is any positive integer less than or equal to K, MiIs a positive integer. In the logical component x according to the charactersiThe precedence order in (A) may be based on logical component xiGenerating an ordered sequence of characters si={w1,w2,...,wMi}. The characters can be ordered into a sequence siInputting text feature extraction submodel to obtain text feature vector of logic component
Figure BDA0002469769090000101
Illustratively, the text feature extraction sub-model may choose a Recurrent Neural Network (RNN) model, taking into account the association between adjacent characters. Text feature vector
Figure BDA0002469769090000102
Wherein the RNN () function is a mapping function of the recurrent neural network layer. The parameters of the mapping function include a weight matrix WRAnd an offset vector bR. In the RNN model training process, if N logic elements are selected as training samples, the weight matrix W is subjected to iterative multi-round generation of the sample data by using a gradient descent algorithmRAnd an offset vector bRThe updating is performed until the loss function satisfies the stop condition. The loss function may be L2 loss function. The L2 loss function can be expressed as equation (1):
Figure BDA0002469769090000103
wherein, yjThe target text feature vector representing the jth logical component,
Figure BDA0002469769090000104
the predicted text feature vector representing the jth logic component.
Secondly, regarding the format feature vector of the logic component, the format feature vector of the logic component represents the format feature of the logic component in the form of vector quantity. Optionally, the format characteristics of the logic component may include whether to be bold, text font size, whether to be centered, and format characteristics of at least one dimension in the category to which the logic component belongs. Accordingly, the format feature vector characterizes at least one of the following format information: whether the logic component is bold, the text font size of the logic component, whether the text of the logic component is centered and the category to which the logic component belongs.
Illustratively, if logical component xiThe format characteristics of (1) comprise the format characteristics of four dimensions of whether the text is thickened, the text font size, whether the logic component is centered and the category of the logic component. The vector with the bold format can be generated respectively firstly
Figure BDA0002469769090000105
Vector with bold format
Figure BDA0002469769090000106
Centered trellis vector
Figure BDA0002469769090000107
And component category vector
Figure BDA0002469769090000108
Then the 4 characteristics are spliced into a logic component xiCharacteristic of format
Figure BDA0002469769090000109
For example,
Figure BDA00024697690900001010
wherein, for the format feature of whether to be thickened, the format vector is thickened
Figure BDA00024697690900001011
The vector has a size of 1. Two different values may be used to represent bold and not bold, respectively. For example,
Figure BDA00024697690900001012
representing logical Components xiThe text of (1) is bolded.
Figure BDA00024697690900001013
Representing a logical component xiThe text of (1) is not bolded.
Lattice for text font sizeFormula feature, corresponding to the bold format vector
Figure BDA00024697690900001014
The vector has a size of 1. Different values may be used to correspond to different font sizes. For example, each word size may be normalized to real numbers in the interval 0-1, in terms of word size. For example,
Figure BDA0002469769090000111
representing logical Components xiThe text font size of (2) is four.
Figure BDA0002469769090000112
Representing logical Components xiHas a text font size of 18.
For a centered or not centered format feature, corresponds to a centered format vector
Figure BDA0002469769090000113
The vector has a size of 1. Two different values may be used to indicate centering and non-centering, respectively. For example,
Figure BDA0002469769090000114
representing logic component xiIs centered.
Figure BDA0002469769090000115
Representing a logical component xiIs not centered.
Format features for the class to which the logical component belongs correspond to a component class vector
Figure BDA0002469769090000116
The size of the component category vector may be related to the number of categories of the component. If a component can be classified into 5 categories, text paragraph, table, chart, picture, and other categories than text paragraph, table, chart, picture, then the component category vector
Figure BDA0002469769090000117
Vector of (2)The dimension is 5. Illustratively, given that each logical component can only belong to 1 class, the component class vector
Figure BDA0002469769090000118
May be a One-Hot (One-Hot) vector. Illustratively, component class vectors
Figure BDA0002469769090000119
The category representing logical components is text paragraphs. Component class vector
Figure BDA00024697690900001110
The category representing the logical component is a table. Component class vector
Figure BDA00024697690900001111
The category representing the logical component is a graph. Component class vector
Figure BDA00024697690900001112
The category representing the logical component is a picture. Component class vector
Figure BDA00024697690900001113
Representing logical components belonging to other categories than text paragraphs, tables, charts, pictures.
Secondly, the text feature vector is processed
Figure BDA00024697690900001114
And format feature vector
Figure BDA00024697690900001115
Stitching into logical component xiCharacteristic vector of
Figure BDA00024697690900001116
Wherein the content of the first and second substances,
Figure BDA00024697690900001117
s3122, for each logical component' S characteristics,inputting the characteristics of each logic component into a title two-classification submodel to obtain the title classification result of each logic component. Wherein the title classification result comprises a title or a non-title. The title two classification submodel may select the first Softmax classifier. Wherein the score function of the first Softmax classifier
Figure BDA00024697690900001121
Satisfies formula (2):
Figure BDA00024697690900001118
wherein Softmax () is the first Softmax function.
Figure BDA00024697690900001119
In order to be a weight matrix, the weight matrix,
Figure BDA00024697690900001120
is a bias vector.
In addition, the title binary sub-module may also select another classifier capable of performing binary classification, such as a sigmoid classifier, and the like, which is not limited in this respect.
In training the first Softmax classifier, the header component may be used as a positive sample, and other logical components except the header component may be used as negative samples. Illustratively, the target classification score of the title component is labeled 1, and the target classification scores of the other logical components except the title component are labeled 0. In the process of training the Softmax classifier, an L2 loss function may be selected, and for specific contents of the L2 loss function, reference may be made to the relevant description in the above embodiments of the present invention, and details thereof are not repeated.
In addition, a classification score is obtained by inputting the characteristics of the logic component into the title binary classification submodel. If the classification score is 1, the classification result of the characterization logic component is a title, otherwise, the classification result of the characterization logic component is a non-title.
And S3123, adding the logic component with the title classification result as the title in the logic component ordered sequence into the title component ordered sequence.
In a second embodiment, the characteristics of a logical component and the contextual characteristics of the logical component can be used to determine whether the logical component belongs to a title component. If the title detection model includes a first feature extraction submodel for extracting features of the logical component, a second feature extraction submodel for extracting context features of the logical component, and a title binary classification submodel, S312 may specifically include S3124 to S3127:
and S3124, inputting the logic component ordered sequence into the first feature extraction submodel to obtain the features of the logic components in the logic component ordered sequence. For specific description of the first feature extraction submodel, reference may be made to relevant contents in the foregoing embodiments of the present invention, and details are not described herein again.
And S3125, for the feature of each logic component, inputting the feature of each logic component and the feature of the adjacent logic component of each logic component into a second feature extraction submodel to obtain the context feature of each logic component. Illustratively, if the ordered sequence of the features of the K logical components of the document to be processed is obtained in S3124
Figure BDA0002469769090000121
Inputting the context feature sequence into a second feature extraction submodel to obtain a context feature ordered sequence of the logic component
Figure BDA0002469769090000122
Wherein the content of the first and second substances,
Figure BDA0002469769090000123
is the context characteristic of the ith logic component.
The second feature extraction submodel may be a Multi-layer Convolutional Neural network (Multi-layers CNNs) model. If the convolution kernel of the multilayer convolution neural network model is L, for any logic component xiThe logical component x can be generated by using the characteristics of L-1 logical components adjacent to each other in front and backiThe contextual characteristics of (1). Illustratively, if the number of layers of the multilayer convolutional neural network model is 2, the mapping functions of the two convolutional neural network layers are respectively represented as CNN1() And CNN2() The convolution kernel size of each layer is 3, and the convolution kernel of the k-th layer is
Figure BDA0002469769090000131
The offset vector is
Figure BDA0002469769090000132
Then for any one logical component xiThe characteristics of the i-2 logic component can be utilized
Figure BDA0002469769090000133
Characterization of the i-1 st logical component
Figure BDA0002469769090000134
Characterization of the (i + 1) th logical component
Figure BDA0002469769090000135
Characterization of the i +2 th logical component
Figure BDA0002469769090000136
Generating logical component xiContext characteristics of
Figure BDA0002469769090000137
Figure BDA0002469769090000138
Is full of
Satisfies formula (3):
Figure BDA0002469769090000139
in the process of training the second feature extraction submodel, an L2 loss function can be selected and a gradient descent algorithm is used to continuously perform convolution kernel
Figure BDA00024697690900001310
The offset vector is
Figure BDA00024697690900001311
And carrying out iterative updating on the parameters. For details of the L2 loss function, reference may be made to the related description in the above embodiments of the present invention, and further description thereof is omitted.
S3126, inputting the context feature of each logic component into the title binary classification submodel to obtain the title classification result of each logic component. The title classification result comprises a title or a non-title. Wherein the title two classification submodel can be expressed in formula (2)
Figure BDA00024697690900001312
Is replaced by
Figure BDA00024697690900001313
Then, the title classification result of each logic component is calculated by using the formula (2) after replacement. For other details of the title two-classification submodel, reference may be made to the related description in the above embodiments of the present invention, and further description is omitted here.
S3127, adding the logic component with the title classification result in the logic component ordered sequence into the title component ordered sequence.
And a list and other logic components with higher similarity to the text structure of the title component may exist in the document. For example, each line in the list may be a combination of a number and a literal symbol. The method takes certain relevance between adjacent logic components in the document to be processed into consideration. For example, the front and back logical components of a title component may be non-peer title components, document content paragraphs, charts, pictures, and the like. And the front and back logical components of the list may be numbered the same. Even though the characteristics of the list are the same as those of the title component, there is a large difference between the contextual characteristics of the list and those of the title component. Therefore, by using the context feature of the logical component, the recognition accuracy of the title detection model can be improved based on the feature of the logical component surrounding the logical component.
S320, establishing a title logic tree based on the hierarchy of the title components in the ordered sequence of the title components.
In some embodiments of the present invention, the title component in the ordered sequence of title components may be considered the first title component in sequence.
S322, for each first title component, S3221 and S3222 are performed.
S3221, if a second title component having the same level as the first title component exists in a previous title component of the first title component and an ancestor node of the previous title component in the title logical tree, inserting the first title component into the title logical tree as a sibling node of the second title component. If the first title component is used as a brother node of the second title component, the first title component and the second title component are represented to have the same upper-level title.
First, in order to fully understand S322, it is exemplarily explained with reference to fig. 4. Illustratively, fig. 4 is a logic schematic diagram of an exemplary generated title logic tree provided by an embodiment of the present invention. Referring to fig. 4, if the title components in the document to be processed are the title component a, the title component b, the title component c, the title component d, the title component e, the title component f, and the title components g and … … in sequence. The above title components a-g may be inserted into the title tree as the first title component in turn in S322. With continued reference to FIG. 4, the title components a-f have already been inserted into the logical tree of titles, at which point the title component g needs to be inserted as the first title component into the logical tree of titles on a continuing basis.
In S3221, it is determined whether a second title component exists in the title component (shaded in fig. 4) in the rightmost branch of the existing logical tree of titles, that is, the previous title component f of the first title component g and the ancestor title component a and the ancestor title component c of the title component f. Where ancestor title component a may be a first level title, such as "third section corporate financial scenario". The grand title component c may be a secondary title, such as "two, debt situation". The previous title component f "1, mobile liability" of the first title component g.
If there is a second header element in the rightmost branch of the existing header logical tree, there may be three insertion locations for the first header element g. Specifically, (1) if the title component a is the second title component, the first title component g needs to be inserted below the root node r, i.e., at position p in fig. 31To (3). For example, if the first title component g is "three, stockholder equity case" and "two, liability case" are peer titles, then the "three, stockholder equity case" is inserted as a sibling node of the "two, liability case" under the root node. (2) If title component c is the second title component, then the first title component g needs to be inserted below node a as a child of node a, i.e. position p in FIG. 42To (3). (3) If the title component f is a second title component, the first title component g needs to be inserted below the node c as a child of the node c, i.e. at position p in FIG. 33To (3).
Next, the following steps one to three are performed when determining whether each node is the peer node of the first header component.
And step one, obtaining hierarchical classification scores of the title components in the ancestor nodes of the previous title component of the first title component and the previous title component of the first title component by using the target data and the title hierarchical classification model.
First, for the hierarchical classification score, the hierarchical classification score may be a value within one [0,1] interval for reflecting the probability that the title component to be classified is the second title component. The higher the hierarchical classification score, the higher the probability that the title component to be classified is the second title component.
In the first case, if the second title component is determined to exist by traversing all nodes of the previous title component of the first title component and the ancestor node of the previous title component, each title component in the previous title component of the first title component and the ancestor node of the previous title component can be respectively used as the title component to be classified, and the hierarchical classification score of each title component is obtained through the first step. If there are a plurality of nodes having scores exceeding a preset score threshold in the previous title component of the first title component and the ancestor node of the previous title component, the node having the highest score may be selected as the second title component. Illustratively, with continued reference to fig. 4, if the preset score threshold is 0.5, the hierarchical classification score corresponding to title component a is 0.6, the hierarchical classification score corresponding to title component c is 0.8, and the hierarchical classification score corresponding to title component a is 0.7, then title component c may be considered as the second title component.
In the second case, if the head element exists from the root node of the head logical tree to the bottom, whether the head element exists in the head element before the first head element and the ancestor node of the head element before the first head element is judged in sequence. The node whose first hierarchical classification score exceeds the preset score threshold among the previous title component and the ancestor nodes of the previous title component may be selected as the second title component. Continuing with the example of the previous case, it can be determined whether title component a, title component c, and title component f are the second title component in turn. If the hierarchical classification score corresponding to the title component a is 0.6 and is greater than the preset score threshold value 0.5, the title component a can be selected as the second title component. And the hierarchical classification scores of the title components c and f do not need to be calculated any more.
In the third case, if starting from the previous title element of the first title element of the title logical tree, whether the title element exists in the previous title element of the first title element and the ancestor node of the previous title element is sequentially judged according to the sequence from low to high, namely the sequence from the leaf node to the root node. The node whose first hierarchical classification score exceeds the preset score threshold among the previous title component and the ancestor nodes of the previous title component may be selected as the second title component. Continuing with the example of the previous case, it may be determined whether the title component f, the title component c, and the title component a are the second title component in sequence. If the hierarchical classification score corresponding to the title element f is 0.7, which is greater than the predetermined score threshold of 0.5, the title element f can be selected as the second title element. And the calculation of the hierarchical classification scores of the title component c and the title component a is not required to be continued.
Second, the model is classified for the title level two. Optionally, title level twoThe classification model may include a feed Forward Neural Network (FNN) model and a second Softmax classifier. If the feature of the first title element is expressed as
Figure BDA0002469769090000161
The characteristics of each title element are expressed as
Figure BDA0002469769090000162
It should be noted that "first" and "second" in the first Softmax classifier and the second Softmax classifier of the present invention are only used for distinguishing between the two classifiers, and the functions, features, and the like of the two classifiers are not limited.
The target data includes the characteristics of the title component and the characteristics of the first title component in the previous title component of the first title component and the ancestor node of the previous title component of the first title component. Specifically, the hierarchical classification score of a title component in the ancestor node of the preceding title component and the first title component may be calculated one by one, and accordingly, the target data utilized in calculating the hierarchical classification score of an arbitrary title component in the ancestor node of the preceding title component and the preceding title component of the first title component includes the feature of the arbitrary title component and the feature of the first title component.
Step one is described in detail below with reference to two different target data, which are divided into two cases.
In the first case, the target data includes only the characteristics of the title component and the characteristics of the first title component in the previous title component and the ancestor node of the previous title component. At this time, the first step includes: inputting the characteristics of the title assembly in the ancestor node of the previous title assembly and the characteristics of the first title assembly into a first relational characteristic generation model to obtain first relational characteristics which characterize the relationship between the characteristics of the title assembly in the ancestor node of the previous title assembly and the characteristics of the first title assembly
Figure BDA0002469769090000163
Then the obtainedThe first relational features are input into a pre-trained title level two-classification model.
The first relational feature generation model may be embodied as an FNN model, that is, the FNN model may be used to obtain the relational features between the features of the arbitrary header component and the features of the first header component
Figure BDA0002469769090000164
Wherein, for any title component in the previous title component and the ancestor node of the previous title component, the characteristics of the any title component can be firstly detected
Figure BDA0002469769090000171
And a first title Assembly
Figure BDA0002469769090000172
The features of the first and second images are spliced to obtain a first spliced vector
Figure BDA0002469769090000173
And obtaining a relation feature vector between the features of the arbitrary title component and the features of the first title component by using the first splicing vector and a mapping function FNN () of the FNN model
Figure BDA0002469769090000174
Wherein, the parameters of the FNN model comprise a weight matrix WFOffset vector bF. In addition, in the process of training the FNN model, an L2 loss function can be selected, and the parameters of the FNN model are continuously updated in an iterative mode by using a gradient descent algorithm. For details of the L2 loss function, reference is made to the related description in the above embodiments of the present invention, and further description is omitted here.
Secondly, obtaining the relation feature
Figure BDA0002469769090000175
After that, it can be input into a second Softmax classifier for hierarchical classification recognition. Score function of second Softmax classifier
Figure BDA0002469769090000176
Satisfies formula (4):
Figure RE-GDA0002539120470000177
wherein Softmax () is a second Softmax function.
Figure BDA0002469769090000178
In order to be a weight matrix, the weight matrix,
Figure BDA0002469769090000179
is a bias vector. For specific contents of the training process of the second Softmax classifier, reference may be made to the relevant description of the first Softmax classifier in the above embodiment of the present invention, and details thereof are not repeated.
In addition, if the score output by the second Softmax classifier is 1, each of the title elements is characterized as a second title element, that is, each of the title elements is a same-level title element as the first title element. Otherwise, each title element is characterized as not being a second title element, i.e., not being a peer title element of the first title element.
In the second case, the target data includes only the feature of the first title component, the feature of the arbitrary title component, and the feature of the sibling node of the arbitrary title component. Second case step one step in the first case is substantially similar except that a relational feature vector between the feature of the arbitrary title element and the feature of the first title element is calculated
Figure BDA00024697690900001710
In this case, the first step may specifically include: inputting the characteristics of the title assemblies in the ancestor nodes of the previous title assembly and the previous title assembly, and the characteristics of brother nodes of the title assemblies in the ancestor nodes of the previous title assembly and the previous title assembly in the title node into a second relational feature generation model to obtain characteristics of the previous title assembly and the previous title assemblySecond relation characteristic of relation between characteristic of title assembly in ancestor node of title assembly and characteristic of brother node
Figure BDA00024697690900001711
Characterizing the second relationship
Figure BDA00024697690900001712
And features of the first title Assembly
Figure BDA00024697690900001713
A third relational feature generation model is input, and a third relational feature characterizing the relationship between the second relational feature and the feature of the first title element is generated.
In particular, the first title element is characterized as
Figure BDA0002469769090000181
Figure BDA0002469769090000182
The arbitrary header component is characterized as
Figure BDA0002469769090000183
Figure BDA0002469769090000184
The arbitrary title component includes M sibling nodes, S1、 S2、……、SMAnd M is a positive integer. The characteristics of the M sibling nodes are respectively expressed as
Figure BDA0002469769090000185
Wherein the content of the first and second substances,
Figure BDA0002469769090000186
the second relational feature generation model may be embodied as an RNN model. That is, for any title component in the previous title component and the ancestor node of the previous title component, any title group may be precededFeatures of the elements
Figure BDA0002469769090000187
Characteristics of sibling nodes of arbitrary title component
Figure BDA0002469769090000188
Inputting RNN model, and obtaining second relation feature vector of relation between the arbitrary title component and brother node thereof
Figure BDA0002469769090000189
Then, a third relation characteristic is calculated
Figure BDA00024697690900001810
Then, the FNN model may be selected as the third correlation characteristic generation model. That is, the FNN model is first used to obtain the characteristics of the first header component
Figure BDA00024697690900001811
Relationship feature vector with the arbitrary title component and its sibling nodes
Figure BDA00024697690900001812
Characteristic of the relationship between
Figure BDA00024697690900001813
Figure BDA00024697690900001814
And step two, if the title assembly with the hierarchical classification score larger than a preset score threshold exists in the ancestor node of the previous title assembly of the first title assembly and the previous title assembly of the first title assembly, determining that the subtree where the previous title assembly is located has a second title assembly. The related description of step two can be referred to the related description of the hierarchical classification score in step one, and will not be repeated herein.
And step three, if the hierarchical classification scores of all the title components in the ancestor nodes of the previous title component of the first title component and the previous title component of the first title component are smaller than a preset score threshold value, determining that no second title component exists in the previous title component of the first title component and the ancestor node of the previous title component of the first title component.
Illustratively, if the preset score threshold is 0.5, the hierarchical classification score corresponding to title component a is 0.4, the hierarchical classification score corresponding to title component c is 0.3, and the hierarchical classification score corresponding to title component a is 0.2, then none of title component a, title component c, and title component f is the second title component. S3222, if the sub-tree where the previous title assembly is located does not have the second title assembly, the first title assembly is used as the child node of the previous title assembly.
Illustratively, with continued reference to FIG. 4, if the hierarchical classification score of the title component f, the hierarchical classification score of the title component a, and the hierarchical classification score of the ancestor title component c are all less than the preset score threshold, then the first title component g is inserted as a child of the previous title component f, below node f, i.e., at position p in FIG. 34To (3).
For example, if the inserted title component g is "(1) short term ticket", since it is judged that the "(1) short term ticket" is not located at the same level as the "third section company financial situation", "second, liability situation", "1, mobile liability", it is inserted as a child node of the previous title component f "1, mobile liability".
In addition, for the first title component xiAfter performing S3221 and S3222, x may be determinedi+1As a next first title component, and S3221 and S3222 are performed.
A method for generating a directory structure using a title template, the number of hierarchies of a target structure being the same as the number of hierarchies set in the title template. For example, if only 3 caption levels are set in the template, only 3 caption levels at the maximum can be generated. By using the method for extracting the document directory structure in the embodiment of the invention, whether the document directory structure is the same as the title component added to the logical tree of the title can be compared, and if the document directory structure is different from the logical tree of the title, the document directory structure is used as a child node of the previous title component. Even if the document to be processed has more title levels, the corresponding title levels can be generated. E.g., 8-stage, 9-stage, etc. Compared with the method for generating the directory structure by using the title template, the method can improve the flexibility, the accuracy and the depth of generating the directory structure.
S330, generating a directory structure of the document to be processed according to the title logic tree. Specifically, after the title logical tree is obtained, the title component corresponding to the child node directly connected to the root node is taken as the first-level title. For example, title component A in FIG. 21Title Assembly A8And title Assembly A14Is the first level title. The child nodes of the first level title are second level titles. E.g. title Module A2Title Assembly A3And title Assembly A7As a first level title A1Second level title below.
According to the method for extracting the document directory structure in the embodiment of the invention, the ordered sequence of the title components in the document to be processed can be obtained firstly, and the title logic tree is established by utilizing each title component in the ordered sequence of the title components. Because the title component corresponding to any node in the title logical tree is the upper-level title of the title component corresponding to the child node of the node, the hierarchical relationship among the title components can be determined by establishing the title logical tree, and therefore the extraction accuracy of the target structure is improved.
In addition, compared with a method for generating a directory structure by using a title template, the embodiment of the invention can generate the directory hierarchy structure by using a pre-trained learning model (a first feature extraction sub-model, a second feature extraction sub-model, a title second classification sub-model, a title level second classification model and the like), thereby ensuring the generalization capability of the extraction method of the document directory structure. Especially when a deep learning model is selected, the accuracy of the directory structure identification can be further improved.
In some embodiments of the present invention, part or all of the target structure of the document to be processed may be displayed on a display interface of the terminal. For example, if the hierarchical structure of the directory structure is complex, the titles of the first 3 levels can be displayed. The format of the titles at different levels may be different. For example, the number of characters indented before the segment is different, or the font size is different, etc. The format of the titles of different levels can be set according to specific needs, and is not limited to this.
In addition, in order to facilitate the indexing of the document to be processed, when the trigger operation is executed on the title component on the terminal, the user can directly jump to the page where the title component is located.
An apparatus according to an embodiment of the present invention will be described in detail below with reference to the accompanying drawings.
Based on the same inventive concept, the embodiment of the invention provides an extracting device of a document directory structure. Fig. 5 is a structural schematic diagram showing an extracting apparatus of a document directory structure according to an embodiment of the present invention. As shown in fig. 5, the apparatus 500 for extracting a document directory structure includes a title sequence acquiring module 510, a logical tree building module 520, and a target structure generating module 530.
The title sequence acquiring module 510 is configured to acquire an ordered sequence of title components of the document to be processed.
And a logical tree building module 520, configured to build the title logical tree based on the hierarchical relationship between the title components in the ordered sequence of the title components.
And the target structure generating module 530 is configured to generate a directory structure of the document to be processed according to the title logic tree.
In some embodiments of the present invention, the logic tree building module 520 is specifically configured to:
the title components in the ordered sequence of title components are in turn taken as the first title component.
For each first title component, performing the following operations:
and if a second title component with the same level as the first title component exists in the ancestor nodes of the previous title component and the previous title component of the first title component in the title logical tree, inserting the first title component into the title logical tree as the brother node of the second title component. And if the second title component does not exist in the ancestor nodes of the previous title component and the previous title component, inserting the first title component into the title logical tree as a child node of the previous title component.
In some embodiments of the present invention, the apparatus 500 for extracting a document directory structure further comprises:
and the hierarchical classification module is used for obtaining the hierarchical classification scores of the previous title assembly and the title assembly in the ancestor node of the previous title assembly by using the target data and the title hierarchical two-classification model.
Wherein the target data includes the characteristics of the previous title component and the characteristics of the first title component in the ancestor nodes of the previous title component and the previous title component.
And the first determination model is used for determining that a second title component exists in the ancestor nodes of the previous title component and the previous title component if the title component with the hierarchical classification score larger than the preset score threshold exists in the ancestor nodes of the previous title component and the previous title component.
And the second determination model is used for determining that the second title component does not exist in the ancestor nodes of the previous title component and the previous title component if the hierarchical classification scores of all the title components in the ancestor nodes of the previous title component and the previous title component are less than the preset score threshold.
In some embodiments, the hierarchical classification module is specifically configured to:
and inputting the characteristics of the title assembly in the ancestor nodes of the previous title assembly and the characteristics of the first title assembly into a first relation characteristic generation model to obtain a first relation characteristic representing the relation between the characteristics of the title assembly in the ancestor nodes of the previous title assembly and the characteristics of the first title assembly.
And inputting the first relation characteristic into the title level binary classification model to obtain the level classification scores of the previous title component and the title components in the ancestor nodes of the previous title component.
In other embodiments, the hierarchical classification module is specifically configured to:
and inputting the characteristics of the title assemblies in the ancestor nodes of the previous title assembly and the previous title assembly, and the characteristics of the brother nodes of the title assemblies in the ancestor nodes of the previous title assembly and the previous title assembly in the title node into a second relational feature generation model to obtain a second relational feature representing the relationship between the characteristics of the title assemblies in the ancestor nodes of the previous title assembly and the characteristics of the brother nodes.
And inputting the second relational feature and the feature of the first title assembly into a third relational feature generation model to generate a third relational feature which characterizes the relationship between the second relational feature and the feature of the first title assembly.
And inputting the third relation characteristic into the title level binary model to obtain the level classification scores of the previous title component and the title components in the ancestor nodes of the previous title component.
In some embodiments of the present invention, the title sequence obtaining module 510 is specifically configured to:
acquiring a logic component ordered sequence of a document to be processed; and inputting the logic component ordered sequence into a title detection model to obtain the title component ordered sequence.
In some embodiments, the title detection model includes a first feature extraction sub-model and a title binary classification sub-model, and the title sequence obtaining module 510 is specifically configured to:
and inputting the logic assembly ordered sequence into the first feature extraction submodel to obtain the features of the logic assemblies in the logic assembly ordered sequence.
And inputting the characteristics of each logic component into a title two-classification submodel aiming at the characteristics of each logic component to obtain a title classification result of each logic component. The title classification result is a title or a non-title.
And adding the logic component with the title classification result as the title in the logic component ordered sequence into the title component ordered sequence.
In other embodiments, the title detection model includes a first feature extraction sub-model, a second feature extraction sub-model, and a title binary classification sub-model, and the title sequence obtaining module 510 is specifically configured to:
and inputting the logic assembly ordered sequence into the first feature extraction submodel to obtain the features of the logic assemblies in the logic assembly ordered sequence.
And inputting the characteristics of each logic assembly and the characteristics of the adjacent logic assemblies of each logic assembly into a second characteristic extraction submodel aiming at the characteristics of each logic assembly to obtain the contextual characteristics of each logic assembly.
Inputting the context characteristics into a title binary classification submodel to obtain a title classification result of each logic component, wherein the title classification result comprises a title or a non-title;
and adding the logic component with the title classification result as the title in the logic component ordered sequence into the title component ordered sequence.
In some embodiments of the present invention, the logic component is characterized by a feature vector, and the header sequence obtaining module 510 is specifically configured to:
and acquiring the text feature vector of the logic component and the format feature vector of the logic component.
And splicing the text feature vector and the format feature vector into a feature vector of the logic component.
Wherein the text feature vector of the logical component is generated based on the ordered sequence of characters of the logical component.
The format feature vector characterizes at least one of the following format information: whether the logic component is thickened, the text word size of the logic component, whether the text of the logic component is centered and represents the category to which the logic component belongs, wherein the category to which the logic component belongs comprises: paragraphs, tables, charts, pictures.
Other details of the apparatus for extracting a document directory structure according to the embodiment of the present invention are similar to the method according to the embodiment of the present invention described above with reference to fig. 1 to 4, and are not repeated herein.
Fig. 6 is a block diagram of an exemplary hardware architecture of an extracting apparatus of a document directory structure in the embodiment of the present invention.
As shown in fig. 6, the extracting apparatus 600 of the document directory structure includes an input apparatus 601, an input interface 602, a central processor 603, a memory 604, an output interface 605, and an output apparatus 606. The input interface 602, the central processing unit 603, the memory 604, and the output interface 605 are connected to each other via a bus 610, and the input device 601 and the output device 606 are connected to the bus 610 via the input interface 602 and the output interface 605, respectively, and further connected to other components of the document directory structure extraction device 600.
Specifically, the input device 601 receives input information from the outside, and transmits the input information to the central processor 603 through the input interface 602; the central processor 603 processes input information based on computer-executable instructions stored in the memory 604 to generate output information, stores the output information temporarily or permanently in the memory 604, and then transmits the output information to the output device 606 through the output interface 605; the output device 606 outputs the output information to the outside of the extracting device 600 of the document directory structure for use by the user.
That is, the extracting device of the document directory structure shown in fig. 6 may also be implemented to include: a memory storing computer-executable instructions; and a processor which, when executing computer executable instructions, may implement the method and apparatus of the document directory structure extraction device described in connection with fig. 1-4.
In one embodiment, the extracting apparatus 600 of the document directory structure shown in fig. 6 may be implemented as an apparatus that may include: a memory for storing a program; and the processor is used for operating the program stored in the memory so as to execute the extraction method of the document directory structure of the embodiment of the invention.
The embodiment of the invention also provides a computer storage medium, wherein computer program instructions are stored on the computer storage medium, and when being executed by a processor, the computer program instructions realize the extraction method of the document directory structure of the embodiment of the invention.
It is to be understood that the invention is not limited to the particular arrangements and instrumentality described above and shown in the drawings. A detailed description of known methods is omitted herein for the sake of brevity. In the above embodiments, several specific steps are described and shown as examples. However, the method processes of the present invention are not limited to the specific steps described and illustrated, and those skilled in the art can make various changes, modifications and additions or change the order between the steps after comprehending the spirit of the present invention.
The functional blocks shown in the above structural block diagrams may be implemented as hardware, software, firmware, or a combination thereof. When implemented in hardware, it may be, for example, an electronic circuit, an Application Specific Integrated Circuit (ASIC), suitable firmware, plug-in, function card, or the like. When implemented in software, the elements of the invention are the programs or code segments used to perform the required tasks. The program or code segments may be stored in a machine-readable medium or transmitted by a data signal carried in a carrier wave over a transmission medium or a communication link. A "machine-readable medium" may include any medium that can store or transfer information. Examples of a machine-readable medium include electronic circuits, semiconductor memory devices, ROM, flash memory, Erasable ROM (EROM), floppy disks, CD-ROMs, optical disks, hard disks, fiber optic media, Radio Frequency (RF) links, and so forth. The code segments may be downloaded via a computer network such as the internet, an intranet, etc.
As described above, only the specific embodiments of the present invention are provided, and it is clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the system, the module and the unit described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

Claims (10)

1. A method for extracting a document directory structure, the method comprising:
acquiring a title component ordered sequence of a document to be processed;
establishing a title logic tree based on the hierarchical relationship among the title components in the ordered sequence of the title components;
and generating a directory structure of the document to be processed according to the title logic tree.
2. The method according to claim 1, wherein the building a logical tree of titles based on the hierarchical relationship between each title component in the ordered sequence of title components comprises:
sequentially taking the title components in the ordered sequence of title components as first title components;
for each first title component, performing the following operations:
if a second title component with the same level as the first title component exists in a previous title component of the first title component and an ancestor node of the previous title component in the title logical tree, inserting the first title component into the title logical tree as a brother node of the second title component;
and if the second title component does not exist in the ancestor nodes of the previous title component and the previous title component, inserting the first title component into the title logical tree as a child node of the previous title component.
3. The method of claim 2, further comprising:
obtaining hierarchical classification scores of the previous title component and the title components in the ancestor nodes of the previous title component by using target data and a title hierarchical classification model, wherein the target data comprises the characteristics of the title components in the ancestor nodes of the previous title component and the characteristics of the first title component;
determining that the second title component exists in the ancestor nodes of the previous title component and the previous title component if a title component with a hierarchical classification score greater than a preset score threshold exists in the ancestor nodes of the previous title component and the previous title component;
determining that the second title component is not present in the ancestor nodes of the previous title component and the previous title component if the hierarchical classification scores of all the title components in the ancestor nodes of the previous title component and the previous title component are less than the preset score threshold.
4. The method of claim 3, wherein obtaining hierarchical classification scores for the previous title component and the title component in the ancestor node of the previous title component using the target data and a title hierarchical classification model comprises:
inputting the characteristics of the previous title assembly and the title assembly in the ancestor node of the previous title assembly and the characteristics of the first title assembly into a first relational characteristic generation model to obtain a first relational characteristic representing the relationship between the characteristics of the title assembly and the first title assembly in the ancestor node of the previous title assembly and the previous title assembly;
and inputting the first relation characteristic into the title level binary classification model to obtain the level classification scores of the previous title component and the title components in the ancestor nodes of the previous title component.
5. The method of claim 3, wherein obtaining hierarchical classification scores for the previous title component and the title component in the ancestor node of the previous title component using the target data and a title hierarchical classification model comprises:
inputting the characteristics of the title assemblies in the ancestor nodes of the previous title assembly and the characteristics of the brother nodes of the title assemblies in the ancestor nodes of the previous title assembly and the previous title assembly in the title node into a second relational characteristic generation model to obtain a second relational characteristic representing the relationship between the characteristics of the title assemblies in the ancestor nodes of the previous title assembly and the characteristics of the brother nodes;
inputting the second relational feature and the feature of the first title assembly into a third relational feature generation model, and generating a third relational feature which represents the relationship between the second relational feature and the feature of the first title assembly;
inputting the third relation characteristic into the title level binary classification model to obtain the level classification scores of the previous title component and the title components in the ancestor nodes of the previous title component.
6. The method of claim 1, wherein obtaining an ordered sequence of header components of the document to be processed comprises:
acquiring a logic component ordered sequence of the document to be processed;
inputting the logic component ordered sequence into a title detection model to obtain the title component ordered sequence; wherein, if the title detection model comprises a first feature extraction submodel and a title binary classification submodel,
inputting the logic component ordered sequence into a title detection model to obtain the title component ordered sequence, wherein the method comprises the following steps:
inputting the logic assembly ordered sequence into the first feature extraction submodel to obtain the features of the logic assemblies in the logic assembly ordered sequence;
inputting the characteristics of each logic component into the title two-classification submodel aiming at the characteristics of each logic component to obtain a title classification result of each logic component, wherein the title classification result is a title or a non-title;
adding the logic component with the title classification result as the title in the logic component ordered sequence into the title component ordered sequence;
alternatively, the first and second electrodes may be,
if the title detection model comprises a first feature extraction submodel, a second feature extraction submodel and a title binary classification submodel,
inputting the ordered sequence of logical components into a title detection model to obtain the ordered sequence of title components, including:
inputting the logic assembly ordered sequence into the first feature extraction submodel to obtain the features of the logic assemblies in the logic assembly ordered sequence;
aiming at the characteristics of each logic assembly, inputting the characteristics of each logic assembly and the characteristics of adjacent logic assemblies of each logic assembly into a second characteristic extraction submodel to obtain the context characteristics of each logic assembly;
inputting the context characteristics into the title binary classification submodel to obtain a title classification result of each logic component, wherein the title classification result comprises a title or a non-title;
and adding the logic component with the title classification result as the title in the logic component ordered sequence into the title component ordered sequence.
7. The method of claim 5 or claim 6, wherein the features of the logic component are feature vectors,
inputting the logic assembly ordered sequence into the first feature extraction submodel to obtain the features of the logic assemblies in the logic assembly ordered sequence, wherein the method comprises the following steps:
acquiring a text feature vector of the logic component and a format feature vector of the logic component;
concatenating the text feature vector and the format feature vector into a feature vector for the logical component,
wherein the text feature vector of the logical component is generated based on the ordered sequence of characters of the logical component,
the format feature vector characterizes at least one of the following format information:
whether the logic component is thickened, the text word size of the logic component, whether the text of the logic component is centered and represents the category to which the logic component belongs, wherein the category to which the logic component belongs comprises: paragraphs, tables, charts, pictures.
8. An apparatus for extracting a document directory structure, the apparatus comprising:
the title sequence acquisition module is used for acquiring the ordered sequence of the title components of the document to be processed;
the logic tree building module is used for building a title logic tree based on the hierarchical relationship among the title components in the ordered sequence of the title components;
and the target structure generating module is used for generating the directory structure of the document to be processed according to the title logic tree.
9. An apparatus for extracting a document directory structure, the apparatus comprising:
a memory for storing a program;
a processor for executing the program stored in the memory to perform the method for extracting a document directory structure of any one of claims 1 to 8.
10. A computer storage medium having computer program instructions stored thereon which, when executed by a processor, implement the method of extracting a document directory structure of any one of claims 1 to 8.
CN202010344802.XA 2020-04-27 2020-04-27 Method, device, equipment and medium for extracting document directory structure Pending CN113642320A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010344802.XA CN113642320A (en) 2020-04-27 2020-04-27 Method, device, equipment and medium for extracting document directory structure

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010344802.XA CN113642320A (en) 2020-04-27 2020-04-27 Method, device, equipment and medium for extracting document directory structure

Publications (1)

Publication Number Publication Date
CN113642320A true CN113642320A (en) 2021-11-12

Family

ID=78415101

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010344802.XA Pending CN113642320A (en) 2020-04-27 2020-04-27 Method, device, equipment and medium for extracting document directory structure

Country Status (1)

Country Link
CN (1) CN113642320A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114741147A (en) * 2022-03-30 2022-07-12 阿里巴巴(中国)有限公司 Method for displaying page on mobile terminal and mobile terminal
CN116127079A (en) * 2023-04-20 2023-05-16 中电科大数据研究院有限公司 Text classification method

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2007286861A (en) * 2006-04-17 2007-11-01 Hitachi Ltd Method for extracting document structure and document search method
CN102467496A (en) * 2010-11-17 2012-05-23 北大方正集团有限公司 Method and device for converting stream mode typeset content into block mode typeset document
CN102567394A (en) * 2010-12-30 2012-07-11 国际商业机器公司 Method and device for obtaining hierarchical information of plane data
JP5433764B1 (en) * 2012-11-01 2014-03-05 ヤフー株式会社 Hierarchical structure modification processing apparatus, hierarchical structure modification method, and program
CN104978328A (en) * 2014-04-03 2015-10-14 北京奇虎科技有限公司 Hierarchical classifier obtaining method, text classification method, hierarchical classifier obtaining device and text classification device
CN106649464A (en) * 2016-09-26 2017-05-10 深圳市数字城市工程研究中心 Method of building Chinese address tree and device
CN107145479A (en) * 2017-05-04 2017-09-08 北京文因互联科技有限公司 Structure of an article analysis method based on text semantic
CN107391675A (en) * 2017-07-21 2017-11-24 百度在线网络技术(北京)有限公司 Method and apparatus for generating structure information
CN109977366A (en) * 2017-12-27 2019-07-05 珠海金山办公软件有限公司 A kind of catalogue generation method and device
CN110532834A (en) * 2018-05-24 2019-12-03 北京庖丁科技有限公司 Table extracting method, device, equipment and medium based on rich text format document
CN110704573A (en) * 2019-09-04 2020-01-17 平安科技(深圳)有限公司 Directory storage method and device, computer equipment and storage medium
CN110852079A (en) * 2019-10-11 2020-02-28 平安科技(深圳)有限公司 Document directory automatic generation method and device and computer readable storage medium

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2007286861A (en) * 2006-04-17 2007-11-01 Hitachi Ltd Method for extracting document structure and document search method
CN102467496A (en) * 2010-11-17 2012-05-23 北大方正集团有限公司 Method and device for converting stream mode typeset content into block mode typeset document
CN102567394A (en) * 2010-12-30 2012-07-11 国际商业机器公司 Method and device for obtaining hierarchical information of plane data
JP5433764B1 (en) * 2012-11-01 2014-03-05 ヤフー株式会社 Hierarchical structure modification processing apparatus, hierarchical structure modification method, and program
CN104978328A (en) * 2014-04-03 2015-10-14 北京奇虎科技有限公司 Hierarchical classifier obtaining method, text classification method, hierarchical classifier obtaining device and text classification device
CN106649464A (en) * 2016-09-26 2017-05-10 深圳市数字城市工程研究中心 Method of building Chinese address tree and device
CN107145479A (en) * 2017-05-04 2017-09-08 北京文因互联科技有限公司 Structure of an article analysis method based on text semantic
CN107391675A (en) * 2017-07-21 2017-11-24 百度在线网络技术(北京)有限公司 Method and apparatus for generating structure information
CN109977366A (en) * 2017-12-27 2019-07-05 珠海金山办公软件有限公司 A kind of catalogue generation method and device
CN110532834A (en) * 2018-05-24 2019-12-03 北京庖丁科技有限公司 Table extracting method, device, equipment and medium based on rich text format document
CN110704573A (en) * 2019-09-04 2020-01-17 平安科技(深圳)有限公司 Directory storage method and device, computer equipment and storage medium
CN110852079A (en) * 2019-10-11 2020-02-28 平安科技(深圳)有限公司 Document directory automatic generation method and device and computer readable storage medium

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
LAN WEI 等: "Research on hierarchical tree structure of Multi Dimension Data İndex in Centralized Environment", IBERIAN JOURNAL OF INFORMATION SYSTEMS AND TECHNOLOGIES, pages 345 - 358 *
MUHAMMAD MAHBUBUR RAHMAN;TIM FININ: "Unfolding the structure of a document using deep learning", HTTPS://ARXIV.ORG/PDF/1910.03678V1.PDF, pages 1 - 16 *
NAJAH-IMANE BENTABET 等: "Table-Of-Contents generation on contemporary documents", HTTPS://ARXIV.LINFEN3.TOP/ABS/1911.08836, pages 1 - 8 *
罗双玲;王涛;匡海波: "层级标注系统及基于层级标签的分众分类生成算法研究", 系统工程理论与实践, vol. 38, no. 7, pages 1862 - 1869 *
黄胜;王博博;朱菁: "基于文档结构与深度学习的金融公告信息抽取", 计算机工程与设计, vol. 41, no. 1, pages 115 - 121 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114741147A (en) * 2022-03-30 2022-07-12 阿里巴巴(中国)有限公司 Method for displaying page on mobile terminal and mobile terminal
CN114741147B (en) * 2022-03-30 2023-11-14 阿里巴巴(中国)有限公司 Method for displaying page on mobile terminal and mobile terminal
CN116127079A (en) * 2023-04-20 2023-05-16 中电科大数据研究院有限公司 Text classification method
CN116127079B (en) * 2023-04-20 2023-06-20 中电科大数据研究院有限公司 Text classification method

Similar Documents

Publication Publication Date Title
KR101312770B1 (en) Information classification paradigm
US8468167B2 (en) Automatic data validation and correction
US8539349B1 (en) Methods and systems for splitting a chinese character sequence into word segments
US10956673B1 (en) Method and system for identifying citations within regulatory content
CN112131920A (en) Data structure generation for table information in scanned images
RU2760471C1 (en) Methods and systems for identifying fields in a document
CN111062451B (en) Image description generation method based on text guide graph model
US10963717B1 (en) Auto-correction of pattern defined strings
US20210110153A1 (en) Heading Identification and Classification for a Digital Document
Sinha et al. Visual text recognition through contextual processing
US20220335073A1 (en) Fuzzy searching using word shapes for big data applications
CN109165373B (en) Data processing method and device
US11615244B2 (en) Data extraction and ordering based on document layout analysis
CN113642320A (en) Method, device, equipment and medium for extracting document directory structure
CN108959204B (en) Internet financial project information extraction method and system
CN111488400B (en) Data classification method, device and computer readable storage medium
US20230134218A1 (en) Continuous learning for document processing and analysis
RU2703270C1 (en) Optical character recognition using specialized confidence functions, implemented on the basis of neural networks
CN116029280A (en) Method, device, computing equipment and storage medium for extracting key information of document
CN113392189A (en) News text processing method based on automatic word segmentation
Klaiman et al. DocReader: bounding-box free training of a document information extraction model
WO2021154238A1 (en) A transferrable neural architecture for structured data extraction from web documents
CN112651590A (en) Instruction processing flow recommending method
CN116912867B (en) Teaching material structure extraction method and device combining automatic labeling and recall completion
Idziak et al. Scalable handwritten text recognition system for lexicographic sources of under-resourced languages and alphabets

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination