CN113642320A

CN113642320A - Method, device, equipment and medium for extracting document directory structure

Info

Publication number: CN113642320A
Application number: CN202010344802.XA
Authority: CN
Inventors: 林得苗
Original assignee: Pai Tech Co ltd
Current assignee: Pai Tech Co ltd
Priority date: 2020-04-27
Filing date: 2020-04-27
Publication date: 2021-11-12

Abstract

The invention discloses a method, a device, equipment and a medium for extracting a document directory structure. The method comprises the following steps: acquiring a title component ordered sequence of a document to be processed; establishing a title logic tree based on the hierarchical relationship among the title components in the ordered sequence of the title components; and generating a directory structure of the document to be processed according to the title logic tree. According to the method, the device, the equipment and the medium for extracting the document directory structure, provided by the embodiment of the invention, the extraction accuracy of the directory structure can be improved.

Description

Method, device, equipment and medium for extracting document directory structure

Technical Field

The present invention relates to the field of data processing, and in particular, to a method, an apparatus, a device, and a medium for extracting a document directory structure.

Background

Title is a brief sentence that marks the content of the document. In order to enhance the readability of a document, multiple levels of titles are typically provided within a document. Specifically, the document content corresponding to a certain level of title may be subdivided into a plurality of parts by establishing several subordinate titles.

The document object structure contains the membership of different levels of titles, with the lower level of titles being subordinate to the higher level of titles. Fig. 1 shows an exemplary document directory structure, and as shown in fig. 1, the document directory structure comprises three levels of titles, namely a first level title, a second level title and a third level title in sequence from high to low in the hierarchy. Wherein, the second-level title "1. second-level title" is subordinate to the first-level title "first, first-level title", the third-level title "2.1 third-level title" and "2.2 third-level title" are subordinate to the second-level title "2. second-level title".

In the existing directory extraction method, the titles in the document need to be manually marked, for example, set as a title style. When the directory is generated, the directory structure is generated by using the title marked as the title style, and the extracted directory structure has low accuracy.

Disclosure of Invention

The embodiment of the invention provides a method, a device, equipment and a medium for extracting a document directory structure, which can improve the extraction accuracy of the directory structure.

In a first aspect, a method for extracting a document directory structure is provided, including: acquiring a title assembly ordered sequence of a document to be processed; establishing a title logic tree based on a hierarchical relation among the title components in the ordered sequence of the title components; and generating a directory structure of the document to be processed according to the title logic tree.

According to the method for extracting the document directory structure in the embodiment of the invention, the ordered sequence of the title components in the document to be processed can be obtained firstly, and the title logic tree is established by utilizing each title component in the ordered sequence of the title components. Because the title component corresponding to any node in the title logical tree is the upper-level title of the title component corresponding to the child node of the node, the hierarchical relationship among the title components can be determined by establishing the title logical tree, and therefore the extraction accuracy of the target structure is improved.

In an alternative embodiment, the creating a title logical tree based on the hierarchical relationship between each title component in the ordered sequence of title components specifically includes: sequentially taking the title components in the ordered sequence of the title components as first title components; for each first title component, performing the following operations: if a second title component with the same level as the first title component exists in the ancestor nodes of the previous title component and the previous title component of the first title component in the title logical tree, inserting the first title component into the title logical tree as the brother node of the second title component; if the second title element does not exist in the predecessor nodes of the previous title element and the previous title element, the first title element is inserted into the title logical tree as a child node of the previous title element.

A method for generating a directory structure using a title template, the number of hierarchies of a target structure being the same as the number of hierarchies set in the title template. For example, if only 3 caption levels are set in the template, only 3 caption levels at the maximum can be generated. By using the method for extracting the document directory structure in the embodiment of the invention, whether the document directory structure is the same as the title component added to the logical tree of the title can be compared, and if the document directory structure is different from the logical tree of the title, the document directory structure is used as a child node of the previous title component. Even if the document to be processed has more title levels, the corresponding title levels can be generated. E.g., 8-stage, 9-stage, etc. Compared with the method for generating the directory structure by using the title template, the method can improve the flexibility, the accuracy and the depth of generating the directory structure.

In an optional embodiment, the method further comprises: obtaining hierarchical classification scores of the previous title assembly and the title assembly in the ancestor node of the previous title assembly by using target data and a title hierarchical two-classification model, wherein the target data comprises the characteristics of the title assembly in the ancestor node of the previous title assembly and the characteristics of the first title assembly; if the prior title assembly and the title assembly with the hierarchy classification score larger than the preset score threshold exist in the ancestor nodes of the prior title assembly, determining that a second title assembly exists in the prior title assembly and the ancestor nodes of the prior title assembly; and if the hierarchical classification scores of all the title assemblies in the ancestor nodes of the previous title assembly and the previous title assembly are less than a preset score threshold, determining that a second title assembly does not exist in the ancestor nodes of the previous title assembly and the previous title assembly.

In the embodiment, whether each title component is the second title component can be judged by using the title level binary model, so that the judgment precision is ensured. Especially, if the title level two-classification model selects the deep learning model, the accuracy of the directory structure identification can be improved.

In an alternative embodiment, using the target data and the title hierarchical two-classification model to obtain hierarchical classification scores for the title component in the previous title component and the ancestor node of the previous title component, comprises: inputting the characteristics of the title assembly in the ancestor node of the previous title assembly and the characteristics of the first title assembly into a first relational characteristic generation model to obtain a first relational characteristic representing the relationship between the characteristics of the title assembly in the ancestor node of the previous title assembly and the characteristics of the first title assembly; and inputting the first relation characteristic into a title level two classification model to obtain the level classification scores of the previous title component and the title component in the ancestor node of the previous title component.

In an alternative embodiment, using the target data and the title hierarchical two-classification model to obtain hierarchical classification scores for the title component in the previous title component and the ancestor node of the previous title component, comprises: inputting the characteristics of the title assemblies in the ancestor nodes of the previous title assembly and the characteristics of the brother nodes of the title assemblies in the ancestor nodes of the previous title assembly and the previous title assembly into a second relational characteristic generation model to obtain a second relational characteristic representing the relationship between the characteristics of the title assemblies in the ancestor nodes of the previous title assembly and the characteristics of the brother nodes; inputting the second relational feature and the feature of the first title assembly into a third relational feature generation model to generate a third relational feature which represents the relationship between the second relational feature and the feature of the first title assembly; and inputting the third relation characteristic into the title level binary classification model to obtain the level classification scores of the previous title component and the title components in the ancestor nodes of the previous title component.

In an alternative embodiment, obtaining an ordered sequence of title components of a document to be processed includes: acquiring a logic component ordered sequence of a document to be processed; inputting the logic component ordered sequence into a title detection model to obtain a title component ordered sequence; if the title detection model comprises a first feature extraction submodel and a title binary classification submodel, inputting the logic component ordered sequence into the title detection model to obtain a title component ordered sequence, wherein the method comprises the following steps: inputting the logic assembly ordered sequence into a first characteristic extraction submodel to obtain the characteristics of the logic assemblies in the logic assembly ordered sequence; inputting the characteristics of each logic component into a title binary classification submodel aiming at the characteristics of each logic component to obtain a title classification result of each logic component, wherein the title classification result is a title or a non-title; adding the logic component with the title classification result as the title in the logic component ordered sequence into the title component ordered sequence; or, if the title detection model includes the first feature extraction submodel, the second feature extraction submodel and the title binary classification submodel, inputting the ordered sequence of the logic components into the title detection model to obtain the ordered sequence of the title components, including: inputting the logic assembly ordered sequence into a first feature extraction submodel to obtain the features of the logic assemblies in the logic assembly ordered sequence; inputting the characteristics of each logic assembly and the characteristics of the adjacent logic assemblies of each logic assembly into a second characteristic extraction submodel to obtain the context characteristics of each logic assembly; inputting the context characteristics into a title binary classification submodel to obtain a title classification result of each logic component, wherein the title classification result comprises a title or a non-title; and adding the logic component with the title classification result as the title in the logic component ordered sequence into the title component ordered sequence.

In the embodiment of the invention, the characteristics of the logic components can be obtained firstly, and then the classification results of the logic components are determined by utilizing the characteristics of the logic components, so that the classification efficiency and accuracy are ensured. Especially, if the first feature extraction sub-model and the title two-classification sub-model adopt the deep learning model, the accuracy of identifying the directory structure can be improved.

And a list and other logic components with higher similarity to the text structure of the title component may exist in the document. For example, each line in the list may be a combination of a number and a literal symbol. The method takes certain relevance between adjacent logic components in the document to be processed into consideration. For example, the front and back logical components of a title component may be non-peer title components, document content paragraphs, charts, pictures, and the like. And the front and back logical components of the list may be numbered the same. Even though the characteristics of the list are the same as those of the title component, there is a large difference between the contextual characteristics of the list and those of the title component. Therefore, by using the context feature of the logical component, the recognition accuracy of the title detection model can be improved based on the feature of the logical component surrounding the logical component. Especially, if the first feature extraction submodel, the second feature extraction submodel and the title binary classification submodel adopt the deep learning model, the accuracy rate of the directory structure identification can be improved.

In an alternative embodiment, the step of inputting the ordered sequence of logic components into the first feature extraction submodel to obtain the features of the logic components in the ordered sequence of logic components includes: acquiring a text feature vector of a logic component and a format feature vector of the logic component; splicing the text feature vector and the format feature vector into a feature vector of the logic component, wherein the text feature vector of the logic component is generated based on the character ordered sequence of the logic component, and the format feature vector characterizes at least one of the following format information: whether the logic component is thickened, the text word size of the logic component, whether the text of the logic component is centered and represents the category to which the logic component belongs, wherein the category to which the logic component belongs comprises: paragraphs, tables, charts, pictures.

In the embodiment of the invention, the text characteristic and the format characteristic of the logic component can be comprehensively utilized, and the accuracy of target structure identification is improved.

In a second aspect, an apparatus for extracting a document directory structure is provided, including: the title sequence acquisition module is used for acquiring the ordered sequence of the title components of the document to be processed; the logic tree building module is used for building a title logic tree based on the hierarchical relationship among the title components in the ordered sequence of the title components; and the target structure generating module is used for generating a directory structure of the document to be processed according to the title logic tree.

In a third aspect, an extracting apparatus for a document directory structure is provided, including: a memory for storing a program; and the processor is used for operating the program stored in the memory to execute the document directory structure extraction method provided by the first aspect or any optional implementation manner of the first aspect.

In a fourth aspect, a computer storage medium is provided, on which computer program instructions are stored, and the computer program instructions, when executed by a processor, implement the method for extracting a document directory structure provided in the first aspect or any optional implementation manner of the first aspect.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments of the present invention will be briefly described below, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 illustrates an exemplary document directory structure;

FIG. 2 is a schematic diagram illustrating an exemplary logical tree structure of a title in an embodiment of the present invention;

FIG. 3 is a schematic flow chart diagram illustrating a method of extracting a document directory structure according to an embodiment of the present invention;

FIG. 4 is a logical schematic diagram of an exemplary generate title logical tree provided by an embodiment of the present invention;

FIG. 5 is a schematic structural diagram showing an extracting apparatus of a document directory structure according to an embodiment of the present invention;

fig. 6 is a block diagram of an exemplary hardware architecture of an extracting apparatus of a document directory structure in the embodiment of the present invention.

Detailed Description

Features and exemplary embodiments of various aspects of the present invention will be described in detail below, and in order to make objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not to be construed as limiting the invention. It will be apparent to one skilled in the art that the present invention may be practiced without some of these specific details. The following description of the embodiments is merely intended to provide a better understanding of the present invention by illustrating examples of the present invention.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The embodiment of the invention provides an extraction scheme of a document directory structure, which is suitable for a specific scene needing to extract the directory structure from a document. Such as a survey of complex financial information texts, such as a survey of shares, a bond recruitment statement, an annual report, a financial report, a merger and reorganization report, a rating report, a research report, a legal contract document, and public opinion news. After the title components are extracted from the document, the title logical tree can be generated by utilizing the hierarchical relationship among the title components. The title logical tree consists of a root node and N subtrees, wherein N is an integer, and the N subtrees have no direct connection relation.

To facilitate understanding of the logical tree of the title in the embodiment of the present invention, as shown in fig. 2, fig. 2 is a schematic structural diagram of the logical tree of the title provided in the embodiment of the present invention. As shown in FIG. 2, the logical tree of headers is composed of a root node R₀And child node A₁-A₇Composed subtree and child node A₈-A₁₃Composed subtree and child node A₁₄-A₁₉Three fruit trees of which R is₀Three directly connected child nodes are respectively A₁、A₈、A₁₄. Specifically, the three subtrees are respectively: by child node A₁And with child node A₁Directly or indirectlyAll connected child nodes A₂To A₇A sub-tree composed of sub-nodes A₈And with child node A₈All child nodes A connected directly or indirectly₉To A₁₃A sub-tree composed of sub-nodes A₁₄And with child node A₁₄All child nodes A connected directly or indirectly₁₅To A₁₉And (4) forming a subtree. There is no direct connection between the three subtrees.

In the logical tree of headings shown in FIG. 2, the root node R₀May be the subject name of the document or the subject matter of the document. Composition A₁、A₈、A₁₄All child nodes of the three subtrees are titles. For any child node in the child tree, the parent node is the title of the upper level, and the child node is the title of the lower level. For example, child node A₁Is the first level header, child node A₂Is the first secondary title under the first primary title. Still alternatively, the root node R shown in FIG. 2₀Or may be left vacant, i.e. the root node R₀And is not used to represent the hierarchical structure of a directory.

In order to better understand the technical solution of the embodiment of the present invention, a method, an apparatus, a device, and a medium for extracting a document directory structure according to the embodiment of the present invention will be described in detail below with reference to the accompanying drawings, and it should be noted that these embodiments are not intended to limit the scope of the present invention.

Fig. 3 is a flowchart illustrating an extracting method of a document directory structure according to an embodiment of the present invention. As shown in fig. 3, the method 300 for extracting a document directory structure may include S310 to S330.

S310, acquiring the ordered sequence of the title components of the document to be processed.

The document to be processed refers to an electronic document capable of acquiring the text information of the document. Specifically, it may be an electronic document in a WORD format, a PDF format, TXT, or the like.

At least one title paragraph may be included in the document to be processed, each title paragraph being referred to as a title component. Front-to-back order of title elements in ordered sequence of title elements, and corresponding methodThe appearance sequence in the document to be processed is the same. Illustratively, the title components are sequentially title paragraphs A if in the order of appearance in the document to be processed₁Title paragraph A₂… …, heading paragraph A_mWherein, the subscript of each title component represents the sequential order in which the title component appears in the document. Then the title element ordered sequence is { title paragraph A₁Title paragraph A₂… …, heading paragraph A_m}. Wherein m is a positive integer.

For the title component, the title component refers to a single title in an article and may include a number and words. The number may be a number, such as an arabic numeral "123", a chinese numeral "twenty-three", a roman numeral, and the like. The number may also be a combination of numbers and symbols, and the symbols may be a pause, an English period, a Chinese period, a colon, a comma, and the like. Such as "1.1", "two, one", "2.2.1", etc. The number may also be a combination of numbers and words that characterize the title structure unit, such as a volume, a chapter, a section, a subsection, etc., e.g., "third chapter". In the same document, titles of different numbers have different hierarchies. For example, "chapter one", "section one", "1.1" and "1.1.1" represent different hierarchies.

In S310, since many documents to be processed are not directly composed of the header component, but are composed of a plurality of logical components including the header component. Accordingly, the title extracting component may extract the ordered sequence of logical components of the document to be processed from the document to be processed, and screen the ordered sequence of title components from the ordered sequence of logical components. Therefore, in the process of obtaining the title component, the ordered sequence of the logical components of the document to be processed can be extracted from the document to be processed, and the ordered sequence of the title component can be screened from the ordered sequence of the logical components. Accordingly, S310 specifically includes S311 and S312.

S311, acquiring the logic component ordered sequence of the document to be processed.

First, in S311, the document to be processed may be divided into a plurality of logical components independent of each other, for example, paragraphs, tables, diagramsTables, pictures, etc. Wherein, paragraphs can be further subdivided into document content paragraphs and heading paragraphs. Similar to the above-described ordered sequence of header components is that the front-to-back order of the logical components in a logical component is the same as the order in which they appear in the document to be processed. Illustratively, the logical components are sequentially title paragraph A if they appear in the order in which they appear in the document to be processed₁Document content paragraph B₁Document content paragraph B₂Table C₁Title paragraph A₂Chart D₁Title paragraph A₃Picture E₁Then the title element ordered sequence is { title paragraph A₁Document content paragraph B₁Document content paragraph B₂Table C₁Title paragraph A₂Graph D₁Title paragraph A₃Picture E₁}。

Secondly, in the process of obtaining the ordered sequence of the logical components, the document to be processed can be input into the logical structure analysis model. And obtaining the logic component ordered sequence of the document to be processed. Wherein, the logic structure analysis model can be trained by using the document sample marked with the logic component. For example, the logic structure analysis module may cluster the documents to be processed according to the character text features and/or the structural features of the documents to be processed, and take each clustering result as a logic component.

In addition, in the process of obtaining the ordered sequence of the logical components, if the document to be processed is a multi-column document, each page of the document to be processed needs to be divided into a plurality of column regions by using the column model. Then, the columnar areas of each page are sorted in order from left to right. Wherein, the logic components in each column are arranged in sequence from top to bottom. Finally, the logic components of each column area are analyzed. And acquiring the logic component ordered sequence of the document to be processed according to the page sequence, the column sequence of each page from left to right and the logic component sequence of each column from top to bottom.

Illustratively, if a page in a document is in columns 1 and 2, respectively, from left to right. Column 1 includes textDocument content paragraph B₂Table C₁. Column 2 includes title paragraph A₂Graph D₁. The logical components in the page are in turn document content paragraph B₂Table C₁Title paragraph A₂Graph D₁。

S312, inputting the logic assembly ordered sequence into the title detection model to obtain the title assembly ordered sequence.

In the process of obtaining the ordered sequence of the title components, the title detection model can extract relevant features of the logic components, perform secondary classification on the logic components and confirm whether each logic component is the title component. The following section of the embodiments of the present invention explains the specific implementation of S312 with two possible embodiments.

In a first embodiment, the heading detection model includes a first feature extraction submodel and a heading two classification submodel for extracting features of the logical component. S312 may specifically include S3121 to S3123.

S3121, inputting the logic component ordered sequence into a first feature extraction submodel to obtain the features of the logic components in the logic component ordered sequence. Wherein the characteristics of the logic component include text characteristics of the logic component and format characteristics of the logic component. In the calculation of the first feature extraction submodel, the above features may be expressed in the form of a vector. Accordingly, S3121 may include the following two steps.

Firstly, acquiring a text feature vector of a logic component and a format feature vector of the logic component.

First, for a text feature vector of a logical component, the text feature vector represents a text feature of the logical component in the form of a vector. Wherein, if the document to be processed comprises K logic components. Ith logical component x_iMay include M_iEach character is w₁To w_Mi. Wherein i is any positive integer less than or equal to K, M_iIs a positive integer. In the logical component x according to the characters_iThe precedence order in (A) may be based on logical component x_iGenerating an ordered sequence of characters s_i＝{w₁,w₂,...,w_Mi}. The characters can be ordered into a sequence s_iInputting text feature extraction submodel to obtain text feature vector of logic component

Illustratively, the text feature extraction sub-model may choose a Recurrent Neural Network (RNN) model, taking into account the association between adjacent characters. Text feature vector

Wherein the RNN () function is a mapping function of the recurrent neural network layer. The parameters of the mapping function include a weight matrix W_RAnd an offset vector b_R. In the RNN model training process, if N logic elements are selected as training samples, the weight matrix W is subjected to iterative multi-round generation of the sample data by using a gradient descent algorithm_RAnd an offset vector b_RThe updating is performed until the loss function satisfies the stop condition. The loss function may be L2 loss function. The L2 loss function can be expressed as equation (1):

wherein, y_jThe target text feature vector representing the jth logical component,

the predicted text feature vector representing the jth logic component.

Secondly, regarding the format feature vector of the logic component, the format feature vector of the logic component represents the format feature of the logic component in the form of vector quantity. Optionally, the format characteristics of the logic component may include whether to be bold, text font size, whether to be centered, and format characteristics of at least one dimension in the category to which the logic component belongs. Accordingly, the format feature vector characterizes at least one of the following format information: whether the logic component is bold, the text font size of the logic component, whether the text of the logic component is centered and the category to which the logic component belongs.

Illustratively, if logical component x_iThe format characteristics of (1) comprise the format characteristics of four dimensions of whether the text is thickened, the text font size, whether the logic component is centered and the category of the logic component. The vector with the bold format can be generated respectively firstly

Vector with bold format

Centered trellis vector

And component category vector

Then the 4 characteristics are spliced into a logic component x_iCharacteristic of format

For example,

wherein, for the format feature of whether to be thickened, the format vector is thickened

The vector has a size of 1. Two different values may be used to represent bold and not bold, respectively. For example,

representing logical Components x_iThe text of (1) is bolded.

Representing a logical component x_iThe text of (1) is not bolded.

Lattice for text font sizeFormula feature, corresponding to the bold format vector

The vector has a size of 1. Different values may be used to correspond to different font sizes. For example, each word size may be normalized to real numbers in the interval 0-1, in terms of word size. For example,

representing logical Components x_iThe text font size of (2) is four.

Representing logical Components x_iHas a text font size of 18.

For a centered or not centered format feature, corresponds to a centered format vector

The vector has a size of 1. Two different values may be used to indicate centering and non-centering, respectively. For example,

representing logic component x_iIs centered.

Representing a logical component x_iIs not centered.

Format features for the class to which the logical component belongs correspond to a component class vector

The size of the component category vector may be related to the number of categories of the component. If a component can be classified into 5 categories, text paragraph, table, chart, picture, and other categories than text paragraph, table, chart, picture, then the component category vector

Vector of (2)The dimension is 5. Illustratively, given that each logical component can only belong to 1 class, the component class vector

May be a One-Hot (One-Hot) vector. Illustratively, component class vectors

The category representing logical components is text paragraphs. Component class vector

The category representing the logical component is a table. Component class vector

The category representing the logical component is a graph. Component class vector

The category representing the logical component is a picture. Component class vector

Representing logical components belonging to other categories than text paragraphs, tables, charts, pictures.

Secondly, the text feature vector is processed

And format feature vector

Stitching into logical component x_iCharacteristic vector of

Wherein the content of the first and second substances,

s3122, for each logical component' S characteristics,inputting the characteristics of each logic component into a title two-classification submodel to obtain the title classification result of each logic component. Wherein the title classification result comprises a title or a non-title. The title two classification submodel may select the first Softmax classifier. Wherein the score function of the first Softmax classifier

Satisfies formula (2):

wherein Softmax () is the first Softmax function.

In order to be a weight matrix, the weight matrix,

is a bias vector.

In addition, the title binary sub-module may also select another classifier capable of performing binary classification, such as a sigmoid classifier, and the like, which is not limited in this respect.

In training the first Softmax classifier, the header component may be used as a positive sample, and other logical components except the header component may be used as negative samples. Illustratively, the target classification score of the title component is labeled 1, and the target classification scores of the other logical components except the title component are labeled 0. In the process of training the Softmax classifier, an L2 loss function may be selected, and for specific contents of the L2 loss function, reference may be made to the relevant description in the above embodiments of the present invention, and details thereof are not repeated.

In addition, a classification score is obtained by inputting the characteristics of the logic component into the title binary classification submodel. If the classification score is 1, the classification result of the characterization logic component is a title, otherwise, the classification result of the characterization logic component is a non-title.

And S3123, adding the logic component with the title classification result as the title in the logic component ordered sequence into the title component ordered sequence.

In a second embodiment, the characteristics of a logical component and the contextual characteristics of the logical component can be used to determine whether the logical component belongs to a title component. If the title detection model includes a first feature extraction submodel for extracting features of the logical component, a second feature extraction submodel for extracting context features of the logical component, and a title binary classification submodel, S312 may specifically include S3124 to S3127:

and S3124, inputting the logic component ordered sequence into the first feature extraction submodel to obtain the features of the logic components in the logic component ordered sequence. For specific description of the first feature extraction submodel, reference may be made to relevant contents in the foregoing embodiments of the present invention, and details are not described herein again.

And S3125, for the feature of each logic component, inputting the feature of each logic component and the feature of the adjacent logic component of each logic component into a second feature extraction submodel to obtain the context feature of each logic component. Illustratively, if the ordered sequence of the features of the K logical components of the document to be processed is obtained in S3124

Inputting the context feature sequence into a second feature extraction submodel to obtain a context feature ordered sequence of the logic component

Wherein the content of the first and second substances,

is the context characteristic of the ith logic component.

The second feature extraction submodel may be a Multi-layer Convolutional Neural network (Multi-layers CNNs) model. If the convolution kernel of the multilayer convolution neural network model is L, for any logic component x_iThe logical component x can be generated by using the characteristics of L-1 logical components adjacent to each other in front and back_iThe contextual characteristics of (1). Illustratively, if the number of layers of the multilayer convolutional neural network model is 2, the mapping functions of the two convolutional neural network layers are respectively represented as CNN₁() And CNN₂() The convolution kernel size of each layer is 3, and the convolution kernel of the k-th layer is

The offset vector is

Then for any one logical component x_iThe characteristics of the i-2 logic component can be utilized

Characterization of the i-1 st logical component

Characterization of the (i + 1) th logical component

Characterization of the i +2 th logical component

Generating logical component x_iContext characteristics of

Is full of

Satisfies formula (3):

in the process of training the second feature extraction submodel, an L2 loss function can be selected and a gradient descent algorithm is used to continuously perform convolution kernel

The offset vector is

And carrying out iterative updating on the parameters. For details of the L2 loss function, reference may be made to the related description in the above embodiments of the present invention, and further description thereof is omitted.

S3126, inputting the context feature of each logic component into the title binary classification submodel to obtain the title classification result of each logic component. The title classification result comprises a title or a non-title. Wherein the title two classification submodel can be expressed in formula (2)

Is replaced by

Then, the title classification result of each logic component is calculated by using the formula (2) after replacement. For other details of the title two-classification submodel, reference may be made to the related description in the above embodiments of the present invention, and further description is omitted here.

S3127, adding the logic component with the title classification result in the logic component ordered sequence into the title component ordered sequence.

And a list and other logic components with higher similarity to the text structure of the title component may exist in the document. For example, each line in the list may be a combination of a number and a literal symbol. The method takes certain relevance between adjacent logic components in the document to be processed into consideration. For example, the front and back logical components of a title component may be non-peer title components, document content paragraphs, charts, pictures, and the like. And the front and back logical components of the list may be numbered the same. Even though the characteristics of the list are the same as those of the title component, there is a large difference between the contextual characteristics of the list and those of the title component. Therefore, by using the context feature of the logical component, the recognition accuracy of the title detection model can be improved based on the feature of the logical component surrounding the logical component.

S320, establishing a title logic tree based on the hierarchy of the title components in the ordered sequence of the title components.

In some embodiments of the present invention, the title component in the ordered sequence of title components may be considered the first title component in sequence.

S322, for each first title component, S3221 and S3222 are performed.

S3221, if a second title component having the same level as the first title component exists in a previous title component of the first title component and an ancestor node of the previous title component in the title logical tree, inserting the first title component into the title logical tree as a sibling node of the second title component. If the first title component is used as a brother node of the second title component, the first title component and the second title component are represented to have the same upper-level title.

First, in order to fully understand S322, it is exemplarily explained with reference to fig. 4. Illustratively, fig. 4 is a logic schematic diagram of an exemplary generated title logic tree provided by an embodiment of the present invention. Referring to fig. 4, if the title components in the document to be processed are the title component a, the title component b, the title component c, the title component d, the title component e, the title component f, and the title components g and … … in sequence. The above title components a-g may be inserted into the title tree as the first title component in turn in S322. With continued reference to FIG. 4, the title components a-f have already been inserted into the logical tree of titles, at which point the title component g needs to be inserted as the first title component into the logical tree of titles on a continuing basis.

In S3221, it is determined whether a second title component exists in the title component (shaded in fig. 4) in the rightmost branch of the existing logical tree of titles, that is, the previous title component f of the first title component g and the ancestor title component a and the ancestor title component c of the title component f. Where ancestor title component a may be a first level title, such as "third section corporate financial scenario". The grand title component c may be a secondary title, such as "two, debt situation". The previous title component f "1, mobile liability" of the first title component g.

If there is a second header element in the rightmost branch of the existing header logical tree, there may be three insertion locations for the first header element g. Specifically, (1) if the title component a is the second title component, the first title component g needs to be inserted below the root node r, i.e., at position p in fig. 3₁To (3). For example, if the first title component g is "three, stockholder equity case" and "two, liability case" are peer titles, then the "three, stockholder equity case" is inserted as a sibling node of the "two, liability case" under the root node. (2) If title component c is the second title component, then the first title component g needs to be inserted below node a as a child of node a, i.e. position p in FIG. 4₂To (3). (3) If the title component f is a second title component, the first title component g needs to be inserted below the node c as a child of the node c, i.e. at position p in FIG. 3₃To (3).

Next, the following steps one to three are performed when determining whether each node is the peer node of the first header component.

And step one, obtaining hierarchical classification scores of the title components in the ancestor nodes of the previous title component of the first title component and the previous title component of the first title component by using the target data and the title hierarchical classification model.

First, for the hierarchical classification score, the hierarchical classification score may be a value within one [0,1] interval for reflecting the probability that the title component to be classified is the second title component. The higher the hierarchical classification score, the higher the probability that the title component to be classified is the second title component.

In the first case, if the second title component is determined to exist by traversing all nodes of the previous title component of the first title component and the ancestor node of the previous title component, each title component in the previous title component of the first title component and the ancestor node of the previous title component can be respectively used as the title component to be classified, and the hierarchical classification score of each title component is obtained through the first step. If there are a plurality of nodes having scores exceeding a preset score threshold in the previous title component of the first title component and the ancestor node of the previous title component, the node having the highest score may be selected as the second title component. Illustratively, with continued reference to fig. 4, if the preset score threshold is 0.5, the hierarchical classification score corresponding to title component a is 0.6, the hierarchical classification score corresponding to title component c is 0.8, and the hierarchical classification score corresponding to title component a is 0.7, then title component c may be considered as the second title component.

In the second case, if the head element exists from the root node of the head logical tree to the bottom, whether the head element exists in the head element before the first head element and the ancestor node of the head element before the first head element is judged in sequence. The node whose first hierarchical classification score exceeds the preset score threshold among the previous title component and the ancestor nodes of the previous title component may be selected as the second title component. Continuing with the example of the previous case, it can be determined whether title component a, title component c, and title component f are the second title component in turn. If the hierarchical classification score corresponding to the title component a is 0.6 and is greater than the preset score threshold value 0.5, the title component a can be selected as the second title component. And the hierarchical classification scores of the title components c and f do not need to be calculated any more.

In the third case, if starting from the previous title element of the first title element of the title logical tree, whether the title element exists in the previous title element of the first title element and the ancestor node of the previous title element is sequentially judged according to the sequence from low to high, namely the sequence from the leaf node to the root node. The node whose first hierarchical classification score exceeds the preset score threshold among the previous title component and the ancestor nodes of the previous title component may be selected as the second title component. Continuing with the example of the previous case, it may be determined whether the title component f, the title component c, and the title component a are the second title component in sequence. If the hierarchical classification score corresponding to the title element f is 0.7, which is greater than the predetermined score threshold of 0.5, the title element f can be selected as the second title element. And the calculation of the hierarchical classification scores of the title component c and the title component a is not required to be continued.

Second, the model is classified for the title level two. Optionally, title level twoThe classification model may include a feed Forward Neural Network (FNN) model and a second Softmax classifier. If the feature of the first title element is expressed as

The characteristics of each title element are expressed as

It should be noted that "first" and "second" in the first Softmax classifier and the second Softmax classifier of the present invention are only used for distinguishing between the two classifiers, and the functions, features, and the like of the two classifiers are not limited.

The target data includes the characteristics of the title component and the characteristics of the first title component in the previous title component of the first title component and the ancestor node of the previous title component of the first title component. Specifically, the hierarchical classification score of a title component in the ancestor node of the preceding title component and the first title component may be calculated one by one, and accordingly, the target data utilized in calculating the hierarchical classification score of an arbitrary title component in the ancestor node of the preceding title component and the preceding title component of the first title component includes the feature of the arbitrary title component and the feature of the first title component.

Step one is described in detail below with reference to two different target data, which are divided into two cases.

In the first case, the target data includes only the characteristics of the title component and the characteristics of the first title component in the previous title component and the ancestor node of the previous title component. At this time, the first step includes: inputting the characteristics of the title assembly in the ancestor node of the previous title assembly and the characteristics of the first title assembly into a first relational characteristic generation model to obtain first relational characteristics which characterize the relationship between the characteristics of the title assembly in the ancestor node of the previous title assembly and the characteristics of the first title assembly

Then the obtainedThe first relational features are input into a pre-trained title level two-classification model.

The first relational feature generation model may be embodied as an FNN model, that is, the FNN model may be used to obtain the relational features between the features of the arbitrary header component and the features of the first header component

Wherein, for any title component in the previous title component and the ancestor node of the previous title component, the characteristics of the any title component can be firstly detected

And a first title Assembly

The features of the first and second images are spliced to obtain a first spliced vector

And obtaining a relation feature vector between the features of the arbitrary title component and the features of the first title component by using the first splicing vector and a mapping function FNN () of the FNN model

Wherein, the parameters of the FNN model comprise a weight matrix W_FOffset vector b_F. In addition, in the process of training the FNN model, an L2 loss function can be selected, and the parameters of the FNN model are continuously updated in an iterative mode by using a gradient descent algorithm. For details of the L2 loss function, reference is made to the related description in the above embodiments of the present invention, and further description is omitted here.

Secondly, obtaining the relation feature

After that, it can be input into a second Softmax classifier for hierarchical classification recognition. Score function of second Softmax classifier

Satisfies formula (4):

wherein Softmax () is a second Softmax function.

In order to be a weight matrix, the weight matrix,

is a bias vector. For specific contents of the training process of the second Softmax classifier, reference may be made to the relevant description of the first Softmax classifier in the above embodiment of the present invention, and details thereof are not repeated.

In addition, if the score output by the second Softmax classifier is 1, each of the title elements is characterized as a second title element, that is, each of the title elements is a same-level title element as the first title element. Otherwise, each title element is characterized as not being a second title element, i.e., not being a peer title element of the first title element.

In the second case, the target data includes only the feature of the first title component, the feature of the arbitrary title component, and the feature of the sibling node of the arbitrary title component. Second case step one step in the first case is substantially similar except that a relational feature vector between the feature of the arbitrary title element and the feature of the first title element is calculated

In this case, the first step may specifically include: inputting the characteristics of the title assemblies in the ancestor nodes of the previous title assembly and the previous title assembly, and the characteristics of brother nodes of the title assemblies in the ancestor nodes of the previous title assembly and the previous title assembly in the title node into a second relational feature generation model to obtain characteristics of the previous title assembly and the previous title assemblySecond relation characteristic of relation between characteristic of title assembly in ancestor node of title assembly and characteristic of brother node

Characterizing the second relationship

And features of the first title Assembly

A third relational feature generation model is input, and a third relational feature characterizing the relationship between the second relational feature and the feature of the first title element is generated.

In particular, the first title element is characterized as

The arbitrary header component is characterized as

The arbitrary title component includes M sibling nodes, S₁、 S₂、……、S_MAnd M is a positive integer. The characteristics of the M sibling nodes are respectively expressed as

Wherein the content of the first and second substances,

the second relational feature generation model may be embodied as an RNN model. That is, for any title component in the previous title component and the ancestor node of the previous title component, any title group may be precededFeatures of the elements

Characteristics of sibling nodes of arbitrary title component

Inputting RNN model, and obtaining second relation feature vector of relation between the arbitrary title component and brother node thereof

Then, a third relation characteristic is calculated

Then, the FNN model may be selected as the third correlation characteristic generation model. That is, the FNN model is first used to obtain the characteristics of the first header component

Relationship feature vector with the arbitrary title component and its sibling nodes

Characteristic of the relationship between

And step two, if the title assembly with the hierarchical classification score larger than a preset score threshold exists in the ancestor node of the previous title assembly of the first title assembly and the previous title assembly of the first title assembly, determining that the subtree where the previous title assembly is located has a second title assembly. The related description of step two can be referred to the related description of the hierarchical classification score in step one, and will not be repeated herein.

And step three, if the hierarchical classification scores of all the title components in the ancestor nodes of the previous title component of the first title component and the previous title component of the first title component are smaller than a preset score threshold value, determining that no second title component exists in the previous title component of the first title component and the ancestor node of the previous title component of the first title component.

Illustratively, if the preset score threshold is 0.5, the hierarchical classification score corresponding to title component a is 0.4, the hierarchical classification score corresponding to title component c is 0.3, and the hierarchical classification score corresponding to title component a is 0.2, then none of title component a, title component c, and title component f is the second title component. S3222, if the sub-tree where the previous title assembly is located does not have the second title assembly, the first title assembly is used as the child node of the previous title assembly.

Illustratively, with continued reference to FIG. 4, if the hierarchical classification score of the title component f, the hierarchical classification score of the title component a, and the hierarchical classification score of the ancestor title component c are all less than the preset score threshold, then the first title component g is inserted as a child of the previous title component f, below node f, i.e., at position p in FIG. 3₄To (3).

For example, if the inserted title component g is "(1) short term ticket", since it is judged that the "(1) short term ticket" is not located at the same level as the "third section company financial situation", "second, liability situation", "1, mobile liability", it is inserted as a child node of the previous title component f "1, mobile liability".

In addition, for the first title component x_iAfter performing S3221 and S3222, x may be determined_i+1As a next first title component, and S3221 and S3222 are performed.

S330, generating a directory structure of the document to be processed according to the title logic tree. Specifically, after the title logical tree is obtained, the title component corresponding to the child node directly connected to the root node is taken as the first-level title. For example, title component A in FIG. 2₁Title Assembly A₈And title Assembly A₁₄Is the first level title. The child nodes of the first level title are second level titles. E.g. title Module A₂Title Assembly A₃And title Assembly A₇As a first level title A₁Second level title below.

In addition, compared with a method for generating a directory structure by using a title template, the embodiment of the invention can generate the directory hierarchy structure by using a pre-trained learning model (a first feature extraction sub-model, a second feature extraction sub-model, a title second classification sub-model, a title level second classification model and the like), thereby ensuring the generalization capability of the extraction method of the document directory structure. Especially when a deep learning model is selected, the accuracy of the directory structure identification can be further improved.

In some embodiments of the present invention, part or all of the target structure of the document to be processed may be displayed on a display interface of the terminal. For example, if the hierarchical structure of the directory structure is complex, the titles of the first 3 levels can be displayed. The format of the titles at different levels may be different. For example, the number of characters indented before the segment is different, or the font size is different, etc. The format of the titles of different levels can be set according to specific needs, and is not limited to this.

In addition, in order to facilitate the indexing of the document to be processed, when the trigger operation is executed on the title component on the terminal, the user can directly jump to the page where the title component is located.

An apparatus according to an embodiment of the present invention will be described in detail below with reference to the accompanying drawings.

Based on the same inventive concept, the embodiment of the invention provides an extracting device of a document directory structure. Fig. 5 is a structural schematic diagram showing an extracting apparatus of a document directory structure according to an embodiment of the present invention. As shown in fig. 5, the apparatus 500 for extracting a document directory structure includes a title sequence acquiring module 510, a logical tree building module 520, and a target structure generating module 530.

The title sequence acquiring module 510 is configured to acquire an ordered sequence of title components of the document to be processed.

And a logical tree building module 520, configured to build the title logical tree based on the hierarchical relationship between the title components in the ordered sequence of the title components.

And the target structure generating module 530 is configured to generate a directory structure of the document to be processed according to the title logic tree.

In some embodiments of the present invention, the logic tree building module 520 is specifically configured to:

the title components in the ordered sequence of title components are in turn taken as the first title component.

For each first title component, performing the following operations:

and if a second title component with the same level as the first title component exists in the ancestor nodes of the previous title component and the previous title component of the first title component in the title logical tree, inserting the first title component into the title logical tree as the brother node of the second title component. And if the second title component does not exist in the ancestor nodes of the previous title component and the previous title component, inserting the first title component into the title logical tree as a child node of the previous title component.

In some embodiments of the present invention, the apparatus 500 for extracting a document directory structure further comprises:

and the hierarchical classification module is used for obtaining the hierarchical classification scores of the previous title assembly and the title assembly in the ancestor node of the previous title assembly by using the target data and the title hierarchical two-classification model.

Wherein the target data includes the characteristics of the previous title component and the characteristics of the first title component in the ancestor nodes of the previous title component and the previous title component.

And the first determination model is used for determining that a second title component exists in the ancestor nodes of the previous title component and the previous title component if the title component with the hierarchical classification score larger than the preset score threshold exists in the ancestor nodes of the previous title component and the previous title component.

And the second determination model is used for determining that the second title component does not exist in the ancestor nodes of the previous title component and the previous title component if the hierarchical classification scores of all the title components in the ancestor nodes of the previous title component and the previous title component are less than the preset score threshold.

In some embodiments, the hierarchical classification module is specifically configured to:

and inputting the characteristics of the title assembly in the ancestor nodes of the previous title assembly and the characteristics of the first title assembly into a first relation characteristic generation model to obtain a first relation characteristic representing the relation between the characteristics of the title assembly in the ancestor nodes of the previous title assembly and the characteristics of the first title assembly.

And inputting the first relation characteristic into the title level binary classification model to obtain the level classification scores of the previous title component and the title components in the ancestor nodes of the previous title component.

In other embodiments, the hierarchical classification module is specifically configured to:

and inputting the characteristics of the title assemblies in the ancestor nodes of the previous title assembly and the previous title assembly, and the characteristics of the brother nodes of the title assemblies in the ancestor nodes of the previous title assembly and the previous title assembly in the title node into a second relational feature generation model to obtain a second relational feature representing the relationship between the characteristics of the title assemblies in the ancestor nodes of the previous title assembly and the characteristics of the brother nodes.

And inputting the second relational feature and the feature of the first title assembly into a third relational feature generation model to generate a third relational feature which characterizes the relationship between the second relational feature and the feature of the first title assembly.

And inputting the third relation characteristic into the title level binary model to obtain the level classification scores of the previous title component and the title components in the ancestor nodes of the previous title component.

In some embodiments of the present invention, the title sequence obtaining module 510 is specifically configured to:

acquiring a logic component ordered sequence of a document to be processed; and inputting the logic component ordered sequence into a title detection model to obtain the title component ordered sequence.

In some embodiments, the title detection model includes a first feature extraction sub-model and a title binary classification sub-model, and the title sequence obtaining module 510 is specifically configured to:

and inputting the logic assembly ordered sequence into the first feature extraction submodel to obtain the features of the logic assemblies in the logic assembly ordered sequence.

And inputting the characteristics of each logic component into a title two-classification submodel aiming at the characteristics of each logic component to obtain a title classification result of each logic component. The title classification result is a title or a non-title.

And adding the logic component with the title classification result as the title in the logic component ordered sequence into the title component ordered sequence.

In other embodiments, the title detection model includes a first feature extraction sub-model, a second feature extraction sub-model, and a title binary classification sub-model, and the title sequence obtaining module 510 is specifically configured to:

And inputting the characteristics of each logic assembly and the characteristics of the adjacent logic assemblies of each logic assembly into a second characteristic extraction submodel aiming at the characteristics of each logic assembly to obtain the contextual characteristics of each logic assembly.

Inputting the context characteristics into a title binary classification submodel to obtain a title classification result of each logic component, wherein the title classification result comprises a title or a non-title;

In some embodiments of the present invention, the logic component is characterized by a feature vector, and the header sequence obtaining module 510 is specifically configured to:

and acquiring the text feature vector of the logic component and the format feature vector of the logic component.

And splicing the text feature vector and the format feature vector into a feature vector of the logic component.

Wherein the text feature vector of the logical component is generated based on the ordered sequence of characters of the logical component.

The format feature vector characterizes at least one of the following format information: whether the logic component is thickened, the text word size of the logic component, whether the text of the logic component is centered and represents the category to which the logic component belongs, wherein the category to which the logic component belongs comprises: paragraphs, tables, charts, pictures.

Other details of the apparatus for extracting a document directory structure according to the embodiment of the present invention are similar to the method according to the embodiment of the present invention described above with reference to fig. 1 to 4, and are not repeated herein.

As shown in fig. 6, the extracting apparatus 600 of the document directory structure includes an input apparatus 601, an input interface 602, a central processor 603, a memory 604, an output interface 605, and an output apparatus 606. The input interface 602, the central processing unit 603, the memory 604, and the output interface 605 are connected to each other via a bus 610, and the input device 601 and the output device 606 are connected to the bus 610 via the input interface 602 and the output interface 605, respectively, and further connected to other components of the document directory structure extraction device 600.

Specifically, the input device 601 receives input information from the outside, and transmits the input information to the central processor 603 through the input interface 602; the central processor 603 processes input information based on computer-executable instructions stored in the memory 604 to generate output information, stores the output information temporarily or permanently in the memory 604, and then transmits the output information to the output device 606 through the output interface 605; the output device 606 outputs the output information to the outside of the extracting device 600 of the document directory structure for use by the user.

That is, the extracting device of the document directory structure shown in fig. 6 may also be implemented to include: a memory storing computer-executable instructions; and a processor which, when executing computer executable instructions, may implement the method and apparatus of the document directory structure extraction device described in connection with fig. 1-4.

In one embodiment, the extracting apparatus 600 of the document directory structure shown in fig. 6 may be implemented as an apparatus that may include: a memory for storing a program; and the processor is used for operating the program stored in the memory so as to execute the extraction method of the document directory structure of the embodiment of the invention.

The embodiment of the invention also provides a computer storage medium, wherein computer program instructions are stored on the computer storage medium, and when being executed by a processor, the computer program instructions realize the extraction method of the document directory structure of the embodiment of the invention.

It is to be understood that the invention is not limited to the particular arrangements and instrumentality described above and shown in the drawings. A detailed description of known methods is omitted herein for the sake of brevity. In the above embodiments, several specific steps are described and shown as examples. However, the method processes of the present invention are not limited to the specific steps described and illustrated, and those skilled in the art can make various changes, modifications and additions or change the order between the steps after comprehending the spirit of the present invention.

The functional blocks shown in the above structural block diagrams may be implemented as hardware, software, firmware, or a combination thereof. When implemented in hardware, it may be, for example, an electronic circuit, an Application Specific Integrated Circuit (ASIC), suitable firmware, plug-in, function card, or the like. When implemented in software, the elements of the invention are the programs or code segments used to perform the required tasks. The program or code segments may be stored in a machine-readable medium or transmitted by a data signal carried in a carrier wave over a transmission medium or a communication link. A "machine-readable medium" may include any medium that can store or transfer information. Examples of a machine-readable medium include electronic circuits, semiconductor memory devices, ROM, flash memory, Erasable ROM (EROM), floppy disks, CD-ROMs, optical disks, hard disks, fiber optic media, Radio Frequency (RF) links, and so forth. The code segments may be downloaded via a computer network such as the internet, an intranet, etc.

As described above, only the specific embodiments of the present invention are provided, and it is clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the system, the module and the unit described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

Claims

1. A method for extracting a document directory structure, the method comprising:

acquiring a title component ordered sequence of a document to be processed;

establishing a title logic tree based on the hierarchical relationship among the title components in the ordered sequence of the title components;

and generating a directory structure of the document to be processed according to the title logic tree.

2. The method according to claim 1, wherein the building a logical tree of titles based on the hierarchical relationship between each title component in the ordered sequence of title components comprises:

sequentially taking the title components in the ordered sequence of title components as first title components;

for each first title component, performing the following operations:

if a second title component with the same level as the first title component exists in a previous title component of the first title component and an ancestor node of the previous title component in the title logical tree, inserting the first title component into the title logical tree as a brother node of the second title component;

and if the second title component does not exist in the ancestor nodes of the previous title component and the previous title component, inserting the first title component into the title logical tree as a child node of the previous title component.

3. The method of claim 2, further comprising:

obtaining hierarchical classification scores of the previous title component and the title components in the ancestor nodes of the previous title component by using target data and a title hierarchical classification model, wherein the target data comprises the characteristics of the title components in the ancestor nodes of the previous title component and the characteristics of the first title component;

determining that the second title component exists in the ancestor nodes of the previous title component and the previous title component if a title component with a hierarchical classification score greater than a preset score threshold exists in the ancestor nodes of the previous title component and the previous title component;

determining that the second title component is not present in the ancestor nodes of the previous title component and the previous title component if the hierarchical classification scores of all the title components in the ancestor nodes of the previous title component and the previous title component are less than the preset score threshold.

4. The method of claim 3, wherein obtaining hierarchical classification scores for the previous title component and the title component in the ancestor node of the previous title component using the target data and a title hierarchical classification model comprises:

inputting the characteristics of the previous title assembly and the title assembly in the ancestor node of the previous title assembly and the characteristics of the first title assembly into a first relational characteristic generation model to obtain a first relational characteristic representing the relationship between the characteristics of the title assembly and the first title assembly in the ancestor node of the previous title assembly and the previous title assembly;

5. The method of claim 3, wherein obtaining hierarchical classification scores for the previous title component and the title component in the ancestor node of the previous title component using the target data and a title hierarchical classification model comprises:

inputting the characteristics of the title assemblies in the ancestor nodes of the previous title assembly and the characteristics of the brother nodes of the title assemblies in the ancestor nodes of the previous title assembly and the previous title assembly in the title node into a second relational characteristic generation model to obtain a second relational characteristic representing the relationship between the characteristics of the title assemblies in the ancestor nodes of the previous title assembly and the characteristics of the brother nodes;

inputting the second relational feature and the feature of the first title assembly into a third relational feature generation model, and generating a third relational feature which represents the relationship between the second relational feature and the feature of the first title assembly;

inputting the third relation characteristic into the title level binary classification model to obtain the level classification scores of the previous title component and the title components in the ancestor nodes of the previous title component.

6. The method of claim 1, wherein obtaining an ordered sequence of header components of the document to be processed comprises:

acquiring a logic component ordered sequence of the document to be processed;

inputting the logic component ordered sequence into a title detection model to obtain the title component ordered sequence; wherein, if the title detection model comprises a first feature extraction submodel and a title binary classification submodel,

inputting the logic component ordered sequence into a title detection model to obtain the title component ordered sequence, wherein the method comprises the following steps:

inputting the logic assembly ordered sequence into the first feature extraction submodel to obtain the features of the logic assemblies in the logic assembly ordered sequence;

inputting the characteristics of each logic component into the title two-classification submodel aiming at the characteristics of each logic component to obtain a title classification result of each logic component, wherein the title classification result is a title or a non-title;

adding the logic component with the title classification result as the title in the logic component ordered sequence into the title component ordered sequence;

alternatively, the first and second electrodes may be,

if the title detection model comprises a first feature extraction submodel, a second feature extraction submodel and a title binary classification submodel,

inputting the ordered sequence of logical components into a title detection model to obtain the ordered sequence of title components, including:

aiming at the characteristics of each logic assembly, inputting the characteristics of each logic assembly and the characteristics of adjacent logic assemblies of each logic assembly into a second characteristic extraction submodel to obtain the context characteristics of each logic assembly;

inputting the context characteristics into the title binary classification submodel to obtain a title classification result of each logic component, wherein the title classification result comprises a title or a non-title;

7. The method of claim 5 or claim 6, wherein the features of the logic component are feature vectors,

inputting the logic assembly ordered sequence into the first feature extraction submodel to obtain the features of the logic assemblies in the logic assembly ordered sequence, wherein the method comprises the following steps:

acquiring a text feature vector of the logic component and a format feature vector of the logic component;

concatenating the text feature vector and the format feature vector into a feature vector for the logical component,

wherein the text feature vector of the logical component is generated based on the ordered sequence of characters of the logical component,

the format feature vector characterizes at least one of the following format information:

whether the logic component is thickened, the text word size of the logic component, whether the text of the logic component is centered and represents the category to which the logic component belongs, wherein the category to which the logic component belongs comprises: paragraphs, tables, charts, pictures.

8. An apparatus for extracting a document directory structure, the apparatus comprising:

the title sequence acquisition module is used for acquiring the ordered sequence of the title components of the document to be processed;

the logic tree building module is used for building a title logic tree based on the hierarchical relationship among the title components in the ordered sequence of the title components;

and the target structure generating module is used for generating the directory structure of the document to be processed according to the title logic tree.

9. An apparatus for extracting a document directory structure, the apparatus comprising:

a memory for storing a program;

a processor for executing the program stored in the memory to perform the method for extracting a document directory structure of any one of claims 1 to 8.

10. A computer storage medium having computer program instructions stored thereon which, when executed by a processor, implement the method of extracting a document directory structure of any one of claims 1 to 8.