CN117709337A

CN117709337A - Directory extraction method, device and storage medium based on chapter-level situation

Info

Publication number: CN117709337A
Application number: CN202311740980.4A
Authority: CN
Inventors: 冯卫强; 张友豪; 徐旺; 朱珊珊
Original assignee: Hefei Da Zhi Cai Hui Data Technology Co ltd
Current assignee: Hefei Da Zhi Cai Hui Data Technology Co ltd
Priority date: 2023-12-18
Filing date: 2023-12-18
Publication date: 2024-03-15

Abstract

The invention relates to the technical field of text processing, and discloses a chapter-level-context-based directory extraction method, equipment and a storage medium. The method comprises the following steps: s1, acquiring articles needing to be extracted from catalogues and text paragraph information of the articles. S2, carrying out semantic extraction on each paragraph of the article to obtain paragraph semantic features of the article. S3, inputting the paragraph semantic features into a paragraph interaction model, so that paragraph features based on chapter level situations are extracted. S4, inputting paragraph features into a conditional random field to calculate the probability that text paragraph information obtains a correct directory-level link, so that the directory-level label of each paragraph is recursively obtained, and the task of extracting the directory is completed. According to the invention, a deep learning method is directly adopted to carry out task modeling, and a complicated rule feature library construction process is abandoned. The method replaces the manual induction mode in a data driving mode, so that the labor cost is greatly reduced, and meanwhile, the generalization and universality of the directory extraction method are improved.

Description

Directory extraction method, device and storage medium based on chapter-level situation

Technical Field

The invention relates to the technical field of text processing, in particular to a chapter-level-context-based directory extraction method, equipment and a storage medium.

Background

The marketing companies each year disclose a large number of documents to disclose business operations to the public. Among these documents, there are numerous PDF files and scanned documents, which often lack structural information, thus presenting a serious challenge to subsequent document processing efforts. Of all document structure information, directory information has a critical importance. The catalogue is not only helpful for dividing the overall structure and logic of the document, so that readers can more easily understand the organization and the context of the article, and thus, the targeted reading can be carried out; and the method is also beneficial to a text processor to quickly locate a text analysis range and filter irrelevant contents, so that the efficiency and accuracy of information extraction are improved. Therefore, how to efficiently and accurately obtain the directory structure information of the documents such as PDF is a task to be solved.

To solve this problem, in the existing method, a directory rule feature library is constructed manually, and the directory is judged and extracted by using rule matching degree, for example, the directory extraction technology disclosed in application numbers 201910500726.4, 201910973998.6 and 202310163468.1. The method mainly comprises the steps of manually inducing feature words indicating the category levels of paragraphs, such as a first chapter, a first section and the like, traversing all paragraphs in an article, and judging whether the corresponding feature words exist or not by adopting regular matching so as to finish a category extraction task. But the disadvantages of this approach are equally apparent. On one hand, the rule feature library needs to be constructed manually with great effort, and meanwhile, all features cannot be guaranteed to be exhausted, and directory identification errors can be directly caused to directory features which are not in the library; on the other hand, the method cannot guarantee the accuracy of the acquired directory hierarchy. The directory label hierarchy of different files is specific. In a document, a "first section" may belong to a primary directory; in another document, however, it may be that it is not reasonable to determine the directory hierarchy based solely on rule features. Furthermore, feature engineering methods are typically not generalizable because different sets of documents may have different directory tab systems, requiring adaptation to different rule bases.

To overcome the above drawbacks, some researchers have introduced machine learning techniques into the catalog extraction task, which do not rely on manually building rule bases, but on patterns in model learning data, thereby improving applicability and generalization. Typical cases are related technologies disclosed in application numbers 202211734526.3 and 202310291320.6, and classification models are mainly used to determine whether each paragraph is a directory and a directory level. According to the method, paragraph semantic characterization is introduced to identify the article catalogue, so that the problem of excessive dependence on rule features is relieved, and the accuracy of catalogue level identification still cannot be guaranteed. In different articles, similar semantic expressions may belong to different catalogue levels, and the specificity of the article cannot be handled by relying on local semantic characterization alone. The above solutions consider only the "commonality" of the catalogue rules and semantics, but not the "personality" of the article. In addition, the directory layers have a certain constraint relationship, for example, tertiary directories do not appear in a jumping manner after primary directories, but the above method does not consider the connection between the directory layers when processing the directory extraction task. Therefore, how to effectively process the directory extraction task is still a problem to be solved.

Disclosure of Invention

In order to avoid and overcome the technical problems in the prior art, the invention provides a chapter-level context-based directory extraction method, a chapter-level context-based directory extraction device and a storage medium. According to the method, when the catalog extraction task is processed, a complicated rule base construction process is eliminated, meanwhile, the specificity of the articles is learned based on chapter-level contexts, and finally, the relativity between catalog layers is considered by adopting a conditional random field, so that the accuracy of the catalog extraction task is improved.

In order to achieve the above purpose, the present invention provides the following technical solutions:

the invention discloses a directory extraction method based on chapter-level situation, which comprises the following steps of S1-S4.

S1, acquiring articles needing to be extracted from catalogues and text paragraph information of the articles.

S2, carrying out semantic extraction on each paragraph of the article to obtain paragraph semantic features of the article.

S3, inputting the paragraph semantic features into a preselected paragraph interaction model, so that paragraph features based on chapter level situations are extracted.

S4, inputting paragraph features into a conditional random field to calculate the probability that text paragraph information obtains a correct directory-level link, so that the directory-level label of each paragraph is recursively obtained, and the task of extracting the directory is completed.

As a further improvement of the above scheme, in step S4, the text paragraph information obtains the probability of the correct directory-level linkThe calculation formula is as follows:

wherein M is [1, …, M-1 ]]M is the maximum paragraph number of the article; y is ^m Directory level tag for article mth paragraph, t ^p Text paragraph information of the article; h (y) ^m ；t ^p ) Represents the mth paragraph t ^pm Directory level prediction y ^m Is obtained by the paragraph feature through a fully connected layer with sigmoid as an activation function; g (y) ^m ；y ^m+1 ) Directory level label y representing the content of the mth paragraph ^m Directory level label y for branching to the m+1th paragraph ^m+1 Is a score of (2); z (t) ^p ) And (5) representing the normalization factor to enable the final result to conform to the probability distribution.

As a further improvement of the above scheme, in step S4, a viterbi algorithm is used to recursively obtain the directory-level label for each paragraph.

As a further improvement of the scheme, the paragraph interaction model and the conditional random field participate in joint training to obtain an optimal model before application, and the training process is as follows:

using maximum likelihood to maximize the probability of directory-level linksAn objective function is set for the purpose.

And setting the maximum iteration times, training the objective function by using a back propagation and gradient descent method, and stopping training when the maximum iteration times are reached, so that the objective function reaches the minimum, thereby obtaining the optimal model.

As a further improvement of the above solution, in step S3, a paragraph interaction model is selected according to the maximum number of paragraphs of the article, and the specific process is as follows:

and when the maximum paragraph number of the articles is not higher than a preset paragraph number threshold value of the articles, adopting a paragraph interaction model of the Bert structure.

When the maximum paragraph number of the articles is higher than the paragraph number threshold value of the articles, adopting a paragraph interaction model of a Longformer structure.

As a further improvement of the above scheme, in step S3, the selected interactive model structure of the paragraph is adjusted, the word embedding layer is deleted, and the semantic features of the paragraph are input into the model to replace the original word embedding result, thereby extracting the paragraph features based on the chapter-level context.

As a further improvement of the above solution, in step S1, the text paragraph information extraction method of the article specifically includes the following steps:

s11, forming all text lines of the article into text line information of the article.

S12, restoring the paragraph format of the text line information, and adopting a paragraph recognition model to realize a text merging process from line to paragraph so as to acquire text paragraph information of an article; the text paragraph information consists of all text paragraphs of the article.

As a further improvement of the scheme, in step S2, each paragraph of the article is sequentially input into a pre-training language model to obtain an embedded vector of the paragraph, so that the paragraph semantic features of the article are obtained.

The invention also discloses a computer device comprising a processor and a memory, wherein the memory stores a computer program which can be executed by the processor, and the processor can execute the computer program to realize the directory extraction method based on chapter level situation.

The invention also discloses a storage medium, on which a computer program is stored, which when executed by a processor implements the above-mentioned chapter-level context-based directory extraction method.

Compared with the prior art, the invention has the beneficial effects that:

1. the invention converts the directory extraction problem into a sequence labeling problem for predicting the directory level label of each paragraph. And the deep learning method is directly adopted for task modeling, and a complicated rule feature library construction process is abandoned. The method replaces the manual induction mode in a data driving mode, so that the labor cost is greatly reduced, and meanwhile, the generalization and universality of the directory extraction method are improved.

2. The invention fully considers the specificity of the article. The method has the advantages that all paragraphs of the articles can be interactively modeled through the Transformer structure, the full text context of the articles is effectively extracted, and from the full text perspective, the individuality of different articles can be fully considered, so that a catalogue hierarchy system is customized for each article in a 'fit' mode, and the catalogue hierarchy problem is better processed.

3. The invention considers the relativity between directory layers. The correlation between directory layers is explicitly coded through a specific model layer, and the transfer relationship between different layers is modeled by using a conditional random field, so that a unique directory-level link is obtained, the complex relationship between different directory layers is fully considered, and the accuracy is higher.

Drawings

Fig. 1 is a flowchart of a chapter-level context-based directory extraction method in embodiment 1 of the present invention.

Fig. 2 is a schematic block diagram of a computer device 10 according to embodiment 1 of the present invention.

Fig. 3 is a flowchart of a chapter-level context-based directory extraction method in embodiment 2 of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Example 1

The embodiment of the invention provides a chapter-level context-based catalog extraction method, which converts a catalog extraction task into a sequence labeling problem of searching an optimal catalog level link, abandons a complicated rule feature construction process, considers the specificity of articles by chapter-level context modeling, and considers the relevance between catalog levels by means of a conditional random field, thereby obtaining a more accurate hierarchical catalog system.

Referring to fig. 1, the chapter-level context-based directory extraction method includes the following steps S1 to S4.

S1, acquiring articles needing to be extracted from catalogues and text paragraph information of the articles. The text paragraph information extraction method of the article specifically comprises the following steps of S11-S12.

In some embodiments, the embedded vector of the paragraph is obtained by sequentially inputting each paragraph of the article into the pre-trained language model, thereby obtaining the paragraph semantic features of the article.

The paragraph interaction model is selected according to the maximum paragraph number of the articles, and the specific process is as follows:

when the maximum paragraph number of the articles is not higher than a preset paragraph number threshold value of the articles, adopting a paragraph interaction model of a Bert (Bidirectional Encoder Representations from Transformers) structure.

When the maximum paragraph number of the articles is higher than the paragraph number threshold of the articles, a paragraph interaction model of a Longformer (Long-document Transformer) structure is adopted.

In some embodiments, the article paragraph number threshold may be set to 512, although other numbers may be set.

In step S3, the selected interactive model structure of the paragraph can be adjusted, the word embedding layer is deleted, and the semantic features of the paragraph are input into the model to replace the original word embedding result, thereby extracting paragraph features based on the chapter level situation.

Wherein text paragraph information obtains probability of correct directory-level linksThe calculation formula is as follows:

wherein M is [1, …, M-1 ]]M is the maximum paragraph number of the article; y is ^m Directory level tag for article mth paragraph, t ^p Text paragraph information of the article;h(y ^m ；t ^p ) Represents the mth paragraph t ^pm Directory level prediction y ^m Is obtained by the paragraph feature through a fully connected layer with sigmoid as an activation function; g (y) ^m ；y ^m+1 ) Directory level label y representing the content of the mth paragraph ^m Directory level label y for branching to the m+1th paragraph ^m+1 Is a score of (2); z (t) ^p ) And (5) representing the normalization factor to enable the final result to conform to the probability distribution.

In some embodiments, the directory-level labels for each paragraph may be recursively obtained using a viterbi algorithm, which is a dynamic programming problem algorithm that is used to find the viterbi path that most likely yields the sequence of observation events. Of course, other typical ways of solving the dynamic programming problem may also be employed.

Referring to fig. 2, the present embodiment further provides a computer device 10, where the computer device 10 includes a processor 11, a memory 12, a communication interface 13, and a bus 14.

The computer device 10 may be a smart phone, tablet computer, notebook computer, etc. capable of executing a program. The processor 11, the memory 12 and the communication interface 13 are electrically connected, either directly or indirectly, to enable transmission and/or interaction of data.

In this embodiment, the processor 11 may be a central processing unit (Central Processing Unit, CPU), a controller, a microcontroller, a microprocessor, or other data processing chip. The processor is typically used to control the overall operation of the computer device and may implement or perform the steps of the chapter-level context-based directory extraction method described above.

The memory 12 may be a nonvolatile memory such as a hard disk (HDD) or a Solid State Disk (SSD), or the like, or may be a volatile memory such as RAM. In this embodiment, the memory 12 may also be implemented as circuitry or any other element capable of performing a memory function for storing instructions and/or data.

Bus 14 may be a peripheral component interconnect standard (PCI) bus, or an Extended Industry Standard Architecture (EISA) bus, etc. The bus 14 may be classified as an address bus, a data bus, a control bus, or the like. Only one thick line is shown in fig. 2, but not only one bus or one type of bus.

The present embodiment also provides a storage medium having a computer program stored thereon, which, when executed by a processor, can implement the above-mentioned chapter-level context-based directory extraction method.

Example 2

Referring to fig. 3, the present embodiment provides a chapter-level context-based directory extraction method, which is different from the directory extraction method of embodiment 1 in that the present embodiment further provides detailed steps of data set construction, model training, etc., and the directory extraction method includes steps 1 to 4.

Step 1, constructing a data set A required by directory extraction, which can specifically comprise steps 1.1-1.4.

Step 1.1, constructing text line representation of data required by directory extraction, and recording asWherein (1)>Text line information representing the kth sample, K being the total number of samples; text line information->Consists of all text lines of an article, denoted +.>Wherein (1)>Representing the kth text line information +.>N represents the maximum number of lines of text information

In this embodiment, a PDF document may be employed as the data source. For the replicable text, acquiring text line information in a PDF file through a PDFbox and other open source PDF file analysis frames; for non-replicable text such as scanned pieces, an OCR method is used to obtain text line information. Of course, in other embodiments, other formats of documents may be used, such as DOC, DOCX, etc.

Step 1.2, restoring the paragraph layout of the text line data, wherein the catalog may have a cross-line condition, so that the text line data needs to be combined to obtain the paragraph representation of the text information set, which is recorded asWherein (1)>Text paragraph information representing the kth sample, K being the total number of samples; text paragraph information->Consists of all text paragraphs of an article, noted +.>Wherein (1)>Representing an mth paragraph in the kth piece of text paragraph information, M representing a maximum paragraph number of the text information; />From the set of text lines t in step 1.1 _k l is combined by a paragraph identification model, and in this embodiment, a sequence labeling method can be used to train the paragraph identification model, so as to implement a text combining process from line to segment.

Step 1.3, constructing a tag information set of directory extraction data, which is marked as Y= { Y ₁ ,Y ₂ ,···,Y _k ,···,Y _K And } wherein,tag information representing the kth sample, +.>The label value representing the mth paragraph in the kth sample, when +.>Time-representing paragraph->Not a directory, when->Time-representing paragraph->Belongs to the class I directory, I being the maximum value of the directory hierarchy. The tag data in this embodiment is manually labeled.

Step 1.4, collecting T from the text paragraph set ^p Constructing a directory extraction data set A= { T by a tag information set Y ^p ,Y}。

And step 2, obtaining paragraph semantic feature representation.

Specifically, each paragraph in the kth sample is takenSequentially inputting the pre-training language model to obtain +.>Is +.>Thereby obtaining a paragraph semantic feature representation of the kth sample, denoted +.>

And 3, paragraph interaction modeling based on chapter level situations. Specifically, the method comprises the steps 3.1-3.2.

And 3.1, selecting a paragraph interaction model. Specifically, according to the length of the article paragraph to be processed, a corresponding model structure is selected. When the maximum paragraph number M of the article is less than or equal to 512, adopting a Bert structure, and fully modeling interaction among paragraphs by using a fully connected self-attention mechanism; when the maximum paragraph number M of the article is more than 512, a Longformer structure is adopted, and a local and global sparse self-attention mechanism is combined, so that the time complexity of the model is greatly reduced on the premise of ensuring the paragraph interactive modeling effect as much as possible.

Step 3.2, adjusting the paragraph interaction model structure, deleting the word embedding layer, directly using the paragraph semantic feature representation input model obtained in the step 2 to replace the original word embedding result, and finally obtaining paragraph feature representation based on chapter level situation

Step 4, paragraph catalog level label prediction based on conditional random fields, which specifically comprises steps 4.1-4.4.

Step 4.1, representing paragraph characteristics of the kth sampleInputting a conditional random field layer as shown in the following formula to obtain a sample +.>Probability of getting correct directory-level link +.>Further, each paragraph +/is recursively obtained by the viterbi algorithm>Is a directory-level tag of (1). Probability->The calculation formula is as follows：

In the method, in the process of the invention,representation paragraph->Directory level forecast +.>Score of>Obtained through a fully connected layer taking sigmoid as an activation function; />Directory level tag representing the content of the mth paragraph +.>Directory level tag for branching to the m+1th paragraph +.>Is essentially an ixi learnable matrix; />And (5) representing the normalization factor to enable the final result to conform to the probability distribution.

Step 4.2, using a maximum likelihood estimation method, in order to maximize the probability of a correct directory-level link, using a loss function as an objective function J as shown below:

and 4.3, setting the maximum iteration number epoch_number, training the objective function J by using a back propagation and gradient descent method, and stopping training when the iteration number reaches the epoch_number, so that the J reaches the minimum, thereby obtaining the optimal model.

And 4.4, processing articles needing to be subjected to catalog extraction by using the optimal model obtained in the step 4.3, and recursively finding out one with the highest probability in all catalog-level links by combining a viterbi algorithm to finish the catalog extraction task.

The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art, who is within the scope of the present invention, should make equivalent substitutions or modifications according to the technical scheme of the present invention and the inventive concept thereof, and should be covered by the scope of the present invention.

Claims

1. The directory extraction method based on chapter-level situation is characterized by comprising the following steps:

s1, acquiring articles needing to be extracted from catalogues and text paragraph information of the articles;

s2, carrying out semantic extraction on each paragraph of the article to obtain paragraph semantic features of the article;

s3, inputting the paragraph semantic features into a preselected paragraph interaction model, so as to extract paragraph features based on chapter level situations;

s4, inputting the paragraph features into a conditional random field to calculate the probability that the text paragraph information obtains a correct directory-level link, so that the directory-level label of each paragraph is recursively obtained, and the directory extraction task is completed.

2. The chapter-level context-based directory extraction method of claim 1 wherein, in step S4, text paragraph information obtains probabilities of correct directory-level linksThe calculation formula is as follows:

3. The chapter-level context-based directory extraction method of claim 2 wherein, in step S4, a viterbi algorithm is employed to recursively obtain directory-level labels for each paragraph.

4. The chapter-level context-based catalog extraction method of claim 2 wherein the paragraph interaction model and the conditional random field participate in a joint training prior to application to obtain an optimal model, the training process being as follows:

using maximum likelihood to maximize the probability of directory-level linksSetting an objective function for the purpose;

and setting the maximum iteration times, training the objective function by using a back propagation and gradient descent method, and stopping training when the maximum iteration times are reached, so that the objective function is minimized, and thus, the optimal model is obtained.

5. The chapter-level context-based catalog extraction method of claim 1, wherein in step S3, the paragraph interaction model is selected according to the maximum paragraph number of the article, and the specific process is as follows:

when the maximum paragraph number of the articles is not higher than a preset paragraph number threshold value of the articles, adopting a paragraph interaction model of a Bert structure;

and when the maximum paragraph number of the articles is higher than the paragraph number threshold value of the articles, adopting a paragraph interaction model of a Longformer structure.

6. The chapter-level context-based directory extraction method of claim 5, wherein in step S3, the selected paragraph interaction model structure is adjusted, the word embedding layer is deleted, and paragraph semantic features are input into the model to replace the original word embedding result, thereby extracting paragraph features based on chapter-level context.

7. The chapter-level context-based directory extraction method according to claim 1, wherein in step S1, the text paragraph information extraction method of articles specifically comprises the steps of:

s11, forming all text lines of the article into text line information of the article;

s12, restoring the paragraph format of the text line information, and adopting a paragraph recognition model to realize a text merging process from line to paragraph so as to acquire text paragraph information of an article; the text paragraph information is made up of all text paragraphs of the article.

8. The chapter-level context-based catalog extraction method of claim 1 wherein in step S2, each paragraph of an article is sequentially input into a pre-trained language model to obtain an embedded vector of the paragraph, thereby obtaining paragraph semantic features of the article.

9. A computer device comprising a processor and a memory, wherein the memory stores a computer program executable by the processor, the processor being executable to implement a chapter-level context-based directory extraction method as claimed in any one of claims 1 to 8.

10. A storage medium having stored thereon a computer program which when executed by a processor implements a chapter-level context based directory extraction method as claimed in any one of claims 1 to 8.