CN114356924A - Method and apparatus for extracting data from structured documents - Google Patents

Method and apparatus for extracting data from structured documents Download PDF

Info

Publication number
CN114356924A
CN114356924A CN202111658801.3A CN202111658801A CN114356924A CN 114356924 A CN114356924 A CN 114356924A CN 202111658801 A CN202111658801 A CN 202111658801A CN 114356924 A CN114356924 A CN 114356924A
Authority
CN
China
Prior art keywords
data
text
text data
field type
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111658801.3A
Other languages
Chinese (zh)
Inventor
凌悦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shengdoushi Shanghai Science and Technology Development Co Ltd
Original Assignee
Shengdoushi Shanghai Technology Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shengdoushi Shanghai Technology Development Co Ltd filed Critical Shengdoushi Shanghai Technology Development Co Ltd
Priority to CN202111658801.3A priority Critical patent/CN114356924A/en
Publication of CN114356924A publication Critical patent/CN114356924A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a method and a device for extracting data from a structured document. The method comprises the steps of obtaining a text data set of a structured document, determining sequence marking data of the text data, determining a first data field type of the text data and a second data field type of the text data based on the sequence marking data, and extracting the text data corresponding to a preset data field type from the text data set based on the first field type and the second field type. By using the method and the device, valuable information can be quickly, accurately and intelligently analyzed and extracted from the structured document.

Description

Method and apparatus for extracting data from structured documents
Technical Field
The present application relates to data extraction, and more particularly, to a method, apparatus, and computer storage medium for extracting text data having a preset data field type from a structured document.
Background
In an information automation processing application, it is necessary to parse and extract information in a contract from a contract text such as a rental contract according to a predetermined data type. Contract text is one type of structured document whose information typically includes a collection of data corresponding to data fields of a number of specific contract information item types.
The information extraction task is usually completed manually or processed into a common contract for information extraction for rental contract processing containing more information. Manual parsing and extraction methods are inefficient and costly. When the rental contract is treated as a common contract, the extracted data fields are limited, and the content corresponding to the customized requirement in the rental contract cannot be acquired, so that the complete information in the rental contract cannot be accurately extracted.
Accordingly, there is a need for improved data parsing and extraction of structured text, such as rental contracts.
Disclosure of Invention
The embodiment of the application provides a method and equipment for extracting data from a structured document, which can at least partially solve the problem of extracting corresponding text data from the structured document such as contract text (especially complex leasing contracts) according to the data field type, thereby rapidly, accurately and intelligently analyzing and extracting valuable information in the structured document.
According to an aspect of the application, a method for extracting data from a structured document is proposed, comprising: acquiring a text data set of a structured document, wherein the text data set comprises a plurality of text data; determining sequence marking data of the text data; determining a first data field type of the text data based on the sequence annotation data, wherein the first data field type is associated with a text feature of the text data that is adjacent or close to the position in the text data set; determining a second data field type of the text data, wherein the second data field type is associated with a remotely located text feature of the text data in the set of text data; and extracting text data corresponding to a preset data field type from the text data set based on the first field type and the second field type.
According to another aspect of the application, a device for extracting data from a structured document is proposed, comprising: an acquisition unit configured to acquire a text data set of a structured document, wherein the text data set includes a plurality of text data; the extraction unit is configured to determine sequence marking data of the text data, determine a first data field type of the text data based on the sequence marking data, wherein the first data field type is associated with text features of the text data which are adjacent or close to each other in position in the text data set, determine a second data field type of the text data, wherein the second data field type is associated with text features of the text data which are far away from each other in position in the text data set, and extract the text data corresponding to a preset data field type from the text data set based on the first field type and the second field type.
According to yet another aspect of the application, a computer-readable storage medium is proposed, on which a computer program is stored, the computer program comprising executable instructions which, when executed by a processor, carry out the method as described above.
According to yet another aspect of the present application, an electronic device is provided, including: a processor; and a memory for storing executable instructions for the processor; wherein the processor is configured to execute executable instructions to implement the method as described above.
By adopting the data extraction scheme of the embodiment of the application, an improved text analysis and extraction model is introduced aiming at a text data set in a complex structured document with a large number of data field types, a structured document such as a lease contract is combined with a large number of document level corpora by using a transfer learning strategy to perform specific pre-training, entity identification and relation extraction of small target corpora are realized, and quick, accurate and intelligent information identification and extraction effects of the structured document are obtained. The scheme of the application can also expand the data field types and support the evaluation and calibration of the service data so as to further optimize the performance of the system model. After valuable information is analyzed and extracted, the extraction result of the text data can be optimized so as to be convenient for a user to understand and read, and more user-friendly use experience is provided on the basis of improving the data identification and extraction effects.
Drawings
The above and other features and advantages of the present application will become more apparent by describing in detail exemplary embodiments thereof with reference to the attached drawings.
FIG. 1 is a schematic logical block diagram of a system for extracting data from a structured document according to one embodiment of the present application.
FIG. 2 is a schematic flow diagram of a method of extracting data from a structured document according to one embodiment of the present application.
FIG. 3 is a schematic block diagram of an apparatus for extracting data from a structured document according to one embodiment of the present application.
FIG. 4 is a block diagram of a schematic structure of an electronic device according to one embodiment of the present application.
Detailed Description
Exemplary embodiments will now be described more fully with reference to the accompanying drawings. The exemplary embodiments, however, may be embodied in many different forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the example embodiments to those skilled in the art. In the drawings, the size of some of the elements may be exaggerated or distorted for clarity. The same reference numerals denote the same or similar structures in the drawings, and thus detailed descriptions thereof will be omitted.
Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the application. One skilled in the relevant art will recognize, however, that the subject matter of the present application can be practiced without one or more of the specific details, or with other methods, components, etc. In other instances, well-known structures, methods, or operations are not shown or described in detail to avoid obscuring aspects of the technology of the present application.
In the present application, the system, method, apparatus, and the like of the embodiments of the present application are described by taking as an example the parsing and extraction of data information of interest from a rental contract. However, the above examples are merely illustrative and not restrictive of the technical solution of the present application. The data extraction scheme provided by the application can be applied to intelligent analysis and extraction of key information of contract documents or texts such as leasing contracts, and can also be applied to analysis and extraction of key information of other structured documents or texts, wherein the structured documents generally have multiple types of data fields, and key data information of the structured documents is divided and stored in corresponding parts of the documents according to classification rules defined by the types of the data fields. Structured documents include, for example, contracts, resumes, reports, and the like. These structured documents may be used in a variety of industries, such as the catering industry, retail industry, rental industry, transportation industry, trade industry, and the like.
Structured documents typically include various items of information represented in the form of textual data. These information items correspond to information types characterizing different information bodies and contents, respectively, and hereinafter the information type of the text data is represented by a data field type. The complete structured document includes a plurality of data field types, and each data field type includes text data of a plurality of types and lengths. For example, the text in the form of natural language words may be one or more words or words, a phrase or sentence composed of words or words, or a paragraph or paragraphs composed of multiple sentences. The text data may include characters, data of numeric type, and various combinations of these characters and numeric values. The text data may also include character and value type data overlaid, superimposed or embedded in the picture. In words, phrases, sentences or paragraphs, numerical and non-numerical (numeric) symbols may also be used as components of the words, phrases or sentences to form generalized text data. In this context, text data is generally used to refer to a continuous string of characters and values that conveys some subject or piece of information, which may be a continuous phrase or sentence that includes multiple words (e.g., characters and values). A paragraph is considered to be a set composed of a plurality of sentences, and thus the term "set of text data" herein refers to a paragraph or a combination of a plurality of paragraphs whose character and numerical string are longer in length than phrases and sentences. The paragraphs that are text data sets may be paragraphs under the same data field type or paragraphs under different data field types. For each information item or data field under each data field type in the structured document, one or more words, phrases and sentences and one or more paragraphs with these words, phrases and sentences may be included, that is, all text data under each data field type may include a text data set composed of one or more text data. For the entire structured document, the text data included in all types of data fields constitutes the text data set for the document. Thus, the concept of a collection of text data includes not only a collection of text data belonging to a single data field type, but also a collection of text data belonging to all data field types of a document. When a structured document includes multiple associated documents (e.g., a contract includes multiple subsuits), a collection of textual data may also refer to a combination of all textual data included in the structured documents. Text data belonging to a certain data field type may be distributed in the same structured document or may be distributed in different structured documents in an associated set of structured documents. Ideally, the information items in the structured document correspond to data field types. In practice, the text data under an information item in a structured document may belong to the same or different data field type, or the text data under multiple information items may all belong to the same data field type.
The task of key information parsing and extraction for structured documents, such as rental contracts, has been improved from inefficient manual approaches to intelligent information processing systems supported by techniques such as Artificial Intelligence (AI). In the process of automatically analyzing and extracting information, a structured document with complex content is usually treated as a common contract document, and corresponding text features are extracted from the common contract document according to the data field types common in the common contract. The amount of information, efficiency, and accuracy that the AI model can extract depends primarily on the preset configuration of the data field type associated with the contract type document. The type of data field of a contract depends on the need for information parsing and extraction from the contract. Leasing contracts, particularly those used by restaurant stores in the catering industry, involve a very large number of data field types and involve a wide range of content and are logically complex, resulting in an inefficient and inaccurate AI scheme for critical information extraction based on common data field types of common contracts. For example, the type of the information extraction data field of the general contract document mainly includes information items or types such as a customer name, an amount, a contract time, a property title body, and the like, and cannot meet the customized requirements of the lease contract of the restaurant, for example, specifically, the payment mode of the rent of the restaurant is a fixed rent or a drawing rent, and the payment mode is a payment according to the natural month or a lease month, which is specific to the lease contract of the restaurant. The number of preset data field types related to the common contract is small, the logic is simple, and the method cannot be adapted to a complex data field type system when being applied to the leasing contract of a specific scene.
For a rental contract type complex structured document, more complex data field types may be set to characterize complex information items in the document. For example, 182 data field types to be analyzed and relevant information extracted can be set for the rental contracts of restaurant stores, and in addition to the inherent data field type of the contract name, 181 data field types specific to the rental contracts can be included for meeting the special requirements in the rental scene. These specific data field types can realize intelligent information analysis and extraction of text data sets of multiple (for example, 116) data field types with complicated logic and flexible change, and the subject of the text data sets needs to be understood in combination with context semantics. The preset specific data field types comprise specialized preset contract information items or data field types such as rent payment mode, whether independent third party management fee exists, general ledger sales subject access rules, compensation fund, guarantee fund, decoration time and the like. The data extraction is carried out on a large number of preset data field type settings special for the leasing contract, and a great challenge exists in the realization of a manual mode and a simple intelligent contract information extraction scheme.
In addition, there is also an association between text data contained in structured documents (e.g. rental contracts with detailed information) with large data content and complex logical relationships, so that text data and text data sets in the same paragraph, different paragraphs, and even paragraphs, sentences, phrases in different contract information items may relate to text data belonging to different preset field types. At this time, the contract information items in the rental contract are no longer limited to the types of data fields corresponding to the information parsing and extraction of the document. This is because many information items in a complex structured document are structured in document-specific logical relationships, which is different from the need for intelligent information parsing and extraction. Other reasons include lack of specification in the composition of the rental contract, logical recursive references and branches between contract information items, and the like. For example, in a contract or in an associated set of contracts, there is a reference in some contract information item or (sub) contract, such as "see other contract information items or some contract information item or term in another (sub) contract", making the information of the type of data field to which the text data included in these contract information items belong more complex. The combination of text data belonging to a plurality of data field types is included in a contract information item, even in the case where the same text data may belong to a plurality of data field types at the same time, so that text data relating to the same or similar contract information exists in different contract information items or in a plurality of positions in the same contract information item, even in different (sub) contract positions, which may not always be located in adjacent or close positions, in some cases even far apart, from the viewpoint of words, phrases, sentences and paragraphs.
How to accurately and efficiently analyze and extract the data field type corresponding to the text data aiming at the structured document with complex logical relationship and a large amount of data, and integrally provide the text data corresponding to the preset data field type for business personnel is one of important tasks concerned by the scheme provided by the embodiment of the application.
The logical architecture of a system for extracting data from a structured document according to an embodiment of the present application is described below with reference to fig. 1, taking a rental contract as an example.
The system first obtains a text data set 101 of a plurality of text data contained in a structured document or an associated set of structured documents, as indicated by the dashed box 100. The rental contract, which is a structured document of information to be extracted, may come from a restaurant store or restaurant business, or a subject or entity associated with the rental contract. The restaurant as the lessee of the business location (i.e. party B of the lease contract) needs to submit the lease invitation to the building property, property party, etc. as the lease party (i.e. party A of the lease contract), so the source of the lease contract can also be the lease party of the lease contract. The provider of the rental contract can also be other participants related to the rental contract, such as a supervisor, etc.
If the rental contract is not in the form of an electronic document, the rental contract also needs to be digitally processed for subsequent processing. Typically, the digital processing includes scanning the contract to obtain a scan file (e.g., PDF or picture format). The scanned electronic document may be further digitized by full-text or partial-text recognition (e.g., using OCR technology) into a coded file that can identify and extract textual data in the document.
Also included in the dashed box 100 are rules for parsing and extracting information from the document, i.e., preset data field types 102. The preset data field type 102 may be preset according to different requirements when a model or system for extracting data is established, or may be input or updated according to a user setting. The preset data field type 102 is associated with the need for information parsing and extraction tasks for the structured document (rental contract), and can be provided and/or maintained by the business department of a restaurant store or restaurant enterprise or any user that needs to extract information for the structured document, as one of the rules for information parsing and extraction. In general, the set preset data field type 102 may be the same as, similar to, or at least associated with the contract information items in the rental contract document so as to be able to adequately reflect the accurate information in the rental contract. Ideally, the data field type 102 preset in the information extraction task of the rental contract document is the same as each contract item of the contract, and the main purpose of the information parsing and extraction task is to be able to extract the text in the contract document with a large logic complexity as a combination of key information in a simplified manner as intelligently and automatically as possible, so as to facilitate the review and processing of the user.
After the text data set 101 of the rental contract is acquired, the system enters the data extraction process. The data extraction process may be implemented using a Named Entity Recognition (NER) model.
The NER model is used for extracting text data in a text passage (i.e. a text data set) according to a predefined entity type (e.g. a preset data field type 102) and an entity type to which the text data belongs. The output text data may be a sentence, phrase or word in a paragraph, or a combination thereof. Entity types are embodied as different types in different application scenarios. For example, for a rental contract, the entity type is the contract-related data field type, such as the renter and lessee names, lease, rent payment, management fee, and the like. For resume documents, the entity types may be names, educational backgrounds, professional experiences, skills, and so forth. The NER model may also be applied to information extraction, relationship extraction, parsing, information retrieval, question-answering systems, machine translation, and the like. For the application scenario of the embodiment of the present application, the information extraction of the complex structured document targeted by the NER model is mainly to complete the extraction and classification of the lease contract information embodied by the text data sets having specific semantic relationships in the text data set of the document, especially the semantic relationships between the text data and the text features that are located far away from each other in the text data set. Here, the semantic relationship includes a semantic relationship between text data such as a word, a phrase, a sentence, or the like, or a semantic relationship between text features extracted from the text data that characterize features or attributes of a certain dimension or dimensions of a text. The determination result of the entity type (data field type) to which the text data belongs corresponds to a classification result of binary classification, multi-element classification, or multi-label classification of the text data. When the text data has information of relevant dimensions for a plurality of data field types, a multi-label classification mode is more suitable. The classification result includes not only the type of data field that the text data has or belongs to, but also the probability or likelihood of belonging to that type. For simple binary classification results, it may be classified whether a certain text data belongs to or does not belong to a certain data field type.
The NER model may be implemented based on conventional algorithmic models such as a rule knowledge base maintained by an expert, clustering, probabilistic sequence algorithms, and the like. It may also be implemented based on machine learning models or neural network models. The neural network model may include a convolutional neural network CNN, a recurrent neural network RNN, and the like. The neural network model based on the natural language processing NLP technology has remarkable speed and precision advantages for classification and information extraction of text data sets containing natural language texts, and can bring many advantages for an NER model after a statistical algorithm is combined. The neural network model may also be pre-trained and trained using supervised, semi-supervised or unsupervised training data sets to determine optimal parameters for the model. Particularly, the neural network model adopting the NLP technology can use large-scale linguistic data to perform unsupervised training, obtain sufficient text representation capacity and has better effects on entity recognition and relation extraction of small target linguistic data.
Based on the deeply learned NER model, neural network model structures of the type including ELMo, GPT and BERT may be used, wherein BERT models have relatively better performance. The BERT model can adopt mass data to perform unsupervised learning to obtain better text data feature extraction and classification prediction. Under the scene of information analysis and extraction of the Chinese structured document, the BERT model, the LSTM model and other models can be combined to form a combined NER model such as a BERT + LSTM model, an LSTM + Conditional Random Field (CRF) model, a BERT + CRF model and the like so as to obtain better text information extraction performance.
In the process of analyzing and extracting the information of the text data set by the NER model, a modular processing mode can be adopted.
First, the NER model performs representation, processing, and transformation of features. The model performs embedding (embedding) operation on the character sequence in the original input text data or further adds some features used in the traditional shallow supervised neural network model in addition to the embedding. Here, an embedding operation refers to mapping high-dimensional original text data (e.g., a sentence having a plurality of words) having multi-dimensional features or attributes to low-dimensional text data (e.g., a low-dimensional or single-dimensional feature vector) having a plurality of single-dimensional features or attributes. The embedding operation may include word embedding (word embedding), character typing (character embedding), and the like. In the character embedding, the Chinese characters can be embedded based on UTF-8 coding or radical and stroke, for example, the BERT model mentioned above can convert the Chinese characters into coding to extract the semantic relation between text features in use.
After the feature representation is completed, the model further performs encoding of the features. The NER model may use a structure such as CNN, RNN or Transformer model to perform feature transformation and encoding on the embedding result of the embedding operation. The BERT model based on the NLP technique employs a transform-structured decoder (encoder) as a feature extractor when used for text feature extraction and text classification prediction, and can support partial model update by Fine-tuning (for example, employing a softmax sub-network structure). The reason for using the Transformer is that the performance of the Transformer and the RNN and CNN is sequentially decreased when the Transformer is applied to feature extraction. The Transformer architecture can feature combine (e.g., stitch) the embedded outputs to obtain larger size and dimension graphs or feature vector data.
And finally, the NER model carries out label decoding and outputs the corresponding type or label of the text data. When the BERT model is used as the main body of the NER model, similar steps related to the representation, processing and conversion of features and feature encoding can be completed, but in the process of performing tag decoding, the output result is usually a tag or a type corresponding to a text feature extracted from text data, and generally cannot be directly used for determining and predicting the type of a plurality of text data in a text data set (for example, the data field type of the text data in a contract document). Furthermore, the BERT model can only extract semantic relationships for text features that are located adjacently or very close in text data (e.g., sentences or phrases), which is not good enough in the case of text data in the form of words, phrases, and sentences at multiple locations (especially remote locations) that exist in structured documents with complex logical relationships and large amounts of data, even in multiple structured documents.
In the embodiment shown in fig. 1, the NER model for implementing the data extraction process adopts a scheme of combining a multi-layer perceptron MLP model and a CRF model structure on the basis of a BERT model structure. This model structure combination scheme is exemplary, and those skilled in the art can adopt other model structures or structure combination schemes according to the functions and requirements realized by the NER model. The NER model mainly includes portions shown by dashed boxes 110, 120 and 130.
The text data in the text data set may be preprocessed 110 before extracting the sequence annotation data 103 of the text data in the text data set 101 using the first BERT model 121. The preprocessing operation may include a long text data partitioning (long document split) operation 111, and the like.
Text data can be divided into text data of a normal length (e.g., words and simple phrases) and long text data of a long length (e.g., long complex phrases and sentences) according to the length of the text. Unnecessary data in the text data may be filtered and culled, such as removing conjunctions, punctuation marks, and discourse words, before the text data is input into the first BERT model 121 to determine the sequence annotation data 103 corresponding to the text data. The text data of ordinary length contains less of such unnecessary data and is easy to process, so that text characteristic information, such as words, of the text data can be determined by simple filtering and culling. Long text data is long in length (e.g., includes a plurality of complex phrases and sentences), and text feature extraction is inconvenient even after unnecessary data is removed, and thus it is necessary to divide it into a plurality of words or a group of simple words that can be directly or simply processed to extract text features. Text data of a common length typically includes a single or significantly related subject or semantic and may therefore be provided as an input vector to the first BERT model 121. Long text data (e.g., sentences) includes multiple topics or semantics and there may be no correlation between these topics and semantics, so the role of the long text data segmentation operation 111 is to segment these long text data into multiple text data corresponding to topics or semantics having a single or significant correlation, facilitating the extraction of the sequence annotation data 103 by the first BERT model 121.
Specifically, word embedding implemented by the BERT model may include Tokenization (also referred to as Token embedding), sequence embedding (sequence embedding), position embedding (position embedding), and the like. Among other things, tokenization operations may be used to convert character-type text data, such as chinese characters, into encoding-type text data, particularly natural language text data, to facilitate text feature extraction and subsequent classification. Text features may be represented by words, numbers, and symbols corresponding to the code Token. The preprocessing operation 110 may be performed independently from the BERT model before the first BERT model 121 determines the sequence labeling data 103 corresponding to the text data, or may be performed by integrating the first BERT model into the BERT model as a sub-function or a sub-step executed by the first BERT model 121 for determining the sequence labeling data 103.
Text data of the normal text data, the long text data and/or the character-type text data types in the text data set 101 are input to the first BERT model 121 in a determination stage (as indicated by a dashed box 120) of a data field type including a first data field type and a second data field type, either directly or after processing, respectively. As described above, the BERT model, as an NLP feature extraction and classification prediction model, may perform natural language embedding based on input (preprocessed) text data, extract associated text features representing the text data, and output the text data in these text data sets 101 and the corresponding sequence labeling data 103. The first BERT model 121 has been trained based on large-scale corpora, and its transform structure can better capture the semantic relationship between codes (tokens) representing chinese characters, so that text features associated with preset data field types can be accurately extracted for subsequent processes.
According to the embodiment of the present application, the text features extracted by the first BERT model 121 may be further processed, for example, including sequence tagging (sequence tagging) and position embedding (position embedding), to obtain sequence tagging data 103 representing position information of these text data. As mentioned above, BERT models are generally good at extracting semantic relationships for keywords that are located adjacent or very close to each other in text data (e.g., sentences or phrases), and thus it is necessary to further extract semantic relationships for those keywords that are located farther away in the text data and in the text data set, so as to more fully extract text features related to the preset data field type 102 and the text data from or to which the text features belong.
More information for semantic extraction may be introduced by a sequence labeling operation and a position embedding operation, wherein the sequence labeling may mark a sequence label (e.g., a BIO label) of each text feature in the corresponding text data to obtain sequence labeling data 103 of the text data, and the position embedding may extract positions of the text features in the text data and the text data set to represent distances between semantic relationships of a plurality of text features.
The sequence labeling can adopt the labeling and word segmentation method of BIO (Begin-Inner-Output) labeling. In the sequence annotation process. The text sequence of text data corresponds to a sentence or phrase, wherein each word is an element, each element having a tag in the BIO label. Here, a word generally refers to a text feature or element having a unique Token or a word, number, symbol, etc. characterized by Token for characterizing a certain dimensional feature of text data. The first word has a B-tag, the middle word(s) have an I-tag, and the last word has an O-tag. For a sentence or phrase having multiple words, it should have a labelset consisting of multiple BIO labels. The BIO tags may characterize the multi-dimensional or multi-attribute text features of the text data represented by the sentence or phrase. The BIO tags are particularly suitable for characterizing semantic relationships between location-dependent text features of text data. The preset data field type may also be regarded as a data type corresponding to a sentence or phrase having a multidimensional feature or attribute of a multi-BIO tag, or a code or a numerical value corresponding to the sentence or phrase having the multidimensional feature or attribute.
The rules of the sequence annotation (e.g., BIO annotation) characterize the mapping relationship between the location-dependent text features of the text data to the preset data field type 102, and may determine the structural parameters of the model that extracts the semantic relationship between the text features. For example, when a CRF neural network model is employed to determine semantic relationships between text features having locations that are relatively far apart, a corresponding number of network layers of the CRF neural network model may be selected according to the sequence labeling rules. When the BERT model is adopted for determining the semantic relation between text features with positions at close distances, parameters related to the network structure of the BERT model can also be selected according to the sequence labeling rule.
According to an embodiment of the present application, part or all of the operations of the preprocessing operation section 110, and at least one of the operations of sequence labeling and position embedding of text data (text features) may be integrated into the first BERT model 121.
Based on the location data extracted by the location embedding operation, the location relationship between the text features may be determined. The sequence annotation data 103 output by the first BERT model 121 and the text data set 101 (and the preprocessed text data) including a plurality of text data are supplied to the second BERT model 122. The second BERT model 122 performs first feature extraction on the text data based on the text data set 101 and the sequence labeling data 103 corresponding to the text data in the text data set 101 to obtain first feature data. Since the BERT model is focused on semantic relationship extraction between text features that are located adjacently or proximately, information characterizing the semantic relationship of the text features of the text data, in particular the semantic relationship of text features associated with those text features that are located adjacently or proximately in the text data set, is included in the first feature data.
The second BERT model 122 predicts the corresponding preset data field type 102 of the text data having the text features with adjacent or similar positions, or marks the corresponding preset data field type 102 for the text data. The data field type of the text data determined by the second BERT model 122 is referred to as the first data field type 104, and represents a data field type corresponding to a subject or semantic relationship represented by a text feature having a closer relationship in the text data and/or the text data set. In this context, adjacent or proximate in position of text features means that at least two text features are located in the same text data, i.e. in the same words, phrases and sentences. At least two text features (e.g., words, numbers, and symbols corresponding to the code Token) may be text features within the same phrase or sentence. In the same phrase or sentence, two text features may be adjacent or not, and the position relationship of the text features is called adjacent or similar. According to an embodiment of the present application, the at least two text features may also be text features within two adjacent phrases or sentences. In the case of being located in an adjacent phrase or sentence, the two text features may be adjacent or not, and the positional relationship of the text features is also referred to as adjacent or similar. In a particular case, text features that are adjacent or near to each other within adjacent phrases or sentences refer primarily to at least two text features that are in different phrases or sentences, respectively, in adjacent phrases or sentences, but are still adjacent. For example, one text feature is located in a preceding phrase or sentence of two adjacent phrases or sentences, and the other text feature is located in a succeeding phrase or sentence. In the process of determining the data field type information of the text data based on the position semantic relationship between the text features provided by the sequence annotation data 103, the second BERT model 122 can process and determine the data field type only for the semantic relationship related to the text features adjacent or close to the position, and the semantic relationship related to the text features not adjacent or close to the position cannot be extracted by the second BERT model 122. However, the first feature data output via the second BERT model 122 includes both textual feature relationships that can be extracted by the second BERT model 122 and textual feature relationships that cannot be extracted by the second BERT model 122 (the partial feature relationships may be extracted by the CRF model below) in order to provide the complete set of information from the set of textual data 101 and the sequence annotation data 103 to subsequent models and modules for processing.
In general, the performance of the BERT model for extracting text data of various types and lengths in a structured document under the application scenario of the present application may not be good enough, so that adding an additional model to improve the extraction performance of semantic relations related to text features adjacent or close to positions may be considered. According to an embodiment of the application, the second BERT model 122 is followed by an MLP model 123 (comprising an input layer, a plurality of hidden layers and an output layer). The MLP model 123 is used to perform further feature extraction (which may also be referred to as additional feature extraction) on the first feature data from the second BERT model 122 to obtain additional first feature data that is more suitable for predicting the first data field type 104. Additional first feature extraction is used to extract further information on the subject matter or semantic relationship of text features associated with those text features that are adjacent or near in position in the text data set based on the first feature data in order to predict the first data field type of the text data. Thus, the MLP model 123 may provide a more optimal extraction of text feature relationships and determination of the first data field type on the basis of the second BERT model 122. Like the second BERT model 122, the additional first feature data output by the MLP model 123 also includes both textual feature relationships that can be extracted by the MLP model 123 and textual feature relationships that cannot be extracted by the MLP model 123 (the partial feature relationships can still be extracted by the CRF model below).
In FIG. 1, the process of predicting a first data field type 104 using a second BERT model 122 in conjunction with an MLP model 123 is represented by the solid arrows between the second BERT model 122, the MLP model 123, and the CRF model 124, where the solid arrows of the MLP model 123 to the first data field type 104 represent the more accurate first data field type 104 for determining text data by the MLP model 123. If a scheme without the MLP model 123 is employed, the first feature data output by the second BERT model 122 is directly input into the CRF model 124, and the first data field type 104 of the text data is directly determined by the second BERT model 122, as indicated by the dashed arrow.
The CRF model 124 is more suitable for the MLP model 131 that extracts semantic relationships of keywords that are further away. In the first feature data (from the second BERT model 122) or the additional first feature data (from the MLP model 123) received by the CRF model 124, both information embodying a topic or semantic relationship embodied by text features that are closer in relation to the text data and/or the text data set and information embodying a topic or semantic relationship embodied by text features that are farther in relation to each other are included. Thus, the input to the CRF model 124 still contains virtually all the information of the sequence annotation data 103 output by the first BERT model 121 and the text data set 101 (and the preprocessed text data) comprising a plurality of text data, except that this information is information that is further processed (e.g., feature extracted) by the second BERT model 122 and/or the MLP model 123.
The far or distant location of the text feature means that at least two text features (e.g., words, numbers, and symbols corresponding to the code Token) are located at a far location of the text data, or are located in different text data, and the different text data may belong to different paragraphs in the same data field type (e.g., contract information items), or respectively belong to words, phrases, sentences, and paragraphs in different data field types in the structured document, or respectively belong to different structured documents, and the like. At least two text features that are located far apart may be located in adjacent words or sentences respectively but the text features are not adjacent, or in non-adjacent phrases or sentences respectively, in the text data set, compared to adjacent or similar locations of the text features. Non-adjacent phrases or sentences may be in the same paragraph or in different paragraphs, respectively. In the case of different paragraphs, the different paragraphs may be paragraphs belonging to the same data field type, paragraphs belonging to different data field types, and even paragraphs belonging to different documents in a set of structured documents. For example, the text features are respectively located in different paragraphs of the same contract information item, or are dispersed in different words, phrases, sentences or paragraphs under different contract information items.
The data field type ultimately output by the CRF model 124 is likewise selected from the preset data field types 102, which represent data field types of text data identified and determined for subject or semantic relationships between text features that are not adjacent or not closely located, i.e., that are remotely or far away (e.g., words, numbers, and symbols corresponding to the encoding Token), referred to as the second data field type 105.
The CRF model 124 may generate a rule for extracting semantic relationships between text features with longer distance positions and determining the second data field type 105 based on the mapping relationship between the sequence annotation data 103 output from the first BERT model 121 and the second data field type 105 corresponding to the text data. Based on the rule, the CRF model 124 performs a second feature extraction on the first feature data, thereby determining a second data field type 105 of the text data. Thus, the CRF model 124 actually determines, based on the information of the sequence annotation data 103 provided by the first BERT model 121, the second data field types 105 associated with those text features that the text data in the text data set 101 has a location in the text data set that is far away.
The CRF model 124 focuses more on semantic relationships between text data and more distant text features in the text data set 101 than the second BERT model 122, which may be text features in two words or phrases that are distant in a sentence of longer length, text features in different words, phrases and sentences that are distant in a paragraph, and may even be text features in words, phrases and sentences that are cross-sentences belonging to a predetermined data field type 102 but belong to the same paragraph or not, or further text features in words, phrases, sentences and paragraphs in different predetermined data field types 102 within the whole structured document. Through semantic relations among the text features which are far away or even far away, classification prediction of the text features which are far away can be achieved. By using the CRF model 124, the data field types extracted for semantic relationships between text features that are farther away may be added to the data field types of text data extracted for semantic relationships between text features that are closer to the second BERT model 122 (and/or the MLP model 123), covering a wider range of text data and information about text features of a text data set, more accurately identifying and extracting at least one data field type 104 and 105 associated with such text data. The CRF model 124 has high robustness, so that the analysis and identification performance of the data field type corresponding to the text data is very stable and reliable, and the effectiveness of the whole system is improved.
From the proximity of the semantic relationships of the text features that the model can extract, it can be seen that the semantic relationships can be extracted by the BERT model to determine those text features of the first data field type, i.e. those located adjacently or proximately, whereas the semantic relationships cannot be extracted by the BERT model and need to be extracted by the CRF model to determine those text features of the second data field type, i.e. those of the position principle.
After obtaining the first data field type 104 and the second data field type 105 respectively representing the closer characteristic relationship and the farther characteristic relationship of the text data, the system integrates the text data with the same data field type in all the text data sets of the structured document, extracts and labels at least one text data corresponding to the preset data field type 102, and forms the complete document information analyzed and extracted based on the preset data field type 102, as shown by a dashed box 130 as a post-classification processing part. The text data corresponding to the data field types can be part or all of text contents in words, phrases, sentences and paragraphs in original information items (for example, rental contract information items) of the document, or part or all of text contents in words, phrases, sentences and paragraphs in different original information items scattered in the same structured document or different structured documents in a group of structured documents.
The post-classification processing section 130 of the text data or the text data set 101 based on the data field types 104 and 105 may include sequence label decoding (sequence labeling decode)131 and multi-text data fusion 132. The sequence annotation decoding 131 is used to decode the correspondence of the sequence annotation data 103 and the data field types 104 and 105 into a correspondence of text data and the data field types 104 and 105 that is understandable and recognizable by the user based on the rule of the sequence annotation (e.g., the rule of the BIO annotation). The multi-text data fusion 132 is used for fusing the text data belonging to the same preset data field type 102 into a complete piece of text data or text data set, for example, fusing a plurality of words, phrases, sentences and paragraphs, etc. marked as the same preset data field type 102 into a completed phrase, sentence, paragraph or consecutive paragraph. The fusion operation may include concatenation of vectors, summation of values, etc. For example, for a data field type of rental area in a rental contract, the text data items or text data sets in data field types 104 and 105 belonging to the same rental area-related data field type are respectively 40 square meters at the first level and 30 square meters at the second level, and the predicted output of the merged NER model may be that the rental area is 40+ 30-70 square meters (the merging operation here is a numerical sum). Some text data corresponding to the simple sequence label data 103 can skip the decoding operation 131 of the sequence label and directly perform the fusion operation 132 of the multi-text data.
Ideally, when the preset data field type is the same as or semantically similar to each information item in the structured document, the relevant data field type extracted by the NER model and the fused text data 107 corresponding to the data field type can accurately cover the information and content of the corresponding information item in the document, and even more information and content across information items can be parsed and extracted from the document. Compared with the conventional AI model, the average recognition accuracy of the text data corresponding to the preset data field type of the structured document data extraction system according to the embodiment of the present application is higher, for example, up to 80% or more.
According to the embodiment of the application, after the NER model outputs the data extraction result, the system can also add rules to adjust the type of the preset data field and optimize the output result so as to be convenient for a user to understand and read. For example, replacing the contract terms for party a and party b in the body of the lease contract with the entity names referred to in the contract (party a of the lease being a particular property, party b of the lease being a restaurant) makes the text data or text data set under the respective data field type more readable.
Before the NER model is used, a transfer learning technology can be applied, and aiming at text data types of entities such as contract information items of a lease contract, a large number of text data corpora at a hundreds of thousands of document levels are used as a training data set of the model to carry out targeted specific pre-training, so that the model can be trained with smaller target corpora to obtain a better effect in a new text data identification and relation extraction task.
The training data set for pre-training may employ training data that labels text data and the data field type to which the text data corresponds on the results of OCR recognition of the document. Labeled data of the entire document or a portion thereof may be used as training data. And a better model training effect can be obtained by mass document data corpora.
The NER model may be pre-trained, trained and adapted separately for at least one of the first BERT model 121, the second BERT model 122, the MLP model 123 and the CRF model 124 in the NER model, or may be pre-trained, trained and adapted as a whole.
During the use of the system, the extraction result of the document information can be viewed and modified by the assistant service personnel through a background management system capable of real-time online interaction, the record of the extraction result is collected and modified, and the calibration data 108 is formed to guide the system model to carry out self-learning iteration in a model updating iteration and optimization link (as shown by a dashed line box 140) so as to optimize the model parameters, as shown by dashed arrows from the calibration data 108 to the various models 121, 122, 123 and 124 in fig. 1. The preset data field type 102 and/or the update data of the generated model training data set can be adjusted periodically or in real time according to the updated structured document data to fine-tune the model parameters, so that the data analysis and extraction effect of the system model on the structured text is improved.
FIG. 2 illustrates a method of extracting information of a structured document according to an embodiment of the application. The steps of the method that are the same as or similar to the system of fig. 1 will not be described in detail.
The method of extracting data from a structured document comprises a step S210 for obtaining a text data set of the structured document, a step S220 for determining text features (e.g. words, numbers and symbols corresponding to the encoding Token) of the text data set and sequence annotation data corresponding to the text data using, for example, a first BERT model, a step S230 for determining a first data field type of the text data and a step S240 for determining a second data field type of the text data based on the sequence annotation data corresponding to the text data, respectively, and a step S250 of extracting text data corresponding to a preset data field type from the text data set based on the first field type and the second field type, wherein the first and second data field types should be associated with text features that are located adjacent or proximate in the text data set and text features that are located remotely from the text data, respectively.
Step S210 may include an operation of digitizing the non-electronic structured document to obtain the text data set, wherein the digitizing may include OCR scanning or the like.
In step S220, before determining the sequence annotation data, preprocessing the text data set to conform to the data input format of the BERT model may be further included. These preprocessing operations include, for example, segmentation of long text data, and the like. After extracting the text features from the text data, step S220 may further include performing a sequence labeling operation and a position embedding operation on the text features of the text data, and determining a sequence label of the text features according to positions of the text features in the text data (set) (for example, obtaining a sequence label such as a BIO label), thereby determining sequence labeling data of the text data.
Step S230 further comprises performing a first feature extraction, for example using a second BERT model, based on (text data in) the set of text data and the sequence annotation data, such as from the first BERT model, to obtain first feature data, and for those text features that are located in close or adjacent positions in the set of text data, determining a semantic relationship between the text features and determining and tagging a first data field type of the text data having the text features based on the semantic relationship. Step S230 may also further include additional feature extraction of the first feature data using, for example, the MLP model 123 to obtain further information on the subject matter or semantic relationship of text features associated with those text features that are adjacent or near in position in the text data set, thereby obtaining additional first feature data suitable for more accurately predicting the first data field type.
After step S230, step S240 further comprises performing a second feature extraction using, for example, a CRF model to determine a second data field type of text data having text features located far away, for those text features whose locations are close or adjacent and are located further away, based on the first feature data or the additional first feature data from step S230, which text features are different.
After the first and second data field types of the text data are obtained, the text data having the same data field type is further determined to obtain the extraction result in step S250. The extracted result can be further processed, a plurality of text data corresponding to the preset data field type are determined based on the labeling rule of the sequence labeling data, and the text data are fused so as to be convenient for a user to understand and read.
The method may further comprise the steps of pre-training and/or model parameters with labeled training data before using the BERT models (in particular the first and second BERT models), the MLP model and the CRF model and the overall NER model composed of these models, and the steps of fine-tuning and updating the models using calibration data of the service personnel.
Further, the method may further include a step S260 of adjusting the data field type to optimize the output result using the extracted text data content corresponding to the preset data field type, as shown by the dotted box of fig. 2.
FIG. 3 illustrates an apparatus 300 for extracting data from a structured document according to an embodiment of the application.
The device 300 comprises an obtaining unit 310 for obtaining a text data set of a structured document and a preset data field type, and an extracting unit 320 for extracting data of the structured document.
The extraction unit 320 further comprises a first BERT model 321 for determining text features of text data of the set of text data and sequence annotation data corresponding to the text data, a second BERT model 322 for performing a first feature extraction based on the sequence annotation data to obtain first feature data and determining a first data field type of the text data, an MLP model 323 for further performing additional feature extraction on the basis of the second BERT model 322 to obtain additional first feature data and determining a more accurate first data field type of the text data based on the additional first feature data and a CRF model 324 for performing second feature extraction based on the first feature data or the additional first feature data from the second BERT model 322 and/or the MLP model 323 to determine a second data field type of the text data, and a training unit 325 for pre-training based on the labeled training data and optimizing the models 321 to 324 based on the updated calibration data. When the models 321 through 324 make up an overall NER model that implements data extraction functionality, the NER model may be pre-trained and updated using the training unit 325. The extracting unit 320 may further extract text data corresponding to a preset data field type from the text data set based on the first field type output by the second BERT model 322 or the MLP model 323 and the second field type output by the CRF model 324. According to an embodiment of the present application, the extraction unit 320 may also implement the functions of the dashed box portions 110 to 140 as shown in fig. 1 and further details of steps S220 to S260 shown in fig. 2.
The device 300 may comprise an output unit (not shown) for providing the data extraction results to the user in a visual or readable form or the like. The output unit may be a display or a touch screen on which a user interface for interacting with a user to receive user input and feedback may also be provided, in which case the output unit corresponds to an input-output unit.
By adopting the data extraction scheme of the embodiment of the application, an improved text analysis and extraction model is introduced aiming at a text data set in a complex structured document with a large number of data field types, a structured document such as a lease contract is combined with a large number of document level corpora by using a transfer learning strategy to perform specific pre-training, entity identification and relation extraction of small target corpora are realized, and quick, accurate and intelligent information identification and extraction effects of the structured document are obtained. The scheme of the application can also expand the data field types and support the evaluation and calibration of the service data so as to further optimize the performance of the system model. After valuable information is analyzed and extracted, the extraction result of the text data can be optimized so as to be convenient for a user to understand and read, and more user-friendly use experience is provided on the basis of improving the data identification and extraction effects.
It should be noted that although several modules or units of the system for extracting data from a structured document are mentioned in the above detailed description, such division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the application. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units. The components shown as modules or units may or may not be physical units, i.e. may be located in one place or may also be distributed over a plurality of network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of the application. One of ordinary skill in the art can understand and implement it without inventive effort.
In an exemplary embodiment of the present application, there is also provided a computer-readable storage medium, on which a computer program is stored, the program comprising executable instructions that, when executed by, for example, a processor, may implement the steps of the method for extracting data from a structured document described in any of the above embodiments. In some possible implementations, the various aspects of the present application may also be implemented in the form of a program product comprising program code for causing a terminal device to perform the steps according to various exemplary embodiments of the present disclosure described in the method for extracting data from a structured document of the present specification, when said program product is run on said terminal device.
A program product for implementing the above method according to an embodiment of the present application may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a terminal device, such as a personal computer. However, the program product of the present application is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The computer readable storage medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable storage medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a readable storage medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Program code for carrying out operations of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).
In an exemplary embodiment of the present application, there is also provided an electronic device that may include a processor, and a memory for storing executable instructions of the processor. Wherein the processor is configured to perform the steps of the method for extracting data from a structured document in any of the above embodiments via execution of the executable instructions.
As will be appreciated by one skilled in the art, aspects of the present application may be embodied as a system, method or program product. Accordingly, various aspects of the present application may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.
An electronic device 400 according to this embodiment of the present application is described below with reference to fig. 4. The electronic device 400 shown in fig. 4 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.
As shown in fig. 4, electronic device 400 is embodied in the form of a general purpose computing device. The components of electronic device 400 may include, but are not limited to: at least one processing unit 410, at least one memory unit 420, a bus 430 that connects the various system components (including the memory unit 420 and the processing unit 410), a display unit 440, and the like.
Wherein the storage unit stores program code executable by the processing unit 410 to cause the processing unit 410 to perform steps according to various exemplary embodiments of the present disclosure described in the present specification for a method for extracting data from a structured document. For example, the processing unit 410 may perform the steps as shown in fig. 2.
The storage unit 420 may include readable media in the form of volatile storage units, such as a random access memory unit (RAM)4201 and/or a cache memory unit 4202, and may further include a read only memory unit (ROM) 4203.
The storage unit 420 may also include a program/utility 4204 having a set (at least one) of program modules 4205, such program modules 4205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.
Bus 430 may be any bus representing one or more of several types of bus structures, including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.
The electronic device 400 may also communicate with one or more external devices 500 (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device 400, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device 400 to communicate with one or more other computing devices. Such communication may occur via input/output (I/O) interfaces 450. Also, the electronic device 400 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the internet) via the network adapter 460. The network adapter 460 may communicate with other modules of the electronic device 400 via the bus 430. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with electronic device 400, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.
Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present application may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, or a network device, etc.) to execute the method for extracting data from a structured document according to the embodiments of the present application.
Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.

Claims (33)

1. A method for extracting data from a structured document, comprising:
acquiring a text data set of the structured document, wherein the text data set comprises a plurality of text data;
determining sequence labeling data of the text data;
determining a first data field type of the text data based on the sequence annotation data, wherein the first data field type is associated with a text feature of the text data that is adjacent or proximate in position in the set of text data;
determining a second data field type for the text data, wherein the second data field type is associated with a remotely located text feature of the text data in the set of text data; and
extracting text data corresponding to a preset data field type from the text data set based on the first field type and the second field type.
2. The method of claim 1, wherein the adjacent or proximate positions comprise at least one of the following positional relationships:
at least two text features are located within the same phrase or sentence;
at least two text features are respectively located in different phrases or sentences in adjacent phrases or sentences, and the at least two text features are adjacent.
3. The method of claim 1, wherein locating away comprises at least one of:
at least two text features are respectively positioned in different phrases or sentences in adjacent phrases or sentences, and the at least two text features are not adjacent;
at least two text features are respectively located in different phrases or sentences in non-adjacent phrases or sentences.
4. The method of claim 1, wherein obtaining the text data set of the structured document further comprises:
and carrying out digital processing on the structured document to obtain the text data set.
5. The method of claim 4, wherein the digital processing comprises OCR.
6. The method according to any of claims 1-5, further comprising pre-processing the set of text data, the pre-processing comprising at least one of:
segmenting long text data in the text data set;
the character type text data is converted into encoding type text data.
7. The method of any of claims 1-5, wherein determining sequence annotation data for the text data further comprises:
extracting the text features of the text data;
determining a sequence label of the text feature in the corresponding text data.
8. The method of claim 7, further comprising determining the sequence tag based on a position of the textual feature in the set of textual data.
9. The method of claim 7, wherein the sequence tag is a BIO tag.
10. The method of claim 7, wherein determining a first data field type of the text data based on the sequence annotation data further comprises:
performing a first feature extraction on the text data based on the sequence annotation data to obtain first feature data and determining the first data field type of the text data.
11. The method of claim 10, wherein performing a first feature extraction on the text data based on the sequence annotation data to obtain first feature data and determining the first data field type of the text data further comprises:
additional feature extraction is performed based on the first feature data to obtain additional first feature data and to determine the first data field type of the text data.
12. The method of claim 10, wherein determining the second data field type of the text data further comprises:
performing a second feature extraction based on the first feature data to determine the second data field type of the text data.
13. The method of claim 11, wherein determining the second data field type of the text data further comprises:
performing a second feature extraction based on the additional feature data to determine the second data field type of the text data.
14. Method according to claim 12 or 13, characterized in that the remotely located text features belong to different text data.
15. The method of claim 1, wherein extracting text data corresponding to a preset data field type from the set of text data based on the first field type and the second field type further comprises:
determining text data having a same data field type based on the first field type and the second field type;
determining a plurality of text data corresponding to preset data field types based on the labeling rule of the sequence labeling data;
and fusing the plurality of text data.
16. The method of claim 1, further comprising adjusting a preset data field type based on text data corresponding to the preset data field type.
17. The method of any one of claims 1 to 16, wherein the method is implemented using a machine learning model or a neural network model.
18. The method of claim 17, wherein the neural network model is a Named Entity Recognition (NER) model.
19. The method of claim 18, wherein the named entity recognition model comprises a first BERT model, a second BERT model, and a Conditional Random Field (CRF) model, wherein,
the first BERT model is configured to determine sequence annotation data for the text data;
the second BERT model is configured to perform a first feature extraction based on the sequence annotation data to obtain first feature data and determine the first data field type of the text data;
the conditional random field model is configured to perform the second feature extraction and determine the second data field type.
20. The method of claim 19, wherein the named entity recognition model further comprises a multi-layered perceptron (MLP) model configured for additional feature extraction based on the first feature data and determining the first data field type.
21. The method of claim 20 wherein the conditional random field model is further configured to perform a second feature extraction based on the additional features to determine the second data field type.
22. The method of claim 19, wherein at least one of the named entity recognition model, the first BERT model, the second BERT model, and the conditional random field model is pre-trained and/or trained using a training data set.
23. The method according to claim 20, wherein the multi-layered perceptron model is pre-trained and/or trained using a training data set.
24. The method of claim 17, wherein the machine learning model or neural network model is updated based on calibration data for the first field type and the second field type of the text data.
25. The method of any of claims 1-16, wherein the structured document comprises a rental contract, and wherein the data field type comprises a contract field type.
26. The method of any of claims 1-16, wherein the structured document is a combination comprising a plurality of structured documents.
27. An apparatus for extracting data from a structured document, comprising:
an acquisition unit configured to acquire a text data set of the structured document, wherein the text data set includes a plurality of text data;
an extraction unit configured to determine sequence annotation data of the text data; determining a first data field type of the text data based on the sequence annotation data, wherein the first data field type is associated with a text feature of the text data that is adjacent or proximate in position in the set of text data; determining a second data field type for the text data, wherein the second data field type is associated with a remotely located text feature of the text data in the set of text data; and extracting text data corresponding to a preset data field type from the text data set based on the first field type and the second field type.
28. The method of claim 27, wherein the adjacent or proximate positions comprise at least one of the following positional relationships:
at least two text features are located within the same phrase or sentence;
at least two text features are respectively located in different phrases or sentences in adjacent phrases or sentences, and the at least two text features are adjacent.
29. The method of claim 27, wherein locating away comprises at least one of:
at least two text features are respectively positioned in different phrases or sentences in adjacent phrases or sentences, and the at least two text features are not adjacent;
at least two text features are respectively located in different phrases or sentences in non-adjacent phrases or sentences.
30. The apparatus according to any of the claims 27 to 29, wherein said extraction unit comprises a named entity recognition model comprising at least one of a first BERT model, a second BERT model, a multilayer perceptron (MLP) model and a Conditional Random Field (CRF) model.
31. The apparatus of claim 30, further comprising a training unit configured to pre-train and/or train at least one of the named entity recognition model, the first BERT model, the second BERT model, the multi-layer percept model, and the conditional random field model using a training dataset.
32. A computer-readable storage medium, having stored thereon a computer program comprising executable instructions that, when executed by a processor, carry out the method according to any one of claims 1 to 26.
33. An electronic device, comprising:
a processor; and
a memory for storing executable instructions of the processor;
wherein the processor is configured to execute the executable instructions to implement the method of any of claims 1 to 26.
CN202111658801.3A 2021-12-31 2021-12-31 Method and apparatus for extracting data from structured documents Pending CN114356924A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111658801.3A CN114356924A (en) 2021-12-31 2021-12-31 Method and apparatus for extracting data from structured documents

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111658801.3A CN114356924A (en) 2021-12-31 2021-12-31 Method and apparatus for extracting data from structured documents

Publications (1)

Publication Number Publication Date
CN114356924A true CN114356924A (en) 2022-04-15

Family

ID=81104805

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111658801.3A Pending CN114356924A (en) 2021-12-31 2021-12-31 Method and apparatus for extracting data from structured documents

Country Status (1)

Country Link
CN (1) CN114356924A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115145928A (en) * 2022-08-01 2022-10-04 支付宝(杭州)信息技术有限公司 Model training method and device and structured abstract acquisition method and device
CN115600155A (en) * 2022-11-09 2023-01-13 支付宝(杭州)信息技术有限公司(Cn) Data processing method, device and equipment

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115145928A (en) * 2022-08-01 2022-10-04 支付宝(杭州)信息技术有限公司 Model training method and device and structured abstract acquisition method and device
CN115600155A (en) * 2022-11-09 2023-01-13 支付宝(杭州)信息技术有限公司(Cn) Data processing method, device and equipment
CN115600155B (en) * 2022-11-09 2023-05-12 支付宝(杭州)信息技术有限公司 Data processing method, device and equipment

Similar Documents

Publication Publication Date Title
US11734328B2 (en) Artificial intelligence based corpus enrichment for knowledge population and query response
CN110427623B (en) Semi-structured document knowledge extraction method and device, electronic equipment and storage medium
CN110825882B (en) Knowledge graph-based information system management method
Kang et al. Convolve, attend and spell: An attention-based sequence-to-sequence model for handwritten word recognition
CN110110335B (en) Named entity identification method based on stack model
Mai et al. Joint sentence and aspect-level sentiment analysis of product comments
CN108920467B (en) Method and device for learning word meaning of polysemous word and search result display method
CN109685056B (en) Method and device for acquiring document information
CN109145260B (en) Automatic text information extraction method
CN112270379A (en) Training method of classification model, sample classification method, device and equipment
CN110580308B (en) Information auditing method and device, electronic equipment and storage medium
CN114356924A (en) Method and apparatus for extracting data from structured documents
Hamdi et al. Information extraction from invoices
US20230028664A1 (en) System and method for automatically tagging documents
CN116775872A (en) Text processing method and device, electronic equipment and storage medium
CN115952791A (en) Chapter-level event extraction method, device and equipment based on machine reading understanding and storage medium
CN116150361A (en) Event extraction method, system and storage medium for financial statement notes
CN115098673A (en) Business document information extraction method based on variant attention and hierarchical structure
US20220076109A1 (en) System for contextual and positional parameterized record building
Qi et al. Aspect-based sentiment analysis with enhanced aspect-sensitive word embeddings
CN115017271B (en) Method and system for intelligently generating RPA flow component block
CN115757325A (en) Intelligent conversion method and system for XES logs
CN115203206A (en) Data content searching method and device, computer equipment and readable storage medium
CN114611489A (en) Text logic condition extraction AI model construction method, extraction method and system
CN115062615A (en) Financial field event extraction method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
EE01 Entry into force of recordation of patent licensing contract

Application publication date: 20220415

Assignee: Baisheng Consultation (Shanghai) Co.,Ltd.

Assignor: Shengdoushi (Shanghai) Technology Development Co.,Ltd.

Contract record no.: X2023310000138

Denomination of invention: Method and device for extracting data from structured documents

License type: Common License

Record date: 20230714

EE01 Entry into force of recordation of patent licensing contract