CN114090776A - Document analysis method, system and device - Google Patents

Document analysis method, system and device Download PDF

Info

Publication number
CN114090776A
CN114090776A CN202111424953.7A CN202111424953A CN114090776A CN 114090776 A CN114090776 A CN 114090776A CN 202111424953 A CN202111424953 A CN 202111424953A CN 114090776 A CN114090776 A CN 114090776A
Authority
CN
China
Prior art keywords
text
information
document
category
key information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111424953.7A
Other languages
Chinese (zh)
Inventor
毛璐
李长亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Kingsoft Digital Entertainment Co Ltd
Original Assignee
Beijing Kingsoft Digital Entertainment Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Kingsoft Digital Entertainment Co Ltd filed Critical Beijing Kingsoft Digital Entertainment Co Ltd
Priority to CN202111424953.7A priority Critical patent/CN114090776A/en
Publication of CN114090776A publication Critical patent/CN114090776A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/42Syntactic analysis
    • G06F8/427Parsing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a document analysis method, a system and a device, wherein the document analysis method comprises the following steps: acquiring a target document, and extracting at least one piece of text information contained in the target document; performing text classification on the at least one text message to obtain a text category of each text message; and extracting information from the at least one text message according to the text category to obtain key information contained in the target document. The accuracy of extracting the key information in the target document is improved.

Description

Document analysis method, system and device
Technical Field
The present application relates to the field of data processing technologies, and in particular, to a method, a system, and an apparatus for document parsing, a computing device, and a computer-readable storage medium.
Background
With the development of information technology, document types are more and more diversified. While the type of information (such as text or images) contained in documents of different document types is different. In addition, the way of storing information in these documents may be various, for example, information storage may be performed through a table, or information storage may be performed directly through plain text, and the like. The existing document analysis method is more based on information analysis of a pure text, and for a document of an image type such as PDF or a document containing special elements such as a table, key information analyzed may be incomplete and inaccurate.
Disclosure of Invention
In view of this, embodiments of the present application provide a method, a system, and an apparatus for document parsing, a computing device, and a computer-readable storage medium, so as to solve technical defects in the prior art.
According to a first aspect of embodiments of the present application, a document parsing method is provided, including:
acquiring a target document, and extracting at least one piece of text information contained in the target document;
performing text classification on the at least one text message to obtain a text category of each text message;
and extracting information from the at least one text message according to the text category to obtain key information contained in the target document.
Optionally, the extracting at least one piece of text information contained in the target document includes:
determining a document type of the target document;
and extracting information of the target document according to the document type to obtain at least one piece of text information.
Optionally, the extracting information of the target document according to the document type to obtain at least one piece of text information includes:
under the condition that the document type is a text type, performing text extraction on the target document to obtain a target text;
performing sentence division processing on the target text to generate a sentence sequence;
sequentially inputting each text statement in the statement sequence to a statement feature extraction model, and obtaining a statement feature vector corresponding to each text statement output by the statement feature extraction model;
inputting the sentence characteristic vector corresponding to each text sentence into a characteristic classification model, and obtaining the sentence category corresponding to each text sentence output by the characteristic classification model;
and carrying out block processing on the target text according to the sentence type corresponding to each text sentence to obtain at least one text message.
Optionally, the text classifying the at least one text message to obtain a text category of each text message includes:
and determining a text category corresponding to each text message according to the sentence feature vector and the at least one text message.
Optionally, the extracting information of the target document according to the document type to obtain at least one piece of text information includes:
determining at least one text region in the target document through a target detection model under the condition that the document type is an image type;
and identifying text content in the at least one text region, and determining text information in the at least one text region.
Optionally, the text classifying the at least one text message to obtain a text category of each text message includes:
inputting the text information in each text region into a text classification model, and determining a plurality of category confidence degrees of each text information, wherein each category confidence degree of each text information is used for representing the probability that the text information belongs to a reference category;
a text category for each text information is determined based on a plurality of category confidences for each text information.
Optionally, in a case that any one of the text messages is single element information, extracting information from any one of the text messages is implemented by:
inputting text information to be processed into a first keyword recognition model corresponding to a first text category of the text information to be processed, and obtaining first key information contained in the text information to be processed; or the like, or, alternatively,
performing key information identification on the text information to be processed according to a first preset rule corresponding to the first text category to obtain second key information contained in the text information to be processed; or the like, or, alternatively,
inputting the text information to be processed into a first keyword recognition model corresponding to the first text category which is trained in advance, and acquiring first key information contained in the text information to be processed; performing key information identification on the text information to be processed according to a first preset rule corresponding to the first text category to obtain second key information contained in the text information to be processed; and determining key information contained in the text information to be processed based on the first key information and the second key information.
Optionally, in a case that any one text information is multi-element information, extracting information of any one text information is performed by:
splitting text information to be processed into at least one element information;
inputting each element information into a second keyword recognition model corresponding to a second text category of the text information to be processed respectively, and obtaining third key information contained in each element information; combining the third key information contained in each element information to obtain first combined key information contained in the text information to be processed; or the like, or, alternatively,
performing key information identification on the at least one element information according to a second preset rule corresponding to the second text type to obtain fourth key information contained in each element information; combining the fourth key information contained in each element information to obtain second combined key information contained in the text information to be processed; or the like, or, alternatively,
inputting each element information into a pre-trained second keyword recognition model corresponding to the second text category respectively to obtain first key information contained in each element; combining the key information contained in each element to obtain first combined key information contained in the text information to be processed; performing key information identification on the at least one element information according to a second preset rule corresponding to the second text type to obtain fourth key information contained in each element information; combining the fourth key information contained in each element information to obtain second combined key information contained in the text information to be processed; and determining key information contained in the text information to be processed based on the first combined key information and the second combined key information.
Optionally, before extracting information from the at least one piece of text information according to the text category and obtaining key information included in the target document, the method further includes:
determining a document type of the target document, and determining a sentence unit contained in text information of at least one text area in the target document under the condition that the document type is an image type;
inputting adjacent sentence units into a first classification model for classification processing to obtain a first classification result;
under the condition that the first classification result is the same sentence, splicing adjacent sentence units of which the first classification result is the same sentence to obtain spliced text information corresponding to the at least one text region;
correspondingly, the extracting information of the at least one text message according to the text category to obtain the key information contained in the target document includes:
and extracting information of at least one spliced text message according to the text category to obtain key information contained in the target document.
Optionally, after the information extraction is performed on the at least one piece of text information according to the text category and key information included in the target document is obtained, the method further includes:
determining a third text category corresponding to a first preset field in the text categories;
determining first target text information corresponding to the third text category in the at least one text information;
judging whether first field information corresponding to a first preset field exists in key information contained in the first target text information or not;
if not, determining a fourth text type associated with the first preset field in the text types;
determining the first field information corresponding to the first preset field in the second target text information corresponding to the fourth text type according to the extraction rule corresponding to the first preset field;
and updating the key information according to the first preset field and the first field information.
Optionally, after the information extraction is performed on the at least one piece of text information according to the text category and key information included in the target document is obtained, the method further includes:
determining a document type of the target document, and determining a key sentence unit contained in the key information under the condition that the document type is an image type;
inputting adjacent key sentence units into a second classification model to perform first classification processing to obtain a second classification result;
under the condition that the second classification result is the same statement, splicing adjacent key statement units of which the second classification result is the same statement to obtain spliced key information;
correspondingly, the determining whether the key information included in the first target text information includes first field information corresponding to a first preset field includes:
and judging whether first field information corresponding to a first preset field exists in the spliced key information contained in the first target text information.
Optionally, after the information extraction is performed on the at least one piece of text information according to the text category and key information included in the target document is obtained, the method further includes:
performing keyword matching in the at least one text message according to a second preset field to obtain second field information corresponding to the second preset field;
and updating the key information according to the second preset field and the second field information.
According to a second aspect of embodiments of the present application, there is provided a document parsing system, including:
a client and a document analysis end;
the client is configured to send an analysis request carrying a target document to the document analysis end;
the document analysis end is configured to receive the analysis request carrying the target document and extract at least one piece of text information contained in the target document; performing text classification on the at least one text message to obtain a text category of each text message; and extracting information from the at least one text message according to the text category to obtain key information contained in the target document, and returning the key information to the client as a response of the analysis request.
According to a third aspect of the embodiments of the present application, there is provided a document parsing apparatus, including:
the extraction module is configured to acquire a target document and extract at least one piece of text information contained in the target document;
the classification module is configured to perform text classification on the at least one piece of text information to obtain a text category of each piece of text information;
and the extraction module is configured to extract information of the at least one text message according to the text category to obtain key information contained in the target document.
According to a fourth aspect of embodiments herein, there is provided a computing device comprising a memory, a processor and computer instructions stored on the memory and executable on the processor, the processor implementing the steps of the document parsing method when executing the computer instructions.
According to a fifth aspect of embodiments herein, there is provided a computer readable storage medium storing computer instructions which, when executed by a processor, implement the steps of the document parsing method.
According to a sixth aspect of embodiments of the present application, there is provided a chip storing computer instructions which, when executed by the chip, implement the steps of the document parsing method.
In the embodiment of the application, the extraction of the text information contained in the target document is realized by acquiring the target document and extracting at least one text information contained in the target document; the text classification is carried out on at least one text message to obtain the text category of each text message, so that the classification of the text messages is realized; and extracting information of at least one text message according to the text category to obtain key information contained in the target document. The method and the device realize extraction of the key information of the target document according to the text category of the text information on the basis of dividing the target document into at least one text information, and improve the accuracy of extraction of the key information.
Drawings
FIG. 1 is a block diagram of a computing device provided by an embodiment of the present application;
FIG. 2 is a flowchart of a document parsing method provided by an embodiment of the present application;
FIG. 3 is a schematic diagram of a document parsing method applied in a resume document scene according to an embodiment of the present application;
FIG. 4 is a flowchart illustrating a document parsing method applied to a resume document scenario according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of a document parsing system according to an embodiment of the present application.
Fig. 6 is a schematic structural diagram of a document parsing apparatus according to an embodiment of the present application.
Detailed Description
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. This application is capable of implementation in many different ways than those herein set forth and of similar import by those skilled in the art without departing from the spirit and scope of this application, and it is therefore not limited to the specific implementations disclosed below.
The terminology used in the description of the one or more embodiments of the application is for the purpose of describing the particular embodiments only and is not intended to be limiting of the application. As used in one or more embodiments of the present application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used in one or more embodiments of the present application is intended to encompass any and all possible combinations of one or more of the associated listed items.
It will be understood that, although the terms first, second, etc. may be used herein in one or more embodiments of the present application to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first aspect may be termed a second aspect, and, similarly, a second aspect may be termed a first aspect, without departing from the scope of one or more embodiments of the present application. The word "if," as used herein, may be interpreted as "responsive to a determination," depending on the context.
First, the noun terms to which one or more embodiments of the present invention relate are explained.
TIFF: (Tag Image File Format) is a flexible bitmap Format, mainly used to store images including photographs and art drawings.
LibreOffice: the method has strong data import and export functions, can directly import PDF documents and Microsoft Word (doc files), and supports the main OpenXML format.
Sequence labeling model: simply, a sequence is given, and each element in the sequence is labeled or labeled. The basic NLP tasks such as named entity recognition, Chinese word segmentation and part-of-speech tagging belong to the category of sequence tagging.
BERT: (Bidirectional Encoder Representation from Transformers), an open-source pre-trained language model.
NER: (Named Entity Recognition), which refers to the Recognition of entities with specific meaning in text, mainly including names of people, places, organizations, proper nouns, etc.
CRF: (Conditional Random Field algorithm), which is an undirected graph model, combines the characteristics of a maximum entropy model and a hidden Markov model, and can be used in sequence labeling tasks such as word segmentation, part of speech labeling, named entity recognition and the like.
CNN: (Convolutional Neural Network) for performing feature extraction on the input, and then classifying, recognizing, predicting, etc. the input according to the extracted features.
RNN: a recurrent neural Network (recurrent neural Network) is a type of recurrent neural Network (recurrent neural Network) in which sequence data is input, recursion is performed in the direction of evolution of the sequence, and all nodes (recurrent units) are connected in a chain.
OCR: optical Character Recognition refers to a process of analyzing and recognizing a file in a certain form to acquire Character and layout information.
A keyword recognition model: the model for identifying key information from text may have an input of text and an output of the model for identifying key information from text.
SVM: the (Support Vector Machines, SVM) is a pattern recognition method based on a statistical learning theory, and is mainly applied to the field of pattern recognition.
LSTM: the Long Short-Term Memory network (Long Short-Term Memory network) is a time-cycle neural network, and is specially designed for solving the Long-Term dependence problem of the general RNN (cyclic neural network), and all RNNs have a chain form of repeated neural network modules.
BilSTM: (Bidirectional Long Short-Term Memory artificial neural network).
YOLOv 4: on the basis of the original YoLO target detection architecture, optimization strategies in the CNN (Convolutional Neural Network) field in recent years are adopted, and optimization is performed on various aspects such as data processing, a backbone Network, Network training, an activation function, a loss function and the like to obtain a target detection algorithm.
Fast R-CNN: (Fast Region-based Convolutional Neural Network ), which is a Region-based target detection algorithm.
RPN: (Region pro-technical Network), which can be used to determine the Region in the image where the object exists.
CRNN: (Convolutional recurrent neural Network) can be used for character recognition.
In the present application, a document parsing method, a system and an apparatus, a computing device and a computer readable storage medium are provided, which are described in detail in the following embodiments one by one.
Fig. 1 illustrates a block diagram of a computing device 100 provided according to an embodiment of the present application. The components of the computing device 100 include, but are not limited to, memory 110 and processor 120. The processor 120 is coupled to the memory 110 via a bus 130 and a database 150 is used to store data.
Computing device 100 also includes access device 140, access device 140 enabling computing device 100 to communicate via one or more networks 160. Examples of such networks include the Public Switched Telephone Network (PSTN), a Local Area Network (LAN), a Wide Area Network (WAN), a Personal Area Network (PAN), or a combination of communication networks such as the internet. Access device 140 may include one or more of any type of network interface, e.g., a Network Interface Card (NIC), such as IEEE802, whether wired or wireless. 11 Wireless Local Area Network (WLAN) wireless interface, global microwave internet access (Wi-MAX) interface, ethernet interface, Universal Serial Bus (USB) interface, cellular network interface, bluetooth interface, Near Field Communication (NFC) interface, and the like.
In one embodiment of the present application, the above-mentioned components of the computing device 100 and other components not shown in fig. 1 may also be connected to each other, for example, by a bus. It should be understood that the block diagram of the computing device architecture shown in FIG. 1 is for purposes of example only and is not intended to limit the scope of the present application. Those skilled in the art may add or replace other components as desired.
Computing device 100 may be any type of stationary or mobile computing device, including a mobile computer or mobile computing device (e.g., tablet, personal digital assistant, laptop, notebook, netbook, etc.), a mobile phone (e.g., smartphone), a wearable computing device (e.g., smartwatch, smartglasses, etc.), or other type of mobile device, or a stationary computing device such as a desktop computer or PC. Computing device 100 may also be a mobile or stationary server.
Wherein the processor 120 may perform the steps of the document parsing method shown in fig. 2. Fig. 2 shows a flowchart of a document parsing method provided in an embodiment of the present application, which specifically includes the following steps:
step 202: and acquiring a target document, and extracting at least one piece of text information contained in the target document.
The target document refers to a document to be parsed, and the target document may be a news document, a paper document, a report document, a resume document, etc., which is not limited herein. In practical applications, the target document contains rich information, but the user or the enterprise usually only focuses on some key information. And the key information of interest to different documents, users or businesses also varies. If the user or the enterprise has a high reading demand for any kind of document (such as a news document or a resume document), such a document may be obtained from a document library or a website, etc. storing such a document, and at least one of such documents may be used as a target document. In addition, a target document uploaded by the client can also be obtained. And extracting the key information according to the content storage rule of the target document, thereby improving the reading efficiency of the target document.
Specifically, the target document may include different kinds of information, such as text information, picture information, table information, and the like. These pieces of information may all contain text information, and therefore, in order to better parse the target document, the text information needs to be extracted from the target document.
In practical application, in consideration of different storage modes of text information in different types of target documents, in order to accurately identify the text information contained in the different types of target documents, a document type of the target document may be determined first, and then the text information is extracted by adopting a corresponding extraction mode according to the document type, which is specifically implemented by the following modes:
determining a document type of a target document;
and extracting information of the target document according to the document type to obtain at least one piece of text information.
Specifically, the document type of the target document may be a text type or an image type. The text type includes a plurality of text subtypes, such as a TXT subtype, a DOC subtype, a DOCX subtype, a WPS subtype, an EXCEL subtype, and the like, which are not limited herein; the image type may also include a plurality of image subtypes, such as a PDF subtype, a PNG subtype, a JPG subtype, a JPEG subtype, a TIFF subtype, and the like, which are not limited herein, and in specific implementation, the subtype of the target document may be determined according to a document suffix of the target document, and then the document type to which the subtype belongs may be determined, for example, the document suffix of the target document is.
Further, since each document type includes a plurality of document sub-types, the target documents of some document sub-types may not be able to be extracted or the result of information extraction may be inaccurate. In addition, there may be some target documents of document sub-types that need to consume a large amount of computing resources when performing information extraction. Therefore, in order to improve the efficiency and accuracy of information extraction, the target documents of the document sub-types can be converted into other document sub-types capable of efficiently extracting information, and then information extraction processing is performed. Specifically, a document subtype which is high in information extraction efficiency and easy to convert can be selected from document subtypes contained in each document type, and a document subtype which is difficult to extract information is converted into the document subtype.
In particular implementations, the manner in which information is stored in documents of different document types may vary. And the extraction modes required by different storage modes are different. In order to accurately extract text information from different types of documents, a corresponding extraction mode can be determined according to the document type, and then information extraction is performed on a target document through the determined extraction mode. For example, the information in the target document of text type is stored in text mode, so that the text content in the target document can be directly extracted. And the information in the target document of the image type is stored in an image mode, and text content cannot be directly extracted, so that OCR (optical character recognition) is required to be performed on the target document, and then information extraction is performed according to a recognition result.
The at least one text message is information obtained by classifying the content extracted from the target document. For example, if the document content of the target document includes a text of 3 classified contents, the information of the target document is extracted, and 3 pieces of text information can be obtained.
It is contemplated that different categories of information in the target document may be stored in multiple paragraphs in the case where the document type is a text type, or multiple categories of information may be stored in a single paragraph. In order to accurately obtain text information, in the embodiment of the present application, information extraction may be performed on a target document of a text type through a sequence tagging model, which is specifically implemented through the following steps 202-1 to 202-5:
step 202-1: and under the condition that the document type is a text type, performing text extraction on the target document to obtain a target text.
In specific implementation, the text extraction of the target document refers to extracting character information (i.e., a target text) contained in the target document. Specifically, the character information includes character information, punctuation information, and the like.
Such as: and acquiring a target document TD, and determining that the target document TD is a text type according to a document suffix.doc of the target document TD. And converting the target document into a document with a document suffix of docx by using a LidreOffice tool, and extracting the text of the target document TD to obtain the target text TT contained in the target document TD.
Step 202-2: and carrying out sentence division processing on the target text to generate a sentence sequence.
On the basis of extracting the target text, sentence division processing can be performed on the target text according to preset sentence division identifiers (such as sentence numbers, semicolons, line feed characters or other special characters).
Specifically, when sentence division processing is performed, each character in a sentence may be sequentially recognized, and after a sentence division identifier is recognized, one sentence may be divided. And after the recognition of all the characters in the target text is finished, a plurality of sentences corresponding to the target text can be obtained. And then arranging the sentences according to the position sequence of the sentences in the target text to form a sentence sequence.
Specifically, the target text TT is subjected to clause processing according to preset clause identifiers to obtain m sentences, and the m sentences are S in sequence1,S2,S3,……,Sm. The m sentences are formed into a sentence sequence SS (S)1,S2,S3,……,Sm)。
Step 202-3: and sequentially inputting each text statement in the statement sequence to the statement feature extraction model to obtain a statement feature vector corresponding to each text statement output by the statement feature extraction model.
And on the basis of obtaining the sentence sequence, inputting each text sentence in the sentence sequence into the feature extraction model. The feature extraction model is a neural network model for extracting feature information of each text statement, and may be a BERT model, an Ernie model, a Robert model, and the like, and preferably, the feature extraction model is a BERT model, and in the embodiment of the present application, the BERT model includes an embedded layer and an encoder, and the encoder includes 12 sequentially connected encoding layers as an example for explanation.
In the BERT model, [ CLS ] is usually added before each statement of BERT is input, where CLS is a beginning symbol of the statement, and the statement with CLS is processed by multiple coding layers to obtain a statement vector with CLS. After each text statement in the statement sequence is processed by the feature extraction model, a statement feature vector corresponding to each text statement is obtained, that is, a vector with a CLS (common class service) corresponding to each text statement is obtained, and the vector is referred to as a CLS vector (statement feature vector).
Specifically, the sentence sequence SS (S)1,S2,S3,……,Sm) Each text statement in the sentence is sequentially input into the sentence characteristic extraction model, and a sentence characteristic vector corresponding to each text statement is obtained, so that m sentence characteristic vectors are obtained in total, and the m sentence characteristic vectors are sequentially V1,V2,V3,……,Vm
Step 202-4: and inputting the sentence characteristic vector corresponding to each text sentence into the characteristic classification model to obtain the sentence category corresponding to each text sentence output by the characteristic classification model.
On the basis of obtaining the sentence feature vector corresponding to each text sentence, the sentence feature vectors may be input to a feature classification model, and the feature classification model is used to determine the category (i.e., sentence category) of each text sentence according to the feature vector of each text sentence. Specifically, the feature classification model includes a feature classification model applying a recurrent neural network, such as an LSTM network, a BiLSTM network, an RNN network, and the like, and the feature classification model applying the recurrent neural network can refer to the context between feature vectors of each sentence to better understand the semantics of each text sentence.
In specific implementation, if the feature classification model selects binary output in the output layer, that is, whether the sentence is of a first type or a second type is determined according to the CLS vector of each text sentence, specifically, the first type is a starting sentence and can be represented by 1, which indicates that the sentence is the first sentence in one text block; the second category is intermediate sentences, which may be denoted by 0, indicating that the sentence is something other than the first sentence in a block of text.
Specifically, the m sentence feature vectors are sequentially V1,V2,V3,……,VmInputting a feature classification model to obtain each sentence characteristicThe statement classes corresponding to the eigenvectors are respectively 11,02,03,……,0m
Step 202-5: and carrying out block processing on the target text according to the sentence type corresponding to each text sentence to obtain at least one text message.
The target text is partitioned according to the first category or the second category corresponding to each text sentence, and each sentence in the first category in the sentence sequence may be used as a first sentence in each partition (i.e., text information), and a sentence after the first sentence and before the next first sentence may be used as a subsequent sentence after the first sentence in the partition, that is, a sentence in the second category between the sentences in the two first categories may be used as a sentence included in the text information corresponding to the sentence in the previous first category. Thereby realizing the division of the target text into at least one text message.
Specifically, according to the sentence categories corresponding to the m sentence feature vectors: 11,02,03,……,0mThe target text TT is processed in blocks, and 3 sentence categories in the m sentence categories are assumed to be 1 and are respectively 11,18,122Then the target text TT is divided into 3 blocks, 11Corresponding sentence S1First sentence as first text information, S2-S7As a continuation of the first text message. 18Corresponding sentence S8First sentence as second text information, S9-S21As a subsequent sentence of the second text information. 122Corresponding sentence S22First sentence as third text information, S22-SmAs a continuation of the third text message.
In summary, the text in the target document of the text type is extracted, and at least one piece of text information contained in the target document is obtained by classifying the sentences in the text, so that the accuracy of extracting the text information from the document of the text type is guaranteed.
In addition, in practical applications, there are also cases where the document type is an image type. In this case, the target document is generally divided into at least one text region by a form image or a line image, and each text region stores a part of text information, so that the content of each text region can be regarded as a type of text information, and the embodiment of the present application is implemented specifically by the following manner:
determining at least one text region in the target document through the target detection model under the condition that the document type is the image type;
and identifying text content in the at least one text region, and determining text information in the at least one text region.
Specifically, the target detection model may be a YOLOv4 model, Fast R-CNN, EfficientDET, CenterNet, CTPN, etc., and is not limited herein.
In the case where the target detection model is YOLOv4, the target detection model includes a backhaul network, a Neck layer, and a Head layer. The backhaul network is a convolutional neural network which aggregates and forms image characteristics on different fine image granularities, and can be composed of a plurality of CSPDarknet53 networks; the Neck layer is a network layer for mixing and combining a series of image features and transferring the image features to a prediction layer; the Head layer is also called a prediction layer, and is used for predicting image features, generating a bounding box and predicting a category. The method comprises the steps of inputting a target document into a Backbone network for feature extraction after data enhancement and preprocessing, wherein the Backbone network can comprise three feature layers, namely a middle layer, a middle-lower layer and a bottom layer, obtaining three-layer image features of the target document through feature extraction layer by layer, inputting the three-layer image features into a Neck layer for a series of mixing and combination processing, and obtaining the processed image features. The processed image features are then input into a Head layer, which may include PANet, and which is capable of extracting an output feature vector of three channels, which contains the generated plurality of candidate regions and region features. Decoding the output feature vector by using the configured prior frame to obtain a candidate region containing a plurality of prediction frames and prediction categories. And finally, processing the plurality of prediction boxes through a non-maximum suppression algorithm to obtain the finally output detection box coordinates and the corresponding category information. And determining the area where the detection box with the character type as the text area based on the detection box coordinates.
Furthermore, the text content in the text region may be recognized by an OCR algorithm, or may be recognized by a text recognition model, and the like, which is not limited herein.
In the case where the text content of the text region is recognized by the text recognition model, and the text recognition model is a CRNN model, the CRNN model may include a convolutional layer, a loop layer, and a transcription layer. The convolutional layer uses Resnet-34 network to input the text region (image) into the convolutional layer, so that the convolutional layer can extract the features of the image and can extract the feature sequence of the text region; the circulation layer predicts the characteristic sequence by using bidirectional RNN (BilSTM), learns each characteristic vector in the characteristic sequence, inputs the characteristic sequence output by the convolution layer into the circulation layer, and can predict the characteristic sequence and output the distribution of prediction labels (real values); the transcription layer uses CTC loss to convert a series of label distribution obtained from the circulation layer into a final label sequence, and then the label sequence is converted into a final recognition result through operations such as duplication, integration and the like, so that the text information of the text region can be obtained.
For example, when the document type of the target document TD is the image type, the Yolov4 is used to determine that the target document TD includes three text regions, and OCR recognition is performed on the three text regions respectively to obtain text information corresponding to each text region.
In conclusion, for the target document of the image type, the text region in the target document is determined first, and then the text content in the text region is identified to obtain at least one piece of text information, so that the accuracy of extracting the text information from the document of the image type is guaranteed.
Step 204: and performing text classification on the at least one text message to obtain a text category of each text message.
Specifically, on the basis of at least one piece of text information contained in the target document, in consideration of the difference in semantics expressed by each piece of text information, in order to better extract information from the target document, it is necessary to accurately identify text categories of the text information.
The text category refers to a semantic category obtained by classifying the text information according to the semantic meaning expressed by the text information. In specific implementation, the processing of text classification is roughly divided into text preprocessing, text feature extraction, classification model construction and the like. And the classification model can be obtained by conventional machine learning algorithms such as: a Bayesian model, a random forest model (RF), an SVM classification model, a neural network classification model and the like. In addition, the classification model can also adopt a deep learning text classification model, such as: fastText, TextCNN, TextRNN, and the like, without limitation.
On the basis of extracting the text information from the target document of the text type, because the feature information of each text statement in the target document is obtained in the process of extracting the text information, the feature extraction does not need to be repeated, and the embodiment of the application is specifically realized by the following method:
and determining a text category corresponding to each text message according to the sentence feature vector and at least one text message.
Because the sentence characteristic vector represents the characteristic information of the corresponding sentence, the text type corresponding to the text information can be determined according to the characteristic information and at least one text information. In a specific implementation, a text message may include a plurality of sentences, and the text category of the text message may be determined by performing statistics or deep learning on the feature vectors of the sentences.
Furthermore, depending on the writing habits of the target document, it may be that the feature information of one or several sentences (such as the first sentence or the last sentence, etc.) in the text information is more representative of the semantic representation of the text information. That is, the text type of the text information can be represented by using the feature information of the sentence. The text type of the text information can be determined according to the feature information represented by the sentence feature vector corresponding to the sentence in each text information.
In the case where the target document TT is of a text type and three pieces of text information are extracted from the target document TT, it is assumed that a first sentence S is included in the first text information1If the corresponding semantic representation is 'product composition', determining the text category corresponding to the first text information as a product composition category; suppose a first sentence S in the second text message8If the corresponding semantic representation is 'product function', determining the text type corresponding to the second text information as a product function type; suppose a first sentence S in the third text information22And if the corresponding semantic expression is 'after sale of the product', determining that the text category corresponding to the third text information is the after sale category of the product.
In conclusion, the text category corresponding to each text message is determined according to the sentence characteristic vector and at least one text message, so that the accuracy and the efficiency of determining the text category to which the text message belongs are improved.
On the basis of extracting the text information from the target document of the text type, because the feature information of each text statement in the target document is obtained in the process of extracting the text information, the feature extraction does not need to be repeated, and the embodiment of the application is specifically realized by the following method:
inputting the text information in each text region into a text classification model, and determining a plurality of category confidence degrees of each text information, wherein each category confidence degree of each text information is used for representing the probability that the text information belongs to a reference category;
a text category for each text information is determined based on a plurality of category confidences for each text information.
The category may be understood as a text category (such as a product composition category, a product function category, an after-market product category, etc.) to which the text information may belong. Accordingly, the category confidence: the category confidence corresponds to a reference category and represents the probability that the text belongs to the reference category.
In practical application, each text message can be input into a text classification model, and the probability that each text fragment belongs to each reference category is determined, so that each text message can obtain multiple probabilities. And further, determining the reference category with the highest probability corresponding to the text information as the text category of the text information.
And under the condition that the target document TT is of an image type and three pieces of text information are extracted from the target document TT, respectively inputting the three pieces of text information into a text classification model, and determining the class confidence coefficient of each piece of text information belonging to each reference class. Assume that the reference categories include: the product composition category, the product function category and the after-sale category. And the confidence coefficient of the first text information output by the text classification model is 0.22 for the category of the product composition category, the confidence coefficient of the first text information is 0.15 for the category of the product function category, and the confidence coefficient of the first text information is 0.88 for the category of the product after sale. Based on the three category confidences corresponding to the first text information, the text category of the first text information can be determined to be the after-sale category of the product with the highest category confidence.
In conclusion, the class confidence of the text information is output through the text classification model, and the text class of the text information is determined according to the class confidence, so that the text classification of the text information is realized, and the accuracy of the text classification is guaranteed.
Step 206: and extracting information from the at least one text message according to the text category to obtain key information contained in the target document.
Specifically, on the basis of obtaining the text category corresponding to each text information, it is considered that the key information corresponding to the information of different text categories is different, and therefore, the information of the text information can be extracted according to the text category to obtain the key information contained in the target document.
The key information is information indicating the subject content of the text information or important information included in the text information.
In specific implementation, in order to ensure the accuracy of extracting the key information (i.e., the accuracy of identifying the first field information corresponding to the first preset field), it is considered that, in the case that the document type is the image type, a sentence between two adjacent punctuations may be split into at least two sentences (i.e., information with unnatural breaks) in at least one text information obtained by extracting the information from the target document. The sentence unit included in the text information of at least one text area in the target document can be determined first, and then whether the sentence unit and the previous sentence unit are a sentence is determined by classifying the adjacent sentence units, which is specifically realized by the following method in the embodiment of the application:
determining a sentence unit contained in text information of at least one text area in the target document under the condition that the document type is the image type;
inputting adjacent sentence units into a first classification model for classification processing to obtain a first classification result;
under the condition that the first classification result is the same sentence, splicing adjacent sentence units of which the first classification result is the same sentence to obtain spliced text information corresponding to at least one text region;
correspondingly, extracting information from at least one text message according to the text category to obtain key information contained in the target document, including:
and extracting information of at least one spliced text message according to the text category to obtain key information contained in the target document.
Sentence unit refers to the sentence information between two adjacent punctuations. The punctuation marks may be clause marks (such as periods, exclamations, question marks, etc.), or punctuation marks (such as commas or semicolons, etc.), and are not limited herein.
The first classification model is a binary classification model for classifying adjacent sentence units included in the text information. Accordingly, the first classification result is whether the first classification model outputs the same sentence or not for the input adjacent sentence unit.
Specifically, the adjacent sentence units are input into the first classification model to perform classification processing (i.e. splicing judgment), i.e. whether the adjacent sentence units belong to the same sentence is judged, wherein the same sentence can be a complete sentence divided by sentence division symbols (such as a period, an exclamation mark, a question mark, etc.), or can be a short sentence in a complete sentence divided by sentence division symbols (such as a comma or a semicolon, etc.); if the first classification result is the same sentence, it indicates that the two adjacent sentence units are actually a sentence. Splicing the two adjacent sentence units to form a sentence unit; if the first classification result is different sentences, the two adjacent sentence units are different sentences, and no processing is needed. The first classification result may be represented by a classification value of 1 or True for the same sentence, and the first classification result may be represented by a classification value of 0 or False for different sentences, which is not limited herein.
Furthermore, on the basis of splicing processing, information extraction is carried out on the spliced text information according to the text type, so that key information contained in the target document is obtained, and the accuracy and the integrity of key information extraction are guaranteed.
In addition, since the sentences in the text information are spliced before the key information is extracted from the text information, more content may need to be spliced. After the key information is extracted, the splicing processing can be performed on the text information corresponding to each text region which needs to be processed subsequently, so that the splicing efficiency is improved.
Such as: in the case where the document type is the image type, a sentence unit included in each of the three text information in the target document TT is determined. Taking the first text message as an example, the first text message includes sentence unit Su1-Su15. Adjacent sentence units Su1And Su2And inputting a first classification model for classification processing. In the case that the classification result is the same sentence, the adjacent sentence units S of the same sentence are classified as the classification resultu1And Su2Splicing is carried out; similarly, the adjacent sentence units S are again combinedu2And Su3And inputting a first classification model for classification processing. When the first classification result output by the first classification model is the same sentence, the first classification result is an adjacent sentence unit S of the same sentenceu2And Su3And (6) splicing. By analogy, obtainAnd obtaining spliced text information corresponding to at least one text region.
In conclusion, on the basis of splicing the text information which is not naturally interrupted, the key information of the spliced text information is extracted, so that the accuracy of extracting the key information is improved.
In practical applications, some of the different text messages are unique, and some of the different text messages may not be unique. Based on which the text information can be divided into single-element information or multi-element information. The single element information includes a single element, and the single element indicates that the key information corresponding to the element is unique, such as: gender, age, etc. The multi-element information includes multiple elements, and the multiple elements indicate that the key information corresponding to the elements is not unique, such as work experience, education experience and the like. In order to improve the accuracy of information extraction, different extraction methods may be required for text information of different information, and the embodiment of the present application is specifically implemented in the following manner when any one text information is single element information:
inputting the text information to be processed into a first keyword recognition model corresponding to a first text category of the text information to be processed, and obtaining first key information contained in the text information to be processed; or the like, or, alternatively,
performing key information identification on the text information to be processed according to a first preset rule corresponding to the first text category to obtain second key information contained in the text information to be processed; or the like, or, alternatively,
inputting the text information to be processed into a first keyword recognition model corresponding to a first text category trained in advance, and obtaining first key information contained in the text information to be processed; performing key information identification on the text information to be processed according to a first preset rule corresponding to the first text category to obtain second key information contained in the text information to be processed; and determining key information contained in the text information to be processed based on the first key information and the second key information.
In specific implementation, the corresponding relationship between the text type of the text information and the single-element information or the multi-element information can be preset. And determining the text information as single element information or multi-element information according to the corresponding relation. For example, the text information of the basic information category in the resume text usually contains single elements. Therefore, the corresponding relationship between the basic information category and the single element information can be established in advance, and the text information corresponding to the basic information category can be determined to be the single element information according to the corresponding relationship.
The text information to be processed refers to any text information to be subjected to information extraction. The first text category refers to a text category of the text information to be processed.
The first keyword recognition model refers to a keyword recognition model corresponding to the first text category, and the keyword recognition model may be an NER model. The NER model may consist of a BERT model and a CRF layer, or the NER model may consist of a RNN model and a CRF layer. In the case where the NER model consists of a BERT model and a CRF layer, the NER model may be followed by a CRF layer, the CRF alone being a conditional random field algorithm, the CRF being a downstream task layer in the NER model to constrain the relationship of label transition probabilities. Accordingly, the first key information refers to key information identified in any text information by the first keyword identification model.
In practical application, in order to increase the accuracy of identifying the key information for the text information of each text category, the corresponding keyword identification model can be trained for each text category separately. In addition, a plurality of text types can correspond to the same keyword recognition model, and the method is not limited herein.
In addition, considering that some key information may present a certain storage rule in the text information, the extraction rule (i.e. the first preset rule) for the key information may be preset according to the way the key information is stored. The part of the key information may be directly extracted based on the first preset rule. For example, the first preset rule corresponding to the first text category may be extract ": the key information may be extracted according to the first preset rule by using the "preceding and following information or the word after the" name "information is extracted. Accordingly, the second key information refers to key information identified in any one of the text information by the first preset rule.
Further, in order to improve the accuracy of extracting the key information, the first keyword recognition model recognition and the first preset rule recognition may be combined to jointly recognize the key information included in the text information. And determining key information contained in the text information to be processed based on the key information (first key information) identified by the first keyword identification model and the key information (second key information) identified by the first preset rule. In specific implementation, both the first key information and the second key information may be used as key information included in the text information to be processed, or deduplication or combination processing may be performed based on the first key information and the second key information to determine the key information included in the text information to be processed, which is not limited herein.
Along the above example, assuming that the first text information is singleton information, and the text category of the first text information is determined to be a product composition category, the first text information is input into the first keyword recognition model M1 trained for the product composition category in advance, and the first keyword recognition model M1 is obtained to output that the first key information included in the first text information is "name" and "a 1".
In conclusion, the text information of the single element information is subjected to key information extraction through the first keyword identification model and/or the first preset rule, so that the flexibility and diversity of key information extraction are improved. And under the condition of combining the two extraction modes, the accuracy of extracting the key information is guaranteed.
In addition, in the case that any text information is multi-element information, the embodiment of the present application is specifically implemented as follows:
splitting text information to be processed into at least one element information;
inputting each element information into a second keyword recognition model corresponding to a second text category of the text information to be processed respectively, and obtaining third key information contained in each element information; combining the third key information contained in each element information to obtain first combined key information contained in the text information to be processed; or the like, or, alternatively,
performing key information identification on at least one element information according to a second preset rule corresponding to a second text type to obtain fourth key information contained in each element information; combining the fourth key information contained in each element information to obtain second combined key information contained in the text information to be processed; or the like, or, alternatively,
inputting each element information into a pre-trained second keyword recognition model corresponding to a second text category respectively to obtain first key information contained in each element; combining the key information contained in each element to obtain first combined key information contained in the text information to be processed; performing key information identification on at least one element information according to a second preset rule corresponding to a second text type to obtain fourth key information contained in each element information; combining the fourth key information contained in each element information to obtain second combined key information contained in the text information to be processed; and determining key information contained in the text information to be processed based on the first combined key information and the second combined key information.
In particular, in the multi-element information, for one content (such as an educational experience, a work experience, and the like), a plurality of pieces of parallel or progressive (such as progressive according to time stages) information may be included, and each piece of information includes key information of the text information to be processed. Specifically, a time entity, an institution entity, and/or a school entity, etc. that may be included in the text information to be processed may be preset. And the entities split the parallel or progressive information in the text information to be processed to obtain at least one piece of independent information (namely element information). And then, identifying each element information, and combining the identification results to obtain key information contained in the text to be processed.
The second text category refers to a text category to which the text information to be processed of the multi-element information belongs.
The second keyword recognition model corresponding to the second text type refers to a keyword recognition model corresponding to the second text type, and the specific implementation of the second keyword recognition model is similar to the specific implementation of the first keyword recognition model, and the specific implementation of the first keyword recognition model is referred to, which is not repeated herein. Accordingly, the third key information refers to key information identified in the element information by the second keyword recognition model. The second preset rule is a preset rule for identifying the element information. Accordingly, the fourth key information refers to key information identified in the element information by the second preset rule. Since each element information may contain key information, it is necessary to combine the identified key information to obtain the key information in the text information to be processed.
In addition, in order to improve the accuracy of extracting the key information, the identification of the second keyword identification model and the identification of the second preset rule can be combined to jointly identify the key information contained in the text information to be processed. And determining key information contained in the text information to be processed based on the key information (third key information) identified by the second keyword identification model and the key information (fourth key information) identified by a second preset rule.
In the above example, the second text message includes two pieces of element information, the two pieces of element information included in the third text message are respectively input into the second keyword recognition model M2 pre-trained for the after-sale category of the product, the key information included in the first piece of element information output by the keyword recognition model M2 is the first year contract, and the key information included in the second piece of element information is the second year warranty. These two pieces of key information are combined as key information contained in the third text information.
In practical application, the text information to be processed is multi-element information, and the text information to be processed contains element information to which the multi-element belongs. And for each element information, extracting key information through a second keyword recognition model corresponding to the second text category and/or a second preset rule corresponding to the second text category. And combining the key information in each extracted element information to obtain the key information corresponding to the text information to be processed.
In conclusion, the key information extraction is performed on the text information to be processed of the multi-element information through the second keyword recognition model and/or the second preset rule, so that the flexibility and diversity of the key information extraction are improved. And under the condition of combining the two extraction modes, the accuracy of extracting the key information is guaranteed.
In practical application, in order to intuitively provide the key information extracted in the above manner to the information that the user needs to know or the text category of the text information to which the key information belongs so as to meet the viewing habit of the user and the viewing requirement of the user on the key information, the embodiment of the present application is specifically implemented in the following manner:
determining a third text category corresponding to the first preset field in the text categories;
determining first target text information corresponding to a third text category in at least one text information;
judging whether first field information corresponding to a first preset field exists in key information contained in the first target text information or not;
if not, determining a fourth text type associated with the first preset field in the text types;
according to the extraction rule corresponding to the first preset field, determining first field information corresponding to the first preset field in second target text information corresponding to the fourth text type;
and updating the key information according to the first preset field and the first field information.
The first preset field may be understood as a name of information generally contained in the target document, or a name of information generally required to be known in the target document, such as a name, a material, a phone, and the like. Accordingly, the third text category refers to a desired category to which the first preset field belongs in the target document. The relationship (correspondence) may be set in advance.
The first target text information refers to text information corresponding to the third text type. Further, whether an information value (namely, first field information) corresponding to a first preset field exists in the key information contained in the first text information is judged, if yes, it is indicated that the first field information corresponding to the first preset field is already contained in the key information extracted from the text information corresponding to the third text type, and therefore, information extraction is not required; if not, the determination of the first field information corresponding to the first preset field from the text information of other categories is indicated, and therefore, a fourth text category (i.e., a text category that may store the key information of the first preset field) associated with the first preset field is determined from the text categories of at least one text information. And then extracting key information (namely first field information) corresponding to the first preset field from the text information (namely second target text information) corresponding to the fourth text type. Specifically, the extraction rule for extracting the first field information from the second target text information may be preset according to an information storage manner corresponding to the fourth text type under a normal condition, so as to ensure that the first field information can be extracted.
In addition, the number of the first preset fields may be one or more. When the number of the first preset fields is multiple, the information positions of the first field information corresponding to different first preset fields in the text information are different. And the storage modes of the first field information corresponding to different first preset fields in the text information may also be different, so that a corresponding extraction rule needs to be determined for each first preset field to ensure that the first field information corresponding to the first preset field can be accurately extracted.
For example, when the first preset field is "material", the user usually expects to see the key information corresponding to the "material" field in the key information corresponding to the "product composition category", so a first association relationship between the "product composition category" and the "material" field is preset, and in the text categories (product composition category, product function category, and after-sale category), the third text category corresponding to the "material" field is determined as the "product composition category" according to the first association relationship, and the text information corresponding to the "product composition category" is determined as the first text information. Then, whether the key information of the first text information contains the first field information corresponding to the 'material' field is judged. And under the condition that the judgment result is negative, determining that the fourth text type associated with the 'material' field is the 'product function type' in the text types according to a preset second association relation. And according to the extraction rule corresponding to the 'material' field, determining that the first field information corresponding to the 'material' field is 'delinted' in the text information corresponding to the 'product function type', and adding the 'material' field and the first field information 'delinted' to the key information extracted aiming at the target document.
In summary, under the condition that the key information of the third text category to the text information does not include the first field information of the first preset field, the first field information of the first preset field is determined in the text information of other categories, and the key information of the target document is updated based on the first preset field and the corresponding first field information thereof, so that the completeness of the key information is ensured, and the viewing experience of the user on the key information is met.
In specific implementation, under the condition that the document type is the image type, if sentence break splicing is directly performed on the OCR recognition result, a large amount of time needs to be consumed, and in order to reduce the times of sentence break splicing, after the key information of the target document is extracted, sentence break splicing can be performed on the key information, which is specifically realized by the following method in the embodiment of the application:
determining a key sentence unit contained in the key information under the condition that the document type is the image type;
inputting adjacent key sentence units into a second classification model to perform first classification processing to obtain a second classification result;
under the condition that the second classification result is the same statement, splicing adjacent key statement units of which the second classification result is the same statement to obtain spliced key information;
correspondingly, judging whether the key information contained in the first target text information has first field information corresponding to a first preset field or not includes:
and judging whether first field information corresponding to a first preset field exists in the spliced key information contained in the first target text information.
The second classification model is a classification model for classifying adjacent key sentence units included in the key information. In practical applications, the second classification model may be the same as or different from the first classification model, and is not limited herein. Correspondingly, the second classification result is whether the classification result of the second classification model is the same sentence or not for the input adjacent key sentence units.
Specifically, the adjacent key sentence units are input into the second classification model for classification processing (namely splicing judgment), namely whether the adjacent key sentence units belong to the same sentence is judged; if the second classification result is the same sentence, the two adjacent key sentence units are actually one sentence, and the adjacent key sentence units are spliced to form a text sentence; if the second classification result is different sentences, the two adjacent key sentence units are different sentences, and no processing is needed. The second classification result may be represented by a classification value of 1 or True for the same statement, and may be represented by a classification value of 0 or False for different statements, which is not limited herein.
Furthermore, on the basis of splicing processing, according to the judgment whether the first field information corresponding to the first preset field exists in the spliced key information contained in the first target text information, the accuracy of identifying the first field information is guaranteed.
Such as: and determining a key sentence unit contained in the key information when the document type is the image type. Suppose that the key sentence unit includes a PS1-PS5. Adjacent key sentence units PS1And PS2And inputting a second classification model for classification processing. When the classification result is the same sentence, the adjacent key sentence units PS with the classification result being the same sentence1And PS2Splicing is carried out; similarly, the adjacent key sentence units PS are again connected1And PS2And inputting a second classification model for classification processing. When the second classification result output by the second classification model is the same sentence, the second classification result is the adjacent key sentence unit PS of the same sentence1And PS2And (6) splicing. And by analogy, key information after splicing is obtained.
In summary, under the condition that the key information may have the unnatural interruption, on the basis of splicing the key information of the unnatural interruption, the subsequent first field information is identified and processed on the spliced key information, so that the accuracy of identifying the first field information is ensured.
In practical applications, the key information that may be extracted according to the above-mentioned manner still cannot meet or cannot be determined whether the extraction requirement corresponding to the key information is met. In order to further satisfy the extraction requirement of the key information, some custom fields may be expanded, and the key information is extracted complementarily through the custom fields, so the embodiment of the present application further includes:
performing keyword matching in at least one text message according to a second preset field to obtain second field information corresponding to the second preset field;
and updating the key information according to the second preset field and the second field information.
The second preset field refers to a predefined field of a user or an enterprise. This field may be a field that is not normally apparent in the target document. Which in turn may be of particular interest to the user or business. In order to further improve the flexibility and diversity of key information extraction. The extraction of the key information can be carried out according to the requirements.
In specific implementation, the keywords are matched in at least one text message by acquiring the synonyms or synonyms of the second preset field and using the second preset field and the synonyms or synonyms corresponding to the second preset field as the keywords, and the matched information is used as the second field information corresponding to the second preset field. The keyword matching can be performed through a pre-trained semantic matching model, and can also be performed through a preset rule without limitation.
For example, in the case that the second preset field is a "price" field, determining the synonym or synonym corresponding to the "price" includes: the price, the selling price and the RMB are matched according to the words in the three text messages contained in the target document TT, and the second field information corresponding to the price field is obtained and is 500 yuan.
In summary, the second field information corresponding to the second preset field is obtained by performing keyword matching on the second preset field, and the key information is updated according to the second preset field and the second field information, so that the flexibility of key information extraction is improved, and the personalized requirements of users on key information extraction are met.
Referring to fig. 3, fig. 3 is a schematic diagram illustrating a document parsing method provided in an embodiment of the present application in a resume document scene application.
Specifically, the resume documents are acquired first, and the resume documents are classified according to document types. If the resume document is a DOC, DOCX, WPS or TXT subtype, the resume document is treated as DOC series (type). And if the resume document belongs to the DOC series resume documents and the text in the resume document can be directly read, directly extracting the text. If the resume document belongs to the DOC series resume documents, but is not the DOCX subtype and the text in the resume document cannot be directly read, converting the resume document into the DOCX subtype and extracting the text for the DOCX (namely extracting the text for the DOCX subtype resume document);
if the resume DOCument is a sub-type such as PDF (Portable DOCument Format), PNG, JPG, JPEG or TIFF, the resume DOCument is regarded as a PDF series (type). If the resume document belongs to the resume documents in the PDF series and the contained text can be directly identified, the contained text can be directly identified; if the resume document belongs to the resume documents of the PDF series, but the resume document is not a PDF subtype and cannot directly identify the text in the resume document, converting the resume document into a PDF subtype, and performing target identification and OCR (optical character recognition) extraction on the resume document of the PDF subtype to extract the block text.
Performing block processing on text information extracted from the DOCX-type resume document through a block model 1; and the text information extracted from the PDF type resume document is subjected to blocking processing through the blocking model 2. Both the two blocking processes can obtain basic information, skill certificates, self evaluation, education experiences, work experiences, on-school experiences, practice experiences, project experiences and other text information corresponding to the resume documents.
Further, in the information extraction module, both the basic information and the skill certificate can be used as basic information, and the information extraction of the key information is performed through the NER model + rule. For the education experience, the work experience, the on-going experience, the practice experience, the project experience and the like, the large-segment experience can be divided into a plurality of independent small experiences, and then the NET + rule is utilized to extract the key information.
On the basis of extracting the key information, sentence splicing can be carried out on the unnatural interrupted text recognized by the OCR through a post-processing module. And reasoning refinement is performed after sentence splicing. Such as populating the highest degree in the educational experience to a degree field in the base information; judging whether the work is in the work or not in the last period of time in the work experience; the last period of time in the educational experience determines whether a graduate should expire, etc. And finally, the custom field expansion module can quickly extract by using rules, and the fields often have definite key values, such as job hunting intentions, target cities, expected salaries and the like. And finally, storing the information processed by the post-processing module into a database.
In practical applications, since document types are various, any type of document can be used for information storage. There are currently limitations on the format of the target document to be parsed. In addition, the extraction effect of multi-element information in the document, such as "work experience", "education experience", and "project experience", is poor, and the extensibility and flexibility of resume parsing are lacked, so that the user cannot quickly extract the personalized information demand.
In view of this, in order to support parsing of information in documents of different document types, different information extraction methods are adopted for documents of different document types. The document analysis mode corresponding to the extracted information is determined according to different document types, so that the document analysis accuracy is improved, and in addition, personalized extraction requirements of users are met by supporting user-defined fields.
In the embodiment of the application, the extraction of the text information contained in the target document is realized by acquiring the target document and extracting at least one text information contained in the target document; the text classification is carried out on at least one text message to obtain the text category of each text message, so that the classification of the text messages is realized; and extracting information of at least one text message according to the text category to obtain key information contained in the target document. The method and the device realize extraction of the key information of the target document according to the text category of the text information on the basis of dividing the target document into at least one text information, and improve the accuracy of extraction of the key information.
The following describes the document parsing method with reference to fig. 4 by taking an application of the document parsing method provided in the present application in a resume parsing scenario as an example. Fig. 4 shows a processing flow chart of a document parsing method applied to a resume document scene according to an embodiment of the present application, which specifically includes the following steps:
step 402: and acquiring a target document and determining the document type of the target document.
Specifically, a newly added resume document D is obtained from the resume library, and if the document suffix of the resume document D is given as DOC, the target document is determined to be a DOC subtype, and the DOC subtype belongs to a text type subtype, and the document type of the target document is determined to be a text type.
Step 404: and extracting information of the target document according to the document type to obtain at least one piece of text information.
Specifically, because the information extraction modes required by the target documents of different document types are different, the corresponding extraction mode needs to be determined first according to the document types, and the information extraction is performed on the target documents through the determined extraction mode. In specific implementation, when the document type is a text type, the information of the target document is extracted by performing the following steps 404-1 to 404-6:
step 404-1: and under the condition that the document type is a text type, performing text extraction on the target document to obtain a target text.
Specifically, text extraction is performed on the resume document D to obtain a resume text T to be extracted.
Step 404-2: and carrying out sentence division processing on the target text to generate a sentence sequence.
Specifically, after the resume text T is subjected to sentence segmentation processing, n sentences are obtained to form a sentence sequence (S01, S02 … … Sn), where 01 in S01 represents that the sentence is the 1 st sentence in the resume text T, 02 in S02 represents that the sentence is the 2 nd sentence in the resume text T, and so on.
Step 404-3: and sequentially inputting each text statement in the statement sequence to the feature extraction model to obtain a statement feature vector corresponding to each text statement output by the feature extraction model.
Specifically, after the sentence sequence (S01, S02 … … Sn) is subjected to a feature extraction model, a CLS vector composition vector sequence (B) corresponding to each text sentence is obtainedCLS-01、BCLS-02……BCLS-n) Wherein B isCLS-01CLS vector representing the 01 st sentence.
Step 404-4: and inputting the sentence characteristic vector corresponding to each text sentence into the characteristic classification model to obtain the sentence category corresponding to each text sentence output by the characteristic classification model.
Specifically, the vector sequence (B)CLS-01、BCLS-02……BCLS-n) Inputting the input data into a feature classification model constructed by a BilSTM network, extracting the features of each CLS vector through the BilSTM, classifying, judging the classification of each CLS vector, and determining the statement class corresponding to each CLS vector in a vector sequence as (1)01、002、…、105、…、0n)。
Step 404-5: and carrying out block processing on the target text according to the sentence type corresponding to each text sentence to obtain at least one text message.
Specifically, in the above sentence type (1)01、002、…、105、…、0n) In the above description, the 1 st sentence, the 5 th sentence, the 10 th sentence, the 16 th sentence, the 20 th sentence, the 25 th sentence, the 30 th sentence, the 33 th sentence and the 36 th sentence are of the first category (i.e., the starting sentence), and the other sentences are of the second category (i.e., the intermediate sentence), so that it can be determined that the 1 st to 4 th sentences are the first text information, the 5 th to 9 th sentences are the second text information, the 10 th to 15 th sentences are the third text information, the 16 th to 19 th sentences are the fourth text information, the 20 th to 24 th sentences are the fifth text information, the 25 th to 29 th sentences are the sixth text information, the 30 th to 32 th sentences are the seventh text information, the 33 th to 36 th sentences are the eighth text information, and the 36 th to n th sentences are the ninth text information, and the first text information, the second text information, the third text information, the fourth text information and the fourth text information, And the third text information, the fourth text information, the fifth text information, the sixth text information, the seventh text information, the eighth text information and the ninth text information are used as the text information obtained by block processing.
Step 404-6: and determining a text category corresponding to each text message according to the sentence feature vector and at least one text message.
Specifically, if the semantic representation (feature information) corresponding to the CLS vector corresponding to the 1 st sentence is "basic information", determining that the text category corresponding to the first text information is a basic information category; the semantic representation corresponding to the CLS vector corresponding to the 5 th statement is 'skill certificate information', and then the text category corresponding to the second text information is determined to be the skill certificate information category; the semantic representation corresponding to the CLS vector corresponding to the 10 th statement is self-evaluation, and then the text category corresponding to the third text information is determined as the evaluation category; the semantic meaning corresponding to the CLS vector corresponding to the 16 th sentence is expressed as 'education experience', and then the text category corresponding to the fourth text information is determined to be the education experience category; if the semantic meaning corresponding to the CLS vector corresponding to the 20 th statement is represented as 'working experience', determining the text category corresponding to the fifth text information as a working experience category; the semantic representation corresponding to the CLS vector corresponding to the 25 th statement is 'at-school experience', and then the text category corresponding to the sixth text information is determined to be at-school experience category; the semantic representation corresponding to the CLS vector corresponding to the 30 th sentence is 'practice experience', and then the text type corresponding to the seventh text information is determined to be the practice experience type; the semantic representation corresponding to the CLS vector corresponding to the 33 th statement is 'project experience', and then the text type corresponding to the eighth text information is determined to be a practice experience type; and if the semantic representation corresponding to the CLS vector corresponding to the 36 th sentence is other, determining that the text category corresponding to the eighth text information is other.
Further, in the case where the document type is an image type, information extraction may be performed on the target document by performing steps 404-7 to 404-10 as follows:
step 404-7: and determining at least one text region in the target document through the target detection model under the condition that the document type is the image type.
Specifically, assuming that the resume document D includes at least two text pages, the at least two text pages may be respectively input into the target detection model, and at least one text region of each text page is determined.
Step 404-8: and identifying text content in the at least one text region, and determining text information in the at least one text region.
Specifically, assuming that one text page in the resume document D includes at least two text regions, the at least two text regions may be respectively input into the text recognition model, and the text information of each text region is determined. Wherein the first text region includes text information "XXXX", and the fourth text region includes text information "education experience reading this subject, professional xxxxxxxx", at XX university in 2010-2014 ".
Step 404-9: and inputting the text information in each text region into a text classification model, and determining a plurality of category confidences of each text information, wherein each category confidence of each text information is used for representing the probability that the text information belongs to the reference category.
Specifically, the text information of the fourth text region is input into the text classification model, so that 5 category confidences can be obtained, which are respectively 0.2, 0.25, 0.86, 0.5, and 0.1.
Step 404-10: a text category for each text information is determined based on a plurality of category confidences for each text information.
Specifically, among the 5 category confidences, the category confidence is highest to be 0.86, and the reference category corresponding to the category confidence is the education experience category, and the education experience category is used as the text category corresponding to the text information in the fourth text region.
Step 406: and extracting information of at least one text message according to the text category to obtain key information contained in the target document.
In specific implementation, information extraction needs to be performed on each text message in at least one text message, and for any text message in the at least one text message, it is determined that the any text message is single-element information or multi-element information.
Under the condition that the arbitrary text information is determined to be the single element information, the following steps 406-1 to 406-3 are executed by taking the arbitrary text information as the text information to be processed;
step 406-1: inputting the text information to be processed into a first keyword recognition model corresponding to a first text category trained in advance, and obtaining first key information contained in the text information to be processed.
Specifically, on the basis of determining that a first text category corresponding to the first text information is a "basic information category", the first text information is input into a first keyword recognition model trained in advance for the "basic information category", and first key information included in the first text information is obtained, wherein the first key information may include information corresponding to entity types such as a person name, a place name, a school name, a numerical value, and the like.
Step 406-2: and performing key information identification on the text information to be processed according to a first preset rule corresponding to the first text category of the text information to be processed to obtain second key information contained in the text information to be processed.
In practical applications, the preset rule may be used to directly extract character contents in a format of "xxxx-xx-xx", and content in the symbol "is directly extracted, and the like, which is not limited herein. Correspondingly, the second key information is key information extracted according to a first preset rule corresponding to the first text type.
Based on this, it is assumed that "date of birth: 1990-12-11 ", then" 1990-12-11 "can be directly extracted therefrom as the second key information in the first text information.
Step 406-3: and combining the first key information and the second key information to be used as the key information contained in any one text information.
Based on this, the key information included in the first text information can be obtained by combining the information corresponding to the entity types including the person name, the place name, the school name, the numerical value, and the like included in the first key information with the second key information "1990-12-11".
Specifically, when any text information is multi-element information, the following steps 406-4 to 406-7 are performed with the any text information as the text information to be processed:
step 406-4: and splitting the text information to be processed into at least one element information.
Specifically, for a fourth text information of the at least one text information, a fourth text category corresponding to the fourth text information is "education experience category". And splitting the fourth text information according to the time entity to obtain two element information.
Step 406-5: and respectively inputting each element information into a second keyword recognition model corresponding to a pre-trained second text category to obtain third keyword information contained in each element information.
Specifically, it is assumed that the fourth text information includes two element information, where the two element information are element information 1 and element information 2, respectively, where the element information 1 is: "the educational experience read this department, professional XXXXXX," at university a in 2010-2014. The element information 2 is "study student, profession yyyyyyy, who read at university a in the education experience 2014-2017". Inputting the element information 1 into a second keyword recognition model corresponding to the education experience category, and obtaining third key information contained in the element information 1 as university A; the element information 2 is input into the second keyword recognition model corresponding to the "education experience category", and the third key information included in the element information 2 is acquired as B university.
Step 406-6: and performing key information identification on the at least two elements according to a second preset rule corresponding to the second text type to obtain fourth key information contained in each element information.
Specifically, assuming that a preset rule corresponding to the "education experience category" is xxxx year-xxxx year, performing key information identification on the element information 1 based on the preset rule, obtaining fourth key information included in the element information 1 from 2010 to 2014, and obtaining fourth key information included in the text statement 2 from 2014 to 2017.
Step 406-7: and combining the third key information and the fourth key information to be used as key information contained in the text information to be processed.
Specifically, the third key information, university a and university B, and the fourth key information, 2010-2014 and 2014-2017, are taken as the key information included in the fourth text information.
Step 408: and determining a third text category corresponding to the first preset field in the text categories.
The first preset field can be understood as a field usually contained in a resume document, or a field that a business usually needs to know in a resume, such as a name, a gender, a school calendar, and the like. Accordingly, the third text category refers to the category to which the first preset field in the resume document generally belongs. The relationship (correspondence) may be set in advance.
In practice, this step 408 to the following step 418 can be understood as reasoning refinement for the extracted key information.
Specifically, according to the preset corresponding relation, a third text category corresponding to the "academic calendar" field is determined to be the "basic information category" in the text categories.
Step 410: and determining the first target text information corresponding to the third text category in the at least one text information.
Specifically, the first target text information corresponding to the third text type "basic information type" is determined to be the first text information in the at least one text information.
Step 412: judging whether first field information corresponding to a first preset field exists in key information contained in the first target text information or not;
if not, the first field information corresponding to the first preset field needs to be determined from the text information of other text types, and then the following step 414 is executed;
if yes, it indicates that the first field information corresponding to the first preset field does not need to be determined from other text information, then the following step 420 is performed.
Specifically, assuming that no field value corresponding to the "academic" field exists in the key information included in the first text information, the following step 414 is performed.
Step 414: and determining a fourth text category associated with the first preset field in the text categories.
Specifically, according to the preset association relationship, a fourth text category corresponding to the "academic story" field is determined to be the "educational experience category" in the text categories.
Step 416: and determining first field information corresponding to the first preset field in second target text information corresponding to the fourth text type according to the extraction rule corresponding to the first preset field.
Specifically, the extraction rule corresponding to the "academic calendar" field is to extract the academic calendar information corresponding to the latest text sentence in the text information. Then, in the fourth text information corresponding to the "education experience category", it is determined that the text sentence which is the latest year is: if "graduates, professionalyyyyyy", are read at university a in the education experience 2014-2017, the information "graduates" related to the scholars is extracted from the text sentence, that is, the first field information corresponding to the "scholars" field is determined as: "investigator".
Step 418: and updating the key information according to the first preset field and the first field information.
Specifically, the first preset field "scholarly" and the first field information "researcher" are used for updating the key information in the resume document D to obtain the updated key information.
Step 420: and performing keyword matching in at least one text message according to the second preset field to obtain second field information corresponding to the second preset field.
In practical applications, some enterprises may pay special attention to a certain information of an applicant due to the requirement of recruitment, such as information of job hunting intention, target city and/or expected salary. In order to avoid that the extracted key information is lack of such information or the extracted key information is not obvious, the information corresponding to the custom field can be accurately acquired by setting the custom field (i.e. the second preset field).
Specifically, in a case that the second preset field is an "expected salary" field, determining a synonym or synonym corresponding to the "expected salary" includes: the words are matched in at least one text message contained in the resume document D according to the keywords, and the second field message corresponding to the second preset field is obtained as '9000'.
Step 422: and updating the key information according to the second preset field and the second field information.
Specifically, the updated key information is updated again by using the second preset field "expected salary" and the second field information "9000", and the final key information corresponding to the resume document D is obtained.
Step 424: and storing the target document and the key information contained in the target document into a database.
Specifically, the resume document D and the key information identified in the resume document D are stored in the resume database, so that the resume information can be queried or acquired through the resume database.
In summary, the resume parsing supporting multiple document types and different blocking modes and information extraction modes for different types of resumes are adopted, so that the accuracy of the resume parsing is improved, and in order to further meet the requirements of enterprises or users on the resume document extraction, the first field information corresponding to the custom field in the resume document can be quickly extracted through the custom field.
Corresponding to the above method embodiment, the present application further provides a document parsing system embodiment, and fig. 5 shows a schematic structural diagram of the document parsing system provided in an embodiment of the present application. As shown in fig. 5, the document parsing system includes:
a client 502 and a document analysis end 504;
the client 502 is configured to send an analysis request carrying a target document to the document analysis terminal 504;
the document parsing end 504 is configured to receive the parsing request carrying the target document, and extract at least one piece of text information included in the target document; performing text classification on the at least one text message to obtain a text category of each text message; and extracting information from the at least one text message according to the text category to obtain key information contained in the target document, and returning the key information to the client 502 as a response of the analysis request.
Specifically, the client 502 may be an intelligent device such as a mobile phone, a tablet computer, a desktop computer, and a notebook computer, which is not limited herein. Accordingly, the parsing request refers to a request for document parsing of a target document.
In practical applications, when a user or a business has a requirement for parsing a target document, the target document may be uploaded through the client 502, and a parsing request for the target document may be sent. After receiving the parsing request, the document parsing end 504 parses the key information contained in the target document. And returns the parsed key information to the client 502 as a parsing result of the parsing request.
In the embodiment of the application, through interaction between the client 502 and the document analysis end 504, the document analysis end 504 performs document analysis on the target document sent by the client 502 based on the analysis request sent by the client 502. During specific analysis, extracting at least one piece of text information contained in the target document to realize extraction of the text information contained in the target document; the text classification is carried out on at least one text message to obtain the text category of each text message, so that the classification of the text messages is realized; and extracting information of at least one text message according to the text category to obtain key information contained in the target document. The method and the device realize extraction of the key information of the target document according to the text category of the text information on the basis of dividing the target document into at least one text information, and improve the accuracy of extraction of the key information. Finally, the client 502 can obtain the analysis result of the target document in real time by returning the analyzed key information as the analysis result of the analysis request to the client 502.
The above is a schematic scheme of a document parsing system of the present embodiment. It should be noted that the technical solution of the document parsing system and the technical solution of the document parsing method belong to the same concept, and details that are not described in detail in the technical solution of the document parsing system can be referred to the description of the technical solution of the document parsing method.
Corresponding to the above method embodiment, the present application further provides a document parsing apparatus embodiment, and fig. 6 shows a schematic structural diagram of the document parsing apparatus provided in an embodiment of the present application. As shown in fig. 6, the apparatus includes:
an extraction module 602 configured to obtain a target document and extract at least one piece of text information contained in the target document;
a classification module 604 configured to perform text classification on the at least one text message to obtain a text category of each text message;
an extracting module 606 configured to perform information extraction on the at least one text message according to the text category to obtain key information included in the target document.
Optionally, the extracting module 602 includes:
a determine type sub-module configured to determine a document type of the target document;
and the extraction submodule is configured to extract information of the target document according to the document type to obtain at least one piece of text information.
Optionally, the extraction sub-module is further configured to:
under the condition that the document type is a text type, performing text extraction on the target document to obtain a target text;
performing sentence division processing on the target text to generate a sentence sequence;
sequentially inputting each text statement in the statement sequence to a statement feature extraction model, and obtaining a statement feature vector corresponding to each text statement output by the statement feature extraction model;
inputting the sentence characteristic vector corresponding to each text sentence into a characteristic classification model, and obtaining the sentence category corresponding to each text sentence output by the characteristic classification model;
and carrying out block processing on the target text according to the sentence type corresponding to each text sentence to obtain at least one text message.
Optionally, the classification module 604 is further configured to:
and determining a text category corresponding to each text message according to the sentence feature vector and the at least one text message.
Optionally, the extraction sub-module is further configured to:
determining at least one text region in the target document through a target detection model under the condition that the document type is an image type;
and identifying text content in the at least one text region, and determining text information in the at least one text region.
Optionally, the classification module 604 is further configured to:
inputting the text information in each text region into a text classification model, and determining a plurality of category confidence degrees of each text information, wherein each category confidence degree of each text information is used for representing the probability that the text information belongs to a reference category;
a text category for each text information is determined based on a plurality of category confidences for each text information.
Optionally, in a case that any one of the text messages is singleton message, the extracting module 506 is further configured to:
inputting text information to be processed into a first keyword recognition model corresponding to a first text category of the pre-trained text information to be processed, and acquiring first key information contained in the text information to be processed; or the like, or, alternatively,
performing key information identification on the text information to be processed according to a first preset rule corresponding to the first text category to obtain second key information contained in the text information to be processed; or the like, or, alternatively,
inputting the text information to be processed into a first keyword recognition model corresponding to the first text category which is trained in advance, and acquiring first key information contained in the text information to be processed; performing key information identification on the text information to be processed according to a first preset rule corresponding to the first text category to obtain second key information contained in the text information to be processed; and determining key information contained in the text information to be processed based on the first key information and the second key information.
Optionally, in a case that any one of the text information is multi-element information, the extraction module 506 is further configured to:
splitting text information to be processed into at least one element information;
inputting each element information into a pre-trained second keyword recognition model corresponding to a second text category of the text information to be processed respectively, and obtaining third key information contained in each element information; combining the third key information contained in each element information to obtain first combined key information contained in the text information to be processed; or the like, or, alternatively,
performing key information identification on the at least one element information according to a second preset rule corresponding to the second text type to obtain fourth key information contained in each element information; combining the fourth key information contained in each element information to obtain second combined key information contained in the text information to be processed; or the like, or, alternatively,
inputting each element information into a pre-trained second keyword recognition model corresponding to the second text category respectively to obtain first key information contained in each element; combining the key information contained in each element to obtain first combined key information contained in the text information to be processed; performing key information identification on the at least one element information according to a second preset rule corresponding to the second text type to obtain fourth key information contained in each element information; combining the fourth key information contained in each element information to obtain second combined key information contained in the text information to be processed; and determining key information contained in the text information to be processed based on the first combined key information and the second combined key information.
Optionally, the document parsing apparatus further includes:
the first sentence determining module is configured to determine a document type of the target document, and determine a sentence unit contained in text information of at least one text area in the target document when the document type is an image type;
the first classification module is configured to input the adjacent sentence units into the first classification model for classification processing to obtain a first classification result;
the first splicing module is configured to splice adjacent sentence units of the same sentence as the first classification result under the condition that the first classification result is the same sentence, so as to obtain spliced text information corresponding to the at least one text region;
accordingly, the extraction module 606 is further configured to:
and extracting information of at least one spliced text message according to the text category to obtain key information contained in the target document.
Optionally, the document parsing apparatus further includes:
the first category determining module is configured to determine a third text category corresponding to the first preset field in the text categories;
a first determination information module configured to determine, in the at least one text message, a first target text message corresponding to the third text category;
the judging module is configured to judge whether first field information corresponding to a first preset field exists in key information contained in the first target text information;
if not, operating a second category determining module, wherein the second category determining module is configured to determine a fourth text category associated with the first preset field in the text categories;
a second information determining module configured to determine, according to an extraction rule corresponding to the first preset field, the first field information corresponding to the first preset field in second target text information corresponding to the fourth text category;
an updating module configured to update the key information according to the first preset field and the first field information.
Optionally, after the document parsing apparatus, the method further includes:
the second determining statement module is configured to determine a document type of the target document, and determine a key sentence unit contained in the key information if the document type is an image type;
the second classification module is configured to input the adjacent key sentence units into the second classification model to perform first classification processing to obtain a second classification result;
the second splicing module is configured to splice adjacent key sentence units of the same sentence as the second classification result under the condition that the second classification result is the same sentence, so as to obtain spliced key information;
accordingly, the determining module is further configured to:
and judging whether first field information corresponding to a first preset field exists in the spliced key information contained in the first target text information.
Optionally, the document parsing apparatus further includes:
the matching module is configured to perform keyword matching in the at least one text message according to a second preset field to obtain second field information corresponding to the second preset field;
an information updating module configured to update the key information according to the second preset field and the second field information.
In the embodiment of the application, the extraction of the text information contained in the target document is realized by acquiring the target document and extracting at least one text information contained in the target document; the text classification is carried out on at least one text message to obtain the text category of each text message, so that the classification of the text messages is realized; and extracting information of at least one text message according to the text category to obtain key information contained in the target document. The method and the device realize extraction of the key information of the target document according to the text category of the text information on the basis of dividing the target document into at least one text information, and improve the accuracy of extraction of the key information.
The above is a schematic scheme of a document parsing apparatus of the present embodiment. It should be noted that the technical solution of the document analysis apparatus and the technical solution of the document analysis method belong to the same concept, and details that are not described in detail in the technical solution of the document analysis apparatus can be referred to the description of the technical solution of the document analysis method.
It should be noted that the components in the device claims should be understood as functional blocks which are necessary to implement the steps of the program flow or the steps of the method, and each functional block is not actually defined by functional division or separation. The device claims defined by such a set of functional modules are to be understood as a functional module framework for implementing the solution mainly by means of a computer program as described in the specification, and not as a physical device for implementing the solution mainly by means of hardware.
An embodiment of the present application further provides a computing device, which includes a memory, a processor, and computer instructions stored on the memory and executable on the processor, where the processor implements the steps of the document parsing method when executing the computer instructions.
The above is an illustrative scheme of a computing device of the present embodiment. It should be noted that the technical solution of the computing device and the technical solution of the document parsing method belong to the same concept, and details that are not described in detail in the technical solution of the computing device can be referred to the description of the technical solution of the document parsing method.
An embodiment of the present application further provides a computer readable storage medium, which stores computer instructions, and the computer instructions, when executed by a processor, implement the steps of the document parsing method as described above.
The above is an illustrative scheme of a computer-readable storage medium of the present embodiment. It should be noted that the technical solution of the storage medium belongs to the same concept as the technical solution of the above-mentioned document parsing method, and for details that are not described in detail in the technical solution of the storage medium, reference may be made to the description of the technical solution of the above-mentioned document parsing method.
The embodiment of the application discloses a chip, which stores computer instructions, and the computer instructions are executed by a processor to realize the steps of the document parsing method.
The foregoing description of specific embodiments of the present application has been presented. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.
The computer instructions comprise computer program code which may be in the form of source code, object code, an executable file or some intermediate form, or the like. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, etc. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.
It should be noted that, for the sake of simplicity, the above-mentioned method embodiments are described as a series of acts or combinations, but those skilled in the art should understand that the present application is not limited by the described order of acts, because some steps may be performed in other orders or simultaneously according to the present application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
The preferred embodiments of the present application disclosed above are intended only to aid in the explanation of the application. Alternative embodiments are not exhaustive and do not limit the invention to the precise embodiments described. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the application and its practical applications, to thereby enable others skilled in the art to best understand and utilize the application. The application is limited only by the claims and their full scope and equivalents.

Claims (16)

1. A document parsing method, comprising:
acquiring a target document, and extracting at least one piece of text information contained in the target document;
performing text classification on the at least one text message to obtain a text category of each text message;
and extracting information from the at least one text message according to the text category to obtain key information contained in the target document.
2. The method of claim 1, wherein the extracting at least one text message contained in the target document comprises:
determining a document type of the target document;
and extracting information of the target document according to the document type to obtain at least one piece of text information.
3. The method of claim 2, wherein the extracting information from the target document according to the document type to obtain at least one text message comprises:
under the condition that the document type is a text type, performing text extraction on the target document to obtain a target text;
performing sentence division processing on the target text to generate a sentence sequence;
sequentially inputting each text statement in the statement sequence to a statement feature extraction model, and obtaining a statement feature vector corresponding to each text statement output by the statement feature extraction model;
inputting the sentence characteristic vector corresponding to each text sentence into a characteristic classification model, and obtaining the sentence category corresponding to each text sentence output by the characteristic classification model;
and carrying out block processing on the target text according to the sentence type corresponding to each text sentence to obtain at least one text message.
4. The method of claim 3, wherein the text classifying the at least one text message to obtain a text category of each text message comprises:
and determining a text category corresponding to each text message according to the sentence feature vector and the at least one text message.
5. The method of claim 2, wherein the extracting information from the target document according to the document type to obtain at least one text message comprises:
determining at least one text region in the target document through a target detection model under the condition that the document type is an image type;
and identifying text content in the at least one text region, and determining text information in the at least one text region.
6. The method of claim 5, wherein the text classifying the at least one text message to obtain a text category of each text message comprises:
inputting the text information in each text region into a text classification model, and determining a plurality of category confidence degrees of each text information, wherein each category confidence degree of each text information is used for representing the probability that the text information belongs to a reference category;
a text category for each text information is determined based on a plurality of category confidences for each text information.
7. The document analysis method according to claim 1, wherein in a case where any one of the text messages is single element information, information extraction is performed on any one of the text messages by:
inputting text information to be processed into a first keyword recognition model corresponding to a first text category of the text information to be processed, and obtaining first key information contained in the text information to be processed; or the like, or, alternatively,
performing key information identification on the text information to be processed according to a first preset rule corresponding to the first text category to obtain second key information contained in the text information to be processed; or the like, or, alternatively,
inputting the text information to be processed into a first keyword recognition model corresponding to the first text category which is trained in advance, and acquiring first key information contained in the text information to be processed; performing key information identification on the text information to be processed according to a first preset rule corresponding to the first text category to obtain second key information contained in the text information to be processed; and determining key information contained in the text information to be processed based on the first key information and the second key information.
8. The document parsing method according to claim 1, wherein in a case that any one of the text messages is multi-element information, information extraction is performed on any one of the text messages by:
splitting text information to be processed into at least one element information;
inputting each element information into a second keyword recognition model corresponding to a second text category of the text information to be processed respectively, and obtaining third key information contained in each element information; combining the third key information contained in each element information to obtain first combined key information contained in the text information to be processed; or the like, or, alternatively,
performing key information identification on the at least one element information according to a second preset rule corresponding to the second text type to obtain fourth key information contained in each element information; combining the fourth key information contained in each element information to obtain second combined key information contained in the text information to be processed; or the like, or, alternatively,
inputting each element information into a pre-trained second keyword recognition model corresponding to the second text category respectively to obtain first key information contained in each element; combining the key information contained in each element to obtain first combined key information contained in the text information to be processed; performing key information identification on the at least one element information according to a second preset rule corresponding to the second text type to obtain fourth key information contained in each element information; combining the fourth key information contained in each element information to obtain second combined key information contained in the text information to be processed; and determining key information contained in the text information to be processed based on the first combined key information and the second combined key information.
9. The method of claim 1, wherein before extracting information from the at least one text message according to the text category and obtaining key information included in the target document, the method further comprises:
determining a document type of the target document, and determining a sentence unit contained in text information of at least one text area in the target document under the condition that the document type is an image type;
inputting adjacent sentence units into a first classification model for classification processing to obtain a first classification result;
under the condition that the first classification result is the same sentence, splicing adjacent sentence units of which the first classification result is the same sentence to obtain spliced text information corresponding to the at least one text region;
the extracting information of the at least one text message according to the text category to obtain the key information contained in the target document includes:
and extracting information of at least one spliced text message according to the text category to obtain key information contained in the target document.
10. The method of claim 1, wherein after extracting information from the at least one text message according to the text category and obtaining key information included in the target document, the method further comprises:
determining a third text category corresponding to a first preset field in the text categories;
determining first target text information corresponding to the third text category in the at least one text information;
judging whether first field information corresponding to a first preset field exists in key information contained in the first target text information or not;
if not, determining a fourth text type associated with the first preset field in the text types;
determining the first field information corresponding to the first preset field in the second target text information corresponding to the fourth text type according to the extraction rule corresponding to the first preset field;
and updating the key information according to the first preset field and the first field information.
11. The method of claim 10, wherein after extracting information from the at least one text message according to the text category and obtaining key information included in the target document, the method further comprises:
determining a document type of the target document, and determining a key sentence unit contained in the key information under the condition that the document type is an image type;
inputting adjacent key sentence units into a second classification model to perform first classification processing to obtain a second classification result;
under the condition that the second classification result is the same statement, splicing adjacent key statement units of which the second classification result is the same statement to obtain spliced key information;
the determining whether there is first field information corresponding to a first preset field in the key information included in the first target text information includes:
and judging whether first field information corresponding to a first preset field exists in the spliced key information contained in the first target text information.
12. The method of claim 1, wherein after extracting information from the at least one text message according to the text category and obtaining key information included in the target document, the method further comprises:
performing keyword matching in the at least one text message according to a second preset field to obtain second field information corresponding to the second preset field;
and updating the key information according to the second preset field and the second field information.
13. A document parsing system, comprising:
a client and a document analysis end;
the client is configured to send an analysis request carrying a target document to the document analysis end;
the document analysis end is configured to receive the analysis request carrying the target document and extract at least one piece of text information contained in the target document; performing text classification on the at least one text message to obtain a text category of each text message; and extracting information from the at least one text message according to the text category to obtain key information contained in the target document, and returning the key information to the client as a response of the analysis request.
14. A document parsing apparatus, comprising:
the extraction module is configured to acquire a target document and extract at least one piece of text information contained in the target document;
the classification module is configured to perform text classification on the at least one piece of text information to obtain a text category of each piece of text information;
and the extraction module is configured to extract information of the at least one text message according to the text category to obtain key information contained in the target document.
15. A computing device comprising a memory, a processor, and computer instructions stored on the memory and executable on the processor, wherein the processor implements the steps of the method of any one of claims 1-12 when executing the computer instructions.
16. A computer-readable storage medium storing computer instructions, which when executed by a processor, perform the steps of the method of any one of claims 1 to 12.
CN202111424953.7A 2021-11-26 2021-11-26 Document analysis method, system and device Pending CN114090776A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111424953.7A CN114090776A (en) 2021-11-26 2021-11-26 Document analysis method, system and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111424953.7A CN114090776A (en) 2021-11-26 2021-11-26 Document analysis method, system and device

Publications (1)

Publication Number Publication Date
CN114090776A true CN114090776A (en) 2022-02-25

Family

ID=80305103

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111424953.7A Pending CN114090776A (en) 2021-11-26 2021-11-26 Document analysis method, system and device

Country Status (1)

Country Link
CN (1) CN114090776A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114548072A (en) * 2022-04-25 2022-05-27 杭州实在智能科技有限公司 Automatic content analysis and information evaluation method and system for contract files
CN114861657A (en) * 2022-05-18 2022-08-05 北京金山数字娱乐科技有限公司 Conference key sentence extraction method and device
CN115130989A (en) * 2022-06-24 2022-09-30 北京百度网讯科技有限公司 Method, device and equipment for auditing service document and storage medium
CN115995087A (en) * 2023-03-23 2023-04-21 杭州实在智能科技有限公司 Document catalog intelligent generation method and system based on fusion visual information
CN116627907A (en) * 2023-04-10 2023-08-22 甘肃中电瓜州风力发电有限公司 Settlement data analysis method and system based on electric power transaction platform

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107392143A (en) * 2017-07-20 2017-11-24 中国科学院软件研究所 A kind of resume accurate Analysis method based on SVM text classifications
CN109766438A (en) * 2018-12-12 2019-05-17 平安科技(深圳)有限公司 Biographic information extracting method, device, computer equipment and storage medium
CN112668316A (en) * 2020-11-17 2021-04-16 国家计算机网络与信息安全管理中心 word document key information extraction method
CN113033204A (en) * 2021-03-24 2021-06-25 广州万孚生物技术股份有限公司 Information entity extraction method and device, electronic equipment and storage medium
CN113094509A (en) * 2021-06-08 2021-07-09 明品云(北京)数据科技有限公司 Text information extraction method, system, device and medium
WO2021203581A1 (en) * 2020-04-10 2021-10-14 深圳壹账通智能科技有限公司 Key information extraction method based on fine annotation text, and apparatus and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107392143A (en) * 2017-07-20 2017-11-24 中国科学院软件研究所 A kind of resume accurate Analysis method based on SVM text classifications
CN109766438A (en) * 2018-12-12 2019-05-17 平安科技(深圳)有限公司 Biographic information extracting method, device, computer equipment and storage medium
WO2021203581A1 (en) * 2020-04-10 2021-10-14 深圳壹账通智能科技有限公司 Key information extraction method based on fine annotation text, and apparatus and storage medium
CN112668316A (en) * 2020-11-17 2021-04-16 国家计算机网络与信息安全管理中心 word document key information extraction method
CN113033204A (en) * 2021-03-24 2021-06-25 广州万孚生物技术股份有限公司 Information entity extraction method and device, electronic equipment and storage medium
CN113094509A (en) * 2021-06-08 2021-07-09 明品云(北京)数据科技有限公司 Text information extraction method, system, device and medium

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114548072A (en) * 2022-04-25 2022-05-27 杭州实在智能科技有限公司 Automatic content analysis and information evaluation method and system for contract files
CN114861657A (en) * 2022-05-18 2022-08-05 北京金山数字娱乐科技有限公司 Conference key sentence extraction method and device
CN115130989A (en) * 2022-06-24 2022-09-30 北京百度网讯科技有限公司 Method, device and equipment for auditing service document and storage medium
CN115995087A (en) * 2023-03-23 2023-04-21 杭州实在智能科技有限公司 Document catalog intelligent generation method and system based on fusion visual information
CN115995087B (en) * 2023-03-23 2023-06-20 杭州实在智能科技有限公司 Document catalog intelligent generation method and system based on fusion visual information
CN116627907A (en) * 2023-04-10 2023-08-22 甘肃中电瓜州风力发电有限公司 Settlement data analysis method and system based on electric power transaction platform

Similar Documents

Publication Publication Date Title
CN111753060B (en) Information retrieval method, apparatus, device and computer readable storage medium
US11948058B2 (en) Utilizing recurrent neural networks to recognize and extract open intent from text inputs
CN109885691B (en) Knowledge graph completion method, knowledge graph completion device, computer equipment and storage medium
US11403680B2 (en) Method, apparatus for evaluating review, device and storage medium
US20210232761A1 (en) Methods and systems for improving machine learning performance
CN114090776A (en) Document analysis method, system and device
CN112800170A (en) Question matching method and device and question reply method and device
CN113961685A (en) Information extraction method and device
CN112948534A (en) Interaction method and system for intelligent man-machine conversation and electronic equipment
CN113569011B (en) Training method, device and equipment of text matching model and storage medium
CN113282729B (en) Knowledge graph-based question and answer method and device
CN113868419B (en) Text classification method, device, equipment and medium based on artificial intelligence
CN112579733A (en) Rule matching method, rule matching device, storage medium and electronic equipment
CN115525757A (en) Contract abstract generation method and device and contract key information extraction model training method
CN113362026B (en) Text processing method and device
CN112199954A (en) Disease entity matching method and device based on voice semantics and computer equipment
CN114997167A (en) Resume content extraction method and device
CN114090777A (en) Text data processing method and device
CN114491010A (en) Training method and device of information extraction model
CN113157920B (en) Aspect-level emotion analysis method and system based on machine reading understanding paradigm
CN115757723A (en) Text processing method and device
CN114077831B (en) Training method and device for problem text analysis model
CN113590768B (en) Training method and device for text relevance model, question answering method and device
CN114896404A (en) Document classification method and device
CN114942981A (en) Question-answer query method and device, electronic equipment and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination