CN114547301A

CN114547301A - Document processing method, document processing device, recognition model training equipment and storage medium

Info

Publication number: CN114547301A
Application number: CN202210159137.6A
Authority: CN
Inventors: 李硕; 陈禹燊; 韩光耀
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-02-21
Filing date: 2022-02-21
Publication date: 2022-05-27

Abstract

The disclosure provides a document processing and recognition model training method, a device, equipment and a storage medium, and relates to the technical field of data processing, in particular to the technical field of deep learning, natural language processing and deep search. The document processing method comprises the following steps: and processing the acquired document to be processed to obtain an identification object set in the document to be processed, and determining an identification result of the document to be processed according to the identification score of the identification object included in the object category in the identification object set. The recognition model training method comprises the following steps: and inputting the text sample of the acquired text sample set into a preset network to obtain an object identification result of the text sample, and adjusting parameters of the preset network to obtain an object identification model by combining object marking information carried by the text sample. The technical scheme can accurately identify the object type in the document and the identification object corresponding to the object type, and improves the information extraction effect of the document.

Description

Document processing method, document processing device, recognition model training equipment and storage medium

Technical Field

The present disclosure relates to the technical field of deep learning, natural language processing, and deep search in data processing, and in particular, to a method, an apparatus, a device, and a storage medium for training a document processing and recognition model.

Background

Document intelligence refers to the process of automatically reading, understanding and analyzing documents by a computer, and the popularization of deep learning technology greatly promotes the development of the field of document intelligence represented by document information extraction. Document information extraction refers to identifying or extracting key information from a document.

In the related art, the method for extracting document information mainly uses a Named Entity Recognition (NER) scheme and a machine reading understanding (MRC) mode to extract key information from a document. However, the above method has a requirement on the length of the processed document, and there may be a phenomenon that the training and prediction results are inconsistent when entity nesting exists in the document, resulting in a poor information extraction effect.

Disclosure of Invention

The disclosure provides a document processing and recognition model training method, a device, equipment and a storage medium.

According to a first aspect of the present disclosure, there is provided a document processing method including:

acquiring a document to be processed;

processing the document to be processed to obtain an identification object set in the document to be processed, wherein the identification object set comprises: an object category, an identification object included in the object category, and an identification score of the identification object;

and determining the recognition result of the document to be processed according to the recognition objects included in the object categories in the recognition object set and the recognition scores of the recognition objects.

According to a second aspect of the present disclosure, there is provided a recognition model training method, including:

acquiring a text sample set, wherein text samples in the text sample set carry object labeling information;

inputting the text samples in the text sample set into a preset network to obtain object identification results of the text samples, wherein the target identification objects corresponding to the object identification categories in the object identification results are determined based on identification scores;

and adjusting parameters of the preset network according to the object marking information carried by the text sample and the object identification result of the text sample to obtain an object identification model.

According to a third aspect of the present disclosure, there is provided a document processing apparatus comprising:

the acquisition unit is used for acquiring a document to be processed;

a processing unit, configured to process the to-be-processed document to obtain an identification object set in the to-be-processed document, where the identification object set includes: an object category, an identification object included in the object category, and an identification score of the identification object;

and the determining unit is used for determining the identification result of the document to be processed according to the identification objects included in the object categories in the identification object set and the identification scores of the identification objects.

According to a fourth aspect of the present disclosure, there is provided a recognition model training apparatus, including:

the device comprises an acquisition unit, a storage unit and a processing unit, wherein the acquisition unit is used for acquiring a text sample set, and text samples in the text sample set carry object labeling information;

the processing unit is used for inputting the text samples in the text sample set into a preset network to obtain object recognition results of the text samples, and target recognition objects corresponding to object recognition categories in the object recognition results are determined based on recognition scores;

and the adjusting unit is used for adjusting the parameters of the preset network according to the object marking information carried by the text sample and the object identification result of the text sample to obtain an object identification model.

According to a fifth aspect of the present disclosure, there is provided an electronic device comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein, the first and the second end of the pipe are connected with each other,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of the first aspect or to perform the method of the second aspect.

According to a sixth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform the method of the first aspect or the method of the second aspect.

According to a seventh aspect of the present disclosure, there is provided a computer program product comprising: a computer program, stored in a readable storage medium, from which at least one processor of an electronic device can read the computer program, execution of the computer program by the at least one processor causing the electronic device to perform the method of the first aspect or to perform the method of the second aspect.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

fig. 1 is a schematic diagram of an application scenario provided by an embodiment of the present disclosure;

FIG. 2 is a flowchart illustrating a document processing method according to a first embodiment of the disclosure;

FIG. 3 is a flowchart illustrating a document processing method according to a second embodiment of the disclosure;

FIG. 4 is a flowchart illustrating a document processing method according to a third embodiment of the disclosure;

FIG. 5 is a flowchart illustrating a recognition model training method according to a first embodiment of the disclosure;

FIG. 6 is a flowchart illustrating a recognition model training method according to a second embodiment of the present disclosure;

FIG. 7 is a flowchart illustrating a recognition model training method according to a third embodiment of the present disclosure;

FIG. 8 is a schematic diagram of an architecture provided by an embodiment of the present disclosure;

FIG. 9 is a flow diagram of a plain text document splitting process;

FIG. 10 is a schematic structural diagram of a document processing apparatus provided in an embodiment of the present disclosure;

FIG. 11 is a schematic structural diagram of a recognition model training apparatus provided in an embodiment of the present disclosure;

FIG. 12 shows a schematic block diagram of an example electronic device used to implement embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

At present, for extracting document information, for example, extracting information of word documents, PDF documents, and text documents, the industry generally adopts an NER scheme and an MRC method to extract key information from documents. But generally the mainstream information extraction is aimed at processing shorter documents, such as: the bidirectional encoding characterization (BERT) based converter generally processes 512 maximum tokens (tokens), which is unwieldy for document data (e.g., Word documents) with Token numbers of thousands or tens of hundreds. Moreover, the real document data not only has the characteristic of long space, but also has the characteristics of multiple formats, complex article structure and the like, and the characteristics of the documents bring challenges to information extraction. Wherein BERT is a pre-trained language characterization model.

In related art deep learning, a method of extended position coding is generally adopted to extract key information of a long text, such as: the number of tokens that model such as BERT can handle is extended from 512 to a very long sequence using relative position encoding instead of absolute position encoding or rotational position encoding (RoPE). The method has the advantages that: when the model is input, a super-long text (all text information of a word document) can be transmitted for coding; the disadvantages are that: the increase of the length of the input text means that the consumption of processing (display card) resources is also increased sharply during training and reasoning, and for the overlong text, the model cannot capture context semantics with a very large span, so that the extraction effect is poor.

For example, the NER task used in the related art for entity extraction may have the problem of inconsistent training and prediction. For example, entity extraction by the NER task uses the BIO labeling method. The BIO labeling method refers to labeling each element as "B-X", "I-X" or "O", wherein "B-X" indicates that the fragment in which the element is located belongs to X type and the element is at the beginning of the fragment, "I-X" indicates that the fragment in which the element is located belongs to X type and the element is at the middle position of the fragment, and "O" indicates that the element does not belong to any type. That is, when an entity extraction is performed by using the NER task, each character is marked as one of B, I, O types for one entity, and then the probability that the character of each entity is divided into B and I is optimized in the model training stage to be maximum; however, when the model is used for prediction and evaluation, the dimension of the entity needs to be evaluated, namely: whether the entity is identified or not and whether the entity is identified accurately are calculated, and then a plurality of evaluation indexes are calculated, such as Accuracy (Accuracy), precision (precision), recall (recall), F1 values and the like. Thus, there is a problem that training of the model is optimized at the character level, but evaluation at the entity level, there is a problem that training and prediction are inconsistent.

Wherein the Accuracy (Accuracy) in the evaluation index is located as the ratio of the number of samples correctly classified by the classifier to the total number of samples for a given test data set; precision ratio (precision) is the proportion of all "correct retrieved items (TP)" to all "actually retrieved items (correct retrieved items (TP) + wrong retrieved items (FP))"; the recall (recall) is the proportion of all "correctly retrieved items (TP)" to all "items (TP + FN) that should be retrieved"; f1 value ═ precision ═ recall × (precision + recall).

Illustratively, the BIO tagging approach fails to address the problem of entity nesting. Such as: the model is nested with "xx university" of two entities, namely "place name" and "organization name", wherein "xx" belongs to the place name and "xx university" belongs to the organization name, and the "xx" and "xx university" need to be recognized simultaneously in the extraction process, but when the model is labeled before training, the model can be labeled in only one form, namely, or labeled as:

[ B-Location, I-Location, O, O ], or otherwise noted: [ B-Organization, I-Organization ]. Therefore, one tagging sequence cannot extract the place name and the organization name, so that the problem of entity nesting cannot be solved.

In addition, in the related art, when an MRC method is generally used for entity extraction or question answering, for example: for text (text), there is a question (or query, query), and then an answer (answer) is made to the query, and this answer is the target entity, which is one or more sub-sequences existing in the text, and its essence is to use Pointer Network (Pointer Network), that is: generally, two modules are needed to identify the index (index) of the head and the tail of the entity respectively, and similarly, during prediction, the index is also prediction of the entity level, and the problem of inconsistent training and prediction also exists.

Aiming at the technical problems, the technical conception process of the technical scheme disclosed by the invention is as follows: the inventor finds that: the method for extracting the entities by using the NER task or the MRC task faces inconsistency of training and prediction to a certain extent, which is mainly caused by the fact that a labeling sequence cannot label a plurality of entities at the same time, so that the problem that entity nesting cannot be solved, and therefore, in the process of processing the documents, the object types in the documents to be processed, the recognition objects included in the object types and the recognition scores of the recognition objects can be determined, the recognition objects included in the object types can be determined based on the recognition scores of the recognition objects.

Based on the above conception process, an embodiment of the present disclosure provides a document processing method, which obtains an identification object set in a to-be-processed document by processing an acquired to-be-processed document, where the identification object set includes: the identification result of the document to be processed is determined according to the identification object and the identification score of the identification object included in the object category in the identification object set, so that the information processing result is improved.

Further, the embodiment of the present disclosure further provides a method for training an identification model, where a text sample set is obtained, where the text sample in the text sample set carries object labeling information, the text sample in the text sample set is input to a preset network to obtain an object identification result of the text sample, and a target identification object corresponding to an object identification category in the object identification result is determined based on an identification score, and parameters of the preset network are adjusted according to the object labeling information carried in the text sample and the object identification result of the text sample to obtain an object identification model, thereby solving the problem of inconsistency between training and prediction.

It is understood that in the embodiment of the present disclosure, the "object recognition model" is also referred to as "model", and may receive the document to be processed and determine the recognition object set in the document to be processed according to the received document to be processed and the current model parameters. Alternatively, the object recognition model may be a regression model, A Neural Network (ANN), a Deep Neural Network (DNN), a Support Vector Machine (SVM), or other machine learning models. The disclosed embodiments are not limited thereto.

Exemplarily, fig. 1 is a schematic diagram of an application scenario provided by an embodiment of the present disclosure. As shown in fig. 1, the application scenario may include: two stages; wherein:

the first stage is the training stage of the recognition model.

In the training phase of the recognition model, the object recognition model is a model for recognizing the object class included in the document to be processed, the recognition object included in the object class, and the recognition score of the recognition object. In the application scenario of the present disclosure, the text samples in the text sample set are input into the preset network, and the preset network is trained based on the object recognition result of the text sample and the object labeling information carried by the text sample, so as to obtain the object recognition model.

Optionally, the text sample in the embodiment of the present disclosure is a text after segmentation, and a labeling method for ensuring the context of the object is adopted for labeling the text sample, so that consistency during training and subsequent application can be ensured.

For example, in an embodiment of the present disclosure, referring to fig. 1, a training device may obtain a document set from N document libraries, extract at least one document from the document set, obtain a text set through segmentation, obtain a text sample set formed after performing object labeling on a text in the text set, perform object recognition on a text sample in the text sample set by using a preset network, obtain an object recognition result of the text sample, and finally adjust parameters of the preset network according to object labeling information carried by the text sample and the object recognition result of the text sample, to obtain an object recognition model.

The second stage is a stage of performing object recognition using an object recognition model.

In the stage of object recognition using the object recognition model, with continued reference to fig. 1, the object recognition model trained in the first stage may be loaded into the processing device. The processing device processes the document to be processed using the object recognition model. Alternatively, the processing device may also be referred to as a smart device.

Exemplarily, a document to be processed is input to a processing device for processing, and an identification object set in the document to be processed is obtained, where the identification object set includes: and determining the identification result of the document to be processed according to the identification objects and the identification scores of the identification objects included in the object categories in the identification object set.

It is understood that the number of object categories included in the set of identification objects and the number of identification objects included in each object category are not limited by the embodiments of the present disclosure, and may be determined according to an actual scene, for example, the number of object categories and the number of identification objects included in each object category may be at least one.

It should be noted that fig. 1 is only an application scenario schematic diagram provided by the embodiment of the present disclosure, and the embodiment of the present disclosure does not limit specific devices included in an application scenario, for example, the application scenario may further include: document parsing devices, storage devices, and the like.

For example, in the application scenario shown in fig. 1, the document parsing device may parse the acquired non-text document based on the received parsing instruction, and transmit the parsed text document to the processing device for processing, so as to obtain an identification result of the text document.

Optionally, the storage device in this embodiment may be used to store the recognition result, and may be an independent device or may be integrated in the processing platform.

It will be appreciated that the positional relationship between the devices shown in fig. 1 does not constitute any limitation, for example, when the application scenario also includes a storage device, the storage device may be an external memory with respect to the training device or the processing device, and in other cases, the storage device may also be disposed in the processing device.

It should be further noted that in the embodiment of the present disclosure, the training device and the processing device may be the same device or different devices. The training device and/or the processing device may be a terminal device including, but not limited to: the smart phone, the notebook computer, the desktop computer, the platform computer, the vehicle-mounted device, the smart wearable device, etc., may also be a server or a virtual machine, etc., and may also be a distributed computer system composed of one or more servers and/or computers, etc., and the embodiment of the present disclosure is not limited. The server can be a common server or a cloud server, and the cloud server is also called a cloud computing server or a cloud host and is a host product in a cloud computing service system. The server may also be a server of a distributed system, or a server incorporating a blockchain.

It should be noted that the product implementation form of the present disclosure is a program code included in machine learning and deep learning platform software and deployed on a server (which may also be hardware with computing capability such as a computing cloud or a mobile terminal). In the system architecture diagram shown in FIG. 1, the program code of the present disclosure may be stored within the processing device and the training device. During operation, the program code is run in the host memory and/or the GPU memory of the server.

In the technical scheme of the disclosure, the collection, storage, use, processing, transmission, provision, disclosure and other processing of the personal information of the related user are all in accordance with the regulations of related laws and regulations and do not violate the good customs of the public order.

It should be noted that the object recognition model in this embodiment is not a recognition model for a specific object, and cannot reflect information of a specific object; moreover, the document or text sample set to be processed in the present embodiment is from a public data set.

The embodiment of the disclosure provides a method, a device, equipment and a storage medium for document processing and recognition model training, which are applied to the technical fields of deep learning, natural language processing and deep search in data processing so as to improve the effect of document information extraction.

In the embodiments of the present disclosure, "a plurality" means two or more. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.

Hereinafter, the technical solution of the present disclosure will be described in detail by specific examples. It should be noted that the following specific embodiments may be combined with each other, and the same or similar concepts or processes may not be described in detail in some embodiments.

By way of example, the document processing process is first described in detail below with reference to several specific embodiments.

Fig. 2 is a flowchart illustrating a document processing method according to a first embodiment of the disclosure. The method of this embodiment may be executed by the processing device in fig. 1, or may be executed by a processor in the processing device. In this embodiment, the processing device executes the method. As shown in fig. 2, the document processing method provided by the present embodiment may include:

s201, obtaining a document to be processed.

For example, the processing device may receive the document to be processed transmitted from another device, read the document to be processed from a document library stored in the processing device (in this case, the document library is deployed in the processing device), and generate the document to be processed based on a document component owned by the processing device. The embodiment of the present disclosure does not limit the acquisition process of the document to be processed, and may be determined according to an actual scene.

S202, processing the document to be processed to obtain an identification object set in the document to be processed, wherein the identification object set comprises: an object category, an identification object included in the object category, and an identification score of the identification object.

In the present embodiment, the processing device has functions of document processing, for example, document parsing, text serialization, text category recognition, text object recognition, and so on. The present embodiment does not limit the functions of the processing apparatus.

For example, when the length of the to-be-processed document is smaller than the length that can be processed by the processing device, and the to-be-processed device is a text document, the processing device may process the to-be-processed document by using the object recognition model loaded by the processing device, so as to obtain a recognition object set in the to-be-processed document.

Optionally, when an object category exists in the text of the document to be processed and the object category includes the identification object, the processing device may identify at least one object category in the document to be processed, at least one identification object included in the at least one object category, and the identification score of the at least one identification object.

S203, determining the identification result of the document to be processed according to the identification objects included in the object categories in the identification object set and the identification scores of the identification objects.

In the disclosed embodiment, for an object class in the set of identification objects, a final identification object of the object class may be determined based on an identification score of an identification object included in the object class.

For example, when the object class included in the recognition object set is multiple, the processing device may compare scores of all recognition objects in the object class for at least one object class in the multiple object classes, and determine a final recognition object of the object class based on the scores of all recognition objects.

In the embodiment of the present disclosure, an identification object set in an acquired document to be processed is obtained by processing the acquired document to be processed, where the identification object set includes: the technical scheme can accurately identify the identification objects in the object categories and improve the information extraction effect of the documents.

On the basis of the embodiment shown in fig. 2, the following describes the document processing method provided by the embodiment of the present disclosure in more detail.

Exemplarily, fig. 3 is a flowchart illustrating a document processing method according to a second embodiment of the present disclosure. As shown in fig. 3, in the embodiment of the present disclosure, the above S202 may be implemented by the following steps:

s301, splitting a text sequence corresponding to the document to be processed to obtain at least one text to be processed.

Illustratively, after the document to be processed is serialized, a text sequence corresponding to the document to be processed can be obtained. In order to solve the problem that the maximum number of sequences that can be processed in the conventional information extraction scheme is limited, in this embodiment, if the length of the text sequence is greater than the longest sequence that can be processed by the processing device, the processing device may split the text sequence to obtain at least one text to be processed.

In a possible design of this embodiment, a text sequence corresponding to a document to be processed may be split based on a preset sliding window length and a preset sliding step length, so as to obtain at least one text to be processed.

Wherein the length of the sliding window is greater than or equal to the sliding step length.

By way of example, the embodiment adopts the idea of sliding window, and can split a text with a length exceeding the maximum sequence number that can be processed by the processing device into a text to be processed with a shorter length, and then perform object recognition, that is, can process an ultra-long text (word document with hundreds of pages and ten thousand characters), thereby greatly improving the recognition performance.

For example, the embodiment of the present disclosure performs sliding window on the text sequence (long sequence) corresponding to the document to be processed, and splits the text sequence into a plurality of texts to be processed. For example, taking a word plain text with a length of 10000 as an example, the length of a sliding window can be set to 512, the sliding step length is 384, a sequence from a position of 0 to a position of 9999 is subjected to sliding window processing, the word plain text sequence with the length of 10000 is split into a plurality of texts to be processed with the length of 512, and about 10000/384 ≈ 26 texts to be processed can be obtained.

S302, inputting at least one text to be processed into a pre-trained object recognition model, and determining a recognition object set in the document to be processed.

The training principle of the object recognition model comprises the following steps: transducer-based bidirectional coded representation (BERT) and global pointer (GlobalPointer).

In an embodiment of the present disclosure, a pre-trained object recognition model is deployed or loaded on a processing device. In a possible design of this embodiment, after the processing device obtains the document to be processed, if the document to be processed is a text document and the text sequence length of the text document meets the requirement, the document to be processed may be input into the object recognition model, so that the recognition object set corresponding to the document to be processed may be directly output.

In an example of this embodiment, when the text sequence corresponding to the document to be processed does not meet the requirement, after at least one text to be processed is obtained through the splitting process in S301, the at least one text to be processed may be respectively input into the object recognition model, and a recognition object set corresponding to the at least one text to be processed is output.

It can be understood that, in the present embodiment, in order to solve the problem of object nesting (entity nesting), training and application inconsistency, in the training of the object recognition model in the present embodiment, a combination form of BERT and GlobalPointer is adopted for object recognition, and the head and the tail of each object category are regarded as a whole for discrimination, so that the object recognition model has a Global view (more Global).

Specifically, assuming that the text sequence of a certain text to be processed has a length of n, it may be assumed for simplicity that only one object in the text to be processed is to be recognized, and it is assumed that each object to be recognized is a continuous segment of the text sequence, the length is not limited, and the objects may be nested with each other (there is an intersection between two different recognition objects), then the number of "candidate recognition objects" in the sequence may be: n (n +1)/2 text sequences with the length of n have n (n +1)/2 different continuous subsequences, and the subsequences contain all possible objects, at this time, what the processing device needs to do is to pick out real objects from the n (n +1)/2 "candidate recognition objects", which is a multi-label classification problem of "n (n +1)/2 k". If m object categories in the text to be processed need to be identified, m multi-label classification problems of 'n (n + 1)/2-to-k' are made. This is the basic idea of the GlobalPointer module, and the object is used as the basic unit for judgment.

In the embodiment of the present disclosure, at least one text to be processed is obtained by splitting a text sequence corresponding to a document to be processed, the at least one text to be processed is input to a pre-trained object recognition model, and a recognition object set in the document to be processed is determined, where a training principle of the object recognition model includes: BERT and GlobalPointer, therefore, the problem of inconsistent nesting, training and application of the recognition objects can be solved, and the recognition precision of the object recognition model is improved.

Optionally, in this embodiment, at least one text to be processed may be input into the corresponding object recognition model, and the model may independently predict a sample to be processed, so that parallelization processing may be performed, and the model inference efficiency is greatly improved.

Optionally, in the embodiment of the disclosure, as shown in fig. 3, before step S202, that is, before step S301, the document processing method may further include the following steps:

s300a, determining whether the document to be processed is a plain text document; if yes, go to S300c first, then to S301; if not, S300b and S300c are executed first, and then S301 is executed.

In the embodiment of the disclosure, since the pre-trained object recognition model is obtained by training based on the text sequence, after the processing device obtains the document to be processed, it is first determined whether the document to be processed is a plain text document, and then the subsequent operation is executed based on the determination result.

S300b, analyzing the document to be processed to obtain a plain text document corresponding to the document to be processed.

In one example, in response to that the document to be processed is a non-plain text document, the document to be processed is parsed to obtain a plain text document corresponding to the document to be processed.

For example, if the Document to be processed is a Word Document, the Word Document may be read by using an open source module python-docx to obtain a Document object, then the text attributes in the Paragraph object are obtained, all the texts are spliced into a long text string, and a plain text Document corresponding to the Word Document is obtained.

It is understood that a document object, which may be interpreted as a file object, is a computer term that refers to an object in an HTML file. A Paragraph object (Paragraph object) represents a Paragraph in the selected content, scope, or document. The Paragraph object is a member of a Paragraph collection that contains all of the Paragraphs in the selected content, scope, or document.

S300c, processing the plain text document to obtain a text sequence corresponding to the document to be processed.

For example, after the document to be processed is a plain text document or a plain text document corresponding to a non-plain text document is obtained by analyzing the document to be processed of the non-plain text document, the processing device may perform operations such as space symbol cleaning on the plain text document.

For example, the processing device may perform space symbol cleaning for word document classes: uniformly replacing all the characters (separators and blanks) with the types of Zs of n, t, r, blank and character code (Unicode) with one blank, and replacing a plurality of continuous blanks with one blank, thereby realizing the cleaning operation of the text document and obtaining the text sequence corresponding to the document to be processed.

In the embodiment of the disclosure, a text sequence corresponding to a plain text document is obtained by processing a document to be processed, then the text sequence is split to obtain at least one text to be processed, and finally the at least one text to be processed is input into a pre-trained object recognition model to determine a recognition object set in the document to be processed. According to the technical scheme, the document to be processed is processed, so that the validity of the text to be processed input into the object recognition model can be guaranteed to a certain extent, and the information processing effect is improved.

Exemplarily, fig. 4 is a flowchart illustrating a document processing method according to a third embodiment of the present disclosure. As shown in fig. 4, in the embodiment of the present disclosure, the above S203 may be implemented by the following steps:

s401, determining the number of the identification objects included in the object category.

For example, at least one text to be processed included in the document to be processed is input to the object recognition model for recognition, and a recognition object set, that is, an object category included in the recognition object set, at least one recognition object included in the object category, and a prediction score (referred to as a recognition score in this embodiment) of the GlobalPointer for each recognition object can be obtained. At this time, the number of identification objects included in the object class may be determined first.

For example, assuming that a certain bulletin document of a certain school is processed, two object categories, namely "bulletin organization" and "bulletin place", can be obtained. Further, the identification object and the identification score included in the "notice organization" and the "notice place" are respectively as follows:

[ { "announcement mechanism":

0.98 of grade one, grade two, grade three, grade four and grade five,

"department of academic affairs": 0.18},

"medical treatment room": 0.09},

{ "place of announcement":

{ "blackboard report": 0.78}}].

S402, in response to the fact that the object category comprises at least two recognition objects, determining a target recognition object in the at least two recognition objects according to the recognition scores of the at least two recognition objects.

S403, determining that the recognition result of the document to be processed comprises the target recognition object in the object category. Optionally, for each object category, a score of at least one identification object in the object category may be obtained, and an identification object with a highest identification score in the object category may be determined as the target identification object of the identification category. Correspondingly, the identification result of the document to be processed comprises the target identification object in the object category.

For example, for the "bulletin board organization", the recognition object with the maximum GlobalPointer score, that is, the recognition object corresponding to 0.98 is selected as the recognition result. Namely: "grade one, grade two, grade three, grade four, grade five".

It can be understood that, in this embodiment, if a certain object class only includes one identification object, the identification object is the identification result of the object class.

Optionally, as shown in fig. 4, in an embodiment of the present disclosure, the document processing method may further include the following steps:

s404, determining whether an integral object exists in at least one identification object included in the identification result, wherein the integral object comprises at least two sub-objects with the same context.

For example, in the embodiment of the present disclosure, after the recognition result of the document to be processed is determined, it may be determined whether at least one recognition object included in the recognition result is an integral object.

It can be understood that the object recognition model is a text sample set obtained by labeling with a labeling method based on object context during training, and processes a plurality of objects with the same context as an integral object, so that when a text sequence corresponding to a document to be processed is processed by using the object recognition model, the obtained recognition result may include an integral object formed by at least two sub-objects with the same context.

S405, responding to the integral object existing in the identification object, and performing object segmentation on the integral object to obtain a final identification result of the document to be processed.

In one possible design, an integral object exists in the identification objects included in the identification result, and at this time, the integral object is segmented according to the blank space to obtain each sub-object, namely, the final identification result of the document to be processed.

Illustratively, for the recognition result of "bulletin board organization": "grade one, grade two, grade three, grade four, grade five", after cutting according to the blank, can obtain the final output of "the bulletin mechanism": [ "first grade", "second grade", "third grade", "fourth grade", "fifth grade" ].

In the embodiment of the disclosure, the number of the recognition objects included in the object category is determined, in response to the object category including at least two recognition objects, a target recognition object in the at least two recognition objects is determined according to the recognition scores of the at least two recognition objects, so that it is determined that the recognition result of the document to be processed includes the target recognition object in the object category, and in response to an integral object existing in the recognition result, the integral object includes at least two sub-objects having the same context, the integral object is subjected to object segmentation, so as to obtain the final recognition result of the document to be processed. According to the technical scheme, the object identification accuracy can be improved, the service requirement can be met, and the information processing efficiency is improved.

The above embodiments describe the process of document processing. The following describes a process for training an object recognition model utilized in a document processing process, in conjunction with several specific embodiments.

Fig. 5 is a schematic flowchart of a recognition model training method according to a first embodiment of the present disclosure. The method of this embodiment may be executed by the training apparatus in fig. 1, and may also be executed by a processor in the training apparatus. In this embodiment, the training apparatus executes the method. As shown in fig. 5, the recognition model training method provided in this embodiment may include:

s501, a text sample set is obtained, and the text samples in the text sample set carry object labeling information.

For example, the training device may obtain a large number of text samples from a plurality of text repositories.

In a possible design of this embodiment, the text sample acquired by the training device is labeled, and optionally, the labeled text sample carries object labeling information.

Optionally, the object labeling information may be labeled based on a method for ensuring the context of the entity context.

S502, inputting the text samples in the text sample set into a preset network to obtain the object recognition results of the text samples.

Wherein the target recognition object corresponding to the object recognition category in the object recognition result is determined based on the recognition score.

In the embodiment of the disclosure, in the training process of the object recognition model, the training device may input the text samples in the text sample set into the preset network (when there are a plurality of text samples, the text samples may be input in parallel or in series), and may output the object recognition result of the text sample, that is, the object recognition category in the text sample and the target recognition object corresponding to the object recognition category.

Optionally, in this embodiment, the format of the two text samples input into the preset network is as follows:

{ [ "text sample 1", [ object first 11, object last 11, label object 1], [ object first 12, object last 12, label object 2], ];

[ "text sample 2", [ object first 21, object last 21, label object 1], [ object first 22, object last 22, label object 2], [. ].

S503, adjusting parameters of a preset network according to the object marking information carried by the text sample and the object identification result of the text sample to obtain an object identification model.

In this embodiment, the training device may compare the object recognition result of the text sample with the object labeling information carried by the text sample, determine a degree of consistency of the preset network for the object recognition result of the text sample and the object labeling information, and adjust a parameter of the preset network when the degree of consistency is lower than a preset requirement, to obtain the object recognition model.

In the embodiment of the disclosure, a text sample set is obtained, the text samples in the text sample set carry object labeling information, the text samples in the text sample set are input to a preset network, an object recognition result of the text sample is obtained, and then parameters of the preset network are adjusted according to the object labeling information carried by the text sample and the object recognition result of the text sample, so as to obtain an object recognition model.

On the basis of the embodiment shown in fig. 5, the following describes the recognition model training method provided by the embodiment of the present disclosure in more detail.

Fig. 6 is a flowchart illustrating a recognition model training method according to a second embodiment of the present disclosure. In an embodiment of the present disclosure, the preset network includes: a BERT part and a GlobalPointer part. The disclosed embodiments mainly illustrate the training of the GlobalPointer part.

In practical application, because the problem of inconsistency of training and prediction is faced to a certain extent when the NER task or the MRC task is used for entity (object) extraction in the related art, in this embodiment, the traditional BERT + CRF for NER is changed into BERT + GlobalPointer for NER, and when the BERT + GlobalPointer is used for entity extraction task, the beginning and the end are regarded as a whole for judgment, so that the method has more "Global view" (more Global view).

Accordingly, as shown in fig. 6, the above S502 may be implemented by the following steps:

s601, carrying out object recognition on the text samples in the text sample set by using a BERT part, and determining all recognition objects included in the text samples.

Illustratively, the BERT component generates a deep bi-directional language representation using a Mask Language Model (MLM), specifically, considering the context from both sides (left and right) of each word, which helps the model to better understand the context of the word, so that when the BERT component is used to perform object recognition on a text sample in a text sample set, all recognition objects included in the text sample can be determined.

S602, classifying all the recognition objects included in the text sample based on the global pointer part, and determining an object recognition category included in the text sample and at least one recognition object included in the object recognition category.

For example, the basic idea of the global pointer part is a multi-label classification problem, and therefore, after all the recognition objects included in the text sample are determined, the global pointer part may be used to perform class division on all the recognition objects included in the text sample to obtain an object recognition class included in the text sample, and then at least one recognition object included in the object recognition class is determined for the object recognition class.

S603, determining a target identification object corresponding to the object identification category according to the identification score of at least one identification object in the object identification category.

For example, in the embodiment of the present disclosure, since the global pointer part can give the identification score of each identification object in the object identification category, the identification scores of the identification objects may be sorted from high to low, and the identification object with the highest score may be used as the target identification object corresponding to the current object identification category.

Optionally, in an embodiment of the present disclosure, the object tagging information includes: the object marking type and the marking object corresponding to the object marking type, the object identification result comprises: an object identification category and an identification object corresponding to the object identification category; accordingly, as shown in fig. 6, the above S503 may be implemented by the following steps:

s604, determining a category identification result of the text sample according to the object marking category carried by the text sample and the object identification category of the text sample.

For example, in this embodiment, the object identification category of the text sample is compared with the object labeling category carried by the text sample, and it is determined whether the category identification of the text sample is correct, that is, the category identification result.

And S605, determining the category identification accuracy of the preset network according to the category identification results of at least two text samples in the text sample set.

Optionally, after the category identification results of the at least two text samples are determined, the number of text samples with correct category identification in the text sample set may be counted, and the percentage of the number of text samples with correct category identification in all the text sample sets is calculated, so as to obtain the category identification accuracy of the preset network.

S606, judging whether the category identification accuracy of the preset network is greater than or equal to a category accuracy threshold value or not; if yes, go to S607; if not, go to step S610 and then to step S601.

As an example, a category accuracy threshold may be preset in the training device, so that after the category identification accuracy of the preset network is determined, the preset network may be compared with the category accuracy threshold, and then the subsequent operation is determined based on the comparison result.

S607, determining the object recognition result of the text sample according to the labeling object corresponding to the object labeling type and the target recognition object corresponding to the object recognition type.

As an example, when the category identification accuracy of the preset network is greater than or equal to the category accuracy threshold, it indicates that a category index of the preset network has reached the requirement, and at this time, the object identification result of the text sample is calculated.

Optionally, in this embodiment, an identification object corresponding to an object identification category in the text sample may be compared with a label object corresponding to an object label category in the text sample, and it is determined whether the identification object and the label object of the text sample are consistent, so as to obtain an object identification result.

S608, determining the object recognition accuracy of the preset network according to the object recognition results of at least two text samples in the text sample set.

Optionally, after the object identification results of at least two text samples are determined, the number of text samples with correct object identification in the text sample set may be counted, and the percentage of the number of text samples with correct object identification in all the text sample sets is calculated, so as to obtain the object identification accuracy of the preset network.

S609, judging whether the object identification accuracy of the preset network is greater than or equal to the object accuracy threshold, if so, turning to S610; if not, go to step S610 and then to step S601.

As an example, an object accuracy threshold may be preset in the training device, so that after the object recognition accuracy of the preset network is determined, the object recognition accuracy may be compared with the object accuracy threshold, and then the subsequent operation may be determined based on the comparison result.

S610, adjusting parameters of a preset network.

And S611, obtaining the object recognition model.

For example, when the class identification accuracy of the preset network is smaller than the class accuracy threshold and/or the object identification accuracy of the preset network is smaller than the object accuracy threshold, it indicates that the current parameter value of the preset network cannot meet the preset accuracy index, and therefore, the parameters of the preset network may be adjusted, and the above steps may be executed in a loop until the class identification accuracy of the preset network is greater than or equal to the class accuracy threshold and the object identification accuracy of the preset network is greater than or equal to the object accuracy threshold, so as to obtain the object identification model.

Further, in the model training phase of the embodiment of the present disclosure, in order to ensure the generalization of the general NER extraction, a Fast Gradient Method (FGM) is added for countertraining: that is, a, parameters of a preset network are kept unchanged, and for each text sample, a tiny disturbance Δ x is added to the text sample at an embedding (embedding) layer of the preset network, so that a countermeasure sample corresponding to each text sample is obtained, that is, the obtained countermeasure sample can maximize the loss of the preset network; b. for the confrontation samples, inputting a preset network, performing gradient descent to minimize loss, further optimizing a parameter theta of the preset network, repeatedly executing a and b, and repeating the steps to form confrontation training, so that the generalization of object recognition can be ensured.

In this embodiment, a global normalization idea is used to perform Named Entity Recognition (NER), so that a nested entity and a non-nested entity can be indiscriminately recognized, and a better effect can be obtained in the case of a non-nested NER (flat NER), and a good effect can be obtained in the case of a nested NER (nested NER). In addition, in theory, the design idea of the GlobalPointer is reasonable, and in practice, the implementation is completely parallel, and the complexity is low.

In the embodiment, a BERT + GlobalPointer model is used for training the NER task, aggregation is performed after reasoning is completed, a final result is obtained by using a maximum score, extremely high accuracy can be obtained for each object type, the recall rate index is greatly improved due to the consistency of the training and reasoning tasks, and the generalization capability of the model is greatly improved by using FGM countermeasure training.

Fig. 7 is a flowchart illustrating a recognition model training method according to a third embodiment of the present disclosure. In the embodiment of the present disclosure, as shown in fig. 6, the above S501 may be implemented by the following steps:

and S701, acquiring a document sample set.

Optionally, in this embodiment, the training device may obtain a text sample set from other devices or a storage location of the training device itself, where the document samples in the document sample set may be documents in various formats, for example, a PDF document, a Word document, and a plain text document. The format of the document sample is not limited in this embodiment.

S702, determining whether the document samples in the document sample set are plain text documents; if not, executing S703 first, and then executing S704; if yes, go to step S704.

Optionally, in practical application, the preset network is a structure of BERT and a global pointer, and a general processing object of the preset network is a text sequence, so that after the document sample set is obtained, whether the document sample in the document sample set is a plain text document is detected first, if yes, the document is serialized, and if not, the non-plain text document is converted into the plain text document first, and then the serialization is performed.

And S703, converting the non-plain text documents in the document sample set into plain text documents to obtain a text document sample set.

As an example, in response to a non-plain text document existing in the sample set of documents, the non-plain text documents in the sample set of documents are converted into plain text documents, resulting in a sample set of text documents.

Exemplarily, a PDF document is already a paged document, a Word document does not have a fixed page, and a Word document may have different pages due to differences in open software, so that, in order to improve processing accuracy and model applicability, a Word document parser may be used to parse the Word document to obtain a plain text document; similarly, for PDF, a PDF document parser is used to parse a PDF document, or a PDF document is first converted into a Word document, and then the PDF document is parsed by the PDF document parser to obtain a plain text document.

And S704, serializing the plain text documents in the text document sample set to obtain a text sequence corresponding to the plain text documents.

Optionally, when it is determined that all document samples in the text document sample set are plain text documents, the text therein may be extracted, and all the text contents are processed into a long plain text character string to obtain a text sequence corresponding to the plain text document, for example: the number of characters varies from hundreds to tens of thousands.

In addition, for the problem that the maximum sequence number that can be processed by the conventional key information extraction technology is limited, in this embodiment, the text sequence corresponding to the plain text document is split into a plurality of text samples by sliding a window.

S705, obtaining object labeling information of the text sequence corresponding to the plain text document to obtain a text sample set.

Wherein the object labeling information is labeled based on the context of the object context.

Optionally, in the training stage of this embodiment, a training person may perform object labeling on the text samples first, and input the text samples with the object labeling information into a preset network to perform model training, so that the preset network can learn the object information features in the text samples for the text samples, and since the text samples in the text sample set can be independently trained, a plurality of text samples can be subjected to parallel training at this step, thereby greatly improving the efficiency of model training.

It can be understood that when there is no object in some text sample, it can be regarded as a negative text sample for data enhancement.

Optionally, the embodiment of the present disclosure adopts a labeling method that preferentially guarantees the context of the entity context. For example, when the "announcement mechanism" is to be identified, the objects are labeled in the mode of finest granularity, the 5 announcement mechanisms are labeled respectively, so that the context semantics near the object are not learned in the model training process, and further, the model inference causes a great number of false recalls. For example, when "department" and "medical room" exist in a text sample, the "department" and "medical room" may be recalled by mistake, and neither is the "bulletin board".

Therefore, the embodiment of the present disclosure provides a labeling method for preferentially ensuring the context of an object, in which although the name of "advertisement mechanism" is different, the advertisement file has similar context, so that a plurality of sub-objects having the same context can be labeled according to one identification object. The model, when trained and inferred, can learn the overall object that satisfies this context. In the post-processing process after the integral object is obtained through reasoning, the integral object can be divided by spaces to obtain each object with finer granularity, and finally the service requirement is met.

In the embodiment of the disclosure, during labeling, a labeling method which preferentially ensures the context of the entity is adopted, and the entity under the fixed context can be accurately learned and predicted in the training and reasoning stages, so that the problem of false recall of the fine-grained entity is solved.

The following explains an overall scheme of the embodiment of the present disclosure with a specific example. Fig. 8 is a schematic structural diagram provided in an embodiment of the present disclosure. In the embodiment, the document to be processed is taken as a Word document, and the corresponding text sequence of the Word document exceeds the preset processing length for explanation. As shown in fig. 8, the architecture diagram may include four parts: word parsing module 801, preprocessing module 802, model training module 803, and object recognition module 804.

The Word parsing module 801 mainly converts a Word document into a plain text document, and processes the plain text document into a text which can be used for labeling.

The preprocessing module 802 performs data preprocessing Sliding, and mainly performs Sliding window processing on the plain text obtained by the Word parsing module 801, and splits the plain text into a plurality of texts (entity labeling is performed after splitting in the training stage), so as to generate an input format required by model training and reasoning.

Illustratively, fig. 9 is a flow chart illustrating a process of splitting a plain text document. As shown in fig. 9, the plain text document is subjected to segmentation processing to obtain n parts included in the plain text document, that is, subsamples corresponding to the plain text document.

As shown in fig. 8 and 9, the model training module 803 is a simple version of automated modeling, which includes functions such as model parameter adjustment, global pointer model training, model publishing, and the like, and implements training through data (with labels) provided by the preprocessing module 802 to obtain a model file, i.e., an object recognition model (BERT + global pointer model).

The object recognition module 804 is applied in the inference stage, and inputs the subsamples corresponding to the plain text documents into the object recognition model to obtain an object recognition set.

For specific implementation of each module, reference may be made to the descriptions in the above embodiments, and details are not described here.

Fig. 10 is a schematic structural diagram of a document processing apparatus according to an embodiment of the disclosure. The document processing apparatus provided by the embodiment may be the processing device in fig. 1 or an apparatus in a processing device. As shown in fig. 10, a document processing apparatus 1000 provided in an embodiment of the present disclosure may include:

an acquisition unit 1001 configured to acquire a document to be processed;

a processing unit 1002, configured to process the to-be-processed document to obtain an identification object set in the to-be-processed document, where the identification object set includes: an object category, an identification object included in the object category, and an identification score of the identification object;

a determining unit 1003, configured to determine an identification result of the to-be-processed document according to the identification object included in the object category in the identification object set and the identification score of the identification object.

In a possible implementation of this embodiment, the processing unit 1002 includes:

the splitting module is used for splitting the text sequence corresponding to the document to be processed to obtain at least one text to be processed;

the recognition module is used for inputting the at least one text to be processed into a pre-trained object recognition model and determining a recognition object set in the document to be processed, and the training principle of the object recognition model comprises the following steps: the converter-based bi-directional encoding characterizes BERT and global pointers.

Optionally, the splitting module is specifically configured to split a text sequence corresponding to the document to be processed based on a preset sliding window length and a preset sliding step length to obtain at least one text to be processed, where the sliding window length is greater than or equal to the sliding step length.

In a possible implementation of this embodiment, the determining unit 1003 includes:

the first determining module is used for determining the number of the identification objects included in the object category;

the second determination module is used for determining a target identification object in the at least two identification objects according to the identification scores of the at least two identification objects in response to the object category comprising the at least two identification objects;

a third determining module, configured to determine that the recognition result of the to-be-processed document includes the target recognition object in the object class.

In one possible implementation of this embodiment, the document processing apparatus further includes:

a detecting unit (not shown) for determining whether an entire object exists in the recognition objects included in the recognition result, the entire object including at least two sub-objects having the same context;

and a segmentation unit (not shown) configured to perform object segmentation on the whole object in response to that the whole object exists in the identification object, so as to obtain a final identification result of the to-be-processed document.

a detection unit (not shown) for determining whether the document to be processed is a plain text document;

a parsing unit (not shown) for:

responding to the fact that the document to be processed is a non-plain text document, analyzing the document to be processed to obtain a plain text document corresponding to the document to be processed;

and processing the plain text document to obtain a text sequence corresponding to the document to be processed.

The document processing apparatus provided in this embodiment may be configured to execute the document processing method executed by the processing device in any method embodiment described above, and the implementation principle and the technical effect are similar, which are not described herein again.

Fig. 11 is a schematic structural diagram of a recognition model training apparatus according to an embodiment of the present disclosure. The recognition model training device provided in this embodiment may be the training apparatus in fig. 1 or a device in the training apparatus. As shown in fig. 11, the recognition model training apparatus 1100 provided in the embodiment of the present disclosure may include:

an obtaining unit 1101, configured to obtain a text sample set, where text samples in the text sample set carry object tagging information;

the processing unit 1102 is configured to input the text samples in the text sample set to a preset network, so as to obtain an object recognition result of the text samples, where a target recognition object corresponding to an object recognition category in the object recognition result is determined based on a recognition score;

an adjusting unit 1103, configured to adjust a parameter of the preset network according to the object labeling information carried by the text sample and the object identification result of the text sample, to obtain an object identification model.

In a possible implementation of this embodiment, the preset network includes: characterizing a BERT portion and a global pointer portion based on bi-directional encoding of the converter;

accordingly, the processing unit 1102 includes:

the first processing module is used for carrying out object recognition on the text samples in the text sample set by utilizing the bidirectional coding representation BERT part based on the converter and determining all recognition objects included in the text samples;

the second processing module is used for classifying all the identification objects included in the text sample based on the global pointer part and determining an object identification category included in the text sample and at least one identification object included in the object identification category;

and the third processing module is used for determining a target identification object corresponding to the object identification category according to the identification score of the at least one identification object in the object identification category.

Optionally, the object tagging information includes: the object labeling type and the labeling object corresponding to the object labeling type, wherein the object identification result comprises: the object identification category and the target identification object corresponding to the object identification category;

correspondingly, the adjusting unit 1103 includes:

the first determining module is used for determining a category identification result of the text sample according to the object marking category carried by the text sample and the object identification category of the text sample;

the second determining module is used for determining the category identification accuracy of the preset network according to the category identification results of at least two text samples in the text sample set;

a third determining module, configured to determine, in response to that the category identification accuracy of the preset network is greater than or equal to a category accuracy threshold, an object identification result of the text sample according to a label object corresponding to the object label category and a target identification object corresponding to the object identification category;

the fourth determining module is used for determining the object recognition accuracy of the preset network according to the object recognition results of at least two text samples in the text sample set;

a fifth determining module, configured to adjust a parameter of the preset network in response to that the category identification accuracy of the preset network is less than the category accuracy threshold and/or that the object identification accuracy of the preset network is less than the object accuracy threshold, until the category identification accuracy of the preset network is greater than or equal to the category accuracy threshold and the object identification accuracy of the preset network is greater than or equal to the object accuracy threshold, so as to obtain an object identification model.

In a possible implementation of the embodiment of the present disclosure, the obtaining unit 1101 includes:

the acquisition module is used for acquiring a document sample set;

the detection module is used for determining whether the document samples in the document sample set are plain text documents;

the conversion module is used for responding to the existence of the non-plain text documents in the document sample set, converting the non-plain text documents in the document sample set into plain text documents, and obtaining a text document sample set;

the serialization module is used for serializing the plain text documents in the text document sample set to obtain text sequences corresponding to the plain text documents;

the obtaining module is further configured to obtain object labeling information of a text sequence corresponding to the plain text document to obtain the text sample set, where the object labeling information is labeled based on an object context.

The recognition model training apparatus provided in this embodiment may be configured to execute the recognition model training method executed by the training device in any of the above method embodiments, and the implementation principle and the technical effect are similar, which are not described herein again.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

According to an embodiment of the present disclosure, the present disclosure also provides a computer program product comprising: a computer program, stored in a readable storage medium, from which at least one processor of the electronic device can read the computer program, the at least one processor executing the computer program causing the electronic device to perform the solution provided by any of the embodiments described above.

FIG. 12 shows a schematic block diagram of an example electronic device used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 12, the apparatus 1200 includes a computing unit 1201 which can perform various appropriate actions and processes in accordance with a computer program stored in a Read Only Memory (ROM)1202 or a computer program loaded from a storage unit 1208 into a Random Access Memory (RAM) 1203. In the RAM 1203, various programs and data necessary for the operation of the device 1200 can also be stored. The computing unit 1201, the ROM 1202, and the RAM 1203 are connected to each other by a bus 1204. An input/output (I/O) interface 1205 is also connected to bus 1204.

Various components in the device 1200 are connected to the I/O interface 1205 including: an input unit 1206 such as a keyboard, a mouse, or the like; an output unit 1207 such as various types of displays, speakers, and the like; a storage unit 1208, such as a magnetic disk, optical disk, or the like; and a communication unit 1209 such as a network card, modem, wireless communication transceiver, etc. The communication unit 1209 allows the device 1200 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The computing unit 1201 may be a variety of general purpose and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 1201 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 1201 performs the respective methods and processes described above, such as a document processing method, a recognition model training method. For example, in some embodiments, the document processing method, the recognition model training method, may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 1208. In some embodiments, part or all of a computer program may be loaded onto and/or installed onto device 1200 via ROM 1202 and/or communications unit 1209. When the computer program is loaded into the RAM 1203 and executed by the computing unit 1201, one or more steps of the document processing method, the recognition model training method described above may be performed. Alternatively, in other embodiments, the computing unit 1201 may be configured in any other suitable manner (e.g., by means of firmware) to perform a document processing method, a recognition model training method.

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server can be a cloud Server, also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service ("Virtual Private Server", or simply "VPS"). The server may also be a server of a distributed system, or a server incorporating a blockchain.

It should be understood that various forms of the flows shown above, reordering, adding or deleting steps, may be used. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A document processing method, comprising:

acquiring a document to be processed;

2. The method according to claim 1, wherein the processing the document to be processed to obtain the set of identification objects in the document to be processed comprises:

splitting a text sequence corresponding to the document to be processed to obtain at least one text to be processed;

inputting the at least one text to be processed into a pre-trained object recognition model, and determining a recognition object set in the document to be processed, wherein the training principle of the object recognition model comprises the following steps: the converter-based bi-directional encoding characterizes BERT and global pointers.

3. The method according to claim 2, wherein the splitting the text sequence corresponding to the document to be processed to obtain at least one text to be processed comprises:

splitting a text sequence corresponding to the document to be processed based on a preset sliding window length and a preset sliding step length to obtain at least one text to be processed, wherein the sliding window length is greater than or equal to the sliding step length.

4. The method according to any one of claims 1 to 3, wherein the determining the recognition result of the document to be processed according to the recognition objects included in the object categories in the recognition object set and the recognition scores of the recognition objects comprises:

determining the number of the identification objects included in the object category;

in response to the object category comprising at least two identification objects, determining a target identification object in the at least two identification objects according to the identification scores of the at least two identification objects;

determining that the recognition result of the document to be processed includes the target recognition object in the object class.

5. The method of any of claims 1 to 4, further comprising:

determining whether an overall object exists in the identified objects included in the identification result, wherein the overall object comprises at least two sub-objects with the same context;

and responding to the existence of the whole object in the identification object, and performing object segmentation on the whole object to obtain a final identification result of the document to be processed.

6. The method according to any one of claims 1 to 5, before processing the document to be processed to obtain the set of identified objects in the document to be processed, further comprising:

determining whether the document to be processed is a plain text document;

7. A recognition model training method, comprising:

8. The method of claim 7, the pre-set network comprising: characterizing a BERT portion and a global pointer portion based on bi-directional encoding of the converter;

the step of inputting the text samples in the text sample set to a preset network to obtain the object recognition results of the text samples includes:

performing object recognition on the text samples in the text sample set by using the bidirectional coding representation BERT part based on the converter, and determining all recognition objects included in the text samples;

classifying all identification objects included in the text sample based on the global pointer part, and determining an object identification category included in the text sample and at least one identification object included in the object identification category;

and determining a target identification object corresponding to the object identification category according to the identification score of the at least one identification object in the object identification category.

9. The method of claim 7 or 8, wherein the object annotation information comprises: the object labeling type and the labeling object corresponding to the object labeling type, wherein the object identification result comprises: the object identification category and the target identification object corresponding to the object identification category;

the adjusting the parameters of the preset network according to the object labeling information carried by the text sample and the object identification result of the text sample to obtain an object identification model comprises:

determining a category identification result of the text sample according to the object marking category carried by the text sample and the object identification category of the text sample;

determining the category identification accuracy of the preset network according to the category identification results of at least two text samples in the text sample set;

in response to the fact that the category identification accuracy of the preset network is larger than or equal to a category accuracy threshold, determining an object identification result of the text sample according to the labeled object corresponding to the object labeling category and the target identification object corresponding to the object identification category;

determining the object identification accuracy of the preset network according to the object identification results of at least two text samples in the text sample set;

and in response to the fact that the class identification accuracy of the preset network is smaller than a class accuracy threshold and/or the object identification accuracy of the preset network is smaller than an object accuracy threshold, adjusting parameters of the preset network until the class identification accuracy of the preset network is larger than or equal to the class accuracy threshold and the object identification accuracy of the preset network is larger than or equal to the object accuracy threshold, and obtaining an object identification model.

10. The method of any of claims 7 to 9, wherein the obtaining a text sample set comprises:

acquiring a document sample set;

determining whether document samples in the document sample set are plain text documents;

responding to the existence of non-plain text documents in the document sample set, converting the non-plain text documents in the document sample set into plain text documents to obtain a text document sample set;

serializing plain text documents in the text document sample set to obtain text sequences corresponding to the plain text documents;

and acquiring object labeling information of a text sequence corresponding to the plain text document to obtain the text sample set, wherein the object labeling information is labeled based on an object context.

11. A document processing apparatus comprising:

the acquisition unit is used for acquiring a document to be processed;

12. The apparatus of claim 11, wherein the processing unit comprises:

13. The apparatus according to claim 12, wherein the splitting module is specifically configured to split a text sequence corresponding to the document to be processed based on a preset sliding window length and a sliding step length to obtain at least one text to be processed, where the sliding window length is greater than or equal to the sliding step length.

14. The apparatus according to any one of claims 11 to 13, wherein the determining unit comprises:

a second determining module, configured to determine, in response to that the object category includes at least two identification objects, a target identification object in the at least two identification objects according to the identification scores of the at least two identification objects;

15. The apparatus of any of claims 11 to 14, further comprising:

a detection unit, configured to determine whether an entire object exists in the recognition objects included in the recognition result, where the entire object includes at least two sub-objects having the same context;

and the segmentation unit is used for responding to the existence of the whole object in the identification object, and performing object segmentation on the whole object to obtain a final identification result of the document to be processed.

16. The apparatus of any of claims 11 to 15, further comprising:

the detection unit is used for determining whether the document to be processed is a plain text document;

an analysis unit configured to:

17. A recognition model training apparatus comprising:

18. The apparatus of claim 17, the pre-set network comprising: characterizing a BERT portion and a global pointer portion based on bi-directional encoding of the converter;

the processing unit includes:

19. The apparatus of claim 17 or 18, wherein the object annotation information comprises: the object labeling type and the labeling object corresponding to the object labeling type, wherein the object identification result comprises: the object identification category and the target identification object corresponding to the object identification category;

the adjusting unit includes:

a third determining module, configured to determine, in response to that the category identification accuracy of the preset network is greater than or equal to a category accuracy threshold, an object identification result of the text sample according to a labeled object corresponding to the object labeling category and a target identification object corresponding to the object identification category;

20. The apparatus of any one of claims 17 to 19, wherein the obtaining unit comprises:

the acquisition module is used for acquiring a document sample set;

21. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1 to 6 or to perform the method of any one of claims 7 to 10.

22. A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method of any one of claims 1 to 6 or the method of any one of claims 7 to 10.

23. A computer program product comprising a computer program which, when executed by a processor, carries out the steps of the method of any one of claims 1 to 6 or carries out the steps of the method of any one of claims 7 to 10.