WO2022134805A1

WO2022134805A1 - Document classification prediction method and apparatus, and computer device and storage medium

Info

Publication number: WO2022134805A1
Application number: PCT/CN2021/125227
Authority: WO
Inventors: 刘玉; 徐国强
Original assignee: 深圳壹账通智能科技有限公司
Priority date: 2020-12-21
Filing date: 2021-10-21
Publication date: 2022-06-30
Also published as: CN112699923A

Abstract

A document classification prediction method and apparatus, and a computer device and a storage medium. The method comprises: receiving a prediction request instruction that contains a target document (S10); performing document parsing on the target document by means of a preset document parsing model, so as to obtain text information corresponding to the target document and coordinate information corresponding to the text information (S20); inputting the text information and the coordinate information into a preset pretrained language model, and performing vector extraction on the text information and the coordinate information, so as to obtain a document representation vector corresponding to the target document (S30); acquiring a sample document vector set, wherein the sample document vector set contains at least one sample document vector, and one sample document vector is associated with one document category (S40); and determining a document vector distance between the document representation vector and each sample document vector, and determining, according to each document vector distance, a document category corresponding to the target document (S50). By means of the method, the efficiency of document classification is improved.

Description

Document classification prediction method, device, computer equipment and storage medium

This application claims the priority of the Chinese patent application filed on December 21, 2020 with the application number 202011521171.0 and the title of the invention is "document classification prediction method, device, computer equipment and storage medium", the entire contents of which are by reference Incorporated in this application.

technical field

The present application relates to the technical field of classification models, and in particular, to a document classification prediction method, apparatus, computer equipment and storage medium.

Background technique

At present, there are tens of thousands of pdf documents in various fields, such as pdf papers in academic fields and pdf data reports in professional fields. After more and more pdf documents are generated, how to effectively classify these pdf documents and predict the document category of new documents is a challenge.

The inventor realizes that document classification models in the prior art generally require a large amount of labeled data for training in order to have considerable classification accuracy, but these document classification models are easily affected by data imbalance, such as training of a certain category. If there is very little data, the classification accuracy of the model in this classification will be low, resulting in low document classification accuracy, and it takes a lot of time to manually label the data, which is not conducive to the deployment and application of the model in various fields.

Application content

Embodiments of the present application provide a document classification prediction method, apparatus, computer equipment, and storage medium, so as to solve the problem of low document classification accuracy caused by less manual annotation data.

A document classification prediction method, comprising:

Receive a prediction request instruction containing the target document;

Performing document parsing on the target document by using a preset document parsing model to obtain text information corresponding to the target document and coordinate information corresponding to the text information;

Inputting the text information and the coordinate information into a preset pre-trained language model, and performing vector extraction on the text information and the coordinate information to obtain a document representation vector corresponding to the target document;

Obtain a sample document vector set; the sample document vector set includes at least one sample document vector; one of the sample document vectors is associated with a document category;

A document vector distance between the document representation vector and each of the sample document vectors is determined, and a document category corresponding to the target document is determined according to each of the document vector distances.

A document classification prediction device, comprising:

The prediction request instruction receiving module is used to receive the prediction request instruction including the target document;

a document parsing module, configured to perform document parsing on the target document by using a preset document parsing model to obtain text information corresponding to the target document and coordinate information corresponding to the text information;

The first vector extraction module is used for inputting the text information and the coordinate information into a preset pre-training language model, and performing vector extraction on the text information and the coordinate information to obtain the corresponding text information and the target document. document representation vector;

a document vector set acquisition module, configured to acquire a sample document vector set; the sample document vector set includes at least one sample document vector; one of the sample document vectors is associated with a document category;

A document category determination module, configured to determine a document vector distance between the document representation vector and each of the sample document vectors, and determine a document category corresponding to the target document according to each of the document vector distances.

A computer device comprising a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor, the processor implementing the following steps when executing the computer-readable instructions:

Receive a prediction request instruction containing the target document;

One or more readable storage media storing computer-readable instructions that, when executed by one or more processors, cause the one or more processors to perform the following steps:

Receive a prediction request instruction containing the target document;

The above-mentioned document classification prediction method, device, computer equipment and storage medium, the method receives the prediction request instruction containing the target document; through the preset document analysis model, the document analysis is performed on the target document, and the corresponding target document is obtained. text information and coordinate information corresponding to the text information; input the text information and the coordinate information into a preset pre-training language model, perform vector extraction on the text information and the coordinate information, and obtain the text information and the coordinate information. the document representation vector corresponding to the target document; obtain a sample document vector set; the sample document vector set contains at least one sample document vector; one of the sample document vectors is associated with a document category; document vector distances between document vectors, and the document category corresponding to the target document is determined according to the document vector distances.

The present application determines the document category of the target document by introducing the text information of the document and the corresponding coordinate information, and according to the document vector distance between the document representation vector corresponding to the text information and the coordinate information and the sample document vector. In this way, in the case of few sample documents, new documents can still be classified. If they do not match the sample documents, they can be regarded as a new document category, and the new documents are continuously classified. During the classification process, the number of documents in each document category can be supplemented without the need to constantly replace the preset document parsing model or the preset pre-trained language model to classify new documents, which improves the efficiency and convenience of document classification.

The details of one or more embodiments of the application are set forth in the accompanying drawings and the description below, and other features and advantages of the application will become apparent from the description, drawings, and claims.

Description of drawings

In order to illustrate the technical solutions of the embodiments of the present application more clearly, the following briefly introduces the drawings that are used in the description of the embodiments of the present application. Obviously, the drawings in the following description are only some embodiments of the present application. , for those of ordinary skill in the art, other drawings can also be obtained from these drawings without creative labor.

1 is a schematic diagram of an application environment of a document classification prediction method in an embodiment of the present application;

2 is a flowchart of a document classification prediction method in an embodiment of the present application;

3 is a flowchart of step S50 in the document classification prediction method in an embodiment of the present application;

4 is another flowchart of a document classification prediction method in an embodiment of the present application;

5 is a schematic block diagram of a document classification prediction device in an embodiment of the present application;

FIG. 6 is another principle block diagram of a document classification prediction apparatus in an embodiment of the present application;

7 is a schematic block diagram of a document category determination module in a document category prediction device according to an embodiment of the present application;

FIG. 8 is a schematic diagram of a computer device in an embodiment of the present application.

Detailed ways

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, not all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by those of ordinary skill in the art without creative work fall within the protection scope of the present application.

The document classification prediction method provided by the embodiment of the present application can be applied in the application environment shown in FIG. 1 . Specifically, the document classification prediction method is applied in a document classification prediction system. The document classification prediction system includes a client and a server as shown in FIG. 1 , and the client and the server communicate through the network to solve the problem of less manual annotation data. This leads to the problem of low document classification accuracy. Among them, the client, also known as the client, refers to the program corresponding to the server and providing local services for the client. Clients can be installed on, but not limited to, various personal computers, laptops, smartphones, tablets, and portable wearable devices. The server can be implemented as an independent server or a server cluster composed of multiple servers.

In one embodiment, as shown in FIG. 2, a document classification prediction method is provided, and the method is applied to the server in FIG. 1 as an example for description, including the following steps:

S10: Receive a prediction request instruction including the target document;

Understandably, the prediction request instruction may be an instruction sent by a preset sender (eg, the author of the target document, or the document manager). In this embodiment, the target document refers to a document with a regular title and has not been classified as a document; wherein, the regular title refers to a title with several filled areas, such as a company name area and a year area; the regularity The optional title can be used by document creators to fill in the content that needs to be filled in the filling area, combined with the content of the document. Exemplarily, such as "Rongsheng Petrochemical (company name area): 2020 (year area) semi-annual report" similar style document.

S20: Perform document parsing on the target document by using a preset document parsing model to obtain text information corresponding to the target document and coordinate information corresponding to the text information;

The preset document parsing model is used to extract text information and coordinate information of the target document. Exemplarily, when the target document is a pdf document, the preset document parsing model may be based on PyMuPDF (an open source pdf parsing software). Parse the model. Text information refers to the text content of the first five pages in the target document. The coordinate information refers to the page number of each word in the content of the first five pages and the specific position in the corresponding page number.

Specifically, extracting the text content of the first five pages in the target document through the preset document parsing model to obtain the text information; the page number to which each word in the text information belongs and the position in the page number The information association is recorded as the coordinate information. Understandably, since the default document parsing model generally only supports input with a length of 512, the text contained in a real pdf cannot be used as input. Secondly, the first five pages generally contain the title of the article, and the title is used to determine the pdf category. an important information.

S30: Input the text information and the coordinate information into a preset pre-trained language model, and perform vector extraction on the text information and the coordinate information to obtain a document representation vector corresponding to the target document;

The preset pre-trained language model may be a LayoutLM model.

Specifically, after performing document parsing on the target document by using a preset document parsing model to obtain text information corresponding to the target document and coordinate information corresponding to the text information, the text information and the The coordinate information is input into the pre-trained language model to generate a target word sequence corresponding to the target document according to the text information and the coordinate information, and the target word sequence represents each word in the target document sorted according to the coordinate information; method, determine the target high-order feature corresponding to the target word sequence, and perform an average pooling process on the target high-order feature to obtain a document representation vector.

S40: Obtain a sample document vector set; the sample document vector set includes at least one sample document vector; one of the sample document vectors is associated with one document category;

The sample document vector set is a set of sample document vectors corresponding to each sample document obtained by inputting the sample document into a preset pre-trained language model.

Understandably, after the training of the preset pre-trained language model is completed, all sample documents are input into the preset document parsing model respectively, so as to perform document parsing on each sample document, and obtain the sample text information corresponding to the sample document and the sample text corresponding to the sample text. The sample coordinate information corresponding to the information; and then input the sample text information and sample coordinate information into the preset pre-training language model, and perform vector extraction on the text information and coordinate information to obtain the sample document vector corresponding to each sample document.

Further, after each sample document is acquired, the classification of each sample document can be determined according to the document title associated with the sample document, and then each sample document is classified so that one sample document is associated with one document category.

S50: Determine a document vector distance between the document representation vector and each of the sample document vectors, and determine a document category corresponding to the target document according to each of the document vector distances.

Specifically, after obtaining the sample document vector set, the document vector distance between the document representation vector and each of the sample document vectors is determined, and the document category corresponding to the target document is determined according to each of the document vector distances.

In one embodiment, as shown in FIG. 3 , the sample document vector is also associated with a sample document; the determining the document category corresponding to the target document according to the distance of each document vector includes:

S501: Select a preset number of sample documents from the sample documents whose document vector distance is less than or equal to a preset distance threshold, and record the selected sample documents as candidate documents;

Wherein, the preset number may be determined according to a specific scenario, and for example, the preset number may be 10, 20, etc. The preset distance threshold can be 0.5, 0.7, etc.

Understandably, after determining the document vector distance between the document representation vector and each of the sample document vectors, a preset number of sample documents whose document vector distance is less than or equal to a preset distance threshold are selected as candidate documents. When the number of sample documents whose document vector distance is less than or equal to the preset distance threshold does not meet the preset number, all sample documents that satisfy the condition that the document vector distance is less than or equal to the preset distance threshold may be used as candidate documents.

Further, if the document vector distances are all greater than the preset distance threshold, it indicates that the document category currently associated with the sample document cannot characterize the document category of the target document, and then a new document category is established according to the document title of the target document, and the The target document is classified under the new document category, and when the next time a prediction request command containing a new target document is received, if the document vector distance between the document vector of the new target document and the document representation vector of the target document is less than or When it is equal to the preset distance threshold, the document category of the target document can be used as the document category of the new target document, which improves the efficiency of document classification.

S502: Obtain the proportion of candidate documents of the same document category in all the candidate documents, and record the document category with the highest proportion as the document category of the target document.

It can be understood that after selecting a preset number of sample documents from the sample documents whose distance from the document vector is less than or equal to the preset distance threshold, and recording the selected sample documents as candidate documents, the candidate documents of the same document category are obtained. For the proportion of all the candidate documents, the document category with the highest proportion is recorded as the document category of the target document.

In this embodiment, the document category of the target document is determined by introducing the text information of the document and the corresponding coordinate information, and according to the document vector distance between the document representation vector corresponding to the text information and the coordinate information and the sample document vector. In this way, in the case of few sample documents, new documents can still be classified. If they do not match the sample documents, they can be regarded as a new document category, and the new documents are continuously classified. During the classification process, the number of documents in each document category can be supplemented without the need to constantly replace the preset document parsing model or the preset pre-trained language model to classify new documents, which improves the efficiency and convenience of document classification.

In one embodiment, as shown in FIG. 4 , before the inputting the text information and the coordinate information into the preset pre-trained language model, the method further includes:

S01: Acquire a training document triplet; the sample document triplet includes a training document, a positive sample document corresponding to the training document, and a negative sample document corresponding to the sample document;

Among them, positive sample documents refer to documents with the same document category as the training documents. Negative documents are documents that do not have the same document class as the training document.

S02: Input the triplet of the sample document into an initial language model including initial parameters, and perform vector extraction on the training document, positive sample document and negative sample document, respectively, to obtain a first training document corresponding to the training document vector, the second training vector corresponding to the positive sample document, and the third training vector corresponding to the negative sample document;

Exemplarily, the initial language model may be a LayoutLM model. A detailed explanation of this step can be found in the following examples.

In one embodiment, the sample document triplet is input into an initial language model including initial parameters, and vector extraction is performed on the training document, the positive sample document and the negative sample document, respectively, to obtain the training document, the positive sample document and the negative sample document. The first training vector corresponding to the document, the second training vector corresponding to the positive sample document, and the third training vector corresponding to the negative sample document, including:

S011: Extract the word sequences of the training document, the positive sample document and the negative sample document, respectively, to obtain the training word sequence corresponding to the training document, the positive sample word sequence corresponding to the positive sample document, and the negative sample word sequence corresponding to the negative sample document. The negative sample word sequence corresponding to the sample document;

The word sequence refers to each word in the training document, the positive sample document, and the negative sample document and the corresponding ranking relationship. Exemplarily, it is assumed that after the word sequences of the training document, positive sample document and negative sample document are extracted respectively, the obtained training word sequence is:

(where a represents the training document, and x is the length of the word sequence of the training document), since in the initial language model it is necessary to distinguish the beginning of a document ([CLS] below) and the end ([SEP] below), So the final training word sequence is

In the same way, it is assumed that the obtained positive sample word sequence is

(where p represents the positive sample document, y is the word sequence length of the positive sample document), and the final positive sample word sequence is

In the same way, it is assumed that the negative sample word sequence obtained is

(where n represents the negative sample document, s is the word sequence length of the negative sample document), and the final negative sample word sequence is

S012: Determine the training high-order feature corresponding to each word in the training word sequence, the positive sample high-order feature corresponding to each word in the positive sample word sequence, and the negative sample Negative sample high-order features corresponding to each word in the word sequence;

Specifically, the high-level feature representation corresponding to each word in each word sequence can be determined by the following expression:

where i represents the ith word.

for training high-level features;

are high-order features of positive samples;

High-order features of negative samples.

S013: Perform an average pooling process on the training high-order features, the positive sample high-order features, and the negative sample high-order features, respectively, to obtain the first training vector, the second training vector, and the third training vector.

Specifically, after determining the training high-level features corresponding to each word in the training word sequence, the positive sample high-level feature corresponding to each word in the positive sample word sequence, and the positive sample high-level feature corresponding to each word in the negative sample word sequence After the corresponding high-order features of the negative samples, the average pooling processing method is used to obtain the first training vector, the second training vector and the third training vector.

Optionally, it can be determined by the following expression:

Among them, ^MEAN_POOLING _i ( ) is the average pooling function; i represents the i-th word; S ^a is the first training vector; Sp is the second training vector; ^Sn is the third training vector.

S03: Determine a total loss value of the language model according to the first training vector, the second training vector and the third training vector.

Specifically, performing an average pooling process on the training high-level features, the positive sample high-level features, and the negative sample high-level features, respectively, to obtain the first training vector, the second training vector, and the third training vector, A total loss value of the language model is determined according to the first training vector, the second training vector and the third training vector.

In one embodiment, in step S03, the determining the total loss value of the language model according to the first training vector, the second sample vector and the third training vector includes:

determining the first document distance between the first training vector and the second sample vector; simultaneously determining the second document distance between the first training vector and the third training vector;

According to the first document distance and the second document distance, the total loss value is determined through a triple loss function.

The first document distance and the second document distance are substantially Euclidean distances.

Specifically, the total loss value can be determined according to the following triple loss function:

L=max(||S _a -S _p ||-||S _a -S _n ||+ε,0)

Wherein, _Sa is the first training vector; _Sp is the second training vector; _Sn is the third training vector. ||S _a -S _p || is the first document distance; ||S _a -S _n || is the second document distance; ε is a real number, which is taken as 1 in this embodiment. The intuitive meaning of the total loss is that the distance between the positive sample document and the training document is getting closer and the distance between the negative sample document and the training document is getting further and further, thereby improving the document classification accuracy of the model.

S04: When the total loss value does not reach the preset convergence condition, update and iterate the initial parameters of the initial language model, until the total loss value reaches the preset convergence condition, update the The initial language model is recorded as the preset pre-trained language model.

Understandably, the convergence condition can be the condition that the total loss value is less than the set threshold, that is, when the total loss value is less than the set threshold, the training is stopped; the convergence condition can also be that the total loss value after 10,000 calculations is The condition is very small and will not decrease, that is, when the total loss value is small and will not decrease after 10,000 calculations, stop training, and record the initial language model after convergence as the preset pre-training language model.

Further, after determining the total loss value according to the training document, positive sample document and negative sample document in the training document triplet, when the total loss value does not reach the preset convergence condition, adjust the initial language model according to the total loss value. initial parameters, and re-input the training document triplet into the initial language model after adjusting the initial parameters, so as to select another training document when the total loss value corresponding to the training document triplet reaches the preset convergence condition Triples (such as replacing negative sample documents or positive sample documents), and perform steps S01 to S04 to obtain the total loss value corresponding to the training document triples, and when the total loss value does not reach the preset convergence When conditions are met, the initial parameters of the initial language model are adjusted again according to the total loss value, so that the total loss value corresponding to the training document triplet reaches the preset convergence condition.

In this way, after the initial language model is trained through all training document triples, the output results of the initial language model can continue to move closer to accurate results, so that the recognition accuracy is getting higher and higher, until all training document triples correspond to When all of the total loss values of 1 and 2 reach a preset convergence condition, the initial language model after convergence is recorded as the preset pre-trained language model.

Further, in this embodiment, an adam optimizer may also be used, and the optimizer is based on a parameter update method of gradient descent, and further updates the initial parameters continuously when the total loss value is less than the set threshold condition.

In one embodiment, before acquiring the triplet of the sample document, the method further includes:

(1) obtaining a preset sample document set; the sample document set includes at least one sample document; one described sample document is associated with a document title;

Among them, the sample documents in the preset sample document set can be crawled from all pdf documents from major websites by conventional crawling technology, and the crawled information includes the sample documents and the document titles associated with the sample documents.

(2) performing normalization processing on each of the document titles, and performing document classification on each of the sample documents according to each document title after the normalization processing, to obtain a document category corresponding to each of the sample documents;

Specifically, in an embodiment, the normalization process for each of the document titles includes:

Detecting whether the document title contains preset special symbols;

When the document title includes the preset special symbol, remove the preset special symbol and all characters before the preset special symbol to obtain the excluded title;

Wherein, the preset special symbol can be ":". Understandably, although the content of each pdf document is different, the structure of the content is mostly the same. For example, for pdf documents similar to "XXX Company: 2020 Annual Report", the text content before ":" is only limited The report of a certain company, so the preset special symbol and all characters before the preset special symbol should be eliminated and processed without affecting the subsequent document classification.

Detecting whether the culling title contains a preset year character and/or a preset number of times character;

When the culling title contains the preset year character and/or the preset number of times character, replace the preset year character with the first preset character, and replace the preset number of times character with the second preset character , which further indicates that the normalization processing of the document title is completed.

It is understandable that the preset year character is the character containing the year in the title; the preset number character is the character that represents the frequency style in the title, such as "XXX Company: 2020 X Quarterly Report". The first preset characters and the second preset characters can be replaced by English characters or other special characters. The first preset characters and the second preset characters are used to eliminate the influence of the year and the number of times on the document classification.

Exemplarily, after removing the preset special symbols and all the characters before the preset special symbols to obtain the removed title, the title is "Announcement on Holding the Eighth Meeting in 2020", then the 2020 is replaced by X; eight can be replaced by Y, then it will be replaced by the "Announcement on Holding the Yth Meeting in Year X".

Further, after performing the normalization processing on each document title, and according to each document title after the normalization processing, document classification is performed on each of the sample documents, that is, according to each document title after the normalization processing. The matching degree between characters is used for document classification, and the documents whose matching degree is higher than the preset threshold are classified into one category, and then the document category corresponding to each sample document is obtained. Wherein, the preset threshold can be set to 90%, 95% and so on.

Exemplarily, if there are many categories in the result of document classification, the top 500 document categories with the most sample documents can be selected, and the remaining document categories are removed to avoid too many document categories and burden the computer system.

(3) select a document category from each described document category as a positive document category; select a document category from other document categories except the described positive document category as a negative document category;

(4) Select a sample document from the positive document category and record it as the training document; meanwhile, select a sample document other than the training document from the positive document category and record it as the positive sample document ; Select a sample document from the negative document category and record it as the negative sample document;

(5) Construct the training document triplet according to the training document, the positive sample document and the negative sample document.

It can be understood that after document classification is performed on each of the sample documents according to the document titles after the normalization process, and after obtaining the document type corresponding to each of the sample documents, any document type can be selected from each document type. Select a sample document as a training document, and then select a document from the document category as a positive sample document; then select a document category from other document categories except the selected document category, and then select a document category from the document category. Pick a sample document as a negative sample document.

It should be understood that the size of the sequence numbers of the steps in the above embodiments does not mean the sequence of execution, and the execution sequence of each process should be determined by its function and internal logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.

In one embodiment, a document classification prediction apparatus is provided, and the document classification prediction apparatus corresponds one-to-one with the document classification prediction method in the above embodiment. As shown in FIG. 5 , the document classification prediction apparatus includes a prediction request instruction receiving module 10 , a document parsing module 20 , a first vector extraction module 30 , a document vector set acquisition module 40 and a document category determination module 50 . The detailed description of each functional module is as follows:

A prediction request instruction receiving module 10, configured to receive a prediction request instruction including a target document;

The document parsing module 20 is configured to perform document parsing on the target document by using a preset document parsing model to obtain text information corresponding to the target document and coordinate information corresponding to the text information;

The first vector extraction module 30 is used for inputting the text information and the coordinate information into a preset pre-trained language model, and performing vector extraction on the text information and the coordinate information to obtain a vector corresponding to the target document The document representation vector of ;

a document vector set obtaining module 40, configured to obtain a sample document vector set; the sample document vector set includes at least one sample document vector; one of the sample document vectors is associated with a document category;

The document category determination module 50 is configured to determine a document vector distance between the document representation vector and each of the sample document vectors, and determine a document category corresponding to the target document according to each of the document vector distances.

Preferably, as shown in FIG. 6 , the document classification prediction device further includes:

Document triplet acquisition module 01, used to acquire training document triples; the sample document triples include training documents, positive sample documents corresponding to the training documents, and negative sample documents corresponding to the sample documents;

The second vector extraction module 02 is configured to input the triplet of the sample document into an initial language model including initial parameters, and perform vector extraction on the training document, the positive sample document and the negative sample document, respectively, to obtain the The first training vector corresponding to the training document, the second training vector corresponding to the positive sample document, and the third training vector corresponding to the negative sample document;

A total loss value determination module 03, configured to determine the total loss value of the language model according to the first training vector, the second training vector and the third training vector;

Language model training module 04, configured to update and iterate the initial parameters of the initial language model when the total loss value does not reach the preset convergence condition, until the total loss value reaches the preset convergence condition, The initial language model after convergence is recorded as the preset pre-trained language model.

Preferably, the second vector extraction module includes:

A word sequence extraction unit, used for extracting the word sequences of the training document, the positive sample document and the negative sample document respectively, to obtain the training word sequence corresponding to the training document and the positive sample word sequence corresponding to the positive sample document, and the negative sample word sequence corresponding to the negative sample document;

a high-level feature determination unit, configured to determine, by using a preset feature representation method, the training high-level features corresponding to each word in the training word sequence, and the positive sample high-level feature corresponding to each word in the positive sample word sequence, And the negative sample high-order features corresponding to each word in the negative sample word sequence;

The average pooling processing unit is used to perform average pooling processing on the training high-order features, positive sample high-order features and negative sample high-order features respectively, to obtain the first training vector, the second training vector and the first training vector. Three training vectors.

Preferably, the document classification prediction device further includes:

a sample document set acquisition module, used for acquiring a preset sample document set; the sample document set includes at least one sample document; one of the sample documents is associated with a document title;

A normalization processing module is used for normalizing each of the document titles, and according to each document title after the normalization processing, the document classification is performed on each of the sample documents, and the corresponding sample documents are obtained. the document category;

A document category selection module for selecting a document category from each of the document categories as a positive document category; selecting a document category from other document categories except the positive document category as a negative document category;

The document selection module is used to select a sample document from the positive document category and record it as the training document; meanwhile, select a sample document other than the training document from the positive document category and record it as the training document Positive sample document; select a sample document from the negative document category and record it as the negative sample document;

A triplet building module is configured to construct the training document triplet according to the training document, the positive sample document and the negative sample document.

Preferably, the normalization processing module includes:

a special symbol detection unit for detecting whether the document title contains a preset special symbol;

a character culling unit, configured to cull the preset special symbol and all characters before the preset special symbol when the preset special symbol is included in the document title, to obtain the cull title;

a special character detection unit for detecting whether the culling title contains a preset year character and/or a preset number of times character;

A character replacement unit, configured to replace the preset year character with the first preset character and replace the second preset character with the preset year character and/or the preset number of times character when the culling title contains the preset year character and/or the preset number of times character The preset number of characters further indicates that the normalization processing of the document title is completed.

Preferably, as shown in FIG. 7 , the document category determination module 50 includes:

The sample document selection unit 501 is used to select a preset number of sample documents from the sample documents whose document vector distance is less than or equal to a preset distance threshold, and record the selected sample documents as candidate documents;

The document category determining unit 502 is configured to obtain the proportion of candidate documents of the same document category in all the candidate documents, and record the document category with the highest proportion as the document category of the target document.

For the specific limitation of the document classification prediction apparatus, reference may be made to the definition of the document classification prediction method above, which will not be repeated here. Each module in the above-mentioned document classification prediction apparatus may be implemented in whole or in part by software, hardware and combinations thereof. The above modules can be embedded in or independent of the processor in the computer device in the form of hardware, or stored in the memory in the computer device in the form of software, so that the processor can call and execute the operations corresponding to the above modules.

In one embodiment, a computer device is provided, and the computer device may be a server, and its internal structure diagram may be as shown in FIG. 8 . The computer device includes a processor, memory, a network interface, and a database connected by a system bus. Among them, the processor of the computer device is used to provide computing and control capabilities. The memory of the computer device includes a readable storage medium, an internal memory. The readable storage medium stores an operating system, computer readable instructions and a database. The internal memory provides an environment for the execution of the operating system and computer-readable instructions in the readable storage medium. The database of the computer device is used to store the data used in the document classification prediction method in the above embodiment. The network interface of the computer device is used to communicate with an external terminal through a network connection. The computer-readable instructions, when executed by a processor, implement a document classification prediction method. The readable storage medium provided in this embodiment includes a non-volatile readable storage medium and a volatile readable storage medium.

In one embodiment, there is provided a computer apparatus comprising a memory, a processor, and computer readable instructions stored in the memory and executable on the processor, the processor executing the computer readable instructions Implement the following steps when instructing:

Receive a prediction request instruction containing the target document;

In one embodiment, one or more readable storage media are provided that store computer-readable instructions that, when executed by one or more processors, cause the one or more processors to execute Follow the steps below:

Receive a prediction request instruction containing the target document;

Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above embodiments can be implemented by instructing the relevant hardware through computer-readable instructions, and the computer-readable instructions can be stored in a non-volatile computer. In a readable storage medium or a volatile computer-readable storage medium, the computer-readable instructions, when executed, may include the processes of the foregoing method embodiments. Wherein, any reference to memory, storage, database or other medium used in the various embodiments provided in this application may include non-volatile and/or volatile memory. Nonvolatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in various forms such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Road (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

Those skilled in the art can clearly understand that, for the convenience and simplicity of description, only the division of the above-mentioned functional units and modules is used as an example for illustration. In practical applications, the above-mentioned functions can be allocated to different functional units, Module completion, that is, dividing the internal structure of the device into different functional units or modules to complete all or part of the functions described above.

The above-mentioned embodiments are only used to illustrate the technical solutions of the present application, but not to limit them; although the present application has been described in detail with reference to the above-mentioned embodiments, those of ordinary skill in the art should understand that: it can still be used for the above-mentioned implementations. The technical solutions described in the examples are modified, or some technical features thereof are equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions in the embodiments of the application, and should be included in the within the scope of protection of this application.

Claims

A document classification prediction method, comprising:

Receive a prediction request instruction containing the target document;

Performing document parsing on the target document by using a preset document parsing model to obtain text information corresponding to the target document and coordinate information corresponding to the text information;

Inputting the text information and the coordinate information into a preset pre-trained language model, and performing vector extraction on the text information and the coordinate information to obtain a document representation vector corresponding to the target document;

Obtain a sample document vector set; the sample document vector set includes at least one sample document vector; one of the sample document vectors is associated with a document category;

A document vector distance between the document representation vector and each of the sample document vectors is determined, and a document category corresponding to the target document is determined according to each of the document vector distances.
The document classification prediction method according to claim 1, wherein before the inputting the text information and the coordinate information into a preset pre-trained language model, the method further comprises:

obtaining a training document triplet; the sample document triplet includes a training document, a positive sample document corresponding to the training document, and a negative sample document corresponding to the sample document;

The sample document triplet is input into the initial language model containing the initial parameters, and the training document, the positive sample document and the negative sample document are respectively subjected to vector extraction to obtain the first training vector corresponding to the training document, a second training vector corresponding to the positive sample document, and a third training vector corresponding to the negative sample document;

determining the total loss value of the language model according to the first training vector, the second training vector and the third training vector;

When the total loss value does not reach the preset convergence condition, update and iterate the initial parameters of the initial language model, until the total loss value reaches the preset convergence condition, the initial language after convergence The model is recorded as the preset pre-trained language model.
The document classification prediction method according to claim 2, wherein the sample document triplet is input into an initial language model including initial parameters, and the training document, positive sample document and negative sample document are respectively performed on the training document, positive sample document and negative sample document. Vector extraction to obtain the first training vector corresponding to the training document, the second training vector corresponding to the positive sample document, and the third training vector corresponding to the negative sample document, including:

Extract the word sequence of the training document, the positive sample document and the negative sample document respectively, and obtain the training word sequence corresponding to the training document, the positive sample word sequence corresponding to the positive sample document, and the negative sample document. The corresponding negative sample word sequence;

Determine the training high-level feature corresponding to each word in the training word sequence, the positive sample high-level feature corresponding to each word in the positive sample word sequence, and the negative sample word sequence by using a preset feature representation method. The high-order features of the negative samples corresponding to each word in the

The average pooling process is performed on the training high-order features, the positive sample high-order features and the negative sample high-order features, respectively, to obtain the first training vector, the second training vector and the third training vector.
The document classification prediction method according to claim 2, wherein the determining the total loss value of the language model according to the first training vector, the second training vector and the third training vector comprises:

determining a first document distance between the first training vector and the second training vector; simultaneously determining a second document distance between the first training vector and the third training vector;

The total loss value is determined by a triple loss function according to the first document distance and the second document distance.
The document classification prediction method according to claim 2, wherein, before acquiring the triplet of the sample document, it further comprises:

Obtain preset sample document set; At least one sample document is included in the sample document set; One described sample document is associated with a document title;

Performing normalization processing on each of the document titles, and performing document classification on each of the sample documents according to each document title after the normalization processing, to obtain a document category corresponding to each of the sample documents;

A document category is selected from each of the document categories as a positive document category; a document category is selected from other document categories except the positive document category as a negative document category;

A sample document is selected from the positive document category and recorded as the training document; at the same time, a sample document other than the training document is selected from the positive document category and recorded as the positive sample document; Select a sample document from the negative document category and record it as the negative sample document;

The training document triplet is constructed from the training document, the positive sample document, and the negative sample document.
The document classification prediction method according to claim 5, wherein the normalization processing for each of the document titles comprises:

Detecting whether the document title contains preset special symbols;

When the document title includes the preset special symbol, remove the preset special symbol and all characters before the preset special symbol to obtain the excluded title;

Detecting whether the culling title contains a preset year character and/or a preset number of times character;

When the culling title contains the preset year character and/or the preset number of times character, replace the preset year character with the first preset character, and replace the preset number of times character with the second preset character , which further indicates that the normalization processing of the document title is completed.
The document classification prediction method according to claim 1, wherein the sample document vector is further associated with a sample document; the determining the document category corresponding to the target document according to the distance of each document vector comprises:

Select a preset number of sample documents from the sample documents whose document vector distance is less than or equal to the preset distance threshold, and record the selected sample documents as candidate documents;

Obtain the proportion of candidate documents of the same document category in all the candidate documents, and record the document category with the highest proportion as the document category of the target document.
A document classification prediction device, comprising:

The prediction request instruction receiving module is used to receive the prediction request instruction including the target document;

a document parsing module, configured to perform document parsing on the target document by using a preset document parsing model to obtain text information corresponding to the target document and coordinate information corresponding to the text information;

The first vector extraction module is used for inputting the text information and the coordinate information into a preset pre-training language model, and performing vector extraction on the text information and the coordinate information to obtain the corresponding text information and the target document. document representation vector;

a document vector set acquisition module, configured to acquire a sample document vector set; the sample document vector set includes at least one sample document vector; one of the sample document vectors is associated with a document category;

A document category determination module, configured to determine a document vector distance between the document representation vector and each of the sample document vectors, and determine a document category corresponding to the target document according to each of the document vector distances.
A computer device comprising a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor, wherein the processor implements the following steps when executing the computer-readable instructions:

Receive a prediction request instruction containing the target document;

Performing document parsing on the target document by using a preset document parsing model to obtain text information corresponding to the target document and coordinate information corresponding to the text information;

Inputting the text information and the coordinate information into a preset pre-trained language model, and performing vector extraction on the text information and the coordinate information to obtain a document representation vector corresponding to the target document;

Obtain a sample document vector set; the sample document vector set includes at least one sample document vector; one of the sample document vectors is associated with a document category;

A document vector distance between the document representation vector and each of the sample document vectors is determined, and a document category corresponding to the target document is determined according to each of the document vector distances.
The computer device according to claim 9, wherein before the inputting the text information and the coordinate information into the preset pre-trained language model, the processor further implements the following when executing the computer-readable instructions step:

obtaining a training document triplet; the sample document triplet includes a training document, a positive sample document corresponding to the training document, and a negative sample document corresponding to the sample document;

The sample document triplet is input into the initial language model containing the initial parameters, and the training document, the positive sample document and the negative sample document are respectively subjected to vector extraction to obtain the first training vector corresponding to the training document, a second training vector corresponding to the positive sample document, and a third training vector corresponding to the negative sample document;

determining the total loss value of the language model according to the first training vector, the second training vector and the third training vector;

When the total loss value does not reach the preset convergence condition, update and iterate the initial parameters of the initial language model, until the total loss value reaches the preset convergence condition, the initial language after convergence The model is recorded as the preset pre-trained language model.
The computer device according to claim 10, wherein the sample document triplet is input into an initial language model including initial parameters, and vector extraction is performed on the training document, positive sample document and negative sample document respectively , to obtain the first training vector corresponding to the training document, the second training vector corresponding to the positive sample document, and the third training vector corresponding to the negative sample document, including:

Extract the word sequence of the training document, the positive sample document and the negative sample document respectively, and obtain the training word sequence corresponding to the training document, the positive sample word sequence corresponding to the positive sample document, and the negative sample document. The corresponding negative sample word sequence;

Determine the training high-level feature corresponding to each word in the training word sequence, the positive sample high-level feature corresponding to each word in the positive sample word sequence, and the negative sample word sequence by using a preset feature representation method. The high-order features of the negative samples corresponding to each word in the

The average pooling process is performed on the training high-order features, the positive sample high-order features and the negative sample high-order features, respectively, to obtain the first training vector, the second training vector and the third training vector.
The computer device of claim 10, wherein the determining a total loss value of the language model according to the first training vector, the second training vector and the third training vector comprises:

determining a first document distance between the first training vector and the second training vector; simultaneously determining a second document distance between the first training vector and the third training vector;

The total loss value is determined by a triple loss function according to the first document distance and the second document distance.
The computer device according to claim 10, wherein, before the acquisition of the sample document triplet, the processor further implements the following steps when executing the computer-readable instructions:

Obtaining a preset sample document set; the sample document set includes at least one sample document; one of the sample documents is associated with a document title;

Performing normalization processing on each of the document titles, and performing document classification on each of the sample documents according to the document titles after the normalization processing, to obtain a document category corresponding to each of the sample documents;

A document category is selected from each of the document categories as a positive document category; a document category is selected from other document categories except the positive document category as a negative document category;

A sample document is selected from the positive document category and recorded as the training document; at the same time, a sample document other than the training document is selected from the positive document category and recorded as the positive sample document; Select a sample document from the negative document category and record it as the negative sample document;

The training document triplet is constructed from the training document, the positive sample document, and the negative sample document.
The computer device of claim 13, wherein the normalizing processing for each of the document titles comprises:

Detecting whether the document title contains preset special symbols;

When the document title includes the preset special symbol, remove the preset special symbol and all characters before the preset special symbol to obtain the excluded title;

Detecting whether the culling title contains a preset year character and/or a preset number of times character;

When the culling title contains the preset year character and/or the preset number of times character, replace the preset year character with the first preset character, and replace the preset number of times character with the second preset character , which further indicates that the normalization processing of the document title is completed.
One or more readable storage media storing computer-readable instructions, wherein the computer-readable instructions, when executed by one or more processors, cause the one or more processors to perform the following steps:

Receive a prediction request instruction containing the target document;

Performing document parsing on the target document by using a preset document parsing model to obtain text information corresponding to the target document and coordinate information corresponding to the text information;

Inputting the text information and the coordinate information into a preset pre-trained language model, and performing vector extraction on the text information and the coordinate information to obtain a document representation vector corresponding to the target document;

Obtain a sample document vector set; the sample document vector set includes at least one sample document vector; one of the sample document vectors is associated with a document category;

A document vector distance between the document representation vector and each of the sample document vectors is determined, and a document category corresponding to the target document is determined according to each of the document vector distances.
16. The readable storage medium of claim 15, wherein the computer-readable instructions are executed by one or more processors prior to inputting the textual information and the coordinate information into a preset pre-trained language model During execution, the one or more processors are caused to further perform the following steps:

obtaining a training document triplet; the sample document triplet includes a training document, a positive sample document corresponding to the training document, and a negative sample document corresponding to the sample document;

The sample document triplet is input into the initial language model containing the initial parameters, and the training document, the positive sample document and the negative sample document are respectively subjected to vector extraction to obtain the first training vector corresponding to the training document, a second training vector corresponding to the positive sample document, and a third training vector corresponding to the negative sample document;

determining the total loss value of the language model according to the first training vector, the second training vector and the third training vector;

When the total loss value does not reach the preset convergence condition, update and iterate the initial parameters of the initial language model, until the total loss value reaches the preset convergence condition, the initial language after convergence The model is recorded as the preset pre-trained language model.
The readable storage medium according to claim 16, wherein the sample document triplet is input into an initial language model including initial parameters, and the training document, the positive sample document and the negative sample document are respectively performed on the training document, the positive sample document and the negative sample document. Vector extraction to obtain the first training vector corresponding to the training document, the second training vector corresponding to the positive sample document, and the third training vector corresponding to the negative sample document, including:

Extract the word sequence of the training document, the positive sample document and the negative sample document respectively, and obtain the training word sequence corresponding to the training document, the positive sample word sequence corresponding to the positive sample document, and the negative sample document. The corresponding negative sample word sequence;

Determine the training high-level features corresponding to each word in the training word sequence, the positive sample high-level feature corresponding to each word in the positive sample word sequence, and the negative sample word sequence through a preset feature representation method. The high-order features of the negative samples corresponding to each word in the

The average pooling process is performed on the training high-order features, the positive sample high-order features and the negative sample high-order features, respectively, to obtain the first training vector, the second training vector and the third training vector.
The readable storage medium of claim 16, wherein the determining a total loss value of the language model according to the first training vector, the second training vector and the third training vector comprises:

determining a first document distance between the first training vector and the second training vector; simultaneously determining a second document distance between the first training vector and the third training vector;

According to the first document distance and the second document distance, the total loss value is determined through a triple loss function.
17. The readable storage medium of claim 16, wherein the computer-readable instructions, when executed by the one or more processors, cause the one or more processors to further Perform the following steps:

Obtaining a preset sample document set; the sample document set includes at least one sample document; one of the sample documents is associated with a document title;

Performing normalization processing on each of the document titles, and performing document classification on each of the sample documents according to each document title after the normalization processing, to obtain a document category corresponding to each of the sample documents;

A document category is selected from each of the document categories as a positive document category; a document category is selected from other document categories except the positive document category as a negative document category;

A sample document is selected from the positive document category and recorded as the training document; at the same time, a sample document other than the training document is selected from the positive document category and recorded as the positive sample document; Select a sample document from the negative document category and record it as the negative sample document;

The training document triplet is constructed from the training document, the positive sample document, and the negative sample document.
The readable storage medium of claim 19, wherein said normalizing each of said document titles comprises:

Detecting whether the document title contains preset special symbols;

When the document title includes the preset special symbol, remove the preset special symbol and all characters before the preset special symbol to obtain the excluded title;

Detecting whether the culling title contains a preset year character and/or a preset number of times character;

When the culling title contains the preset year character and/or the preset number of times character, the first preset character is substituted for the preset year character, and the second preset character is substituted for the preset number of times character , which further indicates that the normalization processing of the document title is completed.