WO2022134805A1 - Document classification prediction method and apparatus, and computer device and storage medium - Google Patents

Document classification prediction method and apparatus, and computer device and storage medium Download PDF

Info

Publication number
WO2022134805A1
WO2022134805A1 PCT/CN2021/125227 CN2021125227W WO2022134805A1 WO 2022134805 A1 WO2022134805 A1 WO 2022134805A1 CN 2021125227 W CN2021125227 W CN 2021125227W WO 2022134805 A1 WO2022134805 A1 WO 2022134805A1
Authority
WO
WIPO (PCT)
Prior art keywords
document
sample
training
vector
preset
Prior art date
Application number
PCT/CN2021/125227
Other languages
French (fr)
Chinese (zh)
Inventor
刘玉
徐国强
Original Assignee
深圳壹账通智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳壹账通智能科技有限公司 filed Critical 深圳壹账通智能科技有限公司
Publication of WO2022134805A1 publication Critical patent/WO2022134805A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/11File system administration, e.g. details of archiving or snapshots
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition

Definitions

  • the present application relates to the technical field of classification models, and in particular, to a document classification prediction method, apparatus, computer equipment and storage medium.
  • document classification models in the prior art generally require a large amount of labeled data for training in order to have considerable classification accuracy, but these document classification models are easily affected by data imbalance, such as training of a certain category. If there is very little data, the classification accuracy of the model in this classification will be low, resulting in low document classification accuracy, and it takes a lot of time to manually label the data, which is not conducive to the deployment and application of the model in various fields.
  • Embodiments of the present application provide a document classification prediction method, apparatus, computer equipment, and storage medium, so as to solve the problem of low document classification accuracy caused by less manual annotation data.
  • a document classification prediction method comprising:
  • sample document vector set includes at least one sample document vector; one of the sample document vectors is associated with a document category;
  • a document vector distance between the document representation vector and each of the sample document vectors is determined, and a document category corresponding to the target document is determined according to each of the document vector distances.
  • a document classification prediction device comprising:
  • the prediction request instruction receiving module is used to receive the prediction request instruction including the target document;
  • a document parsing module configured to perform document parsing on the target document by using a preset document parsing model to obtain text information corresponding to the target document and coordinate information corresponding to the text information;
  • the first vector extraction module is used for inputting the text information and the coordinate information into a preset pre-training language model, and performing vector extraction on the text information and the coordinate information to obtain the corresponding text information and the target document.
  • document representation vector
  • a document vector set acquisition module configured to acquire a sample document vector set; the sample document vector set includes at least one sample document vector; one of the sample document vectors is associated with a document category;
  • a document category determination module configured to determine a document vector distance between the document representation vector and each of the sample document vectors, and determine a document category corresponding to the target document according to each of the document vector distances.
  • a computer device comprising a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor, the processor implementing the following steps when executing the computer-readable instructions:
  • sample document vector set includes at least one sample document vector; one of the sample document vectors is associated with a document category;
  • a document vector distance between the document representation vector and each of the sample document vectors is determined, and a document category corresponding to the target document is determined according to each of the document vector distances.
  • One or more readable storage media storing computer-readable instructions that, when executed by one or more processors, cause the one or more processors to perform the following steps:
  • sample document vector set includes at least one sample document vector; one of the sample document vectors is associated with a document category;
  • a document vector distance between the document representation vector and each of the sample document vectors is determined, and a document category corresponding to the target document is determined according to each of the document vector distances.
  • the above-mentioned document classification prediction method, device, computer equipment and storage medium receives the prediction request instruction containing the target document; through the preset document analysis model, the document analysis is performed on the target document, and the corresponding target document is obtained.
  • text information and coordinate information corresponding to the text information input the text information and the coordinate information into a preset pre-training language model, perform vector extraction on the text information and the coordinate information, and obtain the text information and the coordinate information.
  • the document representation vector corresponding to the target document obtain a sample document vector set; the sample document vector set contains at least one sample document vector; one of the sample document vectors is associated with a document category; document vector distances between document vectors, and the document category corresponding to the target document is determined according to the document vector distances.
  • the present application determines the document category of the target document by introducing the text information of the document and the corresponding coordinate information, and according to the document vector distance between the document representation vector corresponding to the text information and the coordinate information and the sample document vector. In this way, in the case of few sample documents, new documents can still be classified. If they do not match the sample documents, they can be regarded as a new document category, and the new documents are continuously classified. During the classification process, the number of documents in each document category can be supplemented without the need to constantly replace the preset document parsing model or the preset pre-trained language model to classify new documents, which improves the efficiency and convenience of document classification.
  • FIG. 1 is a schematic diagram of an application environment of a document classification prediction method in an embodiment of the present application
  • FIG. 2 is a flowchart of a document classification prediction method in an embodiment of the present application.
  • step S50 in the document classification prediction method in an embodiment of the present application
  • FIG. 5 is a schematic block diagram of a document classification prediction device in an embodiment of the present application.
  • FIG. 6 is another principle block diagram of a document classification prediction apparatus in an embodiment of the present application.
  • FIG. 7 is a schematic block diagram of a document category determination module in a document category prediction device according to an embodiment of the present application.
  • FIG. 8 is a schematic diagram of a computer device in an embodiment of the present application.
  • the document classification prediction method provided by the embodiment of the present application can be applied in the application environment shown in FIG. 1 .
  • the document classification prediction method is applied in a document classification prediction system.
  • the document classification prediction system includes a client and a server as shown in FIG. 1 , and the client and the server communicate through the network to solve the problem of less manual annotation data. This leads to the problem of low document classification accuracy.
  • the client also known as the client, refers to the program corresponding to the server and providing local services for the client.
  • Clients can be installed on, but not limited to, various personal computers, laptops, smartphones, tablets, and portable wearable devices.
  • the server can be implemented as an independent server or a server cluster composed of multiple servers.
  • a document classification prediction method is provided, and the method is applied to the server in FIG. 1 as an example for description, including the following steps:
  • the prediction request instruction may be an instruction sent by a preset sender (eg, the author of the target document, or the document manager).
  • the target document refers to a document with a regular title and has not been classified as a document; wherein, the regular title refers to a title with several filled areas, such as a company name area and a year area; the regularity
  • the optional title can be used by document creators to fill in the content that needs to be filled in the filling area, combined with the content of the document. Exemplarily, such as "Rongsheng Petrochemical (company name area): 2020 (year area) semi-annual report" similar style document.
  • S20 Perform document parsing on the target document by using a preset document parsing model to obtain text information corresponding to the target document and coordinate information corresponding to the text information;
  • the preset document parsing model is used to extract text information and coordinate information of the target document.
  • the target document is a pdf document
  • the preset document parsing model may be based on PyMuPDF (an open source pdf parsing software). Parse the model.
  • Text information refers to the text content of the first five pages in the target document.
  • the coordinate information refers to the page number of each word in the content of the first five pages and the specific position in the corresponding page number.
  • S30 Input the text information and the coordinate information into a preset pre-trained language model, and perform vector extraction on the text information and the coordinate information to obtain a document representation vector corresponding to the target document;
  • the preset pre-trained language model may be a LayoutLM model.
  • the text information and the The coordinate information is input into the pre-trained language model to generate a target word sequence corresponding to the target document according to the text information and the coordinate information, and the target word sequence represents each word in the target document sorted according to the coordinate information; method, determine the target high-order feature corresponding to the target word sequence, and perform an average pooling process on the target high-order feature to obtain a document representation vector.
  • S40 Obtain a sample document vector set; the sample document vector set includes at least one sample document vector; one of the sample document vectors is associated with one document category;
  • the sample document vector set is a set of sample document vectors corresponding to each sample document obtained by inputting the sample document into a preset pre-trained language model.
  • all sample documents are input into the preset document parsing model respectively, so as to perform document parsing on each sample document, and obtain the sample text information corresponding to the sample document and the sample text corresponding to the sample text.
  • the sample coordinate information corresponding to the information and then input the sample text information and sample coordinate information into the preset pre-training language model, and perform vector extraction on the text information and coordinate information to obtain the sample document vector corresponding to each sample document.
  • each sample document is acquired, the classification of each sample document can be determined according to the document title associated with the sample document, and then each sample document is classified so that one sample document is associated with one document category.
  • S50 Determine a document vector distance between the document representation vector and each of the sample document vectors, and determine a document category corresponding to the target document according to each of the document vector distances.
  • the document vector distance between the document representation vector and each of the sample document vectors is determined, and the document category corresponding to the target document is determined according to each of the document vector distances.
  • the sample document vector is also associated with a sample document; the determining the document category corresponding to the target document according to the distance of each document vector includes:
  • S501 Select a preset number of sample documents from the sample documents whose document vector distance is less than or equal to a preset distance threshold, and record the selected sample documents as candidate documents;
  • the preset number may be determined according to a specific scenario, and for example, the preset number may be 10, 20, etc.
  • the preset distance threshold can be 0.5, 0.7, etc.
  • a preset number of sample documents whose document vector distance is less than or equal to a preset distance threshold are selected as candidate documents.
  • all sample documents that satisfy the condition that the document vector distance is less than or equal to the preset distance threshold may be used as candidate documents.
  • the document vector distances are all greater than the preset distance threshold, it indicates that the document category currently associated with the sample document cannot characterize the document category of the target document, and then a new document category is established according to the document title of the target document, and the The target document is classified under the new document category, and when the next time a prediction request command containing a new target document is received, if the document vector distance between the document vector of the new target document and the document representation vector of the target document is less than or When it is equal to the preset distance threshold, the document category of the target document can be used as the document category of the new target document, which improves the efficiency of document classification.
  • S502 Obtain the proportion of candidate documents of the same document category in all the candidate documents, and record the document category with the highest proportion as the document category of the target document.
  • the candidate documents of the same document category are obtained. For the proportion of all the candidate documents, the document category with the highest proportion is recorded as the document category of the target document.
  • the document category of the target document is determined by introducing the text information of the document and the corresponding coordinate information, and according to the document vector distance between the document representation vector corresponding to the text information and the coordinate information and the sample document vector. In this way, in the case of few sample documents, new documents can still be classified. If they do not match the sample documents, they can be regarded as a new document category, and the new documents are continuously classified.
  • the number of documents in each document category can be supplemented without the need to constantly replace the preset document parsing model or the preset pre-trained language model to classify new documents, which improves the efficiency and convenience of document classification.
  • the method before the inputting the text information and the coordinate information into the preset pre-trained language model, the method further includes:
  • S01 Acquire a training document triplet;
  • the sample document triplet includes a training document, a positive sample document corresponding to the training document, and a negative sample document corresponding to the sample document;
  • positive sample documents refer to documents with the same document category as the training documents.
  • Negative documents are documents that do not have the same document class as the training document.
  • S02 Input the triplet of the sample document into an initial language model including initial parameters, and perform vector extraction on the training document, positive sample document and negative sample document, respectively, to obtain a first training document corresponding to the training document vector, the second training vector corresponding to the positive sample document, and the third training vector corresponding to the negative sample document;
  • the initial language model may be a LayoutLM model.
  • a detailed explanation of this step can be found in the following examples.
  • the sample document triplet is input into an initial language model including initial parameters, and vector extraction is performed on the training document, the positive sample document and the negative sample document, respectively, to obtain the training document, the positive sample document and the negative sample document.
  • the first training vector corresponding to the document, the second training vector corresponding to the positive sample document, and the third training vector corresponding to the negative sample document including:
  • S011 Extract the word sequences of the training document, the positive sample document and the negative sample document, respectively, to obtain the training word sequence corresponding to the training document, the positive sample word sequence corresponding to the positive sample document, and the negative sample word sequence corresponding to the negative sample document.
  • the word sequence refers to each word in the training document, the positive sample document, and the negative sample document and the corresponding ranking relationship.
  • the obtained training word sequence is: (where a represents the training document, and x is the length of the word sequence of the training document), since in the initial language model it is necessary to distinguish the beginning of a document ([CLS] below) and the end ([SEP] below), So the final training word sequence is In the same way, it is assumed that the obtained positive sample word sequence is (where p represents the positive sample document, y is the word sequence length of the positive sample document), and the final positive sample word sequence is In the same way, it is assumed that the negative sample word sequence obtained is (where n represents the negative sample document, s is the word sequence length of the negative sample document), and the final negative sample word sequence is
  • S012 Determine the training high-order feature corresponding to each word in the training word sequence, the positive sample high-order feature corresponding to each word in the positive sample word sequence, and the negative sample Negative sample high-order features corresponding to each word in the word sequence;
  • the high-level feature representation corresponding to each word in each word sequence can be determined by the following expression:
  • S013 Perform an average pooling process on the training high-order features, the positive sample high-order features, and the negative sample high-order features, respectively, to obtain the first training vector, the second training vector, and the third training vector.
  • the average pooling processing method is used to obtain the first training vector, the second training vector and the third training vector.
  • MEAN_POOLING i ( ) is the average pooling function; i represents the i-th word; S a is the first training vector; Sp is the second training vector; Sn is the third training vector.
  • S03 Determine a total loss value of the language model according to the first training vector, the second training vector and the third training vector.
  • a total loss value of the language model is determined according to the first training vector, the second training vector and the third training vector.
  • step S03 the determining the total loss value of the language model according to the first training vector, the second sample vector and the third training vector includes:
  • the total loss value is determined through a triple loss function.
  • the first document distance and the second document distance are substantially Euclidean distances.
  • the total loss value can be determined according to the following triple loss function:
  • Sa is the first training vector
  • Sp is the second training vector
  • Sn is the third training vector.
  • is the first document distance
  • is the second document distance
  • is a real number, which is taken as 1 in this embodiment.
  • the intuitive meaning of the total loss is that the distance between the positive sample document and the training document is getting closer and the distance between the negative sample document and the training document is getting further and further, thereby improving the document classification accuracy of the model.
  • the convergence condition can be the condition that the total loss value is less than the set threshold, that is, when the total loss value is less than the set threshold, the training is stopped; the convergence condition can also be that the total loss value after 10,000 calculations is The condition is very small and will not decrease, that is, when the total loss value is small and will not decrease after 10,000 calculations, stop training, and record the initial language model after convergence as the preset pre-training language model.
  • the initial language model after determining the total loss value according to the training document, positive sample document and negative sample document in the training document triplet, when the total loss value does not reach the preset convergence condition, adjust the initial language model according to the total loss value. initial parameters, and re-input the training document triplet into the initial language model after adjusting the initial parameters, so as to select another training document when the total loss value corresponding to the training document triplet reaches the preset convergence condition Triples (such as replacing negative sample documents or positive sample documents), and perform steps S01 to S04 to obtain the total loss value corresponding to the training document triples, and when the total loss value does not reach the preset convergence When conditions are met, the initial parameters of the initial language model are adjusted again according to the total loss value, so that the total loss value corresponding to the training document triplet reaches the preset convergence condition.
  • the output results of the initial language model can continue to move closer to accurate results, so that the recognition accuracy is getting higher and higher, until all training document triples correspond to
  • the initial language model after convergence is recorded as the preset pre-trained language model.
  • an adam optimizer may also be used, and the optimizer is based on a parameter update method of gradient descent, and further updates the initial parameters continuously when the total loss value is less than the set threshold condition.
  • the method before acquiring the triplet of the sample document, the method further includes:
  • sample document set includes at least one sample document; one described sample document is associated with a document title;
  • the sample documents in the preset sample document set can be crawled from all pdf documents from major websites by conventional crawling technology, and the crawled information includes the sample documents and the document titles associated with the sample documents.
  • the normalization process for each of the document titles includes:
  • the preset special symbol can be ":". Understandably, although the content of each pdf document is different, the structure of the content is mostly the same. For example, for pdf documents similar to "XXX Company: 2020 Annual Report", the text content before ":” is only limited The report of a certain company, so the preset special symbol and all characters before the preset special symbol should be eliminated and processed without affecting the subsequent document classification.
  • the culling title contains the preset year character and/or the preset number of times character, replace the preset year character with the first preset character, and replace the preset number of times character with the second preset character , which further indicates that the normalization processing of the document title is completed.
  • the preset year character is the character containing the year in the title;
  • the preset number character is the character that represents the frequency style in the title, such as "XXX Company: 2020 X Quarterly Report".
  • the first preset characters and the second preset characters can be replaced by English characters or other special characters.
  • the first preset characters and the second preset characters are used to eliminate the influence of the year and the number of times on the document classification.
  • the title is "Announcement on Holding the Eighth Meeting in 2020", then the 2020 is replaced by X; eight can be replaced by Y, then it will be replaced by the "Announcement on Holding the Yth Meeting in Year X”.
  • document classification is performed on each of the sample documents, that is, according to each document title after the normalization processing.
  • the matching degree between characters is used for document classification, and the documents whose matching degree is higher than the preset threshold are classified into one category, and then the document category corresponding to each sample document is obtained.
  • the preset threshold can be set to 90%, 95% and so on.
  • the top 500 document categories with the most sample documents can be selected, and the remaining document categories are removed to avoid too many document categories and burden the computer system.
  • any document type can be selected from each document type. Select a sample document as a training document, and then select a document from the document category as a positive sample document; then select a document category from other document categories except the selected document category, and then select a document category from the document category. Pick a sample document as a negative sample document.
  • a document classification prediction apparatus is provided, and the document classification prediction apparatus corresponds one-to-one with the document classification prediction method in the above embodiment.
  • the document classification prediction apparatus includes a prediction request instruction receiving module 10 , a document parsing module 20 , a first vector extraction module 30 , a document vector set acquisition module 40 and a document category determination module 50 .
  • the detailed description of each functional module is as follows:
  • a prediction request instruction receiving module 10 configured to receive a prediction request instruction including a target document
  • the document parsing module 20 is configured to perform document parsing on the target document by using a preset document parsing model to obtain text information corresponding to the target document and coordinate information corresponding to the text information;
  • the first vector extraction module 30 is used for inputting the text information and the coordinate information into a preset pre-trained language model, and performing vector extraction on the text information and the coordinate information to obtain a vector corresponding to the target document The document representation vector of ;
  • a document vector set obtaining module 40 configured to obtain a sample document vector set; the sample document vector set includes at least one sample document vector; one of the sample document vectors is associated with a document category;
  • the document category determination module 50 is configured to determine a document vector distance between the document representation vector and each of the sample document vectors, and determine a document category corresponding to the target document according to each of the document vector distances.
  • the document classification prediction device further includes:
  • Document triplet acquisition module 01 used to acquire training document triples;
  • the sample document triples include training documents, positive sample documents corresponding to the training documents, and negative sample documents corresponding to the sample documents;
  • the second vector extraction module 02 is configured to input the triplet of the sample document into an initial language model including initial parameters, and perform vector extraction on the training document, the positive sample document and the negative sample document, respectively, to obtain the The first training vector corresponding to the training document, the second training vector corresponding to the positive sample document, and the third training vector corresponding to the negative sample document;
  • a total loss value determination module 03 configured to determine the total loss value of the language model according to the first training vector, the second training vector and the third training vector;
  • Language model training module 04 configured to update and iterate the initial parameters of the initial language model when the total loss value does not reach the preset convergence condition, until the total loss value reaches the preset convergence condition, The initial language model after convergence is recorded as the preset pre-trained language model.
  • the second vector extraction module includes:
  • a word sequence extraction unit used for extracting the word sequences of the training document, the positive sample document and the negative sample document respectively, to obtain the training word sequence corresponding to the training document and the positive sample word sequence corresponding to the positive sample document, and the negative sample word sequence corresponding to the negative sample document;
  • a high-level feature determination unit configured to determine, by using a preset feature representation method, the training high-level features corresponding to each word in the training word sequence, and the positive sample high-level feature corresponding to each word in the positive sample word sequence, And the negative sample high-order features corresponding to each word in the negative sample word sequence;
  • the average pooling processing unit is used to perform average pooling processing on the training high-order features, positive sample high-order features and negative sample high-order features respectively, to obtain the first training vector, the second training vector and the first training vector.
  • the document classification prediction device further includes:
  • a sample document set acquisition module used for acquiring a preset sample document set;
  • the sample document set includes at least one sample document; one of the sample documents is associated with a document title;
  • a normalization processing module is used for normalizing each of the document titles, and according to each document title after the normalization processing, the document classification is performed on each of the sample documents, and the corresponding sample documents are obtained. the document category;
  • a document category selection module for selecting a document category from each of the document categories as a positive document category; selecting a document category from other document categories except the positive document category as a negative document category;
  • the document selection module is used to select a sample document from the positive document category and record it as the training document; meanwhile, select a sample document other than the training document from the positive document category and record it as the training document Positive sample document; select a sample document from the negative document category and record it as the negative sample document;
  • a triplet building module is configured to construct the training document triplet according to the training document, the positive sample document and the negative sample document.
  • the normalization processing module includes:
  • a special symbol detection unit for detecting whether the document title contains a preset special symbol
  • a character culling unit configured to cull the preset special symbol and all characters before the preset special symbol when the preset special symbol is included in the document title, to obtain the cull title
  • a special character detection unit for detecting whether the culling title contains a preset year character and/or a preset number of times character
  • a character replacement unit configured to replace the preset year character with the first preset character and replace the second preset character with the preset year character and/or the preset number of times character when the culling title contains the preset year character and/or the preset number of times character
  • the preset number of characters further indicates that the normalization processing of the document title is completed.
  • the document category determination module 50 includes:
  • the sample document selection unit 501 is used to select a preset number of sample documents from the sample documents whose document vector distance is less than or equal to a preset distance threshold, and record the selected sample documents as candidate documents;
  • the document category determining unit 502 is configured to obtain the proportion of candidate documents of the same document category in all the candidate documents, and record the document category with the highest proportion as the document category of the target document.
  • Each module in the above-mentioned document classification prediction apparatus may be implemented in whole or in part by software, hardware and combinations thereof.
  • the above modules can be embedded in or independent of the processor in the computer device in the form of hardware, or stored in the memory in the computer device in the form of software, so that the processor can call and execute the operations corresponding to the above modules.
  • a computer device is provided, and the computer device may be a server, and its internal structure diagram may be as shown in FIG. 8 .
  • the computer device includes a processor, memory, a network interface, and a database connected by a system bus. Among them, the processor of the computer device is used to provide computing and control capabilities.
  • the memory of the computer device includes a readable storage medium, an internal memory.
  • the readable storage medium stores an operating system, computer readable instructions and a database.
  • the internal memory provides an environment for the execution of the operating system and computer-readable instructions in the readable storage medium.
  • the database of the computer device is used to store the data used in the document classification prediction method in the above embodiment.
  • the network interface of the computer device is used to communicate with an external terminal through a network connection.
  • the computer-readable instructions when executed by a processor, implement a document classification prediction method.
  • the readable storage medium provided in this embodiment includes a non-volatile readable storage medium and a volatile readable storage medium.
  • a computer apparatus comprising a memory, a processor, and computer readable instructions stored in the memory and executable on the processor, the processor executing the computer readable instructions Implement the following steps when instructing:
  • sample document vector set includes at least one sample document vector; one of the sample document vectors is associated with a document category;
  • a document vector distance between the document representation vector and each of the sample document vectors is determined, and a document category corresponding to the target document is determined according to each of the document vector distances.
  • one or more readable storage media are provided that store computer-readable instructions that, when executed by one or more processors, cause the one or more processors to execute follows the steps below:
  • sample document vector set includes at least one sample document vector; one of the sample document vectors is associated with a document category;
  • a document vector distance between the document representation vector and each of the sample document vectors is determined, and a document category corresponding to the target document is determined according to each of the document vector distances.
  • Nonvolatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory may include random access memory (RAM) or external cache memory.
  • RAM is available in various forms such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Road (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

Abstract

A document classification prediction method and apparatus, and a computer device and a storage medium. The method comprises: receiving a prediction request instruction that contains a target document (S10); performing document parsing on the target document by means of a preset document parsing model, so as to obtain text information corresponding to the target document and coordinate information corresponding to the text information (S20); inputting the text information and the coordinate information into a preset pretrained language model, and performing vector extraction on the text information and the coordinate information, so as to obtain a document representation vector corresponding to the target document (S30); acquiring a sample document vector set, wherein the sample document vector set contains at least one sample document vector, and one sample document vector is associated with one document category (S40); and determining a document vector distance between the document representation vector and each sample document vector, and determining, according to each document vector distance, a document category corresponding to the target document (S50). By means of the method, the efficiency of document classification is improved.

Description

文档分类预测方法、装置、计算机设备及存储介质Document classification prediction method, device, computer equipment and storage medium
本申请要求于2020年12月21日提交中国专利局、申请号为202011521171.0,发明名称为“文档分类预测方法、装置、计算机设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application filed on December 21, 2020 with the application number 202011521171.0 and the title of the invention is "document classification prediction method, device, computer equipment and storage medium", the entire contents of which are by reference Incorporated in this application.
技术领域technical field
本申请涉及分类模型技术领域,尤其涉及一种文档分类预测方法、装置、计算机设备及存储介质。The present application relates to the technical field of classification models, and in particular, to a document classification prediction method, apparatus, computer equipment and storage medium.
背景技术Background technique
目前各个领域中均包括数以万计的pdf文档,例如在学术领域存在pdf论文,在专业领域中存在pdf数据报告等。在越来越多pdf文档产生过后,如何对这些pdf文档进行有效分类并且对新文档进行文档类别预测是一种挑战。At present, there are tens of thousands of pdf documents in various fields, such as pdf papers in academic fields and pdf data reports in professional fields. After more and more pdf documents are generated, how to effectively classify these pdf documents and predict the document category of new documents is a challenge.
发明人意识到,现有技术中的文档分类模型一般都需要大量的标注数据进行训练,才拥有较为可观的分类精度,但是这些文档分类模型容易受到数据不平衡的影响,例如某种类别的训练数据很少,则模型在这个分类上的分类精度会较低,进而导致文档分类准确率较低,并且人工标注数据需要花费大量的时间,不利于模型在各个领域中进行部署应用。The inventor realizes that document classification models in the prior art generally require a large amount of labeled data for training in order to have considerable classification accuracy, but these document classification models are easily affected by data imbalance, such as training of a certain category. If there is very little data, the classification accuracy of the model in this classification will be low, resulting in low document classification accuracy, and it takes a lot of time to manually label the data, which is not conducive to the deployment and application of the model in various fields.
申请内容Application content
本申请实施例提供一种文档分类预测方法、装置、计算机设备及存储介质,以解决人工标注数据较少导致文档分类准确率较低的问题。Embodiments of the present application provide a document classification prediction method, apparatus, computer equipment, and storage medium, so as to solve the problem of low document classification accuracy caused by less manual annotation data.
一种文档分类预测方法,包括:A document classification prediction method, comprising:
接收包含目标文档的预测请求指令;Receive a prediction request instruction containing the target document;
通过预设文档解析模型,对所述目标文档进行文档解析,得到与所述目标文档对应的文字信息以及与所述文字信息对应的坐标信息;Performing document parsing on the target document by using a preset document parsing model to obtain text information corresponding to the target document and coordinate information corresponding to the text information;
将所述文字信息以及所述坐标信息输入至预设预训练语言模型中,对所述文字信息以及所述坐标信息进行向量提取,得到与所述目标文档对应的文档表示向量;Inputting the text information and the coordinate information into a preset pre-trained language model, and performing vector extraction on the text information and the coordinate information to obtain a document representation vector corresponding to the target document;
获取样本文档向量集;所述样本文档向量集中包含至少一个样本文档向量;一个所述样本文档向量关联一个文档类别;Obtain a sample document vector set; the sample document vector set includes at least one sample document vector; one of the sample document vectors is associated with a document category;
确定所述文档表示向量与各所述样本文档向量之间的文档向量距离,并根据各所述文档向量距离确定所述目标文档对应的文档类别。A document vector distance between the document representation vector and each of the sample document vectors is determined, and a document category corresponding to the target document is determined according to each of the document vector distances.
一种文档分类预测装置,包括:A document classification prediction device, comprising:
预测请求指令接收模块,用于接收包含目标文档的预测请求指令;The prediction request instruction receiving module is used to receive the prediction request instruction including the target document;
文档解析模块,用于通过预设文档解析模型,对所述目标文档进行文档解析,得到与所述目标文档对应的文字信息以及与所述文字信息对应的坐标信息;a document parsing module, configured to perform document parsing on the target document by using a preset document parsing model to obtain text information corresponding to the target document and coordinate information corresponding to the text information;
第一向量提取模块,用于将所述文字信息以及所述坐标信息输入至预设预训练语言模型中,对所述文字信息以及所述坐标信息进行向量提取,得到与所述目标文档对应的文档表示向量;The first vector extraction module is used for inputting the text information and the coordinate information into a preset pre-training language model, and performing vector extraction on the text information and the coordinate information to obtain the corresponding text information and the target document. document representation vector;
文档向量集获取模块,用于获取样本文档向量集;所述样本文档向量集中包含至少一个样本文档向量;一个所述样本文档向量关联一个文档类别;a document vector set acquisition module, configured to acquire a sample document vector set; the sample document vector set includes at least one sample document vector; one of the sample document vectors is associated with a document category;
文档类别确定模块,用于确定所述文档表示向量与各所述样本文档向量之间的文档向量距离,并根据各所述文档向量距离确定所述目标文档对应的文档类别。A document category determination module, configured to determine a document vector distance between the document representation vector and each of the sample document vectors, and determine a document category corresponding to the target document according to each of the document vector distances.
一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上 运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现如下步骤:A computer device comprising a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor, the processor implementing the following steps when executing the computer-readable instructions:
接收包含目标文档的预测请求指令;Receive a prediction request instruction containing the target document;
通过预设文档解析模型,对所述目标文档进行文档解析,得到与所述目标文档对应的文字信息以及与所述文字信息对应的坐标信息;Performing document parsing on the target document by using a preset document parsing model to obtain text information corresponding to the target document and coordinate information corresponding to the text information;
将所述文字信息以及所述坐标信息输入至预设预训练语言模型中,对所述文字信息以及所述坐标信息进行向量提取,得到与所述目标文档对应的文档表示向量;Inputting the text information and the coordinate information into a preset pre-trained language model, and performing vector extraction on the text information and the coordinate information to obtain a document representation vector corresponding to the target document;
获取样本文档向量集;所述样本文档向量集中包含至少一个样本文档向量;一个所述样本文档向量关联一个文档类别;Obtain a sample document vector set; the sample document vector set includes at least one sample document vector; one of the sample document vectors is associated with a document category;
确定所述文档表示向量与各所述样本文档向量之间的文档向量距离,并根据各所述文档向量距离确定所述目标文档对应的文档类别。A document vector distance between the document representation vector and each of the sample document vectors is determined, and a document category corresponding to the target document is determined according to each of the document vector distances.
一个或多个存储有计算机可读指令的可读存储介质,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如下步骤:One or more readable storage media storing computer-readable instructions that, when executed by one or more processors, cause the one or more processors to perform the following steps:
接收包含目标文档的预测请求指令;Receive a prediction request instruction containing the target document;
通过预设文档解析模型,对所述目标文档进行文档解析,得到与所述目标文档对应的文字信息以及与所述文字信息对应的坐标信息;Performing document parsing on the target document by using a preset document parsing model to obtain text information corresponding to the target document and coordinate information corresponding to the text information;
将所述文字信息以及所述坐标信息输入至预设预训练语言模型中,对所述文字信息以及所述坐标信息进行向量提取,得到与所述目标文档对应的文档表示向量;Inputting the text information and the coordinate information into a preset pre-trained language model, and performing vector extraction on the text information and the coordinate information to obtain a document representation vector corresponding to the target document;
获取样本文档向量集;所述样本文档向量集中包含至少一个样本文档向量;一个所述样本文档向量关联一个文档类别;Obtain a sample document vector set; the sample document vector set includes at least one sample document vector; one of the sample document vectors is associated with a document category;
确定所述文档表示向量与各所述样本文档向量之间的文档向量距离,并根据各所述文档向量距离确定所述目标文档对应的文档类别。A document vector distance between the document representation vector and each of the sample document vectors is determined, and a document category corresponding to the target document is determined according to each of the document vector distances.
上述文档分类预测方法、装置、计算机设备及存储介质,该方法通过接收包含目标文档的预测请求指令;通过预设文档解析模型,对所述目标文档进行文档解析,得到与所述目标文档对应的文字信息以及与所述文字信息对应的坐标信息;将所述文字信息以及所述坐标信息输入至预设预训练语言模型中,对所述文字信息以及所述坐标信息进行向量提取,得到与所述目标文档对应的文档表示向量;获取样本文档向量集;所述样本文档向量集中包含至少一个样本文档向量;一个所述样本文档向量关联一个文档类别;确定所述文档表示向量与各所述样本文档向量之间的文档向量距离,并根据各所述文档向量距离确定所述目标文档对应的文档类别。The above-mentioned document classification prediction method, device, computer equipment and storage medium, the method receives the prediction request instruction containing the target document; through the preset document analysis model, the document analysis is performed on the target document, and the corresponding target document is obtained. text information and coordinate information corresponding to the text information; input the text information and the coordinate information into a preset pre-training language model, perform vector extraction on the text information and the coordinate information, and obtain the text information and the coordinate information. the document representation vector corresponding to the target document; obtain a sample document vector set; the sample document vector set contains at least one sample document vector; one of the sample document vectors is associated with a document category; document vector distances between document vectors, and the document category corresponding to the target document is determined according to the document vector distances.
本申请通过引入文档的文字信息以及对应的坐标信息,并根据该文字信息和坐标信息对应的文档表示向量,与样本文档向量之间的文档向量距离确定目标文档的文档类别。如此,在样本文档较少的情况下,依然可以对新的文档进行分类,如遇到与样本文档均不匹配的情况下,可以视为一个新的文档类别,进而在不断对新的文档进行分类的过程中,可以补足各个文档类别下的文档数量,而不需要不断更换预设文档解析模型或者预设预训练语言模型对新的文档进行分类,提高了文档分类的效率以及便捷性。The present application determines the document category of the target document by introducing the text information of the document and the corresponding coordinate information, and according to the document vector distance between the document representation vector corresponding to the text information and the coordinate information and the sample document vector. In this way, in the case of few sample documents, new documents can still be classified. If they do not match the sample documents, they can be regarded as a new document category, and the new documents are continuously classified. During the classification process, the number of documents in each document category can be supplemented without the need to constantly replace the preset document parsing model or the preset pre-trained language model to classify new documents, which improves the efficiency and convenience of document classification.
本申请的一个或多个实施例的细节在下面的附图和描述中提出,本申请的其他特征和优点将从说明书、附图以及权利要求变得明显。The details of one or more embodiments of the application are set forth in the accompanying drawings and the description below, and other features and advantages of the application will become apparent from the description, drawings, and claims.
附图说明Description of drawings
为了更清楚地说明本申请实施例的技术方案,下面将对本申请实施例的描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。In order to illustrate the technical solutions of the embodiments of the present application more clearly, the following briefly introduces the drawings that are used in the description of the embodiments of the present application. Obviously, the drawings in the following description are only some embodiments of the present application. , for those of ordinary skill in the art, other drawings can also be obtained from these drawings without creative labor.
图1是本申请一实施例中文档分类预测方法的一应用环境示意图;1 is a schematic diagram of an application environment of a document classification prediction method in an embodiment of the present application;
图2是本申请一实施例中文档分类预测方法的一流程图;2 is a flowchart of a document classification prediction method in an embodiment of the present application;
图3是本申请一实施例中文档分类预测方法中步骤S50的一流程图;3 is a flowchart of step S50 in the document classification prediction method in an embodiment of the present application;
图4是本申请一实施例中文档分类预测方法的另一流程图;4 is another flowchart of a document classification prediction method in an embodiment of the present application;
图5是本申请一实施例中文档分类预测装置的一原理框图;5 is a schematic block diagram of a document classification prediction device in an embodiment of the present application;
图6是本申请一实施例中文档分类预测装置的另一原理框图;FIG. 6 is another principle block diagram of a document classification prediction apparatus in an embodiment of the present application;
图7是本申请一实施例中文档分类预测装置中文档类别确定模块的一原理框图;7 is a schematic block diagram of a document category determination module in a document category prediction device according to an embodiment of the present application;
图8是本申请一实施例中计算机设备的一示意图。FIG. 8 is a schematic diagram of a computer device in an embodiment of the present application.
具体实施方式Detailed ways
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, not all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by those of ordinary skill in the art without creative work fall within the protection scope of the present application.
本申请实施例提供的文档分类预测方法,该文档分类预测方法可应用如图1所示的应用环境中。具体地,该文档分类预测方法应用在文档分类预测系统中,该文档分类预测系统包括如图1所示的客户端和服务器,客户端与服务器通过网络进行通信,用于解决人工标注数据较少导致文档分类准确率较低的问题。其中,客户端又称为用户端,是指与服务器相对应,为客户提供本地服务的程序。客户端可安装在但不限于各种个人计算机、笔记本电脑、智能手机、平板电脑和便携式可穿戴设备上。服务器可以用独立的服务器或者是多个服务器组成的服务器集群来实现。The document classification prediction method provided by the embodiment of the present application can be applied in the application environment shown in FIG. 1 . Specifically, the document classification prediction method is applied in a document classification prediction system. The document classification prediction system includes a client and a server as shown in FIG. 1 , and the client and the server communicate through the network to solve the problem of less manual annotation data. This leads to the problem of low document classification accuracy. Among them, the client, also known as the client, refers to the program corresponding to the server and providing local services for the client. Clients can be installed on, but not limited to, various personal computers, laptops, smartphones, tablets, and portable wearable devices. The server can be implemented as an independent server or a server cluster composed of multiple servers.
在一实施例中,如图2所示,提供一种文档分类预测方法,以该方法应用在图1中的服务器为例进行说明,包括如下步骤:In one embodiment, as shown in FIG. 2, a document classification prediction method is provided, and the method is applied to the server in FIG. 1 as an example for description, including the following steps:
S10:接收包含目标文档的预测请求指令;S10: Receive a prediction request instruction including the target document;
可以理解地,该预测请求指令可以为预设发送方(如目标文档的撰写者,或者文档管理人员)发送的指令。在本实施例中,目标文档指的是具有规律性标题,且暂未进行文档分类的文档;其中,规律性标题指的是存在若干填充区域的标题,如公司名称区域,年份区域;该规律性标题可供文档创建者按照填充区域中需要填入的内容,并结合文档内容进行填充。示例性地,如《荣盛石化(公司名称区域):2020年(年份区域)半年度报告》类似样式的文档。Understandably, the prediction request instruction may be an instruction sent by a preset sender (eg, the author of the target document, or the document manager). In this embodiment, the target document refers to a document with a regular title and has not been classified as a document; wherein, the regular title refers to a title with several filled areas, such as a company name area and a year area; the regularity The optional title can be used by document creators to fill in the content that needs to be filled in the filling area, combined with the content of the document. Exemplarily, such as "Rongsheng Petrochemical (company name area): 2020 (year area) semi-annual report" similar style document.
S20:通过预设文档解析模型,对所述目标文档进行文档解析,得到与所述目标文档对应的文字信息以及与所述文字信息对应的坐标信息;S20: Perform document parsing on the target document by using a preset document parsing model to obtain text information corresponding to the target document and coordinate information corresponding to the text information;
其中,预设文档解析模型用于提取目标文档的文字信息以及坐标信息,示例性地,当目标文档为pdf文档时,该预设文档解析模型可以为基于PyMuPDF(一个开源的pdf解析软件)的解析模型。文字信息指的是目标文档中前五页的文字内容。坐标信息指的是前五页内容中各字词所处的页码以及对应页码中的具体位置。The preset document parsing model is used to extract text information and coordinate information of the target document. Exemplarily, when the target document is a pdf document, the preset document parsing model may be based on PyMuPDF (an open source pdf parsing software). Parse the model. Text information refers to the text content of the first five pages in the target document. The coordinate information refers to the page number of each word in the content of the first five pages and the specific position in the corresponding page number.
具体地,通过所述预设文档解析模型,抽取所述目标文档中前五页的文字内容,得到所述文字信息;将所述文字信息中各个字词所属的页码以及处于该页码中的位置信息关联记录为所述坐标信息。可以理解地,由于预设文档解析模型一般仅支持只支持512长度的输入,因此无法将真个pdf所包含的文字作为输入,其次前五页一般都会包含文章的标题,而标题是判断pdf类别的一个重要信息。Specifically, extracting the text content of the first five pages in the target document through the preset document parsing model to obtain the text information; the page number to which each word in the text information belongs and the position in the page number The information association is recorded as the coordinate information. Understandably, since the default document parsing model generally only supports input with a length of 512, the text contained in a real pdf cannot be used as input. Secondly, the first five pages generally contain the title of the article, and the title is used to determine the pdf category. an important information.
S30:将所述文字信息以及所述坐标信息输入至预设预训练语言模型中,对所述文字信息以及所述坐标信息进行向量提取,得到与所述目标文档对应的文档表示向量;S30: Input the text information and the coordinate information into a preset pre-trained language model, and perform vector extraction on the text information and the coordinate information to obtain a document representation vector corresponding to the target document;
其中,预设预训练语言模型可以为LayoutLM模型。The preset pre-trained language model may be a LayoutLM model.
具体地,在通过预设文档解析模型,对所述目标文档进行文档解析,得到与所述目标文档对应的文字信息以及与所述文字信息对应的坐标信息之后,将所述文字信息以及所述坐标信息输入至预训练语言模型中,以根据文字信息以及坐标信息生成与该目标文档对应 的目标单词序列,该目标单词序列表征目标文档中各个按照坐标信息排序的单词;进而通过预设特征表示方法,确定该目标单词序列对应的目标高阶特征,并对该目标高阶特征进行平均池化处理,得到文档表示向量。Specifically, after performing document parsing on the target document by using a preset document parsing model to obtain text information corresponding to the target document and coordinate information corresponding to the text information, the text information and the The coordinate information is input into the pre-trained language model to generate a target word sequence corresponding to the target document according to the text information and the coordinate information, and the target word sequence represents each word in the target document sorted according to the coordinate information; method, determine the target high-order feature corresponding to the target word sequence, and perform an average pooling process on the target high-order feature to obtain a document representation vector.
S40:获取样本文档向量集;所述样本文档向量集中包含至少一个样本文档向量;一个所述样本文档向量关联一个文档类别;S40: Obtain a sample document vector set; the sample document vector set includes at least one sample document vector; one of the sample document vectors is associated with one document category;
其中,样本文档向量集是通过将样本文档输入至预设预训练语言模型之后,得到与各样本文档对应的样本文档向量的集合。The sample document vector set is a set of sample document vectors corresponding to each sample document obtained by inputting the sample document into a preset pre-trained language model.
可以理解地,预设预训练语言模型训练完成之后,将所有样本文档分别输入至预设文档解析模型中,以对各样本文档进行文档解析,得到与样本文档对应的样本文字信息以及与样本文字信息对应的样本坐标信息;进而将样本文字信息以及样本坐标信息输入至预设预训练语言模型中,对文字信息以及坐标信息进行向量提取,得到与各样本文档对应的样本文档向量。Understandably, after the training of the preset pre-trained language model is completed, all sample documents are input into the preset document parsing model respectively, so as to perform document parsing on each sample document, and obtain the sample text information corresponding to the sample document and the sample text corresponding to the sample text. The sample coordinate information corresponding to the information; and then input the sample text information and sample coordinate information into the preset pre-training language model, and perform vector extraction on the text information and coordinate information to obtain the sample document vector corresponding to each sample document.
进一步地,在获取各样本文档之后,可以根据样本文档关联的文档标题确定各样本文档的分类,进而对各个样本文档进行分类,使得一个样本文档关联一个文档类别。Further, after each sample document is acquired, the classification of each sample document can be determined according to the document title associated with the sample document, and then each sample document is classified so that one sample document is associated with one document category.
S50:确定所述文档表示向量与各所述样本文档向量之间的文档向量距离,并根据各所述文档向量距离确定所述目标文档对应的文档类别。S50: Determine a document vector distance between the document representation vector and each of the sample document vectors, and determine a document category corresponding to the target document according to each of the document vector distances.
具体地,在获取样本文档向量集之后,确定所述文档表示向量与各所述样本文档向量之间的文档向量距离,并根据各所述文档向量距离确定所述目标文档对应的文档类别。Specifically, after obtaining the sample document vector set, the document vector distance between the document representation vector and each of the sample document vectors is determined, and the document category corresponding to the target document is determined according to each of the document vector distances.
在一实施例中,如图3所示,所述样本文档向量还关联一个样本文档;所述根据各所述文档向量距离确定所述目标文档对应的文档类别,包括:In one embodiment, as shown in FIG. 3 , the sample document vector is also associated with a sample document; the determining the document category corresponding to the target document according to the distance of each document vector includes:
S501:自文档向量距离小于或等于预设距离阈值的所述样本文档中选取预设数量的样本文档,并将被选取的样本文档记录为候选文档;S501: Select a preset number of sample documents from the sample documents whose document vector distance is less than or equal to a preset distance threshold, and record the selected sample documents as candidate documents;
其中,预设数量可以根据具体场景进行确定,示例性地,该预设数量可以为10个,20个等。预设距离阈值可以为0.5、0.7等Wherein, the preset number may be determined according to a specific scenario, and for example, the preset number may be 10, 20, etc. The preset distance threshold can be 0.5, 0.7, etc.
可以理解地,在确定所述文档表示向量与各所述样本文档向量之间的文档向量距离之后,选取预设数量,且文档向量距离小于或等于预设距离阈值的样本文档作为候选文档。在文档向量距离小于或等于预设距离阈值的样本文档的个数不满足预设数量时,则将所有满足文档向量距离小于或等于预设距离阈值这一条件的样本文档作为候选文档即可。Understandably, after determining the document vector distance between the document representation vector and each of the sample document vectors, a preset number of sample documents whose document vector distance is less than or equal to a preset distance threshold are selected as candidate documents. When the number of sample documents whose document vector distance is less than or equal to the preset distance threshold does not meet the preset number, all sample documents that satisfy the condition that the document vector distance is less than or equal to the preset distance threshold may be used as candidate documents.
进一步地,若文档向量距离均大于预设距离阈值,则表征当前与样本文档关联的文档类别中无法表征目标文档的文档类别,进而根据目标文档的文档标题建立一个新的文档类别,并将该目标文档分类至该新的文档类别下,待下一次接收包含新的目标文档的预测请求指令时,若新的目标文档的文档向量,与目标文档的文档表示向量之间的文档向量距离小于或等于预设距离阈值时,则可以将目标文档的文档类别作为该新的目标文档的文档类别,提高了文档分类的效率。Further, if the document vector distances are all greater than the preset distance threshold, it indicates that the document category currently associated with the sample document cannot characterize the document category of the target document, and then a new document category is established according to the document title of the target document, and the The target document is classified under the new document category, and when the next time a prediction request command containing a new target document is received, if the document vector distance between the document vector of the new target document and the document representation vector of the target document is less than or When it is equal to the preset distance threshold, the document category of the target document can be used as the document category of the new target document, which improves the efficiency of document classification.
S502:获取同一文档类别的候选文档在所有所述候选文档中的占比,将占比最高的文档类别记录为所述目标文档的文档类别。S502: Obtain the proportion of candidate documents of the same document category in all the candidate documents, and record the document category with the highest proportion as the document category of the target document.
可以理解地,在自文档向量距离小于或等于预设距离阈值的所述样本文档中选取预设数量的样本文档,并将被选取的样本文档记录为候选文档之后,获取同一文档类别的候选文档在所有所述候选文档中的占比,将占比最高的文档类别记录为所述目标文档的文档类别。It can be understood that after selecting a preset number of sample documents from the sample documents whose distance from the document vector is less than or equal to the preset distance threshold, and recording the selected sample documents as candidate documents, the candidate documents of the same document category are obtained. For the proportion of all the candidate documents, the document category with the highest proportion is recorded as the document category of the target document.
在本实施例中,通过引入文档的文字信息以及对应的坐标信息,并根据该文字信息和坐标信息对应的文档表示向量,与样本文档向量之间的文档向量距离确定目标文档的文档类别。如此,在样本文档较少的情况下,依然可以对新的文档进行分类,如遇到与样本文档均不匹配的情况下,可以视为一个新的文档类别,进而在不断对新的文档进行分类的过程中,可以补足各个文档类别下的文档数量,而不需要不断更换预设文档解析模型或者预 设预训练语言模型对新的文档进行分类,提高了文档分类的效率以及便捷性。In this embodiment, the document category of the target document is determined by introducing the text information of the document and the corresponding coordinate information, and according to the document vector distance between the document representation vector corresponding to the text information and the coordinate information and the sample document vector. In this way, in the case of few sample documents, new documents can still be classified. If they do not match the sample documents, they can be regarded as a new document category, and the new documents are continuously classified. During the classification process, the number of documents in each document category can be supplemented without the need to constantly replace the preset document parsing model or the preset pre-trained language model to classify new documents, which improves the efficiency and convenience of document classification.
在一实施例中,如图4所示,所述将所述文字信息以及所述坐标信息输入至预设预训练语言模型中之前,还包括:In one embodiment, as shown in FIG. 4 , before the inputting the text information and the coordinate information into the preset pre-trained language model, the method further includes:
S01:获取训练文档三元组;所述样本文档三元组包含训练文档、与所述训练文档对应的正样本文档以及与所述样本文档对应的负样本文档;S01: Acquire a training document triplet; the sample document triplet includes a training document, a positive sample document corresponding to the training document, and a negative sample document corresponding to the sample document;
其中,正样本文档指的是具有与训练文档相同文档类别的文档。负样本文档指的是不具有与训练文档相同文档类别的文档。Among them, positive sample documents refer to documents with the same document category as the training documents. Negative documents are documents that do not have the same document class as the training document.
S02:将所述样本文档三元组输入至包含初始参数的初始语言模型中,分别对所述训练文档、正样本文档以及负样本文档进行向量提取,得到与所述训练文档对应的第一训练向量,与所述正样本文档对应的第二训练向量,以及与所述负样本文档对应的第三训练向量;S02: Input the triplet of the sample document into an initial language model including initial parameters, and perform vector extraction on the training document, positive sample document and negative sample document, respectively, to obtain a first training document corresponding to the training document vector, the second training vector corresponding to the positive sample document, and the third training vector corresponding to the negative sample document;
示例性地,该初始语言模型可以为LayoutLM模型。该步骤的详细解释参见下述实施例。Exemplarily, the initial language model may be a LayoutLM model. A detailed explanation of this step can be found in the following examples.
在一实施例中,所述将所述样本文档三元组输入至包含初始参数的初始语言模型中,分别对所述训练文档、正样本文档以及负样本文档进行向量提取,得到与所述训练文档对应的第一训练向量,与所述正样本文档对应的第二训练向量,以及与所述负样本文档对应的第三训练向量,包括:In one embodiment, the sample document triplet is input into an initial language model including initial parameters, and vector extraction is performed on the training document, the positive sample document and the negative sample document, respectively, to obtain the training document, the positive sample document and the negative sample document. The first training vector corresponding to the document, the second training vector corresponding to the positive sample document, and the third training vector corresponding to the negative sample document, including:
S011:分别提取所述训练文档、正样本文档以及负样本文档的单词序列,得到与所述训练文档对应的训练单词序列、与所述正样本文档对应的正样本单词序列,以及与所述负样本文档对应的负样本单词序列;S011: Extract the word sequences of the training document, the positive sample document and the negative sample document, respectively, to obtain the training word sequence corresponding to the training document, the positive sample word sequence corresponding to the positive sample document, and the negative sample word sequence corresponding to the negative sample document. The negative sample word sequence corresponding to the sample document;
其中,单词序列指的是训练文档、正样本文档以及负样本文档中各字词以及对应的排序关系。示例性地,假设分别提取所述训练文档、正样本文档以及负样本文档的单词序列之后,得到的训练单词序列为
Figure PCTCN2021125227-appb-000001
(其中a代表的是训练文档,x为训练文档的单词序列长度),由于在初始语言模型中需要区分一个文档的开头(下述的[CLS])以及结尾(下述的[SEP]),因此最终的训练单词序列为
Figure PCTCN2021125227-appb-000002
同理,假设得到的正样本单词序列为
Figure PCTCN2021125227-appb-000003
(其中p代表的正样本文档,y为正样本文档的单词序列长度),最终的正样本单词序列为
Figure PCTCN2021125227-appb-000004
同理,假设得到的负样本单词序列为
Figure PCTCN2021125227-appb-000005
(其中n代表的负样本文档,s为负样本文档的单词序列长度),最终的负样本单词序列为
Figure PCTCN2021125227-appb-000006
The word sequence refers to each word in the training document, the positive sample document, and the negative sample document and the corresponding ranking relationship. Exemplarily, it is assumed that after the word sequences of the training document, positive sample document and negative sample document are extracted respectively, the obtained training word sequence is:
Figure PCTCN2021125227-appb-000001
(where a represents the training document, and x is the length of the word sequence of the training document), since in the initial language model it is necessary to distinguish the beginning of a document ([CLS] below) and the end ([SEP] below), So the final training word sequence is
Figure PCTCN2021125227-appb-000002
In the same way, it is assumed that the obtained positive sample word sequence is
Figure PCTCN2021125227-appb-000003
(where p represents the positive sample document, y is the word sequence length of the positive sample document), and the final positive sample word sequence is
Figure PCTCN2021125227-appb-000004
In the same way, it is assumed that the negative sample word sequence obtained is
Figure PCTCN2021125227-appb-000005
(where n represents the negative sample document, s is the word sequence length of the negative sample document), and the final negative sample word sequence is
Figure PCTCN2021125227-appb-000006
S012:通过预设特征表示方法,确定与所述训练单词序列中各单词对应的训练高阶特征,与所述正样本单词序列中各单词对应的正样本高阶特征,以及与所述负样本单词序列中各单词对应的负样本高阶特征;S012: Determine the training high-order feature corresponding to each word in the training word sequence, the positive sample high-order feature corresponding to each word in the positive sample word sequence, and the negative sample Negative sample high-order features corresponding to each word in the word sequence;
具体地,可以通过如下表达式,确定各单词序列中各个单词对应的高阶特征表示:Specifically, the high-level feature representation corresponding to each word in each word sequence can be determined by the following expression:
Figure PCTCN2021125227-appb-000007
Figure PCTCN2021125227-appb-000007
Figure PCTCN2021125227-appb-000008
Figure PCTCN2021125227-appb-000008
Figure PCTCN2021125227-appb-000009
Figure PCTCN2021125227-appb-000009
其中,i表征第i个单词。
Figure PCTCN2021125227-appb-000010
为训练高阶特征;
Figure PCTCN2021125227-appb-000011
为正样本高阶特征;
Figure PCTCN2021125227-appb-000012
为负样本高阶特征。
where i represents the ith word.
Figure PCTCN2021125227-appb-000010
for training high-level features;
Figure PCTCN2021125227-appb-000011
are high-order features of positive samples;
Figure PCTCN2021125227-appb-000012
High-order features of negative samples.
S013:分别对所述训练高阶特征、正样本高阶特征以及负样本高阶特征进行平均池化处理,得到所述第一训练向量、第二训练向量以及所述第三训练向量。S013: Perform an average pooling process on the training high-order features, the positive sample high-order features, and the negative sample high-order features, respectively, to obtain the first training vector, the second training vector, and the third training vector.
具体地,在确定与所述训练单词序列中各单词对应的训练高阶特征,与所述正样本单词序列中各单词对应的正样本高阶特征,以及与所述负样本单词序列中各单词对应的负样本高阶特征之后,通过平均池化处理方法,以得到第一训练向量、第二训练向量以及所述第三训练向量。Specifically, after determining the training high-level features corresponding to each word in the training word sequence, the positive sample high-level feature corresponding to each word in the positive sample word sequence, and the positive sample high-level feature corresponding to each word in the negative sample word sequence After the corresponding high-order features of the negative samples, the average pooling processing method is used to obtain the first training vector, the second training vector and the third training vector.
可选地,可以通过下述表达式确定:Optionally, it can be determined by the following expression:
Figure PCTCN2021125227-appb-000013
Figure PCTCN2021125227-appb-000013
Figure PCTCN2021125227-appb-000014
Figure PCTCN2021125227-appb-000014
Figure PCTCN2021125227-appb-000015
Figure PCTCN2021125227-appb-000015
其中,MEAN_POOLING i()为平均池化函数;i表征第i个单词;S a为第一训练向量;S p为第二训练向量;S n为第三训练向量。 Among them, MEAN_POOLING i ( ) is the average pooling function; i represents the i-th word; S a is the first training vector; Sp is the second training vector; Sn is the third training vector.
S03:根据所述第一训练向量、第二训练向量以及第三训练向量,确定所述语言模型的总损失值。S03: Determine a total loss value of the language model according to the first training vector, the second training vector and the third training vector.
具体地,在分别对所述训练高阶特征、正样本高阶特征以及负样本高阶特征进行平均池化处理,得到所述第一训练向量、第二训练向量以及所述第三训练向量,根据第一训练向量、第二训练向量以及第三训练向量,确定所述语言模型的总损失值。Specifically, performing an average pooling process on the training high-level features, the positive sample high-level features, and the negative sample high-level features, respectively, to obtain the first training vector, the second training vector, and the third training vector, A total loss value of the language model is determined according to the first training vector, the second training vector and the third training vector.
在一实施例中,步骤S03中,所述根据所述第一训练向量、第二样本向量以及第三训练向量,确定所述语言模型的总损失值,包括:In one embodiment, in step S03, the determining the total loss value of the language model according to the first training vector, the second sample vector and the third training vector includes:
确定所述第一训练向量与所述第二样本向量之间的第一文档距离;同时确定所述第一训练向量与所述第三训练向量之间的第二文档距离;determining the first document distance between the first training vector and the second sample vector; simultaneously determining the second document distance between the first training vector and the third training vector;
根据所述第一文档距离以及所述第二文档距离,通过三重损失函数确定所述总损失值。According to the first document distance and the second document distance, the total loss value is determined through a triple loss function.
其中,第一文档距离以及第二文档距离的实质均为欧几里得距离。The first document distance and the second document distance are substantially Euclidean distances.
具体地,可以根据如下三重损失函数确定总损失值:Specifically, the total loss value can be determined according to the following triple loss function:
L=max(||S a-S p||-||S a-S n||+ε,0) L=max(||S a -S p ||-||S a -S n ||+ε,0)
其中,S a为第一训练向量;S p为第二训练向量;S n为第三训练向量。||S a-S p||为第一文档距离;||S a-S n||为第二文档距离;ε为实数,在本实施例中取1。该总损失的直观含义即,使得正样本文档离训练文档的距离越来越近,负样本文档离训练文档的距离越来越远,进而提高模型的文档分类精度。 Wherein, Sa is the first training vector; Sp is the second training vector; Sn is the third training vector. ||S a -S p || is the first document distance; ||S a -S n || is the second document distance; ε is a real number, which is taken as 1 in this embodiment. The intuitive meaning of the total loss is that the distance between the positive sample document and the training document is getting closer and the distance between the negative sample document and the training document is getting further and further, thereby improving the document classification accuracy of the model.
S04:在所述总损失值未达到预设的收敛条件时,更新迭代所述初始语言模型的初始参数,直至所述总损失值达到所述预设的收敛条件时,将收敛之后的所述初始语言模型记录为所述预设预训练语言模型。S04: When the total loss value does not reach the preset convergence condition, update and iterate the initial parameters of the initial language model, until the total loss value reaches the preset convergence condition, update the The initial language model is recorded as the preset pre-trained language model.
可以理解地,该收敛条件可以为总损失值小于设定阈值的条件,也即在总损失值小于 设定阈值时,停止训练;收敛条件还可以为总损失值经过了10000次计算后值为很小且不会再下降的条件,也即总损失值经过10000次计算后值很小且不会下降时,停止训练,并将收敛之后的所述初始语言模型记录为所述预设预训练语言模型。Understandably, the convergence condition can be the condition that the total loss value is less than the set threshold, that is, when the total loss value is less than the set threshold, the training is stopped; the convergence condition can also be that the total loss value after 10,000 calculations is The condition is very small and will not decrease, that is, when the total loss value is small and will not decrease after 10,000 calculations, stop training, and record the initial language model after convergence as the preset pre-training language model.
进一步地,根据训练文档三元组中的训练文档、正样本文档以及负样本文档确定总损失值之后,在总损失值未达到预设的收敛条件时,根据该总损失值调整初始语言模型的初始参数,并将该训练文档三元组重新输入至调整初始参数后的初始语言模型中,以在该训练文档三元组对应的总损失值达到预设的收敛条件时,选取另一个训练文档三元组(如更换其中的负样本文档或者正样本文档),并执行步骤S01至S04,得到与该训练文档三元组对应的总损失值,并在该总损失值未达到预设的收敛条件时,根据该总损失值再次调整初始语言模型的初始参数,使得该训练文档三元组对应的总损失值达到预设的收敛条件。Further, after determining the total loss value according to the training document, positive sample document and negative sample document in the training document triplet, when the total loss value does not reach the preset convergence condition, adjust the initial language model according to the total loss value. initial parameters, and re-input the training document triplet into the initial language model after adjusting the initial parameters, so as to select another training document when the total loss value corresponding to the training document triplet reaches the preset convergence condition Triples (such as replacing negative sample documents or positive sample documents), and perform steps S01 to S04 to obtain the total loss value corresponding to the training document triples, and when the total loss value does not reach the preset convergence When conditions are met, the initial parameters of the initial language model are adjusted again according to the total loss value, so that the total loss value corresponding to the training document triplet reaches the preset convergence condition.
如此,在通过所有训练文档三元组对初始语言模型进行训练之后,使得初始语言模型输出的结果可以不断向准确的结果靠拢,让识别准确率越来越高,直至所有训练文档三元组对应的总损失值均达到预设的收敛条件时,将收敛之后的所述初始语言模型记录为所述预设预训练语言模型。In this way, after the initial language model is trained through all training document triples, the output results of the initial language model can continue to move closer to accurate results, so that the recognition accuracy is getting higher and higher, until all training document triples correspond to When all of the total loss values of 1 and 2 reach a preset convergence condition, the initial language model after convergence is recorded as the preset pre-trained language model.
进一步地,在本实施例中还可以采用adam优化器,该优化器基于梯度下降的参数更新方式,进而在总损失值小于设定阈值的条件时,会不断更新初始参数。Further, in this embodiment, an adam optimizer may also be used, and the optimizer is based on a parameter update method of gradient descent, and further updates the initial parameters continuously when the total loss value is less than the set threshold condition.
在一实施例中,所述获取样本文档三元组之前,还包括:In one embodiment, before acquiring the triplet of the sample document, the method further includes:
(1)获取预设样本文档集合;所述样本文档集合中包含至少一个样本文档;一个所述样本文档关联一个文档标题;(1) obtaining a preset sample document set; the sample document set includes at least one sample document; one described sample document is associated with a document title;
其中,该预设样本文档集合中的样本文档可以通过常规的爬虫技术,从各大网站上将所有pdf文档爬取下来,爬取的信息包括样本文档,以及与样本文档关联的文档标题。Among them, the sample documents in the preset sample document set can be crawled from all pdf documents from major websites by conventional crawling technology, and the crawled information includes the sample documents and the document titles associated with the sample documents.
(2)对各所述文档标题进行归一化处理,并根据归一化处理之后的各文档标题,对各所述样本文档进行文档分类,得到与各所述样本文档对应的文档类别;(2) performing normalization processing on each of the document titles, and performing document classification on each of the sample documents according to each document title after the normalization processing, to obtain a document category corresponding to each of the sample documents;
具体地,在一实施例中,所述对各所述文档标题进行归一化处理,包括:Specifically, in an embodiment, the normalization process for each of the document titles includes:
检测所述文档标题中是否包含预设特殊符号;Detecting whether the document title contains preset special symbols;
在所述文档标题中包含所述预设特殊符号时,剔除所述预设特殊符号,以及处于所述预设特殊符号之前的所有字符,得到剔除标题;When the document title includes the preset special symbol, remove the preset special symbol and all characters before the preset special symbol to obtain the excluded title;
其中,该预设特殊符号可以为“:”。可以理解地,虽然每一个pdf文档的内容均不相同,但是内容的结构大多数是一致的,例如《XXX公司:2020年度报告》类似的pdf文档,在“:”之前的文字内容仅仅只是限定某个公司的报告,因此该预设特殊符号以及在预设特殊符号之前的所有字符均应该剔除处理,不影响后续文档分类。Wherein, the preset special symbol can be ":". Understandably, although the content of each pdf document is different, the structure of the content is mostly the same. For example, for pdf documents similar to "XXX Company: 2020 Annual Report", the text content before ":" is only limited The report of a certain company, so the preset special symbol and all characters before the preset special symbol should be eliminated and processed without affecting the subsequent document classification.
检测所述剔除标题中是否包含预设年份字符和/或预设次数字符;Detecting whether the culling title contains a preset year character and/or a preset number of times character;
在所述剔除标题中包含所述预设年份字符和/或预设次数字符时,将第一预设字符代替所述预设年份字符,且将第二预设字符代替所述预设次数字符,进而表征所述文档标题归一化处理完毕。When the culling title contains the preset year character and/or the preset number of times character, replace the preset year character with the first preset character, and replace the preset number of times character with the second preset character , which further indicates that the normalization processing of the document title is completed.
可以理解地,预设年份字符即为标题中包含年份的字符;预设次数字符即为标题中涵盖表征次数样式的字符,如《XXX公司:2020年度第X季度报告》。第一预设字符以及第二预设字符可以选用英文字符亦或者其它特殊字符进行代替,第一预设字符以及第二预设字符是为了消除年份以及次数对文档分类的影响。It is understandable that the preset year character is the character containing the year in the title; the preset number character is the character that represents the frequency style in the title, such as "XXX Company: 2020 X Quarterly Report". The first preset characters and the second preset characters can be replaced by English characters or other special characters. The first preset characters and the second preset characters are used to eliminate the influence of the year and the number of times on the document classification.
示例性地,在剔除所述预设特殊符号,以及处于所述预设特殊符号之前的所有字符,得到剔除标题之后,剔除标题为《关于召开2020年度第八次会议公告》,则可以将其中的2020替换成X;八可以替换成Y,则替换后为《关于召开X年度第Y次会议公告》。Exemplarily, after removing the preset special symbols and all the characters before the preset special symbols to obtain the removed title, the title is "Announcement on Holding the Eighth Meeting in 2020", then the 2020 is replaced by X; eight can be replaced by Y, then it will be replaced by the "Announcement on Holding the Yth Meeting in Year X".
进一步地,在对各文档标题进行归一化处理之后,并根据归一化处理之后的各文档标题,对各所述样本文档进行文档分类,也即根据归一化处理之后的各文档标题中字符之间的匹配度进行文档分类,将匹配度高于预设阈值的文档分为一类,进而得到与各样本文档 对应的文档类别。其中,预设阈值可以设定为90%,95%等。Further, after performing the normalization processing on each document title, and according to each document title after the normalization processing, document classification is performed on each of the sample documents, that is, according to each document title after the normalization processing. The matching degree between characters is used for document classification, and the documents whose matching degree is higher than the preset threshold are classified into one category, and then the document category corresponding to each sample document is obtained. Wherein, the preset threshold can be set to 90%, 95% and so on.
示例性地,若文档分类的结果中存在众多类别,则可以选取样本文档最多的前500个文档类别,剩余文档类别则进行去除处理,避免文档类别过多,对计算机系统造成负担。Exemplarily, if there are many categories in the result of document classification, the top 500 document categories with the most sample documents can be selected, and the remaining document categories are removed to avoid too many document categories and burden the computer system.
(3)自各所述文档类别中选取一个文档类别作为正文档类别;自除所述正文档类别之外的其它文档类别中选取一个文档类别作为负文档类别;(3) select a document category from each described document category as a positive document category; select a document category from other document categories except the described positive document category as a negative document category;
(4)自所述正文档类别中选取一个样本文档并记录为所述训练文档;同时,自所述正文档类别中选取除所述训练文档外的一个样本文档并记录为所述正样本文档;自所述负文档类别中选取一个样本文档并记录为所述负样本文档;(4) Select a sample document from the positive document category and record it as the training document; meanwhile, select a sample document other than the training document from the positive document category and record it as the positive sample document ; Select a sample document from the negative document category and record it as the negative sample document;
(5)根据所述训练文档、正样本文档以及所述负样本文档构建所述训练文档三元组。(5) Construct the training document triplet according to the training document, the positive sample document and the negative sample document.
可以理解地,在根据归一化处理之后的各文档标题,对各所述样本文档进行文档分类,得到与各所述样本文档对应的文档类别之后,可以从各个文档类别中选取任何一个文档类别中选取一个样本文档作为训练文档,再从该文档类别中选取一个文档作为正样本文档;再从除已选取的文档类别之外的其它文档类别中,选取一个文档类别,再从该文档类别中选取一个样本文档作为负样本文档。It can be understood that after document classification is performed on each of the sample documents according to the document titles after the normalization process, and after obtaining the document type corresponding to each of the sample documents, any document type can be selected from each document type. Select a sample document as a training document, and then select a document from the document category as a positive sample document; then select a document category from other document categories except the selected document category, and then select a document category from the document category. Pick a sample document as a negative sample document.
应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。It should be understood that the size of the sequence numbers of the steps in the above embodiments does not mean the sequence of execution, and the execution sequence of each process should be determined by its function and internal logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.
在一实施例中,提供一种文档分类预测装置,该文档分类预测装置与上述实施例中文档分类预测方法一一对应。如图5所示,该文档分类预测装置包括预测请求指令接收模块10、文档解析模块20、第一向量提取模块30、文档向量集获取模块40和文档类别确定模块50。各功能模块详细说明如下:In one embodiment, a document classification prediction apparatus is provided, and the document classification prediction apparatus corresponds one-to-one with the document classification prediction method in the above embodiment. As shown in FIG. 5 , the document classification prediction apparatus includes a prediction request instruction receiving module 10 , a document parsing module 20 , a first vector extraction module 30 , a document vector set acquisition module 40 and a document category determination module 50 . The detailed description of each functional module is as follows:
预测请求指令接收模块10,用于接收包含目标文档的预测请求指令;A prediction request instruction receiving module 10, configured to receive a prediction request instruction including a target document;
文档解析模块20,用于通过预设文档解析模型,对所述目标文档进行文档解析,得到与所述目标文档对应的文字信息以及与所述文字信息对应的坐标信息;The document parsing module 20 is configured to perform document parsing on the target document by using a preset document parsing model to obtain text information corresponding to the target document and coordinate information corresponding to the text information;
第一向量提取模块30,用于将所述文字信息以及所述坐标信息输入至预设预训练语言模型中,对所述文字信息以及所述坐标信息进行向量提取,得到与所述目标文档对应的文档表示向量;The first vector extraction module 30 is used for inputting the text information and the coordinate information into a preset pre-trained language model, and performing vector extraction on the text information and the coordinate information to obtain a vector corresponding to the target document The document representation vector of ;
文档向量集获取模块40,用于获取样本文档向量集;所述样本文档向量集中包含至少一个样本文档向量;一个所述样本文档向量关联一个文档类别;a document vector set obtaining module 40, configured to obtain a sample document vector set; the sample document vector set includes at least one sample document vector; one of the sample document vectors is associated with a document category;
文档类别确定模块50,用于确定所述文档表示向量与各所述样本文档向量之间的文档向量距离,并根据各所述文档向量距离确定所述目标文档对应的文档类别。The document category determination module 50 is configured to determine a document vector distance between the document representation vector and each of the sample document vectors, and determine a document category corresponding to the target document according to each of the document vector distances.
优选地,如图6所示,所述文档分类预测装置还包括:Preferably, as shown in FIG. 6 , the document classification prediction device further includes:
文档三元组获取模块01,用于获取训练文档三元组;所述样本文档三元组包含训练文档、与所述训练文档对应的正样本文档以及与所述样本文档对应的负样本文档;Document triplet acquisition module 01, used to acquire training document triples; the sample document triples include training documents, positive sample documents corresponding to the training documents, and negative sample documents corresponding to the sample documents;
第二向量提取模块02,用于将所述样本文档三元组输入至包含初始参数的初始语言模型中,分别对所述训练文档、正样本文档以及负样本文档进行向量提取,得到与所述训练文档对应的第一训练向量,与所述正样本文档对应的第二训练向量,以及与所述负样本文档对应的第三训练向量;The second vector extraction module 02 is configured to input the triplet of the sample document into an initial language model including initial parameters, and perform vector extraction on the training document, the positive sample document and the negative sample document, respectively, to obtain the The first training vector corresponding to the training document, the second training vector corresponding to the positive sample document, and the third training vector corresponding to the negative sample document;
总损失值确定模块03,用于根据所述第一训练向量、第二训练向量以及第三训练向量,确定所述语言模型的总损失值;A total loss value determination module 03, configured to determine the total loss value of the language model according to the first training vector, the second training vector and the third training vector;
语言模型训练模块04,用于在所述总损失值未达到预设的收敛条件时,更新迭代所述初始语言模型的初始参数,直至所述总损失值达到所述预设的收敛条件时,将收敛之后的所述初始语言模型记录为所述预设预训练语言模型。Language model training module 04, configured to update and iterate the initial parameters of the initial language model when the total loss value does not reach the preset convergence condition, until the total loss value reaches the preset convergence condition, The initial language model after convergence is recorded as the preset pre-trained language model.
优选地,所述第二向量提取模块包括:Preferably, the second vector extraction module includes:
单词序列提取单元,用于分别提取所述训练文档、正样本文档以及负样本文档的单词序列,得到与所述训练文档对应的训练单词序列、与所述正样本文档对应的正样本单词序 列,以及与所述负样本文档对应的负样本单词序列;A word sequence extraction unit, used for extracting the word sequences of the training document, the positive sample document and the negative sample document respectively, to obtain the training word sequence corresponding to the training document and the positive sample word sequence corresponding to the positive sample document, and the negative sample word sequence corresponding to the negative sample document;
高阶特征确定单元,用于通过预设特征表示方法,确定与所述训练单词序列中各单词对应的训练高阶特征,与所述正样本单词序列中各单词对应的正样本高阶特征,以及与所述负样本单词序列中各单词对应的负样本高阶特征;a high-level feature determination unit, configured to determine, by using a preset feature representation method, the training high-level features corresponding to each word in the training word sequence, and the positive sample high-level feature corresponding to each word in the positive sample word sequence, And the negative sample high-order features corresponding to each word in the negative sample word sequence;
平均池化处理单元,用于分别对所述训练高阶特征、正样本高阶特征以及负样本高阶特征进行平均池化处理,得到所述第一训练向量、第二训练向量以及所述第三训练向量。The average pooling processing unit is used to perform average pooling processing on the training high-order features, positive sample high-order features and negative sample high-order features respectively, to obtain the first training vector, the second training vector and the first training vector. Three training vectors.
优选地,所述文档分类预测装置还包括:Preferably, the document classification prediction device further includes:
样本文档集合获取模块,用于获取预设样本文档集合;所述样本文档集合中包含至少一个样本文档;一个所述样本文档关联一个文档标题;a sample document set acquisition module, used for acquiring a preset sample document set; the sample document set includes at least one sample document; one of the sample documents is associated with a document title;
归一化处理模块,用于对各所述文档标题进行归一化处理,并根据归一化处理之后的各文档标题,对各所述样本文档进行文档分类,得到与各所述样本文档对应的文档类别;A normalization processing module is used for normalizing each of the document titles, and according to each document title after the normalization processing, the document classification is performed on each of the sample documents, and the corresponding sample documents are obtained. the document category;
文档类别选取模块,用于自各所述文档类别中选取一个文档类别作为正文档类别;自除所述正文档类别之外的其它文档类别中选取一个文档类别作为负文档类别;A document category selection module for selecting a document category from each of the document categories as a positive document category; selecting a document category from other document categories except the positive document category as a negative document category;
文档选取模块,用于自所述正文档类别中选取一个样本文档并记录为所述训练文档;同时,自所述正文档类别中选取除所述训练文档外的一个样本文档并记录为所述正样本文档;自所述负文档类别中选取一个样本文档并记录为所述负样本文档;The document selection module is used to select a sample document from the positive document category and record it as the training document; meanwhile, select a sample document other than the training document from the positive document category and record it as the training document Positive sample document; select a sample document from the negative document category and record it as the negative sample document;
三元组构建模块,用于根据所述训练文档、正样本文档以及所述负样本文档构建所述训练文档三元组。A triplet building module is configured to construct the training document triplet according to the training document, the positive sample document and the negative sample document.
优选地,所述归一化处理模块包括:Preferably, the normalization processing module includes:
特殊符号检测单元,用于检测所述文档标题中是否包含预设特殊符号;a special symbol detection unit for detecting whether the document title contains a preset special symbol;
字符剔除单元,用于在所述文档标题中包含所述预设特殊符号时,剔除所述预设特殊符号,以及处于所述预设特殊符号之前的所有字符,得到剔除标题;a character culling unit, configured to cull the preset special symbol and all characters before the preset special symbol when the preset special symbol is included in the document title, to obtain the cull title;
特殊字符检测单元,用于检测所述剔除标题中是否包含预设年份字符和/或预设次数字符;a special character detection unit for detecting whether the culling title contains a preset year character and/or a preset number of times character;
字符替代单元,用于在所述剔除标题中包含所述预设年份字符和/或预设次数字符时,将第一预设字符代替所述预设年份字符,且将第二预设字符代替所述预设次数字符,进而表征所述文档标题归一化处理完毕。A character replacement unit, configured to replace the preset year character with the first preset character and replace the second preset character with the preset year character and/or the preset number of times character when the culling title contains the preset year character and/or the preset number of times character The preset number of characters further indicates that the normalization processing of the document title is completed.
优选地,如图7所示,文档类别确定模块50包括:Preferably, as shown in FIG. 7 , the document category determination module 50 includes:
样本文档选取单元501,用于自文档向量距离小于或等于预设距离阈值的所述样本文档中选取预设数量的样本文档,并将被选取的样本文档记录为候选文档;The sample document selection unit 501 is used to select a preset number of sample documents from the sample documents whose document vector distance is less than or equal to a preset distance threshold, and record the selected sample documents as candidate documents;
文档类别确定单元502,用于获取同一文档类别的候选文档在所有所述候选文档中的占比,将占比最高的文档类别记录为所述目标文档的文档类别。The document category determining unit 502 is configured to obtain the proportion of candidate documents of the same document category in all the candidate documents, and record the document category with the highest proportion as the document category of the target document.
关于文档分类预测装置的具体限定可以参见上文中对于文档分类预测方法的限定,在此不再赘述。上述文档分类预测装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。For the specific limitation of the document classification prediction apparatus, reference may be made to the definition of the document classification prediction method above, which will not be repeated here. Each module in the above-mentioned document classification prediction apparatus may be implemented in whole or in part by software, hardware and combinations thereof. The above modules can be embedded in or independent of the processor in the computer device in the form of hardware, or stored in the memory in the computer device in the form of software, so that the processor can call and execute the operations corresponding to the above modules.
在一个实施例中,提供了一种计算机设备,该计算机设备可以是服务器,其内部结构图可以如图8所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口和数据库。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括可读存储介质、内存储器。该可读存储介质存储有操作系统、计算机可读指令和数据库。该内存储器为可读存储介质中的操作系统和计算机可读指令的运行提供环境。该计算机设备的数据库用于存储上述实施例中文档分类预测方法所使用到的数据。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机可读指令被处理器执行时以实现一种文档分类预测方法。本实施例所提供的可读存储介质包括非易失性可读存储介质和 易失性可读存储介质。In one embodiment, a computer device is provided, and the computer device may be a server, and its internal structure diagram may be as shown in FIG. 8 . The computer device includes a processor, memory, a network interface, and a database connected by a system bus. Among them, the processor of the computer device is used to provide computing and control capabilities. The memory of the computer device includes a readable storage medium, an internal memory. The readable storage medium stores an operating system, computer readable instructions and a database. The internal memory provides an environment for the execution of the operating system and computer-readable instructions in the readable storage medium. The database of the computer device is used to store the data used in the document classification prediction method in the above embodiment. The network interface of the computer device is used to communicate with an external terminal through a network connection. The computer-readable instructions, when executed by a processor, implement a document classification prediction method. The readable storage medium provided in this embodiment includes a non-volatile readable storage medium and a volatile readable storage medium.
在一个实施例中,提供了一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现如下步骤:In one embodiment, there is provided a computer apparatus comprising a memory, a processor, and computer readable instructions stored in the memory and executable on the processor, the processor executing the computer readable instructions Implement the following steps when instructing:
接收包含目标文档的预测请求指令;Receive a prediction request instruction containing the target document;
通过预设文档解析模型,对所述目标文档进行文档解析,得到与所述目标文档对应的文字信息以及与所述文字信息对应的坐标信息;Performing document parsing on the target document by using a preset document parsing model to obtain text information corresponding to the target document and coordinate information corresponding to the text information;
将所述文字信息以及所述坐标信息输入至预设预训练语言模型中,对所述文字信息以及所述坐标信息进行向量提取,得到与所述目标文档对应的文档表示向量;Inputting the text information and the coordinate information into a preset pre-trained language model, and performing vector extraction on the text information and the coordinate information to obtain a document representation vector corresponding to the target document;
获取样本文档向量集;所述样本文档向量集中包含至少一个样本文档向量;一个所述样本文档向量关联一个文档类别;Obtain a sample document vector set; the sample document vector set includes at least one sample document vector; one of the sample document vectors is associated with a document category;
确定所述文档表示向量与各所述样本文档向量之间的文档向量距离,并根据各所述文档向量距离确定所述目标文档对应的文档类别。A document vector distance between the document representation vector and each of the sample document vectors is determined, and a document category corresponding to the target document is determined according to each of the document vector distances.
在一个实施例中,提供了一个或多个存储有计算机可读指令的可读存储介质,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如下步骤:In one embodiment, one or more readable storage media are provided that store computer-readable instructions that, when executed by one or more processors, cause the one or more processors to execute Follow the steps below:
接收包含目标文档的预测请求指令;Receive a prediction request instruction containing the target document;
通过预设文档解析模型,对所述目标文档进行文档解析,得到与所述目标文档对应的文字信息以及与所述文字信息对应的坐标信息;Performing document parsing on the target document by using a preset document parsing model to obtain text information corresponding to the target document and coordinate information corresponding to the text information;
将所述文字信息以及所述坐标信息输入至预设预训练语言模型中,对所述文字信息以及所述坐标信息进行向量提取,得到与所述目标文档对应的文档表示向量;Inputting the text information and the coordinate information into a preset pre-trained language model, and performing vector extraction on the text information and the coordinate information to obtain a document representation vector corresponding to the target document;
获取样本文档向量集;所述样本文档向量集中包含至少一个样本文档向量;一个所述样本文档向量关联一个文档类别;Obtain a sample document vector set; the sample document vector set includes at least one sample document vector; one of the sample document vectors is associated with a document category;
确定所述文档表示向量与各所述样本文档向量之间的文档向量距离,并根据各所述文档向量距离确定所述目标文档对应的文档类别。A document vector distance between the document representation vector and each of the sample document vectors is determined, and a document category corresponding to the target document is determined according to each of the document vector distances.
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机可读指令来指令相关的硬件来完成,所述的计算机可读指令可存储于一非易失性计算机可读取存储介质或者易失性计算机可读取存储介质中,该计算机可读指令在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above embodiments can be implemented by instructing the relevant hardware through computer-readable instructions, and the computer-readable instructions can be stored in a non-volatile computer. In a readable storage medium or a volatile computer-readable storage medium, the computer-readable instructions, when executed, may include the processes of the foregoing method embodiments. Wherein, any reference to memory, storage, database or other medium used in the various embodiments provided in this application may include non-volatile and/or volatile memory. Nonvolatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in various forms such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Road (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,仅以上述各功能单元、模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能单元、模块完成,即将所述装置的内部结构划分成不同的功能单元或模块,以完成以上描述的全部或者部分功能。Those skilled in the art can clearly understand that, for the convenience and simplicity of description, only the division of the above-mentioned functional units and modules is used as an example for illustration. In practical applications, the above-mentioned functions can be allocated to different functional units, Module completion, that is, dividing the internal structure of the device into different functional units or modules to complete all or part of the functions described above.
以上所述实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围,均应包含在本申请的保护范围之内。The above-mentioned embodiments are only used to illustrate the technical solutions of the present application, but not to limit them; although the present application has been described in detail with reference to the above-mentioned embodiments, those of ordinary skill in the art should understand that: it can still be used for the above-mentioned implementations. The technical solutions described in the examples are modified, or some technical features thereof are equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions in the embodiments of the application, and should be included in the within the scope of protection of this application.

Claims (20)

  1. 一种文档分类预测方法,其中,包括:A document classification prediction method, comprising:
    接收包含目标文档的预测请求指令;Receive a prediction request instruction containing the target document;
    通过预设文档解析模型,对所述目标文档进行文档解析,得到与所述目标文档对应的文字信息以及与所述文字信息对应的坐标信息;Performing document parsing on the target document by using a preset document parsing model to obtain text information corresponding to the target document and coordinate information corresponding to the text information;
    将所述文字信息以及所述坐标信息输入至预设预训练语言模型中,对所述文字信息以及所述坐标信息进行向量提取,得到与所述目标文档对应的文档表示向量;Inputting the text information and the coordinate information into a preset pre-trained language model, and performing vector extraction on the text information and the coordinate information to obtain a document representation vector corresponding to the target document;
    获取样本文档向量集;所述样本文档向量集中包含至少一个样本文档向量;一个所述样本文档向量关联一个文档类别;Obtain a sample document vector set; the sample document vector set includes at least one sample document vector; one of the sample document vectors is associated with a document category;
    确定所述文档表示向量与各所述样本文档向量之间的文档向量距离,并根据各所述文档向量距离确定所述目标文档对应的文档类别。A document vector distance between the document representation vector and each of the sample document vectors is determined, and a document category corresponding to the target document is determined according to each of the document vector distances.
  2. 如权利要求1所述的文档分类预测方法,其中,所述将所述文字信息以及所述坐标信息输入至预设预训练语言模型中之前,还包括:The document classification prediction method according to claim 1, wherein before the inputting the text information and the coordinate information into a preset pre-trained language model, the method further comprises:
    获取训练文档三元组;所述样本文档三元组包含训练文档、与所述训练文档对应的正样本文档以及与所述样本文档对应的负样本文档;obtaining a training document triplet; the sample document triplet includes a training document, a positive sample document corresponding to the training document, and a negative sample document corresponding to the sample document;
    将所述样本文档三元组输入至包含初始参数的初始语言模型中,分别对所述训练文档、正样本文档以及负样本文档进行向量提取,得到与所述训练文档对应的第一训练向量,与所述正样本文档对应的第二训练向量,以及与所述负样本文档对应的第三训练向量;The sample document triplet is input into the initial language model containing the initial parameters, and the training document, the positive sample document and the negative sample document are respectively subjected to vector extraction to obtain the first training vector corresponding to the training document, a second training vector corresponding to the positive sample document, and a third training vector corresponding to the negative sample document;
    根据所述第一训练向量、第二训练向量以及第三训练向量,确定所述语言模型的总损失值;determining the total loss value of the language model according to the first training vector, the second training vector and the third training vector;
    在所述总损失值未达到预设的收敛条件时,更新迭代所述初始语言模型的初始参数,直至所述总损失值达到所述预设的收敛条件时,将收敛之后的所述初始语言模型记录为所述预设预训练语言模型。When the total loss value does not reach the preset convergence condition, update and iterate the initial parameters of the initial language model, until the total loss value reaches the preset convergence condition, the initial language after convergence The model is recorded as the preset pre-trained language model.
  3. 如权利要求2所述的文档分类预测方法,其中,所述将所述样本文档三元组输入至包含初始参数的初始语言模型中,分别对所述训练文档、正样本文档以及负样本文档进行向量提取,得到与所述训练文档对应的第一训练向量,与所述正样本文档对应的第二训练向量,以及与所述负样本文档对应的第三训练向量,包括:The document classification prediction method according to claim 2, wherein the sample document triplet is input into an initial language model including initial parameters, and the training document, positive sample document and negative sample document are respectively performed on the training document, positive sample document and negative sample document. Vector extraction to obtain the first training vector corresponding to the training document, the second training vector corresponding to the positive sample document, and the third training vector corresponding to the negative sample document, including:
    分别提取所述训练文档、正样本文档以及负样本文档的单词序列,得到与所述训练文档对应的训练单词序列、与所述正样本文档对应的正样本单词序列,以及与所述负样本文档对应的负样本单词序列;Extract the word sequence of the training document, the positive sample document and the negative sample document respectively, and obtain the training word sequence corresponding to the training document, the positive sample word sequence corresponding to the positive sample document, and the negative sample document. The corresponding negative sample word sequence;
    通过预设特征表示方法,确定与所述训练单词序列中各单词对应的训练高阶特征,与所述正样本单词序列中各单词对应的正样本高阶特征,以及与所述负样本单词序列中各单词对应的负样本高阶特征;Determine the training high-level feature corresponding to each word in the training word sequence, the positive sample high-level feature corresponding to each word in the positive sample word sequence, and the negative sample word sequence by using a preset feature representation method. The high-order features of the negative samples corresponding to each word in the
    分别对所述训练高阶特征、正样本高阶特征以及负样本高阶特征进行平均池化处理,得到所述第一训练向量、第二训练向量以及所述第三训练向量。The average pooling process is performed on the training high-order features, the positive sample high-order features and the negative sample high-order features, respectively, to obtain the first training vector, the second training vector and the third training vector.
  4. 如权利要求2所述的文档分类预测方法,其中,所述根据所述第一训练向量、第二训练向量以及第三训练向量,确定所述语言模型的总损失值,包括:The document classification prediction method according to claim 2, wherein the determining the total loss value of the language model according to the first training vector, the second training vector and the third training vector comprises:
    确定所述第一训练向量与所述第二训练向量之间的第一文档距离;同时确定所述第一训练向量与所述第三训练向量之间的第二文档距离;determining a first document distance between the first training vector and the second training vector; simultaneously determining a second document distance between the first training vector and the third training vector;
    根据所述第一文档距离以及所述第二文档距离,通过三重损失函数确定所述总损失值。The total loss value is determined by a triple loss function according to the first document distance and the second document distance.
  5. 如权利要求2所述的文档分类预测方法,其中,所述获取样本文档三元组之前,还包括:The document classification prediction method according to claim 2, wherein, before acquiring the triplet of the sample document, it further comprises:
    获取预设样本文档集合;所述样本文档集合中包含至少一个样本文档;一个所述样本 文档关联一个文档标题;Obtain preset sample document set; At least one sample document is included in the sample document set; One described sample document is associated with a document title;
    对各所述文档标题进行归一化处理,并根据归一化处理之后的各文档标题,对各所述样本文档进行文档分类,得到与各所述样本文档对应的文档类别;Performing normalization processing on each of the document titles, and performing document classification on each of the sample documents according to each document title after the normalization processing, to obtain a document category corresponding to each of the sample documents;
    自各所述文档类别中选取一个文档类别作为正文档类别;自除所述正文档类别之外的其它文档类别中选取一个文档类别作为负文档类别;A document category is selected from each of the document categories as a positive document category; a document category is selected from other document categories except the positive document category as a negative document category;
    自所述正文档类别中选取一个样本文档并记录为所述训练文档;同时,自所述正文档类别中选取除所述训练文档外的一个样本文档并记录为所述正样本文档;自所述负文档类别中选取一个样本文档并记录为所述负样本文档;A sample document is selected from the positive document category and recorded as the training document; at the same time, a sample document other than the training document is selected from the positive document category and recorded as the positive sample document; Select a sample document from the negative document category and record it as the negative sample document;
    根据所述训练文档、正样本文档以及所述负样本文档构建所述训练文档三元组。The training document triplet is constructed from the training document, the positive sample document, and the negative sample document.
  6. 如权利要求5所述的文档分类预测方法,其中,所述对各所述文档标题进行归一化处理,包括:The document classification prediction method according to claim 5, wherein the normalization processing for each of the document titles comprises:
    检测所述文档标题中是否包含预设特殊符号;Detecting whether the document title contains preset special symbols;
    在所述文档标题中包含所述预设特殊符号时,剔除所述预设特殊符号,以及处于所述预设特殊符号之前的所有字符,得到剔除标题;When the document title includes the preset special symbol, remove the preset special symbol and all characters before the preset special symbol to obtain the excluded title;
    检测所述剔除标题中是否包含预设年份字符和/或预设次数字符;Detecting whether the culling title contains a preset year character and/or a preset number of times character;
    在所述剔除标题中包含所述预设年份字符和/或预设次数字符时,将第一预设字符代替所述预设年份字符,且将第二预设字符代替所述预设次数字符,进而表征所述文档标题归一化处理完毕。When the culling title contains the preset year character and/or the preset number of times character, replace the preset year character with the first preset character, and replace the preset number of times character with the second preset character , which further indicates that the normalization processing of the document title is completed.
  7. 如权利要求1所述的文档分类预测方法,其中,所述样本文档向量还关联一个样本文档;所述根据各所述文档向量距离确定所述目标文档对应的文档类别,包括:The document classification prediction method according to claim 1, wherein the sample document vector is further associated with a sample document; the determining the document category corresponding to the target document according to the distance of each document vector comprises:
    自文档向量距离小于或等于预设距离阈值的所述样本文档中选取预设数量的样本文档,并将被选取的样本文档记录为候选文档;Select a preset number of sample documents from the sample documents whose document vector distance is less than or equal to the preset distance threshold, and record the selected sample documents as candidate documents;
    获取同一文档类别的候选文档在所有所述候选文档中的占比,将占比最高的文档类别记录为所述目标文档的文档类别。Obtain the proportion of candidate documents of the same document category in all the candidate documents, and record the document category with the highest proportion as the document category of the target document.
  8. 一种文档分类预测装置,其中,包括:A document classification prediction device, comprising:
    预测请求指令接收模块,用于接收包含目标文档的预测请求指令;The prediction request instruction receiving module is used to receive the prediction request instruction including the target document;
    文档解析模块,用于通过预设文档解析模型,对所述目标文档进行文档解析,得到与所述目标文档对应的文字信息以及与所述文字信息对应的坐标信息;a document parsing module, configured to perform document parsing on the target document by using a preset document parsing model to obtain text information corresponding to the target document and coordinate information corresponding to the text information;
    第一向量提取模块,用于将所述文字信息以及所述坐标信息输入至预设预训练语言模型中,对所述文字信息以及所述坐标信息进行向量提取,得到与所述目标文档对应的文档表示向量;The first vector extraction module is used for inputting the text information and the coordinate information into a preset pre-training language model, and performing vector extraction on the text information and the coordinate information to obtain the corresponding text information and the target document. document representation vector;
    文档向量集获取模块,用于获取样本文档向量集;所述样本文档向量集中包含至少一个样本文档向量;一个所述样本文档向量关联一个文档类别;a document vector set acquisition module, configured to acquire a sample document vector set; the sample document vector set includes at least one sample document vector; one of the sample document vectors is associated with a document category;
    文档类别确定模块,用于确定所述文档表示向量与各所述样本文档向量之间的文档向量距离,并根据各所述文档向量距离确定所述目标文档对应的文档类别。A document category determination module, configured to determine a document vector distance between the document representation vector and each of the sample document vectors, and determine a document category corresponding to the target document according to each of the document vector distances.
  9. 一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,其中,所述处理器执行所述计算机可读指令时实现如下步骤:A computer device comprising a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor, wherein the processor implements the following steps when executing the computer-readable instructions:
    接收包含目标文档的预测请求指令;Receive a prediction request instruction containing the target document;
    通过预设文档解析模型,对所述目标文档进行文档解析,得到与所述目标文档对应的文字信息以及与所述文字信息对应的坐标信息;Performing document parsing on the target document by using a preset document parsing model to obtain text information corresponding to the target document and coordinate information corresponding to the text information;
    将所述文字信息以及所述坐标信息输入至预设预训练语言模型中,对所述文字信息以及所述坐标信息进行向量提取,得到与所述目标文档对应的文档表示向量;Inputting the text information and the coordinate information into a preset pre-trained language model, and performing vector extraction on the text information and the coordinate information to obtain a document representation vector corresponding to the target document;
    获取样本文档向量集;所述样本文档向量集中包含至少一个样本文档向量;一个所述样本文档向量关联一个文档类别;Obtain a sample document vector set; the sample document vector set includes at least one sample document vector; one of the sample document vectors is associated with a document category;
    确定所述文档表示向量与各所述样本文档向量之间的文档向量距离,并根据各所述文档向 量距离确定所述目标文档对应的文档类别。A document vector distance between the document representation vector and each of the sample document vectors is determined, and a document category corresponding to the target document is determined according to each of the document vector distances.
  10. 如权利要求9所述的计算机设备,其中,所述将所述文字信息以及所述坐标信息输入至预设预训练语言模型中之前,所述处理器执行所述计算机可读指令时还实现如下步骤:The computer device according to claim 9, wherein before the inputting the text information and the coordinate information into the preset pre-trained language model, the processor further implements the following when executing the computer-readable instructions step:
    获取训练文档三元组;所述样本文档三元组包含训练文档、与所述训练文档对应的正样本文档以及与所述样本文档对应的负样本文档;obtaining a training document triplet; the sample document triplet includes a training document, a positive sample document corresponding to the training document, and a negative sample document corresponding to the sample document;
    将所述样本文档三元组输入至包含初始参数的初始语言模型中,分别对所述训练文档、正样本文档以及负样本文档进行向量提取,得到与所述训练文档对应的第一训练向量,与所述正样本文档对应的第二训练向量,以及与所述负样本文档对应的第三训练向量;The sample document triplet is input into the initial language model containing the initial parameters, and the training document, the positive sample document and the negative sample document are respectively subjected to vector extraction to obtain the first training vector corresponding to the training document, a second training vector corresponding to the positive sample document, and a third training vector corresponding to the negative sample document;
    根据所述第一训练向量、第二训练向量以及第三训练向量,确定所述语言模型的总损失值;determining the total loss value of the language model according to the first training vector, the second training vector and the third training vector;
    在所述总损失值未达到预设的收敛条件时,更新迭代所述初始语言模型的初始参数,直至所述总损失值达到所述预设的收敛条件时,将收敛之后的所述初始语言模型记录为所述预设预训练语言模型。When the total loss value does not reach the preset convergence condition, update and iterate the initial parameters of the initial language model, until the total loss value reaches the preset convergence condition, the initial language after convergence The model is recorded as the preset pre-trained language model.
  11. 如权利要求10所述的计算机设备,其中,所述将所述样本文档三元组输入至包含初始参数的初始语言模型中,分别对所述训练文档、正样本文档以及负样本文档进行向量提取,得到与所述训练文档对应的第一训练向量,与所述正样本文档对应的第二训练向量,以及与所述负样本文档对应的第三训练向量,包括:The computer device according to claim 10, wherein the sample document triplet is input into an initial language model including initial parameters, and vector extraction is performed on the training document, positive sample document and negative sample document respectively , to obtain the first training vector corresponding to the training document, the second training vector corresponding to the positive sample document, and the third training vector corresponding to the negative sample document, including:
    分别提取所述训练文档、正样本文档以及负样本文档的单词序列,得到与所述训练文档对应的训练单词序列、与所述正样本文档对应的正样本单词序列,以及与所述负样本文档对应的负样本单词序列;Extract the word sequence of the training document, the positive sample document and the negative sample document respectively, and obtain the training word sequence corresponding to the training document, the positive sample word sequence corresponding to the positive sample document, and the negative sample document. The corresponding negative sample word sequence;
    通过预设特征表示方法,确定与所述训练单词序列中各单词对应的训练高阶特征,与所述正样本单词序列中各单词对应的正样本高阶特征,以及与所述负样本单词序列中各单词对应的负样本高阶特征;Determine the training high-level feature corresponding to each word in the training word sequence, the positive sample high-level feature corresponding to each word in the positive sample word sequence, and the negative sample word sequence by using a preset feature representation method. The high-order features of the negative samples corresponding to each word in the
    分别对所述训练高阶特征、正样本高阶特征以及负样本高阶特征进行平均池化处理,得到所述第一训练向量、第二训练向量以及所述第三训练向量。The average pooling process is performed on the training high-order features, the positive sample high-order features and the negative sample high-order features, respectively, to obtain the first training vector, the second training vector and the third training vector.
  12. 如权利要求10所述的计算机设备,其中,所述根据所述第一训练向量、第二训练向量以及第三训练向量,确定所述语言模型的总损失值,包括:The computer device of claim 10, wherein the determining a total loss value of the language model according to the first training vector, the second training vector and the third training vector comprises:
    确定所述第一训练向量与所述第二训练向量之间的第一文档距离;同时确定所述第一训练向量与所述第三训练向量之间的第二文档距离;determining a first document distance between the first training vector and the second training vector; simultaneously determining a second document distance between the first training vector and the third training vector;
    根据所述第一文档距离以及所述第二文档距离,通过三重损失函数确定所述总损失值。The total loss value is determined by a triple loss function according to the first document distance and the second document distance.
  13. 如权利要求10所述的计算机设备,其中,所述获取样本文档三元组之前,所述处理器执行所述计算机可读指令时还实现如下步骤:The computer device according to claim 10, wherein, before the acquisition of the sample document triplet, the processor further implements the following steps when executing the computer-readable instructions:
    获取预设样本文档集合;所述样本文档集合中包含至少一个样本文档;一个所述样本文档关联一个文档标题;Obtaining a preset sample document set; the sample document set includes at least one sample document; one of the sample documents is associated with a document title;
    对各所述文档标题进行归一化处理,并根据归一化处理之后的各文档标题,对各所述样本文档进行文档分类,得到与各所述样本文档对应的文档类别;Performing normalization processing on each of the document titles, and performing document classification on each of the sample documents according to the document titles after the normalization processing, to obtain a document category corresponding to each of the sample documents;
    自各所述文档类别中选取一个文档类别作为正文档类别;自除所述正文档类别之外的其它文档类别中选取一个文档类别作为负文档类别;A document category is selected from each of the document categories as a positive document category; a document category is selected from other document categories except the positive document category as a negative document category;
    自所述正文档类别中选取一个样本文档并记录为所述训练文档;同时,自所述正文档类别中选取除所述训练文档外的一个样本文档并记录为所述正样本文档;自所述负文档类别中选取一个样本文档并记录为所述负样本文档;A sample document is selected from the positive document category and recorded as the training document; at the same time, a sample document other than the training document is selected from the positive document category and recorded as the positive sample document; Select a sample document from the negative document category and record it as the negative sample document;
    根据所述训练文档、正样本文档以及所述负样本文档构建所述训练文档三元组。The training document triplet is constructed from the training document, the positive sample document, and the negative sample document.
  14. 如权利要求13所述的计算机设备,其中,所述对各所述文档标题进行归一化处理, 包括:The computer device of claim 13, wherein the normalizing processing for each of the document titles comprises:
    检测所述文档标题中是否包含预设特殊符号;Detecting whether the document title contains preset special symbols;
    在所述文档标题中包含所述预设特殊符号时,剔除所述预设特殊符号,以及处于所述预设特殊符号之前的所有字符,得到剔除标题;When the document title includes the preset special symbol, remove the preset special symbol and all characters before the preset special symbol to obtain the excluded title;
    检测所述剔除标题中是否包含预设年份字符和/或预设次数字符;Detecting whether the culling title contains a preset year character and/or a preset number of times character;
    在所述剔除标题中包含所述预设年份字符和/或预设次数字符时,将第一预设字符代替所述预设年份字符,且将第二预设字符代替所述预设次数字符,进而表征所述文档标题归一化处理完毕。When the culling title contains the preset year character and/or the preset number of times character, replace the preset year character with the first preset character, and replace the preset number of times character with the second preset character , which further indicates that the normalization processing of the document title is completed.
  15. 一个或多个存储有计算机可读指令的可读存储介质,其中,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如下步骤:One or more readable storage media storing computer-readable instructions, wherein the computer-readable instructions, when executed by one or more processors, cause the one or more processors to perform the following steps:
    接收包含目标文档的预测请求指令;Receive a prediction request instruction containing the target document;
    通过预设文档解析模型,对所述目标文档进行文档解析,得到与所述目标文档对应的文字信息以及与所述文字信息对应的坐标信息;Performing document parsing on the target document by using a preset document parsing model to obtain text information corresponding to the target document and coordinate information corresponding to the text information;
    将所述文字信息以及所述坐标信息输入至预设预训练语言模型中,对所述文字信息以及所述坐标信息进行向量提取,得到与所述目标文档对应的文档表示向量;Inputting the text information and the coordinate information into a preset pre-trained language model, and performing vector extraction on the text information and the coordinate information to obtain a document representation vector corresponding to the target document;
    获取样本文档向量集;所述样本文档向量集中包含至少一个样本文档向量;一个所述样本文档向量关联一个文档类别;Obtain a sample document vector set; the sample document vector set includes at least one sample document vector; one of the sample document vectors is associated with a document category;
    确定所述文档表示向量与各所述样本文档向量之间的文档向量距离,并根据各所述文档向量距离确定所述目标文档对应的文档类别。A document vector distance between the document representation vector and each of the sample document vectors is determined, and a document category corresponding to the target document is determined according to each of the document vector distances.
  16. 如权利要求15所述的可读存储介质,其中,所述将所述文字信息以及所述坐标信息输入至预设预训练语言模型中之前,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器还执行如下步骤:16. The readable storage medium of claim 15, wherein the computer-readable instructions are executed by one or more processors prior to inputting the textual information and the coordinate information into a preset pre-trained language model During execution, the one or more processors are caused to further perform the following steps:
    获取训练文档三元组;所述样本文档三元组包含训练文档、与所述训练文档对应的正样本文档以及与所述样本文档对应的负样本文档;obtaining a training document triplet; the sample document triplet includes a training document, a positive sample document corresponding to the training document, and a negative sample document corresponding to the sample document;
    将所述样本文档三元组输入至包含初始参数的初始语言模型中,分别对所述训练文档、正样本文档以及负样本文档进行向量提取,得到与所述训练文档对应的第一训练向量,与所述正样本文档对应的第二训练向量,以及与所述负样本文档对应的第三训练向量;The sample document triplet is input into the initial language model containing the initial parameters, and the training document, the positive sample document and the negative sample document are respectively subjected to vector extraction to obtain the first training vector corresponding to the training document, a second training vector corresponding to the positive sample document, and a third training vector corresponding to the negative sample document;
    根据所述第一训练向量、第二训练向量以及第三训练向量,确定所述语言模型的总损失值;determining the total loss value of the language model according to the first training vector, the second training vector and the third training vector;
    在所述总损失值未达到预设的收敛条件时,更新迭代所述初始语言模型的初始参数,直至所述总损失值达到所述预设的收敛条件时,将收敛之后的所述初始语言模型记录为所述预设预训练语言模型。When the total loss value does not reach the preset convergence condition, update and iterate the initial parameters of the initial language model, until the total loss value reaches the preset convergence condition, the initial language after convergence The model is recorded as the preset pre-trained language model.
  17. 如权利要求16所述的可读存储介质,其中,所述将所述样本文档三元组输入至包含初始参数的初始语言模型中,分别对所述训练文档、正样本文档以及负样本文档进行向量提取,得到与所述训练文档对应的第一训练向量,与所述正样本文档对应的第二训练向量,以及与所述负样本文档对应的第三训练向量,包括:The readable storage medium according to claim 16, wherein the sample document triplet is input into an initial language model including initial parameters, and the training document, the positive sample document and the negative sample document are respectively performed on the training document, the positive sample document and the negative sample document. Vector extraction to obtain the first training vector corresponding to the training document, the second training vector corresponding to the positive sample document, and the third training vector corresponding to the negative sample document, including:
    分别提取所述训练文档、正样本文档以及负样本文档的单词序列,得到与所述训练文档对应的训练单词序列、与所述正样本文档对应的正样本单词序列,以及与所述负样本文档对应的负样本单词序列;Extract the word sequence of the training document, the positive sample document and the negative sample document respectively, and obtain the training word sequence corresponding to the training document, the positive sample word sequence corresponding to the positive sample document, and the negative sample document. The corresponding negative sample word sequence;
    通过预设特征表示方法,确定与所述训练单词序列中各单词对应的训练高阶特征,与所述正样本单词序列中各单词对应的正样本高阶特征,以及与所述负样本单词序列中各单词对应的负样本高阶特征;Determine the training high-level features corresponding to each word in the training word sequence, the positive sample high-level feature corresponding to each word in the positive sample word sequence, and the negative sample word sequence through a preset feature representation method. The high-order features of the negative samples corresponding to each word in the
    分别对所述训练高阶特征、正样本高阶特征以及负样本高阶特征进行平均池化处理,得到所述第一训练向量、第二训练向量以及所述第三训练向量。The average pooling process is performed on the training high-order features, the positive sample high-order features and the negative sample high-order features, respectively, to obtain the first training vector, the second training vector and the third training vector.
  18. 如权利要求16所述的可读存储介质,其中,所述根据所述第一训练向量、第二训 练向量以及第三训练向量,确定所述语言模型的总损失值,包括:The readable storage medium of claim 16, wherein the determining a total loss value of the language model according to the first training vector, the second training vector and the third training vector comprises:
    确定所述第一训练向量与所述第二训练向量之间的第一文档距离;同时确定所述第一训练向量与所述第三训练向量之间的第二文档距离;determining a first document distance between the first training vector and the second training vector; simultaneously determining a second document distance between the first training vector and the third training vector;
    根据所述第一文档距离以及所述第二文档距离,通过三重损失函数确定所述总损失值。According to the first document distance and the second document distance, the total loss value is determined through a triple loss function.
  19. 如权利要求16所述的可读存储介质,其中,所述获取样本文档三元组之前,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器还执行如下步骤:17. The readable storage medium of claim 16, wherein the computer-readable instructions, when executed by the one or more processors, cause the one or more processors to further Perform the following steps:
    获取预设样本文档集合;所述样本文档集合中包含至少一个样本文档;一个所述样本文档关联一个文档标题;Obtaining a preset sample document set; the sample document set includes at least one sample document; one of the sample documents is associated with a document title;
    对各所述文档标题进行归一化处理,并根据归一化处理之后的各文档标题,对各所述样本文档进行文档分类,得到与各所述样本文档对应的文档类别;Performing normalization processing on each of the document titles, and performing document classification on each of the sample documents according to each document title after the normalization processing, to obtain a document category corresponding to each of the sample documents;
    自各所述文档类别中选取一个文档类别作为正文档类别;自除所述正文档类别之外的其它文档类别中选取一个文档类别作为负文档类别;A document category is selected from each of the document categories as a positive document category; a document category is selected from other document categories except the positive document category as a negative document category;
    自所述正文档类别中选取一个样本文档并记录为所述训练文档;同时,自所述正文档类别中选取除所述训练文档外的一个样本文档并记录为所述正样本文档;自所述负文档类别中选取一个样本文档并记录为所述负样本文档;A sample document is selected from the positive document category and recorded as the training document; at the same time, a sample document other than the training document is selected from the positive document category and recorded as the positive sample document; Select a sample document from the negative document category and record it as the negative sample document;
    根据所述训练文档、正样本文档以及所述负样本文档构建所述训练文档三元组。The training document triplet is constructed from the training document, the positive sample document, and the negative sample document.
  20. 如权利要求19所述的可读存储介质,其中,所述对各所述文档标题进行归一化处理,包括:The readable storage medium of claim 19, wherein said normalizing each of said document titles comprises:
    检测所述文档标题中是否包含预设特殊符号;Detecting whether the document title contains preset special symbols;
    在所述文档标题中包含所述预设特殊符号时,剔除所述预设特殊符号,以及处于所述预设特殊符号之前的所有字符,得到剔除标题;When the document title includes the preset special symbol, remove the preset special symbol and all characters before the preset special symbol to obtain the excluded title;
    检测所述剔除标题中是否包含预设年份字符和/或预设次数字符;Detecting whether the culling title contains a preset year character and/or a preset number of times character;
    在所述剔除标题中包含所述预设年份字符和/或预设次数字符时,将第一预设字符代替所述预设年份字符,且将第二预设字符代替所述预设次数字符,进而表征所述文档标题归一化处理完毕。When the culling title contains the preset year character and/or the preset number of times character, the first preset character is substituted for the preset year character, and the second preset character is substituted for the preset number of times character , which further indicates that the normalization processing of the document title is completed.
PCT/CN2021/125227 2020-12-21 2021-10-21 Document classification prediction method and apparatus, and computer device and storage medium WO2022134805A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011521171.0 2020-12-21
CN202011521171.0A CN112699923A (en) 2020-12-21 2020-12-21 Document classification prediction method and device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
WO2022134805A1 true WO2022134805A1 (en) 2022-06-30

Family

ID=75509652

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/125227 WO2022134805A1 (en) 2020-12-21 2021-10-21 Document classification prediction method and apparatus, and computer device and storage medium

Country Status (2)

Country Link
CN (1) CN112699923A (en)
WO (1) WO2022134805A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115587175A (en) * 2022-12-08 2023-01-10 阿里巴巴达摩院(杭州)科技有限公司 Man-machine conversation and pre-training language model training method and system and electronic equipment

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112699923A (en) * 2020-12-21 2021-04-23 深圳壹账通智能科技有限公司 Document classification prediction method and device, computer equipment and storage medium
CN113505579A (en) * 2021-06-03 2021-10-15 北京达佳互联信息技术有限公司 Document processing method and device, electronic equipment and storage medium
CN115578739A (en) * 2022-09-16 2023-01-06 上海来也伯特网络科技有限公司 Training method and device for realizing IA classification model by combining RPA and AI

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090216693A1 (en) * 1999-04-28 2009-08-27 Pal Rujan Classification method and apparatus
CN110298338A (en) * 2019-06-20 2019-10-01 北京易道博识科技有限公司 A kind of file and picture classification method and device
CN111400499A (en) * 2020-03-24 2020-07-10 网易(杭州)网络有限公司 Training method of document classification model, document classification method, device and equipment
CN112016273A (en) * 2020-09-03 2020-12-01 平安科技(深圳)有限公司 Document directory generation method and device, electronic equipment and readable storage medium
CN112052331A (en) * 2019-06-06 2020-12-08 武汉Tcl集团工业研究院有限公司 Method and terminal for processing text information
CN112699923A (en) * 2020-12-21 2021-04-23 深圳壹账通智能科技有限公司 Document classification prediction method and device, computer equipment and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090216693A1 (en) * 1999-04-28 2009-08-27 Pal Rujan Classification method and apparatus
CN112052331A (en) * 2019-06-06 2020-12-08 武汉Tcl集团工业研究院有限公司 Method and terminal for processing text information
CN110298338A (en) * 2019-06-20 2019-10-01 北京易道博识科技有限公司 A kind of file and picture classification method and device
CN111400499A (en) * 2020-03-24 2020-07-10 网易(杭州)网络有限公司 Training method of document classification model, document classification method, device and equipment
CN112016273A (en) * 2020-09-03 2020-12-01 平安科技(深圳)有限公司 Document directory generation method and device, electronic equipment and readable storage medium
CN112699923A (en) * 2020-12-21 2021-04-23 深圳壹账通智能科技有限公司 Document classification prediction method and device, computer equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115587175A (en) * 2022-12-08 2023-01-10 阿里巴巴达摩院(杭州)科技有限公司 Man-machine conversation and pre-training language model training method and system and electronic equipment
CN115587175B (en) * 2022-12-08 2023-03-14 阿里巴巴达摩院(杭州)科技有限公司 Man-machine conversation and pre-training language model training method and system and electronic equipment

Also Published As

Publication number Publication date
CN112699923A (en) 2021-04-23

Similar Documents

Publication Publication Date Title
WO2022134805A1 (en) Document classification prediction method and apparatus, and computer device and storage medium
WO2022142613A1 (en) Training corpus expansion method and apparatus, and intent recognition model training method and apparatus
WO2020147238A1 (en) Keyword determination method, automatic scoring method, apparatus and device, and medium
WO2021042503A1 (en) Information classification extraction method, apparatus, computer device and storage medium
WO2018153265A1 (en) Keyword extraction method, computer device, and storage medium
WO2020199591A1 (en) Text categorization model training method, apparatus, computer device, and storage medium
WO2019136993A1 (en) Text similarity calculation method and device, computer apparatus, and storage medium
CN111444723B (en) Information extraction method, computer device, and storage medium
CN111666401B (en) Document recommendation method, device, computer equipment and medium based on graph structure
CN112926654B (en) Pre-labeling model training and certificate pre-labeling method, device, equipment and medium
WO2022227162A1 (en) Question and answer data processing method and apparatus, and computer device and storage medium
WO2022116436A1 (en) Text semantic matching method and apparatus for long and short sentences, computer device and storage medium
WO2021169423A1 (en) Quality test method, apparatus and device for customer service recording, and storage medium
CN112380837B (en) Similar sentence matching method, device, equipment and medium based on translation model
CN110427612B (en) Entity disambiguation method, device, equipment and storage medium based on multiple languages
WO2022141864A1 (en) Conversation intent recognition model training method, apparatus, computer device, and medium
WO2022142108A1 (en) Method and apparatus for training interview entity recognition model, and method and apparatus for extracting interview information entity
CN113051914A (en) Enterprise hidden label extraction method and device based on multi-feature dynamic portrait
CN112528022A (en) Method for extracting characteristic words corresponding to theme categories and identifying text theme categories
US20200082213A1 (en) Sample processing method and device
CN112100377A (en) Text classification method and device, computer equipment and storage medium
CN114187595A (en) Document layout recognition method and system based on fusion of visual features and semantic features
CN110598210B (en) Entity recognition model training, entity recognition method, entity recognition device, entity recognition equipment and medium
CN109992778B (en) Resume document distinguishing method and device based on machine learning
CN113806613B (en) Training image set generation method, training image set generation device, computer equipment and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21908806

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 06.10.2023)