WO2022134805A1 - Procédé et appareil de prédiction de classification de document, dispositif informatique et support de stockage - Google Patents

Procédé et appareil de prédiction de classification de document, dispositif informatique et support de stockage Download PDF

Info

Publication number
WO2022134805A1
WO2022134805A1 PCT/CN2021/125227 CN2021125227W WO2022134805A1 WO 2022134805 A1 WO2022134805 A1 WO 2022134805A1 CN 2021125227 W CN2021125227 W CN 2021125227W WO 2022134805 A1 WO2022134805 A1 WO 2022134805A1
Authority
WO
WIPO (PCT)
Prior art keywords
document
sample
training
vector
preset
Prior art date
Application number
PCT/CN2021/125227
Other languages
English (en)
Chinese (zh)
Inventor
刘玉
徐国强
Original Assignee
深圳壹账通智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳壹账通智能科技有限公司 filed Critical 深圳壹账通智能科技有限公司
Publication of WO2022134805A1 publication Critical patent/WO2022134805A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/11File system administration, e.g. details of archiving or snapshots
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Definitions

  • the present application relates to the technical field of classification models, and in particular, to a document classification prediction method, apparatus, computer equipment and storage medium.
  • document classification models in the prior art generally require a large amount of labeled data for training in order to have considerable classification accuracy, but these document classification models are easily affected by data imbalance, such as training of a certain category. If there is very little data, the classification accuracy of the model in this classification will be low, resulting in low document classification accuracy, and it takes a lot of time to manually label the data, which is not conducive to the deployment and application of the model in various fields.
  • Embodiments of the present application provide a document classification prediction method, apparatus, computer equipment, and storage medium, so as to solve the problem of low document classification accuracy caused by less manual annotation data.
  • a document classification prediction method comprising:
  • sample document vector set includes at least one sample document vector; one of the sample document vectors is associated with a document category;
  • a document vector distance between the document representation vector and each of the sample document vectors is determined, and a document category corresponding to the target document is determined according to each of the document vector distances.
  • a document classification prediction device comprising:
  • the prediction request instruction receiving module is used to receive the prediction request instruction including the target document;
  • a document parsing module configured to perform document parsing on the target document by using a preset document parsing model to obtain text information corresponding to the target document and coordinate information corresponding to the text information;
  • the first vector extraction module is used for inputting the text information and the coordinate information into a preset pre-training language model, and performing vector extraction on the text information and the coordinate information to obtain the corresponding text information and the target document.
  • document representation vector
  • a document vector set acquisition module configured to acquire a sample document vector set; the sample document vector set includes at least one sample document vector; one of the sample document vectors is associated with a document category;
  • a document category determination module configured to determine a document vector distance between the document representation vector and each of the sample document vectors, and determine a document category corresponding to the target document according to each of the document vector distances.
  • a computer device comprising a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor, the processor implementing the following steps when executing the computer-readable instructions:
  • sample document vector set includes at least one sample document vector; one of the sample document vectors is associated with a document category;
  • a document vector distance between the document representation vector and each of the sample document vectors is determined, and a document category corresponding to the target document is determined according to each of the document vector distances.
  • One or more readable storage media storing computer-readable instructions that, when executed by one or more processors, cause the one or more processors to perform the following steps:
  • sample document vector set includes at least one sample document vector; one of the sample document vectors is associated with a document category;
  • a document vector distance between the document representation vector and each of the sample document vectors is determined, and a document category corresponding to the target document is determined according to each of the document vector distances.
  • the above-mentioned document classification prediction method, device, computer equipment and storage medium receives the prediction request instruction containing the target document; through the preset document analysis model, the document analysis is performed on the target document, and the corresponding target document is obtained.
  • text information and coordinate information corresponding to the text information input the text information and the coordinate information into a preset pre-training language model, perform vector extraction on the text information and the coordinate information, and obtain the text information and the coordinate information.
  • the document representation vector corresponding to the target document obtain a sample document vector set; the sample document vector set contains at least one sample document vector; one of the sample document vectors is associated with a document category; document vector distances between document vectors, and the document category corresponding to the target document is determined according to the document vector distances.
  • the present application determines the document category of the target document by introducing the text information of the document and the corresponding coordinate information, and according to the document vector distance between the document representation vector corresponding to the text information and the coordinate information and the sample document vector. In this way, in the case of few sample documents, new documents can still be classified. If they do not match the sample documents, they can be regarded as a new document category, and the new documents are continuously classified. During the classification process, the number of documents in each document category can be supplemented without the need to constantly replace the preset document parsing model or the preset pre-trained language model to classify new documents, which improves the efficiency and convenience of document classification.
  • FIG. 1 is a schematic diagram of an application environment of a document classification prediction method in an embodiment of the present application
  • FIG. 2 is a flowchart of a document classification prediction method in an embodiment of the present application.
  • step S50 in the document classification prediction method in an embodiment of the present application
  • FIG. 5 is a schematic block diagram of a document classification prediction device in an embodiment of the present application.
  • FIG. 6 is another principle block diagram of a document classification prediction apparatus in an embodiment of the present application.
  • FIG. 7 is a schematic block diagram of a document category determination module in a document category prediction device according to an embodiment of the present application.
  • FIG. 8 is a schematic diagram of a computer device in an embodiment of the present application.
  • the document classification prediction method provided by the embodiment of the present application can be applied in the application environment shown in FIG. 1 .
  • the document classification prediction method is applied in a document classification prediction system.
  • the document classification prediction system includes a client and a server as shown in FIG. 1 , and the client and the server communicate through the network to solve the problem of less manual annotation data. This leads to the problem of low document classification accuracy.
  • the client also known as the client, refers to the program corresponding to the server and providing local services for the client.
  • Clients can be installed on, but not limited to, various personal computers, laptops, smartphones, tablets, and portable wearable devices.
  • the server can be implemented as an independent server or a server cluster composed of multiple servers.
  • a document classification prediction method is provided, and the method is applied to the server in FIG. 1 as an example for description, including the following steps:
  • the prediction request instruction may be an instruction sent by a preset sender (eg, the author of the target document, or the document manager).
  • the target document refers to a document with a regular title and has not been classified as a document; wherein, the regular title refers to a title with several filled areas, such as a company name area and a year area; the regularity
  • the optional title can be used by document creators to fill in the content that needs to be filled in the filling area, combined with the content of the document. Exemplarily, such as "Rongsheng Petrochemical (company name area): 2020 (year area) semi-annual report" similar style document.
  • S20 Perform document parsing on the target document by using a preset document parsing model to obtain text information corresponding to the target document and coordinate information corresponding to the text information;
  • the preset document parsing model is used to extract text information and coordinate information of the target document.
  • the target document is a pdf document
  • the preset document parsing model may be based on PyMuPDF (an open source pdf parsing software). Parse the model.
  • Text information refers to the text content of the first five pages in the target document.
  • the coordinate information refers to the page number of each word in the content of the first five pages and the specific position in the corresponding page number.
  • S30 Input the text information and the coordinate information into a preset pre-trained language model, and perform vector extraction on the text information and the coordinate information to obtain a document representation vector corresponding to the target document;
  • the preset pre-trained language model may be a LayoutLM model.
  • the text information and the The coordinate information is input into the pre-trained language model to generate a target word sequence corresponding to the target document according to the text information and the coordinate information, and the target word sequence represents each word in the target document sorted according to the coordinate information; method, determine the target high-order feature corresponding to the target word sequence, and perform an average pooling process on the target high-order feature to obtain a document representation vector.
  • S40 Obtain a sample document vector set; the sample document vector set includes at least one sample document vector; one of the sample document vectors is associated with one document category;
  • the sample document vector set is a set of sample document vectors corresponding to each sample document obtained by inputting the sample document into a preset pre-trained language model.
  • all sample documents are input into the preset document parsing model respectively, so as to perform document parsing on each sample document, and obtain the sample text information corresponding to the sample document and the sample text corresponding to the sample text.
  • the sample coordinate information corresponding to the information and then input the sample text information and sample coordinate information into the preset pre-training language model, and perform vector extraction on the text information and coordinate information to obtain the sample document vector corresponding to each sample document.
  • each sample document is acquired, the classification of each sample document can be determined according to the document title associated with the sample document, and then each sample document is classified so that one sample document is associated with one document category.
  • S50 Determine a document vector distance between the document representation vector and each of the sample document vectors, and determine a document category corresponding to the target document according to each of the document vector distances.
  • the document vector distance between the document representation vector and each of the sample document vectors is determined, and the document category corresponding to the target document is determined according to each of the document vector distances.
  • the sample document vector is also associated with a sample document; the determining the document category corresponding to the target document according to the distance of each document vector includes:
  • S501 Select a preset number of sample documents from the sample documents whose document vector distance is less than or equal to a preset distance threshold, and record the selected sample documents as candidate documents;
  • the preset number may be determined according to a specific scenario, and for example, the preset number may be 10, 20, etc.
  • the preset distance threshold can be 0.5, 0.7, etc.
  • a preset number of sample documents whose document vector distance is less than or equal to a preset distance threshold are selected as candidate documents.
  • all sample documents that satisfy the condition that the document vector distance is less than or equal to the preset distance threshold may be used as candidate documents.
  • the document vector distances are all greater than the preset distance threshold, it indicates that the document category currently associated with the sample document cannot characterize the document category of the target document, and then a new document category is established according to the document title of the target document, and the The target document is classified under the new document category, and when the next time a prediction request command containing a new target document is received, if the document vector distance between the document vector of the new target document and the document representation vector of the target document is less than or When it is equal to the preset distance threshold, the document category of the target document can be used as the document category of the new target document, which improves the efficiency of document classification.
  • S502 Obtain the proportion of candidate documents of the same document category in all the candidate documents, and record the document category with the highest proportion as the document category of the target document.
  • the candidate documents of the same document category are obtained. For the proportion of all the candidate documents, the document category with the highest proportion is recorded as the document category of the target document.
  • the document category of the target document is determined by introducing the text information of the document and the corresponding coordinate information, and according to the document vector distance between the document representation vector corresponding to the text information and the coordinate information and the sample document vector. In this way, in the case of few sample documents, new documents can still be classified. If they do not match the sample documents, they can be regarded as a new document category, and the new documents are continuously classified.
  • the number of documents in each document category can be supplemented without the need to constantly replace the preset document parsing model or the preset pre-trained language model to classify new documents, which improves the efficiency and convenience of document classification.
  • the method before the inputting the text information and the coordinate information into the preset pre-trained language model, the method further includes:
  • S01 Acquire a training document triplet;
  • the sample document triplet includes a training document, a positive sample document corresponding to the training document, and a negative sample document corresponding to the sample document;
  • positive sample documents refer to documents with the same document category as the training documents.
  • Negative documents are documents that do not have the same document class as the training document.
  • S02 Input the triplet of the sample document into an initial language model including initial parameters, and perform vector extraction on the training document, positive sample document and negative sample document, respectively, to obtain a first training document corresponding to the training document vector, the second training vector corresponding to the positive sample document, and the third training vector corresponding to the negative sample document;
  • the initial language model may be a LayoutLM model.
  • a detailed explanation of this step can be found in the following examples.
  • the sample document triplet is input into an initial language model including initial parameters, and vector extraction is performed on the training document, the positive sample document and the negative sample document, respectively, to obtain the training document, the positive sample document and the negative sample document.
  • the first training vector corresponding to the document, the second training vector corresponding to the positive sample document, and the third training vector corresponding to the negative sample document including:
  • S011 Extract the word sequences of the training document, the positive sample document and the negative sample document, respectively, to obtain the training word sequence corresponding to the training document, the positive sample word sequence corresponding to the positive sample document, and the negative sample word sequence corresponding to the negative sample document.
  • the word sequence refers to each word in the training document, the positive sample document, and the negative sample document and the corresponding ranking relationship.
  • the obtained training word sequence is: (where a represents the training document, and x is the length of the word sequence of the training document), since in the initial language model it is necessary to distinguish the beginning of a document ([CLS] below) and the end ([SEP] below), So the final training word sequence is In the same way, it is assumed that the obtained positive sample word sequence is (where p represents the positive sample document, y is the word sequence length of the positive sample document), and the final positive sample word sequence is In the same way, it is assumed that the negative sample word sequence obtained is (where n represents the negative sample document, s is the word sequence length of the negative sample document), and the final negative sample word sequence is
  • S012 Determine the training high-order feature corresponding to each word in the training word sequence, the positive sample high-order feature corresponding to each word in the positive sample word sequence, and the negative sample Negative sample high-order features corresponding to each word in the word sequence;
  • the high-level feature representation corresponding to each word in each word sequence can be determined by the following expression:
  • S013 Perform an average pooling process on the training high-order features, the positive sample high-order features, and the negative sample high-order features, respectively, to obtain the first training vector, the second training vector, and the third training vector.
  • the average pooling processing method is used to obtain the first training vector, the second training vector and the third training vector.
  • MEAN_POOLING i ( ) is the average pooling function; i represents the i-th word; S a is the first training vector; Sp is the second training vector; Sn is the third training vector.
  • S03 Determine a total loss value of the language model according to the first training vector, the second training vector and the third training vector.
  • a total loss value of the language model is determined according to the first training vector, the second training vector and the third training vector.
  • step S03 the determining the total loss value of the language model according to the first training vector, the second sample vector and the third training vector includes:
  • the total loss value is determined through a triple loss function.
  • the first document distance and the second document distance are substantially Euclidean distances.
  • the total loss value can be determined according to the following triple loss function:
  • Sa is the first training vector
  • Sp is the second training vector
  • Sn is the third training vector.
  • is the first document distance
  • is the second document distance
  • is a real number, which is taken as 1 in this embodiment.
  • the intuitive meaning of the total loss is that the distance between the positive sample document and the training document is getting closer and the distance between the negative sample document and the training document is getting further and further, thereby improving the document classification accuracy of the model.
  • the convergence condition can be the condition that the total loss value is less than the set threshold, that is, when the total loss value is less than the set threshold, the training is stopped; the convergence condition can also be that the total loss value after 10,000 calculations is The condition is very small and will not decrease, that is, when the total loss value is small and will not decrease after 10,000 calculations, stop training, and record the initial language model after convergence as the preset pre-training language model.
  • the initial language model after determining the total loss value according to the training document, positive sample document and negative sample document in the training document triplet, when the total loss value does not reach the preset convergence condition, adjust the initial language model according to the total loss value. initial parameters, and re-input the training document triplet into the initial language model after adjusting the initial parameters, so as to select another training document when the total loss value corresponding to the training document triplet reaches the preset convergence condition Triples (such as replacing negative sample documents or positive sample documents), and perform steps S01 to S04 to obtain the total loss value corresponding to the training document triples, and when the total loss value does not reach the preset convergence When conditions are met, the initial parameters of the initial language model are adjusted again according to the total loss value, so that the total loss value corresponding to the training document triplet reaches the preset convergence condition.
  • the output results of the initial language model can continue to move closer to accurate results, so that the recognition accuracy is getting higher and higher, until all training document triples correspond to
  • the initial language model after convergence is recorded as the preset pre-trained language model.
  • an adam optimizer may also be used, and the optimizer is based on a parameter update method of gradient descent, and further updates the initial parameters continuously when the total loss value is less than the set threshold condition.
  • the method before acquiring the triplet of the sample document, the method further includes:
  • sample document set includes at least one sample document; one described sample document is associated with a document title;
  • the sample documents in the preset sample document set can be crawled from all pdf documents from major websites by conventional crawling technology, and the crawled information includes the sample documents and the document titles associated with the sample documents.
  • the normalization process for each of the document titles includes:
  • the preset special symbol can be ":". Understandably, although the content of each pdf document is different, the structure of the content is mostly the same. For example, for pdf documents similar to "XXX Company: 2020 Annual Report", the text content before ":” is only limited The report of a certain company, so the preset special symbol and all characters before the preset special symbol should be eliminated and processed without affecting the subsequent document classification.
  • the culling title contains the preset year character and/or the preset number of times character, replace the preset year character with the first preset character, and replace the preset number of times character with the second preset character , which further indicates that the normalization processing of the document title is completed.
  • the preset year character is the character containing the year in the title;
  • the preset number character is the character that represents the frequency style in the title, such as "XXX Company: 2020 X Quarterly Report".
  • the first preset characters and the second preset characters can be replaced by English characters or other special characters.
  • the first preset characters and the second preset characters are used to eliminate the influence of the year and the number of times on the document classification.
  • the title is "Announcement on Holding the Eighth Meeting in 2020", then the 2020 is replaced by X; eight can be replaced by Y, then it will be replaced by the "Announcement on Holding the Yth Meeting in Year X”.
  • document classification is performed on each of the sample documents, that is, according to each document title after the normalization processing.
  • the matching degree between characters is used for document classification, and the documents whose matching degree is higher than the preset threshold are classified into one category, and then the document category corresponding to each sample document is obtained.
  • the preset threshold can be set to 90%, 95% and so on.
  • the top 500 document categories with the most sample documents can be selected, and the remaining document categories are removed to avoid too many document categories and burden the computer system.
  • any document type can be selected from each document type. Select a sample document as a training document, and then select a document from the document category as a positive sample document; then select a document category from other document categories except the selected document category, and then select a document category from the document category. Pick a sample document as a negative sample document.
  • a document classification prediction apparatus is provided, and the document classification prediction apparatus corresponds one-to-one with the document classification prediction method in the above embodiment.
  • the document classification prediction apparatus includes a prediction request instruction receiving module 10 , a document parsing module 20 , a first vector extraction module 30 , a document vector set acquisition module 40 and a document category determination module 50 .
  • the detailed description of each functional module is as follows:
  • a prediction request instruction receiving module 10 configured to receive a prediction request instruction including a target document
  • the document parsing module 20 is configured to perform document parsing on the target document by using a preset document parsing model to obtain text information corresponding to the target document and coordinate information corresponding to the text information;
  • the first vector extraction module 30 is used for inputting the text information and the coordinate information into a preset pre-trained language model, and performing vector extraction on the text information and the coordinate information to obtain a vector corresponding to the target document The document representation vector of ;
  • a document vector set obtaining module 40 configured to obtain a sample document vector set; the sample document vector set includes at least one sample document vector; one of the sample document vectors is associated with a document category;
  • the document category determination module 50 is configured to determine a document vector distance between the document representation vector and each of the sample document vectors, and determine a document category corresponding to the target document according to each of the document vector distances.
  • the document classification prediction device further includes:
  • Document triplet acquisition module 01 used to acquire training document triples;
  • the sample document triples include training documents, positive sample documents corresponding to the training documents, and negative sample documents corresponding to the sample documents;
  • the second vector extraction module 02 is configured to input the triplet of the sample document into an initial language model including initial parameters, and perform vector extraction on the training document, the positive sample document and the negative sample document, respectively, to obtain the The first training vector corresponding to the training document, the second training vector corresponding to the positive sample document, and the third training vector corresponding to the negative sample document;
  • a total loss value determination module 03 configured to determine the total loss value of the language model according to the first training vector, the second training vector and the third training vector;
  • Language model training module 04 configured to update and iterate the initial parameters of the initial language model when the total loss value does not reach the preset convergence condition, until the total loss value reaches the preset convergence condition, The initial language model after convergence is recorded as the preset pre-trained language model.
  • the second vector extraction module includes:
  • a word sequence extraction unit used for extracting the word sequences of the training document, the positive sample document and the negative sample document respectively, to obtain the training word sequence corresponding to the training document and the positive sample word sequence corresponding to the positive sample document, and the negative sample word sequence corresponding to the negative sample document;
  • a high-level feature determination unit configured to determine, by using a preset feature representation method, the training high-level features corresponding to each word in the training word sequence, and the positive sample high-level feature corresponding to each word in the positive sample word sequence, And the negative sample high-order features corresponding to each word in the negative sample word sequence;
  • the average pooling processing unit is used to perform average pooling processing on the training high-order features, positive sample high-order features and negative sample high-order features respectively, to obtain the first training vector, the second training vector and the first training vector.
  • the document classification prediction device further includes:
  • a sample document set acquisition module used for acquiring a preset sample document set;
  • the sample document set includes at least one sample document; one of the sample documents is associated with a document title;
  • a normalization processing module is used for normalizing each of the document titles, and according to each document title after the normalization processing, the document classification is performed on each of the sample documents, and the corresponding sample documents are obtained. the document category;
  • a document category selection module for selecting a document category from each of the document categories as a positive document category; selecting a document category from other document categories except the positive document category as a negative document category;
  • the document selection module is used to select a sample document from the positive document category and record it as the training document; meanwhile, select a sample document other than the training document from the positive document category and record it as the training document Positive sample document; select a sample document from the negative document category and record it as the negative sample document;
  • a triplet building module is configured to construct the training document triplet according to the training document, the positive sample document and the negative sample document.
  • the normalization processing module includes:
  • a special symbol detection unit for detecting whether the document title contains a preset special symbol
  • a character culling unit configured to cull the preset special symbol and all characters before the preset special symbol when the preset special symbol is included in the document title, to obtain the cull title
  • a special character detection unit for detecting whether the culling title contains a preset year character and/or a preset number of times character
  • a character replacement unit configured to replace the preset year character with the first preset character and replace the second preset character with the preset year character and/or the preset number of times character when the culling title contains the preset year character and/or the preset number of times character
  • the preset number of characters further indicates that the normalization processing of the document title is completed.
  • the document category determination module 50 includes:
  • the sample document selection unit 501 is used to select a preset number of sample documents from the sample documents whose document vector distance is less than or equal to a preset distance threshold, and record the selected sample documents as candidate documents;
  • the document category determining unit 502 is configured to obtain the proportion of candidate documents of the same document category in all the candidate documents, and record the document category with the highest proportion as the document category of the target document.
  • Each module in the above-mentioned document classification prediction apparatus may be implemented in whole or in part by software, hardware and combinations thereof.
  • the above modules can be embedded in or independent of the processor in the computer device in the form of hardware, or stored in the memory in the computer device in the form of software, so that the processor can call and execute the operations corresponding to the above modules.
  • a computer device is provided, and the computer device may be a server, and its internal structure diagram may be as shown in FIG. 8 .
  • the computer device includes a processor, memory, a network interface, and a database connected by a system bus. Among them, the processor of the computer device is used to provide computing and control capabilities.
  • the memory of the computer device includes a readable storage medium, an internal memory.
  • the readable storage medium stores an operating system, computer readable instructions and a database.
  • the internal memory provides an environment for the execution of the operating system and computer-readable instructions in the readable storage medium.
  • the database of the computer device is used to store the data used in the document classification prediction method in the above embodiment.
  • the network interface of the computer device is used to communicate with an external terminal through a network connection.
  • the computer-readable instructions when executed by a processor, implement a document classification prediction method.
  • the readable storage medium provided in this embodiment includes a non-volatile readable storage medium and a volatile readable storage medium.
  • a computer apparatus comprising a memory, a processor, and computer readable instructions stored in the memory and executable on the processor, the processor executing the computer readable instructions Implement the following steps when instructing:
  • sample document vector set includes at least one sample document vector; one of the sample document vectors is associated with a document category;
  • a document vector distance between the document representation vector and each of the sample document vectors is determined, and a document category corresponding to the target document is determined according to each of the document vector distances.
  • one or more readable storage media are provided that store computer-readable instructions that, when executed by one or more processors, cause the one or more processors to execute follows the steps below:
  • sample document vector set includes at least one sample document vector; one of the sample document vectors is associated with a document category;
  • a document vector distance between the document representation vector and each of the sample document vectors is determined, and a document category corresponding to the target document is determined according to each of the document vector distances.
  • Nonvolatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory may include random access memory (RAM) or external cache memory.
  • RAM is available in various forms such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Road (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

La présente invention concerne un procédé et un appareil de prédiction de classification de document, ainsi qu'un dispositif informatique et un support de stockage. Le procédé consiste à : recevoir une instruction de demande de prédiction qui contient un document cible (S10) ; effectuer une analyse de document sur le document cible au moyen d'un modèle d'analyse de document prédéfini afin d'obtenir des informations de texte correspondant au document cible et des informations de coordonnées correspondant aux informations de texte (S20) ; entrer les informations de texte et les informations de coordonnées dans un modèle linguistique pré-appris et prédéfini, puis effectuer une extraction de vecteur sur les informations de texte et les informations de coordonnées afin d'obtenir un vecteur de représentation de document correspondant au document cible (S30) ; acquérir un ensemble d'exemples de vecteurs de document, l'ensemble d'exemples de vecteurs de document contenant au moins un exemple de vecteur de document, un exemple de vecteur de document étant associé à une catégorie de document (S40) ; et déterminer une distance de vecteur de document entre le vecteur de représentation de document et chaque exemple de vecteur de document, puis déterminer, en fonction de chaque distance de vecteur de document, une catégorie de document correspondant au document cible (S50). Grâce au procédé, l'efficacité de la classification de document peut être améliorée.
PCT/CN2021/125227 2020-12-21 2021-10-21 Procédé et appareil de prédiction de classification de document, dispositif informatique et support de stockage WO2022134805A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011521171.0A CN112699923A (zh) 2020-12-21 2020-12-21 文档分类预测方法、装置、计算机设备及存储介质
CN202011521171.0 2020-12-21

Publications (1)

Publication Number Publication Date
WO2022134805A1 true WO2022134805A1 (fr) 2022-06-30

Family

ID=75509652

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/125227 WO2022134805A1 (fr) 2020-12-21 2021-10-21 Procédé et appareil de prédiction de classification de document, dispositif informatique et support de stockage

Country Status (2)

Country Link
CN (1) CN112699923A (fr)
WO (1) WO2022134805A1 (fr)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115587175A (zh) * 2022-12-08 2023-01-10 阿里巴巴达摩院(杭州)科技有限公司 人机对话及预训练语言模型训练方法、系统及电子设备
CN117910980A (zh) * 2024-03-19 2024-04-19 国网山东省电力公司信息通信公司 一种电力档案数据治理方法、系统、设备及介质

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112699923A (zh) * 2020-12-21 2021-04-23 深圳壹账通智能科技有限公司 文档分类预测方法、装置、计算机设备及存储介质
CN113505579A (zh) * 2021-06-03 2021-10-15 北京达佳互联信息技术有限公司 文档处理方法、装置、电子设备及存储介质
CN115578739A (zh) * 2022-09-16 2023-01-06 上海来也伯特网络科技有限公司 结合rpa和ai实现ia的分类模型的训练方法及装置

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090216693A1 (en) * 1999-04-28 2009-08-27 Pal Rujan Classification method and apparatus
CN110298338A (zh) * 2019-06-20 2019-10-01 北京易道博识科技有限公司 一种文档图像分类方法及装置
CN111400499A (zh) * 2020-03-24 2020-07-10 网易(杭州)网络有限公司 文档分类模型的训练方法、文档分类方法、装置及设备
CN112016273A (zh) * 2020-09-03 2020-12-01 平安科技(深圳)有限公司 文档目录生成方法、装置、电子设备及可读存储介质
CN112052331A (zh) * 2019-06-06 2020-12-08 武汉Tcl集团工业研究院有限公司 一种处理文本信息的方法及终端
CN112699923A (zh) * 2020-12-21 2021-04-23 深圳壹账通智能科技有限公司 文档分类预测方法、装置、计算机设备及存储介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090216693A1 (en) * 1999-04-28 2009-08-27 Pal Rujan Classification method and apparatus
CN112052331A (zh) * 2019-06-06 2020-12-08 武汉Tcl集团工业研究院有限公司 一种处理文本信息的方法及终端
CN110298338A (zh) * 2019-06-20 2019-10-01 北京易道博识科技有限公司 一种文档图像分类方法及装置
CN111400499A (zh) * 2020-03-24 2020-07-10 网易(杭州)网络有限公司 文档分类模型的训练方法、文档分类方法、装置及设备
CN112016273A (zh) * 2020-09-03 2020-12-01 平安科技(深圳)有限公司 文档目录生成方法、装置、电子设备及可读存储介质
CN112699923A (zh) * 2020-12-21 2021-04-23 深圳壹账通智能科技有限公司 文档分类预测方法、装置、计算机设备及存储介质

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115587175A (zh) * 2022-12-08 2023-01-10 阿里巴巴达摩院(杭州)科技有限公司 人机对话及预训练语言模型训练方法、系统及电子设备
CN115587175B (zh) * 2022-12-08 2023-03-14 阿里巴巴达摩院(杭州)科技有限公司 人机对话及预训练语言模型训练方法、系统及电子设备
CN117910980A (zh) * 2024-03-19 2024-04-19 国网山东省电力公司信息通信公司 一种电力档案数据治理方法、系统、设备及介质

Also Published As

Publication number Publication date
CN112699923A (zh) 2021-04-23

Similar Documents

Publication Publication Date Title
WO2022134805A1 (fr) Procédé et appareil de prédiction de classification de document, dispositif informatique et support de stockage
WO2022142613A1 (fr) Procédé et appareil d'expansion de corpus de formation et procédé et appareil de formation de modèle de reconnaissance d'intention
WO2020147238A1 (fr) Procédé de détermination de mot-clé, procédé, appareil et dispositif de notation automatique, et support
WO2021042503A1 (fr) Procédé d'extraction de classification d'informations, appareil, dispositif informatique et support de stockage
WO2018153265A1 (fr) Procédé d'extraction de mots-clés, dispositif informatique et support d'informations
WO2020199591A1 (fr) Procédé, appareil, dispositif informatique, et support d'informations d'entraînement de modèles de catégorisation de textes
WO2019136993A1 (fr) Procédé et dispositif de calcul de similarité de texte, appareil informatique, et support de stockage
CN111444723B (zh) 信息抽取方法、计算机设备和存储介质
CN111666401B (zh) 基于图结构的公文推荐方法、装置、计算机设备及介质
CN110427612B (zh) 基于多语言的实体消歧方法、装置、设备和存储介质
WO2022141864A1 (fr) Procédé, appareil, dispositif informatique et support d'apprentissage de modèle de reconnaissance d'intention de conversation
WO2022116436A1 (fr) Procédé et appareil d'appariement sémantique de texte pour des phrases longues et courtes, dispositif informatique et support de stockage
CN112926654A (zh) 预标注模型训练、证件预标注方法、装置、设备及介质
WO2021169423A1 (fr) Procédé, appareil et dispositif de test de qualité pour l'enregistrement d'un service client, et support de stockage
CN112380837B (zh) 基于翻译模型的相似句子匹配方法、装置、设备及介质
CN110598210B (zh) 实体识别模型训练、实体识别方法、装置、设备及介质
WO2022142108A1 (fr) Procédé et appareil d'apprentissage de modèle de reconnaissance d'entité d'entrevue, et procédé et appareil d'extraction d'entité d'informations d'entrevue
WO2022227162A1 (fr) Procédé et appareil de traitement de données de questions et de réponses, dispositif informatique et support de stockage
US20200082213A1 (en) Sample processing method and device
CN113051914A (zh) 一种基于多特征动态画像的企业隐藏标签抽取方法及装置
US20230073994A1 (en) Method for extracting text information, electronic device and storage medium
CN112528022A (zh) 主题类别对应的特征词提取和文本主题类别识别方法
CN112100377A (zh) 文本分类方法、装置、计算机设备和存储介质
CN114187595A (zh) 基于视觉特征和语义特征融合的文档布局识别方法及系统
CN113806613B (zh) 训练图像集生成方法、装置、计算机设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21908806

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 06.10.2023)

122 Ep: pct application non-entry in european phase

Ref document number: 21908806

Country of ref document: EP

Kind code of ref document: A1