CN112434157A - Document multi-label classification method and device, electronic equipment and storage medium - Google Patents

Document multi-label classification method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN112434157A
CN112434157A CN202011220204.8A CN202011220204A CN112434157A CN 112434157 A CN112434157 A CN 112434157A CN 202011220204 A CN202011220204 A CN 202011220204A CN 112434157 A CN112434157 A CN 112434157A
Authority
CN
China
Prior art keywords
document
standard
classification
label
original
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011220204.8A
Other languages
Chinese (zh)
Other versions
CN112434157B (en
Inventor
邵博
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Zhitong Consulting Co Ltd Shanghai Branch
Original Assignee
Ping An Zhitong Consulting Co Ltd Shanghai Branch
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Zhitong Consulting Co Ltd Shanghai Branch filed Critical Ping An Zhitong Consulting Co Ltd Shanghai Branch
Priority to CN202011220204.8A priority Critical patent/CN112434157B/en
Publication of CN112434157A publication Critical patent/CN112434157A/en
Application granted granted Critical
Publication of CN112434157B publication Critical patent/CN112434157B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3335Syntactic pre-processing, e.g. stopword elimination, stemming
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/18Legal services

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Business, Economics & Management (AREA)
  • Tourism & Hospitality (AREA)
  • Data Mining & Analysis (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Technology Law (AREA)
  • Economics (AREA)
  • Human Resources & Organizations (AREA)
  • Marketing (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • General Business, Economics & Management (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a data processing technology, and discloses a document multi-label classification method, which comprises the following steps: preprocessing an original document set to obtain a standard document set, performing multi-label processing on the standard document set to obtain a document label set, dividing the standard document set according to the number of batches to obtain a plurality of document subsets, inputting the document subsets to a constructed original document multi-classification model for training, calculating an error value between a training value set obtained by training and the document label set, adjusting internal parameters of the document multi-classification model when the error value is greater than a preset error threshold value until the error value is less than or equal to the error threshold value to obtain the standard document multi-classification model, and inputting a document to be classified to the standard document multi-classification model to obtain a plurality of classification results. The invention also relates to blockchain techniques, in which the original corpus can be stored. The invention also discloses a document classification device, an electronic device and a storage medium. The invention can improve the diversity of document classification.

Description

Document multi-label classification method and device, electronic equipment and storage medium
Technical Field
The invention relates to the technical field of data processing, in particular to a document multi-label classification method and device, electronic equipment and a computer-readable storage medium.
Background
The document records the trial process and result of the people's court, is a carrier of litigation activity structure, is also a unique certificate of the people's court for determining and distributing entity right obligations of the parties, and plays an important role in the case examination process.
The big data era provides great convenience for people, if the contents of litigation requests, disputes, dispute focuses and the like of the cases are marked with corresponding labels as one of the characteristics for the retrieval of similar cases, documents of similar cases can be searched more quickly, the case handling efficiency of case handling personnel is improved, and the case handling time is shortened.
At present, there are naive Bayes classification methods, support vector machine algorithms and the like based on classification of documents, but the classification effects of the methods are poor, and features in the documents cannot be effectively utilized or only a few features are utilized, so that not only is the feature waste caused, but also the document classification is not comprehensive enough.
Disclosure of Invention
The invention provides a document multi-label classification method, a document multi-label classification device, electronic equipment and a computer-readable storage medium, and mainly aims to solve the problem of incomplete document classification.
In order to achieve the above object, the present invention provides a document multi-label classification method, which comprises:
acquiring an original document set, and preprocessing the original document set to obtain a standard document set;
performing multi-label processing on the standard document set to obtain a document label set;
constructing an original document multi-classification model;
dividing the standard document set according to a preset batch number to obtain a plurality of document subsets;
inputting a plurality of document subsets into the original document multi-classification model for training to obtain a training value set;
calculating the difference value between the training value set and the document label set to obtain an error value;
when the error value is larger than a preset error threshold value, adjusting internal parameters of the original document multi-classification model, returning to the step of dividing the standard document set according to a preset batch number to obtain a plurality of document subsets, and obtaining the standard document multi-classification model when the error value is smaller than or equal to the error threshold value;
and acquiring a document to be classified, and inputting the document to be classified into the standard document multi-classification model to obtain various classification results.
Optionally, the preprocessing the original document set to obtain a standard document set includes:
removing non-character parts in the original document set to obtain a first document set;
performing word segmentation on the first text set to obtain a second text set;
and removing stop words in the second document training set to obtain a standard document set.
Optionally, the constructing of the original document multi-classification model includes:
constructing an original BERT model;
adding an attention mechanism in the original BERT model to obtain a primary BERT model;
and connecting the primary BERT model by using a pre-constructed full-connection layer to obtain the original document multi-classification model.
Optionally, the inputting the plurality of document subsets into the original document multi-classification model for training to obtain a training value set includes:
performing byte coding on the document subset by using a coding layer in the original document multi-classification model to obtain an original byte coding set;
performing padding truncation operation on the original byte coding set according to a preset length by using a padding truncation layer in the original document multi-classification model to obtain a standard byte coding set;
embedding the standard byte coding set by utilizing an embedding layer in the original document multi-classification model to obtain a standard byte sequence set and calculating a training value set corresponding to the standard byte sequence set.
Optionally, the performing padding truncation operation on the original byte code set according to a preset length by using a padding truncation layer in the original document multi-classification model to obtain a standard byte code set includes:
when the length of byte codes in the original byte code set is greater than the preset length, truncation is carried out from the middle of the byte codes, and head and tail information of the byte codes is reserved to obtain standard byte codes;
and summarizing the standard byte codes to obtain a standard byte code set.
Optionally, the embedding the standard byte encoding set by using an embedding layer in the original document multi-classification model to obtain a standard byte sequence set, including:
embedding a preset code into the head of the standard byte code to obtain a first embedded byte code;
embedding the tail part of the first embedded byte code by using the preset code to obtain a second embedded byte code;
and summarizing the second embedded byte codes subjected to the embedding operation to obtain a standard byte sequence set.
Optionally, the calculating a difference between the training value set and the document label set to obtain an error value includes:
calculating the error value using the following error value calculation formula:
Figure BDA0002761719140000031
wherein C is the error value, n is the number of the document labels in the document label set, x is the total number of the training values in the training value set, y represents the training value set, and a is the document label value.
In order to solve the above problem, the present invention further provides a document multi-label classification method and apparatus, wherein the apparatus comprises:
the data processing module is used for acquiring an original document set and preprocessing the original document set to obtain a standard document set; performing multi-label processing on the standard document set to obtain a document label set;
the model construction module is used for constructing an original document multi-classification model;
the model training module is used for dividing the standard document set according to a preset batch number to obtain a plurality of document subsets; inputting a plurality of document subsets into the original document multi-classification model for training to obtain a training value set; calculating the difference value between the training value set and the document label set to obtain an error value; when the error value is larger than a preset error threshold value, adjusting internal parameters of the original document multi-classification model, returning to the step of dividing the standard document set according to a preset batch number to obtain a plurality of document subsets, and obtaining the standard document multi-classification model when the error value is smaller than or equal to the error threshold value;
and the classification module is used for acquiring the document to be classified and inputting the document to be classified into the standard document multi-classification model to obtain various classification results.
In order to solve the above problem, the present invention also provides an electronic device, including:
at least one processor; and the number of the first and second groups,
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores computer program instructions executable by the at least one processor to cause the at least one processor to implement the method of multi-label classification of documents as described above.
In order to solve the above problem, the present invention further provides a computer-readable storage medium storing a computer program, which when executed by a processor, implements the document multi-label classification method described above.
The embodiment of the invention carries out multi-label processing on the document set to obtain the document label set, the multi-label processing can carry out classification and label processing on documents in different dimensions, more characteristics are utilized to classify the documents, more comprehensive is achieved, the standard document set is divided into a plurality of document subsets, the efficiency of training a subsequent input model is improved, the plurality of document subsets are input into a pre-constructed original document classification model to be trained, internal parameters of the original document multi-classification model are adjusted according to a training value set and the document label set to obtain the standard document multi-classification model, and the standard document multi-classification model can be utilized to carry out multi-dimension classification on the documents to be classified to obtain a plurality of classification results. Therefore, the document multi-label classification method, the document multi-label classification device and the computer-readable storage medium can improve the diversity of document classification.
Drawings
FIG. 1 is a flow chart of a document multi-label classification method according to an embodiment of the present invention;
FIG. 2 is a flow chart illustrating one of the steps of the document multi-label classification method according to an embodiment of the present invention;
FIG. 3 is a flowchart illustrating one step of a document multi-label classification method according to an embodiment of the present invention;
FIG. 4 is a flowchart illustrating one step of a document multi-label classification method according to an embodiment of the present invention;
FIG. 5 is a block diagram of a document multi-label classification method according to an embodiment of the present invention;
fig. 6 is a schematic diagram of an internal structure of an electronic device for implementing a document multi-label classification method according to an embodiment of the present invention;
the implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The embodiment of the application provides a document multi-label classification method, and an execution subject of the document multi-label classification method includes but is not limited to at least one of electronic devices such as a server and a terminal, which can be configured to execute the method provided by the embodiment of the application. In other words, the document multi-label classification method may be performed by software or hardware installed in a terminal device or a server device, and the software may be a blockchain platform. The server includes but is not limited to: a single server, a server cluster, a cloud server or a cloud server cluster, and the like.
Fig. 1 is a schematic flow chart of a document multi-label classification method according to an embodiment of the present invention. In this embodiment, the document multi-label classification method includes:
and S1, acquiring an original document set, and preprocessing the original document set to obtain a standard document set.
In the preferred embodiment of the present invention, the original corpus can be obtained by using human input or a crawler program.
In the embodiment of the present invention, referring to fig. 2, the preprocessing the original document set to obtain a standard document set includes:
s101, removing non-character parts in the original document set to obtain a first document set;
s102, performing word segmentation on the first text set to obtain a second text set;
s103, removing stop words in the second document training set to obtain a standard document set.
In the embodiment of the present invention, for example, the document a includes the following portions:
the name of the document: a civil judgment book for contractual dispute between a company A and a company B;
the content of the document: the document content is consistent with the document content delivered by the party;
title of the document: the document of 'civil judgment of a certain civil court';
other contents of the document, including: format settings, case numbers, text, etc.
The non-character part comprises punctuation marks, messy codes and the like, and the non-character part comprising the punctuation marks, the messy codes and the like in the original document set is removed to obtain the first document set.
Further, the word segmentation is carried out on the first text set to obtain a second text set. The word segmentation method can adopt a jieba word segmentation method which is disclosed currently.
For example: the file name: the word segmentation of the dispute civil judgment books of the company A and the company B can be obtained: [ company A ], [ company B ], [ and ], [ contract ], [ dispute ], [ civil affair ], [ decision book ].
In the embodiment of the present invention, the stop word can be sequentially removed using a pre-constructed stop word list, and if the stop word list includes [ yes ], [ ground ], [ and ], the decision book after the word segmentation is removed according to the stop word list is obtained: the method comprises the following steps of [ company A ], [ company B ], [ contract ], [ dispute ], [ civil affair ], [ decision book ], and summarizing each phrase to obtain a standard text book set.
And S2, carrying out multi-label processing on the standard document set to obtain a document label set.
In the embodiment of the invention, the expert group members label the standard document set in various categories, so as to obtain the document label set.
In detail, the various categories vary according to the content of the standard corpus of documents, e.g., for litigation-type documents, the various categories include dimensions such as litigation requests, disputes, and dispute focus.
The document label set can be divided into a training set, a verification set and a test set according to a preset proportion, the training set is used for training the model at a later stage, the test set is used for testing the robustness of the model, and the verification set is used for verifying the accuracy of the model in the training process.
Preferably, the preset ratio may be 6: 2: 2.
and S3, constructing the original document multi-classification model.
Further, referring to fig. 3, constructing a multi-classification model of the original document includes:
s311, constructing an original BERT model according to the standard document set and a preset classification function;
in detail, the BERT (bidirectional encoderpressationfrom transformer) model is a language characterization model, the BERT model includes a data receiving layer and a classification layer, the size of the data receiving layer is determined according to the standard document set, and the classification layer may use a preset classification function.
Preferably, the preset classification function may be a softmax function.
S312, adding an attention mechanism into the original BERT model to obtain a primary BERT model;
preferably, Attention mechanism (Attention) is a data processing method in machine learning, and is widely applied to various different types of machine learning tasks such as natural language processing, image recognition and voice recognition.
And S313, connecting the primary BERT model by using the pre-constructed full-connection layer to obtain the original document multi-classification model.
In a preferred embodiment of the present invention, the original document multi-classification model can be obtained by connecting the full link layer to the primary BERT model.
And S4, dividing the standard document set according to a preset batch number to obtain a plurality of document subsets.
The batch number of the invention can be divided according to the actual application scene.
For example, a standard corpus of documents may have 90000 training data, which may correspond to 900000 training data divided into 100 batches, resulting in 100 subsets of documents, each subset of documents including 900 training data.
And S5, inputting the plurality of document subsets into the original document multi-classification model for training to obtain a training value set.
In the embodiment of the present invention, referring to fig. 4, the inputting a plurality of document subsets into the original document multi-classification model for training to obtain a training value set includes:
s501, carrying out byte coding on the document subset by using a coding layer in the original document multi-classification model to obtain an original byte coding set;
in the preferred embodiment of the present invention, when the encoding layer in the original document multi-classification model performs byte encoding on the document subset, a Word Piece mode, that is, double-byte encoding is adopted, and the double-byte encoding can effectively reduce the data volume of the document subset and can reduce the influence of similar documents on the whole model training to a certain extent.
S502, performing filling truncation operation on the original byte code set according to a preset length by using a filling truncation layer in the original document multi-classification model to obtain a standard byte code set.
In this embodiment of the present invention, the performing padding truncation operation on the original byte code set according to a preset length by using a padding truncation layer in the original document multi-classification model to obtain a standard byte code set includes:
judging whether the length of byte codes in the original byte code set is greater than the preset length or not;
when the length of byte codes in the original byte code set is greater than the preset length, truncation is carried out from the middle of the byte codes, and head and tail information of the byte codes is reserved to obtain standard byte codes;
when the length of the byte codes in the original byte code set is smaller than or equal to the preset length, the codes in the original byte code set are standard byte codes;
and summarizing the standard byte codes to obtain a standard byte code set.
For example, the preset length is 256 bytes, if the length of the code exceeds 256 bytes, a truncation operation is adopted, and when the padding truncation layer in the original document multi-classification model performs the padding truncation operation on the original byte code set, a sentence truncation mode is adopted, and the information of the head and the tail of the code is reserved.
The truncation operation comprises four modes of head truncation, tail truncation, two-edge truncation and sentence truncation.
S503, embedding the standard byte coding set by utilizing an embedding layer in the original document multi-classification model to obtain a standard byte sequence set and calculating a training value set corresponding to the standard byte sequence set.
In this embodiment of the present invention, the embedding the standard byte encoding set by using the embedding layer in the original document multi-classification model to obtain a standard byte sequence set includes:
embedding a preset code into the head of the standard byte code to obtain a first embedded byte code;
embedding the tail part of the first embedded byte code by using the preset code to obtain a second embedded byte code;
and summarizing the second embedded byte codes subjected to the embedding operation to obtain a standard byte sequence set.
Preferably, the training value set is a prediction label set, for example, a document a, a document B, and a document C are in a standard document training set, and are trained through an original document multi-classification model respectively, so that a prediction label of the document a is a litigation request label, a prediction label of the document B is a dialectical label, and a prediction label of the document C is a dispute focus label, thereby integrating to obtain the training value set.
And S6, calculating the difference value between the training value set and the document label set to obtain an error value.
In the embodiment of the present invention, calculating a difference between the training value set and the document label set to obtain an error value includes:
calculating the error value using the following error value calculation formula:
Figure BDA0002761719140000081
wherein C is the error value, n is the number of the document labels in the document label set, x is the total number of the training values in the training value set, y represents the training value set, and a is the document label value.
And S7, judging whether the error value is larger than a preset error threshold value, executing S8 when the error value is larger than the preset error threshold value, adjusting internal parameters of the original document multi-classification model, returning to the S4, and executing S9 until the error value is smaller than or equal to the error threshold value to obtain the standard document multi-classification model.
In a preferred embodiment of the present invention, if the error value is greater than the error threshold, the internal parameters of the original document multi-classification model are adjusted, where the internal parameters include training batch number, learning rate, iteration number, and the like. And if the error value is less than or equal to the error threshold value, parameters do not need to be adjusted, and the standard document multi-classification model is obtained.
For example, if the error value and the error threshold are 0.3 and 0.5, respectively, the trained standard document multi-classification model is obtained because the error value is smaller than the error threshold, and if the error value and the error threshold are 0.3 and 0.2, respectively, because the error value is larger than the error threshold, the process returns to S4.
And S10, obtaining the document to be classified, and inputting the document to be classified into the standard document multi-classification model to obtain various classification results.
For example, a document X to be classified input by a user is obtained, and classification is carried out through the standard document multi-classification model, so that the classification result of the document X to be classified is a litigation request label and a dispute focus label.
Fig. 5 is a schematic block diagram of the multi-label sorting apparatus according to the present invention.
The multi-label document sorting apparatus 100 of the present invention may be installed in an electronic device. According to the realized functions, the document multi-label classification device 100 can comprise a data processing module 101, a model building module 102, a model training module 103 and a classification module 104. The module of the present invention, which may also be referred to as a unit, refers to a series of computer program segments that can be executed by a processor of an electronic device and that can perform a fixed function, and that are stored in a memory of the electronic device.
In the present embodiment, the functions regarding the respective modules/units are as follows:
the data processing module 101 is configured to obtain an original document set, and pre-process the original document set to obtain a standard document set; performing multi-label processing on the standard document set to obtain a document label set;
the model building module 102 is used for building an original document multi-classification model;
the model training module 103 is configured to divide the standard document set according to a preset number of batches to obtain a plurality of document subsets; inputting a plurality of document subsets into the original document multi-classification model for training to obtain a training value set; calculating the difference value between the training value set and the document label set to obtain an error value; when the error value is larger than a preset error threshold value, adjusting internal parameters of the original document multi-classification model, returning to the step of dividing the standard document set according to a preset batch number to obtain a plurality of document subsets, and obtaining the standard document multi-classification model when the error value is smaller than or equal to the error threshold value;
the classification module 104 is configured to obtain documents to be classified, and input the documents to be classified into the standard document multi-classification model to obtain a plurality of classification results.
In detail, the document multi-label classification apparatus 100 may be used to perform the document multi-label classification method as described in fig. 1 to 4. When the document multi-label classification method is executed, each module in the document multi-label classification device 100 specifically executes the following operations:
step one, the data processing model 101 obtains an original document set, and preprocesses the original document set to obtain a standard document set.
In a preferred embodiment of the present invention, the data processing model 101 may use human input or a crawler program to obtain the original text set.
In the embodiment of the present invention, the preprocessing the original document set by the data processing model 101 to obtain a standard document set, includes:
removing non-character parts in the original document set to obtain a first document set;
performing word segmentation on the first text set to obtain a second text set;
and removing stop words in the second document training set to obtain a standard document set.
In the embodiment of the present invention, for example, the document a includes the following portions:
the name of the document: a civil judgment book for contractual dispute between a company A and a company B;
the content of the document: the document content is consistent with the document content delivered by the party;
title of the document: the document of 'civil judgment of a certain civil court';
other contents of the document, including: format settings, case numbers, text, etc.
The non-character part comprises punctuation marks, messy codes and the like, and the non-character part comprising the punctuation marks, the messy codes and the like in the original document set is removed to obtain the first document set.
Further, the word segmentation is carried out on the first text set to obtain a second text set. The word segmentation method can adopt a jieba word segmentation method which is disclosed currently.
For example: the file name: the word segmentation of the dispute civil judgment books of the company A and the company B can be obtained: [ company A ], [ company B ], [ and ], [ contract ], [ dispute ], [ civil affair ], [ decision book ].
In the embodiment of the present invention, the stop word can be sequentially removed using a pre-constructed stop word list, and if the stop word list includes [ yes ], [ ground ], [ and ], the decision book after the word segmentation is removed according to the stop word list is obtained: the method comprises the following steps of [ company A ], [ company B ], [ contract ], [ dispute ], [ civil affair ], [ decision book ], and summarizing each phrase to obtain a standard text book set.
And step two, the data processing module 101 performs multi-label processing on the standard document set to obtain a document label set.
In the embodiment of the present invention, the expert group members may label the standard document set in multiple categories, and the data processing module 101 receives the labels of the expert group members on the standard document set, so as to obtain the document label set.
In detail, the various categories vary according to the content of the standard corpus of documents, e.g., for litigation-type documents, the various categories include dimensions such as litigation requests, disputes, and dispute focus.
And step three, the model construction module 102 constructs an original document multi-classification model.
Further, the model building module 102 builds the original document multi-classification model by the following method, including:
step A: constructing an original BERT model;
in detail, the bert (bidirectional encoderpressationfrom transformer) model is a language representation model.
And B: adding an attention mechanism in the original BERT model to obtain a primary BERT model;
preferably, Attention mechanism (Attention) is a data processing method in machine learning, and is widely applied to various different types of machine learning tasks such as natural language processing, image recognition and voice recognition.
And C: and connecting the primary BERT model by using a pre-constructed full-connection layer to obtain the original document multi-classification model.
In a preferred embodiment of the present invention, the original document multi-classification model can be obtained by connecting the full link layer to the primary BERT model.
And fourthly, the model training module 103 divides the standard document set according to a preset batch number to obtain a plurality of document subsets.
The batch number of the invention can be divided according to the actual application scene.
For example, a standard corpus of documents may have 90000 training data, which may correspond to 900000 training data divided into 100 batches, resulting in 100 subsets of documents, each subset of documents including 900 training data.
And step five, the model training module 103 inputs the plurality of document subsets into the original document multi-classification model for training to obtain a training value set.
In the embodiment of the present invention, the model training module 103 inputs a plurality of document subsets to the original document multi-classification model for training by the following operations to obtain a training value set:
step a: performing byte coding on the document subset by using a coding layer in the original document multi-classification model to obtain an original byte coding set;
in the preferred embodiment of the present invention, when the encoding layer in the original document multi-classification model performs byte encoding on the document subset, a Word Piece mode, that is, double-byte encoding is adopted, and the double-byte encoding can effectively reduce the data volume of the document subset and can reduce the influence of similar documents on the whole model training to a certain extent.
Step b: performing padding truncation operation on the original byte coding set according to a preset length by using a padding truncation layer in the original document multi-classification model to obtain a standard byte coding set;
in this embodiment of the present invention, the performing padding truncation operation on the original byte code set according to a preset length by using a padding truncation layer in the original document multi-classification model to obtain a standard byte code set includes:
step c: judging whether the length of byte codes in the original byte code set is greater than the preset length or not;
step d: when the length of byte codes in the original byte code set is greater than the preset length, truncation is carried out from the middle of the byte codes, and head and tail information of the byte codes is reserved to obtain standard byte codes;
step e: when the length of the byte codes in the original byte code set is smaller than or equal to the preset length, the codes in the original byte code set are standard byte codes;
step f: and summarizing the standard byte codes to obtain a standard byte code set.
For example, the preset length is 256 bytes, if the length of the code exceeds 256 bytes, a truncation operation is adopted, and when the padding truncation layer in the original document multi-classification model performs the padding truncation operation on the original byte code set, a sentence truncation mode is adopted, and the information of the head and the tail of the code is reserved.
The truncation operation comprises four modes of head truncation, tail truncation, two-edge truncation and sentence truncation.
Embedding the standard byte coding set by utilizing an embedding layer in the original document multi-classification model to obtain a standard byte sequence set and calculating a training value set corresponding to the standard byte sequence set.
In this embodiment of the present invention, the embedding the standard byte encoding set by using the embedding layer in the original document multi-classification model to obtain a standard byte sequence set includes: embedding a preset code into the head of the standard byte code to obtain a first embedded byte code; embedding the tail part of the first embedded byte code by using the preset code to obtain a second embedded byte code; and summarizing the second embedded byte codes subjected to the embedding operation to obtain a standard byte sequence set.
Preferably, the training value set is a prediction label set, for example, a document a, a document B, and a document C are in a standard document training set, and are trained through an original document multi-classification model respectively, so that a prediction label of the document a is a litigation request label, a prediction label of the document B is a dialectical label, and a prediction label of the document C is a dispute focus label, thereby integrating to obtain the training value set.
And step six, the model training module 103 calculates the difference between the training value set and the document label set to obtain an error value.
In the embodiment of the present invention, the model training module 103
Calculating the error value using the following error value calculation formula:
Figure BDA0002761719140000131
wherein C is the error value, n is the number of the document labels in the document label set, x is the total number of the training values in the training value set, y represents the training value set, and a is the document label value.
And step seven, when the error value is larger than a preset error threshold value, the model training module 103 adjusts internal parameters of the original document multi-classification model until the error value is smaller than or equal to the error threshold value, and a standard document multi-classification model is obtained.
In a preferred embodiment of the present invention, if the error value is greater than the error threshold, the internal parameters of the original document multi-classification model are adjusted, where the internal parameters include training batch number, learning rate, iteration number, and the like. And if the error value is less than or equal to the error threshold value, parameters do not need to be adjusted, and the standard document multi-classification model is obtained.
For example, if the error value and the error threshold are 0.3 and 0.5, respectively, the trained standard document multi-classification model is obtained because the error value is smaller than the error threshold, and if the error value and the error threshold are 0.3 and 0.2, respectively, because the error value is larger than the error threshold, the process returns to S4.
And step eight, the classification module 104 acquires the document to be classified, and inputs the document to be classified into the standard document multi-classification model to obtain various classification results.
For example, a document X to be classified input by a user is obtained, and classification is carried out through the standard document multi-classification model, so that the classification result of the document X to be classified is a litigation request label and a dispute focus label.
The embodiment of the invention carries out multi-label processing on the document set to obtain the document label set, the multi-label processing can carry out classification and label processing on documents in different dimensions, more characteristics are utilized to classify the documents, more comprehensive is achieved, the standard document set is divided into a plurality of document subsets, the efficiency of training a subsequent input model is improved, the plurality of document subsets are input into a pre-constructed original document classification model to be trained, internal parameters of the original document multi-classification model are adjusted according to a training value set and the document label set to obtain the standard document multi-classification model, and the standard document multi-classification model can be utilized to carry out multi-dimension classification on the documents to be classified to obtain a plurality of classification results. Therefore, the document multi-label classification method, the document multi-label classification device and the computer-readable storage medium can improve the diversity of document classification.
Fig. 6 is a schematic structural diagram of an electronic device for implementing the document multi-label classification method according to the present invention.
The electronic device 1 may comprise a processor 10, a memory 11 and a bus, and may further comprise a computer program, such as a document classification program 12, stored in the memory 11 and executable on the processor 10.
The memory 11 includes at least one type of readable storage medium, which includes flash memory, removable hard disk, multimedia card, card-type memory (e.g., SD or DX memory, etc.), magnetic memory, magnetic disk, optical disk, etc. The memory 11 may in some embodiments be an internal storage unit of the electronic device 1, such as a removable hard disk of the electronic device 1. The memory 11 may also be an external storage device of the electronic device 1 in other embodiments, such as a plug-in mobile hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the electronic device 1. Further, the memory 11 may also include both an internal storage unit and an external storage device of the electronic device 1. The memory 11 may be used not only to store application software installed in the electronic device 1 and various types of data, such as codes of the document multi-tag classification program 12, but also to temporarily store data that has been output or is to be output.
The processor 10 may be composed of an integrated circuit in some embodiments, for example, a single packaged integrated circuit, or may be composed of a plurality of integrated circuits packaged with the same or different functions, including one or more Central Processing Units (CPUs), microprocessors, digital Processing chips, graphics processors, and combinations of various control chips. The processor 10 is a Control Unit (Control Unit) of the electronic device, connects various components of the electronic device by using various interfaces and lines, and executes various functions and processes data of the electronic device 1 by running or executing programs or modules (e.g., executing a document multi-tag classification program, etc.) stored in the memory 11 and calling data stored in the memory 11.
The bus may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. The bus is arranged to enable connection communication between the memory 11 and at least one processor 10 or the like.
Fig. 6 only shows an electronic device with components, and it will be understood by a person skilled in the art that the structure shown in fig. 6 does not constitute a limitation of the electronic device 1, and may comprise fewer or more components than shown, or a combination of certain components, or a different arrangement of components.
For example, although not shown, the electronic device 1 may further include a power supply (such as a battery) for supplying power to each component, and preferably, the power supply may be logically connected to the at least one processor 10 through a power management device, so as to implement functions of charge management, discharge management, power consumption management, and the like through the power management device. The power supply may also include any component of one or more dc or ac power sources, recharging devices, power failure detection circuitry, power converters or inverters, power status indicators, and the like. The electronic device 1 may further include various sensors, a bluetooth module, a Wi-Fi module, and the like, which are not described herein again.
Further, the electronic device 1 may further include a network interface, and optionally, the network interface may include a wired interface and/or a wireless interface (such as a WI-FI interface, a bluetooth interface, etc.), which are generally used for establishing a communication connection between the electronic device 1 and other electronic devices.
Optionally, the electronic device 1 may further comprise a user interface, which may be a Display (Display), an input unit (such as a Keyboard), and optionally a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch device, or the like. The display, which may also be referred to as a display screen or display unit, is suitable for displaying information processed in the electronic device 1 and for displaying a visualized user interface, among other things.
It is to be understood that the described embodiments are for purposes of illustration only and that the scope of the appended claims is not limited to such structures.
The document multi-label classification program 12 stored in the memory 11 of the electronic device 1 is a combination of instructions that, when executed in the processor 10, can implement:
acquiring an original document set, and preprocessing the original document set to obtain a standard document set;
performing multi-label processing on the standard document set to obtain a document label set;
constructing an original document multi-classification model;
dividing the standard document set according to a preset batch number to obtain a plurality of document subsets;
inputting a plurality of document subsets into the original document multi-classification model for training to obtain a training value set;
calculating the difference value between the training value set and the document label set to obtain an error value;
when the error value is larger than a preset error threshold value, adjusting internal parameters of the original document multi-classification model, returning to the step of dividing the standard document set according to a preset batch number to obtain a plurality of document subsets, and obtaining the standard document multi-classification model when the error value is smaller than or equal to the error threshold value;
and acquiring a document to be classified, and inputting the document to be classified into the standard document multi-classification model to obtain various classification results.
Further, the integrated modules/units of the electronic device 1, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. The computer-readable medium may include: any entity or device capable of carrying said computer program code, recording medium, U-disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM).
Further, the computer usable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to the use of the blockchain node, and the like.
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus, device and method can be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and other divisions may be realized in practice.
The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional module.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof.
The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any accompanying claims should not be construed as limiting the claim concerned.
Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the system claims may also be implemented by one unit or means in software or hardware. The terms second, etc. are used to denote names, but not any particular order.
Finally, it should be noted that the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.

Claims (10)

1. A document multi-label classification method, characterized in that the method comprises:
acquiring an original document set, and preprocessing the original document set to obtain a standard document set;
performing multi-label processing on the standard document set to obtain a document label set;
constructing an original document multi-classification model;
dividing the standard document set according to a preset batch number to obtain a plurality of document subsets;
inputting a plurality of document subsets into the original document multi-classification model for training to obtain a training value set;
calculating the difference value between the training value set and the document label set to obtain an error value;
when the error value is larger than a preset error threshold value, adjusting internal parameters of the original document multi-classification model, returning to the step of dividing the standard document set according to a preset batch number to obtain a plurality of document subsets, and obtaining the standard document multi-classification model when the error value is smaller than or equal to the error threshold value;
and acquiring a document to be classified, and inputting the document to be classified into the standard document multi-classification model to obtain various classification results.
2. The method of multi-label classification of a document as claimed in claim 1, wherein said pre-processing said original document set to obtain a standard document set comprises:
removing non-character parts in the original document set to obtain a first document set;
performing word segmentation on the first text set to obtain a second text set;
and removing stop words in the second document training set to obtain a standard document set.
3. The method of multi-label classification of a document as claimed in claim 1 wherein said constructing an original document multi-classification model comprises:
constructing an original BERT model;
adding an attention mechanism in the original BERT model to obtain a primary BERT model;
and connecting the primary BERT model by using a pre-constructed full-connection layer to obtain the original document multi-classification model.
4. The method of claim 1 wherein said inputting a plurality of said subsets of documents into said original document multi-classification model for training to obtain a set of training values comprises:
performing byte coding on the document subset by using a coding layer in the original document multi-classification model to obtain an original byte coding set;
performing padding truncation operation on the original byte coding set according to a preset length by using a padding truncation layer in the original document multi-classification model to obtain a standard byte coding set;
embedding the standard byte coding set by utilizing an embedding layer in the original document multi-classification model to obtain a standard byte sequence set and calculating a training value set corresponding to the standard byte sequence set.
5. The document multi-label classification method of claim 4, wherein the performing padding truncation operation on the original byte code set according to a preset length by using a padding truncation layer in the original document multi-classification model to obtain a standard byte code set comprises:
when the length of byte codes in the original byte code set is greater than the preset length, truncation is carried out from the middle of the byte codes, and head and tail information of the byte codes is reserved to obtain standard byte codes;
and summarizing the standard byte codes to obtain a standard byte code set.
6. The method of multi-label classification of a document according to claim 4, wherein said embedding said standard byte code set with an embedding layer in said original document multi-classification model to obtain a standard byte sequence set comprises:
embedding a preset code into the head of the standard byte code to obtain a first embedded byte code;
embedding the tail part of the first embedded byte code by using the preset code to obtain a second embedded byte code;
and summarizing the second embedded byte codes to obtain a standard byte sequence set.
7. The method of claim 1 wherein calculating the difference between the training value set and the document label set to obtain an error value comprises:
calculating the error value using the following error value calculation formula:
Figure FDA0002761719130000021
wherein C is the error value, n is the number of the document labels in the document label set, x is the total number of the training values in the training value set, y represents the training value set, and a is the document label value.
8. A document multi-label classification method device is characterized by comprising the following steps:
the data processing module is used for acquiring an original document set and preprocessing the original document set to obtain a standard document set; performing multi-label processing on the standard document set to obtain a document label set;
the model construction module is used for constructing an original document multi-classification model;
the model training module is used for dividing the standard document set according to a preset batch number to obtain a plurality of document subsets; inputting a plurality of document subsets into the original document multi-classification model for training to obtain a training value set; calculating a difference value between the training value set and the document label set to obtain an error value, adjusting internal parameters of an original document multi-classification model when the error value is greater than a preset error threshold value, returning to the step of dividing the standard document set according to a preset batch number to obtain a plurality of document subsets, and obtaining the standard document multi-classification model when the error value is less than or equal to the error threshold value;
and the classification module is used for acquiring the document to be classified and inputting the document to be classified into the standard document multi-classification model to obtain various classification results.
9. An electronic device, characterized in that the electronic device comprises:
at least one processor; and the number of the first and second groups,
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores computer program instructions executable by the at least one processor to cause the at least one processor to perform the method of multi-label classification of a document as claimed in any one of claims 1 to 7.
10. A computer-readable storage medium, storing a computer program, wherein the computer program, when executed by a processor, implements the method of multi-label classification of a document according to any of claims 1 to 7.
CN202011220204.8A 2020-11-05 2020-11-05 Method and device for classifying documents in multiple labels, electronic equipment and storage medium Active CN112434157B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011220204.8A CN112434157B (en) 2020-11-05 2020-11-05 Method and device for classifying documents in multiple labels, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011220204.8A CN112434157B (en) 2020-11-05 2020-11-05 Method and device for classifying documents in multiple labels, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN112434157A true CN112434157A (en) 2021-03-02
CN112434157B CN112434157B (en) 2024-05-17

Family

ID=74695448

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011220204.8A Active CN112434157B (en) 2020-11-05 2020-11-05 Method and device for classifying documents in multiple labels, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112434157B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113076426A (en) * 2021-06-07 2021-07-06 腾讯科技(深圳)有限公司 Multi-label text classification and model training method, device, equipment and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110334710A (en) * 2019-07-10 2019-10-15 深圳市华云中盛科技有限公司 Legal documents recognition methods, device, computer equipment and storage medium
CN110442722A (en) * 2019-08-13 2019-11-12 北京金山数字娱乐科技有限公司 Method and device for training classification model and method and device for data classification
CN110717333A (en) * 2019-09-02 2020-01-21 平安科技(深圳)有限公司 Method and device for automatically generating article abstract and computer readable storage medium
CN110807495A (en) * 2019-11-08 2020-02-18 腾讯科技(深圳)有限公司 Multi-label classification method and device, electronic equipment and storage medium
CN111177324A (en) * 2019-12-31 2020-05-19 支付宝(杭州)信息技术有限公司 Method and device for classifying intentions based on voice recognition result
CN111291152A (en) * 2018-12-07 2020-06-16 北大方正集团有限公司 Case document recommendation method, device, equipment and storage medium
CN111428485A (en) * 2020-04-22 2020-07-17 深圳市华云中盛科技股份有限公司 Method and device for classifying judicial literature paragraphs, computer equipment and storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111291152A (en) * 2018-12-07 2020-06-16 北大方正集团有限公司 Case document recommendation method, device, equipment and storage medium
CN110334710A (en) * 2019-07-10 2019-10-15 深圳市华云中盛科技有限公司 Legal documents recognition methods, device, computer equipment and storage medium
CN110442722A (en) * 2019-08-13 2019-11-12 北京金山数字娱乐科技有限公司 Method and device for training classification model and method and device for data classification
CN110717333A (en) * 2019-09-02 2020-01-21 平安科技(深圳)有限公司 Method and device for automatically generating article abstract and computer readable storage medium
CN110807495A (en) * 2019-11-08 2020-02-18 腾讯科技(深圳)有限公司 Multi-label classification method and device, electronic equipment and storage medium
CN111177324A (en) * 2019-12-31 2020-05-19 支付宝(杭州)信息技术有限公司 Method and device for classifying intentions based on voice recognition result
CN111428485A (en) * 2020-04-22 2020-07-17 深圳市华云中盛科技股份有限公司 Method and device for classifying judicial literature paragraphs, computer equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113076426A (en) * 2021-06-07 2021-07-06 腾讯科技(深圳)有限公司 Multi-label text classification and model training method, device, equipment and storage medium
CN113076426B (en) * 2021-06-07 2021-08-13 腾讯科技(深圳)有限公司 Multi-label text classification and model training method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN112434157B (en) 2024-05-17

Similar Documents

Publication Publication Date Title
CN112597312A (en) Text classification method and device, electronic equipment and readable storage medium
CN111476324B (en) Traffic data labeling method, device, equipment and medium based on artificial intelligence
CN113157927B (en) Text classification method, apparatus, electronic device and readable storage medium
CN112883190A (en) Text classification method and device, electronic equipment and storage medium
CN111444339B (en) Text question difficulty labeling method and device and computer readable storage medium
CN113344125B (en) Long text matching recognition method and device, electronic equipment and storage medium
CN114708461A (en) Multi-modal learning model-based classification method, device, equipment and storage medium
CN112434157B (en) Method and device for classifying documents in multiple labels, electronic equipment and storage medium
CN113505273A (en) Data sorting method, device, equipment and medium based on repeated data screening
CN112633988A (en) User product recommendation method and device, electronic equipment and readable storage medium
CN115409041B (en) Unstructured data extraction method, device, equipment and storage medium
CN116578696A (en) Text abstract generation method, device, equipment and storage medium
CN116468025A (en) Electronic medical record structuring method and device, electronic equipment and storage medium
CN113435308B (en) Text multi-label classification method, device, equipment and storage medium
CN114676307A (en) Ranking model training method, device, equipment and medium based on user retrieval
CN113822215A (en) Equipment operation guide file generation method and device, electronic equipment and storage medium
CN113626605A (en) Information classification method and device, electronic equipment and readable storage medium
CN111414452A (en) Search word matching method and device, electronic equipment and readable storage medium
CN111680513B (en) Feature information identification method and device and computer readable storage medium
CN115146627B (en) Entity identification method, entity identification device, electronic equipment and storage medium
CN115271686B (en) Intelligent checking method and device for government affair data
CN115098688B (en) Multi-label classification model training method and device, electronic equipment and storage medium
CN112328796B (en) Text clustering method, device, equipment and computer readable storage medium
CN114781833B (en) Capability assessment method, device and equipment based on business personnel and storage medium
CN114003720A (en) Business document classification method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant