CN114003720A

CN114003720A - Business document classification method, device, equipment and storage medium

Info

Publication number: CN114003720A
Application number: CN202111272362.2A
Authority: CN
Inventors: 叶思涛
Original assignee: Ping An International Smart City Technology Co Ltd
Current assignee: Ping An International Smart City Technology Co Ltd
Priority date: 2021-10-29
Filing date: 2021-10-29
Publication date: 2022-02-01

Abstract

The invention relates to an artificial intelligence technology, and discloses a business document classification method, which comprises the following steps: vectorizing an initial document text obtained after data cleaning to obtain a text characterization vector, clustering and labeling the text characterization vector, taking a text cluster obtained after labeling as a training data set, inputting the training data set into a preset text classification model for text classification to obtain a predicted classification result, and optimizing the text classification model according to a comparison result obtained by comparing the predicted classification result with a preset real classification result to obtain a standard classification model. And inputting the documents to be classified into the standard classification model to obtain the categories corresponding to the documents to be classified. In addition, the invention also relates to a block chain technology, and the document cluster can be stored in the node of the block chain. The invention also provides a business document classification device, electronic equipment and a storage medium. The invention can improve the efficiency of business document classification.

Description

Business document classification method, device, equipment and storage medium

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a business document classification method, a business document classification device, electronic equipment and a computer readable storage medium.

Background

When an organization carries out business processing, the organization generally needs to cross consult the document materials from different data sources, and most of the document materials are non-category information, so that the document materials are mixed and clustered. At present, after the office staff classify and catalog the part of the document materials, the part of the document materials can be handed to corresponding business departments for processing, automatic classification and catalog of the system are not realized, manual classification and catalog are excessively relied on, and further follow-up case handling quality and effect are influenced. And the manual classification and cataloguing process needs to consume more manpower resources, and the paperwork materials can be classified and catalogued more efficiently and accurately only by having corresponding professional technical capabilities. Therefore, an efficient document classification method is urgently needed to be provided.

Disclosure of Invention

The invention provides a business document classification method, a business document classification device and a computer readable storage medium, and mainly aims to improve the document classification efficiency.

In order to achieve the above object, the present invention provides a method for classifying service documents, comprising:

acquiring an original document text, and performing data cleaning on the original document text to obtain the original document text;

vectorizing the initial document text to obtain a text representation vector, and clustering the text representation vector based on a preset clustering algorithm to obtain a plurality of document clustering clusters;

labeling the plurality of text clustering clusters, and taking the labeled text clustering clusters as a training data set;

inputting the training data set into a preset text classification model for text classification to obtain a prediction classification result, and optimizing the text classification model according to a comparison result obtained by comparing the prediction classification result with a preset real classification result to obtain a standard classification model;

and acquiring a document to be classified, and inputting the document to be classified into the standard classification model to obtain the category corresponding to the document to be classified.

Optionally, the vectorizing the initial document text to obtain a text characterization vector includes:

performing word segmentation processing and word deactivation processing on the initial document text to obtain an initial text sequence;

calculating to obtain a static word vector of each word in the initial text sequence by using a preset word embedding algorithm;

and carrying out average pooling on the static word vectors of each word to obtain a text representation vector.

Optionally, the clustering the text characterization vectors based on a preset clustering algorithm to obtain a plurality of document clustering clusters includes:

obtaining a plurality of initial clustering centers, and respectively calculating distance values between the text characterization vector and the plurality of initial clustering centers;

taking the initial clustering center corresponding to the minimum distance value as a cluster to be clustered, and classifying the text characterization vector into the cluster to be clustered;

and recalculating the clustering center of the cluster to be clustered according to the text characterization vectors contained in the cluster to be clustered, and repeatedly executing clustering operation until the distribution of the plurality of text characterization vectors is finished to obtain a plurality of document clustering clusters.

Optionally, the inputting the training data set into a preset text classification model for text classification to obtain a prediction classification result includes:

performing convolution processing on the training data set by using a convolution layer in the text classification model to obtain a convolution data set;

inputting the convolution data set into a pooling layer in the text classification model for pooling treatment to obtain a pooled data set;

and calculating an activation value corresponding to the pooled data set by using a preset activation function, and classifying according to the size of the activation value to obtain a prediction classification result.

Optionally, the optimizing the text classification model according to a comparison result obtained by comparing the predicted classification result with a preset real classification result to obtain a standard classification model includes:

judging whether the prediction classification result is consistent with the real classification result;

if the predicted classification result is consistent with the real classification result, outputting the text classification model as a standard classification model;

and if the predicted classification result is inconsistent with the real classification result, adjusting the model parameters of the text classification model, re-executing text classification operation by using the adjusted text classification model until the predicted classification result is consistent with the real classification result, and outputting the adjusted text classification model as a standard classification model.

Optionally, the labeling the plurality of text cluster clusters, and using the labeled text cluster clusters as a training data set, includes:

acquiring a preset text type and a document text corresponding to the text type;

and taking the text category corresponding to the document text consistent with the text cluster as the labeling category of the text cluster.

Optionally, the performing data cleaning on the original document text includes:

detecting an error text in the original document text by using a preset error detection statement, and correcting the error text; or

And judging whether the original document text has invalid data or not, and deleting the invalid data contained in the original document text when the original document text has invalid data.

In order to solve the above problem, the present invention further provides a device for classifying service documents, the device comprising:

the data cleaning module is used for acquiring an original document text and cleaning the original document text to obtain the original document text;

the text clustering module is used for vectorizing the initial document text to obtain a text characterization vector, and clustering the text characterization vector based on a preset clustering algorithm to obtain a plurality of document clustering clusters;

the text labeling module is used for labeling the text clustering clusters and taking the labeled text clustering clusters as a training data set;

the model training module is used for inputting the training data set into a preset text classification model for text classification to obtain a prediction classification result, and optimizing the text classification model according to a comparison result obtained by comparing the prediction classification result with a preset real classification result to obtain a standard classification model;

and the document classification module is used for acquiring the document to be classified, inputting the document to be classified into the standard classification model and obtaining the category corresponding to the document to be classified.

In order to solve the above problem, the present invention also provides an electronic device, including:

at least one processor; and the number of the first and second groups,

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores a computer program executable by the at least one processor, the computer program being executable by the at least one processor to enable the at least one processor to perform the method of classification of a business document as described above.

In order to solve the above problem, the present invention further provides a computer-readable storage medium, in which at least one computer program is stored, and the at least one computer program is executed by a processor in an electronic device to implement the business document classification method described above.

The embodiment of the invention obtains a text characterization vector by vectorizing the initial document text obtained by data cleaning, and performs cluster analysis on the text characterization vector to obtain a plurality of document cluster clusters, wherein the cluster analysis ensures that documents of the same category are clustered together, and labels the text cluster clusters, so that the labeled text cluster clusters are used as a training data set to perform model training on a preset text classification model to obtain a standard classification model. The standard classification model is used for classifying documents, so that the classification efficiency is high. And inputting the obtained documents to be classified into the standard classification model to obtain the categories corresponding to the documents to be classified. Therefore, the business document classification method, the business document classification device, the electronic equipment and the computer readable storage medium can solve the problem that the document classification efficiency is not high enough.

Drawings

Fig. 1 is a schematic flow chart of a method for classifying a business document according to an embodiment of the present invention;

fig. 2 is a functional block diagram of a service document classifying device according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of an electronic device for implementing the business document classification method according to an embodiment of the present invention.

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The embodiment of the application provides a business document classification method. The executing body of the business document classification method includes, but is not limited to, at least one of electronic devices such as a server and a terminal, which can be configured to execute the method provided by the embodiments of the present application. In other words, the business document classification method may be performed by software or hardware installed in a terminal device or a server device, and the software may be a blockchain platform. The server includes but is not limited to: a single server, a server cluster, a cloud server or a cloud server cluster, and the like. The server may be an independent server, or may be a cloud server that provides basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), a big data and artificial intelligence platform, and the like.

Fig. 1 is a schematic flow chart of a method for classifying a business document according to an embodiment of the present invention. In this embodiment, the method for classifying the service documents includes:

and S1, acquiring an original document text, and performing data cleaning on the original document text to obtain the original document text.

In the embodiment of the present invention, the original document text may be a document material derived from different data sources, and generally includes a referee document derived from inside of a court, a referee document derived from disclosure of a referee document network, an administrative penalty decision published by each administrative law enforcement agency, and a document material derived from each forum and reporting public opinion content on each government service network.

Specifically, the data cleaning of the original document text includes:

In detail, the preset error detection sentence may be a java sentence with an error detection function, the error text in the original document text may be an error type such as a term abbreviation error or a common spelling error, and the error text may be corrected based on an existing reference term or spelling text. The invalid data may be text data in which the content in the original document text is empty or the description fields are too few and repeated, and when the invalid data exists in the original document text, the invalid data included in the original document text is deleted.

And S2, vectorizing the initial document text to obtain a text characterization vector, and clustering the text characterization vector based on a preset clustering algorithm to obtain a plurality of document clustering clusters.

In the embodiment of the present invention, the vectorizing the initial document text to obtain a text characterization vector includes:

In detail, the initial document text may be participled by using a reference participler, for example, the reference participler may be a Jieba participler or a stanford participler. The Word embedding algorithm may be a TF-IDF algorithm, a Word2Vec algorithm, a GloVe algorithm, or a FastText algorithm.

In another embodiment of the present invention, the vectorizing the initial document text to obtain a text characterization vector includes:

inputting the initial text sequence into a preset pre-training model to obtain a dynamic word vector of each word in the initial text sequence;

and determining a dynamic word vector corresponding to the preset identifier as a text characterization vector.

In detail, the preset pre-training model may be a BERT model, a RoBERTa model, an ALBERT model, or an ERNIE model. The preset identifier is a [ CLS ] identifier.

Specifically, the clustering process is performed on the text characterization vectors based on a preset clustering algorithm to obtain a plurality of document clustering clusters, and the clustering process includes:

In detail, in the embodiment of the present invention, the preset clustering algorithm is a K-Means clustering algorithm.

And S3, labeling the text cluster, and taking the labeled text cluster as a training data set.

In the embodiment of the present invention, the labeling a plurality of text cluster clusters, and using the labeled text cluster clusters as a training data set includes:

In detail, the preset text categories include, but are not limited to, criminal legal documents, civil legal documents and administrative legal documents divided according to case litigation properties, litigation legal documents and internal work legal documents divided according to legal document applicable procedures and ranges, and text narrative documents, space filling documents and tabular documents divided according to different document making forms.

S4, inputting the training data set into a preset text classification model for text classification to obtain a prediction classification result, and optimizing the text classification model according to a comparison result obtained by comparing the prediction classification result with a preset real classification result to obtain a standard classification model.

In the embodiment of the invention, the preset text classification model is a TextCNN network. The textCNN network classifies texts by using a convolutional neural network, and is simple in network structure, high in training speed and accurate in classification effect.

Specifically, the step of inputting the training data set into a preset text classification model for text classification to obtain a prediction classification result includes:

In detail, the text classification model comprises a convolution layer, a max-posing pooling layer and an output external softmax function, wherein the convolution layer performs convolution processing through a convolution kernel, the max-posing pooling layer can reduce the risk of overfitting, and the softmax function is used for classification.

Further, the optimizing the text classification model according to a comparison result obtained by comparing the predicted classification result with a preset real classification result to obtain a standard classification model includes:

And S5, obtaining the document to be classified, and inputting the document to be classified into the standard classification model to obtain the category corresponding to the document to be classified.

In the embodiment of the invention, the documents to be classified are document materials which are not classified and catalogued. And inputting the documents to be classified into the standard classification model, wherein the standard classification model is a trained document classification model and can efficiently and accurately classify and catalog the documents to be classified. And inputting the documents to be classified into the standard classification model to obtain the categories corresponding to the documents to be classified.

For example, the document to be classified is input into the standard classification model, and the classification corresponding to the document to be classified is obtained as a criminal law document.

The embodiment of the invention obtains a text characterization vector by vectorizing the initial document text obtained by data cleaning, and performs cluster analysis on the text characterization vector to obtain a plurality of document cluster clusters, wherein the cluster analysis ensures that documents of the same category are clustered together, and labels the text cluster clusters, so that the labeled text cluster clusters are used as a training data set to perform model training on a preset text classification model to obtain a standard classification model. The standard classification model is used for classifying documents, so that the classification efficiency is high. And inputting the obtained documents to be classified into the standard classification model to obtain the categories corresponding to the documents to be classified. Therefore, the business document classification method provided by the invention can solve the problem that the document classification efficiency is not high enough.

Fig. 2 is a functional block diagram of a service document classifying device according to an embodiment of the present invention.

The apparatus 100 for classifying a business document according to the present invention may be installed in an electronic device. According to the realized functions, the business document classification device 100 can comprise a data cleaning module 101, a text clustering module 102, a text labeling module 103, a model training module 104 and a document classification module 105. The module of the present invention, which may also be referred to as a unit, refers to a series of computer program segments that can be executed by a processor of an electronic device and that can perform a fixed function, and that are stored in a memory of the electronic device.

In the present embodiment, the functions regarding the respective modules/units are as follows:

the data cleaning module 101 is configured to acquire an original document text, and perform data cleaning on the original document text to obtain an initial document text;

the text clustering module 102 is configured to perform vectorization on the initial document text to obtain a text characterization vector, and perform clustering processing on the text characterization vector based on a preset clustering algorithm to obtain a plurality of document clustering clusters;

the text labeling module 103 is configured to label a plurality of text cluster clusters, and use the labeled text cluster clusters as a training data set;

the model training module 104 is configured to input the training data set into a preset text classification model for text classification to obtain a predicted classification result, and optimize the text classification model according to a comparison result obtained by comparing the predicted classification result with a preset real classification result to obtain a standard classification model;

the document classification module 105 is configured to obtain a document to be classified, and input the document to be classified into the standard classification model to obtain a category corresponding to the document to be classified.

In detail, the specific implementation of each module of the service document classification device 100 is as follows:

the method comprises the steps of firstly, obtaining an original document text, and carrying out data cleaning on the original document text to obtain the original document text.

Specifically, the data cleaning of the original document text includes:

And secondly, vectorizing the initial document text to obtain a text characterization vector, and clustering the text characterization vector based on a preset clustering algorithm to obtain a plurality of document clustering clusters.

And step three, labeling the plurality of text cluster clusters, and taking the labeled text cluster clusters as a training data set.

And fourthly, inputting the training data set into a preset text classification model for text classification to obtain a prediction classification result, and optimizing the text classification model according to a comparison result obtained by comparing the prediction classification result with a preset real classification result to obtain a standard classification model.

And fifthly, acquiring the document to be classified, and inputting the document to be classified into the standard classification model to obtain the category corresponding to the document to be classified.

The embodiment of the invention obtains a text characterization vector by vectorizing the initial document text obtained by data cleaning, and performs cluster analysis on the text characterization vector to obtain a plurality of document cluster clusters, wherein the cluster analysis ensures that documents of the same category are clustered together, and labels the text cluster clusters, so that the labeled text cluster clusters are used as a training data set to perform model training on a preset text classification model to obtain a standard classification model. The standard classification model is used for classifying documents, so that the classification efficiency is high. And inputting the obtained documents to be classified into the standard classification model to obtain the categories corresponding to the documents to be classified. Therefore, the business document classification device provided by the invention can solve the problem that the document classification efficiency is not high enough.

Fig. 3 is a schematic structural diagram of an electronic device for implementing a method for classifying a business document according to an embodiment of the present invention.

The electronic device 1 may comprise a processor 10, a memory 11, a communication bus 12 and a communication interface 13, and may further comprise a computer program, such as a business document classification program, stored in the memory 11 and executable on the processor 10.

In some embodiments, the processor 10 may be composed of an integrated circuit, for example, a single packaged integrated circuit, or may be composed of a plurality of integrated circuits packaged with the same function or different functions, and includes one or more Central Processing Units (CPUs), a microprocessor, a digital Processing chip, a graphics processor, a combination of various control chips, and the like. The processor 10 is a Control Unit (Control Unit) of the electronic device, connects various components of the electronic device by using various interfaces and lines, and executes various functions and processes data of the electronic device by running or executing programs or modules (e.g., executing a business document classification program, etc.) stored in the memory 11 and calling data stored in the memory 11.

The memory 11 includes at least one type of readable storage medium including flash memory, removable hard disks, multimedia cards, card-type memory (e.g., SD or DX memory, etc.), magnetic memory, magnetic disks, optical disks, etc. The memory 11 may in some embodiments be an internal storage unit of the electronic device, for example a removable hard disk of the electronic device. The memory 11 may also be an external storage device of the electronic device in other embodiments, such as a plug-in mobile hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the electronic device. Further, the memory 11 may also include both an internal storage unit and an external storage device of the electronic device. The memory 11 may be used not only to store application software installed in the electronic device and various types of data, such as codes of a business document classification program, but also to temporarily store data that has been output or is to be output.

The communication bus 12 may be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus. The bus may be divided into an address bus, a data bus, a control bus, etc. The bus is arranged to enable connection communication between the memory 11 and at least one processor 10 or the like.

The communication interface 13 is used for communication between the electronic device and other devices, and includes a network interface and a user interface. Optionally, the network interface may include a wired interface and/or a wireless interface (e.g., WI-FI interface, bluetooth interface, etc.), which are typically used to establish a communication connection between the electronic device and other electronic devices. The user interface may be a Display (Display), an input unit such as a Keyboard (Keyboard), and optionally a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch device, or the like. The display, which may also be referred to as a display screen or display unit, is suitable, among other things, for displaying information processed in the electronic device and for displaying a visualized user interface.

Fig. 3 shows only an electronic device with components, and it will be understood by those skilled in the art that the structure shown in fig. 3 does not constitute a limitation of the electronic device 1, and may comprise fewer or more components than those shown, or some components may be combined, or a different arrangement of components.

For example, although not shown, the electronic device may further include a power supply (such as a battery) for supplying power to each component, and preferably, the power supply may be logically connected to the at least one processor 10 through a power management device, so that functions of charge management, discharge management, power consumption management and the like are realized through the power management device. The power supply may also include any component of one or more dc or ac power sources, recharging devices, power failure detection circuitry, power converters or inverters, power status indicators, and the like. The electronic device may further include various sensors, a bluetooth module, a Wi-Fi module, and the like, which are not described herein again.

It is to be understood that the described embodiments are for purposes of illustration only and that the scope of the appended claims is not limited to such structures.

The business document classification program stored in the memory 11 of the electronic device 1 is a combination of instructions that, when executed in the processor 10, may implement:

Specifically, the specific implementation method of the instruction by the processor 10 may refer to the description of the relevant steps in the embodiment corresponding to the drawings, which is not described herein again.

Further, the integrated modules/units of the electronic device 1, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. The computer readable storage medium may be volatile or non-volatile. For example, the computer-readable medium may include: any entity or device capable of carrying said computer program code, recording medium, U-disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM).

The present invention also provides a computer-readable storage medium, storing a computer program which, when executed by a processor of an electronic device, may implement:

In the embodiments provided in the present invention, it should be understood that the disclosed apparatus, device and method can be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and other divisions may be realized in practice.

The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional module.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof.

The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference signs in the claims shall not be construed as limiting the claim concerned.

The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.

The embodiment of the application can acquire and process related data based on an artificial intelligence technology. Among them, Artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result.

Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the system claims may also be implemented by one unit or means in software or hardware. The terms first, second, etc. are used to denote names, but not any particular order.

Finally, it should be noted that the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.

Claims

1. A method for classifying a business document, the method comprising:

2. The method of classifying a business document according to claim 1, wherein said vectorizing said initial document text to obtain a text characterization vector comprises:

3. The method of classifying a business document according to claim 1, wherein said clustering said text characterization vectors based on a preset clustering algorithm to obtain a plurality of document cluster clusters comprises:

4. The method for classifying business documents according to claim 1, wherein said inputting said training data set into a preset text classification model for text classification to obtain a predicted classification result comprises:

5. The method of claim 1, wherein the optimizing the text classification model according to the comparison result obtained by comparing the predicted classification result with a preset real classification result to obtain a standard classification model comprises:

6. The method for classifying a service document according to claim 1, wherein said labeling a plurality of said text clusters and using the labeled text clusters as a training data set comprises:

7. The method of any of claims 1 to 6, wherein said data cleaning of said original document text comprises:

8. An apparatus for classifying a business document, the apparatus comprising:

9. An electronic device, characterized in that the electronic device comprises:

at least one processor; and the number of the first and second groups,

the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the method of classifying a business document according to any one of claims 1 to 7.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out a method of classifying a business document according to any one of claims 1 to 7.