CN114722189B

CN114722189B - Multi-label unbalanced text classification method in budget execution audit

Info

Publication number: CN114722189B
Application number: CN202111534284.9A
Authority: CN
Inventors: 伍之昂; 张璐; 方昌健
Original assignee: Guangdong Weishen Information Technology Co ltd; NANJING AUDIT UNIVERSITY
Current assignee: Guangdong Weishen Information Technology Co ltd; NANJING AUDIT UNIVERSITY
Priority date: 2021-12-15
Filing date: 2021-12-15
Publication date: 2023-06-23
Anticipated expiration: 2041-12-15
Also published as: CN114722189A

Abstract

The invention discloses a multi-mark unbalanced text classification method in budget execution audit, which comprises the following steps: constructing a keyword library in the budget execution and audit field, selecting seed words from the keyword library as tag description, then performing word segmentation based on a word segmentation tool and the keyword library, and calculating an embedded matrix corresponding to the tag and the word segmentation; constructing a similarity matrix of the calculated words, phrases and labels (namely label description) of the neural network, solving the weight of the words based on the constructed pooling layer, solving the sentence embedding matrix by combining the word embedding matrix, and outputting the sentence embedding matrix to a classifier to obtain a prediction result; and introducing unbalanced data weight into the loss function, adding tag description into the loss function to strengthen learning of the small categories and the tags, training to obtain a model by taking the minimum loss function as a target, and effectively classifying the payment abstract text data of the unknown tags. The invention effectively solves the problem of multi-label unbalanced classification of the payment voucher abstract text in budget execution audit.

Description

Multi-label unbalanced text classification method in budget execution audit

Technical Field

The invention relates to the field of text classification, in particular to a multi-mark unbalanced text classification method in budget execution audit.

Background

In the execution of audits on financial budgets, it is necessary to categorize the payment summary of the money to identify whether its use is consistent with the budget item, to thereby review whether the expenditure is compliant or not, and even to identify high risk transactions. At present, a large amount of text classification work still depends on manual labeling of auditors, and the explosive growth of audit data in a big data environment is more and more difficult to deal with. Although the research on text classification problems is long, so far, it is still fresh to perform audit completely towards budget, and research and application of payment abstract text classification are carried out pertinently, and general text classification algorithms and tools are obviously difficult to be completely suitable for the field with extremely strong specialty. The text analysis scene in the budget execution audit has the problems of multiple professional words of the text in the audit field, multiple categories of the budget subjects, unbalanced sample size and the like, and meanwhile, the traditional text classification method is difficult to capture the importance degree of different words affecting the classification model by using an unsupervised sentence characterization mechanism based on average word vectors. Aiming at the problems, the invention provides a multi-label unbalanced text classification method in budget execution audit, which integrates sentence characterization learning and training of a multi-label unbalanced classification model in a supervised learning mode, is hopeful to quickly and accurately solve the classification problem of the abstract for payment and improves the efficiency of audit work.

Disclosure of Invention

The invention aims to: the invention provides a multi-label unbalanced text classification method in budget execution audit, which can solve the problem of multi-label unbalanced text classification in budget execution audit.

The technical scheme is as follows:

a multi-label unbalanced text classification method in budget execution audit comprises the following steps:

step one: data preprocessing and word embedding training to obtain input data of a model: giving text data of the payment certificate abstract with a label, wherein the number of samples among different categories is different, and the number of the categories in the data is K; constructing a keyword library for budget execution and audit from a given text, namely proper nouns in the field, and selecting representative seed words from the keyword library as descriptions of labels; word segmentation is carried out on the text by using a word stock and a word segmentation tool, and pre-training of word embedding vectors is completed on the full audit text data to obtain a word matrix E _i ＝[e _i1 ,…,e _iL ] ^T Wherein i is the sequence number of the sentence, L is the sequence number of the word in the sentence, L is the length of the sentence, the seed words are mapped to word embedding matrixes, and then the word embedding matrixes of the seed words of each category are averaged to obtain an embedding matrix L= [ L ] of all the tags ₁ ,…,l _K ] ^T ；

Step two: constructing a model, and constructing a classification frame of the multi-label unbalanced text: firstly, constructing a model by using words and labels in sentencesSolving a similarity matrix, and then calculating the similarity of the context information, namely the phrase and the label by using a neural network, wherein 2 groups of parameters W are provided ₁ And b ₁ Training is required; and then using a newly constructed base pool layer to calculate weight vectors between the phrases and all the category labels, finally using the weight vectors to weight the original words, and obtaining a proper sentence embedding matrix after the training process is completed, namely, the sentence embedding matrix fused with domain knowledge, wherein the formula is as follows:

wherein Z is _i An embedding matrix for the ith sentence, f ₁ To E as _i L input, Z _i Mapping functions for the output;

then, sentence is classified by using a classifier by taking the sentence embedding matrix as input, wherein 2 groups of parameters need to be trained, namely W ₂ And b ₂ The method comprises the steps of carrying out a first treatment on the surface of the The formula is as follows:

wherein the method comprises the steps of

For sentence Z _i Predictive corresponding class probability evidence, f ₂ To Z _i Input, & gt>

Mapping functions for the output;

step three: constructing a sentence embedding and unbalanced multi-classification unified objective function, and guiding the neural network to train; using a cross entropy loss function as a basic objective function, introducing weight data to bias the loss function towards a small class, strengthening training of a classifier on the small class, and finally embedding a tag word into the loss function to strengthen learning of a tag, and realizing training of a model with the aim of minimizing a currently constructed unbalanced objective function; after training, effectively classifying the payment abstract text data of the unknown label;

further, in the second step, a model is built, and a classification frame of the multi-label unbalanced text is built: firstly, constructing a model, solving a similarity matrix by using words and tags in sentences, and then calculating the similarity of context information, namely phrases and tags by using a neural network, wherein 2 groups of parameters W are arranged ₁ And b ₁ Training is required; then using a newly constructed base pool layer to calculate weight vectors between phrases and all category labels, and finally using the weight vectors to weight the original words, and obtaining a proper sentence embedding matrix after the training process is completed, namely, the sentence embedding matrix fused with domain knowledge;

the method specifically comprises the following steps: in the first stage, the similarity matrix is first solved, and the formula is as follows:

similarity matrix G _i Is L x K, wherein L is represented by ₂ Norms.

And calculating the similarity between the phrase containing the context semantics and the label in the sentence, wherein the formula is as follows:

wherein j represents the sequence number of the phrase center position word, j-p, j+p is the sequence numbers of the leftmost and rightmost words of the phrase, W ₁ And b ₁ Performing iterative training in training for two groups of parameters in the neural network;

then calculating a related weight value matrix of the word:

wherein c _jk For the j-th word and the correspondingSimilarity of kth class labels;

re-alignment of beta _j Normalized calculations were performed as follows:

wherein exp represents an exponential function with e as the base, beta _j′ A similarity value for the j' th word in the sentence;

finally obtaining an embedding matrix of the sentence, wherein the formula is as follows:

the above process is expressed as a whole as formula (1);

the second stage builds three-layer full-connection layer neural network classifier to embed the sentence into matrix Z _i The input classifier is trained to obtain effective prediction output

The overall process is expressed as formula (2);

further, in the third step, an objective function unified by sentence embedding and unbalanced multi-classification is constructed, and the neural network training is guided. The cross entropy loss function is used as a basic objective function, weight data are introduced to bias the loss function towards a small category, training of the classifier on the small category is enhanced, and finally, the tag word is embedded into the matrix and introduced into the loss function to enhance learning of the tag, and training of the model is achieved by minimizing the currently constructed unbalanced objective function as a target 99as target. After training, the payment abstract text data of the unknown label can be effectively classified;

the method specifically comprises the following steps: first, the inverse weights of the classes are calculated as follows:

where c (-) is the number of samples in the class, medium (-) represents the median, y _k Representing the label vector of the kth class, the number of samples of the kth' class is the median of the number of all classes, y _k′ A tag vector representing a kth' class;

and then smoothing the reverse weight to obtain a final weight vector, wherein the formula is as follows:

wherein S (·) represents a sigmoid function, r _k Reverse weight of kth class, r _k′ Reverse weight for the kth' category;

then introducing weight vectors to construct a loss function, wherein the formula is as follows:

where N is the total number of sentences in the dataset and CE (·) is the cross entropy loss function;

meaning that the function f can be broken down into two parts: f (f) ₁ And f ₂ As a function f ₁ As a function f ₂ Is input to the computer; y is _i For the actual tag matrix of the ith sentence, Σ is the weight vector, Σ ^T Representing the transpose of the weight vector, y _ik The value of the kth tag representing the ith sentence corresponds to an actual tag position of 1, the remaining positions of 0, < ->

A predictive probability of a kth tag representing an ith sentence;

to improve the importance of the label in training, a special label loss function is added, and the formula is as follows:

where k is the serial number of the corresponding class, α is the penalty coefficient, y _k Is a category label matrix;

finally, training the model based on Adam algorithm and with the aim of minimizing equation (11).

The beneficial effects are that: the invention effectively solves the problem of multi-label unbalanced classification of the summary text of the payment certificate in budget execution audit, and remarkably improves recall rate and overall performance on subclasses by introducing label similarity calculation, thereby greatly improving the efficiency of auditing personnel in checking budget execution compliance and identifying high-risk transactions.

Drawings

Fig. 1 is a flowchart of an unbalanced text classification method for an audit field according to an embodiment of the present invention.

Fig. 2 is a schematic diagram of a neural network framework in accordance with an embodiment of the present invention.

Fig. 3 is a schematic diagram of a model training process according to an embodiment of the present invention.

Detailed Description

The invention is further explained below with reference to the drawings. Fig. 1 is a diagram of an unbalanced text classification method facing to an audit field according to an embodiment of the present invention. As shown in fig. 1, the present embodiment includes the steps of:

step one: data preprocessing and word embedding training to obtain input data of a model; giving text data of the payment certificate abstract with a label, wherein the number of samples among different categories is different, and the number of the categories in the data is K; constructing a keyword library for budget execution and audit from a given text, namely proper nouns in the field, and selecting representative seed words from the keyword library as descriptions of labels; word segmentation is carried out on the text by using a word stock and a word segmentation tool, and pre-training of word embedding vectors is completed on the full audit text data to obtain a word matrixE _i ＝[e _i1 ,…,e _iL ] ^T Wherein i is the sentence sequence number, L is the sequence number of the word in the sentence, the seed words are mapped to word embedding matrixes, and then the word embedding matrixes of the seed words of each category are averaged to obtain an embedding matrix L= [ L ] of all tags ₁ ,…,l _K ] ^T ；

Step two: constructing a model and constructing a classification frame of the multi-mark unbalanced text; firstly, performing model construction, as shown in fig. 2, and solving a similarity matrix by using words and tags in sentences; then using neural network to calculate the similarity of context information, i.e. phrase and label, where there are 2 sets of parameters W ₁ And b ₁ Training is required; and then using a newly constructed base pool layer to calculate weight vectors between the phrases and all the category labels, finally using the weight vectors to weight the original words, and obtaining a proper sentence embedding matrix after the training process is completed, namely, the sentence embedding matrix fused with domain knowledge, wherein the formula is as follows:

finally, sentence is classified by using a classifier by taking the sentence embedding matrix as input, wherein 2 groups of parameters need to be trained, namely W ₂ And b ₂ The formula is as follows:

wherein the method comprises the steps of

For sentence Z _i Predicted corresponding class probability matrix, f ₂ To Z _i Input, & gt>

Mapping functions for the output;

step three: and constructing an objective function unified by sentence embedding and unbalanced multi-classification, and guiding the neural network to train. The cross entropy loss function is used as a basic objective function, weight data are introduced to bias the loss function towards a small class, training of the classifier on the small class is enhanced, and finally, the tag word is embedded into the matrix and introduced into the loss function to enhance learning of the tag, and training of the model is achieved with the aim of minimizing the currently constructed unbalanced objective function. After training, the payment abstract text data of the unknown label can be effectively classified;

in a specific embodiment, a method for classifying multi-label unbalanced text in budget execution audit is described in detail:

firstly, executing audit text data according to the existing budget, utilizing a word segmentation tool LAC (Lexical Analysis of Chinese) to segment sentences, counting corresponding word frequencies in each category, and constructing a keyword library and a seed word in the budget execution and audit field according to word segmentation results and the collected professional field word library:

the keyword library and seed words in the budget execution and audit field are shown in the following table:

executing word library in the auditing field and word segmentation results obtained by using LAC by using conventional stop words based on budget, as shown in the following table;

sequence number	Sentence	Word segmentation result
			1	Shenzhen specialist attends to the ball sea following project accommodation fee	Shenzhen specialist attends to the ball sea following project accommodation fee

The seed words are characterized by CBOW (Continues Bag of Words) to obtain an embedding matrix corresponding to the tag. Taking travel class as an example, the embedding matrix of the seed words and the embedding matrix of the tags are shown in the following table:

and obtaining an average value of the seed word embedding matrix in the travel class to obtain an embedding matrix of the tag, wherein the average value is shown in the following table:

and then, representing the word segmentation result by using CBOW to obtain an embedded matrix corresponding to the word, wherein the embedded matrix is shown in the following table:

the data is divided into a training set and a testing set according to the score, the training set is input into the model for training, and the training process is shown in figure 3.

After training, inputting the test set into the trained model, and calculating the obtained beta _j After being introduced as weight, the weight is calculated to obtain a sentence embedding matrix, as shown in the following table:

the final prediction result obtained after the sentence is embedded into the matrix input classifier is shown in the following table:

the overall prediction results are shown in the following table:

	Precision	Recall	F1-score	support
					five-risk one-gold	0.965	0.971	0.968	17573
Personnel wages and assistance	0.905	0.907	0.906	11075
					Office expenses	0.931	0.905	0.918	3955
Property management fee	0.874	0.873	0.874	1983
					Foundation fee	0.896	0.791	0.840	826
Travel fee	0.780	0.751	0.765	719
					Special purchasing	0.697	0.685	0.677	691
Official business expense	0.645	0.690	0.667	519
					Others	0.500	0.757	0.602	189
Macro Avg	0.799	0.811	0.805	37530
					Weigthed Avg	0.922	0.921	0.921	37530
Big Avg	0.911	0.867	0.888	15856
					Small Avg	0.743	0.783	0.759	21674

The foregoing is merely a preferred embodiment of the present invention and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present invention, which are intended to be comprehended within the scope of the present invention.

Claims

1. A method for classifying multi-label unbalanced text in budget execution audit, which is characterized by comprising the following steps:

Step two: constructing a model and constructing a classification frame of the multi-mark unbalanced text; firstly, constructing a model, solving a similarity matrix by using words and tags in sentences, and then calculating the similarity of context information, namely phrases and tags by using a neural network, wherein 2 groups of parameters W are arranged ₁ And b ₁ Training is required; and then using a newly constructed base pool layer to calculate weight vectors between the phrases and all the category labels, finally using the weight vectors to weight the original words, and obtaining a proper sentence embedding matrix after the training process is completed, namely, the sentence embedding matrix fused with domain knowledge, wherein the formula is as follows:

then take sentence embedded matrix asInput classifies sentences using a classifier where there are 2 sets of parameters to train, i.e. W ₂ And b ₂ The formula is as follows:

wherein the method comprises the steps of

Mapping functions for the output;

step three: constructing a sentence embedding and unbalanced multi-classification unified objective function, and guiding the neural network to train; using a cross entropy loss function as a basic objective function, introducing weight data to bias the loss function towards a small class, strengthening training of a classifier on the small class, and finally embedding a tag word into the loss function to strengthen learning of a tag, and realizing training of a model with the aim of minimizing a currently constructed unbalanced objective function; after training, the payment abstract text data of the unknown label is effectively classified.

2. The method for classifying multi-label unbalanced texts in budget execution audit according to claim 1, wherein in the second step, a model is built, and a classifying framework of the multi-label unbalanced texts is built: firstly, constructing a model, solving a similarity matrix by using words and tags in sentences, and then calculating the similarity of context information, namely phrases and tags by using a neural network, wherein 2 groups of parameters W are arranged ₁ And b ₁ Training is required; then using the newly constructed base pool layer to calculate the weight vector between the phrase and all the class labels, finally using the weight vector to weight the original word, obtaining the proper sentence embedding matrix after finishing the training process,namely, sentence embedding matrix fused with domain knowledge;

similarity matrix G _i Is L x K, wherein L is represented by ₂ A norm;

c _i ＝ReLU(G _i,j-p:j+p W ₁ ^T +b ₁ ),1≤j≤L (4)

then calculating a related weight value matrix of the word:

wherein c _jk Similarity of the phrase corresponding to the jth word and the corresponding kth category label;

re-alignment of beta _j Normalized calculations were performed as follows:

the above process is expressed as a whole as formula (1);

The overall process is expressed as formula (2).

3. The method for classifying multi-label unbalanced texts in budget execution audit according to claim 1, wherein in the third step, an objective function unified with unbalanced multi-classification is constructed for sentence embedding, and neural network training is guided; using a cross entropy loss function as a basic objective function, introducing weight data to bias the loss function towards a small class, strengthening training of a classifier on the small class, and finally embedding a tag word into the loss function to strengthen learning of a tag, and realizing training of a model with the aim of minimizing a currently constructed unbalanced objective function; after training, effectively classifying the payment abstract text data of the unknown label;

meaning that the function f can be broken down into two parts: f (f) ₁ And f ₂ As a function f ₁ As a function f ₂ Is input to the computer; y is _i For the actual tag matrix of the ith sentence, Σ is the weight vector, Σ ^T Representing the transpose of the weight vector, y _ik The value of the kth tag representing the ith sentence corresponds to an actual tag position of 1, the remaining positions of 0, < >>

Representing a predictive probability of a kth tag of an ith sentence;

to improve the importance of the label in training, a label loss function is added, and the formula is as follows: