CN114881173A

CN114881173A - Resume classification method and device based on self-attention mechanism

Info

Publication number: CN114881173A
Application number: CN202210616140.6A
Authority: CN
Inventors: 马涛; 李小伟; 刘金红; 何劲; 许四毛; 马春来; 常超; 杨方
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2022-05-31
Filing date: 2022-05-31
Publication date: 2022-08-09

Abstract

The invention discloses a resume classification method and a resume classification device based on a self-attention mechanism, wherein the method comprises the following steps: acquiring resume information, and extracting texts of the resume information; extracting information of the plain text information subjected to data cleaning to obtain working information; and classifying the working information by adopting a convolutional neural network resume classification model. The convolutional neural network resume classification model comprises an embedded layer, a convolutional layer, a self-attention layer, a pooling layer and a full-connection layer; the self-attention layer acquires long-distance dependence information by using a self-attention mechanism for the local features, and further captures key classification features in the resume text according to the long-distance dependence information; and the full connection layer performs characteristic fusion on the output of the downsampled convolutional layer and the output of the self-attention layer and sends the output and the output into a Softmax function for classification to obtain a final classification result of the resume. The method and the device can effectively classify the resume, and have better classification performance and classification efficiency compared with other classification methods.

Description

Resume classification method and device based on self-attention mechanism

Technical Field

The invention relates to the technical field of text classification, in particular to a resume classification method and device based on a self-attention mechanism.

Background

With the rapid development of the internet, electronic resume information is in explosive growth. The resume is a formal document for showing the work experience and skill of the job seeker to an online recruitment website or the human resources of the company. In order to realize effective analysis and management of the resume, improve the accurate matching of talents and posts and further improve the recruitment efficiency, one important link is to perform accurate industry category division on the resume. The traditional resume classification method mainly performs resume classification based on knowledge engineering by means of manual construction rules, and the method cannot effectively deal with massive resume files with different formats in a big data era.

The task of text classification refers to a process of automatically classifying a text set according to a certain classification system or rule. Currently, text classification has become a hot research problem in the field of natural language processing. The resume text classification refers to the division of the work industry category or the work post category based on the content of job skills, work experience and the like of job seekers in the resume text. Common work industry categories include information technology, education, finance, engineering, medical, art, and the like. The resume text classification technology realizes automatic classification of resumes and provides an information source for follow-up talent recommendation and the like.

The resume text classification may be implemented using a text classification method in natural language processing. Text classification typically includes several processes of text preprocessing, word segmentation, text representation, model construction, and classification. The text classification model based on machine learning mainly comprises naive Bayes, a K-nearest neighbor algorithm, logistic regression, a decision tree, a support vector machine and the like. In recent years, thanks to the development of word vector technology and deep learning, the text classification model based on deep learning avoids the problem of complicated construction of characteristic engineering based on a machine learning method, and obtains better performance on the accuracy rate of classification. The existing text classification method based on deep learning generally includes embedding words into resume texts, then sending the resume texts into models such as a Convolutional Neural Network (CNN) or a cyclic neural network (RNN) for training, and finally realizing prediction of text classes by using the trained models.

The resume files are classified according to the working industry and the post, and support can be provided for subsequent resume recommendation. The resume classification method based on natural language processing can be divided into resume clustering based on no supervision and resume classification based on supervision according to the existence of a labeled data set. Under the condition of no labeled resume data set or uncertain resume types, clustering algorithms such as K-means and the like are generally adopted to cluster resumes according to similarity. Unsupervised resume clustering can divide resumes into different categories according to similarity, but still needs to combine with manual work to further analyze clustered resume data and judge categories. In the case of determining the types of the resumes, in order to classify the resumes more accurately, a supervised resume classification model is adopted to predict the types of the unknown resumes. Resume classification based on machine learning requires feature engineering for feature extraction and feature selection, and text features can be automatically extracted based on a text classification model based on deep learning. In recent years, with the rise of deep learning in natural language processing, studies on resume classification based on deep learning have been increasing more and more. The existing method still needs to be improved in the aspect of improving the accuracy of resume classification.

Disclosure of Invention

Aiming at the problem that the accuracy of the existing resume classification method is still to be improved, the method takes the fact that the contribution degree of the characteristics formed by different words in the resume to the resume classification is different, introduces a self-attention mechanism into a classical convolution neural network text classification model, and expresses the text characteristics of the resume more abundantly. The resume data set is used for experimental verification, and the experimental result shows that the CNN-Attention model is beneficial to further improving the accuracy of resume classification. According to the method and the device, a self-attention mechanism is introduced into the classification model, corresponding weight is given to the influence of the characteristics formed by different words on the classification effect, so that the classification model is further used for extracting important characteristics in the resume text, and more accurate resume classification is realized.

The application discloses a resume classification method based on a self-attention mechanism, which comprises the steps of obtaining resume information, carrying out text extraction on the resume information to obtain pure text information, carrying out data cleaning on the pure text information, carrying out information extraction on the pure text information after the data cleaning to obtain work information, and classifying the work information by adopting a convolutional neural network resume classification model.

And the data cleaning is carried out on the plain text information, and comprises the steps of removing special symbols, non-printed characters, redundant empty lines and personal basic information in the text.

The method comprises the steps of cleaning the plain text information, and smoothing the acquired plain text information to eliminate the acquisition error in a certain period of time; the method specifically comprises the steps of segmenting the plain text information acquired for the same resume within a period of time, wherein each segment forms a text vector y _i Forming a text vector group [ y ] by the N text vectors obtained in a period of time ₁ ,y ₂ ,…,y _N ]And calculating to obtain a cross-correlation matrix C of the text vector group, and performing eigenvalue decomposition on the cross-correlation matrix C to obtain:

C＝VDV ^H ，

and V is an eigenvector matrix, D is an eigenvalue matrix, diagonal elements of the matrix D are normalized and used as weight vectors, and a text vector group acquired within a period of time is weighted and summed to obtain a smooth value of the pure text information acquired within a period of time for the same resume as cleaned data.

The work information comprises work skill information and work experience information.

The convolutional neural network resume classification model comprises an embedded layer, a convolutional layer, a self-attention layer, a pooling layer and a full-connection layer. The embedding layer acquires corresponding word embedding vectors according to the working information input into the convolutional neural network resume classification model and inputs the word embedding vectors into the convolutional layer; the convolution layer carries out one-dimensional convolution operation on the word embedding vector to extract the local characteristics of the word embedding vector and inputs the local characteristics into the self-attention layer; and the self-attention layer acquires long-distance dependence information on the local features by using a self-attention mechanism, and further captures key classification features in the resume text according to the long-distance dependence information. And the pooling layer respectively performs down-sampling on the output of the self-attention layer and the convolution layer by adopting maximum pooling operation to obtain two down-sampling results, sends the two down-sampling results into the full-connection layer, and the full-connection layer performs characteristic fusion on the two down-sampling results and sends the two down-sampling results into a Softmax function for classification to obtain the final classification result of the resume.

The embedding layer adopts a trained continuous Word bag model in Word2Vec to obtain a Word embedding vector of each Word in the working information, forms a matrix by using the Word embedding vectors, and takes the matrix as the input of the convolution layer. The matrix formed by the word embedding vectors is represented as a two-dimensional matrix X with n multiplied by d dimensions, and the expression is as follows:

X＝[x ₁ ,x ₂ ,…x _i ,…,x _n ] ^T ∈R ^n×d ，

wherein n is the number of word embedding vectors contained in the working information, d is the dimension of the word embedding vectors, and x _i Is the word embedding vector for the ith word in the working information.

The fused feature vectors are sent into a Softmax function for classification, and the calculation formula is as follows:

y _i ＝softmax(∑W _s ·V+b _s )，

wherein y is _i Representing classes predicted by classification，W _s Is a weight matrix of the full connection layer, V is a feature vector after feature fusion, b _s Is the bias of the fully connected layer.

The method comprises the steps that a self-attention layer firstly maps text sequences contained in local features into a query variable Q, a key value variable K and a value variable V, dot product operation is conducted on the query variable Q and the key value variable K, then normalization is conducted on dot product operation results, numerical values obtained after normalization are used as weight coefficients of the value variables, finally weighting summation is conducted on the weight coefficients and the value variables, and then a softmax function is used for processing, so that output of the self-attention layer is obtained.

The method maps the text sequence contained in the local features into a query variable Q, a key value variable K and a value variable V, and the specific calculation process is as follows:

Q＝SW _Q ,K＝SW _K ,V＝SW _V ，

wherein, W _Q 、W _K 、W _V The linear transformation matrixes are respectively a query variable Q, a key value variable K and a value variable V, and S is a text sequence matrix contained in local features.

The resume classification model adopts cross entropy (cross entropy) as a Loss function of the model, and the calculation formula is as follows, wherein Loss represents the Loss function, n is the number of resume samples, m is the number of resume categories, and p is _ik Indicates the probability that the ith resume sample belongs to the class k, y _ik The expression is the probability that the simplified classification model predicts the ith sample as class k:

during model training, the performance of the convolutional neural network resume classification model is monitored by using the change of the loss function values of the resume classification model on a resume text training set and a verification set, the loss values of the resume classification model on the training set and the verification set are calculated after each round of training, and meanwhile, the parameters of the model are updated. After training is started, loss values of the resume classification model on the training set and the verification set are continuously reduced, when the fact that the loss value of the resume classification model on the verification set is larger than that of the resume classification model on the last training result is monitored, the training is stopped, the resume classification model is shown to be fitted, and parameters in the last iteration result are used as final parameters of the model.

The method comprises the following steps of monitoring the performance of a convolutional neural network resume classification model by using test data, label data and training data of a resume text, and updating parameters of the convolutional neural network resume classification model when the performance is monitored to be abnormal, wherein the method comprises the following steps:

the performance monitoring of the convolutional neural network resume classification model is realized by judging the difference of the test data, the label data and the training data; regarding data of the resume text as a stable random process, respectively establishing corresponding autoregressive-sliding average models, namely ARMA models, aiming at test data, label data and training data, respectively obtaining a first ARMA model, a second ARMA model and a third ARMA model, calculating the cross-correlation matrix of the coefficients of the three ARMA models, calculating the cross-correlation matrix to obtain a maximum characteristic value, and distinguishing the difference of the test data, the label data and the training data by using the maximum characteristic value; when the maximum characteristic value is larger than the difference discrimination threshold value, judging that the significance difference occurs among the test data, the label data and the training data, considering that the performance of the convolutional neural network resume classification model is abnormal, training the convolutional neural network resume classification model by using the label data and the training data, updating the parameters of the convolutional neural network resume classification model according to the training result, and stopping the training of the convolutional neural network resume classification model until the performance of the convolutional neural network resume classification model is not monitored to be abnormal.

The application discloses a resume classification device based on a self-attention mechanism, which comprises a resume information extraction module and a convolutional neural network resume classification model, wherein the resume information extraction module is used for acquiring resume information, performing text extraction on the resume information to obtain pure text information, performing data cleaning on the pure text information, performing information extraction on the pure text information after the data cleaning to obtain working information, and outputting the working information to the convolutional neural network resume classification model; the convolutional neural network resume classification model is used for classifying the working information to obtain a resume classification result.

C＝VDV ^H ，

and V is an eigenvector matrix, D is an eigenvalue matrix, diagonal elements of the matrix D are normalized and used as weight vectors, and a text vector group acquired within a period of time is subjected to weighted summation to obtain a smooth value of the pure text information acquired within a period of time for the same resume and used as cleaned data.

The convolutional neural network resume classification model comprises an embedding layer, a convolutional layer, a self-attention layer, a pooling layer and a full-connection layer. The embedding layer acquires corresponding word embedding vectors according to the work information input into the convolutional neural network resume classification model and inputs the word embedding vectors into the convolutional layer; the convolution layer carries out one-dimensional convolution operation on the word embedding vector to extract local features of the word embedding vector and inputs the local features into the self-attention layer; and the self-attention layer acquires long-distance dependence information on the local features by using a self-attention mechanism, and further captures key classification features in the resume text according to the long-distance dependence information. The pooling layer respectively performs down-sampling on the output of the self-attention layer and the convolution layer by adopting maximum pooling operation to obtain two down-sampling results, sends the two down-sampling results into the full-connection layer, and the full-connection layer performs feature fusion on the two down-sampling results and sends the two down-sampling results into a Softmax function for classification to obtain the final classification result of the resume.

X＝[x ₁ ,x ₂ ,…x _i ,…,x _n ] ^T ∈R ^n×d ，

y _i ＝softmax(∑W _s ·V+b _s )，

wherein y is _i Represents the class obtained after classification prediction, W _s Is a weight matrix of the full connection layer, V is a feature vector after feature fusion, b _s Is the bias of the fully connected layer.

Q＝SW _Q ,K＝SW _K ,V＝SW _V ，

wherein, W _Q 、W _K 、W _V The linear transformation matrixes are respectively a query variable Q, a key value variable K and a value variable V, and S is a text sequence matrix contained in the local features.

The invention has the beneficial effects that:

the method and the device can effectively classify the resume, and have better classification performance and classification efficiency compared with other classification methods.

Drawings

FIG. 1 is a flow chart of the overall implementation of the method of the present invention;

FIG. 2 is a diagram of the basic components of the convolutional neural network resume classification model of the present invention;

FIG. 3 is a diagram of the composition of a convolutional layer of the present invention;

FIG. 4 is a one-dimensional convolution operation process (step size 1) of the present invention;

FIG. 5 is a diagram illustrating a calculation process of a self-attention layer according to the present invention;

FIG. 6 is a diagram illustrating training scenarios of 4 models;

FIG. 7 is a graph of the results of the 4 models performed on the test set.

Detailed Description

For a better understanding of the present disclosure, an example is given here. FIG. 1 is a flow chart of the overall implementation of the method of the present invention; FIG. 2 is a diagram of the basic components of the convolutional neural network resume classification model of the present invention; FIG. 3 is a schematic view of the composition of a convolutional layer of the present invention; FIG. 4 shows a one-dimensional convolution operation process (step size 1) according to the present invention; FIG. 5 is a diagram illustrating a calculation process of a self-attention layer according to the present invention; FIG. 6 is a diagram illustrating training scenarios of 4 models; FIG. 7 is a graph showing the results of the 4 models performed on the test set.

The application discloses a resume classification method based on a self-attention mechanism, and the overall framework of a resume classification scheme provided by the application is shown in figure 1. Electronic resumes submitted by job seekers are in various formats, such as PDF, DOC, DOCX formats and the like, so that plain text information needs to be captured from electronic resumes in different formats before the resumes are classified; and then, performing data cleaning on the resume text, such as removing special symbols, non-printed characters, redundant empty lines and the like in the text. In consideration of the fact that attributes of personal basic information in the resume, such as a mailbox, a telephone number, personal preference and the like, have no influence on the result of the resume classification, the information is regarded as interference information and is not considered, and only the work skill and work experience information in the resume are concerned. The method is characterized in that an information extraction module is added before the resume classification, and is responsible for extracting the work experience and work skill information from the cleaned resume text, and then sending the extracted text information into the CNN-Attention resume classification model provided by the method for classifying the resumes.

The data cleaning of the plain text information also comprises the step of carrying out data smoothing treatment on the acquired plain text information so as to eliminate the output within a certain period of timeThe current acquisition error; the method specifically comprises the steps of segmenting the plain text information acquired for the same resume within a period of time, wherein each segment forms a text vector y _i Forming a text vector group [ y ] by the N text vectors obtained in a period of time ₁ ,y ₂ ,…,yN]And calculating to obtain a cross-correlation matrix C of the text vector group, and performing eigenvalue decomposition on the cross-correlation matrix C to obtain:

C＝VDV ^H ，

The convolutional neural network resume classification model comprises an embedded layer, a convolutional layer, a self-attention layer, a pooling layer and a full-connection layer. The embedding layer acquires corresponding word embedding vectors according to the working information input into the convolutional neural network resume classification model and inputs the word embedding vectors into the convolutional layer; the convolution layer carries out one-dimensional convolution operation on the word embedding vector to extract the local characteristics of the word embedding vector and inputs the local characteristics into the self-attention layer; and the self-attention layer acquires long-distance dependence information by using a self-attention mechanism for the local features, and further captures key classification features in the resume text according to the long-distance dependence information. And the pooling layer performs down-sampling on the output of the self-attention layer and the convolution layer by adopting maximum pooling operation, and sends two results obtained by down-sampling into the full-connection layer, and the full-connection layer performs characteristic fusion on the output of the down-sampled convolution layer and the output of the self-attention layer and sends the output of the down-sampled convolution layer and the output of the self-attention layer into a Softmax function for classification to obtain a final classification result of the resume.

And acquiring corresponding Word embedding vectors according to the working information input into the convolutional neural network resume classification model, and realizing the Word embedding vectors by using a continuous Word bag model in Word2 Vec.

And for resume files with different formats, performing text extraction on PDF, DOC and DOCX by using two tools, namely docxt and pdfttext. After obtaining the resume text application piece, the resume text is subjected to data cleaning, and characters and the like which cannot be printed are removed. The information of names, mail boxes, telephone numbers, graduates and other institutions in the resume does not help the classification result, the classification accuracy may be reduced, and meanwhile, the calculation amount of the classification model is increased, so that the scheme for classifying the resume by using the work skill and work experience information after information extraction is finally determined. The resume information extraction may be implemented using rule-based or statistical-based methods.

X＝[x ₁ ,x ₂ ,…x _i ,…,x _n ] ^T ∈R ^n×d ，

The convolutional neural network is a neural network with the characteristics of local connection, weight sharing and the like, and has good effects in the fields of computer vision, natural language processing and the like. Convolutional neural networks are typically composed of an input layer, convolutional layer, pooling layer, fully-connected layer, and output layer. The convolutional neural network has the two characteristics of local perception and parameter sharing, wherein the local perception is that each neuron of the convolutional neural network does not need to perceive all information of input text data, only perceives the local information of the input text data, and then merges the local information at a higher layer, so that the global representation information of the resume text is obtained. The weight value sharing network structure reduces the number of weight values and reduces the complexity of a network model. The convolution neural network adopts word embedding of the resume text as input, so that corresponding features can be effectively learned from the resume text, and a complex feature extraction process is avoided. The convolutional layer composition structure is shown in fig. 3, a total of 3 sub convolutional layers are designed, the size of the convolutional core is 3 × 100, and the number of convolutional cores of the 3 sub convolutional layers is 264, 128 and 64.

The application uses one-dimensional convolution to process n x d dimension resume text data output by an embedding layer. The one-dimensional convolution operation process is as shown in fig. 4, the width of the convolution kernel in the one-dimensional convolution is equal to the dimension of the word vector, and the convolution kernel moves along the sequential direction of the resume text. The one-dimensional convolution operation of CNN is shown below,

c _i ＝f(∑W _c ·X _i:i+h-1 +b _c )，

C＝[c ₁ ,c ₂ ,…c _i ,…c _n-h+1 ]，

wherein, c _i And expressing the result after convolution operation, namely the activation output after point multiplication of the input word vector matrix and the convolution kernel and addition of the bias. f denotes an activation function, and the activation function f of the present application is a ReLU function. W _c A weight matrix representing the convolution kernel, h represents the width of the convolution kernel, X _i:i+h-1 Representing the word vector matrix in the i to i + h-1 windows of the input, b _c For bias, the convolutional layer output is C. After the local features are obtained through 3-layer convolution calculation, feature compression is carried out on the extracted local features by utilizing a pooling layer in order to reduce the network calculation complexity. The pooling layer employs maximum pooling. The output of the pooled convolutional layer and the output of the attention layer are spliced to obtain a 128 x 1 one-dimensional characteristic vector V, the calculation formula of which is shown as follows,

wherein the content of the first and second substances,

represents the maximum pooled convolutional layer output,

representing the output from the attention stratum after maximum pooling. And finally, fully connecting the feature vector subjected to feature fusion with classified class neurons, and predicting resume classes through a softmax function.

y _i ＝softmax(∑W _s ·V+b _s )，

wherein y is _i Represents the class obtained after classification prediction, W _s Is a weight matrix of the full connection layer, V is a feature vector after feature fusion, b _s Is the bias of the fully connected layer. After the characteristics of the resume text are extracted by using the convolutional neural network, a self-attention mechanism is added to further extract the classification characteristics in the resume text. The self-attention layer is responsible for extracting long-distance dependency information in the resume text. The attention mechanism is derived from human visual attention mechanism, which can extract and focus on important classification features in the resume text. Therefore, the attention mechanism is applied to the resume classification task, so that the important characteristics of the resume text can be further extracted. The calculation process of the self-attention layer is shown in fig. 5. The method comprises the steps that a self-attention layer firstly maps text sequences contained in local features into a query variable Q, a key value variable K and a value variable V, dot product operation is conducted on the query variable Q and the key value variable K, then normalization is conducted on dot product operation results, numerical values obtained after normalization are used as weight coefficients of the value variables, finally the weight coefficients and the value variables are subjected to weighted summation, and then a softmax function is used for processing, so that output of the self-attention layer is obtained. The method maps the text sequence contained in the local features into a query variable Q, a key value variable K and a value variable V, and the specific calculation process is as follows:

Q＝SW _Q ,K＝SW _K ,V＝SW _V ，

wherein, W _Q 、W _K 、W _V Linear variables for the query variable Q, the key-value variable K and the value variable V, respectivelyAnd (5) changing the matrix, wherein S is a text sequence matrix contained in the local features. By adopting a self-attention mechanism, the method reduces the dependence on external information and is better at capturing the internal correlation of data or characteristics. The self-attention mechanism is applied to the resume text, so that the long-distance dependency relationship of data in the resume text sequence can be fully captured, and different attributes are given different weights. The self-attention mechanism of the present application maps the output C of the convolutional layer into a set of vector outputs for query Q, key K, and value V. The formula for the output H from the attention layer is shown below,

wherein d is _k Representing the dimension of K. Cross entropy (cross entropy) is adopted as a Loss function of the model, and the calculation formula is as follows, wherein Loss represents a Loss function, n is the number of resume samples, m is the number of resume categories, and t _ik Indicates the probability, p, that the ith resume sample belongs to class k _ik The expression is the probability that the resume classification model predicts the ith sample as class k:

The method comprises the following steps of monitoring the performance of a convolutional neural network resume classification model by using test data, label data and training data of a resume text, and updating parameters of the convolutional neural network resume classification model when the performance is monitored to be abnormal, wherein the method comprises the following steps: the performance monitoring of the convolutional neural network resume classification model is realized by judging the difference of the test data, the label data and the training data; regarding data of a short-run text as a stable random process, respectively establishing corresponding autoregressive-moving average models, namely ARMA models, aiming at test data, label data and training data, respectively obtaining a first ARMA model, a second ARMA model and a third ARMA model, calculating the cross correlation matrix of the coefficients of the three ARMA models, calculating the cross correlation matrix to obtain a maximum characteristic value, and distinguishing the difference of the test data, the label data and the training data by using the maximum characteristic value; when the maximum characteristic value is larger than the difference discrimination threshold value, judging that the significance difference occurs among the test data, the label data and the training data, considering that the performance of the convolutional neural network resume classification model is abnormal, training the convolutional neural network resume classification model by using the label data and the training data, updating the parameters of the convolutional neural network resume classification model according to the training result, and stopping the training of the convolutional neural network resume classification model until the performance of the convolutional neural network resume classification model is not monitored to be abnormal.

C＝VDV ^H ，

The convolutional neural network resume classification model comprises an embedding layer, a convolutional layer, a self-attention layer, a pooling layer and a full-connection layer. The embedding layer acquires corresponding word embedding vectors according to the work information input into the convolutional neural network resume classification model and inputs the word embedding vectors into the convolutional layer; the convolution layer carries out one-dimensional convolution operation on the word embedding vector to extract local features of the word embedding vector and inputs the local features into the self-attention layer; and the self-attention layer acquires long-distance dependence information on the local features by using a self-attention mechanism, and further captures key classification features in the resume text according to the long-distance dependence information. The pooling layer performs down-sampling on the output of the self-attention layer and the convolution layer by adopting maximum pooling operation, and sends two results obtained by down-sampling into the full-connection layer, and the full-connection layer performs characteristic fusion on the output of the down-sampled convolution layer and the output of the self-attention layer and sends the output of the down-sampled convolution layer and the output of the self-attention layer into a Softmax function for classification to obtain the final classification result of the resume.

X＝[x ₁ ,x ₂ ,…x _i ,…,x _n ] ^T ∈R ^n×d ，

y _i ＝softmax(∑W _s ·V+b _s )，

Q＝SW _Q ,K＝SW _K ,V＝SW _V ，

The resume classification model of the present application uses cross entropy (cross entropy) as a loss of the modelThe Loss function has the following calculation formula, wherein Loss represents the Loss function, n is the number of resume samples, m is the number of resume categories, and p _ik Indicates the probability that the ith resume sample belongs to the class k, y _ik The expression is the probability that the ith sample is predicted as class k by the resume classification model:

In order to classify the resume, the resume classification model needs to be trained and tested by using the resume corpus labeled with the category. The resume data set in the research of the application is derived from a public resume data set provided on a Kaggle platform. The statistics of the number of resumes in each category of the Data set are shown in table 1, and the labeled categories include Data Science, HR, Advocate, Arts, and the like, wherein the total number of labeled categories is 25. The resume data set category statistics are shown in table 1.

TABLE 1 resume data set Categories statistics

The 3 indexes of Precision (Precision), Recall (Recall) and F1 value (F1-score) were selected for evaluation of model performance, and the definition is shown in Table 2. Wherein, TP is the number of resumes of which resume categories are correctly judged by the resume classification model, FP is the number of irrelevant resumes of which resume categories are wrongly judged by the resume classification model, and FN is the number of resumes of which relevant categories cannot be correctly identified by the resume classification model.

TABLE 2 evaluation index definition

The application uses Python3.7 and TensorFlow building models and is completed under a 64-bit Ubuntu 20.04 operating system. The computer is configured with an 8-core 16-thread CPU, a 16GB memory and a video card NVIDIA GeForce RTX 2080. And when model training and evaluation are carried out, the number ratio of the training set to the verification set to the testing set is 6:2: 2. According to the method, firstly, data cleaning is carried out on the collected resume data set, including the removal of line feed characters, some special interference characters and the like, and then resume data are sent into a classification model for training. In order to compare and evaluate the performance of the CNN-Attention resume classification model, the method utilizes the resume data set to carry out comparison experiments on the following 4 models, and observes evaluation index results of the experiments. 1) BilSTM model: and embedding words in the resume text, and sending the resume text into a bidirectional long-time memory neural network for resume classification. 2) BiGRU model: and after word embedding is carried out on the resume text, the resume text is sent to a bidirectional door control unit neural network for resume classification. 3) CNN model: and embedding words in the resume text, and then sending the resume text into a convolutional neural network for resume classification. 4) The application model is as follows: namely, the resume classification model proposed in the present application in combination with a self-attention mechanism.

The Word vector dimension obtained through Word2vec is set to be 100, the sequence length of the model is set to be 1148 according to the statistical condition of the text length of the resume data, excess Word is cut off, and insufficient zero filling is performed. The input to the convolutional layer is a vector matrix of dimensions 1148 x 100. The Batch _ size is 64, the epoch is 25, the optimizer is Adam, the initial learning rate is 0.001, dropout is used for avoiding overfitting, partial node data of the neural network are randomly discarded during training, all nodes are reused during model testing, and the dropout is set to be 0.4.

Fig. 6 shows training cases of 4 models at 25 epochs. As can be seen from FIG. 6, in the two classification models based on recurrent neural networks, the performance of the BiGRU model on the training set is better than that of the BiLSTM model. In the two models for feature extraction by using the convolutional neural network, the performance of the CNN-Attention model combined with the Attention mechanism is better than the effect of the CNN model only using the convolutional neural network, and in the 15 th round of training, the accuracy of the CNN-Attention model on a training set reaches 100%, which shows that the performance of the resume classification model can be further improved by adding the self-Attention mechanism. FIG. 7 shows the F1-score behavior of the 4 models on the resume test set, from which it can be seen that the classification accuracy of the two resume classification models based on the convolutional neural network is higher than that of the BiLSTM and BiGRU models.

Table 3 shows the evaluation index results of 4 models on the test set under the same experimental environmental conditions and training parameters. As can be seen from Table 3, the accuracy of the CNN-Attention resume classification model after 25 rounds of training is 97.61%, the recall rate is 97.14%, the F1 value is 97.26%, and all indexes reach the highest level. The comparison experiment shows that when the resume classification model is established, the CNN-Attention model combined with the self-Attention mechanism has stronger feature extraction capability and higher classification accuracy compared with other models.

TABLE 3 comparative experimental results (unit:%)

The above description is only an example of the present application and is not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement or the like made within the spirit and principle of the present application shall be included in the scope of the claims of the present application.

Claims

1. A resume classification method based on a self-attention mechanism is characterized by comprising the following steps:

acquiring resume information, and performing text extraction on the resume information to obtain pure text information;

cleaning data of the plain text information;

extracting information of the plain text information subjected to data cleaning to obtain working information;

and classifying the working information by adopting a convolutional neural network resume classification model.

2. The resume classification method based on the self-attention mechanism as claimed in claim 1, comprising:

the data cleaning is carried out on the plain text information, and comprises the steps of removing special symbols, non-printed characters, redundant empty lines and personal basic information in the text.

3. The resume classification method based on the self-attention mechanism as claimed in claim 1, comprising:

the method comprises the steps of carrying out data cleaning on the plain text information, and carrying out data smoothing on the obtained plain text information so as to eliminate acquisition errors occurring in a certain period of time; the method specifically comprises the steps of segmenting the plain text information acquired for the same resume within a period of time, wherein each segment forms a text vector y _i Forming a text vector group [ y ] by the N text vectors acquired in a period of time ₁ ,y ₂ ,…,y _N ]And calculating to obtain a cross-correlation matrix C of the text vector group, and performing eigenvalue decomposition on the cross-correlation matrix C to obtain:

C＝VDV ^H ，

and V is an eigenvector matrix, D is an eigenvalue matrix, the diagonal elements of the matrix D are normalized and used as weight vectors, and the text vector groups collected within a period of time are subjected to weighted summation to obtain a smooth value of the pure text information acquired for the same resume within a period of time and used as cleaned data.

4. The resume classification method based on the self-attention mechanism as claimed in claim 1, comprising:

5. The resume classification method based on the self-attention mechanism as claimed in claim 1, comprising:

the convolutional neural network resume classification model comprises an embedded layer, a convolutional layer, a self-attention layer, a pooling layer and a full-connection layer; the embedding layer acquires corresponding word embedding vectors according to the working information input into the convolutional neural network resume classification model and inputs the word embedding vectors into the convolutional layer; the convolution layer carries out one-dimensional convolution operation on the word embedding vector to extract local features of the word embedding vector and inputs the local features into the self-attention layer; the self-attention layer acquires long-distance dependence information by using a self-attention mechanism for the local features, and further captures key classification features in the resume text according to the long-distance dependence information; and the pooling layer respectively performs down-sampling on the output of the self-attention layer and the convolution layer by adopting maximum pooling operation to obtain two down-sampling results, sends the two down-sampling results into the full-connection layer, and the full-connection layer performs characteristic fusion on the two down-sampling results and sends the two down-sampling results into a Softmax function for classification to obtain a final classification result of the resume.

6. The resume classification method based on the self-attention mechanism as claimed in claim 5, comprising:

the embedding layer adopts a trained continuous Word bag model in Word2Vec to obtain a Word embedding vector of each Word in the working information, forms a matrix by using the Word embedding vectors and takes the matrix as the input of the convolutional layer; the matrix formed by the word embedding vectors is represented as a two-dimensional matrix X with n multiplied by d dimensions, and the expression is as follows:

X＝[x ₁ ,x ₂ ,…x _i ,…,x _n ] ^T ∈R ^n×d ，

7. The resume classification method based on the self-attention mechanism as claimed in claim 5, comprising:

y _i ＝softmax(∑W _s ·V+b _s )，

wherein y is _i Represents the class obtained after classification prediction, W _s Being fully connected layersWeight matrix, V is feature vector after feature fusion, b _s Is the bias of the fully connected layer.

8. The resume classification method based on the self-attention mechanism as claimed in claim 5, comprising:

the method comprises the steps that a self-attention layer firstly maps text sequences contained in local features into a query variable Q, a key value variable K and a value variable V, dot product operation is conducted on the query variable Q and the key value variable K, then normalization is conducted on dot product operation results, numerical values obtained after normalization are used as weight coefficients of the value variables, finally the weight coefficients and the value variables are subjected to weighted summation, and then a softmax function is used for processing, so that output of the self-attention layer is obtained.

9. The resume classification method based on the self-attention mechanism as claimed in claim 8, comprising:

Q＝SW _Q ,K＝SW _K ,V＝SW _V ，

10. The resume information extraction module is used for acquiring resume information, performing text extraction on the resume information to obtain pure text information, performing data cleaning on the pure text information, performing information extraction on the pure text information after the data cleaning to obtain working information, and outputting the working information to the convolutional neural network resume classification model; the convolutional neural network resume classification model is used for classifying the working information to obtain a resume classification result;

the convolutional neural network resume classification model comprises an embedded layer, a convolutional layer, a self-attention layer, a pooling layer and a full-connection layer; the embedding layer acquires corresponding word embedding vectors according to the working information input into the convolutional neural network resume classification model and inputs the word embedding vectors into the convolutional layer; the convolution layer carries out one-dimensional convolution operation on the word embedding vector to extract local features of the word embedding vector and inputs the local features into the self-attention layer; the self-attention layer acquires long-distance dependence information by using a self-attention mechanism for the local features, and further captures key classification features in the resume text according to the long-distance dependence information; the pooling layer respectively performs down-sampling on the output of the self-attention layer and the convolution layer by adopting maximum pooling operation to obtain two down-sampling results, sends the two down-sampling results into a full connection layer, and the full connection layer performs characteristic fusion on the two down-sampling results and sends the two down-sampling results into a Softmax function for classification to obtain a final classification result of the resume;

X＝[x ₁ ,x ₂ ,…x _i ,…,x _n ] ^T ∈R ^n×d ，

wherein n is the number of word embedding vectors contained in the working information, d is the dimension of the word embedding vectors, and x _i The word embedding vector is the ith word in the working information;

y _i ＝softmax(∑W _s ·V+b _s )，

wherein y is _i Represents the class obtained after classification prediction, W _s Is a weight matrix of the full connection layer, V is a feature vector after feature fusion, b _s A bias for a fully connected layer;

the method comprises the steps that a self-attention layer firstly maps text sequences contained in local features into a query variable Q, a key value variable K and a value variable V, dot product operation is conducted on the query variable Q and the key value variable K, then normalization is conducted on dot product operation results, numerical values obtained after normalization are used as weight coefficients of the value variables, finally the weight coefficients and the value variables are subjected to weighted summation, and then a softmax function is used for processing, so that output of the self-attention layer is obtained;

Q＝SW _Q ,K＝SW _K ,V＝SW _V ，