CN111966785A

CN111966785A - Resume information extraction method based on stacking sequence labeling

Info

Publication number: CN111966785A
Application number: CN202010756164.2A
Authority: CN
Inventors: 徐建; 郭培胜; 徐琳; 李晓冬
Original assignee: CETC 28 Research Institute
Current assignee: CETC 28 Research Institute
Priority date: 2020-07-31
Filing date: 2020-07-31
Publication date: 2020-11-20
Anticipated expiration: 2040-07-31
Also published as: CN111966785B

Abstract

The invention provides a resume information extraction method based on stacking sequence labeling, which comprises the following steps: and step 1, analyzing the pdf resume by using a pdf miner, and converting the original pdf into a multi-line text representation. The process mainly solves the problems of sequence disorder and wrong line breaking; step 2, training process data marking: data callbacks using remote supervision and merging of homogeneous items in the tagging process. Step 3, dividing resume information blocks: for the sentences obtained by pdfminer, the block where each sentence is located is judged according to each sentence classification. And 4, extracting information on a sentence level and a short text segment level by using a double-layer sequence labeling model. And the filtering is realized by subsequently utilizing resume block information, so that the recall rate is effectively improved, and the accuracy rate is not greatly reduced. Through the 4 stages, the method can effectively realize the extraction of the resume information.

Description

Resume information extraction method based on stacking sequence labeling

Technical Field

The invention relates to a resume information extraction method based on stacking sequence labeling.

Background

The extraction of the resume key information comprises four major categories: including attribute information, educational experiences, work experiences, and project experiences. The specific attribute information includes: name, year and month of birth, sex, telephone, highest school calendar, native place, city and county, political face; the educational experience includes: graduate colleges, academic ranks, graduation time; the work experience comprises: work units, work content, jobs, work hours; the project experience includes: project name, project responsibility, project time. Among the 18 kinds of information, the work content and the project responsibility are extraction at a key sentence level, and other attributes are extraction of a relatively short text fragment.

The existing information extraction technology only aims at extracting short text segments, cannot process extraction of long text segments taking sentences as units, and does not consider the block structure of the resume text.

Disclosure of Invention

The purpose of the invention is as follows: the technical problem to be solved by the invention is to provide a resume information extraction method based on stacked sequence labeling aiming at the defects of the prior art, which comprises the following steps:

step 1, analyzing resume files with pdf format by using a pdf miner, and analyzing the resume with rich text into common format text for representation;

step 2, training process data marking: data of remote supervision is utilized for carrying out back marking and the same type items are combined in the marking process;

step 3, dividing resume information blocks: dividing the resume into 4 blocks, and training a classifier to divide the blocks of the text;

and 4, extracting information on a sentence level and a short text segment level by using a double-layer sequence labeling model.

The step 1 comprises the following steps:

a pdf is a rich text that needs to be parsed first into a plain text format. The parsing process may involve problems of columnating, sectioning and folding.

Step 1-1, analyzing a resume file with pdf format to a sentence sequence by using a pdfminder, and acquiring horizontal and vertical coordinates of a text to correct error broken lines caused by a broken line problem in the resume analysis process by using an LTTextBox component of the pdfminder analyzer;

step 1-2, the resume is divided into three templates by the column division problem in the resume analysis process or the disordered template reading sequence: the common sequential resume, is divided into left and right sides and is dominated by the right side, resume is divided into left and right sides and is dominated by the left side.

The step 1-1 comprises the following steps:

LTTextBox represents a block of text, typically stored in a rectangular area, that by default will be a line of text. But this default resolution may cause erroneous line breaks.

All text box (LTTextBox) components of each page of the pdf resume file are obtained first, and then the coordinates of the lower left corner and the upper right corner of each text box are obtained as (x0, y0) (x1, y1) (the origin of the pdfminer's parsing process coordinate is at the lower left corner of the pdf page). The heights of the text boxes are then calculated in descending y1 and ascending x0, and for two text boxes with the same abscissa, the two text boxes are merged if the line spacing between the two text boxes is less than twice the text box height. This solves the problem of misrun.

The step 1-2 comprises the following steps: traversing the text boxes (LTTextBox) in the resume according to the sequence in the step 1-1, recording the coordinates (x _ maxlong _0, y _ maxlong _0) at the lower left corner and (x _ maxlong _1, y _ maxlong _1) at the upper right corner of the longest text box in the process, then traversing each text box, recording the coordinates (x _ cur _0, y _ cur _0), (x _ cur _1, y _ cur _1) at the lower left corner and the upper right corner of the current text box, respectively, and if the coordinate (x _ cur _1) at the upper right corner of the current text box is found to be smaller than the coordinate (x _ maxlong _0) at the left corner of the longest text box, the current resume is left-right-column-divided and the right-side is taken as the main part. .

The step 2 comprises the following steps:

step 2-1, remote supervision data marking: since the annotation data does not give the specific appearance position of each entity in the resume original text, the annotation data needs to be marked back to the specific position in the resume original text according to the description of the entity. Traversing each entity description in each training data (the training data can be a given text, and the attributes of fields such as working time, project experience and the like in the text are marked), if one entity description appears more than twice in the resume text, all appearance positions are judged to be the correct appearance positions of the entity, and the method can greatly improve the recall rate although the accuracy is sacrificed to a lower degree;

step 2-2, merging the same type items in the data denormalization process: and uniformly merging and marking the project time, the education time and the work time in the resume text as time, and uniformly marking the project content and the work content as content.

The step 3 comprises the following steps:

the training data is divided into basic information blocks and 4 blocks of education information, work information and project information by using rules. Due to the large inconsistency between the training data and the test data, a 4-classifier is constructed using the training data, and blocks are partitioned using the classifier for the test data. The process is a text classification process and mainly comprises a feature selection process and a text classification process.

In step 3, traversing the resume text extracted in step 1 according to rows, classifying each row by 4, judging that the row belongs to basic attributes, education experiences, work experiences and project experiences,

the method specifically comprises the following steps:

step 3-1, because the number of words is large, keyword extraction of each category is judged by chi-square verification: calculating chi-square statistic χ for class c word t by the following formula²(t，c)：

Wherein, the meaning of each parameter in the formula is explained as follows:

n: training the total number of data set documents;

a: the number of documents including the term t and belonging to the category c;

b: the number of documents that contain the term t but do not belong to category c;

c: the number of documents belonging to category c but not containing term t;

d: the number of documents that do not belong to category c and that do not contain entry t;

setting the entry t to be irrelevant to the category c; calculating the chi-square value of each entry and the category c, arranging the results in an order from large to small, and arranging the first k entries according to the chi-square value in a descending order;

step 3-2, text classification model and threshold selection stage: adopting xgboost based on a gradient lifting tree as a classifier, and training a data set according to the following steps of 9: 1 into training and validation sets, and calculating corresponding accuracy, recall and parameter F1 values:

F1＝2*pre*recall/(pre+recall)，

pre is accuracy, and recall is recall; selecting the value with the maximum value corresponding to the F1 value as the threshold value of the classifier;

and (4) representing the resume participles, selecting the characteristics in the last step, inputting the resume participles into a classifier to perform 4-classification judgment, wherein an xgboost classifier is selected, and the classifier judges whether the line belongs to basic information, education information, work information or project information aiming at each line.

Step 4 comprises the following steps:

step 4-1, the process mainly extracts the work content and the project experience in the resume, because the two contents are generally longer and are information extraction in sentence unit: and sentence level extraction is carried out by taking a sentence as a unit: firstly, coding each sentence by using bert, using a CLS coding vector at the beginning of the sentence as sentence representation, then passing the CLS coding vector at the beginning of the sentence through a bidirectional LSTM and CRF network to obtain a label of each sentence, wherein the label of the sentence represents whether the sentence belongs to working content or project experience;

step 4-2, the process mainly extracts short phrase fragments in the resume, and the extraction of the information is performed by taking characters as units: after the work content and project experience with sentences as units are extracted, the content is replaced by a special character [ NUM ], [ WN ], and on the basis, crf with characters as units is used for extracting other fields; the bert uses a 12-layer coding network inside, and learnable parameters are set to weight 12-layer output, so that the final output representation is obtained: :

wherein, m is 12, and m is the number of hidden vector layers output by bert; b_iIs the ith output of bert; gamma and s_iIs a learnable parameter; o_iIs a weighted representation of the network output of the 12 layers.

Has the advantages that:

the method effectively solves the problems of segmentation, paging and folding in the pdf rich text reading process, and effectively solves the disorder problem of the reading sequence; the modeling is carried out by respectively using sentences and characters as units, so that the extraction of long text segments and short text segments is effectively finished; the resume is partitioned to extract specific information in a specific area, and confusion caused by similar fields is effectively solved (for example, working time and project time are similar, a uniform time is marked as time, and then resume partitioning information is utilized for partitioning).

Drawings

The foregoing and/or other advantages of the invention will become further apparent from the following detailed description of the invention when taken in conjunction with the accompanying drawings.

FIG. 1 is a flow chart of the method of the present invention.

FIG. 2 is a schematic diagram of sentence-level attribute extraction.

Fig. 3 is a representation of a 12-tier network output weighted.

Fig. 4 is a schematic diagram of a line of text.

Fig. 5 is a schematic diagram of an original resume.

FIG. 6 is a diagram of the format after the original resume is extracted.

Detailed Description

As shown in fig. 1, the present invention provides a resume information extraction method based on stacked sequence labeling, which includes the following steps:

The step 1 comprises the following steps:

Step 1-1, analyzing a resume file with pdf format to a sentence sequence by using a pdfminder, and acquiring the horizontal and vertical coordinates of a text by using an LTTextBox component of the pdfminder analyzer to correct an error broken line caused by a broken line problem in the resume analysis process; the method comprises the following specific steps:

LTTextBox represents a block of text, typically stored in a rectangular area, that by default will be a line of text. However, this default parsing method may cause an erroneous line break, for example, 11 months may be put in the next line as shown in fig. 4, resulting in a parsing error.

First get all LTTextBox components for each page of the pdf, then get the coordinates of the bottom left and top right corners of each text box as (x0, y0) (x1, y1), respectively (pdfminer's parsing process origin of coordinates is in the bottom left corner of the page change). The height of the text box is then found for each text box in descending y1 and ascending x0 order, and for two text boxes with the same abscissa, the two boxes are merged if the line spacing between the two text boxes is less than twice. This solves the problem of misrun.

Step 1-2, the resume is divided into three templates by the column division problem in the resume analysis process or the disordered template reading sequence: the common sequential resume, is divided into left and right sides and is dominated by the right side, resume is divided into left and right sides and is dominated by the left side. The method comprises the following steps:

for the text box LTTextBox in the order in the previous step, the coordinates (x _ maxlong _0, y _ maxlong _0), (x _ maxlong _1, y _ maxlong _1) of the bottom left corner and the top right corner of the longest text box are recorded in the process. Then, each text box is traversed, the coordinates (x _ cur _0, y _ cur _0), (x _ cur _1, y _ cur _1) of the current text box are recorded, and if the upper right-hand coordinate x _ cur _1 of the current text box is found to be smaller than the lower left-hand coordinate x _ maxlong _0 of the longest text box, the current resume is left-right column-divided and is mainly positioned on the right side. Other formats of resumes may be determined similarly.

The step 2 comprises the following steps:

step 2-1, remote supervision data marking: since the labeling data does not give the specific appearance position of each entity in the original text, the entity description needs to be labeled back to the specific position in the original text. If an entity description appears for a plurality of times in the original text, all appearance positions are regarded as the correct appearance positions of the entity, and the method can greatly improve the recall rate although the accuracy is sacrificed to a lower degree;

The step 3 comprises the following steps: the training data is divided into basic information blocks and 4 blocks of education information, work information and project information by using rules. Due to the large inconsistency between the training data and the test data, a 4-classifier is constructed using the training data, and blocks are partitioned using the classifier for the test data. The process is a text classification process and mainly comprises a feature selection process and a text classification process.

Step 3-1, because the number of words is large, keyword extraction of each category is judged by chi-square verification: the chi-squared statistic for the class c word t is calculated as follows:

wherein, the meaning of each parameter in the formula is explained as follows:

n: training the total number of data set documents;

c: the number of documents belonging to category c but not containing term t;

setting the entry t to be irrelevant to the category c; calculating the chi-square value of each entry and the category c, arranging the results in an order from large to small, and taking the first k words according to the chi-square value;

step 3-2, classifying each row by the resume according to the extracted row basic unit, and judging whether the row belongs to basic attributes, education experiences, work experiences and project experiences: text classification model and threshold selection stage: using the gradient lifting tree based xgboost as classifier: the data set was as follows 9: 1, dividing a training set and a verification set in proportion; setting different thresholds aiming at the classifier, calculating the accuracy, the recall rate and the F1 value on the verification set according to each threshold, and selecting the maximum threshold of F1 as a discrimination basis;

F1＝2*pre*recall/(pre+recall)，

pre is accuracy, and recall is recall;

inputting the selected representation of the original text according to the characteristics of the previous step into a classifier to perform 4 classification judgment, wherein an xgboost classifier is selected, and the classifier judges which block the line change belongs to for each line: basic information, educational information, work information, project information.

As shown in fig. 2 and 3, step 4 includes:

step 4-1, the process mainly extracts the work content and the project experience in the resume, because the two contents are generally longer and are information extraction in sentence unit: and sentence level extraction is carried out by taking a sentence as a unit: firstly, coding each sentence by using bert, using a CLS coding vector at the beginning of the sentence as a sentence representation, and then passing the CLS coding vector at the beginning of the sentence through a bidirectional LSTM and CRF network to obtain a label of each sentence, namely whether the sentence belongs to work content or project experience (as shown in FIG. 2);

step 4-2, the process mainly extracts short phrase fragments in the resume, and the extraction of the information is performed by taking characters as units: after the work content and project experience with sentences as units are extracted, the content is replaced by a special character [ NUM ], [ WN ], and on the basis, crf with characters as units is used for extracting other fields; the bert uses a 12-layer coding network inside, and learnable parameters are set to weight 12-layer output, so that the final output representation is obtained:

wherein, m is 12, and m is the number of hidden vector layers output by bert; b_iIs the ith output of bert; gamma and s_iIs a learnable parameter; o_iIs a weighted representation of the network output of the 12 layers. Fig. 5 is a schematic diagram of an original resume, and fig. 6 is a schematic diagram of a result obtained by extracting resume information by using the method of the present invention.

The invention provides a resume information extraction method based on stacking sequence labeling, and a plurality of methods and ways for implementing the technical scheme are provided, the above description is only a preferred embodiment of the invention, and it should be noted that, for a person skilled in the art, a plurality of improvements and decorations can be made without departing from the principle of the invention, and these improvements and decorations should also be regarded as the protection scope of the invention. All the components not specified in the present embodiment can be realized by the prior art.

Claims

1. A resume information extraction method based on stacked sequence labeling is characterized by comprising the following steps:

2. The method of claim 1, wherein step 1 comprises:

step 1-2, dividing the resume into three templates: the common sequential resume, is divided into left and right sides and is dominated by the right side, resume is divided into left and right sides and is dominated by the left side.

3. The method of claim 2, wherein step 1-1 comprises:

firstly obtaining all text box components of each page of the pdf resume file, then obtaining coordinates of the lower left corner and the upper right corner of each text box as (x0, y0) and (x1, y1), respectively, then arranging the coordinates according to the descending order of y1 and the ascending order of x0, calculating the height of the text boxes, and combining the two text boxes if the line distance between the two text boxes with the same abscissa is less than twice the height of the text boxes.

4. The method of claim 3, wherein steps 1-2 comprise: traversing the text boxes in the resume according to the sequence in the step 1-1, recording the coordinates of the lower left corner (x _ maxlong _0, y _ maxlong _0) and the coordinates of the upper right corner (x _ maxlong _1, y _ maxlong _1) of the longest text box in the process, then traversing each text box, recording the coordinates of the lower left corner and the upper right corner of the current text box, respectively recording as (x _ cur _0, y _ cur _0), (x _ cur _1, y _ cur _1), if finding that the coordinate of the upper right corner x _ cur _1 of the current text box is smaller than the coordinate of the lower left corner x _ maxlong _0 of the longest text box, indicating that the current resume is left-right split and mainly takes the right side.

5. The method of claim 4, wherein step 2 comprises:

step 2-1, remote supervision data marking: traversing each entity description in each training data, and if one entity description appears more than twice in the resume text, judging all appearance positions to be the correct appearance positions of the entity;

6. The method of claim 5, wherein step 3 comprises:

traversing the resume text extracted in the step 1 according to rows, classifying each row by 4, and judging whether the row belongs to basic attributes, education experiences, work experiences or project experiences, wherein the method specifically comprises the following steps:

step 3-1, keyword extraction of each category is judged by chi-square verification: calculating chi-square statistic χ for class c word t by the following formula²(t，c)：

Wherein, the meaning of each parameter in the formula is explained as follows:

n: training the total number of data set documents;

c: the number of documents belonging to category c but not containing term t;

step 3-2, adopting the xgboost based on the gradient lifting tree as a classifier, and enabling the training data set to be 9: 1 into training and validation sets, and calculating corresponding accuracy, recall and parameter F1 values:

F1＝2*pre*recall/(pre+recall)，

and (3) expressing the resume participles, performing feature selection through chi-square verification in the step 3-1, inputting the resume participles into a classifier to perform 4 classification judgment, wherein an xgboost classifier is selected, and the classifier judges whether the line belongs to basic information, education information, work information or project information aiming at each line.

7. The method of claim 6, wherein step 4 comprises:

step 4-1, extracting the work content and the project experience in the resume, and extracting sentence levels by taking sentences as units: firstly, coding each sentence by using bert, using a CLS coding vector at the beginning of the sentence as sentence representation, then passing the CLS coding vector at the beginning of the sentence through a bidirectional LSTM and CRF network to obtain a label of each sentence, wherein the label of the sentence represents whether the sentence belongs to working content or project experience;

step 4-2, extracting phrase fragments in the resume: after the work content and the project experience with sentences as units are extracted, the content is replaced by special characters [ NUM ], [ WN ], the work content is replaced by [ NUM ], the project experience is replaced by [ WN ], and on the basis, the crf with characters as units is used for extracting other fields; the bert uses a 12-layer coding network inside, and learnable parameters are set to weight 12-layer output, so that the final output representation is obtained: