CN111966785B

CN111966785B - Resume information extraction method based on stacking sequence labeling

Info

Publication number: CN111966785B
Application number: CN202010756164.2A
Authority: CN
Inventors: 徐建; 郭培胜; 徐琳; 李晓冬
Original assignee: CETC 28 Research Institute
Current assignee: CETC 28 Research Institute
Priority date: 2020-07-31
Filing date: 2020-07-31
Publication date: 2023-06-20
Anticipated expiration: 2040-07-31
Also published as: CN111966785A

Abstract

The invention provides a resume information extraction method based on stacking sequence labeling, which comprises the following steps: and step 1, analyzing the pdf resume by using the pdfminer, and converting the original pdf into a text representation of a plurality of lines. The process mainly solves the problems of disordered sequence and wrong disconnection in the process; step 2, training process data marking: the remotely supervised data is used for back-labeling and the similar items are combined in the labeling process. Step 3, resume information block division: and (3) for obtaining sentences through the pdfminer, judging the block where each sentence is located according to each sentence classification. And 4, utilizing a double-layer sequence labeling model to realize information extraction of a sentence level and a short text segment level. And filtering is realized by utilizing resume blocking information, so that the recall rate is effectively improved, and meanwhile, the accuracy is not greatly reduced. Through the 4 stages, the invention can effectively realize the extraction of resume information.

Description

Resume information extraction method based on stacking sequence labeling

Technical Field

The invention relates to a resume information extraction method based on stacking sequence labeling.

Background

The extraction of resume key information comprises four major categories: including attribute information, educational experiences, work experiences, and project experiences. Specific attribute information includes: name, birth month, sex, telephone, highest school, native, city, county, political aspect; the educational experience includes: graduation universities, academic institutions, graduation time; the working experience includes: work units, work content, job title, and work time; the project experience includes: project name, project responsibility, project time. Of these 18 types of information, the work content and project responsibility are the extraction of the key sentence level, and the other attribute is the extraction of the relatively short text segment.

The current information extraction technology is only aimed at the extraction of shorter text fragments, and cannot process the extraction of long text fragments with sentences as units, and does not consider the block structure of the resume text itself.

Disclosure of Invention

The invention aims to: aiming at the defects of the prior art, the invention provides a resume information extraction method based on stacking sequence labeling, which comprises the following steps:

step 1, analyzing a resume file with a pdf format by utilizing a pdfminer, and analyzing a resume of a rich text into a text representation with a common format;

step 2, training process data marking: the data of remote supervision is utilized for marking back and similar items are combined in the marking process;

step 3, dividing resume information blocks: dividing the resume into 4 blocks, and training a classifier to divide the text into blocks;

and 4, utilizing a double-layer sequence labeling model to realize information extraction of a sentence level and a short text segment level.

The step 1 comprises the following steps:

pdf is a rich text that needs to be parsed first into plain text format. The parsing process involves the problems of column splitting, section splitting and folding.

Step 1-1, utilizing a resume file with a pdf in a pdf min der analysis format to a sentence sequence, and utilizing an LTTextBox component of the pdf min der to acquire a transverse coordinate and a longitudinal coordinate of a text to correct the error broken line caused by a broken line problem in the resume analysis process;

step 1-2, dividing the resume into three templates according to the column division problem or the disordered template reading sequence in the resume analysis process: the general sequential resume and resume are divided into left and right sides and mainly on the right side, and the resume is divided into left and right sides and mainly on the left side.

Step 1-1 includes:

LTTextBox represents a block of text, typically stored in a rectangular area, which by default would be a line of text. However, this default parsing method causes erroneous breaking.

All text box (LTTextBox) components of each page of the pdf resume file are obtained first, and then coordinates of the lower left corner and the upper right corner of each text box are respectively (x 0, y 0) (x 1, y 1) (the origin of coordinates of the pdf miner in the lower left corner of the pdf page in the parsing process) are obtained. Then, the heights of the text boxes are calculated according to descending order of y1 and ascending order of x0, and for two text boxes with the same abscissa, if the line spacing between the two text boxes is smaller than twice the height of the text boxes, the two text boxes are combined. Thus, the problem of wrong line can be solved.

The step 1-2 comprises the following steps: traversing the text boxes (LTTextBox) in the resume in the order of step 1-1, recording the lower left corner coordinates (x_maxlong_0, y_maxlong_0) and the upper right corner coordinates (x_maxlong_1, y_maxlong_1) of the longest text box in the process, then traversing each text box, recording the lower left corner and the upper right corner coordinates of the current text box, respectively denoted as (x_cur_0, y_cur_0), (x_cur_1, y_cur_1), and if the upper right corner coordinates x_cur_1 of the current text box are found to be smaller than the lower left corner coordinates x_maxlong_0 of the longest text box, indicating that the current resume is left and right column-divided and is mainly on the right side. .

The step 2 comprises the following steps:

step 2-1, remote supervision data marking: since the annotation data does not give a specific appearance position of each entity in the resume script, it needs to be annotated back to the specific position in the resume script according to the entity description. Traversing each entity description in each training data (the training data can be a given text and marked with the attributes of fields such as working time, project experience and the like), if one entity description appears in the resume text more than twice, judging all appearance positions to be the correct appearance positions of the entity, and by the method, the recall rate can be greatly improved although the sacrifice accuracy is lower;

step 2-2, merging the similar items in the data back labeling process: and integrating and marking the project time, the education time and the working time in the resume text as time, and integrating and marking the project content and the working content as content.

The step 3 comprises the following steps:

the training data is divided into basic information blocks, and education information, work information and project information are 4 blocks by utilizing rules. Because of the large inconsistency between the training data and the test data, a 4-classifier is constructed using the training data and the classifier is used to divide the blocks for the test data. The process is a text classification process and mainly comprises a feature selection and text classification process.

In step 3, traversing the resume text extracted in step 1 according to rows, classifying each row 4, judging that the row belongs to basic attributes, educational experience, work experience and project experience,

the method specifically comprises the following steps:

step 3-1, judging keyword extraction of each category by using chi-square verification because of more words: chi-square statistic for class c word t is calculated by the following formula ² (t，c)：

The meaning of each parameter in the formula is described as follows:

n: training the total number of data set documents;

a: the number of documents that include term t while belonging to category c;

b: a number of documents that contain term t, but do not belong to category c;

c: the number of documents belonging to category c, but not containing term t;

d: the number of documents that do not belong to category c, nor contain entry t;

setting that the entry t is irrelevant to the category c; for each term, calculating the chi-square value of the term and the category c, arranging the results from large to small, and taking the first k terms according to descending order of the chi-square value;

step 3-2, text classification model and threshold selection stage: adopting xgboost based on gradient lifting tree as classifier, training data set according to 9:1 is divided into a training set and a verification set, and corresponding accuracy, recall rate and parameter F1 values are calculated:

F1＝2*pre*recall/(pre+recall)，

pre is the accuracy rate, and recovery is the recall rate; selecting a value with the largest corresponding F1 value as a threshold value of the classifier;

and (3) representing the resume word segmentation, inputting the characteristics of the last step into a classifier to carry out 4-class judgment, wherein an xgboost classifier is selected, and judging whether the line belongs to basic information, education information, work information or project information according to each line by the classifier.

Step 4 comprises:

step 4-1, the process mainly extracts the work content and the item experience in the resume, because the two contents are generally longer, and the information extraction is performed in sentence units: extracting sentence level by taking sentences as a unit: firstly, each sentence is encoded by using bert, CLS encoding vectors at the beginning of the sentence are used as sentence representation, and then the CLS encoding vectors at the beginning of the sentence are subjected to bidirectional LSTM and CRF networks to obtain a label of each sentence, wherein the label of each sentence represents whether the sentence belongs to work content or item experience;

step 4-2, the process mainly extracts shorter phrase fragments in the resume, and the extraction of the information is performed by taking characters as units: after the work content and the project taking the sentence as the unit go through, replacing the content by using a special character [ NUM ] and [ WN ], and extracting other fields by using the crf taking the character as the unit on the basis; the inside of the bert uses a 12-layer coding network, and learnable parameters are set to weight the 12-layer output, so that a final output representation is obtained: :

wherein m=12, m is the number of hidden vector layers output by bert; b _i Is the ith layer output of bert; gamma and s _i Is a learnable parameter; o (o) _i Is a weighted representation of the 12-layer network output.

The beneficial effects are that:

the invention effectively solves the problems of segmentation, paging and folding in the pdf rich text reading process, and effectively solves the problem of disorder of reading sequence; modeling is carried out by using sentences and characters as units, so that the extraction of long text fragments and short text fragments is effectively completed; the resume is segmented to realize specific information extraction of a specific area, so that the problem of confusion caused by similar fields is effectively solved (for example, working time and project time are similar in performance and are uniformly marked as time, and then the resume is segmented by using resume blocking information).

Drawings

The foregoing and/or other advantages of the invention will become more apparent from the following detailed description of the invention when taken in conjunction with the accompanying drawings and detailed description.

FIG. 1 is a flow chart of the method of the present invention.

Fig. 2 is a sentence-level attribute extraction schematic.

Fig. 3 is a schematic representation of a 12-layer network output weighted.

Fig. 4 is a line text schematic.

Fig. 5 is a schematic diagram of an original resume.

Fig. 6 is a schematic diagram of the format after extracting the original resume.

Detailed Description

As shown in FIG. 1, the invention provides a resume information extraction method based on stacking sequence labeling, which comprises the following steps:

The step 1 comprises the following steps:

Step 1-1, utilizing a resume file with a pdf in a pdf min der analysis format to a sentence sequence, and utilizing an LTTextBox component of the pdf min der to acquire an abscissa of a text to correct an error broken line caused by a broken line problem in a resume analysis process; the method comprises the following steps:

LTTextBox represents a block of text, typically stored in a rectangular area, which by default would be a line of text. However, this default parsing scheme may cause an erroneous outage, such as placing 11 months in the next row as shown in fig. 4, resulting in a parsing error.

All LTTextBox components of each page of pdf are first obtained, and then the coordinates of the lower left and upper right corners of each text box are (x 0, y 0) (x 1, y 1), respectively (the resolution process of pdf miner has the origin of coordinates at the lower left corner of the page). Then, the height of each text box is calculated for each text box in descending order of y1 and ascending order of x0, and if the line spacing between two text boxes is smaller than twice for two text boxes with the same abscissa, the two boxes are combined. Thus, the problem of wrong line can be solved.

Step 1-2, dividing the resume into three templates according to the column division problem or the disordered template reading sequence in the resume analysis process: the general sequential resume and resume are divided into left and right sides and mainly on the right side, and the resume is divided into left and right sides and mainly on the left side. The specific method comprises the following steps:

for the text box LTTextBox in the order of the previous step, the coordinates of the bottom left and top right corners (x_maxlong_0, y_maxlong_0), (x_maxlong_1, y_maxlong_1) of the longest text box are recorded in the process. Each text box is then traversed, the current text box coordinates (x_cur_0, y_cur_0), (x_cur_1, y_cur_1) are recorded, and if the upper right corner coordinate x_cur_1 of the current text box is found to be smaller than the lower left corner coordinate x_maxlong_0 of the longest text box, the current resume is left and right column-divided, and the right side is the main. Other formats of the resume may be determined similarly.

The step 2 comprises the following steps:

step 2-1, remote supervision data marking: since the labeling data does not give a specific appearance position of each entity in the original document, it is required to label it back to the specific position in the original document according to the entity description. If an entity description appears in the original text for a plurality of times, all appearance positions are considered as correct appearance positions of the entity, and the recall rate can be greatly improved while the sacrifice accuracy is lower;

The step 3 comprises the following steps: the training data is divided into basic information blocks, and education information, work information and project information are 4 blocks by utilizing rules. Because of the large inconsistency between the training data and the test data, a 4-classifier is constructed using the training data and the classifier is used to divide the blocks for the test data. The process is a text classification process and mainly comprises a feature selection and text classification process.

Step 3-1, judging keyword extraction of each category by using chi-square verification because of more words: the chi-square statistic for class c word t is calculated as follows:

the meaning of each parameter in the formula is described as follows:

n: training the total number of data set documents;

a: the number of documents that include term t while belonging to category c;

b: a number of documents that contain term t, but do not belong to category c;

c: the number of documents belonging to category c, but not containing term t;

setting that the entry t is irrelevant to the category c; for each entry, calculating the chi-square value of the entry and the category c, arranging the results in order from large to small, and taking the first k words according to the chi-square value;

step 3-2, classifying the resume according to the extracted line basic unit, and classifying each line 4, and judging that the line belongs to basic attributes, education experiences, work experiences and project experiences: text classification model and threshold selection stage: the xgboost based on the gradient lifting tree is adopted as a classifier: the dataset was assembled as per 9:1, dividing a training set and a verification set in proportion; setting different thresholds for the classifier, calculating the accuracy, recall rate and F1 value on the verification set according to each threshold, and selecting the threshold with the maximum F1 value as a judgment basis;

F1＝2*pre*recall/(pre+recall)，

pre is the accuracy rate, and recovery is the recall rate;

4 classifying and judging the original text according to the representation input classifier selected by the previous step, wherein an xgboost classifier is selected, and the classifier judges which block the line belongs to according to each line: basic information, education information, work information, project information.

As shown in fig. 2 and 3, step 4 includes:

step 4-1, the process mainly extracts the work content and the item experience in the resume, because the two contents are generally longer, and the information extraction is performed in sentence units: extracting sentence level by taking sentences as a unit: firstly, each sentence is encoded by using bert, CLS encoding vectors at the beginning of the sentence are used as sentence representation, and then the CLS encoding vectors at the beginning of the sentence are subjected to bidirectional LSTM and CRF networks to obtain a label of each sentence, namely whether the sentence belongs to work content or item experience (as shown in figure 2);

step 4-2, the process mainly extracts shorter phrase fragments in the resume, and the extraction of the information is performed by taking characters as units: after the work content and the project taking the sentence as the unit go through, replacing the content by using a special character [ NUM ] and [ WN ], and extracting other fields by using the crf taking the character as the unit on the basis; the inside of the bert uses a 12-layer coding network, and learnable parameters are set to weight the 12-layer output, so that a final output representation is obtained:

wherein m=12, m is the number of hidden vector layers output by bert; b _i Is the ith layer output of bert; gamma and s _i Is a learnable parameter; o (o) _i Is a weighted representation of the 12-layer network output. Fig. 5 is a schematic diagram of an original resume, and fig. 6 is a schematic diagram of a result obtained by extracting resume information by the method of the present invention.

The invention provides a resume information extraction method based on stacking sequence labeling, and the method and the way for realizing the technical scheme are numerous, the above description is only a preferred embodiment of the invention, and it should be pointed out that a plurality of improvements and modifications can be made to those skilled in the art without departing from the principle of the invention, and the improvements and modifications are also considered as the protection scope of the invention. The components not explicitly described in this embodiment can be implemented by using the prior art.

Claims

1. The resume information extraction method based on the stacking sequence labeling is characterized by comprising the following steps of:

step 4, utilizing a double-layer sequence labeling model to realize information extraction of sentence level and short text segment level;

the step 1 comprises the following steps:

step 1-2, dividing the resume into three templates: the general sequential resume is divided into a left side and a right side, mainly the right side, and the resume is divided into a left side and a right side, mainly the left side;

step 1-1 includes:

firstly, obtaining all text box components of each page of a pdf resume file, then obtaining coordinates of a lower left corner and an upper right corner of each text box to be (x 0, y 0) and (x 1, y 1) respectively, and then arranging according to a descending order of y1 and an ascending order of x0, calculating the heights of the text boxes, and for two text boxes with the same horizontal coordinates, merging the two text boxes if the line spacing between the two text boxes is smaller than twice the heights of the text boxes;

the step 1-2 comprises the following steps: traversing the text boxes in the resume according to the sequence in the step 1-1, recording the left lower corner coordinates (x_maxlong_0, y_maxlong_0) and the right upper corner coordinates (x_maxlong_1, y_maxlong_1) of the longest text box in the process, traversing each text box, recording the left lower corner and the right upper corner coordinates of the current text box, respectively recorded as (x_cur_0, y_cur_0), (x_cur_1, y_cur_1), and if the right upper corner coordinates x_cur_1 of the current text box are found to be smaller than the left lower corner coordinates x_maxlong_0 of the longest text box, indicating that the current resume is left and right column-divided and is mainly on the right side;

the step 2 comprises the following steps:

step 2-1, remote supervision data marking: traversing each entity description in each training data, and if one entity description appears in the resume text more than twice, judging all appearance positions to be the correct appearance positions of the entity;

step 2-2, merging the similar items in the data back labeling process: the project time, the education time and the working time in the resume text are unified and combined and marked as time, and the project content and the working content are unified and marked back as content;

the step 3 comprises the following steps:

traversing the resume text extracted in the step 1 according to rows, classifying each row 4, and judging whether the row belongs to basic attributes, educational experience, work experience or project experience, wherein the method specifically comprises the following steps:

step 3-1, judging keyword extraction of each category by using chi-square verification: chi-square statistic for class c word t is calculated by the following formula ² (t，c)：

The meaning of each parameter in the formula is described as follows:

n: training the total number of data set documents;

a: the number of documents that include term t while belonging to category c;

b: a number of documents that contain term t, but do not belong to category c;

c: the number of documents belonging to category c, but not containing term t;

step 3-2, adopting xgboost based on gradient lifting tree as classifier, and training data set according to 9:1 is divided into a training set and a verification set, and corresponding accuracy, recall rate and parameter F1 values are calculated:

F1＝2*pre*recall/(pre+recall)，

the resume word segmentation is expressed, characteristic selection is carried out through chi-square verification in the step 3-1, a classifier is input to carry out 4 classification judgment, an xgboost classifier is selected, and the classifier judges whether the line belongs to basic information, education information, working information or project information according to each line;

step 4 comprises:

step 4-1, extracting working contents and project experiences in the resume, and extracting sentence levels by taking sentences as units: firstly, each sentence is encoded by using bert, CLS encoding vectors at the beginning of the sentence are used as sentence representation, and then the CLS encoding vectors at the beginning of the sentence are subjected to bidirectional LSTM and CRF networks to obtain a label of each sentence, wherein the label of each sentence represents whether the sentence belongs to work content or item experience;

step 4-2, extracting phrase fragments in the resume: after the work content and the project experience taking the sentence as a unit are extracted, replacing the content by using a special character [ NUM ] and [ WN ], replacing the work content by [ NUM ], replacing the project experience by [ WN ], and extracting other fields by using the crf taking the character as a unit on the basis; the inside of the bert uses a 12-layer coding network, and learnable parameters are set to weight the 12-layer output, so that a final output representation is obtained: