CN111966785A - Resume information extraction method based on stacking sequence labeling - Google Patents

Resume information extraction method based on stacking sequence labeling Download PDF

Info

Publication number
CN111966785A
CN111966785A CN202010756164.2A CN202010756164A CN111966785A CN 111966785 A CN111966785 A CN 111966785A CN 202010756164 A CN202010756164 A CN 202010756164A CN 111966785 A CN111966785 A CN 111966785A
Authority
CN
China
Prior art keywords
resume
text
sentence
information
project
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010756164.2A
Other languages
Chinese (zh)
Other versions
CN111966785B (en
Inventor
徐建
郭培胜
徐琳
李晓冬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CETC 28 Research Institute
Original Assignee
CETC 28 Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CETC 28 Research Institute filed Critical CETC 28 Research Institute
Priority to CN202010756164.2A priority Critical patent/CN111966785B/en
Publication of CN111966785A publication Critical patent/CN111966785A/en
Application granted granted Critical
Publication of CN111966785B publication Critical patent/CN111966785B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a resume information extraction method based on stacking sequence labeling, which comprises the following steps: and step 1, analyzing the pdf resume by using a pdf miner, and converting the original pdf into a multi-line text representation. The process mainly solves the problems of sequence disorder and wrong line breaking; step 2, training process data marking: data callbacks using remote supervision and merging of homogeneous items in the tagging process. Step 3, dividing resume information blocks: for the sentences obtained by pdfminer, the block where each sentence is located is judged according to each sentence classification. And 4, extracting information on a sentence level and a short text segment level by using a double-layer sequence labeling model. And the filtering is realized by subsequently utilizing resume block information, so that the recall rate is effectively improved, and the accuracy rate is not greatly reduced. Through the 4 stages, the method can effectively realize the extraction of the resume information.

Description

Resume information extraction method based on stacking sequence labeling
Technical Field
The invention relates to a resume information extraction method based on stacking sequence labeling.
Background
The extraction of the resume key information comprises four major categories: including attribute information, educational experiences, work experiences, and project experiences. The specific attribute information includes: name, year and month of birth, sex, telephone, highest school calendar, native place, city and county, political face; the educational experience includes: graduate colleges, academic ranks, graduation time; the work experience comprises: work units, work content, jobs, work hours; the project experience includes: project name, project responsibility, project time. Among the 18 kinds of information, the work content and the project responsibility are extraction at a key sentence level, and other attributes are extraction of a relatively short text fragment.
The existing information extraction technology only aims at extracting short text segments, cannot process extraction of long text segments taking sentences as units, and does not consider the block structure of the resume text.
Disclosure of Invention
The purpose of the invention is as follows: the technical problem to be solved by the invention is to provide a resume information extraction method based on stacked sequence labeling aiming at the defects of the prior art, which comprises the following steps:
step 1, analyzing resume files with pdf format by using a pdf miner, and analyzing the resume with rich text into common format text for representation;
step 2, training process data marking: data of remote supervision is utilized for carrying out back marking and the same type items are combined in the marking process;
step 3, dividing resume information blocks: dividing the resume into 4 blocks, and training a classifier to divide the blocks of the text;
and 4, extracting information on a sentence level and a short text segment level by using a double-layer sequence labeling model.
The step 1 comprises the following steps:
a pdf is a rich text that needs to be parsed first into a plain text format. The parsing process may involve problems of columnating, sectioning and folding.
Step 1-1, analyzing a resume file with pdf format to a sentence sequence by using a pdfminder, and acquiring horizontal and vertical coordinates of a text to correct error broken lines caused by a broken line problem in the resume analysis process by using an LTTextBox component of the pdfminder analyzer;
step 1-2, the resume is divided into three templates by the column division problem in the resume analysis process or the disordered template reading sequence: the common sequential resume, is divided into left and right sides and is dominated by the right side, resume is divided into left and right sides and is dominated by the left side.
The step 1-1 comprises the following steps:
LTTextBox represents a block of text, typically stored in a rectangular area, that by default will be a line of text. But this default resolution may cause erroneous line breaks.
All text box (LTTextBox) components of each page of the pdf resume file are obtained first, and then the coordinates of the lower left corner and the upper right corner of each text box are obtained as (x0, y0) (x1, y1) (the origin of the pdfminer's parsing process coordinate is at the lower left corner of the pdf page). The heights of the text boxes are then calculated in descending y1 and ascending x0, and for two text boxes with the same abscissa, the two text boxes are merged if the line spacing between the two text boxes is less than twice the text box height. This solves the problem of misrun.
The step 1-2 comprises the following steps: traversing the text boxes (LTTextBox) in the resume according to the sequence in the step 1-1, recording the coordinates (x _ maxlong _0, y _ maxlong _0) at the lower left corner and (x _ maxlong _1, y _ maxlong _1) at the upper right corner of the longest text box in the process, then traversing each text box, recording the coordinates (x _ cur _0, y _ cur _0), (x _ cur _1, y _ cur _1) at the lower left corner and the upper right corner of the current text box, respectively, and if the coordinate (x _ cur _1) at the upper right corner of the current text box is found to be smaller than the coordinate (x _ maxlong _0) at the left corner of the longest text box, the current resume is left-right-column-divided and the right-side is taken as the main part. .
The step 2 comprises the following steps:
step 2-1, remote supervision data marking: since the annotation data does not give the specific appearance position of each entity in the resume original text, the annotation data needs to be marked back to the specific position in the resume original text according to the description of the entity. Traversing each entity description in each training data (the training data can be a given text, and the attributes of fields such as working time, project experience and the like in the text are marked), if one entity description appears more than twice in the resume text, all appearance positions are judged to be the correct appearance positions of the entity, and the method can greatly improve the recall rate although the accuracy is sacrificed to a lower degree;
step 2-2, merging the same type items in the data denormalization process: and uniformly merging and marking the project time, the education time and the work time in the resume text as time, and uniformly marking the project content and the work content as content.
The step 3 comprises the following steps:
the training data is divided into basic information blocks and 4 blocks of education information, work information and project information by using rules. Due to the large inconsistency between the training data and the test data, a 4-classifier is constructed using the training data, and blocks are partitioned using the classifier for the test data. The process is a text classification process and mainly comprises a feature selection process and a text classification process.
In step 3, traversing the resume text extracted in step 1 according to rows, classifying each row by 4, judging that the row belongs to basic attributes, education experiences, work experiences and project experiences,
the method specifically comprises the following steps:
step 3-1, because the number of words is large, keyword extraction of each category is judged by chi-square verification: calculating chi-square statistic χ for class c word t by the following formula2(t,c):
Figure BDA0002611642950000031
Wherein, the meaning of each parameter in the formula is explained as follows:
n: training the total number of data set documents;
a: the number of documents including the term t and belonging to the category c;
b: the number of documents that contain the term t but do not belong to category c;
c: the number of documents belonging to category c but not containing term t;
d: the number of documents that do not belong to category c and that do not contain entry t;
setting the entry t to be irrelevant to the category c; calculating the chi-square value of each entry and the category c, arranging the results in an order from large to small, and arranging the first k entries according to the chi-square value in a descending order;
step 3-2, text classification model and threshold selection stage: adopting xgboost based on a gradient lifting tree as a classifier, and training a data set according to the following steps of 9: 1 into training and validation sets, and calculating corresponding accuracy, recall and parameter F1 values:
F1=2*pre*recall/(pre+recall),
pre is accuracy, and recall is recall; selecting the value with the maximum value corresponding to the F1 value as the threshold value of the classifier;
and (4) representing the resume participles, selecting the characteristics in the last step, inputting the resume participles into a classifier to perform 4-classification judgment, wherein an xgboost classifier is selected, and the classifier judges whether the line belongs to basic information, education information, work information or project information aiming at each line.
Step 4 comprises the following steps:
step 4-1, the process mainly extracts the work content and the project experience in the resume, because the two contents are generally longer and are information extraction in sentence unit: and sentence level extraction is carried out by taking a sentence as a unit: firstly, coding each sentence by using bert, using a CLS coding vector at the beginning of the sentence as sentence representation, then passing the CLS coding vector at the beginning of the sentence through a bidirectional LSTM and CRF network to obtain a label of each sentence, wherein the label of the sentence represents whether the sentence belongs to working content or project experience;
step 4-2, the process mainly extracts short phrase fragments in the resume, and the extraction of the information is performed by taking characters as units: after the work content and project experience with sentences as units are extracted, the content is replaced by a special character [ NUM ], [ WN ], and on the basis, crf with characters as units is used for extracting other fields; the bert uses a 12-layer coding network inside, and learnable parameters are set to weight 12-layer output, so that the final output representation is obtained: :
Figure BDA0002611642950000041
wherein, m is 12, and m is the number of hidden vector layers output by bert; biIs the ith output of bert; gamma and siIs a learnable parameter; oiIs a weighted representation of the network output of the 12 layers.
Has the advantages that:
the method effectively solves the problems of segmentation, paging and folding in the pdf rich text reading process, and effectively solves the disorder problem of the reading sequence; the modeling is carried out by respectively using sentences and characters as units, so that the extraction of long text segments and short text segments is effectively finished; the resume is partitioned to extract specific information in a specific area, and confusion caused by similar fields is effectively solved (for example, working time and project time are similar, a uniform time is marked as time, and then resume partitioning information is utilized for partitioning).
Drawings
The foregoing and/or other advantages of the invention will become further apparent from the following detailed description of the invention when taken in conjunction with the accompanying drawings.
FIG. 1 is a flow chart of the method of the present invention.
FIG. 2 is a schematic diagram of sentence-level attribute extraction.
Fig. 3 is a representation of a 12-tier network output weighted.
Fig. 4 is a schematic diagram of a line of text.
Fig. 5 is a schematic diagram of an original resume.
FIG. 6 is a diagram of the format after the original resume is extracted.
Detailed Description
As shown in fig. 1, the present invention provides a resume information extraction method based on stacked sequence labeling, which includes the following steps:
step 1, analyzing resume files with pdf format by using a pdf miner, and analyzing the resume with rich text into common format text for representation;
step 2, training process data marking: data of remote supervision is utilized for carrying out back marking and the same type items are combined in the marking process;
step 3, dividing resume information blocks: dividing the resume into 4 blocks, and training a classifier to divide the blocks of the text;
and 4, extracting information on a sentence level and a short text segment level by using a double-layer sequence labeling model.
The step 1 comprises the following steps:
a pdf is a rich text that needs to be parsed first into a plain text format. The parsing process may involve problems of columnating, sectioning and folding.
Step 1-1, analyzing a resume file with pdf format to a sentence sequence by using a pdfminder, and acquiring the horizontal and vertical coordinates of a text by using an LTTextBox component of the pdfminder analyzer to correct an error broken line caused by a broken line problem in the resume analysis process; the method comprises the following specific steps:
LTTextBox represents a block of text, typically stored in a rectangular area, that by default will be a line of text. However, this default parsing method may cause an erroneous line break, for example, 11 months may be put in the next line as shown in fig. 4, resulting in a parsing error.
First get all LTTextBox components for each page of the pdf, then get the coordinates of the bottom left and top right corners of each text box as (x0, y0) (x1, y1), respectively (pdfminer's parsing process origin of coordinates is in the bottom left corner of the page change). The height of the text box is then found for each text box in descending y1 and ascending x0 order, and for two text boxes with the same abscissa, the two boxes are merged if the line spacing between the two text boxes is less than twice. This solves the problem of misrun.
Step 1-2, the resume is divided into three templates by the column division problem in the resume analysis process or the disordered template reading sequence: the common sequential resume, is divided into left and right sides and is dominated by the right side, resume is divided into left and right sides and is dominated by the left side. The method comprises the following steps:
for the text box LTTextBox in the order in the previous step, the coordinates (x _ maxlong _0, y _ maxlong _0), (x _ maxlong _1, y _ maxlong _1) of the bottom left corner and the top right corner of the longest text box are recorded in the process. Then, each text box is traversed, the coordinates (x _ cur _0, y _ cur _0), (x _ cur _1, y _ cur _1) of the current text box are recorded, and if the upper right-hand coordinate x _ cur _1 of the current text box is found to be smaller than the lower left-hand coordinate x _ maxlong _0 of the longest text box, the current resume is left-right column-divided and is mainly positioned on the right side. Other formats of resumes may be determined similarly.
The step 2 comprises the following steps:
step 2-1, remote supervision data marking: since the labeling data does not give the specific appearance position of each entity in the original text, the entity description needs to be labeled back to the specific position in the original text. If an entity description appears for a plurality of times in the original text, all appearance positions are regarded as the correct appearance positions of the entity, and the method can greatly improve the recall rate although the accuracy is sacrificed to a lower degree;
step 2-2, merging the same type items in the data denormalization process: and uniformly merging and marking the project time, the education time and the work time in the resume text as time, and uniformly marking the project content and the work content as content.
The step 3 comprises the following steps: the training data is divided into basic information blocks and 4 blocks of education information, work information and project information by using rules. Due to the large inconsistency between the training data and the test data, a 4-classifier is constructed using the training data, and blocks are partitioned using the classifier for the test data. The process is a text classification process and mainly comprises a feature selection process and a text classification process.
Step 3-1, because the number of words is large, keyword extraction of each category is judged by chi-square verification: the chi-squared statistic for the class c word t is calculated as follows:
Figure BDA0002611642950000061
wherein, the meaning of each parameter in the formula is explained as follows:
n: training the total number of data set documents;
a: the number of documents including the term t and belonging to the category c;
b: the number of documents that contain the term t but do not belong to category c;
c: the number of documents belonging to category c but not containing term t;
d: the number of documents that do not belong to category c and that do not contain entry t;
setting the entry t to be irrelevant to the category c; calculating the chi-square value of each entry and the category c, arranging the results in an order from large to small, and taking the first k words according to the chi-square value;
step 3-2, classifying each row by the resume according to the extracted row basic unit, and judging whether the row belongs to basic attributes, education experiences, work experiences and project experiences: text classification model and threshold selection stage: using the gradient lifting tree based xgboost as classifier: the data set was as follows 9: 1, dividing a training set and a verification set in proportion; setting different thresholds aiming at the classifier, calculating the accuracy, the recall rate and the F1 value on the verification set according to each threshold, and selecting the maximum threshold of F1 as a discrimination basis;
F1=2*pre*recall/(pre+recall),
pre is accuracy, and recall is recall;
inputting the selected representation of the original text according to the characteristics of the previous step into a classifier to perform 4 classification judgment, wherein an xgboost classifier is selected, and the classifier judges which block the line change belongs to for each line: basic information, educational information, work information, project information.
As shown in fig. 2 and 3, step 4 includes:
step 4-1, the process mainly extracts the work content and the project experience in the resume, because the two contents are generally longer and are information extraction in sentence unit: and sentence level extraction is carried out by taking a sentence as a unit: firstly, coding each sentence by using bert, using a CLS coding vector at the beginning of the sentence as a sentence representation, and then passing the CLS coding vector at the beginning of the sentence through a bidirectional LSTM and CRF network to obtain a label of each sentence, namely whether the sentence belongs to work content or project experience (as shown in FIG. 2);
step 4-2, the process mainly extracts short phrase fragments in the resume, and the extraction of the information is performed by taking characters as units: after the work content and project experience with sentences as units are extracted, the content is replaced by a special character [ NUM ], [ WN ], and on the basis, crf with characters as units is used for extracting other fields; the bert uses a 12-layer coding network inside, and learnable parameters are set to weight 12-layer output, so that the final output representation is obtained:
Figure BDA0002611642950000071
wherein, m is 12, and m is the number of hidden vector layers output by bert; biIs the ith output of bert; gamma and siIs a learnable parameter; oiIs a weighted representation of the network output of the 12 layers. Fig. 5 is a schematic diagram of an original resume, and fig. 6 is a schematic diagram of a result obtained by extracting resume information by using the method of the present invention.
The invention provides a resume information extraction method based on stacking sequence labeling, and a plurality of methods and ways for implementing the technical scheme are provided, the above description is only a preferred embodiment of the invention, and it should be noted that, for a person skilled in the art, a plurality of improvements and decorations can be made without departing from the principle of the invention, and these improvements and decorations should also be regarded as the protection scope of the invention. All the components not specified in the present embodiment can be realized by the prior art.

Claims (7)

1. A resume information extraction method based on stacked sequence labeling is characterized by comprising the following steps:
step 1, analyzing resume files with pdf format by using a pdf miner, and analyzing the resume with rich text into common format text for representation;
step 2, training process data marking: data of remote supervision is utilized for carrying out back marking and the same type items are combined in the marking process;
step 3, dividing resume information blocks: dividing the resume into 4 blocks, and training a classifier to divide the blocks of the text;
and 4, extracting information on a sentence level and a short text segment level by using a double-layer sequence labeling model.
2. The method of claim 1, wherein step 1 comprises:
step 1-1, analyzing a resume file with pdf format to a sentence sequence by using a pdfminder, and acquiring horizontal and vertical coordinates of a text to correct error broken lines caused by a broken line problem in the resume analysis process by using an LTTextBox component of the pdfminder analyzer;
step 1-2, dividing the resume into three templates: the common sequential resume, is divided into left and right sides and is dominated by the right side, resume is divided into left and right sides and is dominated by the left side.
3. The method of claim 2, wherein step 1-1 comprises:
firstly obtaining all text box components of each page of the pdf resume file, then obtaining coordinates of the lower left corner and the upper right corner of each text box as (x0, y0) and (x1, y1), respectively, then arranging the coordinates according to the descending order of y1 and the ascending order of x0, calculating the height of the text boxes, and combining the two text boxes if the line distance between the two text boxes with the same abscissa is less than twice the height of the text boxes.
4. The method of claim 3, wherein steps 1-2 comprise: traversing the text boxes in the resume according to the sequence in the step 1-1, recording the coordinates of the lower left corner (x _ maxlong _0, y _ maxlong _0) and the coordinates of the upper right corner (x _ maxlong _1, y _ maxlong _1) of the longest text box in the process, then traversing each text box, recording the coordinates of the lower left corner and the upper right corner of the current text box, respectively recording as (x _ cur _0, y _ cur _0), (x _ cur _1, y _ cur _1), if finding that the coordinate of the upper right corner x _ cur _1 of the current text box is smaller than the coordinate of the lower left corner x _ maxlong _0 of the longest text box, indicating that the current resume is left-right split and mainly takes the right side.
5. The method of claim 4, wherein step 2 comprises:
step 2-1, remote supervision data marking: traversing each entity description in each training data, and if one entity description appears more than twice in the resume text, judging all appearance positions to be the correct appearance positions of the entity;
step 2-2, merging the same type items in the data denormalization process: and uniformly merging and marking the project time, the education time and the work time in the resume text as time, and uniformly marking the project content and the work content as content.
6. The method of claim 5, wherein step 3 comprises:
traversing the resume text extracted in the step 1 according to rows, classifying each row by 4, and judging whether the row belongs to basic attributes, education experiences, work experiences or project experiences, wherein the method specifically comprises the following steps:
step 3-1, keyword extraction of each category is judged by chi-square verification: calculating chi-square statistic χ for class c word t by the following formula2(t,c):
Figure FDA0002611642940000021
Wherein, the meaning of each parameter in the formula is explained as follows:
n: training the total number of data set documents;
a: the number of documents including the term t and belonging to the category c;
b: the number of documents that contain the term t but do not belong to category c;
c: the number of documents belonging to category c but not containing term t;
d: the number of documents that do not belong to category c and that do not contain entry t;
setting the entry t to be irrelevant to the category c; calculating the chi-square value of each entry and the category c, arranging the results in an order from large to small, and arranging the first k entries according to the chi-square value in a descending order;
step 3-2, adopting the xgboost based on the gradient lifting tree as a classifier, and enabling the training data set to be 9: 1 into training and validation sets, and calculating corresponding accuracy, recall and parameter F1 values:
F1=2*pre*recall/(pre+recall),
pre is accuracy, and recall is recall; selecting the value with the maximum value corresponding to the F1 value as the threshold value of the classifier;
and (3) expressing the resume participles, performing feature selection through chi-square verification in the step 3-1, inputting the resume participles into a classifier to perform 4 classification judgment, wherein an xgboost classifier is selected, and the classifier judges whether the line belongs to basic information, education information, work information or project information aiming at each line.
7. The method of claim 6, wherein step 4 comprises:
step 4-1, extracting the work content and the project experience in the resume, and extracting sentence levels by taking sentences as units: firstly, coding each sentence by using bert, using a CLS coding vector at the beginning of the sentence as sentence representation, then passing the CLS coding vector at the beginning of the sentence through a bidirectional LSTM and CRF network to obtain a label of each sentence, wherein the label of the sentence represents whether the sentence belongs to working content or project experience;
step 4-2, extracting phrase fragments in the resume: after the work content and the project experience with sentences as units are extracted, the content is replaced by special characters [ NUM ], [ WN ], the work content is replaced by [ NUM ], the project experience is replaced by [ WN ], and on the basis, the crf with characters as units is used for extracting other fields; the bert uses a 12-layer coding network inside, and learnable parameters are set to weight 12-layer output, so that the final output representation is obtained:
Figure FDA0002611642940000031
wherein, m is 12, and m is the number of hidden vector layers output by bert; biIs the ith output of bert; gamma and siIs a learnable parameter; oiIs a weighted representation of the network output of the 12 layers.
CN202010756164.2A 2020-07-31 2020-07-31 Resume information extraction method based on stacking sequence labeling Active CN111966785B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010756164.2A CN111966785B (en) 2020-07-31 2020-07-31 Resume information extraction method based on stacking sequence labeling

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010756164.2A CN111966785B (en) 2020-07-31 2020-07-31 Resume information extraction method based on stacking sequence labeling

Publications (2)

Publication Number Publication Date
CN111966785A true CN111966785A (en) 2020-11-20
CN111966785B CN111966785B (en) 2023-06-20

Family

ID=73363289

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010756164.2A Active CN111966785B (en) 2020-07-31 2020-07-31 Resume information extraction method based on stacking sequence labeling

Country Status (1)

Country Link
CN (1) CN111966785B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114861630A (en) * 2022-05-10 2022-08-05 马上消费金融股份有限公司 Information acquisition and related model training method and device, electronic equipment and medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170315984A1 (en) * 2016-04-29 2017-11-02 Cavium, Inc. Systems and methods for text analytics processor
CN107392143A (en) * 2017-07-20 2017-11-24 中国科学院软件研究所 A kind of resume accurate Analysis method based on SVM text classifications
CN110442841A (en) * 2019-06-20 2019-11-12 平安科技(深圳)有限公司 Identify method and device, the computer equipment, storage medium of resume

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170315984A1 (en) * 2016-04-29 2017-11-02 Cavium, Inc. Systems and methods for text analytics processor
CN107392143A (en) * 2017-07-20 2017-11-24 中国科学院软件研究所 A kind of resume accurate Analysis method based on SVM text classifications
CN110442841A (en) * 2019-06-20 2019-11-12 平安科技(深圳)有限公司 Identify method and device, the computer equipment, storage medium of resume

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
张博: "基于领域知识库的简历信息抽取系统的设计与实现", 《中国优秀博硕士学位论文全文数据库(硕士)信息科技辑》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114861630A (en) * 2022-05-10 2022-08-05 马上消费金融股份有限公司 Information acquisition and related model training method and device, electronic equipment and medium

Also Published As

Publication number Publication date
CN111966785B (en) 2023-06-20

Similar Documents

Publication Publication Date Title
CN110968667B (en) Periodical and literature table extraction method based on text state characteristics
US11782928B2 (en) Computerized information extraction from tables
CN106384282A (en) Method and device for building decision-making model
CN112287916B (en) Video image text courseware text extraction method, device, equipment and medium
JP2005526314A (en) Document structure identifier
CN105843897A (en) Vertical domain-oriented intelligent question and answer system
JP2004139484A (en) Form processing device, program for implementing it, and program for creating form format
CN110083832B (en) Article reprint relation identification method, device, equipment and readable storage medium
CN105677638B (en) Web information abstracting method
WO2014050774A1 (en) Document classification assisting apparatus, method and program
US7046847B2 (en) Document processing method, system and medium
CN109918648B (en) Rumor depth detection method based on dynamic sliding window feature score
CN116205211A (en) Document level resume analysis method based on large-scale pre-training generation model
CN112241730A (en) Form extraction method and system based on machine learning
JP4787955B2 (en) Method, system, and program for extracting keywords from target document
WO2020252931A1 (en) Pdf file data extraction method and apparatus, device, and storage medium
CN111966785A (en) Resume information extraction method based on stacking sequence labeling
CN110019820A (en) Main suit and present illness history symptom Timing Coincidence Detection method in a kind of case history
Long An agent-based approach to table recognition and interpretation
JPH0821057B2 (en) Document image analysis method
CN116629258A (en) Structured analysis method and system for judicial document based on complex information item data
CN114842982B (en) Knowledge expression method, device and system for medical information system
CN110765107A (en) Question type identification method and system based on digital coding
CN113642291B (en) Method, system, storage medium and terminal for constructing logical structure tree reported by listed companies
CN115659989A (en) Web table abnormal data discovery method based on text semantic mapping relation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant