CN111966785B - Resume information extraction method based on stacking sequence labeling - Google Patents

Resume information extraction method based on stacking sequence labeling Download PDF

Info

Publication number
CN111966785B
CN111966785B CN202010756164.2A CN202010756164A CN111966785B CN 111966785 B CN111966785 B CN 111966785B CN 202010756164 A CN202010756164 A CN 202010756164A CN 111966785 B CN111966785 B CN 111966785B
Authority
CN
China
Prior art keywords
resume
text
sentence
information
category
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010756164.2A
Other languages
Chinese (zh)
Other versions
CN111966785A (en
Inventor
徐建
郭培胜
徐琳
李晓冬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CETC 28 Research Institute
Original Assignee
CETC 28 Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CETC 28 Research Institute filed Critical CETC 28 Research Institute
Priority to CN202010756164.2A priority Critical patent/CN111966785B/en
Publication of CN111966785A publication Critical patent/CN111966785A/en
Application granted granted Critical
Publication of CN111966785B publication Critical patent/CN111966785B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Abstract

The invention provides a resume information extraction method based on stacking sequence labeling, which comprises the following steps: and step 1, analyzing the pdf resume by using the pdfminer, and converting the original pdf into a text representation of a plurality of lines. The process mainly solves the problems of disordered sequence and wrong disconnection in the process; step 2, training process data marking: the remotely supervised data is used for back-labeling and the similar items are combined in the labeling process. Step 3, resume information block division: and (3) for obtaining sentences through the pdfminer, judging the block where each sentence is located according to each sentence classification. And 4, utilizing a double-layer sequence labeling model to realize information extraction of a sentence level and a short text segment level. And filtering is realized by utilizing resume blocking information, so that the recall rate is effectively improved, and meanwhile, the accuracy is not greatly reduced. Through the 4 stages, the invention can effectively realize the extraction of resume information.

Description

Resume information extraction method based on stacking sequence labeling
Technical Field
The invention relates to a resume information extraction method based on stacking sequence labeling.
Background
The extraction of resume key information comprises four major categories: including attribute information, educational experiences, work experiences, and project experiences. Specific attribute information includes: name, birth month, sex, telephone, highest school, native, city, county, political aspect; the educational experience includes: graduation universities, academic institutions, graduation time; the working experience includes: work units, work content, job title, and work time; the project experience includes: project name, project responsibility, project time. Of these 18 types of information, the work content and project responsibility are the extraction of the key sentence level, and the other attribute is the extraction of the relatively short text segment.
The current information extraction technology is only aimed at the extraction of shorter text fragments, and cannot process the extraction of long text fragments with sentences as units, and does not consider the block structure of the resume text itself.
Disclosure of Invention
The invention aims to: aiming at the defects of the prior art, the invention provides a resume information extraction method based on stacking sequence labeling, which comprises the following steps:
step 1, analyzing a resume file with a pdf format by utilizing a pdfminer, and analyzing a resume of a rich text into a text representation with a common format;
step 2, training process data marking: the data of remote supervision is utilized for marking back and similar items are combined in the marking process;
step 3, dividing resume information blocks: dividing the resume into 4 blocks, and training a classifier to divide the text into blocks;
and 4, utilizing a double-layer sequence labeling model to realize information extraction of a sentence level and a short text segment level.
The step 1 comprises the following steps:
pdf is a rich text that needs to be parsed first into plain text format. The parsing process involves the problems of column splitting, section splitting and folding.
Step 1-1, utilizing a resume file with a pdf in a pdf min der analysis format to a sentence sequence, and utilizing an LTTextBox component of the pdf min der to acquire a transverse coordinate and a longitudinal coordinate of a text to correct the error broken line caused by a broken line problem in the resume analysis process;
step 1-2, dividing the resume into three templates according to the column division problem or the disordered template reading sequence in the resume analysis process: the general sequential resume and resume are divided into left and right sides and mainly on the right side, and the resume is divided into left and right sides and mainly on the left side.
Step 1-1 includes:
LTTextBox represents a block of text, typically stored in a rectangular area, which by default would be a line of text. However, this default parsing method causes erroneous breaking.
All text box (LTTextBox) components of each page of the pdf resume file are obtained first, and then coordinates of the lower left corner and the upper right corner of each text box are respectively (x 0, y 0) (x 1, y 1) (the origin of coordinates of the pdf miner in the lower left corner of the pdf page in the parsing process) are obtained. Then, the heights of the text boxes are calculated according to descending order of y1 and ascending order of x0, and for two text boxes with the same abscissa, if the line spacing between the two text boxes is smaller than twice the height of the text boxes, the two text boxes are combined. Thus, the problem of wrong line can be solved.
The step 1-2 comprises the following steps: traversing the text boxes (LTTextBox) in the resume in the order of step 1-1, recording the lower left corner coordinates (x_maxlong_0, y_maxlong_0) and the upper right corner coordinates (x_maxlong_1, y_maxlong_1) of the longest text box in the process, then traversing each text box, recording the lower left corner and the upper right corner coordinates of the current text box, respectively denoted as (x_cur_0, y_cur_0), (x_cur_1, y_cur_1), and if the upper right corner coordinates x_cur_1 of the current text box are found to be smaller than the lower left corner coordinates x_maxlong_0 of the longest text box, indicating that the current resume is left and right column-divided and is mainly on the right side. .
The step 2 comprises the following steps:
step 2-1, remote supervision data marking: since the annotation data does not give a specific appearance position of each entity in the resume script, it needs to be annotated back to the specific position in the resume script according to the entity description. Traversing each entity description in each training data (the training data can be a given text and marked with the attributes of fields such as working time, project experience and the like), if one entity description appears in the resume text more than twice, judging all appearance positions to be the correct appearance positions of the entity, and by the method, the recall rate can be greatly improved although the sacrifice accuracy is lower;
step 2-2, merging the similar items in the data back labeling process: and integrating and marking the project time, the education time and the working time in the resume text as time, and integrating and marking the project content and the working content as content.
The step 3 comprises the following steps:
the training data is divided into basic information blocks, and education information, work information and project information are 4 blocks by utilizing rules. Because of the large inconsistency between the training data and the test data, a 4-classifier is constructed using the training data and the classifier is used to divide the blocks for the test data. The process is a text classification process and mainly comprises a feature selection and text classification process.
In step 3, traversing the resume text extracted in step 1 according to rows, classifying each row 4, judging that the row belongs to basic attributes, educational experience, work experience and project experience,
the method specifically comprises the following steps:
step 3-1, judging keyword extraction of each category by using chi-square verification because of more words: chi-square statistic for class c word t is calculated by the following formula 2 (t,c):
Figure BDA0002611642950000031
The meaning of each parameter in the formula is described as follows:
n: training the total number of data set documents;
a: the number of documents that include term t while belonging to category c;
b: a number of documents that contain term t, but do not belong to category c;
c: the number of documents belonging to category c, but not containing term t;
d: the number of documents that do not belong to category c, nor contain entry t;
setting that the entry t is irrelevant to the category c; for each term, calculating the chi-square value of the term and the category c, arranging the results from large to small, and taking the first k terms according to descending order of the chi-square value;
step 3-2, text classification model and threshold selection stage: adopting xgboost based on gradient lifting tree as classifier, training data set according to 9:1 is divided into a training set and a verification set, and corresponding accuracy, recall rate and parameter F1 values are calculated:
F1=2*pre*recall/(pre+recall),
pre is the accuracy rate, and recovery is the recall rate; selecting a value with the largest corresponding F1 value as a threshold value of the classifier;
and (3) representing the resume word segmentation, inputting the characteristics of the last step into a classifier to carry out 4-class judgment, wherein an xgboost classifier is selected, and judging whether the line belongs to basic information, education information, work information or project information according to each line by the classifier.
Step 4 comprises:
step 4-1, the process mainly extracts the work content and the item experience in the resume, because the two contents are generally longer, and the information extraction is performed in sentence units: extracting sentence level by taking sentences as a unit: firstly, each sentence is encoded by using bert, CLS encoding vectors at the beginning of the sentence are used as sentence representation, and then the CLS encoding vectors at the beginning of the sentence are subjected to bidirectional LSTM and CRF networks to obtain a label of each sentence, wherein the label of each sentence represents whether the sentence belongs to work content or item experience;
step 4-2, the process mainly extracts shorter phrase fragments in the resume, and the extraction of the information is performed by taking characters as units: after the work content and the project taking the sentence as the unit go through, replacing the content by using a special character [ NUM ] and [ WN ], and extracting other fields by using the crf taking the character as the unit on the basis; the inside of the bert uses a 12-layer coding network, and learnable parameters are set to weight the 12-layer output, so that a final output representation is obtained: :
Figure BDA0002611642950000041
wherein m=12, m is the number of hidden vector layers output by bert; b i Is the ith layer output of bert; gamma and s i Is a learnable parameter; o (o) i Is a weighted representation of the 12-layer network output.
The beneficial effects are that:
the invention effectively solves the problems of segmentation, paging and folding in the pdf rich text reading process, and effectively solves the problem of disorder of reading sequence; modeling is carried out by using sentences and characters as units, so that the extraction of long text fragments and short text fragments is effectively completed; the resume is segmented to realize specific information extraction of a specific area, so that the problem of confusion caused by similar fields is effectively solved (for example, working time and project time are similar in performance and are uniformly marked as time, and then the resume is segmented by using resume blocking information).
Drawings
The foregoing and/or other advantages of the invention will become more apparent from the following detailed description of the invention when taken in conjunction with the accompanying drawings and detailed description.
FIG. 1 is a flow chart of the method of the present invention.
Fig. 2 is a sentence-level attribute extraction schematic.
Fig. 3 is a schematic representation of a 12-layer network output weighted.
Fig. 4 is a line text schematic.
Fig. 5 is a schematic diagram of an original resume.
Fig. 6 is a schematic diagram of the format after extracting the original resume.
Detailed Description
As shown in FIG. 1, the invention provides a resume information extraction method based on stacking sequence labeling, which comprises the following steps:
step 1, analyzing a resume file with a pdf format by utilizing a pdfminer, and analyzing a resume of a rich text into a text representation with a common format;
step 2, training process data marking: the data of remote supervision is utilized for marking back and similar items are combined in the marking process;
step 3, dividing resume information blocks: dividing the resume into 4 blocks, and training a classifier to divide the text into blocks;
and 4, utilizing a double-layer sequence labeling model to realize information extraction of a sentence level and a short text segment level.
The step 1 comprises the following steps:
pdf is a rich text that needs to be parsed first into plain text format. The parsing process involves the problems of column splitting, section splitting and folding.
Step 1-1, utilizing a resume file with a pdf in a pdf min der analysis format to a sentence sequence, and utilizing an LTTextBox component of the pdf min der to acquire an abscissa of a text to correct an error broken line caused by a broken line problem in a resume analysis process; the method comprises the following steps:
LTTextBox represents a block of text, typically stored in a rectangular area, which by default would be a line of text. However, this default parsing scheme may cause an erroneous outage, such as placing 11 months in the next row as shown in fig. 4, resulting in a parsing error.
All LTTextBox components of each page of pdf are first obtained, and then the coordinates of the lower left and upper right corners of each text box are (x 0, y 0) (x 1, y 1), respectively (the resolution process of pdf miner has the origin of coordinates at the lower left corner of the page). Then, the height of each text box is calculated for each text box in descending order of y1 and ascending order of x0, and if the line spacing between two text boxes is smaller than twice for two text boxes with the same abscissa, the two boxes are combined. Thus, the problem of wrong line can be solved.
Step 1-2, dividing the resume into three templates according to the column division problem or the disordered template reading sequence in the resume analysis process: the general sequential resume and resume are divided into left and right sides and mainly on the right side, and the resume is divided into left and right sides and mainly on the left side. The specific method comprises the following steps:
for the text box LTTextBox in the order of the previous step, the coordinates of the bottom left and top right corners (x_maxlong_0, y_maxlong_0), (x_maxlong_1, y_maxlong_1) of the longest text box are recorded in the process. Each text box is then traversed, the current text box coordinates (x_cur_0, y_cur_0), (x_cur_1, y_cur_1) are recorded, and if the upper right corner coordinate x_cur_1 of the current text box is found to be smaller than the lower left corner coordinate x_maxlong_0 of the longest text box, the current resume is left and right column-divided, and the right side is the main. Other formats of the resume may be determined similarly.
The step 2 comprises the following steps:
step 2-1, remote supervision data marking: since the labeling data does not give a specific appearance position of each entity in the original document, it is required to label it back to the specific position in the original document according to the entity description. If an entity description appears in the original text for a plurality of times, all appearance positions are considered as correct appearance positions of the entity, and the recall rate can be greatly improved while the sacrifice accuracy is lower;
step 2-2, merging the similar items in the data back labeling process: and integrating and marking the project time, the education time and the working time in the resume text as time, and integrating and marking the project content and the working content as content.
The step 3 comprises the following steps: the training data is divided into basic information blocks, and education information, work information and project information are 4 blocks by utilizing rules. Because of the large inconsistency between the training data and the test data, a 4-classifier is constructed using the training data and the classifier is used to divide the blocks for the test data. The process is a text classification process and mainly comprises a feature selection and text classification process.
Step 3-1, judging keyword extraction of each category by using chi-square verification because of more words: the chi-square statistic for class c word t is calculated as follows:
Figure BDA0002611642950000061
the meaning of each parameter in the formula is described as follows:
n: training the total number of data set documents;
a: the number of documents that include term t while belonging to category c;
b: a number of documents that contain term t, but do not belong to category c;
c: the number of documents belonging to category c, but not containing term t;
d: the number of documents that do not belong to category c, nor contain entry t;
setting that the entry t is irrelevant to the category c; for each entry, calculating the chi-square value of the entry and the category c, arranging the results in order from large to small, and taking the first k words according to the chi-square value;
step 3-2, classifying the resume according to the extracted line basic unit, and classifying each line 4, and judging that the line belongs to basic attributes, education experiences, work experiences and project experiences: text classification model and threshold selection stage: the xgboost based on the gradient lifting tree is adopted as a classifier: the dataset was assembled as per 9:1, dividing a training set and a verification set in proportion; setting different thresholds for the classifier, calculating the accuracy, recall rate and F1 value on the verification set according to each threshold, and selecting the threshold with the maximum F1 value as a judgment basis;
F1=2*pre*recall/(pre+recall),
pre is the accuracy rate, and recovery is the recall rate;
4 classifying and judging the original text according to the representation input classifier selected by the previous step, wherein an xgboost classifier is selected, and the classifier judges which block the line belongs to according to each line: basic information, education information, work information, project information.
As shown in fig. 2 and 3, step 4 includes:
step 4-1, the process mainly extracts the work content and the item experience in the resume, because the two contents are generally longer, and the information extraction is performed in sentence units: extracting sentence level by taking sentences as a unit: firstly, each sentence is encoded by using bert, CLS encoding vectors at the beginning of the sentence are used as sentence representation, and then the CLS encoding vectors at the beginning of the sentence are subjected to bidirectional LSTM and CRF networks to obtain a label of each sentence, namely whether the sentence belongs to work content or item experience (as shown in figure 2);
step 4-2, the process mainly extracts shorter phrase fragments in the resume, and the extraction of the information is performed by taking characters as units: after the work content and the project taking the sentence as the unit go through, replacing the content by using a special character [ NUM ] and [ WN ], and extracting other fields by using the crf taking the character as the unit on the basis; the inside of the bert uses a 12-layer coding network, and learnable parameters are set to weight the 12-layer output, so that a final output representation is obtained:
Figure BDA0002611642950000071
wherein m=12, m is the number of hidden vector layers output by bert; b i Is the ith layer output of bert; gamma and s i Is a learnable parameter; o (o) i Is a weighted representation of the 12-layer network output. Fig. 5 is a schematic diagram of an original resume, and fig. 6 is a schematic diagram of a result obtained by extracting resume information by the method of the present invention.
The invention provides a resume information extraction method based on stacking sequence labeling, and the method and the way for realizing the technical scheme are numerous, the above description is only a preferred embodiment of the invention, and it should be pointed out that a plurality of improvements and modifications can be made to those skilled in the art without departing from the principle of the invention, and the improvements and modifications are also considered as the protection scope of the invention. The components not explicitly described in this embodiment can be implemented by using the prior art.

Claims (1)

1. The resume information extraction method based on the stacking sequence labeling is characterized by comprising the following steps of:
step 1, analyzing a resume file with a pdf format by utilizing a pdfminer, and analyzing a resume of a rich text into a text representation with a common format;
step 2, training process data marking: the data of remote supervision is utilized for marking back and similar items are combined in the marking process;
step 3, dividing resume information blocks: dividing the resume into 4 blocks, and training a classifier to divide the text into blocks;
step 4, utilizing a double-layer sequence labeling model to realize information extraction of sentence level and short text segment level;
the step 1 comprises the following steps:
step 1-1, utilizing a resume file with a pdf in a pdf min der analysis format to a sentence sequence, and utilizing an LTTextBox component of the pdf min der to acquire a transverse coordinate and a longitudinal coordinate of a text to correct the error broken line caused by a broken line problem in the resume analysis process;
step 1-2, dividing the resume into three templates: the general sequential resume is divided into a left side and a right side, mainly the right side, and the resume is divided into a left side and a right side, mainly the left side;
step 1-1 includes:
firstly, obtaining all text box components of each page of a pdf resume file, then obtaining coordinates of a lower left corner and an upper right corner of each text box to be (x 0, y 0) and (x 1, y 1) respectively, and then arranging according to a descending order of y1 and an ascending order of x0, calculating the heights of the text boxes, and for two text boxes with the same horizontal coordinates, merging the two text boxes if the line spacing between the two text boxes is smaller than twice the heights of the text boxes;
the step 1-2 comprises the following steps: traversing the text boxes in the resume according to the sequence in the step 1-1, recording the left lower corner coordinates (x_maxlong_0, y_maxlong_0) and the right upper corner coordinates (x_maxlong_1, y_maxlong_1) of the longest text box in the process, traversing each text box, recording the left lower corner and the right upper corner coordinates of the current text box, respectively recorded as (x_cur_0, y_cur_0), (x_cur_1, y_cur_1), and if the right upper corner coordinates x_cur_1 of the current text box are found to be smaller than the left lower corner coordinates x_maxlong_0 of the longest text box, indicating that the current resume is left and right column-divided and is mainly on the right side;
the step 2 comprises the following steps:
step 2-1, remote supervision data marking: traversing each entity description in each training data, and if one entity description appears in the resume text more than twice, judging all appearance positions to be the correct appearance positions of the entity;
step 2-2, merging the similar items in the data back labeling process: the project time, the education time and the working time in the resume text are unified and combined and marked as time, and the project content and the working content are unified and marked back as content;
the step 3 comprises the following steps:
traversing the resume text extracted in the step 1 according to rows, classifying each row 4, and judging whether the row belongs to basic attributes, educational experience, work experience or project experience, wherein the method specifically comprises the following steps:
step 3-1, judging keyword extraction of each category by using chi-square verification: chi-square statistic for class c word t is calculated by the following formula 2 (t,c):
Figure FDA0004040424880000021
The meaning of each parameter in the formula is described as follows:
n: training the total number of data set documents;
a: the number of documents that include term t while belonging to category c;
b: a number of documents that contain term t, but do not belong to category c;
c: the number of documents belonging to category c, but not containing term t;
d: the number of documents that do not belong to category c, nor contain entry t;
setting that the entry t is irrelevant to the category c; for each term, calculating the chi-square value of the term and the category c, arranging the results from large to small, and taking the first k terms according to descending order of the chi-square value;
step 3-2, adopting xgboost based on gradient lifting tree as classifier, and training data set according to 9:1 is divided into a training set and a verification set, and corresponding accuracy, recall rate and parameter F1 values are calculated:
F1=2*pre*recall/(pre+recall),
pre is the accuracy rate, and recovery is the recall rate; selecting a value with the largest corresponding F1 value as a threshold value of the classifier;
the resume word segmentation is expressed, characteristic selection is carried out through chi-square verification in the step 3-1, a classifier is input to carry out 4 classification judgment, an xgboost classifier is selected, and the classifier judges whether the line belongs to basic information, education information, working information or project information according to each line;
step 4 comprises:
step 4-1, extracting working contents and project experiences in the resume, and extracting sentence levels by taking sentences as units: firstly, each sentence is encoded by using bert, CLS encoding vectors at the beginning of the sentence are used as sentence representation, and then the CLS encoding vectors at the beginning of the sentence are subjected to bidirectional LSTM and CRF networks to obtain a label of each sentence, wherein the label of each sentence represents whether the sentence belongs to work content or item experience;
step 4-2, extracting phrase fragments in the resume: after the work content and the project experience taking the sentence as a unit are extracted, replacing the content by using a special character [ NUM ] and [ WN ], replacing the work content by [ NUM ], replacing the project experience by [ WN ], and extracting other fields by using the crf taking the character as a unit on the basis; the inside of the bert uses a 12-layer coding network, and learnable parameters are set to weight the 12-layer output, so that a final output representation is obtained:
Figure FDA0004040424880000031
wherein m=12, m is the number of hidden vector layers output by bert; b i Is the ith layer output of bert; gamma and s i Is a learnable parameter; o (o) i Is a weighted representation of the 12-layer network output.
CN202010756164.2A 2020-07-31 2020-07-31 Resume information extraction method based on stacking sequence labeling Active CN111966785B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010756164.2A CN111966785B (en) 2020-07-31 2020-07-31 Resume information extraction method based on stacking sequence labeling

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010756164.2A CN111966785B (en) 2020-07-31 2020-07-31 Resume information extraction method based on stacking sequence labeling

Publications (2)

Publication Number Publication Date
CN111966785A CN111966785A (en) 2020-11-20
CN111966785B true CN111966785B (en) 2023-06-20

Family

ID=73363289

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010756164.2A Active CN111966785B (en) 2020-07-31 2020-07-31 Resume information extraction method based on stacking sequence labeling

Country Status (1)

Country Link
CN (1) CN111966785B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107392143A (en) * 2017-07-20 2017-11-24 中国科学院软件研究所 A kind of resume accurate Analysis method based on SVM text classifications
CN110442841A (en) * 2019-06-20 2019-11-12 平安科技(深圳)有限公司 Identify method and device, the computer equipment, storage medium of resume

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10409911B2 (en) * 2016-04-29 2019-09-10 Cavium, Llc Systems and methods for text analytics processor

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107392143A (en) * 2017-07-20 2017-11-24 中国科学院软件研究所 A kind of resume accurate Analysis method based on SVM text classifications
CN110442841A (en) * 2019-06-20 2019-11-12 平安科技(深圳)有限公司 Identify method and device, the computer equipment, storage medium of resume

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于领域知识库的简历信息抽取系统的设计与实现;张博;《中国优秀博硕士学位论文全文数据库(硕士)信息科技辑》;20181015;正文第10-59页 *

Also Published As

Publication number Publication date
CN111966785A (en) 2020-11-20

Similar Documents

Publication Publication Date Title
CN110968667B (en) Periodical and literature table extraction method based on text state characteristics
WO2021203581A1 (en) Key information extraction method based on fine annotation text, and apparatus and storage medium
US11782928B2 (en) Computerized information extraction from tables
CN108664574B (en) Information input method, terminal equipment and medium
CN109871955A (en) A kind of aviation safety accident causality abstracting method
CN112633431B (en) Tibetan-Chinese bilingual scene character recognition method based on CRNN and CTC
CN101354727A (en) Method and apparatus for establishing links between digital document catalog and text
CN110019820A (en) Main suit and present illness history symptom Timing Coincidence Detection method in a kind of case history
CN103473217A (en) Method and device for extracting keywords from text
CN114118053A (en) Contract information extraction method and device
CN116701303B (en) Electronic file classification method, system and readable storage medium based on deep learning
CN108536683A (en) A kind of paper fragmentation information abstracting method based on machine learning
CN113761202A (en) Optimization system for mapping unstructured financial Excel table to database
CN105488471B (en) A kind of font recognition methods and device
CN109472020B (en) Feature alignment Chinese word segmentation method
CN114201620A (en) Method, apparatus and medium for mining PDF tables in PDF file
CN112905793B (en) Case recommendation method and system based on bilstm+attention text classification
CN111966785B (en) Resume information extraction method based on stacking sequence labeling
CN112084783B (en) Entity identification method and system based on civil aviation non-civilized passengers
CN112560490A (en) Knowledge graph relation extraction method and device, electronic equipment and storage medium
WO2020252931A1 (en) Pdf file data extraction method and apparatus, device, and storage medium
CN113642291B (en) Method, system, storage medium and terminal for constructing logical structure tree reported by listed companies
CN112632223B (en) Case and event knowledge graph construction method and related equipment
CN113723078A (en) Text logic information structuring method and device and electronic equipment
CN109635681B (en) Document processing method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant