WO2020252919A1 - Resume identification method and apparatus, and computer device and storage medium - Google Patents

Resume identification method and apparatus, and computer device and storage medium Download PDF

Info

Publication number
WO2020252919A1
WO2020252919A1 PCT/CN2019/103268 CN2019103268W WO2020252919A1 WO 2020252919 A1 WO2020252919 A1 WO 2020252919A1 CN 2019103268 W CN2019103268 W CN 2019103268W WO 2020252919 A1 WO2020252919 A1 WO 2020252919A1
Authority
WO
WIPO (PCT)
Prior art keywords
resume
lstm
dnlp
text block
text
Prior art date
Application number
PCT/CN2019/103268
Other languages
French (fr)
Chinese (zh)
Inventor
石明川
姚飞
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2020252919A1 publication Critical patent/WO2020252919A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Definitions

  • This application relates to the field of computers, in particular to a method and device for identifying resumes, computer equipment, and storage media.
  • Resume recognition is a kind of semi-structured text recognition. Because it does not have the natural word order concept of traditional unstructured text, it is difficult to recognize.
  • the resume recognition system in the prior art is a recognition system based on keywords. For example, "person's name”, “mobile phone number”, “work history”, etc., but if these keywords do not exist in the semi-structured text, the traditional resume recognition system cannot recognize the corresponding corpus.
  • regular expressions are usually used.
  • the period contains various resume formats that bring identification difficulties. For example: the person name keyword is followed by the person's name in the resume, but there are also a series of problems such as the number of words, Chinese and English, and spaces in the person's name.
  • the resume may include multiple names, multiple time periods, etc., often with work experience and project experience
  • the problem of confusion in the recognition of the middle because this part of the resume does not have a unified format, which leads to a very low recognition rate of the resume, and manual screening is required.
  • embodiments of the present application provide a method and device for identifying resumes, computer equipment, and storage media.
  • an embodiment of the present application provides a method for recognizing a resume, the method comprising: receiving a target resume to be recognized; and inputting the target resume into a deep neural language programming DNLP system, where the DNLP system is It is obtained by training with a bidirectional long and short-term memory loop neural network BI-LSTM-CRF model; using the DNLP system to determine the resume template used by the target resume; and extracting feature information in the target resume according to the resume template.
  • the method before inputting the target resume into the deep neural linguistic programming DNLP system, the method further includes: determining a plurality of resume samples; using the plurality of resume samples to train the initial nerve of the BI-LSTM-CRF model Network to obtain the DNLP system.
  • using the multiple resume samples to train the initial neural network of the BI-LSTM-CRF model includes: using a supervised classification method to segment the resume text of each resume sample to obtain multiple text blocks that can correspond to artificial labels , Wherein each text block corresponds to a category attribute in the resume; word segmentation is performed on the text block, and feature words of each text block are extracted; the BI-LSTM- is trained using the text block and the corresponding feature words The initial neural network of the CRF model.
  • dividing the resume text of each resume sample by means of supervised classification includes: dividing the following resume text in each resume sample: self-introduction, education experience, work experience, learning experience, project experience; use The label information marks the resume text.
  • n is a positive integer greater than 1; among them, n i, j is the number of occurrences of the current word in the text block d j , the denominator is the sum of the number of occurrences of all words in d j , and k is any value of i;
  • is the total number of files in the resume sample, and
  • training the initial neural network of the BI-LSTM-CRF model by using the text block and the corresponding feature words includes: in the BI layer of the BI-LSTM-CRF model, using pre-trained or randomly initialized
  • the embedding matrix maps each word in the sentence of the text block from a one-hot vector to a low-dimensional dense word vector.
  • the following maximum log likelihood function is used to process the sample data:
  • x) score (x, y x) -log ( ⁇ y 'exp (score (x, y'))); where, (x, y x) of training samples.
  • an embodiment of the present application provides a device for recognizing resumes.
  • the device includes: a receiving module for receiving a target resume to be recognized; and an input module for inputting the target resume into a deep neural language program Learn a DNLP system, wherein the DNLP system is obtained by training with a bidirectional long and short-term memory loop neural network BI-LSTM-CRF model; a determining module is used to determine the resume template used by the target resume using the DNLP system; extract The module is used to extract feature information in the target resume according to the resume template.
  • the device further includes: a determination module, configured to determine a plurality of resume samples before the input module inputs the target resume into the deep neural linguistic programming DNLP system; a training module, configured to use the Multiple resume samples train the initial neural network of the BI-LSTM-CRF model to obtain the DNLP system.
  • a determination module configured to determine a plurality of resume samples before the input module inputs the target resume into the deep neural linguistic programming DNLP system
  • a training module configured to use the Multiple resume samples train the initial neural network of the BI-LSTM-CRF model to obtain the DNLP system.
  • the training module includes: a segmentation unit for segmenting the resume text of each resume sample in a supervised classification manner to obtain multiple text blocks that can correspond to manual labels, wherein each text block corresponds to a resume An extraction unit, used to segment the text block, and extract the feature words of each text block; a training unit, used to train the BI-LSTM using the text block and corresponding feature words -The initial neural network of the CRF model.
  • the segmentation unit includes: a segmentation subunit for segmenting the following resume text in each resume sample: self-introduction, education experience, work experience, learning experience, project experience; labeling the information with label information Resume text.
  • is the total number of files in the resume sample, and
  • the training module includes: a first processing unit configured to use a pre-trained or randomly initialized embedding matrix to convert each sentence in the text block in the BI layer of the BI-LSTM-CRF model Words are mapped from one-hot vectors to low-dimensional dense word vectors. Before inputting the next layer, set detachment to relieve overfitting; the second processing unit is used in the LSTM layer of the BI-LSTM-CRF model In extracting sentence features, each feature word sequence of a sentence is used as the input of each time step of the bidirectional LSTM, and then the hidden state sequence output by the forward LSTM and the hidden state output by the reverse LSTM at each position are spliced by position.
  • the third processing unit further includes: a processing sub-unit for processing sample data using the following maximization log likelihood function: logP(y x
  • x) score(x,y x )- log ( ⁇ y 'exp (score (x, y'))); where, (x, y x) of training samples.
  • a storage medium in which a computer program is stored, wherein the computer program is configured to execute the steps in any one of the foregoing method embodiments when running.
  • an electronic device including a memory and a processor, the memory is stored with a computer program, and the processor is configured to run the computer program to execute any of the above Steps in the method embodiment.
  • FIG. 1 is a hardware structure block diagram of a mobile terminal for identifying resumes according to an embodiment of the present application
  • Figure 2 is a flowchart of a method for identifying resumes according to an embodiment of the present application
  • FIG. 3 is a flowchart of training a BI-LSTM-CRF model in an embodiment of the application
  • Fig. 4 is a structural block diagram of a device for identifying resumes according to an embodiment of the present application.
  • FIG. 1 is a hardware structural block diagram of a computer terminal for identifying resumes according to an embodiment of the present application.
  • the computer terminal 10 may include one or more (only one is shown in FIG. 1) processor 102 (the processor 102 may include, but is not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA. ) And a memory 104 for storing data.
  • the aforementioned computer terminal may also include a transmission device 106 and an input/output device 108 for communication functions.
  • FIG. 1 is only for illustration, and does not limit the structure of the foregoing computer terminal.
  • the computer terminal 10 may also include more or fewer components than those shown in FIG. 1, or have a different configuration from that shown in FIG.
  • the memory 104 may be used to store computer programs, for example, software programs and modules of application software, such as the computer programs corresponding to the method for identifying resumes in the embodiment of the present application.
  • the processor 102 executes the computer programs stored in the memory 104 by running Various functional applications and data processing, namely to achieve the above methods.
  • the memory 104 may include a high-speed random access memory, and may also include a non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory.
  • the memory 104 may further include a memory remotely provided with respect to the processor 102, and these remote memories may be connected to the computer terminal 10 via a network. Examples of the aforementioned networks include, but are not limited to, the Internet, corporate intranets, local area networks, mobile communication networks, and combinations thereof.
  • the transmission device 106 is used to receive or send data via a network.
  • the above-mentioned specific examples of the network may include a wireless network provided by the communication provider of the computer terminal 10.
  • the transmission device 106 includes a network adapter (Network Interface Controller, NIC for short), which can be connected to other network devices through a base station to communicate with the Internet.
  • the transmission device 106 may be a radio frequency (Radio Frequency, referred to as RF) module, which is used to communicate with the Internet in a wireless manner.
  • RF Radio Frequency
  • FIG. 2 is a flowchart of the method for identifying a resume according to an embodiment of the present application. As shown in FIG. 2, the process includes the following steps:
  • Step S202 receiving a target resume to be identified
  • Step S204 input the target resume into a deep neural language programming DNLP system, where the DNLP system is obtained by training using a bidirectional long and short-term memory cyclic neural network BI-LSTM-CRF model;
  • Step S206 Use the DNLP system to determine a resume template used by the target resume; the resume template includes multiple physical sections;
  • the resume template of this embodiment refers to the resume style or resume layout adopted by the target resume.
  • the content of the same physical section (such as work experience) is distributed in different positions of the text.
  • the resume template of the target resume is determined by Can determine the position of each text content to be determined in the target resume;
  • Step S208 Extract feature information in the target resume according to the resume template.
  • the target resume is input into the deep neural linguistic programming DNLP system, and the DNLP system is used to determine the resume template used by the target resume, and finally the target resume is extracted according to the resume template
  • the DNLP system is used to determine the resume template used by the target resume
  • the target resume is extracted according to the resume template
  • the feature information after extracting the feature information in the target resume according to the resume template, the feature information can be re-typeset according to the specified template set by the user, so as to facilitate centralized collection, or only the feature information that the user pays attention to ( For example, the graduate school) is extracted and bound with the resume logo or other key information, and then formatted and displayed, so as to reduce the time for users to find the key information in the complicated resume.
  • the method before inputting the target resume into the deep neural linguistic programming DNLP system, the method further includes: determining a plurality of resume samples; using the plurality of resume samples to train the initial neural network of the BI-LSTM-CRF model , To obtain the DNLP system.
  • Fig. 3 is a flowchart of training the BI-LSTM-CRF model according to an embodiment of the present application.
  • the initial neural network for training the BI-LSTM-CRF model using the multiple resume samples includes:
  • S302 Use a supervised classification method to segment the resume text of each resume sample to obtain multiple text blocks that can correspond to manual labels, where each text block corresponds to a category attribute in the resume;
  • dividing the resume text of each resume sample by means of supervised classification includes: dividing the following resume text (physical section) in each resume sample: self-introduction, education experience, work experience, learning experience, project Experience; use label information to mark the resume text.
  • a resume sample a complete resume is composed of multiple resume texts, but for resumes with different templates, the same resume text may be distributed in different positions; this part is the process of learning each entity section of the resume;
  • S304 Perform word segmentation on the text block, and extract feature words of each text block; key feature words can be extracted by performing word segmentation and synonym matching on the marked text block.
  • is the total number of files in the resume sample,
  • TF-IDF can filter out common words, keep important words, and extract characteristic words.
  • the BI-LSTM-CRF model pair is trained and learned by text blocks of each category, and the recognition model of each category can be obtained including: the word-based Bi-LSTM-CRF can be used, such as B- PER, I-PER represent the first character of a person’s name, and the name is not the first character, B-SCH, I-SCH represent the first character of the school, the non-initial character of the school, etc., to train and learn the recognition model of each entity module.
  • the neural network of the BI-LSTM-CRF model includes a three-layer logical structure. Training the initial neural network of the BI-LSTM-CRF model using the text block and the corresponding feature words includes:
  • each word in the sentence of the text block is mapped to a low-dimensional by one-hot vector using a pre-trained or randomly initialized embedding matrix For dense word vectors, set detachment before entering the next layer to alleviate overfitting;
  • LSTM layer of the BI-LSTM-CRF model extract sentence features, use each feature word sequence of a sentence as the input of each time step of the bidirectional LSTM, and then combine the hidden state sequence output by the forward LSTM and the reverse LSTM
  • the hidden states output at each position are spliced by position to obtain a complete hidden state sequence, and output pi, where pi is the probability of belonging to the i tag;
  • the softmax of this embodiment only takes partial considerations, that is, the tag of the current word is not affected by other tags.
  • the following maximum log likelihood function is used to process the sample data: logP (y x
  • x) score (x, y x) -log ( ⁇ y 'exp (score (x, y'))); where, (x, y x) of training samples.
  • the scoring of the entire sequence in this embodiment is equal to the sum of the scoring of each position, and the scoring of each position is obtained from two parts, one part is determined by the pi output by the LSTM, and the other part is determined by the transition matrix A of the CRF.
  • the method according to the above embodiment can be implemented by means of software plus the necessary general hardware platform, of course, it can also be implemented by hardware, but in many cases the former is Better implementation.
  • the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product is stored in a storage medium (such as ROM/RAM, magnetic disk, The optical disc) includes several instructions to enable a terminal device (which can be a mobile phone, a computer, a server, or a network device, etc.) to execute the method described in each embodiment of the present application.
  • a device for recognizing resumes is also provided.
  • the device is used to implement the above-mentioned embodiments and preferred implementations. What has been explained will not be repeated.
  • the term "module" can implement a combination of software and/or hardware with predetermined functions.
  • the devices described in the following embodiments are preferably implemented by software, hardware or a combination of software and hardware is also possible and conceived.
  • Fig. 4 is a structural block diagram of a device for identifying resumes according to an embodiment of the present application. As shown in Fig. 4, the device includes:
  • the receiving module 40 is used to receive the target resume to be identified
  • the input module 42 is configured to input the target resume into a deep neural language programming DNLP system, where the DNLP system is obtained by training using a bidirectional long and short-term memory cyclic neural network BI-LSTM-CRF model;
  • the determining module 44 is configured to use the DNLP system to determine the resume template used by the target resume;
  • the extraction module 46 is configured to extract feature information in the target resume according to the resume template.
  • the device further includes: a determination module, configured to determine a plurality of resume samples before the input module inputs the target resume into the deep neural linguistic programming DNLP system; a training module, configured to use the Multiple resume samples train the initial neural network of the BI-LSTM-CRF model to obtain the DNLP system.
  • a determination module configured to determine a plurality of resume samples before the input module inputs the target resume into the deep neural linguistic programming DNLP system
  • a training module configured to use the Multiple resume samples train the initial neural network of the BI-LSTM-CRF model to obtain the DNLP system.
  • the training module includes: a segmentation unit for segmenting the resume text of each resume sample in a supervised classification manner to obtain multiple text blocks that can correspond to manual labels, wherein each text block corresponds to a resume An extraction unit, used to segment the text block, and extract the feature words of each text block; a training unit, used to train the BI-LSTM using the text block and corresponding feature words -The initial neural network of the CRF model.
  • the segmentation unit includes: a segmentation subunit for segmenting the following resume text in each resume sample: self-introduction, education experience, work experience, learning experience, project experience; labeling the information with label information Resume text.
  • is the total number of files in the resume sample, and
  • the training module includes: a first processing unit configured to use a pre-trained or randomly initialized embedding matrix to convert each sentence in the text block in the BI layer of the BI-LSTM-CRF model Words are mapped from one-hot vectors to low-dimensional dense word vectors. Before inputting the next layer, set detachment to relieve overfitting; the second processing unit is used in the LSTM layer of the BI-LSTM-CRF model In extracting sentence features, each feature word sequence of a sentence is used as the input of each time step of the bidirectional LSTM, and then the hidden state sequence output by the forward LSTM and the hidden state output by the reverse LSTM at each position are spliced by position.
  • the third processing unit further includes: a processing sub-unit for processing sample data using the following maximization log likelihood function: logP(y x
  • x) score(x,y x )- log ( ⁇ y 'exp (score (x, y'))); where, (x, y x) of training samples.
  • each of the above modules can be implemented by software or hardware.
  • it can be implemented in the following manner, but not limited to this: the above modules are all located in the same processor; or, the above modules are combined in any combination The forms are located in different processors.
  • the disclosed system, device, and method may be implemented in other ways.
  • the device embodiments described above are merely illustrative, for example, the division of the units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components may be combined Or it can be integrated into another system, or some features can be ignored or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • each unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated unit may be implemented in the form of hardware, or may be implemented in the form of hardware plus software functional units.
  • the above-mentioned integrated unit implemented in the form of a software functional unit may be stored in a computer readable storage medium.
  • the above-mentioned software functional unit is stored in a storage medium and includes several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (Processor) execute the method described in each embodiment of the present application Part of the steps.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disk and other media that can store program code .
  • the embodiment of the present application also provides a storage medium in which a computer program is stored, wherein the computer program is set to execute the steps in any one of the foregoing method embodiments when running.
  • the foregoing storage medium may be configured to store a computer program for executing the following steps:
  • the foregoing storage medium may include, but is not limited to: U disk, Read-Only Memory (Read-Only Memory, ROM for short), Random Access Memory (Random Access Memory, RAM for short), Various media that can store computer programs, such as mobile hard disks, magnetic disks, or optical disks.
  • the embodiment of the present application also provides an electronic device, including a memory and a processor, the memory is stored with a computer program, and the processor is configured to run the computer program to execute the steps in any of the foregoing method embodiments.
  • the aforementioned electronic device may further include a transmission device and an input-output device, wherein the transmission device is connected to the aforementioned processor, and the input-output device is connected to the aforementioned processor.
  • the foregoing processor may be configured to execute the following steps through a computer program:

Abstract

A resume identification method and apparatus, and a computer device and a storage medium. The method comprises: receiving a target resume to be identified (S202); inputting said target resume into a deep neural language programming (DNLP) system, wherein the DNLP system is obtained by training using a bidirectional long-short-term memory recurrent neural network (BI-LSTM-CRF) model (S204); determining a resume template used in said target resume by using the DNLP system (S206); and extracting feature information in said target resume according to the resume template (S208). According to the method, the technical problem in the prior art of low resume identification rate is solved.

Description

识别简历的方法及装置、计算机设备、存储介质Method and device for recognizing resume, computer equipment and storage medium 【技术领域】【Technical Field】
本申请涉及计算机领域,尤其涉及一种识别简历的方法及装置、计算机设备、存储介质。This application relates to the field of computers, in particular to a method and device for identifying resumes, computer equipment, and storage media.
【背景技术】【Background technique】
简历识别是属于一种半结构化文本识别,因其没有传统非结构化文本的自然语序概念,所以识别困难。Resume recognition is a kind of semi-structured text recognition. Because it does not have the natural word order concept of traditional unstructured text, it is difficult to recognize.
现有技术中的简历识别系统,是基于关键字的一种识别系统。比如"人名","手机号码","工作经历"等等,但是半结构文本中若不存在这些关键词,传统的简历识别系统则无法识别对应的语料。在现有技术进行简历识别时,基于关键词识别,通常采用正则表达式进行。期间包含各种简历格式带来识别困难的问题。比如:人名关键词后紧跟的是简历的人名,但是人名也存在字数、中英文、空格等一系列问题,简历中可能包括多个人名,多个时间段等,往往存在工作经历和项目经历的中识别混乱问题,因为这部分在简历中没有统一的格式,这样导致简历的识别率非常低下,还需要通过人工来辅助筛选。The resume recognition system in the prior art is a recognition system based on keywords. For example, "person's name", "mobile phone number", "work history", etc., but if these keywords do not exist in the semi-structured text, the traditional resume recognition system cannot recognize the corresponding corpus. When performing resume recognition in the prior art, based on keyword recognition, regular expressions are usually used. The period contains various resume formats that bring identification difficulties. For example: the person name keyword is followed by the person's name in the resume, but there are also a series of problems such as the number of words, Chinese and English, and spaces in the person's name. The resume may include multiple names, multiple time periods, etc., often with work experience and project experience The problem of confusion in the recognition of the middle, because this part of the resume does not have a unified format, which leads to a very low recognition rate of the resume, and manual screening is required.
针对相关技术中存在的上述问题,目前尚未发现有效的解决方案。For the above-mentioned problems existing in related technologies, no effective solution has been found yet.
【发明内容】[Content of the invention]
有鉴于此,本申请实施例提供了一种识别简历的方法及装置、计算机设备、存储介质。In view of this, embodiments of the present application provide a method and device for identifying resumes, computer equipment, and storage media.
一方面,本申请实施例提供了一种识别简历的方法,所述方法包括:接收待识别的目标简历;将所述目标简历输入到深度神经语言程序学DNLP系统,其中,所述DNLP系统是采用双向长短时记忆循环神经网络BI-LSTM-CRF模型训练得到的;使用所述DNLP系统确定所述目标简历使用的简历模板;按照所 述简历模板提取所述目标简历中的特征信息。On the one hand, an embodiment of the present application provides a method for recognizing a resume, the method comprising: receiving a target resume to be recognized; and inputting the target resume into a deep neural language programming DNLP system, where the DNLP system is It is obtained by training with a bidirectional long and short-term memory loop neural network BI-LSTM-CRF model; using the DNLP system to determine the resume template used by the target resume; and extracting feature information in the target resume according to the resume template.
可选的,在将所述目标简历输入到深度神经语言程序学DNLP系统之前,所述方法还包括:确定多个简历样本;使用所述多个简历样本训练BI-LSTM-CRF模型的初始神经网络,得到所述DNLP系统。Optionally, before inputting the target resume into the deep neural linguistic programming DNLP system, the method further includes: determining a plurality of resume samples; using the plurality of resume samples to train the initial nerve of the BI-LSTM-CRF model Network to obtain the DNLP system.
可选的,使用所述多个简历样本训练BI-LSTM-CRF模型的初始神经网络包括:采用监督分类的方式分割每个所述简历样本的简历文本,得到多个可以对应人工标签的文本块,其中,每个文本块对应简历中的一个类别属性;对所述文本块进行分词,并提取每个文本块的特征词;采用所述文本块和对应的特征词训练所述BI-LSTM-CRF模型的初始神经网络。Optionally, using the multiple resume samples to train the initial neural network of the BI-LSTM-CRF model includes: using a supervised classification method to segment the resume text of each resume sample to obtain multiple text blocks that can correspond to artificial labels , Wherein each text block corresponds to a category attribute in the resume; word segmentation is performed on the text block, and feature words of each text block are extracted; the BI-LSTM- is trained using the text block and the corresponding feature words The initial neural network of the CRF model.
可选的,采用监督分类的方式分割每个所述简历样本的简历文本包括:分割每个所述简历样本中的以下简历文本:自我介绍、教育经历、工作经历、学习经历、项目经历;使用标签信息标注所述简历文本。Optionally, dividing the resume text of each resume sample by means of supervised classification includes: dividing the following resume text in each resume sample: self-introduction, education experience, work experience, learning experience, project experience; use The label information marks the resume text.
可选的,提取每个文本块的特征词包括:采用词频-逆向文件频率TF-IDF算法提取每个文本块的特征词;其中,tfidf=tf*idf,每个文本块取tfidf的top n作为特征词,n为大于1的正整数;其中,
Figure PCTCN2019103268-appb-000001
n i,j是当前词在文本块d j中的出现次数,分母是d j中所有词的出现次数之和,k是i的任一取值;
Figure PCTCN2019103268-appb-000002
|D|为简历样本中的文件总数,|{j:t i|∈d j}|为包含词语t i的文件数目。
Optionally, extracting the characteristic words of each text block includes: extracting the characteristic words of each text block using the word frequency-reverse file frequency TF-IDF algorithm; where tfidf=tf*idf, and each text block takes the top n of tfidf As a characteristic word, n is a positive integer greater than 1; among them,
Figure PCTCN2019103268-appb-000001
n i, j is the number of occurrences of the current word in the text block d j , the denominator is the sum of the number of occurrences of all words in d j , and k is any value of i;
Figure PCTCN2019103268-appb-000002
|D| is the total number of files in the resume sample, and |{j:t i |∈d j }| is the number of files containing the word t i .
可选的,采用所述文本块和对应的特征词训练所述BI-LSTM-CRF模型的初始神经网络包括:在所述BI-LSTM-CRF模型的BI层中,利用预训练或随机初始化的embedding矩阵将所述文本块的句子中的每个字由one-hot向量映射为低维稠密的字向量,在输入下一层之前,设置脱离以缓解过拟合;在所述BI-LSTM-CRF模型的LSTM层中,提取句子特征,将一个句子的各个特征词序列作为双向LSTM各个时间步的输入,再将正向LSTM输出的隐状态序列与反向LSTM的在各个位置输出的隐状态进行按位置拼接,得到完整的隐状态序 列,输出pi,其中,pi是归属i标签的概率;在所述BI-LSTM-CRF模型的CRF层中,进行句子级的序列标注,得到线性CRF,其中,所述线性CRF的计算公式中对于句子x的标签等于y的打分为:
Figure PCTCN2019103268-appb-000003
其中,一个长度等于句子长度的标签序列y=(y1,y2,...,yn);利用Softmax得到归一化后的概率为:
Figure PCTCN2019103268-appb-000004
y'是所有标签的任一取值。
Optionally, training the initial neural network of the BI-LSTM-CRF model by using the text block and the corresponding feature words includes: in the BI layer of the BI-LSTM-CRF model, using pre-trained or randomly initialized The embedding matrix maps each word in the sentence of the text block from a one-hot vector to a low-dimensional dense word vector. Before inputting to the next layer, set the detachment to alleviate overfitting; in the BI-LSTM- In the LSTM layer of the CRF model, extract sentence features, take each feature word sequence of a sentence as the input of each time step of the bidirectional LSTM, and then combine the hidden state sequence output by the forward LSTM and the hidden state output by the reverse LSTM at each position Perform location-wise splicing to obtain a complete hidden state sequence, and output pi, where pi is the probability of belonging to the i tag; in the CRF layer of the BI-LSTM-CRF model, perform sentence-level sequence labeling to obtain linear CRF, Wherein, in the linear CRF calculation formula, the score for the tag of sentence x equal to y is:
Figure PCTCN2019103268-appb-000003
Among them, a tag sequence y=(y1,y2,...,yn) whose length is equal to the length of the sentence; the normalized probability obtained by Softmax is:
Figure PCTCN2019103268-appb-000004
y'is any value of all labels.
可选的,在训练所述BI-LSTM-CRF模型的初始神经网络时,在所述BI-LSTM-CRF模型的CRF层中,采用以下最大化对数似然函数对样本数据进行处理:Optionally, when training the initial neural network of the BI-LSTM-CRF model, in the CRF layer of the BI-LSTM-CRF model, the following maximum log likelihood function is used to process the sample data:
logP(y x|x)=score(x,y x)-log(∑ y'exp(score(x,y')));其中,(x,y x)为训练样本。 logP (y x | x) = score (x, y x) -log (Σ y 'exp (score (x, y'))); where, (x, y x) of training samples.
另一方面,本申请实施例提供了一种识别简历的装置,所述装置包括:接收模块,用于接收待识别的目标简历;输入模块,用于将所述目标简历输入到深度神经语言程序学DNLP系统,其中,所述DNLP系统是采用双向长短时记忆循环神经网络BI-LSTM-CRF模型训练得到的;确定模块,用于使用所述DNLP系统确定所述目标简历使用的简历模板;提取模块,用于按照所述简历模板提取所述目标简历中的特征信息。On the other hand, an embodiment of the present application provides a device for recognizing resumes. The device includes: a receiving module for receiving a target resume to be recognized; and an input module for inputting the target resume into a deep neural language program Learn a DNLP system, wherein the DNLP system is obtained by training with a bidirectional long and short-term memory loop neural network BI-LSTM-CRF model; a determining module is used to determine the resume template used by the target resume using the DNLP system; extract The module is used to extract feature information in the target resume according to the resume template.
可选的,所述装置还包括:确定模块,用于在所述输入模块将所述目标简历输入到深度神经语言程序学DNLP系统之前,确定多个简历样本;训练模块,用于使用所述多个简历样本训练BI-LSTM-CRF模型的初始神经网络,得到所述DNLP系统。Optionally, the device further includes: a determination module, configured to determine a plurality of resume samples before the input module inputs the target resume into the deep neural linguistic programming DNLP system; a training module, configured to use the Multiple resume samples train the initial neural network of the BI-LSTM-CRF model to obtain the DNLP system.
可选的,所述训练模块包括:分割单元,用于采用监督分类的方式分割每个所述简历样本的简历文本,得到多个可以对应人工标签的文本块,其中,每个文本块对应简历中的一个类别属性;提取单元,用于对所述文本块进行分词,并提取每个文本块的特征词;训练单元,用于采用所述文本块和对应的特征词训练所述BI-LSTM-CRF模型的初始神经网络。Optionally, the training module includes: a segmentation unit for segmenting the resume text of each resume sample in a supervised classification manner to obtain multiple text blocks that can correspond to manual labels, wherein each text block corresponds to a resume An extraction unit, used to segment the text block, and extract the feature words of each text block; a training unit, used to train the BI-LSTM using the text block and corresponding feature words -The initial neural network of the CRF model.
可选的,所述分割单元包括:分割子单元,用于分割每个所述简历样本 中的以下简历文本:自我介绍、教育经历、工作经历、学习经历、项目经历;使用标签信息标注所述简历文本。Optionally, the segmentation unit includes: a segmentation subunit for segmenting the following resume text in each resume sample: self-introduction, education experience, work experience, learning experience, project experience; labeling the information with label information Resume text.
可选的,所述提取单元包括:提取子单元,用于采用词频-逆向文件频率TF-IDF算法提取每个文本块的特征词;其中,tfidf=tf*idf,每个文本块取tfidf的top n作为特征词,n为大于1的正整数;其中,
Figure PCTCN2019103268-appb-000005
n i,j是当前词在文本块d j中的出现次数,分母是d j中所有词的出现次数之和,k是i的任一取值;
Figure PCTCN2019103268-appb-000006
|D|为简历样本中的文件总数,|{j:t i|∈d j}|为包含词语t i的文件数目。
Optionally, the extracting unit includes: an extracting subunit for extracting the characteristic words of each text block using the word frequency-inverse document frequency TF-IDF algorithm; where tfidf=tf*idf, each text block takes tfidf top n is a feature word, n is a positive integer greater than 1; among them,
Figure PCTCN2019103268-appb-000005
n i, j is the number of occurrences of the current word in the text block d j , the denominator is the sum of the number of occurrences of all words in d j , and k is any value of i;
Figure PCTCN2019103268-appb-000006
|D| is the total number of files in the resume sample, and |{j:t i |∈d j }| is the number of files containing the word t i .
可选的,所述训练模块包括:第一处理单元,用于在所述BI-LSTM-CRF模型的BI层中,利用预训练或随机初始化的embedding矩阵将所述文本块的句子中的每个字由one-hot向量映射为低维稠密的字向量,在输入下一层之前,设置脱离以缓解过拟合;第二处理单元,用于在所述BI-LSTM-CRF模型的LSTM层中,提取句子特征,将一个句子的各个特征词序列作为双向LSTM各个时间步的输入,再将正向LSTM输出的隐状态序列与反向LSTM的在各个位置输出的隐状态进行按位置拼接,得到完整的隐状态序列,输出pi,其中, pi是归属i标签的概率;第三处理单元,用于在所述BI-LSTM-CRF模型的CRF层中,进行句子级的序列标注,得到线性CRF,其中,所述线性CRF的计算公式中对于句子x的标签等于y的打分为:
Figure PCTCN2019103268-appb-000007
其中,一个长度等于句子长度的标签序列y=(y1,y2,...,yn);利用Softmax得到归一化后的概率为:
Figure PCTCN2019103268-appb-000008
y'是所有标签的任一取值。
Optionally, the training module includes: a first processing unit configured to use a pre-trained or randomly initialized embedding matrix to convert each sentence in the text block in the BI layer of the BI-LSTM-CRF model Words are mapped from one-hot vectors to low-dimensional dense word vectors. Before inputting the next layer, set detachment to relieve overfitting; the second processing unit is used in the LSTM layer of the BI-LSTM-CRF model In extracting sentence features, each feature word sequence of a sentence is used as the input of each time step of the bidirectional LSTM, and then the hidden state sequence output by the forward LSTM and the hidden state output by the reverse LSTM at each position are spliced by position. Obtain the complete hidden state sequence, output pi, where p i is the probability of belonging to the i tag; the third processing unit is used to perform sentence-level sequence labeling in the CRF layer of the BI-LSTM-CRF model to obtain Linear CRF, where, in the calculation formula of the linear CRF, the score for the label of sentence x equal to y is:
Figure PCTCN2019103268-appb-000007
Among them, a tag sequence y=(y1,y2,...,yn) whose length is equal to the length of the sentence; the normalized probability obtained by Softmax is:
Figure PCTCN2019103268-appb-000008
y'is any value of all labels.
可选的,所述第三处理单元还包括:处理子单元,用于采用以下最大化对数似然函数对样本数据进行处理:logP(y x|x)=score(x,y x)-log(∑ y'exp(score(x,y')));其中,(x,y x)为训练样本。 Optionally, the third processing unit further includes: a processing sub-unit for processing sample data using the following maximization log likelihood function: logP(y x |x)=score(x,y x )- log (Σ y 'exp (score (x, y'))); where, (x, y x) of training samples.
根据本申请的又一个实施例,还提供了一种存储介质,所述存储介质中 存储有计算机程序,其中,所述计算机程序被设置为运行时执行上述任一项方法实施例中的步骤。According to another embodiment of the present application, there is also provided a storage medium in which a computer program is stored, wherein the computer program is configured to execute the steps in any one of the foregoing method embodiments when running.
根据本申请的又一个实施例,还提供了一种电子装置,包括存储器和处理器,所述存储器中存储有计算机程序,所述处理器被设置为运行所述计算机程序以执行上述任一项方法实施例中的步骤。According to another embodiment of the present application, there is also provided an electronic device, including a memory and a processor, the memory is stored with a computer program, and the processor is configured to run the computer program to execute any of the above Steps in the method embodiment.
通过本申请,将所述目标简历输入到深度神经语言程序学DNLP系统,并使用所述DNLP系统确定所述目标简历使用的简历模板,最后按照所述简历模板提取所述目标简历中的特征信息,通过先识别简历的模板,再从对应模板中提取特征信息,解决了现有技术中简历识别率低的技术问题,提高了简历的识别率。Through this application, input the target resume into the deep neural linguistic programming DNLP system, and use the DNLP system to determine the resume template used by the target resume, and finally extract the characteristic information in the target resume according to the resume template , By first identifying the template of the resume, and then extracting characteristic information from the corresponding template, the technical problem of the low resume recognition rate in the prior art is solved, and the recognition rate of the resume is improved.
【附图说明】【Explanation of drawings】
为了更清楚地说明本申请实施例的技术方案,下面将对实施例中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其它的附图。In order to explain the technical solutions of the embodiments of the present application more clearly, the following will briefly introduce the drawings needed in the embodiments. Obviously, the drawings in the following description are only some embodiments of the present application. For those of ordinary skill in the art, without creative labor, other drawings can be obtained from these drawings.
图1是本申请实施例的一种识别简历的移动终端的硬件结构框图;FIG. 1 is a hardware structure block diagram of a mobile terminal for identifying resumes according to an embodiment of the present application;
图2是根据本申请实施例的识别简历的方法的流程图;Figure 2 is a flowchart of a method for identifying resumes according to an embodiment of the present application;
图3是本申请实施例训练BI-LSTM-CRF模型的流程图;FIG. 3 is a flowchart of training a BI-LSTM-CRF model in an embodiment of the application;
图4是根据本申请实施例的识别简历的装置的结构框图。Fig. 4 is a structural block diagram of a device for identifying resumes according to an embodiment of the present application.
【具体实施方式】【Detailed ways】
下文中将参考附图并结合实施例来详细说明本申请。需要说明的是,在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互组合。Hereinafter, the application will be described in detail with reference to the drawings and in conjunction with embodiments. It should be noted that the embodiments in this application and the features in the embodiments can be combined with each other if there is no conflict.
需要说明的是,本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。It should be noted that the terms "first" and "second" in the description and claims of the application and the above-mentioned drawings are used to distinguish similar objects, and are not necessarily used to describe a specific sequence or sequence.
实施例1Example 1
本申请实施例一所提供的方法实施例可以在移动终端、服务器、计算机终端或者类似的运算装置中执行。以运行在计算机终端上为例,图1是本申请实施例的一种识别简历的计算机终端的硬件结构框图。如图1所示,计算机终端10可以包括一个或多个(图1中仅示出一个)处理器102(处理器102可以包括但不限于微处理器MCU或可编程逻辑器件FPGA等的处理装置)和用于存储数据的存储器104,可选地,上述计算机终端还可以包括用于通信功能的传输设备106以及输入输出设备108。本领域普通技术人员可以理解,图1所示的结构仅为示意,其并不对上述计算机终端的结构造成限定。例如,计算机终端10还可包括比图1中所示更多或者更少的组件,或者具有与图1所示不同的配置。The method embodiment provided in Embodiment 1 of the present application may be executed in a mobile terminal, a server, a computer terminal, or a similar computing device. Taking running on a computer terminal as an example, FIG. 1 is a hardware structural block diagram of a computer terminal for identifying resumes according to an embodiment of the present application. As shown in FIG. 1, the computer terminal 10 may include one or more (only one is shown in FIG. 1) processor 102 (the processor 102 may include, but is not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA. ) And a memory 104 for storing data. Optionally, the aforementioned computer terminal may also include a transmission device 106 and an input/output device 108 for communication functions. A person of ordinary skill in the art can understand that the structure shown in FIG. 1 is only for illustration, and does not limit the structure of the foregoing computer terminal. For example, the computer terminal 10 may also include more or fewer components than those shown in FIG. 1, or have a different configuration from that shown in FIG.
存储器104可用于存储计算机程序,例如,应用软件的软件程序以及模块,如本申请实施例中的识别简历的方法对应的计算机程序,处理器102通过运行存储在存储器104内的计算机程序,从而执行各种功能应用以及数据处理,即实现上述的方法。存储器104可包括高速随机存储器,还可包括非易失性存储器,如一个或者多个磁性存储装置、闪存、或者其他非易失性固态存储器。在一些实例中,存储器104可进一步包括相对于处理器102远程设置的存储器,这些远程存储器可以通过网络连接至计算机终端10。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。The memory 104 may be used to store computer programs, for example, software programs and modules of application software, such as the computer programs corresponding to the method for identifying resumes in the embodiment of the present application. The processor 102 executes the computer programs stored in the memory 104 by running Various functional applications and data processing, namely to achieve the above methods. The memory 104 may include a high-speed random access memory, and may also include a non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include a memory remotely provided with respect to the processor 102, and these remote memories may be connected to the computer terminal 10 via a network. Examples of the aforementioned networks include, but are not limited to, the Internet, corporate intranets, local area networks, mobile communication networks, and combinations thereof.
传输装置106用于经由一个网络接收或者发送数据。上述的网络具体实例可包括计算机终端10的通信供应商提供的无线网络。在一个实例中,传输装置106包括一个网络适配器(Network Interface Controller,简称为NIC),其可通过基站与其他网络设备相连从而可与互联网进行通讯。在一个实例中,传输装置106可以为射频(Radio Frequency,简称为RF)模块,其用于通过无线方式与互联网进行通讯。The transmission device 106 is used to receive or send data via a network. The above-mentioned specific examples of the network may include a wireless network provided by the communication provider of the computer terminal 10. In one example, the transmission device 106 includes a network adapter (Network Interface Controller, NIC for short), which can be connected to other network devices through a base station to communicate with the Internet. In an example, the transmission device 106 may be a radio frequency (Radio Frequency, referred to as RF) module, which is used to communicate with the Internet in a wireless manner.
在本实施例中提供了一种识别简历的方法,图2是根据本申请实施例的识别简历的方法的流程图,如图2所示,该流程包括如下步骤:In this embodiment, a method for identifying a resume is provided. FIG. 2 is a flowchart of the method for identifying a resume according to an embodiment of the present application. As shown in FIG. 2, the process includes the following steps:
步骤S202,接收待识别的目标简历;Step S202, receiving a target resume to be identified;
步骤S204,将所述目标简历输入到深度神经语言程序学DNLP系统,其中,所述DNLP系统是采用双向长短时记忆循环神经网络BI-LSTM-CRF模型训练得到的;Step S204, input the target resume into a deep neural language programming DNLP system, where the DNLP system is obtained by training using a bidirectional long and short-term memory cyclic neural network BI-LSTM-CRF model;
步骤S206,使用所述DNLP系统确定所述目标简历使用的简历模板;所述简历模板包括多个实体版块;Step S206: Use the DNLP system to determine a resume template used by the target resume; the resume template includes multiple physical sections;
本实施例的简历模板是指目标简历采用的简历样式或简历布局,在不同的简历模板中,同一个实体版块(如工作经历)的内容分布在文本的不同位置,通过确定目标简历的简历模板可以确定各个待确定的文本内容在目标简历中的位置;The resume template of this embodiment refers to the resume style or resume layout adopted by the target resume. In different resume templates, the content of the same physical section (such as work experience) is distributed in different positions of the text. The resume template of the target resume is determined by Can determine the position of each text content to be determined in the target resume;
步骤S208,按照所述简历模板提取所述目标简历中的特征信息。Step S208: Extract feature information in the target resume according to the resume template.
通过本实施例的方案,将所述目标简历输入到深度神经语言程序学DNLP系统,并使用所述DNLP系统确定所述目标简历使用的简历模板,最后按照所述简历模板提取所述目标简历中的特征信息,通过先识别简历的模板,再从对应模板中提取特征信息,解决了现有技术中简历识别率低的技术问题,提高了简历的识别率。Through the solution of this embodiment, the target resume is input into the deep neural linguistic programming DNLP system, and the DNLP system is used to determine the resume template used by the target resume, and finally the target resume is extracted according to the resume template By first identifying the template of the resume, and then extracting the characteristic information from the corresponding template, the technical problem of low resume recognition rate in the prior art is solved, and the recognition rate of resumes is improved.
本实施例在按照所述简历模板提取所述目标简历中的特征信息之后,可以将特征信息按照用户设置的指定模板重新排版布局,以便于集中化采集,或者是仅将用户关注的特征信息(如毕业院校)提取出来,并与简历标识或其他关键信息进行绑定后,再格式化展示,以减少用户在纷繁复杂的简历中查找关键信息的时间。In this embodiment, after extracting the feature information in the target resume according to the resume template, the feature information can be re-typeset according to the specified template set by the user, so as to facilitate centralized collection, or only the feature information that the user pays attention to ( For example, the graduate school) is extracted and bound with the resume logo or other key information, and then formatted and displayed, so as to reduce the time for users to find the key information in the complicated resume.
在本实施例中,在将所述目标简历输入到深度神经语言程序学DNLP系统之前,还包括:确定多个简历样本;使用所述多个简历样本训练BI-LSTM-CRF模型的初始神经网络,得到所述DNLP系统。In this embodiment, before inputting the target resume into the deep neural linguistic programming DNLP system, the method further includes: determining a plurality of resume samples; using the plurality of resume samples to train the initial neural network of the BI-LSTM-CRF model , To obtain the DNLP system.
图3是本申请实施例训练BI-LSTM-CRF模型的流程图,如图3所示,使用所述多个简历样本训练BI-LSTM-CRF模型的初始神经网络包括:Fig. 3 is a flowchart of training the BI-LSTM-CRF model according to an embodiment of the present application. As shown in Fig. 3, the initial neural network for training the BI-LSTM-CRF model using the multiple resume samples includes:
S302,采用监督分类的方式分割每个所述简历样本的简历文本,得到多 个可以对应人工标签的文本块,其中,每个文本块对应简历中的一个类别属性;S302: Use a supervised classification method to segment the resume text of each resume sample to obtain multiple text blocks that can correspond to manual labels, where each text block corresponds to a category attribute in the resume;
具体的,采用监督分类的方式分割每个所述简历样本的简历文本包括:分割每个所述简历样本中的以下简历文本(实体版块):自我介绍、教育经历、工作经历、学习经历、项目经历;使用标签信息标注所述简历文本。简历样本中,一个完整的简历都是有多个简历文本组成的,但是不同模板的简历,同样的简历文本可能分布在不同的位置;该部分是对简历的各个实体版块进行学习的过程;Specifically, dividing the resume text of each resume sample by means of supervised classification includes: dividing the following resume text (physical section) in each resume sample: self-introduction, education experience, work experience, learning experience, project Experience; use label information to mark the resume text. In a resume sample, a complete resume is composed of multiple resume texts, but for resumes with different templates, the same resume text may be distributed in different positions; this part is the process of learning each entity section of the resume;
S304,对所述文本块进行分词,并提取每个文本块的特征词;可以通过对标记后的文本块进行分词、近义词匹配,来抽取关键的特征词。S304: Perform word segmentation on the text block, and extract feature words of each text block; key feature words can be extracted by performing word segmentation and synonym matching on the marked text block.
具体的,提取每个文本块的特征词的方案包括:采用词频-逆向文件频率TF-IDF算法提取每个文本块的特征词;其中,tfidf=tf*idf,每个文本块取tfidf的top n作为特征词,n为大于1的正整数,优选的,n=15;其中,
Figure PCTCN2019103268-appb-000009
n i,j是当前词在文本块d j中的出现次数,分母是d j中所有词的出现次数之和,k是i的任一取值;
Figure PCTCN2019103268-appb-000010
|D|为简历样本中的文件总数,|{j:t i∈d j}|为包含词语t i的文件数目。
Specifically, the solution for extracting the characteristic words of each text block includes: using the word frequency-inverse document frequency TF-IDF algorithm to extract the characteristic words of each text block; where tfidf=tf*idf, and each text block takes the top of tfidf n is a feature word, n is a positive integer greater than 1, preferably, n=15; where,
Figure PCTCN2019103268-appb-000009
n i, j is the number of occurrences of the current word in the text block d j , the denominator is the sum of the number of occurrences of all words in d j , and k is any value of i;
Figure PCTCN2019103268-appb-000010
|D| is the total number of files in the resume sample, |{j: t i ∈ d j }| is the number of files containing the word t i .
TF-IDF可以过滤掉常见的词语,保留重要的词语,抽取得到特征词。TF-IDF can filter out common words, keep important words, and extract characteristic words.
S306,采用所述文本块和对应的特征词训练所述BI-LSTM-CRF模型的初始神经网络。S306. Train the initial neural network of the BI-LSTM-CRF model by using the text block and the corresponding feature words.
通过将样本的简历文本分割成不同的实体模块(简历文本),进而对不同的实体模块进行学习。By dividing the resume text of the sample into different entity modules (resume text), and then learn different entity modules.
在本实施例的一个实施方式中,采用各个类别的文本块对BI-LSTM-CRF模型对进行训练学习,得到各个类别的识别模型包括:可以使用基于字的Bi-LSTM-CRF,如B-PER、I-PER代表人名首字、人名非首字,B-SCH、I-SCH代表学校首字、学校非首字等,对各个实体模块的识别模型进行训练学习。 BI-LSTM-CRF模型的神经网络包括三层逻辑结构。采用所述文本块和对应的特征词训练所述BI-LSTM-CRF模型的初始神经网络包括:In an implementation of this embodiment, the BI-LSTM-CRF model pair is trained and learned by text blocks of each category, and the recognition model of each category can be obtained including: the word-based Bi-LSTM-CRF can be used, such as B- PER, I-PER represent the first character of a person’s name, and the name is not the first character, B-SCH, I-SCH represent the first character of the school, the non-initial character of the school, etc., to train and learn the recognition model of each entity module. The neural network of the BI-LSTM-CRF model includes a three-layer logical structure. Training the initial neural network of the BI-LSTM-CRF model using the text block and the corresponding feature words includes:
在所述BI-LSTM-CRF模型的BI层(也叫查找层)中,利用预训练或随机初始化的embedding矩阵将所述文本块的句子中的每个字由one-hot向量映射为低维稠密的字向量,在输入下一层之前,设置脱离以缓解过拟合;In the BI layer (also called the search layer) of the BI-LSTM-CRF model, each word in the sentence of the text block is mapped to a low-dimensional by one-hot vector using a pre-trained or randomly initialized embedding matrix For dense word vectors, set detachment before entering the next layer to alleviate overfitting;
在所述BI-LSTM-CRF模型的LSTM层中,提取句子特征,将一个句子的各个特征词序列作为双向LSTM各个时间步的输入,再将正向LSTM输出的隐状态序列与反向LSTM的在各个位置输出的隐状态进行按位置拼接,得到完整的隐状态序列,输出pi,其中,pi是归属i标签的概率;In the LSTM layer of the BI-LSTM-CRF model, extract sentence features, use each feature word sequence of a sentence as the input of each time step of the bidirectional LSTM, and then combine the hidden state sequence output by the forward LSTM and the reverse LSTM The hidden states output at each position are spliced by position to obtain a complete hidden state sequence, and output pi, where pi is the probability of belonging to the i tag;
在所述BI-LSTM-CRF模型的CRF层中,进行句子级的序列标注,得到线性CRF,其中,所述线性CRF的计算公式中对于句子x的标签等于y的打分为:
Figure PCTCN2019103268-appb-000011
其中,句子长度的标签序列y=(y1,y2,...,yn),A为CRF层的转移矩阵;利用Softmax得到归一化后的概率为:
Figure PCTCN2019103268-appb-000012
y'是所有标签的任一取值。
In the CRF layer of the BI-LSTM-CRF model, the sentence-level sequence labeling is performed to obtain a linear CRF, where in the calculation formula of the linear CRF, the score for the tag of sentence x equal to y is divided into:
Figure PCTCN2019103268-appb-000011
Among them, the sentence length tag sequence y=(y1,y2,...,yn), A is the transition matrix of the CRF layer; the normalized probability obtained by Softmax is:
Figure PCTCN2019103268-appb-000012
y'is any value of all labels.
本实施例的softmax只做了局部的考虑,也就是说,当前词的tag,是不受其它的tag的影响的。The softmax of this embodiment only takes partial considerations, that is, the tag of the current word is not affected by other tags.
可选的,在训练所述BI-LSTM-CRF模型的初始神经网络时,在所述BI-LSTM-CRF模型的CRF层中,采用以下最大化对数似然函数对样本数据进行处理:logP(y x|x)=score(x,y x)-log(∑ y'exp(score(x,y')));其中,(x,y x)为训练样本。本实施例的整个序列的打分等于各个位置的打分之和,而每个位置的打分由两部分得到,一部分是由LSTM输出的pi决定,另一部分则由CRF的转移矩阵A决定。 Optionally, when training the initial neural network of the BI-LSTM-CRF model, in the CRF layer of the BI-LSTM-CRF model, the following maximum log likelihood function is used to process the sample data: logP (y x | x) = score (x, y x) -log (Σ y 'exp (score (x, y'))); where, (x, y x) of training samples. The scoring of the entire sequence in this embodiment is equal to the sum of the scoring of each position, and the scoring of each position is obtained from two parts, one part is determined by the pi output by the LSTM, and the other part is determined by the transition matrix A of the CRF.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到根据上述实施例的方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的 形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,或者网络设备等)执行本申请各个实施例所述的方法。Through the description of the above embodiments, those skilled in the art can clearly understand that the method according to the above embodiment can be implemented by means of software plus the necessary general hardware platform, of course, it can also be implemented by hardware, but in many cases the former is Better implementation. Based on this understanding, the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product is stored in a storage medium (such as ROM/RAM, magnetic disk, The optical disc) includes several instructions to enable a terminal device (which can be a mobile phone, a computer, a server, or a network device, etc.) to execute the method described in each embodiment of the present application.
实施例2Example 2
在本实施例中还提供了一种识别简历的装置,该装置用于实现上述实施例及优选实施方式,已经进行过说明的不再赘述。如以下所使用的,术语“模块”可以实现预定功能的软件和/或硬件的组合。尽管以下实施例所描述的装置较佳地以软件来实现,但是硬件,或者软件和硬件的组合的实现也是可能并被构想的。In this embodiment, a device for recognizing resumes is also provided. The device is used to implement the above-mentioned embodiments and preferred implementations. What has been explained will not be repeated. As used below, the term "module" can implement a combination of software and/or hardware with predetermined functions. Although the devices described in the following embodiments are preferably implemented by software, hardware or a combination of software and hardware is also possible and conceived.
图4是根据本申请实施例的识别简历的装置的结构框图,如图4所示,该装置包括:Fig. 4 is a structural block diagram of a device for identifying resumes according to an embodiment of the present application. As shown in Fig. 4, the device includes:
接收模块40,用于接收待识别的目标简历;The receiving module 40 is used to receive the target resume to be identified;
输入模块42,用于将所述目标简历输入到深度神经语言程序学DNLP系统,其中,所述DNLP系统是采用双向长短时记忆循环神经网络BI-LSTM-CRF模型训练得到的;The input module 42 is configured to input the target resume into a deep neural language programming DNLP system, where the DNLP system is obtained by training using a bidirectional long and short-term memory cyclic neural network BI-LSTM-CRF model;
确定模块44,用于使用所述DNLP系统确定所述目标简历使用的简历模板;The determining module 44 is configured to use the DNLP system to determine the resume template used by the target resume;
提取模块46,用于按照所述简历模板提取所述目标简历中的特征信息。The extraction module 46 is configured to extract feature information in the target resume according to the resume template.
可选的,所述装置还包括:确定模块,用于在所述输入模块将所述目标简历输入到深度神经语言程序学DNLP系统之前,确定多个简历样本;训练模块,用于使用所述多个简历样本训练BI-LSTM-CRF模型的初始神经网络,得到所述DNLP系统。Optionally, the device further includes: a determination module, configured to determine a plurality of resume samples before the input module inputs the target resume into the deep neural linguistic programming DNLP system; a training module, configured to use the Multiple resume samples train the initial neural network of the BI-LSTM-CRF model to obtain the DNLP system.
可选的,所述训练模块包括:分割单元,用于采用监督分类的方式分割每个所述简历样本的简历文本,得到多个可以对应人工标签的文本块,其中,每个文本块对应简历中的一个类别属性;提取单元,用于对所述文本块进行分词,并提取每个文本块的特征词;训练单元,用于采用所述文本块和对应的特征词训练所述BI-LSTM-CRF模型的初始神经网络。Optionally, the training module includes: a segmentation unit for segmenting the resume text of each resume sample in a supervised classification manner to obtain multiple text blocks that can correspond to manual labels, wherein each text block corresponds to a resume An extraction unit, used to segment the text block, and extract the feature words of each text block; a training unit, used to train the BI-LSTM using the text block and corresponding feature words -The initial neural network of the CRF model.
可选的,所述分割单元包括:分割子单元,用于分割每个所述简历样本中的以下简历文本:自我介绍、教育经历、工作经历、学习经历、项目经历;使用标签信息标注所述简历文本。Optionally, the segmentation unit includes: a segmentation subunit for segmenting the following resume text in each resume sample: self-introduction, education experience, work experience, learning experience, project experience; labeling the information with label information Resume text.
可选的,所述提取单元包括:提取子单元,用于采用词频-逆向文件频率TF-IDF算法提取每个文本块的特征词;其中,tfidf=tf*idf,每个文本块取tfidf的top n作为特征词,n为大于1的正整数;其中,
Figure PCTCN2019103268-appb-000013
n i,j是当前词在文本块d j中的出现次数,分母是d j中所有词的出现次数之和,k是i的任一取值;
Figure PCTCN2019103268-appb-000014
|D|为简历样本中的文件总数,|{j:t i|∈d j}|为包含词语t i的文件数目。
Optionally, the extracting unit includes: an extracting subunit for extracting the characteristic words of each text block using the word frequency-inverse document frequency TF-IDF algorithm; where tfidf=tf*idf, each text block takes tfidf top n is a feature word, n is a positive integer greater than 1; among them,
Figure PCTCN2019103268-appb-000013
n i, j is the number of occurrences of the current word in the text block d j , the denominator is the sum of the number of occurrences of all words in d j , and k is any value of i;
Figure PCTCN2019103268-appb-000014
|D| is the total number of files in the resume sample, and |{j:t i |∈d j }| is the number of files containing the word t i .
可选的,所述训练模块包括:第一处理单元,用于在所述BI-LSTM-CRF模型的BI层中,利用预训练或随机初始化的embedding矩阵将所述文本块的句子中的每个字由one-hot向量映射为低维稠密的字向量,在输入下一层之前,设置脱离以缓解过拟合;第二处理单元,用于在所述BI-LSTM-CRF模型的LSTM层中,提取句子特征,将一个句子的各个特征词序列作为双向LSTM各个时间步的输入,再将正向LSTM输出的隐状态序列与反向LSTM的在各个位置输出的隐状态进行按位置拼接,得到完整的隐状态序列,输出pi,其中,pi是归属i标签的概率;第三处理单元,用于在所述BI-LSTM-CRF模型的CRF层中,进行句子级的序列标注,得到线性CRF,其中,所述线性CRF的计算公式中对于句子x的标签等于y的打分为:
Figure PCTCN2019103268-appb-000015
其中,一个长度等于句子长度的标签序列y=(y1,y2,...,yn);利用Softmax得到归一化后的概率为:
Figure PCTCN2019103268-appb-000016
y'是所有标签的任一取值。
Optionally, the training module includes: a first processing unit configured to use a pre-trained or randomly initialized embedding matrix to convert each sentence in the text block in the BI layer of the BI-LSTM-CRF model Words are mapped from one-hot vectors to low-dimensional dense word vectors. Before inputting the next layer, set detachment to relieve overfitting; the second processing unit is used in the LSTM layer of the BI-LSTM-CRF model In extracting sentence features, each feature word sequence of a sentence is used as the input of each time step of the bidirectional LSTM, and then the hidden state sequence output by the forward LSTM and the hidden state output by the reverse LSTM at each position are spliced by position. Obtain the complete hidden state sequence, output pi, where pi is the probability of belonging to the i tag; the third processing unit is used to perform sentence-level sequence labeling in the CRF layer of the BI-LSTM-CRF model to obtain linear CRF, wherein, in the calculation formula of the linear CRF, the score for the label of sentence x equal to y is:
Figure PCTCN2019103268-appb-000015
Among them, a tag sequence y=(y1,y2,...,yn) whose length is equal to the length of the sentence; the normalized probability obtained by Softmax is:
Figure PCTCN2019103268-appb-000016
y'is any value of all labels.
可选的,所述第三处理单元还包括:处理子单元,用于采用以下最大化对数似然函数对样本数据进行处理:logP(y x|x)=score(x,y x)-log(∑ y'exp(score(x,y')));其中,(x,y x)为训练样本。 Optionally, the third processing unit further includes: a processing sub-unit for processing sample data using the following maximization log likelihood function: logP(y x |x)=score(x,y x )- log (Σ y 'exp (score (x, y'))); where, (x, y x) of training samples.
需要说明的是,上述各个模块是可以通过软件或硬件来实现的,对于后者,可以通过以下方式实现,但不限于此:上述模块均位于同一处理器中;或者,上述各个模块以任意组合的形式分别位于不同的处理器中。It should be noted that each of the above modules can be implemented by software or hardware. For the latter, it can be implemented in the following manner, but not limited to this: the above modules are all located in the same processor; or, the above modules are combined in any combination The forms are located in different processors.
实施例3Example 3
在本申请所提供的几个实施例中,应该理解到,所揭露的系统,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如,多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。In the several embodiments provided in this application, it should be understood that the disclosed system, device, and method may be implemented in other ways. For example, the device embodiments described above are merely illustrative, for example, the division of the units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components may be combined Or it can be integrated into another system, or some features can be ignored or not implemented. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用硬件加软件功能单元的形式实现。In addition, the functional units in each embodiment of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit. The above-mentioned integrated unit may be implemented in the form of hardware, or may be implemented in the form of hardware plus software functional units.
上述以软件功能单元的形式实现的集成的单元,可以存储在一个计算机可读取存储介质中。上述软件功能单元存储在一个存储介质中,包括若干指令用以使得一台计算机装置(可以是个人计算机,服务器,或者网络装置等)或处理器(Processor)执行本申请各个实施例所述方法的部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。The above-mentioned integrated unit implemented in the form of a software functional unit may be stored in a computer readable storage medium. The above-mentioned software functional unit is stored in a storage medium and includes several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (Processor) execute the method described in each embodiment of the present application Part of the steps. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disk and other media that can store program code .
本申请的实施例还提供了一种存储介质,该存储介质中存储有计算机程 序,其中,该计算机程序被设置为运行时执行上述任一项方法实施例中的步骤。The embodiment of the present application also provides a storage medium in which a computer program is stored, wherein the computer program is set to execute the steps in any one of the foregoing method embodiments when running.
可选地,在本实施例中,上述存储介质可以被设置为存储用于执行以下步骤的计算机程序:Optionally, in this embodiment, the foregoing storage medium may be configured to store a computer program for executing the following steps:
S1,接收待识别的目标简历;S1, receive the target resume to be identified;
S2,将所述目标简历输入到深度神经语言程序学DNLP系统,其中,所述DNLP系统是采用双向长短时记忆循环神经网络BI-LSTM-CRF模型训练得到的;S2. Input the target resume into a deep neural language programming DNLP system, where the DNLP system is obtained by training using a bidirectional long and short-term memory cyclic neural network BI-LSTM-CRF model;
S3,使用所述DNLP系统确定所述目标简历使用的简历模板;S3, using the DNLP system to determine a resume template used by the target resume;
S4,按照所述简历模板提取所述目标简历中的特征信息。S4: Extract feature information in the target resume according to the resume template.
可选地,在本实施例中,上述存储介质可以包括但不限于:U盘、只读存储器(Read-Only Memory,简称为ROM)、随机存取存储器(Random Access Memory,简称为RAM)、移动硬盘、磁碟或者光盘等各种可以存储计算机程序的介质。Optionally, in this embodiment, the foregoing storage medium may include, but is not limited to: U disk, Read-Only Memory (Read-Only Memory, ROM for short), Random Access Memory (Random Access Memory, RAM for short), Various media that can store computer programs, such as mobile hard disks, magnetic disks, or optical disks.
本申请的实施例还提供了一种电子装置,包括存储器和处理器,该存储器中存储有计算机程序,该处理器被设置为运行计算机程序以执行上述任一项方法实施例中的步骤。The embodiment of the present application also provides an electronic device, including a memory and a processor, the memory is stored with a computer program, and the processor is configured to run the computer program to execute the steps in any of the foregoing method embodiments.
可选地,上述电子装置还可以包括传输设备以及输入输出设备,其中,该传输设备和上述处理器连接,该输入输出设备和上述处理器连接。Optionally, the aforementioned electronic device may further include a transmission device and an input-output device, wherein the transmission device is connected to the aforementioned processor, and the input-output device is connected to the aforementioned processor.
可选地,在本实施例中,上述处理器可以被设置为通过计算机程序执行以下步骤:Optionally, in this embodiment, the foregoing processor may be configured to execute the following steps through a computer program:
S1,接收待识别的目标简历;S1, receive the target resume to be identified;
S2,将所述目标简历输入到深度神经语言程序学DNLP系统,其中,所述DNLP系统是采用双向长短时记忆循环神经网络BI-LSTM-CRF模型训练得到的;S2. Input the target resume into a deep neural language programming DNLP system, where the DNLP system is obtained by training using a bidirectional long and short-term memory cyclic neural network BI-LSTM-CRF model;
S3,使用所述DNLP系统确定所述目标简历使用的简历模板;S3, using the DNLP system to determine a resume template used by the target resume;
S4,按照所述简历模板提取所述目标简历中的特征信息。S4: Extract feature information in the target resume according to the resume template.
以上所述仅为本申请的较佳实施例而已,并不用以限制本申请,凡在本申请的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本申请保护的范围之内。The above are only the preferred embodiments of this application and are not intended to limit this application. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of this application shall be included in this application Within the scope of protection.

Claims (20)

  1. 一种识别简历的方法,所述方法包括:A method for identifying resumes, the method comprising:
    接收待识别的目标简历;Receive the target resume to be identified;
    将所述目标简历输入到深度神经语言程序学DNLP系统,其中,所述DNLP系统是采用双向长短时记忆循环神经网络BI-LSTM-CRF模型训练得到的;Input the target resume into a deep neural language programming DNLP system, where the DNLP system is obtained by training using a bidirectional long and short-term memory loop neural network BI-LSTM-CRF model;
    使用所述DNLP系统确定所述目标简历使用的简历模板;Use the DNLP system to determine the resume template used by the target resume;
    按照所述简历模板提取所述目标简历中的特征信息。The feature information in the target resume is extracted according to the resume template.
  2. 根据权利要求1所述的方法,在将所述目标简历输入到深度神经语言程序学DNLP系统之前,所述方法还包括:The method according to claim 1, before inputting the target resume into a deep neural language programming DNLP system, the method further comprises:
    确定多个简历样本;Identify multiple resume samples;
    使用所述多个简历样本训练BI-LSTM-CRF模型的初始神经网络,得到所述DNLP系统。The initial neural network of the BI-LSTM-CRF model is trained using the multiple resume samples to obtain the DNLP system.
  3. 根据权利要求2所述的方法,使用所述多个简历样本训练BI-LSTM-CRF模型的初始神经网络包括:The method according to claim 2, wherein training the initial neural network of the BI-LSTM-CRF model using the plurality of resume samples comprises:
    采用监督分类的方式分割每个所述简历样本的简历文本,得到多个可以对应人工标签的文本块,其中,每个文本块对应简历中的一个类别属性;Using a supervised classification method to segment the resume text of each resume sample to obtain multiple text blocks that can correspond to manual labels, where each text block corresponds to a category attribute in the resume;
    对所述文本块进行分词,并提取每个文本块的特征词;Perform word segmentation on the text block, and extract the characteristic words of each text block;
    采用所述文本块和对应的特征词训练所述BI-LSTM-CRF模型的初始神经网络。The initial neural network of the BI-LSTM-CRF model is trained using the text block and the corresponding feature words.
  4. 根据权利要求3所述的方法,采用监督分类的方式分割每个所述简历样本的简历文本包括:The method according to claim 3, adopting a supervised classification method to segment the resume text of each resume sample includes:
    分割每个所述简历样本中的以下简历文本:自我介绍、教育经历、工作经历、学习经历、项目经历;Divide the following resume texts in each resume sample: self-introduction, education experience, work experience, learning experience, project experience;
    使用标签信息标注所述简历文本。Use label information to mark the resume text.
  5. 根据权利要求3所述的方法,提取每个文本块的特征词包括:According to the method of claim 3, extracting the characteristic words of each text block comprises:
    采用词频-逆向文件频率TF-IDF算法提取每个文本块的特征词;Use word frequency-reverse document frequency TF-IDF algorithm to extract the characteristic words of each text block;
    其中,tfidf=tf*idf,每个文本块取tfidf的top n作为特征词,n为大于1的正整数;Among them, tfidf=tf*idf, each text block takes top n of tfidf as the characteristic word, and n is a positive integer greater than 1;
    其中,
    Figure PCTCN2019103268-appb-100001
    n i,j是当前词在文本块d j中的出现次数,分母是d j中所有词的出现次数之和,k是i的任一取值;
    among them,
    Figure PCTCN2019103268-appb-100001
    n i, j is the number of occurrences of the current word in the text block d j , the denominator is the sum of the number of occurrences of all words in d j , and k is any value of i;
    Figure PCTCN2019103268-appb-100002
    |D|为简历样本中的文件总数,|{j:t i∈d j}|为包含词语t i的文件数目。
    Figure PCTCN2019103268-appb-100002
    |D| is the total number of files in the resume sample, |{j: t i ∈ d j }| is the number of files containing the word t i .
  6. 根据权利要求3所述的方法,采用所述文本块和对应的特征词训练所述BI-LSTM-CRF模型的初始神经网络包括:According to the method of claim 3, training the initial neural network of the BI-LSTM-CRF model by using the text block and the corresponding feature words comprises:
    在所述BI-LSTM-CRF模型的BI层中,利用预训练或随机初始化的embedding矩阵将所述文本块的句子中的每个字由one-hot向量映射为低维稠密的字向量,在输入下一层之前,设置脱离以缓解过拟合;In the BI layer of the BI-LSTM-CRF model, each word in the sentence of the text block is mapped from a one-hot vector to a low-dimensional dense word vector using a pre-trained or randomly initialized embedding matrix. Before entering the next layer, set detachment to relieve overfitting;
    在所述BI-LSTM-CRF模型的LSTM层中,提取句子特征,将一个句子的各个特征词序列作为双向LSTM各个时间步的输入,再将正向LSTM输出的隐状态序列与反向LSTM的在各个位置输出的隐状态进行按位置拼接,得到完整的隐状态序列,输出pi,其中,pi是归属i标签的概率;In the LSTM layer of the BI-LSTM-CRF model, extract sentence features, use each feature word sequence of a sentence as the input of each time step of the bidirectional LSTM, and then combine the hidden state sequence output by the forward LSTM and the reverse LSTM The hidden states output at each position are spliced by position to obtain a complete hidden state sequence, and output pi, where pi is the probability of belonging to the i tag;
    在所述BI-LSTM-CRF模型的CRF层中,进行句子级的序列标注,得到线性CRF,其中,所述线性CRF的计算公式中对于句子x的标签等于y的打分为:In the CRF layer of the BI-LSTM-CRF model, the sentence-level sequence labeling is performed to obtain a linear CRF, where in the calculation formula of the linear CRF, the score for the tag of sentence x equal to y is divided into:
    Figure PCTCN2019103268-appb-100003
    其中,句子长度的标签序列y=(y1,y2,...,yn),A为CRF层的转移矩阵;
    Figure PCTCN2019103268-appb-100003
    Among them, the sentence length tag sequence y=(y1,y2,...,yn), A is the transition matrix of the CRF layer;
    利用Softmax得到归一化后的概率为:The normalized probability obtained by Softmax is:
    Figure PCTCN2019103268-appb-100004
    y'是所有标签的任一取值。
    Figure PCTCN2019103268-appb-100004
    y'is any value of all labels.
  7. 根据权利要求6所述的方法,在训练所述BI-LSTM-CRF模型的初始神经网络时,在所述BI-LSTM-CRF模型的CRF层中,采用以下最大化对数似然函数对样本数据进行处理:According to the method of claim 6, when training the initial neural network of the BI-LSTM-CRF model, in the CRF layer of the BI-LSTM-CRF model, the following maximization log likelihood function is used for the sample Data processing:
    logP(y x|x)=score(x,y x)-log(∑ y'exp(score(x,y'))); logP (y x | x) = score (x, y x) -log (Σ y 'exp (score (x, y')));
    其中,(x,y x)为训练样本。 Among them, (x, y x ) are training samples.
  8. 一种识别简历的装置,所述装置包括:A device for identifying resumes, the device comprising:
    接收模块,用于接收待识别的目标简历;The receiving module is used to receive the target resume to be identified;
    输入模块,用于将所述目标简历输入到深度神经语言程序学DNLP系统,其中,所述DNLP系统是采用双向长短时记忆循环神经网络BI-LSTM-CRF模型训练得到的;The input module is used to input the target resume into a deep neural language programming DNLP system, where the DNLP system is obtained by training using a bidirectional long and short-term memory cyclic neural network BI-LSTM-CRF model;
    确定模块,用于使用所述DNLP系统确定所述目标简历使用的简历模板;The determining module is configured to use the DNLP system to determine the resume template used by the target resume;
    提取模块,用于按照所述简历模板提取所述目标简历中的特征信息。The extraction module is used to extract feature information in the target resume according to the resume template.
  9. 根据权利要求8所述的装置,所述装置还包括:The device according to claim 8, further comprising:
    确定模块,用于在所述输入模块将所述目标简历输入到深度神经语言程序学DNLP系统之前,确定多个简历样本;训练模块,用于使用所述多个简历样本训练BI-LSTM-CRF模型的初始神经网络,得到所述DNLP系统。The determining module is used to determine a plurality of resume samples before the input module inputs the target resume into the deep neural linguistic programming DNLP system; the training module is used to train BI-LSTM-CRF using the multiple resume samples The initial neural network of the model is used to obtain the DNLP system.
  10. 根据权利要求9所述的装置,所述训练模块包括:The device according to claim 9, wherein the training module comprises:
    分割单元,用于采用监督分类的方式分割每个所述简历样本的简历文本,得到多个可以对应人工标签的文本块,其中,每个文本块对应简历中的一个类别属性;提取单元,用于对所述文本块进行分词,并提取每个文本块的特征词;训练单元,用于采用所述文本块和对应的特征词训练所述BI-LSTM-CRF模型的初始神经网络。The segmentation unit is used to segment the resume text of each resume sample in a supervised classification method to obtain multiple text blocks that can correspond to manual tags, where each text block corresponds to a category attribute in the resume; the extraction unit uses To segment the text block and extract the feature words of each text block; the training unit is used to train the initial neural network of the BI-LSTM-CRF model by using the text block and the corresponding feature words.
  11. 根据权利要求10所述的装置,所述分割单元包括:The device according to claim 10, the dividing unit comprises:
    分割子单元,用于分割每个所述简历样本中的以下简历文本:自我介绍、教育经历、工作经历、学习经历、项目经历;使用标签信息标注所述简历文本。The segmentation subunit is used to segment the following resume text in each resume sample: self-introduction, education experience, work experience, learning experience, and project experience; label the resume text with label information.
  12. 根据权利要求10所述的装置,所述提取单元包括:The device according to claim 10, the extracting unit comprises:
    提取子单元,用于采用词频-逆向文件频率TF-IDF算法提取每个文本块的特征词;其中,tfidf=tf*idf,每个文本块取tfidf的top n作为特征 词,n为大于1的正整数;其中,
    Figure PCTCN2019103268-appb-100005
    n i,j是当前词在文本块d j中的出现次数,分母是d j中所有词的出现次数之和,k是i的任一取值;
    Figure PCTCN2019103268-appb-100006
    |D|为简历样本中的文件总数,|{j:t i∈d j}|为包含词语t i的文件数目。
    The extraction subunit is used to extract the feature words of each text block by using the word frequency-inverse document frequency TF-IDF algorithm; where tfidf=tf*idf, each text block takes the top n of tfidf as the feature word, and n is greater than 1. Positive integer; where,
    Figure PCTCN2019103268-appb-100005
    n i, j is the number of occurrences of the current word in the text block d j , the denominator is the sum of the number of occurrences of all words in d j , and k is any value of i;
    Figure PCTCN2019103268-appb-100006
    |D| is the total number of files in the resume sample, |{j: t i ∈ d j }| is the number of files containing the word t i .
  13. 根据权利要求10所述的装置,所述训练模块包括:The device according to claim 10, the training module comprises:
    第一处理单元,用于在所述BI-LSTM-CRF模型的BI层中,利用预训练或随机初始化的embedding矩阵将所述文本块的句子中的每个字由one-hot向量映射为低维稠密的字向量,在输入下一层之前,设置脱离以缓解过拟合;第二处理单元,用于在所述BI-LSTM-CRF模型的LSTM层中,提取句子特征,将一个句子的各个特征词序列作为双向LSTM各个时间步的输入,再将正向LSTM输出的隐状态序列与反向LSTM的在各个位置输出的隐状态进行按位置拼接,得到完整的隐状态序列,输出pi,其中,pi是归属i标签的概率;第三处理单元,用于在所述BI-LSTM-CRF模型的CRF层中,进行句子级的序列标注,得到线性CRF,其中,所述线性CRF的计算公式中对于句子x的标签等于y的打分为:
    Figure PCTCN2019103268-appb-100007
    其中,一个长度等于句子长度的标签序列y=(y1,y2,...,yn);利用Softmax得到归一化后的概率为:
    Figure PCTCN2019103268-appb-100008
    y'是所有标签的任一取值。
    The first processing unit is configured to use a pre-trained or randomly initialized embedding matrix in the BI layer of the BI-LSTM-CRF model to map each word in the sentence of the text block by one-hot vector to low The word vector with dense dimensions is set to escape before entering the next layer to alleviate overfitting; the second processing unit is used to extract sentence features in the LSTM layer of the BI-LSTM-CRF model, and combine the Each feature word sequence is used as the input of each time step of the bidirectional LSTM, and then the hidden state sequence output by the forward LSTM and the hidden state output by the reverse LSTM at each position are spliced by position to obtain a complete hidden state sequence, and output pi, Among them, pi is the probability of belonging to the i tag; the third processing unit is used to perform sentence-level sequence labeling in the CRF layer of the BI-LSTM-CRF model to obtain a linear CRF, wherein the calculation of the linear CRF In the formula, the label of sentence x is equal to y score:
    Figure PCTCN2019103268-appb-100007
    Among them, a tag sequence y=(y1,y2,...,yn) whose length is equal to the length of the sentence; the normalized probability obtained by Softmax is:
    Figure PCTCN2019103268-appb-100008
    y'is any value of all labels.
  14. 根据权利要求13所述的装置,所述第三处理单元还包括:The apparatus according to claim 13, wherein the third processing unit further comprises:
    处理子单元,用于采用以下最大化对数似然函数对样本数据进行处理:logP(y x|x)=score(x,y x)-log(∑ y'exp(score(x,y')));其中,(x,y x)为训练样本。 Processing sub-unit, for maximizing the following log likelihood function of sample data processing: logP (y x | x) = score (x, y x) -log (Σ y 'exp (score (x, y'))); where (x, y x ) are training samples.
  15. 一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,所述处理器执行所述计算机程序时实现一种识别简历的方法的步骤,包括:A computer device includes a memory and a processor, the memory stores a computer program, and the steps of a method for identifying a resume when the processor executes the computer program include:
    接收待识别的目标简历;Receive the target resume to be identified;
    将所述目标简历输入到深度神经语言程序学DNLP系统,其中,所述DNLP系统是采用双向长短时记忆循环神经网络BI-LSTM-CRF模型训练得到的;Input the target resume into a deep neural language programming DNLP system, where the DNLP system is obtained by training using a bidirectional long and short-term memory loop neural network BI-LSTM-CRF model;
    使用所述DNLP系统确定所述目标简历使用的简历模板;Use the DNLP system to determine the resume template used by the target resume;
    按照所述简历模板提取所述目标简历中的特征信息。The feature information in the target resume is extracted according to the resume template.
  16. 根据权利要求15所述的计算机设备,在将所述目标简历输入到深度神经语言程序学DNLP系统之前,所述方法还包括:The computer device according to claim 15, before inputting the target resume into a deep neural language programming DNLP system, the method further comprises:
    确定多个简历样本;Identify multiple resume samples;
    使用所述多个简历样本训练BI-LSTM-CRF模型的初始神经网络,得到所述DNLP系统。The initial neural network of the BI-LSTM-CRF model is trained using the multiple resume samples to obtain the DNLP system.
  17. 根据权利要求15所述的计算机设备,使用所述多个简历样本训练BI-LSTM-CRF模型的初始神经网络包括:The computer device according to claim 15, wherein using the plurality of resume samples to train the initial neural network of the BI-LSTM-CRF model comprises:
    采用监督分类的方式分割每个所述简历样本的简历文本,得到多个可以对应人工标签的文本块,其中,每个文本块对应简历中的一个类别属性;Using a supervised classification method to segment the resume text of each resume sample to obtain multiple text blocks that can correspond to manual labels, where each text block corresponds to a category attribute in the resume;
    对所述文本块进行分词,并提取每个文本块的特征词;Perform word segmentation on the text block, and extract the characteristic words of each text block;
    采用所述文本块和对应的特征词训练所述BI-LSTM-CRF模型的初始神经网络。The initial neural network of the BI-LSTM-CRF model is trained using the text block and the corresponding feature words.
  18. 一种计算机存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现一种识别简历的方法的步骤,包括:A computer storage medium having a computer program stored thereon, and the steps of implementing a method for identifying a resume when the computer program is executed by a processor include:
    接收待识别的目标简历;Receive the target resume to be identified;
    将所述目标简历输入到深度神经语言程序学DNLP系统,其中,所述DNLP系统是采用双向长短时记忆循环神经网络BI-LSTM-CRF模型训练得到的;Input the target resume into a deep neural language programming DNLP system, where the DNLP system is obtained by training using a bidirectional long and short-term memory loop neural network BI-LSTM-CRF model;
    使用所述DNLP系统确定所述目标简历使用的简历模板;Use the DNLP system to determine the resume template used by the target resume;
    按照所述简历模板提取所述目标简历中的特征信息。The feature information in the target resume is extracted according to the resume template.
  19. 根据权利要求18所述的计算机设备,在将所述目标简历输入到深度神经语言程序学DNLP系统之前,所述方法还包括:The computer device according to claim 18, before inputting the target resume into the deep neural language programming DNLP system, the method further comprises:
    确定多个简历样本;Identify multiple resume samples;
    使用所述多个简历样本训练BI-LSTM-CRF模型的初始神经网络,得到所述 DNLP系统。Use the multiple resume samples to train the initial neural network of the BI-LSTM-CRF model to obtain the DNLP system.
  20. 根据权利要求18所述的计算机设备,使用所述多个简历样本训练BI-LSTM-CRF模型的初始神经网络包括:The computer device according to claim 18, using the plurality of resume samples to train the initial neural network of the BI-LSTM-CRF model comprises:
    采用监督分类的方式分割每个所述简历样本的简历文本,得到多个可以对应人工标签的文本块,其中,每个文本块对应简历中的一个类别属性;Using a supervised classification method to segment the resume text of each resume sample to obtain multiple text blocks that can correspond to manual labels, where each text block corresponds to a category attribute in the resume;
    对所述文本块进行分词,并提取每个文本块的特征词;Perform word segmentation on the text block, and extract the characteristic words of each text block;
    采用所述文本块和对应的特征词训练所述BI-LSTM-CRF模型的初始神经网络。The initial neural network of the BI-LSTM-CRF model is trained using the text block and the corresponding feature words.
PCT/CN2019/103268 2019-06-20 2019-08-29 Resume identification method and apparatus, and computer device and storage medium WO2020252919A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910534813.1A CN110442841B (en) 2019-06-20 2019-06-20 Resume identification method and device, computer equipment and storage medium
CN201910534813.1 2019-06-20

Publications (1)

Publication Number Publication Date
WO2020252919A1 true WO2020252919A1 (en) 2020-12-24

Family

ID=68428319

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/103268 WO2020252919A1 (en) 2019-06-20 2019-08-29 Resume identification method and apparatus, and computer device and storage medium

Country Status (2)

Country Link
CN (1) CN110442841B (en)
WO (1) WO2020252919A1 (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112541125A (en) * 2020-12-25 2021-03-23 北京百度网讯科技有限公司 Sequence labeling model training method and device and electronic equipment
CN112733550A (en) * 2020-12-31 2021-04-30 科大讯飞股份有限公司 Knowledge distillation-based language model training method, text classification method and device
CN112767106A (en) * 2021-01-14 2021-05-07 中国科学院上海高等研究院 Automatic auditing method, system, computer readable storage medium and auditing equipment
CN113076245A (en) * 2021-03-30 2021-07-06 山东英信计算机技术有限公司 Risk assessment method, device, equipment and storage medium of open source protocol
CN113361253A (en) * 2021-05-28 2021-09-07 北京金山数字娱乐科技有限公司 Recognition model training method and device
CN113627139A (en) * 2021-08-11 2021-11-09 平安国际智慧城市科技股份有限公司 Enterprise reporting form generation method, device, equipment and storage medium
CN114821603A (en) * 2022-03-03 2022-07-29 北京百度网讯科技有限公司 Bill recognition method, bill recognition device, electronic device and storage medium

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111143517B (en) * 2019-12-30 2023-09-05 浙江阿尔法人力资源有限公司 Human selection label prediction method, device, equipment and storage medium
CN111144373B (en) * 2019-12-31 2020-12-04 广州市昊链信息科技股份有限公司 Information identification method and device, computer equipment and storage medium
CN111428480B (en) * 2020-03-06 2023-11-21 广州视源电子科技股份有限公司 Resume identification method, device, equipment and storage medium
CN111460084A (en) * 2020-04-03 2020-07-28 中国建设银行股份有限公司 Resume structured extraction model training method and system
CN111598462B (en) * 2020-05-19 2022-07-12 厦门大学 Resume screening method for campus recruitment
CN111966785B (en) * 2020-07-31 2023-06-20 中国电子科技集团公司第二十八研究所 Resume information extraction method based on stacking sequence labeling
CN113297845B (en) * 2021-06-21 2022-07-26 南京航空航天大学 Resume block classification method based on multi-level bidirectional circulation neural network

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105159962A (en) * 2015-08-21 2015-12-16 北京全聘致远科技有限公司 Position recommendation method and apparatus, resume recommendation method and apparatus, and recruitment platform
US20170300565A1 (en) * 2016-04-14 2017-10-19 Xerox Corporation System and method for entity extraction from semi-structured text documents
CN107943911A (en) * 2017-11-20 2018-04-20 北京大学深圳研究院 Data pick-up method, apparatus, computer equipment and readable storage medium storing program for executing
CN108664474A (en) * 2018-05-21 2018-10-16 众安信息技术服务有限公司 A kind of resume analytic method based on deep learning
CN109710930A (en) * 2018-12-20 2019-05-03 重庆邮电大学 A kind of Chinese Resume analytic method based on deep neural network

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6874002B1 (en) * 2000-07-03 2005-03-29 Magnaware, Inc. System and method for normalizing a resume
US20070005549A1 (en) * 2005-06-10 2007-01-04 Microsoft Corporation Document information extraction with cascaded hybrid model
CN107862303B (en) * 2017-11-30 2019-04-26 平安科技(深圳)有限公司 Information identifying method, electronic device and the readable storage medium storing program for executing of form class diagram picture
CN108897726B (en) * 2018-05-03 2021-11-16 平安科技(深圳)有限公司 Electronic resume creating method, storage medium and server
CN109214382A (en) * 2018-07-16 2019-01-15 顺丰科技有限公司 A kind of billing information recognizer, equipment and storage medium based on CRNN
CN109214385B (en) * 2018-08-15 2021-06-08 腾讯科技(深圳)有限公司 Data acquisition method, data acquisition device and storage medium
CN109635288B (en) * 2018-11-29 2023-05-23 东莞理工学院 Resume extraction method based on deep neural network

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105159962A (en) * 2015-08-21 2015-12-16 北京全聘致远科技有限公司 Position recommendation method and apparatus, resume recommendation method and apparatus, and recruitment platform
US20170300565A1 (en) * 2016-04-14 2017-10-19 Xerox Corporation System and method for entity extraction from semi-structured text documents
CN107943911A (en) * 2017-11-20 2018-04-20 北京大学深圳研究院 Data pick-up method, apparatus, computer equipment and readable storage medium storing program for executing
CN108664474A (en) * 2018-05-21 2018-10-16 众安信息技术服务有限公司 A kind of resume analytic method based on deep learning
CN109710930A (en) * 2018-12-20 2019-05-03 重庆邮电大学 A kind of Chinese Resume analytic method based on deep neural network

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112541125A (en) * 2020-12-25 2021-03-23 北京百度网讯科技有限公司 Sequence labeling model training method and device and electronic equipment
CN112541125B (en) * 2020-12-25 2024-01-12 北京百度网讯科技有限公司 Sequence annotation model training method and device and electronic equipment
CN112733550A (en) * 2020-12-31 2021-04-30 科大讯飞股份有限公司 Knowledge distillation-based language model training method, text classification method and device
CN112733550B (en) * 2020-12-31 2023-07-25 科大讯飞股份有限公司 Knowledge distillation-based language model training method, text classification method and device
CN112767106A (en) * 2021-01-14 2021-05-07 中国科学院上海高等研究院 Automatic auditing method, system, computer readable storage medium and auditing equipment
CN112767106B (en) * 2021-01-14 2023-11-07 中国科学院上海高等研究院 Automatic auditing method, system, computer readable storage medium and auditing equipment
CN113076245A (en) * 2021-03-30 2021-07-06 山东英信计算机技术有限公司 Risk assessment method, device, equipment and storage medium of open source protocol
CN113361253A (en) * 2021-05-28 2021-09-07 北京金山数字娱乐科技有限公司 Recognition model training method and device
CN113361253B (en) * 2021-05-28 2024-04-09 北京金山数字娱乐科技有限公司 Recognition model training method and device
CN113627139A (en) * 2021-08-11 2021-11-09 平安国际智慧城市科技股份有限公司 Enterprise reporting form generation method, device, equipment and storage medium
CN114821603A (en) * 2022-03-03 2022-07-29 北京百度网讯科技有限公司 Bill recognition method, bill recognition device, electronic device and storage medium
CN114821603B (en) * 2022-03-03 2023-09-01 北京百度网讯科技有限公司 Bill identification method, device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN110442841A (en) 2019-11-12
CN110442841B (en) 2024-02-02

Similar Documents

Publication Publication Date Title
WO2020252919A1 (en) Resume identification method and apparatus, and computer device and storage medium
CN110569366B (en) Text entity relation extraction method, device and storage medium
CN109145153B (en) Intention category identification method and device
CN107729309B (en) Deep learning-based Chinese semantic analysis method and device
WO2021068339A1 (en) Text classification method and device, and computer readable storage medium
WO2021068329A1 (en) Chinese named-entity recognition method, device, and computer-readable storage medium
CN110502621A (en) Answering method, question and answer system, computer equipment and storage medium
CN108304373B (en) Semantic dictionary construction method and device, storage medium and electronic device
CN110909549B (en) Method, device and storage medium for punctuating ancient Chinese
CN110851599B (en) Automatic scoring method for Chinese composition and teaching assistance system
CN112101041B (en) Entity relationship extraction method, device, equipment and medium based on semantic similarity
CN108804423B (en) Medical text feature extraction and automatic matching method and system
CN112287069B (en) Information retrieval method and device based on voice semantics and computer equipment
CN110276023B (en) POI transition event discovery method, device, computing equipment and medium
CN112395395B (en) Text keyword extraction method, device, equipment and storage medium
WO2021051574A1 (en) English text sequence labelling method and system, and computer device
WO2022222300A1 (en) Open relationship extraction method and apparatus, electronic device, and storage medium
CN105760363B (en) Word sense disambiguation method and device for text file
CN112131881B (en) Information extraction method and device, electronic equipment and storage medium
CN112215008A (en) Entity recognition method and device based on semantic understanding, computer equipment and medium
CN108550065A (en) comment data processing method, device and equipment
Panda Developing an efficient text pre-processing method with sparse generative Naive Bayes for text mining
CN112188312A (en) Method and apparatus for determining video material of news
CN106897274B (en) Cross-language comment replying method
CN114840685A (en) Emergency plan knowledge graph construction method

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19933488

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19933488

Country of ref document: EP

Kind code of ref document: A1