WO2021174919A1 - 简历数据信息解析及匹配方法、装置、电子设备及介质 - Google Patents

简历数据信息解析及匹配方法、装置、电子设备及介质 Download PDF

Info

Publication number
WO2021174919A1
WO2021174919A1 PCT/CN2020/131916 CN2020131916W WO2021174919A1 WO 2021174919 A1 WO2021174919 A1 WO 2021174919A1 CN 2020131916 W CN2020131916 W CN 2020131916W WO 2021174919 A1 WO2021174919 A1 WO 2021174919A1
Authority
WO
WIPO (PCT)
Prior art keywords
resume
word
sequence
label
word segmentation
Prior art date
Application number
PCT/CN2020/131916
Other languages
English (en)
French (fr)
Inventor
侯丽
周慧娟
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN202010151399.9A external-priority patent/CN111428488B/zh
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021174919A1 publication Critical patent/WO2021174919A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/10Office automation; Time management
    • G06Q10/105Human resources
    • G06Q10/1053Employment or hiring

Definitions

  • This application relates to the field of data processing technology, and in particular to a method, device, electronic device, and medium for analyzing and matching resume data information.
  • a method for analyzing and matching resume data information comprising:
  • the similarity between each label in the resume label sequence and the label of each post is calculated, and a resume matching each post is determined from the resume to be parsed according to the calculated similarity.
  • a device for analyzing and matching resume data information comprising:
  • the preprocessing unit is used to retrieve resumes from the database and preprocess the retrieved resumes to obtain resumes to be parsed;
  • the construction unit is used to construct the word segmentation directed acyclic graph according to the pre-built word segmentation dictionary, and segment the to-be-analyzed resume according to the constructed word segmentation directed acyclic graph to obtain the resume text after word segmentation processing;
  • a determining unit configured to construct a co-occurrence matrix according to the resume text that has undergone word segmentation processing, and determine the keywords of the resume text based on the co-occurrence matrix
  • a processing unit configured to obtain the word sequence in the keyword, and use a word representation model to process the word sequence to obtain a word representation of the word sequence;
  • the prediction unit is used to input the word representation into the constructed resume label analysis model to obtain the predicted resume label sequence
  • the determining unit is further configured to calculate the similarity between each label in the resume label sequence and the label of each post, and determine a resume matching each post from the resume to be parsed according to the calculated similarity .
  • An electronic device which includes:
  • Memory storing at least one instruction
  • the processor executes the instructions stored in the memory to implement the following steps:
  • the similarity between each label in the resume label sequence and the label of each post is calculated, and a resume matching each post is determined from the resume to be parsed according to the calculated similarity.
  • a computer-readable storage medium storing at least one instruction, and the at least one instruction is executed by a processor in an electronic device to implement the following steps:
  • the similarity between each label in the resume label sequence and the label of each post is calculated, and a resume matching each post is determined from the resume to be parsed according to the calculated similarity.
  • Fig. 1 is a flowchart of a preferred embodiment of a method for analyzing and matching resume data information of the present application.
  • Fig. 2 is a functional module diagram of a preferred embodiment of the apparatus for analyzing and matching resume data information of the present application.
  • FIG. 3 is a schematic structural diagram of an electronic device according to a preferred embodiment of a method for analyzing and matching resume data information according to the present application.
  • FIG. 4 is a schematic diagram of a co-occurrence matrix in a preferred embodiment of the method for analyzing and matching resume data information according to the present application.
  • FIG. 1 it is a flowchart of a preferred embodiment of the method for analyzing and matching resume data of the present application. According to different needs, the order of the steps in the flowchart can be changed, and some steps can be omitted.
  • the resume data information analysis and matching method is applied to one or more electronic devices.
  • the electronic device is a device that can automatically perform numerical calculation and/or information processing in accordance with pre-set or stored instructions, and its hardware Including but not limited to microprocessors, application specific integrated circuits (ASIC), programmable gate arrays (Field-Programmable Gate Array, FPGA), digital processors (Digital Signal Processor, DSP), embedded devices, etc.
  • ASIC application specific integrated circuits
  • FPGA Field-Programmable Gate Array
  • DSP Digital Signal Processor
  • embedded devices etc.
  • the electronic device may be any electronic product that can perform human-computer interaction with the user, such as a personal computer, a tablet computer, a smart phone, a personal digital assistant (PDA), a game console, an interactive network television ( Internet Protocol Television, IPTV), smart wearable devices, etc.
  • a personal computer a tablet computer
  • a smart phone a personal digital assistant (PDA)
  • PDA personal digital assistant
  • IPTV interactive network television
  • smart wearable devices etc.
  • the electronic device may also include a network device and/or user equipment.
  • the network device includes, but is not limited to, a single network server, a server group composed of multiple network servers, or a cloud composed of a large number of hosts or network servers based on cloud computing.
  • the network where the electronic device is located includes, but is not limited to, the Internet, a wide area network, a metropolitan area network, a local area network, a virtual private network (Virtual Private Network, VPN), etc.
  • the database may be a database that communicates with the electronic device, or may be an internal database of the electronic device, and can be customized according to different requirements.
  • the database may be a talent pool.
  • the electronic device retrieves and organizes resumes from the talent pool to obtain a large number of resumes.
  • the resume can be summarized into a set of nouns ⁇ name, gender, birthday, political appearance, school, education, major, contact information, hometown, education experience, skills... ⁇ , each of which has an expanded description, and Each item is separated by a separator. Due to the particularity of the social behavior of job hunting and the imitation between people, many job seekers have considerable commonality in describing their own characteristics.
  • the electronic device parses out the resume including the content of interest and concern of the resume picker from a large number of resumes with common characteristics, and forms a generally convergent limited resume set as the retrieved resume.
  • the electronic device may first remove duplicate resumes, thereby realizing the de-duplication of resumes.
  • the preprocessing of the retrieved resume by the electronic device includes:
  • the electronic device adopts a stop word list filtering method to perform stop word removal processing on the retrieved resume.
  • stop words are words that have no actual meaning in the function words of the text data, which have no effect on the classification of the text, but have a high frequency of appearance, and may specifically include commonly used pronouns, prepositions, and the like.
  • the stop words will reduce the accuracy of the text classification effect.
  • the electronic device can match the words in the retrieved resume with a pre-built stop word list one by one. If the matching is successful, the word is a stop word, and the electronic device deletes the word .
  • S11 Construct a word segmentation directed acyclic graph according to the pre-built word segmentation dictionary, and segment the to-be-analyzed resume according to the constructed word segmentation directed acyclic graph to obtain a resume text after word segmentation processing.
  • the word segmentation dictionary may include a prefix dictionary, a custom dictionary, and the like.
  • the prefix dictionary includes the prefix of each word segmentation in the dictionary.
  • the prefixes of the word “Peking University” in the dictionary are “North”, “Beijing”, and “Beijing University”; the prefixes of the word “University” It is “big”; the custom dictionary can also be called a proper noun dictionary, which is a specific and proprietary word in a certain field that does not exist in the statistical dictionary, such as resume, work experience, etc.
  • the electronic device constructs a word segmentation directed acyclic graph according to a pre-built word segmentation dictionary, wherein each word corresponds to a directed edge in the graph, and is assigned a corresponding edge length (weight). Further, the electronic device calculates the length value in all the paths from the start point to the end point, and arranges them in strict ascending order (that is, the values at any two different positions must be different, the same below), which is the first in order, The 2nd,..., i-th,..., and Nth path sets are used as the corresponding rough division result sets. If two or more paths are equal in length, then their lengths are tied together as the i-th, and they must be included in the coarse result set without affecting the sequence numbers of other paths. The size of the final coarse result set is greater than Or equal to N, the resume text after word segmentation is obtained accordingly.
  • the word segmentation result of the resume text can be quickly obtained by using the word segmentation dictionary and the directed acyclic graph.
  • S12 Construct a co-occurrence matrix according to the resume text that has undergone word segmentation processing, and determine keywords of the resume text based on the co-occurrence matrix.
  • the electronic device constructs a co-occurrence matrix according to the resume text, and determining the keywords of the resume text based on the co-occurrence matrix includes:
  • the electronic device constructs the co-occurrence matrix according to the number of occurrences of each word segmentation in the resume text, and extracts the word frequency (freq) and degree (deg) of each word segment from the co-occurrence matrix, the electronic device Calculate the score of each word segmentation according to the word frequency and degree of each word segmentation, and further output each word segmentation in descending order according to the score of each word segmentation to obtain the keywords of the resume text.
  • the electronic device outputs each word segmentation in descending order according to the score of each word segmentation to obtain the first n words, such as outputting the first 1/3 words in descending order of the score size as the keywords of the resume text.
  • the co-occurrence matrix counts the number of co-occurrences of words in a window of a predetermined size, and uses the number of co-occurring words around the word as the vector of the current word.
  • the constructed co-occurrence matrix X is shown in Figure 4.
  • the method further includes:
  • the electronic device merges the two keywords into a new keyword.
  • the preset value may be 2 times and so on.
  • S13 Obtain the word sequence in the keyword, and use a word representation model to perform word representation processing on the word sequence to obtain a word representation of the word sequence.
  • the electronic device uses a word representation model to process the word sequence, and obtaining the word representation of the word sequence includes:
  • the electronic device inputs the word sequence in the keyword into the word representation model, and generates a first vector containing the word sequence and the above information of the word sequence by forwardly reading the word sequence, And generating a second vector containing the word sequence and the following information of the word sequence by reading the word sequence in the reverse direction, and the electronic device connects the first vector and the second vector to obtain the The word sequence and the word representation of the context information of the word sequence.
  • the electronic device obtains the word representation of the word sequence.
  • word representation models can be used to express the symbolic information of "words" into a mathematical vector form.
  • the vector representation of words can be used as input to various machine learning models.
  • the existing word representation models can include two categories: one is syntagmatic models, and the other is paradigmatic models.
  • the electronic device may further use regular expression matching to format it, and then analyze and classify it, and store it in a designated database for subsequent use.
  • the resume label analysis model is obtained by training a large amount of resume data as a training sample and verifying it with a verification set. Using the resume label analysis model to analyze unstructured word representations, corresponding labels can be output to form the resume label sequence.
  • the tags in the resume tag sequence may include, but are not limited to: undergraduates, postgraduates, proficiency in WORD, and so on.
  • the method further includes:
  • the electronic device obtains resume data, splits the resume data to obtain a training set and a verification set, and further, uses the verification set to train a CRF model, and uses a conditional log-likelihood function and a maximum score formula to predict the target label Sequence, verify the target label sequence with the verification set, and when the target label sequence passes the verification, stop training and obtain the resume label analysis model.
  • said refers to the predicted most suitable tag sequence.
  • P represents the output score matrix of the two-way LSTM algorithm (Long short-term memory), and its size is n ⁇ k, and k represents the number of target tags, which is the summary evaluation of the resume , N represents the length of the word sequence, and A represents the transition score matrix.
  • y 0 represents the start of a sequence.
  • y n+1 represents the end of the sequence.
  • the size of the square matrix A is k+2.
  • Y Wd represents all possible tag sequences corresponding to the resume information sequence Wd.
  • the conditional log-likelihood function that maximizes the correct label sequence will be used for calculation, and the maximum score formula will be used to predict the most suitable label sequence:
  • S15 Calculate the similarity between each label in the resume label sequence and the label of each post, and determine a resume matching each post from the resume to be parsed according to the calculated similarity.
  • the electronic device calculates the similarity between each label in the resume label sequence and the label of each post, and determines the similarity from the resume to be parsed according to the calculated similarity.
  • the resume for each position includes:
  • the electronic device calculates the cosine distance between each tag and the tag of each post, and when the cosine distance between the target tag and the target post is less than or equal to the preset distance, the electronic device reads the resume from the resume to be parsed.
  • the target resume corresponding to the target tag is retrieved in, and it is determined that the target resume matches the target post.
  • the cosine distance uses the cosine value of the angle between two vectors in the vector space as a measure of the size of the difference between two individuals. The closer the cosine value is to 1, the closer the angle is to 0 degrees, that is, two The more similar the vectors are.
  • the resulting similarity ranges from -1 to 1, where -1 means that the two vectors point in exactly opposite directions, 1 means that their directions are exactly the same, 0 usually means that they are independent, and here The value between indicates moderate similarity or dissimilarity. According to this algorithm, a resume with a higher label similarity can be selected for each position for quick matching and entry.
  • the electronic device may also assign corresponding weights according to the obtained resume label sequence and configuration (for example, the weight of the graduate student label in the resume grading is 0.2, and the undergraduate label in the resume grading The weight is 0.1), and the resume label sequence is represented by a score, and the required employees are further quickly screened based on the score.
  • this application can retrieve resumes from the database, preprocess the retrieved resumes to obtain resumes to be parsed, and construct a word segmentation directed acyclic graph based on the pre-built word segmentation dictionary, and Segment the resume to be parsed according to the constructed word segmentation directed acyclic graph to obtain the resume text after word segmentation processing, and then can quickly obtain the word segmentation result of the resume to be parsed, and further construct a co-occurrence matrix based on the resume text, and Determine the keyword of the resume text based on the co-occurrence matrix, obtain the word sequence in the keyword, and use the word representation model to process the word sequence to obtain the word representation of the word sequence, which improves the analysis Effect, input the word representation into the constructed resume label analysis model to obtain the predicted resume label sequence, and further calculate the similarity between each label in the resume label sequence and the label of each post, and according to the calculated The similarity determines a resume matching each post from the resume to be parsed, so as to realize quick and accurate intelligent matching of
  • the resume data information analysis and matching device 11 includes a preprocessing unit 110, a construction unit 111, a determination unit 112, a processing unit 113, a prediction unit 114, a merging unit 115, a training unit 116, an acquisition unit 117, a split unit 118, and a verification unit.
  • Unit 119 The module/unit referred to in this application refers to a series of computer program segments that can be executed by the processor 13 and can complete fixed functions, and are stored in the memory 12. In this embodiment, the functions of each module/unit will be described in detail in subsequent embodiments.
  • the preprocessing unit 110 retrieves a resume from the database, and preprocesses the retrieved resume to obtain a resume to be analyzed.
  • the database may be a database that communicates with an electronic device, or an internal database of the electronic device, which can be customized according to different requirements.
  • the database may be a talent pool.
  • the preprocessing unit 110 retrieves and organizes resumes from the talent pool to obtain a large number of resumes.
  • the resume can be summarized into a set of nouns ⁇ name, gender, birthday, political appearance, school, education, major, contact information, hometown, education experience, skills... ⁇ , each of which has an expanded description, and Each item is separated by a separator. Due to the particularity of the social behavior of job hunting and the imitation between people, many job seekers have considerable commonality in describing their own characteristics.
  • the preprocessing unit 110 parses out a resume including content that the resume picker is interested in and cares about from a large number of resumes with common characteristics, and forms a generally convergent limited resume set as the retrieved resume.
  • the preprocessing unit 110 preprocessing the retrieved resume includes:
  • the preprocessing unit 110 uses a stop word list filtering method to remove stop words on the retrieved resume.
  • stop words are words that have no actual meaning in the function words of the text data, which have no effect on the classification of the text, but have a high frequency of appearance, and may specifically include commonly used pronouns, prepositions, and the like.
  • the stop words will reduce the accuracy of the text classification effect.
  • the preprocessing unit 110 may match the words in the retrieved resume with a pre-built stop word list one by one. If the matching is successful, the word is a stop word. The preprocessing unit 110 Delete the word.
  • the construction unit 111 constructs a word segmentation directed acyclic graph according to a pre-built word segmentation dictionary, and segments the to-be-analyzed resume according to the constructed word segmentation directed acyclic graph to obtain a resume text after word segmentation processing.
  • the word segmentation dictionary may include a prefix dictionary, a custom dictionary, and the like.
  • the prefix dictionary includes the prefix of each participle in the statistical dictionary.
  • the prefix of the word "Peking University” in the dictionary is "North”, “Beijing”, and “Beijing University”
  • the custom dictionary can also be called a proper noun dictionary, which is a specific and proprietary word in a certain field that does not exist in the statistical dictionary, such as resume, work experience, etc.
  • the construction unit 111 constructs a word segmentation directed acyclic graph according to a pre-built word segmentation dictionary, wherein each word corresponds to a directed edge in the graph, and is assigned a corresponding edge length (weight). Further, the construction unit 111 calculates the length values in all paths from the start point to the end point, and arranges them in strict ascending order (that is, the values at any two different positions must be different, the same below), which is the first in order. , The 2nd,..., i-th,..., Nth path set, as the corresponding rough score result set. If two or more paths are equal in length, then their lengths are tied together as the i-th, and they must be included in the coarse result set without affecting the sequence numbers of other paths. The size of the final coarse result set is greater than Or equal to N, the resume text after word segmentation is obtained accordingly.
  • the word segmentation result of the resume text can be quickly obtained by using the word segmentation dictionary and the directed acyclic graph.
  • the determining unit 112 constructs a co-occurrence matrix according to the resume text, and determines the keywords of the resume text based on the co-occurrence matrix.
  • the determining unit 112 constructs a co-occurrence matrix according to the resume text, and determining the keywords of the resume text based on the co-occurrence matrix includes:
  • the determining unit 112 constructs the co-occurrence matrix according to the number of occurrences of each word segmentation in the resume text, and extracts the word frequency (freq) and degree (deg) of each word segment from the co-occurrence matrix, and the determination
  • the unit 112 calculates the score of each word segmentation according to the word frequency and degree of each word segmentation, and further outputs each word segmentation in descending order according to the score of each word segmentation to obtain the keywords of the resume text.
  • the determining unit 112 outputs each word segmentation in descending order according to the score of each word segmentation to obtain the first n words, such as outputting the first 1/3 words in descending order of the score size as the keywords of the resume text.
  • the co-occurrence matrix counts the number of co-occurrences of words in a window of a predetermined size, and uses the number of co-occurring words around the word as the vector of the current word.
  • the constructed co-occurrence matrix X is shown in Figure 4.
  • the method further includes:
  • the merging unit 115 merges the two keywords into a new keyword.
  • the preset value may be 2 times and so on.
  • the processing unit 113 acquires the word sequence in the keyword, and uses a word representation model to perform word representation processing on the word sequence to obtain a word representation of the word sequence.
  • the processing unit 113 uses a word representation model to process the word sequence, and obtaining the word representation of the word sequence includes:
  • the processing unit 113 inputs the word sequence in the keyword into the word representation model, and reads the word sequence forward to generate a first vector containing the word sequence and the above information of the word sequence , And by reading the word sequence in reverse to generate a second vector containing the word sequence and the following information of the word sequence, the processing unit 113 connects the first vector and the second vector to obtain The word sequence and the word representation of the context information of the word sequence.
  • the processing unit 113 obtains the word representation of the word sequence.
  • word representation models can be used to express the symbolic information of "words" into a mathematical vector form.
  • the vector representation of words can be used as input to various machine learning models.
  • the existing word representation models can include two categories: one is syntagmatic models, and the other is paradigmatic models.
  • the electronic device may further use regular expression matching to format it, and then analyze and classify it, and store it in a designated database for subsequent use.
  • the prediction unit 114 inputs the word representation into the constructed resume label analysis model to obtain a predicted resume label sequence.
  • the resume label analysis model is obtained by training a large amount of resume data as a training sample and verifying it with a verification set. Using the resume label analysis model to analyze unstructured word representations, corresponding labels can be output to form the resume label sequence.
  • the tags in the resume tag sequence may include, but are not limited to: undergraduates, postgraduates, proficiency in WORD, and so on.
  • training the resume label analysis model includes:
  • the obtaining unit 117 obtains resume data, and the splitting unit 118 splits the resume data to obtain a training set and a verification set. Further, the verification unit 119 uses the verification set to train the CRF model, and the training unit 116 uses a conditional log-likelihood function. And the maximum score formula predicts the target label sequence, and verifies the target label sequence with the verification set. When the target label sequence passes the verification, the training unit 116 stops training and obtains the resume label analysis model.
  • said refers to the predicted most suitable tag sequence.
  • P represents the output score matrix of the bidirectional LSTM algorithm (Long short-term memory, long short-term memory algorithm), its size is n ⁇ k, and k represents the number of target tags, which is the summary evaluation of the resume , N represents the length of the word sequence, and A represents the transition score matrix.
  • y 0 represents the start of a sequence.
  • y n+1 represents the end of the sequence.
  • the size of the square matrix A is k+2.
  • Y Wd represents all possible tag sequences corresponding to the resume information sequence Wd.
  • the training unit 116 will calculate the conditional log-likelihood function that maximizes the correct label sequence, and use the maximum score formula to predict the most suitable label sequence:
  • the determining unit 112 calculates the similarity between each label in the resume label sequence and the label of each post, and determines a resume matching each post from the resume to be parsed according to the calculated similarity.
  • the determining unit 112 calculates the similarity between each label in the resume label sequence and the label of each post, and determines from the resume to be parsed according to the calculated similarity
  • the resumes that match each position include:
  • the determining unit 112 calculates the cosine distance between each tag and the tag of each post. When the cosine distance between the target tag and the target post is less than or equal to the preset distance, the determining unit 112 starts from the waiting list. Analyze the resume to retrieve the target resume corresponding to the target tag, and determine that the target resume matches the target post.
  • the cosine distance uses the cosine value of the angle between two vectors in the vector space as a measure of the size of the difference between two individuals. The closer the cosine value is to 1, the closer the angle is to 0 degrees, that is, two The more similar the vectors are.
  • the resulting similarity ranges from -1 to 1, where -1 means that the two vectors point in exactly opposite directions, 1 means that their directions are exactly the same, 0 usually means that they are independent, and here The value between indicates moderate similarity or dissimilarity. According to this algorithm, a resume with a higher label similarity can be selected for each position for quick matching and entry.
  • the determining unit 112 may also assign corresponding weights according to the obtained resume label sequence and configuration (for example, the weight of the graduate student label in the resume score is 0.2, and the undergraduate label in the resume The weight of the score is 0.1), the resume label sequence is expressed by the score, and the required employees are further quickly screened based on the score.
  • this application can retrieve resumes from the database, preprocess the retrieved resumes to obtain resumes to be parsed, and construct a word segmentation directed acyclic graph based on the pre-built word segmentation dictionary, and According to the constructed word segmentation directed acyclic graph, the resume to be parsed is segmented to obtain the resume text, and then the word segmentation result of the resume to be parsed can be quickly obtained, and the co-occurrence matrix is further constructed according to the resume text, and based on the co-occurrence matrix Now the matrix determines the keywords of the resume text, obtains the word sequence in the keyword, and uses the word representation model to process the word sequence to obtain the word representation of the word sequence, which improves the analysis effect and combines
  • the predicate means input into the constructed resume label analysis model to obtain the predicted resume label sequence, and further calculate the similarity between each label in the resume label sequence and the label of each post, and obtain the similarity from all the labels according to the calculated similarity.
  • the resume to be analyzed the resume that
  • FIG. 3 it is a schematic structural diagram of an electronic device according to a preferred embodiment of a method for analyzing and matching resume data information according to the present application.
  • the electronic device 1 may include a memory 12, a processor 13, and a bus, and may also include a computer program stored in the memory 12 and running on the processor 13, such as a resume data information analysis and matching program.
  • the electronic device 1 may have a bus structure or a star structure.
  • the device 1 may also include more or less other hardware or software than shown in the figure, or a different component arrangement.
  • the electronic device 1 may also include an input/output device, a network access device, and the like.
  • the electronic device 1 is only an example. If other existing or future electronic products can be adapted to this application, they should also be included in the scope of protection of this application and included here by reference. .
  • the memory 12 includes at least one type of readable storage medium, the readable storage medium includes flash memory, mobile hard disk, multimedia card, card-type memory (for example: SD or DX memory, etc.), magnetic memory, magnetic disk, optical disk, etc. .
  • the memory 12 may be an internal storage unit of the electronic device 1 in some embodiments, for example, a mobile hard disk of the electronic device 1.
  • the memory 12 may also be an external storage device of the electronic device 1, such as a plug-in mobile hard disk, a smart media card (SMC), and a secure digital (Secure Digital, SD) equipped on the electronic device 1. ) Card, Flash Card, etc.
  • the memory 12 may also include both an internal storage unit of the electronic device 1 and an external storage device.
  • the memory 12 can be used not only to store application software and various types of data installed in the electronic device 1, such as resume data information analysis and matching program codes, etc., but also to temporarily store data that has been output or will be output.
  • the processor 13 may be composed of integrated circuits in some embodiments, for example, may be composed of a single packaged integrated circuit, or may be composed of multiple integrated circuits with the same function or different functions, including one or more central processing units. Combinations of central processing unit (CPU), microprocessor, digital processing chip, graphics processor, and various control chips.
  • the processor 13 is the control unit of the electronic device 1, which uses various interfaces and lines to connect the various components of the entire electronic device 1, and runs or executes programs or modules stored in the memory 12 (such as executing Resume data information analysis and matching programs, etc.), and call the data stored in the memory 12 to execute various functions of the electronic device 1 and process data.
  • the processor 13 executes the operating system of the electronic device 1 and various installed applications.
  • the processor 13 executes the application program to implement the steps in the foregoing embodiments of the resume data information analysis and matching method, such as steps S10, S11, S12, S13, S14, and S15 shown in FIG. 1.
  • the processor 13 implements the functions of the modules/units in the foregoing device embodiments when executing the computer program, for example:
  • the similarity between each label in the resume label sequence and the label of each post is calculated, and a resume matching each post is determined from the resume to be parsed according to the calculated similarity.
  • the computer program may be divided into one or more modules/units, and the one or more modules/units are stored in the memory 12 and executed by the processor 13 to complete the present invention.
  • the one or more modules/units may be a series of computer program instruction segments capable of completing specific functions, and the instruction segments are used to describe the execution process of the computer program in the electronic device 1.
  • the computer program can be divided into a preprocessing unit 110, a construction unit 111, a determination unit 112, a processing unit 113, a prediction unit 114, a merging unit 115, a training unit 116, an acquisition unit 117, a split unit 118, and a verification unit. 119.
  • the above-mentioned integrated unit implemented in the form of a software function module may be stored in a computer-readable storage medium, and the computer-readable storage medium may be non-volatile or volatile.
  • the above-mentioned software function module is stored in a storage medium and includes several instructions to make a computer device (which can be a personal computer, a computer device, or a network device, etc.) or a processor to execute the methods described in the various embodiments of the present application part.
  • the integrated module/unit of the electronic device 1 is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium. Based on this understanding, this application implements all or part of the processes in the above-mentioned embodiments and methods, and can also be completed by instructing related hardware devices through a computer program.
  • the computer program can be stored in a computer-readable storage medium. When the computer program is executed by the processor, it can implement the steps of the foregoing method embodiments.
  • the computer program includes computer program code
  • the computer program code may be in the form of source code, object code, executable file, or some intermediate forms.
  • the computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, U disk, mobile hard disk, magnetic disk, optical disk, computer memory, read-only memory (ROM, Read-Only Memory) .
  • the bus may be a peripheral component interconnect standard (PCI) bus or an extended industry standard architecture (EISA) bus, etc.
  • PCI peripheral component interconnect standard
  • EISA extended industry standard architecture
  • the bus can be divided into address bus, data bus, control bus and so on. For ease of presentation, only one arrow is used to indicate in FIG. 3, but it does not mean that there is only one bus or one type of bus.
  • the bus is configured to implement connection and communication between the memory 12 and at least one processor 13 and the like.
  • the electronic device 1 may also include a power source (such as a battery) for supplying power to various components.
  • the power source may be logically connected to the at least one processor 13 through a power management device, so as to be realized by the power management device. Functions such as charge management, discharge management, and power consumption management.
  • the power supply may also include any components such as one or more DC or AC power supplies, recharging devices, power failure detection circuits, power converters or inverters, and power status indicators.
  • the electronic device 1 may also include various sensors, Bluetooth modules, Wi-Fi modules, etc., which will not be repeated here.
  • the electronic device 1 may also include a network interface.
  • the network interface may include a wired interface and/or a wireless interface (such as a WI-FI interface, a Bluetooth interface, etc.), which is usually used in the electronic device 1 Establish a communication connection with other electronic devices.
  • the electronic device 1 may also include a user interface.
  • the user interface may be a display (Display) and an input unit (such as a keyboard (Keyboard)).
  • the user interface may also be a standard wired interface or a wireless interface.
  • the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode, organic light-emitting diode) touch device, etc.
  • the display can also be appropriately called a display screen or a display unit, which is used to display the information processed in the electronic device 1 and to display a visualized user interface.
  • FIG. 3 only shows the electronic device 1 with components 12-13. Those skilled in the art can understand that the structure shown in FIG. 3 does not constitute a limitation on the electronic device 1, and may include less Or more parts, or a combination of some parts, or a different arrangement of parts.
  • the memory 12 in the electronic device 1 stores multiple instructions to implement a method for analyzing and matching resume data information, and the processor 13 can execute the multiple instructions to achieve:
  • the similarity between each label in the resume label sequence and the label of each post is calculated, and a resume matching each post is determined from the resume to be parsed according to the calculated similarity.
  • modules described as separate components may or may not be physically separated, and the components displayed as modules may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the modules can be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • the functional modules in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated unit may be implemented in the form of hardware, or may be implemented in the form of hardware plus software functional modules.

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Human Resources & Organizations (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Strategic Management (AREA)
  • Entrepreneurship & Innovation (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种简历数据信息解析及匹配方法、装置、电子设备及介质。该方法能够对调取的简历进行预处理,得到待解析简历,并根据预先构建的分词词典构建词语切分有向无环图以切分所述待解析简历,进而能够快速得到待解析简历的分词结果,得到简历文本,进一步根据简历文本构建共现矩阵,并基于共现矩阵确定所述简历文本的关键词,并获取所述关键词中的字序列,利用词表示模型对所述字序列进行词表示处理,得到所述字序列的词表示,提升了解析效果,将所述词表示输入到简历标签解析模型中,得到简历标签序列,进一步计算简历标签序列中的每个标签与每个岗位的标签的相似度以确定与每个岗位匹配的简历,实现对岗位与简历快速且准确地智能匹配。

Description

简历数据信息解析及匹配方法、装置、电子设备及介质
本申请要求于2020年3月6日提交中国专利局、申请号为202010151399.9,发明名称为“简历数据信息解析及匹配方法、装置、电子设备及介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及数据处理技术领域,尤其涉及一种简历数据信息解析及匹配方法、装置、电子设备及介质。
背景技术
现有技术方案中,发明人意识到在进行简历匹配时,通常需要人工筛选,并匹配到与岗位相关联的简历,不仅要耗费大量的人力成本,且耗时较长。
而目前对简历的智能化筛选还只停留在去掉某些不符合要求的简历的初级阶段(如筛除掉不满足学历条件的简历),还无法实现岗位与简历的自动匹配。
发明内容
鉴于以上内容,有必要提供一种简历数据信息解析及匹配方法、装置、电子设备及介质,能够实现对岗位与简历快速且准确地智能匹配。
一种简历数据信息解析及匹配方法,所述方法包括:
从数据库中调取简历,并对调取的简历进行预处理,得到待解析简历;
根据预先构建的分词词典构建词语切分有向无环图,并根据构建的词语切分有向无环图切分所述待解析简历,得到分词处理后的简历文本;
根据经过分词处理的所述简历文本构建共现矩阵,并基于所述共现矩阵确定所述简历文本的关键词;
获取所述关键词中的字序列,并利用词表示模型对所述字序列进行处理,得到所述字序列的词表示;
将所述词表示输入到构建的简历标签解析模型中,得到预测的简历标签序列;
计算所述简历标签序列中的每个标签与每个岗位的标签的相似度,并根据计算的相似度从所述待解析简历中确定与每个岗位匹配的简历。
一种简历数据信息解析及匹配装置,所述装置包括:
预处理单元,用于从数据库中调取简历,并对调取的简历进行预处理,得到待解析简历;
构建单元,用于根据预先构建的分词词典构建词语切分有向无环图,并根据构建的词语切分有向无环图切分所述待解析简历,得到分词处理后的简历文本;
确定单元,用于根据经过分词处理的所述简历文本构建共现矩阵,并基于所述共现矩阵确定所述简历文本的关键词;
处理单元,用于获取所述关键词中的字序列,并利用词表示模型对所述字序列进行处理,得到所述字序列的词表示;
预测单元,用于将所述词表示输入到构建的简历标签解析模型中,得到预测的简历标签序列;
所述确定单元,还用于计算所述简历标签序列中的每个标签与每个岗位的标签的相似度,并根据计算的相似度从所述待解析简历中确定与每个岗位匹配的简历。
一种电子设备,所述电子设备包括:
存储器,存储至少一个指令;及
处理器,执行所述存储器中存储的指令以实现如下步骤:
从数据库中调取简历,并对调取的简历进行预处理,得到待解析简历;
根据预先构建的分词词典构建词语切分有向无环图,并根据构建的词语切分有向无环图切分所述待解析简历,得到分词处理后的简历文本;
根据经过分词处理的所述简历文本构建共现矩阵,并基于所述共现矩阵确定所述简历文本的关键词;
获取所述关键词中的字序列,并利用词表示模型对所述字序列进行处理,得到所述字序列的词表示;
将所述词表示输入到构建的简历标签解析模型中,得到预测的简历标签序列;
计算所述简历标签序列中的每个标签与每个岗位的标签的相似度,并根据计算的相似度从所述待解析简历中确定与每个岗位匹配的简历。
一种计算机可读存储介质,所述计算机可读存储介质中存储有至少一个指令,所述至少一个指令被电子设备中的处理器执行以实现如下步骤:
从数据库中调取简历,并对调取的简历进行预处理,得到待解析简历;
根据预先构建的分词词典构建词语切分有向无环图,并根据构建的词语切分有向无环图切分所述待解析简历,得到分词处理后的简历文本;
根据经过分词处理的所述简历文本构建共现矩阵,并基于所述共现矩阵确定所述简历文本的关键词;
获取所述关键词中的字序列,并利用词表示模型对所述字序列进行处理,得到所述字序列的词表示;
将所述词表示输入到构建的简历标签解析模型中,得到预测的简历标签序列;
计算所述简历标签序列中的每个标签与每个岗位的标签的相似度,并根据计算的相似度从所述待解析简历中确定与每个岗位匹配的简历。
附图说明
图1是本申请简历数据信息解析及匹配方法的较佳实施例的流程图。
图2是本申请简历数据信息解析及匹配装置的较佳实施例的功能模块图。
图3是本申请实现简历数据信息解析及匹配方法的较佳实施例的电子设备的结构示意图。
图4是本申请实现简历数据信息解析及匹配方法的较佳实施例中的共现矩阵的示意图。
本申请目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。
具体实施方式
为了使本申请的目的、技术方案和优点更加清楚,下面结合附图和具体实施例对本申请进行详细描述。
如图1所示,是本申请简历数据信息解析及匹配方法的较佳实施例的流程图。根据不同的需求,该流程图中步骤的顺序可以改变,某些步骤可以省略。
所述简历数据信息解析及匹配方法应用于一个或者多个电子设备中,所述电子设备是一种能够按照事先设定或存储的指令,自动进行数值计算和/或信息处理的设备,其硬件包括但不限于微处理器、专用集成电路(Application Specific Integrated Circuit, ASIC)、可编程门阵列(Field-Programmable Gate Array,FPGA)、数字处理器(Digital Signal Processor,DSP)、嵌入式设备等。
所述电子设备可以是任何一种可与用户进行人机交互的电子产品,例如,个人计算机、平板电脑、智能手机、个人数字助理(Personal Digital Assistant,PDA)、游戏机、交互式网络电视(Internet Protocol Television,IPTV)、智能式穿戴式设备等。
所述电子设备还可以包括网络设备和/或用户设备。其中,所述网络设备包括,但不限于单个网络服务器、多个网络服务器组成的服务器组或基于云计算(Cloud Computing)的由大量主机或网络服务器构成的云。
所述电子设备所处的网络包括但不限于互联网、广域网、城域网、局域网、虚拟专用网络(Virtual Private Network,VPN)等。
S10,从数据库中调取简历,并对调取的简历进行预处理,得到待解析简历。
在本申请的至少一个实施例中,所述数据库可以是与所述电子设备相通信的数据库,也可以是所述电子设备的内部数据库,根据不同的需求,可以进行自定义配置。
例如:所述数据库可以是人才库。所述电子设备从所述人才库中进行简历的调取和整理,得到大量简历。所述简历可以归纳成一个名词集合{姓名、性别、生日、政貌、学校、学历、专业、联系方式、籍贯、教育经历、技能……},其中的每一项内容都有展开描述,并且每一项都有分隔符分开。由于求职这一社会行为的特殊性以及人与人之间的模仿,很多求职人员在描述自身特点方面有相当大的共性。所述电子设备从大量的具有共性的简历之中解析出包括简历挑选者感兴趣和关心的内容的简历,形成一个大致收敛的有限的简历集合,作为调取的简历。
在本申请的至少一个实施例中,由于在求职过程中,同一人有可能发送多份简历,因此所述电子设备可以首先将重复的简历进行剔除,从而实现简历的去重。
进一步地,由于简历中还存在一些冗余的停用词,同样会对解析产生不利影响,因此,还需要剔除停用词,即对调取的简历进行预处理。
具体地,所述电子设备对调取的简历进行预处理包括:
所述电子设备采用停用词表过滤方法对所述调取的简历进行去停用词处理。
其中,所述停用词是文本数据功能词中没有实际意义的词,对文本的分类没有影响,但是出现的频率高,具体可以包括常用的代词、介词等。所述停用词会降低文本分类效果的准确性。
进一步地,所述电子设备可以将调取的简历中的词语与预先构建好的停用词表进行一一匹配,如果匹配成功,那么该词语就是停用词,所述电子设备将该词删除。
S11,根据预先构建的分词词典构建词语切分有向无环图,并根据构建的词语切分有向无环图切分所述待解析简历,得到经过分词处理的简历文本。
在本申请的至少一个实施例中,所述分词词典可以包括前缀字典、自定义字典等。
其中,所述前缀词典包括统计的词典中每一个分词的前缀,例如:词典中的词“北京大学”的前缀分别是“北”、“北京”、“北京大”;词“大学”的前缀是“大”;所述自定义词典也可以称为专有名词词典,是在统计的词典中不存在,但是某领域特定、专有的词,如简历、工作经历等。
进一步地,所述电子设备根据预先构建的分词词典构建词语切分有向无环图,其中,每个词对应图中的一条有向边,并赋给相应的边长(权值)。进一步地,所述电子设备在起点到终点的所有路径中,求出长度值,并按严格升序排列(即:任何两个不同位置上的值一定不等,下同),依次为第1,第2,…,第i,…,第N的路径集合,作为相应的粗分结果集。如果两条或两条以上路径的长度相等,那么他们的长度并列为第i,都要列入所述粗分结果集,而且不影响其他路径的排列序号,最后的粗分结果集的大小大于或等于N,据此得到经过分词处理的简历文本。
通过上述实施方式,能够利用分词词典及有向无环图快速得到简历文本的分词结果。
S12,根据经过分词处理的所述简历文本构建共现矩阵,并基于所述共现矩阵确定所述简历文本的关键词。
在本申请的至少一个实施例中,所述电子设备根据所述简历文本构建共现矩阵,并基于所述共现矩阵确定所述简历文本的关键词包括:
所述电子设备根据所述简历文本中每个分词出现的次数构建所述共现矩阵,并从所述共现矩阵中提取每个分词的词频(freq)及度(deg),所述电子设备根据每个分词的词频及度计算每个分词的得分,进一步根据每个分词的得分对每个分词进行降序输出,得到所述简历文本的关键词。
例如:所述电子设备根据每个分词的得分对每个分词降序输出,得到前n个词语,如按score大小降序输出前1/3的词语作为所述简历文本的关键词。
其中,所述共现矩阵是通过统计一个事先指定大小的窗口内的词语的共现次数,以词语周边的共现词的次数作为当前词语的向量。
例如,当所述简历文本中有如下语料:
我擅长研究。(该语料中包括分词:“我”、“擅长”、“研究”及“。”,下面两个语料采取类似的分词方式,将不再一一列举)
我擅长编程。
我享受阅读。
根据上述简历文本中的语料,构建的共现矩阵X如图4所示。在本申请的至少一个实施例中,在得到所述简历文本的关键词后,所述方法还包括:
当有两个关键词在同一文档中相邻的次数大于预设值时,所述电子设备将所述两个关键词合并为新的关键词。
其中,所述预设值可以是2次等。
通过上述实施方式,能够将相似的关键词进一步合并,避免出现冗余关键词。
S13,获取所述关键词中的字序列,并利用词表示模型对所述字序列进行词表示处理,得到所述字序列的词表示。
在本申请的至少一个实施例中,所述电子设备利用词表示模型对所述字序列进行处理,得到所述字序列的词表示包括:
所述电子设备将所述关键词中的字序列输入所述词表示模型,并通过正向读取所述字序列生成包含所述字序列以及所述字序列的上文信息的第一向量,及通过反向读取所述字序列生成包含所述字序列以及所述字序列的下文信息的第二向量,所述电子设备连接所述第一向量及所述第二向量,得到包含所述字序列及所述字序列的上下文信息的词表示。
例如:对于给定一个包含n个关键字的非结构化文本简历的字序列Char=(char 1,char 2…,char n),其中char n是一个维度为d维的字向量,将所述非结构化文本字序列输入到词表示模型中,从而利用该词表示模型对字序列进行建模,通过正向读取字序列,以生成一个包含字序列以及字序列上文信息的向量,表示为CharF i,同理,通过反向读取字序列,以生成一个包含字序列以及字序列下文信息的向量,表示为CharB i,然后将CharF i和CharB i连接,形成一个包含字序列以及上下文信息的词表示:
Wd=[CharF i:CharB i]
据此,所述电子设备得到所述字序列的词表示。
需要说明的是,在进行自然语言处理时,可以利用各种词表示模型将“词”这一符号信息表示成数学上的向量形式。词的向量表示可以作为各种机器学习模型的输入来使用。现有的词表示模型可以包括两大类:一类是syntagmatic models,一类是paradigmatic models。
进一步地,对于该词表示,所述电子设备还可以进一步使用正则表达匹配对其进行格式化处理,进而解析、分类,存入指定数据库中,以供后续使用。
S14,将所述词表示输入到构建的简历标签解析模型中,得到预测的简历标签序列。
在本申请的至少一个实施例中,所述简历标签解析模型是以大量的简历数据作为训练样本进行训练,并以验证集进行验证而得到。利用所述简历标签解析模型对非结构化的词表示进行解析,能够输出相对应的标签以形成所述简历标签序列。
例如:所述简历标签序列中的标签可以包括,但不限于:本科生、研究生、熟练掌握WORD等。
在本申请的至少一个实施例中,所述方法还包括:
所述电子设备获取简历数据,拆分所述简历数据,得到训练集和验证集,进一步地,利用所述验证集训练CRF模型,并采用条件对数似然函数及最大分值公式预测目标标签序列,以所述验证集验证所述目标标签序列,当所述目标标签序列通过验证时,停止训练并得到所述简历标签解析模型。
其中,所述是指预测的最适合的标签序列。
具体地,所述电子设备采用CRF(conditional random field,条件随机场)进行建模。假定得到非结构化文本的关键字信息的输出目标序列(即对应的标签序列)为:y=(y 1,…y n)。为了有效获得非结构化文本简历信息的目标序列,模型的分值公式定义如下:
Figure PCTCN2020131916-appb-000001
其中,P表示双向LSTM算法(Long short-term memory,长短期记忆算法)的输出分值矩阵,其大小为n×k,k表示目标标签的数量,所述目标标签即对该简历的概述评价,n表示词序列的长度,A表示转移分值矩阵。当j=0时,y 0表示的是一个序列开始的标志,当j=n时,y n+1表示的是一个序列结束的标志,A方阵的大小为k+2。
在所有简历信息的标签序列上,CRF生成目标序列y的概率为:
Figure PCTCN2020131916-appb-000002
其中,Y Wd代表简历信息序列Wd对应的所有可能标签序列。在训练过程中,为了获得简历信息正确的标签序列,将采用最大化正确标签序列的条件对数似然函数进行计算,并使用最大分值公式预测最合适的标签序列:
Figure PCTCN2020131916-appb-000003
通过上述实施方式,结合条件对数似然函数及最大分值公式,能够提升模型的准确率。
S15,计算所述简历标签序列中的每个标签与每个岗位的标签的相似度,并根据计算的相似度从所述待解析简历中确定与每个岗位匹配的简历。
在本申请的至少一个实施例中,所述电子设备计算所述简历标签序列中的每个标签与每个岗位的标签的相似度,并根据计算的相似度从所述待解析简历中确定与每个岗位匹配的简历包括:
所述电子设备计算每个标签与每个岗位的标签之间的余弦距离,当存在目标标签与目标岗位之间的余弦距离小于或者等于预设距离时,所述电子设备从所述待解析简历中调取所述目标标签对应的目标简历,并确定所述目标简历与所述目标岗位相匹配。
具体地,所述余弦距离是用向量空间中两个向量夹角的余弦值作为衡量两个个体间差异的大小的度量,余弦值越接近1,就表明夹角越接近0度,也就是两个向量越相似。
例如:对于所得到的简历标签序列X和入职岗位所需要的简历标签序列Y,利用下列式子进行计算,式中X i表示简历标签序列X中第i个向量,Y i表示入职岗位所需要的简历标签序列Y中第i个向量:
Figure PCTCN2020131916-appb-000004
产生的相似性范围从-1到1,其中,-1意味着两个向量指向的方向正好截然相反,1表示它们的指向是完全相同的,0通常表示它们之间是独立的,而在这之间的值则表示中度的相似性或相异性,根据这一算法,能够对每份岗位选取标签相似度较高的简历,以进行快速匹配入职。
在本申请的至少一个实施例中,所述电子设备还可以根据得到的简历标签序列及配置的相应的权重(如:研究生标签在简历评分中所占权重为0.2,而本科生标签在简历评分中所占权重为0.1),将所述简历标签序列通过得分进行表示,进一步根据得分快速筛选出所需的员工。
由以上技术方案可以看出,本申请能够从数据库中调取简历,并对调取的简历进行预处理,得到待解析简历,根据预先构建的分词词典构建词语切分有向无环图,并根据构建的词语切分有向无环图切分所述待解析简历,得到分词处理后的简历文本,进而能够快速得到待解析简历的分词结果,进一步根据所述简历文本构建共现矩阵,并基于所述共现矩阵确定所述简历文本的关键词,获取所述关键词中的字序列,并利用词表示模型对所述字序列进行处理,得到所述字序列的词表示,提升了解析效果,将所述词表示输入到构建的简历标签解析模型中,得到预测的简历标签序列,进一步计算所述简历标签序列中的每个标签与每个岗位的标签的相似度,并根据计算的相似度从所述待解析简历中确定与每个岗位匹配的简历,实现对岗位与简历快速且准确地智能匹配。
如图2所示,是本申请简历数据信息解析及匹配装置的较佳实施例的功能模块图。所述简历数据信息解析及匹配装置11包括预处理单元110、构建单元111、确定单元112、处理单元113、预测单元114、合并单元115、训练单元116、获取单元117、拆分单元118、验证单元119。本申请所称的模块/单元是指一种能够被处理器13所执行,并且能够完成固定功能的一系列计算机程序段,其存储在存储器12中。在本实施例中,关于各模块/单元的功能将在后续的实施例中详述。
预处理单元110从数据库中调取简历,并对调取的简历进行预处理,得到待解析简历。
在本申请的至少一个实施例中,所述数据库可以是与电子设备相通信的数据库,也可以是所述电子设备的内部数据库,根据不同的需求,可以进行自定义配置。
例如:所述数据库可以是人才库。所述预处理单元110从所述人才库中进行简历的调取和整理,得到大量简历。所述简历可以归纳成一个名词集合{姓名、性别、生日、政貌、学校、学历、专业、联系方式、籍贯、教育经历、技能……},其中的每一项内容都有展开描述,并且每一项都有分隔符分开。由于求职这一社会行为的特殊性以及人与人之间的模仿,很多求职人员在描述自身特点方面有相当大的共性。所述预处理单元110从大量的具有共性的简历之中解析出包括简历挑选者感兴趣和关心的内容的简历,形成一个大致收敛的有限的简历集合,作为所述调取的简历。
在本申请的至少一个实施例中,由于在求职过程中,同一人有可能发送多份简历,因此可以首先将重复的简历进行剔除,从而实现简历的去重。
进一步地,由于简历中还存在一些冗余的停用词,同样会对解析产生不利影响,因此,还需要剔除停用词,即对调取的简历进行预处理。
具体地,所述预处理单元110对调取的简历进行预处理包括:
所述预处理单元110采用停用词表过滤方法对所述调取的简历进行去停用词处理。
其中,所述停用词是文本数据功能词中没有实际意义的词,对文本的分类没有影响,但是出现的频率高,具体可以包括常用的代词、介词等。所述停用词会降低文本分类效果的准确性。
进一步地,所述预处理单元110可以将调取的简历中的词语与预先构建好的停用词表进行一一匹配,如果匹配成功,那么该词语就是停用词,所述预处理单元110将该词删除。
构建单元111根据预先构建的分词词典构建词语切分有向无环图,并根据构建的词语切分有向无环图切分所述待解析简历,得到分词处理后的简历文本。
在本申请的至少一个实施例中,所述分词词典可以包括前缀字典、自定义字典等。
其中,所述前缀词典包括统计的词典中每一个分词的前缀,例如:词典中的词“北京大学”的前缀分别是“北”、“北京”、“北京大”;词“大学”的前缀是“大”;所述自定义词典也可以称为专有名词词典,是在统计的词典中不存在,但是某领域特定、专有的词,如简历、工作经历等。
进一步地,所述构建单元111根据预先构建的分词词典构建词语切分有向无环图,其中,每个词对应图中的一条有向边,并赋给相应的边长(权值)。进一步地,所述构建单元111在起点到终点的所有路径中,求出长度值,并按严格升序排列(即:任何两个不同位置上的值一定不等,下同),依次为第1,第2,…,第i,…,第N的路径集合,作为相应的粗分结果集。如果两条或两条以上路径的长度相等,那么他们的长度并列为第i,都要列入所述粗分结果集,而且不影响其他路径的排列序号,最后的粗分结果集的大小大于或等于N,据此得到经过分词处理的简历文本。
通过上述实施方式,能够利用分词词典及有向无环图快速得到简历文本的分词结果。
确定单元112根据所述简历文本构建共现矩阵,并基于所述共现矩阵确定所述简历文本的关键词。
在本申请的至少一个实施例中,所述确定单元112根据所述简历文本构建共现矩阵,并基于所述共现矩阵确定所述简历文本的关键词包括:
所述确定单元112根据所述简历文本中每个分词出现的次数构建所述共现矩阵,并从所述共现矩阵中提取每个分词的词频(freq)及度(deg),所述确定单元112根据每个分词的词频及度计算每个分词的得分,进一步根据每个分词的得分对每个分词进行降序输出,得到所述简历文本的关键词。
例如:所述确定单元112根据每个分词的得分对每个分词降序输出,得到前n个词语,如按score大小降序输出前1/3的词语作为所述简历文本的关键词。
其中,所述共现矩阵是通过统计一个事先指定大小的窗口内的词语的共现次数,以词语周边的共现词的次数作为当前词语的向量。
例如,当所述简历文本中有如下语料:
我擅长研究。(该语料中包括分词:“我”、“擅长”、“研究”及“。”,下面两个语料采取类似的分词方式,将不再一一列举)
我擅长编程。
我享受阅读。
根据上述简历文本中的语料,构建的共现矩阵X如图4所示。在本申请的至少一个实施例中,在得到所述简历文本的关键词后,所述方法还包括:
当有两个关键词在同一文档中相邻的次数大于预设值时,合并单元115将所述两个关键词合并为新的关键词。
其中,所述预设值可以是2次等。
通过上述实施方式,能够将相似的关键词进一步合并,避免出现冗余关键词。
处理单元113获取所述关键词中的字序列,并利用词表示模型对所述字序列进行词表示处理,得到所述字序列的词表示。
在本申请的至少一个实施例中,所述处理单元113利用词表示模型对所述字序列进行处理,得到所述字序列的词表示包括:
所述处理单元113将所述关键词中的字序列输入所述词表示模型,并通过正向读取所述字序列生成包含所述字序列以及所述字序列的上文信息的第一向量,及通过反向读取所述字序列生成包含所述字序列以及所述字序列的下文信息的第二向量,所述处理单元113连接所述第一向量及所述第二向量,得到包含所述字序列及所述字序列的上下文信息的词表示。
例如:对于给定一个包含n个关键字的非结构化文本简历的字序列Char=(char 1,char 2…,char n),其中char n是一个维度为d维的字向量,将所述非结构化文本字序列输入到词表示模型中,从而利用该词表示模型对字序列进行建模,通过正向读取字序列,以生成一个包含字序列以及字序列上文信息的向量,表示为CharF i,同理,通过反向读取字序列,以生成一个包含字序列以及字序列下文信息的向量,表示为CharB i,然后将CharF i和CharB i连接,形成一个包含字序列以及上下文信息的词表示:
Wd=[CharF i:CharB i]
据此,所述处理单元113得到所述字序列的词表示。
需要说明的是,在进行自然语言处理时,可以利用各种词表示模型将“词”这一符号信息表示成数学上的向量形式。词的向量表示可以作为各种机器学习模型的输入来使用。现有的词表示模型可以包括两大类:一类是syntagmatic models,一类是paradigmatic models。
进一步地,对于该词表示,所述电子设备还可以进一步使用正则表达匹配对其进行格式化处理,进而解析、分类,存入指定数据库中,以供后续使用。
预测单元114将所述词表示输入到构建的简历标签解析模型中,得到预测的简历标签序列。
在本申请的至少一个实施例中,所述简历标签解析模型是以大量的简历数据作为训练样本进行训练,并以验证集进行验证而得到。利用所述简历标签解析模型对非结构化的词表示进行解析,能够输出相对应的标签以形成所述简历标签序列。
例如:所述简历标签序列中的标签可以包括,但不限于:本科生、研究生、熟练掌握WORD等。
在本申请的至少一个实施例中,训练所述简历标签解析模型包括:
获取单元117获取简历数据,拆分单元118拆分所述简历数据,得到训练集和验证集,进一步地,验证单元119利用所述验证集训练CRF模型,训练单元116采用条件对数似然函数及最大分值公式预测目标标签序列,以所述验证集验证所述目标标签序列,当所述目标标签序列通过验证时,所述训练单元116停止训练并得到所述简历标签解析模型。
其中,所述是指预测的最适合的标签序列。
具体地,所述训练单元116采用CRF(conditional random field,条件随机场)进行建模。假定得到非结构化文本的关键字信息的输出目标序列(即对应的标签序列)为:y=(y 1,…y n)。为了有效获得非结构化文本简历信息的目标序列,模型的分值公式定义如下:
Figure PCTCN2020131916-appb-000005
其中,P表示双向LSTM算法(Long short-term memory,长短期记忆算法)的输出分值矩阵,其大小为n×k,k表示目标标签的数量,所述目标标签即对该简历的概述评价,n表示词序列的长度,A表示转移分值矩阵。当j=0时,y 0表示的是一个序列开始的标志,当j=n时,y n+1表示的是一个序列结束的标志,A方阵的大小为k+2。
在所有简历信息的标签序列上,CRF生成目标序列y的概率为:
Figure PCTCN2020131916-appb-000006
其中,Y Wd代表简历信息序列Wd对应的所有可能标签序列。在训练过程中,为了获得简历信息正确的标签序列,所述训练单元116将采用最大化正确标签序列的条件对数似然函数进行计算,并使用最大分值公式预测最合适的标签序列:
Figure PCTCN2020131916-appb-000007
通过上述实施方式,结合条件对数似然函数及最大分值公式,能够提升模型的准确率。
所述确定单元112计算所述简历标签序列中的每个标签与每个岗位的标签的相似度,并根据计算的相似度从所述待解析简历中确定与每个岗位匹配的简历。
在本申请的至少一个实施例中,所述确定单元112计算所述简历标签序列中的每个标签与每个岗位的标签的相似度,并根据计算的相似度从所述待解析简历中确定与每个岗位匹配的简历包括:
所述确定单元112计算每个标签与每个岗位的标签之间的余弦距离,当存在目标标签与目标岗位之间的余弦距离小于或者等于预设距离时,所述确定单元112从所述待解析简历中调取所述目标标签对应的目标简历,并确定所述目标简历与所述目标岗位相匹配。
具体地,所述余弦距离是用向量空间中两个向量夹角的余弦值作为衡量两个个体间差异的大小的度量,余弦值越接近1,就表明夹角越接近0度,也就是两个向量越相似。
例如:对于所得到的简历标签序列X和入职岗位所需要的简历标签序列Y,利用下列式子进行计算,式中X i表示简历标签序列X中第i个向量,Y i表示入职岗位所需要的简历标签序列Y中第i个向量:
Figure PCTCN2020131916-appb-000008
产生的相似性范围从-1到1,其中,-1意味着两个向量指向的方向正好截然相反,1表示它们的指向是完全相同的,0通常表示它们之间是独立的,而在这之间的值则表示中度的相似性或相异性,根据这一算法,能够对每份岗位选取标签相似度较高的简历,以进行快速匹配入职。
在本申请的至少一个实施例中,所述确定单元112还可以根据得到的简历标签序列及配置的相应的权重(如:研究生标签在简历评分中所占权重为0.2,而本科生标签在简历评分中所占权重为0.1),将所述简历标签序列通过得分进行表示,进一步根据得分快速筛选出所需的员工。
由以上技术方案可以看出,本申请能够从数据库中调取简历,并对调取的简历进行预处理,得到待解析简历,根据预先构建的分词词典构建词语切分有向无环图,并根据构建的词语切分有向无环图切分所述待解析简历,得到简历文本,进而能够快速得到待解析简历的分词结果,进一步根据所述简历文本构建共现矩阵,并基于所述共现矩阵确定所述简历文本的关键词,获取所述关键词中的字序列,并利用词表示模型对所述字序列进行处理,得到所述字序列的词表示,提升了解析效果,将所述词表示输入到构建的简历标签解析模型中,得到预测的简历标签序列,进一步计算所述简历标签序列中的每 个标签与每个岗位的标签的相似度,并根据计算的相似度从所述待解析简历中确定与每个岗位匹配的简历,实现对岗位与简历快速且准确地智能匹配。
如图3所示,是本申请实现简历数据信息解析及匹配方法的较佳实施例的电子设备的结构示意图。
所述电子设备1可以包括存储器12、处理器13和总线,还可以包括存储在所述存储器12中并可在所述处理器13上运行的计算机程序,例如简历数据信息解析及匹配程序。
本领域技术人员可以理解,所述示意图仅仅是电子设备1的示例,并不构成对电子设备1的限定,所述电子设备1既可以是总线型结构,也可以是星形结构,所述电子设备1还可以包括比图示更多或更少的其他硬件或者软件,或者不同的部件布置,例如所述电子设备1还可以包括输入输出设备、网络接入设备等。
需要说明的是,所述电子设备1仅为举例,其他现有的或今后可能出现的电子产品如可适应于本申请,也应包含在本申请的保护范围以内,并以引用方式包含于此。
其中,存储器12至少包括一种类型的可读存储介质,所述可读存储介质包括闪存、移动硬盘、多媒体卡、卡型存储器(例如:SD或DX存储器等)、磁性存储器、磁盘、光盘等。存储器12在一些实施例中可以是电子设备1的内部存储单元,例如该电子设备1的移动硬盘。存储器12在另一些实施例中也可以是电子设备1的外部存储设备,例如电子设备1上配备的插接式移动硬盘、智能存储卡(Smart Media Card,SMC)、安全数字(Secure Digital,SD)卡、闪存卡(Flash Card)等。进一步地,存储器12还可以既包括电子设备1的内部存储单元也包括外部存储设备。存储器12不仅可以用于存储安装于电子设备1的应用软件及各类数据,例如简历数据信息解析及匹配程序的代码等,还可以用于暂时地存储已经输出或者将要输出的数据。
处理器13在一些实施例中可以由集成电路组成,例如可以由单个封装的集成电路所组成,也可以是由多个相同功能或不同功能封装的集成电路所组成,包括一个或者多个中央处理器(Central Processing unit,CPU)、微处理器、数字处理芯片、图形处理器及各种控制芯片的组合等。处理器13是所述电子设备1的控制核心(Control Unit),利用各种接口和线路连接整个电子设备1的各个部件,通过运行或执行存储在所述存储器12内的程序或者模块(例如执行简历数据信息解析及匹配程序等),以及调用存储在所述存储器12内的数据,以执行电子设备1的各种功能和处理数据。
所述处理器13执行所述电子设备1的操作系统以及安装的各类应用程序。所述处理器13执行所述应用程序以实现上述各个简历数据信息解析及匹配方法实施例中的步骤,例如图1所示的步骤S10、S11、S12、S13、S14、S15。
或者,所述处理器13执行所述计算机程序时实现上述各装置实施例中各模块/单元的功能,例如:
从数据库中调取简历,并对调取的简历进行预处理,得到待解析简历;
根据预先构建的分词词典构建词语切分有向无环图,并根据构建的词语切分有向无环图切分所述待解析简历,得到分词处理后的简历文本;
根据分词处理后的所述简历文本构建共现矩阵,并基于所述共现矩阵确定所述简历文本的关键词;
获取所述关键词中的字序列,并利用词表示模型对所述字序列进行处理,得到所述字序列的词表示;
将所述词表示输入到构建的简历标签解析模型中,得到预测的简历标签序列;
计算所述简历标签序列中的每个标签与每个岗位的标签的相似度,并根据计算的相似度从所述待解析简历中确定与每个岗位匹配的简历。
示例性的,所述计算机程序可以被分割成一个或多个模块/单元,所述一个或者多个 模块/单元被存储在所述存储器12中,并由所述处理器13执行,以完成本申请。所述一个或多个模块/单元可以是能够完成特定功能的一系列计算机程序指令段,该指令段用于描述所述计算机程序在所述电子设备1中的执行过程。例如,所述计算机程序可以被分割成预处理单元110、构建单元111、确定单元112、处理单元113、预测单元114、合并单元115、训练单元116、获取单元117、拆分单元118、验证单元119。
上述以软件功能模块的形式实现的集成的单元,可以存储在一个计算机可读取存储介质中,所述计算机可读存储介质可以是非易失性,也可以是易失性。上述软件功能模块存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机、计算机设备,或者网络设备等)或处理器(processor)执行本申请各个实施例所述方法的部分。
所述电子设备1集成的模块/单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请实现上述实施例方法中的全部或部分流程,也可以通过计算机程序来指示相关的硬件设备来完成,所述的计算机程序可存储于一计算机可读存储介质中,该计算机程序在被处理器执行时,可实现上述各个方法实施例的步骤。
其中,所述计算机程序包括计算机程序代码,所述计算机程序代码可以为源代码形式、对象代码形式、可执行文件或某些中间形式等。所述计算机可读介质可以包括:能够携带所述计算机程序代码的任何实体或装置、记录介质、U盘、移动硬盘、磁碟、光盘、计算机存储器、只读存储器(ROM,Read-Only Memory)。
总线可以是外设部件互连标准(peripheral component interconnect,简称PCI)总线或扩展工业标准结构(extended industry standard architecture,简称EISA)总线等。该总线可以分为地址总线、数据总线、控制总线等。为便于表示,在图3中仅用一根箭头表示,但并不表示仅有一根总线或一种类型的总线。所述总线被设置为实现所述存储器12以及至少一个处理器13等之间的连接通信。
尽管未示出,所述电子设备1还可以包括给各个部件供电的电源(比如电池),优选地,电源可以通过电源管理装置与所述至少一个处理器13逻辑相连,从而通过电源管理装置实现充电管理、放电管理、以及功耗管理等功能。电源还可以包括一个或一个以上的直流或交流电源、再充电装置、电源故障检测电路、电源转换器或者逆变器、电源状态指示器等任意组件。所述电子设备1还可以包括多种传感器、蓝牙模块、Wi-Fi模块等,在此不再赘述。
进一步地,所述电子设备1还可以包括网络接口,可选地,所述网络接口可以包括有线接口和/或无线接口(如WI-FI接口、蓝牙接口等),通常用于在该电子设备1与其他电子设备之间建立通信连接。
可选地,该电子设备1还可以包括用户接口,用户接口可以是显示器(Display)、输入单元(比如键盘(Keyboard)),可选地,用户接口还可以是标准的有线接口、无线接口。可选地,在一些实施例中,显示器可以是LED显示器、液晶显示器、触控式液晶显示器以及OLED(Organic Light-Emitting Diode,有机发光二极管)触摸器等。其中,显示器也可以适当的称为显示屏或显示单元,用于显示在电子设备1中处理的信息以及用于显示可视化的用户界面。
应该了解,所述实施例仅为说明之用,在专利申请范围上并不受此结构的限制。
图3仅示出了具有组件12-13的电子设备1,本领域技术人员可以理解的是,图3示出的结构并不构成对所述电子设备1的限定,可以包括比图示更少或者更多的部件,或者组合某些部件,或者不同的部件布置。
结合图1,所述电子设备1中的所述存储器12存储多个指令以实现一种简历数据信息解析及匹配方法,所述处理器13可执行所述多个指令从而实现:
从数据库中调取简历,并对调取的简历进行预处理,得到待解析简历;
根据预先构建的分词词典构建词语切分有向无环图,并根据构建的词语切分有向无环图切分所述待解析简历,得到分词处理后的简历文本;
根据分词处理后的所述简历文本构建共现矩阵,并基于所述共现矩阵确定所述简历文本的关键词;
获取所述关键词中的字序列,并利用词表示模型对所述字序列进行处理,得到所述字序列的词表示;
将所述词表示输入到构建的简历标签解析模型中,得到预测的简历标签序列;
计算所述简历标签序列中的每个标签与每个岗位的标签的相似度,并根据计算的相似度从所述待解析简历中确定与每个岗位匹配的简历。
具体地,所述处理器13对上述指令的具体实现方法可参考图1对应实施例中相关步骤的描述,在此不赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述模块的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。
所述作为分离部件说明的模块可以是或者也可以不是物理上分开的,作为模块显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能模块可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用硬件加软件功能模块的形式实现。
对于本领域技术人员而言,显然本申请不限于上述示范性实施例的细节,而且在不背离本申请的精神或基本特征的情况下,能够以其他的具体形式实现本申请。
因此,无论从哪一点来看,均应将实施例看作是示范性的,而且是非限制性的,本申请的范围由所附权利要求而不是上述说明限定,因此旨在将落在权利要求的等同要件的含义和范围内的所有变化涵括在本申请内。不应将权利要求中的任何附关联图标记视为限制所涉及的权利要求。
此外,显然“包括”一词不排除其他单元或步骤,单数不排除复数。系统权利要求中陈述的多个单元或装置也可以由一个单元或装置通过软件或者硬件来实现。第二等词语用来表示名称,而并不表示任何特定的顺序。
最后应说明的是,以上实施例仅用以说明本申请的技术方案而非限制,尽管参照较佳实施例对本申请进行了详细说明,本领域的普通技术人员应当理解,可以对本申请的技术方案进行修改或等同替换,而不脱离本申请技术方案的精神和范围。

Claims (20)

  1. 一种简历数据信息解析及匹配方法,其中,所述方法包括:
    从数据库中调取简历,并对调取的简历进行预处理,得到待解析简历;
    根据预先构建的分词词典构建词语切分有向无环图,并根据构建的词语切分有向无环图切分所述待解析简历,得到分词处理后的简历文本;
    根据经过分词处理的所述简历文本构建共现矩阵,并基于所述共现矩阵确定所述简历文本的关键词;
    获取所述关键词中的字序列,并利用词表示模型对所述字序列进行词表示处理,得到所述字序列的词表示;
    将所述词表示输入到构建的简历标签解析模型中,得到预测的简历标签序列;
    计算所述简历标签序列中的每个标签与每个岗位的标签的相似度,并根据计算的相似度从所述待解析简历中确定与每个岗位匹配的简历。
  2. 如权利要求1所述的简历数据信息解析及匹配方法,其中,所述对调取的简历进行预处理包括:
    采用停用词表过滤方法对所述调取的简历进行去停用词处理。
  3. 如权利要求1所述的简历数据信息解析及匹配方法,其中,所述根据所述简历文本构建共现矩阵,并基于所述共现矩阵确定所述简历文本的关键词包括:
    根据所述简历文本中每个分词出现的次数构建所述共现矩阵;
    从所述共现矩阵中提取每个分词的词频及角度;
    根据每个分词的词频及角度计算每个分词的得分;
    根据每个分词的得分对每个分词进行降序输出,得到所述简历文本的关键词。
  4. 如权利要求3所述的简历数据信息解析及匹配方法,其中,在得到所述简历文本的关键词后,所述方法还包括:
    当有两个关键词在同一文档中相邻的次数大于预设值时,将所述两个关键词合并为新的关键词。
  5. 如权利要求1所述的简历数据信息解析及匹配方法,其中,所述利用词表示模型对所述字序列进行词表示处理,得到所述字序列的词表示包括:
    将所述关键词中的字序列输入所述词表示模型,并通过正向读取所述字序列生成包含所述字序列以及所述字序列的上文信息的第一向量,及通过反向读取所述字序列生成包含所述字序列以及所述字序列的下文信息的第二向量;
    连接所述第一向量及所述第二向量,得到包含所述字序列及所述字序列的上下文信息的词表示。
  6. 如权利要求1所述的简历数据信息解析及匹配方法,其中,所述方法还包括:
    获取简历数据;
    拆分所述简历数据,得到训练集和验证集;
    利用所述验证集训练CRF模型,并采用条件对数似然函数及最大分值公式预测目标标签序列;
    以所述验证集验证所述目标标签序列;
    当所述目标标签序列通过验证时,停止训练并得到所述简历标签解析模型。
  7. 如权利要求1所述的简历数据信息解析及匹配方法,其中,所述计算所述简历标签序列中的每个标签与每个岗位的标签的相似度,并根据计算的相似度从所述待解析简历中确定与每个岗位匹配的简历包括:
    计算每个标签与每个岗位的标签之间的余弦距离;
    当存在目标标签与目标岗位之间的余弦距离小于或者等于预设距离时,从所述待解析简历中调取所述目标标签对应的目标简历;
    确定所述目标简历与所述目标岗位相匹配。
  8. 一种简历数据信息解析及匹配装置,其中,所述装置包括:
    预处理单元,用于从数据库中调取简历,并对调取的简历进行预处理,得到待解析简历;
    构建单元,用于根据预先构建的分词词典构建词语切分有向无环图,并根据构建的词语切分有向无环图切分所述待解析简历,得到分词处理后的简历文本;
    确定单元,用于根据经过分词处理的所述简历文本构建共现矩阵,并基于所述共现矩阵确定所述简历文本的关键词;
    处理单元,用于获取所述关键词中的字序列,并利用词表示模型对所述字序列进行词表示处理,得到所述字序列的词表示;
    预测单元,用于将所述词表示输入到构建的简历标签解析模型中,得到预测的简历标签序列;
    所述确定单元,还用于计算所述简历标签序列中的每个标签与每个岗位的标签的相似度,并根据计算的相似度从所述待解析简历中确定与每个岗位匹配的简历。
  9. 一种电子设备,其中,所述电子设备包括:
    存储器,存储至少一个指令;及
    处理器,执行所述存储器中存储的指令以实现如下步骤:
    从数据库中调取简历,并对调取的简历进行预处理,得到待解析简历;
    根据预先构建的分词词典构建词语切分有向无环图,并根据构建的词语切分有向无环图切分所述待解析简历,得到分词处理后的简历文本;
    根据经过分词处理的所述简历文本构建共现矩阵,并基于所述共现矩阵确定所述简历文本的关键词;
    获取所述关键词中的字序列,并利用词表示模型对所述字序列进行词表示处理,得到所述字序列的词表示;
    将所述词表示输入到构建的简历标签解析模型中,得到预测的简历标签序列;
    计算所述简历标签序列中的每个标签与每个岗位的标签的相似度,并根据计算的相似度从所述待解析简历中确定与每个岗位匹配的简历。
  10. 如权利要求9所述的电子设备,其中,所述对调取的简历进行预处理包括:
    采用停用词表过滤方法对所述调取的简历进行去停用词处理。
  11. 如权利要求9所述的电子设备,其中,所述根据所述简历文本构建共现矩阵,并基于所述共现矩阵确定所述简历文本的关键词包括:
    根据所述简历文本中每个分词出现的次数构建所述共现矩阵;
    从所述共现矩阵中提取每个分词的词频及角度;
    根据每个分词的词频及角度计算每个分词的得分;
    根据每个分词的得分对每个分词进行降序输出,得到所述简历文本的关键词。
  12. 如权利要求11所述的电子设备,其中,在得到所述简历文本的关键词后,所述方法还包括:
    当有两个关键词在同一文档中相邻的次数大于预设值时,将所述两个关键词合并为新的关键词。
  13. 如权利要求9所述的电子设备,其中,所述利用词表示模型对所述字序列进行词表示处理,得到所述字序列的词表示包括:
    将所述关键词中的字序列输入所述词表示模型,并通过正向读取所述字序列生成包含所述字序列以及所述字序列的上文信息的第一向量,及通过反向读取所述字序列生成包含所述字序列以及所述字序列的下文信息的第二向量;
    连接所述第一向量及所述第二向量,得到包含所述字序列及所述字序列的上下文信息的词表示。
  14. 如权利要求9所述的电子设备,其中,执行所述存储器中存储的指令时还实现如下步骤:
    获取简历数据;
    拆分所述简历数据,得到训练集和验证集;
    利用所述验证集训练CRF模型,并采用条件对数似然函数及最大分值公式预测目标标签序列;
    以所述验证集验证所述目标标签序列;
    当所述目标标签序列通过验证时,停止训练并得到所述简历标签解析模型。
  15. 如权利要求9所述的电子设备,其中,所述计算所述简历标签序列中的每个标签与每个岗位的标签的相似度,并根据计算的相似度从所述待解析简历中确定与每个岗位匹配的简历包括:
    计算每个标签与每个岗位的标签之间的余弦距离;
    当存在目标标签与目标岗位之间的余弦距离小于或者等于预设距离时,从所述待解析简历中调取所述目标标签对应的目标简历;
    确定所述目标简历与所述目标岗位相匹配。
  16. 一种计算机可读存储介质,其中:所述计算机可读存储介质中存储有至少一个指令,所述至少一个指令被电子设备中的处理器执行以实现如下步骤:
    从数据库中调取简历,并对调取的简历进行预处理,得到待解析简历;
    根据预先构建的分词词典构建词语切分有向无环图,并根据构建的词语切分有向无环图切分所述待解析简历,得到分词处理后的简历文本;
    根据经过分词处理的所述简历文本构建共现矩阵,并基于所述共现矩阵确定所述简历文本的关键词;
    获取所述关键词中的字序列,并利用词表示模型对所述字序列进行词表示处理,得到所述字序列的词表示;
    将所述词表示输入到构建的简历标签解析模型中,得到预测的简历标签序列;
    计算所述简历标签序列中的每个标签与每个岗位的标签的相似度,并根据计算的相似度从所述待解析简历中确定与每个岗位匹配的简历。
  17. 如权利要求16所述的计算机可读存储介质,其中,所述对调取的简历进行预处理包括:
    采用停用词表过滤方法对所述调取的简历进行去停用词处理。
  18. 如权利要求16所述的计算机可读存储介质,其中,所述根据所述简历文本构建共现矩阵,并基于所述共现矩阵确定所述简历文本的关键词包括:
    根据所述简历文本中每个分词出现的次数构建所述共现矩阵;
    从所述共现矩阵中提取每个分词的词频及角度;
    根据每个分词的词频及角度计算每个分词的得分;
    根据每个分词的得分对每个分词进行降序输出,得到所述简历文本的关键词。
  19. 如权利要求18所述的计算机可读存储介质,其中,在得到所述简历文本的关键词后,所述方法还包括:
    当有两个关键词在同一文档中相邻的次数大于预设值时,将所述两个关键词合并为新的关键词。
  20. 如权利要求16所述的计算机可读存储介质,其中,所述利用词表示模型对所述字序列进行词表示处理,得到所述字序列的词表示包括:
    将所述关键词中的字序列输入所述词表示模型,并通过正向读取所述字序列生成包含所述字序列以及所述字序列的上文信息的第一向量,及通过反向读取所述字序列生成包含所述字序列以及所述字序列的下文信息的第二向量;
    连接所述第一向量及所述第二向量,得到包含所述字序列及所述字序列的上下文信息的词表示。
PCT/CN2020/131916 2020-03-06 2020-11-26 简历数据信息解析及匹配方法、装置、电子设备及介质 WO2021174919A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010151399.9 2020-03-06
CN202010151399.9A CN111428488B (zh) 2020-03-06 简历数据信息解析及匹配方法、装置、电子设备及介质

Publications (1)

Publication Number Publication Date
WO2021174919A1 true WO2021174919A1 (zh) 2021-09-10

Family

ID=71546173

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/131916 WO2021174919A1 (zh) 2020-03-06 2020-11-26 简历数据信息解析及匹配方法、装置、电子设备及介质

Country Status (1)

Country Link
WO (1) WO2021174919A1 (zh)

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113905095A (zh) * 2021-12-09 2022-01-07 深圳佑驾创新科技有限公司 一种基于can通信矩阵的数据生成方法及装置
CN114254951A (zh) * 2021-12-27 2022-03-29 南方电网物资有限公司 一种基于数字化技术的电网设备到货抽检方法
CN114637839A (zh) * 2022-03-15 2022-06-17 平安国际智慧城市科技股份有限公司 文本高亮显示方法、装置、设备及存储介质
CN114637836A (zh) * 2022-03-15 2022-06-17 平安国际智慧城市科技股份有限公司 文本处理方法、装置、设备及存储介质
CN115293131A (zh) * 2022-09-29 2022-11-04 广州万维视景科技有限公司 数据匹配方法、装置、设备及存储介质
CN115879901A (zh) * 2023-02-22 2023-03-31 陕西湘秦衡兴科技集团股份有限公司 一种智能人事自助服务平台
CN116562837A (zh) * 2023-07-12 2023-08-08 深圳须弥云图空间科技有限公司 人岗匹配方法、装置、电子设备及计算机可读存储介质
CN116680590A (zh) * 2023-07-28 2023-09-01 中国人民解放军国防科技大学 基于工作说明书解析的岗位画像标签提取方法及装置
CN116843155A (zh) * 2023-07-27 2023-10-03 深圳市贝福数据服务有限公司 一种基于saas的人岗双向匹配方法和系统
CN117236647A (zh) * 2023-11-10 2023-12-15 贵州优特云科技有限公司 一种基于人工智能的岗位招聘分析方法及系统
CN117670273A (zh) * 2023-12-11 2024-03-08 南京道尔医药研究院有限公司 基于人力资源智能终端的员工服务系统
CN117875921A (zh) * 2024-03-13 2024-04-12 北京金诚久安人力资源服务有限公司 基于人工智能的人力资源管理方法和系统
CN118035561A (zh) * 2024-03-29 2024-05-14 上海云生未来技术集团有限公司 基于大数据的岗位推荐方法及系统
CN118195562A (zh) * 2024-05-16 2024-06-14 乐麦信息技术(杭州)有限公司 基于自然语义分析的入职意愿评估方法及系统
CN118333591A (zh) * 2024-05-07 2024-07-12 中国人民解放军91977部队 一种基于动态优化的人力资源的调度方法及装置

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080222133A1 (en) * 2007-03-08 2008-09-11 Anthony Au System that automatically identifies key words & key texts from a source document, such as a job description, and apply both (key words & text) as context in the automatic matching with another document, such as a resume, to produce a numerically scored result.
CN107766318A (zh) * 2016-08-17 2018-03-06 北京金山安全软件有限公司 一种关键词的抽取方法、装置及电子设备
CN110399475A (zh) * 2019-06-18 2019-11-01 平安科技(深圳)有限公司 基于人工智能的简历匹配方法、装置、设备及存储介质
CN110750993A (zh) * 2019-10-15 2020-02-04 成都数联铭品科技有限公司 分词方法及分词器、命名实体识别方法及系统
CN111428488A (zh) * 2020-03-06 2020-07-17 平安科技(深圳)有限公司 简历数据信息解析及匹配方法、装置、电子设备及介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080222133A1 (en) * 2007-03-08 2008-09-11 Anthony Au System that automatically identifies key words & key texts from a source document, such as a job description, and apply both (key words & text) as context in the automatic matching with another document, such as a resume, to produce a numerically scored result.
CN107766318A (zh) * 2016-08-17 2018-03-06 北京金山安全软件有限公司 一种关键词的抽取方法、装置及电子设备
CN110399475A (zh) * 2019-06-18 2019-11-01 平安科技(深圳)有限公司 基于人工智能的简历匹配方法、装置、设备及存储介质
CN110750993A (zh) * 2019-10-15 2020-02-04 成都数联铭品科技有限公司 分词方法及分词器、命名实体识别方法及系统
CN111428488A (zh) * 2020-03-06 2020-07-17 平安科技(深圳)有限公司 简历数据信息解析及匹配方法、装置、电子设备及介质

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113905095A (zh) * 2021-12-09 2022-01-07 深圳佑驾创新科技有限公司 一种基于can通信矩阵的数据生成方法及装置
CN114254951A (zh) * 2021-12-27 2022-03-29 南方电网物资有限公司 一种基于数字化技术的电网设备到货抽检方法
CN114637839A (zh) * 2022-03-15 2022-06-17 平安国际智慧城市科技股份有限公司 文本高亮显示方法、装置、设备及存储介质
CN114637836A (zh) * 2022-03-15 2022-06-17 平安国际智慧城市科技股份有限公司 文本处理方法、装置、设备及存储介质
CN115293131A (zh) * 2022-09-29 2022-11-04 广州万维视景科技有限公司 数据匹配方法、装置、设备及存储介质
CN115879901A (zh) * 2023-02-22 2023-03-31 陕西湘秦衡兴科技集团股份有限公司 一种智能人事自助服务平台
CN115879901B (zh) * 2023-02-22 2023-07-28 陕西湘秦衡兴科技集团股份有限公司 一种智能人事自助服务平台
CN116562837A (zh) * 2023-07-12 2023-08-08 深圳须弥云图空间科技有限公司 人岗匹配方法、装置、电子设备及计算机可读存储介质
CN116843155B (zh) * 2023-07-27 2024-04-30 深圳市贝福数据服务有限公司 一种基于saas的人岗双向匹配方法和系统
CN116843155A (zh) * 2023-07-27 2023-10-03 深圳市贝福数据服务有限公司 一种基于saas的人岗双向匹配方法和系统
CN116680590B (zh) * 2023-07-28 2023-10-20 中国人民解放军国防科技大学 基于工作说明书解析的岗位画像标签提取方法及装置
CN116680590A (zh) * 2023-07-28 2023-09-01 中国人民解放军国防科技大学 基于工作说明书解析的岗位画像标签提取方法及装置
CN117236647A (zh) * 2023-11-10 2023-12-15 贵州优特云科技有限公司 一种基于人工智能的岗位招聘分析方法及系统
CN117236647B (zh) * 2023-11-10 2024-02-02 贵州优特云科技有限公司 一种基于人工智能的岗位招聘分析方法及系统
CN117670273A (zh) * 2023-12-11 2024-03-08 南京道尔医药研究院有限公司 基于人力资源智能终端的员工服务系统
CN117875921A (zh) * 2024-03-13 2024-04-12 北京金诚久安人力资源服务有限公司 基于人工智能的人力资源管理方法和系统
CN117875921B (zh) * 2024-03-13 2024-05-24 北京金诚久安人力资源服务有限公司 基于人工智能的人力资源管理方法和系统
CN118035561A (zh) * 2024-03-29 2024-05-14 上海云生未来技术集团有限公司 基于大数据的岗位推荐方法及系统
CN118333591A (zh) * 2024-05-07 2024-07-12 中国人民解放军91977部队 一种基于动态优化的人力资源的调度方法及装置
CN118195562A (zh) * 2024-05-16 2024-06-14 乐麦信息技术(杭州)有限公司 基于自然语义分析的入职意愿评估方法及系统

Also Published As

Publication number Publication date
CN111428488A (zh) 2020-07-17

Similar Documents

Publication Publication Date Title
WO2021174919A1 (zh) 简历数据信息解析及匹配方法、装置、电子设备及介质
US11016966B2 (en) Semantic analysis-based query result retrieval for natural language procedural queries
CN108717406B (zh) 文本情绪分析方法、装置及存储介质
CN112560479B (zh) 摘要抽取模型训练方法、摘要抽取方法、装置和电子设备
US20180181544A1 (en) Systems for Automatically Extracting Job Skills from an Electronic Document
CN110516073A (zh) 一种文本分类方法、装置、设备和介质
JP5744228B2 (ja) インターネットにおける有害情報の遮断方法と装置
US9881037B2 (en) Method for systematic mass normalization of titles
US10102191B2 (en) Propagation of changes in master content to variant content
CN111753060A (zh) 信息检索方法、装置、设备及计算机可读存储介质
US10474752B2 (en) System and method for slang sentiment classification for opinion mining
US9483460B2 (en) Automated formation of specialized dictionaries
CN108038725A (zh) 一种基于机器学习的电商产品客户满意度分析方法
WO2021169423A1 (zh) 客服录音的质检方法、装置、设备及存储介质
CN109241319B (zh) 一种图片检索方法、装置、服务器和存储介质
US20130036076A1 (en) Method for keyword extraction
CN110597978B (zh) 物品摘要生成方法、系统、电子设备及可读存储介质
CN102043843A (zh) 一种用于基于目标应用获取目标词条的方法与获取设备
US20210397787A1 (en) Domain-specific grammar correction system, server and method for academic text
CN111695349A (zh) 文本匹配方法和文本匹配系统
US20220067290A1 (en) Automatically identifying multi-word expressions
US10198497B2 (en) Search term clustering
US10614109B2 (en) Natural language processing keyword analysis
WO2021174924A1 (zh) 信息生成方法、装置、电子设备及存储介质
CN111858834B (zh) 基于ai的案件争议焦点确定方法、装置、设备及介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20922528

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20922528

Country of ref document: EP

Kind code of ref document: A1