CN115481220A - Post and resume content-based intelligent matching method and system for comparison learning human posts - Google Patents

Post and resume content-based intelligent matching method and system for comparison learning human posts Download PDF

Info

Publication number
CN115481220A
CN115481220A CN202211146339.3A CN202211146339A CN115481220A CN 115481220 A CN115481220 A CN 115481220A CN 202211146339 A CN202211146339 A CN 202211146339A CN 115481220 A CN115481220 A CN 115481220A
Authority
CN
China
Prior art keywords
post
resume
semantic
posts
resumes
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202211146339.3A
Other languages
Chinese (zh)
Other versions
CN115481220B (en
Inventor
肖小范
刘王祥
李敬泉
刘雨晨
徐雯
谢志辉
景昊
吴显仁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Today Talent Information Technology Co ltd
Original Assignee
Shenzhen Today Talent Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Today Talent Information Technology Co ltd filed Critical Shenzhen Today Talent Information Technology Co ltd
Priority to CN202211146339.3A priority Critical patent/CN115481220B/en
Publication of CN115481220A publication Critical patent/CN115481220A/en
Application granted granted Critical
Publication of CN115481220B publication Critical patent/CN115481220B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3347Query execution using vector based model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/10Office automation; Time management
    • G06Q10/105Human resources
    • G06Q10/1053Employment or hiring
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Data Mining & Analysis (AREA)
  • Human Resources & Organizations (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Strategic Management (AREA)
  • Entrepreneurship & Innovation (AREA)
  • General Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a post and resume content-based intelligent matching method and system for comparison learning human posts, wherein the method comprises the following steps: training an off-line post-resume semantic encoder and constructing a resume index; the on-line sentry recalls the match. The invention provides a people's post matching method based on contrast learning in the field of recruitment based on post and resume content by combining the people's post matching service scene of the recruitment industry; through the design of a refined training sample pair, the post-resume contrast learning semantic encoder model can carry out accurate semantic encoding on resumes and post contents, and the post matching effect is improved; by adopting the technical architecture of cache and faiss, the million-scale sentry matching is carried out in millisecond response, the performance requirements of high concurrency and low time delay are met, the efficiency of the sentry matching is greatly improved, and the burden and the efficiency of enterprises and job seekers are reduced.

Description

Post and resume content-based intelligent matching method and system for comparison learning human posts
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a post and resume content-based comparison learning post intelligent matching method and system.
Background
With the development of internet technology, more and more job seekers and enterprises release job hunting and recruitment information on the recruitment website, and massive resume resources and post resources are gathered on the recruitment website; by means of search and recommendation technologies, the problem of difficulty in recruitment and job hunting is relieved to a great extent by a recruitment website, and network recruitment gradually replaces traditional face-to-face recruitment.
At present, the most common resume and post matching technology is based on keyword matching, resumes with the same keywords are recommended to enterprises and posts with the same keywords are recommended to job seekers by extracting the keywords of the resumes and the posts, but due to the diversity of language expression, different words can express the same meaning, the same meaning can also be expressed by different words, matching is carried out according to the faces of the words, and the phenomena that the recommended resumes do not meet the requirements of the enterprises and the recommended posts do not meet the requirements of the job seekers often occur.
Based on the above deficiency, after keyword-based matching, semantic-based matching occurs, and each word is mapped into a high-dimensional vector through a natural language processing technology, wherein the most typical words are word2vec and glove, and the high-dimensional vector has the characteristic that the distance between words with the same semantic in a high-dimensional space is closer.
And dividing the resume and the post into words, and after a word vector of each word is obtained, matching the resume and the post through similar vector retrieval. However, both word2vec and glove are static word vectors, and the tie is still unfamiliar in the face of word ambiguity. Meanwhile, most patents are unclear about how to search the nearest neighbor vector of topk, and some patents even store the generated word vector into the mysql database, so that time is consumed seriously in the actual production environment.
The human-sentry matching problem can be further refined into two recommendation problems: push the post with the person and push the person with the post. Therefore, collaborative filtering based on the e-commerce recommendation field is introduced into the recruitment field, through historical interaction information of enterprises and job seekers, an interaction matrix of resumes and posts is generated, people push posts and then are converted into collaborative filtering based on users, and people push posts and then are converted into collaborative filtering based on commodities, however, the traditional collaborative filtering cannot solve the problem of cold start, namely, the enterprises and job seekers without interaction records cannot be recommended.
In another scheme, a clustering algorithm based on Kmeans is used, each resume in the resume set represents a data point, clustering of similar resumes is realized by presetting the number of the clustering categories, a position category is allocated to each clustered resume cluster, and resumes can be selected from the corresponding resume cluster by adopting the position category when resumes are recommended for an enterprise, wherein the clustering algorithm has two main defects: 1. the post types are various, and the number of all the post types cannot be accurately preset; 2. there is no quantitative selection criterion for selecting topk resumes in the resume cluster.
Based on the difficulties, the invention provides a post and resume content-based intelligent matching method and system for comparison learning human posts.
Disclosure of Invention
Aiming at the problems in the prior art, the invention provides a post and resume content-based intelligent matching method for comparison learning human posts.
In order to achieve the purpose, the invention adopts the following specific scheme:
the invention provides a post and resume content-based intelligent matching method for comparison learning human posts, which comprises the following steps:
s1, off-line model training and resume index construction;
and S2, matching on-line human posts.
Further, step S1 specifically includes the following steps:
s11, training a sample pair structure; providing a training sample generator for training a comparative learning model;
s12, training a model; training the comparative learning model by using the training samples generated in the S11 to obtain CV-BERT and JD-BERT models which can be respectively used for resume and post semantic extraction;
s13, constructing a full resume index; for all resumes in the RCN resume pool, performing semantic coding on all resumes by using the trained CV-BERT model in S12 to obtain a semantic vector of each resume; then, semantic vectors of all resumes are put into a faiss search library, and a full resume index is constructed.
Further, step S2 specifically includes the following steps:
s21, post semantic caching; extracting part of semantics of hot, urgent or active positions by using the JD-BERT trained in S12 in advance, and putting the semantics into a cache, so that when the positions are processed, the system can quickly recommend resumes for the positions;
s22, searching by a human sentry; after S13 and S21 are completed, the post semantic vector and the full resume vector index which need to be matched by the request post are obtained, and topk resumes related to the post are recalled for each post in the request through an approximate neighbor algorithm of a faiss vector retrieval engine.
The invention also discloses a post and resume content-based intelligent matching system for comparison learning human posts, which is used for realizing the method and comprises the following steps: training sample generator, semantic coder, real-time batch matcher;
the training sample generator: the method comprises the steps that semantic coding is carried out on posts and resumes by adopting a Simese-bert model, in order to further learn the similarity between resumes and posts, the similarity between resumes and resumes, namely the similarity between classes and the similarity inside the classes, the semantic coding model for the resumes adopts input ternary pair input, post text information is used as an anchor point and is marked as a, resumes matched with the posts are used as positive samples and are marked as p, resumes not matched with the posts are used as negative samples and are marked as n, namely the input of the semantic extraction model is the ternary pair s (a, p, n) formed by the posts and resumes;
the semantic encoder: and adopting a pseudo Siemese-Bert model for accurate semantic capture and matching of the post and the resume, using JD-BERT to capture semantic information of the post, and using CV-BERT to capture semantic information of the resume. Recommending post pairs with highly relevant semantics through precise semantic coding of post contents and resume contents;
the real-time batch vector recaller: and adopting a Redis cache technology and a Faiss vector retrieval engine to realize batch human-job matching and retrieval.
Further, the training sample generator includes: positive sample pair construction and negative sample pair construction;
the positive sample pair construction: in the RCN cooperation platform, finding out a proper resume for post matching and delivery, regarding jd-cv pair as a pair of positive samples, in the matching, the post and the resume are highly related and matched on the semantic level, taking all orders recommended to a client in the RCN, taking the post as an anchor point for each order, and taking the resume as a positive sample to form a positive sample pair;
the negative sample pair construction: the construction of negative examples is divided into simple and difficult example constructions.
Further, the negative example pair configuration includes: a simple negative sample pair structure and a difficult negative sample pair structure;
the simple negative sample pair construction: selecting a resume which is not matched with the specialty, the industry background, the working age and the mastering skill required by the post, and generating a negative sample for the post;
the difficult negative sample pair construction: training samples with slight differences between posts and resumes, and between resumes and resumes.
Further, the difficult negative example pair construction includes: a construction method based on skill points and a construction method based on working years;
the construction method based on the skill points comprises the following steps: resume analysis is carried out based on extraction and identification of resume skill points;
the construction method based on the working years comprises the following steps: and carrying out resume screening based on the working years.
Further, the semantic encoder: adopting a pseudo Siemens-Bert model for accurate semantic capture and matching of posts and resumes, using JD-BERT for capturing semantic information of the posts, using CV-BERT for capturing semantic information of the resumes,
simultaneously training two BERTs, respectively performing global average pooling on embedding vectors of all words output by the last layer of JD-BERT and CV-BERT models, respectively using the pooled embedding as semantic vectors of JD semantic vectors and CV semantic vectors, wherein JD and CV represent a post and a resume respectively, a cosine distance between the semantic vectors of JD and CV is used for measuring the semantic relationship between JD and CV in the training, and the more relevant the semantics are, the smaller the distance between the semantic vectors is, as shown in formula (1):
Figure BDA0003855425500000051
the method comprises the following steps that a post anchor point a and a positive sample resume p are matched semantically, a negative sample resume n is not matched with the post semantically, and the training goal of a model is that the text semanteme of the post and the resume can be accurately coded, so that the post anchor point a and the positive sample resume p are very similar in semanteme, namely the cosine included angle of the semantic vectors of the post anchor point a and the positive sample resume p in space is as small as possible, otherwise, in the negative sample, the similarity of the post anchor point a and the negative sample resume n in semanteme is as low as possible, namely the cosine included angle of the semantic vectors of the post anchor point a and the negative sample resume n in space is as large as possible, and the distance of the positive sample is reduced as much as possible by the model, and the distance of a negative sample pair is enlarged as much as possible;
by using
Figure BDA0003855425500000052
A semantic vector representing the anchor position,
Figure BDA0003855425500000053
and
Figure BDA0003855425500000054
respectively representing the semantic vector of the positive sample resume and the semantic vector of the negative sample resume matched with the positive sample resume, and obtaining a loss function of the post matching semantic code Siemese-Bert model training as shown in a formula (2):
Figure BDA0003855425500000055
further, the real-time batch vector recaller:
batch artificial match and search are realized by adopting Redis cache technology and Faiss vector search engine,
the semantic vectors of the posts are stored by adopting Redis, so that the delay caused by post semantic calculation in a matching request is reduced, the semantic vectors of the posts are stored on a cache in a key value pair mode (post ID, post vector), an elimination strategy of all keys-lru is adopted, the least recently used post data is selected from all key sets to be eliminated, the cache is ensured to store enough hot posts, the cache hit rate is improved, the semantic vectors of the posts are conveniently and quickly taken out, and only when the requested posts do not hit the cache, the semantic vectors of the posts are calculated by using an offline model;
generating semantic vectors of all resume scripts offline by using trained CV-Bert, putting all the semantic vectors of the scripts into a Faiss vector retrieval engine, establishing indexes for the semantic vectors, constructing the indexes by using a FlatL2 retrieval method, and using a similarity measurement method adopted by the index constructed by the L2 as an L2 norm, namely an Euclidean distance;
when a batch of post matching resume requests enter the system, firstly, JD is searched on a cache according to the request sequence, if yes, a post semantic vector corresponding to the JD is returned, if not, the post semantic vector is calculated by using the JD-Bert trained offline, after the post semantic vector is obtained, the post semantic vector is searched in a Faiss library with the resume semantic vector with the built index, and finally, topK resumes matched with each post semantic are returned.
By adopting the technical scheme of the invention, the invention has the following beneficial effects:
1. the invention provides a people's post matching method based on contrast learning in the field of recruitment based on post and resume content by combining the people's post matching service scene of the recruitment industry; through the design of a refined training sample pair, the model can carry out accurate semantic coding on resume and post contents, and the effect of post matching is improved;
2. according to the scheme, a technical framework of cache and faiss is adopted, the million-scale sentry matching is realized through millisecond-level response, the performance requirements of high concurrency and low time delay are met, the sentry matching efficiency is greatly improved, and the burden of enterprises and job hunters is reduced.
Drawings
FIG. 1 is an overall flow chart of the present invention;
FIG. 2 is a detailed flowchart of step S1 of the present invention;
FIG. 3 is a detailed flowchart of step S2 of the present invention;
FIG. 4 is a schematic diagram of the implementation principle of the present invention;
FIG. 5 is an exemplary diagram of a Java backend post skill tree of the present invention;
FIG. 6 is a diagram of the human Shift matched siamese-bert structure based on contrast learning of the present invention;
FIG. 7 is a distance relationship in space for a training sample triplet s (a, p, n) of the present invention;
FIG. 8 is an exemplary flow chart of the batch position real-time recall matching resume request process of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not to be construed as limiting the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.
The invention provides a post and resume content-based intelligent matching method for comparison learning human posts, which comprises the following steps:
s1, off-line model training and resume index construction; the main substeps are: training sample pair construction, model training and full resume vector construction faiss index;
s2, on-line post matching; and (3) on-line post matching, mainly considering the RCN on-line post matching scene. The method mainly comprises two scenes of finding people on duty and finding people on duty. The real-time batch vector recaller part in the invention is mainly explained by taking a post person finding scene as an example. The online post matching mainly comprises two sub-steps of post semantic caching and post retrieval.
As shown in fig. 2, step S1 specifically includes the following steps:
s11, training a sample pair structure; the training sample pair construction part mainly provides a training sample generator for training a comparative learning model, and the detailed content can refer to the training sample generator part in the invention content;
s12, training a model; the model training part trains a contrast learning model provided in the scheme by using the training sample generated in the S11, and the detailed structure introduction of the model provided in the scheme can refer to the detailed introduction of a semantic encoder part in the invention content; after the model is fully trained, obtaining two models, namely CV-BERT and JD-BERT, which can be respectively used for resume and post semantic extraction;
s13, constructing a full resume index; for all resumes in the RCN resume pool, performing semantic coding on all resumes by using the trained CV-BERT model in S12 to obtain a semantic vector of each resume; then, the semantic vectors of all the resumes are put into a faiss search library, and a full amount of resume indexes are constructed; all construction specific methods and parameter information about the faiss resume can be referred to in the section of "real-time batch vector recallers" in the summary of the invention.
As shown in fig. 3, step S2 specifically includes the following steps:
s21, post semantic caching;
the post semantic cache is used for extracting part of the semantics of hot, urgent or active posts by using the JD-BERT trained in S12 in advance, and the semantics are put into the cache, so that when a hunter processes the posts, the system can quickly recommend resumes for the posts. Because the situations of cache expiration, post invalidation and the like exist, when the post of hunting query does not hit the cache, JD-BERT is needed to recode the semantics of the post; in a word, the semantic vector of the post can be obtained in the step;
s22, human sentry retrieval
After S13 and S21 are completed, a batch of post semantic vectors and full resume vector indexes that need to be matched by the requesting post are obtained, so that S22 can recall topk resumes related to each post in the request through an approximate neighbor algorithm of a faiss vector retrieval engine, and the detailed content can refer to a "real-time batch vector recaller" part in the invention content, and thus, the introduction of the complete implementation steps of the scheme is completed.
The invention also relates to a post and resume content-based intelligent matching system for comparison learning human posts, which is used for realizing the method and comprises the following steps as shown in figure 4: training sample generator, semantic coder, real-time batch matcher;
1. training sample generator: in order to accurately extract the semantic information of the resume and the post in the same semantic space, as described above, the siense-bert model is adopted to carry out semantic coding on the post and the resume. In order to further learn the similarity between the resume and the post, the similarity between the resume and the resume, namely the similarity between the classes and the similarity inside the classes, the semantic coding model of the resume post adopts input three-pair input. Considering that the number of posts in the system is far less than that of resumes, we take the post text information as an anchor point, denoted as a. The resume matching the position is taken as a positive sample and is marked as p. And taking the resume not matched with the position as a negative sample and recording the negative sample as n. Namely, the input of the semantic extraction model is a ternary pair s (a, p, n) consisting of a post and a resume. The input of fig. 6 gives an example of a ternary pair.
For the resume, basic information (gender, location and the like), a study history background (school, professional and the like), a work experience (company, industry and the like), a list experience (list content and the like) and other text information of the resume are spliced to form a text expression of the resume. For the post information, post company information, basic requirements, skill requirements and the like are spliced as the text characteristics of the post.
The training sample generator comprises:
positive sample pair construction: in the RCN hunting cooperative platform, after hunting details and researches the position jd information, a proper resume is found out for position matching and delivery, so that the jd-cv pair pushed to enterprise clients by hunting heads can be regarded as a pair of positive samples. In the matching, the posts and the resumes are highly related and matched on a semantic level, so that all orders recommended to the client in the RCN can be taken, for each order, the posts are taken as anchor points, the resumes are taken as positive samples, and a positive sample pair is formed.
Negative sample pair construction: according to the business process, the positive sample is easy to construct, the construction method and the sampling mode of the negative sample are also difficult parts in the scheme, and the construction of the negative sample is divided into a simple sample construction and a difficult sample construction in the comparative learning model.
The negative example pair configuration includes:
simple negative sample pair construction: the Jian Shanfu sample is, as the name implies, a negative sample that is relatively easy to distinguish. This is particularly true if there is a significant mismatch between the post and the resume, for example, if there is a post that requires 3 years of experience with the NLP algorithm, if he is recommended a resume that only packages the printing professional background and the associated job experience job seeker, it is clear that the recommendation is invalid because the post to be recommended does not match the resume. Semantically, the semantics of the resume and the semantics of the post requirement have larger difference, so that resumes which are not matched with the specialty, the industry background, the working age, the mastering skill and the like required by the post can be selected, and a negative sample is generated for the post.
Difficult negative sample pair construction: in contrast learning, if a negative sample with a simple structure is used, though a resume with a large difference from the position semantic meaning can be distinguished obviously, slight differences between the position and the resume and between the resume and the resume cannot be distinguished, for example, the recruitment of an "NLP algorithm" is also required, but some positions require candidates with the candidate needing the experience of items such as "machine translation" and some positions require the candidate to have the experience of related items such as "entity extraction" or "knowledge map", and although the skill points are NLP related skills, the emphasis is different. Therefore, such problems should be taken into account when constructing the negative examples. This type of training sample is collectively referred to herein as a difficult sample.
(difficult negative sample pair construction) the construction of the difficult sample pair scheme in the scheme is as follows:
a) The construction method based on the skill points comprises the following steps:
the extraction and identification of resume skill points is a very important but difficult part of resume parsing. The skill points in the human job matching scenario can be viewed as a tree structure.
FIG. 5 illustrates an example of a "Java backend" post. From the figure, we can find that the same position name is different according to the emphasis point of the work content, and different skill points and related work and project experiences are needed. Therefore, in constructing a difficult negative example pair, the difficult negative examples under the same position name need to be considered. In the specific implementation process of the scheme, after the positions and the skill nodes of the positive samples are determined, the negative samples are randomly selected from subtrees of different branches of the skill tree from the positive samples. For example, in the existing post of a 'Java big data engineer', the resume can form a positive sample as long as the resume can match the 'big data' skill node in the skill tree, and the matching path is 'Java back end' - 'development' - 'big data'; negative examples can now be constructed from other sub-trees in the skill tree than the tree with the "big data" node as the root node. For example, a resume with the coverage skill path "Java backend-development-ssm-spring mvc" is selected as the negative sample for this position. And according to the different distances from different skill nodes to the post skill nodes in the skill tree, the difficulty degree of the construction is different.
b) The construction method based on the working years comprises the following steps:
in the post matching, the 'working year' is also a stronger resume screening option. For example, when a post recruits a senior engineer (5 years or more), two resumes of A, B have the same professional background, industry background, skill point, etc., but a has only 2 years of work experience, B has 6 years of work experience, and it is obvious that B better conforms to the definition of "resource depth". If only semantics of skill, industry, project background and the like are considered, the situation can easily cause resume A to be recommended to the post, thereby causing post mismatch. Therefore, when a difficult sample is constructed, the scheme considers the important attribute of 'working age', so that the position and resume semantic coding model based on the contrast learning can capture the slight semantic differences.
2. Semantic encoder
A good post match recommendation should be that the post requirements and resumes are semantically highly related and matching. If the post is in the recruitment NLP algorithm engineer, but the resume of insurance sales is recommended, the post information and the resume information in the recommendation are semantically mismatched, so that the recommendation does not achieve the effect of post matching, but loses the user experience. Therefore, accurate capture of post content and resume semantics in post matching is a precondition for document matching.
A semantic encoder:
the invention adopts a pseudo Siemese-Bert model for accurate semantic capture and matching of posts and resumes, the Siemese-Bert is composed of two identical Bert models as the name implies, and the parameters of the two models are shared and consistent, namely, only one Bert model is in the models. The pseudo siemese-Bert also consists of two Bert models, except that the two models are not identical. The model used in this scheme is shown in figure 3. The JD-BERT on the left is used to capture the semantic information for the post and the CV-BERT on the right is used to capture the semantic information for the resume. Compared with a static semantic coding algorithm, the Bert model has a self-attention mechanism, fully performs semantic coding according to the context of the context, is large in capacity and multiple in parameters, performs pre-training on a plurality of language data sets, has learned an excellent language model, and can be competent for a semantic capturing task. In model training, two berts are trained simultaneously. And respectively performing global average pooling (not including CLS) on the embedding vectors of all words output by the last layer of the JD-BERT model and the CV-BERT model, and respectively using the pooled embedding as JD (position) semantic vectors and CV (resume) semantic vectors. The cosine distance between the semantic vectors of jd and cv is used in training to measure the semantic relationship between the two. If the two semantics are more related, the distance between the semantic vectors is smaller, as shown in formula (1).
Figure BDA0003855425500000121
A construction method of a training sample ternary pair s (a, p, n) is introduced in detail in a sample collection part, in the ternary pair, a post anchor point a and a positive sample resume p are matched semantically, a negative sample resume n is not matched with the post semantically, and the training target of the model is that the text semanteme of the post and the resume can be accurately coded, so that the semantics of the post anchor point a and the positive sample resume p are very similar, namely the cosine included angle of semantic vectors of the two in space is as small as possible. On the contrary, in the negative sample, the post anchor point a and the negative sample resume n have semantically as low as possible similarity, that is, the spatial cosine included angle of the semantic vectors of the post anchor point a and the negative sample resume n is as large as possible. In the training process, in order to improve the robustness of the model, a hyper-parameter lambda is introduced, so that the distance between positive samples is reduced as much as possible by the model, and the distance between negative sample pairs is enlarged as much as possible. FIG. 7 shows the spatial distance relationship of the triplet s (a, p, n) during training.
If it is used
Figure BDA0003855425500000122
Indicating anchorThe semantic vector of the point position is determined,
Figure BDA0003855425500000123
and
Figure BDA0003855425500000124
respectively representing the semantic vector of the positive sample resume and the semantic vector of the negative sample resume matched with the positive sample resume, and obtaining a loss function of the post matching semantic code Siemese-Bert model training as shown in a formula (2):
Figure BDA0003855425500000125
3. real-time batch vector recalling device
As shown in fig. 8, in the present solution, a Redis caching technology and a Faiss vector search engine are used to implement batch-wise matching and searching.
In the scheme, redis is adopted to store the post semantic vector, so that the time delay generated by computing the post semantic in the matching request is greatly reduced. The semantic vector for a position is stored on the cache in the form of a key-value pair of (position ID, position vector). And selecting the least recently used post data from the whole key set to eliminate by adopting an elimination strategy of allkeys-lru, ensuring that enough hot posts are stored in the cache, improving the cache hit rate and facilitating the rapid extraction of the semantic vectors of the posts. The semantic vector for a post is calculated using the offline model only if the requested post does not hit in the cache.
The Faiss (Facebook AI Similarity Search) is a clustering and Similarity Search library of a Facebook AI open source, provides efficient Similarity Search and clustering for dense vectors, supports Search of billion-level vectors, and is the most mature approximate neighbor Search library at present. The method supports the algorithm of searching the vector set with any size, provides very efficient improvement on the basic algorithm, and provides GPU realization of the core algorithm, so that the Faiss speed is high, and the performance of millisecond retrieval can be realized on 10 hundred million-level indexes. In the scheme of the method, trained CV-Bert is used for generating semantic vectors of all resume documents offline, all the semantic vectors of the documents are put into a Faiss vector retrieval engine, and indexes are built for the Faiss vector retrieval engine. In the scheme, a FlatL2 retrieval method is selected to construct the index, and the similarity measurement method adopted by the L2 for representing the constructed index is an L2 norm, namely an Euclidean distance.
The real-time semantic matching process of batch positions is as shown in fig. 8, when a batch position matching resume request enters a system, jd is searched on a cache according to the request sequence, and if yes, a position semantic vector corresponding to jd is returned. And if not, calculating the semantic vector of the post by using the JD-Bert trained offline. After the post semantic vectors are obtained, searching is carried out on resume semantic vectors with established indexes in a Faiss library, and finally topK resumes (K =20 in the scheme) matched with each post semantic are returned.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention, and all modifications and equivalents of the present invention, which are made by the contents of the present specification and the accompanying drawings, or directly/indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (9)

1. A post and resume content-based intelligent matching method for comparison learning of human posts is characterized by comprising the following steps:
s1, off-line model training and resume index construction;
and S2, matching on-line human posts.
2. The post and resume content-based intelligent matching method for comparison learning human posts according to claim 1, wherein the step S1 specifically comprises the following steps:
s11, training a sample pair structure; providing a training sample generator for training a comparative learning model;
s12, training a model; training the comparative learning model by using the training samples generated in the S11 to obtain CV-BERT and JD-BERT models which can be respectively used for resume and post semantic extraction;
s13, constructing a full resume index; for all resumes in the RCN resume pool, performing semantic coding on all resumes by using the trained CV-BERT model in S12 to obtain a semantic vector of each resume; then, semantic vectors of all resumes are put into a faiss search library, and a full resume index is constructed.
3. The post and resume content-based intelligent matching method for comparison learning human posts according to claim 1, wherein the step S2 specifically comprises the following steps:
s21, post semantic caching; extracting part of semantics of hot, urgent or active positions by using the JD-BERT trained in S12 in advance, and putting the semantics into a cache, so that when the positions are processed, the system can quickly recommend resumes for the positions;
s22, searching by a human sentry; after S13 and S21 are completed, the post semantic vector and the full resume vector index which need to be matched by the request post are obtained, and topk resumes related to the post are recalled for each post in the request through an approximate neighbor algorithm of a faiss vector retrieval engine.
4. A post and resume content-based intelligent matching system for comparison learning human posts, which is used for realizing the post and resume content-based intelligent matching method for comparison learning human posts as claimed in any one of claims 1-3, and is characterized in that the system comprises: training sample generator, semantic coder, real-time batch matcher;
the training sample generator: the method comprises the steps that semantic coding is carried out on posts and resumes by adopting a Simese-bert model, in order to further learn the similarity between resumes and posts, the similarity between resumes and resumes, namely the similarity between classes and the similarity inside the classes, the semantic coding model for the resumes adopts input ternary pair input, post text information is used as an anchor point and is marked as a, resumes matched with the posts are used as positive samples and are marked as p, resumes not matched with the posts are used as negative samples and are marked as n, namely the input of the semantic extraction model is the ternary pair s (a, p, n) formed by the posts and resumes;
the semantic encoder: and adopting a pseudo-Simese-Bert model for accurate semantic capture and matching of the posts and the resumes, using JD-BERT for capturing semantic information of the posts, and using CV-BERT for capturing semantic information of the resumes. Recommending post pairs with highly relevant semantics through precise semantic coding of post contents and resume contents;
the real-time batch vector recaller: and adopting a Redis cache technology and a Faiss vector retrieval engine to realize batch human-job matching and retrieval.
5. The post and resume content based comparative learning human post intelligent matching system according to claim 4, wherein the training sample generator comprises: positive sample pair construction and negative sample pair construction;
the positive sample pair construction: in the RCN cooperation platform, finding out a proper resume for post matching and delivery, regarding jd-cv pair as a pair of positive samples, in the matching, taking all orders recommended to clients in the RCN, taking posts as anchor points for each order and taking the resume as a positive sample to form a positive sample pair, wherein the posts and the resume are highly related and matched on a semantic level;
the negative sample pair construction: the construction of negative examples is divided into simple and difficult example constructions.
6. The post and resume content based comparative learning human post intelligent matching system according to claim 5, wherein the negative sample pair construction comprises: a simple negative sample pair structure and a difficult negative sample pair structure;
the simple negative sample pair construction: selecting a resume which is not matched with the specialty, the industry background, the working age and the mastering skill required by the post, and generating a negative sample for the post;
the difficult negative sample pair construction: training samples with slight differences between posts and resumes, and between resumes and resumes.
7. The post and resume content based contrast learning human post intelligent matching system of claim 6, wherein the difficult negative sample pair construction comprises: a construction method based on skill points and a construction method based on working years;
the construction method based on the skill points comprises the following steps: resume analysis is carried out based on extraction and identification of resume skill points;
the construction method based on the working years comprises the following steps: and carrying out resume screening based on the working years.
8. The post and resume content based contrast learning post intelligent matching system of claim 4,
the semantic encoder: adopting a pseudo Siemens-Bert model for accurate semantic capture and matching of posts and resumes, using JD-BERT for capturing semantic information of the posts, using CV-BERT for capturing semantic information of the resumes,
simultaneously training two BERTs, respectively performing global average pooling on embedding vectors of all words output by the last layer of JD-BERT and CV-BERT models, respectively using the pooled embedding as semantic vectors of JD semantic vectors and CV semantic vectors, wherein JD and CV represent a post and a resume respectively, a cosine distance between the semantic vectors of JD and CV is used for measuring the semantic relationship between JD and CV in the training, and the more relevant the semantics are, the smaller the distance between the semantic vectors is, as shown in formula (1):
Figure FDA0003855425490000031
the method comprises the following steps that a post anchor point a and a positive sample resume p are matched semantically, a negative sample resume n is not matched with the post semantically, and the training goal of a model is that the text semanteme of the post anchor point a and the resume can be accurately coded, so that the post anchor point a and the positive sample resume p are very similar in semanteme, namely the cosine included angle of the semantic vectors of the post anchor point a and the positive sample resume p in the space is as small as possible, on the contrary, in the negative sample, the similarity of the post anchor point a and the negative sample resume n in the semanteme is as low as possible, namely the cosine included angle of the semantic vectors of the post anchor point a and the negative sample resume n in the space is as large as possible, and the distance of the positive sample is as large as possible by the model;
by using
Figure FDA0003855425490000041
A semantic vector representing the anchor position,
Figure FDA0003855425490000042
and
Figure FDA0003855425490000043
respectively representing the semantic vector of the positive sample resume and the semantic vector of the negative sample resume matched with the positive sample resume, and obtaining a loss function of the post matching semantic code Siemese-Bert model training as shown in a formula (2):
Figure FDA0003855425490000044
9. the post and resume content based contrast learning human post intelligent matching system of claim 4, wherein the real-time batch vector recaller:
adopting Redis cache technology and Faiss vector search engine to realize batch human-job matching and searching,
the semantic vectors of the posts are stored by adopting Redis, so that the delay caused by post semantic calculation in a matching request is reduced, the semantic vectors of the posts are stored on a cache in a key value pair mode (post ID, post vector), an elimination strategy of all keys-lru is adopted, the least recently used post data is selected from all key sets to be eliminated, the cache is ensured to store enough hot posts, the cache hit rate is improved, the semantic vectors of the posts are conveniently and quickly taken out, and only when the requested posts do not hit the cache, the semantic vectors of the posts are calculated by using an offline model;
generating semantic vectors of all resume scripts offline by using trained CV-Bert, putting all the semantic vectors of the scripts into a Faiss vector retrieval engine, establishing indexes for the semantic vectors, constructing the indexes by using a FlatL2 retrieval method, and using a similarity measurement method adopted by the index constructed by the L2 as an L2 norm, namely an Euclidean distance;
when a batch of post matching resume requests enter the system, JD is searched on a cache according to the request sequence, if the JD is hit, the post semantic vector corresponding to the JD is returned, if the JD-Bert is not hit, the post semantic vector is calculated by using the JD-Bert trained offline, after the post semantic vector is obtained, the post semantic vector and the resume semantic vector with the built index are searched in a Faiss library, and finally topK resumes matched with the semantics of each post are returned.
CN202211146339.3A 2022-09-20 2022-09-20 Intelligent matching method and system for comparison learner post based on post and resume content Active CN115481220B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211146339.3A CN115481220B (en) 2022-09-20 2022-09-20 Intelligent matching method and system for comparison learner post based on post and resume content

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211146339.3A CN115481220B (en) 2022-09-20 2022-09-20 Intelligent matching method and system for comparison learner post based on post and resume content

Publications (2)

Publication Number Publication Date
CN115481220A true CN115481220A (en) 2022-12-16
CN115481220B CN115481220B (en) 2023-07-25

Family

ID=84423868

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211146339.3A Active CN115481220B (en) 2022-09-20 2022-09-20 Intelligent matching method and system for comparison learner post based on post and resume content

Country Status (1)

Country Link
CN (1) CN115481220B (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200104421A1 (en) * 2018-09-28 2020-04-02 Microsoft Technology Licensing. LLC Job search ranking and filtering using word embedding
CN111984784A (en) * 2020-07-17 2020-11-24 北京嘀嘀无限科技发展有限公司 Method and device for matching human posts, electronic equipment and storage medium
CN113407683A (en) * 2021-01-25 2021-09-17 腾讯科技(深圳)有限公司 Text information processing method and device, electronic equipment and storage medium
CN113435841A (en) * 2021-06-24 2021-09-24 浙江工贸职业技术学院 Talent intelligent matching recruitment system based on big data
CN114090877A (en) * 2021-11-03 2022-02-25 北京淘友天下科技发展有限公司 Position information recommendation method and device, electronic equipment and storage medium
CN114219248A (en) * 2021-12-03 2022-03-22 深圳市前海欢雀科技有限公司 Man-sentry matching method based on LDA model, dependency syntax and deep learning
CN114741538A (en) * 2022-05-10 2022-07-12 图谱天下(北京)科技有限公司 Resume screening method and device
CN114782077A (en) * 2022-03-29 2022-07-22 北京沃东天骏信息技术有限公司 Information screening method, model training method, device, electronic equipment and medium
CN115062220A (en) * 2022-06-16 2022-09-16 成都集致生活科技有限公司 Attention merging-based recruitment recommendation system

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200104421A1 (en) * 2018-09-28 2020-04-02 Microsoft Technology Licensing. LLC Job search ranking and filtering using word embedding
CN111984784A (en) * 2020-07-17 2020-11-24 北京嘀嘀无限科技发展有限公司 Method and device for matching human posts, electronic equipment and storage medium
CN113407683A (en) * 2021-01-25 2021-09-17 腾讯科技(深圳)有限公司 Text information processing method and device, electronic equipment and storage medium
CN113435841A (en) * 2021-06-24 2021-09-24 浙江工贸职业技术学院 Talent intelligent matching recruitment system based on big data
CN114090877A (en) * 2021-11-03 2022-02-25 北京淘友天下科技发展有限公司 Position information recommendation method and device, electronic equipment and storage medium
CN114219248A (en) * 2021-12-03 2022-03-22 深圳市前海欢雀科技有限公司 Man-sentry matching method based on LDA model, dependency syntax and deep learning
CN114782077A (en) * 2022-03-29 2022-07-22 北京沃东天骏信息技术有限公司 Information screening method, model training method, device, electronic equipment and medium
CN114741538A (en) * 2022-05-10 2022-07-12 图谱天下(北京)科技有限公司 Resume screening method and device
CN115062220A (en) * 2022-06-16 2022-09-16 成都集致生活科技有限公司 Attention merging-based recruitment recommendation system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
XINGYI CHENG,等: "Dual-View Distilled BERT for Sentence Embedding", 《SIGIR \'21: PROCEEDINGS OF THE 44TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION 》, pages 2151 *
祖石诚;王修来;曹阳;张玉韬;梁珊;: "基于新型文本块分割法的简历解析", 计算机科学, no. 1, pages 105 - 111 *

Also Published As

Publication number Publication date
CN115481220B (en) 2023-07-25

Similar Documents

Publication Publication Date Title
CN110826336B (en) Emotion classification method, system, storage medium and equipment
CN108268600B (en) AI-based unstructured data management method and device
CN109101479A (en) A kind of clustering method and device for Chinese sentence
CN110765277B (en) Knowledge-graph-based mobile terminal online equipment fault diagnosis method
CN103838833A (en) Full-text retrieval system based on semantic analysis of relevant words
Cai et al. Intelligent question answering in restricted domains using deep learning and question pair matching
CN108874783A (en) Power information O&M knowledge model construction method
WO2023108980A1 (en) Information push method and device based on text adversarial sample
CN116991869A (en) Method for automatically generating database query statement based on NLP language model
CN117271767B (en) Operation and maintenance knowledge base establishing method based on multiple intelligent agents
CN110362663A (en) Adaptive more perception similarity detections and parsing
CN115238053A (en) BERT model-based new crown knowledge intelligent question-answering system and method
CN113254630A (en) Domain knowledge map recommendation method for global comprehensive observation results
CN113761208A (en) Scientific and technological innovation information classification method and storage device based on knowledge graph
Zhang et al. Combining explicit entity graph with implicit text information for news recommendation
CN115982338A (en) Query path ordering-based domain knowledge graph question-answering method and system
CN114706989A (en) Intelligent recommendation method based on technical innovation assets as knowledge base
CN116680377B (en) Chinese medical term self-adaptive alignment method based on log feedback
CN113342950A (en) Answer selection method and system based on semantic union
CN117541202A (en) Employment recommendation system based on multi-mode knowledge graph and pre-training large model fusion
CN114372454A (en) Text information extraction method, model training method, device and storage medium
CN116361428A (en) Question-answer recall method, device and storage medium
CN117076598A (en) Semantic retrieval model fusion method and system based on self-adaptive weight
CN116227594A (en) Construction method of high-credibility knowledge graph of medical industry facing multi-source data
CN115481220A (en) Post and resume content-based intelligent matching method and system for comparison learning human posts

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant