CN115481220B - Intelligent matching method and system for comparison learner post based on post and resume content - Google Patents

Intelligent matching method and system for comparison learner post based on post and resume content Download PDF

Info

Publication number
CN115481220B
CN115481220B CN202211146339.3A CN202211146339A CN115481220B CN 115481220 B CN115481220 B CN 115481220B CN 202211146339 A CN202211146339 A CN 202211146339A CN 115481220 B CN115481220 B CN 115481220B
Authority
CN
China
Prior art keywords
post
resume
semantic
matching
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211146339.3A
Other languages
Chinese (zh)
Other versions
CN115481220A (en
Inventor
肖小范
刘王祥
李敬泉
刘雨晨
徐雯
谢志辉
景昊
吴显仁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Today Talent Information Technology Co ltd
Original Assignee
Shenzhen Today Talent Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Today Talent Information Technology Co ltd filed Critical Shenzhen Today Talent Information Technology Co ltd
Priority to CN202211146339.3A priority Critical patent/CN115481220B/en
Publication of CN115481220A publication Critical patent/CN115481220A/en
Application granted granted Critical
Publication of CN115481220B publication Critical patent/CN115481220B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3347Query execution using vector based model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/10Office automation; Time management
    • G06Q10/105Human resources
    • G06Q10/1053Employment or hiring
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Data Mining & Analysis (AREA)
  • Human Resources & Organizations (AREA)
  • Computational Linguistics (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Artificial Intelligence (AREA)
  • Strategic Management (AREA)
  • General Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a comparison learner post intelligent matching method and a system based on post and resume contents, wherein the method comprises the following steps: training an offline post-resume semantic encoder and constructing resume indexes; the matching is recalled by the on-line person post. The invention provides a person post matching method based on contrast learning in the recruitment field based on post and resume content by combining with the recruitment industry person post matching service scene; through the design of the refined training sample pair, the post-resume contrast learning semantic encoder model can carry out accurate semantic encoding on the resume and post content, and the effect of person post matching is improved; by adopting the technical architecture of buffering and faiss, millions of scale sentry matches are responded in millisecond level, the performance requirement of high concurrency and low time delay is met, the efficiency of sentry matching is greatly improved, and the load and efficiency of hunting enterprises and job seekers are reduced.

Description

Intelligent matching method and system for comparison learner post based on post and resume content
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a comparison learning person post intelligent matching method and system based on post and resume content.
Background
With the development of internet technology, more and more job seekers and enterprises issue job seekers and recruitment information on recruitment websites, and massive resume resources and post resources are converged on the recruitment websites; by virtue of the search and recommendation technology, the recruitment website greatly relieves the problems of recruitment difficulty and job difficulty, and network recruitment is gradually replacing the traditional face-to-face recruitment.
At present, the most common resume and post matching technology is based on keyword matching, the resume with the same keywords is recommended to enterprises by extracting the keywords of the resume and the post, the post with the same keywords is recommended to job seekers, but because of the diversity of language expression, different words can express the same meaning, the same meaning can also be expressed by different words, the matching is carried out according to the word face of the words, and the phenomenon that the recommended resume does not meet the requirements of the enterprises and the recommended post does not meet the requirements of the job seekers often occurs.
Based on the above shortcomings, semantic-based matching occurs after keyword-based matching, and each word is mapped into a high-dimensional vector through natural language processing technology, wherein word2vec and glove are the most typical, and the high-dimensional vector has the characteristic that the distance of words with the same semantic meaning in a high-dimensional space is closer.
After word vectors of each word are obtained, the resume and the post are matched through searching similar vectors. However, word2vec and glove are both static word vectors, and still remain in the way of the word ambiguous case. Meanwhile, many patents are inexact on how to perform topk nearest neighbor vector retrieval, and some of the generated word vectors are stored in a mysql database, so that the time consumption is serious in an actual production environment.
The person post matching problem can be further refined into two recommended problems: push the post by person and push the person by post. Thus, collaborative filtering based on the e-commerce recommendation field is introduced into the recruitment field, the interaction matrix of resume and post is generated through the historical interaction information of enterprises and job seekers, the post pushing by people is converted into collaborative filtering based on users, the post pushing by people is converted into collaborative filtering based on commodities, however, the problem of cold start cannot be solved in the traditional collaborative filtering, and recommendation cannot be performed for enterprises and job seekers without interaction records.
Still other schemes try to use a Kmeans-based clustering algorithm, each resume represents a data point in a resume set, clustering of similar resume is achieved by presetting the number of clustered categories, a post category is allocated to each clustered resume cluster, when a resume is recommended for an enterprise, the resume can be selected from the corresponding resume clusters through recruiting the post category, and the clustering-based algorithm has two main disadvantages: 1. the post categories are various, and all post category numbers cannot be accurately preset; 2. the resumes picking topk in the decursive cluster have no quantitative selection criteria.
Based on the difficulties, the invention provides a comparison learner post intelligent matching method and system based on post and resume content.
Disclosure of Invention
Aiming at the problems existing in the prior art, the invention provides an intelligent matching method for a comparison learner post based on post and resume contents.
In order to achieve the above object, the present invention is specifically as follows:
the invention provides a comparison learner post intelligent matching method based on post and resume contents, which comprises the following steps:
s1, offline model training and resume index construction;
s2, matching on-line person posts.
Further, the step S1 specifically includes the following steps:
s11, training sample pair construction; providing a training sample generator for comparing training of the learning model;
s12, model training; training the comparison learning model by using the training sample generated in the step S11 to obtain two models which can be respectively used for resume and post semantic extraction;
s13, constructing a full resume index; for all the resume in the RCN resume pool, semantic coding is carried out on all the resume by using the CV-BERT model trained in the S12, and the semantic vector of each resume is obtained; and then, the semantic vectors of all resume are put into a fasss search library to construct a full resume index.
Further, the step S2 specifically includes the following steps:
s21, post semantic cache; the semantics of part of hot, urgent or active posts are extracted by the trained JD-BERT in S12 in advance and put into a cache, so that the system can quickly recommend resume for the posts when the posts are processed;
s22, searching the person post; after S13 and S21 are completed, post semantic vectors and full resume vector indexes which need to be requested for post matching are obtained, and topk resume related to each post in the request are recalled through an approximate neighbor algorithm of a fass vector search engine.
The invention also discloses a comparison learner post intelligent matching system based on post and resume content, which is used for realizing the method, and comprises the following steps: training a sample generator, a semantic encoder and a real-time batch vector recall device;
the training sample generator: the method comprises the steps of adopting a Siamese-bert model to carry out semantic coding on posts and resume, adopting input three-element pairs to input the post semantic coding model, taking post text information as anchor points, marking a resume matched with the post as a positive sample, marking p, marking a resume not matched with the post as a negative sample, marking m, namely, inputting a semantic extraction model as three-element pairs s (a, p, m) consisting of the post and the resume in order to further learn the similarity between the resume and the post, namely the similarity between the resume and the resume, namely the similarity inside the class;
the semantic encoder: the pseudo Siamese-Bert model is used for capturing and matching the accurate semantics of the post and the resume, the JD-BERT is used for capturing the semantic information of the post, and the CV-BERT is used for capturing the semantic information of the resume. Recommending person post pairs with high correlation of semantics through the accurate semantic coding of post content and resume content;
the real-time batch vector recall: batch person post matching and searching are realized by adopting a Redis caching technology and a Faiss vector search engine.
Further, the training sample generator includes: positive sample pair construction, negative sample pair construction;
the positive sample pair configuration: in the RCN cooperation platform, a proper resume is found out for post matching and delivery, jd-cv pair is regarded as a pair of positive samples, in matching, the post and the resume are highly correlated and matched on the semantic level, all orders recommended to clients in the RCN are taken, each order is taken as an anchor point, the resume is taken as a positive sample, and a positive sample pair is formed;
the negative sample pair configuration: the negative sample configuration is divided into simple and difficult sample configurations.
Further, the negative pair configuration includes: simple negative pair construction, difficult negative pair construction;
the simple negative pair construction: selecting a resume which is not matched with professions, industry backgrounds, working years and mastering skills required by the post, and generating a negative sample for the post;
the difficult negative sample pair configuration: training samples with subtle differences between post and resume and between resume and resume.
Further, the difficult negative pair configuration includes: construction method based on skill points and construction method based on working years;
the construction method based on the skill points comprises the following steps: performing resume analysis based on the extraction and the identification of resume skill points;
the construction method based on the working years comprises the following steps: resume screening is performed based on operational years.
Further, the semantic encoder: the pseudo Siamese-Bert model is adopted for the accurate semantic capturing and matching of the post and the resume, the JD-BERT is used for capturing the semantic information of the post, the CV-BERT is used for capturing the semantic information of the resume,
training two BERTs, respectively carrying out global average pooling on the emmbedding vectors of all words output by the last layer of the JD-Bert and CV-Bert models, respectively taking the pooled emmbedding as JD semantic vectors and CV semantic vectors, wherein JD and CV represent posts and resume respectively, cosine distances between the JD semantic vectors and the CV semantic vectors are used for measuring semantic relations of the JD semantic vectors and the CV semantic vectors in training, and the more relevant the two semantics are, the smaller the distance between the semantic vectors is, as shown in a formula (1):
the post anchor point a and the positive sample resume p are matched semantically, the negative sample resume m is unmatched semantically with the post, the training target of the model is that text semanteme of the post and the resume can be accurately encoded, so that the post anchor point a and the positive sample resume p are very similar semantically, namely cosine included angles of semantic vectors of the post anchor point a and the positive sample resume p are as small as possible in space, otherwise, in the negative sample, the similarity of the post anchor point a and the negative sample resume m is as low as possible in semantically, namely cosine included angles of semantic vectors of the post anchor point a and the negative sample resume m are as large as possible, the distance of the positive sample is reduced as much as possible by the model, and the distance of the negative sample pair is as large as possible;
by usingSemantic vector representing anchor position, +.>And->The semantic vector of the positive sample resume and the semantic vector of the negative sample resume which are matched with the semantic vector are respectively represented, and a loss function trained by the sentry matching semantic coding Siamese-Bert model can be obtained as shown in a formula (2):
further, the real-time batch vector recall:
batch person post matching and searching are realized by adopting a Redis caching technology and a Faiss vector search engine,
the method comprises the steps of adopting Redis to store post semantic vectors, reducing delay caused by calculating post semantics in a matching request, storing post semantic vectors on a cache in a key value pair form of (post ID, post vector), adopting an elimination strategy of allkys-lru, selecting least recently used post data from a whole key set to eliminate, ensuring that enough hot posts are stored on the cache, improving the cache hit rate, facilitating quick extraction of post semantic vectors, and calculating post semantic vectors by using an offline model only when the requested post does not hit the cache;
generating semantic vectors of all resume documents offline by using trained CV-Bert, putting the semantic vectors of all resume documents into a Faiss vector search engine, establishing indexes for the semantic vectors, constructing indexes by using a FlatL2 search method, wherein the similarity measurement method adopted by L2 representing the constructed indexes is L2 norm, namely Euclidean distance;
when a batch of post matching resume requests enter the system, JD is firstly searched on a cache according to a request sequence, if hit occurs, a post semantic vector corresponding to the JD is returned, if not, the post semantic vector is calculated by using the off-line trained JD-Bert, after the post semantic vector is obtained, the post semantic vector is searched in a Faiss library and the resume semantic vector with established index, and finally the topK resume matched with each post semantic is returned.
The technical scheme of the invention has the following beneficial effects:
1. the invention provides a person post matching method based on contrast learning in the recruitment field based on post and resume content by combining with the recruitment industry person post matching service scene; through the design of the refined training sample pair, the model can carry out accurate semantic coding on resume and post content, and the effect of person post matching is improved;
2. in the scheme, a technical architecture of buffering and faiss is provided, millions of scale sentry matching is performed to achieve millisecond response, the performance requirement of high concurrence and low time delay is achieved, the efficiency of sentry matching is greatly improved, and the burden is reduced for hunting enterprises and job seekers.
Drawings
FIG. 1 is a general flow chart of the present invention;
FIG. 2 is a flowchart showing the step S1 of the present invention;
FIG. 3 is a flowchart showing the step S2 of the present invention;
FIG. 4 is a schematic diagram of the implementation principle of the present invention;
FIG. 5 is an exemplary diagram of a Java backend post skills tree of the present invention;
FIG. 6 is a diagram of a human post matching siamese-bert structure based on contrast learning of the present invention;
FIG. 7 is a spatial distance relationship of a training sample triplet s (a, p, m) of the present invention;
FIG. 8 is an example of a request processing flow for a batch job real-time recall matching resume of the present invention.
Detailed Description
The invention is described in further detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting thereof. It should be further noted that, for convenience of description, only some, but not all of the structures related to the present invention are shown in the drawings.
The invention provides a comparison learner post intelligent matching method based on post and resume content, which is shown in figure 1 and comprises the following steps:
s1, offline model training and resume index construction; the main substeps are: training sample pair construction, model training and full resume vector construction fass index;
s2, matching on-line person posts; on-line person post matching, the main consideration is the on-line person post matching scene of RCN. Mainly comprises two scenes of 'person looking for sentry' and 'person looking for sentry'. The section of the 'real-time batch vector recall' in the invention mainly takes the scene of searching people on the sentry as an example for explanation. The on-line post matching mainly comprises two sub-steps of post semantic cache and post retrieval.
As shown in fig. 2, the step S1 specifically includes the following steps:
s11, training sample pair construction; the training sample pair constructing part mainly provides a training sample generator for comparing training of the learning model, and the detailed content can refer to a training sample generator part in the invention content;
s12, model training; the model training part uses the training sample generated in the step S11 to train the contrast learning model proposed in the scheme, and the detailed structure description of the model proposed in the scheme can refer to the detailed description of the semantic encoder part in the summary of the invention; after the model is fully trained, two models which can be respectively used for resume and post semantic extraction are obtained, namely CV-BERT and JD-BERT;
s13, constructing a full resume index; for all the resume in the RCN resume pool, semantic coding is carried out on all the resume by using the CV-BERT model trained in the S12, and the semantic vector of each resume is obtained; then, the semantic vectors of all resume are put into a fasss search library to construct a full resume index; all build specific methods and parameter information for the fasss resume can be referenced in the "real-time batch vector recall" section of the summary.
As shown in fig. 3, the step S2 specifically includes the following steps:
s21, post semantic cache;
the post semantic cache is to extract the semantics of part of hot, urgent or active posts by the trained JD-BERT in S12 in advance, and put the post semantic cache into the cache, so that a system can quickly recommend resume for the posts when a hunter processes the posts. Because of the conditions of expiration of the cache, invalid positions, etc., when the positions of the hunt head query do not hit the cache, the semantics of the positions need to be recoded by JD-BERT; in a word, the semantic vector of the post can be obtained in the step;
s22, person post search
After finishing S13 and S21, a group of post semantic vectors and full resume vector indexes which need to be requested for post matching are obtained, so that S22 can recall topk resume related to each post in the request through the approximate neighbor algorithm of the fass vector retrieval engine, and the details can refer to a 'real-time batch vector recall device' part in the invention content, so that the complete implementation steps of the scheme are introduced.
The invention also discloses a comparison learner post intelligent matching system based on post and resume content, which is used for realizing the method, as shown in fig. 4, and comprises the following steps: training a sample generator, a semantic encoder and a real-time batch vector recall device;
1. training sample generator: in order to accurately extract semantic information of the resumes and the post in the same semantic space, as described above, the post and the resume are semantically encoded by adopting a Siamese-bert model. In order to further learn the similarity between the resume and the post and the similarity between the resume and the post, namely the similarity between classes, the similarity between classes is similar, the post semantic coding model of the resume adopts three pairs of input. Considering that the number of posts in the system is far less than the number of resume, we mark post text information as anchor point as a. The resume matching the post is taken as a positive sample and denoted as p. The resume that does not match the post is taken as a negative sample and is noted as m. I.e. the input of the semantic extraction model is a triplet s (a, p, m) of posts and resume. The input of fig. 6 gives an example of a triplet.
For the resume, we splice text information such as basic information (gender, place, etc.), academic background (school, specialty, etc.), work experience (company, industry, etc.), project experience (project content, etc.) and the like of the resume as text expression of the resume. For post information we splice post company information, basic requirements, skill requirements, etc. as textual features of the post.
The training sample generator comprises:
positive sample pair configuration: in the RCN hunter cooperation platform, after the hunter investigates the post jd information in detail, a proper resume is found out for post matching and delivery, so that the jd-cv pair of the hunter pushed to the enterprise client can be regarded as a pair of alignment samples. In the matching, the post and the resume are highly relevant and matched on the semantic level, so that all orders recommended to clients in the RCN can be taken, and for each order, the post is taken as an anchor point, the resume is taken as a positive sample, and a positive sample pair is formed.
Negative sample pair configuration: according to the business flow, the positive sample is easier to construct, the construction method and the sampling mode of the negative sample are also difficult parts in the scheme, and the construction of the negative sample is divided into simple sample and difficult sample construction in the contrast learning model of the invention.
The negative pair configuration includes:
simple negative example pair construction: simple negative samples, as their name implies, are relatively easily distinguishable negative samples. In particular, if there is a significant mismatch between the post and the resume, for example, if there is a post that requires 3 years of experience of the NLP algorithm, it is obvious that if he is recommended a resume of only the packaging printing professional background and the relevant work experience job seeker, the recommendation is not valid because the post to be recommended and the resume do not match. The semantics of the resume and the semantics of the post requirements have larger difference in terms of semantics, so that the resume which is not matched with the profession, industry background, working years, mastering skills and the like of the post requirements can be selected, and a negative sample can be generated for the post.
Difficult negative example pair construction: in contrast learning, if only a negative sample is simply constructed, the resumes with a great difference from post semantics can be clearly distinguished, but subtle differences between posts and resumes and between resumes cannot be met, such as recruitment of an NLP algorithm, but some posts require candidates having to have project experience such as machine translation, some posts require candidates having to have related project experience such as entity extraction or knowledge graph, and although the skill points are all NLP related skills, the emphasis is different. Therefore, such problems should be considered when constructing negative samples. Such training samples, collectively referred to as difficult samples in this scenario.
(difficult negative sample vs. construction) the construction difficult sample vs. scheme in this scheme is as follows:
a) Skill point-based construction method:
the extraction and identification of resume skill points is a very important but difficult ring of resume parsing. The skill points in the person post matching scene can be seen as a tree structure.
FIG. 5 illustrates an example of a "Java backend" post. From the figure, we can find that the same post names are different according to the emphasis point of the work content, and different skill points and related work and project experiences are needed. Thus, when constructing difficult negative-sample pairs, it is necessary to consider the difficult negative samples under the same job name. In the implementation process of the scheme, after the positions and the skill nodes of the positive samples are determined, the negative samples are selected randomly from subtrees of branches different from the positive samples in the skill tree. For example, a post of a "Java big data engineer" exists, and if the resume can match the "big data" skill node in the skill tree, the matching path is "Java backend" - "develop" - "big data", so that a positive sample can be formed; the negative examples may be generated from other subtrees of the skill tree outside the tree with the "big data" node as the root node. For example, a resume with an overlay skill path of "Java backend-development-ssm-spring mvc" is selected as a negative sample of this post. According to the distances from different skill nodes to post skill nodes in the skill tree, the difficulty level of construction is different.
b) Construction method based on working life:
in person post matching, the "working life" is also a stronger resume screening option. For example, when A, B resumes have the same professional background, industry background, skill point, etc., but A has only 2 years of working experience, B has 6 years of working experience, and B is more in line with the definition of "senior" as is evident from the fact that the prior advanced engineers (5 years and above) are recruited at one post. This can easily lead to recommending resume A to the post, which can lead to a person's post mismatch, if only consider semantics of skill, industry, project context, etc. Therefore, when a difficult sample is constructed, the important attribute of 'working years' is considered, so that the post and resume semantic coding model based on contrast learning can capture the subtle semantic differences.
2. Semantic encoder
The good post match recommendation should be post requirement and resume are semantically highly correlated and matched. If the post is recruited by NLP algorithm engineers, however, the resume of insurance sales is recommended, so that the post information and resume information in the recommendation are semantically mismatched, the recommendation does not reach the effect of post matching, but also loses the user experience. Therefore, the post content and the resume semantic accurate capture in the post matching are the precondition of document matching.
Semantic encoder:
the invention adopts a pseudo Siamese-Bert model for the accurate semantic capturing and matching of posts and resume, and the Siamese-Bert is defined as two identical Bert models, and the parameters of the two models are shared and consistent, namely only one Bert model is arranged in the models. The pseudo Siamese-Bert also consists of two Bert models, except that the two models are not identical. The model used in this scheme is shown in fig. 3. Wherein the left JD-BERT is used for capturing semantic information of posts, and the right CV-BERT is used for capturing semantic information of resume. Compared with a static semantic coding algorithm, the Bert model has a self-attention mechanism, fully codes word senses according to context, has large capacity and multiple parameters, is pretrained on a plurality of language data sets, and has learned an excellent language model, so that the Bert model can be used for capturing the semantics. In model training, two berts are trained simultaneously. Global average pooling (without CLS) is performed on the emmbedding vectors of all words output by the last layer of the JD-BERT and CV-BERT models, and the pooled emmbedding is used as JD (post) semantic vector and CV (resume) semantic vector respectively. Cosine distances between the semantic vectors of jd and cv are used in training to measure the semantic relationship of the two. If the two semantics are more related, the distance between the semantic vectors is smaller as shown in formula (1).
The construction method of the training sample triple pair s (a, p, m) is described in detail in the sample acquisition part, the three pairs are matched semantically, the post anchor point a and the positive sample resume p are not matched semantically, the negative sample resume m is not matched semantically with the post, and the training target of the model is that the text semanteme of the post and the resume can be accurately encoded, so that the semanteme of the post anchor point a and the positive sample resume p is very similar, namely the cosine included angle of the semantic vector of the post anchor point a and the positive sample resume p in space is as small as possible. On the contrary, in the negative sample, the similarity of the post anchor point a and the negative sample resume m is as low as possible in terms of semanteme, namely, the cosine included angle of the semantic vector of the post anchor point a and the negative sample resume m in terms of space is as large as possible. In the training process, in order to improve the robustness of the model, the super-parameter lambda is introduced, so that the distance between positive samples is reduced as much as possible, and the distance between negative sample pairs is increased as much as possible. Fig. 7 shows the spatial distance relationship of the triplet s (a, p, m) during training.
If usedSemantic vector representing anchor position, +.>And->The semantic vector of the positive sample resume and the semantic vector of the negative sample resume which are matched with the semantic vector are respectively represented, and a loss function trained by the sentry matching semantic coding Siamese-Bert model can be obtained as shown in a formula (2):
3. real-time batch vector recall device
As shown in FIG. 8, the scheme adopts Redis caching technology and a Faiss vector retrieval engine to realize batch person post matching and retrieval.
In the scheme, the semantic vector of Redis storage post is adopted, so that delay caused by calculating post semantics in a matching request is greatly reduced. The post semantic vector is stored on the cache in the form of key-value pairs of (post ID, post vector). And selecting least recently used post data from the whole key set to eliminate by adopting an elimination strategy of allkys-lru, ensuring that enough hot posts are stored on a cache, improving the cache hit rate and facilitating quick extraction of semantic vectors of the posts. Only when the requested post does not hit the cache, the semantic vector for the post is calculated using the offline model.
Faiss (Facebook AI Similarity Search) is a cluster and similarity search library for Facebook Ai open source, provides efficient similarity search and clustering for dense vectors, supports billion level vector search, and is the most mature approximate neighbor search library at present. He supports searching vector set algorithms of arbitrary size, by providing very efficient improvements to the underlying algorithm and GPU implementation of the core algorithm, thus fasss is fast, enabling millisecond retrieval performance for 10 hundred million magnitude indexes. In the scheme of the method, the trained CV-Bert is used for generating semantic vectors of all resume documents offline, and all the document semantic vectors are put into a Faiss vector search engine and are indexed. In the scheme, a FlatL2 retrieval method is selected to construct an index, and a similarity measurement method adopted by L2 representing the constructed index is L2 norm, namely Euclidean distance.
The real-time semantic matching flow of the batch positions is shown in fig. 8, when a batch of position matching resume requests enter the system, jd is searched on a cache according to the request sequence, and if hit occurs, the position semantic vector corresponding to the jd is returned. If not, the post semantic vector is calculated by using the off-line trained JD-Bert. After the post semantic vector is obtained, searching in the Faiss library and the resume semantic vector with the established index, and finally returning topK (in the scheme K=20) resume matched with each post semantic.
The foregoing description is only of the preferred embodiments of the present invention and is not intended to limit the scope of the invention, and all equivalent structural changes made by the description of the present invention and the accompanying drawings or direct/indirect application in other related technical fields are included in the scope of the present invention.

Claims (6)

1. The utility model provides a contrast study people post intelligent matching system based on post and resume content which characterized in that, this system includes: training a sample generator, a semantic encoder and a real-time batch vector recall device;
the training sample generator: the method comprises the steps of adopting a Siamese-bert model to carry out semantic coding on posts and resume, adopting input three-element pairs to input the post semantic coding model, taking post text information as anchor points, marking a resume matched with the post as a positive sample, marking p, marking a resume not matched with the post as a negative sample, marking m, namely, inputting a semantic extraction model as three-element pairs s (a, p, m) consisting of the post and the resume in order to further learn the similarity between the resume and the post, namely the similarity between the resume and the resume, namely the similarity inside the class;
the semantic encoder: the pseudo Siamese-Bert model is used for capturing and matching the accurate semantics of the post and the resume, the JD-BERT is used for capturing the semantic information of the post, and the CV-BERT is used for capturing the semantic information of the resume; recommending person post pairs with high correlation of semantics through the accurate semantic coding of post content and resume content;
the real-time batch vector recall: batch person post matching and searching are realized by adopting a Redis caching technology and a Faiss vector search engine.
2. The intelligent matching system for a comparison learner post based on post and resume content according to claim 1, wherein the training sample generator comprises: positive sample pair construction, negative sample pair construction;
the positive sample pair configuration: in the RCN cooperation platform, a proper resume is found out for post matching and delivery, jd-cv pair is regarded as a pair of positive samples, in matching, the post and the resume are highly correlated and matched on the semantic level, all orders recommended to clients in the RCN are taken, each order is taken as an anchor point, the resume is taken as a positive sample, and a positive sample pair is formed;
the negative sample pair configuration: the negative sample configuration is divided into simple and difficult sample configurations.
3. The post and resume content based contrast learner post intelligent matching system according to claim 2, wherein said negative sample pair construction comprises: simple negative pair construction, difficult negative pair construction;
the simple negative pair construction: selecting a resume which is not matched with professions, industry backgrounds, working years and mastering skills required by the post, and generating a negative sample for the post;
the difficult negative sample pair configuration: training samples with subtle differences between post and resume and between resume and resume.
4. The post and resume content based contrast learner post intelligent matching system according to claim 3, wherein said difficult negative sample pair construct comprises: construction method based on skill points and construction method based on working years;
the construction method based on the skill points comprises the following steps: performing resume analysis based on the extraction and the identification of resume skill points;
the construction method based on the working years comprises the following steps: resume screening is performed based on operational years.
5. The intelligent matching system for a comparison learner based on post and resume content according to claim 1, wherein,
the semantic encoder: the pseudo Siamese-Bert model is adopted for the accurate semantic capturing and matching of the post and the resume, the JD-BERT is used for capturing the semantic information of the post, the CV-BERT is used for capturing the semantic information of the resume,
training two BERTs, respectively carrying out global average pooling on the emmbedding vectors of all words output by the last layer of the JD-Bert and CV-Bert models, respectively taking the pooled emmbedding as JD semantic vectors and CV semantic vectors, wherein JD and CV represent posts and resume respectively, cosine distances between the JD semantic vectors and the CV semantic vectors are used for measuring semantic relations of the JD semantic vectors and the CV semantic vectors in training, and the more relevant the two semantics are, the smaller the distance between the semantic vectors is, as shown in a formula (1):
the post anchor point a and the positive sample resume p are matched semantically, the negative sample resume m is unmatched semantically with the post, the training target of the model is that text semanteme of the post and the resume can be accurately encoded, so that the post anchor point a and the positive sample resume p are very similar semantically, namely cosine included angles of semantic vectors of the post anchor point a and the positive sample resume p are as small as possible in space, otherwise, in the negative sample, the similarity of the post anchor point a and the negative sample resume m is as low as possible in semantically, namely cosine included angles of semantic vectors of the post anchor point a and the negative sample resume m are as large as possible, the distance of the positive sample is reduced as much as possible by the model, and the distance of the negative sample pair is as large as possible;
by usingSemantic vector representing anchor position, +.>And->The semantic vector of the positive sample resume and the semantic vector of the negative sample resume which are matched with the semantic vector are respectively represented, and a loss function trained by the sentry matching semantic coding Siamese-Bert model can be obtained as shown in a formula (2):
6. the intelligent matching system for a comparison learner post based on post and resume content according to claim 5, wherein the real-time batch vector recall:
batch person post matching and searching are realized by adopting a Redis caching technology and a Faiss vector search engine,
the method comprises the steps of adopting Redis to store post semantic vectors, reducing delay caused by calculating post semantics in a matching request, storing post semantic vectors on a cache in a key value pair form of (post ID, post vector), adopting an elimination strategy of allkys-lru, selecting least recently used post data from a whole key set to eliminate, ensuring that enough hot posts are stored on the cache, improving the cache hit rate, facilitating quick extraction of post semantic vectors, and calculating post semantic vectors by using an offline model only when the requested post does not hit the cache;
generating semantic vectors of all resume documents offline by using trained CV-Bert, putting the semantic vectors of all resume documents into a Faiss vector search engine, establishing indexes for the semantic vectors, constructing indexes by using a FlatL2 search method, wherein the similarity measurement method adopted by L2 representing the constructed indexes is L2 norm, namely Euclidean distance;
when a batch of post matching resume requests enter the system, JD is firstly searched on a cache according to a request sequence, if hit occurs, a post semantic vector corresponding to the JD is returned, if not, the post semantic vector is calculated by using the off-line trained JD-Bert, after the post semantic vector is obtained, the post semantic vector is searched in a Faiss library and the resume semantic vector with established index, and finally the topK resume matched with each post semantic is returned.
CN202211146339.3A 2022-09-20 2022-09-20 Intelligent matching method and system for comparison learner post based on post and resume content Active CN115481220B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211146339.3A CN115481220B (en) 2022-09-20 2022-09-20 Intelligent matching method and system for comparison learner post based on post and resume content

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211146339.3A CN115481220B (en) 2022-09-20 2022-09-20 Intelligent matching method and system for comparison learner post based on post and resume content

Publications (2)

Publication Number Publication Date
CN115481220A CN115481220A (en) 2022-12-16
CN115481220B true CN115481220B (en) 2023-07-25

Family

ID=84423868

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211146339.3A Active CN115481220B (en) 2022-09-20 2022-09-20 Intelligent matching method and system for comparison learner post based on post and resume content

Country Status (1)

Country Link
CN (1) CN115481220B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113407683A (en) * 2021-01-25 2021-09-17 腾讯科技(深圳)有限公司 Text information processing method and device, electronic equipment and storage medium
CN113435841A (en) * 2021-06-24 2021-09-24 浙江工贸职业技术学院 Talent intelligent matching recruitment system based on big data
CN114782077A (en) * 2022-03-29 2022-07-22 北京沃东天骏信息技术有限公司 Information screening method, model training method, device, electronic equipment and medium

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200104421A1 (en) * 2018-09-28 2020-04-02 Microsoft Technology Licensing. LLC Job search ranking and filtering using word embedding
CN111984784B (en) * 2020-07-17 2024-03-12 北京嘀嘀无限科技发展有限公司 Person post matching method, device, electronic equipment and storage medium
CN114090877A (en) * 2021-11-03 2022-02-25 北京淘友天下科技发展有限公司 Position information recommendation method and device, electronic equipment and storage medium
CN114219248A (en) * 2021-12-03 2022-03-22 深圳市前海欢雀科技有限公司 Man-sentry matching method based on LDA model, dependency syntax and deep learning
CN114741538A (en) * 2022-05-10 2022-07-12 图谱天下(北京)科技有限公司 Resume screening method and device
CN115062220B (en) * 2022-06-16 2023-06-23 成都集致生活科技有限公司 Attention merging-based recruitment recommendation system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113407683A (en) * 2021-01-25 2021-09-17 腾讯科技(深圳)有限公司 Text information processing method and device, electronic equipment and storage medium
CN113435841A (en) * 2021-06-24 2021-09-24 浙江工贸职业技术学院 Talent intelligent matching recruitment system based on big data
CN114782077A (en) * 2022-03-29 2022-07-22 北京沃东天骏信息技术有限公司 Information screening method, model training method, device, electronic equipment and medium

Also Published As

Publication number Publication date
CN115481220A (en) 2022-12-16

Similar Documents

Publication Publication Date Title
CN110826336B (en) Emotion classification method, system, storage medium and equipment
CN109829104B (en) Semantic similarity based pseudo-correlation feedback model information retrieval method and system
CN103838833B (en) Text retrieval system based on correlation word semantic analysis
WO2021164200A1 (en) Intelligent semantic matching method and apparatus based on deep hierarchical coding
Zhou et al. Resolving surface forms to wikipedia topics
CN109101479A (en) A kind of clustering method and device for Chinese sentence
CN110765277B (en) Knowledge-graph-based mobile terminal online equipment fault diagnosis method
CN112270188B (en) Questioning type analysis path recommendation method, system and storage medium
CN113672693B (en) Label recommendation method of online question-answering platform based on knowledge graph and label association
CN116662582B (en) Specific domain business knowledge retrieval method and retrieval device based on natural language
CN113076465A (en) Universal cross-modal retrieval model based on deep hash
CN107145519B (en) Image retrieval and annotation method based on hypergraph
CN113761208A (en) Scientific and technological innovation information classification method and storage device based on knowledge graph
CN112559723A (en) FAQ search type question-answer construction method and system based on deep learning
CN115238053A (en) BERT model-based new crown knowledge intelligent question-answering system and method
CN117973540A (en) Retrieval enhancement generation system and method based on knowledge graph
CN113342950A (en) Answer selection method and system based on semantic union
CN115481220B (en) Intelligent matching method and system for comparison learner post based on post and resume content
CN114372454A (en) Text information extraction method, model training method, device and storage medium
CN116662478A (en) Multi-hop retrieval method and system based on knowledge graph embedding and path information
Lai et al. Improved search in Hamming space using deep multi-index hashing
CN114820134A (en) Commodity information recall method, device, equipment and computer storage medium
Tian et al. Multi-attribute scientific documents retrieval and ranking model based on GBDT and LR
Guo et al. Long-form text matching with word vector clustering and graph convolution
CN113987145B (en) Method, system, equipment and storage medium for accurately reasoning user attribute entity

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant