CN115774778A

CN115774778A - Resume processing method and device, electronic equipment and readable storage medium

Info

Publication number: CN115774778A
Application number: CN202111050657.5A
Authority: CN
Inventors: 刘志煌
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-09-08
Filing date: 2021-09-08
Publication date: 2023-03-10

Abstract

The embodiment of the application provides a resume processing method and device, electronic equipment and a readable storage medium, and relates to the fields of big data, cloud technology, data mining, machine learning and the like. The method comprises the following steps: acquiring a resume set to be processed, wherein the resume set comprises a first resume of a known position type and a second resume of an unknown position type; acquiring a resume feature vector of each resume in the resume set; clustering the resumes in the resume set based on the similarity between the resume feature vectors of the resumes to obtain a plurality of clustering categories; determining a target position category corresponding to each cluster category according to the proportion of the first resume of each position category in the resumes belonging to each cluster category; and determining the target position category corresponding to the cluster category to which each second resume belongs as the position category corresponding to each second resume. Based on the method, the position category of the resume with the unknown position category can be efficiently and accurately determined, and a basis is provided for subsequent resume screening or other processing requirements.

Description

Resume processing method and device, electronic equipment and readable storage medium

Technical Field

The application relates to the technical fields of computers, big data, cloud technology, machine learning, artificial intelligence and the like, in particular to a resume processing method and device, an electronic device and a readable storage medium.

Background

Talents play a significant role in enterprise development, and the recruitment of appropriate choices and talents is a very important ring for every company. The accurate and efficient automatic resume screening system can greatly improve the efficiency of selecting excellent talents by companies, and meanwhile screening is carried out before the interview link, so that the introduction of inappropriate talents is reduced, the interview success rate is improved, and intelligent decision is realized. Therefore, the construction of the intelligent, efficient and accurate resume screening system has great significance in practical application.

At present, various different resume screening modes exist in the prior art, but most modes require experienced manual participation, the efficiency and the general performance are poor, and some modes can not require manual participation, but the screening effect is not ideal enough. Therefore, the conventional resume processing method still needs to be improved.

Disclosure of Invention

The application aims to provide a resume processing method, a resume processing device, electronic equipment and a readable storage medium, and the method can be used for accurately and efficiently determining the position category corresponding to the resume.

In one aspect, an embodiment of the present application provides a resume processing method, where the method includes:

acquiring a resume set to be processed, wherein the resume set to be processed comprises a plurality of first resumes of known position categories and at least one second resume of an unknown position category;

acquiring a resume feature vector of each resume in the set of resumes to be processed;

clustering the resumes in the resume set to be processed based on the similarity between the resume feature vectors of the resumes in the resume set to be processed to obtain a plurality of clustering categories;

for each cluster type, determining a target position type corresponding to the cluster type according to the proportion of the first resume of each position type in the resumes belonging to the cluster type;

and for each second resume, determining the target position category corresponding to the cluster category to which the second resume belongs as the position category corresponding to the second resume, and processing each second resume based on the position category of each second resume.

Optionally, for each cluster category, determining a target position category corresponding to the cluster category according to a duty ratio of a first resume of each position category in resumes belonging to the cluster category, including:

determining the proportion of the first resume of each post category in the resumes belonging to the cluster category;

and for the maximum occupation ratio in the occupation ratios, if the maximum occupation ratio is not less than a set threshold, determining the position category corresponding to the maximum occupation ratio as the target position category corresponding to the clustering category.

On the other hand, an embodiment of the present application provides a resume processing apparatus, including:

the resume processing system comprises a to-be-processed resume acquisition module, a to-be-processed resume processing module and a processing module, wherein the to-be-processed resume acquisition module is used for acquiring a to-be-processed resume set, and the to-be-processed resume set comprises a first resume of a plurality of known position categories and a second resume of at least one unknown position category;

the resume clustering module is used for acquiring the resume feature vectors of each resume in the resume set to be processed, and clustering the resumes in the resume set to be processed based on the similarity between the resume feature vectors of the resumes in the resume set to be processed to obtain a plurality of clustering categories;

and the position category determining module is used for determining a target position category corresponding to each clustering category according to the duty ratio of the first resume of each position category in the resumes belonging to each clustering category, determining the target position category corresponding to the clustering category to which each second resume belongs as the position category of each second resume, and processing each second resume based on the position category of each second resume.

Optionally, when the resume clustering module obtains the resume feature vector of each resume in the resume set to be processed, it may be configured to:

acquiring a post category feature word library, wherein the post category feature word library comprises post feature words of a plurality of post categories; and for each resume in the resume set to be processed, determining the position feature words contained in the resume according to the position category feature word library, and obtaining the resume feature vector of the resume according to the position feature words contained in the resume.

Optionally, for each resume in the resume set to be processed, when obtaining the resume feature vector of the resume, the resume clustering module may be configured to:

obtaining a resume feature vector of the resume according to a mixed feature vector of each character in the post feature words contained in the resume;

the mixed feature vector of each character in each post feature word is obtained by the following method:

acquiring a word feature vector of the post feature words and a character feature vector of each character in the post feature words;

and for each character of the post characteristic words, obtaining a mixed characteristic vector of the character by fusing the character characteristic vector of the character and the word characteristic vector of the post characteristic words.

Optionally, the post category feature word library is obtained by extracting a positive sample resume set of each of the plurality of post categories, and the positive sample resume set of each of the plurality of post categories includes a plurality of positive sample resumes; the post feature words of each post category include at least one of subject words, keywords, or named entities of the post category.

Optionally, the subject term of each post category in the post category feature term library is acquired by the subject term acquisition module in the following manner:

extracting resume data of each positive sample resume, and performing word segmentation processing on the resume data to obtain each word segmentation contained in each positive sample resume;

determining candidate subject terms of each post category from each participle according to the first word frequency of each participle in the resume data of the positive sample resume of each post category;

for each candidate subject term, determining the subject importance of the candidate subject term according to the second term frequency of the candidate subject term in the resume data of the positive sample resumes of all the post categories and the document frequency of the candidate subject term in the resume data of the positive sample resumes of all the post categories;

and for each post category, determining the subject term of the post category from the candidate subject terms according to the subject importance of the candidate subject terms of the post category.

Optionally, for each candidate subject term, the subject term obtaining module, when determining the subject importance of the candidate subject term, may be configured to:

determining the initial importance of the candidate subject term according to the second term frequency and the document frequency corresponding to the candidate subject term; determining the part of speech of the candidate subject term; and determining the topic importance of the candidate subject term according to the part of speech and the initial importance of the candidate subject term.

Optionally, for each candidate subject term, when determining the subject importance of the candidate subject term according to the part of speech and the subject importance of the candidate subject term, the subject term obtaining module may be configured to:

if the part of speech of the candidate subject term is a noun, determining the noun type of the candidate subject term, wherein the noun type is a special noun or a common noun; and according to the noun type of the candidate subject term, improving the initial importance of the candidate subject term to obtain the subject importance of the candidate subject term, wherein the improvement degree of the initial importance of the proper noun is greater than that of the general noun.

Optionally, the candidate subject term of each position category includes at least one of a domain term or a common term of the position category, and the initial importance of any domain term is not less than the maximum value of the initial importance of all the common terms.

Optionally, the candidate subject term of each post category includes at least one of a domain term or a common term of the post category; for each candidate topic word, the topic word acquisition module, when determining the topic importance of the candidate topic word, may be configured to:

for each field word, determining the topic importance of the field word according to the second word frequency and the document frequency corresponding to the field word and the first maximum document frequency, wherein the first maximum document frequency is the maximum value of the document frequencies corresponding to the field words of all post categories;

and for each common word, determining the topic importance of the common word according to a second word frequency and a document frequency corresponding to the common word and a second maximum document frequency, wherein the second maximum document frequency is the maximum value of the document frequencies corresponding to the common words of all the post categories.

Optionally, the keyword of each post category in the post category feature word library is acquired by the keyword acquisition module in the following manner:

determining a third word frequency of each participle in resume data of the positive sample resume of each post category;

for each word segmentation and each position category, determining the ratio of the number of resumes in which the word segmentation appears in the resume data of the positive sample resumes of all the position categories to the number of resumes in which the word segmentation appears in the resume data of the positive sample resumes of other position categories except the position category;

for each post category, determining the category distinguishing capability of each participle for the post category according to the third word frequency of each participle corresponding to the post category and the ratio of the participle corresponding to the post category;

and determining the key words of each post category from the word segmentations according to the category distinguishing capability of each word segmentations for each post category.

Optionally, the resume clustering module may be configured to, when clustering resumes in the resume set to be processed:

determining similarity among the resumes based on the resume feature vectors of the resumes;

constructing a graph based on the resumes and the similarity between the resumes, wherein each resume is a node in the graph, if the similarity pair corresponding to the two nodes is greater than or equal to a set threshold, a connecting edge exists between the two nodes, and the similarity corresponding to the two nodes is used as the weight of the connecting edge;

determining transition probability between nodes with connected edges based on the weight of each connected edge in the graph; based on the transition probability among the nodes in the graph, the nodes in the graph are divided into a plurality of cluster categories by adopting a clustering mode based on random walk.

Optionally, the resume to be processed acquiring module is further configured to:

acquiring at least one new resume to be processed of an unknown position category;

the station category determination module is further configured to:

determining the number of resumes of at least one new resume to be processed; if the number of the resumes is less than the set number, acquiring the resume characteristic vector of each new resume to be processed and the category characteristic vector of each cluster category; for each new resume to be processed, determining a clustering category to which the new resume to be processed belongs according to the similarity between the new resume to be processed and the feature vectors of each category, and determining a target clustering category corresponding to the clustering category to which the new resume to be processed belongs as a post category of the new resume to be processed; if the number of the resumes is larger than or equal to the set number, constructing a new resume set to be processed based on at least one new resume to be processed, clustering the resumes in the new resume set to be processed to obtain a plurality of clustering categories, and determining a target position category of each clustering category according to the duty ratio of the resumes of the known position categories of the resumes belonging to each clustering category; and determining the target position category corresponding to the cluster category to which each new resume to be processed belongs as the position category of the new resume to be processed.

Optionally, the station category determining module may be configured to:

determining the proportion of the first resume of each post category in the resumes belonging to the cluster category; and for the maximum occupation ratio in the occupation ratios, if the maximum occupation ratio is not less than a set threshold, determining the position category corresponding to the maximum occupation ratio as the target position category corresponding to the clustering category.

In yet another aspect, an embodiment of the present application provides an electronic device, which includes a memory and a processor; the memory is configured to store a computer program; the processor, when executing the computer program, performs the method as provided in any of the alternative embodiments of the present application.

In another aspect, an embodiment of the present application provides a computer-readable storage medium, where a computer program is stored in the storage medium, and when the computer program is executed by a processor, the computer program may implement the method provided in any optional embodiment of the present application.

In yet another aspect, the present application provides a computer program product, which includes a computer program that, when executed by a processor or a computer device, implements the steps of the method provided by the present application.

In yet another aspect, embodiments of the present application provide a computer program product or a computer program, where the computer program product or the computer program includes computer instructions, and the computer instructions are stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and executes the computer instructions, so that the computer device executes the resume processing method provided in any optional embodiment of the present application.

The beneficial effect that technical scheme that this application provided brought as follows:

the resume processing method provided by the embodiment of the application realizes the cluster division of the resume set containing the unknown resume classes based on the similarity between a large number of resumes of the known position classes and the unknown position classes in the resume set to be processed, and as a plurality of cluster classes are determined based on the similarity between the resumes, and the target position class corresponding to each cluster class is determined according to the proportion of the resumes of the known position classes of each position class in the resumes belonging to the cluster class, the target position class corresponding to the cluster class can well represent the position classes of the resumes divided into the classes.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the description of the embodiments of the present application will be briefly described below.

Fig. 1 is a schematic flowchart of a resume processing method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a method for obtaining a hybrid feature vector of a character according to an embodiment of the present disclosure;

fig. 3 is a schematic diagram of a resume processing method according to an embodiment of the present application;

fig. 4 is a schematic flowchart of obtaining a hybrid feature vector according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a clustering result of a graph provided in an example of the present application;

FIG. 6 is a schematic diagram of a resume processing system according to an embodiment of the present application;

fig. 7 is a schematic flowchart of a resume processing method in a specific application scenario provided by the present application;

fig. 8 is a schematic structural diagram of a resume processing apparatus according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present invention.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.

The resume automatic screening system has wide application scenes in the fields of post matching excavation, talent screening, hunting and the like, and the conventional resume screening method mainly comprises a method for matching a user portrait with a post capability label, a method for matching a resume with a post based on a keyword, or a method for scoring by constructing an occupation scoring model. Although these several approaches can achieve some degree of automated screening of resumes, the inventors have found that these approaches have at least the following problems:

the method for matching the user portrait with the post capability label needs to manually create a capability label library and match the capability portrait of the resume with the capability label corresponding to the target post in the label library. The method depends on the prior knowledge base of the manual resume, has low universality, needs to reestablish post knowledge bases of different professions, and is very time-consuming and labor-consuming. A method for matching resume with post based on keywords is characterized in that whether the resume meets the post requirement is judged by calculating whether the matching value of target establishment and target post reaches a set threshold value according to the target key system of the target post. The method mainly depends on key words, and the accuracy of resume screening is unstable, and the effect is not ideal. By constructing the occupation scoring model for scoring, the model training process is time-consuming, and the requirement on the timeliness in the conventional resume screening process is difficult to meet in industrial application.

Based on the method, the resume processing method can efficiently and accurately determine the position type corresponding to the resume of the unknown position type based on the method, so as to solve at least one of the problems in the conventional resume screening system and better meet the requirement of automatic resume processing. The scheme of the embodiment of the application can be widely applied to various application scenes such as human resource system construction, talent directional mining, target group hunting and the like, and has high industrial application value and guiding significance.

Optionally, the resume processing method provided in the embodiment of the present application may be applied to processing of Big data (Big data), for example, may be implemented based on Cloud technology (Cloud technology). The data computation involved in the embodiment of the present application may adopt a Cloud computing (Cloud computing) manner. For example, cloud computing may be used for computing a large number of resume clustering processes, calculating similarity between feature vectors of resumes, and the like. The resume can be stored in a cloud storage mode in the embodiment of the application, for example, the resume set to be processed, the positive sample resume set and the like can be stored in a cloud end.

Big data is a data set which cannot be captured, managed and processed by a conventional software tool within a certain time range, and is a massive, high-growth-rate and diversified information asset which can have stronger decision-making power, insight discovery power and process optimization capability only by a new processing mode. With the advent of the cloud era, big data has attracted more and more attention, and the big data needs special technology to effectively process a large amount of data within a tolerance elapsed time. The method is suitable for technologies of big data, including a large-scale parallel processing database, data mining, a distributed file system, a distributed database, a cloud computing platform, the Internet and an extensible storage system. The cloud technology is a general name of a network technology, an information technology, an integration technology, a management platform technology, an application technology and the like based on cloud computing business model application, can form a resource pool, is used as required, and is flexible and convenient. Cloud computing technology will become an important support.

Optionally, the scheme provided in this embodiment of the present application may also be implemented based on an Artificial Intelligence (AI) technology, for example, the feature vectors of the word feature words in the post position, the feature vectors of each character in the word, and the like may be obtained by performing feature extraction through a trained neural network. The clustering of the resumes can also be realized by a classification model, and the clustering of the resumes can also be carried out by adopting a machine learning mode. The artificial intelligence technology is a comprehensive subject, and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

The following describes the technical solutions of the present application and how to solve the above technical problems with specific embodiments. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.

Fig. 1 is a flowchart illustrating a resume processing method according to an embodiment of the present application, where the method may be executed by any electronic device, for example, a terminal device, and the terminal device may execute the method to determine, based on a large number of resumes including known position categories and unknown position categories, the position categories of the unknown positions therein, and implement automatic identification of the position categories of the resumes, so that subsequent processing may be performed subsequently based on the resumes including the determined position categories, for example, all resumes of a certain position category are picked out as resumes of a reserve talent, or are applied to a talent's recruitment. The method can also be executed by a server, optionally, the server can be a cloud server, the method can be realized as a resume processing program or as a plug-in or a functional module of an existing resume processing program, for example, the method can be used as a new functional module of a recruitment application program, and the confirmation of the post types of massive resumes can be realized by executing the method of the embodiment of the application, so that more appropriate talents can be better recommended to a recruiter client using the program.

As shown in fig. 1, the resume processing method provided by the embodiment of the present application may include the following steps S110 to S140.

Step S110: the method comprises the steps of obtaining a resume set to be processed, wherein the resume set to be processed comprises a plurality of first resumes in known position types and a second resume in at least one unknown position type.

The specific division mode of the post categories is not limited in the embodiment of the application, and can be determined according to actual requirements. One post category may correspond to a specific post, or may correspond to a plurality of different posts, and the plurality of different posts are divided into one category according to a certain post division rule. That is to say, the classification granularity of the post categories may be set according to the actual application requirements, for example, an intellectual property engineer may serve as one post category, the posts belonging to the post category may include a plurality of posts such as a domestic patent agent, a patent agent involved in foreign countries, a trademark engineer, etc., if the classification is made with finer granularity, each of the different posts may serve as one post category, for example, a domestic patent agent may serve as an independent post category. Alternatively, each position may be regarded as a position category.

The first resumes of the multiple known position categories include multiple first resumes of each of the multiple position categories, and optionally, each first resume may have a label, where the label represents the position category corresponding to the resume. For example, the resume set to be processed includes 50 resumes with a label 1 and 60 resumes with a label 2, where the position category corresponding to the label 1 is a and the position category corresponding to the label 2 is B, that is, the resume set includes 50 resumes with the position category a and 60 resumes with the position category B. The second resume of the unknown position category is also the resume for which the position category needs to be determined.

Optionally, the first resume in the embodiment of the present application is a resume meeting a preset requirement, which may be specifically referred to as a qualified resume, and the manner of screening the first resume is not limited in the embodiment of the present application and may be configured according to an actual requirement. Optionally, for each post category, the resume of the historical employees and the current employees of the post category may be divided into the positive and negative sample sets according to preset rules, rules for screening the positive and negative samples may be flexibly set according to the needs of the scene, and the resume in the positive sample set is used as the first resume of the post category.

As an example, assume that the preset rule for screening the positive and negative sample sets is as follows:

the resume of the employee of which the post is more than a preset year (if the post is 3 years) and the assessment result meets the requirement in the post period is taken as a positive sample of the post category to which the post belongs, namely a qualified resume;

and the resume of the employee who leaves the post or is in the post within a set age limit (which can be set as 2 years) but does not meet the requirement of the assessment result during the post is taken as a negative sample of the post category to which the post belongs, namely, a non-qualified resume.

The judgment condition of whether the examination result meets the requirement can be configured according to the requirement, for example, if the percentage of the qualified times of examination in all examination times during the job is higher than or equal to a preset examination threshold (for example, 0.9 or 1), the examination result is considered to meet the requirement, and if the percentage of the unqualified times of examination in the job but during the job is higher than or equal to the preset examination threshold (for example, 0.5), the examination result is considered to not meet the requirement.

The resume set to be processed is constructed by obtaining the multiple qualified resumes of each post category, and in subsequent processing, a resume clustering result with a better effect can be obtained based on basic data (namely the first resume of a known post category) which better/better meets application requirements, so that the post category of the resume of an unknown post category can be better determined according to the clustering result.

Step S120: acquiring a resume feature vector of each resume in the set of resumes to be processed; clustering the resumes in the resume set to be processed based on the similarity between the resume feature vectors of the resumes in the resume set to be processed to obtain a plurality of clustering categories;

the resume feature vector of the resume can also be referred to as a resume feature, and a specific acquisition mode of the resume feature is not limited in the embodiment of the application. Optionally, each resume feature may be extracted through a trained neural network model, for example, resume features of resumes may be extracted through a trained convolutional neural network model, a long-term and short-term memory network, and the like.

In an optional embodiment of the present application, for each resume (first resume or second resume) in the set of resumes to be processed, obtaining the resume feature vector of the resume may include:

extracting resume data of at least one target information module in the resume based on the modular information structure of the resume;

and obtaining the resume feature vector of the resume according to the resume data of each target information module of the resume obtained by extraction.

In practical applications, the place where the resume text is significantly different from other texts is that the resume text has a hierarchical and modular structure, that is, the resume usually includes a plurality of information modules, and these modules generally include: personal basic information, job seeking intention, educational experience, work experience (project experience), self evaluation, professional skills, prize winning situation and the like, and the content of some modules and the determined position category are information which is not obviously related, therefore, in order to reduce the data processing amount and improve the expression capacity of the resume feature vector, the resume data can be filtered, the content of the modules (namely non-target information modules) which are not obviously related to the position category is filtered, namely, the specific content (namely the resume data) of the target information module (the module which is obviously related to the position category) in the resume can be extracted, and the feature vector of the resume is obtained through further processing based on the content.

The information modules can be specifically filtered out, and the information modules can specifically include the information modules, and can be set according to one or more of empirical values, experimental values or actual requirements. Optionally, the non-target information module may include: basic information module (name, sex, birthday, address, mobile phone number, mailbox, etc.), job hunting intention (intention post, expected salary, etc.), etc.

Optionally, after the resume data of each target information module in the resume is extracted, the data may be preprocessed, and then some data which has no or little influence on determining the position category of the resume is removed, and then the feature vector of the resume is determined based on the preprocessed data. The preprocessing mode can include various modes, for example, the regular matching can be adopted to filter the time information in the resume data of each target information module, for example, the resume data in the working experience module comprises the role of b in the company a from 2018 to 2020, and the time information in the resume data has no influence on the determination of the position category and can be removed.

For each first resume and each second resume, resume data corresponding to each processed resume can be obtained through the processing mode, and then the feature vector of the resume can be obtained based on the resume data of each resume.

After the feature vector of each resume is obtained, the resumes may be classified based on the similarity between the feature vectors of different resumes to obtain a plurality of different cluster categories (which may also be referred to as clusters, communities, etc.). The characteristic vectors of the resumes are obtained according to the resume data, namely the content, of the resumes, so that the resumes with similar content can be divided into a cluster type in a clustering mode, the resumes are classified according to the resume content through clustering, and a foundation is provided for subsequently determining the position type of the resumes with unknown position type.

The embodiment of the present application is not limited to the specific manner used for clustering, and any conventional clustering manner may be used, for example, resume classification may be performed by a classification model, or other clustering algorithms may be used. For example, a graph may be constructed based on the similarity between the resumes and the feature vectors of the resumes, each resume corresponds to one node in the graph, whether a connecting edge exists between the nodes is determined according to the similarity, and then classification of the nodes may be achieved through a neural network model according to the constructed graph and the resume feature vectors corresponding to the nodes, so as to obtain a plurality of clustering categories.

Step S130: and for each cluster type, determining a target position type corresponding to the cluster type according to the proportion of the first resume of each position type in the resumes belonging to the cluster type.

Step S140: and for each second resume, determining the target position category corresponding to the cluster category to which the second resume belongs as the position category corresponding to the second resume.

Through step S120, a large number of resumes in the resume set to be processed are divided into a plurality of different clustering categories, each clustering category includes at least one resume, i.e., at least one resume belonging to each category. Because the content similarity between the resumes belonging to the same cluster category should be relatively high, if the proportion of the first resume of a certain position category in a cluster category is large, it indicates that the probability that the resume of the position category belongs to the cluster category is high, so that the position category with the large proportion can be determined as the target position category corresponding to the cluster category according to the proportion of the first resume of each position category in all the resumes belonging to the cluster category in all the resumes, that is, the position category of the resumes classified into the cluster category is likely to be the position category with the large proportion.

Therefore, after the clustering is completed, the target position category corresponding to the clustering category can be determined according to the proportion of the first resume of each position category in the resumes belonging to the clustering category. For the second resume with the unknown position category, the position category corresponding to the second resume can be determined as the target position category corresponding to the cluster category to which the resume belongs.

For example, for a cluster category, after determining the percentage of the first resume of each position category in the resumes of the cluster category, the position category corresponding to the largest percentage may be determined as the target position category corresponding to the cluster category, and if a resume of an unknown position category belongs to the cluster category, it may be determined that the position category of the resume is the target position category corresponding to the cluster category. As an example, if the number of first resumes of the a position category in one cluster category is 8, the number of first resumes of the b position category is 1, and the number of second resumes of the unknown position category is 1, the occupation ratio of the resumes of the a position category is the largest, the target position category corresponding to the cluster category is the a position category, and the position category of the second resume is determined to be the a position category.

Optionally, for each cluster category, determining the target position category corresponding to the cluster category according to the duty ratio of the first resume of each position category in the resumes belonging to the cluster category may include:

In order to improve the accuracy of the resume position class determination, a target position class corresponding to each cluster class can be determined through a certain requirement, and if the proportion of the first resume of a position class in one cluster class is the largest and is not smaller than a set threshold, the position class can be used as the target position class corresponding to the cluster class. The size of the set threshold value can be set according to requirements, the larger the threshold value is, the higher the accuracy of determining the position category of the resume of the unknown position category is, but the higher the threshold value is, a certain cluster category may not be determined to correspond to the target position category, and thus the position category of the resume of the unknown position category divided into the cluster category cannot be determined. Thus, the threshold may be configured according to one or more of actual demand, empirical values, or experimental values.

Certainly, in practical applications, if there is no percentage greater than or equal to the set threshold in each percentage corresponding to a certain cluster category, that is, when the corresponding target position category cannot be determined, the position category of the resume of the unknown position category divided into this cluster category may also be determined in other manners, for example, a feature vector of the position category (for example, an average of the resume feature vectors of all the first resumes belonging to this position category) may be determined according to the resume feature vector of the first resume of each position category, and the position category corresponding to the maximum similarity is determined as the position category of the resume of the unknown position category by calculating similarities between the feature vector of the resume of the unknown position category and the feature vectors of each position category.

As an alternative, the set threshold may be set to 60%, if the number of first resumes of the a-position category in one cluster category is 8, the number of first resumes of the b-position category is 1, and the number of second resumes of the unknown position category is 1, the percentage of resumes of the a-position category is at most 80%, and 80% is greater than 65%, so that the target position category corresponding to this cluster category is the a-position category.

Optionally, after the post category of each second resume is determined, the post categories of the second resumes may be used to process the second resumes, where what kind of processing is specifically performed may be configured according to actual requirements, for example, the second resumes may be classified according to the post categories, or resumes of a specific post category may be screened from the second resumes.

According to the resume processing method provided by the embodiment of the application, a large number of resumes can be clustered into a plurality of cluster categories based on the similarity between a large number of resumes of known position categories and unknown position categories in the resume set, and the target position category corresponding to each cluster category is determined based on the duty ratio of the resumes of the known position categories of each position category in the resumes belonging to each cluster category, so that the position category of the resumes belonging to the unknown position category of each cluster category can be determined based on the target position category. According to the scheme, a large number of resumes containing unknown resume categories can be divided based on the similarity between resumes, and a plurality of clustering categories are determined based on the similarity between resumes, and the target position category corresponding to each clustering category is determined according to the proportion of the resumes of the known position category of each position category in the resumes belonging to the clustering category, so that the target position category corresponding to the clustering category can well represent the position categories of the resumes divided into the clustering category, and the position categories of the resumes of the unknown position categories can be conveniently and quickly determined based on the method.

In addition, when clustering is performed based on the similarity between the resumes, because the resume set simultaneously contains the resumes of the known position classes and the resumes of the unknown position classes, that is, when determining the position classes of the resumes of the unknown position classes, the method of the embodiment considers the relevance between the resumes of the known position classes and the resumes of the unknown position classes, and therefore, the accuracy of the determined position classes of the resumes of the unknown position classes is also well ensured.

In addition, based on the method provided by the embodiment of the application, rapid determination of the post categories corresponding to a large number of resumes of unknown post categories can be realized without manually/expert building of a priori knowledge base, good technical support is provided for further processing of subsequent resumes, automatic screening of resumes can be realized without manual participation, for example, the resumes of the required post categories can be obtained according to the determined post categories corresponding to all second resumes and classified according to the post categories, or the server pushes resumes of the corresponding post categories to different companies according to the demands of the companies.

In an optional embodiment of the present application, the obtaining of the resume feature vector of each resume in the to-be-processed resume set may include:

acquiring a post category feature word library, wherein the post category feature word library comprises post feature words of a plurality of post categories;

and for each resume in the resume set to be processed, determining the position feature words contained in the resume according to the position category feature word library, and obtaining the resume feature vector of the resume according to the position feature words contained in the resume.

The position feature words of a position category can be understood as words that can be used to distinguish the position category from other position categories, i.e., words that are representative of the position category, words that have a relatively high probability of appearing in the resume of the position category, or words that are proprietary in the establishment of the position category.

Optionally, the position feature words of each position category may include at least one of subject words, keywords, or named entities of the position category.

In this context, for a text, a topic word is understood to be a word that represents the main content of the text or at least a part of the content in the text, i.e. a word that represents the central concept/main content/topic of the text. Accordingly, in the embodiment of the present application, the subject word of one position category refers to a word that can represent the position category, that is, a word with a good position category distinguishing capability, and may include a so-called position keyword or a non-keyword. Whether a word is a subject word of a post category or not is judged, which can be determined by the importance of the word to the post category (the size of the function of the word when judging whether a resume is a resume of the post category or not), and the specific measurement condition can be determined according to requirements or experience. Subject words typically satisfy one or more of the principles of utility (to satisfy the requirements of indexing and retrieving text), accuracy (to accurately express the meaning of a concept), and commonality (to commonly accepted words).

The keywords can be simply understood as more important words. In the embodiment of the present application, the keyword refers to a keyword of a position category, that is, a word having a distinguishing capability for a position category, for example, one word often appears in the resume of one position category, but less appears in the resumes of other position categories, and this word may be used as the keyword of this position category.

Named entities refer to entities identified by name, and broader entities may also include numbers, addresses, and the like. In the embodiment of the present application, the named entity of one position category refers to a named entity appearing in the resume of the position category, such as a company name. Named Entity nouns appearing in the resume text often play an important role in position screening and distinguishing, so that Named Entity Recognition can be performed on sample resumes of each position category by adopting a Named Entity Recognition (NER) algorithm to obtain Named entities in the resume text, and accordingly, the Named entities corresponding to each resume category are obtained.

For each resume in the resume set to be processed, which position feature words are included in the resume can be determined according to the position feature words included in the position category feature word library, and the feature vector of the resume can be obtained according to the position feature words included in the resume, for example, the feature vector of the resume can be extracted and obtained through a neural network model based on the position feature words included in the resume. Because the post category feature word library contains the representative words with the post category distinguishing capability of each post category, the resume feature vector obtained by the method can well represent the feature of the resume text of the category, and therefore after subsequent clustering is carried out on the resume feature vectors based on a large number of resumes, the target post category corresponding to each clustering category can well represent the post category of the resumes belonging to the clustering category.

In an optional embodiment of the present application, the position category feature word library may be extracted based on a positive sample resume set of each of a plurality of position categories, where the positive sample resume set of each of the position categories includes a plurality of positive sample resumes.

The positive sample resume in the positive sample resume set of each post category refers to a resume meeting the post requirement conditions of the post category, and the conditions can be configured according to actual requirements. Optionally, the positive sample resume of a position category refers to the qualified resume of the position category, that is, the qualified resume in the resume belonging to the position category. For the explanation of the eligible resume, reference may be made to the description in the foregoing, and the explanation will not be repeated here.

The method has the advantages that the method obtains the position feature words of each position category based on the positive sample resume sets of the plurality of position categories, and can avoid the problem that the unqualified resumes possibly comprise some non-position feature words to cause the non-position feature words to be contained in the position feature words, so that the position feature words in the position category feature word library have good position category distinguishing capability, and a better basis is provided for determining resume feature vectors of each resume in the resume set to be processed based on the feature word library.

In the embodiment of the present application, the positive sample resume in the positive sample resume set and the first resume of the known position category in the resume set to be processed may be the same or different. That is, the first plurality of resumes in the set of to-be-processed resumes may be resumes in a positive sample set of resumes for a plurality of position categories.

In an optional embodiment of the present application, the subject term of each post category in the post category feature word library may be obtained by the following method:

determining candidate subject terms of each position category from each participle according to the first word frequency of each participle in the resume data of the positive sample resume of each position category;

for each candidate subject term, determining the subject importance of the candidate subject term according to the second term frequency of the resume data of the candidate subject term in the positive sample resumes of all the post categories and the document frequency of the candidate subject term appearing in the resume data of the positive sample resumes of all the post categories;

For each sample resume, the resume data of the resume may be all resume data in the resume, that is, content in the whole text, or preprocessed resume data, and optionally, the resume data of each sample resume may be resume data in the target information module in the resume extracted by the method described in the foregoing, or data obtained by further processing the resume data of the extracted target information module.

Where word frequency refers to the number of times a given word occurs in a document/corpus, it can be used to evaluate how repeatedly a word is to a document or a corpus of domain documents in a corpus, and this number is usually normalized in order to prevent it from being biased toward long documents.

In the embodiment of the application, for a participle, the first word frequency refers to the frequency of occurrence of the word in the resume data of the positive sample resume of each position category. The second word frequency refers to the number of times this word occurs in the resume data of the positive sample resumes for all position categories. For a position category, the first word frequency of a participle in the resume data of a positive sample resume of the category may be represented as follows:

the second word frequency of a participle in the resume data of the positive sample resumes for all position categories can be expressed as follows:

the category in the above two expressions refers to a post category, the resume refers to a positive sample resume, and the number of all the occurring participles is also the total number of the occurring participles.

The document frequency of a word characterizes the number of documents in a given document set in which the word occurs. In this embodiment of the present application, the document frequency of the candidate subject term appearing in the resume data of the positive sample resumes in all the position categories characterizes the number of texts in which the candidate subject term appears in all the positive sample resumes in all the position categories, and optionally, the document frequency may be determined based on a ratio of the number of the positive sample resumes in which the candidate subject term appears to the number of all the positive sample resumes, and if the ratio is directly adopted as the document frequency.

For each participle, the importance of the word to different position categories can be determined by comparing the frequency difference of the word between different position categories, and if the frequency (the first word frequency) is higher in a certain position category and lower in other position categories, the possibility that the word can be used as the subject word of the position category is higher, so that a candidate subject word corresponding to each position category in all participles can be determined based on the first word frequency of each participle corresponding to the different position categories. The second word frequency reflects the frequency of occurrence of a word in the resumes of all the post categories, the document frequency can measure the measure of the universal importance of a word in the resumes of all the categories, and if the frequency of the corresponding second word is high, it indicates that the word may be more important, but if the frequency of the document corresponding to the word is high, it indicates that the word has a high possibility of occurrence in each resume of all the categories, and the ability of the word for distinguishing one post category is weak, so after each candidate subject word of each post category is determined, the subject importance of the candidate subject word can be determined further based on the second word frequency and the document frequency corresponding to the candidate subject word, that is, the possibility that the word can be used as the subject word of the resume category is high, and the strength of the distinguishing ability of the post category of the word is characterized.

The specific manner of determining the topic importance of a candidate topic word based on the second word frequency and the document frequency corresponding to the candidate topic word may be configured according to requirements, for example, the topic importance may be represented by a ratio of the second word frequency to the document frequency.

Optionally, the candidate subject word of each position category may include at least one of a domain word or a general word of the position category.

The domain words are proprietary words in one domain, and in the embodiment of the application, the domain words can be understood as proprietary words in resumes of one post category, that is, words which are likely to appear in resumes of one post category and rarely appear in resumes of other post categories. A general word for one position category refers to a word that appears relatively infrequently relative to domain words, but the frequency with which the word appears in the resume for that position category is still high relative to the frequency with which the word appears in the resumes for other position categories. By dividing the field words and the common words, the division of word types with finer granularity can be realized, so that the subject words of all post categories can be selected more specifically.

In practical application, each word segmentation can be used for dividing word types according to the importance degree of the theme, and optionally, the word can be divided into three types of words: domain words, common words, and unrelated words. If one participle has higher frequency in a certain position category and lower frequency in other position categories, the participle can be used as a domain word of the position category; if the variation of the frequency difference of the word segmentation in each position category is small, the word can be determined as an irrelevant word, otherwise, the word is determined as a common word. Specifically, the type of a word may be determined by setting a frequency threshold (i.e., a word frequency threshold), for example, setting a first threshold and a second threshold, if a difference between a first word frequency of a participle corresponding to one position category and a first word frequency of the participle corresponding to each position category other than the position category is not less than the first threshold, the participle is considered to belong to a domain word of the position category, if a difference between the first word frequency of a participle corresponding to each position category is less than the second threshold, the participle is considered to belong to an irrelevant word, and if a difference between the first word frequency of a participle corresponding to one position category and the first word frequency of the participle corresponding to each position category other than the position category is greater than the third threshold but less than the first threshold, the participle is considered to belong to a normal word of the position category.

Optionally, when the candidate topic word may be further subdivided into a domain word and a general word, when determining the topic importance of the candidate topic word for each candidate topic word according to the second word frequency of the candidate topic word in the positive sample resumes of all position categories and the document frequency of the candidate topic word in the positive sample resumes of all position categories, the determining may include:

for each field word, determining the topic importance of the field word according to the second word frequency and the document frequency corresponding to the field word and the first maximum document frequency, wherein the first maximum document frequency is the maximum value of the document frequencies corresponding to the field words of all position categories;

and for each common word, determining the topic importance of the common word according to a second word frequency and a document frequency corresponding to the common word and a second maximum document frequency, wherein the second maximum document frequency is the maximum value of the document frequencies corresponding to the common words of all position categories.

That is, for different types of candidate subject words, the subject importance of the candidate subject word can be determined in a corresponding manner according to whether the candidate subject word is a domain word or a common word. The first maximum frequency and the second maximum frequency may be used as a reference parameter value for determining the importance of the subject of the candidate subject word of the corresponding type. The solution may be understood as a normalization process of the document frequency corresponding to the candidate subject term, for example, a ratio of the maximum document frequency to the document frequency corresponding to the candidate subject term may be used as a normalized document frequency corresponding to the candidate subject term.

In an alternative embodiment of the application, for each candidate topic word, the determining the topic importance of the candidate topic word according to the second word frequency of the candidate topic word in the positive sample resumes of all position categories and the document frequency of the candidate topic word in the positive sample resumes of all position categories may include:

determining the initial importance of the candidate subject term according to the second term frequency and the document frequency corresponding to the candidate subject term;

determining the part of speech of the candidate subject word;

and determining the topic importance of the candidate subject term according to the part of speech and the initial importance of the candidate subject term.

In practical applications, the probability of words of different parts of speech as subject words is usually different. In most cases, the probability of the noun representing the subject is greater than that of the verb, and in order to better determine the subject importance of the candidate subject in the segmentation result, the part of speech of the candidate subject can be further considered. For example, the initial importance of the candidate subject term may be determined according to the second word frequency and the document frequency corresponding to the candidate subject term, and the subject importance of the candidate subject term may be obtained by turning up or down the initial importance according to the part of speech of the candidate subject term. Optionally, if the candidate topic word is a noun, the initial importance may be increased, for example, a weight larger than 1 may be preconfigured, and if the candidate topic word is a noun, the initial importance is weighted by using the weight to obtain the topic importance. Of course, the purpose of adjusting the topic importance of the candidate topic words according to the part of speech can also be achieved by reducing the initial importance of the candidate topic words of the non-noun part of speech.

Optionally, for a candidate subject term, determining the subject importance of the candidate subject term according to the part of speech and the subject importance of the candidate subject term may include:

if the part of speech of the candidate subject term is a noun, determining the noun type of the candidate subject term, wherein the noun type is a special noun or a common noun;

and according to the noun type of the candidate subject term, improving the initial importance of the candidate subject term to obtain the subject importance of the candidate subject term, wherein the improvement degree of the initial importance of the proper noun is greater than that of the ordinary noun.

In the alternative scheme, the fact that the probability that a proper noun represents a theme is larger than that of a general noun and the probability that the noun represents the theme is larger than that of a verb in practical application is further considered, therefore, in order to better perform theme representation degree importance ranking on the word segmentation result, the alternative scheme realizes more targeted adjustment of the importance of the candidate subject words and more accords with actual requirements by further distinguishing the types of the nouns. For example, a weight may be configured for each proper noun and a proper noun, where the weight of the proper noun is greater than the weight of the proper noun, and the initial importance is weighted by using the weight to realize adjustment; or a first weight corresponding to a noun and a second weight corresponding to a proper noun may be configured, if a candidate subject term is a noun, the initial importance of the candidate subject term may be weighted by the first weight, and if the candidate subject term is also a proper noun, the initial importance of the candidate subject term may be weighted again by the second weight.

In addition, as can be seen from the foregoing description, the candidate subject word of one position category may include at least one of a domain word or a common word, and the domain word is relative to the common word because the frequency of occurrence of the domain word in the resume of one position category is higher than that of the common word, so that, as an alternative, in practical applications, the initial importance of any domain word may not be less than the maximum value of the initial importance of all the common words.

After determining the subject importance of each candidate subject term of each post category, the subject term of each post category can be picked out from the candidate subject terms of each post category according to the subject importance. For example, for each position category, a certain number of candidate subject terms ranked at the top are taken as the subject terms of the position category in an order from the highest subject importance to the lowest subject importance, or the candidate subject terms with the subject importance higher than a certain threshold are determined as the subject terms of the position category.

In an optional embodiment of the present application, the keyword of each post category in the post category feature word library may be obtained by the following method:

extracting resume data of each positive sample resume, and performing word segmentation processing on the resume data to obtain each word segment contained in each positive sample resume;

Optionally, for each position category, the category distinguishing capability of a participle corresponding to the position category may be a product of a third word frequency of the participle corresponding to the position category and a ratio of the participle corresponding to the position category. Wherein the third word frequency in this alternative is the same as the first word frequency in the other alternative embodiments described above, i.e. the frequency with which a word appears in the positive sample resumes for a certain position category.

The category distinguishing capability of a participle for a position category can also be characterized by the TF-IDF (term frequency-inverse text frequency index) of the positive sample resume set of the position category of the participle. The above alternatives provided by the embodiments of the present application are improved in order to more accurately measure the distinguishing capability of a word for a certain position category, instead of directly calculating the IDF in the existing manner, the number of samples of the word appearing in the resume of the position category is not considered when determining the IDF of a word, that is, how many resumes of the word appear in the positive sample resume of the position category is not considered.

Optionally, after determining the category distinguishing capability of each participle corresponding to each post category, for each post category, according to the order from high to low of the category distinguishing capability of each participle corresponding to the post category, a certain number of the participles ranked in the front may be used as the keyword of the post category, or the participles with the category distinguishing capability greater than a certain threshold may be determined as the keyword of the post category.

It can be understood that, in practical applications, the execution sequence of the steps of determining the subject word and the keyword may be not sequential, or one of the steps may be executed first and then the other step may be executed, and if a sequential execution manner is adopted, such as determining the subject word first and then determining the keyword, after determining the subject word of each position category, the keyword of each position category may be determined from each participle other than the subject word of each position category, so as to reduce the data processing amount. In addition, when different alternative embodiments are used in combination, if different embodiments used in combination include the same processing mode, the processing may be performed only once, for example, when determining the subject word and the keyword, the word segmentation processing may be performed, and the step of determining the word frequency of each word segmentation corresponding to each position category may be performed only once.

In an optional embodiment of the present application, for each post feature word, the method further comprises:

acquiring a word feature vector of the post feature word and a character feature vector of each character in the post feature word;

for each character of the post feature words, a character feature vector of the character and a word feature vector of the post feature words are fused to obtain a mixed feature vector of the character;

the above obtaining, for each resume in the resume set to be processed, the resume feature vector of the resume according to the post feature word included in the resume, includes:

and obtaining the resume feature vector of the resume according to the mixed feature vector of each character in the post feature words contained in the resume.

For example, for chinese, a character may refer to a word, and for english, a character may refer to an english letter in an english word.

In the embodiment of the application, for each post feature word in the post category feature word library, feature extraction can be performed on the feature word from the granularity of a word and the granularity of a word to obtain feature representations of two different granularities corresponding to the word, namely, a word feature vector of each character in the post feature word and a word feature vector of the feature word, and then, for each character, a mixed feature vector of the character with better feature expression capability can be obtained by splicing the word feature vector of the character and the word feature vector of the feature word to which the character belongs.

Correspondingly, when the resume feature vector of each resume is obtained according to the position feature words contained in each resume, the resume feature vector with better feature expression capability can be obtained based on the mixed feature vector of each character in each position feature word contained in each resume.

In practical application, for each character, the specific manner of fusing the word feature vector of the character and the word feature vector of the post feature word to which the character belongs is not limited in the embodiment of the present application. For example, the word feature vector of the character and the corresponding word feature vector may be added (add, feature value/element value of corresponding position in the word feature vector and the word feature vector are added) or spliced to obtain a mixed feature vector of the character. When the addition mode is adopted, the dimension of the character feature vector of each character and the dimension of the word feature vector of the feature word should be the same, namely the length of the two vector sequences should be the same. When the splicing mode is adopted, the dimension of the character feature vector of each character and the dimension of the word feature vector of each feature word can be the same or different.

In practical application, as an alternative, word2vec (correlation model for generating Word vector) can be used to respectively train character-based character vector model and Word-based unidirectional Word vector model based on training corpus. After the trained character vector model and word vector model are obtained, for a resume, the word vector of each character and the word feature vector of each word can be obtained for each post feature word contained in the resume through the two models, and then the mixed feature vector of each character is obtained through a splicing mode.

As an example, fig. 2 shows a schematic diagram of a method for obtaining a mixed feature vector of a character provided in an embodiment of the present application, where a word is a "resume," and for performing a word segmentation process on this word to obtain two words, i.e., "resume" and "calendar," included in the word, a word feature vector of "resume" (the word vector shown in the figure) can be obtained through a word vector model, a word feature vector of "resume" and a word feature vector of "calendar" (the word vector shown in the figure) can be obtained through a word vector model, a mixed feature vector of "resume" can be obtained by concatenating the word vector of "resume" (the word vector shown in the figure) and the word vector of "resume", and a mixed feature vector of "calendar" can be obtained by concatenating the word vector of "resume" and the word vector of "calendar".

In practice, in order to obtain a word feature vector aligned with the number of word feature vectors, each word needs to be repeatedly encoded (which may also be understood as being repeatedly used), the number of times of repetition is the number of characters constituting the word, for example, the word vector of "resume" in the above example needs to be used twice, and is respectively fused with the word vector of "resume" and the word vector of "resume".

After the mixed feature vector of each character of each position feature word is obtained, a specific manner of obtaining the resume feature vector of the resume is obtained based on the mixed feature vector of each character in each position feature word included in one resume. The fusion may include, but is not limited to, addition (add, adding feature words at corresponding positions in a plurality of word feature vectors), averaging after addition, and the like. Optionally, after the word feature vectors of the feature words of each position are obtained by fusing the mixed feature vectors, further feature extraction is performed based on the word feature vectors of the feature words of each position to obtain the resume feature vector.

After the resume feature vectors of each resume are obtained, the similarity between the resumes can be obtained by calculating the similarity between the resume feature vectors of different resumes, and the clustering processing of all the resumes in the resume set to be processed can be realized based on the calculated similarity between different resumes.

In an optional example of the present application, based on similarity between the resume feature vectors of the resumes in the resume set to be processed, clustering the resumes in the resume set to be processed to obtain a plurality of clustering categories, which may include:

determining similarity between the resumes based on the resume feature vectors of the resumes;

constructing a graph based on the resumes and the similarity between the resumes, wherein each resume is a node in the graph, if the similarity pair corresponding to two nodes is greater than or equal to a set threshold, a connecting edge exists between the two nodes, and the similarity corresponding to the two nodes is used as the weight of the connecting edge;

determining transition probabilities between nodes with connected edges based on the weights of the connected edges in the graph;

and based on the transition probability among the nodes in the graph, dividing the nodes in the graph into a plurality of cluster categories by adopting a clustering mode based on random walk.

The optional embodiment of the present application provides a graph structure-based clustering scheme, which may use resumes as nodes in a graph, and determine whether there is a continuous edge between the nodes according to similarity between the resumes. The transition probability between two nodes in the graph can be understood as the probability of a transition from one of the two nodes to the two nodes. The greater the weight, the more intimate the relationship strength between two nodes is, and the greater the transition probability between nodes is. For two nodes a and b with connected edges, the transition probability of the node a and the node b may be determined based on the weight w1 of the connected edge between the node a and the node b and the sum w2 of the weights of all the connected edges corresponding to the node b, and may specifically be a ratio of the two.

After the transition probability among the nodes is determined, a clustering mode based on random walk can be adopted to divide the nodes in the graph into a plurality of clustering categories, the relevance among the nodes belonging to the same clustering category is high, and the relevance among the nodes belonging to different clustering categories is low. In actual implementation, which clustering manner based on random walk is specifically adopted is not limited in the embodiment of the present application, for example, a clustering manner based on an InfoMap algorithm (also referred to as map evaluation) may be adopted, and by adopting the method, each resume (i.e., each node in a graph) in a resume set to be processed may be subjected to community division by constructing a graph network model.

In an optional embodiment of the present application, the method may further include:

determining the number of resumes of at least one new resume to be processed;

if the number of the resumes is less than the set number, acquiring resume feature vectors of each new resume to be processed and category feature vectors of each cluster category; for each new resume to be processed, determining a cluster type of the new resume to be processed according to the similarity between the new resume to be processed and the characteristic vector of each type, and determining a target cluster type corresponding to the cluster type of the new resume to be processed as the post type of the new resume to be processed;

if the number of the resumes is larger than or equal to the set number, constructing a new resume set to be processed based on at least one new resume to be processed, clustering the resumes in the new resume set to be processed to obtain a plurality of clustering categories, and determining a target position category of each clustering category according to the duty ratio of the resumes of the known position categories of the resumes belonging to each clustering category; and determining the target position category corresponding to the cluster category to which each new resume to be processed belongs as the position category of the new resume to be processed.

Based on the scheme, when resume clustering is executed at least once based on the resume set to be processed and a target position category corresponding to each clustering category is determined, if a new resume of an unknown position category is subsequently acquired, different processing modes can be adopted according to the number of resumes to be processed at the moment, if the number is large, the method shown in fig. 1 can be executed again in order to ensure the effect, at the moment, the new resume set to be processed comprises the new resumes to be processed and resumes of a plurality of known position categories (which can include other resumes in the foregoing text), if the number is small, the clustering category to which the resume belongs can be judged according to the distance (namely, the similarity) between the resume and the clustering center of each clustering category, the position category corresponding to the resume is predicted based on the category to which the resume belongs, and the method can obtain better balance in the aspects of saving computing resources and ensuring the processing effect.

Based on the resume processing method provided by the embodiment of the application, the resume processing efficiency can be effectively improved, and the actual application requirements can be better met. The method can be applied to scenes related to resume screening and talent matching system construction, including but not limited to application scenes of human resource system construction, target position/post talent screening, target position hunting mining and the like. For example, in the construction of a human resource system, an intelligent resume screening system is established according to the method provided by the embodiment of the application, so that the recruitment efficiency of a company can be greatly improved, and the human screening input and the interview cost are reduced; in the application of target groups of hunting mining positions, the method provided by the embodiment of the application can be used for mining the appropriate person selection of the target position, performing directional tracking hunting and maximizing the income-cost ratio.

In practical application, according to actual needs, periodically or when obtaining no less than a certain number of resumes of known position categories and resumes of unknown position categories, the resume set to be processed can be updated, the target position category corresponding to each clustering center is re-determined through re-clustering, and the position category of the resumes of the unknown position categories is determined based on the result at this time.

Compared with the prior art, the scheme provided by the embodiment of the application has at least the following advantages:

(1) According to the embodiment of the application, the association relation between the known qualified sample library (namely the resume of the known position analogy) and the unknown sample (namely the resume of the unknown position category) is built, so that the position category of the unknown sample can be determined according to the preset rule and the known qualified sample library, and technical support is provided for screening of the position category based on the resume.

(2) The method for constructing the position category feature word library provided by the embodiment of the application can increase the number of the position feature words which are accurately extracted to different position categories, so that the resume feature vectors with better feature expression capability can be obtained based on the word library, and a good basis is provided for the clustering and further processing of subsequent resumes.

(3) In the alternative provided by the application, a sequence can be generated by constructing a graph and constructing transition probabilities among nodes in the graph in a random walking mode on the graph, then hierarchical coding is carried out on the sequence, and the entropy is minimized to achieve the optimized average coding length.

(4) According to the method provided by the embodiment of the application, an expert prior knowledge base does not need to be constructed, a set of automatic resume intelligent screening system which does not need manual participation can be constructed through the system, and the method has good operability in industry.

In order to more systematically describe the solutions provided in the present application, a specific embodiment of the present application will be described in detail below with reference to a specific embodiment.

Fig. 3 shows a flow diagram of the resume processing method provided in this embodiment, and as shown in fig. 3, this embodiment may mainly include several parts, such as dividing the position positive and negative sample resume sets according to a preset rule, constructing a scoring database, extracting and vectorizing the resume text features, constructing a position-resume distribution diagram based on an InfoMap algorithm, and screening resumes based on the positive sample library and the preset rule. These sections will be described separately below.

Step 1, dividing positive and negative samples of posts according to preset rules

Optionally, each post may be defined as a category, the positive and negative sample sets of the post category are divided according to preset rules according to the historical job and the resume of the currently-active staff, and the rules for screening the positive and negative samples may be flexibly set according to the scene needs. The positive sample is the positive sample resume in the foregoing, which may also be referred to as a qualified sample resume or a qualified resume text, and the negative sample resume is an unqualified resume. The manner of obtaining the positive and negative samples for each position category can refer to the foregoing description, and the description is not repeated here.

Step 2, constructing a grading database

Taking the positive sample of each post category obtained by dividing in the step 1 as the qualified resume of the post category, and constructing a scoring database/high-score database of < post category, qualified resume > i.e. the positive sample resume set of each post category described in the foregoing.

Optionally, each post category is marked with a category identifier, i.e. a category id, such as: 0. 1, 2, etc. The qualified resume text constitutes a high-score database under each post category, while negative examples are not included in the scoring database. As an example, a < position category, qualified resume > scoring database may be constructed as the structure shown in table 1 below:

TABLE 1

As shown in Table 1, samples 1 through 3 are all positive sample resumes under the position category with position category identification 1, and samples 5 through 7 are positive sample resumes under the position category with position category identification 2.

Step 3, extracting and vectorizing resume text features

And 2, the resume text features are the position feature words in the previous text, the step can extract the position feature words of each position category based on the positive samples of each position category obtained in the step 2, the position feature words comprise subject words, key words and named entities, and the mixed feature vectors of all the words in the position feature words can be obtained in a mixed coding mode. A schematic flow chart of an alternative embodiment of this step is shown in fig. 4, and as shown in fig. 4, this step may include the following steps:

step 31: carrying out blocking processing on each positive sample in the grading database to obtain each information module in each resume;

the place where the resume text is significantly different from other texts is that the resume text has a hierarchical, modular structure, and these modules generally include: the method comprises the following steps of obtaining personal basic information, job hunting intentions, education experiences, work experiences (project experiences), self-evaluation, professional skills, prize winning conditions and the like, and partitioning the resume by performing regular matching on the basis of names of modules in the resume and the keywords to obtain information modules in each resume.

Step 32: acquiring resume data of a target information module of each resume;

optionally, the data of the information modules that are not significantly related to the post matching degree model influence factors are filtered to obtain the resume data of each target information module in each resume, where the modules that are not significantly related may include: basic information module (name, gender, birthday, address, mobile phone number, mailbox, etc.), job hunting intention (intention post, expected salary, etc.), etc.

Step 33: filtering the time information of each module based on the regular matching;

step 34: extracting the subject term of each post category according to the positive sample in the scoring database;

the subject word ordering and extraction can be performed based on the global features, and the resume text generally comprises proper nouns, common nouns, verbs, adjectives and the like. In topic mining, the probability that a special noun represents a topic is greater than that of a common noun, the probability that a noun represents a topic is greater than that of a verb, and in order to better perform topic representation degree importance ordering on a word segmentation result, topic importance ordering of words can be achieved in a word and phrase classification mode in the embodiment;

first, words, i.e., participles, are divided into three types of words: domain words, common words, and unrelated words.

And respectively carrying out post category characteristic statistics and new word discovery on qualified samples belonging to different post categories in the resume database, namely the scoring database, and obtaining the field dictionary of each post category. The domain dictionary is obtained by mainly comparing frequency differences among different post categories, and if the frequency is higher in a certain post category and lower in other post categories, the word is defined as a domain word; if the frequency difference change in each position category is smaller, the vocabulary is defined as an irrelevant word, otherwise, the vocabulary is defined as a common word.

Specifically, based on the scoring database, the word segmentation processing may be performed on each qualified sample obtained in the previous step, that is, the resume data of the qualified resume, so as to obtain each word segmentation included in each resume. Through the frequency difference of the qualified samples of the participles in different post categories, irrelevant words in the participles, and field words and common words of each post category, namely candidate subject words, can be determined.

After determining each post type word, respectively assigning a theme importance weight (i.e., an initial importance) to each post type word, optionally, the calculation method is as follows:

the importance of the irrelevant word is 0.

The topic importance weight of a generic word may be expressed as:

wherein, W _c Denotes any common word, tf _c Is a common word W _c Frequency of occurrence in all position category text of the standard library (second word frequency), maxdf _k The largest document frequency (i.e., the second largest document frequency in the foregoing), df, of the document frequencies at which each common word appears in all position category texts in the standard corpus _c Is a common word W _c Frequency of documents appearing in all position category text of the standard library.

The topic importance weight for a domain term may be expressed as:

wherein, W _f Denotes any field word, maxPW _c Is the maximum importance weight, tf, among the vocabulary importance weights of all common words _f Is a field word W _f Frequency of occurrence in all post category texts in the standard library (i.e., second word frequency), maxdf _l The largest document frequency (i.e., the first largest document frequency in the foregoing), df, of the document frequencies at which the domain words appear in the all-position category text of the standard library _f Is a field word W _f Frequency of documents appearing in all position category text of the standard library.

Optionally, the topic importance weight may be weighted according to different parts of speech.

In this embodiment, the proper noun weighting δ _prop (w)，δ _prop (w) is a positive number greater than 1, and if the domain word or generic word w is a proper noun, the topic importance weight is multiplied by the coefficient δ _prop (w); noun part-of-speech weighting δ _noun (w)，δ _noun (w) is a positive number greater than 1, delta _noun (w)＜δ _prop (w) if the word wPart of speech is noun, and the weight is multiplied by the coefficient delta _noun (w) in the above step (a). Therefore, the topic comprehensive weight of a word, i.e. the topic importance formula, can be expressed as:

W _topic (w)＝PW×δ _prop (w)×δ _noun (w)

wherein PW is PW _c Or PW _f . Of course, if a domain word or common word is not a noun, W _topic (w)＝PW。

According to the subject importance ranking of the words, each post category TOPK can be obtained ₁ And the important candidate subject term is used as the subject label of the position category, namely the subject term of the position category.

Step 35: extracting keywords of each post category according to positive samples in a grading database;

according to the word segmentation result of the qualified resume sample, keywords of different post categories can be mined based on TF-IDF to construct a post category keyword library, and the aim of constructing category keywords by using TF-IDF is as follows: if a word appears frequently in the text of a certain category and rarely appears in the text of other categories, the word is distinguished from the classification category.

In the embodiment of the application, TF-IDF represents the distinguishing capability of post category characteristic words in each classification category, and the application provides an improved TF-IDF calculation method, wherein the calculation formula is as follows:

TF-IDF = word frequency (TF) x Inverse Document Frequency (IDF)

A TF-IDF where a word corresponds to a position category is the word's ability to distinguish between categories for that category. If a word segment appears frequently in the resume of a certain position category and appears frequently in the resumes of other position categoriesIf the word is not a distinguishing keyword of the category, the word is described as the distinguishing keyword of the category. For each position category, the TOPK can be divided into TF-IDF according to the TF-IDF of the position category corresponding to each participle and according to the descending order of the TF-IDF ₂ Each participle serves as a keyword tag of the position category, namely a keyword of the position category.

Step S36: named entity identification for each post category

Named entity nouns appearing in the resume text often play an important role in post screening and distinguishing, so that entity identification can be performed on resume data in qualified samples of each post category based on an entity identification NER tool, and named entities in the resume text can be obtained.

As an example, assuming that a resume category is a machine learning related position category, the topic words, keywords, and named entities extracted based on the above scheme provided in the embodiment of the present application include: artificial intelligence, machine learning, data mining, AI, big data, data analysis, deep learning, neural networks, language models, python, SCI articles, top meetings, and the like.

Step 37: vectorizing and coding the text features (namely, the post feature words, namely, the extracted subject words, keywords and named entities of each post category) preprocessed in the above steps to obtain a mixed feature vector of each word contained in each text feature.

For a specific implementation of obtaining the mixed feature vector of each word included in each post feature word, reference may be made to the foregoing description, and the description is not repeated here.

Step 4, constructing a post-resume distribution diagram model based on the InfoMap algorithm

And mapping each resume to be a point in the space based on the qualified resume sample library of each post category constructed in the step 2 and the post characteristic words constructed in the step 3.

In this step, the input of the model is a feature vector of the resume, optionally, a feature vector of a post feature word included in each resume, and specifically, the model may be obtained by fusing mixed feature vectors of characters included in the post feature words, for example, the feature vector of the feature word may be obtained by adding the mixed feature vectors of the characters included in the post feature words, and the feature vector of the resume is obtained by fusing the feature vectors of the post feature words included in the resume.

As an example, assuming that the position feature words included in one resume include "artificial intelligence" and "machine learning", the encoding vector of "artificial intelligence" may be obtained based on the mixed feature vector of each character of "artificial intelligence" obtained in step 3, and the encoding vector of "machine learning" may be obtained based on the mixed feature vector of each character of "machine learning" obtained in step 3, for example, "artificial intelligence" encoding: [0.255, -1.41, \8230 ]; "machine learning" encoding: [5.472,6.2109, \8230 ], the coded vector of the resume, namely the resume feature vector, can be obtained according to the coded vectors of the two feature words.

After the feature vector of each resume is obtained, a graph network model can be constructed based on an InfoMap algorithm to perform community division, namely clustering division on each resume, wherein the nodes represent the resumes, the edges represent the relationship between the resumes and the resumes, the strength (or intimacy) of the relationship can be given a weight to each edge, and the larger the weight is, the stronger the relationship is, namely, the more intimacy is.

The InfoMap algorithm carries out random walk on the graph by constructing the transition probability among nodes in the graph structure, based on the transition probability among the nodes and the occurrence probability of each node in the graph structure (based on continuous iterative optimization of an objective function), namely, starting from a certain point j (namely, a node in the graph) and jumping to the next point i according to the transition probability among the node and other nodes, then starting from the point i and jumping to the next point according to the transition probability, repeating the process, constructing Huffman codes according to the probability of the random walk to generate sequences, and then carrying out hierarchical coding on the sequences, wherein the hierarchical coding method comprises the following steps: inserting a class mark in front of the resume features (namely nodes) of the same position class, and inserting a termination mark at the end of the class, wherein the class mark is represented by a single set of codes, such as 000, 001 and 002, the resume features in the class and the termination mark are represented by another set of codes, and the resume features of different position classes are also represented by the same set of codes, such as 000, 001, 010, 011 and 100, due to the consideration of the class label, and resume clustering is performed by minimizing the total shortest average code length.

The following describes a process for implementing resume clustering by constructing a resume-belonging position graph network model based on an InfoMap algorithm in the optional embodiment of the present application, and may include the following steps:

step 41, calculating to obtain the transition probability among the resume text characteristics;

and respectively calculating the similarity between each node and other nodes, namely calculating the similarity between the resumes, setting a relation similarity threshold value to be S (assuming to be 0.7), connecting nodes corresponding to the text features of the two resumes which are higher than the threshold value into an edge, and normalizing the weight of the edge to be used as the transition probability.

Step 42, initializing the occurrence probability of each node;

and 43, taking each node as an independent cluster type during initialization, and finishing cluster division of all nodes in the graph by continuously iterative optimization with the aim of minimizing the total shortest average coding length corresponding to each node in the graph on the basis of the transition probability among the nodes and the occurrence probability of each initialized node when set technical conditions are met to obtain a plurality of clusters (one cluster is the cluster type), wherein each type comprises a plurality of nodes.

As an example, suppose that the resume feature vectors of two resumes are resume text feature a and resume text feature β, respectively, if the similarity between the resume text feature a and the resume text feature β is greater than the threshold S, a connecting edge is provided between the nodes corresponding to the two resumes, and the transition probability P between the nodes corresponding to the two resumes can be obtained by normalizing the similarity of the connecting edge _α→β . At initialization, all nodes may be of uniform access probability, i.e., probability of occurrence. Suppose that the occurrence probability of the resume text feature a is p _α The occurrence probability of the resume text feature beta is p _β The crossing probability (also called jump probability) is tau, which is an additional hyperparameter with a value less than 1, and can be an empirical valueSetting, if the crossing probability is considered, the following relation exists:

if the following relationship exists in consideration of the crossing probability, then

Where n represents the number of all nodes.

Assuming that the position category to which resume text feature a belongs is category i, the category probability of category i (i.e. the probability of the random walk process jumping from category i to another category) can be expressed as: change to

If the crossing probability is considered, p needs to be set _α→β Change to

By adopting a hierarchical coding mode, different codes are adopted for the categories and the objects (namely nodes/resumes) in the categories, so that the shortest average coding lengths of the categories and the objects in the categories need to be calculated respectively. Wherein the shortest average code length H (Q) of a class is represented as follows:

wherein the content of the first and second substances,

shortest average encoding length H (P) of objects within a class within each class (taking class i as an example) ⁱ ) Is represented as follows:

wherein the content of the first and second substances,

and weighted averaging the shortest average encoding length of the categories and the shortest average encoding length of the objects in the categories of each category to obtain the total shortest average encoding length L (M), which is expressed as follows:

and taking L (M) as an objective function, taking the minimized L (M) as a target, and continuously iterating and optimizing until the L (M) can not be optimized any more, finishing clustering of all nodes, and dividing all nodes into a plurality of clusters.

And 4, clustering the qualified resumes of the post categories in a cluster mode, wherein each cluster represents different post categories.

As an example, fig. 5 shows a simple clustering result diagram, in this example, all nodes in the graph structure shown in fig. 5 may be divided into three clusters, as shown by three dashed boxes in the graph, a node in each dashed box is a node belonging to the cluster, and resumes corresponding to nodes belonging to the same cluster are divided into a clustering class.

It is understood that in practical applications, other clustering manners may be adopted to perform the division of the resume. Nodes included in different clusters obtained by clustering may have overlapped nodes (i.e., a node may belong to multiple clusters at the same time), or may have non-overlapped nodes (as shown in fig. 5).

And 5, screening the resume based on the positive sample library and a preset rule.

The method includes the steps of constructing a positive sample library (namely, the first resume in the resume set to be processed, in this embodiment, the first resume in the resume set to be processed may be a positive sample in the score database), according to the distribution of the qualified resumes in each post category in each community (namely, the cluster/cluster category), counting the occupation ratio of the resumes in each known post category in each community, and predicting the resumes in the unknown category belonging to the community as the qualified resumes in the post category if the occupation ratio of the qualified resumes in a certain post category in the community exceeds a preset threshold.

As an example, the distribution of the qualified resumes of the known position category of each cluster in the step 4 is counted, for example, the percentage of the qualified resumes of the position category "data analysis" in a certain cluster is 80%, if the preset threshold is 65%, the target position category corresponding to the cluster is "data analysis", and resumes of other unknown categories in the cluster are predicted as the qualified resumes of the position "data analysis".

In addition, if there are overlapping nodes between different clusters, when a node corresponding to a resume of an unknown position category is an overlapping node of at least two clusters, a cluster most matched with the resume can be determined further according to the similarity between the resume and the cluster to which the resume belongs, for example, according to the distance between the resume feature vector of the resume and the cluster center of each cluster to which the resume belongs, and the target position category corresponding to the cluster is determined as the position category of the resume. Of course, a plurality of position category labels may also be set for the resume at the same time, that is, the position category of the resume may include the target position categories corresponding to the plurality of clusters to which the resume belongs at the same time.

In order to better illustrate the application value of the scheme provided by the application, a specific application scenario embodiment is described below. The scheme provided by the application can be applied to scenes of talent screening and recommendation, can be realized as an application program or a plug-in of the application program, a job seeker can fill in personal information and release resumes at a client of the job seeker through the application program, a recruiter (such as a company) can release the related requirements of talents required to be recruited at a client of the recruiter, and a server can screen appropriate resumes for the recruiter according to the requirements of the recruiter and recommend the resumes to the recruiter.

Fig. 6 shows a schematic structural diagram of a resume processing system to which the present application is applied, and as shown in fig. 6, the system may include a server 10, a terminal device 21 of a job seeker, and a terminal device 22 of a recruiter, where the terminal device 21 and the terminal device 22 are respectively connected to the server 10 through a network 30. In the following, an optional embodiment of the scheme of the present application is described with reference to the system, and fig. 7 shows a resume processing flow in this scenario embodiment, which may include the following steps:

step 11: the terminal device 21 sends its resume (job-seeking resume shown in the figure) to the server 10 according to the operation of the job seeker at the client;

step 12: the server 10 receives job-seeking resumes sent by the terminal devices 21 of a large number of job seekers and stores the job-seeking resumes in the resume repository 11;

step 13: the terminal device 22 sends a recruitment requirement to the server 10 according to the operation of a recruiter at the client, wherein the recruitment requirement includes a recruitment position category (which may be a specific position or a category) of the recruiter;

step 14: the server 11 builds a resume set to be processed based on a large amount of establishment stored in the resume library 11 according to the received recruitment requirement, wherein the resume set comprises a large amount of qualified resumes (referred to as known resumes for short) with known position categories and resumes (referred to as unknown resumes for short) with unknown position categories.

The resume library 11 at the server 10 includes a qualified resume, a source of the qualified resume is not limited in this application, and optionally, in practical application, the qualified resume may be obtained by the server 10 according to feedback of the recruiter, for example, the recruiter may feed back a qualified resume sample to the server 10 through a client of its terminal device.

Step 15: the server 10 determines the post category of the unknown resume in the resume set to be processed, and specifically, may perform clustering on the resume set to be processed by executing the method of any optional embodiment of the present application, divide the resume in the resume set to be processed into a plurality of clustering categories, determine a target post category corresponding to each clustering category, and determine, for the unknown resume, a target post category corresponding to the clustering category to which the unknown resume belongs as the post category corresponding to the unknown resume;

step 16: the server 10 sends a resume for the position category in the recruitment requirement to the terminal device 22 based on the position category.

Certainly, in practical applications, besides the designated post category, the recruitment requirement generally has other requirement information, such as the age class and the working life of the applicant, and accordingly, after the resume meeting the designated post category is determined, the resume meeting the requirement can be further screened according to the resume data in the resume and sent to the terminal device 22.

Based on the same principle as the resume processing method provided in the embodiment of the present application, an embodiment of the present application further provides a resume processing apparatus, as shown in fig. 8, the resume processing apparatus 100 may include a to-be-processed resume obtaining module 110, a resume clustering module 120, and a position category determining module 130, where:

a to-be-processed resume acquisition module 110, configured to acquire a to-be-processed resume set, where the to-be-processed resume set includes a first resume in a plurality of known position categories and a second resume in at least one unknown position category;

the resume clustering module 120 is configured to obtain a resume feature vector of each resume in the resume set to be processed, and cluster the resumes in the resume set to be processed based on similarity between the resume feature vectors of the resumes in the resume set to be processed to obtain a plurality of clustering categories;

the position category determining module 130 is configured to determine a target position category corresponding to each cluster category according to a duty ratio of a first resume of each position category in resumes belonging to each cluster category, determine a target position category corresponding to a cluster category to which each second resume belongs as a position category of each second resume, and process each second resume based on the position category of each second resume.

acquiring a post category feature word library, wherein the post category feature word library comprises post feature words of a plurality of post categories; and for each resume in the resume set to be processed, determining the position characteristic words contained in the resume according to the position category characteristic word library, and obtaining the resume characteristic vector of the resume according to the position characteristic words contained in the resume.

Optionally, for each resume in the set of resumes to be processed, the resume clustering module may be configured to:

obtaining a resume feature vector of the resume according to the mixed feature vector of each character in the position feature words contained in the resume; the mixed feature vector of each character in each post feature word is obtained by the following method:

acquiring a word feature vector of the post feature word and a character feature vector of each character in the post feature word; and for each character of the post characteristic words, obtaining a mixed characteristic vector of the character by fusing the character characteristic vector of the character and the word characteristic vector of the post characteristic words.

Optionally, the post category feature word library is obtained by extracting a positive sample resume set of each post category in the plurality of post categories, and the positive sample resume set of each post category comprises a plurality of positive sample resumes; the post feature words of each post category include at least one of subject words, keywords, or named entities of the post category.

The subject term acquiring module may be a module included in the resume processing apparatus, or may be a module in another apparatus, and the other apparatus executes the action of acquiring the subject term, so that the acquired subject term can be provided to the resume processing apparatus.

determining the initial importance of the candidate subject term according to the second term frequency and the document frequency corresponding to the candidate subject term; determining the part of speech of the candidate subject word; and determining the topic importance of the candidate subject term according to the part of speech and the initial importance of the candidate subject term.

Optionally, for each candidate subject term, the subject term obtaining module may be configured to, when determining the subject importance of the candidate subject term according to the part of speech and the subject importance of the candidate subject term:

extracting resume data of each positive sample resume, and performing word segmentation processing on the resume data to obtain each word segmentation contained in each positive sample resume; determining a third word frequency of each participle in resume data of the positive sample resume of each post category;

for each word segmentation and each position category, determining the ratio of the number of the resumes of the word segmentation appearing in the resume data of the positive sample resumes of all the position categories to the number of the resumes of the word segmentation appearing in the resume data of the positive sample resumes of the other position categories except the position category;

for each post category, determining the category distinguishing capability of each participle for the post category according to the ratio of the third word frequency of each participle corresponding to the post category of the participle;

Similarly, the keyword obtaining module may be a module included in the resume processing apparatus, or may be a module in another apparatus, and the other apparatus executes the action of obtaining the subject term, and may provide the obtained subject term to the resume processing apparatus.

constructing a graph based on the resumes and the similarity between the resumes, wherein each resume is a node in the graph, if the similarity pair corresponding to the two nodes is greater than or equal to a set threshold value, a connecting edge is arranged between the two nodes, and the similarity corresponding to the two nodes is used as the weight of the connecting edge;

determining transition probability between nodes with connected edges based on the weight of each connected edge in the graph; and based on the transition probability among the nodes in the graph, dividing the nodes in the graph into a plurality of cluster categories by adopting a clustering mode based on random walk.

Optionally, the resume to be processed acquiring module is further configured to: acquiring at least one new resume to be processed of an unknown position category;

the post category determination module is further configured to:

determining the number of resumes of the at least one new resume to be processed; if the number of the resumes is smaller than the set number, acquiring the resume feature vector of each new resume to be processed and the category feature vector of each cluster category; for each new resume to be processed, determining a clustering category to which the new resume to be processed belongs according to the similarity between the new resume to be processed and each category feature vector, and determining a target clustering category corresponding to the clustering category to which the new resume to be processed belongs as a position category of the new resume to be processed; if the number of the resumes is larger than or equal to the set number, constructing a new resume set to be processed based on the at least one new resume to be processed, clustering resumes in the new resume set to be processed to obtain a plurality of clustering categories, and determining a target position category of each clustering category according to the occupation ratio of the resumes of the known position categories of the resumes belonging to each clustering category; and determining the target post category corresponding to the cluster category to which each new resume to be processed belongs as the post category of the new resume to be processed.

Optionally, the station category determining module may be configured to:

determining the proportion of the first resume of each post category in the resumes belonging to the cluster category; and for the maximum ratio in all ratios, if the maximum ratio is not less than a set threshold, determining the position category corresponding to the maximum ratio as the target position category corresponding to the cluster category.

Based on the same principle as the resume processing method and the resume processing apparatus provided in the embodiments of the present application, an embodiment of the present application further provides an electronic device, which may include a memory and a processor, where the memory stores a computer program, and the processor, when running the computer program, is configured to execute the resume processing method provided in any optional embodiment of the present application, or is configured to execute actions performed by the apparatus provided in any optional embodiment of the present application.

As an optional embodiment, fig. 9 shows a schematic structural diagram of an electronic device according to an embodiment of the present application, where the electronic device may execute the data query method provided in any optional embodiment of the present application. As shown in fig. 9, the electronic device 4000 may include a processor 4001 and a memory 4003. Processor 4001 is coupled to memory 4003, such as via bus 4002. Optionally, the electronic device 4000 may further include a transceiver 4004, and the transceiver 4004 may be used for data interaction between the electronic device and other electronic devices, such as transmission of data and/or reception of data. In addition, the transceiver 4004 is not limited to one in practical applications, and the structure of the electronic device 4000 is not limited to the embodiment of the present application.

The Processor 4001 may be a CPU (Central Processing Unit), a general-purpose Processor, a DSP (Digital Signal Processor), an ASIC (Application Specific Integrated Circuit), an FPGA (Field Programmable Gate Array) or other Programmable logic device, a transistor logic device, a hardware component, or any combination thereof. Which may implement or execute the various illustrative logical blocks, modules, and circuits described in connection with the disclosure herein. The processor 4001 may also be a combination that performs a computational function, including, for example, a combination of one or more microprocessors, a combination of a DSP and a microprocessor, or the like.

Bus 4002 may include a path that carries information between the aforementioned components. The bus 4002 may be a PCI (Peripheral Component Interconnect) bus, an EISA (Extended Industry Standard Architecture) bus, or the like. The bus 4002 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 9, but this does not indicate only one bus or one type of bus.

The Memory 4003 may be a ROM (Read Only Memory) or other types of static storage devices that can store static information and instructions, a RAM (Random Access Memory) or other types of dynamic storage devices that can store information and instructions, an EEPROM (Electrically Erasable Programmable Read Only Memory), a CD-ROM (Compact Disc Read Only Memory) or other optical Disc storage, optical Disc storage (including Compact Disc, laser Disc, optical Disc, digital versatile Disc, blu-ray Disc, etc.), a magnetic Disc storage medium or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to these.

The memory 4003 is used for storing application program codes (computer programs) for executing the present scheme, and is controlled by the processor 4001 to execute. Processor 4001 is configured to execute application code stored in memory 4003 to implement what is shown in the foregoing method embodiments.

The embodiment of the present application provides a computer readable storage medium, on which a computer program is stored, and when the computer program runs on a computer, the computer is enabled to execute the corresponding content in the foregoing method embodiment.

Embodiments of the present application also provide a computer product, which includes a computer program that, when executed by a processor, implements the steps of the method provided by the embodiments of the present application.

Based on the same principle as the method provided by the embodiment of the present application, the embodiment of the present application also provides a computer program product or a computer program, which includes computer instructions, and the computer instructions are stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the method provided in any of the alternative embodiments of the present application.

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of execution is not necessarily sequential, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

The foregoing is only a partial embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A resume processing method, comprising:

acquiring a resume set to be processed, wherein the resume set to be processed comprises a first resume of a plurality of known position types and a second resume of at least one unknown position type;

acquiring a resume feature vector of each resume in the resume set to be processed;

for each cluster type, determining a target position type corresponding to the cluster type according to the proportion of the first resume of each position type in resumes belonging to the cluster type;

and for each second resume, determining the target position category corresponding to the cluster category to which the second resume belongs as the position category corresponding to the second resume, and processing the second resume based on the position category of the second resume.

2. The method according to claim 1, wherein the obtaining of the resume feature vector of each resume in the set of resumes to be processed comprises:

and for each resume in the resume set to be processed, determining the position characteristic words contained in the resume according to the position category characteristic word library, and obtaining the resume characteristic vector of the resume according to the position characteristic words contained in the resume.

3. The method of claim 2, further comprising, for each of the post feature words:

for each character of the post feature words, obtaining a mixed feature vector of the character by fusing a character feature vector of the character and a word feature vector of the post feature words;

for each resume in the resume set to be processed, obtaining a resume feature vector of the resume according to the position feature words contained in the resume, including:

and obtaining the resume feature vector of the resume according to the mixed feature vector of each character in the position feature words contained in the resume.

4. The method according to claim 2 or 3, wherein the position category feature thesaurus is extracted based on a positive sample resume set of each of the plurality of position categories, the positive sample resume set of each position category comprising a plurality of positive sample resumes; the post feature words of each post category include at least one of subject words, keywords, or named entities of the post category.

5. The method according to claim 4, wherein the subject word of each post category in the post category feature word library is obtained by:

determining a candidate subject term of each post category from each participle according to a first word frequency of each participle in resume data of the positive sample resume of each post category;

6. The method of claim 5, wherein for each of the candidate subject words, determining the subject importance of the candidate subject word based on the second word frequency of the candidate subject word in the resume data of the positive sample resumes for all of the position categories and the document frequency of the candidate subject word in the resume data of the positive sample resumes for all of the position categories comprises:

determining the part of speech of the candidate subject word;

and determining the topic importance of the candidate topic word according to the part of speech and the initial importance of the candidate topic word.

7. The method of claim 6, wherein determining the topic importance of the candidate topic word based on the part-of-speech and the topic importance of the candidate topic word comprises:

if the part of speech of the candidate subject word is a noun, determining the noun type of the candidate subject word, wherein the noun type is a proper noun or a common noun;

and according to the noun type of the candidate subject term, improving the initial importance of the candidate subject term to obtain the subject importance of the candidate subject term, wherein the improvement degree of the initial importance of the proper noun is greater than that of the general noun.

8. The method of claim 6, wherein the candidate subject words of each of the position categories comprise at least one of domain words or common words of the position category, and an initial importance of any domain word is not less than a maximum of initial importance of all common words.

9. The method of claim 5, wherein the candidate subject word for each of the position categories comprises at least one of a domain word or a generic word for the position category;

for each candidate subject term, determining the topic importance of the candidate subject term according to the second term frequency of the candidate subject term in the resume data of the positive sample resumes in all position categories and the document frequency of the candidate subject term in the resume data of the positive sample resumes in all position categories, including:

and for each common word, determining the topic importance of the common word according to a second word frequency and a document frequency corresponding to the common word and a second maximum document frequency, wherein the second maximum document frequency is the maximum value of the document frequencies corresponding to the common words of all post categories.

10. The method according to claim 4, wherein the keywords of each post category in the post category feature word library are obtained by:

determining a third word frequency of each word segmentation in resume data of the positive sample resume of each post category;

for each word segmentation and each position category, determining the ratio of the number of resumes in which the word segmentation appears in resume data of positive sample resumes of all position categories to the number of resumes in which the word segmentation appears in resume data of positive sample resumes of other position categories except the position category;

and determining the key words of each position category from each participle according to the category distinguishing capability of each participle for each position category.

11. The method according to any one of claims 1 to 3, wherein the clustering resumes in the resume set to be processed based on similarity between resume feature vectors of the resumes in the resume set to be processed to obtain a plurality of clustering categories includes:

constructing a graph based on the resumes and the similarity between the resumes, wherein each resume is a node in the graph, if the similarity pair corresponding to two nodes is greater than or equal to a set threshold value, a connecting edge is formed between the two nodes, and the similarity corresponding to the two nodes is used as the weight of the connecting edge;

12. The method of any of claims 1 to 3, further comprising:

acquiring at least one new resume to be processed of unknown position types, and determining the number of resumes of the at least one new resume to be processed;

if the number of the resumes is smaller than the set number, acquiring the resume feature vector of each new resume to be processed and the category feature vector of each cluster category; for each new resume to be processed, determining a cluster type of the new resume to be processed according to the similarity between the new resume to be processed and each type feature vector, and determining a target cluster type corresponding to the cluster type of the new resume to be processed as a position type of the new resume to be processed;

if the number of the resumes is larger than or equal to the set number, constructing a new resume set to be processed based on the at least one new resume to be processed, clustering resumes in the new resume set to be processed to obtain a plurality of clustering categories, and determining a target position category of each clustering category according to the occupation ratio of the resumes of the known position categories of the resumes belonging to each clustering category; and determining the target post category corresponding to the cluster category to which each new resume to be processed belongs as the post category of the new resume to be processed.

13. A resume processing apparatus, comprising:

the resume processing system comprises a to-be-processed resume acquisition module, a to-be-processed resume processing module and a processing module, wherein the to-be-processed resume set comprises a plurality of first resumes of known position categories and at least one second resume of an unknown position category;

and the position category determining module is used for determining a target position category corresponding to each clustering category according to the duty ratio of the first resume of each position category in the resumes belonging to each clustering category, determining the target position category corresponding to the clustering category to which each second resume belongs as the position category corresponding to each second resume, and processing each second resume based on the position category of each second resume.

14. An electronic device, comprising a memory and a processor;

the memory is configured to store a computer program;

the processor, when executing the computer program, performs the method of any of claims 1 to 12.

15. A computer-readable storage medium, characterized in that a computer program is stored in the storage medium, which computer program, when being executed by a processor, carries out the method of any one of claims 1 to 12.

16. A computer product, characterized in that the computer product comprises a computer program which, when executed by a processor, implements the method of any one of claims 1 to 12.