CN105786781A - Job description text similarity calculation method based on topic model - Google Patents
Job description text similarity calculation method based on topic model Download PDFInfo
- Publication number
- CN105786781A CN105786781A CN201610140634.6A CN201610140634A CN105786781A CN 105786781 A CN105786781 A CN 105786781A CN 201610140634 A CN201610140634 A CN 201610140634A CN 105786781 A CN105786781 A CN 105786781A
- Authority
- CN
- China
- Prior art keywords
- job description
- text
- description text
- model
- topic model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/131—Fragmentation of text files, e.g. creating reusable text-blocks; Linking to fragments, e.g. using XInclude; Namespaces
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/247—Thesauruses; Synonyms
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a job description text similarity calculation method based on a topic model.The method specifically includes the steps of semantic pretreatment, model pretreatment, topic model analysis, clustering analysis, similarity calculation and the like.Projection features of job description texts on different topics are extracted, in combination with multiple specific features such as years of working, working places and education backgrounds, vectorized expression of the job description texts is achieved, and the functions of text similarity calculation and clustering and the like are completed.The texts are expressed through semantic features and field specific features, and the accuracy of similarity calculation of the job description texts is greatly improved.The function of finding jobs with highly-overlapped functions in a massive post and job description database is achieved, and the method assists corresponding departments in completing analysis and decision making.The defects that the deviation is large when a traditional vector space model is used for calculating the text similarity can be overcome, and therefore the automatic judgment function of the function overlapped jobs is better achieved.
Description
Technical field
The invention belongs to information retrieval and Text Mining Technology field, particularly relate to a kind of job description Text similarity computing method based on topic model.
Background technology
Along with the aggravation of competition among enterprises, the human resources that enterprise has is more and more higher in the accounting of enterprise operation cost.Corresponding to this, the allocation of talented people of enterprises is more and more frequent with flowing.Therefore, reducing the position demand that function highly overlaps, making full use of human resources on duty is that enterprise cuts down cost, carries one of high efficiency important channel.Along with scope of the enterprise constantly expands, the means of the tradition similar position of energy of job orientation really, can not meet enterprise demand as artificial screening differentiates.Therefore, design a job description Text similarity computing algorithm, realizing the automatic discrimination of function coincidence position, part even thoroughly replaces the traditional method that this high cost of artificial screening is inefficient, has been the problem that must solve in drainage of human resources informatization.
Job description Text similarity computing method, its key problem is content of text to be indicated and the evaluation of text similarity.At present, common document representation method is to adopt vector space model, namely first extracts a stack features morphology from vocabulary and becomes the representation space of text, then text is mapped as the vector in representation space.Represent that the size of vector element directly reflects the character pair word percentage contribution to the text.Would indicate that vector normalization, utilize cosine similarity can calculate the similarity of text.The deficiency of vector space model is in that the mutual independence between this model hypothesis Feature Words, have ignored the dependency of Feature Words.Therefore, simply adopt vector space model to carry out Text similarity computing and deviation often occurs.
Summary of the invention
The deficiency that the present invention exists to overcome prior art, a kind of job description Text similarity computing method based on topic model is provided, it can overcome tradition vector space model to calculate the shortcomings such as the deviation of appearance during text similarity is big, thus better realizing the automatic discrimination function of function coincidence position.
The present invention is achieved by the following technical solutions: a kind of job description Text similarity computing method based on topic model, and it comprises the following steps:
The input of step 1) job description text and storage: described computational methods allow user by two ways input job description text;
Step 2) special characteristic extracts: for the characteristic of job description text, extract special characteristic, such as length of service, job site, the working time, academic or professional.
Step 3) semanteme pretreatment: job description text to be analyzed is made pauses in reading unpunctuated ancient writings, participle, part of speech resolve, lemmatization or stem extract semantic pretreatment.
Step 4) model preprocessing: stop words and low-frequency word are filtered out by model preprocessing, forms the calculating corpus of topic model, and the purpose filtering stop words and low-frequency word is handle word incoherent with content of text, as preposition or conjunction are deleted from text;
Step 5) topic model is analyzed: adopt layer semantic analysis model of diving that corpus is carried out subject analysis, it is achieved in corpus, the vectorization in latent layer semantic space of all texts represents;
Step 6) cluster analysis: cluster analysis first combine text projection properties in different themes and special characteristic, realize the Precise Representation of job description text, then being clustered by the text after vectorization, the job description text of feature similarity will be divided into a class;
Step 7) job description Text similarity computing: the Text similarity computing formula based on assemblage characteristic such as text projection properties in different themes and special characteristic is, in formula,WithIt is the expression vector of text to be calculated respectively,Represent vector?Individual component.By job description Text similarity computing, user just can the similarity degree of quantitative analysis job information.By arranging similarity threshold, user can realize position resource proportioning strategy flexibly, thus providing effective index support for manpower resource optimization.
Two ways in described step 1 refers to respectively: first kind of way is user-specified network address, and system obtains storage text on the internet;The second way is that user directly needs text to be processed in server end input, and mass text data acquisition distributed storage mode stores.
Latent layer semantic analysis model in described step 5 is on the basis of tradition vector space model, and text is mapped as the vector in theme representation space, utilizes topic model, extracts job description text projection properties in different themes.
Projection properties in described step 6 and special characteristic refer to the length of service, job site, the working time, academic or professional.
The invention has the beneficial effects as follows: a kind of method that this application discloses job description Text similarity computing, it specifically includes semantic pretreatment, model preprocessing, topic model analysis and the step such as cluster analysis and Similarity Measure.The present invention is extracted job description text projection properties in different themes, in combination with multiple special characteristics, such as length of service, job site, educational background etc., it is achieved that the vectorization of job description text represents, completes the function such as Text similarity computing and cluster.Compared with existing Text similarity computing method, the present invention adopts semantic feature and field special characteristic to represent text, greatly improves the accuracy of job description Text similarity computing.Present invention achieves in the job description data base of magnanimity post, it has been found that the function of the position that function highly overlaps, auxiliary phase answers department to complete analysis decision.
Accompanying drawing explanation
In order to be illustrated more clearly that embodiments of the present invention, the accompanying drawing in implementation process is briefly described:
Fig. 1 is based on the system block diagram of the job description Text similarity computing method of topic model;
Fig. 2 is the flow chart of semantic pretreatment;
Fig. 3 is the flow chart of model preprocessing;
Fig. 4 is the flow chart adopting layer semantic analysis model of diving that corpus carries out subject analysis.
Detailed description of the invention
Below in conjunction with the drawings and specific embodiments, the present invention is described in detail.
As shown in Figures 1 to 4, a kind of job description Text similarity computing method based on topic model, it comprises the following steps.
Step 1) input of job description text and storage: the present invention allows user to pass through two ways input job description text.First kind of way user-specified network address, system obtains storage text on the internet;Second way user directly needs text to be processed in server end input.Mass text data acquisition distributed storage mode stores.
Step 2) special characteristic extraction: for the characteristic of job description text, extract special characteristic, such as length of service, job site, working time, educational background, specialty etc..
Step 3) semantic pretreatment: job description text to be analyzed is made pauses in reading unpunctuated ancient writings (English), the parsing of participle (Chinese), part of speech, the semanteme pretreatment such as lemmatization (English), stem extraction (English).
Step 4) model preprocessing: stop words and low-frequency word are filtered out by model preprocessing, form the calculating corpus of topic model.The purpose filtering stop words and low-frequency word is handle word incoherent with content of text, as preposition, conjunction etc. are deleted from text.Model preprocessing both will not lose the information that original text comprises, and can reduce again operand.
Step 5) topic model analysis: adopt layer semantic analysis model (LatentSemanticAnalysis) of diving that corpus is carried out subject analysis, it is achieved in corpus, the vectorization in latent layer semantic space of all texts represents.Dive layer semantic analysis model on the basis of tradition vector space model, text is mapped as the vector in theme representation space.Utilizing topic model, we can extract job description text projection properties in different themes.
Step 6) cluster analysis: cluster analysis first combine text projection properties in different themes and special characteristic (length of service, job site, working time, educational background, specialty etc.), it is achieved the Precise Representation of job description text.Then being clustered by the text after vectorization, the job description text of feature similarity will be divided into a class.
By position cluster analysis, user can obtain effective classification of job information, thus function post that is similar or that differ greatly is screened targetedly, provides quantitative analysis to support for manpower resource optimization.
Step 7) job description Text similarity computing: the Text similarity computing formula based on assemblage characteristic (text projection properties in different themes and special characteristic) is, in formula,WithIt is the expression vector of text to be calculated respectively,Represent vector?Individual component.By job description Text similarity computing, user can the similarity degree of quantitative analysis job information.By arranging similarity threshold, user can realize position resource proportioning strategy flexibly, thus providing effective index support for manpower resource optimization.
Last it should be noted that, based on embodiments of the present invention, other case study on implementation all that those of ordinary skill in the art obtain under not making creative work premise, broadly fall into the scope of protection of the invention.Above content is only in order to illustrate technical scheme; but not limiting the scope of the invention; simple modification that technical scheme is carried out by those of ordinary skill in the art or equivalent replace, all without departing from the spirit and scope of technical solution of the present invention.
Claims (4)
1. the job description Text similarity computing method based on topic model, it is characterised in that: the described job description Text similarity computing method based on topic model comprises the following steps:
The input of step 1) job description text and storage: described computational methods allow user by two ways input job description text;
Step 2) special characteristic extracts: for the characteristic of job description text, extract special characteristic, such as length of service, job site, the working time, academic or professional;
Step 3) semanteme pretreatment: job description text to be analyzed is made pauses in reading unpunctuated ancient writings, participle, part of speech resolve, lemmatization or stem extract semantic pretreatment;
Step 4) model preprocessing: stop words and low-frequency word are filtered out by model preprocessing, forms the calculating corpus of topic model, and the purpose filtering stop words and low-frequency word is handle word incoherent with content of text, as preposition or conjunction are deleted from text;
Step 5) topic model is analyzed: adopt layer semantic analysis model of diving that corpus is carried out subject analysis, it is achieved in corpus, the vectorization in latent layer semantic space of all texts represents;
Step 6) cluster analysis: cluster analysis first combine text projection properties in different themes and special characteristic, realize the Precise Representation of job description text, then being clustered by the text after vectorization, the job description text of feature similarity will be divided into a class;
Step 7) job description Text similarity computing: based on the Text similarity computing of assemblage characteristic such as text projection properties in different themes and special characteristic, by job description Text similarity computing, user just can the similarity degree of quantitative analysis job information.
2. a kind of job description Text similarity computing method based on topic model according to claim 1, it is characterized in that: the two ways in described step 1 refers to respectively: first kind of way is user-specified network address, system obtains storage text on the internet;The second way is that user directly needs text to be processed in server end input, and mass text data acquisition distributed storage mode stores.
3. a kind of job description Text similarity computing method based on topic model according to claim 1, it is characterized in that: the latent layer semantic analysis model in described step 5 is on the basis of tradition vector space model, text is mapped as the vector in theme representation space, utilize topic model, extract job description text projection properties in different themes.
4. a kind of job description Text similarity computing method based on topic model according to claim 1, it is characterised in that: projection properties in described step 6 and special characteristic refer to the length of service, job site, the working time, academic or professional.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610140634.6A CN105786781A (en) | 2016-03-14 | 2016-03-14 | Job description text similarity calculation method based on topic model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610140634.6A CN105786781A (en) | 2016-03-14 | 2016-03-14 | Job description text similarity calculation method based on topic model |
Publications (1)
Publication Number | Publication Date |
---|---|
CN105786781A true CN105786781A (en) | 2016-07-20 |
Family
ID=56393272
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610140634.6A Pending CN105786781A (en) | 2016-03-14 | 2016-03-14 | Job description text similarity calculation method based on topic model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105786781A (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106446089A (en) * | 2016-09-12 | 2017-02-22 | 北京大学 | Method for extracting and storing multidimensional field key knowledge |
CN106777295A (en) * | 2016-12-30 | 2017-05-31 | 深圳爱拼信息科技有限公司 | Method and system is recommended in a kind of position search based on semantic matches |
CN106777296A (en) * | 2016-12-30 | 2017-05-31 | 深圳爱拼信息科技有限公司 | Method and system are recommended in a kind of talent's search based on semantic matches |
CN107818134A (en) * | 2017-09-26 | 2018-03-20 | 北京纳人网络科技有限公司 | A kind of position similarity calculating method, client and server |
US20190197482A1 (en) * | 2017-12-27 | 2019-06-27 | International Business Machines Corporation | Creating and using triplet representations to assess similarity between job description documents |
CN112100492A (en) * | 2020-09-11 | 2020-12-18 | 河北冀联人力资源服务集团有限公司 | Batch delivery method and system for resumes of different versions |
CN113221000A (en) * | 2021-05-17 | 2021-08-06 | 上海博亦信息科技有限公司 | Talent data intelligent retrieval and recommendation method |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6446061B1 (en) * | 1998-07-31 | 2002-09-03 | International Business Machines Corporation | Taxonomy generation for document collections |
CN101594313A (en) * | 2008-05-30 | 2009-12-02 | 电子科技大学 | A kind of spam judgement, classification, filter method and system based on potential semantic indexing |
CN102110140A (en) * | 2011-01-26 | 2011-06-29 | 桂林电子科技大学 | Network-based method for analyzing opinion information in discrete text |
CN102929937A (en) * | 2012-09-28 | 2013-02-13 | 福州博远无线网络科技有限公司 | Text-subject-model-based data processing method for commodity classification |
CN103177087A (en) * | 2013-03-08 | 2013-06-26 | 浙江大学 | Similar Chinese herbal medicine search method based on probability topic model |
-
2016
- 2016-03-14 CN CN201610140634.6A patent/CN105786781A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6446061B1 (en) * | 1998-07-31 | 2002-09-03 | International Business Machines Corporation | Taxonomy generation for document collections |
CN101594313A (en) * | 2008-05-30 | 2009-12-02 | 电子科技大学 | A kind of spam judgement, classification, filter method and system based on potential semantic indexing |
CN102110140A (en) * | 2011-01-26 | 2011-06-29 | 桂林电子科技大学 | Network-based method for analyzing opinion information in discrete text |
CN102929937A (en) * | 2012-09-28 | 2013-02-13 | 福州博远无线网络科技有限公司 | Text-subject-model-based data processing method for commodity classification |
CN103177087A (en) * | 2013-03-08 | 2013-06-26 | 浙江大学 | Similar Chinese herbal medicine search method based on probability topic model |
Non-Patent Citations (2)
Title |
---|
任姚鹏: ""基于语义相似度分析的软构件聚类算法研究"", 《中国优秀硕士学位论文全文数据库信息科技辑》 * |
罗义兵: ""领域文本相似度计算方法研究"", 《中国优秀硕士学位论文全文数据库信息科技辑》 * |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106446089A (en) * | 2016-09-12 | 2017-02-22 | 北京大学 | Method for extracting and storing multidimensional field key knowledge |
CN106446089B (en) * | 2016-09-12 | 2019-08-16 | 北京大学 | The extraction and storage method of a kind of various dimensions field all critical learning |
CN106777295A (en) * | 2016-12-30 | 2017-05-31 | 深圳爱拼信息科技有限公司 | Method and system is recommended in a kind of position search based on semantic matches |
CN106777296A (en) * | 2016-12-30 | 2017-05-31 | 深圳爱拼信息科技有限公司 | Method and system are recommended in a kind of talent's search based on semantic matches |
CN107818134A (en) * | 2017-09-26 | 2018-03-20 | 北京纳人网络科技有限公司 | A kind of position similarity calculating method, client and server |
US20190197482A1 (en) * | 2017-12-27 | 2019-06-27 | International Business Machines Corporation | Creating and using triplet representations to assess similarity between job description documents |
US11410130B2 (en) * | 2017-12-27 | 2022-08-09 | International Business Machines Corporation | Creating and using triplet representations to assess similarity between job description documents |
CN112100492A (en) * | 2020-09-11 | 2020-12-18 | 河北冀联人力资源服务集团有限公司 | Batch delivery method and system for resumes of different versions |
CN113221000A (en) * | 2021-05-17 | 2021-08-06 | 上海博亦信息科技有限公司 | Talent data intelligent retrieval and recommendation method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105786781A (en) | Job description text similarity calculation method based on topic model | |
CN105786991B (en) | In conjunction with the Chinese emotion new word identification method and system of user feeling expression way | |
CN104462378B (en) | Data processing method and device for text identification | |
CN104636466B (en) | Entity attribute extraction method and system for open webpage | |
US20170364503A1 (en) | Multi-stage recognition of named entities in natural language text based on morphological and semantic features | |
CN104462053A (en) | Inner-text personal pronoun anaphora resolution method based on semantic features | |
Bhargava et al. | Atssi: Abstractive text summarization using sentiment infusion | |
CN103077164A (en) | Text analysis method and text analyzer | |
CN104598535A (en) | Event extraction method based on maximum entropy | |
CN105069021A (en) | Chinese short text sentiment classification method based on fields | |
CN109543034A (en) | Text Clustering Method, device and the readable storage medium storing program for executing of knowledge based map | |
Falk et al. | Classifying French verbs using French and English lexical resources | |
CN109033320B (en) | Bilingual news aggregation method and system | |
CN103324626A (en) | Method for setting multi-granularity dictionary and segmenting words and device thereof | |
WO2021098651A1 (en) | Method and apparatus for acquiring risk entity | |
CN107133212A (en) | It is a kind of that recognition methods is contained based on integrated study and the text of words and phrases integrated information | |
CN107515849A (en) | It is a kind of into word judgment model generating method, new word discovery method and device | |
CN102929860A (en) | Chinese clause emotion polarity distinguishing method based on context | |
CN106776695A (en) | The method for realizing the automatic identification of secretarial document value | |
CN110321561A (en) | A kind of keyword extracting method and device | |
CN105404693A (en) | Service clustering method based on demand semantics | |
CN110705292A (en) | Entity name extraction method based on knowledge base and deep learning | |
CN106126497A (en) | A kind of automatic mining correspondence executes leader section and the method for cited literature textual content fragment | |
CN114265931A (en) | Big data text mining-based consumer policy perception analysis method and system | |
CN110334188A (en) | A kind of multi-document summary generation method and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20160720 |