CN116680590A

CN116680590A - Post portrait label extraction method and device based on work instruction analysis

Info

Publication number: CN116680590A
Application number: CN202310941705.2A
Authority: CN
Inventors: 王涛; 沈大勇; 张忠山; 姚锋; 刘晓路; 杜永浩; 闫俊刚; 王沛; 陈英武; 吕济民; 何磊; 陈宇宁; 陈盈果
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2023-07-28
Filing date: 2023-07-28
Publication date: 2023-09-01
Anticipated expiration: 2043-07-28
Also published as: CN116680590B

Abstract

The application relates to a post portrait label extraction method and device based on work instruction analysis. The method comprises the following steps: acquiring a working specification of a relevant enterprise post; preprocessing the working instruction book to obtain a preprocessed working instruction book; vectorizing the preprocessed work instruction according to a natural language processing technology to obtain a vectorized work instruction; carrying out fuzzy calculation on keywords in the quantized work instruction book to obtain fuzzy factors of the keywords; setting a loss function of fuzzy clustering by using the fuzzy factors, and distributing keywords according to the loss function to obtain an initial category to which each keyword belongs; performing density clustering on the keywords after the initial category distribution according to a density-based DBSCAN algorithm to obtain a center of each category and a keyword set corresponding to the center; and taking the keyword set as a post portrait tag. The method can be used for extracting the post portrait label.

Description

Post portrait label extraction method and device based on work instruction analysis

Technical Field

The application relates to the technical field of data processing, in particular to a post portrait label extraction method and device based on work instruction analysis.

Background

In recent years, with the continuous development and popularization of big data and artificial intelligence technology, post portraits become an important tool in the fields of enterprise recruitment, talent cultivation, occupation planning and the like. Based on the post portraits, the enterprise can more accurately know the requirements and characteristics of various posts, thereby formulating more effective recruitment strategies. Meanwhile, in the aspect of job seekers, the job portraits can help the job seekers to better know the suitable job positions, and the job seeker success rate is improved.

However, in the prior patent application, CN 201910068512-a post image setting method, a post image setting device and a terminal device, CN 201910744021-a post image generating method, a post image generating device and an electronic device, CN 201910192576-a matching method, a device, equipment and a storage medium of post image and resume information, and CN 202011286200-a post image generating method, device, equipment and a storage medium, mainly solve the problems of post image generation and setting, and do not provide an extraction method of relevant post image labels.

Disclosure of Invention

In view of the above, it is desirable to provide a post portrait tag extraction method, apparatus, computer device, and storage medium that enable post portrait tag extraction based on job instruction analysis.

A post portrait label extraction method based on work instruction analysis comprises the following steps:

acquiring a working specification of a relevant enterprise post; preprocessing the working instruction book to obtain a preprocessed working instruction book;

vectorizing the preprocessed work instruction according to a natural language processing technology to obtain a vectorized work instruction;

carrying out fuzzy calculation on keywords in the quantized work instruction book to obtain fuzzy factors of the keywords; setting a loss function of fuzzy clustering by using the fuzzy factors, and distributing keywords according to the loss function to obtain an initial category to which each keyword belongs;

performing density clustering on the keywords after the initial category distribution according to a density-based DBSCAN algorithm to obtain a center of each category and a keyword set corresponding to the center; and taking the keyword set as a post portrait tag.

In one embodiment, the preprocessing of the working instruction book to obtain a preprocessed working instruction book includes:

and cleaning the text of the working instruction, removing useless information in the working instruction, performing word segmentation and part-of-speech tagging on the cleaned working instruction according to a jieba word segmentation tool, and performing stop-word filtering to obtain the preprocessed working instruction.

In one embodiment, vectorizing the pre-processed work instruction according to a natural language processing technique to obtain a vectorized work instruction, including:

extracting keywords from the preprocessed working specification according to a TF-IDF algorithm, and vectorizing the extracted sentences or phrases according to a word bag model to obtain vectorized sentences;

and carrying out weighted average on all the vectorized sentences to obtain a vectorized working specification.

In one embodiment, keyword extraction is performed on the pre-processed working specification according to a TF-IDF algorithm, including:

extracting keywords from the preprocessed working specification according to TF-IDF algorithm to obtain extracted sentences or phrases as

；

Wherein w represents a word,represents a sentence or phrase in the text of the work instruction, D represents the whole work instruction,/-or%>Representing the word w in a sentence or phrase +.>Frequency of occurrence,/->Representing the inverse document frequency of word w throughout the work specification.

In one embodiment, performing fuzzy calculation on keywords in the quantized work instruction to obtain fuzzy factors of the keywords, including:

performing fuzzy calculation on keywords in the quantized work instruction, and obtaining the fuzzy factor of the keywords as follows

；

wherein ,representing keywords +.>To the firstjPersonal category center->Distance of->Representing keywords +.>To the firstkPersonal category center->Is used for the distance of (a),mrepresenting the total number of categories,bis an index of the blurring factor.

In one embodiment, setting a loss function of the fuzzy cluster using the fuzzy factor includes:

setting a loss function of fuzzy clustering as using a fuzzy factor

；

wherein ,representing the total number of keywords.

In one embodiment, density clustering is performed on keywords after initial category assignment according to a density-based DBSCAN algorithm to obtain a keyword set corresponding to each category, including:

dividing the keywords after the initial category distribution into core points, boundary points and noise points; the core point is that when taking the core point itself as the center of a circle,at least +.>Data points of individual points, wherein->Is a preset parameter; the boundary point is about the core point>Data points that are within a neighborhood of the radius but are not core points; noise points refer to data points that are neither core points nor boundary points;

randomly selecting one keywordxJudging whether the core point is a core point or not, if so, creating a new cluster, and classifying the core point and all points with reachable densities into the new cluster; if it isxNot core points, butxIs the boundary point of a certain core point, thenxGrouping the clusters corresponding to the core points; if it isxNeither core nor boundary points, thenxMarking as noise points until all keywords are classified, and obtaining a keyword set corresponding to each category; wherein, for each keywordDefining the center of the Chinese medicine->A neighborhood of radius is +.>If a certain keyword->In the key wordsWithin the neighborhood of (i.e.)>Then call->Is->If there is a keyword sequenceSatisfy->And->Is->Is called +.>Is->The density of (3) can reach the point.

A post portrait tag extraction device based on work instruction parsing, the device comprising:

the preprocessing module is used for acquiring a working specification of a relevant enterprise post; preprocessing the working instruction book to obtain a preprocessed working instruction book;

the vectorization processing module is used for vectorizing the preprocessed work instruction according to a natural language processing technology to obtain a vectorized work instruction;

the fuzzy clustering module is used for carrying out fuzzy calculation on the keywords in the quantized work instruction book to obtain fuzzy factors of the keywords; setting a loss function of fuzzy clustering by using the fuzzy factors, and distributing keywords according to the loss function to obtain an initial category to which each keyword belongs;

the density clustering module is used for carrying out density clustering on the keywords after the initial category distribution according to a DBSCAN algorithm based on density to obtain the center of each category and a keyword set corresponding to the center; and taking the keyword set as a post portrait tag.

According to the post portrait tag extraction method and device based on the analysis of the working instruction, the working instruction is preprocessed to obtain the preprocessed working instruction, vectorization processing is carried out on the preprocessed working instruction according to the natural language processing technology to obtain the vectorized working instruction, and a large amount of data can be efficiently processed by vectorizing the content of the working instruction, so that the algorithm efficiency is improved. Then carrying out fuzzy calculation on the keywords in the quantized work instruction to obtain fuzzy factors of the keywords; the fuzzy clustering method has the advantages that a fuzzy clustering loss function is set by using the fuzzy factors, keywords are distributed according to the loss function, the initial category of each keyword is obtained, the ambiguity among the keywords is considered, compared with a traditional clustering algorithm, the content in the work instruction can be analyzed and extracted more accurately and efficiently by designing the work instruction work content and the task fuzzy factors, so that the automatic generation of the post portrait label is realized, finally, the keywords after the initial category distribution are clustered in density according to the DBSCAN algorithm based on the density, and the problems of uneven data distribution, irregular clustering shape, noise and the like can be effectively processed by further clustering each category by using the latest clustering algorithm, the clustering accuracy is improved, and the position portrait label accuracy is further improved.

Drawings

FIG. 1 is a flow diagram of a post portrait tag extraction method based on work instruction parsing in one embodiment;

FIG. 2 is a block diagram of a post portrait tag extraction device based on work instruction parsing in one embodiment.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

In one embodiment, as shown in fig. 1, a post portrait tag extraction method based on work instruction analysis is provided, which includes the following steps:

step 102, acquiring a working specification of a relevant enterprise post; and preprocessing the working instruction book to obtain a preprocessed working instruction book.

Firstly, the post specifications of related enterprises need to be collected and can be obtained through recruitment websites, corporate officials and other channels on the Internet. The post instruction book is subjected to preprocessing of text data, including text cleaning, text word segmentation, part-of-speech tagging and stop word processing, so that a plurality of semantic irrelevant words and digital codes are removed.

And 104, vectorizing the preprocessed work instruction according to a natural language processing technology to obtain the vectorized work instruction.

And processing the working instruction by using natural language processing technologies such as word2vector and the like, extracting information such as keywords, phrases, sentences and the like in the working instruction, and classifying and vectorizing the information. First, a word segmentation process is required for a work instruction to convert sentences and phrases therein into a series of words. And performing word segmentation on the text by using a word segmentation tool, such as jieba word segmentation. The formulation of the word is as follows:

；

wherein ,representing the ith sentence or phrase in the workbook text,/i>Representing words in the sentence or phrase.

And extracting keywords from the segmented results, extracting the most important words in the text, and converting each sentence or phrase into a vector form after obtaining the keywords, so that subsequent classification and clustering operations are facilitated. Vectorization was performed using a bag of words model (BoW). The sentence vectorization formula is as follows:

；

wherein ,vector representation representing the ith sentence or phrase in the workbook text, +.>Representing the number of occurrences of each keyword in the sentence or phrase throughout the document.

The whole working specification needs to be converted into a vector form, so that subsequent task matching and similarity calculation are facilitated. All sentence vectors can be summarized in a weighted average manner to obtain a vectorized working specification.

Step 106, carrying out fuzzy calculation on the keywords in the quantized work instruction book to obtain fuzzy factors of the keywords; and setting a loss function of the fuzzy clustering by using the fuzzy factors, and distributing the keywords according to the loss function to obtain an initial category to which each keyword belongs.

In order to take into account the ambiguity between keywords, a fuzzy factor is introduced for representing the degree of attribution of each keyword to different categories, in fuzzy clustering, each data point may belong to a plurality of categories, and the degree of belonging to each category is represented by the fuzzy factor. Therefore, keywords in the work instruction can be distributed according to the fuzzy factors to obtain the initial category to which each keyword belongs, and the work content and the task fuzzy factors of the work instruction are designed, so that compared with a traditional clustering algorithm, the method can analyze and extract the content in the work instruction more accurately and efficiently, and further automatic generation of the post portrait label is realized.

Step 108, performing density clustering on the keywords after the initial category distribution according to a density-based DBSCAN algorithm to obtain a center of each category and a keyword set corresponding to the center; and taking the keyword set as a post portrait tag.

The density clustering is carried out on the keywords after the initial category distribution according to the density-based DBSCAN algorithm, so that the problems of uneven data distribution, irregular clustering shape, noise and the like can be effectively solved, the clustering accuracy is improved, and the accuracy of post portrait labels is further improved.

In the post portrait tag extraction method based on the analysis of the working instruction book, the working instruction book is preprocessed to obtain the preprocessed working instruction book, the preprocessed working instruction book is vectorized according to the natural language processing technology to obtain the vectorized working instruction book, and a large amount of data can be efficiently processed by vectorizing the content of the working instruction book, so that the efficiency of an algorithm is improved. Then carrying out fuzzy calculation on the keywords in the quantized work instruction to obtain fuzzy factors of the keywords; the fuzzy clustering method has the advantages that a fuzzy clustering loss function is set by using the fuzzy factors, keywords are distributed according to the loss function, the initial category of each keyword is obtained, the ambiguity among the keywords is considered, compared with a traditional clustering algorithm, the content in the work instruction can be analyzed and extracted more accurately and efficiently by designing the work instruction work content and the task fuzzy factors, so that the automatic generation of the post portrait label is realized, finally, the keywords after the initial category distribution are clustered in density according to the DBSCAN algorithm based on the density, and the problems of uneven data distribution, irregular clustering shape, noise and the like can be effectively processed by further clustering each category by using the latest clustering algorithm, the clustering accuracy is improved, and the position portrait label accuracy is further improved.

In a specific embodiment, text cleansing: for preprocessing of text data, firstly, text cleaning is needed, including removal of useless information such as HTML tags, special characters, blank spaces, line-wrapping characters, non-Chinese characters and the like, and meaningful text content is reserved. Text segmentation: text segmentation is to segment sentences into individual words, and provides a basis for subsequent part-of-speech tagging and disabling word processing. The method adopts a jieba word segmentation tool to segment words. Part of speech tagging: part of speech tagging refers to determining the part of speech, such as nouns, verbs, etc., for each word after word segmentation. The method adopts the part-of-speech tagging function of the jieba word segmentation tool to perform part-of-speech tagging. Stop word processing: the term "stop" is a term that indicates that the frequency of occurrence is extremely high, but that there is no actual meaning, such as "or" having "or the like. In the method, a preset stop word list is used, stop word filtering is carried out on the words after word segmentation, and nonsensical words are removed.

；

setting a loss function of fuzzy clustering as using a fuzzy factor

；

wherein ,nrepresenting the total number of keywords.

In a specific embodiment, the density-based DBSCAN algorithm can effectively solve the problems of uneven data distribution, irregular cluster shape, noise and the like. The basic idea of the algorithm is that keywords are expressed as data points, the data points are divided into three types of core points, boundary points and noise points, and clustering is combined through the communication relation among the core points, and the method comprises the following steps:

first for each data pointDefining the center of the Chinese medicine->A neighborhood of radius is +.>. If a certain data point +.>At->Within the neighborhood of (i.e.)>Then call->Is->Can reach the point of direct density. If there is a data point sequence +.>Satisfy->And->Is->Is called +.>Is->The density of (3) can reach the point. If there is a core point c, so that +.> and />Are all density reachable points of c, then we call +.> and />Is a density connected point. Based on the above definition, data points can be divided into three categories: core points, boundary points, and noise points. The core point is the center of the circle of the core point, namely the +.>At least +.>Data points of individual points, whereinIs a preset parameter. The boundary point is about the core point>Is a data point within the neighborhood of the radius but not the core point. Noise points refer to data points that are neither core points nor boundary points.

Based on the definition, the DBSCAN algorithm carrying the fuzzy factor based on density can be used for clustering, and the clustering flow is as follows:

(1) Randomly selecting an unclassified data pointxJudging whether the core point is the core point or not. If the point is a core point, a new cluster is created, and the point and all points with reachable densities are classified as the cluster.

(2) If it isxNot core points, but boundary points of a core point, thenxAnd is classified into the cluster where the core point is located.

(3) If it isxNeither the core nor the boundary points are marked as noise points.

(4) Repeating the steps (1) - (3) until all the data points are classified to obtain keyword sets of a plurality of categories, and combining the keyword sets of each category as labels of the category to obtain content labels of the post image.

In one embodiment, the generated post portrait tags are evaluated and revised using artificial intelligence techniques and expert domain knowledge to ensure their accuracy and reliability. After the post portrait label is generated, the post portrait label is evaluated and corrected, so that the accuracy and the reliability of the post portrait label are ensured. The application utilizes artificial intelligence technology and professional field knowledge to improve the quality and the credibility of the post portrait label by carrying out manual or semi-automatic auditing and modification on the post portrait label.

The post portrait tag evaluation formula is as follows:

；

wherein TP represents a real exampleI.e., the number of samples that are actually positive examples and that are correctly predicted as positive examples; TN represents true counter->I.e., the number of samples that are actually counterexamples and that are correctly predicted as counterexamples; FP represents false positive->I.e. the number of samples that are actually counterexamples but are mispredicted as positive examples; FN represents false counter exampleI.e. the number of samples that are actually positive examples but are mispredicted as negative examples.

The meaning and interpretation of the post picture label evaluation index are as follows:

accuracy (Accuracy): the ratio of the correct number of samples to the total number of samples in the prediction result is one of important indexes for evaluating the performance of the classifier. The higher the accuracy, the more accurate the prediction result of the classifier is represented.

True examples: refers to the number of samples that are actually positive examples and are correctly predicted to be positive examples. In post portrait tag evaluation, the number of samples that a tag is correctly classified as a post feature is indicated.

True and reverse examples: refers to the number of samples that are actually counter-examples and that are correctly predicted to be counter-examples. In post portrait tag evaluation, the number of samples that represent a tag correctly classified as not being a feature of the post.

False positive example: refers to the number of samples that are actually counter examples but are mispredicted as positive examples. In post portrayal label evaluation, the number of samples that a label is misclassified as a feature of that post is indicated.

False counter example: refers to the number of samples that are actually positive examples but are mispredicted as negative examples. In post portrait tag evaluation, the number of samples that a tag is misclassified as not being a feature of the post is represented.

The quality and the reliability of the generated post portrait tag can be evaluated through the accuracy index, and if the accuracy is higher, the generated post portrait tag is indicated to express the main characteristics and the requirements of the post; conversely, if the accuracy is low, further modifications and optimizations may be required to the tag.

It should be understood that, although the steps in the flowchart of fig. 1 are shown in sequence as indicated by the arrows, the steps are not necessarily performed in sequence as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in fig. 1 may include multiple sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, nor do the order in which the sub-steps or stages are performed necessarily performed in sequence, but may be performed alternately or alternately with at least a portion of other steps or sub-steps of other steps.

In one embodiment, as shown in fig. 2, there is provided a post portrait tag extraction device based on work instruction parsing, including: a preprocessing module 202, a vectorization processing module 204, a fuzzy clustering module 206, and a density clustering module 208, wherein:

a preprocessing module 202, configured to obtain a working specification of a relevant enterprise post; preprocessing the working instruction book to obtain a preprocessed working instruction book;

the vectorization processing module 204 is configured to perform vectorization processing on the pre-processed working instruction according to a natural language processing technology, so as to obtain a vectorized working instruction;

the fuzzy clustering module 206 is used for performing fuzzy calculation on the keywords in the quantized work instruction book to obtain fuzzy factors of the keywords; setting a loss function of fuzzy clustering by using the fuzzy factors, and distributing keywords according to the loss function to obtain an initial category to which each keyword belongs;

the density clustering module 208 is configured to perform density clustering on the keywords after the initial category is allocated according to a density-based DBSCAN algorithm, so as to obtain a center of each category and a keyword set corresponding to the center; and taking the keyword set as a post portrait tag.

The specific limitation of the post portrait tag extraction device based on the analysis of the working specification can be referred to as the limitation of the post portrait tag extraction method based on the analysis of the working specification, and the description thereof is omitted here. The modules in the post portrait tag extraction device based on the analysis of the working specification can be all or partially realized by software, hardware and the combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples illustrate only a few embodiments of the application, which are described in detail and are not to be construed as limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of protection of the present application is to be determined by the appended claims.

Claims

1. A post portrait label extraction method based on work instruction analysis is characterized by comprising the following steps:

vectorizing the pretreated working instruction according to a natural language processing technology to obtain a vectorized working instruction;

performing fuzzy calculation on keywords in the vectorized working specification to obtain fuzzy factors of the keywords; setting a loss function of fuzzy clustering by using the fuzzy factors, and distributing the keywords according to the loss function to obtain an initial category to which each keyword belongs;

2. The method of claim 1, wherein preprocessing the work instruction to obtain a preprocessed work instruction comprises:

and cleaning the text of the working instruction, removing useless information in the working instruction, performing word segmentation and part-of-speech tagging on the cleaned working instruction according to a jieba word segmentation tool, and performing stop word filtering to obtain the preprocessed working instruction.

3. The method of claim 1, wherein vectorizing the pre-processed work instruction according to natural language processing techniques to obtain a vectorized work instruction comprises:

extracting keywords from the preprocessed work instruction according to a TF-IDF algorithm, and vectorizing the extracted sentences or phrases according to a word bag model to obtain vectorized sentences;

4. A method according to claim 3, wherein keyword extraction of the pre-processed work instruction according to TF-IDF algorithm comprises:

extracting keywords from the preprocessed work instruction according to TF-IDF algorithm to obtain extracted sentences or phrases as

；

5. The method of claim 1, wherein performing fuzzy computation on the keywords in the vectorized working specification to obtain fuzzy factors of the keywords comprises:

performing fuzzy calculation on the keywords in the vectorized working specification to obtain the fuzzy factors of the keywords as follows

；

wherein ,representing keywords +.>To->Personal category center->Distance of->Representing keywords +.>To the firstkPersonal category center->Distance of->Representing the total number of categories->Is an index of the blurring factor.

6. The method of claim 5, wherein setting a loss function of fuzzy clustering using the fuzzy factor comprises:

setting the loss function of the fuzzy clustering as using the fuzzy factor

；

wherein ,representing the total number of keywords.

7. The method of claim 5, wherein performing density clustering on the keywords after the initial category assignment according to a density-based DBSCAN algorithm to obtain a keyword set corresponding to each category comprises:

dividing the keywords after the initial category distribution into core points, boundary points and noise points; the core point is that when taking the core point itself as the center of a circle,at least +.>Data points of individual points, wherein->Is a preset parameter; the boundary point is about the core point>Data points that are within a neighborhood of the radius but are not core points; the noise points refer to data points that are neither core points nor boundary points;

randomly selecting one keywordxJudging whether the core point is a core point or not, if so, creating a new cluster, and classifying all the core points and all the points with reachable densities into the new cluster; if it isxNot core points, butxIs the boundary point of a certain core point, thenxGrouping the clusters corresponding to the core points; if it isxNeither core nor boundary points, thenxMarking as noise points until all keywords are classified, and obtaining a keyword set corresponding to each category; wherein, for each keywordDefining the center of the Chinese medicine->A neighborhood of radius is +.>If a certain keyword->In the key wordsWithin the neighborhood of (i.e.)>Then call->Is->If there is a keyword sequenceSatisfy->And->Is->Is called +.>Is->The density of (3) can reach the point.

8. Post portrait label extraction element based on work instruction analysis, characterized in that, the device includes:

the fuzzy clustering module is used for carrying out fuzzy calculation on the keywords in the vectorized working specification to obtain fuzzy factors of the keywords; setting a loss function of fuzzy clustering by using the fuzzy factors, and distributing the keywords according to the loss function to obtain an initial category to which each keyword belongs;