CN116680590A - Post portrait label extraction method and device based on work instruction analysis - Google Patents

Post portrait label extraction method and device based on work instruction analysis Download PDF

Info

Publication number
CN116680590A
CN116680590A CN202310941705.2A CN202310941705A CN116680590A CN 116680590 A CN116680590 A CN 116680590A CN 202310941705 A CN202310941705 A CN 202310941705A CN 116680590 A CN116680590 A CN 116680590A
Authority
CN
China
Prior art keywords
keywords
fuzzy
points
work instruction
working
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310941705.2A
Other languages
Chinese (zh)
Other versions
CN116680590B (en
Inventor
王涛
沈大勇
张忠山
姚锋
刘晓路
杜永浩
闫俊刚
王沛
陈英武
吕济民
何磊
陈宇宁
陈盈果
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN202310941705.2A priority Critical patent/CN116680590B/en
Publication of CN116680590A publication Critical patent/CN116680590A/en
Application granted granted Critical
Publication of CN116680590B publication Critical patent/CN116680590B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/10Office automation; Time management
    • G06Q10/105Human resources
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Human Resources & Organizations (AREA)
  • Strategic Management (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Databases & Information Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application relates to a post portrait label extraction method and device based on work instruction analysis. The method comprises the following steps: acquiring a working specification of a relevant enterprise post; preprocessing the working instruction book to obtain a preprocessed working instruction book; vectorizing the preprocessed work instruction according to a natural language processing technology to obtain a vectorized work instruction; carrying out fuzzy calculation on keywords in the quantized work instruction book to obtain fuzzy factors of the keywords; setting a loss function of fuzzy clustering by using the fuzzy factors, and distributing keywords according to the loss function to obtain an initial category to which each keyword belongs; performing density clustering on the keywords after the initial category distribution according to a density-based DBSCAN algorithm to obtain a center of each category and a keyword set corresponding to the center; and taking the keyword set as a post portrait tag. The method can be used for extracting the post portrait label.

Description

Post portrait label extraction method and device based on work instruction analysis
Technical Field
The application relates to the technical field of data processing, in particular to a post portrait label extraction method and device based on work instruction analysis.
Background
In recent years, with the continuous development and popularization of big data and artificial intelligence technology, post portraits become an important tool in the fields of enterprise recruitment, talent cultivation, occupation planning and the like. Based on the post portraits, the enterprise can more accurately know the requirements and characteristics of various posts, thereby formulating more effective recruitment strategies. Meanwhile, in the aspect of job seekers, the job portraits can help the job seekers to better know the suitable job positions, and the job seeker success rate is improved.
However, in the prior patent application, CN 201910068512-a post image setting method, a post image setting device and a terminal device, CN 201910744021-a post image generating method, a post image generating device and an electronic device, CN 201910192576-a matching method, a device, equipment and a storage medium of post image and resume information, and CN 202011286200-a post image generating method, device, equipment and a storage medium, mainly solve the problems of post image generation and setting, and do not provide an extraction method of relevant post image labels.
Disclosure of Invention
In view of the above, it is desirable to provide a post portrait tag extraction method, apparatus, computer device, and storage medium that enable post portrait tag extraction based on job instruction analysis.
A post portrait label extraction method based on work instruction analysis comprises the following steps:
acquiring a working specification of a relevant enterprise post; preprocessing the working instruction book to obtain a preprocessed working instruction book;
vectorizing the preprocessed work instruction according to a natural language processing technology to obtain a vectorized work instruction;
carrying out fuzzy calculation on keywords in the quantized work instruction book to obtain fuzzy factors of the keywords; setting a loss function of fuzzy clustering by using the fuzzy factors, and distributing keywords according to the loss function to obtain an initial category to which each keyword belongs;
performing density clustering on the keywords after the initial category distribution according to a density-based DBSCAN algorithm to obtain a center of each category and a keyword set corresponding to the center; and taking the keyword set as a post portrait tag.
In one embodiment, the preprocessing of the working instruction book to obtain a preprocessed working instruction book includes:
and cleaning the text of the working instruction, removing useless information in the working instruction, performing word segmentation and part-of-speech tagging on the cleaned working instruction according to a jieba word segmentation tool, and performing stop-word filtering to obtain the preprocessed working instruction.
In one embodiment, vectorizing the pre-processed work instruction according to a natural language processing technique to obtain a vectorized work instruction, including:
extracting keywords from the preprocessed working specification according to a TF-IDF algorithm, and vectorizing the extracted sentences or phrases according to a word bag model to obtain vectorized sentences;
and carrying out weighted average on all the vectorized sentences to obtain a vectorized working specification.
In one embodiment, keyword extraction is performed on the pre-processed working specification according to a TF-IDF algorithm, including:
extracting keywords from the preprocessed working specification according to TF-IDF algorithm to obtain extracted sentences or phrases as
Wherein w represents a word,represents a sentence or phrase in the text of the work instruction, D represents the whole work instruction,/-or%>Representing the word w in a sentence or phrase +.>Frequency of occurrence,/->Representing the inverse document frequency of word w throughout the work specification.
In one embodiment, performing fuzzy calculation on keywords in the quantized work instruction to obtain fuzzy factors of the keywords, including:
performing fuzzy calculation on keywords in the quantized work instruction, and obtaining the fuzzy factor of the keywords as follows
wherein ,representing keywords +.>To the firstjPersonal category center->Distance of->Representing keywords +.>To the firstkPersonal category center->Is used for the distance of (a),mrepresenting the total number of categories,bis an index of the blurring factor.
In one embodiment, setting a loss function of the fuzzy cluster using the fuzzy factor includes:
setting a loss function of fuzzy clustering as using a fuzzy factor
wherein ,representing the total number of keywords.
In one embodiment, density clustering is performed on keywords after initial category assignment according to a density-based DBSCAN algorithm to obtain a keyword set corresponding to each category, including:
dividing the keywords after the initial category distribution into core points, boundary points and noise points; the core point is that when taking the core point itself as the center of a circle,at least +.>Data points of individual points, wherein->Is a preset parameter; the boundary point is about the core point>Data points that are within a neighborhood of the radius but are not core points; noise points refer to data points that are neither core points nor boundary points;
randomly selecting one keywordxJudging whether the core point is a core point or not, if so, creating a new cluster, and classifying the core point and all points with reachable densities into the new cluster; if it isxNot core points, butxIs the boundary point of a certain core point, thenxGrouping the clusters corresponding to the core points; if it isxNeither core nor boundary points, thenxMarking as noise points until all keywords are classified, and obtaining a keyword set corresponding to each category; wherein, for each keywordDefining the center of the Chinese medicine->A neighborhood of radius is +.>If a certain keyword->In the key wordsWithin the neighborhood of (i.e.)>Then call->Is->If there is a keyword sequenceSatisfy->And->Is->Is called +.>Is->The density of (3) can reach the point.
A post portrait tag extraction device based on work instruction parsing, the device comprising:
the preprocessing module is used for acquiring a working specification of a relevant enterprise post; preprocessing the working instruction book to obtain a preprocessed working instruction book;
the vectorization processing module is used for vectorizing the preprocessed work instruction according to a natural language processing technology to obtain a vectorized work instruction;
the fuzzy clustering module is used for carrying out fuzzy calculation on the keywords in the quantized work instruction book to obtain fuzzy factors of the keywords; setting a loss function of fuzzy clustering by using the fuzzy factors, and distributing keywords according to the loss function to obtain an initial category to which each keyword belongs;
the density clustering module is used for carrying out density clustering on the keywords after the initial category distribution according to a DBSCAN algorithm based on density to obtain the center of each category and a keyword set corresponding to the center; and taking the keyword set as a post portrait tag.
According to the post portrait tag extraction method and device based on the analysis of the working instruction, the working instruction is preprocessed to obtain the preprocessed working instruction, vectorization processing is carried out on the preprocessed working instruction according to the natural language processing technology to obtain the vectorized working instruction, and a large amount of data can be efficiently processed by vectorizing the content of the working instruction, so that the algorithm efficiency is improved. Then carrying out fuzzy calculation on the keywords in the quantized work instruction to obtain fuzzy factors of the keywords; the fuzzy clustering method has the advantages that a fuzzy clustering loss function is set by using the fuzzy factors, keywords are distributed according to the loss function, the initial category of each keyword is obtained, the ambiguity among the keywords is considered, compared with a traditional clustering algorithm, the content in the work instruction can be analyzed and extracted more accurately and efficiently by designing the work instruction work content and the task fuzzy factors, so that the automatic generation of the post portrait label is realized, finally, the keywords after the initial category distribution are clustered in density according to the DBSCAN algorithm based on the density, and the problems of uneven data distribution, irregular clustering shape, noise and the like can be effectively processed by further clustering each category by using the latest clustering algorithm, the clustering accuracy is improved, and the position portrait label accuracy is further improved.
Drawings
FIG. 1 is a flow diagram of a post portrait tag extraction method based on work instruction parsing in one embodiment;
FIG. 2 is a block diagram of a post portrait tag extraction device based on work instruction parsing in one embodiment.
Detailed Description
The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.
In one embodiment, as shown in fig. 1, a post portrait tag extraction method based on work instruction analysis is provided, which includes the following steps:
step 102, acquiring a working specification of a relevant enterprise post; and preprocessing the working instruction book to obtain a preprocessed working instruction book.
Firstly, the post specifications of related enterprises need to be collected and can be obtained through recruitment websites, corporate officials and other channels on the Internet. The post instruction book is subjected to preprocessing of text data, including text cleaning, text word segmentation, part-of-speech tagging and stop word processing, so that a plurality of semantic irrelevant words and digital codes are removed.
And 104, vectorizing the preprocessed work instruction according to a natural language processing technology to obtain the vectorized work instruction.
And processing the working instruction by using natural language processing technologies such as word2vector and the like, extracting information such as keywords, phrases, sentences and the like in the working instruction, and classifying and vectorizing the information. First, a word segmentation process is required for a work instruction to convert sentences and phrases therein into a series of words. And performing word segmentation on the text by using a word segmentation tool, such as jieba word segmentation. The formulation of the word is as follows:
wherein ,representing the ith sentence or phrase in the workbook text,/i>Representing words in the sentence or phrase.
And extracting keywords from the segmented results, extracting the most important words in the text, and converting each sentence or phrase into a vector form after obtaining the keywords, so that subsequent classification and clustering operations are facilitated. Vectorization was performed using a bag of words model (BoW). The sentence vectorization formula is as follows:
wherein ,vector representation representing the ith sentence or phrase in the workbook text, +.>Representing the number of occurrences of each keyword in the sentence or phrase throughout the document.
The whole working specification needs to be converted into a vector form, so that subsequent task matching and similarity calculation are facilitated. All sentence vectors can be summarized in a weighted average manner to obtain a vectorized working specification.
Step 106, carrying out fuzzy calculation on the keywords in the quantized work instruction book to obtain fuzzy factors of the keywords; and setting a loss function of the fuzzy clustering by using the fuzzy factors, and distributing the keywords according to the loss function to obtain an initial category to which each keyword belongs.
In order to take into account the ambiguity between keywords, a fuzzy factor is introduced for representing the degree of attribution of each keyword to different categories, in fuzzy clustering, each data point may belong to a plurality of categories, and the degree of belonging to each category is represented by the fuzzy factor. Therefore, keywords in the work instruction can be distributed according to the fuzzy factors to obtain the initial category to which each keyword belongs, and the work content and the task fuzzy factors of the work instruction are designed, so that compared with a traditional clustering algorithm, the method can analyze and extract the content in the work instruction more accurately and efficiently, and further automatic generation of the post portrait label is realized.
Step 108, performing density clustering on the keywords after the initial category distribution according to a density-based DBSCAN algorithm to obtain a center of each category and a keyword set corresponding to the center; and taking the keyword set as a post portrait tag.
The density clustering is carried out on the keywords after the initial category distribution according to the density-based DBSCAN algorithm, so that the problems of uneven data distribution, irregular clustering shape, noise and the like can be effectively solved, the clustering accuracy is improved, and the accuracy of post portrait labels is further improved.
In the post portrait tag extraction method based on the analysis of the working instruction book, the working instruction book is preprocessed to obtain the preprocessed working instruction book, the preprocessed working instruction book is vectorized according to the natural language processing technology to obtain the vectorized working instruction book, and a large amount of data can be efficiently processed by vectorizing the content of the working instruction book, so that the efficiency of an algorithm is improved. Then carrying out fuzzy calculation on the keywords in the quantized work instruction to obtain fuzzy factors of the keywords; the fuzzy clustering method has the advantages that a fuzzy clustering loss function is set by using the fuzzy factors, keywords are distributed according to the loss function, the initial category of each keyword is obtained, the ambiguity among the keywords is considered, compared with a traditional clustering algorithm, the content in the work instruction can be analyzed and extracted more accurately and efficiently by designing the work instruction work content and the task fuzzy factors, so that the automatic generation of the post portrait label is realized, finally, the keywords after the initial category distribution are clustered in density according to the DBSCAN algorithm based on the density, and the problems of uneven data distribution, irregular clustering shape, noise and the like can be effectively processed by further clustering each category by using the latest clustering algorithm, the clustering accuracy is improved, and the position portrait label accuracy is further improved.
In one embodiment, the preprocessing of the working instruction book to obtain a preprocessed working instruction book includes:
and cleaning the text of the working instruction, removing useless information in the working instruction, performing word segmentation and part-of-speech tagging on the cleaned working instruction according to a jieba word segmentation tool, and performing stop-word filtering to obtain the preprocessed working instruction.
In a specific embodiment, text cleansing: for preprocessing of text data, firstly, text cleaning is needed, including removal of useless information such as HTML tags, special characters, blank spaces, line-wrapping characters, non-Chinese characters and the like, and meaningful text content is reserved. Text segmentation: text segmentation is to segment sentences into individual words, and provides a basis for subsequent part-of-speech tagging and disabling word processing. The method adopts a jieba word segmentation tool to segment words. Part of speech tagging: part of speech tagging refers to determining the part of speech, such as nouns, verbs, etc., for each word after word segmentation. The method adopts the part-of-speech tagging function of the jieba word segmentation tool to perform part-of-speech tagging. Stop word processing: the term "stop" is a term that indicates that the frequency of occurrence is extremely high, but that there is no actual meaning, such as "or" having "or the like. In the method, a preset stop word list is used, stop word filtering is carried out on the words after word segmentation, and nonsensical words are removed.
In one embodiment, vectorizing the pre-processed work instruction according to a natural language processing technique to obtain a vectorized work instruction, including:
extracting keywords from the preprocessed working specification according to a TF-IDF algorithm, and vectorizing the extracted sentences or phrases according to a word bag model to obtain vectorized sentences;
and carrying out weighted average on all the vectorized sentences to obtain a vectorized working specification.
In one embodiment, keyword extraction is performed on the pre-processed working specification according to a TF-IDF algorithm, including:
extracting keywords from the preprocessed working specification according to TF-IDF algorithm to obtain extracted sentences or phrases as
Wherein w represents a word,represents a sentence or phrase in the text of the work instruction, D represents the whole work instruction,/-or%>Representing the word w in a sentence or phrase +.>Frequency of occurrence,/->Representing the inverse document frequency of word w throughout the work specification.
In one embodiment, performing fuzzy calculation on keywords in the quantized work instruction to obtain fuzzy factors of the keywords, including:
performing fuzzy calculation on keywords in the quantized work instruction, and obtaining the fuzzy factor of the keywords as follows
wherein ,representing keywords +.>To the firstjPersonal category center->Distance of->Representing keywords +.>To the firstkPersonal category center->Is used for the distance of (a),mrepresenting the total number of categories,bis an index of the blurring factor.
In one embodiment, setting a loss function of the fuzzy cluster using the fuzzy factor includes:
setting a loss function of fuzzy clustering as using a fuzzy factor
wherein ,nrepresenting the total number of keywords.
In one embodiment, density clustering is performed on keywords after initial category assignment according to a density-based DBSCAN algorithm to obtain a keyword set corresponding to each category, including:
dividing the keywords after the initial category distribution into core points, boundary points and noise points; the core point is that when taking the core point itself as the center of a circle,at least +.>Data points of individual points, wherein->Is a preset parameter; the boundary point is about the core point>Data points that are within a neighborhood of the radius but are not core points; noise points refer to data points that are neither core points nor boundary points;
randomly selecting one keywordxJudging whether the core point is a core point or not, if so, creating a new cluster, and classifying the core point and all points with reachable densities into the new cluster; if it isxNot core points, butxIs the boundary point of a certain core point, thenxGrouping the clusters corresponding to the core points; if it isxNeither core nor boundary points, thenxMarking as noise points until all keywords are classified, and obtaining a keyword set corresponding to each category; wherein, for each keywordDefining the center of the Chinese medicine->A neighborhood of radius is +.>If a certain keyword->In the key wordsWithin the neighborhood of (i.e.)>Then call->Is->If there is a keyword sequenceSatisfy->And->Is->Is called +.>Is->The density of (3) can reach the point.
In a specific embodiment, the density-based DBSCAN algorithm can effectively solve the problems of uneven data distribution, irregular cluster shape, noise and the like. The basic idea of the algorithm is that keywords are expressed as data points, the data points are divided into three types of core points, boundary points and noise points, and clustering is combined through the communication relation among the core points, and the method comprises the following steps:
first for each data pointDefining the center of the Chinese medicine->A neighborhood of radius is +.>. If a certain data point +.>At->Within the neighborhood of (i.e.)>Then call->Is->Can reach the point of direct density. If there is a data point sequence +.>Satisfy->And->Is->Is called +.>Is->The density of (3) can reach the point. If there is a core point c, so that +.> and />Are all density reachable points of c, then we call +.> and />Is a density connected point. Based on the above definition, data points can be divided into three categories: core points, boundary points, and noise points. The core point is the center of the circle of the core point, namely the +.>At least +.>Data points of individual points, whereinIs a preset parameter. The boundary point is about the core point>Is a data point within the neighborhood of the radius but not the core point. Noise points refer to data points that are neither core points nor boundary points.
Based on the definition, the DBSCAN algorithm carrying the fuzzy factor based on density can be used for clustering, and the clustering flow is as follows:
(1) Randomly selecting an unclassified data pointxJudging whether the core point is the core point or not. If the point is a core point, a new cluster is created, and the point and all points with reachable densities are classified as the cluster.
(2) If it isxNot core points, but boundary points of a core point, thenxAnd is classified into the cluster where the core point is located.
(3) If it isxNeither the core nor the boundary points are marked as noise points.
(4) Repeating the steps (1) - (3) until all the data points are classified to obtain keyword sets of a plurality of categories, and combining the keyword sets of each category as labels of the category to obtain content labels of the post image.
In one embodiment, the generated post portrait tags are evaluated and revised using artificial intelligence techniques and expert domain knowledge to ensure their accuracy and reliability. After the post portrait label is generated, the post portrait label is evaluated and corrected, so that the accuracy and the reliability of the post portrait label are ensured. The application utilizes artificial intelligence technology and professional field knowledge to improve the quality and the credibility of the post portrait label by carrying out manual or semi-automatic auditing and modification on the post portrait label.
The post portrait tag evaluation formula is as follows:
wherein TP represents a real exampleI.e., the number of samples that are actually positive examples and that are correctly predicted as positive examples; TN represents true counter->I.e., the number of samples that are actually counterexamples and that are correctly predicted as counterexamples; FP represents false positive->I.e. the number of samples that are actually counterexamples but are mispredicted as positive examples; FN represents false counter exampleI.e. the number of samples that are actually positive examples but are mispredicted as negative examples.
The meaning and interpretation of the post picture label evaluation index are as follows:
accuracy (Accuracy): the ratio of the correct number of samples to the total number of samples in the prediction result is one of important indexes for evaluating the performance of the classifier. The higher the accuracy, the more accurate the prediction result of the classifier is represented.
True examples: refers to the number of samples that are actually positive examples and are correctly predicted to be positive examples. In post portrait tag evaluation, the number of samples that a tag is correctly classified as a post feature is indicated.
True and reverse examples: refers to the number of samples that are actually counter-examples and that are correctly predicted to be counter-examples. In post portrait tag evaluation, the number of samples that represent a tag correctly classified as not being a feature of the post.
False positive example: refers to the number of samples that are actually counter examples but are mispredicted as positive examples. In post portrayal label evaluation, the number of samples that a label is misclassified as a feature of that post is indicated.
False counter example: refers to the number of samples that are actually positive examples but are mispredicted as negative examples. In post portrait tag evaluation, the number of samples that a tag is misclassified as not being a feature of the post is represented.
The quality and the reliability of the generated post portrait tag can be evaluated through the accuracy index, and if the accuracy is higher, the generated post portrait tag is indicated to express the main characteristics and the requirements of the post; conversely, if the accuracy is low, further modifications and optimizations may be required to the tag.
It should be understood that, although the steps in the flowchart of fig. 1 are shown in sequence as indicated by the arrows, the steps are not necessarily performed in sequence as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in fig. 1 may include multiple sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, nor do the order in which the sub-steps or stages are performed necessarily performed in sequence, but may be performed alternately or alternately with at least a portion of other steps or sub-steps of other steps.
In one embodiment, as shown in fig. 2, there is provided a post portrait tag extraction device based on work instruction parsing, including: a preprocessing module 202, a vectorization processing module 204, a fuzzy clustering module 206, and a density clustering module 208, wherein:
a preprocessing module 202, configured to obtain a working specification of a relevant enterprise post; preprocessing the working instruction book to obtain a preprocessed working instruction book;
the vectorization processing module 204 is configured to perform vectorization processing on the pre-processed working instruction according to a natural language processing technology, so as to obtain a vectorized working instruction;
the fuzzy clustering module 206 is used for performing fuzzy calculation on the keywords in the quantized work instruction book to obtain fuzzy factors of the keywords; setting a loss function of fuzzy clustering by using the fuzzy factors, and distributing keywords according to the loss function to obtain an initial category to which each keyword belongs;
the density clustering module 208 is configured to perform density clustering on the keywords after the initial category is allocated according to a density-based DBSCAN algorithm, so as to obtain a center of each category and a keyword set corresponding to the center; and taking the keyword set as a post portrait tag.
The specific limitation of the post portrait tag extraction device based on the analysis of the working specification can be referred to as the limitation of the post portrait tag extraction method based on the analysis of the working specification, and the description thereof is omitted here. The modules in the post portrait tag extraction device based on the analysis of the working specification can be all or partially realized by software, hardware and the combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The above examples illustrate only a few embodiments of the application, which are described in detail and are not to be construed as limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of protection of the present application is to be determined by the appended claims.

Claims (8)

1. A post portrait label extraction method based on work instruction analysis is characterized by comprising the following steps:
acquiring a working specification of a relevant enterprise post; preprocessing the working instruction book to obtain a preprocessed working instruction book;
vectorizing the pretreated working instruction according to a natural language processing technology to obtain a vectorized working instruction;
performing fuzzy calculation on keywords in the vectorized working specification to obtain fuzzy factors of the keywords; setting a loss function of fuzzy clustering by using the fuzzy factors, and distributing the keywords according to the loss function to obtain an initial category to which each keyword belongs;
performing density clustering on the keywords after the initial category distribution according to a density-based DBSCAN algorithm to obtain a center of each category and a keyword set corresponding to the center; and taking the keyword set as a post portrait tag.
2. The method of claim 1, wherein preprocessing the work instruction to obtain a preprocessed work instruction comprises:
and cleaning the text of the working instruction, removing useless information in the working instruction, performing word segmentation and part-of-speech tagging on the cleaned working instruction according to a jieba word segmentation tool, and performing stop word filtering to obtain the preprocessed working instruction.
3. The method of claim 1, wherein vectorizing the pre-processed work instruction according to natural language processing techniques to obtain a vectorized work instruction comprises:
extracting keywords from the preprocessed work instruction according to a TF-IDF algorithm, and vectorizing the extracted sentences or phrases according to a word bag model to obtain vectorized sentences;
and carrying out weighted average on all the vectorized sentences to obtain a vectorized working specification.
4. A method according to claim 3, wherein keyword extraction of the pre-processed work instruction according to TF-IDF algorithm comprises:
extracting keywords from the preprocessed work instruction according to TF-IDF algorithm to obtain extracted sentences or phrases as
Wherein w represents a word,represents a sentence or phrase in the text of the work instruction, D represents the whole work instruction,/-or%>Representing the word w in a sentence or phrase +.>Frequency of occurrence,/->Representing the inverse document frequency of word w throughout the work specification.
5. The method of claim 1, wherein performing fuzzy computation on the keywords in the vectorized working specification to obtain fuzzy factors of the keywords comprises:
performing fuzzy calculation on the keywords in the vectorized working specification to obtain the fuzzy factors of the keywords as follows
wherein ,representing keywords +.>To->Personal category center->Distance of->Representing keywords +.>To the firstkPersonal category center->Distance of->Representing the total number of categories->Is an index of the blurring factor.
6. The method of claim 5, wherein setting a loss function of fuzzy clustering using the fuzzy factor comprises:
setting the loss function of the fuzzy clustering as using the fuzzy factor
wherein ,representing the total number of keywords.
7. The method of claim 5, wherein performing density clustering on the keywords after the initial category assignment according to a density-based DBSCAN algorithm to obtain a keyword set corresponding to each category comprises:
dividing the keywords after the initial category distribution into core points, boundary points and noise points; the core point is that when taking the core point itself as the center of a circle,at least +.>Data points of individual points, wherein->Is a preset parameter; the boundary point is about the core point>Data points that are within a neighborhood of the radius but are not core points; the noise points refer to data points that are neither core points nor boundary points;
randomly selecting one keywordxJudging whether the core point is a core point or not, if so, creating a new cluster, and classifying all the core points and all the points with reachable densities into the new cluster; if it isxNot core points, butxIs the boundary point of a certain core point, thenxGrouping the clusters corresponding to the core points; if it isxNeither core nor boundary points, thenxMarking as noise points until all keywords are classified, and obtaining a keyword set corresponding to each category; wherein, for each keywordDefining the center of the Chinese medicine->A neighborhood of radius is +.>If a certain keyword->In the key wordsWithin the neighborhood of (i.e.)>Then call->Is->If there is a keyword sequenceSatisfy->And->Is->Is called +.>Is->The density of (3) can reach the point.
8. Post portrait label extraction element based on work instruction analysis, characterized in that, the device includes:
the preprocessing module is used for acquiring a working specification of a relevant enterprise post; preprocessing the working instruction book to obtain a preprocessed working instruction book;
the vectorization processing module is used for vectorizing the preprocessed work instruction according to a natural language processing technology to obtain a vectorized work instruction;
the fuzzy clustering module is used for carrying out fuzzy calculation on the keywords in the vectorized working specification to obtain fuzzy factors of the keywords; setting a loss function of fuzzy clustering by using the fuzzy factors, and distributing the keywords according to the loss function to obtain an initial category to which each keyword belongs;
the density clustering module is used for carrying out density clustering on the keywords after the initial category distribution according to a DBSCAN algorithm based on density to obtain the center of each category and a keyword set corresponding to the center; and taking the keyword set as a post portrait tag.
CN202310941705.2A 2023-07-28 2023-07-28 Post portrait label extraction method and device based on work instruction analysis Active CN116680590B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310941705.2A CN116680590B (en) 2023-07-28 2023-07-28 Post portrait label extraction method and device based on work instruction analysis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310941705.2A CN116680590B (en) 2023-07-28 2023-07-28 Post portrait label extraction method and device based on work instruction analysis

Publications (2)

Publication Number Publication Date
CN116680590A true CN116680590A (en) 2023-09-01
CN116680590B CN116680590B (en) 2023-10-20

Family

ID=87784012

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310941705.2A Active CN116680590B (en) 2023-07-28 2023-07-28 Post portrait label extraction method and device based on work instruction analysis

Country Status (1)

Country Link
CN (1) CN116680590B (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109190009A (en) * 2018-09-12 2019-01-11 北京邮电大学 A kind of Web Community's topic integration method and system
CN111595804A (en) * 2020-05-09 2020-08-28 滁州职业技术学院 Fuzzy clustering tea near infrared spectrum classification method
WO2021164199A1 (en) * 2020-02-20 2021-08-26 齐鲁工业大学 Multi-granularity fusion model-based intelligent semantic chinese sentence matching method, and device
WO2021174919A1 (en) * 2020-03-06 2021-09-10 平安科技(深圳)有限公司 Method and apparatus for analysis and matching of resume data information, electronic device, and medium
WO2021179715A1 (en) * 2020-10-21 2021-09-16 平安科技(深圳)有限公司 Hidden markov model-based resignation prediction method and related device
CN114819924A (en) * 2022-06-28 2022-07-29 杭银消费金融股份有限公司 Enterprise information push processing method and device based on portrait analysis
CN115423019A (en) * 2022-09-01 2022-12-02 西安电子科技大学 Fuzzy clustering method and device based on density
CN116012233A (en) * 2021-10-19 2023-04-25 上海寒武纪信息科技有限公司 Training method of machine learning model and related products
CN116451074A (en) * 2023-03-31 2023-07-18 兴业银行股份有限公司 Image generation method and device for target object, computer equipment and storage medium

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109190009A (en) * 2018-09-12 2019-01-11 北京邮电大学 A kind of Web Community's topic integration method and system
WO2021164199A1 (en) * 2020-02-20 2021-08-26 齐鲁工业大学 Multi-granularity fusion model-based intelligent semantic chinese sentence matching method, and device
WO2021174919A1 (en) * 2020-03-06 2021-09-10 平安科技(深圳)有限公司 Method and apparatus for analysis and matching of resume data information, electronic device, and medium
CN111595804A (en) * 2020-05-09 2020-08-28 滁州职业技术学院 Fuzzy clustering tea near infrared spectrum classification method
WO2021179715A1 (en) * 2020-10-21 2021-09-16 平安科技(深圳)有限公司 Hidden markov model-based resignation prediction method and related device
CN116012233A (en) * 2021-10-19 2023-04-25 上海寒武纪信息科技有限公司 Training method of machine learning model and related products
CN114819924A (en) * 2022-06-28 2022-07-29 杭银消费金融股份有限公司 Enterprise information push processing method and device based on portrait analysis
CN115423019A (en) * 2022-09-01 2022-12-02 西安电子科技大学 Fuzzy clustering method and device based on density
CN116451074A (en) * 2023-03-31 2023-07-18 兴业银行股份有限公司 Image generation method and device for target object, computer equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
XINYI ZHANG 等: "A Density-Based Adaptive Distance Fuzzy Clustering Algorithm Based on the Multi-target Traffic Radar", 2020 13TH INTERNATIONAL CONGRESS ON IMAGE AND SIGNAL PROCESSING, BIOMEDICAL ENGINEERING AND INFORMATICS, pages 1 - 5 *
乔枫: "基于行为大数据的人岗匹配分析关键技术研究", 硕士论文电子期刊, pages 2 - 5 *

Also Published As

Publication number Publication date
CN116680590B (en) 2023-10-20

Similar Documents

Publication Publication Date Title
CN113011533B (en) Text classification method, apparatus, computer device and storage medium
CN108090070B (en) Chinese entity attribute extraction method
CN112231447B (en) Method and system for extracting Chinese document events
CN111104510B (en) Text classification training sample expansion method based on word embedding
CN112256939B (en) Text entity relation extraction method for chemical field
CN107102993B (en) User appeal analysis method and device
CN109446423B (en) System and method for judging sentiment of news and texts
CN112052684A (en) Named entity identification method, device, equipment and storage medium for power metering
CN111191442A (en) Similar problem generation method, device, equipment and medium
CN116050397B (en) Method, system, equipment and storage medium for generating long text abstract
CN113378563B (en) Case feature extraction method and device based on genetic variation and semi-supervision
CN108763192B (en) Entity relation extraction method and device for text processing
CN115952292B (en) Multi-label classification method, apparatus and computer readable medium
CN111178080B (en) Named entity identification method and system based on structured information
CN112257425A (en) Power data analysis method and system based on data classification model
CN111930936A (en) Method and system for excavating platform message text
CN115146062A (en) Intelligent event analysis method and system fusing expert recommendation and text clustering
CN114676346A (en) News event processing method and device, computer equipment and storage medium
CN112580329A (en) Text noise data identification method and device, computer equipment and storage medium
CN112380346B (en) Financial news emotion analysis method and device, computer equipment and storage medium
CN116680590B (en) Post portrait label extraction method and device based on work instruction analysis
CN113095073B (en) Corpus tag generation method and device, computer equipment and storage medium
Zhai et al. TRIZ technical contradiction extraction method based on patent semantic space mapping
CN109597879B (en) Service behavior relation extraction method and device based on 'citation relation' data
CN115481240A (en) Data asset quality detection method and detection device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant