CN110263341B - Method for mining and locating personal ability from text - Google Patents

Method for mining and locating personal ability from text Download PDF

Info

Publication number
CN110263341B
CN110263341B CN201910538161.9A CN201910538161A CN110263341B CN 110263341 B CN110263341 B CN 110263341B CN 201910538161 A CN201910538161 A CN 201910538161A CN 110263341 B CN110263341 B CN 110263341B
Authority
CN
China
Prior art keywords
capability
word stock
name
word
words
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910538161.9A
Other languages
Chinese (zh)
Other versions
CN110263341A (en
Inventor
吴漾
王鹏宇
缪新萍
杨箴
周玲
田钺
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guizhou Power Grid Co Ltd
Original Assignee
Guizhou Power Grid Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guizhou Power Grid Co Ltd filed Critical Guizhou Power Grid Co Ltd
Priority to CN201910538161.9A priority Critical patent/CN110263341B/en
Publication of CN110263341A publication Critical patent/CN110263341A/en
Application granted granted Critical
Publication of CN110263341B publication Critical patent/CN110263341B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The invention discloses a method for mining and locating personal ability from text, which comprises the following steps: the document data and the mail data are input into a database; generating a name word stock and a system word stock file by adopting a database; dividing words according to the generated name word stock and the system word stock and removing stop words; extracting all predicate-merging predicate files; manually marking the ability words by utilizing the predicate files and forming an ability word stock file; and (3) segmenting words by using the capability word file, the name word library and the system word library, removing stop words, judging whether the capability and the name are in parallel relation according to the rules and regulations, if so, generating the corresponding relation between the capability and the name, if not, calculating the nearest according to the distance, and then generating the corresponding personnel capability and storing the generated corresponding personnel capability into a database. The invention can automatically search personnel from corresponding capability, thereby greatly improving office efficiency.

Description

Method for mining and locating personal ability from text
Technical Field
The invention belongs to the technical field of personal capability mining and positioning, and relates to a method for mining and positioning personal capability from texts.
Background
In the prior art, no method for marking personnel capability exists, the capability of automatically extracting a person from a document description cannot be realized, only the personnel or the financing can be marked, and manual input is needed. It is difficult for a large-volume company to operate.
Disclosure of Invention
The invention aims to solve the technical problems that: a method of mining and locating personal capabilities from text is provided to solve the problems of the prior art.
The technical scheme adopted by the invention is as follows: a method of mining and locating personal capabilities from text, the method comprising the steps of:
(1) And (3) data storage: storing document data (word is the main part) and a plurality of mails (eml files) in a database, generating html (hypertext markup language) for the files such as word and the like, then crawling and storing, and directly storing mail data;
(2) Generating a personal name word library and a system word (application systems of companies such as an automation office system and the like) library file by using a database generated by document data and mail data;
(3) Dividing words according to the generated name word stock and the system word stock and removing stop words;
(4) Extracting all predicates (namely verbs such as popularization, purchase and the like) and generating predicate files;
(5) Manually marking the ability words by using a predicate file and forming an ability word stock file (the word stock file which is convenient for jieba word segmentation is generally a txt file, each word is a row, each row is separated by a space, and generally three attributes, word names, word frequencies and word parts are included);
(6) And (3) utilizing the capability word file, the name word library and the system word library to divide words and remove stop words, analyzing whether each sentence of the document judges whether the capability and the name are in parallel relation according to the rules and regulations, if so, generating the corresponding relation between the capability and all the people, if not, calculating nearest according to the distance, and then generating the corresponding personnel capability and storing the corresponding personnel capability into a database.
The invention has the beneficial effects that: compared with the prior art, the invention utilizes the existing mails and office documents to generate the word stock files so as to facilitate accurate word segmentation, combines the functions of web service providing enterprises for conveniently searching users by taking the names of people as semantic roles after word segmentation, automatically searches the users from the corresponding capability, and further greatly improves the office efficiency.
Drawings
FIG. 1 is a schematic flow chart of the present invention.
Detailed Description
The invention will be further described with reference to the accompanying drawings and specific examples.
Example 1: as shown in fig. 1, a method of mining and locating personal capabilities from text, the method comprising the steps of:
(1) And (3) data storage: the method comprises the steps of inputting document data (words are taken as a main part, company letters, technical files, project files, work orders, reports, machine accounts and the like generated in daily work) and ten thousands of mails (eml files, mails generated in daily work and with information such as titles, texts, receiving and sending persons, time, accessories and the like) into a database, generating html (hypertext transfer language) from the words and the like, then crawling and warehousing, and directly warehousing mail data;
(2) Generating a personal name word library and a system word (application systems of companies such as an automation office system and the like) library file (belonging to initialization data because of personnel names, system names and the like in mails) by using a database generated by document data and mail data;
(3) Dividing words according to the generated name word stock and the system word stock and removing stop words (natural language processing generally needs to remove some nonsensical words and then carries out subsequent processing);
(4) Extracting all predicates (namely verbs such as popularization, purchase and the like) and generating predicate files (obtained by training part-of-speech analysis models through part-of-speech tagging);
(5) Manually marking the ability words by using a predicate file and forming an ability word stock file (the word stock file which is convenient for jieba word segmentation is generally a txt file, each word is a row, each row is separated by a space, and generally three attributes, word names, word frequencies and word parts are included);
(6) The method comprises the steps of utilizing a capability word file, a person name word library and a system word library to divide words and remove stop words, analyzing each sentence of a document (judging the relation among sentence components through a regular expression, a semantic dependency tree, a syntactic dependency tree and the like), judging whether the capability and the person names are in parallel relation according to the regular and rule, if so, generating the corresponding relation between the capability and all the persons, if not, calculating the nearest according to the distance, and then generating the corresponding person capability and storing the corresponding person capability into a database (for example: OA systems require a third and fourth expansion, which would correspond to this ability to expand to third and fourth).
According to the contents of the enterprise such as communication documents, mails and the like, the SQLAlchemy is utilized to input data into a database, so that later analysis is facilitated, and the name of a person and the name of a system are found out to generate a word stock by utilizing the data in the warehouse. The generated name and system word stock are used for more accurate word segmentation, predicate is generated into a word stock file (txt custom word stock file), capability words can be obtained through a manually marked or trained model according to the predicate, then documents and mail contents are segmented, modification relations between a plurality of semantic roles and capabilities are judged by using regular expression and other technologies, and finally corresponding relations of individuals, capabilities and systems are generated, so that the application of the system is facilitated.
Semantic dependency analysis (step 2-4) (Semantic Dependency Parsing, SDP) analyzes semantic associations between individual language units of sentences and presents the semantic associations in a dependency structure. Sentence semantics are characterized using semantic dependency, which has the advantage that the vocabulary itself does not need to be abstracted, but rather is described by the semantic framework to which the vocabulary is subjected, while the number of arguments is always much smaller than the number of lexicons. At present, semantic dependency basically only exists in academia, the invention only judges the relationship between the capability and the name of the person by referring to the thought of the semantic dependency, and then generates the relationship between the capability and the person.
The invention utilizes a large amount of contents such as documents, mails and the like, and takes data warehouse-in extraction as analysis basis. And generating a special word stock by using the name of the person and the name of the system. And accurately segmenting words by using the word stock, and marking the predicates into a capability word stock. And subdividing the document according to the name of the person, the name of the system and the word stock of the capability, and obtaining the relationship among the person, the capability and the system by utilizing the relation of the semantic dependency tree. The professional skills of the personnel of the enterprise can be quickly built, and the related personnel can be conveniently utilized and searched.
The foregoing is merely illustrative of the present invention, and the scope of the present invention is not limited thereto, and any person skilled in the art can easily think about variations or substitutions within the scope of the present invention, and therefore, the scope of the present invention shall be defined by the scope of the appended claims.

Claims (1)

1. A method of mining and locating personal capabilities from text, characterized by: the method comprises the following steps:
(1) And (3) data storage: the document data and the multi-mail data are put into a warehouse, html is generated from the word files, then crawling and put into a warehouse, and mail data are put into a warehouse directly;
(2) Generating a name word stock and a system word stock file by utilizing a database generated by the document data and the mail data, wherein the system word stock refers to a word stock of an application system of a company;
(3) Dividing words according to the generated name word stock and the system word stock and removing stop words;
(4) Extracting all predicates and generating a predicate file;
(5) Manually marking the ability words by utilizing the predicate files and forming an ability word stock file;
(6) And (3) utilizing the capability word stock file, the name word stock and the system word stock to divide words and remove stop words, analyzing each sentence of the document, judging whether the capability and the name are in parallel relation according to the rules and regulations, if so, generating the corresponding relation between the capability and all the people, if not, calculating nearest according to the distance, and then generating the corresponding personnel capability and storing the corresponding personnel capability into a database.
CN201910538161.9A 2019-06-20 2019-06-20 Method for mining and locating personal ability from text Active CN110263341B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910538161.9A CN110263341B (en) 2019-06-20 2019-06-20 Method for mining and locating personal ability from text

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910538161.9A CN110263341B (en) 2019-06-20 2019-06-20 Method for mining and locating personal ability from text

Publications (2)

Publication Number Publication Date
CN110263341A CN110263341A (en) 2019-09-20
CN110263341B true CN110263341B (en) 2023-06-20

Family

ID=67920064

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910538161.9A Active CN110263341B (en) 2019-06-20 2019-06-20 Method for mining and locating personal ability from text

Country Status (1)

Country Link
CN (1) CN110263341B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109117872A (en) * 2018-07-24 2019-01-01 贵州电网有限责任公司信息中心 A kind of user power utilization behavior analysis method based on automatic Optimal Clustering
CN109241538A (en) * 2018-09-26 2019-01-18 上海德拓信息技术股份有限公司 Based on the interdependent Chinese entity relation extraction method of keyword and verb
CN109271626A (en) * 2018-08-31 2019-01-25 北京工业大学 Text semantic analysis method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TW201118619A (en) * 2009-11-30 2011-06-01 Inst Information Industry An opinion term mining method and apparatus thereof

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109117872A (en) * 2018-07-24 2019-01-01 贵州电网有限责任公司信息中心 A kind of user power utilization behavior analysis method based on automatic Optimal Clustering
CN109271626A (en) * 2018-08-31 2019-01-25 北京工业大学 Text semantic analysis method
CN109241538A (en) * 2018-09-26 2019-01-18 上海德拓信息技术股份有限公司 Based on the interdependent Chinese entity relation extraction method of keyword and verb

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于文本挖掘的搭配词典自动架构探讨;张辉等;《上海工程技术大学学报》;20041230(第04期);全文 *
异构数据转换技术在电力营销客户档案迁移中的研究及应用;吴方权等;《信息通信》;20181015(第10期);全文 *

Also Published As

Publication number Publication date
CN110263341A (en) 2019-09-20

Similar Documents

Publication Publication Date Title
US10698977B1 (en) System and methods for processing fuzzy expressions in search engines and for information extraction
CN108763333B (en) Social media-based event map construction method
US10489439B2 (en) System and method for entity extraction from semi-structured text documents
US8775433B2 (en) Self-indexing data structure
US8954360B2 (en) Semantic request normalizer
CN104281702A (en) Power keyword segmentation based data retrieval method and device
US10606903B2 (en) Multi-dimensional query based extraction of polarity-aware content
US20220358379A1 (en) System, apparatus and method of managing knowledge generated from technical data
US9886479B2 (en) Managing credibility for a question answering system
WO2015084404A1 (en) Matching of an input document to documents in a document collection
Chinsha et al. Aspect based opinion mining from restaurant reviews
Bhatia et al. Semantic web mining: Using ontology learning and grammatical rule inference technique
CN112507089A (en) Intelligent question-answering engine based on knowledge graph and implementation method thereof
CN116561295A (en) Internet data extraction system
Kessler et al. Extraction of terminology in the field of construction
Mkrtchyan et al. Deep parsing at the CLEF2014 IE task (DFKI-Medical)
Prasad et al. Document summarization and information extraction for generation of presentation slides
Li et al. Opinion mining of camera reviews based on Semantic Role Labeling
CN110263341B (en) Method for mining and locating personal ability from text
CN115828896A (en) Text information extraction method in field of aviation equipment quality and reliability
CN115759037A (en) Intelligent auditing frame and auditing method for building construction scheme
JP5688754B2 (en) Information retrieval apparatus and computer program
Lazemi et al. PAKE: a supervised approach for Persian automatic keyword extraction using statistical features
Lam et al. A method for web information extraction
Jebbor et al. Overview of knowledge extraction techniques in five question-answering systems

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant