CN110263341B

CN110263341B - Method for mining and locating personal ability from text

Info

Publication number: CN110263341B
Application number: CN201910538161.9A
Authority: CN
Inventors: 吴漾; 王鹏宇; 缪新萍; 杨箴; 周玲; 田钺
Original assignee: Guizhou Power Grid Co Ltd
Current assignee: Guizhou Power Grid Co Ltd
Priority date: 2019-06-20
Filing date: 2019-06-20
Publication date: 2023-06-20
Anticipated expiration: 2039-06-20
Also published as: CN110263341A

Abstract

The invention discloses a method for mining and locating personal ability from text, which comprises the following steps: the document data and the mail data are input into a database; generating a name word stock and a system word stock file by adopting a database; dividing words according to the generated name word stock and the system word stock and removing stop words; extracting all predicate-merging predicate files; manually marking the ability words by utilizing the predicate files and forming an ability word stock file; and (3) segmenting words by using the capability word file, the name word library and the system word library, removing stop words, judging whether the capability and the name are in parallel relation according to the rules and regulations, if so, generating the corresponding relation between the capability and the name, if not, calculating the nearest according to the distance, and then generating the corresponding personnel capability and storing the generated corresponding personnel capability into a database. The invention can automatically search personnel from corresponding capability, thereby greatly improving office efficiency.

Description

Method for mining and locating personal ability from text

Technical Field

The invention belongs to the technical field of personal capability mining and positioning, and relates to a method for mining and positioning personal capability from texts.

Background

In the prior art, no method for marking personnel capability exists, the capability of automatically extracting a person from a document description cannot be realized, only the personnel or the financing can be marked, and manual input is needed. It is difficult for a large-volume company to operate.

Disclosure of Invention

The invention aims to solve the technical problems that: a method of mining and locating personal capabilities from text is provided to solve the problems of the prior art.

The technical scheme adopted by the invention is as follows: a method of mining and locating personal capabilities from text, the method comprising the steps of:

(1) And (3) data storage: storing document data (word is the main part) and a plurality of mails (eml files) in a database, generating html (hypertext markup language) for the files such as word and the like, then crawling and storing, and directly storing mail data;

(2) Generating a personal name word library and a system word (application systems of companies such as an automation office system and the like) library file by using a database generated by document data and mail data;

(3) Dividing words according to the generated name word stock and the system word stock and removing stop words;

(4) Extracting all predicates (namely verbs such as popularization, purchase and the like) and generating predicate files;

(5) Manually marking the ability words by using a predicate file and forming an ability word stock file (the word stock file which is convenient for jieba word segmentation is generally a txt file, each word is a row, each row is separated by a space, and generally three attributes, word names, word frequencies and word parts are included);

(6) And (3) utilizing the capability word file, the name word library and the system word library to divide words and remove stop words, analyzing whether each sentence of the document judges whether the capability and the name are in parallel relation according to the rules and regulations, if so, generating the corresponding relation between the capability and all the people, if not, calculating nearest according to the distance, and then generating the corresponding personnel capability and storing the corresponding personnel capability into a database.

The invention has the beneficial effects that: compared with the prior art, the invention utilizes the existing mails and office documents to generate the word stock files so as to facilitate accurate word segmentation, combines the functions of web service providing enterprises for conveniently searching users by taking the names of people as semantic roles after word segmentation, automatically searches the users from the corresponding capability, and further greatly improves the office efficiency.

Drawings

FIG. 1 is a schematic flow chart of the present invention.

Detailed Description

The invention will be further described with reference to the accompanying drawings and specific examples.

Example 1: as shown in fig. 1, a method of mining and locating personal capabilities from text, the method comprising the steps of:

(1) And (3) data storage: the method comprises the steps of inputting document data (words are taken as a main part, company letters, technical files, project files, work orders, reports, machine accounts and the like generated in daily work) and ten thousands of mails (eml files, mails generated in daily work and with information such as titles, texts, receiving and sending persons, time, accessories and the like) into a database, generating html (hypertext transfer language) from the words and the like, then crawling and warehousing, and directly warehousing mail data;

(2) Generating a personal name word library and a system word (application systems of companies such as an automation office system and the like) library file (belonging to initialization data because of personnel names, system names and the like in mails) by using a database generated by document data and mail data;

(3) Dividing words according to the generated name word stock and the system word stock and removing stop words (natural language processing generally needs to remove some nonsensical words and then carries out subsequent processing);

(4) Extracting all predicates (namely verbs such as popularization, purchase and the like) and generating predicate files (obtained by training part-of-speech analysis models through part-of-speech tagging);

(6) The method comprises the steps of utilizing a capability word file, a person name word library and a system word library to divide words and remove stop words, analyzing each sentence of a document (judging the relation among sentence components through a regular expression, a semantic dependency tree, a syntactic dependency tree and the like), judging whether the capability and the person names are in parallel relation according to the regular and rule, if so, generating the corresponding relation between the capability and all the persons, if not, calculating the nearest according to the distance, and then generating the corresponding person capability and storing the corresponding person capability into a database (for example: OA systems require a third and fourth expansion, which would correspond to this ability to expand to third and fourth).

According to the contents of the enterprise such as communication documents, mails and the like, the SQLAlchemy is utilized to input data into a database, so that later analysis is facilitated, and the name of a person and the name of a system are found out to generate a word stock by utilizing the data in the warehouse. The generated name and system word stock are used for more accurate word segmentation, predicate is generated into a word stock file (txt custom word stock file), capability words can be obtained through a manually marked or trained model according to the predicate, then documents and mail contents are segmented, modification relations between a plurality of semantic roles and capabilities are judged by using regular expression and other technologies, and finally corresponding relations of individuals, capabilities and systems are generated, so that the application of the system is facilitated.

Semantic dependency analysis (step 2-4) (Semantic Dependency Parsing, SDP) analyzes semantic associations between individual language units of sentences and presents the semantic associations in a dependency structure. Sentence semantics are characterized using semantic dependency, which has the advantage that the vocabulary itself does not need to be abstracted, but rather is described by the semantic framework to which the vocabulary is subjected, while the number of arguments is always much smaller than the number of lexicons. At present, semantic dependency basically only exists in academia, the invention only judges the relationship between the capability and the name of the person by referring to the thought of the semantic dependency, and then generates the relationship between the capability and the person.

The invention utilizes a large amount of contents such as documents, mails and the like, and takes data warehouse-in extraction as analysis basis. And generating a special word stock by using the name of the person and the name of the system. And accurately segmenting words by using the word stock, and marking the predicates into a capability word stock. And subdividing the document according to the name of the person, the name of the system and the word stock of the capability, and obtaining the relationship among the person, the capability and the system by utilizing the relation of the semantic dependency tree. The professional skills of the personnel of the enterprise can be quickly built, and the related personnel can be conveniently utilized and searched.

The foregoing is merely illustrative of the present invention, and the scope of the present invention is not limited thereto, and any person skilled in the art can easily think about variations or substitutions within the scope of the present invention, and therefore, the scope of the present invention shall be defined by the scope of the appended claims.

Claims

1. A method of mining and locating personal capabilities from text, characterized by: the method comprises the following steps:

(1) And (3) data storage: the document data and the multi-mail data are put into a warehouse, html is generated from the word files, then crawling and put into a warehouse, and mail data are put into a warehouse directly;

(2) Generating a name word stock and a system word stock file by utilizing a database generated by the document data and the mail data, wherein the system word stock refers to a word stock of an application system of a company;

(4) Extracting all predicates and generating a predicate file;

(5) Manually marking the ability words by utilizing the predicate files and forming an ability word stock file;

(6) And (3) utilizing the capability word stock file, the name word stock and the system word stock to divide words and remove stop words, analyzing each sentence of the document, judging whether the capability and the name are in parallel relation according to the rules and regulations, if so, generating the corresponding relation between the capability and all the people, if not, calculating nearest according to the distance, and then generating the corresponding personnel capability and storing the corresponding personnel capability into a database.