CN113468339A

CN113468339A - Label extraction method, system, electronic device and medium based on knowledge graph

Info

Publication number: CN113468339A
Application number: CN202110704870.7A
Authority: CN
Inventors: 刘俊辰; 尤旸
Original assignee: Beijing Mininglamp Software System Co ltd
Current assignee: Nanjing Minglue Technology Co ltd
Priority date: 2021-06-24
Filing date: 2021-06-24
Publication date: 2021-10-01

Abstract

The invention discloses a label extraction method, a system, electronic equipment and a medium based on a knowledge graph, wherein the label extraction method comprises the following steps: document preprocessing step: preprocessing an online document to obtain an entity and a non-entity; and (3) entity processing: mapping an entity to a knowledge graph to obtain a plurality of entity labels; non-entity processing step: analyzing the non-entity to obtain a non-entity sequence through a preset rule, and processing the non-entity sequence to obtain a plurality of non-entity labels; merging treatment: and combining and de-duplicating the entity tags and the non-entity tags to obtain the document tags corresponding to the online documents. The invention introduces entity identification, thereby pertinently extracting the required entity type as a text label; meanwhile, the extracted entities are subjected to importance sequencing through the knowledge graph, so that the extracted labels are more valuable, manual reinspection and other operations are omitted, and the labor cost is reduced.

Description

Label extraction method, system, electronic device and medium based on knowledge graph

Technical Field

The invention relates to the technical field of natural language processing, in particular to a knowledge graph-based label extraction method, a knowledge graph-based label extraction system, electronic equipment and a knowledge graph-based label extraction medium.

Background

Recently, online documents receive more and more attention due to the fact that the working efficiency of company enterprises is effectively improved by the characteristic of multi-person collaboration due to cloud storage of the online documents. The online document not only has the information of the document, but also carries the information generated in the multi-person cooperation process of editors, browsers and the like. The information can be easily communicated and associated with an internal knowledge base of an enterprise, an information system and an enterprise knowledge map, and the value of the online document is greatly improved.

By extracting the label of the online document, the intelligent services such as document retrieval, recommendation and the like of the knowledge base can be more effectively supported. The extraction of the document tag is to extract important information of one or more documents from the documents by nlp and other technologies, wherein the information is the content really concerned by the user and comprises entities, key phrases and the like. By utilizing the tags, services such as document inquiry, recommendation and the like can be more intelligent and accurate, so that the efficiency is improved.

The existing technology is mostly to directly extract keywords from documents. For example, based on TF-IDF (word frequency-inverse document frequency), the product of the word frequency and the inverse document frequency is used as the score of the word, and several words with higher scores are selected as the keywords. The method based on the TextRank is characterized in that a candidate keyword graph is constructed by using a local lexical relation, namely a co-occurrence window, an edge between any two points is constructed by using the co-occurrence relation, and the weight of each node is iteratively calculated according to a formula until convergence. And finally, sequencing the node weights, and selecting the first nodes as keywords. However, in practice, it is found that the methods in the prior art only rely on word frequency or word co-occurrence to determine the importance of words, so as to sort the extracted keywords, and in some cases, the method is not accurate enough. Even if the operation of removing stop words is performed before extraction, the extracted keywords can still be words with more occurrence but no obvious meaning, so that manual screening is often required after extraction. Because of this, the prior art method cannot perform centralized extraction on some word types we are interested in, such as product names, department names, document types, and the like.

It is therefore desirable to develop a method, system, electronic device and medium for extracting labels based on knowledge-graph that overcomes the above-mentioned drawbacks.

Disclosure of Invention

In view of the above problems, embodiments of the present application provide a method, a system, an electronic device, and a medium for extracting a label based on a knowledge graph, so as to at least solve the problem of performing centralized extraction on a word type of interest.

The invention provides a knowledge graph-based label extraction method, which comprises the following steps:

document preprocessing step: preprocessing an online document to obtain an entity and a non-entity;

and (3) entity processing: mapping the entity to a knowledge graph to obtain a plurality of entity labels;

non-entity processing step: analyzing the non-entity to obtain a non-entity sequence through a preset rule, and processing the non-entity sequence to obtain a plurality of non-entity labels;

merging treatment: and combining and de-duplicating the entity tags and the non-entity tags to obtain the document tags corresponding to the online documents.

In the above tag extraction method, the document preprocessing step includes:

an entity acquisition step: extracting a plurality of the entities of the online document through an entity identification technology;

non-entity acquisition step: and extracting a plurality of non-entities of the online document according to the dependency relationship.

In the above tag extraction method, the entity processing step includes:

and an entity position judgment step: determining a location of the entity in the online document;

and an entity tag obtaining step: setting the entity appearing in a document name or a document title of the online document as the entity tag; setting at least one entity in the knowledge-graph, the node path distance between which and the entity is less than a threshold value and which appears in the online document, as the entity tag; and respectively corresponding the entities appearing in the text of the online document to the knowledge graph, constructing a new small knowledge graph according to the related entities corresponding to the entities and the relationship between the entities, sequencing the importance of the entities of the small knowledge graph through a PageRank algorithm, and reserving at least one entity as the entity label according to the sequencing result.

In the above tag extraction method, the non-entity processing step includes: and setting the non-entity sequences appearing in the title of the online document as the non-entity tags, sorting the non-entity sequences appearing in the body text of the online document according to word frequency, and selecting at least one non-entity sequence as the non-entity tags.

The invention also provides a knowledge graph-based label extraction system, which comprises the following components:

the document preprocessing unit is used for preprocessing an online document to obtain an entity and a non-entity;

the entity processing unit is used for mapping the entity to a knowledge graph to obtain a plurality of entity labels;

the non-entity processing unit is used for analyzing the non-entity to obtain a non-entity sequence through a preset rule and processing the non-entity sequence to obtain a plurality of non-entity labels;

and the merging processing unit is used for merging and de-duplicating the entity tags and the non-entity tags to obtain document tags corresponding to the online documents.

The above tag extraction system, wherein the document preprocessing unit includes:

the entity acquisition module extracts a plurality of entities of the online document through an entity identification technology;

and the non-entity acquisition module extracts a plurality of non-entities of the online document according to the dependency relationship.

The tag extraction system described above, wherein the entity processing unit includes:

the entity position judging module is used for judging the position of the entity in the online document;

an entity tag obtaining module, configured to set the entity appearing in the document name or document title of the online document as the entity tag; setting at least one entity in the knowledge-graph, the node path distance between which and the entity is less than a threshold value and which appears in the online document, as the entity tag; and respectively corresponding the entities appearing in the text of the online document to the knowledge graph, constructing a new small knowledge graph according to the related entities corresponding to the entities and the relationship between the entities, sequencing the importance of the entities of the small knowledge graph through a PageRank algorithm, and reserving at least one entity as the entity label according to the sequencing result.

In the above tag extraction system, the non-entity processing unit sets the non-entity sequence appearing in the header of the online document as the non-entity tag, and the non-entity processing unit selects at least one non-entity sequence to set as the non-entity tag after sorting the non-entity sequences appearing in the body text of the online document according to word frequency.

The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the tag extraction method as described in any one of the above when executing the computer program.

The invention also provides a medium on which a computer program is stored, wherein the program, when executed by a processor, implements a tag extraction method as defined in any one of the above.

Compared with the prior art, the invention has the following effects: the invention introduces entity identification, thereby pertinently extracting the required entity type as a text label; meanwhile, the extracted entities are subjected to importance sequencing through the knowledge graph, so that the extracted labels are more valuable, manual reinspection and other operations are omitted, and the labor cost is reduced.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 is a flow chart of a tag extraction method of the present invention;

FIG. 2 is a flowchart illustrating the substeps of step S1 in FIG. 1;

FIG. 3 is a flowchart illustrating the substeps of step S2 in FIG. 1;

FIG. 4 is a flow chart of an application of the tag extraction method of the present invention;

FIG. 5 is a schematic diagram of the structure of the tag extraction system of the present invention;

fig. 6 is a schematic structural diagram of an electronic device according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The exemplary embodiments of the present invention and the description thereof are provided to explain the present invention and not to limit the present invention. Additionally, the same or similar numbered elements/components used in the drawings and the embodiments are used to represent the same or similar parts.

As used herein, the terms "first", "second", "S1", "S2", …, etc. do not particularly denote an order or sequential meaning, nor are they intended to limit the present invention, but merely distinguish between elements or operations described in the same technical terms.

With respect to directional terminology used herein, for example: up, down, left, right, front or rear, etc., are simply directions with reference to the drawings. Accordingly, the directional terminology used is intended to be illustrative and is not intended to be limiting of the present teachings.

As used herein, the terms "comprising," "including," "having," "containing," and the like are open-ended terms that mean including, but not limited to.

As used herein, "and/or" includes any and all combinations of the described items.

References to "plurality" herein include "two" and "more than two"; reference to "multiple sets" herein includes "two sets" and "more than two sets".

As used herein, the terms "substantially", "about" and the like are used to modify any slight variation in quantity or error that does not alter the nature of the variation. Generally, the range of slight variations or errors modified by such terms may be 20% in some embodiments, 10% in some embodiments, 5% in some embodiments, or other values. It should be understood by those skilled in the art that the aforementioned values can be adjusted according to actual needs, and are not limited thereto.

Certain words used to describe the present application are discussed below or elsewhere in this specification to provide additional guidance to those skilled in the art in describing the present application.

The extraction effect of the document tags depends on two aspects, namely the quality of the extracted entities or keywords and the sequencing effect of the extracted words. The extracted words are often sorted to select the top topk words as the labels of the representative documents. How to obtain more reasonable sequencing results is a very important problem. Firstly, dividing document tags into two types, namely an entity and a non-entity, wherein the non-entity takes noun phrases as an example, and the two types of tags are respectively extracted and sequenced; the entities are obtained through an entity identification technology, in addition, the online documents carry some editors, browsers and other entities, the entities are sorted through a related knowledge graph, and the non-entities are obtained through extracting noun phrases and are comprehensively sorted according to the occurrence positions and the word frequencies of the noun phrases; finally, respectively taking topk from the two types of tags, and then combining the topk and the topk to serve as a final document tag; the present invention will be described in detail below with reference to the accompanying drawings.

Referring to fig. 1, fig. 1 is a flowchart of a tag extraction method according to the present invention. As shown in fig. 1, the method for extracting a label based on a knowledge-graph includes:

document preprocessing step S1: preprocessing an online document to obtain an entity and a non-entity;

entity processing step S2: mapping the entity to a knowledge graph to obtain a plurality of entity labels;

non-entity processing step S3: analyzing the non-entity to obtain a non-entity sequence through a preset rule, and processing the non-entity sequence to obtain a plurality of non-entity labels;

merging processing step S4: and combining and de-duplicating the entity tags and the non-entity tags to obtain the document tags corresponding to the online documents.

In the embodiment, the non-entity may be a first-order phrase, but the invention is not limited thereto.

Further, referring to fig. 2, fig. 2 is a flowchart illustrating a sub-step of step S1 in fig. 1. As shown in fig. 2, the document preprocessing step S1 includes:

entity acquisition step S11: extracting a plurality of the entities of the online document through an entity identification technology;

non-entity acquisition step S12: and extracting a plurality of non-entities of the online document according to the dependency relationship.

In this embodiment, the entity may be, for example, a product, a function, a technology, a department, a document type, and the like, where the document type may be a deployment document, a white paper, an instruction manual, and the like. The invention adopts the entity recognition technology such as a dictionary-based method and a deep learning neural network-based method such as a CRF, LSTM + CRF, Bert + CRF and other sequence labeling models, and intensively extracts the required entity types by using the entity recognition, thereby reducing the extraction range and enabling the final document label to be more accurate.

Still further, referring to fig. 3, fig. 3 is a flowchart illustrating a sub-step of step S2 in fig. 1. As shown in fig. 3, the entity processing step S2 includes:

entity position determination step S21: determining a location of the entity in the online document;

entity tag acquisition step S22: setting the entity appearing in a document name or a document title of the online document as the entity tag; setting at least one entity in the knowledge-graph, the node path distance between which and the entity is less than a threshold value and which appears in the online document, as the entity tag; and respectively corresponding the entities appearing in the text of the online document to the knowledge graph, constructing a new small knowledge graph according to the related entities corresponding to the entities and the relationship between the entities, sequencing the importance of the entities of the small knowledge graph through a PageRank algorithm, and reserving at least one entity as the entity label according to the sequencing result. According to the invention, the extracted entities are subjected to importance sequencing by utilizing the entities in the knowledge graph and the relationship between the entities, so that the sequencing result is more accurate.

Further, the non-entity processing step S3 includes: and setting the non-entity sequences appearing in the title of the online document as the non-entity tags, sorting the non-entity sequences appearing in the body text of the online document according to word frequency, and selecting at least one non-entity sequence as the non-entity tags.

Referring to fig. 4, fig. 4 is a flowchart illustrating an application of the tag extraction method according to the present invention. The tag extraction method of the present invention is specifically described in an embodiment with reference to fig. 4.

(1) Entity label:

first, entities of a desired type are extracted from an online document through an entity identification technique.

It should be noted that, in addition to the entity extracted by the entity identification, some entities are an editor, a viewer, an administrator, and the like of the document carried by the online document in the embodiment. And mapping all the entities and the extracted entities into the enterprise knowledge graph. I.e. find the corresponding entity node in the graph.

A knowledge graph is a graph-based data structure, which is composed of nodes and edges, each node represents an entity, such as an employee, a product, a company, etc., each edge is a relationship between entities, and is essentially a semantic network that exposes relationships between entities, and can link all information together. The enterprise knowledge graph is constructed based on a large amount of data in an enterprise, and can better reflect the relevance between entities, so that the entities which can better reflect the document content are found out.

There are different ways of handling an entity depending on where it appears in the document.

A. When an entity appears in a document name or a document title, the entity is regarded as a very important entity and directly classified as an entity tag. In addition, the path distance of the enterprise knowledge graph to the entity node is 2, and all entities appearing in the document are directly classified as entity labels. These tags are also of high importance in the knowledge graph since they are closely related to important entities in the document header. The number of physical tags can be controlled as required, such as taking the first 5 tags according to the path distance, but the invention is not limited to the number.

B. And other entities appearing in the document body also respectively correspond to the enterprise knowledge graph, and then all the corresponding related entities and the relations among the entities are abstracted to form a new small knowledge graph aiming at the document. And (5) carrying out importance sequencing on the entities of the small knowledge graph through a PageRank algorithm. PageRank is a graph computation algorithm, and the importance of entity nodes is ranked by continuously iteratively computing the weight of connection between nodes. And finally, reserving topk as an entity label according to the sorted result.

(2) Non-entity label

In the present embodiment, a non-entity tag is used as a noun phrase tag for example.

Noun phrases in the title and body of the online document are extracted. The noun phrases herein do not include the entities that have been previously extracted.

The extraction of noun phrases is to analyze the dependency relationship of the sentence to be extracted through a Chinese model of space, and extract a conventional noun phrase sequence and a noun phrase sequence containing a special verb structure through a rule formulated according to the dependency relationship. And finally, integrating the extracted phrase sequences. And keeping noun phrases within 2-7 characters according to the length, and filtering out common words and dirty words, wherein the dirty words can be manually collected according to a test result.

C. If noun phrases appear in the title, they are considered important noun phrases and are directly retained as non-entity tags.

D. The rest noun phrases appearing in the text are subjected to importance ranking according to the appearing word frequency, and then the top topk is selected as a non-entity label.

Thus, the invention also considers noun phrase tags except for the required entities, thereby avoiding the situation of missing important tag information.

(3) Document tag deduplication

And (4) collecting the entity tags and the non-entity tags extracted from the A, B, C and D together for de-duplication processing to obtain the final document tag.

Referring to fig. 5, fig. 5 is a schematic structural diagram of a tag extraction system according to the present invention. As shown in fig. 5, the system for extracting a knowledge-graph-based tag of the present invention comprises:

the document preprocessing unit 11 is used for preprocessing an online document to obtain an entity and a non-entity;

the entity processing unit 12 is used for mapping the entity to a knowledge graph to obtain a plurality of entity labels;

a non-entity processing unit 13, which analyzes the non-entity to obtain a non-entity sequence through a preset rule, and processes the non-entity sequence to obtain a plurality of non-entity labels;

and a merging processing unit 14, configured to merge and perform deduplication processing on the entity tags and the non-entity tags to obtain document tags corresponding to the online documents.

Further, the document preprocessing unit 11 includes:

an entity obtaining module 111, which extracts a plurality of entities of the online document through an entity identification technology;

the non-entity obtaining module 112 extracts a plurality of non-entities of the online document according to the dependency relationship.

Still further, the entity processing unit 12 includes:

an entity position judging module 121, which judges the position of the entity in the online document;

an entity tag obtaining module 122, configured to set the entity appearing in the document name or document title of the online document as the entity tag; setting at least one entity in the knowledge-graph, the node path distance between which and the entity is less than a threshold value and which appears in the online document, as the entity tag; and respectively corresponding the entities appearing in the text of the online document to the knowledge graph, constructing a new small knowledge graph according to the related entities corresponding to the entities and the relationship between the entities, sequencing the importance of the entities of the small knowledge graph through a PageRank algorithm, and reserving at least one entity as the entity label according to the sequencing result.

Further, the non-entity processing unit 13 sets the non-entity sequence appearing in the header of the online document as the non-entity tag, and the non-entity processing unit selects at least one non-entity sequence to set as the non-entity tag after sorting the non-entity sequence appearing in the body of the online document according to word frequency.

Referring to fig. 6, fig. 6 is a schematic structural diagram of an electronic device according to the present invention. As shown in fig. 6, the present embodiment discloses a specific implementation of an electronic device. The electronic device may include a processor 81 and a memory 82 storing computer program instructions.

Specifically, the processor 81 may include a Central Processing Unit (CPU), or A Specific Integrated Circuit (ASIC), or may be configured to implement one or more Integrated circuits of the embodiments of the present Application.

Memory 82 may include, among other things, mass storage for data or instructions. By way of example, and not limitation, memory 82 may include a Hard Disk Drive (Hard Disk Drive, abbreviated to HDD), a floppy Disk Drive, a Solid State Drive (SSD), flash memory, an optical Disk, a magneto-optical Disk, tape, or a Universal Serial Bus (USB) Drive or a combination of two or more of these. Memory 82 may include removable or non-removable (or fixed) media, where appropriate. The memory 82 may be internal or external to the data processing apparatus, where appropriate. In a particular embodiment, the memory 82 is a Non-Volatile (Non-Volatile) memory. In particular embodiments, Memory 82 includes Read-Only Memory (ROM) and Random Access Memory (RAM). The ROM may be mask-programmed ROM, Programmable ROM (PROM), Erasable PROM (EPROM), Electrically Erasable PROM (EEPROM), Electrically rewritable ROM (EAROM), or FLASH Memory (FLASH), or a combination of two or more of these, where appropriate. The RAM may be a Static Random-Access Memory (SRAM) or a Dynamic Random-Access Memory (DRAM), where the DRAM may be a Fast Page Mode Dynamic Random-Access Memory (FPMDRAM), an Extended data output Dynamic Random-Access Memory (EDODRAM), a Synchronous Dynamic Random-Access Memory (SDRAM), and the like.

The memory 82 may be used to store or cache various data files for processing and/or communication use, as well as possible computer program instructions executed by the processor 81.

The processor 81 implements any of the tag extraction methods in the above embodiments by reading and executing computer program instructions stored in the memory 82.

In some of these embodiments, the electronic device may also include a communication interface 83 and a bus 80. As shown in fig. 6, the processor 81, the memory 82, and the communication interface 83 are connected via the bus 80 to complete communication therebetween.

The communication interface 83 is used for implementing communication between modules, devices, units and/or equipment in the embodiment of the present application. The communication port 83 may also be implemented with other components such as: the data communication is carried out among external equipment, image/data acquisition equipment, a database, external storage, an image/data processing workstation and the like.

The bus 80 includes hardware, software, or both to couple the components of the electronic device to one another. Bus 80 includes, but is not limited to, at least one of the following: data Bus (Data Bus), Address Bus (Address Bus), Control Bus (Control Bus), Expansion Bus (Expansion Bus), and Local Bus (Local Bus). By way of example, and not limitation, Bus 80 may include an Accelerated Graphics Port (AGP) or other graphics Bus, an Enhanced Industry Standard Architecture (EISA) Bus, a Front-Side Bus (FSB), a HyperTransport (HT) Interconnect, an ISA (ISA) Bus, an InfiniBand (InfiniBand) Interconnect, a Low Pin Count (LPC) Bus, a memory Bus, a Microchannel Architecture (MCA) Bus, a PCI (Peripheral Component Interconnect) Bus, a PCI-Express (PCI-X) Bus, a Serial Advanced Technology Attachment (AGP) Bus, a Local Video Architecture (Video) Bus, abbreviated VLB) bus or other suitable bus or a combination of two or more of these. Bus 80 may include one or more buses, where appropriate. Although specific buses are described and shown in the embodiments of the application, any suitable buses or interconnects are contemplated by the application.

In addition, in combination with the processing methods in the foregoing embodiments, the embodiments of the present application may be implemented by providing a computer-readable storage medium. The computer readable storage medium having stored thereon computer program instructions; the computer program instructions, when executed by a processor, implement any of the tag extraction methods in the above embodiments.

In summary, the entity identification is introduced, so that the entity type required by the method is extracted in a targeted manner to serve as the text label; meanwhile, the extracted entities are subjected to importance sequencing through the knowledge graph, so that the extracted labels are more valuable, manual reinspection and other operations are omitted, and the labor cost is reduced.

Although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A label extraction method based on a knowledge graph is characterized by comprising the following steps:

2. The tag extraction method of claim 1, wherein the document preprocessing step comprises:

3. The tag extraction method of claim 1, wherein the entity processing step comprises:

4. The label extraction method of claim 1, wherein the non-entity processing step comprises: and setting the non-entity sequences appearing in the title of the online document as the non-entity tags, sorting the non-entity sequences appearing in the body text of the online document according to word frequency, and selecting at least one non-entity sequence as the non-entity tags.

5. A knowledge-graph-based tag extraction system, comprising:

6. The tag extraction system of claim 5, wherein the document preprocessing unit comprises:

7. The tag extraction system of claim 5, wherein the entity processing unit comprises:

8. The tag extraction system of claim 5, wherein the non-entity processing unit sets the non-entity sequence appearing in the header of the online document as the non-entity tag, and the non-entity processing unit selects at least one non-entity sequence to be set as the non-entity tag after sorting the non-entity sequence appearing in the body of the online document according to word frequency.

9. An electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the tag extraction method of any one of claims 1 to 4 when executing the computer program.

10. A medium on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the tag extraction method according to any one of claims 1 to 4.