CN113468339A - Label extraction method, system, electronic device and medium based on knowledge graph - Google Patents
Label extraction method, system, electronic device and medium based on knowledge graph Download PDFInfo
- Publication number
- CN113468339A CN113468339A CN202110704870.7A CN202110704870A CN113468339A CN 113468339 A CN113468339 A CN 113468339A CN 202110704870 A CN202110704870 A CN 202110704870A CN 113468339 A CN113468339 A CN 113468339A
- Authority
- CN
- China
- Prior art keywords
- entity
- document
- entities
- tag
- online document
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000605 extraction Methods 0.000 title claims abstract description 44
- 238000012545 processing Methods 0.000 claims abstract description 45
- 238000012163 sequencing technique Methods 0.000 claims abstract description 21
- 238000007781 pre-processing Methods 0.000 claims abstract description 20
- 238000013507 mapping Methods 0.000 claims abstract description 8
- 238000005516 engineering process Methods 0.000 claims description 12
- 238000004590 computer program Methods 0.000 claims description 11
- 239000000284 extract Substances 0.000 claims description 9
- 238000004422 calculation algorithm Methods 0.000 claims description 8
- 238000000034 method Methods 0.000 description 13
- 238000004891 communication Methods 0.000 description 8
- 101100481876 Danio rerio pbk gene Proteins 0.000 description 6
- 101100481878 Mus musculus Pbk gene Proteins 0.000 description 6
- 238000010586 diagram Methods 0.000 description 4
- 230000000694 effects Effects 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/367—Ontology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Animal Behavior & Ethology (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a label extraction method, a system, electronic equipment and a medium based on a knowledge graph, wherein the label extraction method comprises the following steps: document preprocessing step: preprocessing an online document to obtain an entity and a non-entity; and (3) entity processing: mapping an entity to a knowledge graph to obtain a plurality of entity labels; non-entity processing step: analyzing the non-entity to obtain a non-entity sequence through a preset rule, and processing the non-entity sequence to obtain a plurality of non-entity labels; merging treatment: and combining and de-duplicating the entity tags and the non-entity tags to obtain the document tags corresponding to the online documents. The invention introduces entity identification, thereby pertinently extracting the required entity type as a text label; meanwhile, the extracted entities are subjected to importance sequencing through the knowledge graph, so that the extracted labels are more valuable, manual reinspection and other operations are omitted, and the labor cost is reduced.
Description
Technical Field
The invention relates to the technical field of natural language processing, in particular to a knowledge graph-based label extraction method, a knowledge graph-based label extraction system, electronic equipment and a knowledge graph-based label extraction medium.
Background
Recently, online documents receive more and more attention due to the fact that the working efficiency of company enterprises is effectively improved by the characteristic of multi-person collaboration due to cloud storage of the online documents. The online document not only has the information of the document, but also carries the information generated in the multi-person cooperation process of editors, browsers and the like. The information can be easily communicated and associated with an internal knowledge base of an enterprise, an information system and an enterprise knowledge map, and the value of the online document is greatly improved.
By extracting the label of the online document, the intelligent services such as document retrieval, recommendation and the like of the knowledge base can be more effectively supported. The extraction of the document tag is to extract important information of one or more documents from the documents by nlp and other technologies, wherein the information is the content really concerned by the user and comprises entities, key phrases and the like. By utilizing the tags, services such as document inquiry, recommendation and the like can be more intelligent and accurate, so that the efficiency is improved.
The existing technology is mostly to directly extract keywords from documents. For example, based on TF-IDF (word frequency-inverse document frequency), the product of the word frequency and the inverse document frequency is used as the score of the word, and several words with higher scores are selected as the keywords. The method based on the TextRank is characterized in that a candidate keyword graph is constructed by using a local lexical relation, namely a co-occurrence window, an edge between any two points is constructed by using the co-occurrence relation, and the weight of each node is iteratively calculated according to a formula until convergence. And finally, sequencing the node weights, and selecting the first nodes as keywords. However, in practice, it is found that the methods in the prior art only rely on word frequency or word co-occurrence to determine the importance of words, so as to sort the extracted keywords, and in some cases, the method is not accurate enough. Even if the operation of removing stop words is performed before extraction, the extracted keywords can still be words with more occurrence but no obvious meaning, so that manual screening is often required after extraction. Because of this, the prior art method cannot perform centralized extraction on some word types we are interested in, such as product names, department names, document types, and the like.
It is therefore desirable to develop a method, system, electronic device and medium for extracting labels based on knowledge-graph that overcomes the above-mentioned drawbacks.
Disclosure of Invention
In view of the above problems, embodiments of the present application provide a method, a system, an electronic device, and a medium for extracting a label based on a knowledge graph, so as to at least solve the problem of performing centralized extraction on a word type of interest.
The invention provides a knowledge graph-based label extraction method, which comprises the following steps:
document preprocessing step: preprocessing an online document to obtain an entity and a non-entity;
and (3) entity processing: mapping the entity to a knowledge graph to obtain a plurality of entity labels;
non-entity processing step: analyzing the non-entity to obtain a non-entity sequence through a preset rule, and processing the non-entity sequence to obtain a plurality of non-entity labels;
merging treatment: and combining and de-duplicating the entity tags and the non-entity tags to obtain the document tags corresponding to the online documents.
In the above tag extraction method, the document preprocessing step includes:
an entity acquisition step: extracting a plurality of the entities of the online document through an entity identification technology;
non-entity acquisition step: and extracting a plurality of non-entities of the online document according to the dependency relationship.
In the above tag extraction method, the entity processing step includes:
and an entity position judgment step: determining a location of the entity in the online document;
and an entity tag obtaining step: setting the entity appearing in a document name or a document title of the online document as the entity tag; setting at least one entity in the knowledge-graph, the node path distance between which and the entity is less than a threshold value and which appears in the online document, as the entity tag; and respectively corresponding the entities appearing in the text of the online document to the knowledge graph, constructing a new small knowledge graph according to the related entities corresponding to the entities and the relationship between the entities, sequencing the importance of the entities of the small knowledge graph through a PageRank algorithm, and reserving at least one entity as the entity label according to the sequencing result.
In the above tag extraction method, the non-entity processing step includes: and setting the non-entity sequences appearing in the title of the online document as the non-entity tags, sorting the non-entity sequences appearing in the body text of the online document according to word frequency, and selecting at least one non-entity sequence as the non-entity tags.
The invention also provides a knowledge graph-based label extraction system, which comprises the following components:
the document preprocessing unit is used for preprocessing an online document to obtain an entity and a non-entity;
the entity processing unit is used for mapping the entity to a knowledge graph to obtain a plurality of entity labels;
the non-entity processing unit is used for analyzing the non-entity to obtain a non-entity sequence through a preset rule and processing the non-entity sequence to obtain a plurality of non-entity labels;
and the merging processing unit is used for merging and de-duplicating the entity tags and the non-entity tags to obtain document tags corresponding to the online documents.
The above tag extraction system, wherein the document preprocessing unit includes:
the entity acquisition module extracts a plurality of entities of the online document through an entity identification technology;
and the non-entity acquisition module extracts a plurality of non-entities of the online document according to the dependency relationship.
The tag extraction system described above, wherein the entity processing unit includes:
the entity position judging module is used for judging the position of the entity in the online document;
an entity tag obtaining module, configured to set the entity appearing in the document name or document title of the online document as the entity tag; setting at least one entity in the knowledge-graph, the node path distance between which and the entity is less than a threshold value and which appears in the online document, as the entity tag; and respectively corresponding the entities appearing in the text of the online document to the knowledge graph, constructing a new small knowledge graph according to the related entities corresponding to the entities and the relationship between the entities, sequencing the importance of the entities of the small knowledge graph through a PageRank algorithm, and reserving at least one entity as the entity label according to the sequencing result.
In the above tag extraction system, the non-entity processing unit sets the non-entity sequence appearing in the header of the online document as the non-entity tag, and the non-entity processing unit selects at least one non-entity sequence to set as the non-entity tag after sorting the non-entity sequences appearing in the body text of the online document according to word frequency.
The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the tag extraction method as described in any one of the above when executing the computer program.
The invention also provides a medium on which a computer program is stored, wherein the program, when executed by a processor, implements a tag extraction method as defined in any one of the above.
Compared with the prior art, the invention has the following effects: the invention introduces entity identification, thereby pertinently extracting the required entity type as a text label; meanwhile, the extracted entities are subjected to importance sequencing through the knowledge graph, so that the extracted labels are more valuable, manual reinspection and other operations are omitted, and the labor cost is reduced.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
FIG. 1 is a flow chart of a tag extraction method of the present invention;
FIG. 2 is a flowchart illustrating the substeps of step S1 in FIG. 1;
FIG. 3 is a flowchart illustrating the substeps of step S2 in FIG. 1;
FIG. 4 is a flow chart of an application of the tag extraction method of the present invention;
FIG. 5 is a schematic diagram of the structure of the tag extraction system of the present invention;
fig. 6 is a schematic structural diagram of an electronic device according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The exemplary embodiments of the present invention and the description thereof are provided to explain the present invention and not to limit the present invention. Additionally, the same or similar numbered elements/components used in the drawings and the embodiments are used to represent the same or similar parts.
As used herein, the terms "first", "second", "S1", "S2", …, etc. do not particularly denote an order or sequential meaning, nor are they intended to limit the present invention, but merely distinguish between elements or operations described in the same technical terms.
With respect to directional terminology used herein, for example: up, down, left, right, front or rear, etc., are simply directions with reference to the drawings. Accordingly, the directional terminology used is intended to be illustrative and is not intended to be limiting of the present teachings.
As used herein, the terms "comprising," "including," "having," "containing," and the like are open-ended terms that mean including, but not limited to.
As used herein, "and/or" includes any and all combinations of the described items.
References to "plurality" herein include "two" and "more than two"; reference to "multiple sets" herein includes "two sets" and "more than two sets".
As used herein, the terms "substantially", "about" and the like are used to modify any slight variation in quantity or error that does not alter the nature of the variation. Generally, the range of slight variations or errors modified by such terms may be 20% in some embodiments, 10% in some embodiments, 5% in some embodiments, or other values. It should be understood by those skilled in the art that the aforementioned values can be adjusted according to actual needs, and are not limited thereto.
Certain words used to describe the present application are discussed below or elsewhere in this specification to provide additional guidance to those skilled in the art in describing the present application.
The extraction effect of the document tags depends on two aspects, namely the quality of the extracted entities or keywords and the sequencing effect of the extracted words. The extracted words are often sorted to select the top topk words as the labels of the representative documents. How to obtain more reasonable sequencing results is a very important problem. Firstly, dividing document tags into two types, namely an entity and a non-entity, wherein the non-entity takes noun phrases as an example, and the two types of tags are respectively extracted and sequenced; the entities are obtained through an entity identification technology, in addition, the online documents carry some editors, browsers and other entities, the entities are sorted through a related knowledge graph, and the non-entities are obtained through extracting noun phrases and are comprehensively sorted according to the occurrence positions and the word frequencies of the noun phrases; finally, respectively taking topk from the two types of tags, and then combining the topk and the topk to serve as a final document tag; the present invention will be described in detail below with reference to the accompanying drawings.
Referring to fig. 1, fig. 1 is a flowchart of a tag extraction method according to the present invention. As shown in fig. 1, the method for extracting a label based on a knowledge-graph includes:
document preprocessing step S1: preprocessing an online document to obtain an entity and a non-entity;
entity processing step S2: mapping the entity to a knowledge graph to obtain a plurality of entity labels;
non-entity processing step S3: analyzing the non-entity to obtain a non-entity sequence through a preset rule, and processing the non-entity sequence to obtain a plurality of non-entity labels;
merging processing step S4: and combining and de-duplicating the entity tags and the non-entity tags to obtain the document tags corresponding to the online documents.
In the embodiment, the non-entity may be a first-order phrase, but the invention is not limited thereto.
Further, referring to fig. 2, fig. 2 is a flowchart illustrating a sub-step of step S1 in fig. 1. As shown in fig. 2, the document preprocessing step S1 includes:
entity acquisition step S11: extracting a plurality of the entities of the online document through an entity identification technology;
non-entity acquisition step S12: and extracting a plurality of non-entities of the online document according to the dependency relationship.
In this embodiment, the entity may be, for example, a product, a function, a technology, a department, a document type, and the like, where the document type may be a deployment document, a white paper, an instruction manual, and the like. The invention adopts the entity recognition technology such as a dictionary-based method and a deep learning neural network-based method such as a CRF, LSTM + CRF, Bert + CRF and other sequence labeling models, and intensively extracts the required entity types by using the entity recognition, thereby reducing the extraction range and enabling the final document label to be more accurate.
Still further, referring to fig. 3, fig. 3 is a flowchart illustrating a sub-step of step S2 in fig. 1. As shown in fig. 3, the entity processing step S2 includes:
entity position determination step S21: determining a location of the entity in the online document;
entity tag acquisition step S22: setting the entity appearing in a document name or a document title of the online document as the entity tag; setting at least one entity in the knowledge-graph, the node path distance between which and the entity is less than a threshold value and which appears in the online document, as the entity tag; and respectively corresponding the entities appearing in the text of the online document to the knowledge graph, constructing a new small knowledge graph according to the related entities corresponding to the entities and the relationship between the entities, sequencing the importance of the entities of the small knowledge graph through a PageRank algorithm, and reserving at least one entity as the entity label according to the sequencing result. According to the invention, the extracted entities are subjected to importance sequencing by utilizing the entities in the knowledge graph and the relationship between the entities, so that the sequencing result is more accurate.
Further, the non-entity processing step S3 includes: and setting the non-entity sequences appearing in the title of the online document as the non-entity tags, sorting the non-entity sequences appearing in the body text of the online document according to word frequency, and selecting at least one non-entity sequence as the non-entity tags.
Referring to fig. 4, fig. 4 is a flowchart illustrating an application of the tag extraction method according to the present invention. The tag extraction method of the present invention is specifically described in an embodiment with reference to fig. 4.
(1) Entity label:
first, entities of a desired type are extracted from an online document through an entity identification technique.
It should be noted that, in addition to the entity extracted by the entity identification, some entities are an editor, a viewer, an administrator, and the like of the document carried by the online document in the embodiment. And mapping all the entities and the extracted entities into the enterprise knowledge graph. I.e. find the corresponding entity node in the graph.
A knowledge graph is a graph-based data structure, which is composed of nodes and edges, each node represents an entity, such as an employee, a product, a company, etc., each edge is a relationship between entities, and is essentially a semantic network that exposes relationships between entities, and can link all information together. The enterprise knowledge graph is constructed based on a large amount of data in an enterprise, and can better reflect the relevance between entities, so that the entities which can better reflect the document content are found out.
There are different ways of handling an entity depending on where it appears in the document.
A. When an entity appears in a document name or a document title, the entity is regarded as a very important entity and directly classified as an entity tag. In addition, the path distance of the enterprise knowledge graph to the entity node is 2, and all entities appearing in the document are directly classified as entity labels. These tags are also of high importance in the knowledge graph since they are closely related to important entities in the document header. The number of physical tags can be controlled as required, such as taking the first 5 tags according to the path distance, but the invention is not limited to the number.
B. And other entities appearing in the document body also respectively correspond to the enterprise knowledge graph, and then all the corresponding related entities and the relations among the entities are abstracted to form a new small knowledge graph aiming at the document. And (5) carrying out importance sequencing on the entities of the small knowledge graph through a PageRank algorithm. PageRank is a graph computation algorithm, and the importance of entity nodes is ranked by continuously iteratively computing the weight of connection between nodes. And finally, reserving topk as an entity label according to the sorted result.
(2) Non-entity label
In the present embodiment, a non-entity tag is used as a noun phrase tag for example.
Noun phrases in the title and body of the online document are extracted. The noun phrases herein do not include the entities that have been previously extracted.
The extraction of noun phrases is to analyze the dependency relationship of the sentence to be extracted through a Chinese model of space, and extract a conventional noun phrase sequence and a noun phrase sequence containing a special verb structure through a rule formulated according to the dependency relationship. And finally, integrating the extracted phrase sequences. And keeping noun phrases within 2-7 characters according to the length, and filtering out common words and dirty words, wherein the dirty words can be manually collected according to a test result.
C. If noun phrases appear in the title, they are considered important noun phrases and are directly retained as non-entity tags.
D. The rest noun phrases appearing in the text are subjected to importance ranking according to the appearing word frequency, and then the top topk is selected as a non-entity label.
Thus, the invention also considers noun phrase tags except for the required entities, thereby avoiding the situation of missing important tag information.
(3) Document tag deduplication
And (4) collecting the entity tags and the non-entity tags extracted from the A, B, C and D together for de-duplication processing to obtain the final document tag.
Referring to fig. 5, fig. 5 is a schematic structural diagram of a tag extraction system according to the present invention. As shown in fig. 5, the system for extracting a knowledge-graph-based tag of the present invention comprises:
the document preprocessing unit 11 is used for preprocessing an online document to obtain an entity and a non-entity;
the entity processing unit 12 is used for mapping the entity to a knowledge graph to obtain a plurality of entity labels;
a non-entity processing unit 13, which analyzes the non-entity to obtain a non-entity sequence through a preset rule, and processes the non-entity sequence to obtain a plurality of non-entity labels;
and a merging processing unit 14, configured to merge and perform deduplication processing on the entity tags and the non-entity tags to obtain document tags corresponding to the online documents.
Further, the document preprocessing unit 11 includes:
an entity obtaining module 111, which extracts a plurality of entities of the online document through an entity identification technology;
the non-entity obtaining module 112 extracts a plurality of non-entities of the online document according to the dependency relationship.
Still further, the entity processing unit 12 includes:
an entity position judging module 121, which judges the position of the entity in the online document;
an entity tag obtaining module 122, configured to set the entity appearing in the document name or document title of the online document as the entity tag; setting at least one entity in the knowledge-graph, the node path distance between which and the entity is less than a threshold value and which appears in the online document, as the entity tag; and respectively corresponding the entities appearing in the text of the online document to the knowledge graph, constructing a new small knowledge graph according to the related entities corresponding to the entities and the relationship between the entities, sequencing the importance of the entities of the small knowledge graph through a PageRank algorithm, and reserving at least one entity as the entity label according to the sequencing result.
Further, the non-entity processing unit 13 sets the non-entity sequence appearing in the header of the online document as the non-entity tag, and the non-entity processing unit selects at least one non-entity sequence to set as the non-entity tag after sorting the non-entity sequence appearing in the body of the online document according to word frequency.
Referring to fig. 6, fig. 6 is a schematic structural diagram of an electronic device according to the present invention. As shown in fig. 6, the present embodiment discloses a specific implementation of an electronic device. The electronic device may include a processor 81 and a memory 82 storing computer program instructions.
Specifically, the processor 81 may include a Central Processing Unit (CPU), or A Specific Integrated Circuit (ASIC), or may be configured to implement one or more Integrated circuits of the embodiments of the present Application.
The memory 82 may be used to store or cache various data files for processing and/or communication use, as well as possible computer program instructions executed by the processor 81.
The processor 81 implements any of the tag extraction methods in the above embodiments by reading and executing computer program instructions stored in the memory 82.
In some of these embodiments, the electronic device may also include a communication interface 83 and a bus 80. As shown in fig. 6, the processor 81, the memory 82, and the communication interface 83 are connected via the bus 80 to complete communication therebetween.
The communication interface 83 is used for implementing communication between modules, devices, units and/or equipment in the embodiment of the present application. The communication port 83 may also be implemented with other components such as: the data communication is carried out among external equipment, image/data acquisition equipment, a database, external storage, an image/data processing workstation and the like.
The bus 80 includes hardware, software, or both to couple the components of the electronic device to one another. Bus 80 includes, but is not limited to, at least one of the following: data Bus (Data Bus), Address Bus (Address Bus), Control Bus (Control Bus), Expansion Bus (Expansion Bus), and Local Bus (Local Bus). By way of example, and not limitation, Bus 80 may include an Accelerated Graphics Port (AGP) or other graphics Bus, an Enhanced Industry Standard Architecture (EISA) Bus, a Front-Side Bus (FSB), a HyperTransport (HT) Interconnect, an ISA (ISA) Bus, an InfiniBand (InfiniBand) Interconnect, a Low Pin Count (LPC) Bus, a memory Bus, a Microchannel Architecture (MCA) Bus, a PCI (Peripheral Component Interconnect) Bus, a PCI-Express (PCI-X) Bus, a Serial Advanced Technology Attachment (AGP) Bus, a Local Video Architecture (Video) Bus, abbreviated VLB) bus or other suitable bus or a combination of two or more of these. Bus 80 may include one or more buses, where appropriate. Although specific buses are described and shown in the embodiments of the application, any suitable buses or interconnects are contemplated by the application.
In addition, in combination with the processing methods in the foregoing embodiments, the embodiments of the present application may be implemented by providing a computer-readable storage medium. The computer readable storage medium having stored thereon computer program instructions; the computer program instructions, when executed by a processor, implement any of the tag extraction methods in the above embodiments.
In summary, the entity identification is introduced, so that the entity type required by the method is extracted in a targeted manner to serve as the text label; meanwhile, the extracted entities are subjected to importance sequencing through the knowledge graph, so that the extracted labels are more valuable, manual reinspection and other operations are omitted, and the labor cost is reduced.
Although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.
Claims (10)
1. A label extraction method based on a knowledge graph is characterized by comprising the following steps:
document preprocessing step: preprocessing an online document to obtain an entity and a non-entity;
and (3) entity processing: mapping the entity to a knowledge graph to obtain a plurality of entity labels;
non-entity processing step: analyzing the non-entity to obtain a non-entity sequence through a preset rule, and processing the non-entity sequence to obtain a plurality of non-entity labels;
merging treatment: and combining and de-duplicating the entity tags and the non-entity tags to obtain the document tags corresponding to the online documents.
2. The tag extraction method of claim 1, wherein the document preprocessing step comprises:
an entity acquisition step: extracting a plurality of the entities of the online document through an entity identification technology;
non-entity acquisition step: and extracting a plurality of non-entities of the online document according to the dependency relationship.
3. The tag extraction method of claim 1, wherein the entity processing step comprises:
and an entity position judgment step: determining a location of the entity in the online document;
and an entity tag obtaining step: setting the entity appearing in a document name or a document title of the online document as the entity tag; setting at least one entity in the knowledge-graph, the node path distance between which and the entity is less than a threshold value and which appears in the online document, as the entity tag; and respectively corresponding the entities appearing in the text of the online document to the knowledge graph, constructing a new small knowledge graph according to the related entities corresponding to the entities and the relationship between the entities, sequencing the importance of the entities of the small knowledge graph through a PageRank algorithm, and reserving at least one entity as the entity label according to the sequencing result.
4. The label extraction method of claim 1, wherein the non-entity processing step comprises: and setting the non-entity sequences appearing in the title of the online document as the non-entity tags, sorting the non-entity sequences appearing in the body text of the online document according to word frequency, and selecting at least one non-entity sequence as the non-entity tags.
5. A knowledge-graph-based tag extraction system, comprising:
the document preprocessing unit is used for preprocessing an online document to obtain an entity and a non-entity;
the entity processing unit is used for mapping the entity to a knowledge graph to obtain a plurality of entity labels;
the non-entity processing unit is used for analyzing the non-entity to obtain a non-entity sequence through a preset rule and processing the non-entity sequence to obtain a plurality of non-entity labels;
and the merging processing unit is used for merging and de-duplicating the entity tags and the non-entity tags to obtain document tags corresponding to the online documents.
6. The tag extraction system of claim 5, wherein the document preprocessing unit comprises:
the entity acquisition module extracts a plurality of entities of the online document through an entity identification technology;
and the non-entity acquisition module extracts a plurality of non-entities of the online document according to the dependency relationship.
7. The tag extraction system of claim 5, wherein the entity processing unit comprises:
the entity position judging module is used for judging the position of the entity in the online document;
an entity tag obtaining module, configured to set the entity appearing in the document name or document title of the online document as the entity tag; setting at least one entity in the knowledge-graph, the node path distance between which and the entity is less than a threshold value and which appears in the online document, as the entity tag; and respectively corresponding the entities appearing in the text of the online document to the knowledge graph, constructing a new small knowledge graph according to the related entities corresponding to the entities and the relationship between the entities, sequencing the importance of the entities of the small knowledge graph through a PageRank algorithm, and reserving at least one entity as the entity label according to the sequencing result.
8. The tag extraction system of claim 5, wherein the non-entity processing unit sets the non-entity sequence appearing in the header of the online document as the non-entity tag, and the non-entity processing unit selects at least one non-entity sequence to be set as the non-entity tag after sorting the non-entity sequence appearing in the body of the online document according to word frequency.
9. An electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the tag extraction method of any one of claims 1 to 4 when executing the computer program.
10. A medium on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the tag extraction method according to any one of claims 1 to 4.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110704870.7A CN113468339B (en) | 2021-06-24 | 2021-06-24 | Label extraction method and system based on knowledge graph, electronic equipment and medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110704870.7A CN113468339B (en) | 2021-06-24 | 2021-06-24 | Label extraction method and system based on knowledge graph, electronic equipment and medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113468339A true CN113468339A (en) | 2021-10-01 |
CN113468339B CN113468339B (en) | 2024-09-13 |
Family
ID=77872852
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110704870.7A Active CN113468339B (en) | 2021-06-24 | 2021-06-24 | Label extraction method and system based on knowledge graph, electronic equipment and medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113468339B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114970548A (en) * | 2022-06-14 | 2022-08-30 | 阿里云计算有限公司 | Label processing method and device, electronic equipment and computer readable storage medium |
CN116737926A (en) * | 2023-06-07 | 2023-09-12 | 北京天融信网络安全技术有限公司 | Method, device, equipment and storage medium for classifying threat information text |
TWI848531B (en) * | 2023-01-18 | 2024-07-11 | 中華電信股份有限公司 | Electronic device and method for generating knowledge graph |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107436922A (en) * | 2017-07-05 | 2017-12-05 | 北京百度网讯科技有限公司 | Text label generation method and device |
CN111209411A (en) * | 2020-01-03 | 2020-05-29 | 北京明略软件系统有限公司 | Document analysis method and device |
CN112052304A (en) * | 2020-08-18 | 2020-12-08 | 中国建设银行股份有限公司 | Course label determining method and device and electronic equipment |
CN112446204A (en) * | 2020-12-07 | 2021-03-05 | 北京明略软件系统有限公司 | Document tag determination method, system and computer equipment |
-
2021
- 2021-06-24 CN CN202110704870.7A patent/CN113468339B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107436922A (en) * | 2017-07-05 | 2017-12-05 | 北京百度网讯科技有限公司 | Text label generation method and device |
US20190012377A1 (en) * | 2017-07-05 | 2019-01-10 | Beijing Baidu Netcom Science And Technology Co., Ltd. | Method and device for generating text tag |
CN111209411A (en) * | 2020-01-03 | 2020-05-29 | 北京明略软件系统有限公司 | Document analysis method and device |
CN112052304A (en) * | 2020-08-18 | 2020-12-08 | 中国建设银行股份有限公司 | Course label determining method and device and electronic equipment |
CN112446204A (en) * | 2020-12-07 | 2021-03-05 | 北京明略软件系统有限公司 | Document tag determination method, system and computer equipment |
Non-Patent Citations (1)
Title |
---|
孙梦博: "基于卷积神经网络的关键词提取方法", 《计算机产品与流通》, no. 1, pages 50 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114970548A (en) * | 2022-06-14 | 2022-08-30 | 阿里云计算有限公司 | Label processing method and device, electronic equipment and computer readable storage medium |
TWI848531B (en) * | 2023-01-18 | 2024-07-11 | 中華電信股份有限公司 | Electronic device and method for generating knowledge graph |
CN116737926A (en) * | 2023-06-07 | 2023-09-12 | 北京天融信网络安全技术有限公司 | Method, device, equipment and storage medium for classifying threat information text |
Also Published As
Publication number | Publication date |
---|---|
CN113468339B (en) | 2024-09-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2019174132A1 (en) | Data processing method, server and computer storage medium | |
CN106649818B (en) | Application search intention identification method and device, application search method and server | |
CN109299480B (en) | Context-based term translation method and device | |
WO2019091026A1 (en) | Knowledge base document rapid search method, application server, and computer readable storage medium | |
WO2017097231A1 (en) | Topic processing method and device | |
CN107577671B (en) | Subject term extraction method based on multi-feature fusion | |
CN107463548B (en) | Phrase mining method and device | |
CN104598532A (en) | Information processing method and device | |
CN111444330A (en) | Method, device and equipment for extracting short text keywords and storage medium | |
CN107102993B (en) | User appeal analysis method and device | |
CN105279277A (en) | Knowledge data processing method and device | |
CN113468339B (en) | Label extraction method and system based on knowledge graph, electronic equipment and medium | |
CN113282955B (en) | Method, system, terminal and medium for extracting privacy information in privacy policy | |
CN108549723B (en) | Text concept classification method and device and server | |
CN110968664A (en) | Document retrieval method, device, equipment and medium | |
CN106844482B (en) | Search engine-based retrieval information matching method and device | |
CN109165373B (en) | Data processing method and device | |
CN111160445B (en) | Bid file similarity calculation method and device | |
CN110705285B (en) | Government affair text subject word library construction method, device, server and readable storage medium | |
CN105574004B (en) | A kind of removing duplicate webpages method and apparatus | |
CN110888977B (en) | Text classification method, apparatus, computer device and storage medium | |
CN108475265B (en) | Method and device for acquiring unknown words | |
CN109344397B (en) | Text feature word extraction method and device, storage medium and program product | |
US20210182549A1 (en) | Natural Language Processing (NLP) Pipeline for Automated Attribute Extraction | |
CN112446204B (en) | Method, system and computer equipment for determining document label |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
TA01 | Transfer of patent application right | ||
TA01 | Transfer of patent application right |
Effective date of registration: 20231109 Address after: Room 401, 4th Floor, Building J, Yunmi City, No. 19 Ningshuang Road, Yuhuatai District, Nanjing City, Jiangsu Province, 210000 Applicant after: Nanjing Minglue Technology Co.,Ltd. Address before: 100089 a1002, 10th floor, building 1, yard 1, Zhongguancun East Road, Haidian District, Beijing Applicant before: MININGLAMP SOFTWARE SYSTEMS Co.,Ltd. |
|
GR01 | Patent grant | ||
GR01 | Patent grant |