CN114048742A - Knowledge entity and relation extraction method of text information and text quality evaluation method - Google Patents

Knowledge entity and relation extraction method of text information and text quality evaluation method Download PDF

Info

Publication number
CN114048742A
CN114048742A CN202111251665.6A CN202111251665A CN114048742A CN 114048742 A CN114048742 A CN 114048742A CN 202111251665 A CN202111251665 A CN 202111251665A CN 114048742 A CN114048742 A CN 114048742A
Authority
CN
China
Prior art keywords
knowledge
entity
text
knowledge entity
calculating
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111251665.6A
Other languages
Chinese (zh)
Other versions
CN114048742B (en
Inventor
王怀波
陈丽
郑勤华
杜君磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Normal University
Original Assignee
Beijing Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Normal University filed Critical Beijing Normal University
Priority to CN202111251665.6A priority Critical patent/CN114048742B/en
Publication of CN114048742A publication Critical patent/CN114048742A/en
Application granted granted Critical
Publication of CN114048742B publication Critical patent/CN114048742B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method for extracting knowledge entities and relations of text information and a method for evaluating text quality, wherein the extraction method comprises the following steps: acquiring text data; extracting knowledge entities in the text data; calculating phrase importance indexes of the knowledge entities according to a mutual information algorithm; calculating the importance of the knowledge entity according to an improved TextRank algorithm based on word vectors; and determining the relation of the knowledge entities according to the occurrence probability and the refinement degree of the knowledge entities. By implementing the method, original words lost when the knowledge entity is extracted can be recombined by calculating the phrase importance index, so that the accuracy of extraction of the knowledge entity is improved; meanwhile, the quality of the extracted text information can be conveniently evaluated in the follow-up process by calculating the importance of the knowledge entity; in addition, the method also provides that the relation of the knowledge entities is determined based on the occurrence probability and the refinement degree of the knowledge entities, and further analysis of the text information can be facilitated through determination of the relation of the knowledge entities.

Description

Knowledge entity and relation extraction method of text information and text quality evaluation method
Technical Field
The invention relates to the technical field of text processing, in particular to a knowledge entity and relation extraction method and a text quality evaluation method of text information.
Background
With the rapid development of the internet, various resource information is gradually enriched and even increased explosively; characters carry abundant information, and research on the texts needs to be paid more and more attention. Therefore, it is important to extract the concerned content from the huge text, and the purpose of information extraction is to provide a powerful information acquisition tool for people.
A piece of text may contain a great deal of knowledge and information, and is an important and common data format. The method has the advantages that the extraction and mining of the key points of the text data by means of the computer algorithm are helpful for helping people to obtain a large amount of refined knowledge entity data in a short time, key information can be captured quickly, and the text reading efficiency and quality are improved. However, the current method for extracting and mining text data does not comprehensively consider information in the text, resulting in insufficient information acquisition capability.
Disclosure of Invention
In view of this, embodiments of the present invention provide a method for extracting knowledge entities and relationships of text information and a method for evaluating text quality, so as to solve the technical problem in the prior art that information acquisition capability is insufficient when information is extracted from text data.
The technical scheme provided by the invention is as follows:
the first aspect of the embodiments of the present invention provides a method for extracting knowledge entities and relationships of text information, including: acquiring text data; extracting knowledge entities in the text data; calculating the phrase importance index of the knowledge entity according to a mutual information algorithm; calculating the importance of the knowledge entity according to an improved TextRank algorithm based on word vectors; and determining the relation of the knowledge entities according to the occurrence probability and the refinement degree of the knowledge entities.
Optionally, the method for extracting knowledge entities and relationships of text information further includes: performing part-of-speech tagging on the knowledge entity; and constructing a knowledge entity database according to the knowledge entity, the part of speech tagging result, the phrase importance index, the knowledge entity importance and the knowledge entity relationship.
Optionally, calculating a phrase importance index of the knowledge entity according to a mutual information algorithm, including: combining two of the extracted knowledge entities; calculating the cosine mutual information value of the combined knowledge entity according to a mutual information algorithm; calculating a phrase importance index of the combined knowledge entity according to the cosine mutual information value, wherein the phrase importance index is expressed by the following formula:
Figure BDA0003322378500000021
wherein Q-V represents a phrase importance index, and PMI-C represents a cosine mutual information value.
Optionally, the knowledge entity importance is calculated according to a modified TextRank algorithm based on word vectors, including: calculating a word vector of each knowledge entity; calculating a Rank index of a knowledge entity according to the word vector and an improved TextRank algorithm, wherein the improved TextRank algorithm comprises the steps of constructing a keyword network according to the knowledge entity, wherein the keyword network comprises each keyword node in the network, and directional authoritative edges among the keywords and the vector distance of the keywords which are determined according to a network construction range; calculating the reverse text frequency of each knowledge entity; and calculating the importance of the knowledge entity according to the Rank index and the reverse text frequency.
Optionally, determining the relationship of the knowledge entities according to the occurrence probability of the knowledge entities and the refinement degree includes: traversing the text data and determining a document set corresponding to the knowledge entity; calculating the occurrence probability of the knowledge entity according to the document set corresponding to the knowledge entity; calculating the refinement degree of the knowledge entity according to the occurrence probability of the knowledge entity; and determining the relation of the knowledge entities according to the document set corresponding to the knowledge entities and the refinement degree of the knowledge entities.
A second aspect of the embodiments of the present invention provides a text quality assessment method, including: acquiring the importance and the relation of the knowledge entities calculated in the first aspect of the embodiment of the invention; constructing a text entity network graph according to the knowledge entity relationship; calculating the average degree, the diameter, the average path length and the clustering coefficient according to the text entity network graph; and evaluating the text quality according to the average degree, the diameter, the average path length, the clustering coefficient and the importance of the knowledge entity.
Optionally, the evaluating the text quality according to the average degree, the diameter, the average path length, the clustering coefficient and the importance of the knowledge entity includes: calculating the text topic clustering power according to the diameter and the clustering coefficient; and calculating the quality of the text knowledge architecture according to the average degree, the average path length, the clustering coefficient and the importance of the knowledge entity.
A third aspect of the embodiments of the present invention provides a device for extracting knowledge entities and relationships of text information, including: the text data acquisition module is used for acquiring text data; the knowledge entity extraction module is used for extracting knowledge entities in the text data; the importance index calculation module is used for calculating the phrase importance index of the knowledge entity according to a mutual information algorithm; the importance calculating module is used for calculating the importance of the knowledge entity according to the improved TextRank algorithm based on the word vector; and the knowledge entity relationship determining module is used for determining the knowledge entity relationship according to the occurrence probability and the refinement degree of the knowledge entity.
A fourth aspect of the embodiments of the present invention provides a text quality evaluation apparatus, including: the data acquisition module is used for acquiring the importance and the relation of the knowledge entities calculated in the first aspect of the embodiment of the invention; the network graph building module is used for building a text entity network graph according to the knowledge entity relationship; the parameter calculation module is used for calculating the average degree, the diameter, the average path length and the clustering coefficient according to the text entity network graph; and the evaluation module is used for evaluating the text quality according to the average degree, the diameter, the average path length, the clustering coefficient and the importance of the knowledge entity.
A fifth aspect of the embodiments of the present invention provides a computer-readable storage medium, where computer instructions are stored, and the computer instructions are configured to enable the computer to execute the method for extracting knowledge entities and relationships of text information according to any one of the first aspect and the first aspect of the embodiments of the present invention, and execute the method for evaluating text quality according to any one of the second aspect and the second aspect of the embodiments of the present invention.
A sixth aspect of an embodiment of the present invention provides an electronic device, including: the system comprises a memory and a processor, wherein the memory and the processor are connected with each other in a communication mode, the memory stores computer instructions, and the processor executes the computer instructions so as to execute the knowledge entity and relationship extraction method of the text information according to any one of the first aspect and the first aspect of the embodiment of the invention and execute the text quality assessment method according to any one of the second aspect and the second aspect of the embodiment of the invention.
The technical scheme provided by the invention has the following effects:
the embodiment of the invention provides a method for extracting knowledge entities and relations of text information, which comprises the steps of after extracting the knowledge entities in text data, calculating phrase importance indexes of the knowledge entities according to a mutual information algorithm; calculating the importance of the knowledge entity according to an improved TextRank algorithm based on word vectors; and determining the relation of the knowledge entities according to the occurrence probability and the refinement degree of the knowledge entities. Therefore, the method can recombine the original words lost during the extraction of the knowledge entity by calculating the phrase importance index, thereby improving the accuracy of the extraction of the knowledge entity; meanwhile, the quality of the extracted text information can be conveniently evaluated in the follow-up process by calculating the importance of the knowledge entity; in addition, the method also provides that the relation of the knowledge entities is determined based on the occurrence probability and the refinement degree of the knowledge entities, and further analysis of the text information can be facilitated through determination of the relation of the knowledge entities.
According to the text quality evaluation method provided by the embodiment of the invention, the text entity network diagram is constructed through the knowledge entity relationship, each index for evaluating the text quality is determined through the network diagram, and finally, the text quality is comprehensively evaluated by combining a plurality of indexes and the importance of the knowledge entity. Therefore, the text quality assessment method reconsiders the relationship representation mode between the entities based on the complex network view, provides the text assessment index based on the entity relationship, and provides ideas and references for exploring the text quality of the content in the characteristic field.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
FIG. 1 is a flow diagram of a knowledge entity and relationship extraction method for textual information, according to an embodiment of the invention;
FIG. 2 is a flow diagram of a knowledge entity and relationship extraction method for textual information, according to another embodiment of the present invention;
FIG. 3 is a flow diagram of a knowledge entity and relationship extraction method for textual information, according to another embodiment of the present invention;
FIG. 4 is a flow diagram of a knowledge entity and relationship extraction method for textual information, according to another embodiment of the present invention;
FIG. 5 is a flow diagram of a method of text quality assessment according to an embodiment of the invention;
FIG. 6 is a block diagram of a knowledge entity and relationship extraction apparatus for textual information, according to an embodiment of the present invention;
fig. 7 is a block diagram of the structure of a text quality evaluation apparatus according to an embodiment of the present invention;
FIG. 8 is a schematic structural diagram of a computer-readable storage medium provided in accordance with an embodiment of the present invention;
fig. 9 is a schematic structural diagram of an electronic device provided in an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The embodiment of the invention provides a method for extracting knowledge entities and relations of text information, which comprises the following steps as shown in figure 1:
step S101: acquiring text data; specifically, the text data is text data of knowledge entities and relations to be extracted. The text data can be acquired from the internet or other places, and the position of acquiring the text data is not limited in the embodiment of the invention. In one embodiment, the text data may be forum posts for certain topics, postings for certain topics or classes, or blog content for certain topics, etc. The acquired text data may be stored in text form in segments.
Step S102: extracting knowledge entities in the text data; the knowledge entity may be a keyword in the text data, or may also refer to a proper noun or phrase with a name attribute, which has a complete and specific semantic meaning and a fixed structure and can better express a semantic concept. Specifically, the TextRank algorithm can be adopted when the knowledge entity is extracted. The knowledge entity may serve as a knowledge keyword for the textual data.
In an embodiment, when extracting the knowledge entity in the text data, the extracted text data may be further subjected to processes of text segmentation, part of speech tagging, and the like. The part-of-speech tagging refers to assigning an appropriate part-of-speech to each word in the sentence, that is, determining that each word is a part-of-speech attribute, such as a noun, a verb, or an adjective word, and there are various part-of-speech tagging standards, for example, a PKU tag set provided by the university of beijing, computational linguistics research institute. In the embodiment, part-of-speech tagging is performed by using a PKU tagging rule based on the institute of computational linguistics of Beijing university.
Step S103: calculating the phrase importance index of the knowledge entity according to a mutual information algorithm; in particular, mutual information is a concept in information theory that is used to measure the correlation of two random events. On word combination, two segments V for a phrase V1And V2The PMI (reciprocal information) between the two segments can be calculated.
Figure BDA0003322378500000061
PMI(V1,V2) PMI (V) as a consistency feature of text1,V2) The larger the value, the more V1And V2The greater the relevance, the more phrases can be combined. Conversely, a phrase cannot be composed if the relevance is smaller than a certain threshold.
In one embodiment, text segmentation is performed when the knowledge entity is extracted, but limited by the accuracy of the sample trained by the segmentation tool and the algorithm itself, there may be a phenomenon of cutting off keywords in the field during the segmentation process, for example, the keyword "internet + education" is usually simply cut into two keywords "internet" and "education" during the segmentation process, and the original word is lost. Whether the domain associated entities need to be connected to form an entity phrase based on the existing entities can be judged through the calculation of the phrase importance index.
Step S104: calculating the importance of the knowledge entity according to an improved TextRank algorithm based on word vectors; specifically, when calculating the importance of the knowledge entity, the word vector of the knowledge entity may be calculated first, and then the importance of the knowledge entity may be calculated based on the modified TextRank algorithm and the calculated word vector.
Step S105: and determining the relation of the knowledge entities according to the occurrence probability and the refinement degree of the knowledge entities. Specifically, the knowledge entity relationship includes an up-down relationship and a co-occurrence relationship. In the above and below relationships, if a is a higher order entity of B, a means that a is broader than B, and B means that B is more specific than a. The co-occurrence relationship refers to that if A and B co-occur, they are generally considered to be in a flat-level relationship, and more refer to different aspects of the same thing.
The degree of refinement is an index reflecting the relationship between entities, and entities with high degree of refinement are usually located in the lower level. The entity occurrence probability refers to the probability that V1 is realized to occur in the document in which the entity V2 occurs. Whether the relation between two knowledge entities is an upper-lower relation or a co-occurrence relation can be determined through the entity occurrence probability and the refinement degree.
The embodiment of the invention provides a method for extracting knowledge entities and relations of text information, which comprises the steps of after extracting the knowledge entities in text data, calculating phrase importance indexes of the knowledge entities according to a mutual information algorithm; calculating the importance of the knowledge entity according to an improved TextRank algorithm based on word vectors; and determining the relation of the knowledge entities according to the occurrence probability and the refinement degree of the knowledge entities. Therefore, the method can recombine the original words lost during the extraction of the knowledge entity by calculating the phrase importance index, thereby improving the accuracy of the extraction of the knowledge entity; meanwhile, the quality of the extracted text information can be conveniently evaluated in the follow-up process by calculating the importance of the knowledge entity; in addition, the method also provides that the relation of the knowledge entities is determined based on the occurrence probability and the refinement degree of the knowledge entities, and further analysis of the text information can be facilitated through determination of the relation of the knowledge entities.
As an optional implementation manner of the embodiment of the present invention, the method for extracting knowledge entities and relationships of text information further includes: performing part-of-speech tagging on the knowledge entity; and constructing a knowledge entity database according to the knowledge entity, the part of speech tagging result, the phrase importance index, the knowledge entity importance and the knowledge entity relationship. Specifically, the indexes such as entity content, entity importance, phrase content, phrase importance, the relationship between the upper entity and the lower entity and the like are stored in a knowledge entity database and serve as a special knowledge base in the field of internet texts, so that possible research and analysis in the future are facilitated.
As an optional implementation manner of the embodiment of the present invention, as shown in fig. 2, calculating the phrase importance index of the knowledge entity according to a mutual information algorithm includes the following steps:
step S201: combining two of the extracted knowledge entities; specifically, it can be known from the above that, when a knowledge entity is extracted, the original phrase may be broken down and the original word may be lost through text word segmentation. Therefore, in order to retrieve the original word, two pieces of knowledge in the extracted knowledge entity can be usedAnd (4) entity combination. For example, entity V1And V2The combination yields entity V.
Step S202: calculating the cosine mutual information value of the combined knowledge entity according to a mutual information algorithm; specifically, after combining the entities, the cosine mutual information value of the combined knowledge entity is calculated. For example, the two knowledge entities combined are each V1And V2And combining, namely calculating the cosine mutual information value of the combined knowledge entity V by the following formula:
Figure BDA0003322378500000081
Figure BDA0003322378500000082
where cos θ is the word distance based on the entity word vector. PMI-C (V)1,V2) The cosine mutual information value of the combined knowledge entity V is represented, and the importance of the combined knowledge entity V is reflected.
Step S203: and calculating the phrase importance index of the combined knowledge entity according to the cosine mutual information value. Specifically, when calculating the phrase importance index, the combined knowledge entity V may be given an application scenario, and the importance of the entity phrase V in the corresponding scenario is calculated. In the calculation, the inverse document frequency of V is calculated based on logistic regression, thereby re-evaluating the textual importance of the entity V. Wherein, the phrase importance index is calculated by the following formula:
Figure BDA0003322378500000091
as an optional implementation manner of the embodiment of the present invention, as shown in fig. 3, the method for calculating the importance of the knowledge entity according to the improved TextRank algorithm based on the word vector includes the following steps:
step S301: calculating a word vector of each knowledge entity; specifically, if there are N pieces of input text data, the N pieces of text are respectively denoted as W1~NDetermining the indication entity i extracted from each text and the sequence information S of the first appearance in the text1~p. When calculating the Word vector of each knowledge entity, the Word2Vec model calculation in natural language processing can be adopted, and the Glove model or the BERT model calculation can also be adopted. Other models may be adopted, which is not limited in the embodiment of the present invention.
Step S302: calculating a Rank index of the knowledge entity according to the word vector and an improved TextRank algorithm; specifically, when calculating the Rank index, a keyword network G ═ V, E is first constructed, where V denotes a keyword included in the keyword network, and may be used as a node in the network, and if two keywords appear in one window, an edge E is established for the two keywords. The edges are directional, and according to the sequence of each knowledge entity or keyword appearing in step S301, the upper keywords appear in the front of the sequence in one window, and the lower keywords appear in the back of the sequence, and the direction is from the upper keywords to the lower keywords. Wherein the window is a range of the specified network construction. For example, a total of 100 keywords are extracted, the window is 10, and the above process is repeated 10 times.
Thus, the Rank index WS of a knowledge entity can be calculated by the following formula:
Figure BDA0003322378500000092
wherein d is a damping coefficient, which represents the probability that a certain node points to other nodes in the graph network, and the value range is (0, 1), which can be generally set to 0.85; omegajiIs a weight coefficient, i.e., the vector distance D, In (V) of the keywordi) Set of points pointing to the node, Out (V)j) Is node ViAnd (4) iteratively calculating the weight of each node until convergence by the set of pointed points.
Step S303: calculating the reverse text frequency of each knowledge entity; specifically, the inverse text frequency IDF refers to a measure of the general importance of a word. The IDF for a particular term may be obtained by dividing the total number of documents by the number of documents that contain that term and taking the logarithm of the resulting quotient. If the documents containing the entry t are fewer and the IDF is larger, the entry has good category distinguishing capability. Thus, the reverse text frequency can be calculated by the following formula:
Figure BDA0003322378500000101
where | D | is the total number of files in the corpus. L { j: ti ∈ dj } | denotes the number of files containing the word ti (i.e., the number of files ni, j ≠ 0). If the word is not in the corpus, it will result in a denominator of zero, so 1+ | { j: ti belongs to dj. That is, IDF is equal to the total number of documents in the corpus and the number of documents containing the entry w +1, and then the logarithm is taken.
Step S304: and calculating the importance of the knowledge entity according to the Rank index and the reverse text frequency. Specifically, the knowledge entity importance can be obtained by multiplying the Rank index WS by the inverse file frequency, that is, the knowledge entity importance Q is WS.
As an optional implementation manner of the embodiment of the present invention, as shown in fig. 4, determining a knowledge entity relationship according to a probability of occurrence of a knowledge entity and a refinement degree includes the following steps:
step S401: traversing the text data and determining a document set corresponding to the knowledge entity; specifically, after the knowledge entity is extracted, the entity in the document may be traversed, and the document may be numbered to establish a document number set for the V entity. E.g. DV={D1,D2,D4Represents the entity V in the document D1,D2,D4Has been shown in (a). If for two entities V1And V2Is provided with
Figure BDA0003322378500000102
And is
Figure BDA0003322378500000103
Then V can be considered2Is V1General concept of,V1Is V2The lower concept of (1).
Step S402: calculating the occurrence probability of the knowledge entity according to the document set corresponding to the knowledge entity; specifically, the entity occurrence probability can be calculated by the following formula:
Figure BDA0003322378500000111
wherein, P (V)1|V2) Is represented in an entity V2In the appearing document, entity V1Probability of occurrence, N (V)1V2) As an entity V1And V2Number of documents appearing simultaneously, N (V)2) Representing an entity V2Number of documents present, P (V)1|V2) When V is more than or equal to 0.81And V2Has a superior-inferior relationship.
Step S403: calculating the refinement degree of the knowledge entity according to the occurrence probability of the knowledge entity; specifically, in calculating the degree of refinement, it may be calculated on the basis of the entity occurrence probability. The degree of refinement is calculated by the following formula:
Figure BDA0003322378500000112
wherein, N (V) value represents refinement value, when entity V and any entity in the knowledge entity database satisfy P (V)0And when the | V) is more than or equal to 0.8, adding 1 to N (V), wherein N is the total number of the entities in the knowledge entity database.
Step S404: and determining the relation of the knowledge entities according to the document set corresponding to the knowledge entities and the refinement degree of the knowledge entities. Specifically, after determining the occurrence and refinement of the knowledge entity in the document, the relationship between the knowledge entities may be determined in the following manner: if it is
Figure BDA0003322378500000113
And is
Figure BDA0003322378500000114
And R isv1<Rv2Then V is1Is V2The host entity of (2); if it is
Figure BDA0003322378500000115
And is
Figure BDA0003322378500000116
And R isv1>Rv2Then V is1Is V2The lower entity of (2); otherwise consider V1And V2Have a flat-level relationship, i.e., a co-occurrence relationship.
The embodiment of the present invention further provides a text quality evaluation method, as shown in fig. 5, the evaluation method includes the following steps:
step S501: acquiring the knowledge entity importance and the knowledge entity relationship calculated by the knowledge entity and relationship extraction method of the text information; specifically, the knowledge entity importance and the knowledge entity relationship can be obtained from the knowledge entity data constructed by the method. Or directly calculating by adopting the calculation method.
Step S502: and constructing a text entity network graph according to the knowledge entity relationship. Specifically, as can be seen from the above, the knowledge entity relationship is divided into three relationships: a) v1Is V2The upper position of (1); b) v1Is V2The lower level of (d); c) v1And V2Co-occurrence. Therefore, the entity upper and lower relation directed graph alpha can be established based on the upper and lower relation, and the connecting edge points to the lower node from the upper node; based on the co-occurrence relationship, an entity co-occurrence relationship undirected graph beta can be established, and the establishment of the continuous edges indicates the co-occurrence between two entities.
Step S503: calculating the average degree, the diameter, the average path length and the clustering coefficient according to the text entity network graph; specifically, after the text entity network graph is constructed, an evaluation index for evaluating text quality may be calculated based on the relationship between the entities in the graph.
Wherein, the average degree k refers to the degree of all nodes (the digraph is the node out degree k)outOr degree of penetration kin) The ratio of the number of nodes, and N represents the number of nodes in the graph. The average degree represents the entity in the networkThe connection condition of (2) can be used as one of indexes for evaluating the communication degree of the text knowledge, and a network with higher average degree represents that the communication among all entities in the text is stronger and the entities are closely connected.
For the directed graph α, the average degree calculation method is:
Figure BDA0003322378500000121
Figure BDA0003322378500000122
for undirected graph β, the mean degree calculation method is:
Figure BDA0003322378500000123
Figure BDA0003322378500000124
the Diameter (Diameter) of the network is defined as the maximum value of the distances of all node pairs in the network (the distance between any two nodes), and the index represents the breadth of the network, namely the universality of text content; the higher the diameter D, the more extensive and common the textual content contains for the physical network.
The average path length L is an average Distance (Mean Distance) obtained by averaging the distances of all node pairs, the average path length L represents the most likely typical Distance between two entity nodes, the index represents the "size" of the network, i.e. the effective coverage degree of the text content, and the longer the knowledge network L is, the better the knowledge structure of the text is.
The clustering coefficient C is a parameter for measuring the degree of node clustering. The cluster coefficient of a single node is the ratio of the number of edges between all its neighboring nodes to the maximum possible number of edges. The cluster coefficient C of the network is the average of all node cluster coefficients. The clustering coefficient can represent the clustering degree among the entities in the text, and generally speaking, the higher the clustering coefficient is, the more focused and obvious the speaking subject is. Thus, the cluster coefficient of a single node and the cluster coefficient (clustering coefficient) of the network are calculated using the following formulas, respectively:
Figure BDA0003322378500000131
Figure BDA0003322378500000132
step S504: and evaluating the text quality according to the average degree, the diameter, the average path length, the clustering coefficient and the importance of the knowledge entity. Specifically, at the time of evaluation, the text topic focusing power is calculated from the diameter and the clustering coefficient, that is, the text topic focusing power P is calculated by the following formula:
Figure BDA0003322378500000133
in addition, the text knowledge framework quality can be calculated according to the average degree, the average path length, the clustering coefficient and the knowledge entity importance, that is, the text knowledge framework quality M is calculated by the following formula:
Figure BDA0003322378500000134
according to the text quality evaluation method provided by the embodiment of the invention, the text entity network diagram is constructed through the knowledge entity relationship, each index for evaluating the text quality is determined through the network diagram, and finally, the text quality is comprehensively evaluated by combining a plurality of indexes and the importance of the knowledge entity. Therefore, the text quality assessment method reconsiders the relationship representation mode between the entities based on the complex network view, provides the text assessment index based on the entity relationship, and provides ideas and references for exploring the text quality of the content in the characteristic field.
As shown in fig. 6, the apparatus for extracting knowledge entities and relationships of text information according to the embodiment of the present invention includes:
the text data acquisition module is used for acquiring text data; for details, reference is made to the corresponding parts of the above method embodiments, which are not described herein again.
The knowledge entity extraction module is used for extracting knowledge entities in the text data; for details, reference is made to the corresponding parts of the above method embodiments, which are not described herein again.
The importance index calculation module is used for calculating the phrase importance index of the knowledge entity according to a mutual information algorithm; for details, reference is made to the corresponding parts of the above method embodiments, which are not described herein again.
The importance calculating module is used for calculating the importance of the knowledge entity according to the improved TextRank algorithm based on the word vector; for details, reference is made to the corresponding parts of the above method embodiments, which are not described herein again.
And the knowledge entity relationship determining module is used for determining the knowledge entity relationship according to the occurrence probability and the refinement degree of the knowledge entity. For details, reference is made to the corresponding parts of the above method embodiments, which are not described herein again.
The device for extracting the knowledge entity and the relation of the text information provided by the embodiment of the invention calculates the phrase importance index of the knowledge entity according to a mutual information algorithm after the knowledge entity in the text data is extracted; calculating the importance of the knowledge entity according to an improved TextRank algorithm based on word vectors; and determining the relation of the knowledge entities according to the occurrence probability and the refinement degree of the knowledge entities. Therefore, the device can recombine the original words lost during the extraction of the knowledge entity by calculating the phrase importance index, and improves the accuracy of the extraction of the knowledge entity; meanwhile, the quality of the extracted text information can be conveniently evaluated in the follow-up process by calculating the importance of the knowledge entity; in addition, the device also provides that the relation of the knowledge entities is determined based on the occurrence probability and the refinement degree of the knowledge entities, and further analysis of the text information can be facilitated through the determination of the relation of the knowledge entities.
The detailed description of the knowledge entity and relationship extraction method of the text information provided by the embodiment of the invention is referred to the description of the knowledge entity and relationship extraction method of the text information in the above embodiment.
As shown in fig. 7, the text quality evaluation apparatus provided in the embodiment of the present invention includes:
the data acquisition module is used for acquiring the knowledge entity importance and the knowledge entity relationship calculated by the knowledge entity and relationship extraction method of the text information; for details, reference is made to the corresponding parts of the above method embodiments, which are not described herein again.
The network graph building module is used for building a text entity network graph according to the knowledge entity relationship; for details, reference is made to the corresponding parts of the above method embodiments, which are not described herein again.
The parameter calculation module is used for calculating the average degree, the diameter, the average path length and the clustering coefficient according to the text entity network graph; for details, reference is made to the corresponding parts of the above method embodiments, which are not described herein again.
And the evaluation module is used for evaluating the text quality according to the average degree, the diameter, the average path length, the clustering coefficient and the importance of the knowledge entity. For details, reference is made to the corresponding parts of the above method embodiments, which are not described herein again.
The text quality evaluation device provided by the embodiment of the invention constructs the text entity network diagram through the knowledge entity relationship, determines each index for evaluating the text quality through the network diagram, and finally comprehensively evaluates the text quality by combining a plurality of indexes and the importance of the knowledge entity. Therefore, the text quality assessment device reconsiders the relationship representation mode between the entities based on the complex network view, provides the text assessment index based on the entity relationship, and provides ideas and references for exploring the text quality of the content in the characteristic field.
For a detailed description of the functions of the text quality assessment apparatus provided by the embodiment of the present invention, reference is made to the description of the text quality assessment method in the above embodiment.
An embodiment of the present invention further provides a storage medium, as shown in fig. 8, on which a computer program 601 is stored, where the instructions, when executed by a processor, implement the steps of the knowledge entity and relationship extraction method for text information in the foregoing embodiments. The storage medium is also stored with audio and video stream data, characteristic frame data, an interactive request signaling, encrypted data, preset data size and the like. The storage medium may be a magnetic Disk, an optical Disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a Flash Memory (Flash Memory), a Hard Disk (Hard Disk Drive, abbreviated as HDD) or a Solid State Drive (SSD), etc.; the storage medium may also comprise a combination of memories of the kind described above.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic Disk, an optical Disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a Flash Memory (Flash Memory), a Hard Disk (Hard Disk Drive, abbreviated as HDD) or a Solid State Drive (SSD), etc.; the storage medium may also comprise a combination of memories of the kind described above.
An embodiment of the present invention further provides an electronic device, as shown in fig. 9, the electronic device may include a processor 51 and a memory 52, where the processor 51 and the memory 52 may be connected by a bus or in another manner, and fig. 9 takes the connection by the bus as an example.
The processor 51 may be a Central Processing Unit (CPU). The Processor 51 may also be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, or combinations thereof.
The memory 52, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as the corresponding program instructions/modules in the embodiments of the present invention. The processor 51 executes various functional applications and data processing of the processor, namely, a knowledge entity and a relation extraction method of text information in the above-described method embodiments, by running non-transitory software programs, instructions and modules stored in the memory 52.
The memory 52 may include a storage program area and a storage data area, wherein the storage program area may store an operating device, an application program required for at least one function; the storage data area may store data created by the processor 51, and the like. Further, the memory 52 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 52 may optionally include memory located remotely from the processor 51, and these remote memories may be connected to the processor 51 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The one or more modules are stored in the memory 52 and, when executed by the processor 51, perform the knowledge entity and relationship extraction method of textual information as in the embodiment of fig. 1-5.
The details of the electronic device may be understood by referring to the corresponding descriptions and effects in the embodiments shown in fig. 1 to fig. 5, which are not described herein again.
Although the embodiments of the present invention have been described in conjunction with the accompanying drawings, those skilled in the art may make various modifications and variations without departing from the spirit and scope of the invention, and such modifications and variations fall within the scope defined by the appended claims.

Claims (11)

1. A method for extracting knowledge entities and relations of text information is characterized by comprising the following steps:
acquiring text data;
extracting knowledge entities in the text data;
calculating the phrase importance index of the knowledge entity according to a mutual information algorithm;
calculating the importance of the knowledge entity according to an improved TextRank algorithm based on word vectors;
and determining the relation of the knowledge entities according to the occurrence probability and the refinement degree of the knowledge entities.
2. The method of extracting knowledge entities and relationships of textual information according to claim 1, further comprising:
performing part-of-speech tagging on the knowledge entity;
and constructing a knowledge entity database according to the knowledge entity, the part of speech tagging result, the phrase importance index, the knowledge entity importance and the knowledge entity relationship.
3. The method for extracting knowledge entities and relationships of text information according to claim 1, wherein calculating phrase importance indexes of the knowledge entities according to mutual information algorithm comprises:
combining two of the extracted knowledge entities;
calculating the cosine mutual information value of the combined knowledge entity according to a mutual information algorithm;
calculating a phrase importance index of the combined knowledge entity according to the cosine mutual information value, wherein the phrase importance index is expressed by the following formula:
Figure FDA0003322378490000011
wherein Q-V represents a phrase importance index, and PMI-C represents a cosine mutual information value.
4. The method of claim 1, wherein the calculating the importance of the knowledge entity according to the improved TextRank algorithm based on word vectors comprises:
calculating a word vector of each knowledge entity;
calculating a Rank index of a knowledge entity according to the word vector and an improved TextRank algorithm, wherein the improved TextRank algorithm comprises the steps of constructing a keyword network according to the knowledge entity, wherein the keyword network comprises each keyword node in the network, and directional authoritative edges among the keywords and the vector distance of the keywords which are determined according to a network construction range;
calculating the reverse text frequency of each knowledge entity;
and calculating the importance of the knowledge entity according to the Rank index and the reverse text frequency.
5. The method for extracting knowledge entities and relationships of text information according to claim 1, wherein determining knowledge entity relationships according to the occurrence probability and the refinement degree of the knowledge entities comprises:
traversing the text data and determining a document set corresponding to the knowledge entity;
calculating the occurrence probability of the knowledge entity according to the document set corresponding to the knowledge entity;
calculating the refinement degree of the knowledge entity according to the occurrence probability of the knowledge entity;
and determining the relation of the knowledge entities according to the document set corresponding to the knowledge entities and the refinement degree of the knowledge entities.
6. A text quality assessment method, comprising:
acquiring the knowledge entity importance and knowledge entity relationship computed in claim 1;
constructing a text entity network graph according to the knowledge entity relationship;
calculating the average degree, the diameter, the average path length and the clustering coefficient according to the text entity network graph;
and evaluating the text quality according to the average degree, the diameter, the average path length, the clustering coefficient and the importance of the knowledge entity.
7. The method of claim 6, wherein the evaluating the text quality according to the average degree, the diameter, the average path length, the clustering coefficient and the importance of the knowledge entity comprises:
calculating the text topic clustering power according to the diameter and the clustering coefficient;
and calculating the quality of the text knowledge architecture according to the average degree, the average path length, the clustering coefficient and the importance of the knowledge entity.
8. A knowledge entity and relationship extraction apparatus for text information, comprising:
the text data acquisition module is used for acquiring text data;
the knowledge entity extraction module is used for extracting knowledge entities in the text data;
the importance index calculation module is used for calculating the phrase importance index of the knowledge entity according to a mutual information algorithm;
the importance calculating module is used for calculating the importance of the knowledge entity according to the improved TextRank algorithm based on the word vector;
and the knowledge entity relationship determining module is used for determining the knowledge entity relationship according to the occurrence probability and the refinement degree of the knowledge entity.
9. A text quality evaluation apparatus, comprising:
a data acquisition module for acquiring the knowledge entity importance and the knowledge entity relationship calculated in claim 1;
the network graph building module is used for building a text entity network graph according to the knowledge entity relationship;
the parameter calculation module is used for calculating the average degree, the diameter, the average path length and the clustering coefficient according to the text entity network graph;
and the evaluation module is used for evaluating the text quality according to the average degree, the diameter, the average path length, the clustering coefficient and the importance of the knowledge entity.
10. A computer-readable storage medium storing computer instructions for causing a computer to execute the method of knowledge entity and relationship extraction of text information according to any one of claims 1 to 5 or the method of text quality assessment according to claim 6 or 7.
11. An electronic device, comprising: a memory and a processor, the memory and the processor being communicatively connected to each other, the memory storing computer instructions, the processor executing the computer instructions to perform the method of extracting knowledge entities and relationships of text information according to any one of claims 1 to 5 or to perform the method of text quality assessment according to claim 6 or 7.
CN202111251665.6A 2021-10-26 2021-10-26 Knowledge entity and relation extraction method of text information and text quality assessment method Active CN114048742B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111251665.6A CN114048742B (en) 2021-10-26 2021-10-26 Knowledge entity and relation extraction method of text information and text quality assessment method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111251665.6A CN114048742B (en) 2021-10-26 2021-10-26 Knowledge entity and relation extraction method of text information and text quality assessment method

Publications (2)

Publication Number Publication Date
CN114048742A true CN114048742A (en) 2022-02-15
CN114048742B CN114048742B (en) 2024-09-06

Family

ID=80206041

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111251665.6A Active CN114048742B (en) 2021-10-26 2021-10-26 Knowledge entity and relation extraction method of text information and text quality assessment method

Country Status (1)

Country Link
CN (1) CN114048742B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116187868A (en) * 2023-04-27 2023-05-30 深圳市迪博企业风险管理技术有限公司 Knowledge graph-based industrial chain development quality evaluation method and device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106021272A (en) * 2016-04-04 2016-10-12 上海大学 Keyword automatic extraction method based on distributed expression word vector calculation
CN110110330A (en) * 2019-04-30 2019-08-09 腾讯科技(深圳)有限公司 Text based keyword extracting method and computer equipment
CN112926310A (en) * 2019-12-06 2021-06-08 北京搜狗科技发展有限公司 Keyword extraction method and device
WO2021169347A1 (en) * 2020-02-25 2021-09-02 华为技术有限公司 Method and device for extracting text keywords

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106021272A (en) * 2016-04-04 2016-10-12 上海大学 Keyword automatic extraction method based on distributed expression word vector calculation
CN110110330A (en) * 2019-04-30 2019-08-09 腾讯科技(深圳)有限公司 Text based keyword extracting method and computer equipment
CN112926310A (en) * 2019-12-06 2021-06-08 北京搜狗科技发展有限公司 Keyword extraction method and device
WO2021169347A1 (en) * 2020-02-25 2021-09-02 华为技术有限公司 Method and device for extracting text keywords

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
李志强;潘苏含;戴娟;胡佳佳;: "一种改进的TextRank关键词提取算法", 计算机技术与发展, no. 03, 5 December 2019 (2019-12-05) *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116187868A (en) * 2023-04-27 2023-05-30 深圳市迪博企业风险管理技术有限公司 Knowledge graph-based industrial chain development quality evaluation method and device

Also Published As

Publication number Publication date
CN114048742B (en) 2024-09-06

Similar Documents

Publication Publication Date Title
US20200401765A1 (en) Man-machine conversation method, electronic device, and computer-readable medium
US11301637B2 (en) Methods, devices, and systems for constructing intelligent knowledge base
CN107220352B (en) Method and device for constructing comment map based on artificial intelligence
CN109918660B (en) Keyword extraction method and device based on TextRank
WO2021189951A1 (en) Text search method and apparatus, and computer device and storage medium
CN111797214A (en) FAQ database-based problem screening method and device, computer equipment and medium
CN111539197A (en) Text matching method and device, computer system and readable storage medium
Priya et al. TAQE: tweet retrieval-based infrastructure damage assessment during disasters
CN103577452A (en) Website server and method and device for enriching content of website
CN111737997A (en) Text similarity determination method, text similarity determination equipment and storage medium
CN114861889B (en) Deep learning model training method, target object detection method and device
CN113515589B (en) Data recommendation method, device, equipment and medium
CN114638222B (en) Natural disaster data classification method and model training method and device thereof
US20220198358A1 (en) Method for generating user interest profile, electronic device and storage medium
CN111859079B (en) Information searching method, device, computer equipment and storage medium
CN110727769A (en) Corpus generation method and device, and man-machine interaction processing method and device
CN114547257A (en) Class matching method and device, computer equipment and storage medium
CN114048742B (en) Knowledge entity and relation extraction method of text information and text quality assessment method
CN111930949B (en) Search string processing method and device, computer readable medium and electronic equipment
CN109918661B (en) Synonym acquisition method and device
CN117057349A (en) News text keyword extraction method, device, computer equipment and storage medium
Perera et al. Interaction history based answer formulation for question answering
CN108710650B (en) Topic mining method for forum text
CN113761125A (en) Dynamic summary determination method and device, computing equipment and computer storage medium
CN113919338A (en) Method and device for processing text data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant