WO2023231331A1 - Procédé, système et dispositif d'extraction de connaissances, et support de stockage - Google Patents

Procédé, système et dispositif d'extraction de connaissances, et support de stockage Download PDF

Info

Publication number
WO2023231331A1
WO2023231331A1 PCT/CN2022/134806 CN2022134806W WO2023231331A1 WO 2023231331 A1 WO2023231331 A1 WO 2023231331A1 CN 2022134806 W CN2022134806 W CN 2022134806W WO 2023231331 A1 WO2023231331 A1 WO 2023231331A1
Authority
WO
WIPO (PCT)
Prior art keywords
sentence
distance
verb
verb phrase
knowledge extraction
Prior art date
Application number
PCT/CN2022/134806
Other languages
English (en)
Chinese (zh)
Inventor
刘宇
王丽
郭振华
赵雅倩
李仁刚
闫瑞栋
刘璐
徐聪
金良
贾麒
Original Assignee
浪潮电子信息产业股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 浪潮电子信息产业股份有限公司 filed Critical 浪潮电子信息产业股份有限公司
Publication of WO2023231331A1 publication Critical patent/WO2023231331A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/313Selection or weighting of terms for indexing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Definitions

  • This application relates to the field of data processing technology, and in particular to a knowledge extraction method, system, equipment and storage medium.
  • the Max Planck Institute of Germany proposed a pattern recognition technology based on graph, which uses label propagation algorithm to solve the problem of semantic migration of small-scale text data. It also proposed a pattern recognition technology based on tree and successfully applied it to large-scale text data. Large-scale text data enables extraction of multi-relationship knowledge in the medical field.
  • Microsoft uses simple string patterns such as, including, etc. to build a complex knowledge instance evaluation framework, extract relational knowledge instances from massive texts and apply them to search engines.
  • some institutions have proposed a variety of extraction methods based on remote supervision and constructed a model quality self-assessment method based on statistical information, which can solve the problem of excessive manual participation.
  • pattern recognition technology or pattern extraction technology
  • pattern recognition technology is a key technology that successfully solves knowledge extraction tasks.
  • the current knowledge extraction technology based on pattern recognition technology for large-scale text data still has problems such as poor versatility, low recall rate, and low quality of the obtained knowledge instances.
  • the purpose of this application is to provide a knowledge extraction method, system, equipment and storage medium to effectively extract knowledge and avoid the situation of poor versatility, low recall rate, and low quality of the obtained knowledge instances.
  • a knowledge extraction method including:
  • determine the search word distance of the sentence determine the syntax parse tree distance of the search word of the sentence through the sentence's syntax parse tree, and find the verb phrase closest to the entity for each entity in the sentence. And determine the grammatical parse tree distance of each verb phrase;
  • the weighted values of each verb phrase of the sentence are determined according to the preset weighting rules
  • K is a positive integer
  • For any target verb phrase retrieve each sentence including the target verb phrase from the annotated corpus, and verify it according to the preset rules;
  • Each sentence that passes the verification is summarized into knowledge extraction content corresponding to the seed data.
  • the text corpus is annotated, including:
  • Coreference resolution is performed on a text corpus to link pronouns in the text to the pronoun's original noun.
  • entity annotation is performed on the text corpus, including:
  • Entity annotation of text corpus using entity recognition tools.
  • coreference resolution is performed on the text corpus, including:
  • it also includes:
  • the text corpus before annotating the text corpus, it also includes:
  • find the verb phrase closest to the entity including:
  • the weighted values of each verb phrase of the sentence are determined according to the preset weighting rules, including:
  • the weighted value corresponding to the search word distance of the sentence, the weighted value corresponding to the syntax parse tree distance of the search word of the sentence, and the syntax parse tree distance corresponding to the verb phrase are The weighted values are summed to obtain the weighted value of the verb phrase;
  • the search word distance of a sentence is negatively correlated with the weighted value corresponding to the search word distance of the sentence
  • the weighted value corresponding to the syntax parse tree distance of the search word of the sentence is negatively correlated with the weighted value corresponding to the syntax parse tree distance of the search word of the sentence
  • the verb phrase The weighted value corresponding to the syntax parse tree distance of the verb phrase is negatively correlated with the syntax parse tree distance of the verb phrase.
  • each verb phrase is obtained through aggregation, it also includes:
  • any verb phrase determine the frequency of occurrence of the verb phrase in each sentence containing the search term in the seed data retrieved from the annotated corpus, and determine the frequency score value of the verb phrase based on the frequency of occurrence, and add the frequency The sum of the score value and the weighted value of the verb phrase is superimposed as the final score value of the verb phrase;
  • the K verb phrases with the highest sum of weighted values are obtained through aggregation as the selected K target verb phrases, including:
  • the K verb phrases with the highest final scores are used as the selected K target verb phrases.
  • the weighted value corresponding to the search word distance of the sentence is determined through function f 1
  • the syntax parse tree distance corresponding to the search word of the sentence is determined through function f 2
  • the weighted value of , the weighted value corresponding to the grammatical parsing tree distance of the verb phrase is determined through function f 3 ;
  • function f 1 is a function whose value changes linearly based on the search word distance of the sentence
  • function f 2 and function f 3 are functions whose value changes exponentially based on the corresponding distance.
  • verification is performed according to preset rules, including:
  • a knowledge extraction system including:
  • the text corpus determination module is used to determine the text corpus
  • the annotation corpus determination module is used to annotate the text corpus and build an index to obtain the annotation corpus;
  • a retrieval module for setting seed data used to represent relationship information, and retrieving each sentence including the search terms in the seed data from the annotated corpus;
  • the distance calculation module is used to determine the search word distance of the sentence for any sentence, determine the syntax parse tree distance of the sentence's search word through the sentence's syntax parse tree, and find the entity for each entity in the sentence.
  • the closest verb phrase and determine the grammatical parse tree distance of each verb phrase;
  • the weighted value calculation module is used to determine the respective weighted values of each verb phrase of the sentence for any sentence based on the determined distances of the sentences and in accordance with the preset weighting rules;
  • the target verb phrase determination module is used to obtain the K verb phrases with the highest sum of weighted values based on the respective weighted values of each verb phrase in each sentence, as the selected K target verb phrases; K is positive integer;
  • the verification module is used to retrieve each sentence including the target verb phrase from the annotated corpus for any target verb phrase, and verify it according to the preset rules;
  • the knowledge extraction content determination module is used to summarize each sentence after passing the verification into the knowledge extraction content corresponding to the seed data.
  • a knowledge extraction device including:
  • Memory used to store computer programs
  • a processor is used to execute a computer program to implement the steps of the above knowledge extraction method.
  • a computer-readable storage medium has a computer program stored on the computer-readable storage medium.
  • the steps of the above knowledge extraction method are implemented.
  • the text corpus is annotated and an index is constructed.
  • the seed data used to represent the relationship information can be set, and each sentence including the search terms in the seed data can be retrieved from the annotated corpus.
  • this application will determine the various distances of the sentence. It can be understood that the search word distance of the sentence and the syntax parse tree distance of the search word of the sentence can effectively reflect the quality of the sentence, and The grammatical parse tree distance of a verb phrase can reflect the degree of connection between the verb phrase and the search terms in the seed data.
  • the selected K target verb phrases can be effectively processed into the seed data.
  • the characteristic reflection of the search words that is, retrieving each sentence including the target verb phrase from the annotated corpus according to the K target verb phrases, is a high-quality knowledge extraction content. And since sentences will be verified according to preset rules, it is helpful to further ensure the quality of the extracted knowledge content.
  • the solution of this application uses K target verb phrases as search objects, so a large number of sentences including the target verb phrases can be retrieved from the annotated corpus, making the recall rate of the solution of this application higher.
  • the solution of this application does not limit the content of the text corpus nor the search terms in the seed data. Therefore, the solution of this application can be used for knowledge extraction in various fields and is highly versatile.
  • the solution of this application can effectively extract knowledge, has high versatility and high recall rate, and can obtain high-quality knowledge extraction content.
  • Figure 1 is an implementation flow chart of a knowledge extraction method in this application
  • Figure 2 is a schematic flowchart of obtaining an annotated corpus in a specific implementation of the present application
  • Figure 3 is a schematic structural diagram of a syntax tree in a specific implementation of the present application.
  • Figure 4 is a schematic structural diagram of a knowledge extraction system in this application.
  • Figure 5 is a schematic structural diagram of a knowledge extraction device in this application.
  • Figure 6 is a schematic diagram of the application environment of a knowledge extraction method in this application.
  • the core of this application is to provide a knowledge extraction method that can effectively extract knowledge, has high versatility and high recall rate, and can obtain high-quality knowledge extraction content.
  • the knowledge extraction method may include the following steps:
  • Step S101 Determine the text corpus
  • the solution of this application has strong versatility.
  • the specific content of the text corpus can be determined according to the actual situation. For example, it can be a text corpus in medical, geography, science, finance, culture, etc., or it can also include multiple aspects at the same time. Text corpus of knowledge.
  • the content of the text corpus should be relatively accurate text content.
  • the text corpus can be a text corpus determined through news documents and/or Internet encyclopedia data.
  • Step S102 Annotate the text corpus and build an index to obtain an annotated corpus.
  • Annotating text corpora helps eliminate ambiguities and helps computers understand text content.
  • Step S21 Perform data cleaning on the text corpus to eliminate irrelevant information.
  • Step S22 Perform entity annotation on the text corpus to link text to entities
  • Step S23 Perform coreference resolution on the text corpus to link the pronouns in the text to the original nouns of the pronouns.
  • Step S24 Based on the results after entity annotation and the results after coreference resolution, when any pronoun points to an unambiguous noun object, link the pronoun to the entity.
  • Step S25 Construct the index and obtain the annotated corpus.
  • the text corpus will be data cleaned to eliminate irrelevant information, which will also help to further improve the accuracy of subsequent extracted knowledge.
  • developers can usually develop corresponding programs based on the data type, data characteristics, data sources and other characteristics in the text corpus to effectively eliminate irrelevant information. For example, in one case, considering that some data in the text corpus comes from web pages, developers can choose to use tools such as JAVA to clean the text data. For example, they can remove irrelevant information such as web page connections, tables, advertisements, etc.
  • entity annotation is performed on the text corpus to link text to entities, which can effectively eliminate text ambiguity.
  • Linking text to entities that is, Entity Linking, is to identify words representing entities in text data and link them to entities in knowledge graphs and other content.
  • the text word is Apple, by linking the text to the entity Apple Inc, rather than iPhone, Apple (fruit).
  • step S22 can be specifically: perform annotation on the text corpus through entity recognition tools. Entity labeling.
  • entity recognition tools there are many choices for specific entity recognition tools.
  • the open source tool package Dexter can be used as the entity recognition tool for this application.
  • Co-reference resolution refers to identifying pronouns in the text (such as he) and pointing to their original nouns (such as Zhang San).
  • step S23 can be specifically: using natural semantic processing tools to resolve coreferences.
  • Coreference resolution of text corpus there are many choices for specific natural semantic processing tools.
  • Stanford's CoreNLP 2.0 toolkit can be used as the natural semantic processing tool for this application.
  • the pronoun he points to the noun Pitt (for example, through the natural semantic processing tool CoreNLP 2.0 toolkit), and Pitt is linked to the entity "Brad Pitt” (through the entity recognition tool Dexter). Based on the entity annotation As a result, and the result after coreference resolution, the pronoun he can be further linked to the entity "Brad Pitt".
  • the scope of annotated data can be effectively increased, and when subsequent retrieval of related sentences is carried out, the number of retrievals can be effectively increased, which also improves the accuracy and coverage of knowledge extraction in this application solution. has been further improved.
  • Step S103 Set seed data for representing relationship information, and retrieve each sentence including the search terms in the seed data from the annotated corpus.
  • Seed data can be given in advance by users/developers and is used to represent relationship information, that is, to represent specific relationships.
  • the seed data is (Bill Gates, Melinda, Spouse), or (Brad Pitt, Angelina Julie, Spouse), all express the relationship of (A, B, husband and wife). In the solution of this application, only a small number of seeds are needed to complete high-quality knowledge extraction.
  • each sentence including the search terms in the seed data can be retrieved from the annotated corpus. It should be noted that in actual applications, when extracting knowledge, one or more seed data can be used. For each seed data, the solution of this application can be implemented separately, and finally the knowledge content corresponding to each seed data can be summarized. That’s it.
  • the seed data may contain one or more search terms, and the retrieved sentences need to carry each search term in the seed data.
  • the seed data usually includes 2 nouns as search terms. If 3 or more search terms need to be used, it can usually be split into multiple seeds for separate knowledge extraction.
  • the seed data is set to (Brad Pitt, Jennifer Anision, Spouse), and a sentence including the search terms in the seed data retrieved from the annotated corpus is Pitt met Friends actress Jennifer Anniston in 1998 and married her in a private wedding ceremony in Malibu on July 29, 2000.
  • Step S104 For any sentence, determine the search word distance of the sentence, determine the syntax parse tree distance of the search word of the sentence through the sentence's syntax parse tree, and find the closest entity to each entity in the sentence.
  • Verb phrases and determine the grammatical parse tree distance of each verb phrase.
  • the search word distance of a sentence refers to the distance of the search word in the sentence, that is, token-distance. Take the sentence in Figure 3 as an example. There are three words met Friends actress between the search term Pitt (linked to the entity "Brad Pitt") and the search term Jennifer Anniston. In this example, the search term distance of the sentence is is 3.
  • the seed data usually includes two nouns as search terms.
  • each step of step S104 can be modified according to the actual situation.
  • Term distance is adaptively processed. For example, when there is only one search word, the search word distance of the sentence and the syntax parse tree distance of the search word are discarded, that is, the search word distance of the sentence and the syntax parse tree distance of the search word are both regarded as 0.
  • the search word distance of the sentence and the syntax parse tree distance of the search words can be determined based on the two nearest search words.
  • the usually chosen solution is to split it into multiple seeds for separate knowledge extraction, that is, split into multiple seed data, each The seed data only includes 2 search terms.
  • Grammar parsing tree is a tree structure that represents the interdependence between words.
  • the leaf nodes are the elements in the sentence, and the non-leaf nodes are the parts of speech of words and phrases.
  • CoreNLP can be used to generate the syntax parse tree of a sentence.
  • ROOT represents the root directory
  • S represents the source, that is, the source of each word in the sentence.
  • NP Noun Phrase, noun phrase
  • VP Very Phrase, verb phrase
  • NNP Proper Noun singular, proper noun singular
  • NNPS Proper Noun plural, proper noun plural
  • VBD Very past tense, verb past tense
  • NN Noun, noun
  • CC Coordinating Conjunction, equivalent conjunction
  • PP Prepostion Phrase, prepositional phrase
  • PRP Personal pronoun
  • IN Preposition or subordinating conjunction, preposition or subordinating conjunction
  • DT Determiner, determiners, such as the, some, my, etc.
  • JJ Adjective, adjective
  • CD Cardinal number, base number
  • the syntactic parse tree distance of the search term that is, noun phrase tree-based distance.
  • the seed data is (Brad Pitt, Jennifer Anision, Spouse).
  • verb phrase tree-based distance it is also necessary to find the verb phrase closest to the entity for each entity in the sentence, and determine the grammatical parsing tree distance of each verb phrase, that is, verb phrase tree-based distance. This is conducive to effectively analyzing the dependency relationship between verbs and nouns, and digging out verb phrases with pointed meanings, such as the relationship between married (marriage) and marriage, transfer into club (transfer) and the relationship between playing football.
  • the nearest neighbor method can be used to conveniently and quickly find the verb phrase closest to the entity for each entity in the sentence.
  • Jennifer Anision is an entity with two verbs next to it, met and married.
  • the syntax parse tree distance between Jennifer Anision and met is 6, and the syntax parse tree distance between Jennifer Anision and married is 8. Therefore, the verb phrase closest to the entity Jennifer Anision is met, and the syntax parsing tree distance of the verb phrase met is 6.
  • the verb phrase can be composed of a single verb, such as met in the above, married can also be composed of multiple words, such as transfer into club in the above, by the staff in advance Just set the verb phrase library.
  • Step S105 For any sentence, based on the determined distances of each sentence, determine the respective weighting values of each verb phrase of the sentence according to the preset weighting rules.
  • function f 2 represents the functional relationship between the syntax parse tree distance of the search word and the weighted value corresponding to the distance.
  • function f 3 represents the functional relationship between the grammatical parse tree distance of the verb phrase and the weighted value corresponding to the distance.
  • step S105 may specifically include:
  • the weighted value corresponding to the search word distance of the sentence, the weighted value corresponding to the syntax parse tree distance of the search word of the sentence, and the syntax parse tree distance corresponding to the verb phrase are The weighted values are summed to obtain the weighted value of the verb phrase;
  • the search word distance of a sentence is negatively correlated with the weighted value corresponding to the search word distance of the sentence
  • the weighted value corresponding to the syntax parse tree distance of the search word of the sentence is negatively correlated with the weighted value corresponding to the syntax parse tree distance of the search word of the sentence
  • the verb phrase The weighted value corresponding to the syntax parse tree distance of the verb phrase is negatively correlated with the syntax parse tree distance of the verb phrase.
  • the shorter the sentence considered the higher the reliability of the sentence, so the weight value corresponding to the search word distance of the sentence is made larger.
  • the shorter the grammatical parse tree distance of the search term the higher the reliability of the sentence. Therefore, the weight value corresponding to the grammatical parse tree distance of the search term of the sentence is larger.
  • the greater the correlation between the verb phrase and the seed data the shorter the grammatical parse tree distance of the verb phrase, and the higher the corresponding weighted value. This can make the accuracy of the subsequently selected target verb phrases higher, which is also beneficial to improving the accuracy of the extracted knowledge.
  • the specific negative correlation functional relationship can be set according to actual needs.
  • the weighted value corresponding to the distance determined by the function can be exponential The land increases.
  • this application considers that the search word distances of different sentences are quite different, that is, the search word distance of a sentence has a larger value range, and the syntax parse tree distance of the search words of a sentence, and the grammar of the verb phrase Since the parse tree distance is determined based on the syntactic parse tree, the value range is small. In order to avoid the excessive impact of the search word distance of the sentence on the weighted values of each verb phrase in the sentence, a on occasion.
  • the weighted value corresponding to the search word distance of the sentence is determined through function f 1
  • the weighted value corresponding to the syntactic parse tree distance of the search word of the sentence is determined through function f 2
  • function f 1 is a function whose value changes linearly based on the search word distance of the sentence
  • function f 2 and function f 3 are functions whose value changes exponentially based on the corresponding distance.
  • the function f 1 can be set as a linearly changing function.
  • Step S106 Based on the respective weighted values of each verb phrase in each sentence, the K verb phrases with the highest sum of weighted values are obtained through aggregation as the selected K target verb phrases; K is a positive integer.
  • multiple sentences including the search terms in the seed data can usually be retrieved from the annotated corpus. For example, 1,000 sentences were retrieved in one occasion. sentence. For each sentence, one or more verb phrases can be determined, and the weighted value of each verb phrase in the sentence can be obtained. For example, among 1,000 sentences, 150 sentences all identify met as a verb phrase. Then for met, there are 150 weighted values. Adding these 150 weighted values is the summed up verb phrase met. The sum of weighted values.
  • the text corpus can be a text corpus in Chinese, English, Russian, etc. It can be understood that the seed data will be in whatever language the text corpus is selected.
  • the seed data will be in whatever language the text corpus is selected.
  • different tenses of the verb phrase can be regarded as the same verb phrase for summarization. That is equivalent to converting the verb phrase into its original tense and part of speech, and then summarizing the weighted values of the verb phrase in each sentence.
  • Step S107 For any target verb phrase, retrieve each sentence including the target verb phrase from the annotated corpus, and verify it according to the preset rules.
  • Step S108 Summarize each sentence that has passed the verification into knowledge extraction content corresponding to the seed data.
  • each sentence including the target verb phrase can be retrieved from the annotated corpus.
  • the preset Rules are verified, and only sentences that pass the verification will be summarized as knowledge extraction content corresponding to the seed data. That is to say, sentences retrieved through the target verb phrase will have a low correlation with the target verb phrase. , it will fail the verification and the sentence will be discarded.
  • the verification described in step S108 according to preset rules may specifically include:
  • the seed data is (Brad Pitt, Jennifer Anision, Spouse), and one of the identified target verb phrases is marry. Then a forward search is first performed, that is, through the syntax parse tree of the sentence, find in the sentence The noun closest to the target verb phrase marry is used as the first noun. Then perform a reverse test, that is, you can use the nearest neighbor method based on the syntax parse tree to find the verb closest to the first noun in the sentence.
  • the verb may or may not be marry. If it is marry, it means that the sentence has a high correlation with the target verb phrase marry, which means that the sentence has a high probability of being high-quality knowledge, so it passes the verification. Otherwise, it fails the verification.
  • Each sentence after passing the verification can be called each knowledge instance, and these sentences are the knowledge extraction content corresponding to the seed data. And as described above, when performing knowledge extraction, multiple sentences may be used, and the knowledge extraction content corresponding to each of these sentences is the total extracted knowledge content.
  • the knowledge extraction content based on the seed data can be medical knowledge extraction, financial knowledge extraction content, geographical knowledge extraction content, or humanities knowledge extraction content.
  • the selected target verb phrase is marry
  • the sentences including the target verb phrase marry retrieved from the annotation corpus are all marry-related information.
  • these sentences are composed of some knowledge instances about marry.
  • any verb phrase determine the frequency of occurrence of the verb phrase in each sentence containing the search term in the seed data retrieved from the annotated corpus, and determine the frequency score value of the verb phrase based on the frequency of occurrence, and add the frequency The sum of the score value and the weighted value of the verb phrase is superimposed as the final score value of the verb phrase;
  • the K verb phrases with the highest sum of weighted values are obtained through aggregation as the selected K target verb phrases, including:
  • the K verb phrases with the highest final scores are used as the selected K target verb phrases.
  • this implementation it is considered that in the aforementioned implementation, when a verb phrase appears in a certain sentence, the verb phrase needs to be the closest verb phrase corresponding to an entity in the sentence, and the verb phrase Only phrases can get a weighted value in the sentence. In some cases, certain verb phrases appear more frequently, which means that the verb phrase has a higher relationship with the seed data. Therefore, this implementation method directly determines the weighted value of the verb phrase according to the frequency of the verb phrase.
  • the frequency of occurrence of the verb phrase is determined, the frequency of occurrence is converted into the frequency score value of the verb phrase, and superimposed on the verb On top of the weighted value of the phrase, it is helpful to further improve the accuracy of the selected K target verb phrases.
  • converting the frequency of occurrence into the frequency score value of a verb phrase there are many specific methods. For example, one method is set to a simple proportional functional relationship.
  • the text corpus is annotated and an index is constructed.
  • the seed data used to represent the relationship information can be set, and each sentence including the search terms in the seed data can be retrieved from the annotated corpus.
  • this application will determine the various distances of the sentence. It can be understood that the search word distance of the sentence and the syntax parse tree distance of the search word of the sentence can effectively reflect the quality of the sentence, and The grammatical parse tree distance of a verb phrase can reflect the degree of connection between the verb phrase and the search terms in the seed data.
  • the selected K target verb phrases can be effectively processed into the seed data.
  • the characteristic reflection of the search words that is, retrieving each sentence including the target verb phrase from the annotated corpus according to the K target verb phrases, is a high-quality knowledge extraction content. And since sentences will be verified according to preset rules, it is helpful to further ensure the quality of the extracted knowledge content.
  • the solution of this application uses K target verb phrases as search objects, so a large number of sentences including the target verb phrases can be retrieved from the annotated corpus, making the recall rate of the solution of this application higher.
  • the solution of this application does not limit the content of the text corpus nor the search terms in the seed data. Therefore, the solution of this application can be used for knowledge extraction in various fields and is highly versatile.
  • FIG. 6 is a schematic diagram of the application environment of a knowledge extraction method in this application.
  • the application environment includes a terminal 610, an analysis device 620 and a network device 630.
  • the terminal 610 may be a monitor, a computer, a smart phone, a tablet computer, a laptop computer, or any other device capable of interacting with the user, and may display the summarized knowledge extraction content corresponding to the seed data.
  • the analysis device 620 may be a server, or a server cluster composed of several servers, or other devices capable of performing data analysis.
  • the analysis device 620 may be a cloud server (also called a cloud computing server).
  • the terminal 610 can establish a wired or wireless communication connection with the analysis device 620 through the communication network.
  • the network device 630 can provide data to be analyzed to the analysis device 620, so that the analysis device 620 performs knowledge extraction, and the terminal 610 can present the knowledge extraction results to the user or relevant staff.
  • the communication network involved in the embodiment of this application may be a second generation (2-Generation, 2G) communication network, a third generation (3rd Generation, 3G) communication network, a long-term evolution (Long Term Evolution, LTE) communication network or a third-generation communication network.
  • Fifth Generation (5G) communication network etc.
  • the aforementioned application environment may also include a storage device for storing data required by the terminal 610, the analysis device 620, and/or the network device 630.
  • the storage device may be a distributed storage device.
  • the solution of this application can effectively extract knowledge, has high versatility and high recall rate, and can obtain high-quality knowledge extraction content.
  • embodiments of the present application also provide a knowledge extraction system, which can be mutually referenced with the above.
  • Figure 4 is a schematic structural diagram of a knowledge extraction system in this application, including:
  • Text corpus determination module 401 used to determine text corpus
  • the annotation corpus determination module 402 is used to annotate the text corpus and build an index to obtain an annotation corpus;
  • the retrieval module 403 is used to set seed data used to represent relationship information, and to retrieve each sentence including the search terms in the seed data from the annotated corpus;
  • the distance calculation module 404 is used to determine the search word distance of the sentence for any sentence, determine the syntax parse tree distance of the search word of the sentence through the syntax parse tree of the sentence, and find for each entity in the sentence the distance of the search word.
  • the verb phrase with the closest entity distance, and the grammatical parsing tree distance of each verb phrase is determined;
  • the weighting value calculation module 405 is used to determine the respective weighting values of each verb phrase of the sentence for any sentence based on the determined distances of the sentences and in accordance with the preset weighting rules;
  • the target verb phrase determination module 406 is used to obtain the K verb phrases with the highest sum of weighted values according to the weighted values of each verb phrase in each sentence as the selected K target verb phrases; K is positive integer;
  • the verification module 407 is used for retrieving each sentence including the target verb phrase from the annotated corpus for any target verb phrase, and verifying it according to the preset rules;
  • the knowledge extraction content determination module 408 is used to summarize each sentence after passing the verification into knowledge extraction content corresponding to the seed data.
  • the annotation corpus determination module 402 includes:
  • the first annotation unit is used to annotate entities in the text corpus to link text to entities
  • the second annotation unit is used to perform coreference resolution on the text corpus to link the pronouns in the text to the original nouns of the pronouns;
  • the index building unit is used to build the index and obtain the annotated corpus.
  • the first labeling unit is specifically used for:
  • Entity annotation of text corpus using entity recognition tools.
  • the second labeling unit is specifically used for:
  • the third annotation unit is used to link the pronoun to the entity when any pronoun points to an unambiguous noun object based on the result of entity annotation and the result of coreference resolution.
  • the first annotation unit before the first annotation unit annotates the text corpus, it also includes:
  • the data cleaning unit is used to clean the text corpus to eliminate irrelevant information.
  • the distance calculation module 404 finds the verb phrase closest to the entity for each entity in the sentence, specifically for:
  • the weighted value calculation module 405 is specifically used to:
  • the weighted value corresponding to the search word distance of the sentence, the weighted value corresponding to the syntax parse tree distance of the search word of the sentence, and the syntax parse tree distance corresponding to the verb phrase are The weighted values are summed to obtain the weighted value of the verb phrase;
  • the search word distance of a sentence is negatively correlated with the weighted value corresponding to the search word distance of the sentence
  • the weighted value corresponding to the syntax parse tree distance of the search word of the sentence is negatively correlated with the weighted value corresponding to the syntax parse tree distance of the search word of the sentence
  • the verb phrase The weighted value corresponding to the syntax parse tree distance of the verb phrase is negatively correlated with the syntax parse tree distance of the verb phrase.
  • the target verb phrase determination module 406 After the target verb phrase determination module 406 obtains the sum of the weighted values of each verb phrase through aggregation, the target verb phrase determination module 406 is also used to:
  • any verb phrase determine the frequency of occurrence of the verb phrase in each sentence containing the search term in the seed data retrieved from the annotated corpus, and determine the frequency score value of the verb phrase based on the frequency of occurrence, and add the frequency The sum of the score value and the weighted value of the verb phrase is superimposed as the final score value of the verb phrase;
  • the K verb phrases with the highest final scores are used as the selected K target verb phrases.
  • the weighted value corresponding to the search word distance of the sentence is determined through function f 1
  • the weighted value corresponding to the search word distance of the sentence is determined through function f 2
  • the weighted value corresponding to the grammatical parsing tree distance of the search word is determined through function f 3 to determine the weighted value corresponding to the grammatical parsing tree distance of the verb phrase;
  • function f 1 is a function whose value changes linearly based on the search word distance of the sentence
  • function f 2 and function f 3 are functions whose value changes exponentially based on the corresponding distance.
  • the verification module 407 performs verification according to preset rules, and is specifically used for:
  • embodiments of the present application also provide a knowledge extraction device and a computer-readable storage medium, which may be mutually referenced with the above.
  • the computer readable storage medium stores a computer program.
  • the steps of the knowledge extraction method in any of the above embodiments are implemented.
  • the computer-readable storage media mentioned here include random access memory (RAM), memory, read-only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, register, hard disk, removable disk, CD-ROM, Or any other form of storage media known in the technical field.
  • the knowledge extraction device may include:
  • Memory 501 used to store computer programs
  • the processor 502 is configured to execute a computer program to implement the steps of the knowledge extraction method in any of the above embodiments.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Software Systems (AREA)
  • Machine Translation (AREA)

Abstract

Procédé, système et dispositif d'extraction de connaissances, et support de stockage, qui sont appliqués au domaine technique du traitement de données. Le procédé consiste : à déterminer un corpus de texte et à effectuer un étiquetage, et à construire un index pour obtenir un corpus étiqueté ; à définir des données de graine pour représenter des informations de relation, à récupérer une phrase correspondante et à déterminer la distance entre des mots récupérés dans la phrase, des distances d'arbre d'analyse syntaxique des mots récupérés et des distances d'arbre d'analyse syntaxique de phrases verbales ; sur la base des distances déterminées de la phrase, à déterminer des valeurs pondérées respectives des phrases verbales dans la phrase en fonction d'une règle de pondération ; à obtenir, au moyen d'une récapitulation, K phrases verbales cibles, dont la somme des valeurs pondérées est le maximum ; à récupérer, à partir du corpus étiqueté, des phrases qui contiennent les phrases verbales cibles, et à effectuer une vérification en fonction d'une règle prédéfinie ; et à résumer les phrases, dont la vérification a réussi, pour former un contenu de connaissances extrait correspondant aux données de graine. Une extraction de connaissances peut être effectuée efficacement, l'universalité est élevée, un taux de rappel est grand, et un contenu de connaissances extrait de haute qualité peut être obtenu.
PCT/CN2022/134806 2022-05-31 2022-11-28 Procédé, système et dispositif d'extraction de connaissances, et support de stockage WO2023231331A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210609563.5A CN114840632A (zh) 2022-05-31 2022-05-31 一种知识抽取方法、系统、设备及存储介质
CN202210609563.5 2022-05-31

Publications (1)

Publication Number Publication Date
WO2023231331A1 true WO2023231331A1 (fr) 2023-12-07

Family

ID=82571583

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/134806 WO2023231331A1 (fr) 2022-05-31 2022-11-28 Procédé, système et dispositif d'extraction de connaissances, et support de stockage

Country Status (2)

Country Link
CN (1) CN114840632A (fr)
WO (1) WO2023231331A1 (fr)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114840632A (zh) * 2022-05-31 2022-08-02 浪潮电子信息产业股份有限公司 一种知识抽取方法、系统、设备及存储介质
CN115017255B (zh) * 2022-08-08 2022-11-01 杭州实在智能科技有限公司 一种基于树状结构的知识库构建和搜索方法

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105938495A (zh) * 2016-04-29 2016-09-14 乐视控股(北京)有限公司 实体关系识别方法及装置
CN110598078A (zh) * 2019-09-11 2019-12-20 京东数字科技控股有限公司 数据检索方法及装置、计算机可读存储介质、电子设备
CN112328811A (zh) * 2020-11-12 2021-02-05 国衡智慧城市科技研究院(北京)有限公司 一种基于同类型词组的词谱聚类智能生成方法
CN114840632A (zh) * 2022-05-31 2022-08-02 浪潮电子信息产业股份有限公司 一种知识抽取方法、系统、设备及存储介质

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105938495A (zh) * 2016-04-29 2016-09-14 乐视控股(北京)有限公司 实体关系识别方法及装置
CN110598078A (zh) * 2019-09-11 2019-12-20 京东数字科技控股有限公司 数据检索方法及装置、计算机可读存储介质、电子设备
CN112328811A (zh) * 2020-11-12 2021-02-05 国衡智慧城市科技研究院(北京)有限公司 一种基于同类型词组的词谱聚类智能生成方法
CN114840632A (zh) * 2022-05-31 2022-08-02 浪潮电子信息产业股份有限公司 一种知识抽取方法、系统、设备及存储介质

Also Published As

Publication number Publication date
CN114840632A (zh) 2022-08-02

Similar Documents

Publication Publication Date Title
US11775760B2 (en) Man-machine conversation method, electronic device, and computer-readable medium
US9613024B1 (en) System and methods for creating datasets representing words and objects
US9558264B2 (en) Identifying and displaying relationships between candidate answers
Shoufan et al. Natural language processing for dialectical Arabic: A survey
WO2023231331A1 (fr) Procédé, système et dispositif d'extraction de connaissances, et support de stockage
EP3958145A1 (fr) Procédé et appareil de recherche sémantique, dispositif et support d'enregistrement
Farouk Measuring text similarity based on structure and word embedding
US9355372B2 (en) Method and system for simplifying implicit rhetorical relation prediction in large scale annotated corpus
Wu et al. Community answer generation based on knowledge graph
Erdmann et al. Improving the extraction of bilingual terminology from Wikipedia
TW201826145A (zh) 從中文語料庫提取知識的方法和系統
Marstawi et al. Ontology-based aspect extraction for an improved sentiment analysis in summarization of product reviews
Rodrigues et al. Advanced applications of natural language processing for performing information extraction
Sukumar et al. Semantic based sentence ordering approach for multi-document summarization
WO2022141872A1 (fr) Procédé et appareil de génération de résumé de document, dispositif informatique et support de stockage
Toral et al. Linguistically-augmented perplexity-based data selection for language models
Gaikwad et al. Adaptive glove and fasttext model for hindi word embeddings
Zhang et al. Chinese-English mixed text normalization
RU2563148C2 (ru) Система и метод семантического поиска
Khalid et al. Reference terms identification of cited articles as topics from citation contexts
JP6106489B2 (ja) 語義解析装置、及びプログラム
Omurca et al. An annotated corpus for Turkish sentiment analysis at sentence level
Gupta et al. A Statistical Language Modeling Framework for Extractive Summarization of Text Documents
Baishya et al. Present state and future scope of Assamese text processing
Zheng et al. Query-focused multi-document summarization based on concept importance

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22944632

Country of ref document: EP

Kind code of ref document: A1