CN114328970A - Triple extraction method, equipment and computer storage medium - Google Patents

Triple extraction method, equipment and computer storage medium Download PDF

Info

Publication number
CN114328970A
CN114328970A CN202111667514.9A CN202111667514A CN114328970A CN 114328970 A CN114328970 A CN 114328970A CN 202111667514 A CN202111667514 A CN 202111667514A CN 114328970 A CN114328970 A CN 114328970A
Authority
CN
China
Prior art keywords
text
corpus
extraction
triple
similar
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111667514.9A
Other languages
Chinese (zh)
Inventor
聂建豪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Cloudminds Robotics Co Ltd
Original Assignee
Cloudminds Robotics Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Cloudminds Robotics Co Ltd filed Critical Cloudminds Robotics Co Ltd
Priority to CN202111667514.9A priority Critical patent/CN114328970A/en
Publication of CN114328970A publication Critical patent/CN114328970A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention relates to the field of artificial intelligence, and discloses a triple extraction method, triple extraction equipment and a computer storage medium. The triple extraction method comprises the following steps: obtaining a corpus set to be processed; clustering the corpus to be processed by an artificial intelligence similarity search algorithm faiss to obtain a plurality of similar corpus; analyzing the dependency relationship among the words of each text in the similar corpus set through dependency syntax; and determining an extraction template corresponding to each similar corpus according to the dependency relationship, and extracting the triples according to the extraction templates. The data do not need to be marked manually in the whole triple extraction process, so that time and labor are saved.

Description

Triple extraction method, equipment and computer storage medium
Technical Field
The embodiment of the invention relates to the field of artificial intelligence, in particular to a triple extraction method, triple extraction equipment and a computer storage medium.
Background
Triples (including subjects, objects and relations between subjects and objects) in the knowledge graph play an important role in application scenarios such as entity question answering and entity recommendation. Triple extraction is an important preposition task in knowledge graph construction, and can be divided into the following steps according to different processing data sources: structured text extraction, semi-structured text extraction, unstructured text extraction. The industry has made a great deal of research on relation extraction of unstructured texts, including supervised learning such as deep learning and machine learning, or semi-supervised learning methods.
However, the supervised learning method or the semi-supervised learning method requires a set of samples of known classes as a reference, and thus requires a lot of data to be labeled with manpower and time. The time cost or the labor cost consumed by labeling is high in a mode of manually and directly labeling the full amount of triple data.
Disclosure of Invention
The embodiment of the invention aims to provide a triple extraction method, equipment and a computer storage medium, so as to achieve the purpose of saving time and labor.
To solve the foregoing technical problem, an embodiment of the present invention provides a triplet extraction method, including: obtaining a corpus set to be processed; clustering the corpus to be processed by an artificial intelligence similarity search algorithm faiss to obtain a plurality of similar corpus; analyzing the dependency relationship among the words of each text in the similar corpus set through dependency syntax; and determining an extraction template corresponding to each similar corpus according to the dependency relationship, and extracting the triples according to the extraction templates.
An embodiment of the present invention further provides an electronic device, including: at least one processor; and a memory communicatively coupled to the at least one processor; the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor to enable the at least one processor to perform the triplet decimation method.
Embodiments of the present invention further provide a computer-readable storage medium, which stores a computer program, and when the computer program is executed by a processor, the triple extraction method is implemented.
Compared with the related art, the embodiment of the invention clusters the corpus to be processed by an artificial intelligence similarity search algorithm faiss to obtain a plurality of similar corpus, and texts in each similar corpus have high similarity. And analyzing the dependency relationship between the words of each text in each similar corpus set through dependency syntax, and determining an extraction template corresponding to each similar corpus set and used for extracting the triples according to the dependency relationship of each similar corpus set, wherein the extraction template can be used for directly extracting the triples of each text in the similar corpus set. The data do not need to be marked manually in the whole triple extracting process, so that a large amount of manpower is saved, and the time cost consumed by marking is saved.
In addition, after obtaining a plurality of similar corpus, the method further comprises: matching a plurality of similar corpus with a regular expression; if the similar corpus is matched with the regular expression, extracting the triple according to the regular expression; and if the similar corpus is not matched with the regular expression, performing a step of analyzing the dependency relationship between words of each text in the similar corpus through dependency syntax. When the similar corpus is matched with the regular expression, the triple is directly extracted according to the regular expression, and the operation of extracting the triple is simplified.
In addition, before analyzing the dependency relationship between words of each text in the similar corpus by the dependency syntax, the method further includes: analyzing the part of speech of the words of each text in the similar corpus set; identifying proper nouns in each text in the similar corpus set; determining an extraction template corresponding to each similar corpus according to the dependency relationship, wherein the extraction template comprises the following steps: and determining a corresponding extraction template according to the part of speech, the proper noun and the dependency relationship of each text in the similar corpus set.
In addition, the method for determining the corresponding extraction template according to the part of speech, the proper noun and the dependency relationship of each text in the similar corpus set comprises the following steps: screening words in the text according to the part of speech; determining proper nouns in the screened words; determining a core word of the text according to the proper noun and the dependency relationship; and determining a corresponding extraction template according to the position of the core word in the text and the part of speech of the core word.
In addition, the extracting the template includes at least: a basic triple template and an attribute triple template; the structure of the basic triple template comprises a first entity, a relevant word and a second entity; the structure of the attribute triple template comprises a first nominal word, a second nominal word and a third nominal word.
In addition, analyzing the dependency relationship between words of each text in the similar corpus by the dependency syntax includes: acquiring the length of a text; if the length of the text exceeds a preset threshold value, decomposing the text according to a clause structure of the text; and analyzing the dependency relationship between words in the decomposed text through dependency syntax. Because the longer the text length is, the higher the error rate of determining the dependency relationship among the words in the text is, the accuracy of determining the dependency relationship can be improved to a greater extent by decomposing the long sentence into short sentences according to the clause structure and analyzing the dependency relationship from the short sentences.
In addition, after determining the extraction template corresponding to each similar corpus according to the dependency relationship, the method further includes: performing iterative optimization on the extracted template by using the verification corpus; extracting triples according to an extraction template, comprising: and extracting the triples according to the extraction template after iterative optimization. And further improve the accuracy of the triple extraction.
In addition, obtaining the corpus to be processed includes: and acquiring the unstructured text from the Internet in a crawler mode, and taking the unstructured text as a corpus to be processed. By the method, the data volume of the corpus to be processed can be enlarged, and the data types of the corpus to be processed can be enriched.
Drawings
One or more embodiments are illustrated by corresponding figures in the drawings, which are not to be construed as limiting the embodiments, unless expressly stated otherwise, and the drawings are not to scale.
Fig. 1 is a flow chart of a triple extraction method according to a first embodiment of the present application;
FIG. 2 is a diagram of a dependency syntax tree according to an embodiment of the present application;
FIG. 3 is a flow chart of a method of triplet extraction according to a second embodiment of the present application;
fig. 4 is a flow chart of a triple extraction method according to a third embodiment of the present application;
fig. 5 is a schematic structural diagram of an apparatus corresponding to a triple extraction method according to a first embodiment of the present application;
fig. 6 is a schematic structural diagram of an apparatus corresponding to a triple extraction method according to a second embodiment of the present application;
fig. 7 is a schematic structural diagram of an apparatus corresponding to a triple extraction method according to a third embodiment of the present application;
fig. 8 is a schematic structural diagram of another apparatus corresponding to the triple extracting method in the third embodiment of the present application;
FIG. 9 is a flow chart of a method of construction of a knowledge graph according to an embodiment of the present application;
FIG. 10 is a flow chart of a natural language processing method according to an embodiment of the present application;
fig. 11 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the embodiments of the present invention will be described in detail with reference to the accompanying drawings. However, it will be appreciated by those of ordinary skill in the art that in various embodiments of the invention, numerous technical details are set forth in order to provide a better understanding of the present application. However, the technical solution claimed in the present application can be implemented without these technical details and various changes and modifications based on the following embodiments.
The following embodiments are divided for convenience of description, and should not constitute any limitation to the specific implementation manner of the present invention, and the embodiments may be mutually incorporated and referred to without contradiction.
A first embodiment of the present invention relates to a triplet extraction method, including: obtaining a corpus set to be processed; clustering the corpus to be processed by an artificial intelligence similarity search algorithm faiss to obtain a plurality of similar corpus; analyzing the dependency relationship among the words of each text in the similar corpus set through dependency syntax; and determining the extraction template corresponding to each similar corpus according to the dependency relationship, and extracting the triples according to the extraction templates so as to achieve the purpose of saving time and labor. The implementation details of the triplet extraction method of the present embodiment are specifically described below, and the following description is only provided for facilitating understanding, and is not necessary for implementing the present embodiment.
The triple extraction method in this embodiment is shown in fig. 1, and the method includes:
step 101, obtaining a corpus set to be processed.
A large number of texts are obtained from the Internet in a crawler mode, and the obtained texts can be used as a corpus to be processed. In addition, in order to facilitate subsequent processing and improve the accuracy of results, the text acquired in the internet can be preprocessed and processed according to preset rules, for example, the acquired text can be divided according to periods, or ill sentences which do not meet the sentence specification can be screened out, and the preprocessed text is used as the corpus to be processed.
And step 102, clustering the corpus to be processed by an artificial intelligence similarity search algorithm faiss to obtain a plurality of similar corpus.
Faiss can find the most similar N texts in the corpus to be processed, and find a plurality of similar corpus by using the most similar N texts as a similar corpus in the same way. Compared with other clustering algorithms, the clustering speed of the faiss is fast, and the clustering result is relatively accurate.
Specifically, each text in the corpus set to be processed has a corresponding text vector, the text vector of each text can be calculated by means of faiss, and the texts are clustered according to the obtained vector distance, wherein the smaller the vector distance between two texts is, the higher the similarity between the two texts is, and conversely, the larger the vector distance between the two texts is, the lower the similarity between the two texts is. And clustering the texts with the vector distance smaller than a preset threshold value to obtain a plurality of similar corpus sets, wherein each similar corpus set comprises a plurality of texts to be processed.
And 103, analyzing the dependency relationship among the words of each text in the similar corpus through the dependency syntax.
Dependency syntax refers to the expression of the entire sentence structure by inter-vocabulary dependencies, which express semantic dependencies between the components of the sentence. In a sentence, if one word modifies another word, the modified word is called a dependent word (dependency), the modified word is called a dominant word (head), and the grammatical relationship between the two is called a dependency relationship. As shown in fig. 2, the direction of the illustrated arrow points to dependent words from the dominant word, and the dependency relationships of all words in a sentence are represented in the form of directed edges, resulting in a dependency syntax tree, as shown in fig. 2, for the sentence "2016 year 1 month passerby a joins clique a job position a".
The linguist Robinson proposed some constraining axioms for dependency syntax: 1) only one word (root word) does not depend on other words; 2) in addition, all words must depend on other words; 3) each term cannot depend on multiple terms. According to the constraint axiom, the dependency relationship analysis of the text can be realized.
Taking the dependency syntax tree shown in fig. 2 as an example, the dependency relationship between "join" and "passerby a" is the primary dependency relationship; the dependency relationship between the 'joining' and the 'group A' is an action guest relationship; the dependency between "ren" and "Job A" is an actor relationship, and so on.
And 104, determining an extraction template corresponding to each similar corpus according to the dependency relationship, and extracting the triples according to the extraction templates.
According to the dependency relationship, several types of extraction templates with universal rules are summarized, for example, the extraction templates at least include: a basic triple template and an attribute triple template; the structure of the basic triple template comprises a first entity, a relevant word and a second entity; the structure of the attribute triple template comprises a first nominal word, a second nominal word and a third nominal word. As shown in fig. 2, a sentence "passerby a joins position a of clique a in 1 month in 2016", and triplets (passerby a (first entity), join (associated word), clique a (second entity)) are extracted based on the template.
Compared with the related art, the embodiment of the invention clusters the corpus to be processed by an artificial intelligence similarity search algorithm faiss to obtain a plurality of similar corpus, and texts in each similar corpus have high similarity. And analyzing the dependency relationship between the words of each text in each similar corpus set through dependency syntax, and determining an extraction template corresponding to each similar corpus set and used for extracting the triples according to the dependency relationship of each similar corpus set, wherein the extraction template can be used for directly extracting the triples of each text in the similar corpus set. The data do not need to be marked manually in the whole triple extracting process, so that a large amount of manpower is saved, and the time cost consumed by marking is saved.
A second embodiment of the invention relates to a triplet extraction method. In the second embodiment of the present invention, a step of performing regular matching on similar corpus is added, and as shown in fig. 3, the triple extraction method includes the following steps:
step 201, obtaining a corpus to be processed.
Step 202, clustering the corpus to be processed by an artificial intelligence similarity search algorithm faiss to obtain a plurality of similar corpus.
Steps 201 and 202 are the same as steps 101 and 102 in the first embodiment, and are not described herein again to avoid repetition.
Step 203, judging whether the similar corpus is matched with the regular expression or not, and if so, entering step 204; otherwise step 205 is entered.
And step 204, extracting the triples according to the regular expression.
Step 205, analyzing the dependency relationship between the words of each text in the similar corpus by the dependency syntax.
And step 206, determining an extraction template corresponding to each similar corpus according to the dependency relationship, and extracting the triples according to the extraction templates.
Specifically, after obtaining a plurality of similar corpus, whether there is a very obvious sentence with a specific rule in the text in the similar corpus may be analyzed, for example, the text in the similar corpus may be matched with the text in the similar corpus through a regular expression, and if the matching result indicates that the similar corpus is a corpus with a sentence with a specific rule. By summarizing the specific rule, the triple extraction rule of the similar corpus can be determined, and then the extraction template corresponding to the similar corpus can be determined, and all corpora sets with sentences of the specific rule can be extracted in batches by utilizing the extraction template, so that the triple extraction efficiency is improved.
Taking a sentence containing [ enterprise, local ] entity category as an example, the triples extracted according to the regular expression can be (enterprise, associated word, local). For example, the text "enterprise was produced in Shenzhen in 1988", "enterprise headquarters is located in the Dragon sentry region of Shenzhen city, Guangdong province, China" can directly write the following regular expression: regex 'originates from | birth in | location in | set up', and the corresponding extracted triplet type (enterprise, headquarters location, place).
Under the condition that the similar corpus is not matched with the summarized regular expression or the similar corpus cannot summarize a corresponding regular expression, analyzing the dependency relationship among words of each text in the similar corpus by the rest unmatched similar corpuses through dependency syntax, determining an extraction template corresponding to each similar corpus according to the dependency relationship, and extracting triples according to the extraction templates.
A third embodiment of the invention relates to a triplet extraction method. In the third embodiment of the present invention, the step of adjusting the extraction template to determine the corresponding extraction template according to the part of speech, proper nouns, and dependencies of each text in the similar corpus set is performed based on the first embodiment or the second embodiment, as shown in fig. 4, the triple extraction method adjusted based on the second embodiment includes the following steps:
step 301, obtaining a corpus set to be processed.
Step 302, clustering the corpus to be processed through an artificial intelligence similarity search algorithm faiss to obtain a plurality of similar corpus.
Step 303, judging whether the similar corpus is matched with the regular expression, and if so, entering step 304; otherwise step 305 is entered.
And step 304, extracting the triples according to the regular expression.
Steps 301 to 304 are the same as steps 201 to 204 in the second embodiment, and are not repeated herein to avoid repetition.
Step 305, analyzing the dependency relationship between the words of each text in the similar corpus through the dependency syntax.
And step 306, determining a corresponding extraction template according to the part of speech, the proper noun and the dependency relationship of each text in the similar corpus set, and extracting the triples according to the extraction template.
Specifically, before analyzing the dependency relationship between words of each text in the similar corpus by the dependency syntax, the method further includes: and analyzing the part of speech of the words of each text in the similar corpus set, and identifying proper nouns in each text in the similar corpus set. The main purpose of part-of-speech tagging is to tag out the part-of-speech of each word in the text (sentence) at the grammatical level, and the role played by each part-of-speech word in the sentence is different from the assumed function. For example, a part-of-speech word is likely to be a triple entity candidate, and most of the part-of-speech words are proper nouns or general nouns and often exist as a subject component or an object component. And some words of other parts of speech are marked in the step of analyzing the parts of speech, and are filtered in subsequent processing, so that the data volume of analyzing the dependency relationship is reduced. In addition, named entity recognition (ner) can recognize proper nouns, such as names of people and places, in the text (sentence), and the proper nouns represent the core information of the sentence with high probability, so that a certain basis is provided for analyzing the dependency relationship.
Determining the corresponding extraction template according to the part of speech, proper nouns and dependency relationship of each text in the similar corpus can be realized by the following modes:
taking the determination of the extracted template according to the nominal words as an example, firstly, all the nominal words possibly existing in the sentence are found out; then, the extraction template is determined according to the information obtained by the dependency analysis, such as the position and type of root words, dependency relationship and the like. Taking the sentence as shown in fig. 2 as an example, the sentence root word is "join", and the two dependencies related thereto are: "passerby (nh)", "join (v)", "group a (nz)", "join (v)", a rule is worked out based on the correspondence, when the root word is a verb and satisfies a dependency relationship of [ a first entity having a dominance relationship with the root word, the root, a second entity having a motivity relationship with the root word ], a basic triple (passerby (first entity), join (related word), group a (second entity)) can be extracted from the sentence. In more detail, it is determined whether there is a word that continues to modify the first entity (entity1), the associated word (rel), or the second entity (entity 2). For example, in the text (sentence) of the above example, "2016 year 1 month (nt)" is used as the modifier of "join (v)", the time information in the sentence can be continuously provided.
For another example, according to the text "congratulation word from company B chief executive officer character B, thank you character C always supports and favors the product", the sentence contains three continuous nominal words, and the dependency relationship of the three nominal words is as follows: are modifiers for each other. Then it can be determined that the text corresponds to the extraction rule as follows: when the dependency relationships among [ first, second, and third noun words ] and the modifiers are satisfied, the text belongs to the attribute triple template, and the corresponding extracted triples are (company B, chief executive officer, character B), and so on.
In addition, according to the language model and statistical demonstration, the longer the text length is, the greater the error rate of the dependency analysis is, and the error rate is increased when the text structure is too complicated due to the inclusion of clauses in the text. In order to improve the accuracy of analysis, the length of the text can be obtained; if the length of the text exceeds a preset threshold value, decomposing the text according to a clause structure of the text; and analyzing the dependency relationship between words in the decomposed text through dependency syntax. Or, when the positions of the entities in the text are more than a certain distance away from each other, the text is considered to have no dependency relationship, and no extraction of the triples is performed.
In addition, the extraction template mentioned in the above embodiments may be implemented by making an extraction model. Specifically, an extraction model can be formulated according to a regular expression matching algorithm, and an extraction model can be formulated according to dependency syntax. After an extraction model is formulated, the accuracy of the extraction model is verified by using other corpora different from any corpora in the corpus to be processed, and meanwhile, according to a sentence extraction result, an extraction rule adopted by an extraction template is iteratively optimized. The formulated extraction model comprises various categories, so that the method and the device can be well applied to different scenes, for example, cold start scenes.
The steps of the above methods are divided for clarity, and the implementation may be combined into one step or split some steps, and the steps are divided into multiple steps, so long as the same logical relationship is included, which are all within the protection scope of the present patent; it is within the scope of the patent to add insignificant modifications to the algorithms or processes or to introduce insignificant design changes to the core design without changing the algorithms or processes.
In addition, the present application also provides an apparatus embodiment corresponding to the first embodiment, and as shown in fig. 4, the triplet extracting apparatus includes: an obtaining module 41, configured to obtain a corpus set to be processed; the clustering module 42 is configured to cluster the corpus sets to be processed through an artificial intelligence similarity search algorithm faiss to obtain a plurality of similar corpus sets; an analysis module 43, configured to analyze, through dependency syntax, a dependency relationship between words of each text in the similar corpus; and the extraction module 44 is configured to determine an extraction template corresponding to each similar corpus according to the dependency relationship, and extract the triple according to the extraction template.
This embodiment can be implemented in cooperation with the first embodiment. The related technical details mentioned in the first embodiment are still valid in this embodiment, and are not described herein again in order to reduce repetition. Accordingly, the related-art details mentioned in the present embodiment can also be applied to the first embodiment.
In addition, the present application also provides an embodiment of an apparatus corresponding to the second embodiment, and as shown in fig. 5, on the basis of the foregoing, the triplet extracting apparatus further includes: a regular matching module 45, configured to match the plurality of similar corpus with a regular expression; if the similar corpus is matched with the regular expression, extracting the triple according to the regular expression; and if the similar corpus is not matched with the regular expression, performing a step of analyzing the dependency relationship between words of each text in the similar corpus through dependency syntax.
This embodiment can be implemented in cooperation with the second embodiment. The relevant technical details mentioned in the second embodiment are still valid in this embodiment.
In addition, the present application also provides an embodiment of an apparatus corresponding to the second embodiment, and as shown in fig. 6, on the basis of the foregoing, the triplet extracting apparatus further includes: a part-of-speech analysis module 46 for analyzing the part-of-speech of each text in the similar corpus set; a proper noun recognition module 47, configured to recognize a proper noun in each text in the similar corpus; and the analysis module 43 is configured to determine a corresponding extraction template according to the part of speech, the proper noun, and the dependency relationship of each text in the similar corpus.
In addition, the analysis module 43 is specifically configured to filter words in the text according to the part of speech; determining proper nouns in the screened words; determining a core word of the text according to the proper noun and the dependency relationship; and determining a corresponding extraction template according to the position of the core word in the text and the part of speech of the core word.
In addition, the analysis module 43 is further configured to obtain the length of the text; if the length of the text exceeds a preset threshold value, decomposing the text according to a clause structure of the text; and analyzing the dependency relationship between words in the decomposed text through dependency syntax.
In addition, as shown in fig. 7, the apparatus further includes: and the optimization module 48 is configured to perform iterative optimization on the extracted template by using the verification corpus.
In addition, the obtaining module 41 is specifically configured to obtain the unstructured text from the internet in a crawler manner, and use the unstructured text as a corpus to be processed.
This embodiment can be implemented in cooperation with the third embodiment. The related-art details mentioned in the third embodiment are still valid in this embodiment.
It should be noted that, all the modules involved in this embodiment are logic modules, and in practical application, one logic unit may be one physical unit, may also be a part of one physical unit, and may also be implemented by a combination of multiple physical units. In addition, in order to highlight the innovative part of the present invention, a unit which is not so closely related to solve the technical problem proposed by the present invention is not introduced in the present embodiment, but this does not indicate that there is no other unit in the present embodiment.
The embodiment of the invention relates to a method for constructing a knowledge graph, which comprises the following steps of: step 901, acquiring a resource data set; step 902, extracting the triples from the resource data set by using the triples extraction method in the above embodiment; and step 903, constructing a knowledge graph according to the extracted triples.
The related technical details mentioned in the above embodiments relating to the triplet extraction method are still valid in this embodiment.
The embodiment of the invention relates to a natural language processing method, as shown in fig. 10, the method comprises the following steps: step 1001, acquiring inquiry data; step 1002, performing semantic analysis and syntactic analysis on the query data by using the knowledge graph constructed as described above; step 1003, converting the query data into a query statement in a preset format according to the result of the semantic analysis and the syntactic analysis.
The relevant technical details mentioned in the above embodiments relating to the method of constructing a knowledge graph are still valid in this embodiment.
An embodiment of the present invention relates to an electronic device, as shown in fig. 11, comprising at least one processor 1101; and a memory 1102 communicatively coupled to the at least one processor 1101; the memory 1102 stores instructions executable by the at least one processor 1101 to enable the at least one processor 1101 to perform the triple extraction method described above, or to perform the knowledge graph construction method described above, or to perform the natural language processing method described above.
The memory 1102 and the processor 1101 are coupled by a bus, which may comprise any number of interconnecting buses and bridges that interconnect one or more of the various circuits of the processor 1101 and the memory 1102. The bus may also connect various other circuits such as peripherals, voltage regulators, power management circuits, and the like, which are well known in the art, and therefore, will not be described any further herein. A bus interface provides an interface between the bus and the transceiver. The transceiver may be one element or a plurality of elements, such as a plurality of receivers and transmitters, providing a means for communicating with various other apparatus over a transmission medium. The data processed by the processor is transmitted over a wireless medium through an antenna, which further receives the data and transmits the data to the processor 1101.
The processor 1101 is responsible for managing the bus and general processing and may also provide various functions including timing, peripheral interfaces, voltage regulation, power management, and other control functions. And the memory 1102 may be used to store data used by the processor in performing operations.
An embodiment of the present invention relates to a computer-readable storage medium storing a computer program. The computer program, when executed by the processor, implements the method of triplet extraction described above, or is capable of performing the method of knowledge graph construction described above, or is capable of performing the method of natural language processing described above.
That is, as can be understood by those skilled in the art, all or part of the steps in the method for implementing the embodiments described above may be implemented by a program instructing related hardware, where the program is stored in a storage medium and includes several instructions to enable a device (which may be a single chip, a chip, or the like) or a processor (processor) to execute all or part of the steps of the method described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
It will be understood by those of ordinary skill in the art that the foregoing embodiments are specific embodiments for practicing the invention, and that various changes in form and details may be made therein without departing from the spirit and scope of the invention in practice.

Claims (13)

1. A method of triplet extraction comprising:
obtaining a corpus set to be processed;
clustering the corpus to be processed by an artificial intelligence similarity search algorithm faiss to obtain a plurality of similar corpus;
analyzing the dependency relationship among the words of each text in the similar corpus set through dependency syntax;
and determining an extraction template corresponding to each similar corpus according to the dependency relationship, and extracting the triples according to the extraction templates.
2. The triple extraction method according to claim 1, further comprising, after obtaining the plurality of similar corpus sets:
matching the plurality of similar corpus sets with a regular expression;
if the similar corpus is matched with the regular expression, extracting the triple according to the regular expression;
and if the similar corpus is not matched with the regular expression, executing the step of analyzing the dependency relationship between the words of each text in the similar corpus through dependency syntax.
3. The triple extraction method according to claim 1 or 2, wherein before the analyzing the dependency relationship between the words of each text in the similar corpus by dependency syntax, the method further comprises:
analyzing the part of speech of the words of each text in the similar corpus set;
identifying proper nouns in each text in the similar corpus set;
the determining the extraction template corresponding to each similar corpus according to the dependency relationship includes:
and determining the corresponding extraction template according to the part of speech, the proper noun and the dependency relationship of each text in the similar corpus.
4. The triple extraction method as claimed in claim 3, wherein the determining the corresponding extraction template according to the part of speech, the proper noun, and the dependency relationship of each text in the similar corpus comprises:
screening words in the text according to the part of speech;
determining proper nouns in the screened words;
determining a core word of the text according to the proper noun and the dependency relationship;
and determining the corresponding extraction template according to the position of the core word in the text and the part of speech of the core word.
5. A method of triplet extraction as claimed in claim 3 comprising:
the extraction template at least comprises: a basic triple template and an attribute triple template; the structure of the basic triple template comprises a first entity, a relevant word and a second entity; the structure of the attribute triple template comprises a first nominal word, a second nominal word and a third nominal word.
6. The triple extraction method as claimed in claim 5, wherein the analyzing the dependency relationship between words of each text in the similar corpus by dependency syntax includes:
acquiring the length of the text;
if the length of the text exceeds a preset threshold value, decomposing the text according to a clause structure of the text;
analyzing the dependency relationship between the words in the decomposed text through dependency syntax.
7. The triple extraction method according to claim 5 or 6, wherein after determining the extraction template corresponding to each similar corpus according to the dependency relationship, the method further comprises:
performing iterative optimization on the extracted template by using a verification corpus;
the extracting the triples according to the extracting template includes:
and extracting the triples according to the extraction template after the iterative optimization.
8. The triple extraction method according to claim 1, wherein the obtaining the corpus to be processed includes:
and acquiring an unstructured text from the Internet in a crawler mode, and taking the unstructured text as the corpus to be processed.
9. A triplet extraction device, comprising:
the acquisition module is used for acquiring a corpus set to be processed;
the clustering module is used for clustering the corpus to be processed through an artificial intelligence similarity search algorithm faiss to obtain a plurality of similar corpus sets;
the analysis module is used for analyzing the dependency relationship among the words of each text in the similar corpus set through dependency syntax;
and the extraction module is used for determining the extraction template corresponding to each similar corpus according to the dependency relationship and extracting the triples according to the extraction template.
10. A method for constructing a knowledge graph, comprising:
acquiring a resource data set;
extracting triples from the resource data set using a method of extracting triples as claimed in any one of claims 1 to 8;
and constructing a knowledge graph according to the extracted triples.
11. A natural language processing method, comprising:
acquiring inquiry data;
semantically and syntactically analyzing the query data using the knowledge-graph constructed as set forth in claim 10;
and converting the query data into a query statement in a preset format according to the semantic analysis result and the syntactic analysis result.
12. An electronic device, comprising:
at least one processor; and the number of the first and second groups,
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a triple extraction method as claimed in any one of claims 1 to 8, or to perform a method of knowledge-graph construction as claimed in claim 10, or to perform a natural language processing method as claimed in claim 11.
13. A computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements the triple extraction method of any one of claims 1 to 8, or is capable of performing the method of constructing a knowledge graph of claim 10, or is capable of performing the method of natural language processing of claim 11.
CN202111667514.9A 2021-12-30 2021-12-30 Triple extraction method, equipment and computer storage medium Pending CN114328970A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111667514.9A CN114328970A (en) 2021-12-30 2021-12-30 Triple extraction method, equipment and computer storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111667514.9A CN114328970A (en) 2021-12-30 2021-12-30 Triple extraction method, equipment and computer storage medium

Publications (1)

Publication Number Publication Date
CN114328970A true CN114328970A (en) 2022-04-12

Family

ID=81020127

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111667514.9A Pending CN114328970A (en) 2021-12-30 2021-12-30 Triple extraction method, equipment and computer storage medium

Country Status (1)

Country Link
CN (1) CN114328970A (en)

Similar Documents

Publication Publication Date Title
CN109241538B (en) Chinese entity relation extraction method based on dependency of keywords and verbs
US11210468B2 (en) System and method for comparing plurality of documents
US8156053B2 (en) Automated tagging of documents
WO2018000272A1 (en) Corpus generation device and method
CN107562919B (en) Multi-index integrated software component retrieval method and system based on information retrieval
CN103678684A (en) Chinese word segmentation method based on navigation information retrieval
CN112541337B (en) Document template automatic generation method and system based on recurrent neural network language model
WO2017198031A1 (en) Semantic parsing method and apparatus
CN113609838B (en) Document information extraction and mapping method and system
CN109522396B (en) Knowledge processing method and system for national defense science and technology field
CN114372153A (en) Structured legal document warehousing method and system based on knowledge graph
RU61442U1 (en) SYSTEM OF AUTOMATED ORDERING OF UNSTRUCTURED INFORMATION FLOW OF INPUT DATA
CN114997288A (en) Design resource association method
CN113934909A (en) Financial event extraction method based on pre-training language and deep learning model
CN112507089A (en) Intelligent question-answering engine based on knowledge graph and implementation method thereof
CN113157887A (en) Knowledge question-answering intention identification method and device and computer equipment
CN117216214A (en) Question and answer extraction generation method, device, equipment and medium
CN117473054A (en) Knowledge graph-based general intelligent question-answering method and device
CN111831624A (en) Data table creating method and device, computer equipment and storage medium
CN114722159B (en) Multi-source heterogeneous data processing method and system for numerical control machine tool manufacturing resources
CN110705316A (en) Method and device for generating linear time sequence logic protocol of smart home
CN116483314A (en) Automatic intelligent activity diagram generation method
CN113963804A (en) Medical data relation mining method and device
CN112733517B (en) Method for checking requirement template conformity, electronic equipment and storage medium
CN115169370A (en) Corpus data enhancement method and device, computer equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 200245 Building 8, No. 207, Zhongqing Road, Minhang District, Shanghai

Applicant after: Dayu robot Co.,Ltd.

Address before: 200245 Building 8, No. 207, Zhongqing Road, Minhang District, Shanghai

Applicant before: Dalu Robot Co.,Ltd.

CB02 Change of applicant information