CN111859974A - Semantic disambiguation method and device combined with knowledge graph and intelligent learning equipment - Google Patents

Semantic disambiguation method and device combined with knowledge graph and intelligent learning equipment Download PDF

Info

Publication number
CN111859974A
CN111859974A CN201910325404.0A CN201910325404A CN111859974A CN 111859974 A CN111859974 A CN 111859974A CN 201910325404 A CN201910325404 A CN 201910325404A CN 111859974 A CN111859974 A CN 111859974A
Authority
CN
China
Prior art keywords
words
word
polysemous
triples
semantic disambiguation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910325404.0A
Other languages
Chinese (zh)
Inventor
魏誉荧
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Genius Technology Co Ltd
Original Assignee
Guangdong Genius Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Genius Technology Co Ltd filed Critical Guangdong Genius Technology Co Ltd
Priority to CN201910325404.0A priority Critical patent/CN111859974A/en
Publication of CN111859974A publication Critical patent/CN111859974A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology

Abstract

The invention provides a semantic disambiguation method and a semantic disambiguation device combined with a knowledge graph and intelligent learning equipment, wherein the semantic disambiguation method comprises the following steps: performing word segmentation on the target text, and performing part-of-speech tagging on words obtained after word segmentation; when a polysemous word exists in the words, extracting the polysemous word and a related word corresponding to the polysemous word from the words; and paraphrasing the polysemous words according to the polysemous words, the related words and the domain knowledge map. The invention combines knowledge graph to carry out semantic disambiguation on the basis of the traditional word stock, further improves the accuracy of semantic disambiguation, and is favorable for popularization and application of intelligent education equipment.

Description

Semantic disambiguation method and device combined with knowledge graph and intelligent learning equipment
Technical Field
The invention relates to the technical field of semantics, in particular to a semantic disambiguation method and device combining a knowledge graph and intelligent learning equipment.
Background
With the continuous development of society, more and more intelligent learning equipment, such as family education machines, student's flat board etc. obtain wide application in the family. People use the equipment to assist children in learning, for example, when an undecipherable problem or an undecipherable knowledge point is encountered during learning, relevant questions or knowledge points are input through voice or text, and corresponding answers and knowledge explanation are searched in the intelligent learning equipment.
Currently, in a human-computer interaction scenario, accurate understanding of semantics of input information is the basis for making correct responses. If the semantic analysis failure rate of the product is high, the use experience of the user is poor, and the popularization and the use of the product are not facilitated.
To obtain correct semantics, the word senses of the key words in the input information are understood correctly. Word sense disambiguation is an important semantic technique that aims at determining, in a particular context, the specific meaning of each ambiguous word in that context. The existing word sense disambiguation method is based on a semantic dictionary, the paraphrase, synonym set and the like of each word are searched through the semantic dictionary, and then the paraphrase and the synonym set are combined with context to obtain the paraphrase of the polysemous word in the text.
Disclosure of Invention
The invention aims to provide a semantic disambiguation method and device combining a knowledge graph and intelligent learning equipment, which not only improve the accuracy of the machine in understanding text semantics, but also improve the semantic disambiguation speed of the machine, thereby being beneficial to popularization and application of intelligent education equipment.
The technical scheme provided by the invention is as follows:
a method of semantic disambiguation in conjunction with a knowledge graph, comprising: performing word segmentation on the target text, and performing part-of-speech tagging on words obtained after word segmentation; when a polysemous word exists in the words, extracting the polysemous word and a related word corresponding to the polysemous word from the words; and paraphrasing the polysemous words according to the polysemous words, the related words and the domain knowledge map.
Further preferably, the extracting the ambiguous word and the related word corresponding to the ambiguous word from the words includes: extracting polysemous words and candidate words which accord with a preset part of speech range from the words; calculating point mutual information between the candidate words and the polysemous words according to the large-scale corpus; and selecting candidate words with point mutual information larger than a preset threshold value as related words of the polysemous words.
Further preferably, the formula for calculating the point-to-point information is as follows:
Figure BDA0002036055930000021
wherein PMI (W1 ═ W1, W2 ═ W2) denotes the point mutual information values of two words W1 and W2, P (W1, W2) denotes the number of times of co-occurrence of two words W1 and W2, P (W1) denotes the number of times of occurrence of W1, and P (W2) denotes the number of times of occurrence of W2.
Further preferably, the paraphrasing the polysemous words according to the polysemous words, the related words and the domain knowledge graph comprises: acquiring entities corresponding to the polysemous words on a domain knowledge map according to the polysemous words; traversing all the triples of the entity, and finding the triples matched with the related words; determining paraphrases of the polysemous words according to the matching triples.
Further preferably, the determining the paraphrase of the ambiguous word according to the matching triple further includes: when all triples of the entity are traversed and no triples matched with the related words are found, obtaining the similar meaning words of the related words according to a preset word bank; and traversing all the triples of the entity to find the triples matched with the similar meaning words of the related words.
The invention also provides a semantic disambiguation device combined with the knowledge graph, which comprises: the text word segmentation module is used for segmenting a target text and performing part-of-speech tagging on words obtained after segmentation; the related word extracting module is used for extracting polysemous words and related words corresponding to the polysemous words from the words when the polysemous words exist in the words; and the matched paraphrasing module is used for paraphrasing the polysemous words according to the polysemous words, the related words and the domain knowledge map.
Preferably, the related word extracting module is further configured to extract ambiguous words and candidate words that conform to a preset part of speech range from the words; and calculating point mutual information between the candidate words and the polysemous words according to the large-scale corpus; and selecting candidate words with point mutual information larger than a preset threshold value as related words of the polysemous words.
Further preferably, the matching paraphrasing module comprises: the search matching unit is used for acquiring an entity corresponding to the polysemous word on the domain knowledge map according to the polysemous word; traversing all the triples of the entity to find the triples matched with the related words; and the word paraphrasing unit is used for determining the paraphrases of the polysemous words according to the matching triples.
Preferably, the search matching unit is further configured to, when all triples of the entity are traversed and no triplet matching the related word is found, obtain a synonym of the related word according to a preset word bank; and traversing all the triples of the entity to find the triples matched with the similar meaning words of the related words.
The invention also provides an intelligent learning device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the method of semantic disambiguation in combination with a knowledge-graph as described in any one of the preceding claims when the computer program is executed by the processor.
The semantic disambiguation method and device combining the knowledge graph and the intelligent learning equipment provided by the invention can bring the following beneficial effects:
1. the invention combines the knowledge graph to carry out semantic disambiguation, and compared with the traditional method which is only based on a dictionary, the invention not only improves the semantic disambiguation speed of the machine, but also improves the accuracy of the machine in understanding the text semantics, thereby being beneficial to the popularization and the application of intelligent education equipment.
2. The invention identifies the correlation of the two words through the large-scale corpus and extracts the related words of the polysemous words according to the correlation of the two words, and the method has more accurate extraction of the related words, thereby improving the paraphrase accuracy rate of the polysemous words.
3. The invention searches the matching triplets in the knowledge map based on the polysemous words and the related words or the similar words of the related words, and can improve the success rate of searching the matching triplets, thereby improving the semantic understanding accuracy of the intelligent learning equipment.
Drawings
The above features, technical features, advantages and implementations of a method and apparatus for semantic disambiguation in conjunction with knowledge graphs are further described in the following detailed description of preferred embodiments in conjunction with the accompanying drawings, which are well understood.
FIG. 1 is a flow diagram of one embodiment of a method of semantic disambiguation in conjunction with a knowledge graph of the present invention;
FIG. 2 is a flow diagram of another embodiment of a method of semantic disambiguation in conjunction with a knowledge graph of the present invention;
FIG. 3 is a flow diagram of another embodiment of a method of semantic disambiguation in conjunction with a knowledge graph of the present invention;
FIG. 4 is a flow diagram of another embodiment of a method of semantic disambiguation in conjunction with a knowledge graph of the present invention;
FIG. 5 is a schematic diagram of an embodiment of a semantic disambiguation apparatus incorporating knowledge-graphs of the present invention;
FIG. 6 is a schematic diagram of another embodiment of a semantic disambiguation apparatus incorporating knowledge-graphs of the present invention;
Fig. 7 is a schematic structural diagram of an embodiment of an intelligent learning device of the present invention.
The reference numbers illustrate:
100. the semantic disambiguation device comprises a semantic disambiguation device, 110, a text word segmentation module, 120, a related word extraction module, 130, a matching paraphrase module, 131, a search matching unit, 132, a word paraphrase unit, 200, intelligent learning equipment, 210, a memory, 220, a processor, 230 and a computer program.
Detailed Description
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the following description will be made with reference to the accompanying drawings. It is obvious that the drawings in the following description are only some examples of the invention, and that for a person skilled in the art, other drawings and embodiments can be derived from them without inventive effort.
For the sake of simplicity, the drawings only schematically show the parts relevant to the present invention, and they do not represent the actual structure as a product. In addition, in order to make the drawings concise and understandable, components having the same structure or function in some of the drawings are only schematically illustrated or only labeled. In this document, "one" means not only "only one" but also a case of "more than one".
In one embodiment of the invention, as shown in FIG. 1, a method of semantic disambiguation in conjunction with a knowledge graph, comprising:
step S100, performing word segmentation on the target text, and performing part-of-speech tagging on the words obtained after word segmentation.
Specifically, the target text is a text to be subjected to semantic parsing, and is subjected to segmentation or clause processing, and the length of the text is limited to a certain range, such as one or two clauses.
Firstly, segmenting a target text to obtain all possible words matched with a preset word bank, so that various segmentation results can be obtained; and determining an optimal segmentation result according to a preset statistical language model, generally calculating the occurrence probability of each segmentation result, and selecting the segmentation result with the maximum occurrence probability to segment the target text, so as to solve the ambiguity problem in word segmentation.
For example, for the text string "Nanjing Yangtze river bridge", firstly, the entry is retrieved, and all matched entries (Nanjing, city, Yangtze river, bridge, Nanjing city, Yangtze river bridge, city, river, bridge, Yangtze river, bridge) are found according to the preset lexicon, so as to obtain a plurality of segmentation results:
results 1: nanjing \ City \ Changjiang river \ bridge; results 2: nanjing city \ Changjiang river \ bridge; results 3: nanjing city \ Changjiang river bridge; results 4: nanjing \ city Changh \ river \ bridge; results 5: nanjing \ city Chang \ Jiang Da \ bridge;
And then, path searching is carried out, an optimal path is found based on the statistical language model, and the language model of Nanjing City \ Changjiang river \ bridge has the highest score, so that the segmentation is the optimal segmentation.
And obtaining each word according to the optimal segmentation result, and performing part-of-speech tagging on each word.
The text may be participled, part-of-speech tagged using an open source toolkit (e.g., NLP Parser at stanford university, usa).
Step S200, when the words have the polysemous words, extracting the polysemous words and the corresponding related words from the words.
Specifically, for example, the target text is "apple beginner arbor", and the word segmentation results in "apple (noun) \ (structure aid) \\ (initian) \\ arbor (noun)". And judging whether the polysemous words exist or not according to the paraphrases of each word in the preset word bank. Wherein, the word "apple" can be known to have multiple meanings according to the explanation of the preset word stock, and can be a fruit or a mobile phone brand, so that the word is a multiple meaning word with multiple meanings. In order to further determine the meaning of the ambiguous word in the text, a word having a large correlation with the ambiguous word needs to be extracted from the words after word segmentation as a related word. The related words have great contribution to the word meaning of the polysemous words, and the method for extracting the related words has various methods, for example, the method I selects the related words as nouns, verbs or adjectives according to the rule, so that some useless words, such as auxiliary words, pronouns and the like, can be filtered; and secondly, evaluating the correlation degree of each word (the nonsense word) after word segmentation and the polysemous word, and selecting the word with the correlation degree higher than a certain threshold as the related word. The related words are extracted, so that the interference of the unrelated words can be reduced, the cost of subsequent searching is reduced, and the speed of correct paraphrasing is improved.
Taking the target text as the ' beginner of apple, arbor ' as an example, the polysemous word is determined as ' apple ', and the ' beginner ' and arbor ' are obtained as the related words of ' apple ' by selecting the related words as nouns, verbs or adjectives according to rules.
For another example, the text "the apple is very red in color", the word segmentation results in "the (structural aid word) color (noun) of the (indication pronoun) apple (noun) is very red (adjective)", the "apple" is a polysemous word, and the related word is selected as a noun, or a verb, or an adjective according to the rule, and results in "the color", "very red" as the related word of the "apple".
Step S300, paraphrasing the polysemous words according to the polysemous words, the related words and the domain knowledge map.
Specifically, the knowledge graph visually describes concepts and complex relationships between entities in the objective world in a structured form. The knowledge graph is a network formed by nodes and node relations, and concepts and entities in the objective world can be used as nodes in the knowledge graph. The knowledge graph mainly comprises nodes, relations and triples, wherein each triplet represents a piece of knowledge, namely a certain relation or attribute exists between two nodes, and is represented by (head node, relation or attribute, tail node), for example, (Hangzhou, located in China) represents a piece of knowledge of 'Hangzhou is located in China', and (apple, color, red) represents that the color attribute of the apple is red.
The domain knowledge maps comprise public domain knowledge maps, professional domain knowledge maps and subdivision domain knowledge maps. The professional domain knowledge graph or the subdivided domain knowledge graph has stronger pertinence and speciality, and the semantic analysis of the professional text is performed to search in the corresponding professional domain knowledge graph, so that the success rate of determining the semantics is higher, and the searching speed is higher. For example, semantic parsing of a segment of medical text preferentially considers searching in a medical knowledge graph; semantic analysis is carried out on the voice problem of the pupils, the voice problem is searched in the educational knowledge map of the pupils, and the correct semantics can be obtained more easily.
Taking an example that a target text is 'apple originator geobuster' and a polysemous word is 'apple', the 'originator' and 'geobuster' are related words, searching in a knowledge graph according to the 'apple', 'originator' and 'geobuster' to see whether a triple including two or three of the words exists, and if a triple (apple, originator and geobuster) exists and is matched with the triple (the triple means that the originator of the apple is geobuster), determining that the 'apple' of the target text is the apple company and is a company for designing and producing mobile phones according to the meaning of the triple.
In the embodiment, relevant words with large relevance to the ambiguous word in the target text are extracted, the ambiguous word and the relevant words are searched in a proper domain knowledge map, and when a matched triple is found, the semantics of the ambiguous word can be accurately determined according to the knowledge of the matched triple; compared with the traditional semantic disambiguation method only based on the dictionary, the method not only improves the accuracy of the machine for semantic understanding of the text, but also improves the semantic disambiguation speed of the machine, thereby being beneficial to popularization and application of intelligent education equipment based on the machine semantic understanding.
In another embodiment of the present invention, as shown in FIG. 2, a method of semantic disambiguation in conjunction with a knowledge-graph, comprising:
step S100, performing word segmentation on the target text, and performing part-of-speech tagging on the words obtained after word segmentation.
Step S210, when a polysemous word exists in the words, extracting the polysemous word and a candidate word which accords with a preset part of speech range from the words.
Step S220, calculating point mutual information between the candidate words and the polysemous words according to the large-scale corpus;
the calculation formula of the point mutual information is as follows:
Figure BDA0002036055930000081
wherein PMI (W1 ═ W1, W2 ═ W2) denotes the point mutual information values of two words W1 and W2, P (W1, W2) denotes the number of times of co-occurrence of two words W1 and W2, P (W1) denotes the number of times of occurrence of W1, and P (W2) denotes the number of times of occurrence of W2.
Step S230 selects a candidate word whose point-to-point information is greater than a preset threshold as a related word of the ambiguous word.
Specifically, the preset part-of-speech range is a condition that must be satisfied by the part-of-speech of the candidate word. Considering that the parts of speech of the elements constituting the triples in the knowledge graph are mainly nouns, verbs and adjectives, the preset parts of speech range is generally set according to the rule.
The Point Mutual Information (PMI) is a useful Information measure in Information theory, and is used for measuring the correlation between two things, and is used for measuring the correlation between two words in the present invention. And respectively counting the occurrence frequency of each word and the common occurrence frequency of two words in the same corpus according to the large-scale corpus, and calculating the point mutual information value between the words according to the statistical information.
Firstly, selecting candidate words (excluding the polysemous word) from the words after word segmentation according to a preset part-of-speech range, calculating the point mutual information value of each candidate word and the polysemous word, filtering the candidate words with the point mutual information value not higher than a preset threshold value, and taking the remaining candidate words as related words of the polysemous word.
Step S300, paraphrasing the polysemous words according to the polysemous words, the related words and the domain knowledge map.
In the embodiment, the relevance of the two words is identified through the large-scale corpus, and the relevant words of the polysemous words are extracted according to the relevance of the two words, so that the extraction of the relevant words is more accurate, and the paraphrase accuracy rate of the polysemous words is improved.
In another embodiment of the present invention, as shown in FIG. 3, a method of semantic disambiguation in conjunction with a knowledge-graph, comprising:
step S100, performing word segmentation on the target text, and performing part-of-speech tagging on the words obtained after word segmentation.
Step S210, when a polysemous word exists in the words, extracting the polysemous word and a candidate word which accords with a preset part of speech range from the words;
step S220, calculating point mutual information between the candidate words and the polysemous words according to the large-scale corpus;
the calculation formula of the point mutual information is as follows:
Figure BDA0002036055930000091
wherein PMI (W1 ═ W1, W2 ═ W2) denotes the point mutual information values of two words W1 and W2, P (W1, W2) denotes the number of times of co-occurrence of two words W1 and W2, P (W1) denotes the number of times of occurrence of W1, and P (W2) denotes the number of times of occurrence of W2.
Step S230 selects a candidate word whose point-to-point information is greater than a preset threshold as a related word of the ambiguous word.
Step S310, according to the polysemous words, acquiring entities corresponding to the polysemous words on the domain knowledge map.
Step S320 traverses all triples of the entity, and finds a triplet matching the related word.
Step S360 determines the paraphrase of the polysemous word according to the matching triple.
Specifically, when searching in the domain knowledge graph according to the polysemous word and the related words thereof, an entity corresponding to the polysemous word is found according to the polysemous word, such as the polysemous word "apple", which may correspond to three entities "apple", "apple company" and "apple mobile phone", all triples of the three entities are found, the "apple" (polysemous word) and two related words "originator" and "arbor" are matched with each found triplet, the triplet with the highest matching degree is selected as a matching triplet, a matching triplet (apple company, originator and arbor) is obtained, and the "apple" of the target text can be determined to refer to the apple company according to the meaning of the matching triplet, and the apple "is a company for designing and producing mobile phones.
For another example, the text "the color of the apple is very red", the ambiguous word is "apple", the related words are "color" and "very red", the ambiguous word "apple" corresponds to three entities "apple", "apple company" and "apple phone" in the domain knowledge map, all triples of the three entities are found, the "apple" (ambiguous word) and two related words "color" and "very red" are matched with each found triplet to obtain a matching triplet (apple, color, red), and the "apple" of the target text can be determined to be a fruit according to the meaning of the matching triplet.
This embodiment describes a method for searching for matching triples in a knowledge graph based on ambiguous words and related words.
In another embodiment of the present invention, as shown in FIG. 4, a method of semantic disambiguation in conjunction with a knowledge-graph, comprising:
step S100, performing word segmentation on the target text, and performing part-of-speech tagging on the words obtained after word segmentation.
Step S210, when a polysemous word exists in the words, extracting the polysemous word and a candidate word which accords with a preset part of speech range from the words;
step S220, calculating point mutual information between the candidate words and the polysemous words according to the large-scale corpus;
the calculation formula of the point mutual information is as follows:
Figure BDA0002036055930000101
wherein PMI (W1 ═ W1, W2 ═ W2) denotes the point mutual information values of two words W1 and W2, P (W1, W2) denotes the number of times of co-occurrence of two words W1 and W2, P (W1) denotes the number of times of occurrence of W1, and P (W2) denotes the number of times of occurrence of W2.
Step S230 selects a candidate word whose point-to-point information is greater than a preset threshold as a related word of the ambiguous word.
Step S310, according to the polysemous words, acquiring entities corresponding to the polysemous words on the domain knowledge map.
Step S330, traversing all triples of the entity, and judging whether a triplet matched with the related word is found; if the data is found, jumping to step S360;
Step S340, obtaining the similar meaning words of the related words according to a preset word bank;
step S350 traverses all triples of the entity, and finds a triplet matching the synonym of the related word.
Step S360 determines the paraphrase of the polysemous word according to the matching triple.
Specifically, when the related words are not accurate enough, the matching triples may not be found in the knowledge graph according to the ambiguous words and the related words, and at the moment, the near-synonyms of the related words are selected to search, so that the search success rate of the matching triples can be improved.
In the embodiment, when the searching of the matching triples in the knowledge graph is unsuccessful based on the polysemous words and the related words, the polysemous words and the related words are selected to search, so that the searching success rate of the matching triples can be improved.
In another embodiment of the present invention, as shown in FIG. 5, a semantic disambiguation apparatus 100 incorporating knowledge-graphs, comprising:
the text word segmentation module 110 is configured to segment a target text and perform part-of-speech tagging on a word obtained after the segmentation.
Specifically, the target text is a text to be subjected to semantic parsing, and is subjected to segmentation or clause processing, and the length of the text is limited to a certain range, such as one or two clauses.
Firstly, segmenting a target text to obtain all possible words matched with a preset word bank, so that various segmentation results can be obtained; and determining an optimal segmentation result according to a preset statistical language model, generally calculating the occurrence probability of each segmentation result, and selecting the segmentation result with the maximum occurrence probability to segment the target text, so as to solve the ambiguity problem in word segmentation.
For example, for the text string "Nanjing Yangtze river bridge", firstly, the entry is retrieved, and all matched entries (Nanjing, city, Yangtze river, bridge, Nanjing city, Yangtze river bridge, city, river, bridge, Yangtze river, bridge) are found according to the preset lexicon, so as to obtain a plurality of segmentation results:
results 1: nanjing \ City \ Changjiang river \ bridge; results 2: nanjing city \ Changjiang river \ bridge; results 3: nanjing city \ Changjiang river bridge; results 4: nanjing \ city Changh \ river \ bridge; results 5: nanjing \ city Chang \ Jiang Da \ bridge;
and then, path searching is carried out, an optimal path is found based on the statistical language model, and the language model of Nanjing City \ Changjiang river \ bridge has the highest score, so that the segmentation is the optimal segmentation.
And obtaining each word according to the optimal segmentation result, and performing part-of-speech tagging on each word.
The text may be participled, part-of-speech tagged using an open source toolkit (e.g., NLP Parser at stanford university, usa).
A related word extracting module 120, configured to, when a ambiguous word exists in the words, extract the ambiguous word and a related word corresponding to the ambiguous word from the words.
For example, the target text is "apple founder arbor", and the word segmentation results in "apple (noun) \ (structure aid) \ (founder) \ arbor (noun)". And judging whether the polysemous words exist or not according to the paraphrases of each word in the preset word bank. Wherein, the word "apple" can be known to have multiple meanings according to the explanation of the preset word stock, and can be a fruit or a mobile phone brand, so that the word is a multiple meaning word with multiple meanings. In order to further determine the meaning of the ambiguous word in the text, a word having a large correlation with the ambiguous word needs to be extracted from the words after word segmentation as a related word. The related words have great contribution to the word meaning of the polysemous words, and the method for extracting the related words has various methods, for example, the method I selects the related words as nouns, verbs or adjectives according to the rule, so that some useless words, such as auxiliary words, pronouns and the like, can be filtered; and secondly, evaluating the correlation degree of each word (the nonsense word) after word segmentation and the polysemous word, and selecting the word with the correlation degree higher than a certain threshold as the related word. The related words are extracted, so that the interference of the unrelated words can be reduced, the cost of subsequent searching is reduced, and the speed of correct paraphrasing is improved.
Taking the target text as the ' beginner of apple, arbor ' as an example, the polysemous word is determined as ' apple ', and the ' beginner ' and arbor ' are obtained as the related words of ' apple ' by selecting the related words as nouns, verbs or adjectives according to rules.
For another example, the text "the apple is very red in color", the word segmentation results in "the (structural aid word) color (noun) of the (indication pronoun) apple (noun) is very red (adjective)", the "apple" is a polysemous word, and the related word is selected as a noun, or a verb, or an adjective according to the rule, and results in "the color", "very red" as the related word of the "apple".
A matching paraphrase module 130, configured to paraphrase the ambiguous word according to the ambiguous word, the related word, and a domain knowledge map.
Specifically, the knowledge graph visually describes the complex relationships between concepts and entities in the objective world in a structured form. The knowledge graph is a network formed by nodes and node relations, and concepts and entities in the objective world can be used as nodes in the knowledge graph. The knowledge graph mainly comprises nodes, relations and triples, wherein each triplet represents a piece of knowledge, namely a certain relation or attribute exists between two nodes, and is represented by (head node, relation or attribute, tail node), for example, (Hangzhou, located in China) represents a piece of knowledge of 'Hangzhou is located in China', and (apple, color, red) represents that the color attribute of the apple is red.
The domain knowledge maps comprise public domain knowledge maps, professional domain knowledge maps and subdivision domain knowledge maps. The professional domain knowledge graph or the subdivided domain knowledge graph has stronger pertinence and speciality, and the semantic analysis of the professional text is performed to search in the corresponding professional domain knowledge graph, so that the success rate of determining the semantics is higher, and the searching speed is higher. For example, semantic parsing of a segment of medical text preferentially considers searching in a medical knowledge graph; semantic analysis is carried out on the voice problem of the pupils, the voice problem is searched in the educational knowledge map of the pupils, and the correct semantics can be obtained more easily.
Taking an example that a target text is 'apple originator geobuster' and a polysemous word is 'apple', the 'originator' and 'geobuster' are related words, searching in a knowledge graph according to the 'apple', 'originator' and 'geobuster' to see whether a triple including two or three of the words exists, and if a triple (apple, originator and geobuster) exists and is matched with the triple (the triple means that the originator of the apple is geobuster), determining that the 'apple' of the target text is the apple company and is a company for designing and producing mobile phones according to the meaning of the triple.
In the embodiment, relevant words with large relevance to the ambiguous word in the target text are extracted, the ambiguous word and the relevant words are searched in a proper domain knowledge map, and when a matched triple is found, the semantics of the ambiguous word can be accurately determined according to the knowledge of the matched triple; compared with the traditional semantic disambiguation method only based on the dictionary, the method not only improves the accuracy of the machine for semantic understanding of the text, but also improves the semantic disambiguation speed of the machine, thereby being beneficial to popularization and application of intelligent education equipment based on the machine semantic understanding.
In another embodiment of the present invention, as shown in FIG. 5, a semantic disambiguation apparatus 100 incorporating knowledge-graphs, comprising:
the text word segmentation module 110 is configured to segment a target text and perform part-of-speech tagging on a word obtained after the segmentation.
A related word extracting module 120, configured to, when a polysemous word exists in the words, extract the polysemous word and a candidate word that meets a preset part of speech range from the words; and calculating point mutual information between the candidate words and the polysemous words according to the large-scale corpus; and selecting candidate words with point mutual information larger than a preset threshold value as related words of the polysemous words.
The related word extraction module calculates point mutual information between two words according to the following formula:
Figure BDA0002036055930000141
wherein PMI (W1 ═ W1, W2 ═ W2) denotes the point mutual information values of two words W1 and W2, P (W1, W2) denotes the number of times of co-occurrence of two words W1 and W2, P (W1) denotes the number of times of occurrence of W1, and P (W2) denotes the number of times of occurrence of W2.
Specifically, the preset part-of-speech range is a condition that must be satisfied by the part-of-speech of the candidate word. Considering that the parts of speech of the elements constituting the triples in the knowledge graph are mainly nouns, verbs and adjectives, the preset parts of speech range is generally set according to the rule.
The Point Mutual Information (PMI) is a useful Information measure in Information theory, and is used for measuring the correlation between two things, and is used for measuring the correlation between two words in the present invention. And respectively counting the occurrence frequency of each word and the common occurrence frequency of two words in the same corpus according to the large-scale corpus, and calculating the point mutual information value between the words according to the statistical information.
Firstly, selecting candidate words (excluding the polysemous word) from the words after word segmentation according to a preset part-of-speech range, calculating the point mutual information value of each candidate word and the polysemous word, filtering the candidate words with the point mutual information value not higher than a preset threshold value, and taking the remaining candidate words as related words of the polysemous word.
A matching paraphrase module 130, configured to paraphrase the ambiguous word according to the ambiguous word, the related word, and a domain knowledge map.
In the embodiment, the relevance of the two words is identified through the large-scale corpus, and the relevant words of the polysemous words are extracted according to the relevance of the two words, so that the extraction of the relevant words is more accurate, and the paraphrase accuracy rate of the polysemous words is improved.
In another embodiment of the present invention, as shown in FIG. 6, a semantic disambiguation apparatus 100 incorporating knowledge-graphs, comprising:
the text word segmentation module 110 is configured to segment a target text and perform part-of-speech tagging on a word obtained after the segmentation.
A related word extracting module 120, configured to, when a polysemous word exists in the words, extract the polysemous word and a candidate word that meets a preset part of speech range from the words; and calculating point mutual information between the candidate words and the polysemous words according to the large-scale corpus; and selecting candidate words with point mutual information larger than a preset threshold value as related words of the polysemous words.
The related word extraction module calculates point mutual information between two words according to the following formula:
Figure BDA0002036055930000151
wherein PMI (W1 ═ W1, W2 ═ W2) denotes the point mutual information values of two words W1 and W2, P (W1, W2) denotes the number of times of co-occurrence of two words W1 and W2, P (W1) denotes the number of times of occurrence of W1, and P (W2) denotes the number of times of occurrence of W2.
A matching paraphrase module 130, configured to paraphrase the ambiguous word according to the ambiguous word, the related word, and a domain knowledge map.
The matching paraphrase module 130 includes a search matching unit 131 and a term paraphrase unit 132;
the search matching unit 131 is configured to obtain, according to the ambiguous word, an entity corresponding to the ambiguous word on a domain knowledge graph; traversing all the triples of the entity to find the triples matched with the related words;
a word paraphrase unit 132 for determining paraphrases of the ambiguous words based on the matching triplets.
Specifically, when searching in the domain knowledge graph according to the polysemous word and the related words thereof, an entity corresponding to the polysemous word is found according to the polysemous word, such as the polysemous word "apple", which may correspond to three entities "apple", "apple company" and "apple mobile phone", all triples of the three entities are found, the "apple" (polysemous word) and two related words "originator" and "arbor" are matched with each found triplet, the triplet with the highest matching degree is selected as a matching triplet, a matching triplet (apple company, originator and arbor) is obtained, and the "apple" of the target text can be determined to refer to the apple company according to the meaning of the matching triplet, and the apple "is a company for designing and producing mobile phones.
For another example, the text "the color of the apple is very red", the ambiguous word is "apple", the related words are "color" and "very red", the ambiguous word "apple" corresponds to three entities "apple", "apple company" and "apple phone" in the domain knowledge map, all triples of the three entities are found, the "apple" (ambiguous word) and two related words "color" and "very red" are matched with each found triplet to obtain a matching triplet (apple, color, red), and the "apple" of the target text can be determined to be a fruit according to the meaning of the matching triplet.
In another embodiment of the present invention, as shown in FIG. 6, a semantic disambiguation apparatus 100 incorporating knowledge-graphs, comprising:
the text word segmentation module 110 is configured to segment a target text and perform part-of-speech tagging on a word obtained after the segmentation.
A related word extracting module 120, configured to, when a polysemous word exists in the words, extract the polysemous word and a candidate word that meets a preset part of speech range from the words; and calculating point mutual information between the candidate words and the polysemous words according to the large-scale corpus; and selecting candidate words with point mutual information larger than a preset threshold value as related words of the polysemous words.
The related word extraction module calculates point mutual information between two words according to the following formula:
Figure BDA0002036055930000161
wherein PMI (W1 ═ W1, W2 ═ W2) denotes the point mutual information values of two words W1 and W2, P (W1, W2) denotes the number of times of co-occurrence of two words W1 and W2, P (W1) denotes the number of times of occurrence of W1, and P (W2) denotes the number of times of occurrence of W2.
A matching paraphrase module 130 for paraphrasing the polysemous words according to the polysemous words, the related words, and a domain knowledge map;
the matching paraphrase module 130 includes a search matching unit 131 and a term paraphrase unit 132;
the search matching unit 131 is configured to obtain, according to the ambiguous word, an entity corresponding to the ambiguous word on a domain knowledge graph; traversing all the triples of the entity, and judging whether the triples matched with the related words are found; if a matching triplet is found, further processing by the word paraphrase unit; if the triples matched with the related words are not found, obtaining the similar meaning words of the related words according to a preset word bank; and traversing all the triples of the entity to find the triples matched with the similar meaning words of the related words.
A word paraphrase unit 132 for determining paraphrases of the ambiguous words based on the matching triplets.
Specifically, when the related words are not accurate enough, the matching triples may not be found in the knowledge graph according to the ambiguous words and the related words, and at the moment, the near-synonyms of the related words are selected to search, so that the search success rate of the matching triples can be improved.
In the embodiment, when the searching of the matching triples in the knowledge graph is unsuccessful based on the polysemous words and the related words, the polysemous words and the related words are selected to search, so that the searching success rate of the matching triples can be improved.
In one embodiment of the present invention, as shown in fig. 7, an intelligent learning apparatus 200 includes a memory 210 and a processor 220. The memory 210 is used to store a computer program 230. The processor, when running the computer program, implements a semantic disambiguation method incorporating a knowledge-graph as described above.
As an example, the processor realizes steps S100 to S300 according to the foregoing description when executing a computer program.
The processor, when executing the computer program, implements the functions of the modules and units in the semantic disambiguation apparatus 100 described above. As yet another example, the processor, when executing the computer program, implements the functionality of the text segmentation module 110, the related word extraction module 120, and the matching paraphrasing module 130.
Alternatively, the computer program may be divided into one or more modules/units according to the particular needs to accomplish the invention. Each module/unit may be a series of computer program instruction segments capable of performing a particular function. The computer program instruction segment is used for describing the execution process of the computer program in the semantic disambiguation apparatus 100. By way of example, the computer program may be partitioned into modules/units in a virtual device, such as a text segmentation module, a related word extraction module, and a matching paraphrase module. Correspondingly, the text word segmentation module is used for segmenting the target text and performing part-of-speech tagging on the words obtained after segmentation; the related word extracting module is used for extracting polysemous words and related words corresponding to the polysemous words from the words when the polysemous words exist in the words; and the matched paraphrasing module is used for paraphrasing the polysemous words according to the polysemous words, the related words and the domain knowledge map.
The processor is configured to implement semantic disambiguation by executing the computer program. The processor may be a Central Processing Unit (CPU), Graphics Processing Unit (GPU), Digital Signal Processor (DSP), Application Specific Integrated Circuit (ASIC), Field Programmable Gate Array (FPGA), general purpose processor or other logic device, etc., as desired.
The memory may be any internal storage unit and/or external storage device capable of implementing data, program storage. For example, the memory may be a plug-in hard disk, a smart card (SMC), a Secure Digital (SD) card, or a flash card. The memory is used to store computer programs, other programs and data for the semantic disambiguation apparatus 100. The memory may also be used to temporarily store data that has been output or is to be output.
The intelligent learning device 200 may be a home education machine, a tablet terminal, a desktop computer, a notebook, a palm computer, a mobile phone, etc. The intelligent learning device 200 may further include an input/output device, a display device, a network access device, a bus, and the like, as required. The intelligent learning device 200 may also be a single chip microcomputer or a computing device integrating a Central Processing Unit (CPU) and a Graphics Processing Unit (GPU).
It will be understood by those skilled in the art that the above-mentioned units and modules for implementing the corresponding functions are divided for the purpose of convenient illustration and description, and the above-mentioned units and modules are further divided or combined according to the application requirements, that is, the internal structures of the devices/apparatuses are divided and combined again to implement the above-mentioned functions. Each unit and module in the above embodiments may be separate physical units, or two or more units and modules may be integrated into one physical unit. The units and modules in the above embodiments may implement corresponding functions by using hardware and/or software functional units. Direct coupling, indirect coupling or communication connection among a plurality of units, components and modules in the above embodiments can be realized through a bus or an interface; the coupling, connection, etc. between the multiple units or devices may be electrical, mechanical, or the like. Accordingly, the specific names of the units and modules in the above embodiments are only for convenience of description and distinction, and are not intended to limit the scope of the present application.
It should be noted that the above embodiments can be freely combined as necessary. The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims (10)

1. A semantic disambiguation method in conjunction with knowledge-graphs, comprising:
performing word segmentation on the target text, and performing part-of-speech tagging on words obtained after word segmentation;
when a polysemous word exists in the words, extracting the polysemous word and a related word corresponding to the polysemous word from the words;
and paraphrasing the polysemous words according to the polysemous words, the related words and the domain knowledge map.
2. The method of claim 1, wherein the extracting ambiguous words and their corresponding related words from the words comprises:
extracting polysemous words and candidate words which accord with a preset part of speech range from the words;
calculating point mutual information between the candidate words and the polysemous words according to the large-scale corpus;
And selecting candidate words with point mutual information larger than a preset threshold value as related words of the polysemous words.
3. The method for semantic disambiguation in conjunction with knowledge-graph according to claim 2, wherein the formula for computing the mutual point information is:
Figure FDA0002036055920000011
wherein PMI (W1 ═ W1, W2 ═ W2) denotes the point mutual information values of two words W1 and W2, P (W1, W2) denotes the number of times of co-occurrence of two words W1 and W2, P (W1) denotes the number of times of occurrence of W1, and P (W2) denotes the number of times of occurrence of W2.
4. The method of claim 1, wherein paraphrasing the ambiguous word based on the ambiguous word, the related word, and a domain knowledge graph comprises:
acquiring entities corresponding to the polysemous words on a domain knowledge map according to the polysemous words;
traversing all the triples of the entity, and finding the triples matched with the related words;
determining paraphrases of the polysemous words according to the matching triples.
5. The method of semantic disambiguation associated with a knowledge-graph of claim 4, wherein said determining a paraphrase of said ambiguous word based on matching triplets further comprises:
when all triples of the entity are traversed and no triples matched with the related words are found, obtaining the similar meaning words of the related words according to a preset word bank;
And traversing all the triples of the entity to find the triples matched with the similar meaning words of the related words.
6. A semantic disambiguation apparatus incorporating knowledge-graphs, comprising:
the text word segmentation module is used for segmenting a target text and performing part-of-speech tagging on words obtained after segmentation;
the related word extracting module is used for extracting polysemous words and related words corresponding to the polysemous words from the words when the polysemous words exist in the words;
and the matched paraphrasing module is used for paraphrasing the polysemous words according to the polysemous words, the related words and the domain knowledge map.
7. The apparatus of claim 6, wherein the semantic disambiguation apparatus comprises:
the related word extraction module is further used for extracting polysemous words and candidate words which accord with a preset part of speech range from the words; and calculating point mutual information between the candidate words and the polysemous words according to the large-scale corpus; and selecting candidate words with point mutual information larger than a preset threshold value as related words of the polysemous words.
8. The apparatus of claim 6, wherein the matching paraphrasing module comprises:
The search matching unit is used for acquiring an entity corresponding to the polysemous word on the domain knowledge map according to the polysemous word; traversing all the triples of the entity to find the triples matched with the related words;
and the word paraphrasing unit is used for determining the paraphrases of the polysemous words according to the matching triples.
9. The apparatus of claim 8, wherein the semantic disambiguation apparatus comprises:
the search matching unit is further used for obtaining the similar meaning words of the related words according to a preset word bank when all the triples of the entity are traversed and no triples matched with the related words are found; and traversing all the triples of the entity to find the triples matched with the similar meaning words of the related words.
10. An intelligent learning apparatus comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor, when executing the computer program, implements the steps of the method of semantic disambiguation in combination with a knowledge-graph according to any of the claims 1 to 5.
CN201910325404.0A 2019-04-22 2019-04-22 Semantic disambiguation method and device combined with knowledge graph and intelligent learning equipment Pending CN111859974A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910325404.0A CN111859974A (en) 2019-04-22 2019-04-22 Semantic disambiguation method and device combined with knowledge graph and intelligent learning equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910325404.0A CN111859974A (en) 2019-04-22 2019-04-22 Semantic disambiguation method and device combined with knowledge graph and intelligent learning equipment

Publications (1)

Publication Number Publication Date
CN111859974A true CN111859974A (en) 2020-10-30

Family

ID=72952001

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910325404.0A Pending CN111859974A (en) 2019-04-22 2019-04-22 Semantic disambiguation method and device combined with knowledge graph and intelligent learning equipment

Country Status (1)

Country Link
CN (1) CN111859974A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112559684A (en) * 2020-12-10 2021-03-26 中科院计算技术研究所大数据研究院 Keyword extraction and information retrieval method
CN112560477A (en) * 2020-12-09 2021-03-26 中科讯飞互联(北京)信息科技有限公司 Text completion method, electronic device and storage device
CN113095080A (en) * 2021-06-08 2021-07-09 腾讯科技(深圳)有限公司 Theme-based semantic recognition method and device, electronic equipment and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2006251843A (en) * 2005-03-08 2006-09-21 Advanced Telecommunication Research Institute International Synonym pair extracting device, and computer program therefor
CN105095195A (en) * 2015-07-03 2015-11-25 北京京东尚科信息技术有限公司 Method and system for human-machine questioning and answering based on knowledge graph
CN108563643A (en) * 2018-03-27 2018-09-21 常熟鑫沐奇宝软件开发有限公司 A kind of polysemy interpretation method based on artificial intelligence knowledge mapping
CN108920467A (en) * 2018-08-01 2018-11-30 北京三快在线科技有限公司 Polysemant lexical study method and device, search result display methods
CN109241078A (en) * 2018-08-30 2019-01-18 中国地质大学(武汉) A kind of knowledge mapping hoc queries method based on hybrid database
CN109522465A (en) * 2018-10-22 2019-03-26 国家电网公司 The semantic searching method and device of knowledge based map
CN109597895A (en) * 2018-11-09 2019-04-09 中电科大数据研究院有限公司 A kind of archives search method of knowledge based map

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2006251843A (en) * 2005-03-08 2006-09-21 Advanced Telecommunication Research Institute International Synonym pair extracting device, and computer program therefor
CN105095195A (en) * 2015-07-03 2015-11-25 北京京东尚科信息技术有限公司 Method and system for human-machine questioning and answering based on knowledge graph
CN108563643A (en) * 2018-03-27 2018-09-21 常熟鑫沐奇宝软件开发有限公司 A kind of polysemy interpretation method based on artificial intelligence knowledge mapping
CN108920467A (en) * 2018-08-01 2018-11-30 北京三快在线科技有限公司 Polysemant lexical study method and device, search result display methods
CN109241078A (en) * 2018-08-30 2019-01-18 中国地质大学(武汉) A kind of knowledge mapping hoc queries method based on hybrid database
CN109522465A (en) * 2018-10-22 2019-03-26 国家电网公司 The semantic searching method and device of knowledge based map
CN109597895A (en) * 2018-11-09 2019-04-09 中电科大数据研究院有限公司 A kind of archives search method of knowledge based map

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112560477A (en) * 2020-12-09 2021-03-26 中科讯飞互联(北京)信息科技有限公司 Text completion method, electronic device and storage device
CN112560477B (en) * 2020-12-09 2024-04-16 科大讯飞(北京)有限公司 Text completion method, electronic equipment and storage device
CN112559684A (en) * 2020-12-10 2021-03-26 中科院计算技术研究所大数据研究院 Keyword extraction and information retrieval method
CN113095080A (en) * 2021-06-08 2021-07-09 腾讯科技(深圳)有限公司 Theme-based semantic recognition method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
US10997370B2 (en) Hybrid classifier for assigning natural language processing (NLP) inputs to domains in real-time
US11016966B2 (en) Semantic analysis-based query result retrieval for natural language procedural queries
CN109299280B (en) Short text clustering analysis method and device and terminal equipment
CN108460011B (en) Entity concept labeling method and system
US10496756B2 (en) Sentence creation system
CN102693279B (en) Method, device and system for fast calculating comment similarity
KR101573854B1 (en) Method and system for statistical context-sensitive spelling correction using probability estimation based on relational words
CN111460170B (en) Word recognition method, device, terminal equipment and storage medium
CN111414763A (en) Semantic disambiguation method, device, equipment and storage device for sign language calculation
CN111859974A (en) Semantic disambiguation method and device combined with knowledge graph and intelligent learning equipment
CN103365974A (en) Semantic disambiguation method and system based on related words topic
JP2011118689A (en) Retrieval method and system
CN113761890B (en) Multi-level semantic information retrieval method based on BERT context awareness
CN111160041A (en) Semantic understanding method and device, electronic equipment and storage medium
US20220365956A1 (en) Method and apparatus for generating patent summary information, and electronic device and medium
WO2019173085A1 (en) Intelligent knowledge-learning and question-answering
CN111737420A (en) Class case retrieval method, system, device and medium based on dispute focus
CN112149427A (en) Method for constructing verb phrase implication map and related equipment
CN114490984A (en) Question-answer knowledge extraction method, device, equipment and medium based on keyword guidance
CN110795544A (en) Content search method, device, equipment and storage medium
CN109657052A (en) A kind of abstract of a thesis contains the abstracting method and device of fine granularity Knowledge Element
CN111680146A (en) Method and device for determining new words, electronic equipment and readable storage medium
CN116090450A (en) Text processing method and computing device
CN110162615A (en) A kind of intelligent answer method, apparatus, electronic equipment and storage medium
CN112183117B (en) Translation evaluation method and device, storage medium and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination