KR101671892B1

KR101671892B1 - Entity's uri recognition system in text based on uri's definition statement and additional information, estimate for topic dispersion and method for selecting uri

Info

Publication number: KR101671892B1
Application number: KR1020150142458A
Authority: KR
Inventors: 최기선; 김지성
Original assignee: 한국과학기술원
Priority date: 2015-01-20
Filing date: 2015-10-12
Publication date: 2016-11-02
Also published as: KR20160089847A

Abstract

A device for identifying a URI of an entity in a URI definition statement and additional information based text, a topic distribution estimation method, and a URI selection method are disclosed. Here, the apparatus includes a topic distribution estimating unit that generates a topic distribution of an individual URI document from the estimated topic distribution for a set of URI documents composed of definitions and additional information for all URIs (Uniform Resource Identifiers) Extracting the object surface type from the query text, estimating a topic distribution of the query text, and selecting a URI corresponding to the object surface type based on the topic distribution, .

Description

[0001] The present invention relates to an apparatus for identifying a URI of an object in a URI definition statement and a supplementary information based text, a method of estimating a topic distribution, and a method of selecting a URI. }

The present invention relates to a device for identifying a URI of an entity in a URI definition statement and an additional information based text, a topic distribution estimation method, and a URI selection method.

Research and commercialization of Entity Linking for linking with the vast resources of LOD (Linked Open Data), which is data of Semantic Web, for existing web documents written in natural languages due to recent development of semantic web technology Development is being activated. Existing object linking systems use a thesaurus, such as WordNet, or measure similarity between the context of a surface type and the context of a resource.

Given an arbitrary sentence, it is necessary to identify a word or surface form that can be an object here (identify an object range), to select one of several objects that can be linked to this surface type Has already been well known and studied. A typical example is the study of English divide-spot URI (Uniform Resource Identifier).

In object recognition, objects can be defined in various ways. It may be a person, an animal, a place name, or it may be a URI of a web page or other resource on the Internet. This is the definition of the person who wants to solve the object recognition problem. The problem of resolving the problem in the URI can be considered as a special case of a part of the object name recognition problem.

One of the difficulties of object recognition problems arises from the ambiguity of words. The identified surface types can have various meanings depending on the context of the contained text, indicating that there can be several, but not one, objects connectable to one surface type. Considering the context, selecting the most suitable object for a given surface type is a problem of solving the problem of entities. Especially when the object is a resource having the same URI as a web page, it is called a problem of solving the problem in URI. The problem of resolving these URIs is to select a real URI when given several URIs that can be linked to a specific word in a given document.

Korean Patent Registration No. 10-0882582 discloses a related art related to solving the sex problem in the URI. However, this technique is for a device to make a citation relation based on URI, and no description about URI identification for a given entity is disclosed.

In addition, the paper [Mendes, Pablo N., et al. "DBpedia spotlight: shedding light on the web of documents." Proceedings of the 7th International Conference on Semantic Systems. ACM, 2011] is a method for determining a URI corresponding to an object string (word), and it is assumed that the meaning of the word is determined according to the context of the word, and the corresponding URI can be determined. At this time, the context of the object string and the URI resource are determined by the degree of similarity between the vectors of the words appearing therein.

In this case, the word of each vector makes a vector whose value is the importance of the word in the object string (or URI resource) the vector is to represent, and calculates the similarity between these vectors.

However, there is a problem that the words in a vector representing a context or a resource must coincide with each other. In other words, we can use thesaurus or root extractor to measure similarity between unmatched words, but there is a limitation in measuring similarity between words that do not match each other.

In particular, when a URI is used as a semantic identifier, words used only in a specific domain are frequently used in each URI resource, so that although the contexts of the two resources are similar to each other, Errors that would be measured as not similar.

Therefore, the technical problem to be solved by the present invention is to provide a URI definition language (topic model) based on the domain of resources without using the string itself of a word appearing in each resource using a topic distribution (topic model) generated from a given document And an apparatus for identifying a URI of an entity in a supplementary information based text, a topic distribution estimation method, and a URI selection method.

According to one aspect of the present invention, a URI identification apparatus includes a topic distribution generating a topic distribution of an individual URI document from an estimated topic distribution for a set of URI documents composed of definitions and additional information for all URIs (Uniform Resource Identifiers) Extracting the object surface type from the query text and estimating a topic distribution of the query text when the query text including the object type and the object surface type as the URI identification object is input, And a URI selecting unit for selecting a URI corresponding to the URI.

The URI selection unit,

The similarity between the topic distribution estimated for the query text and the topic distribution of the individual URI document may be measured and the URI for the entity surface type in the query text among the URIs in the individual URI document may be selected according to the similarity measure.

The topic distribution estimator may include:

Extracting a word from the URI document set, converting the URI document set into an individual URI document composed of a set of nouns, extracting a topic model by applying a machine learning algorithm to the individual URI document, The topic distribution can be estimated for each individual URI document.

The topic distribution estimator may include:

The topic model can be extracted by applying a Latent Dirichlet Allocation (LDA) algorithm.

The topic distribution estimator may include:

Estimate a topic distribution of the query text using the topic model,

The URI selection unit,

The similarity between the topic distribution of the query text and the topic distribution of each of the individual URI documents can be measured.

According to another aspect of the present invention, there is provided a method of estimating a topic distribution, comprising: collecting and storing a set of URI documents, each consisting of definitions and additional information on all URIs (Uniform Resource Identifiers) And generating a topic distribution of the individual URI documents from the estimated topic distribution for the set of URI documents.

Wherein the generating comprises:

Extracting a word from the set of URI documents by morphological analysis, extracting a topic model by applying a machine learning algorithm to an individual URI document composed of the extracted words, and extracting a topic by the individual URI document based on the topic model And estimating the distribution.

Wherein the extracting comprises:

The topic model can be extracted by applying an LDA (Latent Dirichlet Allocation) algorithm to the individual URI document.

According to another aspect of the present invention, there is provided a method for selecting a URI, comprising: receiving a query text including a surface type of a URI (Uniform Resource Identifier) identification device as a computing-based URI identification object; extracting the surface type from the query text Estimating a topic distribution of the query text, and selecting a URI corresponding to the entity surface type based on the topic distribution of the query text.

Wherein the estimating step comprises:

The topic distribution can be estimated by applying a machine learning algorithm including an LDA (Latent Dirichlet Allocation) algorithm.

Wherein the selecting comprises:

The similarity between the topic distribution of the query text and the topic distribution of each individual URI document may be measured and a URI corresponding to the entity surface type may be selected from the individual URI document according to the similarity measurement result.

According to the embodiment of the present invention, even if there is an object which does not appear in the text including the character string for which the URI is to be grasped, a social network (e.g., Twitter) or a short message , KakaoTalk), URI estimation for each surface type (character string) becomes applicable. At this time, it applies regardless of all kinds of languages.

In addition, you can specify related ads for each string, so you can make the accuracy of the ads by each string.

In addition, it is common that web pages are made by web pages at present. The basis of knowledge extraction of big data is based on the extraction of semantic patterns of each string, and it is expected that it will be the basis of knowledge extraction.

FIG. 1 is a schematic block diagram of an apparatus for identifying a URI of an entity in a URI definition statement and an additional information-based text according to an embodiment of the present invention.
2 is a flowchart showing the operation of the topic distribution estimating unit of FIG.
3 is a flowchart showing the operation of the URI selection unit of FIG.

Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings so that those skilled in the art can easily carry out the present invention. The present invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. In order to clearly illustrate the present invention, parts not related to the description are omitted, and similar parts are denoted by like reference characters throughout the specification.

Throughout the specification, when an element is referred to as "comprising ", it means that it can include other elements as well, without excluding other elements unless specifically stated otherwise.

Also, the terms of " part ", "... module" in the description mean units for processing at least one function or operation, which may be implemented by hardware or software or a combination of hardware and software.

Hereinafter, with reference to the drawings, an apparatus for identifying a URI of an entity in a URI (Uniform Resource Identifier) definition statement and an additional information based text according to an embodiment of the present invention, a topic distribution estimation method, and a URI selection method will be described in detail Explain.

Herein, a device for identifying a URI (hereinafter referred to as a "URI identifying device") identifies a URI for a surface string of an entity. That is, given a number of URIs that can be linked to a particular word in a given document, select a real URI.

The URI identification device selects one URI suitable for the identified surface type among all possible URI candidates that can be associated with a surface type that can be a URI and assigns it to the surface type. It is useful in terms of the recent developments in the technology of semantic web, as well as the linkage with the unique URI of cloud resources of LOD (Linked Open Data) and its ambiguity.

FIG. 1 is a schematic block diagram of an apparatus for identifying a URI of an entity in a URI definition statement and an additional information-based text according to an embodiment of the present invention.

Referring to FIG. 1, the URI identification apparatus includes an input unit 100, a URI identification unit including a topic distribution estimation unit 200 and a URI selection unit 300, and an output unit 400. Here, the URI identification device may be implemented as a computing-based device. Such a computing-based device may be a mobile device or a server device, but is not limited thereto.

The input unit 100 receives an arbitrary document. The arbitrary document is a paper document or an online searchable or readable document, and includes a URI document set and a query text.

The topic distribution estimator 200 generates a topic distribution of individual URI documents from the estimated topic distribution for a set of URI documents composed of definitions and additional information for all URIs (Uniform Resource Identifiers).

At this time, the URI may include the URI of Korean Wikipedia documents.

In this case, in the case of redirection, some Wikipedia URIs do not have their own documents, and are automatically redirected to other URIs. Therefore, such a URI is defined as the URI of the document after the redirection. And in the case of homonyms, some Wikipedia URI documents do not refer to a particular entity, but only links of homonyms that share the same word. These URIs do not point to a specific document, so they are not recognized as URIs.

The topic distribution estimator 200 extracts a word from the URI document set and converts the URI document set into a separate URI document composed of a set of nouns. A topic model is extracted by applying a machine learning algorithm for topic extraction to an individual URI document. And estimates the topic distribution for each individual URI document based on the extracted topic model.

The topic distribution estimating unit 200 estimates the topic distribution of the query text using the topic model, and transmits the estimated topic distribution to the URI selecting unit 300.

The topic distribution estimating unit 200 may apply the Latent Dirichlet Allocation (LDA) algorithm, which is a machine learning method, in extracting the topic model.

Here, the LDA algorithm is an algorithm for estimating a topic distribution of a document using a word distribution existing in an individual URI document. The estimated topic distribution is compared and used to calculate the similarity between the surface type context and the candidate resource URI.

LDA is a generative probabilistic model for modeling discrete data such as word sets. It is often used for topic analysis of several documents in the field of information retrieval. It is assumed that a document in the LDA is composed of a mixture of several topics and has a probability distribution about them.

It is also assumed that a topic consists of a mixture of words and has a probability distribution about it.

The LDA estimates the topic distribution for the document and the word distribution for the topic by analyzing the word frequencies in the document or document given the training documents and the number of topics. It is possible to estimate the topic distribution of new data when new data comes in instead of learning data using the topic model obtained after estimation.

The URI selection unit 300 extracts the object surface type from the query text when the query text including the object surface type that is the object of URI identification is input. Then, based on the topic distribution of the query text transmitted from the topic distribution estimating unit 200, a URI corresponding to the entity surface type is selected.

The URI selection unit 300 measures the similarity between the topic distribution estimated for the query text and the topic distribution of the individual URI documents and selects a URI for the object surface type in the query text among the URIs in the individual URI documents according to the similarity measurement do.

Here, the individual URI document is a candidate URI resource, and the URI selection unit 300 calculates the similarity between the context of the surface type word in the query text and the context of the candidate URI resource using the topic model. The topic model is to identify that it is the same topic.

Here, a method of calculating the similarity of the LDA-based document by the URI selector 300 is as follows.

The topic distribution of documents obtained through LDA will represent the document. To obtain the similarity between documents, the similarity between the topic probability distributions representing the documents can be obtained. KL divergence (Kullback-Leibler divergence) can be used to measure the similarity between probability distributions. The similarity between the two discrete probability distributions P and Q can be obtained by the following equation (1).

here,

Means the Kullback-Leibler divergence,

Is the difference that occurs when the probability distribution Q is approximated and used instead of the probability distribution P, and the Kull-back divergence measures the amount of information loss according to the approximate prediction,

Variable

, And the probability

Variable

Is a probability distribution of < RTI ID = 0.0 >

Denotes a given model document probability distribution,

Is a query text probability distribution,

Means each document.

However, since the above functions are not symmetric in the order of their arguments, we can use the following symmetric KL divergence.

According to one embodiment, the URI selection unit 300 selects one URI that is most related to the object surface type when given the URI (Uniform Resource Identifier) candidate of the entity surface type and the various resources that can be connected thereto do.

For example, if you are using a dividea (http://dbpedia.org) URI resource, given a surface object string of 'apple', you would have a page describing the "apple apple" You should choose the appropriate URI based on the Wikipedia on the page that is more similar to the Wikipedia page. In this case, the surface type 'apple' appeared in the IT company 'IBM' document. In this case, the surface type 'apple' is not the URI of fruit apple (http://dbpedia.org/page/Apple) URI (http://dbpedia.org/page/Apple_Inc.).

On the other hand, the URI selection is similar to the WSD (Word Sense Disambiguation) problem in that it selects one surface type and the corresponding URI candidates. In terms of the WSD problem, the surface type can be regarded as a homonym, and the URI candidates are equivalent to the homonyms.

The meaning of the homonym can be determined with high probability by the context of surrounding sentences including the word, and the following can be established.

1. Homonyms can be surface types.

2. Multiple meanings belonging to homonyms can be thought of as URI candidates when selecting URI.

3. The meaning of the homonym w included in a sentence A agrees with the meaning of the homonym w in the sentence B with a similar context.

4. In view of items 1, 2, and 3, the URI of a surface type s contained in a document D is likely to be the same as the URI of a surface type s contained in a document E with a similar context.

The reason that 1, 2, and 4 can be established in the above list is because each resource corresponding to the dividea URI corresponds to the Wikipedia documents which generally include a meaning or explanation for a word. In other words, if you think of Wikipedia as a kind of dictionary, you can give each surface type one of several possible interpretations (documents) included in this dictionary. Therefore, URIs can be selected using the inter-context similarity between documents containing surface types and Wikipedia documents with links to URIs associated with surface types.

The output unit 400 outputs the URI of the object surface type in the query text selected by the URI selection unit 300.

Such a URI identification device can designate one link that can describe the meaning of each word or any string that defines the range for any text with respect to any text, click ". In addition, it is possible to automatically designate or extend the topic and the classification and tag indicating the meaning of each character string in the text of the web newspaper and the social network which are increasingly increasing from time to time. For example, you can specify that the hashtag of the Twitter is associated with the hashtag.

In addition, URI identifiers can be used for big data analysis, semantic estimation system for social networks, and automatic linking of link information in text in the Web editor market.

2 is a flowchart showing the operation of the topic distribution estimating unit of FIG.

Referring to FIG. 2, the topic distribution estimator 200 of the computing-based URI identifying apparatus collects and stores a URI document set (S101).

The topic distribution estimating unit 200 morphs all the URI documents in the URI document set and extracts words composed of nouns (S103).

At this time, in step S103, all documents are morpheme analyzed, and nouns are extracted by considering prefixes, suffixes, and compound nouns. Each URI document is then transformed from a complete set of statements into a set of nouns. That is, all URI documents are converted into respective individual URI documents composed of extracted words.

The topic distribution estimating unit 200 generates a topic model by training all the documents composed of extracted words with the LDA algorithm (S105). That is, a topic model for all sets of URI documents collected in step S101 is generated.

Here, the LDA generates a topic model reflecting the co-occurrences of words. Words often appearing in the same document are likely to be tied to the same topic.

However, other machine learning algorithms besides the LDA algorithm may be used, and thus are not limited to the LDA algorithm.

The topic distribution estimating unit 200 estimates a topic distribution for each individual URI document based on the topic model generated in step S105 (S107). That is, a topic distribution is generated for each URI document.

3 is a flowchart showing the operation of the URI selection unit of FIG.

Referring to FIG. 3, the URI selecting unit 300 of the computing-based URI identifying apparatus receives the query text (S201).

The URI selecting unit 300 extracts the surface type of the object, which is the URI identification target, from the query text (S203).

The URI selection unit 300 measures the similarity between the topic distribution of the query text estimated by the topic distribution estimation unit 200 (S205) and the topic distribution of the individual URI documents (S207).

Here, the topic distribution estimating unit 200 estimates the topic distribution of the query text using the topic model generated in step S105 of FIG. 2 (S205).

In step S209, the URI selection unit 300 selects a URI for the entity type extracted in step S203 according to the degree of similarity measured in step S207. That is, the URI tagged in the individual URI document with a high degree of similarity is selected as the URI for the object surface type.

The URI selection unit 300 measures the similarity between the topic distribution of the query text estimated in step S205 and the topic distribution of each of the individual URI documents estimated in step S107 of FIG. From the individual URI document with a high degree of similarity, the URI for the entity type extracted in step S203 is selected.

The embodiments of the present invention described above are not implemented only by the apparatus and method, but may be implemented through a program for realizing the function corresponding to the configuration of the embodiment of the present invention or a recording medium on which the program is recorded.

While the present invention has been particularly shown and described with reference to exemplary embodiments thereof, it is to be understood that the invention is not limited to the disclosed exemplary embodiments, It belongs to the scope of right.

Claims

A topic distribution estimator for generating a topic distribution of individual URI documents from the estimated topic distribution for a set of URI documents consisting of definitions and additional information for all URIs (Uniform Resource Identifiers), and
Extracting the object surface type from the query text and estimating a topic distribution of the query text when a query text including an object surface type to be a URI identification object is input, and based on the topic distribution, extracting a URI A URI selection unit
And a URI identification device.

The method according to claim 1,
The URI selection unit,
A URI identification for measuring the similarity between the topic distribution estimated for the query text and the topic distribution of the individual URI document and for selecting the URI for the entity surface type in the query text from among the URIs in the individual URI document according to the similarity measure Device.

3. The method of claim 2,
The topic distribution estimator may include:
Extracting a word from the URI document set, converting the URI document set into an individual URI document composed of a set of nouns, extracting a topic model by applying a machine learning algorithm to the individual URI document, And estimates the topic distribution by individual URI documents.

The method of claim 3,
The topic distribution estimator may include:
And extracting the topic model by applying a Latent Dirichlet Allocation (LDA) algorithm.

5. The method of claim 4,
The topic distribution estimator may include:
Estimate a topic distribution of the query text using the topic model,
The URI selection unit,
And measures the similarity between the topic distribution of the query text and the topic distribution of each of the individual URI documents.

Collecting and storing a set of URI documents consisting of definitions and additional information on all Uniform Resource Identifiers (URIs) of computing-based URI (Uniform Resource Identifier)
Generating a topic distribution of individual URI documents from the estimated topic distribution for the set of URI documents
/ RTI >

The method according to claim 6,
Wherein the generating comprises:
Extracting words from the set of URI documents through morphological analysis,
Extracting a topic model by applying a machine learning algorithm to an individual URI document composed of extracted words, and
Estimating a topic distribution for each individual URI document based on the topic model
/ RTI >

8. The method of claim 7,
Wherein the extracting comprises:
And applying the LDA (Latent Dirichlet Allocation) algorithm to the individual URI document to extract the topic model.

Receiving a query text including a surface type of a URI (Uniform Resource Identifier) identification device as a computing-based URI identification object;
Extracting the entity surface type from the query text,
Estimating a topic distribution of the query text, and
Selecting a URI corresponding to the entity surface type based on the topic distribution of the query text
Gt; URI < / RTI >

10. The method of claim 9,
Wherein the estimating step comprises:
A URI selection method for estimating the topic distribution by applying a machine learning algorithm including an LDA (Latent Dirichlet Allocation) algorithm.

10. The method of claim 9,
Wherein the selecting comprises:
Measuring a similarity between a topic distribution of the query text and a topic distribution of each of the individual URI documents and selecting a URI corresponding to the entity surface type from the individual URI document according to a result of the similarity measurement.