JP6499477B2

JP6499477B2 - Ontology generation device, metadata output device, content acquisition device, ontology generation method, and ontology generation program

Info

Publication number: JP6499477B2
Application number: JP2015038206A
Authority: JP
Inventors: 真浦川; 宮崎　勝; 勝宮崎
Original assignee: Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2015-02-27
Filing date: 2015-02-27
Publication date: 2019-04-10
Anticipated expiration: 2035-02-27
Also published as: JP2016162054A

Description

本発明は、オントロジーの生成装置、方法及びプログラムに関する。 The present invention relates to an ontology generation apparatus, method, and program.

映像配信サービスの普及に伴い映像コンテンツの重要性が増し、様々な分野で映像コンテンツの利活用が進むと共に、コンテンツ量も増加している。
このため、コンテンツが何であるかを明確にするメタデータがコンテンツに付与されることが望ましい。コンテンツホルダは、例えば、「このコンテンツは、生物Ａの産卵の映像である。」といったメタデータを付与しておくことで、映像コンテンツの内容を利用者に文章で提示できる。また、例えば「生物Ａ」と「産卵」とを分けて定義することで、コンテンツホルダは、「生物Ａ」に関する他のコンテンツだけでなく、「産卵」に関する複数の生物映像を関連付けて提示できる。
特許文献１では、映像コンテンツに付与されたタイトルを検索キーワードとして、特定のコミュニティサイトを検索し、得られたＷｅｂページから情報を抽出し、メタデータとして映像コンテンツに付与する方法が提案されている。 With the spread of video distribution services, the importance of video content has increased, and the use of video content has advanced in various fields, and the amount of content has also increased.
For this reason, it is desirable that metadata that clarifies what the content is is given to the content. The content holder can present the content of the video content to the user in text by adding metadata such as “This content is a spawning image of the organism A”. Also, for example, by defining “organism A” and “laying eggs” separately, the content holder can present not only other contents related to “living organism A” but also a plurality of living images related to “laying eggs” in association with each other.
Patent Document 1 proposes a method of searching for a specific community site using a title assigned to video content as a search keyword, extracting information from the obtained Web page, and assigning it to video content as metadata. .

特開２００８−４０８０号公報JP 2008-4080 A

ところで、メタデータの入力者は、利用されるシーン又はサービスを想像してメタデータを入力する必要がある。例えば、「生物Ａ」とだけ入力する場合もあれば、「生物Ａの産卵」と入力する場合もある。また、メタデータは、コンテンツの内容が同様であっても入力者によって定義が異なる。例えば、「生物Ａ」の「産卵」というメタデータを入力する場合もあれば、「生物Ａ」の「生殖」と入力する場合もある。
特許文献１の方法では、映像コンテンツについて記載されたコミュニティサイトのＷｅｂページを１つ見つけ出し、このＷｅｂページから情報を抽出する。付与されるメタデータの構造は、単一のＷｅｂページに依存するため、映像コンテンツが異なれば付与されるメタデータの定義も異なる可能性がある。 By the way, it is necessary for a person who inputs metadata to input metadata by imagining a scene or service to be used. For example, only “living organism A” may be input, or “spawning of living organism A” may be input. Moreover, the definition of metadata differs depending on the input person even if the content is the same. For example, metadata “spawning” of “organism A” may be input, and “reproduction” of “organism A” may be input.
In the method of Patent Document 1, one web page of a community site describing video content is found, and information is extracted from this web page. Since the structure of the metadata to be assigned depends on a single Web page, the definition of the metadata to be assigned may be different for different video contents.

本発明は、複数のコンテンツで共通して利用可能なメタデータの体系を定義した知識構造データを生成できるオントロジー生成装置、方法及びプログラムを提供することを目的とする。 It is an object of the present invention to provide an ontology generation apparatus, method, and program capable of generating knowledge structure data that defines a metadata system that can be commonly used by a plurality of contents.

本発明に係るオントロジー生成装置は、指定された分野における複数の文書情報から、見出し語の階層情報を抽出する見出し抽出部と、前記見出し語に紐付く単語を前記文書情報から抽出する単語抽出部と、前記単語の類似度に基づいて、前記見出し語を統合し、当該統合された見出し語の階層情報を含む前記分野の知識構造データを生成する構造化部と、を備える。 An ontology generation device according to the present invention includes a headline extraction unit that extracts hierarchical information of headwords from a plurality of document information in a specified field, and a word extraction unit that extracts words associated with the headwords from the document information And a structuring unit that integrates the headwords based on the similarity of the words and generates knowledge structure data of the field including hierarchical information of the integrated headwords.

前記構造化部は、前記見出し語と前記単語とを関連付けた知識構造データを生成してもよい。 The structuring unit may generate knowledge structure data in which the headword and the word are associated with each other.

前記構造化部は、前記見出し語と前記単語との関連度を含む前記知識構造データを生成してもよい。 The structuring unit may generate the knowledge structure data including a degree of association between the headword and the word.

本発明に係るメタデータ出力装置は、コンテンツに関するテキストデータと、前記オントロジー生成装置により生成された前記知識構造データに含まれる単語とのマッチングにより、当該単語に関連付けられた前記見出し語の階層情報を前記知識構造データから抽出し、前記コンテンツのメタデータとして出力する出力部を備える。 The metadata output device according to the present invention provides hierarchical information of the headword associated with the word by matching the text data related to the content with the word included in the knowledge structure data generated by the ontology generation device. An output unit is provided that extracts the knowledge structure data and outputs it as metadata of the content.

前記メタデータ出力装置は、辞書データに基づいて前記単語の同類語を取得する辞書取得部を備え、前記出力部は、前記テキストデータと、前記単語又は前記同類語とのマッチングによって、当該単語に関連付けられた前記見出し語の階層情報を抽出してもよい。 The metadata output device includes a dictionary acquisition unit that acquires a synonym of the word based on dictionary data, and the output unit applies the word to the word by matching the text data with the word or the synonym. You may extract the hierarchical information of the said headword linked | related.

本発明に係るコンテンツ取得装置は、前記メタデータ出力装置により出力された前記メタデータと同一のメタデータが付与されたコンテンツを、所定のデータベースから取得する第１コンテンツ取得部を備える。 The content acquisition device according to the present invention includes a first content acquisition unit that acquires, from a predetermined database, content to which the same metadata as the metadata output by the metadata output device is assigned.

前記コンテンツ取得装置は、前記メタデータの階層情報に基づいて、当該階層情報の上位が共通する別の階層情報を前記知識構造データから抽出し、当該別の階層情報に相当するメタデータが付与されたコンテンツを、前記所定のデータベースから取得する第２コンテンツ取得部を備えてもよい。 The content acquisition apparatus extracts, from the knowledge structure data, another hierarchical information having a common upper level of the hierarchical information based on the hierarchical information of the metadata, and is given metadata corresponding to the different hierarchical information. A second content acquisition unit that acquires the acquired content from the predetermined database may be provided.

本発明に係るオントロジー生成方法は、コンピュータの制御部が、指定された分野における複数の文書情報から、見出し語の階層情報を抽出する見出し抽出ステップと、前記見出し語に紐付く単語を前記文書情報から抽出する単語抽出ステップと、前記単語の類似度に基づいて、前記見出し語を統合し、当該統合された見出し語の階層情報を含む前記分野の知識構造データを生成する構造化ステップと、を実行する。 In the ontology generation method according to the present invention, a computer control unit extracts a headword hierarchical information from a plurality of document information in a specified field, and a word associated with the headword is extracted from the document information. A word extracting step for extracting from the word, and a structuring step for integrating the headwords based on the similarity of the words and generating knowledge structure data of the field including hierarchical information of the integrated headwords. Run.

本発明に係るオントロジー生成プログラムは、コンピュータの制御部に、指定された分野における複数の文書情報から、見出し語の階層情報を抽出する見出し抽出ステップと、前記見出し語に紐付く単語を前記文書情報から抽出する単語抽出ステップと、前記単語の類似度に基づいて、前記見出し語を統合し、当該統合された見出し語の階層情報を含む前記分野の知識構造データを生成する構造化ステップと、を実行させる。 The ontology generation program according to the present invention includes a headline extraction step of extracting hierarchical information of headwords from a plurality of pieces of document information in a specified field, and a word associated with the headword in the document information. A word extracting step for extracting from the word, and a structuring step for integrating the headwords based on the similarity of the words and generating knowledge structure data of the field including hierarchical information of the integrated headwords. Let it run.

本発明によれば、複数のコンテンツで共通して利用可能なメタデータの体系を定義した知識構造データを生成できる。 According to the present invention, it is possible to generate knowledge structure data that defines a metadata system that can be commonly used by a plurality of contents.

第１実施形態に係る管理サーバの機能構成を示す図である。It is a figure which shows the function structure of the management server which concerns on 1st Embodiment. 第１実施形態に係るオントロジーの生成過程の一例を示す図である。It is a figure which shows an example of the production | generation process of ontology which concerns on 1st Embodiment. 第１実施形態に係るオントロジーの一例を示す図である。It is a figure which shows an example of ontology which concerns on 1st Embodiment. 第１実施形態に係るメタデータの出力方法の一例を示す図である。It is a figure which shows an example of the output method of the metadata which concerns on 1st Embodiment. 第１実施形態に係るメタデータの付与処理の一例を示すフローチャートである。It is a flowchart which shows an example of the provision process of the metadata which concerns on 1st Embodiment. 第２実施形態に係るコンテンツサーバの機能構成を示す図である。It is a figure which shows the function structure of the content server which concerns on 2nd Embodiment. 第２実施形態に係る関連コンテンツの取得過程の一例を示す図である。It is a figure which shows an example of the acquisition process of the related content which concerns on 2nd Embodiment.

［第１実施形態］
以下、本発明の第１実施形態について説明する。
本実施形態に係る管理サーバ１は、コンテンツホルダ又はサービスプロバイダ等がネットワークを介して又は直接アクセスする情報処理装置（コンピュータ）である。管理サーバ１は、コミュニティサイトの情報から知識構造データとしてのオントロジーを構築するオントロジー生成装置、及び映像又はＷｅｂページ等のコンテンツに対してオントロジーに基づくメタデータを付与するメタデータ出力装置として機能する。 [First Embodiment]
The first embodiment of the present invention will be described below.
The management server 1 according to the present embodiment is an information processing apparatus (computer) that is accessed directly or via a network by a content holder, a service provider, or the like. The management server 1 functions as an ontology generation device that builds ontology as knowledge structure data from information on a community site, and a metadata output device that assigns metadata based on the ontology to content such as video or Web pages.

図１は、本実施形態に係る管理サーバ１の機能構成を示す図である。
管理サーバ１は、対象指定部１１と、見出し抽出部１２と、単語抽出部１３と、構造化部１４と、辞書取得部１５と、コンテンツ指定部１６と、出力部１７とを備える。 FIG. 1 is a diagram illustrating a functional configuration of the management server 1 according to the present embodiment.
The management server 1 includes a target designation unit 11, a headline extraction unit 12, a word extraction unit 13, a structuring unit 14, a dictionary acquisition unit 15, a content designation unit 16, and an output unit 17.

対象指定部１１は、ある分野（例えば、「生物」）における知識構造を抽出する対象となる文書情報を取得する。文書情報は、例えば、Ｗｉｋｉｐｅｄｉａ（登録商標）等のコミュニティサイトにおいて個々の見出し語を解説するＷｅｂページである。対象指定部１１は、ＷｅｂページのＵＲＬの指定入力を受け付けてページデータを取得、あるいは、利用者からＣＳＶ等のファイルで文書情報を直接受け付ける。抽出対象として、「生物」に属する複数の抽出対象が指定されることにより、対象指定部１１は、「生物」全体としての特徴を得るための文書情報を取得できる。なお、単一のＷｅｂページでも、テキストをそのまま知識構造として使用できる場合は、抽出対象は単一であってもよい。 The target specifying unit 11 acquires document information that is a target for extracting a knowledge structure in a certain field (for example, “living organism”). The document information is, for example, a Web page that explains individual headwords on a community site such as Wikipedia (registered trademark). The target designating unit 11 receives the designation input of the URL of the Web page and acquires page data, or directly accepts document information from a user as a file such as CSV. By specifying a plurality of extraction targets belonging to “living organisms” as extraction targets, the target designating unit 11 can acquire document information for obtaining characteristics of the “living organisms” as a whole. Note that even if a single Web page can be used as it is as a knowledge structure, the extraction target may be single.

見出し抽出部１２は、対象指定部１１により取得された複数の文書情報それぞれにおける見出し語の構造を解析し、これらの文書情報から、見出し語及びその階層情報を抽出する。 The headline extraction unit 12 analyzes the structure of headwords in each of a plurality of pieces of document information acquired by the target specification unit 11, and extracts headwords and hierarchical information thereof from these document information.

単語抽出部１３は、見出し語に紐付く単語を文書情報から抽出する。
単語抽出部１３は、例えば、文書情報を形態素解析し、得られた単語のうち、使用頻度等の所定の指標（例えば、ＴＦ−ＩＤＦ）に基づく重要度の高い単語を、各見出し内で使用されている特徴的な単語として抽出してよい。 The word extraction unit 13 extracts a word associated with the headword from the document information.
For example, the word extraction unit 13 performs morphological analysis on the document information, and uses a word having high importance based on a predetermined index (for example, TF-IDF) among the obtained words in each heading. It may be extracted as a characteristic word.

構造化部１４は、単語の類似度に基づいて、見出し語を統合し、この統合された見出し語の階層情報を含むオントロジーを、分野毎に生成する。
例えば、生物の分野において、各文書情報は、「特徴」、「生息環境」、「分布」、「生活史」、「生殖」等、統一されていない見出し語で説明されているが、構造化部１４は、類似度の高い見出し語を統合することにより、共通化及び体系化されたオントロジーを生成する。 The structuring unit 14 integrates headwords based on the word similarity, and generates an ontology including hierarchical information of the integrated headwords for each field.
For example, in the field of living organisms, each document information is explained by unconventional headwords such as “feature”, “habitat”, “distribution”, “life history”, “reproduction”, etc. The unit 14 generates a common and systemized ontology by integrating headwords having a high degree of similarity.

オントロジーでは、見出し語と単語とが関連付けられている。さらに、オントロジーには、複数の見出し語に対する同一単語の出現確率等に基づく、見出し語と単語との関連度が含まれる。 In the ontology, headwords and words are associated with each other. Furthermore, the ontology includes the degree of association between a headword and a word based on the appearance probability of the same word for a plurality of headwords.

図２は、本実施形態に係るオントロジーの生成過程の一例を示す図である。
まず、対象指定部１１は、ＵＲＬの指定に応じて、生物に関する文書情報として、「ミジンコ」、「ウニ」、「トンボ」等のＷｅｂページのテキストデータを取得している。 FIG. 2 is a diagram illustrating an example of an ontology generation process according to the present embodiment.
First, the target designating unit 11 acquires text data of a web page such as “daphnia”, “sea urchin”, “dragonfly”, etc. as document information related to a living organism according to the designation of the URL.

見出し抽出部１２は、取得したテキストデータから、見出し語の階層情報として、例えば、「形態」の下位に「内部」及び「外部」を、「生態」の下位に「食性」、「生殖」及び「分布」を、それぞれ紐付けた構造データを抽出する。 From the acquired text data, the headline extraction unit 12 uses, for example, “internal” and “external” below “morphology” and “food”, “reproductive” and “low” below “ecology” as hierarchical information of headwords. Extract structure data associated with each "distribution".

単語抽出部１３は、各見出し語に対応したテキストデータの範囲から、この見出し語を特徴付ける単語を抽出する。例えば、見出し語「食性」に対して「食べる」及び「餌」等の単語が抽出される。 The word extraction unit 13 extracts a word characterizing this headword from the range of text data corresponding to each headword. For example, words such as “eat” and “food” are extracted for the headword “food”.

構造化部１４は、これらの見出し内の単語の類似度から異なる見出し語を１つに統合し、共通化された見出し語の階層情報を生成する。例えば、「内部形態」及び「内部構造」の２つの見出し語が抽出されている場合、これらに紐付く単語は高い確率で一致又は類似するため、２つの見出し語が１つに統合される。 The structuring unit 14 integrates different headwords into one from the similarity of the words in these headlines, and generates hierarchical information of the headwords that are made common. For example, when two headwords of “internal form” and “internal structure” are extracted, since the words associated with them are matched or similar with high probability, the two headwords are integrated into one.

辞書取得部１５は、辞書データに基づいて、オントロジーに含まれる単語の同類語を取得する。
例えば、辞書取得部１５は、辞書データを用いて、オントロジーの単語に対して概念構造を付加することにより、同様の概念構造を持つ同類語を取得する。 The dictionary acquisition unit 15 acquires synonyms of words included in the ontology based on the dictionary data.
For example, the dictionary acquisition unit 15 acquires a similar term having a similar conceptual structure by adding a conceptual structure to an ontology word using dictionary data.

図３は、本実施形態に係る概念構造を付加したオントロジーの一例を示す図である。
このオントロジーは、「生物」を説明する際に必要となる構造や必要な単語を保持している。例えば、「外部」という概念は、上位概念に「形態」を持ち、インスタンスとして、「縮む」という動詞及び「腕」という名詞を持つという概念構造が定義できる。また、これらの集合及び辞書データから、「形態−外部」概念は、「体」に関する名詞を持つと定義されるため、「腕」と同じ概念構造を持つ「吻」又は「足」等の関連名詞が出現しても、「形態−外部」として分類できる。
なお、オントロジーは、例えば、ＯＷＬ等の記述言語を用いて記述される。 FIG. 3 is a diagram illustrating an example of an ontology to which a conceptual structure according to the present embodiment is added.
This ontology holds the structure and necessary words necessary to explain “living organisms”. For example, the concept of “external” can define a conceptual structure having “form” as a superordinate concept and having a verb “shrink” and a noun “arm” as instances. In addition, from these sets and dictionary data, the “form-external” concept is defined as having a noun related to “body”, and therefore, a relationship such as “nose” or “foot” having the same conceptual structure as “arm”. Even if a noun appears, it can be classified as “form-external”.
The ontology is described using a description language such as OWL.

コンテンツ指定部１６は、メタデータを付与したいコンテンツに関するテキストデータを取得する。
コンテンツは、例えば、コンテンツホルダの映像コンテンツであり、コンテンツ指定部１６は、指定されたコンテンツのテキスト情報を抽出し、又はメタデータを付与したいテキスト自体の入力を受け付ける。 The content specifying unit 16 acquires text data regarding the content to which metadata is to be added.
The content is, for example, video content of a content holder, and the content specifying unit 16 extracts text information of the specified content or accepts input of text itself to which metadata is to be added.

出力部１７は、コンテンツに関するテキストデータを形態素に分解した上で、これらの形態素と、オントロジーに含まれる単語又は同類語とのマッチングを行う。このマッチングの結果により、コンテンツのテキストデータと類似度の高い単語に関連付けられた見出し語の階層情報をオントロジーから抽出し、コンテンツの内容を表すメタデータとして出力する。
このとき、出力部１７は、マッチングした単語と見出し語との関連度に基づいて算出されるスコアが上位の見出し語の階層情報を抽出する。 The output unit 17 decomposes text data related to the content into morphemes, and then matches these morphemes with words or similar terms included in the ontology. Based on the result of this matching, the hierarchical information of the headwords associated with the words having high similarity to the text data of the content is extracted from the ontology and output as metadata representing the content content.
At this time, the output unit 17 extracts the hierarchical information of the headword having a higher score calculated based on the degree of association between the matched word and the headword.

図４は、本実施形態に係るスコアに基づくメタデータの出力方法の一例を示す図である。
この例では、コンテンツに関するテキストデータから、「背中」、「育てる」、「卵」、「産む」といった単語が抽出されている。これらを、知識構造データとマッチングすることにより、メタデータ毎のスコアが算出される。 FIG. 4 is a diagram illustrating an example of a metadata output method based on the score according to the present embodiment.
In this example, words such as “back”, “nurturing”, “egg”, and “laying” are extracted from the text data related to the content. By matching these with knowledge structure data, a score for each metadata is calculated.

例えば、「背中」とマッチングした見出し語の「内部」、「外部」、「食性」、「生殖」に対して、それぞれ関連度に応じたスコアが加算される。同様に、「育てる」、「卵」、「産む」とマッチングした見出し語に対しても、それぞれスコアが加算されていく。これらの合計スコアが最も高い見出し語「生殖」が選択され、コンテンツのメタデータとして階層情報「生態−生殖」が抽出される。 For example, a score corresponding to the degree of association is added to “inside”, “outside”, “food habit”, and “reproductive” of the headwords matching “back”. Similarly, scores are added to headwords that match “nurturing”, “egg”, and “laying”. The headword “reproduction” having the highest total score is selected, and the hierarchical information “ecology-reproduction” is extracted as the metadata of the content.

図５は、本実施形態に係る管理サーバ１によるコンテンツへのメタデータの付与処理の一例を示すフローチャートである。
この例は、自社の映像コンテンツにメタデータを付与したいコンテンツホルダにおいて管理サーバ１を利用した際の処理例である。
コンテンツホルダは、映像コンテンツを管理する上で必要となるテキスト情報を蓄積している。このテキスト情報は、例えば、番組情報や字幕情報等である。コンテンツオペレータは、管理サーバ１により、コミュニティサイトからオントロジーを取得し、映像コンテンツへメタデータを付与する。 FIG. 5 is a flowchart illustrating an example of a process for adding metadata to content by the management server 1 according to the present embodiment.
This example is a processing example when the management server 1 is used in a content holder that wants to give metadata to its own video content.
The content holder stores text information necessary for managing video content. This text information is, for example, program information or caption information. The content operator acquires an ontology from the community site by the management server 1 and assigns metadata to the video content.

ステップＳ１において、対象指定部１１は、コンテンツオペレータから、抽出したいＷｅｂページのＵＲＬの指定を受け付ける。あるいは、対象指定部１１は、ＣＳＶ等のファイルを取り込むことで、知識構造を解析するためのテキストデータを取得する。例えば、コミュニティサイトの「ミジンコ」を説明するＷｅｂページのＵＲＬが指定されることで、「ミジンコ」に関するテキストデータを得る。同様に、「ウニ」及び「トンボ」といった、「ミジンコ」と同一カテゴリとされる生物（動物）のＷｅｂページが指定されることで、抽出データに他の生物のデータも追加される。 In step S <b> 1, the target specifying unit 11 receives the specification of the URL of the Web page to be extracted from the content operator. Alternatively, the target specifying unit 11 acquires text data for analyzing the knowledge structure by taking in a file such as CSV. For example, by specifying the URL of a Web page that describes “daphnia” in the community site, text data related to “daphnia” is obtained. Similarly, by designating a web page of an organism (animal) in the same category as “daphnia” such as “sea urchin” and “dragonfly”, data of other organisms are also added to the extracted data.

ステップＳ２において、見出し抽出部１２は、コミュニティサイトから取得された抽出データの見出し構造から、見出し語の階層情報を抽出する。
ステップＳ３において、単語抽出部１３は、ステップＳ２で抽出された見出し内で使用されている単語群から、見出し語を特徴づける単語群を抽出する。 In step S2, the headline extraction unit 12 extracts headword hierarchy information from the headline structure of the extracted data acquired from the community site.
In step S3, the word extracting unit 13 extracts a word group characterizing the headword from the word group used in the headline extracted in step S2.

ステップＳ４において、構造化部１４は、ステップＳ２で抽出された見出し語の階層情報と、ステップＳ３で抽出された見出し内の特徴単語とに基づいて、知識構造データであるオントロジーを生成する。
なお、コンテンツオペレータは、見出し語の階層情報、見出し内の特徴単語、又は生成されたオントロジーを、手動により修正することも可能である。 In step S4, the structuring unit 14 generates an ontology that is knowledge structure data based on the hierarchy information of the headword extracted in step S2 and the feature word in the headline extracted in step S3.
The content operator can also manually correct the hierarchical information of the headword, the feature word in the headline, or the generated ontology.

ステップＳ５において、コンテンツオペレータは、映像コンテンツの説明テキスト（番組情報、映像内容等）からオントロジーに基づくメタデータを付与するために、メタデータを付与したい映像コンテンツのＵＲＬを指定する。コンテンツ指定部１６は、指定されたＵＲＬにより映像コンテンツに関するテキストデータを取得する。あるいは、コンテンツ指定部１６は、ＣＳＶ等のファイルを取り込むことで、テキストデータを取得してもよい。 In step S5, the content operator designates the URL of the video content to which metadata is to be added in order to add metadata based on the ontology from the description text (program information, video content, etc.) of the video content. The content specifying unit 16 acquires text data related to the video content using the specified URL. Alternatively, the content specifying unit 16 may acquire text data by capturing a file such as CSV.

ステップＳ６において、出力部１７は、ステップＳ５で取得されたテキストデータを、形態素単位に分割する。例えば、「モンシロチョウの幼虫は、卵の殻を食べる。」というテキストを、「モンシロチョウ」、「幼虫」、「卵」、「殻」、「食べる」という形態素に分解する。 In step S6, the output unit 17 divides the text data acquired in step S5 into morpheme units. For example, the text “Butterfly larva eats egg shell” is broken down into morphemes “Butterfly”, “larva”, “egg”, “shell”, “eat”.

ステップＳ７において、出力部１７は、ステップＳ６で得られた各々の単語が、オントロジーにおいて、どの分類で最も多く利用されているかを計算し、「モンシロチョウの幼虫は、卵の殻を食べる。」に対して、例えば、「生態」−「食性」というメタデータをオントロジーから抽出して出力する。
コンテンツオペレータは、出力された「生態」−「食性」というメタデータを取得し、自身のシステムで利用できる。 In step S <b> 7, the output unit 17 calculates which classification each word obtained in step S <b> 6 is most frequently used in the ontology, and determines that “a white butterfly larva eats an egg shell”. On the other hand, for example, metadata “ecology”-“food” is extracted from the ontology and output.
The content operator can acquire the output “ecological”-“food” metadata and use it in his system.

また、指定されるコンテンツは、例えば、コミュニティサイトのＷｅｂページとすることもできる。ページ作成者は、Ｗｅｂページを指定することでメタデータを取得し、Ｗｅｂページ自身を再整理することが可能となる。例えば、「ミジンコ」に関するＷｅｂページにおいて、「特徴」という見出しで形態に関する記述があった場合に、ページ作成者は、「特徴」を「形態」という共通の見出し語に定義し直すことで、Ｗｅｂページ自身を共通構造に基づく内容に再整理できる。 Also, the designated content can be, for example, a web page of a community site. A page creator can acquire metadata by designating a Web page and rearrange the Web page itself. For example, in a web page related to “daphnia”, if there is a description about a form under the heading “feature”, the page creator redefines “feature” as a common headword “form”, The page itself can be rearranged into content based on a common structure.

さらに、例えば、コンテンツオペレータは、学校教育用のコンテンツを制作する際の参考として、見出し語が階層化されたテキストを持つ教科書データから、目次構造や説明内容の特徴を抽出する際に管理サーバ１を利用できる。
コンテンツオペレータは、対象指定部１１に対して教科書データが公開されたＵＲＬを複数指定し、コンテンツ指定部１６に対して全指定することにより、構造化部１４により生成された教科書のオントロジーを全て得ることができる。
ここで得られる知識構造は、例えば、中学１年生向けの理科の教科書では、「動物の生活」という目次の下位に「生物と細胞」、「動物の体」、「分類」といった目次と、これらの下位目次を特徴付ける単語である「細胞」、「分裂」、「卵生」といった特徴単語を定義したオントロジーとなる。これにより、コンテンツオペレータは、教育資料として必要となる目次や内容を把握することが可能となる。また、特定の教科書では説明されていない項目の洗い出しも可能となる。 Further, for example, when the content operator extracts the characteristics of the table of contents structure and the explanation contents from the textbook data having the text in which the headword is hierarchized as a reference when producing the content for school education, the management server 1 Can be used.
The content operator designates a plurality of URLs in which the textbook data is disclosed to the target designation unit 11 and designates all the URLs to the content designation unit 16 to obtain all the ontology of the textbooks generated by the structuring unit 14. be able to.
The knowledge structure obtained here is, for example, a science textbook for first-year junior high school students, a table of contents such as “organisms and cells”, “animal bodies”, “classification”, etc. It is an ontology that defines characteristic words such as “cell”, “split”, and “egg” that characterize the subordinate table of contents. As a result, the content operator can grasp the table of contents and contents necessary as educational materials. Also, items that are not explained in a specific textbook can be identified.

本実施形態によれば、管理サーバ１は、分野が共通する複数の文書情報から、見出し語の階層情報及び見出し内の特徴単語を抽出し、これらを類似度に基づいて統合することにより、知識構造データとしてのオントロジーを生成する。したがって、管理サーバ１は、複数のコンテンツで共通して利用可能なメタデータの体系を定義した知識構造データを生成できる。 According to the present embodiment, the management server 1 extracts hierarchical information of headwords and feature words in the headlines from a plurality of document information having a common field, and integrates them based on the degree of similarity. Generate ontology as structural data. Therefore, the management server 1 can generate knowledge structure data that defines a metadata system that can be commonly used by a plurality of contents.

例えば、放送番組コンテンツであれば、関連する複数のＷｅｂページから生成される、「キャスト」、「放送年」、「番組関連書籍」といった、共通して利用できる見出し語による知識構造を基に、コンテンツのメタデータを付与することで、メタデータ定義を共通化することができる。また、放送番組コンテンツには「番組関連書籍」情報が記載されるという知識構造が抽出されるので、映像コンテンツにその記載がない場合に、本当に「番組関連書籍」がないのかといった記載内容の精査も可能となる。 For example, in the case of broadcast program content, based on a knowledge structure with commonly used headwords such as “cast”, “broadcast year”, “program related book” generated from a plurality of related web pages, By adding content metadata, metadata definitions can be shared. Also, since the knowledge structure that “program related books” information is described in the broadcast program content is extracted, if the video content is not described, the description contents such as whether or not the “program related books” really exist are scrutinized. Is also possible.

知識構造データには、見出し語に関連付けられた単語が含まれるので、見出し語の階層構造それぞれの意味的内容がより具体的に表される。この結果、任意のテキストデータとのマッチングが容易となり、コンテンツに対して適切なメタデータを容易に付与できる。 Since the knowledge structure data includes words associated with headwords, the semantic content of each hierarchical structure of headwords is expressed more specifically. As a result, matching with arbitrary text data becomes easy, and appropriate metadata can be easily given to the content.

また、知識構造データには、見出し語と特徴単語との関連度が含まれるので、この関連度に基づいてコンテンツと見出し語とのマッチングの度合いがより具体的に比較できる。この結果、任意のテキストデータとのマッチングが容易となり、コンテンツに対して適切なメタデータを容易に付与できる。 Further, since the knowledge structure data includes the degree of association between the headword and the characteristic word, the degree of matching between the content and the headword can be more specifically compared based on this degree of association. As a result, matching with arbitrary text data becomes easy, and appropriate metadata can be easily given to the content.

さらに、管理サーバ１は、辞書データを用いて見出し内の単語の同類語を取得するので、コンテンツとのマッチングを概念構造に基づいて適切に行うことができる。 Furthermore, since the management server 1 acquires the synonym of the word in the heading using the dictionary data, matching with the content can be appropriately performed based on the conceptual structure.

このように、コンテンツホルダでは、コミュニティサイトを知識源として抽出したオントロジーを用いて、保有するコンテンツのテキスト情報に対して、構造化されたメタデータを自動的に付与することができる。これにより、コンテンツオペレータは、自身が保有するコンテンツにおいて、内容が重複しているコンテンツや不足している内容を把握することができる。また、サービスプロバイダは、Ｗｅｂページと映像コンテンツの補完連携といったサービス提供が可能となる。 As described above, the content holder can automatically give structured metadata to text information of the content held by using the ontology extracted from the community site as a knowledge source. As a result, the content operator can grasp the content that is duplicated or the content that is lacking in the content that the content operator owns. In addition, the service provider can provide a service such as complementary cooperation between a Web page and video content.

［第２実施形態］
以下、本発明の第２実施形態について説明する。
本実施形態に係るコンテンツサーバ２は、コンテンツホルダ又はサービスプロバイダ等においてコンテンツを管理する情報処理装置（コンピュータ）である。コンテンツサーバ２は、第１実施形態の管理サーバ１により付与されたメタデータに関連した新たなコンテンツを取得するコンテンツ取得装置として機能する。 [Second Embodiment]
Hereinafter, a second embodiment of the present invention will be described.
The content server 2 according to the present embodiment is an information processing apparatus (computer) that manages content in a content holder or a service provider. The content server 2 functions as a content acquisition device that acquires new content related to the metadata provided by the management server 1 of the first embodiment.

図６は、本実施形態に係るコンテンツサーバ２の機能構成を示す図である。
コンテンツサーバ２は、メタデータ取得部２１と、第１コンテンツ取得部２２と、第２コンテンツ取得部２３とを備える。 FIG. 6 is a diagram showing a functional configuration of the content server 2 according to the present embodiment.
The content server 2 includes a metadata acquisition unit 21, a first content acquisition unit 22, and a second content acquisition unit 23.

メタデータ取得部２１は、管理サーバ１に対して自身のコンテンツ（例えば、Ｗｅｂページデータ）を提供し、オントロジーに基づくメタデータを取得する。 The metadata acquisition unit 21 provides its own content (for example, Web page data) to the management server 1 and acquires metadata based on the ontology.

第１コンテンツ取得部２２は、取得したメタデータと同一のメタデータが付与されたコンテンツを、所定のデータベース（コンテンツホルダ）から取得する。 The 1st content acquisition part 22 acquires the content to which the same metadata as the acquired metadata was provided from a predetermined database (content holder).

第２コンテンツ取得部２３は、メタデータの階層情報に基づいて、当該階層情報の上位が共通する別の階層情報をオントロジーから抽出し、この別の階層情報に相当するメタデータが付与されたコンテンツを、所定のデータベースから取得する。 The second content acquisition unit 23 extracts, from the ontology, other hierarchical information having a common upper level of the hierarchical information based on the hierarchical information of the metadata, and the content to which metadata corresponding to the different hierarchical information is given Are obtained from a predetermined database.

図７は、本実施形態に係る関連コンテンツの取得過程の一例を示す図である。
この例は、自身のＷｅｂページに映像コンテンツを付加したいサービスプロバイダにおける処理を示している。
サービスプロバイダは、自身のＷｅｂページに、ページの内容に関連した映像コンテンツを紐付けたい場合に、コンテンツホルダが付与したメタデータを利用することで、関連コンテンツを選択できる。 FIG. 7 is a diagram illustrating an example of a related content acquisition process according to the present embodiment.
This example shows processing in a service provider who wants to add video content to his Web page.
When a service provider wants to link video content related to the contents of a page to his / her Web page, the service provider can select related content by using metadata provided by the content holder.

例えば、コンテンツホルダに富士山の文化的背景を説明した映像コンテンツ、及び富士山の気候を説明した映像コンテンツがあった場合、これらのコンテンツに、「山」−「信仰」及び「山」−「地質」というメタデータが付与されているものとする。 For example, if there are video contents explaining the cultural background of Mt. Fuji and video contents explaining the climate of Mt. Fuji in the content holder, these contents include “mountain”-“faith” and “mountain”-“geology”. It is assumed that the metadata is given.

サービスプロバイダは、富士山の紹介サイトを制作する際、その文化的背景を説明したテキストエリア３１を管理サーバ１に問い合わせ、「山」−「信仰」というメタデータを取得する。 When the service provider creates an introduction site for Mt. Fuji, the service provider inquires the management server 1 for a text area 31 describing its cultural background, and acquires metadata “mountain”-“faith”.

サービスプロバイダは、取得したメタデータにより、見出し語３２を修正すると共に、コンテンツホルダが持つ「山」−「信仰」に関連したコンテンツを検索し、リンク３３を張ることができる。このとき、検索したコンテンツに付随するテキスト３４が付加されてもよい。 The service provider can correct the headword 32 based on the acquired metadata, search for content related to “mountain”-“faith” possessed by the content holder, and create a link 33. At this time, text 34 accompanying the searched content may be added.

さらに、サービスプロバイダは、「山」−「地質」に関する映像コンテンツがあることも、メタデータの概念構造を辿ることにより把握できる。したがって、サービスプロバイダは、自身のＷｅｂページに関連した映像コンテンツだけでなく、体系的に関連した映像コンテンツを、さらに検索して表示できる。すなわち、「地質」の見出し語３５、映像のリンク３６及び映像に付随するテキスト３７が付加される。
なお、コンテンツを検索するためのクエリは、ＡＰＩ又はＳＰＡＲＱＬ等でよいが、これらには限られない。 Furthermore, the service provider can grasp that there is video content regarding “mountain”-“geology” by following the conceptual structure of the metadata. Therefore, the service provider can further search and display not only video content related to his / her web page but also systematically related video content. That is, a headword 35 of “geology”, a video link 36 and text 37 accompanying the video are added.
The query for searching for content may be API or SPARQL, but is not limited thereto.

本実施形態によれば、コンテンツサーバ２は、メタデータが共通するコンテンツをデータベースから取得することにより、オントロジーを利用して複数のコンテンツを連携させて情報提供することができる。
さらに、コンテンツサーバ２は、オントロジーに基づいてメタデータの上位階層が共通する関連コンテンツを取得するので、関連情報を含めた複数のコンテンツを効率的に収集して情報量を増やせる。 According to the present embodiment, the content server 2 can provide information in cooperation with a plurality of contents by using an ontology by acquiring content with common metadata from the database.
Furthermore, since the content server 2 acquires related content having a common upper hierarchy of metadata based on the ontology, the content server 2 can efficiently collect a plurality of contents including related information and increase the amount of information.

このように、コンテンツサーバ２は、コミュニティサイトを知識源としたオントロジーに基づいてメタデータをコンテンツに付与することにより、コンテンツを体系化できるだけでなく、他のコンテンツと補完的に連携した新たなコンテンツを生み出すことができる。
サービスプロバイダは、コミュニティサイトの知識を利用して共通化されたメタデータや体系化されたコンテンツにより、映像百科事典といった複数のコンテンツを連携した新たなサービスを容易に提供できる。また、例えば、映像コンテンツとＷｅｂコンテンツ、映像コンテンツと映像コンテンツ、といった柔軟なコンテンツ連携が可能となる。 In this way, the content server 2 can not only organize the content by adding metadata to the content based on the ontology using the community site as a knowledge source, but also create new content that complementarily cooperates with other content. Can be produced.
The service provider can easily provide a new service in which a plurality of contents such as a video encyclopedia are linked by using metadata and systematized contents that are shared by using knowledge of the community site. In addition, for example, flexible content linkage such as video content and Web content and video content and video content is possible.

以上、本発明の実施形態について説明したが、本発明は前述した実施形態に限るものではない。また、本実施形態に記載された効果は、本発明から生じる最も好適な効果を列挙したに過ぎず、本発明による効果は、本実施形態に記載されたものに限定されるものではない。 As mentioned above, although embodiment of this invention was described, this invention is not restricted to embodiment mentioned above. Further, the effects described in the present embodiment are merely a list of the most preferable effects resulting from the present invention, and the effects of the present invention are not limited to those described in the present embodiment.

前述の実施形態では、映像コンテンツ又はＷｅｂページを例にメタデータの付与方法を説明したが、コンテンツはこれらには限られず、オントロジーとのマッチングが可能なテキストデータが付与された様々なコンテンツを対象とできる。 In the above-described embodiment, the method for assigning metadata has been described using video content or a Web page as an example. However, the content is not limited to these, and various types of content to which text data that can be matched with an ontology are assigned. And can.

また、前述の管理サーバ１（オントロジー生成装置、メタデータ出力装置）及びコンテンツサーバ２（コンテンツ取得装置）の各機能は、サービス形態に応じて適宜分散又は統合されたシステムとして提供されてよい。 The functions of the management server 1 (ontology generation device, metadata output device) and the content server 2 (content acquisition device) described above may be provided as a system that is appropriately distributed or integrated depending on the service form.

本実施形態では、オントロジーの生成装置、並びにこのオントロジーを利用するメタデータ出力装置及びコンテンツ取得装置の構成と動作について説明したが、本発明はこれに限られず、各構成要素を備え、オントロジーを生成又は利用するための方法、又はプログラムとして構成されてもよい。 In the present embodiment, the configuration and operation of the ontology generation device, and the metadata output device and content acquisition device that use this ontology have been described. However, the present invention is not limited to this, and includes each component to generate an ontology. Alternatively, it may be configured as a method or program for use.

さらに、各装置の機能を実現するためのプログラムをコンピュータで読み取り可能な記録媒体に記録して、この記録媒体に記録されたプログラムをコンピュータシステムに読み込ませ、実行することによって実現してもよい。 Furthermore, the program for realizing the function of each device may be recorded on a computer-readable recording medium, and the program recorded on the recording medium may be read by a computer system and executed.

ここでいう「コンピュータシステム」とは、ＯＳや周辺機器等のハードウェアを含むものとする。また、「コンピュータで読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ、ＣＤ−ＲＯＭ等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置のことをいう。 The “computer system” here includes an OS and hardware such as peripheral devices. The “computer-readable recording medium” refers to a storage device such as a portable medium such as a flexible disk, a magneto-optical disk, a ROM, and a CD-ROM, and a hard disk built in the computer system.

さらに「コンピュータで読み取り可能な記録媒体」とは、インターネット等のネットワークや電話回線等の通信回線を介してプログラムを送信する場合の通信線のように、短時刻の間、動的にプログラムを保持するもの、その場合のサーバやクライアントとなるコンピュータシステム内部の揮発性メモリのように、一定時刻プログラムを保持しているものも含んでもよい。また、上記プログラムは、前述した機能の一部を実現するためのものであってもよく、さらに前述した機能をコンピュータシステムにすでに記録されているプログラムとの組み合わせで実現できるものであってもよい。 Furthermore, “computer-readable recording medium” means that a program is dynamically held for a short time, like a communication line when transmitting a program via a network such as the Internet or a communication line such as a telephone line. It is also possible to include one that holds a program for a certain time, such as a volatile memory inside a computer system that becomes a server or client in that case. Further, the program may be for realizing a part of the above-described functions, and may be capable of realizing the above-described functions in combination with a program already recorded in the computer system. .

１管理サーバ（オントロジー生成装置、メタデータ出力装置）
２コンテンツサーバ（コンテンツ取得装置）
１１対象指定部
１２見出し抽出部
１３単語抽出部
１４構造化部
１５辞書取得部
１６コンテンツ指定部
１７出力部
２１メタデータ取得部
２２第１コンテンツ取得部
２３第２コンテンツ取得部 1 Management server (Ontology generation device, metadata output device)
2 Content server (content acquisition device)
DESCRIPTION OF SYMBOLS 11 Target specification part 12 Headline extraction part 13 Word extraction part 14 Structured part 15 Dictionary acquisition part 16 Content specification part 17 Output part 21 Metadata acquisition part 22 1st content acquisition part 23 2nd content acquisition part

Claims

A headline extraction unit that extracts hierarchical information of headwords from each of a plurality of document information in a specified field;
A word extraction unit that extracts a word associated with the headword from each of the document information;
Ontology generation comprising: a structuring unit that integrates different headwords into one based on the similarity of the words and generates knowledge structure data of the field including hierarchical information of the integrated headwords apparatus.

The ontology generation apparatus according to claim 1, wherein the structuring unit generates knowledge structure data in which the headword and the word are associated with each other.

The ontology generation device according to claim 2, wherein the structuring unit generates the knowledge structure data including a degree of association between the headword and the word.

The hierarchy of the headword associated with the word by matching the text data related to the content with the word included in the knowledge structure data generated by the ontology generation device according to any one of claims 1 to 3. A metadata output device including an output unit that extracts information from the knowledge structure data and outputs the information as metadata of the content.

A dictionary acquisition unit that acquires synonyms of the word based on dictionary data;
The metadata output device according to claim 4, wherein the output unit extracts hierarchical information of the headword associated with the word by matching the text data with the word or the similar word.

A content acquisition apparatus comprising: a first content acquisition unit that acquires, from a predetermined database, content provided with the same metadata as the metadata output by the metadata output device according to claim 4 or 5.

Based on the hierarchical information of the metadata, another hierarchical information having a common upper level of the hierarchical information is extracted from the knowledge structure data, and the content to which the metadata corresponding to the different hierarchical information is assigned is the predetermined content. The content acquisition apparatus according to claim 6, further comprising a second content acquisition unit that acquires from the database.

The computer controller
A headline extraction step of extracting hierarchical information of headwords from each of a plurality of pieces of document information in a specified field;
A word extracting step of extracting a word associated with the headword from each of the document information;
An ontology that executes the structuring step of integrating the different headwords into one based on the similarity of the words and generating knowledge structure data of the field including hierarchical information of the integrated headwords Generation method.

In the control part of the computer,
A headline extraction step of extracting hierarchical information of headwords from each of a plurality of pieces of document information in a specified field;
A word extracting step of extracting a word associated with the headword from each of the document information;
Structuring step of integrating the different headwords into one based on the similarity of the words and generating knowledge structure data of the field including hierarchical information of the integrated headwords Ontology generation program.