WO2014021229A1 - Relevant document extraction device, relevant document extraction method and relevant document extraction program - Google Patents

Relevant document extraction device, relevant document extraction method and relevant document extraction program Download PDF

Info

Publication number
WO2014021229A1
WO2014021229A1 PCT/JP2013/070376 JP2013070376W WO2014021229A1 WO 2014021229 A1 WO2014021229 A1 WO 2014021229A1 JP 2013070376 W JP2013070376 W JP 2013070376W WO 2014021229 A1 WO2014021229 A1 WO 2014021229A1
Authority
WO
WIPO (PCT)
Prior art keywords
document
topic
tag
default
appearance frequency
Prior art date
Application number
PCT/JP2013/070376
Other languages
French (fr)
Japanese (ja)
Inventor
隼 赤塚
公亮 角野
渉 内田
Original Assignee
株式会社エヌ・ティ・ティ・ドコモ
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 株式会社エヌ・ティ・ティ・ドコモ filed Critical 株式会社エヌ・ティ・ティ・ドコモ
Publication of WO2014021229A1 publication Critical patent/WO2014021229A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Definitions

  • the present invention relates to a related document extracting apparatus, a related document extracting method, and a related document extracting program for extracting a document related to a specific topic from a plurality of documents.
  • Microblogging is an information service that posts short sentences composed of tens to hundreds of characters, and Twitter posts short documents called tweets within 140 characters. There are various contents posted as a tweet, such as, for example, his / her recent situation, sharing of a news article in which he / she is interested, a reply to a tweet of an acquaintance, a comment on a specific topic such as television. Because it is possible to share information with other users by posting a comment with a URL such as a news article that you are interested in, microblogging is not only a tool to get the latest status of friends, but also as an information collection tool It is also widely used.
  • hashtags are associated with a large topic. For example, when a TV program is a topic, “XX drama: YYY (drama title) 1 episode” is one topic. While watching the program, the user tweetes the broadcast XX drama YYY with a broadcast station hashtag. If the hashtag (broadcasting station hashtag) of the broadcasting station that broadcasts the XX drama YYY is #zzz, people who post a tweet with a program hashtag (#xx, #YYY, #XX drama) in addition to the broadcasting station hashtag There are many. Broadcast station hashtags are hashtags that are widely used for programs broadcast on television stations by users regardless of whether they are official or informal. By collecting tweets including hashtags related to the topic, the user's comment on the topic can be grasped.
  • Non-Patent Document 1 provides a service for extracting tweets associated with a broadcasting station and displaying the tweets for each broadcasting station.
  • the broadcasting station is one topic. It is possible to easily link a tweet to a broadcast station using a broadcast station hashtag. For example, when collecting tweets about a program of the broadcast station ZZZ, it is only necessary to collect tweets including #zzz, which is a broadcast station hash tag.
  • Non-Patent Document 2 provides a service for extracting tweets associated with a program and displaying the tweet for each program being broadcast.
  • a program being broadcast is set as one topic.
  • a broadcasting station hashtag is used to link a program being broadcast.
  • it dynamically estimates program hashtags in real time. For example, in the case of a program of the broadcasting station ZZZ, a tweet including a broadcasting station hash tag (#zzzz) is linked to the program, and when a program “YYY” is broadcast, one or more program hashes are dynamically generated.
  • Non-Patent Document 1 and Non-Patent Document 2 have the following problems.
  • television there are users who post a tweet unrelated to the program being broadcast with a plurality of broadcast station hashtags. Since the service according to Non-Patent Document 1 simply collects tweets including broadcast station hashtags, it also displays tweets unrelated to the program.
  • the service according to Non-Patent Document 1 extracts only tweets with a broadcast station hash tag, the amount of tweets that can be extracted is limited.
  • a program hash tag is dynamically estimated in addition to the broadcast station hash tag, and tweets related to the program being broadcast are extracted. I haven't been able to complete the extraction of tweets. Tweets of TV programs being broadcast are not necessarily provided with hashtags, and there is a strong tendency that there are actually many tweets without hashtags.
  • the service related to Non-Patent Document 1 cannot extract tweets with a program hash tag, and the service related to Non-Patent Document 2 can extract tweets related to programs that do not have a hash tag. Not.
  • the present invention has been made in view of the above problems, and a related document extraction apparatus and a related document extraction method capable of appropriately extracting a document related to a specific topic from a plurality of documents such as tweets. And a related document extraction program.
  • a related document extraction apparatus includes a default topic tag storage unit that stores a default topic tag indicating a topic in advance, and a document storage unit that stores a plurality of documents in advance.
  • a word acquisition unit that divides the document stored by the document storage unit into words, and a document including the default topic tag stored by the default topic tag storage unit is extracted from the plurality of documents stored by the document storage unit
  • a default document extracting unit ; a first appearance frequency calculating unit that calculates an appearance frequency of words divided by the word acquiring unit in the document extracted by the default document extracting unit; and an appearance calculated by the first appearance frequency calculating unit.
  • Sentences other than documents extracted by default document extraction means using frequency From comprises a topic document extraction means for extracting the documents related to the topic, the.
  • a document related to a topic is extracted using the appearance frequency of words in a document including a default topic tag indicating a topic. That is, even if the default topic tag indicating a topic is not included, a document corresponding to the appearance frequency is extracted as a document related to the topic.
  • the document relevant to a specific topic can be appropriately extracted from several documents, such as a tweet.
  • the topic document extracting means uses the appearance frequency calculated by the first appearance frequency calculating means to calculate the score of the document from words appearing in a document other than the document extracted by the default document extracting means. And a first topic document determination unit that determines whether a document related to the score is a document related to a topic based on the score calculated by the score calculation unit.
  • a document including a word having a high appearance frequency in a document including a default topic tag can be extracted as a document related to a topic, and a document related to a specific topic can be reliably extracted. Can do.
  • the score calculation means may calculate the score of the document in the same way as when the word appears once in the document. According to this configuration, it is possible to prevent the score of the document from being increased due to words frequently included in the document, and it is possible to avoid extracting an inappropriate document as a document related to a topic.
  • the topic document extraction means includes a tag document extraction means for extracting a document including a tag other than the default topic tag from a plurality of documents stored by the document storage means, and a word acquisition means in the document extracted by the tag document extraction means. Comparing the appearance frequency calculated by the first appearance frequency calculating means with the appearance frequency calculated by the second appearance frequency calculating means, the second appearance frequency calculating means for calculating the appearance frequency of the words divided by Second topic document determination means for determining whether or not the document extracted by the tag document extraction means based on the comparison result is a document related to the topic may be provided. According to this configuration, a document (group) including tags other than the default topic tag can be extracted as a document related to a topic, and a document related to a specific topic can be reliably extracted.
  • the second topic document determination unit is configured to determine whether the feature amount indicated by the word appearance frequency calculated by the first appearance frequency calculation unit and the feature amount indicated by the word appearance frequency calculated by the second appearance frequency calculation unit.
  • the appearance frequencies may be compared by calculating the cosine distance, Jacquard distance, or Euclidean distance. According to this configuration, it is possible to more reliably extract a document related to a specific topic.
  • the default topic tag storage means stores a default topic tag related to an inappropriate topic as a default topic tag, and the topic document extraction means determines whether or not the document is a document related to an inappropriate topic. It is also possible to exclude documents. According to this configuration, inappropriate documents can be excluded, and for example, inappropriate documents can be prevented from being presented to the user.
  • the document storage means may store information relating to a user who posted the document, and the first appearance frequency calculation means may calculate the number of users who have posted the document including the word as the word appearance frequency. .
  • the influence for every user can be made uniform, for example, the influence by one user posting the document of the same content several times can be suppressed. Thereby, it is possible to appropriately extract a document related to a specific topic.
  • the first appearance frequency calculating means calculates the reverse appearance frequency from the ratio of the total number of users who have posted the document to the number of users who have posted the document including the word for each word, and the topic document extracting means is the first appearance frequency
  • a document related to the topic may be extracted using the reverse appearance frequency calculated by the frequency calculating means. According to this configuration, a document related to a topic is extracted using the reverse appearance frequency of words in the document including the default topic tag indicating the topic. Thereby, a document related to a specific topic can be more appropriately extracted from a plurality of documents such as tweets.
  • the topic document extracting means may extract a document related to the topic using the number of characters for each word. According to this configuration, a document related to a topic is extracted using the number of characters of a word in the document including the default topic tag indicating the topic. Thereby, a document related to a specific topic can be more appropriately extracted from a plurality of documents such as tweets.
  • the default topic tag storage unit may store a plurality of default topic tags indicating each of a plurality of topics, and the topic document extraction unit may exclude documents related to the plurality of topics. Documents posted on multiple topics (multi-topic postings) are often not related to each topic. Therefore, according to this configuration, it is possible to avoid extracting an inappropriate document as a document related to a topic.
  • the present invention can be described as an invention of a related document extraction apparatus and a related document extraction program as described below, in addition to being described as an invention of a related document extraction apparatus as described above.
  • This is substantially the same invention only in different categories and the like, and has the same operations and effects.
  • a related document extraction method includes a default topic tag storage unit that stores a default topic tag indicating a topic in advance, and a document storage unit that stores a plurality of documents in advance.
  • a related document extraction method by an apparatus a word acquisition step for dividing a document stored by a document storage unit into words, and a default stored by a default topic tag storage unit from a plurality of documents stored by the document storage unit
  • a related document extraction program includes a computer that stores a default topic tag storage unit that stores a default topic tag indicating a topic in advance, a document storage unit that stores a plurality of documents in advance, and a document storage
  • a word acquisition unit that divides a document stored by the unit into words
  • a default document extraction unit that extracts a document including a default topic tag stored by the default topic tag storage unit from a plurality of documents stored by the document storage unit
  • First appearance frequency calculating means for calculating the appearance frequency of the words divided by the word acquiring means in the document extracted by the default document extracting means, and using the appearance frequency calculated by the first appearance frequency calculating means.
  • Sentences other than documents extracted by default document extraction means From a topic document extraction means for extracting the documents related to the topic, to function as a.
  • a document related to a topic is extracted using the appearance frequency of words in a document including a default topic tag indicating a topic. That is, even if the default topic tag indicating a topic is not included, a document corresponding to the appearance frequency is extracted as a document related to the topic.
  • the document relevant to a specific topic can be extracted appropriately from several documents, such as a tweet.
  • FIG. 1 shows a related document extraction apparatus 10 according to the present embodiment.
  • the related document extraction device 10 is a device that extracts a document related to a specific topic from a plurality of documents (documents).
  • the document to be extracted is, for example, a document published on a microblog posted by the user and published on the Web.
  • a Twitter that is a representative of a microblog is used as a specific example.
  • the extraction target is called a document, but it is also called a tweet or a comment depending on the microblog service. Note that the extraction target document does not necessarily need to be a document published on the Web.
  • the related document extracting apparatus 10 inputs documents posted by a large number of users, extracts documents related to a specific topic from those documents, and provides them to the user as a document group related to the specific topic.
  • Specific topics include, for example, specific television programs. A user can know how other users think about the specific topic by referring to a document group related to the specific topic.
  • the related document extracting apparatus 10 includes a document storage unit 100, a morpheme analysis unit 110, a morpheme storage unit 120, a topic tag estimation unit 130, a topic tag storage unit 140, and a topic ID assignment unit 150.
  • the related document extraction device 10 is connected to a device (for example, a server providing a microblog service) that outputs a document to be extracted (received) via a network such as the Internet so that the document to be extracted can be acquired (received). .
  • the document storage unit 100 is a document storage unit that inputs and stores a plurality of documents to be extracted in advance.
  • the document storage unit 100 may provide a microblog service via the Internet and may request and acquire (receive) a document from a server that stores the document, or may perform streaming from the server.
  • Document data may be received.
  • Each document on Twitter corresponds to each tweet data generated (posted) by the user, for example.
  • the stored data does not necessarily store only one type of data.
  • FIG. 2 shows a sample format of a document stored in the document storage unit 100.
  • the data relating to one document stored in the document storage unit 100 is associated with a document ID, a user ID, a posting time, a text, and a hash tag.
  • One row of data shown in FIG. 2 corresponds to data relating to one document.
  • the document ID is information that identifies each document and is a unique value.
  • the user ID is information that identifies the user who created each document.
  • the document storage unit 100 inputs and stores information related to the user who posted the document.
  • the user ID may be a unique value such as a user account, or may be an ID for each session when using the Internet when it is difficult to specify the unique value.
  • the posting time is information indicating the time when the document is posted by the user.
  • the text is actual text data (document body) included in the document data.
  • a hash tag is tag information given to a document.
  • a hash tag is a Twitter term, but is a tag attached to a document when a user explicitly wants to post a specific topic, for example, a tag that can recognize a specific event, that is, an event identifier.
  • Each document does not necessarily include any hash tag (event identifier), and a NULL value is included when no hash tag is included.
  • the morphological analysis unit 110 is a word acquisition unit that reads the document data stored in the document storage unit 100 and divides the text of the document data into words.
  • the morpheme analysis unit 110 divides text into words by, for example, morpheme analysis.
  • the conventional technique can be used for the morphological analysis.
  • division into words is not necessarily performed by morphological analysis, and may be performed by an arbitrary method.
  • the word is a morpheme. Acquisition of morphemes is performed for each document.
  • the morpheme analysis unit 110 outputs information on the morpheme obtained from the text to the morpheme storage unit 120.
  • the morpheme storage unit 120 is a means for storing the morpheme input from the morpheme analysis unit 110.
  • FIG. 3 shows a sample format of morphemes stored in the morpheme storage unit 120.
  • the data related to one morpheme stored in the morpheme storage unit 120 is a document ID, user ID, posting time, morpheme, and part of speech associated with each other.
  • One row of data shown in FIG. 3 corresponds to data relating to one morpheme.
  • the document ID, user ID, and posting time are the document ID, user ID, and posting time of the document from which the morpheme is acquired.
  • the morpheme is a morpheme obtained by the morpheme analyzer 110.
  • the part of speech is the part of speech of the morpheme obtained by the analysis by the morpheme analysis unit 110. For example, information indicating whether the morpheme is a noun is stored.
  • the topic tag estimation unit 130 is a means for generating information used to determine whether each document is a document related to a specific topic.
  • the topic tag estimation unit 130 generates the above information using the information stored in the topic tag storage unit 140 and stores the information in the topic tag storage unit 140.
  • the topic tag storage unit 140 will be described.
  • the topic tag storage unit 140 includes a default topic tag storage unit 141, a topic feature word storage unit 142, and an extended topic hash tag storage unit 143.
  • the default topic tag storage unit 141 is a default topic tag storage unit that stores in advance a default topic tag indicating a topic.
  • the default topic tag is a tag related to a topic from which a related document is to be extracted, and is registered in advance by the administrator of the related document extraction apparatus 10, for example.
  • a document including the default topic tag is extracted as a document related to the topic related to the default topic tag. This extraction is performed by character string matching.
  • the default topic tag is, for example, any of a morpheme, a hash tag, or a keyword.
  • a default topic tag exists for each topic.
  • FIG. 4 shows a sample format of the default topic tag stored in the default topic tag storage unit 141.
  • the data relating to one default topic tag stored in the default topic tag storage unit 141 is associated with a topic ID and a tag.
  • One row of data shown in FIG. 4 corresponds to data related to one default topic tag.
  • the topic ID is an ID that identifies one topic.
  • the tag is the default topic tag body.
  • the default topic tag storage unit 141 may input a plurality of default topic tags indicating each of a plurality of topics (a plurality of topic IDs).
  • Information stored in the topic feature word storage unit 142 and the extended topic hash tag storage unit 143 is information input from the topic tag estimation unit 130, and will be described later.
  • the topic tag estimation unit 130 uses the default topic tag stored in the default topic tag storage unit 141 to generate information used to determine whether each document is a document related to a specific topic. This information is for determining whether or not the document includes a default topic tag, but the document is related to a topic related to the default topic tag.
  • the topic tag estimation unit 130 includes a topic feature word estimation unit 131 and a topic hash tag estimation unit 132, and each generates different information.
  • the topic feature word estimation unit 131 is a means for estimating topic feature words.
  • a feature word of a topic is a morpheme that appears characteristically in a document related to the topic.
  • the topic feature word estimation unit 131 reads the default topic tag stored by the default topic tag storage unit 141, and associates a document including the default topic tag with the topic from a plurality of documents stored by the document storage unit 100.
  • This is a default document extraction means for extracting as a document (topic document). At this time, based on the posting time, a document for a predetermined period such as the most recent several hours from the time of extraction is acquired.
  • the topic feature word estimation unit 131 refers to the information stored in the document storage unit 100 and the morpheme storage unit 120, and calculates the appearance frequency of the morpheme in the extracted topic document (topic document group). It is a calculation means. At this time, the topic feature word estimation unit 131 calculates the number of users who have posted a document including the morpheme as the appearance frequency of the morpheme. Further, the topic feature word estimation unit 131 calculates the reverse appearance frequency from the ratio of the total number of users who have posted documents to the number of users who have posted documents containing the morpheme for each morpheme.
  • the topic feature word estimation unit 131 extracts morphemes (features) characteristic of topic documents (topic document groups) from the above values.
  • the topic feature word estimation unit 131 generates a feature amount that is information describing the feature of the target topic for each topic ID.
  • the feature amount is composed of a plurality of features (features), and a feature is generated for each morpheme. For example, the feature “Today” has a score of “0.5”, and the feature “Sunny” has a score of “0.2”.
  • the topic feature word estimation unit 131 calculates an IDF (Inverse Document Frequency) value (inverse appearance frequency) for each morpheme from the morphemes included in each document by the following formula.
  • IDF Inverse Document Frequency
  • is the total number of unique users
  • the IDF value is a score indicating that the smaller the number of documents in which the word appears, the more useful the document in which the word appears.
  • the reason why the frequency is calculated not by the number of documents but by the number of users is as follows. If the number of documents is simply used, noise may be mixed. For example, the same user may post a plurality of documents having the same content. Some people submit documents with the same content dozens of times. If the calculation here is based on the number of unique users, even if the same user has posted a document with the same content multiple times, it is counted only once. Therefore, the calculated score is more reliable. It is also possible to think that the influence of one user on the morpheme score is made uniform.
  • the topic feature word estimation unit 131 calculates a TF (Term Frequency) value (appearance frequency) for each morpheme from the morpheme (morpheme to which the topic ID is assigned) included in each topic document for each topic ID by the following expression. ) Is calculated.
  • j is a subscript indicating the topic ID
  • n i, j is the number of unique users who have posted a document related to the topic ID j including the morpheme i (a document including the default topic tag of the topic ID j ).
  • the TF value indicates how prominent a certain word appears in a given document, and the larger this value is, the better the word represents the content of the document.
  • the feature quantities relating to the television program are, for example, “YYY (drama title): 1.0, AAAA (actor name): 0.9, CCCC (role name): 0.7, DDDD (role name): 0.4” (morpheme) : TFIDF value).
  • the IDF may be weighted to adjust the score (TFIDF value) by applying log to the IDF or raising the IDF to a constant power during the above calculation. Moreover, it is good also as calculating a TFIDF value like the following formula
  • tfidf i, j tf i, j ⁇ idf i ⁇ log (length i)
  • length i is the number of characters of morpheme i.
  • the character string may be weighted by applying power (log (length i ), constant) (log (length i ) raised to a constant power). By doing so, it is possible to increase the weight with respect to the morpheme described more specifically. Moreover, since morphemes with a small number of characters frequently appear, the score tends to be high as noise.
  • the topic feature word estimation unit 131 outputs the TFIDF value of each morpheme for each calculated topic ID to the topic feature word storage unit 142 for storage.
  • the morpheme (feature word) having a TFIDF value equal to or greater than a preset threshold value may be stored in the topic feature word storage unit 142.
  • FIG. 5 shows a sample format of feature amounts stored in the topic feature word storage unit 142.
  • the feature amount data stored in the topic feature word storage unit 142 is data for each morpheme, and data related to one morpheme is associated with a topic ID, a creation date, a tag, and a score. Is.
  • One row of data shown in FIG. 5 corresponds to data related to one morpheme of any topic ID.
  • the topic ID is a topic ID of a topic related to the feature amount.
  • the creation date is the time when this data was created.
  • a tag is a morpheme.
  • the score is a TFIDF value calculated by the topic feature word estimation unit 131. The information generated by the topic feature word estimation unit 131 has been described above.
  • the topic hash tag estimation unit 132 is a means for estimating a hash tag other than the default topic tag related to a topic.
  • the topic hash tag estimation unit 132 reads the default topic tag stored by the default topic tag storage unit 141, and selects a tag other than the default topic tag (a hash tag related to the topic) from the plurality of documents stored by the document storage unit 100.
  • Tag document extraction means for extracting a document including a hash tag as a tag document (tag document group). At this time, based on the posting time, a document for a predetermined period such as the most recent several hours from the time of extraction is acquired.
  • the topic hash tag estimation unit 132 refers to the information stored in the document storage unit 100 and the morpheme storage unit 120, and calculates the appearance frequency of the morpheme in the extracted tag document (tag document group). It is a calculation means. At this time, the topic hash tag estimation unit 132 calculates the number of users who have posted a document including the morpheme as the appearance frequency of the morpheme. Further, the topic hash tag estimation unit 132 calculates the reverse appearance frequency from the ratio of the total number of users who have posted documents to the number of users who have posted documents containing the morpheme for each morpheme.
  • the topic hash tag estimation unit 132 calculates an IDF value (reverse appearance frequency) for each morpheme, as with the topic feature word estimation unit 131. Since the IDF values calculated and used by the topic feature word estimation unit 131 and the topic hash tag estimation unit 132 are the same for each morpheme, the IDF value calculated by either one is used in the other. Also good.
  • IDF value reverse appearance frequency
  • the topic hash tag estimation unit 132 calculates the TF value (appearance frequency) for each morpheme from the morpheme (morpheme to which the hash tag is assigned) included in each tag document for each tag other than the default topic tag by the following formula. ) Is calculated.
  • j is a subscript indicating a hash tag
  • n i, j is the number of unique users who have posted a document including the morpheme i and including the hash tag j.
  • the topic hash tag estimation unit 132 reads the default topic tag stored by the default topic tag storage unit 141, and sets a document including the default topic tag as a topic from a plurality of documents stored by the document storage unit 100. This is a default document extraction means for extracting as a related document (topic document). At this time, based on the posting time, a document for a predetermined period such as the most recent several hours from the time of extraction is acquired.
  • the topic hash tag estimation unit 132 refers to the information stored in the document storage unit 100 and the morpheme storage unit 120, and calculates the appearance frequency of the morpheme in the extracted topic document (topic document group). It is a calculation means.
  • the topic hash tag estimation unit 132 calculates the number of users who have posted a document including the morpheme as the appearance frequency of the morpheme. Further, the topic hash tag estimation unit 132 calculates the reverse appearance frequency from the ratio of the total number of users who have posted documents to the number of users who have posted documents containing the morpheme for each morpheme.
  • the topic hash tag estimation unit 132 obtains the TFIDF value (tfidf i, j ) of the morpheme i in the topic IDj.
  • the TFIDF value in the topic IDj that is calculated and used by the topic feature word estimation unit 131 and the topic hash tag estimation unit 132 is the same value for each morpheme, the TFIDF value calculated by either one is used in the other. It may be used.
  • the topic hash tag estimation unit 132 is a function of the second topic document determination unit that compares the feature amount of the topic ID calculated as described above with the feature amount of the tag. Specifically, the topic hash tag estimation unit 132 calculates, for each topic ID, the similarity (similarity) with all hash tags (other than the default topic tag) as a cosine distance using the following formula.
  • a and B are the feature amount of the topic ID and the feature amount of the hash tag, respectively.
  • Ai and Bi are the TFIDF values of each morpheme i.
  • a Jacquard distance or an Euclidean distance may be used for calculating the similarity between the feature amounts indicated by the appearance frequency of the morpheme.
  • any calculation method can be used as long as the similarity between feature quantities can be calculated.
  • the topic hash tag estimation unit 132 determines whether there is a similarity of a hash tag having a similarity equal to or higher than a preset threshold for each topic ID, and relates a tag having a similarity equal to or higher than the threshold to the topic of the topic ID. It shall be a tag. By performing this process on all topic IDs, hash tags (similar hash tags) related to the topic with the topic ID can be extracted.
  • the topic hash tag estimation unit 132 outputs information indicating a hash tag (extended topic hash tag) related to the topic of the topic ID to the extended topic hash tag storage unit 143 for storage.
  • FIG. 6 shows a sample format of the extended topic hash tag stored in the extended topic hash tag storage unit 143.
  • the extended topic hash tag data stored in the extended topic hash tag storage unit 143 is data for each extended topic hash tag, and data related to one extended topic hash tag includes a topic ID, a creation date. And hash tags are associated with each other.
  • One row of data shown in FIG. 6 corresponds to data related to one extended topic hash tag.
  • the topic ID is a topic ID of a topic related to the extended topic hash tag.
  • the creation date is the time when this data was created.
  • the hash tag is an extended topic hash tag. The information generated by the topic hash tag estimation unit 132 has been described above.
  • the topic ID assigning unit 150 is a topic document extracting unit that extracts a document related to a topic using information stored in the topic tag storage unit 140.
  • the topic ID assigning unit 150 is a topic document extracting unit that extracts a document related to a topic although a default topic tag is not included in the document.
  • the topic ID assigning unit 150 first acquires the document stored by the document storage unit 100. At this time, based on the posting time, a document for a predetermined period such as the most recent several hours from the time of acquisition is acquired. To extract a document related to a topic based on information stored in the default topic tag storage unit 141, the following is performed.
  • the topic ID assigning unit 150 reads the default topic tag stored by the default topic tag storage unit 141, determines whether or not the acquired document includes the default topic tag, and the default topic tag is included. A topic ID related to the default topic tag is assigned to the document.
  • the topic ID assigning unit 150 reads the information on the feature amount (the TFIDF value (score) of each morpheme for each topic ID) stored by the topic feature word estimation unit 131 and acquires the information for each topic ID from the feature amount information. It is a score calculation means for calculating the score of each document.
  • the topic ID assigning unit 150 determines whether the score assignment target document includes a morpheme (feature word) related to the feature amount.
  • the topic ID assigning unit 150 adds up the scores of feature words included in the document.
  • the score of the document may be calculated in the same manner as when the feature word appears once. That is, the score of the same feature word is not counted multiple times.
  • the score of the feature word “Today” is 1.0
  • the topic ID assigning unit 150 is a first topic document determination unit that determines, based on the calculated score, whether a document related to the score is a document related to the topic. Specifically, the topic ID assigning unit 150 determines whether the score is a preset threshold value. If the score is equal to or greater than the threshold value, the topic ID assigning unit 150 determines that the document is a document related to the topic and determines the topic. Give an ID. This process is repeated for each topic ID related to the feature amount stored in the topic feature word estimation unit 131, and a topic ID is assigned.
  • the topic ID assigning unit 150 reads the extended topic hash tag stored by the extended topic hash tag storage unit 143, and determines whether or not the acquired document includes the extended topic hash tag (that is, the acquired document Is a second topic document determination means for determining whether or not the document is a document related to the topic.
  • the topic ID assigning unit 150 assigns a topic ID related to the default topic tag to a document that includes the extended topic hash tag.
  • a topic ID is repeatedly given for each topic ID related to the extended topic hash tag stored in the extended topic hash tag storage unit 143.
  • the topic ID assigning unit 150 outputs the document to which the topic ID is assigned to the noise removing unit 190.
  • noise is removed from the document stored in the document storage unit 100. That is, it is determined whether or not a document stored in the document storage unit 100 is inappropriate as a document related to a topic. If it is determined that the document is inappropriate, the document is excluded from the related documents. To do.
  • Twitter it is common to attach a hashtag to share your tweets for a specific topic, but there are users who post their comments with hashtags of multiple independent topics. In this case, postings are made on multiple topics, and the content of postings is very weak in relation to individual topics. For television, it may be criticism of politics or criticism of broadcasting stations. Many. It is important to filter these noises when extracting documents related to a topic with high accuracy.
  • the following configuration is for removing noise from a document.
  • the blacklist hash tag extension unit 160 generates information used to determine whether each document is subject to noise, that is, whether each document is related to a specific topic inappropriate for extraction. Means.
  • the blacklist hash tag extension unit 160 generates the above information using the information stored in the blacklist tag storage unit 170 and stores the information in the blacklist tag storage unit 170.
  • the black list tag storage unit 170 will be described.
  • the black list tag storage unit 170 includes a default black list morpheme storage unit 171, a default black list hash tag storage unit 172, and an extended black list hash tag storage unit 173.
  • the default blacklist morpheme storage unit 171 is a means for inputting and storing blacklist morphemes.
  • a blacklist morpheme is a morpheme that should be excluded if it was included in the document.
  • the black list morpheme is registered in advance by, for example, an administrator of the related document extraction device 10.
  • FIG. 7A shows a sample format of the black list morpheme stored in the default black list morpheme storage unit 171. As shown in FIG. 7A, one line of data corresponds to data relating to one black list morpheme, and is stored for each black list morpheme.
  • the default blacklist hash tag storage unit 172 is a default topic tag storage unit that stores in advance a blacklist hash tag that is a default topic tag indicating an inappropriate topic.
  • the blacklist hash tag is a tag related to a topic for which a related document is to be excluded, and is registered in advance by an administrator of the related document extraction apparatus 10, for example. Documents containing blacklist hash tags are excluded as documents related to inappropriate topics. This exclusion is performed by character string matching.
  • the black list hash tag is, for example, a hash tag.
  • FIG. 7B shows a sample format of the black list hash tag stored in the default black list hash tag storage unit 172. As shown in FIG. 7B, one line of data corresponds to data related to one black list hash tag, and is stored for each black list hash tag.
  • the information stored in the extended blacklist hash tag storage unit 173 is information input from the blacklist hash tag extension unit 160, and will be described later.
  • the black list hash tag extension unit 160 is a document in which each document is to be excluded (a document related to a topic to be excluded) using the black list hash tag stored in the default black list hash tag storage unit 172.
  • the information used for determining whether or not is generated. This information is for determining whether the document does not contain a blacklist hash tag, but the document is to be excluded.
  • the blacklist hash tag extension unit 160 is a means for estimating feature words related to the blacklist hashtag.
  • the characteristic word of the blacklist hash tag is a morpheme that appears characteristically in a document including the blacklist hashtag.
  • the black list hash tag extension unit 160 reads the black list hash tag stored in the default black list hash tag storage unit 172, and selects a document including the black list hash tag from a plurality of documents stored in the document storage unit 100. This is a default document extracting means for extracting as a document to be excluded. At this time, based on the posting time, a document for a predetermined period such as the most recent several hours from the time of extraction is acquired.
  • the blacklist hash tag extension unit 160 refers to the information stored in the document storage unit 100 and the morpheme storage unit 120, and calculates the appearance frequency of the morpheme in the extracted document (document group) to be excluded. 1 appearance frequency calculation means. At this time, the blacklist hash tag expansion unit 160 calculates the number of users who have posted a document including the morpheme as the appearance frequency of the morpheme. Also, the blacklist hash tag expansion unit 160 calculates the reverse appearance frequency from the ratio of the total number of users who have posted documents to the number of users who have posted documents containing the morpheme for each morpheme.
  • the blacklist hash tag extension unit 160 extracts morphemes (features) characteristic of documents (document groups) to be excluded from the above values.
  • the black list hash tag extension unit 160 generates a feature amount that is information describing the feature of the target topic for each black list hash tag.
  • the blacklist hash tag extension unit 160 calculates an IDF value (reverse appearance frequency) for each morpheme from the morphemes included in each document by the following formula.
  • i is a subscript indicating a morpheme
  • is the total number of unique users
  • is the number of unique users who have posted a document containing the morpheme i. Since the IDF value is the same as that calculated by the topic tag estimation unit 130, the IDF value may be calculated by the topic tag estimation unit 130.
  • the blacklist hash tag extension unit 160 calculates each morpheme from the morpheme (morpheme to which the blacklist hash tag is added) included in each extracted document to be excluded for each blacklist hash tag by the following formula.
  • TF value applying frequency
  • j is a subscript indicating a black list hash tag
  • n i, j is the number of unique users who have posted a document related to the black list hash tag j including the morpheme i (a document including the black list hash tag j).
  • the TFIDF value may be weighted in the same manner as described above.
  • the blacklist hash tag expansion unit 160 outputs the calculated TFIDF value of each morpheme for each blacklist hashtag to the blacklist tag storage unit 170 for storage.
  • the morphemes (feature words) having a TFIDF value equal to or greater than the threshold value may be stored in the blacklist tag storage unit 170.
  • the black list hash tag extension unit 160 is a means for estimating a hash tag included in a document to be excluded other than the black list hash tag.
  • the black list hash tag extension unit 160 reads the black list hash tag stored by the default black list hash tag storage unit 172, and from the plurality of documents stored by the document storage unit 100, tags other than the black list hash tag (Tag document extraction means for extracting a document (a document group) including a hash tag that is a candidate for a hash tag included in a document to be excluded. At this time, based on the posting time, a document for a predetermined period such as the most recent several hours from the time of extraction is acquired.
  • the blacklist hash tag extension unit 160 refers to the information stored in the document storage unit 100 and the morpheme storage unit 120, and calculates a second appearance frequency calculation that calculates the appearance frequency of the morpheme in the extracted document (document group). Means. At this time, the blacklist hash tag expansion unit 160 calculates the number of users who have posted a document including the morpheme as the appearance frequency of the morpheme. Also, the blacklist hash tag expansion unit 160 calculates the reverse appearance frequency from the ratio of the total number of users who have posted documents to the number of users who have posted documents containing the morpheme for each morpheme.
  • the blacklist hash tag extension unit 160 calculates an IDF value (reverse appearance frequency) for each morpheme in the same manner as described above. Note that the blacklist hash tag extension unit 160 may use the TF value calculated by the above or the topic tag estimation unit 130.
  • the blacklist hash tag extension unit 160 calculates a TF value (for each morpheme) from the morpheme (morpheme to which the hash tag is assigned) included in each tag document for each tag other than the blacklist hash tag by the following formula. Appearance frequency) is calculated.
  • j is a subscript indicating a hash tag
  • n i, j is the number of unique users who have posted a document including the morpheme i and including the hash tag j.
  • the blacklist hash tag expansion unit 160 may use the TF value calculated by the topic tag estimation unit 130.
  • the blacklist hash tag extension unit 160 reads the blacklist hashtag stored by the default blacklist hashtag storage unit 172, and includes the blacklist hashtag from a plurality of documents stored by the document storage unit 100. This is default document extraction means for extracting a document as a document to be excluded. At this time, based on the posting time, a document for a predetermined period such as the most recent several hours from the time of extraction is acquired.
  • the blacklist hash tag extension unit 160 refers to the information stored in the document storage unit 100 and the morpheme storage unit 120, and calculates the appearance frequency of the morpheme in the extracted document (document group) to be excluded. 1 appearance frequency calculation means.
  • the blacklist hash tag expansion unit 160 calculates the number of users who have posted a document including the morpheme as the appearance frequency of the morpheme. Also, the blacklist hash tag expansion unit 160 calculates the reverse appearance frequency from the ratio of the total number of users who have posted documents to the number of users who have posted documents containing the morpheme for each morpheme.
  • the blacklist hash tag extension unit 160 obtains the TFIDF value (tfidf i, j ) of the morpheme i in the blacklist hash tag j as described above. Since the TFIDF value in the blacklist hash tag j is the same value for each morpheme, the TFIDF value calculated above may be used.
  • the blacklist hash tag extension unit 160 is a function of the second topic document determination unit that compares the feature amount of the blacklist hash tag calculated as described above with the feature amount of the hash tag. Specifically, the blacklist hash tag extension unit 160 calculates, for each blacklist hashtag, the similarity (similarity) with all hashtags (other than the blacklist hashtag) as a cosine distance using the following formula: To do.
  • a and B are the characteristic amount of the blacklist hash tag and the characteristic amount of the hash tag, respectively.
  • Ai and Bi are the TFIDF values of each morpheme i.
  • a Jacquard distance or an Euclidean distance may be used for calculating the similarity between the feature amounts indicated by the appearance frequency of the morpheme.
  • any calculation method can be used as long as the similarity between feature quantities can be calculated.
  • the blacklist hash tag extension unit 160 determines whether there is a similarity of a hash tag having a similarity equal to or higher than a preset threshold for each blacklist hash tag, and excludes a hash tag having a similarity equal to or higher than the threshold. It is assumed that the hash tag is related to a power document. By performing this process on all the blacklist hash tags, it is possible to extract hash tags related to documents to be excluded.
  • the black list hash tag extension unit 160 outputs information indicating the extracted hash tag (extended black list hash tag) related to the document to be excluded to the extended black list hash tag storage unit 173 for storage.
  • FIG. 7C shows a sample format of the extended blacklist hash tag stored in the extended blacklist hash tag storage unit 173. As shown in FIG. 7C, one line of data corresponds to data related to one extended black list hash tag, and is stored for each black list hash tag.
  • the blacklist user storage unit 180 is a means for inputting and storing a blacklist user ID indicating a blacklist user.
  • a blacklist user is a user whose documents posted to the user should be excluded.
  • the blacklist user ID is registered in advance by, for example, the administrator of the related document extraction device 10.
  • FIG. 7D shows a sample format of the black list user ID stored in the black list user storage unit 180. As shown in FIG. 7D, one line of data corresponds to data related to one black list user ID, and is stored for each black list user ID. Any information other than the user ID may be used as long as the information can recognize the blacklist user.
  • the noise removing unit 190 determines whether or not the document input from the topic ID assigning unit 150 is an inappropriate document (related to an inappropriate topic), and excludes the document by performing topic exclusion. It is a function of the means. Specifically, the noise removal unit 190 has the following functions.
  • the noise removing unit 190 reads the black list morpheme from the default black list morpheme storage unit 171 and determines whether or not the black list morpheme is included in the document input from the topic ID assigning unit 150. This determination is performed by matching a character string between a document and a blacklist morpheme. If the noise removing unit 190 determines that the black list morpheme is included in the document, the noise removing unit 190 excludes the document as an inappropriate document to be excluded.
  • the noise removing unit 190 determines whether the document input from the topic ID assigning unit 150 has been posted by taking over another document or returned to another document. Specifically, the noise removing unit 190 determines whether the document is RT (retweet) or a reply tweet. It is possible to determine whether or not it is RT from the official Twitter API. Moreover, it is good also as performing said determination by performing a text analysis. Specifically, it is possible to easily determine whether a document includes a character string “RT” or a user name. If the noise removal unit 190 determines that the document is a post that has been taken over another document or has been returned to another document, the noise removal unit 190 excludes the document as an inappropriate document that should be excluded.
  • the noise removal unit 190 performs multi-post determination.
  • Multi-posting refers to posting on multiple topics. That is, it is determined whether the document is a document related to a plurality of topics. For example, when a broadcast station is set as one topic, a document in which hashtags #fff and #zzz, which are hash tags related to the broadcast station, are posted to a plurality of broadcast stations. Considered a post.
  • the noise removal unit 190 determines whether or not the document input from the topic ID assigning unit 150 has been given a plurality of topic IDs by the topic ID assigning unit 150, so that the document is multi-posted. Determine whether or not.
  • the noise removing unit 190 determines that the document is multi-posted, the noise removing unit 190 excludes the document as an inappropriate document to be excluded.
  • the noise removing unit 190 determines whether the document input from the topic ID assigning unit 150 has been posted by the blacklist user.
  • the noise removal unit 190 reads the user ID of the black list user from the black list user storage unit 180 and compares the user ID of the user who posted the document input from the topic ID adding unit 150 with the user ID of the black list user. If they match, it is determined that the document has been posted by the blacklist user. If the noise removing unit 190 determines that the document has been posted by the blacklist user, the noise removing unit 190 excludes the document as an inappropriate document to be excluded.
  • the noise removal unit 190 uses the information stored in the blacklist tag storage unit 170 to determine whether the document input from the topic ID adding unit 150 is an inappropriate document. In particular, the noise removal unit 190 determines and excludes inappropriate documents that do not contain blacklist hash tags.
  • the noise removing unit 190 reads the black list hash tag stored by the default black list hash tag storage unit 172, determines whether or not the document includes the default topic tag, and the default topic tag is included. Exclude documents as inappropriate documents that should be excluded.
  • the noise removal unit 190 reads information on the feature amount (the TFIDF value (score) of each morpheme for each default topic tag) stored by the blacklist tag storage unit 170, and reads each information for each default topic tag from the feature amount information. It is a score calculation means for calculating the score of a document.
  • the noise removal unit 190 determines whether the score assignment target document includes a morpheme (feature word) related to the feature amount.
  • the noise removing unit 190 adds up the scores of feature words included in the document. Similar to the calculation of the score by the topic ID assigning unit 150, when the feature word appears multiple times in the document when the score is calculated, the score of the document may be calculated as in the case of the single appearance. .
  • the noise removal unit 190 is a first topic document determination unit that determines whether or not a document related to the score is an inappropriate document that should be excluded based on the calculated score. Specifically, the noise removal unit 190 determines whether or not the score is a preset threshold value. If the score is equal to or greater than the threshold value, the noise removal unit 190 determines that the document is an inappropriate document that should be excluded. To exclude. This process is repeated for the number of blacklist hash tags related to the feature quantity stored in the blacklist tag storage unit 170.
  • the noise removing unit 190 reads the extended blacklist hash tag stored by the extended blacklist hashtag storage unit 173, and determines whether the acquired document includes the extended blacklist hashtag. Second topic document determination means for determining whether a document is an inappropriate document to be excluded. The noise removing unit 190 determines that the document including the extended blacklist hash tag is an inappropriate document to be excluded and excludes the document. Excludes repeated documents corresponding to the extended blacklist hashtag stored in the extended blacklist hashtag storage unit 173.
  • the noise removal unit 190 outputs the documents that are not excluded by the above process to the topic document storage unit 200. Further, the document excluded by the noise removing unit 190 may not be used for the processing by the topic tag estimating unit 130. For example, information regarding whether or not the document stored in the document storage unit 100 and the morpheme stored in the morpheme storage unit relate to the document removed by the noise removal unit 190 is associated and removed. What is related to the document may not be input to the topic tag estimation unit 130.
  • the topic document storage unit 200 is a means for inputting and storing a document that is input from the noise removal unit 190 and is assigned with one topic ID.
  • the document with the topic ID is extracted as a document related to the topic related to the topic ID.
  • FIG. 8 shows a sample format of a document stored in the topic document storage unit 200.
  • the data related to the document stored in the topic document storage unit 200 is associated with the topic ID in addition to the data related to the document stored in the document storage unit 100.
  • the document with the topic ID stored in the topic document storage unit 200 is provided to the user as a document related to the topic for each topic ID, for example.
  • the functional configuration of the related document extraction apparatus 10 has been described above.
  • FIG. 9 shows the hardware configuration of the related document extraction apparatus 10.
  • the related document extraction apparatus 10 includes a CPU (Central Processing Unit) 1001, a RAM (Random Access Memory) 1002 and a ROM (Read Only Memory) 1003, and a communication module 1004 for communication. And a computer including hardware such as an auxiliary storage device 1005 such as a hard disk.
  • the functions of the related document extracting apparatus 10 described above are exhibited by the operation of these components by a program or the like.
  • the above is the configuration of the related document extraction apparatus 10.
  • FIG. 10 is a flowchart showing the entire related document extraction method.
  • a plurality of documents to be extracted are input and stored by the document storage unit 100 (S01).
  • the document input to the document storage unit 100 is output to the morphological analysis unit 110.
  • the morpheme analysis unit 110 performs morpheme analysis on the document, and the document is divided into morphemes (S02, word acquisition step).
  • Information indicating the morpheme obtained by the morpheme analysis by the morpheme analysis unit 110 is stored in the morpheme storage unit 120.
  • the topic tag estimation unit 130 assigns each document to a specific topic from the document stored in the document storage unit 100, the morpheme stored in the morpheme analysis unit 110, and the information stored in the topic tag storage unit 140. Information used to determine whether the document is related is generated (S03). This processing is performed by the topic feature word estimation unit 131 and the topic hash tag estimation unit 132, respectively.
  • the topic feature word estimation unit 131 reads the default topic tag stored by the default topic tag storage unit 141, and obtains the default topic tag from a plurality of documents stored by the document storage unit 100.
  • the included document is extracted as a document (topic document) related to the topic (S301, default document extraction step).
  • a feature amount is generated for each topic (S302, first appearance frequency calculation step). This process will be described in detail with reference to the flowchart of FIG.
  • an IDF value for each morpheme is calculated (S3021, first appearance frequency calculation step).
  • the TF value for each morpheme is calculated from the morphemes included in each topic document for each topic ID (processing target) (S3022, first appearance frequency calculation step).
  • the TFIDF value of the morpheme in each topic ID is obtained from the calculated IDF value and TF value (S3023, first appearance frequency calculation step). The obtained TFIDF value is a feature amount.
  • the processing of S3022 and S3023 is repeated until the processing for all topic IDs is completed.
  • feature words are stored in the topic feature word storage unit 142 for each topic ID (S303, first appearance frequency calculation step).
  • This process will be described in detail with reference to the flowchart of FIG. This process is performed for each topic ID.
  • For each morpheme it is determined whether or not the TFIDF value of the morpheme is greater than or equal to a preset threshold value (S3031, first appearance frequency calculation step). When it is determined that the TFIDF value is equal to or greater than a preset threshold, the morpheme and the TFIDF value are output and stored in the topic feature word storage unit 142 for the topic ID (S3032, first appearance frequency calculation step). .
  • TFIDF value is not greater than or equal to a preset threshold value
  • no special process is performed and the process moves to the process for the next morpheme.
  • the above processing is repeated for all morphemes for each topic ID, and is repeated until the processing for all topic IDs is completed.
  • the processing by the topic feature word estimation unit 131 has been described above.
  • the topic hash tag estimation unit 132 reads the default topic tag stored in the default topic tag storage unit 141, and other than the default topic tag from the plurality of documents stored in the document storage unit 100.
  • a document including the hash tag (tag document) is extracted (S311, tag document extraction step).
  • a feature amount is generated for each hash tag (S312, second appearance frequency calculation step). The generation of the feature amount is performed in the same manner as the process described with reference to the flowchart of FIG. However, in this case, the processing loop shown in FIG. 12 is performed for each hash tag, and is repeated until the processing for all hash tags is completed.
  • a document including a default topic tag is extracted as a document (topic document) related to the topic from a plurality of documents stored by the document storage unit 100 (S313, default document extraction step).
  • a feature amount is generated for each topic (S314, first appearance frequency calculation step). The generation of the feature amount is performed in the same manner as the process described with reference to the flowchart of FIG.
  • the feature amount of the topic ID calculated as described above is compared with the feature amount of the hash tag, and the hash tag (extended topic hash tag) related to the topic with the topic ID is expanded based on the comparison result.
  • the data is output and stored in the storage unit 143 (S315, second topic document determination step).
  • the similarity of the feature amount between the topic ID and the hash tag is calculated (S3151, second topic document determination step). As this similarity, for example, a cosine distance is used as described above. Subsequently, it is determined whether or not the calculated similarity is equal to or greater than a preset threshold value (S3152, second topic document determination step). When it is determined that the similarity is equal to or higher than a preset threshold, the hash tag is output and stored in the extended topic hash tag storage unit 143 as an extended topic hash tag for the topic ID (S3153, No. 1). 2-topic document determination step).
  • the blacklist hash tag extension unit 160 determines from the document stored in the document storage unit 100, the morpheme stored in the morpheme analysis unit 110, and the information stored in the blacklist tag storage unit 170. Information used to determine whether each document hits noise, that is, whether each document is related to a specific topic inappropriate for extraction (S04).
  • the blacklist hash tag extension unit 160 estimates a feature word related to the blacklist hashtag.
  • the IDF value for each morpheme and the TF value for each morpheme for each blacklist hash tag are calculated, and the TFIDF value for the morpheme for each blacklist hashtag is calculated.
  • the calculated TFIDF value of each morpheme for each blacklist hash tag is output to and stored in the blacklist tag storage unit 170.
  • only morphemes (feature words) having a TFIDF value equal to or greater than a preset threshold value may be stored in the blacklist tag storage unit 170.
  • the black list hash tag extension unit 160 estimates an extended black list hash tag. This process will be described with reference to the flowcharts of FIGS. As shown in FIG. 16, the blacklist hash tag expansion unit 160 reads out the blacklist hashtag stored in the default blacklist hashtag storage unit 172, and from a plurality of documents stored in the document storage unit 100. A document (tag document) including a hash tag other than the blacklist hash tag is extracted (S411, tag document extraction step). Subsequently, a feature amount is generated for each hash tag (S412, second appearance frequency calculation step). The generation of the feature amount is performed in the same manner as the process described with reference to the flowchart of FIG. However, in this case, the processing loop shown in FIG. 12 is performed for each hash tag, and is repeated until the processing for all hash tags is completed.
  • a document including a blacklist hash tag is extracted from a plurality of documents stored by the document storage unit 100 (S414, default document extraction step).
  • a feature amount is generated for each blacklist hash tag (S415, first appearance frequency calculation step). The generation of the feature amount is performed in the same manner as the process described with reference to the flowchart of FIG. However, in this case, the processing loop shown in FIG. 12 is performed for each blacklist hash tag, and is repeated until the processing for all the blacklist hashtags is completed.
  • the characteristic amount of the black list hash tag calculated as described above is compared with the characteristic amount of the hash tag, and the hash tag related to the black list hash tag (extended black list hash tag) is expanded based on the comparison result. It is output and stored in the blacklist hash tag storage unit 173 (S415, second topic document determination step).
  • This process will be described in more detail using the flowchart of FIG. This process is performed for each blacklist hash tag and hash tag.
  • the similarity between the blacklist hash tag and the hash tag is calculated (S4151, second topic document determination step). As this similarity, for example, a cosine distance is used as described above. Subsequently, it is determined whether or not the calculated similarity is equal to or greater than a preset threshold (S4152, second topic document determination step). If it is determined that the similarity is equal to or higher than a preset threshold, the hash tag is output to the extended black list hash tag storage unit 173 and stored as an extended black list hash tag for the black list hash tag. (S4153, second topic document determination step).
  • the feature ID information stored in the topic feature word estimation unit 131 is read out by the topic ID assigning unit 150, and a topic ID is assigned to the document based on the information (S501, topic document extracting step).
  • the feature amount information stored by the topic feature word estimation unit 131 is acquired for each topic (topic ID) (S5011, score calculation step).
  • the “score total value” of the document is initialized (value is set to zero) (S5012, score calculation step).
  • it is determined whether or not each feature word is included in the document (S5013, score calculation step).
  • the score (TFIDF value) of the feature word is added to the “score total value” (S5014, score calculation step).
  • the score of the feature word is not added to the “score total value”.
  • the topic ID assigning unit 150 reads the default topic tag stored in the default topic tag storage unit 141 and the extended topic hash tag stored in the extended topic hash tag storage unit 143. Then, a topic ID is assigned to the document based on the information (S502, topic document extraction step (second topic document determination step)).
  • This process will be described in more detail using the flowchart of FIG. This process is performed for each document to which a topic is assigned.
  • a default topic tag and an extended topic hash tag associated with the topic are acquired (S5021).
  • it is determined whether or not each default topic tag and extended topic hash tag are included in the document (S5022, second topic document determination step).
  • the topic ID of the topic is assigned to the document (S5023, second topic document determination step).
  • the topic ID of the topic is not given to the document.
  • the above processing (S5022, S5023) is performed for all default topic tags and extended topic hash tags associated with the topic. The above processing is repeated for all topics for each document, and is repeated until the processing for all documents is completed.
  • the document to which the topic ID is assigned by the topic ID assigning unit 150 is output to the noise removing unit 190.
  • the noise removing unit 190 determines whether or not the document input from the topic ID assigning unit 150 is an inappropriate document and excludes the document (S601, topic document extraction step).
  • a black list morpheme (NG word) is read from the default black list morpheme storage unit 171 to determine whether or not the black list morpheme is included in the document (S601). If it is determined that the blacklist morpheme is included in the document, the document is excluded as an inappropriate document to be excluded (subsequent processing is not performed).
  • the blacklist morpheme is not included in the document, it is then determined whether the document is RT or a reply tweet (S602). If it is determined that the document is RT or a reply tweet, the document is excluded as an inappropriate document to be excluded (subsequent processing is not performed).
  • the user ID of the blacklist user is subsequently read from the blacklist user storage unit 180, and whether or not the document has been posted by the blacklist user. Is determined (S604). If it is determined that the document is posted by a blacklist user, the document is excluded as an inappropriate document to be excluded (subsequent processing is not performed).
  • the feature amount (the TFIDF value (score) of each morpheme for each default topic tag) stored by the blacklist tag storage unit 170, and The extended blacklist hash tag stored in the extended blacklist hashtag storage unit 173 is read, and based on these, it is determined whether or not the document is an inappropriate document that should be excluded as described above ( S605). If it is determined that the document is inappropriate, it is excluded (no further processing is performed). If it is determined that the document is not an inappropriate document that should be excluded, the document is output from the noise removal unit 190 to the topic document storage unit 200.
  • the document input by the topic document storage unit 200 is stored together with the assigned topic ID.
  • the above is the processing executed by the related document extraction apparatus 10 according to the present embodiment.
  • said process is good also as being triggered by the operation of the administrator of the related document extraction apparatus 10 for every preset time interval, for example.
  • topic ID assignment to a document and generation of information (features and extended topic hash tags) used for assigning a topic ID are a series of processing, but these processing are mutually independent. It may be performed at different timings.
  • a document related to a topic is extracted using the frequency of appearance of words in a document including a default topic tag indicating a topic. That is, even if the default topic tag indicating a topic is not included, a document corresponding to the appearance frequency is extracted as a document related to the topic.
  • a document related to a specific topic can be appropriately extracted from a plurality of documents such as tweets. Therefore, it is possible to exhaustively extract documents related to the topic.
  • dynamic topic hash tags and topic feature words can be estimated, so that documents related to topics can be extracted in real time.
  • the document score may be calculated based on the feature word to extract the document.
  • a document including a word having a high appearance frequency in a document including a default topic tag can be extracted as a document related to a topic, and a document related to a specific topic can be reliably extracted. Can do. Thereby, it is possible to extract a document without a hash tag, and the number of documents that can be extracted increases.
  • the score of the document may be calculated in the same manner as in the case of a single occurrence. According to this configuration, it is possible to prevent the score of the document from being increased due to words frequently included in the document, and it is possible to avoid extracting an inappropriate document as a document related to a topic.
  • the topic hash tag may be expanded by extracting the document by comparing the feature quantities of the tag document and the topic document.
  • a document (group) including tags other than the default topic tag can be extracted as a document related to a topic, and a document related to a specific topic can be reliably extracted.
  • the tags can be dynamically estimated, and the number of documents that can be extracted increases.
  • the hashtag is aware of a specific topic and the poster creates a document. That is, since topics and topic hash tags have a one-to-N relationship, more topic documents can be extracted by sucking as many hash tags as possible associated with the topics. For example, users often post tweets about broadcast programs with broadcast station hashtags. However, a famous program has a hash tag of the program itself. By comparing the feature quantity between the topic and the hash tag, it is possible to detect the hash tag related to the program being broadcast dynamically earlier.
  • noise may be removed as in the present embodiment.
  • Noise removal is important for document extraction. According to this configuration, inappropriate documents can be excluded, and for example, inappropriate documents can be prevented from being presented to the user. Further, if the topic hash tag and the topic feature word are estimated based on the document group from which noise is removed, the estimation accuracy thereof is improved. As described above, in the estimation of the topic hash tag and the topic feature word, since the feature amount becomes a reference value indicating the topic, the quality of the estimated information decreases as the noise of the data increases. Therefore, cleansing the seed data is important. In addition, it is possible to extract documents related to noise-free topics. Further, noise can be removed more appropriately by dynamically removing noise in the same manner as document extraction. Further, since the black list is automatically expanded in real time, the need for manually registering the black list is reduced.
  • the appearance frequency may be counted in units of users as in this embodiment.
  • the influence for every user can be made uniform, for example, the influence by one user posting the document of the same content several times can be suppressed. Thereby, it is possible to appropriately extract a document related to a specific topic.
  • the appearance frequency may be counted in document units. That is, the IDF value or TF value may be calculated by counting in document units.
  • the popularity and rarity of a morpheme can be expressed by expressing the feature quantity by the feature of the morpheme unit using the TFIDF value as in this embodiment.
  • a document related to a specific topic can be more appropriately extracted from a plurality of documents such as tweets.
  • documents related to multiple topics may be excluded.
  • Documents posted on multiple topics are often not related to each topic. Therefore, according to this configuration, it is possible to avoid extracting an inappropriate document as a document related to a topic.
  • the related document extraction program 40 is inserted into a computer and accessed, or stored in a program storage area 31 formed on a recording medium 30 provided in the computer.
  • the related document extraction program 40 includes a document storage module 400, a morpheme analysis module 410, a morpheme storage module 420, a topic tag estimation module 430, a topic tag storage module 440, a topic ID assignment module 450, and a blacklist hash tag.
  • the extended module 460, the blacklist tag storage module 470, the blacklist user storage module 480, the noise removal module 490, and the topic document storage module 500 are configured.
  • Document storage module 400 morpheme analysis module 410, morpheme storage module 420, topic tag estimation module 430, topic tag storage module 440, topic ID assignment module 450, blacklist hash tag extension module 460, blacklist Functions realized by executing the tag storage module 470, the blacklist user storage module 480, the noise removal module 490, and the topic document storage module 500 are the same as those in the document storage unit 100 of the related document extraction apparatus 10 described above.
  • a part or all of the related document extraction program 40 may be transmitted via a transmission medium such as a communication line and received and recorded (including installation) by another device.
  • each module of the related document extraction program 40 may be installed in any one of a plurality of computers instead of one computer. In that case, the series of related document extraction programs 40 described above are performed by the computer system of the plurality of computers.
  • noise removal unit 200 ... topic Document storage unit, 1001... C U, 1002 ... RAM, 1003 ... ROM, 1004 ... communication module, 1005 ... auxiliary storage device, 30 ... recording medium, 31 ... program storage area, 40 ... related document extraction program, 400 ... document storage module, 410 ... morpheme analysis module , 420 ... Morphological storage module, 430 ... Topic tag estimation module, 440 ... Topic tag storage module, 450 ... Topic ID assignment module, 460 ... Blacklist hash tag expansion module, 470 ... Blacklist tag storage module, 480 ... Blacklist user Storage module, 490... Noise removal module, 500... Topic document storage module.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

In the present invention, documents relevant to a specific topic are suitably extracted from documents such as a plurality of tweets. A relevant document extraction device (10) is provided with the following: a default topic tag storage unit (141) that stores a default topic tag indicating a topic; a document storage unit (100) that stores a plurality of documents; a morpheme analysis unit (110) that divides documents into morphemes; a topic tag estimation unit (130) that extracts a document that includes the default topic tag from a plurality of documents, and calculates the frequency of appearance of terms in the extracted document; and a topic ID assigning unit (150) that extract a document relevant to the topic from information based on the calculated frequency of appearance.

Description

関連文書抽出装置、関連文書抽出方法及び関連文書抽出プログラムRelated document extracting apparatus, related document extracting method, and related document extracting program
 本発明は、複数の文書から特定のトピックに関連する文書を抽出する関連文書抽出装置、関連文書抽出方法及び関連文書抽出プログラムに関する。 The present invention relates to a related document extracting apparatus, a related document extracting method, and a related document extracting program for extracting a document related to a specific topic from a plurality of documents.
 近年では、ツイッター(Twitter)といったマイクロブログ(ミニブログ)によるコミュニケーションが一般化してきている(例えば、特許文献1参照)。マイクロブログとは数十から百数十文字程度で構成される短い文章を投稿する情報サービスであり、ツイッターでは140文字以内のツイートと呼ばれる短い文書を投稿する。ツイートとして投稿される内容は様々であり、例えば、自分の近況、自分が興味をもったニュース記事の共有、知り合いのツイートに対する返信、テレビ等ある特定のトピックに対するコメント等である。自分が興味を持ったニュース記事等URLを付けてコメントを投稿することで他ユーザと情報を共有することも可能なため、マイクロブログは友達の近況を得るツールに留まらず、情報収集ツールとしての活用も広く行われている。 In recent years, communication using microblogs (miniblogs) such as Twitter has become common (see, for example, Patent Document 1). Microblogging is an information service that posts short sentences composed of tens to hundreds of characters, and Twitter posts short documents called tweets within 140 characters. There are various contents posted as a tweet, such as, for example, his / her recent situation, sharing of a news article in which he / she is interested, a reply to a tweet of an acquaintance, a comment on a specific topic such as television. Because it is possible to share information with other users by posting a comment with a URL such as a news article that you are interested in, microblogging is not only a tool to get the latest status of friends, but also as an information collection tool It is also widely used.
 ユーザは特定のトピックに対してツイートする場合、ハッシュタグをツイートにつけてツイートする傾向がある。大きなトピックに対しては1つ又は複数のハッシュタグが紐付くことが多い。例えば、テレビ番組をトピックとした場合、「XXドラマ:YYY(ドラマタイトル)1話」が1つのトピックとなる。ユーザは番組を見ながら放送中のXXドラマYYYについて、放送局ハッシュタグをつけてツイートする。XXドラマYYYを放送する放送局のハッシュタグ(放送局ハッシュタグ)を#zzzとすると、放送局ハッシュタグ以外に番組ハッシュタグ(#xx、#YYY、#XXドラマ)を付けてツイート投稿する人が多い。放送局ハッシュタグとは公式、非公式を問わずユーザがそのテレビ局で放送される番組に対して広く使われるハッシュタグである。トピックに関連するハッシュタグを含んだツイートを収集することで、トピックに対してのユーザのコメントを把握することができる。 • When users tweet a specific topic, they tend to tweet by attaching a hashtag to the tweet. Often one or more hashtags are associated with a large topic. For example, when a TV program is a topic, “XX drama: YYY (drama title) 1 episode” is one topic. While watching the program, the user tweetes the broadcast XX drama YYY with a broadcast station hashtag. If the hashtag (broadcasting station hashtag) of the broadcasting station that broadcasts the XX drama YYY is #zzz, people who post a tweet with a program hashtag (#xx, #YYY, #XX drama) in addition to the broadcasting station hashtag There are many. Broadcast station hashtags are hashtags that are widely used for programs broadcast on television stations by users regardless of whether they are official or informal. By collecting tweets including hashtags related to the topic, the user's comment on the topic can be grasped.
 非特許文献1で示されるWebサイトは、放送局に紐付くツイートを抽出し放送局毎にツイートを表示するサービスを行う。非特許文献1に係るサービスでは放送局を一つのトピックとしている。放送局ハッシュタグを用いて容易にツイートを放送局に紐付けることが可能である。例えば、放送局ZZZの番組に関するツイートを収集する場合は、放送局ハッシュタグである#zzzを含んだツイートを集めればよい。 The website shown in Non-Patent Document 1 provides a service for extracting tweets associated with a broadcasting station and displaying the tweets for each broadcasting station. In the service according to Non-Patent Document 1, the broadcasting station is one topic. It is possible to easily link a tweet to a broadcast station using a broadcast station hashtag. For example, when collecting tweets about a program of the broadcast station ZZZ, it is only necessary to collect tweets including #zzz, which is a broadcast station hash tag.
 非特許文献2で示されるWebサイトは、番組に紐付くツイートを抽出し放送中の番組毎にツイートを表示するサービスを行う。非特許文献2に係るサービスでは放送中の番組を一つのトピックとしている。非特許文献1に係るサービスのように放送局ハッシュタグを用いて放送中の番組に紐付けている。それに加え番組ハッシュタグの動的な推定をリアルタイムに行っている。例えば、放送局ZZZの番組の場合は放送局ハッシュタグ(#zzz)を含むツイートを番組に紐付け、更に「YYY」という番組が放送している場合は動的に1つ又は複数の番組ハッシュタグ(#xx、#YYY、#XXドラマ)の推定を行い番組ハッシュタグに紐付くツイートの抽出も行っている。このように非特許文献2に係るサービスでは放送局ハッシュタグと番組ハッシュタグの推定により、動的に放送している番組のツイート抽出を可能としている。 The website shown in Non-Patent Document 2 provides a service for extracting tweets associated with a program and displaying the tweet for each program being broadcast. In the service according to Non-Patent Document 2, a program being broadcast is set as one topic. Like a service according to Non-Patent Document 1, a broadcasting station hashtag is used to link a program being broadcast. In addition, it dynamically estimates program hashtags in real time. For example, in the case of a program of the broadcasting station ZZZ, a tweet including a broadcasting station hash tag (#zzzz) is linked to the program, and when a program “YYY” is broadcast, one or more program hashes are dynamically generated. Tags (#xx, #YYY, #XX drama) are estimated and tweets associated with program hashtags are also extracted. As described above, in the service according to Non-Patent Document 2, it is possible to extract a tweet of a program being broadcast dynamically by estimating a broadcast station hash tag and a program hash tag.
特開2012-38281号公報JP 2012-38281 A
 しかしながら、非特許文献1及び非特許文献2で示されるサービスには以下に示すような問題がある。テレビに関して言えば、複数の放送局ハッシュタグを付け放送中の番組と関係のないツイートを投稿するユーザがいる。非特許文献1に係るサービスでは単純に放送局ハッシュタグを含んだツイートを収集しているため、番組と関係のないツイートも表示してしまう。また、非特許文献1に係るサービスでは放送局ハッシュタグが付いているツイートのみ抽出を行うため、抽出が可能なツイート量が限定的である。 However, the services shown in Non-Patent Document 1 and Non-Patent Document 2 have the following problems. With regard to television, there are users who post a tweet unrelated to the program being broadcast with a plurality of broadcast station hashtags. Since the service according to Non-Patent Document 1 simply collects tweets including broadcast station hashtags, it also displays tweets unrelated to the program. In addition, since the service according to Non-Patent Document 1 extracts only tweets with a broadcast station hash tag, the amount of tweets that can be extracted is limited.
 また、非特許文献2に係るサービスでは、放送局ハッシュタグ以外に動的に番組ハッシュタグを推定し放送中の番組に関するツイートの抽出を行っているが、ハッシュタグが付与されていない番組に関連するツイートの抽出までは行いきれていない。放送中のテレビ番組のツイートは必ずしもハッシュタグが付与されているとは限らず、実際にはハッシュタグが付いていないツイートが多い傾向が強い。上記のように非特許文献1に係るサービスでは番組ハッシュタグが付与されたツイートの抽出が出来ておらず、非特許文献2に係るサービスではハッシュタグが付いていない番組に関連するツイートが抽出できていない。 In addition, in the service according to Non-Patent Document 2, a program hash tag is dynamically estimated in addition to the broadcast station hash tag, and tweets related to the program being broadcast are extracted. I haven't been able to complete the extraction of tweets. Tweets of TV programs being broadcast are not necessarily provided with hashtags, and there is a strong tendency that there are actually many tweets without hashtags. As described above, the service related to Non-Patent Document 1 cannot extract tweets with a program hash tag, and the service related to Non-Patent Document 2 can extract tweets related to programs that do not have a hash tag. Not.
 本発明は、上記の問題点に鑑みてなされたものであり、複数のツイート等の文書から特定のトピックに関連する文書を適切に抽出することを可能とする関連文書抽出装置、関連文書抽出方法及び関連文書抽出プログラムを提供することを目的とする。 The present invention has been made in view of the above problems, and a related document extraction apparatus and a related document extraction method capable of appropriately extracting a document related to a specific topic from a plurality of documents such as tweets. And a related document extraction program.
 上記の目的を達成するために、本発明の一実施形態に係る関連文書抽出装置は、トピックを示すデフォルトトピックタグを予め格納するデフォルトトピックタグ格納手段と、複数の文書を予め格納する文書格納手段と、文書格納手段によって格納された文書を単語に分割する単語取得手段と、文書格納手段によって格納された複数の文書から、デフォルトトピックタグ格納手段によって格納されたデフォルトトピックタグを含む文書を抽出するデフォルト文書抽出手段と、デフォルト文書抽出手段によって抽出された文書における、単語取得手段によって分割された単語の出現頻度を算出する第1出現頻度算出手段と、第1出現頻度算出手段によって算出された出現頻度を用いて、デフォルト文書抽出手段によって抽出された文書以外の文書から、トピックに関連する文書を抽出するトピック文書抽出手段と、を備える。 To achieve the above object, a related document extraction apparatus according to an embodiment of the present invention includes a default topic tag storage unit that stores a default topic tag indicating a topic in advance, and a document storage unit that stores a plurality of documents in advance. A word acquisition unit that divides the document stored by the document storage unit into words, and a document including the default topic tag stored by the default topic tag storage unit is extracted from the plurality of documents stored by the document storage unit A default document extracting unit; a first appearance frequency calculating unit that calculates an appearance frequency of words divided by the word acquiring unit in the document extracted by the default document extracting unit; and an appearance calculated by the first appearance frequency calculating unit. Sentences other than documents extracted by default document extraction means using frequency From comprises a topic document extraction means for extracting the documents related to the topic, the.
 本発明の一実施形態に係る関連文書抽出装置では、トピックを示すデフォルトトピックタグを含む文書における単語の出現頻度を用いてトピックに関連する文書が抽出される。即ち、トピックを示すデフォルトトピックタグを含んでいなくても上記の出現頻度に応じた文書がトピックに関連する文書として抽出される。これにより、本発明の一実施形態に係る関連文書抽出装置によれば、複数のツイート等の文書から特定のトピックに関連する文書を適切に抽出することができる。 In the related document extraction apparatus according to an embodiment of the present invention, a document related to a topic is extracted using the appearance frequency of words in a document including a default topic tag indicating a topic. That is, even if the default topic tag indicating a topic is not included, a document corresponding to the appearance frequency is extracted as a document related to the topic. Thereby, according to the related document extraction apparatus which concerns on one Embodiment of this invention, the document relevant to a specific topic can be appropriately extracted from several documents, such as a tweet.
 トピック文書抽出手段は、第1出現頻度算出手段によって算出された出現頻度を用いて、デフォルト文書抽出手段によって抽出された文書以外の文書に出現する単語から、当該文書のスコアを算出するスコア算出手段と、スコア算出手段によって算出されたスコアに基づいて、当該スコアに係る文書がトピックに関連する文書であるか否かを判定する第1トピック文書判定手段と、を備えることとしてもよい。この構成によれば、例えば、デフォルトトピックタグを含む文書において出現頻度が高い単語が含まれる文書をトピックに関連する文書として抽出することができ、特定のトピックに関連する文書を確実に抽出することができる。 The topic document extracting means uses the appearance frequency calculated by the first appearance frequency calculating means to calculate the score of the document from words appearing in a document other than the document extracted by the default document extracting means. And a first topic document determination unit that determines whether a document related to the score is a document related to a topic based on the score calculated by the score calculation unit. According to this configuration, for example, a document including a word having a high appearance frequency in a document including a default topic tag can be extracted as a document related to a topic, and a document related to a specific topic can be reliably extracted. Can do.
 スコア算出手段は、文書に単語が複数回出現する場合、1回出現の場合と同様に文書のスコアを算出することとしてもよい。この構成によれば、文書に頻繁に含まれる単語によって文書のスコアが高くなることを防止することができ、不適切な文書をトピックに関連する文書として抽出することを回避することができる。 The score calculation means may calculate the score of the document in the same way as when the word appears once in the document. According to this configuration, it is possible to prevent the score of the document from being increased due to words frequently included in the document, and it is possible to avoid extracting an inappropriate document as a document related to a topic.
 トピック文書抽出手段は、文書格納手段によって格納された複数の文書から、デフォルトトピックタグ以外のタグを含む文書を抽出するタグ文書抽出手段と、タグ文書抽出手段によって抽出された文書における、単語取得手段によって分割された単語の出現頻度を算出する第2出現頻度算出手段と、第1出現頻度算出手段によって算出された出現頻度と第2出現頻度算出手段によって算出された出現頻度とを比較して、当該比較結果に基づいてタグ文書抽出手段によって抽出された文書がトピックに関連する文書であるか否かを判定する第2トピック文書判定手段と、を備えることとしてもよい。この構成によれば、デフォルトトピックタグ以外のタグを含む文書(群)をトピックに関連する文書として抽出することができ、特定のトピックに関連する文書を確実に抽出することができる。 The topic document extraction means includes a tag document extraction means for extracting a document including a tag other than the default topic tag from a plurality of documents stored by the document storage means, and a word acquisition means in the document extracted by the tag document extraction means. Comparing the appearance frequency calculated by the first appearance frequency calculating means with the appearance frequency calculated by the second appearance frequency calculating means, the second appearance frequency calculating means for calculating the appearance frequency of the words divided by Second topic document determination means for determining whether or not the document extracted by the tag document extraction means based on the comparison result is a document related to the topic may be provided. According to this configuration, a document (group) including tags other than the default topic tag can be extracted as a document related to a topic, and a document related to a specific topic can be reliably extracted.
 第2トピック文書判定手段は、第1出現頻度算出手段によって算出された単語の出現頻度によって示される特徴量と第2出現頻度算出手段によって算出された単語の出現頻度によって示される特徴量との間のコサイン距離、ジャカード距離又はユークリッド距離を算出することで、出現頻度同士を比較することとしてもよい。この構成によれば、更に確実に特定のトピックに関連する文書を抽出することができる。 The second topic document determination unit is configured to determine whether the feature amount indicated by the word appearance frequency calculated by the first appearance frequency calculation unit and the feature amount indicated by the word appearance frequency calculated by the second appearance frequency calculation unit. The appearance frequencies may be compared by calculating the cosine distance, Jacquard distance, or Euclidean distance. According to this configuration, it is possible to more reliably extract a document related to a specific topic.
 デフォルトトピックタグ格納手段は、デフォルトトピックタグとして、不適切なトピックに係るデフォルトトピックタグを格納して、トピック文書抽出手段は、文書が不適切なトピックに関連する文書であるか否かを判断して文書の除外を行う、こととしてもよい。この構成によれば、不適切な文書を除外し、例えば不適切な文書をユーザへ提示することを防止することができる。 The default topic tag storage means stores a default topic tag related to an inappropriate topic as a default topic tag, and the topic document extraction means determines whether or not the document is a document related to an inappropriate topic. It is also possible to exclude documents. According to this configuration, inappropriate documents can be excluded, and for example, inappropriate documents can be prevented from being presented to the user.
 文書格納手段は、文書を投稿したユーザに係る情報を格納して、第1出現頻度算出手段は、単語の出現頻度として当該単語が含まれる文書を投稿したユーザ数を算出する、こととしてもよい。この構成によれば、ユーザ毎の影響を均一にし、例えば、1ユーザが複数回同じ内容の文書を投稿したことによる影響を抑えることができる。これにより、適切に特定のトピックに関連する文書を抽出することができる。 The document storage means may store information relating to a user who posted the document, and the first appearance frequency calculation means may calculate the number of users who have posted the document including the word as the word appearance frequency. . According to this structure, the influence for every user can be made uniform, for example, the influence by one user posting the document of the same content several times can be suppressed. Thereby, it is possible to appropriately extract a document related to a specific topic.
 第1出現頻度算出手段は、単語毎に当該単語が含まれる文書を投稿したユーザ数に対する、文書を投稿した全ユーザ数の割合から逆出現頻度を算出し、トピック文書抽出手段は、第1出現頻度算出手段によって算出された逆出現頻度も用いてトピックに関連する文書を抽出する、こととしてもよい。この構成によれば、トピックを示すデフォルトトピックタグを含む文書における単語の逆出現頻度も用いてトピックに関連する文書が抽出される。これにより、複数のツイート等の文書から特定のトピックに関連する文書を更に適切に抽出することができる。 The first appearance frequency calculating means calculates the reverse appearance frequency from the ratio of the total number of users who have posted the document to the number of users who have posted the document including the word for each word, and the topic document extracting means is the first appearance frequency A document related to the topic may be extracted using the reverse appearance frequency calculated by the frequency calculating means. According to this configuration, a document related to a topic is extracted using the reverse appearance frequency of words in the document including the default topic tag indicating the topic. Thereby, a document related to a specific topic can be more appropriately extracted from a plurality of documents such as tweets.
 トピック文書抽出手段は、単語毎の文字数も用いてトピックに関連する文書を抽出することとしてもよい。この構成によれば、トピックを示すデフォルトトピックタグを含む文書における単語の文字数も用いてトピックに関連する文書が抽出される。これにより、複数のツイート等の文書から特定のトピックに関連する文書を更に適切に抽出することができる。 The topic document extracting means may extract a document related to the topic using the number of characters for each word. According to this configuration, a document related to a topic is extracted using the number of characters of a word in the document including the default topic tag indicating the topic. Thereby, a document related to a specific topic can be more appropriately extracted from a plurality of documents such as tweets.
 デフォルトトピックタグ格納手段は、複数のトピックそれぞれを示す複数のデフォルトトピックタグを格納し、トピック文書抽出手段は、複数のトピックに関連する文書を除外する、こととしてもよい。複数のトピックに対して投稿された文書(マルチトピック投稿)は、それぞれのトピックに関連しないケースが多い。従って、この構成によれば、不適切な文書をトピックに関連する文書として抽出することを回避することができる。 The default topic tag storage unit may store a plurality of default topic tags indicating each of a plurality of topics, and the topic document extraction unit may exclude documents related to the plurality of topics. Documents posted on multiple topics (multi-topic postings) are often not related to each topic. Therefore, according to this configuration, it is possible to avoid extracting an inappropriate document as a document related to a topic.
 ところで、本発明は、上記のように関連文書抽出装置の発明として記述できる他に、以下のように関連文書抽出方法及び関連文書抽出プログラムの発明としても記述することができる。これはカテゴリ等が異なるだけで、実質的に同一の発明であり、同様の作用及び効果を奏する。 By the way, the present invention can be described as an invention of a related document extraction apparatus and a related document extraction program as described below, in addition to being described as an invention of a related document extraction apparatus as described above. This is substantially the same invention only in different categories and the like, and has the same operations and effects.
 即ち、本発明の一実施形態に係る関連文書抽出方法は、トピックを示すデフォルトトピックタグを予め格納するデフォルトトピックタグ格納手段と、複数の文書を予め格納する文書格納手段と、を備える関連文書抽出装置による関連文書抽出方法であって、文書格納手段によって格納された文書を単語に分割する単語取得ステップと、文書格納手段によって格納された複数の文書から、デフォルトトピックタグ格納手段によって格納されたデフォルトトピックタグを含む文書を抽出するデフォルト文書抽出ステップと、デフォルト文書抽出ステップにおいて抽出された文書における、単語取得ステップにおいて分割された単語の出現頻度を算出する第1出現頻度算出ステップと、第1出現頻度算出ステップにおいて算出された出現頻度を用いて、デフォルト文書抽出ステップにおいて抽出された文書以外の文書から、トピックに関連する文書を抽出するトピック文書抽出ステップと、を含む。 That is, a related document extraction method according to an embodiment of the present invention includes a default topic tag storage unit that stores a default topic tag indicating a topic in advance, and a document storage unit that stores a plurality of documents in advance. A related document extraction method by an apparatus, a word acquisition step for dividing a document stored by a document storage unit into words, and a default stored by a default topic tag storage unit from a plurality of documents stored by the document storage unit A default document extraction step for extracting a document including a topic tag, a first appearance frequency calculation step for calculating an appearance frequency of words divided in the word acquisition step in the document extracted in the default document extraction step, and a first appearance Use the appearance frequency calculated in the frequency calculation step Te, including from a document other than a document that has been extracted in the default document extraction step, the topic document extraction step of extracting documents related to the topic, the.
 また、本発明の一実施形態に係る関連文書抽出プログラムは、コンピュータを、トピックを示すデフォルトトピックタグを予め格納するデフォルトトピックタグ格納手段と、複数の文書を予め格納する文書格納手段と、文書格納手段によって格納された文書を単語に分割する単語取得手段と、文書格納手段によって格納された複数の文書から、デフォルトトピックタグ格納手段によって格納されたデフォルトトピックタグを含む文書を抽出するデフォルト文書抽出手段と、デフォルト文書抽出手段によって抽出された文書における、単語取得手段によって分割された単語の出現頻度を算出する第1出現頻度算出手段と、第1出現頻度算出手段によって算出された出現頻度を用いて、デフォルト文書抽出手段によって抽出された文書以外の文書から、トピックに関連する文書を抽出するトピック文書抽出手段と、として機能させる。 A related document extraction program according to an embodiment of the present invention includes a computer that stores a default topic tag storage unit that stores a default topic tag indicating a topic in advance, a document storage unit that stores a plurality of documents in advance, and a document storage A word acquisition unit that divides a document stored by the unit into words, and a default document extraction unit that extracts a document including a default topic tag stored by the default topic tag storage unit from a plurality of documents stored by the document storage unit First appearance frequency calculating means for calculating the appearance frequency of the words divided by the word acquiring means in the document extracted by the default document extracting means, and using the appearance frequency calculated by the first appearance frequency calculating means. , Sentences other than documents extracted by default document extraction means From a topic document extraction means for extracting the documents related to the topic, to function as a.
 本発明の一実施形態では、トピックを示すデフォルトトピックタグを含む文書における単語の出現頻度を用いてトピックに関連する文書が抽出される。即ち、トピックを示すデフォルトトピックタグを含んでいなくても上記の出現頻度に応じた文書がトピックに関連する文書として抽出される。これにより、本発明の一実施形態によれば、複数のツイート等の文書から特定のトピックに関連する文書を適切に抽出することができる。 In one embodiment of the present invention, a document related to a topic is extracted using the appearance frequency of words in a document including a default topic tag indicating a topic. That is, even if the default topic tag indicating a topic is not included, a document corresponding to the appearance frequency is extracted as a document related to the topic. Thereby, according to one Embodiment of this invention, the document relevant to a specific topic can be extracted appropriately from several documents, such as a tweet.
本発明の実施形態に係る関連文書抽出装置の機能構成を示す図である。It is a figure which shows the function structure of the related document extraction apparatus which concerns on embodiment of this invention. 文書格納部に格納される文書の例を示すテーブルである。It is a table which shows the example of the document stored in a document storage part. 形態素格納部に格納される形態素の例を示すテーブルである。It is a table which shows the example of the morpheme stored in a morpheme storage part. デフォルトトピックタグ格納部に格納されるデフォルトトピックタグの例を示すテーブルである。It is a table which shows the example of the default topic tag stored in a default topic tag storage part. トピック特徴語格納部に格納される特徴量の例を示すテーブルである。It is a table which shows the example of the feature-value stored in a topic feature word storage part. 拡張トピックハッシュタグ格納部に格納される拡張トピックハッシュタグの例を示すテーブルである。It is a table which shows the example of the extended topic hash tag stored in an extended topic hash tag storage part. 文書の除外に用いる情報の例を示すテーブルである。It is a table which shows the example of the information used for exclusion of a document. トピック文書格納部に格納される文書の例を示すテーブルである。It is a table which shows the example of the document stored in a topic document storage part. 本発明の実施形態に係る関連文書抽出装置のハードウェア構成を示す図である。It is a figure which shows the hardware constitutions of the related document extraction apparatus which concerns on embodiment of this invention. 本発明の実施形態に係る関連文書抽出装置で実行される処理(関連文書抽出方法)全体を示すフローチャートである。It is a flowchart which shows the whole process (related document extraction method) performed with the related document extraction apparatus which concerns on embodiment of this invention. トピック特徴語推定部による処理を示すフローチャートである。It is a flowchart which shows the process by a topic feature word estimation part. トピック特徴語推定部による処理を示すフローチャートである。It is a flowchart which shows the process by a topic feature word estimation part. トピック特徴語推定部による処理を示すフローチャートである。It is a flowchart which shows the process by a topic feature word estimation part. トピックハッシュタグ推定部による処理を示すフローチャートである。It is a flowchart which shows the process by a topic hash tag estimation part. トピックハッシュタグ推定部による処理を示すフローチャートである。It is a flowchart which shows the process by a topic hash tag estimation part. ブラックリストハッシュタグ拡張部による処理を示すフローチャートである。It is a flowchart which shows the process by a blacklist hash tag expansion part. ブラックリストハッシュタグ拡張部による処理を示すフローチャートである。It is a flowchart which shows the process by a blacklist hash tag expansion part. トピックID付与部による処理を示すフローチャートである。It is a flowchart which shows the process by a topic ID provision part. トピックID付与部による処理を示すフローチャートである。It is a flowchart which shows the process by a topic ID provision part. トピックID付与部による処理を示すフローチャートである。It is a flowchart which shows the process by a topic ID provision part. ノイズ除去部による処理を示すフローチャートである。It is a flowchart which shows the process by a noise removal part. 本発明の実施形態に係る関連文書抽出プログラムの構成を、記録媒体と共に示す図である。It is a figure which shows the structure of the related document extraction program which concerns on embodiment of this invention with a recording medium.
 以下、図面と共に本発明に係る関連文書抽出装置、関連文書抽出方法及び関連文書抽出プログラムについて詳細に説明する。なお、図面の説明においては同一要素には同一符号を付し、重複する説明を省略する。 Hereinafter, the related document extracting apparatus, the related document extracting method, and the related document extracting program according to the present invention will be described in detail with reference to the drawings. In the description of the drawings, the same elements are denoted by the same reference numerals, and redundant description is omitted.
 図1に本実施形態に係る関連文書抽出装置10を示す。関連文書抽出装置10は、複数の文書(ドキュメント)から特定のトピックに関連する文書を抽出する装置である。抽出対象となる文書は、例えば、ユーザによって投稿されてWeb上で公開されるマイクロブログで公開される文書である。本実施形態では、説明を簡潔にするため、具体的な例として適宜マイクロブログの代表であるツイッターを用いる。なお、本実施形態では、抽出対象を文書と呼ぶが、マイクロブログのサービスによってはツイートあるいはコメント等とも呼ばれる。なお、抽出対象の文書は、必ずしもWeb上で公開される文書である必要は無い。 FIG. 1 shows a related document extraction apparatus 10 according to the present embodiment. The related document extraction device 10 is a device that extracts a document related to a specific topic from a plurality of documents (documents). The document to be extracted is, for example, a document published on a microblog posted by the user and published on the Web. In this embodiment, for the sake of brevity, a Twitter that is a representative of a microblog is used as a specific example. In this embodiment, the extraction target is called a document, but it is also called a tweet or a comment depending on the microblog service. Note that the extraction target document does not necessarily need to be a document published on the Web.
 関連文書抽出装置10は、多数のユーザによって投稿された文書を入力して、それらの文書から特定のトピックに関連する文書を抽出して、それらを特定のトピックに係る文書群としてユーザに提供する。特定のトピックとしては、例えば、特定のテレビ番組が挙げられる。ユーザは、特定のトピックに係る文書群を参照することで、他のユーザが当該特定のトピックに関してどのように考えているか等を知ることができる。 The related document extracting apparatus 10 inputs documents posted by a large number of users, extracts documents related to a specific topic from those documents, and provides them to the user as a document group related to the specific topic. . Specific topics include, for example, specific television programs. A user can know how other users think about the specific topic by referring to a document group related to the specific topic.
 図1に示すように関連文書抽出装置10は、文書格納部100と、形態素解析部110と、形態素格納部120と、トピックタグ推定部130と、トピックタグ格納部140と、トピックID付与部150と、ブラックリストハッシュタグ拡張部160と、ブラックリストタグ格納部170と、ブラックリストユーザ格納部180と、ノイズ除去部190と、トピック文書格納部200とを備えて構成される。関連文書抽出装置10は、抽出対象となる文書を取得(受信)できるように当該文書を出力する装置(例えば、マイクロブログのサービスを提供するサーバ)とインターネット等のネットワークを介して接続されている。 As shown in FIG. 1, the related document extracting apparatus 10 includes a document storage unit 100, a morpheme analysis unit 110, a morpheme storage unit 120, a topic tag estimation unit 130, a topic tag storage unit 140, and a topic ID assignment unit 150. A blacklist hash tag expansion unit 160, a blacklist tag storage unit 170, a blacklist user storage unit 180, a noise removal unit 190, and a topic document storage unit 200. The related document extraction device 10 is connected to a device (for example, a server providing a microblog service) that outputs a document to be extracted (received) via a network such as the Internet so that the document to be extracted can be acquired (received). .
 文書格納部100は、抽出対象となる複数の文書を予め入力して格納する文書格納手段である。文書格納部100は、例えば、インターネット経由でマイクロブログのサービスを提供すると共に文書を保存するサーバに対して文書の取得を要求して取得(受信)することとしてもよいし、当該サーバからストリーミングでドキュメントのデータを受信することとしてもよい。ツイッターにおける各文書は、例えば、ユーザによって生成(投稿)された各ツイートデータ相当のものである。格納されるデータは、必ずしも一種類のデータのみが格納されるわけではない。 The document storage unit 100 is a document storage unit that inputs and stores a plurality of documents to be extracted in advance. For example, the document storage unit 100 may provide a microblog service via the Internet and may request and acquire (receive) a document from a server that stores the document, or may perform streaming from the server. Document data may be received. Each document on Twitter corresponds to each tweet data generated (posted) by the user, for example. The stored data does not necessarily store only one type of data.
 図2に文書格納部100に格納される文書のサンプルフォーマットを示す。図2に示すように文書格納部100に格納される1つの文書に関するデータは、文書ID、ユーザID、投稿時間、テキスト及びハッシュタグが対応付けられたものである。図2に示す1行のデータが1つの文書に関するデータに相当する。文書IDは、それぞれの文書を特定する情報でありユニークな値である。ユーザIDは、それぞれの文書を作成したユーザを特定する情報である。このように文書格納部100は、文書を投稿したユーザに係る情報を入力して格納する。ユーザIDは、例えば、ユーザのアカウント等のユニークな値としてもよいし、ユニークな値として特定するのが困難な場合にはインターネットを用いる場合のセッション毎のIDとしてもよい。 FIG. 2 shows a sample format of a document stored in the document storage unit 100. As shown in FIG. 2, the data relating to one document stored in the document storage unit 100 is associated with a document ID, a user ID, a posting time, a text, and a hash tag. One row of data shown in FIG. 2 corresponds to data relating to one document. The document ID is information that identifies each document and is a unique value. The user ID is information that identifies the user who created each document. In this way, the document storage unit 100 inputs and stores information related to the user who posted the document. For example, the user ID may be a unique value such as a user account, or may be an ID for each session when using the Internet when it is difficult to specify the unique value.
 投稿時間は、その文書がユーザによって投稿された時刻を示す情報である。テキストは、文書データに含まれる実際のテキストデータ(文書本体)である。ハッシュタグは、文書に付与されたタグ情報である。ハッシュタグはツイッターの用語であるが、ユーザが明示的に特定のトピックに関して投稿したいときに文書に付与されるタグ、例えば特定のイベントを認識できるタグ、つまりイベント識別子である。それぞれの文書には、必ずしも何かしらのハッシュタグ(イベント識別子)を含む必要はなく、ハッシュタグを含まない場合にはNULL値が入るものとする。 The posting time is information indicating the time when the document is posted by the user. The text is actual text data (document body) included in the document data. A hash tag is tag information given to a document. A hash tag is a Twitter term, but is a tag attached to a document when a user explicitly wants to post a specific topic, for example, a tag that can recognize a specific event, that is, an event identifier. Each document does not necessarily include any hash tag (event identifier), and a NULL value is included when no hash tag is included.
 形態素解析部110は、文書格納部100によって格納されている文書データを読み出して、当該文書データのテキストを単語に分割する単語取得手段である。形態素解析部110は、テキストから単語への分割を、例えば、形態素解析によって行う。この際の形態素解析は従来の技術を利用することができる。但し、単語への分割は必ずしも形態素解析が用いられる必要はなく、任意の方法で行われてもよい。以降の説明では単語を形態素とする。形態素の取得は、文書毎に行われる。形態素解析部110は、テキストから得られた形態素に係る情報を形態素格納部120に出力する。 The morphological analysis unit 110 is a word acquisition unit that reads the document data stored in the document storage unit 100 and divides the text of the document data into words. The morpheme analysis unit 110 divides text into words by, for example, morpheme analysis. In this case, the conventional technique can be used for the morphological analysis. However, division into words is not necessarily performed by morphological analysis, and may be performed by an arbitrary method. In the following description, the word is a morpheme. Acquisition of morphemes is performed for each document. The morpheme analysis unit 110 outputs information on the morpheme obtained from the text to the morpheme storage unit 120.
 形態素格納部120は、形態素解析部110から入力された形態素を格納する手段である。図3に形態素格納部120に格納される形態素のサンプルフォーマットを示す。図3に示すように形態素格納部120に格納される1つの形態素に関するデータは、文書ID、ユーザID、投稿時間、形態素及び品詞が対応付けられたものである。図3に示す1行のデータが1つの形態素に関するデータに相当する。文書ID、ユーザID及び投稿時間は、形態素の取得元となった文書の文書ID、ユーザID及び投稿時間である。形態素は、形態素解析部110によって得られた形態素である。品詞は、形態素解析部110による解析によって得られた形態素の品詞である。例えば、形態素が名詞であるか否かの情報が格納される。 The morpheme storage unit 120 is a means for storing the morpheme input from the morpheme analysis unit 110. FIG. 3 shows a sample format of morphemes stored in the morpheme storage unit 120. As shown in FIG. 3, the data related to one morpheme stored in the morpheme storage unit 120 is a document ID, user ID, posting time, morpheme, and part of speech associated with each other. One row of data shown in FIG. 3 corresponds to data relating to one morpheme. The document ID, user ID, and posting time are the document ID, user ID, and posting time of the document from which the morpheme is acquired. The morpheme is a morpheme obtained by the morpheme analyzer 110. The part of speech is the part of speech of the morpheme obtained by the analysis by the morpheme analysis unit 110. For example, information indicating whether the morpheme is a noun is stored.
 トピックタグ推定部130は、各文書が特定のトピックに関連する文書であるか否かを判断するために用いる情報を生成する手段である。トピックタグ推定部130は、トピックタグ格納部140に格納された情報を用いて、上記の情報を生成してトピックタグ格納部140に格納する。ここでトピックタグ格納部140について説明する。 The topic tag estimation unit 130 is a means for generating information used to determine whether each document is a document related to a specific topic. The topic tag estimation unit 130 generates the above information using the information stored in the topic tag storage unit 140 and stores the information in the topic tag storage unit 140. Here, the topic tag storage unit 140 will be described.
 トピックタグ格納部140は、デフォルトトピックタグ格納部141と、トピック特徴語格納部142と、拡張トピックハッシュタグ格納部143とを含む。 The topic tag storage unit 140 includes a default topic tag storage unit 141, a topic feature word storage unit 142, and an extended topic hash tag storage unit 143.
 デフォルトトピックタグ格納部141は、トピックを示すデフォルトトピックタグを予め入力して格納するデフォルトトピックタグ格納手段である。デフォルトトピックタグは、関連する文書を抽出したいトピックに関連するタグであり、例えば、関連文書抽出装置10の管理者によって予め登録される。デフォルトトピックタグが含まれる文書は、当該デフォルトトピックタグに係るトピックに関連する文書として抽出される。この抽出は、文字列マッチングにより行われる。デフォルトトピックタグは、例えば、形態素、ハッシュタグあるいはキーワードの何れかである。デフォルトトピックタグはトピックごとに存在する。例えば、トピックが「XXドラマ:YYY(ドラマタイトル)」の場合、「(YYYに出演している俳優である)AAAA」、「YYY」、「(YYYに出演している俳優である)BBBB」等がデフォルトトピックとされる。 The default topic tag storage unit 141 is a default topic tag storage unit that stores in advance a default topic tag indicating a topic. The default topic tag is a tag related to a topic from which a related document is to be extracted, and is registered in advance by the administrator of the related document extraction apparatus 10, for example. A document including the default topic tag is extracted as a document related to the topic related to the default topic tag. This extraction is performed by character string matching. The default topic tag is, for example, any of a morpheme, a hash tag, or a keyword. A default topic tag exists for each topic. For example, if the topic is “XX drama: YYY (drama title)”, “(AAA actors appearing in YYY)”, “YYY”, “(Actors appearing in YYY) BBBB” Etc. are the default topics.
 図4にデフォルトトピックタグ格納部141に格納されるデフォルトトピックタグのサンプルフォーマットを示す。図4に示すようにデフォルトトピックタグ格納部141に格納される1つのデフォルトトピックタグに関するデータは、トピックID及びタグが対応付けられたものである。図4に示す1行のデータが1つのデフォルトトピックタグに関するデータに相当する。トピックIDは、一つのトピックを特定するIDである。タグは、デフォルトトピックタグ本体である。デフォルトトピックタグ格納部141に格納されるデフォルトトピックタグは、図4に示すように一つのトピック(一つのトピックID)に対して複数あってもよい。また、デフォルトトピックタグ格納部141は、複数のトピック(複数のトピックID)それぞれを示す複数のデフォルトトピックタグを入力することとしてもよい。 FIG. 4 shows a sample format of the default topic tag stored in the default topic tag storage unit 141. As shown in FIG. 4, the data relating to one default topic tag stored in the default topic tag storage unit 141 is associated with a topic ID and a tag. One row of data shown in FIG. 4 corresponds to data related to one default topic tag. The topic ID is an ID that identifies one topic. The tag is the default topic tag body. There may be a plurality of default topic tags stored in the default topic tag storage unit 141 for one topic (one topic ID) as shown in FIG. Further, the default topic tag storage unit 141 may input a plurality of default topic tags indicating each of a plurality of topics (a plurality of topic IDs).
 トピック特徴語格納部142と、拡張トピックハッシュタグ格納部143とに格納される情報はトピックタグ推定部130から入力される情報であるので後述する。 Information stored in the topic feature word storage unit 142 and the extended topic hash tag storage unit 143 is information input from the topic tag estimation unit 130, and will be described later.
 トピックタグ推定部130は、デフォルトトピックタグ格納部141に格納されたデフォルトトピックタグを用いて、各文書が特定のトピックに関連する文書であるか否かを判断するために用いる情報を生成する。この情報は、文書にデフォルトトピックタグが含まれていないが、当該文書にデフォルトトピックタグに係るトピックに関連する文書であるか否かを判断するためのものである。 The topic tag estimation unit 130 uses the default topic tag stored in the default topic tag storage unit 141 to generate information used to determine whether each document is a document related to a specific topic. This information is for determining whether or not the document includes a default topic tag, but the document is related to a topic related to the default topic tag.
 トピックタグ推定部130は、トピック特徴語推定部131と、トピックハッシュタグ推定部132とを含んで構成され、それぞれが異なる情報を生成する。 The topic tag estimation unit 130 includes a topic feature word estimation unit 131 and a topic hash tag estimation unit 132, and each generates different information.
 トピック特徴語推定部131は、トピックの特徴語を推定する手段である。トピックの特徴語とは、当該トピックに関連する文書に特徴的に出現する形態素である。トピック特徴語推定部131は、デフォルトトピックタグ格納部141によって格納されているデフォルトトピックタグを読み出し、文書格納部100によって格納されている複数の文書からデフォルトトピックタグを含む文書を、トピックに関連する文書(トピック文書)として抽出するデフォルト文書抽出手段である。この時、投稿時間に基づいて抽出時から直近数時間等の予め設定された一定期間の文書を取得する。トピック特徴語推定部131は、文書格納部100及び形態素格納部120に格納されている情報を参照して、抽出されたトピック文書(トピック文書群)における形態素の出現頻度を算出する第1出現頻度算出手段である。この際、トピック特徴語推定部131は、形態素の出現頻度として当該形態素が含まれる文書を投稿したユーザ数を算出する。また、トピック特徴語推定部131は、形態素毎に当該形態素が含まれる文書を投稿したユーザ数に対する、文書を投稿した全ユーザ数の割合から逆出現頻度を算出する。 The topic feature word estimation unit 131 is a means for estimating topic feature words. A feature word of a topic is a morpheme that appears characteristically in a document related to the topic. The topic feature word estimation unit 131 reads the default topic tag stored by the default topic tag storage unit 141, and associates a document including the default topic tag with the topic from a plurality of documents stored by the document storage unit 100. This is a default document extraction means for extracting as a document (topic document). At this time, based on the posting time, a document for a predetermined period such as the most recent several hours from the time of extraction is acquired. The topic feature word estimation unit 131 refers to the information stored in the document storage unit 100 and the morpheme storage unit 120, and calculates the appearance frequency of the morpheme in the extracted topic document (topic document group). It is a calculation means. At this time, the topic feature word estimation unit 131 calculates the number of users who have posted a document including the morpheme as the appearance frequency of the morpheme. Further, the topic feature word estimation unit 131 calculates the reverse appearance frequency from the ratio of the total number of users who have posted documents to the number of users who have posted documents containing the morpheme for each morpheme.
 トピック特徴語推定部131は、上記の値からトピック文書(トピック文書群)に特徴的な形態素(特徴量)を抽出する。トピック特徴語推定部131は、トピックID毎に、対象のトピックの特徴を記述する情報である特徴量を生成する。特徴量は、複数の特徴(素性)から構成され、形態素毎に素性が生成される。例えば、「今日」という素性はスコアが「0.5」付いており、「晴」という素性はスコアが「0.2」付いているといった具合である。 The topic feature word estimation unit 131 extracts morphemes (features) characteristic of topic documents (topic document groups) from the above values. The topic feature word estimation unit 131 generates a feature amount that is information describing the feature of the target topic for each topic ID. The feature amount is composed of a plurality of features (features), and a feature is generated for each morpheme. For example, the feature “Today” has a score of “0.5”, and the feature “Sunny” has a score of “0.2”.
 具体的には、以下のように生成する。まず、トピック特徴語推定部131は、各文書に含まれる形態素から以下の式により、各形態素についてのIDF(Inverse Document Frequency)値(逆出現頻度)を算出する。
Figure JPOXMLDOC01-appb-M000001
ここで、iは形態素を示す添え字、|D|は総ユニークユーザ数、|{d:t∈d}|は、形態素iを含む文書を投稿したユニークユーザ数である。IDF値は、その単語が出現する文書数が少なければ少ないほど、その単語が出現する文書にとっては、有用であることを示すスコアである。
Specifically, it is generated as follows. First, the topic feature word estimation unit 131 calculates an IDF (Inverse Document Frequency) value (inverse appearance frequency) for each morpheme from the morphemes included in each document by the following formula.
Figure JPOXMLDOC01-appb-M000001
Here, i is a subscript indicating a morpheme, | D | is the total number of unique users, and | {d: t i εd} | is the number of unique users who have posted a document containing the morpheme i. The IDF value is a score indicating that the smaller the number of documents in which the word appears, the more useful the document in which the word appears.
 なお、このように文書数でなく、ユーザ数で頻度を算出しているのは以下のような理由である。文書数を単純に用いるとノイズが混じることがある。例えば、同じユーザが同じ内容の文書を複数投稿することがある。人によっては何十回も同じ内容の文書を投稿することもある。ここでの計算をユニークユーザ数ベースにすると、同一ユーザが複数回同じ内容の文書を投稿していたとしても1回しかカウントされない。従って、算出されるスコアとしてはより信頼性の高いものとなる。1ユーザが形態素のスコアに与える影響を均一にしているという考え方もできる。 Note that the reason why the frequency is calculated not by the number of documents but by the number of users is as follows. If the number of documents is simply used, noise may be mixed. For example, the same user may post a plurality of documents having the same content. Some people submit documents with the same content dozens of times. If the calculation here is based on the number of unique users, even if the same user has posted a document with the same content multiple times, it is counted only once. Therefore, the calculated score is more reliable. It is also possible to think that the influence of one user on the morpheme score is made uniform.
 続いて、トピック特徴語推定部131は、トピックID毎に各トピック文書に含まれる形態素(トピックIDが付与された形態素)から以下の式により、各形態素についてのTF(Term Frequency)値(出現頻度)を算出する。
Figure JPOXMLDOC01-appb-M000002
ここで、jはトピックIDを示す添え字、ni,jは、形態素iを含むトピックIDjに係る文書(トピックIDjのデフォルトトピックタグを含む文書)を投稿したユニークユーザ数である。なお、TF値は、与えられた文書において、ある単語がどれだけ顕著に出現するかを示し、この値が大きければ大きいほどその単語が文書の内容をよく表現していることを示す。
Subsequently, the topic feature word estimation unit 131 calculates a TF (Term Frequency) value (appearance frequency) for each morpheme from the morpheme (morpheme to which the topic ID is assigned) included in each topic document for each topic ID by the following expression. ) Is calculated.
Figure JPOXMLDOC01-appb-M000002
Here, j is a subscript indicating the topic ID, and n i, j is the number of unique users who have posted a document related to the topic ID j including the morpheme i (a document including the default topic tag of the topic ID j ). The TF value indicates how prominent a certain word appears in a given document, and the larger this value is, the better the word represents the content of the document.
 続いて、トピック特徴語推定部131は、トピックIDjにおける形態素iのTFIDF値(tfidfi,j)を以下の式により求める。
 tfidfi,j=tfi,j・idf
これを各形態素に対して行うことでトピックIDの特徴量(形態素i毎のTFIDF値)を生成する。全てのトピックIDの特徴量を生成するまで続ける。このように算出した形態素毎のスコアは、トピックと相関が高い形態素ほど高いスコアが付く。
Subsequently, the topic feature word estimation unit 131 obtains the TFIDF value (tfidf i, j ) of the morpheme i in the topic IDj by the following equation.
tfidf i, j = tf i, j · idf i
By performing this for each morpheme, a topic ID feature amount (a TFIDF value for each morpheme i) is generated. Continue until feature values for all topic IDs are generated. As for the score for each morpheme calculated in this way, a higher score is assigned to a morpheme having a higher correlation with the topic.
 テレビ番組に関する特徴量は、例えば、「YYY(ドラマタイトル):1.0、AAAA(俳優名):0.9、CCCC(役名):0.7、DDDD(役名):0.4」(形態素:TFIDF値)のようになる。このように特徴量を見ることで、この番組の特徴が明確である。 The feature quantities relating to the television program are, for example, “YYY (drama title): 1.0, AAAA (actor name): 0.9, CCCC (role name): 0.7, DDDD (role name): 0.4” (morpheme) : TFIDF value). By looking at the feature amount in this way, the feature of this program is clear.
 なお、上記の計算の際にIDFにlogをかけたり、IDFを定数で累乗したりすることで、IDFに重みを付けてスコア(TFIDF値)の調整を行ってもよい。また、形態素毎の文字数も用いて、例えば、以下の式のようにTFIDF値を算出することとしてもよい。
 tfidfi,j=tfi,j・idf・log(length
ここで、lengthは、形態素iの文字数である。また、power(log(length),定数)(log(length)の定数乗)をかけることで文字列の重みをかけてもよい。このようにすることでより具体的に記述している形態素に対して重みを上げることができる。また文字数が少ない形態素は頻繁に出現するため、ノイズとしてスコアが高くなりがちである。
Note that the IDF may be weighted to adjust the score (TFIDF value) by applying log to the IDF or raising the IDF to a constant power during the above calculation. Moreover, it is good also as calculating a TFIDF value like the following formula | equation using the number of characters for every morpheme, for example.
tfidf i, j = tf i, j · idf i · log (length i)
Here, length i is the number of characters of morpheme i. Further, the character string may be weighted by applying power (log (length i ), constant) (log (length i ) raised to a constant power). By doing so, it is possible to increase the weight with respect to the morpheme described more specifically. Moreover, since morphemes with a small number of characters frequently appear, the score tends to be high as noise.
 トピック特徴語推定部131は、算出したトピックID毎の各形態素のTFIDF値をトピック特徴語格納部142に出力して格納させる。ここで、TFIDF値が予め設定された閾値以上の形態素(特徴語)についてのみ、トピック特徴語格納部142に格納させることとしてもよい。図5にトピック特徴語格納部142に格納される特徴量のサンプルフォーマットを示す。図5に示すようにトピック特徴語格納部142に格納される特徴量のデータは、形態素毎のデータであり、1つの形態素に関するデータは、トピックID、作成日、タグ及びスコアが対応付けられたものである。図5に示す1行のデータが何れかのトピックIDの1つの形態素に関するデータに相当する。トピックIDは、特徴量に係るトピックのトピックIDである。作成日は、このデータが作成された時刻である。タグは、形態素である。スコアは、トピック特徴語推定部131によって算出されたTFIDF値である。以上が、トピック特徴語推定部131によって生成される情報である。 The topic feature word estimation unit 131 outputs the TFIDF value of each morpheme for each calculated topic ID to the topic feature word storage unit 142 for storage. Here, only the morpheme (feature word) having a TFIDF value equal to or greater than a preset threshold value may be stored in the topic feature word storage unit 142. FIG. 5 shows a sample format of feature amounts stored in the topic feature word storage unit 142. As shown in FIG. 5, the feature amount data stored in the topic feature word storage unit 142 is data for each morpheme, and data related to one morpheme is associated with a topic ID, a creation date, a tag, and a score. Is. One row of data shown in FIG. 5 corresponds to data related to one morpheme of any topic ID. The topic ID is a topic ID of a topic related to the feature amount. The creation date is the time when this data was created. A tag is a morpheme. The score is a TFIDF value calculated by the topic feature word estimation unit 131. The information generated by the topic feature word estimation unit 131 has been described above.
 トピックハッシュタグ推定部132は、トピックに係るデフォルトトピックタグ以外のハッシュタグを推定する手段である。トピックハッシュタグ推定部132は、デフォルトトピックタグ格納部141によって格納されているデフォルトトピックタグを読み出し、文書格納部100によって格納されている複数の文書からデフォルトトピックタグ以外のタグ(トピックに係るハッシュタグの候補となるハッシュタグ)を含む文書をタグ文書(タグ文書群)として抽出するタグ文書抽出手段である。この時、投稿時間に基づいて抽出時から直近数時間等の予め設定された一定期間の文書を取得する。トピックハッシュタグ推定部132は、文書格納部100及び形態素格納部120に格納されている情報を参照して、抽出されたタグ文書(タグ文書群)における形態素の出現頻度を算出する第2出現頻度算出手段である。この際、トピックハッシュタグ推定部132は、形態素の出現頻度として当該形態素が含まれる文書を投稿したユーザ数を算出する。また、トピックハッシュタグ推定部132は、形態素毎に当該形態素が含まれる文書を投稿したユーザ数に対する、文書を投稿した全ユーザ数の割合から逆出現頻度を算出する。 The topic hash tag estimation unit 132 is a means for estimating a hash tag other than the default topic tag related to a topic. The topic hash tag estimation unit 132 reads the default topic tag stored by the default topic tag storage unit 141, and selects a tag other than the default topic tag (a hash tag related to the topic) from the plurality of documents stored by the document storage unit 100. Tag document extraction means for extracting a document including a hash tag as a tag document (tag document group). At this time, based on the posting time, a document for a predetermined period such as the most recent several hours from the time of extraction is acquired. The topic hash tag estimation unit 132 refers to the information stored in the document storage unit 100 and the morpheme storage unit 120, and calculates the appearance frequency of the morpheme in the extracted tag document (tag document group). It is a calculation means. At this time, the topic hash tag estimation unit 132 calculates the number of users who have posted a document including the morpheme as the appearance frequency of the morpheme. Further, the topic hash tag estimation unit 132 calculates the reverse appearance frequency from the ratio of the total number of users who have posted documents to the number of users who have posted documents containing the morpheme for each morpheme.
 具体的には、トピックハッシュタグ推定部132は、トピック特徴語推定部131と同様に各形態素についてのIDF値(逆出現頻度)を算出する。なお、トピック特徴語推定部131及びトピックハッシュタグ推定部132によって算出され利用されるIDF値は形態素毎に同一の値となるため、何れか一方が算出したIDF値をもう一方において利用することとしてもよい。 Specifically, the topic hash tag estimation unit 132 calculates an IDF value (reverse appearance frequency) for each morpheme, as with the topic feature word estimation unit 131. Since the IDF values calculated and used by the topic feature word estimation unit 131 and the topic hash tag estimation unit 132 are the same for each morpheme, the IDF value calculated by either one is used in the other. Also good.
 続いて、トピックハッシュタグ推定部132は、デフォルトトピックタグ以外のタグ毎に各タグ文書に含まれる形態素(ハッシュタグが付与された形態素)から以下の式により、各形態素についてのTF値(出現頻度)を算出する。
Figure JPOXMLDOC01-appb-M000003
ここで、jはハッシュタグを示す添え字、ni,jは、形態素iを含むと共にハッシュタグjを含む文書を投稿したユニークユーザ数である。
Subsequently, the topic hash tag estimation unit 132 calculates the TF value (appearance frequency) for each morpheme from the morpheme (morpheme to which the hash tag is assigned) included in each tag document for each tag other than the default topic tag by the following formula. ) Is calculated.
Figure JPOXMLDOC01-appb-M000003
Here, j is a subscript indicating a hash tag, and n i, j is the number of unique users who have posted a document including the morpheme i and including the hash tag j.
 続いて、トピックハッシュタグ推定部132は、ハッシュタグjにおける形態素iのTFIDF値(tfidfi,j)を以下の式により求める。
 tfidfi,j=tfi,j・idf
これを各形態素に対して行うことでハッシュタグの特徴量(形態素i毎のTFIDF値)を生成する。全てのハッシュタグの特徴量を生成するまで続ける。なお、TFIDF値の重み付け等は上述したトピック特徴語推定部131による方法と同様に行ってもよい。
Subsequently, the topic hash tag estimation unit 132 obtains the TFIDF value (tfidf i, j ) of the morpheme i in the hash tag j by the following expression.
tfidf i, j = tf i, j · idf i
By performing this for each morpheme, a hash tag feature amount (a TFIDF value for each morpheme i) is generated. Continue until all hash tag features are generated. The weighting of the TFIDF value may be performed in the same manner as the method using the topic feature word estimation unit 131 described above.
 また、トピックハッシュタグ推定部132は、デフォルトトピックタグ格納部141によって格納されているデフォルトトピックタグを読み出し、文書格納部100によって格納されている複数の文書からデフォルトトピックタグを含む文書を、トピックに関連する文書(トピック文書)として抽出するデフォルト文書抽出手段である。この時、投稿時間に基づいて抽出時から直近数時間等の予め設定された一定期間の文書を取得する。トピックハッシュタグ推定部132は、文書格納部100及び形態素格納部120に格納されている情報を参照して、抽出されたトピック文書(トピック文書群)における形態素の出現頻度を算出する第1出現頻度算出手段である。この際、トピックハッシュタグ推定部132は、形態素の出現頻度として当該形態素が含まれる文書を投稿したユーザ数を算出する。また、トピックハッシュタグ推定部132は、形態素毎に当該形態素が含まれる文書を投稿したユーザ数に対する、文書を投稿した全ユーザ数の割合から逆出現頻度を算出する。 Further, the topic hash tag estimation unit 132 reads the default topic tag stored by the default topic tag storage unit 141, and sets a document including the default topic tag as a topic from a plurality of documents stored by the document storage unit 100. This is a default document extraction means for extracting as a related document (topic document). At this time, based on the posting time, a document for a predetermined period such as the most recent several hours from the time of extraction is acquired. The topic hash tag estimation unit 132 refers to the information stored in the document storage unit 100 and the morpheme storage unit 120, and calculates the appearance frequency of the morpheme in the extracted topic document (topic document group). It is a calculation means. At this time, the topic hash tag estimation unit 132 calculates the number of users who have posted a document including the morpheme as the appearance frequency of the morpheme. Further, the topic hash tag estimation unit 132 calculates the reverse appearance frequency from the ratio of the total number of users who have posted documents to the number of users who have posted documents containing the morpheme for each morpheme.
 トピックハッシュタグ推定部132は、トピック特徴語推定部131と同様にトピックIDjにおける形態素iのTFIDF値(tfidfi,j)を求める。なお、トピック特徴語推定部131及びトピックハッシュタグ推定部132によって算出され利用される、トピックIDjにおけるTFIDF値は形態素毎に同一の値となるため、何れか一方が算出したTFIDF値をもう一方において利用することとしてもよい。 Similar to the topic feature word estimation unit 131, the topic hash tag estimation unit 132 obtains the TFIDF value (tfidf i, j ) of the morpheme i in the topic IDj. In addition, since the TFIDF value in the topic IDj that is calculated and used by the topic feature word estimation unit 131 and the topic hash tag estimation unit 132 is the same value for each morpheme, the TFIDF value calculated by either one is used in the other. It may be used.
 トピックハッシュタグ推定部132は、上記のように算出したトピックIDの特徴量と、タグの特徴量とを比較する第2トピック文書判定手段の一機能である。具体的には、トピックハッシュタグ推定部132は、トピックID毎に全ての(デフォルトトピックタグ以外の)ハッシュタグとの類似度(similarity)をコサイン距離として以下の式を用いて算出する。
Figure JPOXMLDOC01-appb-M000004
ここでA及びBは、それぞれトピックIDの特徴量及びハッシュタグの特徴量である。Ai及びBiは、各形態素iのTFIDF値である。なお、形態素の出現頻度によって示される特徴量間の類似度の算出には、上記のコサイン距離以外にも、ジャカード距離又はユークリッド距離が用いられてもよい。また、それ以外でも特徴量間の類似度の算出が可能なものであれば、任意の算出方法を用いることができる。
The topic hash tag estimation unit 132 is a function of the second topic document determination unit that compares the feature amount of the topic ID calculated as described above with the feature amount of the tag. Specifically, the topic hash tag estimation unit 132 calculates, for each topic ID, the similarity (similarity) with all hash tags (other than the default topic tag) as a cosine distance using the following formula.
Figure JPOXMLDOC01-appb-M000004
Here, A and B are the feature amount of the topic ID and the feature amount of the hash tag, respectively. Ai and Bi are the TFIDF values of each morpheme i. In addition to the above cosine distance, a Jacquard distance or an Euclidean distance may be used for calculating the similarity between the feature amounts indicated by the appearance frequency of the morpheme. In addition, any calculation method can be used as long as the similarity between feature quantities can be calculated.
 トピックハッシュタグ推定部132は、トピックID毎に類似度が予め設定した閾値以上のハッシュタグの類似度があるか否か判断して、類似度が閾値以上のタグを当該トピックIDのトピックに係るタグであるものとする。この処理を全てのトピックIDに対して行うことで、当該トピックIDのトピックに係るハッシュタグ(類似しているハッシュタグ)を抽出することができる。 The topic hash tag estimation unit 132 determines whether there is a similarity of a hash tag having a similarity equal to or higher than a preset threshold for each topic ID, and relates a tag having a similarity equal to or higher than the threshold to the topic of the topic ID. It shall be a tag. By performing this process on all topic IDs, hash tags (similar hash tags) related to the topic with the topic ID can be extracted.
 トピックハッシュタグ推定部132は、トピックIDのトピックに係るハッシュタグ(拡張トピックハッシュタグ)を示す情報を拡張トピックハッシュタグ格納部143に出力して格納させる。図6に拡張トピックハッシュタグ格納部143に格納される拡張トピックハッシュタグのサンプルフォーマットを示す。図6に示すように拡張トピックハッシュタグ格納部143に格納される拡張トピックハッシュタグのデータは、拡張トピックハッシュタグ毎のデータであり、1つの拡張トピックハッシュタグに関するデータは、トピックID、作成日及びハッシュタグが対応付けられたものである。図6に示す1行のデータが1つの拡張トピックハッシュタグに関するデータに相当する。トピックIDは、拡張トピックハッシュタグに係るトピックのトピックIDである。作成日は、このデータが作成された時刻である。ハッシュタグは、拡張トピックハッシュタグである。以上が、トピックハッシュタグ推定部132によって生成される情報である。 The topic hash tag estimation unit 132 outputs information indicating a hash tag (extended topic hash tag) related to the topic of the topic ID to the extended topic hash tag storage unit 143 for storage. FIG. 6 shows a sample format of the extended topic hash tag stored in the extended topic hash tag storage unit 143. As shown in FIG. 6, the extended topic hash tag data stored in the extended topic hash tag storage unit 143 is data for each extended topic hash tag, and data related to one extended topic hash tag includes a topic ID, a creation date. And hash tags are associated with each other. One row of data shown in FIG. 6 corresponds to data related to one extended topic hash tag. The topic ID is a topic ID of a topic related to the extended topic hash tag. The creation date is the time when this data was created. The hash tag is an extended topic hash tag. The information generated by the topic hash tag estimation unit 132 has been described above.
 トピックID付与部150は、トピックタグ格納部140に格納された情報を用いてトピックに関連する文書を抽出するトピック文書抽出手段である。特にトピックID付与部150は、文書にデフォルトトピックタグが含まれていないが、トピックに関連する文書を抽出するトピック文書抽出手段である。 The topic ID assigning unit 150 is a topic document extracting unit that extracts a document related to a topic using information stored in the topic tag storage unit 140. In particular, the topic ID assigning unit 150 is a topic document extracting unit that extracts a document related to a topic although a default topic tag is not included in the document.
 トピックID付与部150は、まず、文書格納部100によって格納されている文書を取得する。この時、投稿時間に基づいて取得時から直近数時間等の予め設定された一定期間の文書を取得する。デフォルトトピックタグ格納部141に格納されている情報に基づくトピックに関連する文書を抽出するには以下のように行う。トピックID付与部150は、デフォルトトピックタグ格納部141によって格納されているデフォルトトピックタグを読み出し、取得した文書にデフォルトトピックタグが含まれているか否かを判断し、デフォルトトピックタグが含まれていた文書に当該デフォルトトピックタグに係るトピックIDを付与する。 The topic ID assigning unit 150 first acquires the document stored by the document storage unit 100. At this time, based on the posting time, a document for a predetermined period such as the most recent several hours from the time of acquisition is acquired. To extract a document related to a topic based on information stored in the default topic tag storage unit 141, the following is performed. The topic ID assigning unit 150 reads the default topic tag stored by the default topic tag storage unit 141, determines whether or not the acquired document includes the default topic tag, and the default topic tag is included. A topic ID related to the default topic tag is assigned to the document.
 トピック特徴語推定部131に格納されている情報に基づくトピックに関連する文書を抽出するには以下のように行う。トピックID付与部150は、トピック特徴語推定部131によって格納されている特徴量(トピックID毎の各形態素のTFIDF値(スコア))の情報を読み出し、特徴量の情報から各トピックIDに対する取得した各文書のスコアを算出するスコア算出手段である。トピックID付与部150は、スコア付与対象の文書に特徴量に係る形態素(特徴語)が含まれているか判断する。トピックID付与部150は、文書に含まれていた特徴語のスコアを合算する。 To extract a document related to a topic based on information stored in the topic feature word estimation unit 131, the following is performed. The topic ID assigning unit 150 reads the information on the feature amount (the TFIDF value (score) of each morpheme for each topic ID) stored by the topic feature word estimation unit 131 and acquires the information for each topic ID from the feature amount information. It is a score calculation means for calculating the score of each document. The topic ID assigning unit 150 determines whether the score assignment target document includes a morpheme (feature word) related to the feature amount. The topic ID assigning unit 150 adds up the scores of feature words included in the document.
 なお、スコアの算出の際に文書において特徴語が複数回出現する場合、1回出現の場合と同様に文書のスコアを算出することとしてもよい。即ち、同じ特徴語のスコアを複数回カウントしない。文書が「今日は、晴れてよかった。今日はいい天気」の場合、特徴語「今日」のスコアが1.0の場合、この文書に含まれる「今日」から派生するスコアを1.0+1.0=2.0でなく1.0とする。 Note that if a feature word appears multiple times in the document when the score is calculated, the score of the document may be calculated in the same manner as when the feature word appears once. That is, the score of the same feature word is not counted multiple times. When the document is “Today was fine. Today is good weather”, if the score of the feature word “Today” is 1.0, the score derived from “Today” included in this document is set to 1.0 + 1.0. = 1.0 instead of 2.0.
 このように複数回カウントしないことでノイズの除去が可能である。不適切な文書をトピック文書として抽出することを回避できる。例えば、あるワードが特徴語として抽出されたがその特徴語のスコアが低かったとする。その特徴がある文書の中に頻繁に出現した場合、重複カウントを許容してしまうとトピック文書として抽出してしまう可能性がある。重複カウントを許可しないことでこれを回避できる。 It is possible to remove noise by not counting multiple times in this way. Extracting inappropriate documents as topic documents can be avoided. For example, it is assumed that a certain word is extracted as a feature word, but the score of the feature word is low. If the feature frequently appears in a document with the feature, there is a possibility that it will be extracted as a topic document if the duplicate count is allowed. This can be avoided by not allowing duplicate counting.
 トピックID付与部150は、算出されたスコアに基づいて、当該スコアに係る文書がトピックに関連する文書であるか否かを判定する第1トピック文書判定手段である。具体的には、トピックID付与部150は、スコアが予め設定した閾値であるか否かを判断して、閾値以上であった場合、当該文書がそのトピックに関連する文書であると判断しトピックIDを付与する。この処理をトピック特徴語推定部131に格納される特徴量に係るトピックID分繰り返しトピックIDを付与する。 The topic ID assigning unit 150 is a first topic document determination unit that determines, based on the calculated score, whether a document related to the score is a document related to the topic. Specifically, the topic ID assigning unit 150 determines whether the score is a preset threshold value. If the score is equal to or greater than the threshold value, the topic ID assigning unit 150 determines that the document is a document related to the topic and determines the topic. Give an ID. This process is repeated for each topic ID related to the feature amount stored in the topic feature word estimation unit 131, and a topic ID is assigned.
 拡張トピックハッシュタグ格納部143に格納されている情報に基づくトピックに関連する文書を抽出するには以下のように行う。トピックID付与部150は、拡張トピックハッシュタグ格納部143によって格納されている拡張トピックハッシュタグを読み出し、取得した文書に拡張トピックハッシュタグが含まれているか否かを判断する(即ち、取得した文書が拡張トピックハッシュタグに係るタグ文書であるか否かを判断する)ことで、当該文書がトピックに関連する文書であるか否かを判定する第2トピック文書判定手段である。トピックID付与部150は、拡張トピックハッシュタグが含まれていた文書に当該デフォルトトピックタグに係るトピックIDを付与する。拡張トピックハッシュタグ格納部143に格納される拡張トピックハッシュタグに係るトピックID分繰り返しトピックIDを付与する。トピックID付与部150は、トピックIDを付与した文書をノイズ除去部190に出力する。 To extract a document related to a topic based on information stored in the extended topic hash tag storage unit 143, the following is performed. The topic ID assigning unit 150 reads the extended topic hash tag stored by the extended topic hash tag storage unit 143, and determines whether or not the acquired document includes the extended topic hash tag (that is, the acquired document Is a second topic document determination means for determining whether or not the document is a document related to the topic. The topic ID assigning unit 150 assigns a topic ID related to the default topic tag to a document that includes the extended topic hash tag. A topic ID is repeatedly given for each topic ID related to the extended topic hash tag stored in the extended topic hash tag storage unit 143. The topic ID assigning unit 150 outputs the document to which the topic ID is assigned to the noise removing unit 190.
 本実施形態では、文書格納部100に格納されている文書からノイズの除去を行う。即ち、文書格納部100に格納されている文書がトピックに関連する文書として不適切なものか否かを判断して、不適切なものであると判断されるとその文書を関連する文書から除外する。 In the present embodiment, noise is removed from the document stored in the document storage unit 100. That is, it is determined whether or not a document stored in the document storage unit 100 is inappropriate as a document related to a topic. If it is determined that the document is inappropriate, the document is excluded from the related documents. To do.
 ツイッターではハッシュタグを付けて特定のトピックに対して自分のツイートをシェアするのは一般的だが、独立する複数のトピックのハッシュタグを付け自分のコメントを投稿するユーザがいる。この場合、複数のトピックに対して投稿されており、投稿内容としては個々のトピックとは関係性が非常に薄く、テレビに関して言えば政治に対する批判であったり、放送局批判であったりすることが多い。トピックに関連する文書を精度よく抽出するにあたって、これらノイズをフィルタリングすることは重要である。以下の構成は、文書からノイズの除去を行うためのものである。 In Twitter, it is common to attach a hashtag to share your tweets for a specific topic, but there are users who post their comments with hashtags of multiple independent topics. In this case, postings are made on multiple topics, and the content of postings is very weak in relation to individual topics. For television, it may be criticism of politics or criticism of broadcasting stations. Many. It is important to filter these noises when extracting documents related to a topic with high accuracy. The following configuration is for removing noise from a document.
 ブラックリストハッシュタグ拡張部160は、各文書がノイズにあたるか、即ち、各文書が抽出するのに不適切な特定のトピックに関連する文書であるか否かを判断するために用いる情報を生成する手段である。ブラックリストハッシュタグ拡張部160は、ブラックリストタグ格納部170に格納された情報を用いて、上記の情報を生成してブラックリストタグ格納部170に格納する。ここでブラックリストタグ格納部170について説明する。 The blacklist hash tag extension unit 160 generates information used to determine whether each document is subject to noise, that is, whether each document is related to a specific topic inappropriate for extraction. Means. The blacklist hash tag extension unit 160 generates the above information using the information stored in the blacklist tag storage unit 170 and stores the information in the blacklist tag storage unit 170. Here, the black list tag storage unit 170 will be described.
 ブラックリストタグ格納部170は、デフォルトブラックリスト形態素格納部171と、デフォルトブラックリストハッシュタグ格納部172と、拡張ブラックリストハッシュタグ格納部173とを含む。 The black list tag storage unit 170 includes a default black list morpheme storage unit 171, a default black list hash tag storage unit 172, and an extended black list hash tag storage unit 173.
 デフォルトブラックリスト形態素格納部171は、ブラックリスト形態素を入力して格納する手段である。ブラックリスト形態素は、文書に含まれていた場合にその文書が除外されるべきものとされる形態素である。ブラックリスト形態素は、例えば、関連文書抽出装置10の管理者によって予め登録される。図7(a)にデフォルトブラックリスト形態素格納部171に格納されるブラックリスト形態素のサンプルフォーマットを示す。図7(a)に示すように1行のデータが1つのブラックリスト形態素に関するデータに相当し、ブラックリスト形態素毎に格納されている。 The default blacklist morpheme storage unit 171 is a means for inputting and storing blacklist morphemes. A blacklist morpheme is a morpheme that should be excluded if it was included in the document. The black list morpheme is registered in advance by, for example, an administrator of the related document extraction device 10. FIG. 7A shows a sample format of the black list morpheme stored in the default black list morpheme storage unit 171. As shown in FIG. 7A, one line of data corresponds to data relating to one black list morpheme, and is stored for each black list morpheme.
 デフォルトブラックリストハッシュタグ格納部172は、不適切なトピックを示すデフォルトトピックタグであるブラックリストハッシュタグを予め入力して格納するデフォルトトピックタグ格納手段である。ブラックリストハッシュタグは、関連する文書を除外したいトピックに関連するタグであり、例えば、関連文書抽出装置10の管理者によって予め登録される。ブラックリストハッシュタグが含まれる文書は、不適切なトピックに関連する文書として除外される。この除外は、文字列マッチングにより行われる。ブラックリストハッシュタグは、例えば、ハッシュタグである。 The default blacklist hash tag storage unit 172 is a default topic tag storage unit that stores in advance a blacklist hash tag that is a default topic tag indicating an inappropriate topic. The blacklist hash tag is a tag related to a topic for which a related document is to be excluded, and is registered in advance by an administrator of the related document extraction apparatus 10, for example. Documents containing blacklist hash tags are excluded as documents related to inappropriate topics. This exclusion is performed by character string matching. The black list hash tag is, for example, a hash tag.
 図7(b)にデフォルトブラックリストハッシュタグ格納部172に格納されるブラックリストハッシュタグのサンプルフォーマットを示す。図7(b)に示すように1行のデータが1つのブラックリストハッシュタグに関するデータに相当し、ブラックリストハッシュタグ毎に格納されている。 FIG. 7B shows a sample format of the black list hash tag stored in the default black list hash tag storage unit 172. As shown in FIG. 7B, one line of data corresponds to data related to one black list hash tag, and is stored for each black list hash tag.
 拡張ブラックリストハッシュタグ格納部173格納される情報はブラックリストハッシュタグ拡張部160から入力される情報であるので後述する。 The information stored in the extended blacklist hash tag storage unit 173 is information input from the blacklist hash tag extension unit 160, and will be described later.
 ブラックリストハッシュタグ拡張部160は、デフォルトブラックリストハッシュタグ格納部172に格納されたブラックリストハッシュタグを用いて、各文書が除外されるべき文書(除外されるべきトピックに関連する文書)であるか否かを判断するために用いる情報を生成する。この情報は、文書にブラックリストハッシュタグが含まれていないが、当該文書が除外されるべき文書であるか否かを判断するためのものである。 The black list hash tag extension unit 160 is a document in which each document is to be excluded (a document related to a topic to be excluded) using the black list hash tag stored in the default black list hash tag storage unit 172. The information used for determining whether or not is generated. This information is for determining whether the document does not contain a blacklist hash tag, but the document is to be excluded.
 ブラックリストハッシュタグ拡張部160は、ブラックリストハッシュタグに係る特徴語を推定する手段である。ブラックリストハッシュタグの特徴語とは、当該ブラックリストハッシュタグを含む文書に特徴的に出現する形態素である。ブラックリストハッシュタグ拡張部160は、デフォルトブラックリストハッシュタグ格納部172によって格納されているブラックリストハッシュタグを読み出し、文書格納部100によって格納されている複数の文書からブラックリストハッシュタグを含む文書を、除外されるべき文書として抽出するデフォルト文書抽出手段である。この時、投稿時間に基づいて抽出時から直近数時間等の予め設定された一定期間の文書を取得する。ブラックリストハッシュタグ拡張部160は、文書格納部100及び形態素格納部120に格納されている情報を参照して、抽出された除外されるべき文書(文書群)における形態素の出現頻度を算出する第1出現頻度算出手段である。この際、ブラックリストハッシュタグ拡張部160は、形態素の出現頻度として当該形態素が含まれる文書を投稿したユーザ数を算出する。また、ブラックリストハッシュタグ拡張部160は、形態素毎に当該形態素が含まれる文書を投稿したユーザ数に対する、文書を投稿した全ユーザ数の割合から逆出現頻度を算出する。 The blacklist hash tag extension unit 160 is a means for estimating feature words related to the blacklist hashtag. The characteristic word of the blacklist hash tag is a morpheme that appears characteristically in a document including the blacklist hashtag. The black list hash tag extension unit 160 reads the black list hash tag stored in the default black list hash tag storage unit 172, and selects a document including the black list hash tag from a plurality of documents stored in the document storage unit 100. This is a default document extracting means for extracting as a document to be excluded. At this time, based on the posting time, a document for a predetermined period such as the most recent several hours from the time of extraction is acquired. The blacklist hash tag extension unit 160 refers to the information stored in the document storage unit 100 and the morpheme storage unit 120, and calculates the appearance frequency of the morpheme in the extracted document (document group) to be excluded. 1 appearance frequency calculation means. At this time, the blacklist hash tag expansion unit 160 calculates the number of users who have posted a document including the morpheme as the appearance frequency of the morpheme. Also, the blacklist hash tag expansion unit 160 calculates the reverse appearance frequency from the ratio of the total number of users who have posted documents to the number of users who have posted documents containing the morpheme for each morpheme.
 ブラックリストハッシュタグ拡張部160は、上記の値から除外されるべき文書(文書群)に特徴的な形態素(特徴量)を抽出する。ブラックリストハッシュタグ拡張部160は、ブラックリストハッシュタグ毎に、対象のトピックの特徴を記述する情報である特徴量を生成する。 The blacklist hash tag extension unit 160 extracts morphemes (features) characteristic of documents (document groups) to be excluded from the above values. The black list hash tag extension unit 160 generates a feature amount that is information describing the feature of the target topic for each black list hash tag.
 具体的には、以下のように生成する。まず、ブラックリストハッシュタグ拡張部160は、各文書に含まれる形態素から以下の式により、各形態素についてのIDF値(逆出現頻度)を算出する。
Figure JPOXMLDOC01-appb-M000005
ここで、iは形態素を示す添え字、|D|は総ユニークユーザ数、|{d:t∈d}|は、形態素iを含む文書を投稿したユニークユーザ数である。なお、このIDF値は、トピックタグ推定部130によって算出されたものと同様のものであるのでトピックタグ推定部130によって算出されたものを用いることとしてもよい。
Specifically, it is generated as follows. First, the blacklist hash tag extension unit 160 calculates an IDF value (reverse appearance frequency) for each morpheme from the morphemes included in each document by the following formula.
Figure JPOXMLDOC01-appb-M000005
Here, i is a subscript indicating a morpheme, | D | is the total number of unique users, and | {d: t i εd} | is the number of unique users who have posted a document containing the morpheme i. Since the IDF value is the same as that calculated by the topic tag estimation unit 130, the IDF value may be calculated by the topic tag estimation unit 130.
 続いて、ブラックリストハッシュタグ拡張部160は、ブラックリストハッシュタグ毎に各抽出された除外されるべき文書に含まれる形態素(ブラックリストハッシュタグが付与された形態素)から以下の式により、各形態素についてのTF値(出現頻度)を算出する。
Figure JPOXMLDOC01-appb-M000006
ここで、jはブラックリストハッシュタグを示す添え字、ni,jは、形態素iを含むブラックリストハッシュタグjに係る文書(ブラックリストハッシュタグjを含む文書)を投稿したユニークユーザ数である。
Subsequently, the blacklist hash tag extension unit 160 calculates each morpheme from the morpheme (morpheme to which the blacklist hash tag is added) included in each extracted document to be excluded for each blacklist hash tag by the following formula. TF value (appearance frequency) is calculated for.
Figure JPOXMLDOC01-appb-M000006
Here, j is a subscript indicating a black list hash tag, and n i, j is the number of unique users who have posted a document related to the black list hash tag j including the morpheme i (a document including the black list hash tag j). .
 続いて、ブラックリストハッシュタグ拡張部160は、ブラックリストハッシュタグjにおける形態素iのTFIDF値(tfidfi,j)を以下の式により求める。
 tfidfi,j=tfi,j・idf
これを各形態素に対して行うことでブラックリストハッシュタグの特徴量(形態素i毎のTFIDF値)を生成する。全てのブラックリストハッシュタグの特徴量を生成するまで続ける。なお、TFIDF値の重み付け等は上述した方法と同様に行ってもよい。
Subsequently, the blacklist hash tag extension unit 160 obtains the TFIDF value (tfidf i, j ) of the morpheme i in the blacklist hash tag j by the following equation.
tfidf i, j = tf i, j · idf i
By performing this for each morpheme, a feature quantity (a TFIDF value for each morpheme i) of the blacklist hash tag is generated. Continue until all blacklist hashtag features are generated. The TFIDF value may be weighted in the same manner as described above.
 ブラックリストハッシュタグ拡張部160は、算出したブラックリストハッシュタグ毎の各形態素のTFIDF値をブラックリストタグ格納部170に出力して格納させる。ここで、TFIDF値が閾値以上の形態素(特徴語)についてのみ、ブラックリストタグ格納部170に格納させることとしてもよい。 The blacklist hash tag expansion unit 160 outputs the calculated TFIDF value of each morpheme for each blacklist hashtag to the blacklist tag storage unit 170 for storage. Here, only the morphemes (feature words) having a TFIDF value equal to or greater than the threshold value may be stored in the blacklist tag storage unit 170.
 また、ブラックリストハッシュタグ拡張部160は、ブラックリストハッシュタグ以外の除外されるべき文書に含まれるハッシュタグを推定する手段である。ブラックリストハッシュタグ拡張部160は、デフォルトブラックリストハッシュタグ格納部172によって格納されているブラックリストハッシュタグを読み出し、文書格納部100によって格納されている複数の文書からブラックリストハッシュタグ以外のタグ(除外されるべき文書に含まれるハッシュタグの候補となるハッシュタグ)を含む文書(文書群)を抽出するタグ文書抽出手段である。この時、投稿時間に基づいて抽出時から直近数時間等の予め設定された一定期間の文書を取得する。ブラックリストハッシュタグ拡張部160は、文書格納部100及び形態素格納部120に格納されている情報を参照して、抽出された文書(文書群)における形態素の出現頻度を算出する第2出現頻度算出手段である。この際、ブラックリストハッシュタグ拡張部160は、形態素の出現頻度として当該形態素が含まれる文書を投稿したユーザ数を算出する。また、ブラックリストハッシュタグ拡張部160は、形態素毎に当該形態素が含まれる文書を投稿したユーザ数に対する、文書を投稿した全ユーザ数の割合から逆出現頻度を算出する。 Also, the black list hash tag extension unit 160 is a means for estimating a hash tag included in a document to be excluded other than the black list hash tag. The black list hash tag extension unit 160 reads the black list hash tag stored by the default black list hash tag storage unit 172, and from the plurality of documents stored by the document storage unit 100, tags other than the black list hash tag ( Tag document extraction means for extracting a document (a document group) including a hash tag that is a candidate for a hash tag included in a document to be excluded. At this time, based on the posting time, a document for a predetermined period such as the most recent several hours from the time of extraction is acquired. The blacklist hash tag extension unit 160 refers to the information stored in the document storage unit 100 and the morpheme storage unit 120, and calculates a second appearance frequency calculation that calculates the appearance frequency of the morpheme in the extracted document (document group). Means. At this time, the blacklist hash tag expansion unit 160 calculates the number of users who have posted a document including the morpheme as the appearance frequency of the morpheme. Also, the blacklist hash tag expansion unit 160 calculates the reverse appearance frequency from the ratio of the total number of users who have posted documents to the number of users who have posted documents containing the morpheme for each morpheme.
 具体的には、ブラックリストハッシュタグ拡張部160は、上記と同様に各形態素についてのIDF値(逆出現頻度)を算出する。なお、ブラックリストハッシュタグ拡張部160は、上記あるいはトピックタグ推定部130が算出したTF値を利用することとしてもよい。 Specifically, the blacklist hash tag extension unit 160 calculates an IDF value (reverse appearance frequency) for each morpheme in the same manner as described above. Note that the blacklist hash tag extension unit 160 may use the TF value calculated by the above or the topic tag estimation unit 130.
 続いて、ブラックリストハッシュタグ拡張部160は、ブラックリストハッシュタグ以外のタグ毎に各タグ文書に含まれる形態素(ハッシュタグが付与された形態素)から以下の式により、各形態素についてのTF値(出現頻度)を算出する。
Figure JPOXMLDOC01-appb-M000007
ここで、jはハッシュタグを示す添え字、ni,jは、形態素iを含むと共にハッシュタグjを含む文書を投稿したユニークユーザ数である。なお、ブラックリストハッシュタグ拡張部160は、トピックタグ推定部130が算出したTF値を利用することとしてもよい。
Subsequently, the blacklist hash tag extension unit 160 calculates a TF value (for each morpheme) from the morpheme (morpheme to which the hash tag is assigned) included in each tag document for each tag other than the blacklist hash tag by the following formula. Appearance frequency) is calculated.
Figure JPOXMLDOC01-appb-M000007
Here, j is a subscript indicating a hash tag, and n i, j is the number of unique users who have posted a document including the morpheme i and including the hash tag j. The blacklist hash tag expansion unit 160 may use the TF value calculated by the topic tag estimation unit 130.
 続いて、ブラックリストハッシュタグ拡張部160は、ハッシュタグjにおける形態素iのTFIDF値(tfidfi,j)を以下の式により求める。
 tfidfi,j=tfi,j・idf
これを各形態素に対して行うことでハッシュタグの特徴量(形態素i毎のTFIDF値)を生成する。全てのハッシュタグの特徴量を生成するまで続ける。なお、TFIDF値の重み付け等は上述した方法と同様に行ってもよい。
Subsequently, the blacklist hash tag extension unit 160 obtains the TFIDF value (tfidf i, j ) of the morpheme i in the hash tag j by the following equation.
tfidf i, j = tf i, j · idf i
By performing this for each morpheme, a hash tag feature amount (a TFIDF value for each morpheme i) is generated. Continue until all hash tag features are generated. The TFIDF value may be weighted in the same manner as described above.
 また、ブラックリストハッシュタグ拡張部160は、デフォルトブラックリストハッシュタグ格納部172によって格納されているブラックリストハッシュタグを読み出し、文書格納部100によって格納されている複数の文書からブラックリストハッシュタグを含む文書を、除外されるべき文書として抽出するデフォルト文書抽出手段である。この時、投稿時間に基づいて抽出時から直近数時間等の予め設定された一定期間の文書を取得する。ブラックリストハッシュタグ拡張部160は、文書格納部100及び形態素格納部120に格納されている情報を参照して、抽出された除外されるべき文書(文書群)における形態素の出現頻度を算出する第1出現頻度算出手段である。この際、ブラックリストハッシュタグ拡張部160は、形態素の出現頻度として当該形態素が含まれる文書を投稿したユーザ数を算出する。また、ブラックリストハッシュタグ拡張部160は、形態素毎に当該形態素が含まれる文書を投稿したユーザ数に対する、文書を投稿した全ユーザ数の割合から逆出現頻度を算出する。 Also, the blacklist hash tag extension unit 160 reads the blacklist hashtag stored by the default blacklist hashtag storage unit 172, and includes the blacklist hashtag from a plurality of documents stored by the document storage unit 100. This is default document extraction means for extracting a document as a document to be excluded. At this time, based on the posting time, a document for a predetermined period such as the most recent several hours from the time of extraction is acquired. The blacklist hash tag extension unit 160 refers to the information stored in the document storage unit 100 and the morpheme storage unit 120, and calculates the appearance frequency of the morpheme in the extracted document (document group) to be excluded. 1 appearance frequency calculation means. At this time, the blacklist hash tag expansion unit 160 calculates the number of users who have posted a document including the morpheme as the appearance frequency of the morpheme. Also, the blacklist hash tag expansion unit 160 calculates the reverse appearance frequency from the ratio of the total number of users who have posted documents to the number of users who have posted documents containing the morpheme for each morpheme.
 ブラックリストハッシュタグ拡張部160は、上記と同様にブラックリストハッシュタグjにおける形態素iのTFIDF値(tfidfi,j)を求める。なお、ブラックリストハッシュタグjにおけるTFIDF値は形態素毎に同一の値となるため、上記で算出したTFIDF値を利用することとしてもよい。 The blacklist hash tag extension unit 160 obtains the TFIDF value (tfidf i, j ) of the morpheme i in the blacklist hash tag j as described above. Since the TFIDF value in the blacklist hash tag j is the same value for each morpheme, the TFIDF value calculated above may be used.
 ブラックリストハッシュタグ拡張部160は、上記のように算出したブラックリストハッシュタグの特徴量と、ハッシュタグの特徴量とを比較する第2トピック文書判定手段の一機能である。具体的には、ブラックリストハッシュタグ拡張部160は、ブラックリストハッシュタグ毎に全ての(ブラックリストハッシュタグ以外の)ハッシュタグとの類似度(similarity)をコサイン距離として以下の式を用いて算出する。
Figure JPOXMLDOC01-appb-M000008
ここでA及びBは、それぞれブラックリストハッシュタグの特徴量及びハッシュタグの特徴量である。Ai及びBiは、各形態素iのTFIDF値である。なお、形態素の出現頻度によって示される特徴量間の類似度の算出には、上記のコサイン距離以外にも、ジャカード距離又はユークリッド距離が用いられてもよい。また、それ以外でも特徴量間の類似度の算出が可能なものであれば、任意の算出方法を用いることができる。
The blacklist hash tag extension unit 160 is a function of the second topic document determination unit that compares the feature amount of the blacklist hash tag calculated as described above with the feature amount of the hash tag. Specifically, the blacklist hash tag extension unit 160 calculates, for each blacklist hashtag, the similarity (similarity) with all hashtags (other than the blacklist hashtag) as a cosine distance using the following formula: To do.
Figure JPOXMLDOC01-appb-M000008
Here, A and B are the characteristic amount of the blacklist hash tag and the characteristic amount of the hash tag, respectively. Ai and Bi are the TFIDF values of each morpheme i. In addition to the above cosine distance, a Jacquard distance or an Euclidean distance may be used for calculating the similarity between the feature amounts indicated by the appearance frequency of the morpheme. In addition, any calculation method can be used as long as the similarity between feature quantities can be calculated.
 ブラックリストハッシュタグ拡張部160は、ブラックリストハッシュタグ毎に類似度が予め設定した閾値以上のハッシュタグの類似度があるか否か判断して、類似度が閾値以上のハッシュタグを除外されるべき文書に係るハッシュタグであるものとする。この処理を全てのブラックリストハッシュタグに対して行うことで、除外されるべき文書に係るハッシュタグを抽出することができる。 The blacklist hash tag extension unit 160 determines whether there is a similarity of a hash tag having a similarity equal to or higher than a preset threshold for each blacklist hash tag, and excludes a hash tag having a similarity equal to or higher than the threshold. It is assumed that the hash tag is related to a power document. By performing this process on all the blacklist hash tags, it is possible to extract hash tags related to documents to be excluded.
 ブラックリストハッシュタグ拡張部160は、抽出した除外されるべき文書に係るハッシュタグ(拡張ブラックリストハッシュタグ)を示す情報を拡張ブラックリストハッシュタグ格納部173に出力して格納させる。図7(c)に拡張ブラックリストハッシュタグ格納部173に格納される拡張ブラックリストハッシュタグのサンプルフォーマットを示す。図7(c)に示すように1行のデータが1つの拡張ブラックリストハッシュタグに関するデータに相当し、ブラックリストハッシュタグ毎に格納されている。 The black list hash tag extension unit 160 outputs information indicating the extracted hash tag (extended black list hash tag) related to the document to be excluded to the extended black list hash tag storage unit 173 for storage. FIG. 7C shows a sample format of the extended blacklist hash tag stored in the extended blacklist hash tag storage unit 173. As shown in FIG. 7C, one line of data corresponds to data related to one extended black list hash tag, and is stored for each black list hash tag.
 ブラックリストユーザ格納部180は、ブラックリストユーザを示すブラックリストユーザIDを入力して格納する手段である。ブラックリストユーザは、そのユーザに投稿された文書が除外されるべきものとされるユーザである。ブラックリストユーザIDは、例えば、関連文書抽出装置10の管理者によって予め登録される。図7(d)にブラックリストユーザ格納部180に格納されるブラックリストユーザIDのサンプルフォーマットを示す。図7(d)に示すように1行のデータが1つのブラックリストユーザIDに関するデータに相当し、ブラックリストユーザID毎に格納されている。なお、ユーザID以外でもブラックリストユーザを認識できる情報であれば、どのような情報が用いられてもよい。 The blacklist user storage unit 180 is a means for inputting and storing a blacklist user ID indicating a blacklist user. A blacklist user is a user whose documents posted to the user should be excluded. The blacklist user ID is registered in advance by, for example, the administrator of the related document extraction device 10. FIG. 7D shows a sample format of the black list user ID stored in the black list user storage unit 180. As shown in FIG. 7D, one line of data corresponds to data related to one black list user ID, and is stored for each black list user ID. Any information other than the user ID may be used as long as the information can recognize the blacklist user.
 ノイズ除去部190は、トピックID付与部150から入力された文書が不適切な(不適切なトピックに関連する)文書(ノイズ)であるか否かを判定して文書の除外を行うトピック文書抽出手段の一機能である。具体的には、ノイズ除去部190は、以下の機能を有する。 The noise removing unit 190 determines whether or not the document input from the topic ID assigning unit 150 is an inappropriate document (related to an inappropriate topic), and excludes the document by performing topic exclusion. It is a function of the means. Specifically, the noise removal unit 190 has the following functions.
 ノイズ除去部190は、デフォルトブラックリスト形態素格納部171からブラックリスト形態素を読み出して、トピックID付与部150から入力された文書にブラックリスト形態素が含まれていないか否かを判定する。この判定は文書とブラックリスト形態素との文字列のマッチングにより行われる。ノイズ除去部190は、文書にブラックリスト形態素が含まれていると判定すると当該文書を除外されるべき不適切な文書として除外する。 The noise removing unit 190 reads the black list morpheme from the default black list morpheme storage unit 171 and determines whether or not the black list morpheme is included in the document input from the topic ID assigning unit 150. This determination is performed by matching a character string between a document and a blacklist morpheme. If the noise removing unit 190 determines that the black list morpheme is included in the document, the noise removing unit 190 excludes the document as an inappropriate document to be excluded.
 ノイズ除去部190は、トピックID付与部150から入力された文書が別の文書を引き継いで投稿されたものか、あるいは別の文書に対して返信されたものかを判定する。具体的には、ノイズ除去部190は、文書がRT(リツイート)であるか、あるいは返信ツイートであるかの判定を行う。RTであるいか否かの判定は、公式Twitter APIより行うことが可能である。また、テキスト解析を行うことで上記の判定を行うこととしてもよい。具体的には、文書に“RT”との文字列が含まれているか、あるいはユーザ名が含まれているかで容易に判定が可能である。ノイズ除去部190は、文書が別の文書を引き継いで投稿されたもの、あるいは別の文書に対して返信されたものであると判定すると当該文書を除外されるべき不適切な文書として除外する。 The noise removing unit 190 determines whether the document input from the topic ID assigning unit 150 has been posted by taking over another document or returned to another document. Specifically, the noise removing unit 190 determines whether the document is RT (retweet) or a reply tweet. It is possible to determine whether or not it is RT from the official Twitter API. Moreover, it is good also as performing said determination by performing a text analysis. Specifically, it is possible to easily determine whether a document includes a character string “RT” or a user name. If the noise removal unit 190 determines that the document is a post that has been taken over another document or has been returned to another document, the noise removal unit 190 excludes the document as an inappropriate document that should be excluded.
 ノイズ除去部190は、マルチポスト判定を行う。マルチポストとは、複数のトピックに対しての投稿のことを指す。即ち、文書が複数のトピックに関連する文書であるか否かを判定する。例えば、放送局を1トピックとした場合、ハッシュタグにそれぞれ放送局に係るハッシュタグである#fffと#zzzとが含まれる文書は複数の放送局に対して文書を投稿しているため、マルチポストとみなす。ノイズ除去部190は、トピックID付与部150から入力された文書がトピックID付与部150によって複数のトピックIDが付与されているか否かを判定することで、文書がマルチポストされたものであるか否かを判定する。ノイズ除去部190は、文書がマルチポストされたものであると判定すると当該文書を除外されるべき不適切な文書として除外する。 The noise removal unit 190 performs multi-post determination. Multi-posting refers to posting on multiple topics. That is, it is determined whether the document is a document related to a plurality of topics. For example, when a broadcast station is set as one topic, a document in which hashtags #fff and #zzz, which are hash tags related to the broadcast station, are posted to a plurality of broadcast stations. Considered a post. The noise removal unit 190 determines whether or not the document input from the topic ID assigning unit 150 has been given a plurality of topic IDs by the topic ID assigning unit 150, so that the document is multi-posted. Determine whether or not. When the noise removing unit 190 determines that the document is multi-posted, the noise removing unit 190 excludes the document as an inappropriate document to be excluded.
 ノイズ除去部190は、トピックID付与部150から入力された文書がブラックリストユーザによって投稿されたものかを判定する。ノイズ除去部190は、ブラックリストユーザ格納部180からブラックリストユーザのユーザIDを読み出して、トピックID付与部150から入力された文書を投稿したユーザのユーザIDとブラックリストユーザのユーザIDとを比較して、合致した場合、文書がブラックリストユーザによって投稿されたものであると判定する。ノイズ除去部190は、文書がブラックリストユーザによって投稿されたものであると判定すると当該文書を除外されるべき不適切な文書として除外する。 The noise removing unit 190 determines whether the document input from the topic ID assigning unit 150 has been posted by the blacklist user. The noise removal unit 190 reads the user ID of the black list user from the black list user storage unit 180 and compares the user ID of the user who posted the document input from the topic ID adding unit 150 with the user ID of the black list user. If they match, it is determined that the document has been posted by the blacklist user. If the noise removing unit 190 determines that the document has been posted by the blacklist user, the noise removing unit 190 excludes the document as an inappropriate document to be excluded.
 ノイズ除去部190は、ブラックリストタグ格納部170に格納された情報を用いて、トピックID付与部150から入力された文書が不適切な文書であるか否かを判定する。特にノイズ除去部190は、文書にブラックリストハッシュタグが含まれていないが、不適切な文書を判定して除外する。 The noise removal unit 190 uses the information stored in the blacklist tag storage unit 170 to determine whether the document input from the topic ID adding unit 150 is an inappropriate document. In particular, the noise removal unit 190 determines and excludes inappropriate documents that do not contain blacklist hash tags.
 ノイズ除去部190は、デフォルトブラックリストハッシュタグ格納部172によって格納されているブラックリストハッシュタグを読み出し、文書にデフォルトトピックタグが含まれているか否かを判定し、デフォルトトピックタグが含まれていた文書を除外されるべき不適切な文書として除外する。 The noise removing unit 190 reads the black list hash tag stored by the default black list hash tag storage unit 172, determines whether or not the document includes the default topic tag, and the default topic tag is included. Exclude documents as inappropriate documents that should be excluded.
 ノイズ除去部190は、ブラックリストタグ格納部170によって格納されている特徴量(デフォルトトピックタグ毎の各形態素のTFIDF値(スコア))の情報を読み出し、特徴量の情報から各デフォルトトピックタグに対する各文書のスコアを算出するスコア算出手段である。ノイズ除去部190は、スコア付与対象の文書に特徴量に係る形態素(特徴語)が含まれているか判断する。ノイズ除去部190は、文書に含まれていた特徴語のスコアを合算する。なお、トピックID付与部150によるスコアの算出と同様に、スコアの算出の際に文書において特徴語が複数回出現する場合、1回出現の場合と同様に文書のスコアを算出することとしてもよい。 The noise removal unit 190 reads information on the feature amount (the TFIDF value (score) of each morpheme for each default topic tag) stored by the blacklist tag storage unit 170, and reads each information for each default topic tag from the feature amount information. It is a score calculation means for calculating the score of a document. The noise removal unit 190 determines whether the score assignment target document includes a morpheme (feature word) related to the feature amount. The noise removing unit 190 adds up the scores of feature words included in the document. Similar to the calculation of the score by the topic ID assigning unit 150, when the feature word appears multiple times in the document when the score is calculated, the score of the document may be calculated as in the case of the single appearance. .
 ノイズ除去部190は、算出されたスコアに基づいて、当該スコアに係る文書が除外されるべき不適切な文書であるか否かを判定する第1トピック文書判定手段である。具体的には、ノイズ除去部190は、スコアが予め設定した閾値であるか否かを判断して、閾値以上であった場合、当該文書を除外されるべき不適切な文書であると判断して除外する。この処理をブラックリストタグ格納部170に格納される特徴量に係るブラックリストハッシュタグ分繰り返し文書を除外する。 The noise removal unit 190 is a first topic document determination unit that determines whether or not a document related to the score is an inappropriate document that should be excluded based on the calculated score. Specifically, the noise removal unit 190 determines whether or not the score is a preset threshold value. If the score is equal to or greater than the threshold value, the noise removal unit 190 determines that the document is an inappropriate document that should be excluded. To exclude. This process is repeated for the number of blacklist hash tags related to the feature quantity stored in the blacklist tag storage unit 170.
 ノイズ除去部190は、拡張ブラックリストハッシュタグ格納部173によって格納されている拡張ブラックリストハッシュタグを読み出し、取得した文書に拡張ブラックリストハッシュタグが含まれているか否かを判断することで、当該文書が除外されるべき不適切な文書であるか否かを判定する第2トピック文書判定手段である。ノイズ除去部190は、拡張ブラックリストハッシュタグが含まれていた文書を除外されるべき不適切な文書であると判断して除外する。拡張ブラックリストハッシュタグ格納部173に格納される拡張ブラックリストハッシュタグ分繰り返し文書を除外する。 The noise removing unit 190 reads the extended blacklist hash tag stored by the extended blacklist hashtag storage unit 173, and determines whether the acquired document includes the extended blacklist hashtag. Second topic document determination means for determining whether a document is an inappropriate document to be excluded. The noise removing unit 190 determines that the document including the extended blacklist hash tag is an inappropriate document to be excluded and excludes the document. Excludes repeated documents corresponding to the extended blacklist hashtag stored in the extended blacklist hashtag storage unit 173.
 ノイズ除去部190は、上記によって除外されなかった文書をトピック文書格納部200に出力する。また、ノイズ除去部190によって除外された文書については、トピックタグ推定部130による処理に用いないようにしてもよい。例えば、文書格納部100に格納される文書、及び形態素格納部に格納される形態素にノイズ除去部190によって除去された文書に係るものであるか否かの情報を対応付けておき、除去された文書に係るものはトピックタグ推定部130に入力させないようにしてもよい。 The noise removal unit 190 outputs the documents that are not excluded by the above process to the topic document storage unit 200. Further, the document excluded by the noise removing unit 190 may not be used for the processing by the topic tag estimating unit 130. For example, information regarding whether or not the document stored in the document storage unit 100 and the morpheme stored in the morpheme storage unit relate to the document removed by the noise removal unit 190 is associated and removed. What is related to the document may not be input to the topic tag estimation unit 130.
 トピック文書格納部200は、ノイズ除去部190から入力された、1つのトピックIDが付与された文書を入力して格納する手段である。トピックIDが付与された文書は、当該トピックIDに係るトピックに関連する文書として抽出されたものである。図8にトピック文書格納部200に格納される文書のサンプルフォーマットを示す。図8に示すようにトピック文書格納部200に格納される文書に関するデータは、文書格納部100に格納される文書に関するデータに加えてトピックIDが対応付けられたものとなっている。トピック文書格納部200に格納されたトピックIDが付与された文書は、例えば、トピックID毎にトピックに関連する文書としてユーザに提供される。以上が、関連文書抽出装置10の機能構成である。 The topic document storage unit 200 is a means for inputting and storing a document that is input from the noise removal unit 190 and is assigned with one topic ID. The document with the topic ID is extracted as a document related to the topic related to the topic ID. FIG. 8 shows a sample format of a document stored in the topic document storage unit 200. As shown in FIG. 8, the data related to the document stored in the topic document storage unit 200 is associated with the topic ID in addition to the data related to the document stored in the document storage unit 100. The document with the topic ID stored in the topic document storage unit 200 is provided to the user as a document related to the topic for each topic ID, for example. The functional configuration of the related document extraction apparatus 10 has been described above.
 図9に関連文書抽出装置10のハードウェア構成を示す。図9に示すように関連文書抽出装置10は、CPU(Central Processing Unit)1001、主記憶装置であるRAM(Random Access Memory)1002及びROM(Read Only Memory)1003、通信を行うための通信モジュール1004、並びにハードディスク等の補助記憶装置1005等のハードウェアを備えるコンピュータを含むものとして構成される。これらの構成要素がプログラム等により動作することにより、上述した関連文書抽出装置10の機能が発揮される。以上が、関連文書抽出装置10の構成である。 FIG. 9 shows the hardware configuration of the related document extraction apparatus 10. As shown in FIG. 9, the related document extraction apparatus 10 includes a CPU (Central Processing Unit) 1001, a RAM (Random Access Memory) 1002 and a ROM (Read Only Memory) 1003, and a communication module 1004 for communication. And a computer including hardware such as an auxiliary storage device 1005 such as a hard disk. The functions of the related document extracting apparatus 10 described above are exhibited by the operation of these components by a program or the like. The above is the configuration of the related document extraction apparatus 10.
 引き続いて、図10~21のフローチャートを用いて、本実施形態に係る関連文書抽出装置10で実行される処理である関連文書抽出方法を説明する。図10に関連文書抽出方法全体を示すフローチャートを示す。本処理では、まず、文書格納部100によって、抽出対象となる複数の文書が入力されて格納される(S01)。文書格納部100に入力された文書は、形態素解析部110に出力される。続いて、形態素解析部110によって文書に対する形態素解析が行われて文書が形態素に分割される(S02、単語取得ステップ)。形態素解析部110による形態素解析によって得られた形態素を示す情報は、形態素格納部120に格納される。 Subsequently, a related document extraction method, which is a process executed by the related document extraction apparatus 10 according to the present embodiment, will be described using the flowcharts of FIGS. FIG. 10 is a flowchart showing the entire related document extraction method. In this processing, first, a plurality of documents to be extracted are input and stored by the document storage unit 100 (S01). The document input to the document storage unit 100 is output to the morphological analysis unit 110. Subsequently, the morpheme analysis unit 110 performs morpheme analysis on the document, and the document is divided into morphemes (S02, word acquisition step). Information indicating the morpheme obtained by the morpheme analysis by the morpheme analysis unit 110 is stored in the morpheme storage unit 120.
 続いて、トピックタグ推定部130によって、文書格納部100に格納された文書、形態素解析部110に格納された形態素、及びトピックタグ格納部140に格納された情報から、各文書が特定のトピックに関連する文書であるか否かを判断するために用いる情報が生成される(S03)。この処理は、トピック特徴語推定部131及びトピックハッシュタグ推定部132それぞれによって行われる。 Subsequently, the topic tag estimation unit 130 assigns each document to a specific topic from the document stored in the document storage unit 100, the morpheme stored in the morpheme analysis unit 110, and the information stored in the topic tag storage unit 140. Information used to determine whether the document is related is generated (S03). This processing is performed by the topic feature word estimation unit 131 and the topic hash tag estimation unit 132, respectively.
 図11~図13のフローチャートを用いてトピック特徴語推定部131による処理を説明する。図11に示すようにトピック特徴語推定部131によって、デフォルトトピックタグ格納部141によって格納されているデフォルトトピックタグが読み出され、文書格納部100によって格納されている複数の文書からデフォルトトピックタグを含む文書がトピックに関連する文書(トピック文書)として抽出される(S301、デフォルト文書抽出ステップ)。続いて、トピック毎に特徴量が生成される(S302、第1出現頻度算出ステップ)。この処理を図12のフローチャートを用いてより詳細に説明する。 The processing by the topic feature word estimation unit 131 will be described using the flowcharts of FIGS. As shown in FIG. 11, the topic feature word estimation unit 131 reads the default topic tag stored by the default topic tag storage unit 141, and obtains the default topic tag from a plurality of documents stored by the document storage unit 100. The included document is extracted as a document (topic document) related to the topic (S301, default document extraction step). Subsequently, a feature amount is generated for each topic (S302, first appearance frequency calculation step). This process will be described in detail with reference to the flowchart of FIG.
 まず、各形態素についてのIDF値が算出される(S3021、第1出現頻度算出ステップ)。続いて、トピックID(処理対象)毎に各トピック文書に含まれる形態素から各形態素についてのTF値が算出される(S3022、第1出現頻度算出ステップ)。続いて、算出されたIDF値とTF値とから、各トピックIDにおける形態素のTFIDF値が求められる(S3023、第1出現頻度算出ステップ)。求められたTFIDF値が特徴量である。S3022及びS3023の処理は、全てのトピックIDに対しての処理が終了するまで繰り返し行われる。 First, an IDF value for each morpheme is calculated (S3021, first appearance frequency calculation step). Subsequently, the TF value for each morpheme is calculated from the morphemes included in each topic document for each topic ID (processing target) (S3022, first appearance frequency calculation step). Subsequently, the TFIDF value of the morpheme in each topic ID is obtained from the calculated IDF value and TF value (S3023, first appearance frequency calculation step). The obtained TFIDF value is a feature amount. The processing of S3022 and S3023 is repeated until the processing for all topic IDs is completed.
 続いて、図11に戻り、トピックID毎に特徴語がトピック特徴語格納部142に格納される(S303、第1出現頻度算出ステップ)。この処理を図13のフローチャートを用いてより詳細に説明する。この処理は、トピックID毎に行われる。形態素毎に、形態素のTFIDF値が予め設定された閾値以上である否かが判断される(S3031、第1出現頻度算出ステップ)。TFIDF値が予め設定された閾値以上であると判断された場合、そのトピックIDに関して当該形態素及びTFIDF値がトピック特徴語格納部142に出力されて格納される(S3032、第1出現頻度算出ステップ)。TFIDF値が予め設定された閾値以上でないと判断された場合、特段の処理は行われず次の形態素についての処理に移る。上記の処理は、各トピックIDに関して全ての形態素に対して繰り返し行われ、また、全てのトピックIDに対しての処理が終了するまで繰り返し行われる。以上が、トピック特徴語推定部131による処理である。 Subsequently, returning to FIG. 11, feature words are stored in the topic feature word storage unit 142 for each topic ID (S303, first appearance frequency calculation step). This process will be described in detail with reference to the flowchart of FIG. This process is performed for each topic ID. For each morpheme, it is determined whether or not the TFIDF value of the morpheme is greater than or equal to a preset threshold value (S3031, first appearance frequency calculation step). When it is determined that the TFIDF value is equal to or greater than a preset threshold, the morpheme and the TFIDF value are output and stored in the topic feature word storage unit 142 for the topic ID (S3032, first appearance frequency calculation step). . If it is determined that the TFIDF value is not greater than or equal to a preset threshold value, no special process is performed and the process moves to the process for the next morpheme. The above processing is repeated for all morphemes for each topic ID, and is repeated until the processing for all topic IDs is completed. The processing by the topic feature word estimation unit 131 has been described above.
 続いて、図14、図15のフローチャートを用いてトピックハッシュタグ推定部132による処理を説明する。図14に示すようにトピックハッシュタグ推定部132によって、デフォルトトピックタグ格納部141によって格納されているデフォルトトピックタグが読み出され、文書格納部100によって格納されている複数の文書からデフォルトトピックタグ以外のハッシュタグを含む文書(タグ文書)が抽出される(S311、タグ文書抽出ステップ)。続いて、ハッシュタグ毎に特徴量が生成される(S312、第2出現頻度算出ステップ)。特徴量の生成は、上述した図12のフローチャートを用いて説明した処理と同様に行われる。但し、この場合、図12に示す処理のループはハッシュタグ毎に行われ、全てのハッシュタグに対しての処理が終了するまで繰り返し行われる。 Subsequently, processing by the topic hash tag estimation unit 132 will be described using the flowcharts of FIGS. 14 and 15. As illustrated in FIG. 14, the topic hash tag estimation unit 132 reads the default topic tag stored in the default topic tag storage unit 141, and other than the default topic tag from the plurality of documents stored in the document storage unit 100. A document including the hash tag (tag document) is extracted (S311, tag document extraction step). Subsequently, a feature amount is generated for each hash tag (S312, second appearance frequency calculation step). The generation of the feature amount is performed in the same manner as the process described with reference to the flowchart of FIG. However, in this case, the processing loop shown in FIG. 12 is performed for each hash tag, and is repeated until the processing for all hash tags is completed.
 続いて、文書格納部100によって格納されている複数の文書からデフォルトトピックタグを含む文書がトピックに関連する文書(トピック文書)として抽出される(S313、デフォルト文書抽出ステップ)。続いて、トピック毎に特徴量が生成される(S314、第1出現頻度算出ステップ)。特徴量の生成は、上述した図12のフローチャートを用いて説明した処理と同様に行われる。 Subsequently, a document including a default topic tag is extracted as a document (topic document) related to the topic from a plurality of documents stored by the document storage unit 100 (S313, default document extraction step). Subsequently, a feature amount is generated for each topic (S314, first appearance frequency calculation step). The generation of the feature amount is performed in the same manner as the process described with reference to the flowchart of FIG.
 続いて、上記のように算出したトピックIDの特徴量とハッシュタグの特徴量とが比較されて、比較結果に基づいてトピックIDのトピックに係るハッシュタグ(拡張トピックハッシュタグ)が拡張トピックハッシュタグ格納部143に出力されて格納される(S315、第2トピック文書判定ステップ)。 Subsequently, the feature amount of the topic ID calculated as described above is compared with the feature amount of the hash tag, and the hash tag (extended topic hash tag) related to the topic with the topic ID is expanded based on the comparison result. The data is output and stored in the storage unit 143 (S315, second topic document determination step).
 この処理を図15のフローチャートを用いてより詳細に説明する。この処理は、トピックID及びハッシュタグ毎に行われる。トピックIDとハッシュタグとの特徴量の類似度が算出される(S3151、第2トピック文書判定ステップ)。この類似度は、上述したように例えば、コサイン距離が用いられる。続いて、算出された類似度が予め設定された閾値以上である否かが判断される(S3152、第2トピック文書判定ステップ)。類似度が予め設定された閾値以上であると判断された場合、そのハッシュタグが、そのトピックIDについての拡張トピックハッシュタグとして拡張トピックハッシュタグ格納部143に出力されて格納される(S3153、第2トピック文書判定ステップ)。類似度が予め設定された閾値以上でないと判断された場合、特段の処理は行われず次のハッシュタグについての処理に移る。上記の処理は、各トピックIDに関して全てのハッシュタグに対して繰り返し行われ、また、全てのトピックIDに対しての処理が終了するまで繰り返し行われる。以上が、トピックハッシュタグ推定部132による処理である。 This process will be described in more detail using the flowchart of FIG. This process is performed for each topic ID and hash tag. The similarity of the feature amount between the topic ID and the hash tag is calculated (S3151, second topic document determination step). As this similarity, for example, a cosine distance is used as described above. Subsequently, it is determined whether or not the calculated similarity is equal to or greater than a preset threshold value (S3152, second topic document determination step). When it is determined that the similarity is equal to or higher than a preset threshold, the hash tag is output and stored in the extended topic hash tag storage unit 143 as an extended topic hash tag for the topic ID (S3153, No. 1). 2-topic document determination step). If it is determined that the similarity is not greater than or equal to a preset threshold, no special process is performed and the process proceeds to the process for the next hash tag. The above processing is repeated for all hash tags for each topic ID, and is repeated until the processing for all topic IDs is completed. The processing by the topic hash tag estimation unit 132 has been described above.
 続いて、図10に戻り、ブラックリストハッシュタグ拡張部160によって、文書格納部100に格納された文書、形態素解析部110に格納された形態素、及びブラックリストタグ格納部170に格納された情報から、各文書がノイズにあたるか、即ち、各文書が抽出するのに不適切な特定のトピックに関連する文書であるか否かを判断するために用いる情報が生成される(S04)。 Subsequently, referring back to FIG. 10, the blacklist hash tag extension unit 160 determines from the document stored in the document storage unit 100, the morpheme stored in the morpheme analysis unit 110, and the information stored in the blacklist tag storage unit 170. Information used to determine whether each document hits noise, that is, whether each document is related to a specific topic inappropriate for extraction (S04).
 本処理では、ブラックリストハッシュタグ拡張部160によって、ブラックリストハッシュタグに係る特徴語が推定される。この推定では、各形態素についてのIDF値、ブラックリストハッシュタグ毎の各形態素についてのTF値が算出され、ブラックリストハッシュタグ毎の形態素のTFIDF値が算出される。算出されたブラックリストハッシュタグ毎の各形態素のTFIDF値はブラックリストタグ格納部170に出力して格納される。ここで、TFIDF値が予め設定した閾値以上の形態素(特徴語)についてのみ、ブラックリストタグ格納部170に格納させることとしてもよい。 In this process, the blacklist hash tag extension unit 160 estimates a feature word related to the blacklist hashtag. In this estimation, the IDF value for each morpheme and the TF value for each morpheme for each blacklist hash tag are calculated, and the TFIDF value for the morpheme for each blacklist hashtag is calculated. The calculated TFIDF value of each morpheme for each blacklist hash tag is output to and stored in the blacklist tag storage unit 170. Here, only morphemes (feature words) having a TFIDF value equal to or greater than a preset threshold value may be stored in the blacklist tag storage unit 170.
 また、ブラックリストハッシュタグ拡張部160によって、拡張ブラックリストハッシュタグが推定される。この処理を図16、図17のフローチャートを用いて説明する。図16に示すようにブラックリストハッシュタグ拡張部160によって、デフォルトブラックリストハッシュタグ格納部172によって格納されているブラックリストハッシュタグが読み出され、文書格納部100によって格納されている複数の文書からブラックリストハッシュタグ以外のハッシュタグを含む文書(タグ文書)が抽出される(S411、タグ文書抽出ステップ)。続いて、ハッシュタグ毎に特徴量が生成される(S412、第2出現頻度算出ステップ)。特徴量の生成は、上述した図12のフローチャートを用いて説明した処理と同様に行われる。但し、この場合、図12に示す処理のループはハッシュタグ毎に行われ、全てのハッシュタグに対しての処理が終了するまで繰り返し行われる。 Also, the black list hash tag extension unit 160 estimates an extended black list hash tag. This process will be described with reference to the flowcharts of FIGS. As shown in FIG. 16, the blacklist hash tag expansion unit 160 reads out the blacklist hashtag stored in the default blacklist hashtag storage unit 172, and from a plurality of documents stored in the document storage unit 100. A document (tag document) including a hash tag other than the blacklist hash tag is extracted (S411, tag document extraction step). Subsequently, a feature amount is generated for each hash tag (S412, second appearance frequency calculation step). The generation of the feature amount is performed in the same manner as the process described with reference to the flowchart of FIG. However, in this case, the processing loop shown in FIG. 12 is performed for each hash tag, and is repeated until the processing for all hash tags is completed.
 続いて、文書格納部100によって格納されている複数の文書からブラックリストハッシュタグを含む文書が抽出される(S414、デフォルト文書抽出ステップ)。続いて、ブラックリストハッシュタグ毎に特徴量が生成される(S415、第1出現頻度算出ステップ)。特徴量の生成は、上述した図12のフローチャートを用いて説明した処理と同様に行われる。但し、この場合、図12に示す処理のループはブラックリストハッシュタグ毎に行われ、全てのブラックリストハッシュタグに対しての処理が終了するまで繰り返し行われる。 Subsequently, a document including a blacklist hash tag is extracted from a plurality of documents stored by the document storage unit 100 (S414, default document extraction step). Subsequently, a feature amount is generated for each blacklist hash tag (S415, first appearance frequency calculation step). The generation of the feature amount is performed in the same manner as the process described with reference to the flowchart of FIG. However, in this case, the processing loop shown in FIG. 12 is performed for each blacklist hash tag, and is repeated until the processing for all the blacklist hashtags is completed.
 続いて、上記のように算出したブラックリストハッシュタグの特徴量とハッシュタグの特徴量とが比較されて、比較結果に基づいてブラックリストハッシュタグに係るハッシュタグ(拡張ブラックリストハッシュタグ)が拡張ブラックリストハッシュタグ格納部173に出力されて格納される(S415、第2トピック文書判定ステップ)。 Subsequently, the characteristic amount of the black list hash tag calculated as described above is compared with the characteristic amount of the hash tag, and the hash tag related to the black list hash tag (extended black list hash tag) is expanded based on the comparison result. It is output and stored in the blacklist hash tag storage unit 173 (S415, second topic document determination step).
 この処理を図17のフローチャートを用いてより詳細に説明する。この処理は、ブラックリストハッシュタグ及びハッシュタグ毎に行われる。ブラックリストハッシュタグとハッシュタグとの特徴量の類似度が算出される(S4151、第2トピック文書判定ステップ)。この類似度は、上述したように例えば、コサイン距離が用いられる。続いて、算出された類似度が予め設定された閾値以上である否かが判断される(S4152、第2トピック文書判定ステップ)。類似度が予め設定された閾値以上であると判断された場合、そのハッシュタグが、そのブラックリストハッシュタグについての拡張ブラックリストハッシュタグとして拡張ブラックリストハッシュタグ格納部173に出力されて格納される(S4153、第2トピック文書判定ステップ)。類似度が予め設定された閾値以上でないと判断された場合、特段の処理は行われず次のハッシュタグについての処理に移る。上記の処理は、各ブラックリストハッシュタグに関して全てのハッシュタグに対して繰り返し行われ、また、全てのブラックリストハッシュタグに対しての処理が終了するまで繰り返し行われる。以上が、ブラックリストハッシュタグ拡張部160による処理である。 This process will be described in more detail using the flowchart of FIG. This process is performed for each blacklist hash tag and hash tag. The similarity between the blacklist hash tag and the hash tag is calculated (S4151, second topic document determination step). As this similarity, for example, a cosine distance is used as described above. Subsequently, it is determined whether or not the calculated similarity is equal to or greater than a preset threshold (S4152, second topic document determination step). If it is determined that the similarity is equal to or higher than a preset threshold, the hash tag is output to the extended black list hash tag storage unit 173 and stored as an extended black list hash tag for the black list hash tag. (S4153, second topic document determination step). If it is determined that the similarity is not greater than or equal to a preset threshold, no special process is performed and the process proceeds to the process for the next hash tag. The above process is repeated for all the hash tags for each black list hash tag, and is repeated until the process for all the black list hash tags is completed. The above is the processing by the blacklist hash tag extension unit 160.
 続いて、図10に戻り、トピックID付与部150によって、トピックタグ格納部140に格納された情報が用いられて、文書格納部100によって格納されている文書がトピックに関連する文書であるか否かが判断されてその判断に応じて文書にトピックIDが付与される(S05、トピック文書抽出ステップ)。 Subsequently, returning to FIG. 10, whether or not the document stored in the document storage unit 100 is a document related to the topic using the information stored in the topic tag storage unit 140 by the topic ID assigning unit 150. In response to the determination, a topic ID is assigned to the document (S05, topic document extraction step).
 この処理を図18のフローチャートを用いてより詳細に説明する。トピックID付与部150によって、トピック特徴語推定部131によって格納されている特徴量の情報が読み出されて、その情報に基づいて文書にトピックIDが付与される(S501、トピック文書抽出ステップ)。 This process will be described in more detail using the flowchart of FIG. The feature ID information stored in the topic feature word estimation unit 131 is read out by the topic ID assigning unit 150, and a topic ID is assigned to the document based on the information (S501, topic document extracting step).
 この処理を図19のフローチャートを用いてより詳細に説明する。この処理は、トピックの付与対象の文書毎に行われる。まず、トピック(トピックID)毎にトピック特徴語推定部131によって格納されている特徴量の情報が取得される(S5011、スコア算出ステップ)。続いて、文書の「スコア合計値」が初期化される(値がゼロにされる)(S5012、スコア算出ステップ)。続いて、特徴語毎に文書に含まれるか否かが判断される(S5013、スコア算出ステップ)。特徴語が文書に含まれると判断される場合には、その特徴語のスコア(TFIDF値)が「スコア合計値」に加算される(S5014、スコア算出ステップ)。特徴語が文書に含まれないと判断される場合には、その特徴語のスコアは「スコア合計値」に加算されない。 This process will be described in more detail using the flowchart of FIG. This process is performed for each document to which a topic is assigned. First, the feature amount information stored by the topic feature word estimation unit 131 is acquired for each topic (topic ID) (S5011, score calculation step). Subsequently, the “score total value” of the document is initialized (value is set to zero) (S5012, score calculation step). Subsequently, it is determined whether or not each feature word is included in the document (S5013, score calculation step). When it is determined that the feature word is included in the document, the score (TFIDF value) of the feature word is added to the “score total value” (S5014, score calculation step). When it is determined that the feature word is not included in the document, the score of the feature word is not added to the “score total value”.
 全ての特徴語について上記の処理(S5013、S5014)が終了すると、「スコア合計値」が予め設定された閾値以上か否かが判断される(S5015、第1トピック文書判定ステップ)。「スコア合計値」が予め設定された閾値以上であると判断される場合には、その文書に対してそのトピックのトピックIDが付与される(S5016、第1トピック文書判定ステップ)。「スコア合計値」が予め設定された閾値以上でないと判断される場合には、その文書に対してそのトピックのトピックIDは付与されない。上記の処理は、各文書に関して全てのトピックついて繰り返し行われ、また、全ての文書に対しての処理が終了するまで繰り返し行われる。 When the above processing (S5013, S5014) is completed for all feature words, it is determined whether or not the “score total value” is equal to or greater than a preset threshold value (S5015, first topic document determination step). If it is determined that the “score total value” is equal to or greater than a preset threshold, the topic ID of the topic is assigned to the document (S5016, first topic document determination step). If it is determined that the “score total value” is not equal to or greater than a preset threshold value, the topic ID of the topic is not assigned to the document. The above processing is repeated for all topics for each document, and is repeated until the processing for all documents is completed.
 続いて、図18に戻り、トピックID付与部150によって、デフォルトトピックタグ格納部141によって格納されているデフォルトトピックタグ、及び拡張トピックハッシュタグ格納部143によって格納されている拡張トピックハッシュタグが読み出されて、その情報に基づいて文書にトピックIDが付与される(S502、トピック文書抽出ステップ(第2トピック文書判定ステップ))。 Next, returning to FIG. 18, the topic ID assigning unit 150 reads the default topic tag stored in the default topic tag storage unit 141 and the extended topic hash tag stored in the extended topic hash tag storage unit 143. Then, a topic ID is assigned to the document based on the information (S502, topic document extraction step (second topic document determination step)).
 この処理を図20のフローチャートを用いてより詳細に説明する。この処理は、トピックの付与対象の文書毎に行われる。まず、トピック(トピックID)毎に当該トピックに対応付けられているデフォルトトピックタグ及び拡張トピックハッシュタグが取得される(S5021)。続いて、各デフォルトトピックタグ及び拡張トピックハッシュタグが、文書に含まれるか否かが判断される(S5022、第2トピック文書判定ステップ)。デフォルトトピックタグ及び拡張トピックハッシュタグが文書に含まれると判断される場合には、その文書に対してそのトピックのトピックIDが付与される(S5023、第2トピック文書判定ステップ)。デフォルトトピックタグ及び拡張トピックハッシュタグが文書に含まれないと判断される場合には、その文書に対してそのトピックのトピックIDは付与されない。上記の処理(S5022,S5023)は、トピックに対応付けられている全てのデフォルトトピックタグ及び拡張トピックハッシュタグに対して行われる。また、上記の処理は、各文書に関して全てのトピックついて繰り返し行われ、また、全ての文書に対しての処理が終了するまで繰り返し行われる。 This process will be described in more detail using the flowchart of FIG. This process is performed for each document to which a topic is assigned. First, for each topic (topic ID), a default topic tag and an extended topic hash tag associated with the topic are acquired (S5021). Subsequently, it is determined whether or not each default topic tag and extended topic hash tag are included in the document (S5022, second topic document determination step). When it is determined that the default topic tag and the extended topic hash tag are included in the document, the topic ID of the topic is assigned to the document (S5023, second topic document determination step). When it is determined that the default topic tag and the extended topic hash tag are not included in the document, the topic ID of the topic is not given to the document. The above processing (S5022, S5023) is performed for all default topic tags and extended topic hash tags associated with the topic. The above processing is repeated for all topics for each document, and is repeated until the processing for all documents is completed.
 トピックID付与部150によってトピックIDが付与された文書は、ノイズ除去部190に出力される。 The document to which the topic ID is assigned by the topic ID assigning unit 150 is output to the noise removing unit 190.
 続いて、ノイズ除去部190によって、トピックID付与部150から入力された文書が不適切な文書であるか否かが判定されて文書の除外が行われる(S601、トピック文書抽出ステップ)。 Subsequently, the noise removing unit 190 determines whether or not the document input from the topic ID assigning unit 150 is an inappropriate document and excludes the document (S601, topic document extraction step).
 この処理を図21のフローチャートを用いてより詳細に説明する。この処理は、トピックID付与部150から入力された(トピックが付与された)文書毎に行われる。デフォルトブラックリスト形態素格納部171からブラックリスト形態素(NGワード)が読み出されて、文書にブラックリスト形態素が含まれていないか否かが判定される(S601)。文書にブラックリスト形態素が含まれていると判定された場合、当該文書が除外されるべき不適切な文書として除外される(後続の処理が行われない)。 This process will be described in more detail using the flowchart of FIG. This process is performed for each document (topic is assigned) input from the topic ID assigning unit 150. A black list morpheme (NG word) is read from the default black list morpheme storage unit 171 to determine whether or not the black list morpheme is included in the document (S601). If it is determined that the blacklist morpheme is included in the document, the document is excluded as an inappropriate document to be excluded (subsequent processing is not performed).
 文書にブラックリスト形態素が含まれていないと判定された場合、続いて、文書がRTであるか、あるいは返信ツイートであるかの判定が行われる(S602)。文書がRTあるいは返信ツイートであると判定された場合、当該文書が除外されるべき不適切な文書として除外される(後続の処理が行われない)。 If it is determined that the blacklist morpheme is not included in the document, it is then determined whether the document is RT or a reply tweet (S602). If it is determined that the document is RT or a reply tweet, the document is excluded as an inappropriate document to be excluded (subsequent processing is not performed).
 文書がRTあるいは返信ツイートの何れでもないと判定された場合、続いて、文書がマルチポストされたものであるかの判定が行われる(S603)。文書がマルチポストされたものであると判定された場合、当該文書が除外されるべき不適切な文書として除外される(後続の処理が行われない)。 If it is determined that the document is neither an RT nor a reply tweet, it is then determined whether the document has been multi-posted (S603). If it is determined that the document is multi-posted, the document is excluded as an inappropriate document to be excluded (no subsequent processing is performed).
 文書がマルチポストされたものでないと判定された場合、続いて、ブラックリストユーザ格納部180からブラックリストユーザのユーザIDが読み出されて、文書がブラックリストユーザによって投稿されたものであるか否かが判定される(S604)。文書がブラックリストユーザによって投稿されたものであると判定された場合、当該文書が除外されるべき不適切な文書として除外される(後続の処理が行われない)。 If it is determined that the document has not been multi-posted, the user ID of the blacklist user is subsequently read from the blacklist user storage unit 180, and whether or not the document has been posted by the blacklist user. Is determined (S604). If it is determined that the document is posted by a blacklist user, the document is excluded as an inappropriate document to be excluded (subsequent processing is not performed).
 文書がブラックリストユーザによって投稿されたものでないと判定された場合、続いて、ブラックリストタグ格納部170によって格納されている特徴量(デフォルトトピックタグ毎の各形態素のTFIDF値(スコア))、及び拡張ブラックリストハッシュタグ格納部173によって格納されている拡張ブラックリストハッシュタグが読み出されて、それらに基づいて上述したように文書が除外されるべき不適切な文書か否かが判定される(S605)。文書が除外されるべき不適切な文書と判定されると、当該文書は除外される(後続の処理が行われない)。文書が除外されるべき不適切な文書ではないと判定されると、当該文書がノイズ除去部190からトピック文書格納部200に出力される。 If it is determined that the document is not posted by the blacklist user, then the feature amount (the TFIDF value (score) of each morpheme for each default topic tag) stored by the blacklist tag storage unit 170, and The extended blacklist hash tag stored in the extended blacklist hashtag storage unit 173 is read, and based on these, it is determined whether or not the document is an inappropriate document that should be excluded as described above ( S605). If it is determined that the document is inappropriate, it is excluded (no further processing is performed). If it is determined that the document is not an inappropriate document that should be excluded, the document is output from the noise removal unit 190 to the topic document storage unit 200.
 続いて、図10に戻り、トピック文書格納部200によって入力された文書が、付与されたトピックIDと合わせて格納される。以上が、本実施形態に係る関連文書抽出装置10で実行される処理である。なお、上記の処理は、例えば、予め設定した時間間隔毎、あるいは関連文書抽出装置10の管理者の操作をトリガとして行われることとしてもよい。なお、上記の処理では文書に対するトピックIDの付与と、トピックIDを付与するために用いる情報(特徴量や拡張トピックハッシュタグ)の生成とを一連の処理としているが、それらの処理が独立に互いに異なるタイミングで行われることとしてもよい。 Subsequently, returning to FIG. 10, the document input by the topic document storage unit 200 is stored together with the assigned topic ID. The above is the processing executed by the related document extraction apparatus 10 according to the present embodiment. In addition, said process is good also as being triggered by the operation of the administrator of the related document extraction apparatus 10 for every preset time interval, for example. Note that in the above processing, topic ID assignment to a document and generation of information (features and extended topic hash tags) used for assigning a topic ID are a series of processing, but these processing are mutually independent. It may be performed at different timings.
 上述したように本実施形態では、トピックを示すデフォルトトピックタグを含む文書における単語の出現頻度を用いてトピックに関連する文書が抽出される。即ち、トピックを示すデフォルトトピックタグを含んでいなくても上記の出現頻度に応じた文書がトピックに関連する文書として抽出される。これにより、本実施形態によれば、複数のツイート等の文書から特定のトピックに関連する文書を適切に抽出することができる。従って、トピックに関連する文書を網羅的に抽出することが可能になる。網羅性以外にも、動的なトピックハッシュタグ及びトピック特徴語の推定が可能なため、リアルタイムにトピックに関連する文書の抽出が可能になる。 As described above, in this embodiment, a document related to a topic is extracted using the frequency of appearance of words in a document including a default topic tag indicating a topic. That is, even if the default topic tag indicating a topic is not included, a document corresponding to the appearance frequency is extracted as a document related to the topic. Thereby, according to the present embodiment, a document related to a specific topic can be appropriately extracted from a plurality of documents such as tweets. Therefore, it is possible to exhaustively extract documents related to the topic. In addition to exhaustiveness, dynamic topic hash tags and topic feature words can be estimated, so that documents related to topics can be extracted in real time.
 本実施形態のように特徴語によって文書のスコアを算出して文書を抽出することとしてもよい。この構成によれば、例えば、デフォルトトピックタグを含む文書において出現頻度が高い単語が含まれる文書をトピックに関連する文書として抽出することができ、特定のトピックに関連する文書を確実に抽出することができる。これにより、ハッシュタグが付いていない文書も抽出が可能となり、抽出が可能な文書数が増える。 As in the present embodiment, the document score may be calculated based on the feature word to extract the document. According to this configuration, for example, a document including a word having a high appearance frequency in a document including a default topic tag can be extracted as a document related to a topic, and a document related to a specific topic can be reliably extracted. Can do. Thereby, it is possible to extract a document without a hash tag, and the number of documents that can be extracted increases.
 また、スコアの算出の際に文書に単語が複数回出現する場合、1回出現の場合と同様に文書のスコアを算出することとしてもよい。この構成によれば、文書に頻繁に含まれる単語によって文書のスコアが高くなることを防止することができ、不適切な文書をトピックに関連する文書として抽出することを回避することができる。 Further, when a word appears multiple times in the document when calculating the score, the score of the document may be calculated in the same manner as in the case of a single occurrence. According to this configuration, it is possible to prevent the score of the document from being increased due to words frequently included in the document, and it is possible to avoid extracting an inappropriate document as a document related to a topic.
 また、本実施形態のようにタグ文書とトピック文書との特徴量の比較によってトピックハッシュタグを拡張して文書を抽出することとしてもよい。この構成によれば、デフォルトトピックタグ以外のタグを含む文書(群)をトピックに関連する文書として抽出することができ、特定のトピックに関連する文書を確実に抽出することができる。従って、トピックに関連するタグ(ハッシュタグあるいはキーワード)を1つ又は複数事前にデフォルトトピックタグとして登録しておくことでタグの動的な推定が可能になり、抽出が可能な文書数が増える。 In addition, as in the present embodiment, the topic hash tag may be expanded by extracting the document by comparing the feature quantities of the tag document and the topic document. According to this configuration, a document (group) including tags other than the default topic tag can be extracted as a document related to a topic, and a document related to a specific topic can be reliably extracted. Accordingly, by registering one or more tags (hash tags or keywords) related to the topic as default topic tags in advance, the tags can be dynamically estimated, and the number of documents that can be extracted increases.
 一般的にハッシュタグは特定のトピックを意識し投稿者は文書を作成する。つまりトピックとトピックハッシュタグとは、1対Nの関係であるためトピックに紐付くハッシュタグをできるだけ多く吸い上げることで、より多くのトピック文書の抽出が可能になる。例えば、ユーザは放送している番組に関するツイートを、放送局ハッシュタグをつけて投稿することが多い。しかしながら有名な番組では番組自体のハッシュタグが存在する。トピックとハッシュタグとの特徴量を比較することで動的に放送されている番組に関連するハッシュタグをより早く検出することができる。 [Generally, the hashtag is aware of a specific topic and the poster creates a document. That is, since topics and topic hash tags have a one-to-N relationship, more topic documents can be extracted by sucking as many hash tags as possible associated with the topics. For example, users often post tweets about broadcast programs with broadcast station hashtags. However, a famous program has a hash tag of the program itself. By comparing the feature quantity between the topic and the hash tag, it is possible to detect the hash tag related to the program being broadcast dynamically earlier.
 また、本実施形態のようにノイズを除去することとしてもよい。文書の抽出にあたりノイズの除去は重要である。この構成によれば、不適切な文書を除外し、例えば不適切な文書をユーザへ提示することを防止することができる。また、ノイズを除去した文書群に基づいて、トピックハッシュタグ及びトピック特徴語を推定することとすれば、それらの推定精度が向上する。上述したようにトピックハッシュタグ及びトピック特徴語の推定では、特徴量がトピックを示す基準値となるため、このデータのノイズが多いほど推定される情報の質が落ちる。従ってシードとなるデータのクレンジングが重要である。また、ノイズフリーなトピックに関連する文書の抽出が可能になる。また、文書の抽出と同様にノイズの除去も動的に行うことで更に適切なノイズの除去が可能になる。また、ブラックリストが自動的にリアルタイムに拡充されるため、手動でブラックリストを登録する必要性が少なくなる。 Also, noise may be removed as in the present embodiment. Noise removal is important for document extraction. According to this configuration, inappropriate documents can be excluded, and for example, inappropriate documents can be prevented from being presented to the user. Further, if the topic hash tag and the topic feature word are estimated based on the document group from which noise is removed, the estimation accuracy thereof is improved. As described above, in the estimation of the topic hash tag and the topic feature word, since the feature amount becomes a reference value indicating the topic, the quality of the estimated information decreases as the noise of the data increases. Therefore, cleansing the seed data is important. In addition, it is possible to extract documents related to noise-free topics. Further, noise can be removed more appropriately by dynamically removing noise in the same manner as document extraction. Further, since the black list is automatically expanded in real time, the need for manually registering the black list is reduced.
 但し、抽出対象の文書群に含まれるノイズが小さいと考えられる場合には、必ずしもノイズ(不適切な文書)の除去を行う必要はない。 However, if it is considered that the noise included in the document group to be extracted is small, it is not always necessary to remove noise (an inappropriate document).
 また、本実施形態のように出現頻度をユーザ単位でカウントすることとしてもよい。この構成によれば、ユーザ毎の影響を均一にし、例えば、1ユーザが複数回同じ内容の文書を投稿したことによる影響を抑えることができる。これにより、適切に特定のトピックに関連する文書を抽出することができる。但し、文書を投稿したユーザの情報が取得できない場合やユーザが同じ内容の投稿をすることを考えられない場合等には、出現頻度を文書単位でカウントすることとしてもよい。即ち、IDF値やTF値を文書単位でカウントして算出することとしてもよい。 Also, the appearance frequency may be counted in units of users as in this embodiment. According to this structure, the influence for every user can be made uniform, for example, the influence by one user posting the document of the same content several times can be suppressed. Thereby, it is possible to appropriately extract a document related to a specific topic. However, when the information of the user who posted the document cannot be acquired or when the user cannot consider posting the same content, the appearance frequency may be counted in document units. That is, the IDF value or TF value may be calculated by counting in document units.
 また、本実施形態のようにTFIDF値を用いた形態素単位の素性で特徴量を表現することで形態素のポピュラリティと珍しさを表現することができる。これにより、複数のツイート等の文書から特定のトピックに関連する文書を更に適切に抽出することができる。 Moreover, the popularity and rarity of a morpheme can be expressed by expressing the feature quantity by the feature of the morpheme unit using the TFIDF value as in this embodiment. Thereby, a document related to a specific topic can be more appropriately extracted from a plurality of documents such as tweets.
 また複数のトピックに係る文書を除外することとしてもよい。複数のトピックに対して投稿された文書(マルチトピック投稿)は、それぞれのトピックに関連しないケースが多い。従って、この構成によれば、不適切な文書をトピックに関連する文書として抽出することを回避することができる。 Also, documents related to multiple topics may be excluded. Documents posted on multiple topics (multi-topic postings) are often not related to each topic. Therefore, according to this configuration, it is possible to avoid extracting an inappropriate document as a document related to a topic.
 引き続いて、上述した一連の関連文書抽出装置10による処理をコンピュータに実行させるための関連文書抽出プログラムを説明する。図22に示すように、関連文書抽出プログラム40は、コンピュータに挿入されてアクセスされる、あるいはコンピュータが備える記録媒体30に形成されたプログラム格納領域31内に格納される。 Next, a related document extraction program for causing a computer to execute the above-described series of related document extraction apparatus 10 will be described. As shown in FIG. 22, the related document extraction program 40 is inserted into a computer and accessed, or stored in a program storage area 31 formed on a recording medium 30 provided in the computer.
 関連文書抽出プログラム40は、文書格納モジュール400と、形態素解析モジュール410と、形態素格納モジュール420と、トピックタグ推定モジュール430と、トピックタグ格納モジュール440と、トピックID付与モジュール450と、ブラックリストハッシュタグ拡張モジュール460と、ブラックリストタグ格納モジュール470と、ブラックリストユーザ格納モジュール480と、ノイズ除去モジュール490と、トピック文書格納モジュール500とを備えて構成される。文書格納モジュール400と、形態素解析モジュール410と、形態素格納モジュール420と、トピックタグ推定モジュール430と、トピックタグ格納モジュール440と、トピックID付与モジュール450と、ブラックリストハッシュタグ拡張モジュール460と、ブラックリストタグ格納モジュール470と、ブラックリストユーザ格納モジュール480と、ノイズ除去モジュール490と、トピック文書格納モジュール500とを実行させることにより実現される機能は、上述した関連文書抽出装置10の文書格納部100と、形態素解析部110と、形態素格納部120と、トピックタグ推定部130と、トピックタグ格納部140と、トピックID付与部150と、ブラックリストハッシュタグ拡張部160と、ブラックリストタグ格納部170と、ブラックリストユーザ格納部180と、ノイズ除去部190と、トピック文書格納部200との機能とそれぞれ同様である。 The related document extraction program 40 includes a document storage module 400, a morpheme analysis module 410, a morpheme storage module 420, a topic tag estimation module 430, a topic tag storage module 440, a topic ID assignment module 450, and a blacklist hash tag. The extended module 460, the blacklist tag storage module 470, the blacklist user storage module 480, the noise removal module 490, and the topic document storage module 500 are configured. Document storage module 400, morpheme analysis module 410, morpheme storage module 420, topic tag estimation module 430, topic tag storage module 440, topic ID assignment module 450, blacklist hash tag extension module 460, blacklist Functions realized by executing the tag storage module 470, the blacklist user storage module 480, the noise removal module 490, and the topic document storage module 500 are the same as those in the document storage unit 100 of the related document extraction apparatus 10 described above. The morpheme analyzer 110, the morpheme storage unit 120, the topic tag estimation unit 130, the topic tag storage unit 140, the topic ID assigning unit 150, the black list hash tag expansion unit 160, the black list And Totagu storage unit 170, a blacklist user storage unit 180, a noise removing unit 190, respectively similar to the functions of the topic document storage unit 200.
 なお、関連文書抽出プログラム40は、その一部若しくは全部が、通信回線等の伝送媒体を介して伝送され、他の機器により受信されて記録(インストールを含む)される構成としてもよい。また、関連文書抽出プログラム40の各モジュールは、1つのコンピュータでなく、複数のコンピュータのいずれかにインストールされてもよい。その場合、当該複数のコンピュータによるコンピュータシステムよって上述した一連の関連文書抽出プログラム40の処理が行われる。 Note that a part or all of the related document extraction program 40 may be transmitted via a transmission medium such as a communication line and received and recorded (including installation) by another device. Moreover, each module of the related document extraction program 40 may be installed in any one of a plurality of computers instead of one computer. In that case, the series of related document extraction programs 40 described above are performed by the computer system of the plurality of computers.
 10…関連文書抽出装置、100…文書格納部、110…形態素解析部、120…形態素格納部、130…トピックタグ推定部、132…トピックハッシュタグ推定部、131…トピック特徴語推定部、140…トピックタグ格納部、141…デフォルトトピックタグ格納部、142…トピック特徴語格納部、143…拡張トピックハッシュタグ格納部、150…トピックID付与部、160…ブラックリストハッシュタグ拡張部、170…ブラックリストタグ格納部、171…デフォルトブラックリスト形態素格納部、172…デフォルトブラックリストハッシュタグ格納部、173…拡張ブラックリストハッシュタグ格納部、180…ブラックリストユーザ格納部、190…ノイズ除去部、200…トピック文書格納部、1001…CPU,1002…RAM、1003…ROM、1004…通信モジュール、1005…補助記憶装置、30…記録媒体、31…プログラム格納領域、40…関連文書抽出プログラム、400…文書格納モジュール、410…形態素解析モジュール、420…形態素格納モジュール、430…トピックタグ推定モジュール、440…トピックタグ格納モジュール、450…トピックID付与モジュール、460…ブラックリストハッシュタグ拡張モジュール、470…ブラックリストタグ格納モジュール、480…ブラックリストユーザ格納モジュール、490…ノイズ除去モジュール、500…トピック文書格納モジュール。 DESCRIPTION OF SYMBOLS 10 ... Related document extraction apparatus, 100 ... Document storage part, 110 ... Morphological analysis part, 120 ... Morphological storage part, 130 ... Topic tag estimation part, 132 ... Topic hash tag estimation part, 131 ... Topic feature word estimation part, 140 ... Topic tag storage unit 141 ... Default topic tag storage unit 142 ... Topic feature word storage unit 143 ... Extended topic hash tag storage unit 150 ... Topic ID assigning unit 160 ... Blacklist hash tag extension unit 170 ... Blacklist Tag storage unit, 171 ... default blacklist morpheme storage unit, 172 ... default blacklist hash tag storage unit, 173 ... extended blacklist hashtag storage unit, 180 ... blacklist user storage unit, 190 ... noise removal unit, 200 ... topic Document storage unit, 1001... C U, 1002 ... RAM, 1003 ... ROM, 1004 ... communication module, 1005 ... auxiliary storage device, 30 ... recording medium, 31 ... program storage area, 40 ... related document extraction program, 400 ... document storage module, 410 ... morpheme analysis module , 420 ... Morphological storage module, 430 ... Topic tag estimation module, 440 ... Topic tag storage module, 450 ... Topic ID assignment module, 460 ... Blacklist hash tag expansion module, 470 ... Blacklist tag storage module, 480 ... Blacklist user Storage module, 490... Noise removal module, 500... Topic document storage module.

Claims (12)

  1.  トピックを示すデフォルトトピックタグを予め格納するデフォルトトピックタグ格納手段と、
     複数の文書を予め格納する文書格納手段と、
     前記文書格納手段によって格納された文書を単語に分割する単語取得手段と、
     前記文書格納手段によって格納された複数の文書から、前記デフォルトトピックタグ格納手段によって格納されたデフォルトトピックタグを含む文書を抽出するデフォルト文書抽出手段と、
     前記デフォルト文書抽出手段によって抽出された文書における、前記単語取得手段によって分割された単語の出現頻度を算出する第1出現頻度算出手段と、
     前記第1出現頻度算出手段によって算出された出現頻度を用いて、前記デフォルト文書抽出手段によって抽出された文書以外の文書から、前記トピックに関連する文書を抽出するトピック文書抽出手段と、
    を備える関連文書抽出装置。
    Default topic tag storage means for storing a default topic tag indicating a topic in advance;
    Document storage means for storing a plurality of documents in advance;
    Word acquisition means for dividing the document stored by the document storage means into words;
    Default document extraction means for extracting a document including a default topic tag stored by the default topic tag storage means from a plurality of documents stored by the document storage means;
    First appearance frequency calculation means for calculating the appearance frequency of the words divided by the word acquisition means in the document extracted by the default document extraction means;
    Topic document extraction means for extracting a document related to the topic from a document other than the document extracted by the default document extraction means using the appearance frequency calculated by the first appearance frequency calculation means;
    Related document extraction apparatus comprising:
  2.  前記トピック文書抽出手段は、
     前記第1出現頻度算出手段によって算出された出現頻度を用いて、前記デフォルト文書抽出手段によって抽出された文書以外の文書に出現する単語から、当該文書のスコアを算出するスコア算出手段と、
     前記スコア算出手段によって算出されたスコアに基づいて、当該スコアに係る文書が前記トピックに関連する文書であるか否かを判定する第1トピック文書判定手段と、
    を備える請求項1に記載の関連文書抽出装置。
    The topic document extraction means includes:
    Score calculating means for calculating a score of the document from words appearing in a document other than the document extracted by the default document extracting means using the appearance frequency calculated by the first appearance frequency calculating means;
    First topic document determination means for determining whether or not a document related to the score is a document related to the topic based on the score calculated by the score calculation means;
    The related document extracting device according to claim 1, further comprising:
  3.  前記スコア算出手段は、文書に単語が複数回出現する場合、1回出現の場合と同様に文書のスコアを算出する請求項2に記載の関連文書抽出装置。 3. The related document extracting device according to claim 2, wherein the score calculating means calculates the score of the document when a word appears multiple times in the document as in the case of a single occurrence.
  4.  前記トピック文書抽出手段は、
     前記文書格納手段によって格納された複数の文書から、前記デフォルトトピックタグ以外のタグを含む文書を抽出するタグ文書抽出手段と、
     前記タグ文書抽出手段によって抽出された文書における、前記単語取得手段によって分割された単語の出現頻度を算出する第2出現頻度算出手段と、
     前記第1出現頻度算出手段によって算出された出現頻度と前記第2出現頻度算出手段によって算出された出現頻度とを比較して、当該比較結果に基づいて前記タグ文書抽出手段によって抽出された文書が前記トピックに関連する文書であるか否かを判定する第2トピック文書判定手段と、
    を備える請求項1~3の何れか一項に記載の関連文書抽出装置。
    The topic document extraction means includes:
    Tag document extraction means for extracting a document including a tag other than the default topic tag from a plurality of documents stored by the document storage means;
    Second appearance frequency calculation means for calculating the appearance frequency of the words divided by the word acquisition means in the document extracted by the tag document extraction means;
    The appearance frequency calculated by the first appearance frequency calculation means is compared with the appearance frequency calculated by the second appearance frequency calculation means, and the document extracted by the tag document extraction means based on the comparison result is obtained. Second topic document determination means for determining whether the document is related to the topic;
    The related document extracting apparatus according to any one of claims 1 to 3, further comprising:
  5.  前記第2トピック文書判定手段は、前記第1出現頻度算出手段によって算出された単語の出現頻度によって示される特徴量と前記第2出現頻度算出手段によって算出された単語の出現頻度によって示される特徴量との間のコサイン距離、ジャカード距離又はユークリッド距離を算出することで、出現頻度同士を比較する請求項4に記載の関連文書抽出装置。 The second topic document determination unit includes a feature amount indicated by the word appearance frequency calculated by the first appearance frequency calculation unit and a feature amount indicated by the word appearance frequency calculated by the second appearance frequency calculation unit. The related document extraction device according to claim 4, wherein the appearance frequencies are compared by calculating a cosine distance, a Jacquard distance, or a Euclidean distance between the two.
  6.  前記デフォルトトピックタグ格納手段は、前記デフォルトトピックタグとして、不適切なトピックに係るデフォルトトピックタグを格納して、
     前記トピック文書抽出手段は、前記文書が前記不適切なトピックに関連する文書であるか否かを判断して文書の除外を行う、
    請求項1~5の何れか一項に記載の関連文書抽出装置。
    The default topic tag storage means stores a default topic tag related to an inappropriate topic as the default topic tag,
    The topic document extraction means determines whether the document is a document related to the inappropriate topic and performs document exclusion;
    The related document extracting device according to any one of claims 1 to 5.
  7.  前記文書格納手段は、前記文書を投稿したユーザに係る情報を格納して、
     前記第1出現頻度算出手段は、前記単語の出現頻度として当該単語が含まれる文書を投稿したユーザ数を算出する、
    請求項1~6の何れか一項に記載の関連文書抽出装置。
    The document storage means stores information relating to a user who posted the document,
    The first appearance frequency calculating means calculates the number of users who have posted a document including the word as the appearance frequency of the word.
    The related document extracting apparatus according to any one of claims 1 to 6.
  8.  前記第1出現頻度算出手段は、前記単語毎に当該単語が含まれる文書を投稿したユーザ数に対する、前記文書を投稿した全ユーザ数の割合から逆出現頻度を算出し、
     前記トピック文書抽出手段は、前記第1出現頻度算出手段によって算出された逆出現頻度も用いて前記トピックに関連する文書を抽出する、
    請求項7に記載の関連文書抽出装置。
    The first appearance frequency calculating means calculates a reverse appearance frequency from a ratio of the total number of users who have posted the document to the number of users who have posted the document including the word for each word,
    The topic document extracting means extracts a document related to the topic using the reverse appearance frequency calculated by the first appearance frequency calculating means;
    The related document extraction device according to claim 7.
  9.  前記トピック文書抽出手段は、前記単語毎の文字数も用いて前記トピックに関連する文書を抽出する請求項8に記載の関連文書抽出装置。 The related document extracting device according to claim 8, wherein the topic document extracting means extracts a document related to the topic using the number of characters for each word.
  10.  前記デフォルトトピックタグ格納手段は、複数のトピックそれぞれを示す複数のデフォルトトピックタグを格納し、
     前記トピック文書抽出手段は、複数のトピックに関連する文書を除外する、
    請求項1~9の何れか一項に記載の関連文書抽出装置。
    The default topic tag storage means stores a plurality of default topic tags indicating each of a plurality of topics,
    The topic document extracting means excludes documents related to a plurality of topics;
    The related document extracting apparatus according to any one of claims 1 to 9.
  11.  トピックを示すデフォルトトピックタグを予め格納するデフォルトトピックタグ格納手段と、複数の文書を予め格納する文書格納手段と、を備える関連文書抽出装置による関連文書抽出方法であって、
     前記文書格納手段によって格納された文書を単語に分割する単語取得ステップと、
     前記文書格納手段によって格納された複数の文書から、前記デフォルトトピックタグ格納手段によって格納されたデフォルトトピックタグを含む文書を抽出するデフォルト文書抽出ステップと、
     前記デフォルト文書抽出ステップにおいて抽出された文書における、前記単語取得ステップにおいて分割された単語の出現頻度を算出する第1出現頻度算出ステップと、
     前記第1出現頻度算出ステップにおいて算出された出現頻度を用いて、前記デフォルト文書抽出ステップにおいて抽出された文書以外の文書から、前記トピックに関連する文書を抽出するトピック文書抽出ステップと、
    を含む関連文書抽出方法。
    A related document extraction method by a related document extraction device comprising: a default topic tag storage unit that stores a default topic tag indicating a topic in advance; and a document storage unit that stores a plurality of documents in advance.
    A word acquisition step of dividing the document stored by the document storage means into words;
    A default document extraction step of extracting a document including a default topic tag stored by the default topic tag storage unit from a plurality of documents stored by the document storage unit;
    A first appearance frequency calculating step of calculating an appearance frequency of the words divided in the word acquisition step in the document extracted in the default document extraction step;
    A topic document extraction step for extracting a document related to the topic from a document other than the document extracted in the default document extraction step using the appearance frequency calculated in the first appearance frequency calculation step;
    Related document extraction method including
  12.  コンピュータを、
     トピックを示すデフォルトトピックタグを予め格納するデフォルトトピックタグ格納手段と、
     複数の文書を予め格納する文書格納手段と、
     前記文書格納手段によって格納された文書を単語に分割する単語取得手段と、
     前記文書格納手段によって格納された複数の文書から、前記デフォルトトピックタグ格納手段によって格納されたデフォルトトピックタグを含む文書を抽出するデフォルト文書抽出手段と、
     前記デフォルト文書抽出手段によって抽出された文書における、前記単語取得手段によって分割された単語の出現頻度を算出する第1出現頻度算出手段と、
     前記第1出現頻度算出手段によって算出された出現頻度を用いて、前記デフォルト文書抽出手段によって抽出された文書以外の文書から、前記トピックに関連する文書を抽出するトピック文書抽出手段と、
    として機能させる関連文書抽出プログラム。
    Computer
    Default topic tag storage means for storing a default topic tag indicating a topic in advance;
    Document storage means for storing a plurality of documents in advance;
    Word acquisition means for dividing the document stored by the document storage means into words;
    Default document extraction means for extracting a document including a default topic tag stored by the default topic tag storage means from a plurality of documents stored by the document storage means;
    First appearance frequency calculation means for calculating the appearance frequency of the words divided by the word acquisition means in the document extracted by the default document extraction means;
    Topic document extraction means for extracting a document related to the topic from a document other than the document extracted by the default document extraction means using the appearance frequency calculated by the first appearance frequency calculation means;
    Related document extraction program to function as.
PCT/JP2013/070376 2012-08-03 2013-07-26 Relevant document extraction device, relevant document extraction method and relevant document extraction program WO2014021229A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2012172793A JP5389234B1 (en) 2012-08-03 2012-08-03 Related document extracting apparatus, related document extracting method, and related document extracting program
JP2012-172793 2012-08-03

Publications (1)

Publication Number Publication Date
WO2014021229A1 true WO2014021229A1 (en) 2014-02-06

Family

ID=50027906

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2013/070376 WO2014021229A1 (en) 2012-08-03 2013-07-26 Relevant document extraction device, relevant document extraction method and relevant document extraction program

Country Status (2)

Country Link
JP (1) JP5389234B1 (en)
WO (1) WO2014021229A1 (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2016024485A (en) * 2014-07-16 2016-02-08 株式会社ビデオリサーチ Contributed document acquiring device, and contributed document acquiring method
JP5957048B2 (en) 2014-08-19 2016-07-27 インターナショナル・ビジネス・マシーンズ・コーポレーションInternational Business Machines Corporation Teacher data generation method, generation system, and generation program for eliminating ambiguity
CN108628906B (en) * 2017-03-24 2021-01-26 北京京东尚科信息技术有限公司 Short text template mining method and device, electronic equipment and readable storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH11161654A (en) * 1997-11-27 1999-06-18 Mitsubishi Electric Corp Method and device for electronic document processing and recording medium in which electronic document retrieval processing program is recorded
JP2010224622A (en) * 2009-03-19 2010-10-07 Nomura Research Institute Ltd Method and program for applying tag
JP2010288024A (en) * 2009-06-10 2010-12-24 Univ Of Electro-Communications Moving picture recommendation apparatus
JP2011134334A (en) * 2009-12-23 2011-07-07 Palo Alto Research Center Inc System and method for identifying topics for short text communications

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH11161654A (en) * 1997-11-27 1999-06-18 Mitsubishi Electric Corp Method and device for electronic document processing and recording medium in which electronic document retrieval processing program is recorded
JP2010224622A (en) * 2009-03-19 2010-10-07 Nomura Research Institute Ltd Method and program for applying tag
JP2010288024A (en) * 2009-06-10 2010-12-24 Univ Of Electro-Communications Moving picture recommendation apparatus
JP2011134334A (en) * 2009-12-23 2011-07-07 Palo Alto Research Center Inc System and method for identifying topics for short text communications

Also Published As

Publication number Publication date
JP5389234B1 (en) 2014-01-15
JP2014032536A (en) 2014-02-20

Similar Documents

Publication Publication Date Title
Ghosh et al. Cognos: crowdsourcing search for topic experts in microblogs
JP6449351B2 (en) Data mining to identify online user response to broadcast messages
US9008489B2 (en) Keyword-tagging of scenes of interest within video content
Shamma et al. Tweet the debates: understanding community annotation of uncollected sources
US10242003B2 (en) Search relevance using messages of a messaging platform
Wu et al. Crowdsourced time-sync video tagging using temporal and personalized topic modeling
US9088808B1 (en) User interaction based related videos
KR20160057475A (en) System and method for actively obtaining social data
US20080104034A1 (en) Method For Scoring Changes to a Webpage
JP6429382B2 (en) Content recommendation device and program
Jansen et al. Real time search on the web: Queries, topics, and economic value
US9407589B2 (en) System and method for following topics in an electronic textual conversation
US9245035B2 (en) Information processing system, information processing method, program, and non-transitory information storage medium
lvaro Cuesta et al. A Framework for massive Twitter data extraction and analysis
JP5952711B2 (en) Prediction server, program and method for predicting future number of comments in prediction target content
Sakai et al. Diversified search evaluation: Lessons from the NTCIR-9 INTENT task
Belkaroui et al. Conversation analysis on social networking sites
Nakazawa et al. Social indexing of TV programs: Detection and labeling of significant TV scenes by Twitter analysis
JP5389234B1 (en) Related document extracting apparatus, related document extracting method, and related document extracting program
US11200288B1 (en) Validating interests for a search and feed service
Paudel et al. An early look at the Gettr social network
KR101486924B1 (en) Method for recommanding media contents using social network service
Masuda et al. Video scene retrieval using online video annotation
JP6373767B2 (en) Topic word ranking device, topic word ranking method, and program
Milajevs et al. Real time discussion retrieval from twitter

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 13826059

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 13826059

Country of ref document: EP

Kind code of ref document: A1