WO2014021229A1

WO2014021229A1 - Relevant document extraction device, relevant document extraction method and relevant document extraction program

Info

Publication number: WO2014021229A1
Application number: PCT/JP2013/070376
Authority: WO
Inventors: 隼赤塚; 公亮角野; 渉内田
Original assignee: 株式会社エヌ・ティ・ティ・ドコモ
Priority date: 2012-08-03
Filing date: 2013-07-26
Publication date: 2014-02-06
Also published as: JP5389234B1; JP2014032536A

Abstract

In the present invention, documents relevant to a specific topic are suitably extracted from documents such as a plurality of tweets. A relevant document extraction device (10) is provided with the following: a default topic tag storage unit (141) that stores a default topic tag indicating a topic; a document storage unit (100) that stores a plurality of documents; a morpheme analysis unit (110) that divides documents into morphemes; a topic tag estimation unit (130) that extracts a document that includes the default topic tag from a plurality of documents, and calculates the frequency of appearance of terms in the extracted document; and a topic ID assigning unit (150) that extract a document relevant to the topic from information based on the calculated frequency of appearance.

Description

Related document extracting apparatus, related document extracting method, and related document extracting program

The present invention relates to a related document extracting apparatus, a related document extracting method, and a related document extracting program for extracting a document related to a specific topic from a plurality of documents.

In recent years, communication using microblogs (miniblogs) such as Twitter has become common (see, for example, Patent Document 1). Microblogging is an information service that posts short sentences composed of tens to hundreds of characters, and Twitter posts short documents called tweets within 140 characters. There are various contents posted as a tweet, such as, for example, his / her recent situation, sharing of a news article in which he / she is interested, a reply to a tweet of an acquaintance, a comment on a specific topic such as television. Because it is possible to share information with other users by posting a comment with a URL such as a news article that you are interested in, microblogging is not only a tool to get the latest status of friends, but also as an information collection tool It is also widely used.

• When users tweet a specific topic, they tend to tweet by attaching a hashtag to the tweet. Often one or more hashtags are associated with a large topic. For example, when a TV program is a topic, “XX drama: YYY (drama title) 1 episode” is one topic. While watching the program, the user tweetes the broadcast XX drama YYY with a broadcast station hashtag. If the hashtag (broadcasting station hashtag) of the broadcasting station that broadcasts the XX drama YYY is #zzz, people who post a tweet with a program hashtag (#xx, #YYY, #XX drama) in addition to the broadcasting station hashtag There are many. Broadcast station hashtags are hashtags that are widely used for programs broadcast on television stations by users regardless of whether they are official or informal. By collecting tweets including hashtags related to the topic, the user's comment on the topic can be grasped.

The website shown in Non-Patent Document 1 provides a service for extracting tweets associated with a broadcasting station and displaying the tweets for each broadcasting station. In the service according to Non-Patent Document 1, the broadcasting station is one topic. It is possible to easily link a tweet to a broadcast station using a broadcast station hashtag. For example, when collecting tweets about a program of the broadcast station ZZZ, it is only necessary to collect tweets including #zzz, which is a broadcast station hash tag.

The website shown in Non-Patent Document 2 provides a service for extracting tweets associated with a program and displaying the tweet for each program being broadcast. In the service according to Non-Patent Document 2, a program being broadcast is set as one topic. Like a service according to Non-Patent Document 1, a broadcasting station hashtag is used to link a program being broadcast. In addition, it dynamically estimates program hashtags in real time. For example, in the case of a program of the broadcasting station ZZZ, a tweet including a broadcasting station hash tag (#zzzz) is linked to the program, and when a program “YYY” is broadcast, one or more program hashes are dynamically generated. Tags (#xx, #YYY, #XX drama) are estimated and tweets associated with program hashtags are also extracted. As described above, in the service according to Non-Patent Document 2, it is possible to extract a tweet of a program being broadcast dynamically by estimating a broadcast station hash tag and a program hash tag.

JP 2012-38281 A

However, the services shown in Non-Patent Document 1 and Non-Patent Document 2 have the following problems. With regard to television, there are users who post a tweet unrelated to the program being broadcast with a plurality of broadcast station hashtags. Since the service according to Non-Patent Document 1 simply collects tweets including broadcast station hashtags, it also displays tweets unrelated to the program. In addition, since the service according to Non-Patent Document 1 extracts only tweets with a broadcast station hash tag, the amount of tweets that can be extracted is limited.

In addition, in the service according to Non-Patent Document 2, a program hash tag is dynamically estimated in addition to the broadcast station hash tag, and tweets related to the program being broadcast are extracted. I haven't been able to complete the extraction of tweets. Tweets of TV programs being broadcast are not necessarily provided with hashtags, and there is a strong tendency that there are actually many tweets without hashtags. As described above, the service related to Non-Patent Document 1 cannot extract tweets with a program hash tag, and the service related to Non-Patent Document 2 can extract tweets related to programs that do not have a hash tag. Not.

The present invention has been made in view of the above problems, and a related document extraction apparatus and a related document extraction method capable of appropriately extracting a document related to a specific topic from a plurality of documents such as tweets. And a related document extraction program.

To achieve the above object, a related document extraction apparatus according to an embodiment of the present invention includes a default topic tag storage unit that stores a default topic tag indicating a topic in advance, and a document storage unit that stores a plurality of documents in advance. A word acquisition unit that divides the document stored by the document storage unit into words, and a document including the default topic tag stored by the default topic tag storage unit is extracted from the plurality of documents stored by the document storage unit A default document extracting unit; a first appearance frequency calculating unit that calculates an appearance frequency of words divided by the word acquiring unit in the document extracted by the default document extracting unit; and an appearance calculated by the first appearance frequency calculating unit. Sentences other than documents extracted by default document extraction means using frequency From comprises a topic document extraction means for extracting the documents related to the topic, the.

In the related document extraction apparatus according to an embodiment of the present invention, a document related to a topic is extracted using the appearance frequency of words in a document including a default topic tag indicating a topic. That is, even if the default topic tag indicating a topic is not included, a document corresponding to the appearance frequency is extracted as a document related to the topic. Thereby, according to the related document extraction apparatus which concerns on one Embodiment of this invention, the document relevant to a specific topic can be appropriately extracted from several documents, such as a tweet.

The topic document extracting means uses the appearance frequency calculated by the first appearance frequency calculating means to calculate the score of the document from words appearing in a document other than the document extracted by the default document extracting means. And a first topic document determination unit that determines whether a document related to the score is a document related to a topic based on the score calculated by the score calculation unit. According to this configuration, for example, a document including a word having a high appearance frequency in a document including a default topic tag can be extracted as a document related to a topic, and a document related to a specific topic can be reliably extracted. Can do.

The score calculation means may calculate the score of the document in the same way as when the word appears once in the document. According to this configuration, it is possible to prevent the score of the document from being increased due to words frequently included in the document, and it is possible to avoid extracting an inappropriate document as a document related to a topic.

The topic document extraction means includes a tag document extraction means for extracting a document including a tag other than the default topic tag from a plurality of documents stored by the document storage means, and a word acquisition means in the document extracted by the tag document extraction means. Comparing the appearance frequency calculated by the first appearance frequency calculating means with the appearance frequency calculated by the second appearance frequency calculating means, the second appearance frequency calculating means for calculating the appearance frequency of the words divided by Second topic document determination means for determining whether or not the document extracted by the tag document extraction means based on the comparison result is a document related to the topic may be provided. According to this configuration, a document (group) including tags other than the default topic tag can be extracted as a document related to a topic, and a document related to a specific topic can be reliably extracted.

The second topic document determination unit is configured to determine whether the feature amount indicated by the word appearance frequency calculated by the first appearance frequency calculation unit and the feature amount indicated by the word appearance frequency calculated by the second appearance frequency calculation unit. The appearance frequencies may be compared by calculating the cosine distance, Jacquard distance, or Euclidean distance. According to this configuration, it is possible to more reliably extract a document related to a specific topic.

The default topic tag storage means stores a default topic tag related to an inappropriate topic as a default topic tag, and the topic document extraction means determines whether or not the document is a document related to an inappropriate topic. It is also possible to exclude documents. According to this configuration, inappropriate documents can be excluded, and for example, inappropriate documents can be prevented from being presented to the user.

The document storage means may store information relating to a user who posted the document, and the first appearance frequency calculation means may calculate the number of users who have posted the document including the word as the word appearance frequency. . According to this structure, the influence for every user can be made uniform, for example, the influence by one user posting the document of the same content several times can be suppressed. Thereby, it is possible to appropriately extract a document related to a specific topic.

The first appearance frequency calculating means calculates the reverse appearance frequency from the ratio of the total number of users who have posted the document to the number of users who have posted the document including the word for each word, and the topic document extracting means is the first appearance frequency A document related to the topic may be extracted using the reverse appearance frequency calculated by the frequency calculating means. According to this configuration, a document related to a topic is extracted using the reverse appearance frequency of words in the document including the default topic tag indicating the topic. Thereby, a document related to a specific topic can be more appropriately extracted from a plurality of documents such as tweets.

The topic document extracting means may extract a document related to the topic using the number of characters for each word. According to this configuration, a document related to a topic is extracted using the number of characters of a word in the document including the default topic tag indicating the topic. Thereby, a document related to a specific topic can be more appropriately extracted from a plurality of documents such as tweets.

The default topic tag storage unit may store a plurality of default topic tags indicating each of a plurality of topics, and the topic document extraction unit may exclude documents related to the plurality of topics. Documents posted on multiple topics (multi-topic postings) are often not related to each topic. Therefore, according to this configuration, it is possible to avoid extracting an inappropriate document as a document related to a topic.

By the way, the present invention can be described as an invention of a related document extraction apparatus and a related document extraction program as described below, in addition to being described as an invention of a related document extraction apparatus as described above. This is substantially the same invention only in different categories and the like, and has the same operations and effects.

That is, a related document extraction method according to an embodiment of the present invention includes a default topic tag storage unit that stores a default topic tag indicating a topic in advance, and a document storage unit that stores a plurality of documents in advance. A related document extraction method by an apparatus, a word acquisition step for dividing a document stored by a document storage unit into words, and a default stored by a default topic tag storage unit from a plurality of documents stored by the document storage unit A default document extraction step for extracting a document including a topic tag, a first appearance frequency calculation step for calculating an appearance frequency of words divided in the word acquisition step in the document extracted in the default document extraction step, and a first appearance Use the appearance frequency calculated in the frequency calculation step Te, including from a document other than a document that has been extracted in the default document extraction step, the topic document extraction step of extracting documents related to the topic, the.

A related document extraction program according to an embodiment of the present invention includes a computer that stores a default topic tag storage unit that stores a default topic tag indicating a topic in advance, a document storage unit that stores a plurality of documents in advance, and a document storage A word acquisition unit that divides a document stored by the unit into words, and a default document extraction unit that extracts a document including a default topic tag stored by the default topic tag storage unit from a plurality of documents stored by the document storage unit First appearance frequency calculating means for calculating the appearance frequency of the words divided by the word acquiring means in the document extracted by the default document extracting means, and using the appearance frequency calculated by the first appearance frequency calculating means. , Sentences other than documents extracted by default document extraction means From a topic document extraction means for extracting the documents related to the topic, to function as a.

In one embodiment of the present invention, a document related to a topic is extracted using the appearance frequency of words in a document including a default topic tag indicating a topic. That is, even if the default topic tag indicating a topic is not included, a document corresponding to the appearance frequency is extracted as a document related to the topic. Thereby, according to one Embodiment of this invention, the document relevant to a specific topic can be extracted appropriately from several documents, such as a tweet.

It is a figure which shows the function structure of the related document extraction apparatus which concerns on embodiment of this invention. It is a table which shows the example of the document stored in a document storage part. It is a table which shows the example of the morpheme stored in a morpheme storage part. It is a table which shows the example of the default topic tag stored in a default topic tag storage part. It is a table which shows the example of the feature-value stored in a topic feature word storage part. It is a table which shows the example of the extended topic hash tag stored in an extended topic hash tag storage part. It is a table which shows the example of the information used for exclusion of a document. It is a table which shows the example of the document stored in a topic document storage part. It is a figure which shows the hardware constitutions of the related document extraction apparatus which concerns on embodiment of this invention. It is a flowchart which shows the whole process (related document extraction method) performed with the related document extraction apparatus which concerns on embodiment of this invention. It is a flowchart which shows the process by a topic feature word estimation part. It is a flowchart which shows the process by a topic feature word estimation part. It is a flowchart which shows the process by a topic feature word estimation part. It is a flowchart which shows the process by a topic hash tag estimation part. It is a flowchart which shows the process by a topic hash tag estimation part. It is a flowchart which shows the process by a blacklist hash tag expansion part. It is a flowchart which shows the process by a blacklist hash tag expansion part. It is a flowchart which shows the process by a topic ID provision part. It is a flowchart which shows the process by a topic ID provision part. It is a flowchart which shows the process by a topic ID provision part. It is a flowchart which shows the process by a noise removal part. It is a figure which shows the structure of the related document extraction program which concerns on embodiment of this invention with a recording medium.

Hereinafter, the related document extracting apparatus, the related document extracting method, and the related document extracting program according to the present invention will be described in detail with reference to the drawings. In the description of the drawings, the same elements are denoted by the same reference numerals, and redundant description is omitted.

FIG. 1 shows a related document extraction apparatus 10 according to the present embodiment. The related document extraction device 10 is a device that extracts a document related to a specific topic from a plurality of documents (documents). The document to be extracted is, for example, a document published on a microblog posted by the user and published on the Web. In this embodiment, for the sake of brevity, a Twitter that is a representative of a microblog is used as a specific example. In this embodiment, the extraction target is called a document, but it is also called a tweet or a comment depending on the microblog service. Note that the extraction target document does not necessarily need to be a document published on the Web.

The related document extracting apparatus 10 inputs documents posted by a large number of users, extracts documents related to a specific topic from those documents, and provides them to the user as a document group related to the specific topic. . Specific topics include, for example, specific television programs. A user can know how other users think about the specific topic by referring to a document group related to the specific topic.

As shown in FIG. 1, the related document extracting apparatus 10 includes a document storage unit 100, a morpheme analysis unit 110, a morpheme storage unit 120, a topic tag estimation unit 130, a topic tag storage unit 140, and a topic ID assignment unit 150. A blacklist hash tag expansion unit 160, a blacklist tag storage unit 170, a blacklist user storage unit 180, a noise removal unit 190, and a topic document storage unit 200. The related document extraction device 10 is connected to a device (for example, a server providing a microblog service) that outputs a document to be extracted (received) via a network such as the Internet so that the document to be extracted can be acquired (received). .

The document storage unit 100 is a document storage unit that inputs and stores a plurality of documents to be extracted in advance. For example, the document storage unit 100 may provide a microblog service via the Internet and may request and acquire (receive) a document from a server that stores the document, or may perform streaming from the server. Document data may be received. Each document on Twitter corresponds to each tweet data generated (posted) by the user, for example. The stored data does not necessarily store only one type of data.

FIG. 2 shows a sample format of a document stored in the document storage unit 100. As shown in FIG. 2, the data relating to one document stored in the document storage unit 100 is associated with a document ID, a user ID, a posting time, a text, and a hash tag. One row of data shown in FIG. 2 corresponds to data relating to one document. The document ID is information that identifies each document and is a unique value. The user ID is information that identifies the user who created each document. In this way, the document storage unit 100 inputs and stores information related to the user who posted the document. For example, the user ID may be a unique value such as a user account, or may be an ID for each session when using the Internet when it is difficult to specify the unique value.

The posting time is information indicating the time when the document is posted by the user. The text is actual text data (document body) included in the document data. A hash tag is tag information given to a document. A hash tag is a Twitter term, but is a tag attached to a document when a user explicitly wants to post a specific topic, for example, a tag that can recognize a specific event, that is, an event identifier. Each document does not necessarily include any hash tag (event identifier), and a NULL value is included when no hash tag is included.

The morphological analysis unit 110 is a word acquisition unit that reads the document data stored in the document storage unit 100 and divides the text of the document data into words. The morpheme analysis unit 110 divides text into words by, for example, morpheme analysis. In this case, the conventional technique can be used for the morphological analysis. However, division into words is not necessarily performed by morphological analysis, and may be performed by an arbitrary method. In the following description, the word is a morpheme. Acquisition of morphemes is performed for each document. The morpheme analysis unit 110 outputs information on the morpheme obtained from the text to the morpheme storage unit 120.

The morpheme storage unit 120 is a means for storing the morpheme input from the morpheme analysis unit 110. FIG. 3 shows a sample format of morphemes stored in the morpheme storage unit 120. As shown in FIG. 3, the data related to one morpheme stored in the morpheme storage unit 120 is a document ID, user ID, posting time, morpheme, and part of speech associated with each other. One row of data shown in FIG. 3 corresponds to data relating to one morpheme. The document ID, user ID, and posting time are the document ID, user ID, and posting time of the document from which the morpheme is acquired. The morpheme is a morpheme obtained by the morpheme analyzer 110. The part of speech is the part of speech of the morpheme obtained by the analysis by the morpheme analysis unit 110. For example, information indicating whether the morpheme is a noun is stored.

The topic tag estimation unit 130 is a means for generating information used to determine whether each document is a document related to a specific topic. The topic tag estimation unit 130 generates the above information using the information stored in the topic tag storage unit 140 and stores the information in the topic tag storage unit 140. Here, the topic tag storage unit 140 will be described.

The topic tag storage unit 140 includes a default topic tag storage unit 141, a topic feature word storage unit 142, and an extended topic hash tag storage unit 143.

The default topic tag storage unit 141 is a default topic tag storage unit that stores in advance a default topic tag indicating a topic. The default topic tag is a tag related to a topic from which a related document is to be extracted, and is registered in advance by the administrator of the related document extraction apparatus 10, for example. A document including the default topic tag is extracted as a document related to the topic related to the default topic tag. This extraction is performed by character string matching. The default topic tag is, for example, any of a morpheme, a hash tag, or a keyword. A default topic tag exists for each topic. For example, if the topic is “XX drama: YYY (drama title)”, “(AAA actors appearing in YYY)”, “YYY”, “(Actors appearing in YYY) BBBB” Etc. are the default topics.

FIG. 4 shows a sample format of the default topic tag stored in the default topic tag storage unit 141. As shown in FIG. 4, the data relating to one default topic tag stored in the default topic tag storage unit 141 is associated with a topic ID and a tag. One row of data shown in FIG. 4 corresponds to data related to one default topic tag. The topic ID is an ID that identifies one topic. The tag is the default topic tag body. There may be a plurality of default topic tags stored in the default topic tag storage unit 141 for one topic (one topic ID) as shown in FIG. Further, the default topic tag storage unit 141 may input a plurality of default topic tags indicating each of a plurality of topics (a plurality of topic IDs).

Information stored in the topic feature word storage unit 142 and the extended topic hash tag storage unit 143 is information input from the topic tag estimation unit 130, and will be described later.

The topic tag estimation unit 130 uses the default topic tag stored in the default topic tag storage unit 141 to generate information used to determine whether each document is a document related to a specific topic. This information is for determining whether or not the document includes a default topic tag, but the document is related to a topic related to the default topic tag.

The topic tag estimation unit 130 includes a topic feature word estimation unit 131 and a topic hash tag estimation unit 132, and each generates different information.

The topic feature word estimation unit 131 is a means for estimating topic feature words. A feature word of a topic is a morpheme that appears characteristically in a document related to the topic. The topic feature word estimation unit 131 reads the default topic tag stored by the default topic tag storage unit 141, and associates a document including the default topic tag with the topic from a plurality of documents stored by the document storage unit 100. This is a default document extraction means for extracting as a document (topic document). At this time, based on the posting time, a document for a predetermined period such as the most recent several hours from the time of extraction is acquired. The topic feature word estimation unit 131 refers to the information stored in the document storage unit 100 and the morpheme storage unit 120, and calculates the appearance frequency of the morpheme in the extracted topic document (topic document group). It is a calculation means. At this time, the topic feature word estimation unit 131 calculates the number of users who have posted a document including the morpheme as the appearance frequency of the morpheme. Further, the topic feature word estimation unit 131 calculates the reverse appearance frequency from the ratio of the total number of users who have posted documents to the number of users who have posted documents containing the morpheme for each morpheme.

The topic feature word estimation unit 131 extracts morphemes (features) characteristic of topic documents (topic document groups) from the above values. The topic feature word estimation unit 131 generates a feature amount that is information describing the feature of the target topic for each topic ID. The feature amount is composed of a plurality of features (features), and a feature is generated for each morpheme. For example, the feature “Today” has a score of “0.5”, and the feature “Sunny” has a score of “0.2”.

Specifically, it is generated as follows. First, the topic feature word estimation unit 131 calculates an IDF (Inverse Document Frequency) value (inverse appearance frequency) for each morpheme from the morphemes included in each document by the following formula.

Here, i is a subscript indicating a morpheme, | D | is the total number of unique users, and | {d: t _i εd} | is the number of unique users who have posted a document containing the morpheme i. The IDF value is a score indicating that the smaller the number of documents in which the word appears, the more useful the document in which the word appears.

Note that the reason why the frequency is calculated not by the number of documents but by the number of users is as follows. If the number of documents is simply used, noise may be mixed. For example, the same user may post a plurality of documents having the same content. Some people submit documents with the same content dozens of times. If the calculation here is based on the number of unique users, even if the same user has posted a document with the same content multiple times, it is counted only once. Therefore, the calculated score is more reliable. It is also possible to think that the influence of one user on the morpheme score is made uniform.

Subsequently, the topic feature word estimation unit 131 calculates a TF (Term Frequency) value (appearance frequency) for each morpheme from the morpheme (morpheme to which the topic ID is assigned) included in each topic document for each topic ID by the following expression. ) Is calculated.

Here, j is a subscript indicating the topic ID, and n _{i, j} is the number of unique users who have posted a document related to the topic ID _j including the morpheme i (a document including the default topic tag of the topic ID _j ). The TF value indicates how prominent a certain word appears in a given document, and the larger this value is, the better the word represents the content of the document.

Subsequently, the topic feature word estimation unit 131 obtains the TFIDF value (tfidf _{i, j} ) of the morpheme i in the topic IDj by the following equation.
_{_{tfidf i, j = tf i,}} j · idf i
By performing this for each morpheme, a topic ID feature amount (a TFIDF value for each morpheme i) is generated. Continue until feature values for all topic IDs are generated. As for the score for each morpheme calculated in this way, a higher score is assigned to a morpheme having a higher correlation with the topic.

The feature quantities relating to the television program are, for example, “YYY (drama title): 1.0, AAAA (actor name): 0.9, CCCC (role name): 0.7, DDDD (role name): 0.4” (morpheme) : TFIDF value). By looking at the feature amount in this way, the feature of this program is clear.

Note that the IDF may be weighted to adjust the score (TFIDF value) by applying log to the IDF or raising the IDF to a constant power during the above calculation. Moreover, it is good also as calculating a TFIDF value like the following formula | equation using the number of characters for every morpheme, for example.
_{_{_{tfidf i, j = tf i,}}} j · idf i · log (length i)
Here, length _i is the number of characters of morpheme i. Further, the character string may be weighted by applying power (log (length _i ), constant) (log (length _i ) raised to a constant power). By doing so, it is possible to increase the weight with respect to the morpheme described more specifically. Moreover, since morphemes with a small number of characters frequently appear, the score tends to be high as noise.

The topic feature word estimation unit 131 outputs the TFIDF value of each morpheme for each calculated topic ID to the topic feature word storage unit 142 for storage. Here, only the morpheme (feature word) having a TFIDF value equal to or greater than a preset threshold value may be stored in the topic feature word storage unit 142. FIG. 5 shows a sample format of feature amounts stored in the topic feature word storage unit 142. As shown in FIG. 5, the feature amount data stored in the topic feature word storage unit 142 is data for each morpheme, and data related to one morpheme is associated with a topic ID, a creation date, a tag, and a score. Is. One row of data shown in FIG. 5 corresponds to data related to one morpheme of any topic ID. The topic ID is a topic ID of a topic related to the feature amount. The creation date is the time when this data was created. A tag is a morpheme. The score is a TFIDF value calculated by the topic feature word estimation unit 131. The information generated by the topic feature word estimation unit 131 has been described above.

The topic hash tag estimation unit 132 is a means for estimating a hash tag other than the default topic tag related to a topic. The topic hash tag estimation unit 132 reads the default topic tag stored by the default topic tag storage unit 141, and selects a tag other than the default topic tag (a hash tag related to the topic) from the plurality of documents stored by the document storage unit 100. Tag document extraction means for extracting a document including a hash tag as a tag document (tag document group). At this time, based on the posting time, a document for a predetermined period such as the most recent several hours from the time of extraction is acquired. The topic hash tag estimation unit 132 refers to the information stored in the document storage unit 100 and the morpheme storage unit 120, and calculates the appearance frequency of the morpheme in the extracted tag document (tag document group). It is a calculation means. At this time, the topic hash tag estimation unit 132 calculates the number of users who have posted a document including the morpheme as the appearance frequency of the morpheme. Further, the topic hash tag estimation unit 132 calculates the reverse appearance frequency from the ratio of the total number of users who have posted documents to the number of users who have posted documents containing the morpheme for each morpheme.

Specifically, the topic hash tag estimation unit 132 calculates an IDF value (reverse appearance frequency) for each morpheme, as with the topic feature word estimation unit 131. Since the IDF values calculated and used by the topic feature word estimation unit 131 and the topic hash tag estimation unit 132 are the same for each morpheme, the IDF value calculated by either one is used in the other. Also good.

Subsequently, the topic hash tag estimation unit 132 calculates the TF value (appearance frequency) for each morpheme from the morpheme (morpheme to which the hash tag is assigned) included in each tag document for each tag other than the default topic tag by the following formula. ) Is calculated.

Here, j is a subscript indicating a hash tag, and n _{i, j} is the number of unique users who have posted a document including the morpheme i and including the hash tag j.

Subsequently, the topic hash tag estimation unit 132 obtains the TFIDF value (tfidf _{i, j} ) of the morpheme i in the hash tag j by the following expression.
_{_{tfidf i, j = tf i,}} j · idf i
By performing this for each morpheme, a hash tag feature amount (a TFIDF value for each morpheme i) is generated. Continue until all hash tag features are generated. The weighting of the TFIDF value may be performed in the same manner as the method using the topic feature word estimation unit 131 described above.

Further, the topic hash tag estimation unit 132 reads the default topic tag stored by the default topic tag storage unit 141, and sets a document including the default topic tag as a topic from a plurality of documents stored by the document storage unit 100. This is a default document extraction means for extracting as a related document (topic document). At this time, based on the posting time, a document for a predetermined period such as the most recent several hours from the time of extraction is acquired. The topic hash tag estimation unit 132 refers to the information stored in the document storage unit 100 and the morpheme storage unit 120, and calculates the appearance frequency of the morpheme in the extracted topic document (topic document group). It is a calculation means. At this time, the topic hash tag estimation unit 132 calculates the number of users who have posted a document including the morpheme as the appearance frequency of the morpheme. Further, the topic hash tag estimation unit 132 calculates the reverse appearance frequency from the ratio of the total number of users who have posted documents to the number of users who have posted documents containing the morpheme for each morpheme.

Similar to the topic feature word estimation unit 131, the topic hash tag estimation unit 132 obtains the TFIDF value (tfidf _{i, j} ) of the morpheme i in the topic IDj. In addition, since the TFIDF value in the topic IDj that is calculated and used by the topic feature word estimation unit 131 and the topic hash tag estimation unit 132 is the same value for each morpheme, the TFIDF value calculated by either one is used in the other. It may be used.

The topic hash tag estimation unit 132 is a function of the second topic document determination unit that compares the feature amount of the topic ID calculated as described above with the feature amount of the tag. Specifically, the topic hash tag estimation unit 132 calculates, for each topic ID, the similarity (similarity) with all hash tags (other than the default topic tag) as a cosine distance using the following formula.

Here, A and B are the feature amount of the topic ID and the feature amount of the hash tag, respectively. Ai and Bi are the TFIDF values of each morpheme i. In addition to the above cosine distance, a Jacquard distance or an Euclidean distance may be used for calculating the similarity between the feature amounts indicated by the appearance frequency of the morpheme. In addition, any calculation method can be used as long as the similarity between feature quantities can be calculated.

The topic hash tag estimation unit 132 determines whether there is a similarity of a hash tag having a similarity equal to or higher than a preset threshold for each topic ID, and relates a tag having a similarity equal to or higher than the threshold to the topic of the topic ID. It shall be a tag. By performing this process on all topic IDs, hash tags (similar hash tags) related to the topic with the topic ID can be extracted.

The topic hash tag estimation unit 132 outputs information indicating a hash tag (extended topic hash tag) related to the topic of the topic ID to the extended topic hash tag storage unit 143 for storage. FIG. 6 shows a sample format of the extended topic hash tag stored in the extended topic hash tag storage unit 143. As shown in FIG. 6, the extended topic hash tag data stored in the extended topic hash tag storage unit 143 is data for each extended topic hash tag, and data related to one extended topic hash tag includes a topic ID, a creation date. And hash tags are associated with each other. One row of data shown in FIG. 6 corresponds to data related to one extended topic hash tag. The topic ID is a topic ID of a topic related to the extended topic hash tag. The creation date is the time when this data was created. The hash tag is an extended topic hash tag. The information generated by the topic hash tag estimation unit 132 has been described above.

The topic ID assigning unit 150 is a topic document extracting unit that extracts a document related to a topic using information stored in the topic tag storage unit 140. In particular, the topic ID assigning unit 150 is a topic document extracting unit that extracts a document related to a topic although a default topic tag is not included in the document.

The topic ID assigning unit 150 first acquires the document stored by the document storage unit 100. At this time, based on the posting time, a document for a predetermined period such as the most recent several hours from the time of acquisition is acquired. To extract a document related to a topic based on information stored in the default topic tag storage unit 141, the following is performed. The topic ID assigning unit 150 reads the default topic tag stored by the default topic tag storage unit 141, determines whether or not the acquired document includes the default topic tag, and the default topic tag is included. A topic ID related to the default topic tag is assigned to the document.

To extract a document related to a topic based on information stored in the topic feature word estimation unit 131, the following is performed. The topic ID assigning unit 150 reads the information on the feature amount (the TFIDF value (score) of each morpheme for each topic ID) stored by the topic feature word estimation unit 131 and acquires the information for each topic ID from the feature amount information. It is a score calculation means for calculating the score of each document. The topic ID assigning unit 150 determines whether the score assignment target document includes a morpheme (feature word) related to the feature amount. The topic ID assigning unit 150 adds up the scores of feature words included in the document.

Note that if a feature word appears multiple times in the document when the score is calculated, the score of the document may be calculated in the same manner as when the feature word appears once. That is, the score of the same feature word is not counted multiple times. When the document is “Today was fine. Today is good weather”, if the score of the feature word “Today” is 1.0, the score derived from “Today” included in this document is set to 1.0 + 1.0. = 1.0 instead of 2.0.

It is possible to remove noise by not counting multiple times in this way. Extracting inappropriate documents as topic documents can be avoided. For example, it is assumed that a certain word is extracted as a feature word, but the score of the feature word is low. If the feature frequently appears in a document with the feature, there is a possibility that it will be extracted as a topic document if the duplicate count is allowed. This can be avoided by not allowing duplicate counting.

The topic ID assigning unit 150 is a first topic document determination unit that determines, based on the calculated score, whether a document related to the score is a document related to the topic. Specifically, the topic ID assigning unit 150 determines whether the score is a preset threshold value. If the score is equal to or greater than the threshold value, the topic ID assigning unit 150 determines that the document is a document related to the topic and determines the topic. Give an ID. This process is repeated for each topic ID related to the feature amount stored in the topic feature word estimation unit 131, and a topic ID is assigned.

To extract a document related to a topic based on information stored in the extended topic hash tag storage unit 143, the following is performed. The topic ID assigning unit 150 reads the extended topic hash tag stored by the extended topic hash tag storage unit 143, and determines whether or not the acquired document includes the extended topic hash tag (that is, the acquired document Is a second topic document determination means for determining whether or not the document is a document related to the topic. The topic ID assigning unit 150 assigns a topic ID related to the default topic tag to a document that includes the extended topic hash tag. A topic ID is repeatedly given for each topic ID related to the extended topic hash tag stored in the extended topic hash tag storage unit 143. The topic ID assigning unit 150 outputs the document to which the topic ID is assigned to the noise removing unit 190.

In the present embodiment, noise is removed from the document stored in the document storage unit 100. That is, it is determined whether or not a document stored in the document storage unit 100 is inappropriate as a document related to a topic. If it is determined that the document is inappropriate, the document is excluded from the related documents. To do.

In Twitter, it is common to attach a hashtag to share your tweets for a specific topic, but there are users who post their comments with hashtags of multiple independent topics. In this case, postings are made on multiple topics, and the content of postings is very weak in relation to individual topics. For television, it may be criticism of politics or criticism of broadcasting stations. Many. It is important to filter these noises when extracting documents related to a topic with high accuracy. The following configuration is for removing noise from a document.

The blacklist hash tag extension unit 160 generates information used to determine whether each document is subject to noise, that is, whether each document is related to a specific topic inappropriate for extraction. Means. The blacklist hash tag extension unit 160 generates the above information using the information stored in the blacklist tag storage unit 170 and stores the information in the blacklist tag storage unit 170. Here, the black list tag storage unit 170 will be described.

The black list tag storage unit 170 includes a default black list morpheme storage unit 171, a default black list hash tag storage unit 172, and an extended black list hash tag storage unit 173.

The default blacklist morpheme storage unit 171 is a means for inputting and storing blacklist morphemes. A blacklist morpheme is a morpheme that should be excluded if it was included in the document. The black list morpheme is registered in advance by, for example, an administrator of the related document extraction device 10. FIG. 7A shows a sample format of the black list morpheme stored in the default black list morpheme storage unit 171. As shown in FIG. 7A, one line of data corresponds to data relating to one black list morpheme, and is stored for each black list morpheme.

The default blacklist hash tag storage unit 172 is a default topic tag storage unit that stores in advance a blacklist hash tag that is a default topic tag indicating an inappropriate topic. The blacklist hash tag is a tag related to a topic for which a related document is to be excluded, and is registered in advance by an administrator of the related document extraction apparatus 10, for example. Documents containing blacklist hash tags are excluded as documents related to inappropriate topics. This exclusion is performed by character string matching. The black list hash tag is, for example, a hash tag.

FIG. 7B shows a sample format of the black list hash tag stored in the default black list hash tag storage unit 172. As shown in FIG. 7B, one line of data corresponds to data related to one black list hash tag, and is stored for each black list hash tag.

The information stored in the extended blacklist hash tag storage unit 173 is information input from the blacklist hash tag extension unit 160, and will be described later.

The black list hash tag extension unit 160 is a document in which each document is to be excluded (a document related to a topic to be excluded) using the black list hash tag stored in the default black list hash tag storage unit 172. The information used for determining whether or not is generated. This information is for determining whether the document does not contain a blacklist hash tag, but the document is to be excluded.

The blacklist hash tag extension unit 160 is a means for estimating feature words related to the blacklist hashtag. The characteristic word of the blacklist hash tag is a morpheme that appears characteristically in a document including the blacklist hashtag. The black list hash tag extension unit 160 reads the black list hash tag stored in the default black list hash tag storage unit 172, and selects a document including the black list hash tag from a plurality of documents stored in the document storage unit 100. This is a default document extracting means for extracting as a document to be excluded. At this time, based on the posting time, a document for a predetermined period such as the most recent several hours from the time of extraction is acquired. The blacklist hash tag extension unit 160 refers to the information stored in the document storage unit 100 and the morpheme storage unit 120, and calculates the appearance frequency of the morpheme in the extracted document (document group) to be excluded. 1 appearance frequency calculation means. At this time, the blacklist hash tag expansion unit 160 calculates the number of users who have posted a document including the morpheme as the appearance frequency of the morpheme. Also, the blacklist hash tag expansion unit 160 calculates the reverse appearance frequency from the ratio of the total number of users who have posted documents to the number of users who have posted documents containing the morpheme for each morpheme.

The blacklist hash tag extension unit 160 extracts morphemes (features) characteristic of documents (document groups) to be excluded from the above values. The black list hash tag extension unit 160 generates a feature amount that is information describing the feature of the target topic for each black list hash tag.

Specifically, it is generated as follows. First, the blacklist hash tag extension unit 160 calculates an IDF value (reverse appearance frequency) for each morpheme from the morphemes included in each document by the following formula.

Here, i is a subscript indicating a morpheme, | D | is the total number of unique users, and | {d: t _i εd} | is the number of unique users who have posted a document containing the morpheme i. Since the IDF value is the same as that calculated by the topic tag estimation unit 130, the IDF value may be calculated by the topic tag estimation unit 130.

Subsequently, the blacklist hash tag extension unit 160 calculates each morpheme from the morpheme (morpheme to which the blacklist hash tag is added) included in each extracted document to be excluded for each blacklist hash tag by the following formula. TF value (appearance frequency) is calculated for.

Here, j is a subscript indicating a black list hash tag, and n _{i, j} is the number of unique users who have posted a document related to the black list hash tag j including the morpheme i (a document including the black list hash tag j). .

Subsequently, the blacklist hash tag extension unit 160 obtains the TFIDF value (tfidf _{i, j} ) of the morpheme i in the blacklist hash tag j by the following equation.
_{_{tfidf i, j = tf i,}} j · idf i
By performing this for each morpheme, a feature quantity (a TFIDF value for each morpheme i) of the blacklist hash tag is generated. Continue until all blacklist hashtag features are generated. The TFIDF value may be weighted in the same manner as described above.

The blacklist hash tag expansion unit 160 outputs the calculated TFIDF value of each morpheme for each blacklist hashtag to the blacklist tag storage unit 170 for storage. Here, only the morphemes (feature words) having a TFIDF value equal to or greater than the threshold value may be stored in the blacklist tag storage unit 170.

Also, the black list hash tag extension unit 160 is a means for estimating a hash tag included in a document to be excluded other than the black list hash tag. The black list hash tag extension unit 160 reads the black list hash tag stored by the default black list hash tag storage unit 172, and from the plurality of documents stored by the document storage unit 100, tags other than the black list hash tag ( Tag document extraction means for extracting a document (a document group) including a hash tag that is a candidate for a hash tag included in a document to be excluded. At this time, based on the posting time, a document for a predetermined period such as the most recent several hours from the time of extraction is acquired. The blacklist hash tag extension unit 160 refers to the information stored in the document storage unit 100 and the morpheme storage unit 120, and calculates a second appearance frequency calculation that calculates the appearance frequency of the morpheme in the extracted document (document group). Means. At this time, the blacklist hash tag expansion unit 160 calculates the number of users who have posted a document including the morpheme as the appearance frequency of the morpheme. Also, the blacklist hash tag expansion unit 160 calculates the reverse appearance frequency from the ratio of the total number of users who have posted documents to the number of users who have posted documents containing the morpheme for each morpheme.

Specifically, the blacklist hash tag extension unit 160 calculates an IDF value (reverse appearance frequency) for each morpheme in the same manner as described above. Note that the blacklist hash tag extension unit 160 may use the TF value calculated by the above or the topic tag estimation unit 130.

Subsequently, the blacklist hash tag extension unit 160 calculates a TF value (for each morpheme) from the morpheme (morpheme to which the hash tag is assigned) included in each tag document for each tag other than the blacklist hash tag by the following formula. Appearance frequency) is calculated.

Here, j is a subscript indicating a hash tag, and n _{i, j} is the number of unique users who have posted a document including the morpheme i and including the hash tag j. The blacklist hash tag expansion unit 160 may use the TF value calculated by the topic tag estimation unit 130.

Subsequently, the blacklist hash tag extension unit 160 obtains the TFIDF value (tfidf _{i, j} ) of the morpheme i in the hash tag j by the following equation.
_{_{tfidf i, j = tf i,}} j · idf i
By performing this for each morpheme, a hash tag feature amount (a TFIDF value for each morpheme i) is generated. Continue until all hash tag features are generated. The TFIDF value may be weighted in the same manner as described above.

Also, the blacklist hash tag extension unit 160 reads the blacklist hashtag stored by the default blacklist hashtag storage unit 172, and includes the blacklist hashtag from a plurality of documents stored by the document storage unit 100. This is default document extraction means for extracting a document as a document to be excluded. At this time, based on the posting time, a document for a predetermined period such as the most recent several hours from the time of extraction is acquired. The blacklist hash tag extension unit 160 refers to the information stored in the document storage unit 100 and the morpheme storage unit 120, and calculates the appearance frequency of the morpheme in the extracted document (document group) to be excluded. 1 appearance frequency calculation means. At this time, the blacklist hash tag expansion unit 160 calculates the number of users who have posted a document including the morpheme as the appearance frequency of the morpheme. Also, the blacklist hash tag expansion unit 160 calculates the reverse appearance frequency from the ratio of the total number of users who have posted documents to the number of users who have posted documents containing the morpheme for each morpheme.

The blacklist hash tag extension unit 160 obtains the TFIDF value (tfidf _{i, j} ) of the morpheme i in the blacklist hash tag j as described above. Since the TFIDF value in the blacklist hash tag j is the same value for each morpheme, the TFIDF value calculated above may be used.

The blacklist hash tag extension unit 160 is a function of the second topic document determination unit that compares the feature amount of the blacklist hash tag calculated as described above with the feature amount of the hash tag. Specifically, the blacklist hash tag extension unit 160 calculates, for each blacklist hashtag, the similarity (similarity) with all hashtags (other than the blacklist hashtag) as a cosine distance using the following formula: To do.

Here, A and B are the characteristic amount of the blacklist hash tag and the characteristic amount of the hash tag, respectively. Ai and Bi are the TFIDF values of each morpheme i. In addition to the above cosine distance, a Jacquard distance or an Euclidean distance may be used for calculating the similarity between the feature amounts indicated by the appearance frequency of the morpheme. In addition, any calculation method can be used as long as the similarity between feature quantities can be calculated.

The blacklist hash tag extension unit 160 determines whether there is a similarity of a hash tag having a similarity equal to or higher than a preset threshold for each blacklist hash tag, and excludes a hash tag having a similarity equal to or higher than the threshold. It is assumed that the hash tag is related to a power document. By performing this process on all the blacklist hash tags, it is possible to extract hash tags related to documents to be excluded.

The black list hash tag extension unit 160 outputs information indicating the extracted hash tag (extended black list hash tag) related to the document to be excluded to the extended black list hash tag storage unit 173 for storage. FIG. 7C shows a sample format of the extended blacklist hash tag stored in the extended blacklist hash tag storage unit 173. As shown in FIG. 7C, one line of data corresponds to data related to one extended black list hash tag, and is stored for each black list hash tag.

The blacklist user storage unit 180 is a means for inputting and storing a blacklist user ID indicating a blacklist user. A blacklist user is a user whose documents posted to the user should be excluded. The blacklist user ID is registered in advance by, for example, the administrator of the related document extraction device 10. FIG. 7D shows a sample format of the black list user ID stored in the black list user storage unit 180. As shown in FIG. 7D, one line of data corresponds to data related to one black list user ID, and is stored for each black list user ID. Any information other than the user ID may be used as long as the information can recognize the blacklist user.

The noise removing unit 190 determines whether or not the document input from the topic ID assigning unit 150 is an inappropriate document (related to an inappropriate topic), and excludes the document by performing topic exclusion. It is a function of the means. Specifically, the noise removal unit 190 has the following functions.

The noise removing unit 190 reads the black list morpheme from the default black list morpheme storage unit 171 and determines whether or not the black list morpheme is included in the document input from the topic ID assigning unit 150. This determination is performed by matching a character string between a document and a blacklist morpheme. If the noise removing unit 190 determines that the black list morpheme is included in the document, the noise removing unit 190 excludes the document as an inappropriate document to be excluded.

The noise removing unit 190 determines whether the document input from the topic ID assigning unit 150 has been posted by taking over another document or returned to another document. Specifically, the noise removing unit 190 determines whether the document is RT (retweet) or a reply tweet. It is possible to determine whether or not it is RT from the official Twitter API. Moreover, it is good also as performing said determination by performing a text analysis. Specifically, it is possible to easily determine whether a document includes a character string “RT” or a user name. If the noise removal unit 190 determines that the document is a post that has been taken over another document or has been returned to another document, the noise removal unit 190 excludes the document as an inappropriate document that should be excluded.

The noise removal unit 190 performs multi-post determination. Multi-posting refers to posting on multiple topics. That is, it is determined whether the document is a document related to a plurality of topics. For example, when a broadcast station is set as one topic, a document in which hashtags #fff and #zzz, which are hash tags related to the broadcast station, are posted to a plurality of broadcast stations. Considered a post. The noise removal unit 190 determines whether or not the document input from the topic ID assigning unit 150 has been given a plurality of topic IDs by the topic ID assigning unit 150, so that the document is multi-posted. Determine whether or not. When the noise removing unit 190 determines that the document is multi-posted, the noise removing unit 190 excludes the document as an inappropriate document to be excluded.

The noise removing unit 190 determines whether the document input from the topic ID assigning unit 150 has been posted by the blacklist user. The noise removal unit 190 reads the user ID of the black list user from the black list user storage unit 180 and compares the user ID of the user who posted the document input from the topic ID adding unit 150 with the user ID of the black list user. If they match, it is determined that the document has been posted by the blacklist user. If the noise removing unit 190 determines that the document has been posted by the blacklist user, the noise removing unit 190 excludes the document as an inappropriate document to be excluded.

The noise removal unit 190 uses the information stored in the blacklist tag storage unit 170 to determine whether the document input from the topic ID adding unit 150 is an inappropriate document. In particular, the noise removal unit 190 determines and excludes inappropriate documents that do not contain blacklist hash tags.

The noise removing unit 190 reads the black list hash tag stored by the default black list hash tag storage unit 172, determines whether or not the document includes the default topic tag, and the default topic tag is included. Exclude documents as inappropriate documents that should be excluded.

The noise removal unit 190 reads information on the feature amount (the TFIDF value (score) of each morpheme for each default topic tag) stored by the blacklist tag storage unit 170, and reads each information for each default topic tag from the feature amount information. It is a score calculation means for calculating the score of a document. The noise removal unit 190 determines whether the score assignment target document includes a morpheme (feature word) related to the feature amount. The noise removing unit 190 adds up the scores of feature words included in the document. Similar to the calculation of the score by the topic ID assigning unit 150, when the feature word appears multiple times in the document when the score is calculated, the score of the document may be calculated as in the case of the single appearance. .

The noise removal unit 190 is a first topic document determination unit that determines whether or not a document related to the score is an inappropriate document that should be excluded based on the calculated score. Specifically, the noise removal unit 190 determines whether or not the score is a preset threshold value. If the score is equal to or greater than the threshold value, the noise removal unit 190 determines that the document is an inappropriate document that should be excluded. To exclude. This process is repeated for the number of blacklist hash tags related to the feature quantity stored in the blacklist tag storage unit 170.

The noise removing unit 190 reads the extended blacklist hash tag stored by the extended blacklist hashtag storage unit 173, and determines whether the acquired document includes the extended blacklist hashtag. Second topic document determination means for determining whether a document is an inappropriate document to be excluded. The noise removing unit 190 determines that the document including the extended blacklist hash tag is an inappropriate document to be excluded and excludes the document. Excludes repeated documents corresponding to the extended blacklist hashtag stored in the extended blacklist hashtag storage unit 173.

The noise removal unit 190 outputs the documents that are not excluded by the above process to the topic document storage unit 200. Further, the document excluded by the noise removing unit 190 may not be used for the processing by the topic tag estimating unit 130. For example, information regarding whether or not the document stored in the document storage unit 100 and the morpheme stored in the morpheme storage unit relate to the document removed by the noise removal unit 190 is associated and removed. What is related to the document may not be input to the topic tag estimation unit 130.

The topic document storage unit 200 is a means for inputting and storing a document that is input from the noise removal unit 190 and is assigned with one topic ID. The document with the topic ID is extracted as a document related to the topic related to the topic ID. FIG. 8 shows a sample format of a document stored in the topic document storage unit 200. As shown in FIG. 8, the data related to the document stored in the topic document storage unit 200 is associated with the topic ID in addition to the data related to the document stored in the document storage unit 100. The document with the topic ID stored in the topic document storage unit 200 is provided to the user as a document related to the topic for each topic ID, for example. The functional configuration of the related document extraction apparatus 10 has been described above.

FIG. 9 shows the hardware configuration of the related document extraction apparatus 10. As shown in FIG. 9, the related document extraction apparatus 10 includes a CPU (Central Processing Unit) 1001, a RAM (Random Access Memory) 1002 and a ROM (Read Only Memory) 1003, and a communication module 1004 for communication. And a computer including hardware such as an auxiliary storage device 1005 such as a hard disk. The functions of the related document extracting apparatus 10 described above are exhibited by the operation of these components by a program or the like. The above is the configuration of the related document extraction apparatus 10.

Subsequently, a related document extraction method, which is a process executed by the related document extraction apparatus 10 according to the present embodiment, will be described using the flowcharts of FIGS. FIG. 10 is a flowchart showing the entire related document extraction method. In this processing, first, a plurality of documents to be extracted are input and stored by the document storage unit 100 (S01). The document input to the document storage unit 100 is output to the morphological analysis unit 110. Subsequently, the morpheme analysis unit 110 performs morpheme analysis on the document, and the document is divided into morphemes (S02, word acquisition step). Information indicating the morpheme obtained by the morpheme analysis by the morpheme analysis unit 110 is stored in the morpheme storage unit 120.

Subsequently, the topic tag estimation unit 130 assigns each document to a specific topic from the document stored in the document storage unit 100, the morpheme stored in the morpheme analysis unit 110, and the information stored in the topic tag storage unit 140. Information used to determine whether the document is related is generated (S03). This processing is performed by the topic feature word estimation unit 131 and the topic hash tag estimation unit 132, respectively.

The processing by the topic feature word estimation unit 131 will be described using the flowcharts of FIGS. As shown in FIG. 11, the topic feature word estimation unit 131 reads the default topic tag stored by the default topic tag storage unit 141, and obtains the default topic tag from a plurality of documents stored by the document storage unit 100. The included document is extracted as a document (topic document) related to the topic (S301, default document extraction step). Subsequently, a feature amount is generated for each topic (S302, first appearance frequency calculation step). This process will be described in detail with reference to the flowchart of FIG.

First, an IDF value for each morpheme is calculated (S3021, first appearance frequency calculation step). Subsequently, the TF value for each morpheme is calculated from the morphemes included in each topic document for each topic ID (processing target) (S3022, first appearance frequency calculation step). Subsequently, the TFIDF value of the morpheme in each topic ID is obtained from the calculated IDF value and TF value (S3023, first appearance frequency calculation step). The obtained TFIDF value is a feature amount. The processing of S3022 and S3023 is repeated until the processing for all topic IDs is completed.

Subsequently, returning to FIG. 11, feature words are stored in the topic feature word storage unit 142 for each topic ID (S303, first appearance frequency calculation step). This process will be described in detail with reference to the flowchart of FIG. This process is performed for each topic ID. For each morpheme, it is determined whether or not the TFIDF value of the morpheme is greater than or equal to a preset threshold value (S3031, first appearance frequency calculation step). When it is determined that the TFIDF value is equal to or greater than a preset threshold, the morpheme and the TFIDF value are output and stored in the topic feature word storage unit 142 for the topic ID (S3032, first appearance frequency calculation step). . If it is determined that the TFIDF value is not greater than or equal to a preset threshold value, no special process is performed and the process moves to the process for the next morpheme. The above processing is repeated for all morphemes for each topic ID, and is repeated until the processing for all topic IDs is completed. The processing by the topic feature word estimation unit 131 has been described above.

Subsequently, processing by the topic hash tag estimation unit 132 will be described using the flowcharts of FIGS. 14 and 15. As illustrated in FIG. 14, the topic hash tag estimation unit 132 reads the default topic tag stored in the default topic tag storage unit 141, and other than the default topic tag from the plurality of documents stored in the document storage unit 100. A document including the hash tag (tag document) is extracted (S311, tag document extraction step). Subsequently, a feature amount is generated for each hash tag (S312, second appearance frequency calculation step). The generation of the feature amount is performed in the same manner as the process described with reference to the flowchart of FIG. However, in this case, the processing loop shown in FIG. 12 is performed for each hash tag, and is repeated until the processing for all hash tags is completed.

Subsequently, a document including a default topic tag is extracted as a document (topic document) related to the topic from a plurality of documents stored by the document storage unit 100 (S313, default document extraction step). Subsequently, a feature amount is generated for each topic (S314, first appearance frequency calculation step). The generation of the feature amount is performed in the same manner as the process described with reference to the flowchart of FIG.

Subsequently, the feature amount of the topic ID calculated as described above is compared with the feature amount of the hash tag, and the hash tag (extended topic hash tag) related to the topic with the topic ID is expanded based on the comparison result. The data is output and stored in the storage unit 143 (S315, second topic document determination step).

This process will be described in more detail using the flowchart of FIG. This process is performed for each topic ID and hash tag. The similarity of the feature amount between the topic ID and the hash tag is calculated (S3151, second topic document determination step). As this similarity, for example, a cosine distance is used as described above. Subsequently, it is determined whether or not the calculated similarity is equal to or greater than a preset threshold value (S3152, second topic document determination step). When it is determined that the similarity is equal to or higher than a preset threshold, the hash tag is output and stored in the extended topic hash tag storage unit 143 as an extended topic hash tag for the topic ID (S3153, No. 1). 2-topic document determination step). If it is determined that the similarity is not greater than or equal to a preset threshold, no special process is performed and the process proceeds to the process for the next hash tag. The above processing is repeated for all hash tags for each topic ID, and is repeated until the processing for all topic IDs is completed. The processing by the topic hash tag estimation unit 132 has been described above.

Subsequently, referring back to FIG. 10, the blacklist hash tag extension unit 160 determines from the document stored in the document storage unit 100, the morpheme stored in the morpheme analysis unit 110, and the information stored in the blacklist tag storage unit 170. Information used to determine whether each document hits noise, that is, whether each document is related to a specific topic inappropriate for extraction (S04).

In this process, the blacklist hash tag extension unit 160 estimates a feature word related to the blacklist hashtag. In this estimation, the IDF value for each morpheme and the TF value for each morpheme for each blacklist hash tag are calculated, and the TFIDF value for the morpheme for each blacklist hashtag is calculated. The calculated TFIDF value of each morpheme for each blacklist hash tag is output to and stored in the blacklist tag storage unit 170. Here, only morphemes (feature words) having a TFIDF value equal to or greater than a preset threshold value may be stored in the blacklist tag storage unit 170.

Also, the black list hash tag extension unit 160 estimates an extended black list hash tag. This process will be described with reference to the flowcharts of FIGS. As shown in FIG. 16, the blacklist hash tag expansion unit 160 reads out the blacklist hashtag stored in the default blacklist hashtag storage unit 172, and from a plurality of documents stored in the document storage unit 100. A document (tag document) including a hash tag other than the blacklist hash tag is extracted (S411, tag document extraction step). Subsequently, a feature amount is generated for each hash tag (S412, second appearance frequency calculation step). The generation of the feature amount is performed in the same manner as the process described with reference to the flowchart of FIG. However, in this case, the processing loop shown in FIG. 12 is performed for each hash tag, and is repeated until the processing for all hash tags is completed.

Subsequently, a document including a blacklist hash tag is extracted from a plurality of documents stored by the document storage unit 100 (S414, default document extraction step). Subsequently, a feature amount is generated for each blacklist hash tag (S415, first appearance frequency calculation step). The generation of the feature amount is performed in the same manner as the process described with reference to the flowchart of FIG. However, in this case, the processing loop shown in FIG. 12 is performed for each blacklist hash tag, and is repeated until the processing for all the blacklist hashtags is completed.

Subsequently, the characteristic amount of the black list hash tag calculated as described above is compared with the characteristic amount of the hash tag, and the hash tag related to the black list hash tag (extended black list hash tag) is expanded based on the comparison result. It is output and stored in the blacklist hash tag storage unit 173 (S415, second topic document determination step).

This process will be described in more detail using the flowchart of FIG. This process is performed for each blacklist hash tag and hash tag. The similarity between the blacklist hash tag and the hash tag is calculated (S4151, second topic document determination step). As this similarity, for example, a cosine distance is used as described above. Subsequently, it is determined whether or not the calculated similarity is equal to or greater than a preset threshold (S4152, second topic document determination step). If it is determined that the similarity is equal to or higher than a preset threshold, the hash tag is output to the extended black list hash tag storage unit 173 and stored as an extended black list hash tag for the black list hash tag. (S4153, second topic document determination step). If it is determined that the similarity is not greater than or equal to a preset threshold, no special process is performed and the process proceeds to the process for the next hash tag. The above process is repeated for all the hash tags for each black list hash tag, and is repeated until the process for all the black list hash tags is completed. The above is the processing by the blacklist hash tag extension unit 160.

Subsequently, returning to FIG. 10, whether or not the document stored in the document storage unit 100 is a document related to the topic using the information stored in the topic tag storage unit 140 by the topic ID assigning unit 150. In response to the determination, a topic ID is assigned to the document (S05, topic document extraction step).

This process will be described in more detail using the flowchart of FIG. The feature ID information stored in the topic feature word estimation unit 131 is read out by the topic ID assigning unit 150, and a topic ID is assigned to the document based on the information (S501, topic document extracting step).

This process will be described in more detail using the flowchart of FIG. This process is performed for each document to which a topic is assigned. First, the feature amount information stored by the topic feature word estimation unit 131 is acquired for each topic (topic ID) (S5011, score calculation step). Subsequently, the “score total value” of the document is initialized (value is set to zero) (S5012, score calculation step). Subsequently, it is determined whether or not each feature word is included in the document (S5013, score calculation step). When it is determined that the feature word is included in the document, the score (TFIDF value) of the feature word is added to the “score total value” (S5014, score calculation step). When it is determined that the feature word is not included in the document, the score of the feature word is not added to the “score total value”.

When the above processing (S5013, S5014) is completed for all feature words, it is determined whether or not the “score total value” is equal to or greater than a preset threshold value (S5015, first topic document determination step). If it is determined that the “score total value” is equal to or greater than a preset threshold, the topic ID of the topic is assigned to the document (S5016, first topic document determination step). If it is determined that the “score total value” is not equal to or greater than a preset threshold value, the topic ID of the topic is not assigned to the document. The above processing is repeated for all topics for each document, and is repeated until the processing for all documents is completed.

Next, returning to FIG. 18, the topic ID assigning unit 150 reads the default topic tag stored in the default topic tag storage unit 141 and the extended topic hash tag stored in the extended topic hash tag storage unit 143. Then, a topic ID is assigned to the document based on the information (S502, topic document extraction step (second topic document determination step)).

This process will be described in more detail using the flowchart of FIG. This process is performed for each document to which a topic is assigned. First, for each topic (topic ID), a default topic tag and an extended topic hash tag associated with the topic are acquired (S5021). Subsequently, it is determined whether or not each default topic tag and extended topic hash tag are included in the document (S5022, second topic document determination step). When it is determined that the default topic tag and the extended topic hash tag are included in the document, the topic ID of the topic is assigned to the document (S5023, second topic document determination step). When it is determined that the default topic tag and the extended topic hash tag are not included in the document, the topic ID of the topic is not given to the document. The above processing (S5022, S5023) is performed for all default topic tags and extended topic hash tags associated with the topic. The above processing is repeated for all topics for each document, and is repeated until the processing for all documents is completed.

The document to which the topic ID is assigned by the topic ID assigning unit 150 is output to the noise removing unit 190.

Subsequently, the noise removing unit 190 determines whether or not the document input from the topic ID assigning unit 150 is an inappropriate document and excludes the document (S601, topic document extraction step).

This process will be described in more detail using the flowchart of FIG. This process is performed for each document (topic is assigned) input from the topic ID assigning unit 150. A black list morpheme (NG word) is read from the default black list morpheme storage unit 171 to determine whether or not the black list morpheme is included in the document (S601). If it is determined that the blacklist morpheme is included in the document, the document is excluded as an inappropriate document to be excluded (subsequent processing is not performed).

If it is determined that the blacklist morpheme is not included in the document, it is then determined whether the document is RT or a reply tweet (S602). If it is determined that the document is RT or a reply tweet, the document is excluded as an inappropriate document to be excluded (subsequent processing is not performed).

If it is determined that the document is neither an RT nor a reply tweet, it is then determined whether the document has been multi-posted (S603). If it is determined that the document is multi-posted, the document is excluded as an inappropriate document to be excluded (no subsequent processing is performed).

If it is determined that the document has not been multi-posted, the user ID of the blacklist user is subsequently read from the blacklist user storage unit 180, and whether or not the document has been posted by the blacklist user. Is determined (S604). If it is determined that the document is posted by a blacklist user, the document is excluded as an inappropriate document to be excluded (subsequent processing is not performed).

If it is determined that the document is not posted by the blacklist user, then the feature amount (the TFIDF value (score) of each morpheme for each default topic tag) stored by the blacklist tag storage unit 170, and The extended blacklist hash tag stored in the extended blacklist hashtag storage unit 173 is read, and based on these, it is determined whether or not the document is an inappropriate document that should be excluded as described above ( S605). If it is determined that the document is inappropriate, it is excluded (no further processing is performed). If it is determined that the document is not an inappropriate document that should be excluded, the document is output from the noise removal unit 190 to the topic document storage unit 200.

Subsequently, returning to FIG. 10, the document input by the topic document storage unit 200 is stored together with the assigned topic ID. The above is the processing executed by the related document extraction apparatus 10 according to the present embodiment. In addition, said process is good also as being triggered by the operation of the administrator of the related document extraction apparatus 10 for every preset time interval, for example. Note that in the above processing, topic ID assignment to a document and generation of information (features and extended topic hash tags) used for assigning a topic ID are a series of processing, but these processing are mutually independent. It may be performed at different timings.

As described above, in this embodiment, a document related to a topic is extracted using the frequency of appearance of words in a document including a default topic tag indicating a topic. That is, even if the default topic tag indicating a topic is not included, a document corresponding to the appearance frequency is extracted as a document related to the topic. Thereby, according to the present embodiment, a document related to a specific topic can be appropriately extracted from a plurality of documents such as tweets. Therefore, it is possible to exhaustively extract documents related to the topic. In addition to exhaustiveness, dynamic topic hash tags and topic feature words can be estimated, so that documents related to topics can be extracted in real time.

As in the present embodiment, the document score may be calculated based on the feature word to extract the document. According to this configuration, for example, a document including a word having a high appearance frequency in a document including a default topic tag can be extracted as a document related to a topic, and a document related to a specific topic can be reliably extracted. Can do. Thereby, it is possible to extract a document without a hash tag, and the number of documents that can be extracted increases.

Further, when a word appears multiple times in the document when calculating the score, the score of the document may be calculated in the same manner as in the case of a single occurrence. According to this configuration, it is possible to prevent the score of the document from being increased due to words frequently included in the document, and it is possible to avoid extracting an inappropriate document as a document related to a topic.

In addition, as in the present embodiment, the topic hash tag may be expanded by extracting the document by comparing the feature quantities of the tag document and the topic document. According to this configuration, a document (group) including tags other than the default topic tag can be extracted as a document related to a topic, and a document related to a specific topic can be reliably extracted. Accordingly, by registering one or more tags (hash tags or keywords) related to the topic as default topic tags in advance, the tags can be dynamically estimated, and the number of documents that can be extracted increases.

[Generally, the hashtag is aware of a specific topic and the poster creates a document. That is, since topics and topic hash tags have a one-to-N relationship, more topic documents can be extracted by sucking as many hash tags as possible associated with the topics. For example, users often post tweets about broadcast programs with broadcast station hashtags. However, a famous program has a hash tag of the program itself. By comparing the feature quantity between the topic and the hash tag, it is possible to detect the hash tag related to the program being broadcast dynamically earlier.

Also, noise may be removed as in the present embodiment. Noise removal is important for document extraction. According to this configuration, inappropriate documents can be excluded, and for example, inappropriate documents can be prevented from being presented to the user. Further, if the topic hash tag and the topic feature word are estimated based on the document group from which noise is removed, the estimation accuracy thereof is improved. As described above, in the estimation of the topic hash tag and the topic feature word, since the feature amount becomes a reference value indicating the topic, the quality of the estimated information decreases as the noise of the data increases. Therefore, cleansing the seed data is important. In addition, it is possible to extract documents related to noise-free topics. Further, noise can be removed more appropriately by dynamically removing noise in the same manner as document extraction. Further, since the black list is automatically expanded in real time, the need for manually registering the black list is reduced.

However, if it is considered that the noise included in the document group to be extracted is small, it is not always necessary to remove noise (an inappropriate document).

Also, the appearance frequency may be counted in units of users as in this embodiment. According to this structure, the influence for every user can be made uniform, for example, the influence by one user posting the document of the same content several times can be suppressed. Thereby, it is possible to appropriately extract a document related to a specific topic. However, when the information of the user who posted the document cannot be acquired or when the user cannot consider posting the same content, the appearance frequency may be counted in document units. That is, the IDF value or TF value may be calculated by counting in document units.

Moreover, the popularity and rarity of a morpheme can be expressed by expressing the feature quantity by the feature of the morpheme unit using the TFIDF value as in this embodiment. Thereby, a document related to a specific topic can be more appropriately extracted from a plurality of documents such as tweets.

Also, documents related to multiple topics may be excluded. Documents posted on multiple topics (multi-topic postings) are often not related to each topic. Therefore, according to this configuration, it is possible to avoid extracting an inappropriate document as a document related to a topic.

Next, a related document extraction program for causing a computer to execute the above-described series of related document extraction apparatus 10 will be described. As shown in FIG. 22, the related document extraction program 40 is inserted into a computer and accessed, or stored in a program storage area 31 formed on a recording medium 30 provided in the computer.

The related document extraction program 40 includes a document storage module 400, a morpheme analysis module 410, a morpheme storage module 420, a topic tag estimation module 430, a topic tag storage module 440, a topic ID assignment module 450, and a blacklist hash tag. The extended module 460, the blacklist tag storage module 470, the blacklist user storage module 480, the noise removal module 490, and the topic document storage module 500 are configured. Document storage module 400, morpheme analysis module 410, morpheme storage module 420, topic tag estimation module 430, topic tag storage module 440, topic ID assignment module 450, blacklist hash tag extension module 460, blacklist Functions realized by executing the tag storage module 470, the blacklist user storage module 480, the noise removal module 490, and the topic document storage module 500 are the same as those in the document storage unit 100 of the related document extraction apparatus 10 described above. The morpheme analyzer 110, the morpheme storage unit 120, the topic tag estimation unit 130, the topic tag storage unit 140, the topic ID assigning unit 150, the black list hash tag expansion unit 160, the black list And Totagu storage unit 170, a blacklist user storage unit 180, a noise removing unit 190, respectively similar to the functions of the topic document storage unit 200.

Note that a part or all of the related document extraction program 40 may be transmitted via a transmission medium such as a communication line and received and recorded (including installation) by another device. Moreover, each module of the related document extraction program 40 may be installed in any one of a plurality of computers instead of one computer. In that case, the series of related document extraction programs 40 described above are performed by the computer system of the plurality of computers.

DESCRIPTION OF SYMBOLS 10 ... Related document extraction apparatus, 100 ... Document storage part, 110 ... Morphological analysis part, 120 ... Morphological storage part, 130 ... Topic tag estimation part, 132 ... Topic hash tag estimation part, 131 ... Topic feature word estimation part, 140 ... Topic tag storage unit 141 ... Default topic tag storage unit 142 ... Topic feature word storage unit 143 ... Extended topic hash tag storage unit 150 ... Topic ID assigning unit 160 ... Blacklist hash tag extension unit 170 ... Blacklist Tag storage unit, 171 ... default blacklist morpheme storage unit, 172 ... default blacklist hash tag storage unit, 173 ... extended blacklist hashtag storage unit, 180 ... blacklist user storage unit, 190 ... noise removal unit, 200 ... topic Document storage unit, 1001... C U, 1002 ... RAM, 1003 ... ROM, 1004 ... communication module, 1005 ... auxiliary storage device, 30 ... recording medium, 31 ... program storage area, 40 ... related document extraction program, 400 ... document storage module, 410 ... morpheme analysis module , 420 ... Morphological storage module, 430 ... Topic tag estimation module, 440 ... Topic tag storage module, 450 ... Topic ID assignment module, 460 ... Blacklist hash tag expansion module, 470 ... Blacklist tag storage module, 480 ... Blacklist user Storage module, 490... Noise removal module, 500... Topic document storage module.

Claims

Default topic tag storage means for storing a default topic tag indicating a topic in advance;
Document storage means for storing a plurality of documents in advance;
Word acquisition means for dividing the document stored by the document storage means into words;
Default document extraction means for extracting a document including a default topic tag stored by the default topic tag storage means from a plurality of documents stored by the document storage means;
First appearance frequency calculation means for calculating the appearance frequency of the words divided by the word acquisition means in the document extracted by the default document extraction means;
Topic document extraction means for extracting a document related to the topic from a document other than the document extracted by the default document extraction means using the appearance frequency calculated by the first appearance frequency calculation means;
Related document extraction apparatus comprising:
The topic document extraction means includes:
Score calculating means for calculating a score of the document from words appearing in a document other than the document extracted by the default document extracting means using the appearance frequency calculated by the first appearance frequency calculating means;
First topic document determination means for determining whether or not a document related to the score is a document related to the topic based on the score calculated by the score calculation means;
The related document extracting device according to claim 1, further comprising:
3. The related document extracting device according to claim 2, wherein the score calculating means calculates the score of the document when a word appears multiple times in the document as in the case of a single occurrence.
The topic document extraction means includes:
Tag document extraction means for extracting a document including a tag other than the default topic tag from a plurality of documents stored by the document storage means;
Second appearance frequency calculation means for calculating the appearance frequency of the words divided by the word acquisition means in the document extracted by the tag document extraction means;
The appearance frequency calculated by the first appearance frequency calculation means is compared with the appearance frequency calculated by the second appearance frequency calculation means, and the document extracted by the tag document extraction means based on the comparison result is obtained. Second topic document determination means for determining whether the document is related to the topic;
The related document extracting apparatus according to any one of claims 1 to 3, further comprising:
The second topic document determination unit includes a feature amount indicated by the word appearance frequency calculated by the first appearance frequency calculation unit and a feature amount indicated by the word appearance frequency calculated by the second appearance frequency calculation unit. The related document extraction device according to claim 4, wherein the appearance frequencies are compared by calculating a cosine distance, a Jacquard distance, or a Euclidean distance between the two.
The default topic tag storage means stores a default topic tag related to an inappropriate topic as the default topic tag,
The topic document extraction means determines whether the document is a document related to the inappropriate topic and performs document exclusion;
The related document extracting device according to any one of claims 1 to 5.
The document storage means stores information relating to a user who posted the document,
The first appearance frequency calculating means calculates the number of users who have posted a document including the word as the appearance frequency of the word.
The related document extracting apparatus according to any one of claims 1 to 6.
The first appearance frequency calculating means calculates a reverse appearance frequency from a ratio of the total number of users who have posted the document to the number of users who have posted the document including the word for each word,
The topic document extracting means extracts a document related to the topic using the reverse appearance frequency calculated by the first appearance frequency calculating means;
The related document extraction device according to claim 7.
The related document extracting device according to claim 8, wherein the topic document extracting means extracts a document related to the topic using the number of characters for each word.
The default topic tag storage means stores a plurality of default topic tags indicating each of a plurality of topics,
The topic document extracting means excludes documents related to a plurality of topics;
The related document extracting apparatus according to any one of claims 1 to 9.
A related document extraction method by a related document extraction device comprising: a default topic tag storage unit that stores a default topic tag indicating a topic in advance; and a document storage unit that stores a plurality of documents in advance.
A word acquisition step of dividing the document stored by the document storage means into words;
A default document extraction step of extracting a document including a default topic tag stored by the default topic tag storage unit from a plurality of documents stored by the document storage unit;
A first appearance frequency calculating step of calculating an appearance frequency of the words divided in the word acquisition step in the document extracted in the default document extraction step;
A topic document extraction step for extracting a document related to the topic from a document other than the document extracted in the default document extraction step using the appearance frequency calculated in the first appearance frequency calculation step;
Related document extraction method including
Computer
Default topic tag storage means for storing a default topic tag indicating a topic in advance;
Document storage means for storing a plurality of documents in advance;
Word acquisition means for dividing the document stored by the document storage means into words;
Default document extraction means for extracting a document including a default topic tag stored by the default topic tag storage means from a plurality of documents stored by the document storage means;
First appearance frequency calculation means for calculating the appearance frequency of the words divided by the word acquisition means in the document extracted by the default document extraction means;
Topic document extraction means for extracting a document related to the topic from a document other than the document extracted by the default document extraction means using the appearance frequency calculated by the first appearance frequency calculation means;
Related document extraction program to function as.