JP5676552B2 - Daily word extraction apparatus, method, and program - Google Patents

Daily word extraction apparatus, method, and program Download PDF

Info

Publication number
JP5676552B2
JP5676552B2 JP2012274791A JP2012274791A JP5676552B2 JP 5676552 B2 JP5676552 B2 JP 5676552B2 JP 2012274791 A JP2012274791 A JP 2012274791A JP 2012274791 A JP2012274791 A JP 2012274791A JP 5676552 B2 JP5676552 B2 JP 5676552B2
Authority
JP
Japan
Prior art keywords
character string
daily word
word
daily
equal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
JP2012274791A
Other languages
Japanese (ja)
Other versions
JP2014119977A (en
Inventor
九月 貞光
九月 貞光
東中 竜一郎
竜一郎 東中
齋藤 邦子
邦子 齋藤
牧野 俊朗
俊朗 牧野
松尾 義博
義博 松尾
吉村 健
健 吉村
渉 内田
渉 内田
Original Assignee
日本電信電話株式会社
株式会社Nttドコモ
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日本電信電話株式会社, 株式会社Nttドコモ filed Critical 日本電信電話株式会社
Priority to JP2012274791A priority Critical patent/JP5676552B2/en
Publication of JP2014119977A publication Critical patent/JP2014119977A/en
Application granted granted Critical
Publication of JP5676552B2 publication Critical patent/JP5676552B2/en
Application status is Active legal-status Critical
Anticipated expiration legal-status Critical

Links

Images

Description

  The present invention relates to a daily word extraction apparatus, method, and program, and more particularly, to a daily word extraction apparatus, method, and program for extracting daily words that appear in an input document.

  The daily word is a character string that represents (1) an event (event) that occurs in a short period, and (2) a character string that represents an event that is likely to be public information. For example, “announcement”, “marriage”, “lunch”, “byte”, and “coming” are candidates for daily words because they satisfy the requirement (1) above. Note that “marriage” is an ongoing event, but in the present invention, it is assumed that questions such as “who is the husband of X?” Are likely to continue, so “marriage” is “marriage [announcement / It is a candidate for the daily word because it is easily used in the case of "

  In addition, since “announcement” and “marriage” are public information, they satisfy the above condition (2) and become daily words. However, since “lunch”, “byte”, and “coming” are private information, they do not satisfy the above condition (2), so they are not daily words.

  As a conventional technique close to a technique for detecting a daily word, a method for estimating a bursting event by a burst topic detection technique for estimating a field (topic) to burst is known (Non-patent Document 1).

"Finding Bursty Topics from Microblogs" Qiming Diao, Jing Jiang, Feida Zhu, Ee-Peng Lim, ACL 2012 http://www.mysmu.edu/faculty/jingjiang/papers/ACL'12.pdf

However, in the method of Non-Patent Document 1, the burst model and the topic are detected by the topic model and the HMM, and the burst word can be detected from the topic model, but the daily word crosses the topic. Therefore, there is a problem that it is difficult to detect.

  The present invention has been made to solve the above problems, and an object thereof is to provide a daily word extraction apparatus, method, and program for extracting daily words with high accuracy.

  In order to achieve the above object, the daily word extracting device of the present invention represents a storage means for storing a document composed of at least one sentence that has been subjected to morphological analysis, and a date and time for the document stored in the storage means. The date and time in a portion that matches the regular expression using a regular expression that indicates a pattern of a character string that simultaneously includes a non-character string, a date and time expression related to the character string, and a unique expression or number related to the character string. And an extraction unit that extracts a character string having an appearance frequency equal to or higher than a predetermined threshold as a daily word that is a character string representing an event.

  The daily word extraction method of the present invention is a daily word extraction method in a daily word extraction device, comprising a storage means for storing a document composed of at least one sentence that has been subjected to morphological analysis, and an extraction means. A regular expression indicating a character string pattern that simultaneously includes a character string that does not represent a date and time, a date and time expression related to the character string, and a unique expression or a number related to the character string for the document stored in the storage unit. Use a character string that does not represent the date and time in a portion that matches the regular expression and that has an appearance frequency equal to or higher than a predetermined threshold as a daily word that is a character string representing an event. Extract.

  The extraction means according to the present invention is a character string that does not represent the date and time in a portion that matches the regular expression, based on the appearance frequency information of each character string obtained in advance for an official document, and A character string whose appearance frequency is equal to or higher than the threshold and whose frequency of appearance in the official document is equal to or higher than a predetermined threshold is extracted as the daily word. can do.

  The extraction means according to the present invention includes the appearance frequency information of each character string obtained in advance for the headline part of the newspaper as the official document, and the appearance frequency information of each character string obtained in advance for the body part of the newspaper. And a character string that does not represent the date and time in a portion that matches the regular expression, and that is a character string that has an appearance frequency equal to or higher than the threshold value, and a newspaper headline portion of the character string It is possible to extract, as the daily word, a character string in which the appearance frequency is equal to or higher than a predetermined threshold, or the appearance frequency of the character string in the body part of a newspaper is equal to or higher than a predetermined threshold.

  The extraction means according to the present invention is a character string that does not represent the date and time in a portion that matches the regular expression, and is a character string that has an appearance frequency equal to or higher than the threshold value, and is used for a search engine. The character string coexisting with at least one of a plurality of character strings indicating a predetermined search query is extracted as the daily word, and a character string that does not represent the date and time in a portion that matches the regular expression And the appearance frequency of the character string is equal to or higher than the threshold value, and the frequency of appearance of the character string in the headline portion of the newspaper is equal to or higher than a predetermined threshold value, or the newspaper text portion of the character string. The sentence in which the appearance frequency in the character string is equal to or higher than a predetermined threshold and does not co-occur with any of a plurality of predetermined character strings used in a private document Column, can be extracted as the daily word.

  According to the present invention, for a document composed of at least one sentence that has been subjected to morphological analysis, a character string pattern that simultaneously includes a character string that does not represent a date and time, a date and time expression that relates to the character string, and a unique expression or number that relates to the character string. A daily word that is a character string that represents an event and that is a character string that does not represent a date and time in a portion that matches the written regular expression using a regular expression that indicates Extract as

  In this way, from a portion that matches a regular expression, using a regular expression that indicates a character string pattern that simultaneously includes a character string that does not represent a date and time, a date and time expression related to the character string, and a unique expression or number related to the character string. By extracting the daily word, it is possible to extract the daily word with high accuracy.

  The program of the present invention is a program for causing a computer to function as each means constituting the above-described daily word extraction device.

As described above, according to the daily word extraction apparatus, method, and program of the present invention, a character string that simultaneously includes a character string that does not represent a date and time, a date and time expression that relates to the character string, and a unique expression or number that relates to the character string. A daily word can be extracted with high accuracy by extracting a daily word from a portion that matches the regular expression by using a regular expression indicating a pattern of a column.

It is a block diagram which shows the functional structure of the daily word extraction apparatus of 1st Embodiment. It is a flowchart which shows the content of the daily word extraction process routine in the daily word extraction apparatus of 1st Embodiment. It is a figure which shows the position of the daily word. It is a block diagram which shows the functional structure of the daily word extraction apparatus of 2nd Embodiment. It is a flowchart which shows the content of the daily word extraction process routine in the daily word extraction apparatus of 2nd Embodiment. It is a block diagram which shows the functional structure of the daily word extraction apparatus of 3rd Embodiment. It is a flowchart which shows the content of the daily word extraction process routine in the daily word extraction apparatus of 3rd Embodiment. It is a figure which shows the daily word extraction result in experiment.

  Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

<System configuration>
A daily word extracting apparatus according to the first embodiment of the present invention will be described. As shown in FIG. 1, the daily word extraction apparatus 100 according to the first embodiment of the present invention includes an input unit 10, a calculation unit 20 that executes a daily word extraction processing routine described later, an output unit 50, It has.

  The input unit 10 receives a plurality of morphologically analyzed documents from an input device such as a keyboard. Note that the input unit 10 may accept input from the outside via a network or the like.

  The arithmetic unit 20 is configured by a computer including a CPU (Central Processing Unit), a RAM (Random Access Memory), and a ROM (Read Only Memory) that stores a program for executing a daily word extraction processing routine to be described later. Has been. As shown in FIG. 1, this computer can be functionally represented by a configuration including a morphological-analyzed document storage unit 30, a specific expression extraction unit 31, and a daily word extraction unit 32.

  The morpheme analyzed document storage unit 30 stores a plurality of morpheme analyzed documents received by the input unit 10.

  The specific expression extraction unit 31 uses a specific expression (a proper noun such as a person name, an organization name, and an artifact name, a date expression (DATE), a time for a plurality of morpheme analyzed documents stored in the morpheme analyzed document storage unit 30. Expression (TIME)) is extracted. A conventionally known method may be used as the method of extracting the proper expression. In the example of the present embodiment, CaboCha is used.

  The daily word extraction unit 32 uses a regular expression indicating a character string pattern including a character string that does not represent a date and time, a date and time expression related to the character string, and a unique expression or number expression related to the character string, For a plurality of documents stored in the morphological-analyzed document storage unit 30, character strings that do not represent the date and time are extracted as daily word candidates from portions that match the regular expression pattern. As a regular expression, for example, using a character string pattern of “[DAT] [X] is [Y]”, a character string X corresponding to the regular expression is extracted as a daily noun candidate for a common noun. Further, as a regular expression, a character string pattern “[X] (do | done) [Y]” in [DAT] is used, and a character string X corresponding to the regular expression is used as a daily word candidate of an action noun. Extract as It should be noted that the character string X to be extracted is targeted only for the character string X in which the combined appearance frequency of the common noun and the action noun is equal to or higher than a predetermined threshold (the threshold in this embodiment is 10 times). ).

  Here, DAT is a character string whose part of speech is a date and time, or a character string of a date expression (DATE) or a time expression (TIME) extracted by the unique expression extraction unit 31, and the character string X is a part of speech whose date is part of a date. The character string Y is a character string other than a character string, and the character string Y is a character string of a proper expression (a proper noun such as a person name, an organization name, or an artifact name) or a number expression.

  The daily word extraction unit 32 extracts a character string extracted as a daily word candidate based on the extraction result of the daily word candidate and extracted N times (for example, 10 times) or more as a daily word. The output unit 50 outputs the daily word list.

<Operation of Daily Word Extractor>

  Next, the operation of the daily word extraction device 100 according to the first embodiment of the present invention will be described. First, when a morpheme analyzed document is input by the input unit 10, the CPU executes a program stored in the morpheme analyzed document storage unit 30 and stored in the ROM of the daily word extraction device 100. 2 is executed.

  First, in step S100, a plurality of morphologically analyzed documents stored in the morphologically analyzed document storage unit 30 are read.

  Next, in step S101, specific expression extraction is performed on the plurality of morphologically analyzed documents obtained in step S100.

  Next, in step S102, for each of the morpheme-analyzed documents obtained in step S100, a character string is extracted from each of the portions that match the regular expression of the character string pattern “[DAT] [X] is [Y]”. A character string corresponding to X is extracted as a daily noun candidate for a common noun, and from each of the portions that match the regular expression of the character string pattern “[DAT] [X] (do | done) [Y]” The character string corresponding to the character string X is extracted as a daily word candidate of the action noun.

  A character string extracted as a daily word candidate and extracted ten times or more is extracted as a daily word.

  Next, in step S104, the daily word extracted in step S102 is output as a daily word list by the output unit 50, and the process ends.

  As described above, according to the daily word extraction device according to the first embodiment of the present invention, a character string that does not represent a date and time, a date and time expression that relates to the character string, and a unique expression or number that relates to the character string, By extracting a daily word from a portion that matches the regular expression using a regular expression that indicates a pattern of a character string that simultaneously includes the daily word, it is possible to extract the daily word with high accuracy.

  Further, in consideration of the fact that the character string that becomes the daily word changes from day to day, the daily word can be extracted with high accuracy by using a regular expression that includes both the date expression and the specific expression.

  It is also possible to extract a daily word including private information such as “social gathering” and “birthday”.

  In addition, when classifying whether to refer to social media with high real-time property or non-real-time WEB articles in order to find an answer to a question sentence in natural language, a daily extracted automatically in this embodiment. By using the word list, if a daily word is included in a question sentence, it can be classified that social media with high real-time property should be referred to.

  In addition, when the burst word is detected from the number of appearances of articles, it is not always possible to acquire truly meaningful information (Non-Patent Document 2: “Topic extraction on microblog and classification of user attitudes” Furukawa, Eiji, Yoshinaga, Kitsuregawa (DEIM2012) http://www.tkl.iis.u-tokyo.ac.jp/top/modules/newdb/extract/1176/data/DEIM2012.pdf ) Assuming that the burst word co-occurring with the daily word is a more meaningful burst word, the daily word list automatically extracted in this embodiment is used to extract a truly meaningful burst word. (Fig. 3).

  Next, a second embodiment will be described. In addition, about the part which becomes the structure similar to 1st Embodiment, the same code | symbol is attached | subjected and description is abbreviate | omitted.

  The second embodiment is different from the first embodiment in that only character strings appearing in official documents are extracted as daily words from the daily word candidates.

<Configuration of Daily Word Extractor>
As illustrated in FIG. 4, the daily word extraction device 200 according to the second exemplary embodiment of the present invention includes an input unit 10, a calculation unit 120, and an output unit 50.

  The input unit 10 receives a plurality of morphologically analyzed documents and a plurality of morphologically analyzed newspaper data obtained by morphological analysis of newspaper documents from an input device such as a keyboard. Note that the input unit 10 may accept input from the outside via a network or the like.

  The arithmetic unit 120 is configured by a computer including a CPU (Central Processing Unit), a RAM (Random Access Memory), and a ROM (Read Only Memory) that stores a program for executing a daily word extraction processing routine to be described later. Has been. As shown in FIG. 4, this computer functionally includes a morpheme analyzed document storage unit 30, a specific expression extraction unit 31, a daily word candidate extraction unit 34, an initial daily word list storage unit 36, and a morpheme. The analyzed newspaper data storage unit 38, the word frequency measurement unit 40, the frequency list storage unit 42, and the daily word extraction unit 44 can be represented.

  The daily word candidate extraction unit 34 performs the same processing as the daily word extraction unit 32 of the first embodiment, and extracts daily word candidates from the document stored in the morphologically analyzed document storage unit 30. An initial daily word list composed of the extracted daily word candidates is stored in the initial daily word list storage unit 36.

  The initial daily word list storage unit 36 stores an initial daily word list made up of daily word candidates extracted by the daily word candidate extraction unit 34.

  The morphological-analyzed newspaper data storage unit 38 stores a plurality of morphological-analyzed newspaper data obtained by morphological analysis of newspaper documents received by the input unit 10.

  The word frequency measuring unit 40 divides each of the plurality of morpheme analyzed newspaper data stored in the morpheme analyzed newspaper data storage unit 38 into a headline part data and a body part data, and a plurality of morpheme analyzed data From the data of each headline part of the newspaper data, the appearance frequency is measured for each of the character strings that are the content words (for example, “byte”, “starting”), and a frequency list of the character strings in the headline is created. Further, the appearance frequency is measured for each character string that is a content word from the data of the body part of each of the plurality of morphologically analyzed newspaper data, and a frequency list of the character strings in the body is created. The frequency list of character strings in the created headline and the frequency list of character strings in the body (hereinafter referred to as a word frequency list) are stored in the frequency list storage unit 42.

  The frequency list storage unit 42 stores the word frequency list input from the word frequency measurement unit 40.

  For each of the character strings of the daily word candidates included in the initial daily word list stored in the initial daily word list storage unit 36, the daily word extraction unit 44 converts the daily word candidate character string into a daily word in the action noun. If it is a candidate, the appearance frequency of the character string of the daily word candidate obtained from the frequency list of the character string in the heading out of the word frequency list stored in the frequency list storage unit 42 is a threshold value (this embodiment). In the case where the character string of the daily word candidate is a daily word candidate in a common noun, the word frequency list stored in the frequency list storage unit 42 includes The frequency of occurrence of the character string of the candidate daily word obtained from the frequency list of character strings is a threshold (this implementation In determining whether a 10 times) or more. The daily word extraction unit 44 extracts each character string of the daily word candidate determined that the appearance frequency is equal to or higher than the threshold value as a daily word, and outputs the daily word list as the daily word list. On the other hand, if the appearance frequency of the character string of the candidate for the daily word is less than the threshold value, the character string of the candidate for the daily word is determined not to be a daily word and is excluded.

<Operation of Daily Word Extractor>
Next, the operation of the daily word extracting apparatus 200 according to the second embodiment of the present invention will be described. In addition, about the process similar to 1st Embodiment, the same code | symbol is attached | subjected and detailed description is abbreviate | omitted.

  First, when a plurality of morphologically analyzed newspaper data obtained by morphological analysis of a newspaper document is input by the input unit 10, the word frequency calculation unit 40, for each of the newspaper data, the data of the headline part and the data of the body part. The frequency of appearance of each character string that is a content word is measured from the data of each headline part of the plurality of morphologically analyzed newspaper data, and a frequency list of the character strings in the headline is created. In addition, the word frequency calculation unit 40 measures the appearance frequency of each character string that is a content word from the data of the body part of each of the plurality of morphologically analyzed newspaper data, and creates a frequency list of the character strings in the body text And stored in the frequency list storage unit 42.

  When a plurality of morpheme analyzed documents are input by the input unit 10, the CPU executes a program stored in the morpheme analyzed document storage unit 30 and stored in the ROM of the daily word extraction device 100. The daily word extraction processing routine shown in FIG. 5 is executed.

  In step S100, a plurality of morphologically analyzed documents stored in the morphologically analyzed document storage unit 30 are read.

  In step S200, the word frequency list stored in the frequency list storage unit 42 is read.

  In step S101, specific expression extraction is performed on the plurality of morphologically analyzed documents obtained in step S100.

  In step S204, for each of the plurality of morpheme-analyzed documents obtained in step S100, the character string X is extracted from each of the portions that match the regular expression of the character string pattern “[DAT] [X] is [Y]”. Are extracted as candidates for daily nouns of common nouns, and from each of the portions that match the regular expression of the string pattern “[DAT] [X] (do |) [Y]”, A character string corresponding to the character string X is extracted as a daily word candidate as an action noun. The character strings extracted as the daily word candidates and extracted ten times or more are extracted and stored in the initial daily word list storage unit 36 as the initial daily word list.

  In step S206, for the character strings of the daily word candidates included in the initial daily word list stored in the initial daily word list storage unit 36 in step S204, the daily word candidate strings are determined based on the word frequency list obtained in step S200. If the character string is a behavioral noun, it is determined whether the frequency of occurrence of the character string in the heading is equal to or higher than the threshold value. If the character string of the daily word candidate is a common noun, the frequency of occurrence of the character string in the body text is It is determined whether or not it is the above, and a character string of a daily word candidate for which the appearance frequency is determined to be equal to or higher than a threshold is extracted as a daily word.

  In step S207, it is determined whether or not the daily word extraction processing in step S206 has been performed for all the character strings of the daily word candidates included in the initial daily word list stored in the initial daily word list storage unit 36. If the daily word extraction process has been performed for all the daily word candidates, the process proceeds to step S104. If there is a character string of a daily word candidate that has not been subjected to the daily word extraction process, the process proceeds to step S206. Then, the above process is repeated using the character string as a character string to be determined.

  In step S104, the daily word extracted in step S206 is output as a daily word list by the output unit 50, and the process ends.

  As described above, according to the daily word extraction device of the second embodiment of the present invention, the daily word candidates are extracted, and only the character strings appearing in the official document among the daily word candidates are extracted. By extracting as a daily word, a daily word can be extracted with high accuracy.

  In addition, by checking against frequency information in a highly public resource such as a newspaper, it is possible to extract a daily word while guaranteeing the publicity of a character string extracted as a daily word.

  Moreover, it is possible to avoid excessive acquisition of private information as a daily word by using the frequency information of a public document.

  Next, a third embodiment will be described. In addition, about the part which becomes the same structure as 1st and 2nd embodiment, the same code | symbol is attached | subjected and description is abbreviate | omitted.

  In the third embodiment, a daily word candidate is extracted and used for a private document, whether it is a character string that appears in a public document, whether it co-occurs with a word used as a search query, or not. This is different from the second embodiment in that a daily word is extracted based on whether or not it co-occurs with a given word.

<Configuration of Daily Word Extractor>
As illustrated in FIG. 6, the daily word extraction device 300 according to the third exemplary embodiment of the present invention includes an input unit 10, a calculation unit 220, and an output unit 50.

  The input unit 10 receives, from an input device such as a keyboard, a plurality of morphologically analyzed documents, a plurality of morphologically analyzed newspaper data obtained by morphological analysis of newspaper documents, and a plurality of predetermined words used for private documents. And a search query list consisting of a plurality of predetermined words indicating search queries used in the search engine. Note that the input unit 10 may accept input from the outside via a network or the like.

  The arithmetic unit 220 is configured by a computer including a CPU (Central Processing Unit), a RAM (Random Access Memory), and a ROM (Read Only Memory) that stores a program for executing a daily word extraction processing routine to be described later. Has been. As shown in FIG. 6, this computer functionally includes a morpheme analyzed document storage unit 30, a specific expression extraction unit 31, a daily word candidate extraction unit 34, an initial daily word list storage unit 36, and a morpheme. An analysis newspaper data storage unit 38, a word frequency measurement unit 40, a frequency list storage unit 42, a daily word extraction unit 44, a private word list storage unit 60, and a search query list storage unit 62 are included. Can be represented.

  The private word list storage unit 60 stores the private word list received by the input unit 10.

  The search query list storage unit 62 stores the search query list received by the input unit 10.

  The daily word extraction unit 44 stores the character string of the daily word candidate in the search query list storage unit 62 for the character string of the daily word candidate included in the initial daily word list stored in the initial daily word list storage unit 36. Determine whether or not to co-occur in a morphological-analyzed document that includes a character string that is at least one word included in the stored search query list and the character string of the daily word candidate If it is determined, the character string of the daily word candidate is extracted as a daily word.

When the character string of the daily word candidate does not co-occur with any character string included in the search query list, if the character string of the daily word candidate is a daily word candidate in an action noun, the frequency list storage unit 42 In the word frequency list stored in the heading, it is determined whether the frequency of appearance of the character string of the candidate daily word obtained from the frequency list of the character string in the heading is equal to or higher than a threshold value (10 times in this embodiment), If the character string of the daily word candidate is a daily word candidate in a common noun, the character of the daily word candidate obtained from the frequency list of the character string in the text in the word frequency list stored in the frequency list storage unit 42 It is determined whether the appearance frequency of the column is equal to or greater than a threshold value (10 times in this embodiment). When the daily word extraction unit 44 determines that the appearance frequency of the character string of the daily word candidate is equal to or higher than the threshold value, the character string of the daily word candidate is stored in the private word list storage unit 60. When it is determined that any word in the private word list does not co-occur with any word in the morphologically analyzed document containing the character string of the candidate daily word The character string of the daily word candidate is extracted as a daily word.
On the other hand, when the appearance frequency of the character string of the daily word candidate is less than the threshold value, the character string of the daily word candidate is determined not to be a daily word and is excluded. Further, regarding the character string of the daily word candidate, in the morphologically analyzed document including at least one word of the private word list stored in the private word list storage unit 60 and the character string of the daily word candidate. If it is determined to co-occur, the character string of the daily word candidate is determined not to be a daily word and is excluded.
The daily word extraction unit 44 repeatedly performs the above-described daily word extraction processing for each character string of all the daily word candidates included in the initial daily word list stored in the initial daily word list storage unit 36. The daily word is output by the output unit 50 as a daily word list.

<Operation of Daily Word Extractor>
Next, the operation of the daily word extraction apparatus 300 according to the third embodiment of the present invention will be described. In addition, about the process similar to 1st and 2nd embodiment, the same code | symbol is attached | subjected and detailed description is abbreviate | omitted.

  First, when a plurality of morphological-analyzed newspaper data obtained by morphological analysis of a newspaper document is input by the input unit 10, each of the newspaper data is divided into headline data and body text data, From the data of each headline part of the analyzed newspaper data, the appearance frequency is measured for each character string that is a content word, and a frequency list of the character strings in the headline is created. Further, the appearance frequency is measured for each of the character strings that are the content words from the data of the body part of each of the plurality of morphologically analyzed newspaper data, the frequency list of the character strings in the body is created, and the frequency list storage unit 42 To store.

  When a private word list is input by the input unit 10, the input private word list is stored in the private word list storage unit 60.

  When a search query list is input by the input unit 10, the input search query list is stored in the search query list storage unit 62.

  When a morpheme analyzed document is input by the input unit 10, the CPU executes the program stored in the morpheme analyzed document storage unit 30 and stored in the ROM of the daily word extraction device 300. The daily word extraction processing routine shown in FIG. 7 is executed.

  In step S100, a plurality of morphologically analyzed documents stored in the morphologically analyzed document storage unit 30 are read.

  In step S200, the word frequency list stored in the frequency list storage unit 42 is read.

  In step S300, the private word list stored in the private word list storage unit 60 is read.

  In step S302, the search query list stored in the search query list storage unit 62 is read.

  In step S101, specific expression extraction is performed on the plurality of morphologically analyzed documents obtained in step S100.

  In step S204, for each of the plurality of morpheme-analyzed documents obtained in step S100, the character string X is extracted from each of the portions that match the regular expression of the character string pattern “[DAT] [X] is [Y]”. Are extracted as candidates for daily nouns of common nouns, and from each of the portions that match the regular expression of the string pattern “[DAT] [X] (do |) [Y]”, A character string corresponding to the character string X is extracted as a daily word candidate as an action noun. The character strings extracted as the daily word candidates and extracted ten times or more are extracted and stored in the initial daily word list storage unit 36 as the initial daily word list.

  In step S304, the character strings of the daily word candidates included in the initial daily word list stored in the initial daily word list storage unit 36 in step S204 are included in the search query list based on the search query list acquired in step S302. Whether or not to co-occur in a morpheme-analyzed document including the character string that is the at least one word and the daily word candidate character string. Are extracted as daily words.

  In the case where it does not co-occur with a character string that is at least one word included in the search query list, based on the word frequency list obtained in step S200, the character string heading of the daily word candidate or the character string in the text It is determined whether the appearance frequency is greater than or equal to a threshold value.

  If it is determined that the appearance frequency is equal to or greater than the threshold, whether or not the character string of the daily word candidate does not co-occur with any word included in the private word list based on the private word list acquired in step S300. A character string of a daily word candidate determined not to co-occur with any word is extracted as a daily word.

  In step S207, it is determined whether or not the daily word extraction processing in step S304 has been performed for all the character strings of the daily word candidates included in the initial daily word list stored in the initial daily word list storage unit 36. If the daily word extraction process has been performed for all the daily word candidates, the process proceeds to step S104. If there is a character string of a daily word candidate that has not been subjected to the daily word extraction process, the process proceeds to step S304. Then, the above process is repeated using the character string as a character string to be determined.

  In step S104, the daily word extracted in step S304 is output as a daily word list by the output unit 50, and the process ends.

  As described above, according to the daily word extracting device according to the third embodiment of the present invention, the daily word candidate is extracted, whether or not the character string appears in the official document, and the search query. By extracting a daily word based on whether it co-occurs with a word used as a word and whether it co-occurs with a word used in a private document, a daily word can be extracted with high accuracy.

<Experimental example>
In order to verify the effect of the present invention, an experiment was performed to extract a daily word using the method described in the second embodiment of the present invention. The results of daily word extraction in the experiment are shown in FIG. In this experiment, blog data (35, 294, 684 articles) was used as morphological-analyzed data, and newspaper data (60000 sentences) was used as an official document. In FIG. 8, the character string with a strikethrough is shown as being excessively removed based on the frequency information, and the character string with an underline is originally a character string that should be removed. It shows what is extracted as a daily word without being removed based on the frequency information. From the experimental results shown in FIG. 8, it was found that the daily word was extracted with high accuracy.

  The first column of each element is the number of regular expression matches as the character string of the daily word candidate in the article in the inputted blog, the second column is the frequency information of the newspaper text, the third column is the newspaper Represents frequency information in the heading.

  The present invention is not limited to the above-described embodiment, and various modifications and applications can be made without departing from the gist of the present invention.

  For example, in the above-described second and third embodiments, the calculation units 120 and 220 create the word frequency list by inputting the morphologically analyzed data of the newspaper document, but the present invention is not limited to this. A previously created word frequency list may be input and used.

  In the third embodiment, a character string that co-occurs with at least one word in the search query list is a daily word, and a character string that co-occurs with at least one word in the private word list is excluded from the daily word. However, the present invention is not limited to this. For example, a character string that co-occurs with at least one word in the search query list may be extracted as a daily word, and a character string having an appearance frequency equal to or higher than a threshold value may be extracted as a daily word based on frequency information in newspaper data. . Further, without considering co-occurrence with the search query list, a character string whose appearance frequency is equal to or higher than a threshold based on frequency information in newspaper data and does not co-occur with any word in the private word list is set as a daily word. It may be extracted.

  In the third embodiment, whether or not at least one word in the search query list or the private word list and the character string of the daily word candidate co-occur in the document including the character string. However, the present invention is not limited to this, and it may be determined whether or not co-occurrence occurs in a sentence including the character string.

  In the embodiment of the present invention, as regular expressions, “[DAT] [X] is [Y]” and “[DAT] is [X] (do | done) [Y]”. The daily word is extracted using, but the present invention is not limited to this, and it includes a character string that does not represent the date and time, a date and time expression related to the character string, and a specific expression or number related to the character string at the same time. A regular expression indicating the pattern of the character string may be used.

  The above-described daily word extraction devices 100, 200, and 300 each have a computer system. However, if the “computer system” uses a WWW system, a homepage providing environment (or display) Environment).

  Further, in the present specification, the embodiment has been described in which the program is installed in advance. However, the program can be provided by being stored in a computer-readable recording medium or provided via a network. It is also possible to do. Moreover, you may comprise each part of the daily word extraction apparatuses 100, 200, and 300 of this Embodiment by hardware. The database storing morphologically analyzed newspaper data, frequency list, morphologically analyzed document, private word list, and search query list can be realized by storage means exemplified by a hard disk device or a file server. A database may be provided in the word extraction devices 100, 200, and 300, or may be provided in an external device.

DESCRIPTION OF SYMBOLS 10 Input part 20,120,220 Arithmetic unit 30 Morphological analysis document storage part 31 Specific expression extraction part 32,44 Daily word extraction part 34 Daily word candidate extraction part 36 Initial daily word list storage part 38 Morphological analysis newspaper data storage part 40 Word Frequency Measurement Unit 42 Frequency List Storage Unit 50 Output Unit 60 Private Word List Storage Unit 62 Search Query List Storage Unit 100, 200, 300 Daily Word Extraction Device

Claims (9)

  1. Storage means for storing a document composed of at least one sentence that has undergone morphological analysis;
    For the document stored in the storage means, a regular expression indicating a character string pattern including a character string that does not represent a date and time, a date expression related to the character string, and a unique expression or a number related to the character string is used. In addition, a character string that does not represent the date and time in a portion that matches the regular expression and that has an appearance frequency equal to or higher than a predetermined threshold is extracted as a daily word that is a character string that represents an event. Extraction means to
    Daily word extraction device.
  2.   The extraction means is a character string that does not represent the date and time in a portion that matches the regular expression based on the appearance frequency information of each character string obtained in advance for an official document, and the appearance frequency is The character string that is a character string that is equal to or greater than the threshold and that has an appearance frequency of the character string in the official document that is equal to or greater than a predetermined threshold is extracted as the daily word. Daily word extractor.
  3.   The extraction means is based on the appearance frequency information of each character string obtained in advance for the headline part of the newspaper as the official document and the appearance frequency information of each character string obtained in advance for the body part of the newspaper. , A character string that does not represent the date and time in a portion that matches the regular expression, and that is a character string that has an appearance frequency equal to or greater than the threshold, and the appearance frequency of the character string in the headline portion of the newspaper 3. The daily word extracting device according to claim 2, wherein a character string is extracted as the daily word when the character string is equal to or higher than a predetermined threshold value or the appearance frequency of the character string in a newspaper text portion is equal to or higher than a predetermined threshold value.
  4.   The extraction means is a character string that does not represent the date and time in a portion that matches the regular expression, is a character string that has an appearance frequency equal to or higher than the threshold value, and is used in advance for a search engine. The character string that co-occurs with at least one of a plurality of character strings indicating the search query is extracted as the daily word, and is a character string that does not represent the date and time in a portion that matches the regular expression, In addition, the character string has an appearance frequency equal to or higher than the threshold value, and the frequency of appearance of the character string in the newspaper headline portion is equal to or higher than a predetermined threshold value, or the frequency of appearance of the character string in the text portion of the newspaper. Is a character string that is equal to or greater than a predetermined threshold, and the character string that does not co-occur with any of a plurality of predetermined character strings used in a private document, Daily word extraction apparatus according to claim 3 wherein the extract as serial Daily word.
  5. A daily word extraction method in a daily word extraction apparatus, comprising a storage means for storing a document comprising at least one sentence that has been subjected to morphological analysis, and an extraction means,
    A character string pattern including a character string that does not represent a date and time, a date and time expression related to the character string, and a unique expression or a number related to the character string for the document stored in the storage means by the extracting means. A character string that does not represent the date and time in a portion that matches the regular expression, and that has an appearance frequency equal to or higher than a predetermined threshold, is a character string that represents an event. Extract as a daily word Daily word extraction method.
  6.   The extraction means is a character string that does not represent the date and time in a portion that matches the regular expression based on the appearance frequency information of each character string obtained in advance for an official document, and the appearance frequency is 6. The character string that is equal to or greater than the threshold and that has a frequency of appearance in the official document that is equal to or greater than a predetermined threshold is extracted as the daily word. Daily word extraction method.
  7.   The extraction means is based on the appearance frequency information of each character string obtained in advance for the headline part of the newspaper as the official document and the appearance frequency information of each character string obtained in advance for the body part of the newspaper. , A character string that does not represent the date and time in a portion that matches the regular expression, and that is a character string that has an appearance frequency equal to or greater than the threshold, and the appearance frequency of the character string in the headline portion of the newspaper The daily word extraction method according to claim 6, wherein a character string in which the character string is equal to or higher than a predetermined threshold value or the appearance frequency of the character string in a newspaper text portion is equal to or higher than a predetermined threshold value is extracted as the daily word.
  8.   The extraction means is a character string that does not represent the date and time in a portion that matches the regular expression, is a character string that has an appearance frequency equal to or higher than the threshold value, and is used in advance for a search engine. The character string that co-occurs with at least one of a plurality of character strings indicating the search query is extracted as the daily word, and is a character string that does not represent the date and time in a portion that matches the regular expression, In addition, the character string has an appearance frequency equal to or higher than the threshold value, and the frequency of appearance of the character string in the newspaper headline portion is equal to or higher than a predetermined threshold value, or the frequency of appearance of the character string in the text portion of the newspaper. Is a character string that is equal to or greater than a predetermined threshold, and the character string that does not co-occur with any of a plurality of predetermined character strings used in a private document, Daily word extraction method of claim 7 wherein extracting the serial Daily word.
  9.   The program for functioning a computer as each means which comprises the daily word extraction apparatus of any one of Claims 1-4.
JP2012274791A 2012-12-17 2012-12-17 Daily word extraction apparatus, method, and program Active JP5676552B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2012274791A JP5676552B2 (en) 2012-12-17 2012-12-17 Daily word extraction apparatus, method, and program

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP2012274791A JP5676552B2 (en) 2012-12-17 2012-12-17 Daily word extraction apparatus, method, and program

Publications (2)

Publication Number Publication Date
JP2014119977A JP2014119977A (en) 2014-06-30
JP5676552B2 true JP5676552B2 (en) 2015-02-25

Family

ID=51174761

Family Applications (1)

Application Number Title Priority Date Filing Date
JP2012274791A Active JP5676552B2 (en) 2012-12-17 2012-12-17 Daily word extraction apparatus, method, and program

Country Status (1)

Country Link
JP (1) JP5676552B2 (en)

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3794597B2 (en) * 1997-06-18 2006-07-05 日本電信電話株式会社 Topic extraction method and topic extraction program recording medium
US7865354B2 (en) * 2003-12-05 2011-01-04 International Business Machines Corporation Extracting and grouping opinions from text documents
JP5289573B2 (en) * 2009-07-27 2013-09-11 株式会社東芝 Relevance presentation device, method and program
JP2011159205A (en) * 2010-02-03 2011-08-18 Ntt Docomo Inc System and method for supporting creation of diary
JP5506482B2 (en) * 2010-03-19 2014-05-28 日本電信電話株式会社 Named entity extraction apparatus, string-named expression class pair database creation apparatus, numbered entity extraction method, string-named expression class pair database creation method, program
JP5491446B2 (en) * 2011-05-20 2014-05-14 日本電信電話株式会社 Topic word acquisition apparatus, method, and program

Also Published As

Publication number Publication date
JP2014119977A (en) 2014-06-30

Similar Documents

Publication Publication Date Title
McEnery et al. Corpus linguistics: Method, theory and practice
Abu-Jbara et al. Coherent citation-based summarization of scientific papers
Strötgen et al. Multilingual and cross-domain temporal tagging
Zhang et al. Entity linking leveraging: automatically generated annotation
Grieve Quantitative authorship attribution: An evaluation of techniques
JP4701292B2 (en) Computer system, method and computer program for creating term dictionary from specific expressions or technical terms contained in text data
US8712758B2 (en) Coreference resolution in an ambiguity-sensitive natural language processing system
Shen et al. Emotion mining research on micro-blog
Chen et al. Towards robust unsupervised personal name disambiguation
Şeker et al. Initial explorations on using CRFs for Turkish named entity recognition
Kaya et al. Sentiment analysis of turkish political news
EP1384176A2 (en) Search data management
Islam et al. A light weight stemmer for Bengali and its Use in spelling Checker
Malmasi et al. Arabic dialect identification using a parallel multidialectal corpus
Stamatatos et al. Overview of the pan/clef 2015 evaluation lab
Tamura et al. Classification of multiple-sentence questions
US9251248B2 (en) Using context to extract entities from a document collection
Stamatatos et al. Clustering by authorship within and across documents
Exner et al. Using Semantic Role Labeling to Extract Events from Wikipedia.
WO2009096523A1 (en) Information analysis device, search system, information analysis method, and information analysis program
US9424524B2 (en) Extracting facts from unstructured text
Singh et al. Named entity recognition system for Urdu
Koplenig The impact of lacking metadata for the measurement of cultural and linguistic change using the Google Ngram data sets—Reconstructing the composition of the German corpus in times of WWII
JP5106636B2 (en) System for extracting terms from documents with text segments
US20140359421A1 (en) Annotation Collision Detection in a Question and Answer System

Legal Events

Date Code Title Description
A977 Report on retrieval

Free format text: JAPANESE INTERMEDIATE CODE: A971007

Effective date: 20141015

TRDD Decision of grant or rejection written
A01 Written decision to grant a patent or to grant a registration (utility model)

Free format text: JAPANESE INTERMEDIATE CODE: A01

Effective date: 20141202

A61 First payment of annual fees (during grant procedure)

Free format text: JAPANESE INTERMEDIATE CODE: A61

Effective date: 20141225

R150 Certificate of patent or registration of utility model

Ref document number: 5676552

Country of ref document: JP

Free format text: JAPANESE INTERMEDIATE CODE: R150

R250 Receipt of annual fees

Free format text: JAPANESE INTERMEDIATE CODE: R250

R250 Receipt of annual fees

Free format text: JAPANESE INTERMEDIATE CODE: R250