WO2006048998A1 - キーワード抽出装置 - Google Patents
キーワード抽出装置 Download PDFInfo
- Publication number
- WO2006048998A1 WO2006048998A1 PCT/JP2005/018712 JP2005018712W WO2006048998A1 WO 2006048998 A1 WO2006048998 A1 WO 2006048998A1 JP 2005018712 W JP2005018712 W JP 2005018712W WO 2006048998 A1 WO2006048998 A1 WO 2006048998A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- document group
- group
- document
- index word
- calculating
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
- G06F16/313—Selection or weighting of terms for indexing
Definitions
- the present invention relates to a technique for automatically extracting, from a document group consisting of a plurality of documents, a key word representing the subject of the document group by a computer, and in particular, a keyword extraction device, an extraction method, and an extraction program. About.
- Patent documents, technical documents, and other documents are newly created every day, and the number is enormous.
- a technique for automatically extracting a keyword representing the characteristics of the document is known.
- Non-Patent Document 1 discloses a method of extracting a keyword representing a document claim.
- the broad words (HighFreq) of the number of appearances in the document are extracted.
- the co-occurrence degree of the document is calculated based on the presence / absence of co-occurrence of High Freqs in sentence units, and the combination of High Freqs having high co-occurrence degrees is defined as a “base”. HighFreq with low co-occurrence will belong to different foundations.
- the co-occurrence degree with the word in the foundation is calculated based on the presence or absence of the co-occurrence with the word in each foundation in the sentence unit, and these foundations are calculated based on the co-occurrence degree with the word in the foundation. Extract words (roofs) that integrate sentences supported by.
- Non-Patent Document 1 Yukio Osawa et al. “KeyGraph: Extracting key words by dividing and integrating co-occurrence graphs of words” IEICE Transactions Vol.J82-DI, No.2, 391-400 (February 1999) Invention Disclosure
- Non-Patent Document 1 does not extract a keyword representing the characteristics of a document group composed of a plurality of documents.
- the technology described in Non-Patent Document 1 above is a document written to assert the author's own idea, Since it is based on the premise that a flow is formed, it cannot be applied to a group of documents consisting of multiple independent documents.
- An object of the present invention is to provide a keyword extraction device, an extraction method, and an extraction program capable of automatically extracting a keyword representing a feature of a document group including a plurality of documents.
- Another object of the present invention is to automatically extract a plurality of viewpoints of keywords representing characteristics of a document group having a plurality of document capabilities so that the characteristics of the document group can be understood three-dimensionally.
- a keyword extraction device of the present invention is a device for extracting a keyword from a document group consisting of a plurality of documents, and includes the following means. That is,
- Index word extraction means for extracting index words from the data of the document group
- a high-frequency word extracting unit that calculates a weight including an evaluation of a high appearance frequency in the document group for each index word, and extracts a high-frequency word that is an index word having a large weight, and each of the high-frequency words
- a high-frequency word-one index word for calculating the co-occurrence degree in the document group of each of the high-frequency words and each of the index words based on the presence or absence of co-occurrence of each of the index words and each of the index words.
- Clustering means for classifying the high-frequency words based on the calculated co-occurrence and generating clusters
- Keyword extracting means for extracting a keyword based on the calculated score.
- keywords representing the characteristics of a document group having a plurality of document capabilities are classified based on the co-occurrence degree based on the presence or absence of co-occurrence in document units with each of the index words in the document group to generate clusters, and more clusters. It is possible to extract keywords that accurately represent the characteristics of document groups by extracting keywords by highly evaluating index words that co-occur with high-frequency words belonging to a raster and co-occur in more documents. it can.
- the high-frequency word is extracted by calculating a weight including the high frequency of appearance in the document group for each of the extracted index words. This is done by extracting a predetermined number of words.
- weight may be GF (E) (described later) indicating the high frequency of appearance in the document group, or a function value including GF (E) as a variable.
- the co-occurrence degree with each of the p index words is set. Create a P-dimensional vector as a component. Then, clustering is performed by the clustering means based on the degree of similarity (similarity or dissimilarity) of the p-dimensional vector for each high-frequency word.
- index word As a method of highly evaluating index words that co-occur with more frequent words belonging to more clusters, for example, each index word and intra-cluster high-frequency words for all clusters (the foundation described later) It is conceivable that a score derived from a polynomial including the product of the co-occurrence degree (index word—base co-occurrence degree (described later)) is used as the score of each index word.
- a score derived from a polynomial including the product of the co-occurrence degree index word—base co-occurrence degree (described later)
- index word base co-occurrence degree (described later)
- index word base co-occurrence degree
- Co-occurrence degree (1 or 0, or given weight to this) index word—base co-occurrence degree Co (w, g)
- a function value that includes w ′) (described later) or index word-base co-occurrence degree Co ′ (w, g) (described later)) as a variable may be used as the score of each index word.
- the score calculated by the score calculating means for each index word has a lower appearance frequency in a document group including documents other than the document group, and the index word is Desirably, the score should be higher. This makes it possible to extract keywords by highly evaluating index terms unique to the document group to be analyzed.
- Examples of the appearance frequency in the document group here include DF (P) described later.
- DF the reciprocal number of DF (P), or the reciprocal number of DF (P) X the number of documents in the document group, or the logarithm of any one of these, the higher frequency words belonging to the above-mentioned more clusters.
- Co-occurrence and co-occurrence in more documents can be added to or multiplied by the highly rated score.
- Skey (w) which will be described later, can be cited as a score that highly evaluates an index word having a low DF (P).
- the score calculated by the score calculation means for each index word is a score obtained by highly evaluating an index word having a higher appearance frequency in the document group. I want it! /
- the frequency of appearance in the document group here is, for example, GF (E) described later.
- GF (E) a score that highly co-occurs with a high-frequency word belonging to the above-mentioned more clusters and that co-occurs in a larger number of documents is multiplied by GF (E) or calorieated. It is possible.
- a score that highly evaluates an index word having a high GF (E) is S key (w) described later.
- the keyword extracting means may determine the number of keyword extractions based on the appearance frequency of the index word highly evaluated by the score calculating means in the document group. .
- the frequency of appearance in the document group here is, for example, DF (E) described later.
- the keyword extracting means is preferably configured to extract the determined number of extracted keywords based on a word appearance rate in a title of each document belonging to the document group. Better ,. As a result, a keyword that accurately represents the contents of the document group can be extracted.
- Evaluation value calculating means for calculating an evaluation value in each document group for each index word for the document group group including the document group to be analyzed and another document group, and each document group for each index word The sum of the evaluation values in all document groups belonging to the document group group is calculated, the ratio of the evaluation value in each document group to the sum is calculated for each document group, and the square of the ratio is calculated.
- a concentration degree calculating means for calculating a concentration degree of distribution of each index word in the document group group, which is obtained by calculating a sum of squares of the ratios in all document groups belonging to the document group group; e,
- the keyword extracting means may extract a keyword by adding the degree of concentration calculated by the concentration degree calculating means to the evaluation in addition to the score calculated for the document group to be analyzed by the score calculating means. Hope.
- Words with high scores by the score calculation means and low concentrations by the concentration calculation means are dispersed throughout the document group group, and thus broadly capture the technical area to which the document group to be analyzed belongs. It can be positioned as a thing.
- the individual document group in this case can be obtained by clustering a document group group, for example.
- Evaluation value calculation means for calculating an evaluation value in each document group for each index word for a document group group including the document group to be analyzed and another document group, and each of the document groups in the analysis target Calculate the sum of the index word evaluation values for all index words extracted for each document group power belonging to the document group group, and calculate the ratio of the index word evaluation values to the sum for each index word.
- Share calculation means for calculating a share of each index word in the document group to be analyzed obtained by the keyword extraction means, the score calculated for the document group to be analyzed by the score calculation means In addition, it is desirable to extract a keyword by adding the share calculated by the share calculating means to the document group to be analyzed in addition to the evaluation.
- Words with high scores by the score calculation means and high shares by the share calculation means are: Since the share of the analysis target document group is higher than that of other words, the analysis target document group can be positioned as a well-explained (main word).
- a first reciprocal calculation means for calculating a function value of the reciprocal of the appearance frequency in a document group group including the document group to be analyzed and another document group;
- a second reciprocal calculating means for calculating a function value of the reciprocal of the appearance frequency in the large document group including the document group group for each index word;
- originality calculation means for calculating the originality of each index word in the document group group by a function value obtained by subtracting the calculation result of the second inverse calculation means from the calculation result of the first inverse calculation means ,
- the keyword extraction means may extract a keyword by adding the originality calculated by the originality calculation means to the evaluation in addition to the score calculated for the analysis target document group by the score calculation means. Hope.
- a large value of the reciprocal of the appearance frequency in the document group group means that the word is rare in this document group group.
- the value of the reciprocal of the frequency of appearance in the large document group including the document group group, and the dilemma are used in the document group group even if they are frequently used in other fields. It can be said that the use in such a field is original.
- a word having a high score by the score calculating means and a high originality by the originality calculating means can be regarded as a word representing an original viewpoint in the field.
- an IDF inverse document frequency
- Another keyword extraction device of the present invention is:
- the document group power which also has multiple document powers, is a device that extracts keywords, and includes the following means. That is,
- Index word extraction means for extracting an index word from data of a document group group including the document group to be analyzed and another document group;
- An evaluation value calculating means for calculating an evaluation value in each document group for each index word for the document group group; For each index word, the sum of the evaluation values in each document group is calculated for all document groups belonging to the document group group, and the ratio of the evaluation value in each document group to the sum is calculated for each document group. Concentration for calculating the degree of concentration of each index word distribution in the document group group obtained by calculating the square of the ratio and calculating the sum of the square of the ratio in all document groups belonging to the document group group. Degree calculation means;
- a keyword extracting means for extracting a keyword based on a combination of the concentration degree calculated by the concentration degree calculating means and the share calculated for the document group to be analyzed by the share calculating means.
- words with a low sum of squares calculated by the concentration level calculation means are words that are distributed across multiple document groups. Can be positioned.
- words with a high ratio calculated by the share calculator can be positioned as those that can explain the analyzed document group well (main words) because the share in the analyzed document group is a high language word. .
- keywords can be categorized by the two viewpoints, and the characteristics of the document group can be understood three-dimensionally.
- a first reciprocal calculating means for calculating a function value of the reciprocal of the appearance frequency in the document group group for each index word
- a second reciprocal calculating means for calculating a function value of the reciprocal of the appearance frequency in the large document group including the document group group for each index word;
- the keyword extracting means further extracts a keyword based on a combination with the originality calculated by the originality calculating means.
- keywords can be categorized from three viewpoints so that the characteristics of the document group can be understood three-dimensionally. Become.
- the keyword extraction device of the present invention is a device for extracting a keyword from a document group consisting of a plurality of documents, and includes the following means. That is,
- Index word extraction means for extracting an index word from data of a document group group including the document group to be analyzed and another document group;
- Appearance frequency calculating means for calculating a function value of the appearance frequency in the analysis target document group for each index word
- Keywords representing the characteristics of a document group consisting of a plurality of documents and to understand the characteristics of the document group in a three-dimensional manner.
- the functions of the concentration calculated by the concentration calculator, the shear calculated by the shear calculator, the originality calculated by the originality calculator, and the appearance frequency calculated by the appearance frequency calculator Since keywords are categorized and extracted based on a combination of at least two of the values, the characteristics of the document group can be understood three-dimensionally.
- the keyword extracting means includes
- An index word having a function value of an appearance frequency in the document group to be analyzed is a predetermined threshold or more is determined as an important word in the document group to be analyzed
- an index word having the concentration level equal to or lower than a predetermined threshold is determined as a technical domain word in the document group to be analyzed
- an index word whose share in the analysis target document group is equal to or greater than a predetermined threshold is determined as a main word in the analysis target document group
- an index word having the originality of a predetermined threshold or more is determined as a unique word in the analysis target document group
- the function value of the reciprocal of the appearance frequency in the document group group is obtained by standardizing the inverse document frequency (IDF) in the document group group with all index words of the document group to be analyzed,
- the function value of the reciprocal of the appearance frequency in the large document group including the document group group is the standard value of the inverse document frequency (IDF) in the large document group with all index words of the document group to be analyzed. It is desirable that
- the present invention also provides a keyword extraction method including the same steps as the method executed by each of the above apparatuses, and allows a computer to execute the same process as the process executed by each of the above apparatuses.
- This is a keyword extraction program.
- This program can be recorded on a recording medium such as an FD, CDROM, or DVD, or it can be sent and received over a network.
- FIG. 1 is a diagram showing a hardware configuration of a keyword extraction device according to the first embodiment of the present invention.
- FIG. 2 is a diagram for explaining in detail the configuration and functions of the keyword extraction device of the first embodiment.
- FIG. 3 is a flowchart showing an operation procedure of the processing device 1 in the keyword extracting device of the first embodiment.
- ⁇ 4 A diagram for explaining in detail the configuration and function of the keyword extracting device according to the second embodiment of the present invention.
- FIG. 5 is a flowchart showing an operation procedure of the processing device 1 in the keyword extracting device of the second embodiment.
- FIG. 6 is a reference diagram showing an example in which keywords extracted by the keyword extracting device of the present invention are entered in a document correlation diagram showing the relationship between documents.
- FIG. 7 is a diagram for explaining in detail the configuration and functions of a keyword extraction device according to a third embodiment of the present invention.
- FIG. 8 is a flowchart showing an operation procedure of the processing device 1 in the keyword extracting device of the third embodiment. Explanation of symbols
- index word extraction section index word extraction means
- 30 High frequency word extraction section (high frequency word extraction means)
- 40 High frequency word / index word co-occurrence calculation section (High frequency word / index word co-occurrence) Degree calculation means)
- 50 clustering section (clustering means)
- 70 key (w) calculation section (score calculation means)
- 80 S key (w) calculation section (score calculation means)
- 90 keyword extraction section (keywords) Extraction means)
- label extraction unit key extraction means
- Similarity Similarity or dissimilarity between objects to be compared.
- Each target to be compared is expressed as a vector, and expressed using a product function between vector components such as cosine or Tanimoto correlation (an example of similarity) between vectors.
- a product function between vector components such as cosine or Tanimoto correlation (an example of similarity) between vectors.
- There is a method of expressing using a function of a difference between vector components such as an example of similarity.
- Index word A word cut out from all or part of a document. Extracting meaningful parts of speech, excluding particles and conjunctions, using traditionally known methods that have no particular restrictions on how to extract words, or, for example, commercially available morphological analysis software for Japanese documents Alternatively, the index word dictionary (thesaurus) database may be stored in advance and index words obtained from the database may be used.
- the index word dictionary thesaurus
- High-frequency words Predetermined number of words that have a high weight to include in the evaluation the high frequency of occurrence in the group of documents to be analyzed among the index words. For example, a function value including GF (E) (described later) or G F (E) as a variable as a weight of an index word is calculated, and a predetermined number of words having a large value are extracted.
- E Document group to be analyzed.
- the document group E for example, a document group constituting individual clusters when a large number of documents are clustered based on similarity is used.
- E (u l, 2, ⁇ ⁇ ⁇ , n when displaying each document group in the document group group S with multiple document groups E.
- n is the number of documents. ) Is displayed.
- a document group group including a plurality of document groups E For example, it consists of 300 patent documents similar to a patent document or group of patent documents.
- ⁇ Means summation within a range that satisfies condition H.
- C (w, w) Co-occurrence degree in a document group calculated based on the presence or absence of co-occurrence of index words in document units. All of the index word w and index word w belonging to the document group ⁇ (with weighting ⁇ (w, D) and
- Co (w, g) Index term Base co-occurrence.
- the co-occurrence degree C (w, w ') of the index word w and the high-frequency word w' belonging to the base g is the sum of all the w's (excluding w) belonging to the base g.
- y The average occurrence rate of title terms.
- the title term appearance rate f is divided by the genus m of the index word w (k k k v title term) that appears in each title a.
- T, T, ⁇ ⁇ ⁇ Title (title) extracted in descending order as the title score.
- K Keyword suitability. This is calculated to determine the number of labels (explained later), and indicates the degree of keyword occupancy for document group E.
- DF (w, D) Document frequency in document D according to index word w, that is, 1 if index word w is included in document D, 0 if it is not included!
- TF * IDF (P) product of TF (D) and IDF (P). Calculated for each index word in the document.
- FIG. 1 is a diagram showing a hardware configuration of the keyword extracting device according to the first embodiment of the present invention.
- the keyword extraction device of the present embodiment is an input means such as a processing device 1 composed of a CPU (central processing unit) and a memory (recording device), and a keyboard (manual input device). Created by input device 2, document data and conditions, and processing device 1 It consists of a recording device 3 which is a recording means for storing business results and an output device 4 which is an output means for displaying or printing the extracted keywords.
- FIG. 2 is a diagram for explaining in detail the configuration and functions of the keyword extracting device of the first embodiment.
- the processing device 1 includes a document reading unit 10, an index word extraction unit 20, a high frequency word extraction unit 30, a high frequency word / index word co-occurrence calculation unit 40, a clustering unit 50, and an index word base co-occurrence degree.
- a calculation unit 60, a key (w) calculation unit 70, a Skey (w) calculation unit 80, and a keyword extraction unit 90 are provided.
- the recording device 3 includes a condition recording unit 310, a work result storage unit 320, a document storage unit 330, and the like.
- the document storage unit 330 includes an external database and an internal database.
- An external database means, for example, a document digital database such as IPDL of a patent digital library serviced by the Japan Patent Office or PATOLIS serviced by Patrice Co., Ltd.!
- the internal database is a database that stores data such as patent JP-ROM that is sold on its own, FD (flexible disk), CD (compact disk) ROM, MO (magneto-optical disk), DVD that stores documents.
- Media power reading device digital video disc
- OCR optical information reading device
- a communication means for exchanging signals and data among the processing device 1, the input device 2, the recording device 3, and the output device 4 is a USB (Universal System Bus) cable or the like. It may be connected directly, may be transmitted / received via a network such as a LAN (local area network), or may be via a medium such as FD, CDROM, MO, or DVD that stores documents. Alternatively, some or a combination of these may be used.
- a network such as a LAN (local area network)
- a medium such as FD, CDROM, MO, or DVD that stores documents.
- some or a combination of these may be used.
- the input device 2 accepts inputs such as a document reading condition, a high-frequency word extraction condition, a clustering condition, a saddle diagram creation condition, a saddle diagram cutting condition, a score calculation condition, and a keyword output condition. These input conditions are sent to the condition recording unit 310 of the recording device 3. Stored.
- the document reading unit 10 reads the document group E including a plurality of documents D to D to be analyzed from the document storage unit 330 of the recording device 3 in accordance with the reading conditions stored in the condition recording unit 310 of the recording device 3.
- the data of the read document group is directly sent to the index word extraction unit 20 and used for processing there, and is also sent to the work result storage unit 320 of the recording device 3 and stored therein.
- the data sent from the document reading unit 10 to the index word extraction unit 20 or the work result storage unit 320 may be all data including the document data of the read document group E. Further, only bibliographic data (for example, an application number or a publication number in the case of patent documents) specifying each document D belonging to the document group E may be used. In the latter case, the data of each document D may be read again from the document storage unit 330 based on the bibliographic data when necessary for the subsequent processing.
- bibliographic data for example, an application number or a publication number in the case of patent documents
- the index word extraction unit 20 extracts an index word of each document from the document group read by the document reading unit 10.
- the index word data of each document is sent directly to the high-frequency word extraction unit 30 and used for processing there, and is also sent to the work result storage unit 320 of the recording device 3 for storage.
- the high-frequency word extraction unit 30 is based on the index word of each document extracted by the index word extraction unit 20, and in accordance with the high-frequency word extraction conditions stored in the condition recording unit 310 of the recording device 3, A predetermined number of index words with high weights including the high appearance frequency in the evaluation are extracted. Specifically, for each index word, first, GF (E) that is the number of appearances in the document group E is calculated. Furthermore, it is preferable to calculate IDF (P) of each index word and calculate GF (E) * 1 DF (P) which is a product of GF (E).
- a predetermined upper number of index words of GF (E) or GF (E) * IDF (P), which are the weights of the calculated index words, are extracted as high-frequency words.
- the extracted high-frequency word data is sent directly to the high-frequency word-index word co-occurrence degree calculation unit 40 for use in processing there, and is also sent to the work result storage unit 320 of the recording device 3 for storage. Is done.
- the GF (E) of each index word calculated above and the IDF (P) of each index word determined to be preferably calculated are also sent to the work result storage unit 320 of the recording device 3 for storage. It is preferable to be paid.
- the high-frequency word-index word co-occurrence degree calculation unit 40 extracts each high-frequency word extracted by the high-frequency word extraction unit 30 and the index word extraction unit 20 to the work result storage unit 320.
- the co-occurrence degree in document group E is calculated based on the presence or absence of co-occurrence of each index word stored in the document unit. If there are P index words and q high-frequency words are extracted, the matrix data is p rows and q columns.
- the co-occurrence degree data calculated by the high-frequency word-index word co-occurrence degree calculation unit 40 is directly sent to the clustering unit 50 and used for processing there, or the work result storage unit of the recording device 3 Sent to 320 for storage.
- the clustering unit 50 follows the clustering conditions stored in the condition recording unit 310 of the recording device 3 based on the co-occurrence degree data calculated by the high-frequency word / index word co-occurrence degree calculating unit 40. Cluster analysis of q high-frequency words.
- the degree of similarity (similarity or dissimilarity) of co-occurrence with each index word is calculated.
- the calculation of the similarity is executed by calling the similarity calculation module for calculating the similarity from the condition recording unit 310 based on the condition input from the input device 2.
- the similarity degree calculation is based on the cosine or distance between all the p-dimensional column vectors for each high-frequency word to be compared. It can be done (vector space method).
- the cosine (similarity) between the outer points means that the larger the value, the higher the degree of similarity, and the distance between the vectors (dissimilarity) means the higher the degree of similarity.
- the similarity may be defined using other methods, not limited to the outer space method.
- a saddle diagram in which high-frequency words are connected in a saddle shape is created in accordance with the saddle diagram creation conditions stored in the condition recording unit 310 of the recording device 3. It is desirable to create a dendrogram that reflects the dissimilarity between high-frequency words in the height of the joint position (joint distance).
- the base data formed by the clustering unit 50 is directly sent to the index word base co-occurrence degree calculation unit 60 for use in processing there, or sent to the work result storage unit 320 of the recording device 3 for storage. Is done.
- the base co-occurrence degree calculation unit 60 extracts each base word formed by the clustering unit 50 for each index word extracted by the index word extraction unit 20 and stored in the work result storage unit 320 of the recording device 3. The co-occurrence degree with is calculated. The co-occurrence degree data calculated for each index word is sent directly to the k ey (w) calculation unit 70 and used for processing there, or sent to the work result storage unit 320 of the recording device 3 for storage.
- the key (w) calculation unit 70 is based on the co-occurrence degree of each index word with the base calculated by the index word-base co-occurrence degree calculation unit 60. w) is calculated. The calculated key (w) data is sent directly to the Skey (w) calculation unit 80 and used for processing there, or sent to the work result storage unit 320 of the recording device 3 and stored therein.
- the Skey (w) calculation unit 80 calculates the key (w) score of each index word calculated by the key (w) calculation unit 70, and the work result storage unit of the recording device 3 calculated by the high-frequency word extraction unit 30. Based on the GF (E) and IDF (P) of each index word stored in 320, the Skey (w) score is calculated. The calculated data of Skey (w) is sent directly to the keyword extraction unit 90 and used for processing there, or sent to the work result storage unit 320 of the recording device 3 and stored therein.
- the keyword extraction unit 90 extracts a predetermined number of index words higher in the Skey (w) score of each index word calculated by the Skey (w) calculation unit 80 as keywords of the analysis target document group.
- the extracted keyword data is sent to and stored in the work result storage unit 320 of the recording device 3 and output by the output device 4 as necessary.
- the condition recording unit 310 records information such as conditions obtained from the input device 2, and sends necessary data based on a request from the processing device 1.
- the work result storage unit 320 stores the work result of each component in the processing device 1 and sends necessary data based on the request of the processing device 1.
- the document storage unit 330 includes an input device 2 or a processing device 1 Store and provide necessary document data obtained from external database or internal database based on the requirements of
- the output device 4 in FIG. 2 outputs the keywords of the document group extracted by the keyword extraction unit 90 of the processing device 1 and stored in the work result storage unit 320 of the recording device 3.
- Examples of the output form include display on a display device, printing on a print medium such as paper, or transmission to a computer device on a network via a communication device.
- FIG. 3 is a flowchart showing an operation procedure of the processing device 1 in the keyword extracting device of the first embodiment.
- the document reading unit 10 reads a plurality of documents D to D to be analyzed from the document storage unit 330 of the recording device 3 (step S10).
- the index word extraction unit 20 extracts index words of each document from the document group read in the document reading step S10 (step S20).
- the index word data of each document can be expressed by a vector whose component is the function value of the number of occurrences of the index word included in the document group E in each document D (index word frequency TF (D)). .
- the high-frequency word extraction unit 30 based on the index word data of each document extracted in the index word extraction step S20, a predetermined number of high-weight index words including the high frequency of appearance in the document group E in the evaluation Extract.
- GF (E) that is the number of appearances in the document group E is calculated for each index word (step S30).
- the index word frequency TF (D) in each document of each index word calculated in the index word extraction step S20 is used as the documents D to D belonging to the document group E. You can sum up.
- a predetermined number of index words with the highest appearance frequency are extracted (step S31).
- the number of high-frequency words extracted is, for example, 10 words. In this case, for example, if the 10th and 11th words have the same rank, the 11th word is extracted as a high-frequency word.
- index word w to index word w are extracted as high-frequency words.
- index words may be cut out differently depending on the morphological analysis software, so it is impossible to create a necessary and sufficient unnecessary word list. Therefore, it is desirable to eliminate unnecessary words as much as possible.
- unnecessary word list for example, the following examples can be considered in patent documents. [No meaning as a keyword] The above, the above, the following, the description, the claim, the patent, the number, the formula, the general, the above, the following, the means, the feature
- each high-frequency word extracted in the high-frequency word extraction step S31 and each index extracted in the index word extraction step S20 The co-occurrence degree with the word is calculated (step S40).
- the co-occurrence degree C (w, w) in the document group E of the index word w and the index word w is calculated by, for example, the following equation.
- ⁇ (w, D) TF (w, D)
- ⁇ (w, D) TF (w, D) XIDF (w, P)
- DF (w, D) is 1 if the document word D contains the index word w, and 0 otherwise.
- DF (w, D) XDF (w, D) is an index 1 if the word w and the index word w co-occur in one document D, 0 otherwise. This is the sum of all documents D belonging to document group E (weighted by j8 (w, D) and j8 (w, D)) Co-occurrence of index word w and index word w Degree C (w, w).
- c (wi, wj) ⁇ (seneD ⁇ [TF (wi, sen) x TF (wj, sen)] where sen means each sentence in document D. [TF (w, sen) XT
- F (w, sen)] returns a value greater than or equal to 1 if the index words w and w co-occur in a sentence j I j
- the index word w and the index word w which are the same index words, co-occur in a total of three documents, Document D to Document D, and the co-occurrence degree C (w, w
- index word w and index word w are included in two documents, Document D and Document D.
- the clustering unit 50 performs cluster analysis on the high-frequency words based on the co-occurrence degree data calculated in the high-frequency word-index word co-occurrence degree calculation step S40.
- step S50 the degree of similarity (similarity or dissimilarity) of co-occurrence with each index word is calculated.
- the lower left half of the table is omitted because it overlaps with the upper right half.
- high frequency word w to high frequency word w have a correlation coefficient exceeding 0.8 in any combination thereof.
- the high frequency word w to the high frequency word w exceed the correlation coefficient power in any combination thereof.
- the correlation coefficients are all less than 0.8.
- step S 51 a saddle diagram in which high-frequency words are connected in a saddle shape is created.
- a dendrogram that reflects the dissimilarity between high-frequency words in the height of the joint position (joint distance).
- the principle of creating a dendrogram is briefly explained. First, based on the dissimilarity between high-frequency words, the high-frequency words with the lowest dissimilarity (maximum similarity) are joined together to form a combination. Generate. Furthermore, a new combination is formed by combining the conjugate and other high-frequency words, or the conjugate and the conjugate in order of their dissimilarity. Repeat the work to generate. Thus, it can be expressed as a hierarchical structure. The dissimilarity between the combination and other high-frequency words, or the dissimilarity between the combination and the combination is updated based on the dissimilarity between each high-frequency word. For example, a known Ward method is used as the update method.
- the clustering unit 50 cuts the created cage diagram (step S52). For example, cut at the position of ⁇ d> + ⁇ ⁇ where the bond distance in the dendrogram is d
- the high-frequency words belonging to h have different bases with high similarity in co-occurrence with the index word g
- the high-frequency word belonging to h has a low degree of co-occurrence with the index word.
- the index word-base co-occurrence degree calculation unit 60 for each index word extracted in the index word extraction step S20, the co-occurrence degree with each base formed in the clustering step S53 (index word-base co-occurrence) (Occurrence) Co (w, g) is calculated (step S60).
- the index word base co-occurrence degree Co (w, g) is calculated, for example, by the following equation.
- CO (W, g) L ⁇ w. Eg, w ' ⁇ w] C (w, w) where w' is the high frequency words belonging to a base g, and the co-occurrence degree Co (w, g ) Other than the index word W that is the measurement target.
- the co-occurrence degree Co (w, g) between the index word w and the base g is the sum of the co-occurrence degrees C (w, w,) with w for all w and all.
- the index word base co-occurrence is not limited to the above Co (w, g), and may be calculated by the following equation.
- DF (w ', D)) is one of the high-frequency words belonging to the base g, and the document D contains at least one word w' other than the index word w whose co-occurrence is measured. If it is not included at all, 0 is returned.
- DF (w, D) returns 1 if at least one co-occurrence index word w force document D is included, and 0 if it is not included at all.
- Multiplying DF (w, D) by ⁇ (X) returns 1 if w and any w 'belonging to base g co-occur in document D, and 0 otherwise. Will return.
- Co '(w, g) is obtained by multiplying this by the weight j8 (w, D) defined above and totaling all documents D belonging to the document group E.
- the index word of the above [Equation 3], the co-occurrence degree Co (w, g), is the weight j8 for all E's in the presence or absence (1 or 0) of w and w 'in D. Summed with (w, D) X j8 (w ′, D) (C (w, w ′)), and summed this for w ′ in g.
- the index word-base co-occurrence degree Co '(w, g) in [Equation 4] indicates the presence or absence of co-occurrence (1 or 0) in D of w and any of w in g. All E were added together with weights (w, D).
- index word base co-occurrence degree Co (w, g) in [Equation 3] increases or decreases depending on the number of w's in the base g co-occurring with the index word w.
- the index word-base co-occurrence degree Co (w, g) increases or decreases depending on the presence or absence of w 'in the base g that co-occurs with the index word w, and the number of co-occurrence w's is irrelevant.
- the evaluation score of each index word is key ( w) is calculated (step S70).
- the key (w) is calculated by the following equation, for example.
- Co (w, g) in [Equation 3] above is used as the index word-base co-occurrence, but Co '(w, g) in [Equation 4] above may also be used. This is as described above.
- the order of key (w) is greatly influenced by the order of document frequency DF (E) in document group E.
- DF (E) has the most index word w, key (w) is first, and DF (
- next index word w is second in key (w), and the index words w, w, w, etc.
- An index word with a large document frequency DF (E) in document group E can co-occur with high-frequency words in more documents. Therefore, a larger index word base co-occurrence degree Co (w, g) or Co ′ (w, g) is obtained. This is the reason why the ranking of DF (E) has a great influence on the ranking of key (w).
- the rank of key (w) is the global frequency GF (E) in document group E. The impact of the ranking is expected to increase.
- the direction key (w) that the degree word straddles more foundations is large.
- the key (w) is larger than the index word w or w. [0080] In addition, as shown in [Table 2] and [Table 6], the index words w to w are compared.
- GF (E) of each index word and IDF (P) of each index word calculate Skey (w) score (step S80 ).
- the Skey (w) score is calculated by the following equation.
- GF (w, E) gives a large value to words that frequently appear in document group E
- IDF (P) is a word unique to document group E, which is rare in all documents P.
- the key (w) is a score that is affected by DF (E) as described above and gives a large value to words that co-occur with more foundations. The larger the values of GF (w, E), IDF (P), and key (w)! /, The larger Skey (w) becomes.
- TF * IDF which is often used as a weight for index words, is the product of the index word frequency TF and the IDF that is the logarithm of the reciprocal of the index word occurrence probability DF (P) ZN (P). It is. IDF has the effect of minimizing the contribution of index terms that appear with high probability in a document group, and can give high weight to index terms that appear only in a specific document. However, it sometimes has the disadvantage that the value jumps just because the document frequency is low. As will be explained below, the Skey (w) score has the effect of improving these shortcomings.
- Skey (w) score when the key “(w) in [Equation 7] is used is written as Skey (key”), and the Skey (w) score when the key (w) in [Equation 5] is used. Skey (key) and comparing the two,
- the keyword extraction unit 90 extracts a predetermined number of index words higher in the Skey (w) score of each index word calculated in the Skey (w) calculation step S80 as keywords of the analysis target document group ( Step S90).
- index words that co-occur with high-frequency words in a document are highly evaluated and keywords are extracted. Since high-frequency words belonging to different foundations are not similar in co-occurrence with each index word, index words that co-occur with many foundations may cause variations in topics and claims of document group E. It can be said that it is a bridging word. In addition, it can be said that index words that co-occur with high-frequency words in many documents are words that express topics and claims that are common to the document group whose document frequency DF (E) in document group E is high. By highly evaluating such index terms, it is possible to automatically extract keywords that accurately represent the characteristics of multiple document groups D.
- FIG. 4 is a diagram for explaining in detail the configuration and function of the keyword extracting apparatus according to the second embodiment of the present invention. Parts similar to those in FIG. 2 according to the first embodiment are denoted by the same reference numerals, and description thereof is omitted.
- the keyword extracting device of the second embodiment includes a title extracting unit 100, a title score calculating unit 110, a Skey (w) broader word reading unit 120, a label number determining unit 130, in addition to the components of the first embodiment.
- a label extraction unit 140 is provided in the processing apparatus 1.
- the calculation result of the Skey (w) calculation unit 80 may be stored as it is in the work result storage unit 320 without including the keyword extraction unit 90 among the constituent elements of the first embodiment.
- the title extraction unit 100 extracts the title (title) of each document from the document data read by the document reading unit 10 and stored in the work result storage unit 320. For example, in the case of a patent document, the description of “Title of Invention” is extracted.
- the extracted title data is sent directly to the title score calculation unit 110 and used for processing there, or the data of the recording device 3 is created. It is sent to the business result storage unit 320 and stored.
- the title score calculation unit 110 uses the title data of each document extracted by the title extraction unit 100 and the index word data of the document group E extracted by the index word extraction unit 20 to The title score ⁇ is calculated for the title of. This title score ⁇ is
- the k calculation method will be described later.
- the calculated title score ⁇ is extracted from the label
- the Skey (w) broader word reading unit 120 calculates the Skey (w) based on the Skey (w) of each index word w calculated by the Skey (w) calculation unit 80 and stored in the work result storage unit 320. Top of the score A predetermined number of index terms are extracted. For example, 10 samples are extracted. The extracted Skey (w) broader term data is directly sent to the label number determination unit 130 and used for processing there, or sent to the work result storage unit 320 of the recording device 3 and stored therein.
- the label number determination unit 130 is an index indicating the content uniformity of the document group E based on the data of the Skey (w) broader term extracted by the Skey (w) broader term reading unit 120.
- the keyword fitness ⁇ is calculated. Then, the number of labels to be extracted is determined based on the keyword matching degree ⁇ . A method for calculating the keyword fitness ⁇ and determination of the number of labels based on this will be described later.
- the data of the determined number of labels is sent directly to the label extraction unit 140 and used for processing there, or sent to the work result storage unit 320 of the recording device 3 and stored therein.
- the label extraction unit 140 extracts the number of titles determined by the label number determination unit 130 based on the title score ⁇ of each title calculated by the title score calculation unit 110, and k
- this label corresponds to the keyword of the present invention.
- FIG. 5 shows an operation procedure of the processing device 1 in the keyword extracting device of the second embodiment. It is a flowchart.
- the keyword extraction device according to the second embodiment calculates Skey (w) through the same processing as in the first embodiment (up to step S80). The processing until Skey (w) is calculated is the same as in FIG.
- Step S100 Since one title is extracted for one document D force, the number of documents N (E k ) (E k ).
- the title extraction unit 100 determines the title character k in the document group E from the title a of each document.
- strll means string sum.
- strll means string sum.
- a title term obtained by sharing the title sum s is defined as an index word dictionary.
- index word dictionary an index word dictionary obtained by sharing the document content of the document group E may be used as the index word dictionary instead of the index word obtained from the title sum s.
- index word dictionary only a predetermined number of index words (for example, 30 words) in the keyword score Skey (w) may be used as an index word dictionary.
- the title score calculation unit 110 calculates the title score ⁇ for the title of each document (step SI 10).
- the title score ⁇ is calculated using the title described below. Use current rate x and title term appearance rate average y.
- f k (1 / N (E)) ⁇ TF (w v , s) x IDF (w v , P) x ⁇ (TF (w v , a k )) where the index word in the title sum s
- the number of occurrences of w v is given by TF (w, s).
- the title score ⁇ is an increase function of the title appearance rate X and the title term appearance rate average y.
- the title score ⁇ may be obtained by the following equation.
- each title is set as ⁇ , ⁇ , '"from the top of ⁇ .
- the Skey (w) broader word reading unit 120 extracts a predetermined number (t) of index words of the Skey (w) score (step S120).
- the label number determination unit 130 calculates the key adaptability ⁇ indicating the content uniformity of the document group E, and determines the number of labels to be extracted (step S130).
- ⁇ represents the degree of occupation in the document group E of words evaluated as keywords by Skey (w). If document group E consists of a single field, keywords are highly related and will not be diversified, so the degree of occupation is high. On the other hand, if the document group E is composed of multiple fields, the number of documents per field is small, the keywords are various, and the degree of occupation is low. Therefore, if the value of ⁇ is high, the uniformity of the contents of document group E is high. If the value of ⁇ is low, it can be determined that document group E is composed of multiple fields.
- the number of labels and the output mode, which are the keywords extracted in the second embodiment, are determined in accordance with the obtained keyword adaptability value ⁇ . For example,
- the threshold value of ⁇ is not limited to the set of [0.55, 0.35, 0.2], and other values may be selected.
- the ⁇ threshold is used instead of the above ⁇ threshold set.
- the set [0.3, 0.2, 0.02] is preferably used.
- the title score ⁇ of each title calculated in the title score calculation step S110 and the label number determination step S130 determine k
- the labels are extracted (step S140).
- the number of extracted keywords (labels) is determined based on the appearance frequency of each high-frequency word higher in the Skey (w) score. To do.
- an appropriate number of keywords representing the characteristics of the document group can be automatically extracted according to the degree of content uniformity of the document group E. wear.
- the keywords are extracted by evaluating words with a high appearance rate, so a key word that accurately represents the contents of the document group must be extracted. Can do.
- the cluster analysis is based on the TF of the index word included in each of the above 850 documents.
- the mean value, ⁇ is the standard deviation of d.
- the top three words of Skey (w) were used as keywords according to the first embodiment.
- the keyword matching degree ⁇ is calculated, and based on this, the label according to the second embodiment is generated.
- the index word dictionary for extracting labels according to the second embodiment uses title terms obtained by sharing the title sum s as described above. However, if an index word obtained by dividing the document content of document group E is used, a label is generated, and if a result different from that obtained using the title sum s is obtained, a ⁇ * Attached together.
- the publication order of the documents was set in descending order of the keyword relevance ⁇ , and we tried to understand the difference in the expression of the labels at a glance.
- Oral Composition Label ⁇ Oral Composition ⁇ Dispersant etc. ”* Oral Composition ⁇ Deodorant Composition Keyword [Acid. Salt. Oral]
- Keywords [document.loading.mutan] As described above, the labels of each document group according to the second embodiment are assigned to each document group by humans. There was a tendency to match the title.
- each document group according to the first embodiment is not limited to general names for the subject of the invention, but more specifically terms indicating technical contents were selected.
- FIG. 6 is a reference diagram showing an example in which the keywords extracted by the keyword extracting device of the present invention are entered in a document correlation diagram showing the relationship between documents.
- This document correlation diagram shows the content and temporal relationships among the 27 document groups shown in the above example.
- a dendrogram is constructed based on the similarity between the 26 vectors created in this way. At the position of ⁇ d> + ⁇ , where d is the coupling distance in the dendrogram.
- Clusters were extracted by cutting the program. Where ⁇ d> is the average value of d and ⁇ is the standard deviation of d
- the document correlation diagram created in this way is categorized based on the contents of the document and arranged in chronological order, and it shows the development trend of the household chemicals manufacturers surveyed. Useful for analysis.
- the labels which may be the keywords of the first embodiment
- the method of the second embodiment of the present invention are entered in the document correlation diagram for each set of documents. Therefore, development trends can be grasped at a glance.
- the keyword is extracted from the document group E.
- the plurality of document groups E are preferably individual clusters obtained by clustering the document group group S, but conversely, a plurality of document groups E may be collected to form the document group group S.
- FIG. 7 is a diagram for explaining in detail the configuration and function of the keyword extracting device according to the third embodiment of the present invention. Parts similar to those in FIG. 2 according to the first embodiment are denoted by the same reference numerals, and description thereof is omitted.
- the keyword extraction device of the third embodiment includes an evaluation value calculation unit 200, a concentration degree calculation unit 210, a share calculation unit 220, a first reciprocal number calculation unit 230, and a second reciprocal number calculation unit 240.
- the originality calculation unit 250 and the keyword extraction unit 260 are provided in the processing device 1. Yes.
- the keyword extraction unit 90 among the components of the first embodiment may not be provided, and the calculation result of the Skey (w) calculation unit 80 is stored in the work result storage unit 320 as it is.
- the evaluation value calculation unit 200 reads the index word w of each document extracted by the index word extraction unit 20 from the work result storage unit 320 for the document group group S including a plurality of document groups E. Alternatively, the evaluation value calculation unit 200 reads the Skey (w) of the index word calculated for each document group E by the Skey (w) calculation unit 80 from the work result storage unit 320. If necessary, the evaluation value calculation unit 200 may read the data of each document group E read by the document reading unit 10 from the work result storage unit 320 and count the number of documents N (E). Further, GF (E) and IDF (P) calculated in the process of high-frequency word extraction in the high-frequency word extraction unit 30 may be read from the work result storage unit 320.
- the evaluation value calculation unit 200 calculates an evaluation value A (w, E) based on the appearance frequency of each index word w in each document group E based on the read information.
- the calculated evaluation value is sent to and stored in the work result storage unit 320, or directly sent to the concentration degree calculation unit 210 and the share calculation unit 220 and used for processing there.
- the degree-of-concentration calculation unit 210 reads the evaluation value A (w, E) in each document group E of each index word w calculated by the evaluation value calculation unit 200 from the work result storage unit 320 or evaluates the evaluation value Received directly from the calculation unit 200.
- the concentration degree calculation unit 210 calculates the concentration degree of the distribution of each index word w in the document group group S for each index word w based on the obtained evaluation value A (w, E). This degree of concentration is calculated by calculating the sum of the evaluation value A (w, E) in each document group E for all index groups w in all document groups E belonging to the document group group S, and for each document for that sum. The ratio of the evaluation value A (w, E) in group E is calculated for each document group E, the square of the ratio is calculated, and the sum of the squares of the ratio in all document groups E belonging to the document group group S is calculated. This is obtained by calculating. The calculated degree of concentration is sent to the work result storage unit 320 and stored therein.
- the share calculation unit 220 reads the evaluation value A (w, E) in each document group E of each index word w calculated by the evaluation value calculation unit 200 from the work result storage unit 320 or calculates the evaluation value Receive directly from part 200.
- the share calculation unit 220 calculates a share in each document group E for each index word ⁇ based on the obtained evaluation value A (w ⁇ E u ).
- This share is the sum of the evaluation values A (w, E) of each index word w in the document group E to be analyzed for all index words w extracted for each document group E force belonging to the document group group S.
- the ratio of the evaluation value A (w, E) of each index word w to the sum is calculated for each index word w.
- the calculated degree of concentration is sent to and stored in the work result storage unit 320.
- First reciprocal calculation unit 230 reads index word w of each document extracted by index word extraction unit 20 from work result storage unit 320 for document group group S including a plurality of document groups E. Then, the first reciprocal calculation unit 230, based on the data of the index word w of each document of the document group group S that has been read, the function value of the reciprocal of the appearance frequency in the document group group S for each index word w (for example, described later Standardized IDF (S)). The calculated function value of the reciprocal of the appearance frequency in the document group group S is sent to and stored in the work result storage unit 320, or directly sent to the originality calculation unit 250 and used for processing there.
- S Standardized IDF
- the second reciprocal calculator 240 calculates a function value of the reciprocal of the appearance frequency in the large document group including the document group group S. All documents P are used as a large document group.
- the IDF (P) calculated in the process of high-frequency word extraction in the high-frequency word extraction unit 30 is also read out from the work result storage unit 320, and its function value (for example, standard ID IDF (P) described later) is read. calculate.
- the calculated function value of the reciprocal of the appearance frequency in the large document group P is sent to and stored in the work result storage unit 320, or directly sent to the originality calculation unit 250 and used for processing there.
- the originality calculation unit 250 reads the function value of the reciprocal of each appearance frequency calculated by the first reciprocal calculation unit 230 and the second reciprocal calculation unit 240 from the work result storage unit 320 or calculates the first reciprocal. Directly received from the unit 230 and the second reciprocal calculation unit 240. Also, the GF (E) calculated in the process of high-frequency word extraction in the high-frequency word extraction unit 30 is obtained from the work result storage unit 320.
- the originality calculation unit 250 calculates a function value obtained by subtracting the calculation result of the first reciprocal calculation unit 230 from the calculation result of the second reciprocal calculation unit 240 as the originality. This function value reduces the calculation result power of the first reciprocal calculation unit 230 and the calculation result of the second reciprocal calculation unit 240.
- the calculated result may be divided by the sum of the calculation result of the first reciprocal calculation unit 230 and the calculation result of the second reciprocal calculation unit 240, and GF (E u ) in each document group E u It is also possible to multiply by.
- the calculated originality is sent to and stored in the work result storage unit 320.
- the keyword extraction unit 260 includes the Skey (w) calculated by the Skey (w) calculation unit 80, the concentration calculated by the concentration calculation unit 210, the share and originality calculated by the share calculation unit 220. Each data of originality calculated in the calculation unit 250 is read from the work result storage unit 320.
- the keyword extraction unit 260 extracts a keyword based on two or more indexes selected from the four parameters of the read Skey (w), the degree of concentration, the share, and the originality.
- the keyword extraction method may be based on, for example, whether the total value of a plurality of selected indicators is equal to or greater than a predetermined threshold or whether the power is within a predetermined rank, or based on a combination of the plurality of selected indicators. You can also categorize and extract keywords! /.
- the extracted keyword data is sent to and stored in the work result storage unit 320 of the recording device 3 and output by the output device 4 as necessary.
- FIG. 8 is a flowchart showing an operation procedure of the processing device 1 in the keyword extracting device of the third embodiment.
- the plurality of document groups E are, for example, individual clusters obtained by clustering a certain document group group S.
- step S10 to step S80 the processing from step S10 to step S80 is executed for each document group E belonging to the document group group S by the same processing as in the first embodiment, and each index in each document group E is executed. Calculate the Skey (w) of the word. The processing until Skey (w) is calculated is the same as in FIG.
- the evaluation value calculation unit 200 0 evaluates the evaluation value A (w, based on the function value of the appearance frequency of the index word w in each document group E. E) is calculated for each document group E and each index word w (step S200).
- the evaluation value A (w, E) for example, the force using the above Skey (w) as it is, Skey (w) / N (E), or GF (E) * IDF (P) is used.
- the concentration calculation unit 210 calculates the concentration for each index word ⁇ as follows (step S210).
- the evaluation value A (w, E) in each document group E is calculated as a sum n A (w, E) for all document groups E belonging to the document group group S. Ratio of evaluation value A (w, E) in each document group E to the sum
- the Share calculating unit 220 calculates the share of each document group E u as follows for each index word ⁇ (step S 220).
- the first reciprocal calculation unit 230 calculates a function value of the reciprocal of the appearance frequency in the document group group S for each index word Wi (step S 230).
- the document frequency DF (S) As the appearance frequency in the document group group S, for example, the document frequency DF (S) is used.
- the function value of the reciprocal of the appearance frequency the total index extracted from the document group E to be analyzed is the inverse document frequency IDF (S) in the document group group S, or IDF (S) as a particularly preferable example.
- IDF (S) is the logarithm of “reciprocal number of DF (S) X number of documents N of document group N (S)”.
- a deviation value is used as an example of the standard. The reason for standardization is to make it easy to calculate the originality by combining with the IDF (P) described later by aligning the distribution.
- a function value of the reciprocal of the appearance frequency in the large document group P including the document group group S is calculated (step S 240).
- the function value of the reciprocal of the appearance frequency is IDF (P) or particularly preferred.
- IDF (P) Is normalized with all index words extracted from the document group E u to be analyzed (standardized IDF (P)).
- standardized IDF (P) For example, a deviation value is used as an example of the standard.
- the reason for normalization is to make it easy to calculate the originality by combining with the above IDF (S) by aligning the distribution.
- the originality calculation unit 250 calculates, for each index word w, the function value of ⁇ function value of IDF (S) —function value of I DF (P) ⁇ as the originality (step S250).
- the originality is calculated as one value for each index word w.
- standardized IDF (S) or standardized IDF (P) standardized in document group E is used, or when weighted separately with GF (E), etc., the originality of each document group E and each It is calculated for each index word w.
- the originality is particularly preferably given by the following DEV.
- the standard ⁇ GF (E), which is the first factor of DEV, is the global frequency GF (E) of each index word w in the document group E to be analyzed. It is a standardized index word.
- the second factor of DEV is that the IDF standard value in document group S is a large document. It is positive if it is larger than the standard value of I DF in group P, and negative if it is smaller.
- a large IDF in the document group S means that it is a rare word in this document group S.
- words with a small IDF in the large document group P including the document group group S are used in the field related to the document group group S even if they are frequently used in other fields. In particular, it can be said that it has originality.
- the second factor of DEV is in the range of 1 to +1, which makes it easy to compare different document groups E become.
- DEV is proportional to the standard (GF (E)
- GF (E) the more frequently used words in the target document group are also higher numerical values.
- each of these document groups E is set as an analysis target document group, and a ranking of originality is created.
- the common index word in the document group group S falls to the lower level, and the characteristic words for each document group E come to the upper level in each document group E. Therefore, grasp the characteristics of each document group E. It is useful.
- the keyword extraction unit 260 extracts keywords based on two or more indicators selected from the four indicators Skey (w), concentration, share, and originality obtained in the above steps (step S 260).
- the index word w of the target document group E is defined as “non-important word” and “ Important words are extracted by classifying them into one of “technical domain words”, “main words”, “original words”, and “other important words”.
- the classification method is as follows.
- Skey (w) is used for the first determination.
- a Skey (w) descending ranking is created, and keywords lower than the predetermined rank are set as “non-important words” and excluded from keyword extraction targets. Since the keywords within the predetermined order are important words in each document group E, they are classified as “important words” and further classified by the following judgment.
- the second determination uses concentration. Words with a low degree of concentration are words that are distributed throughout the document group, and can therefore be regarded as a broader view of the technical domain to which the document group to be analyzed belongs. Therefore, an ascending ranking of the degree of concentration in the document group group S is created, and those within a predetermined rank are defined as “technical domain words”. From the key words of each document group E, keywords that match the above technical domain words are classified as “technical domain words” of the document group E.
- the third judgment uses shares. A word with a high share has a higher share in the group of documents to be analyzed than other words, so it can be positioned as something that can explain the group of documents to be analyzed well (main word).
- each document group E a sheer descending ranking is created for important words that are not classified in the second judgment, and those within a predetermined rank are designated as “major words”.
- the fourth determination uses the originality.
- it was not classified by the third judgment Create a descending order of originality for important words, and place words within a specified rank as “original words”.
- the remaining important words are “other important words”.
- Skey (w) is used as the importance index used for the first determination.
- the present invention is not limited to this, and another index indicating the importance in the document group may be used.
- GF (E) * IDF (P) may be used.
- classification was made using four indicators of importance, concentration, share, and originality. By using any two or more of these indicators, the classification of index terms is also possible. Is possible.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
Claims
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2006542917A JPWO2006048998A1 (ja) | 2004-11-05 | 2005-10-11 | キーワード抽出装置 |
EP05793129A EP1830281A1 (en) | 2004-11-05 | 2005-10-11 | Keyword extracting device |
US11/667,097 US20080195595A1 (en) | 2004-11-05 | 2005-10-11 | Keyword Extracting Device |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2004-322924 | 2004-11-05 | ||
JP2004322924 | 2004-11-05 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2006048998A1 true WO2006048998A1 (ja) | 2006-05-11 |
Family
ID=36319012
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2005/018712 WO2006048998A1 (ja) | 2004-11-05 | 2005-10-11 | キーワード抽出装置 |
Country Status (6)
Country | Link |
---|---|
US (1) | US20080195595A1 (ja) |
EP (1) | EP1830281A1 (ja) |
JP (1) | JPWO2006048998A1 (ja) |
KR (1) | KR20070084004A (ja) |
CN (1) | CN101069177A (ja) |
WO (1) | WO2006048998A1 (ja) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN100462979C (zh) * | 2007-06-26 | 2009-02-18 | 腾讯科技(深圳)有限公司 | 分布式索引文件的检索方法、检索系统及检索服务器 |
JP2011242975A (ja) * | 2010-05-18 | 2011-12-01 | Nippon Telegr & Teleph Corp <Ntt> | 代表語抽出装置、代表語抽出方法および代表語抽出プログラム |
JP2012073804A (ja) * | 2010-09-28 | 2012-04-12 | Toshiba Corp | キーワード提示装置、方法及びプログラム |
JP2014096105A (ja) * | 2012-11-12 | 2014-05-22 | Nippon Telegr & Teleph Corp <Ntt> | バーストワード抽出装置、方法、及びプログラム |
JP5792871B1 (ja) * | 2014-05-23 | 2015-10-14 | 日本電信電話株式会社 | 代表スポット出力方法、代表スポット出力装置および代表スポット出力プログラム |
JP2016103205A (ja) * | 2014-11-28 | 2016-06-02 | 富士通株式会社 | データ分類装置、データ分類プログラム、および、データ分類方法 |
EP3046037A1 (en) | 2015-01-15 | 2016-07-20 | Fujitsu Limited | Similarity determination apparatus, similarity determination method, and computer-readable recording medium |
JP2016218512A (ja) * | 2015-05-14 | 2016-12-22 | 富士ゼロックス株式会社 | 情報処理装置及び情報処理プログラム |
Families Citing this family (56)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8275661B1 (en) | 1999-03-31 | 2012-09-25 | Verizon Corporate Services Group Inc. | Targeted banner advertisements |
WO2000058863A1 (en) | 1999-03-31 | 2000-10-05 | Verizon Laboratories Inc. | Techniques for performing a data query in a computer system |
US8572069B2 (en) | 1999-03-31 | 2013-10-29 | Apple Inc. | Semi-automatic index term augmentation in document retrieval |
US6718363B1 (en) | 1999-07-30 | 2004-04-06 | Verizon Laboratories, Inc. | Page aggregation for web sites |
US6912525B1 (en) | 2000-05-08 | 2005-06-28 | Verizon Laboratories, Inc. | Techniques for web site integration |
US7657506B2 (en) * | 2006-01-03 | 2010-02-02 | Microsoft International Holdings B.V. | Methods and apparatus for automated matching and classification of data |
US8775930B2 (en) * | 2006-07-07 | 2014-07-08 | International Business Machines Corporation | Generic frequency weighted visualization component |
US7954052B2 (en) * | 2006-07-07 | 2011-05-31 | International Business Machines Corporation | Method for processing a web page for display in a wiki environment |
US20080010388A1 (en) * | 2006-07-07 | 2008-01-10 | Bryce Allen Curtis | Method and apparatus for server wiring model |
US8560956B2 (en) | 2006-07-07 | 2013-10-15 | International Business Machines Corporation | Processing model of an application wiki |
US20080010345A1 (en) * | 2006-07-07 | 2008-01-10 | Bryce Allen Curtis | Method and apparatus for data hub objects |
US8219900B2 (en) | 2006-07-07 | 2012-07-10 | International Business Machines Corporation | Programmatically hiding and displaying Wiki page layout sections |
US20080010338A1 (en) * | 2006-07-07 | 2008-01-10 | Bryce Allen Curtis | Method and apparatus for client and server interaction |
US8196039B2 (en) * | 2006-07-07 | 2012-06-05 | International Business Machines Corporation | Relevant term extraction and classification for Wiki content |
US20080010387A1 (en) * | 2006-07-07 | 2008-01-10 | Bryce Allen Curtis | Method for defining a Wiki page layout using a Wiki page |
US20080010386A1 (en) * | 2006-07-07 | 2008-01-10 | Bryce Allen Curtis | Method and apparatus for client wiring model |
US7996393B1 (en) | 2006-09-29 | 2011-08-09 | Google Inc. | Keywords associated with document categories |
US8131722B2 (en) | 2006-11-20 | 2012-03-06 | Ebay Inc. | Search clustering |
JP2008165303A (ja) * | 2006-12-27 | 2008-07-17 | Fujifilm Corp | コンテンツ登録装置、及びコンテンツ登録方法、及びコンテンツ登録プログラム |
CA2572116A1 (en) * | 2006-12-27 | 2008-06-27 | Ibm Canada Limited - Ibm Canada Limitee | System and method for processing multi-modal communication within a workgroup |
US7873640B2 (en) * | 2007-03-27 | 2011-01-18 | Adobe Systems Incorporated | Semantic analysis documents to rank terms |
US8990225B2 (en) * | 2007-12-17 | 2015-03-24 | Palo Alto Research Center Incorporated | Outbound content filtering via automated inference detection |
US8290946B2 (en) * | 2008-06-24 | 2012-10-16 | Microsoft Corporation | Consistent phrase relevance measures |
US8171031B2 (en) | 2008-06-27 | 2012-05-01 | Microsoft Corporation | Index optimization for ranking using a linear model |
US8161036B2 (en) * | 2008-06-27 | 2012-04-17 | Microsoft Corporation | Index optimization for ranking using a linear model |
JP4656202B2 (ja) * | 2008-07-22 | 2011-03-23 | ソニー株式会社 | 情報処理装置および方法、プログラム、並びに記録媒体 |
US20100131513A1 (en) | 2008-10-23 | 2010-05-27 | Lundberg Steven W | Patent mapping |
US8849649B2 (en) * | 2009-12-24 | 2014-09-30 | Metavana, Inc. | System and method for determining sentiment expressed in documents |
US9201863B2 (en) * | 2009-12-24 | 2015-12-01 | Woodwire, Inc. | Sentiment analysis from social media content |
US8463786B2 (en) | 2010-06-10 | 2013-06-11 | Microsoft Corporation | Extracting topically related keywords from related documents |
CN102314448B (zh) * | 2010-07-06 | 2013-12-04 | 株式会社理光 | 一种在文档中获得一个或多个关键元素的设备和方法 |
WO2012050247A1 (ko) * | 2010-10-13 | 2012-04-19 | 정보통신산업진흥원 | 인적 자원 역량 평가 시스템 및 방법 |
JP5545876B2 (ja) * | 2011-01-17 | 2014-07-09 | 日本電信電話株式会社 | クエリ提供装置、クエリ提供方法及びクエリ提供プログラム |
US9904726B2 (en) | 2011-05-04 | 2018-02-27 | Black Hills IP Holdings, LLC. | Apparatus and method for automated and assisted patent claim mapping and expense planning |
US8645381B2 (en) * | 2011-06-27 | 2014-02-04 | International Business Machines Corporation | Document taxonomy generation from tag data using user groupings of tags |
US20130086093A1 (en) | 2011-10-03 | 2013-04-04 | Steven W. Lundberg | System and method for competitive prior art analytics and mapping |
CN103890763B (zh) * | 2011-10-26 | 2017-09-12 | 国际商业机器公司 | 信息处理装置、数据存取方法以及计算机可读存储介质 |
TWI477996B (zh) * | 2011-11-29 | 2015-03-21 | Iq Technology Inc | 自動分析個人化輸入之方法 |
CN103198057B (zh) * | 2012-01-05 | 2017-11-07 | 深圳市世纪光速信息技术有限公司 | 一种自动给文档添加标签的方法和装置 |
JP5530476B2 (ja) * | 2012-03-30 | 2014-06-25 | 株式会社Ubic | 文書分別システム及び文書分別方法並びに文書分別プログラム |
JP5526209B2 (ja) * | 2012-10-09 | 2014-06-18 | 株式会社Ubic | フォレンジックシステムおよびフォレンジック方法並びにフォレンジックプログラム |
US20140280178A1 (en) * | 2013-03-15 | 2014-09-18 | Citizennet Inc. | Systems and Methods for Labeling Sets of Objects |
US20140379713A1 (en) * | 2013-06-21 | 2014-12-25 | Hewlett-Packard Development Company, L.P. | Computing a moment for categorizing a document |
KR101374197B1 (ko) * | 2013-10-02 | 2014-03-12 | 한국과학기술정보연구원 | 다종 리소스들의 의미기반 시차 조정 방법, 다종 리소스들의 의미기반 시차 조정 장치 및 다종 리소스들의 의미기반 시차를 조정하는 프로그램을 저장하는 저장 매체 |
US10572491B2 (en) | 2014-11-19 | 2020-02-25 | Google Llc | Methods, systems, and media for presenting related media content items |
US9529860B2 (en) * | 2014-12-01 | 2016-12-27 | Bank Of America Corporation | Keyword frequency analysis system |
US10409909B2 (en) | 2014-12-12 | 2019-09-10 | Omni Ai, Inc. | Lexical analyzer for a neuro-linguistic behavior recognition system |
JP5923806B1 (ja) * | 2015-04-09 | 2016-05-25 | 真之 正林 | 情報処理装置及び方法、並びにプログラム |
US10628431B2 (en) | 2017-04-06 | 2020-04-21 | Salesforce.Com, Inc. | Predicting a type of a record searched for by a user |
US10614061B2 (en) * | 2017-06-28 | 2020-04-07 | Salesforce.Com, Inc. | Predicting user intent based on entity-type search indexes |
CN108334533B (zh) * | 2017-10-20 | 2021-12-24 | 腾讯科技(深圳)有限公司 | 关键词提取方法和装置、存储介质及电子装置 |
JP6847812B2 (ja) * | 2017-10-25 | 2021-03-24 | 株式会社東芝 | 文書理解支援装置、文書理解支援方法、およびプログラム |
US10498898B2 (en) * | 2017-12-13 | 2019-12-03 | Genesys Telecommunications Laboratories, Inc. | Systems and methods for chatbot generation |
KR102018906B1 (ko) * | 2018-01-10 | 2019-09-05 | 주식회사 메디씨앤씨 | 키워드에 대한 타겟 사용자 그룹 선정 방법 및 이를 수행하는 컴퓨팅 시스템 |
KR102515655B1 (ko) | 2018-01-30 | 2023-03-30 | (주)광개토연구소 | 미래 연구 가능성 높은 기술 키워드 추천 장치 및 방법 |
CN110362673B (zh) * | 2019-07-17 | 2022-07-08 | 福州大学 | 基于摘要语义分析的计算机视觉类论文内容判别方法及系统 |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2000276487A (ja) * | 1999-03-26 | 2000-10-06 | Mitsubishi Electric Corp | 事例蓄積・検索装置、並びに事例蓄積方法および事例検索方法、並びに事例蓄積プログラムを記録したコンピュータで読取可能な記録媒体および事例検索プログラムを記録したコンピュータで読取可能な記録媒体 |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6185592B1 (en) * | 1997-11-18 | 2001-02-06 | Apple Computer, Inc. | Summarizing text documents by resolving co-referentiality among actors or objects around which a story unfolds |
SE520533C2 (sv) * | 2001-03-13 | 2003-07-22 | Picsearch Ab | Metod, datorprogram och system för indexering av digitaliserade enheter |
US20040133560A1 (en) * | 2003-01-07 | 2004-07-08 | Simske Steven J. | Methods and systems for organizing electronic documents |
-
2005
- 2005-10-11 WO PCT/JP2005/018712 patent/WO2006048998A1/ja active Application Filing
- 2005-10-11 CN CNA2005800372605A patent/CN101069177A/zh active Pending
- 2005-10-11 EP EP05793129A patent/EP1830281A1/en not_active Withdrawn
- 2005-10-11 JP JP2006542917A patent/JPWO2006048998A1/ja not_active Withdrawn
- 2005-10-11 US US11/667,097 patent/US20080195595A1/en not_active Abandoned
- 2005-10-11 KR KR1020077010276A patent/KR20070084004A/ko not_active Application Discontinuation
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2000276487A (ja) * | 1999-03-26 | 2000-10-06 | Mitsubishi Electric Corp | 事例蓄積・検索装置、並びに事例蓄積方法および事例検索方法、並びに事例蓄積プログラムを記録したコンピュータで読取可能な記録媒体および事例検索プログラムを記録したコンピュータで読取可能な記録媒体 |
Non-Patent Citations (2)
Title |
---|
NIWASE S. ET AL: "KeyGraph to Tokei Shuho o Mochiita Shohisha Image no sokutei Shuho", DAI 51 KAI JINKO CHINO KISORON KENKYUKAI SHIRYO, 30 January 2003 (2003-01-30), pages 25 - 30, XP003007169 * |
OHSAWA Y. ET AL: "KeyGraph: Go no Kyoki Graph no Bankatsu Togo ni yoru Keayword Chushutsu", THE TRANSACTIONS OF THE INSTITUTE OF ELECTRONICS, INFORMATION AND COMMUNICATION ENGINEERS, vol. J82-D-I, no. 2, 25 February 1999 (1999-02-25), pages 391 - 399, XP003007168 * |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN100462979C (zh) * | 2007-06-26 | 2009-02-18 | 腾讯科技(深圳)有限公司 | 分布式索引文件的检索方法、检索系统及检索服务器 |
JP2011242975A (ja) * | 2010-05-18 | 2011-12-01 | Nippon Telegr & Teleph Corp <Ntt> | 代表語抽出装置、代表語抽出方法および代表語抽出プログラム |
JP2012073804A (ja) * | 2010-09-28 | 2012-04-12 | Toshiba Corp | キーワード提示装置、方法及びプログラム |
US8812504B2 (en) | 2010-09-28 | 2014-08-19 | Kabushiki Kaisha Toshiba | Keyword presentation apparatus and method |
JP2014096105A (ja) * | 2012-11-12 | 2014-05-22 | Nippon Telegr & Teleph Corp <Ntt> | バーストワード抽出装置、方法、及びプログラム |
JP5792871B1 (ja) * | 2014-05-23 | 2015-10-14 | 日本電信電話株式会社 | 代表スポット出力方法、代表スポット出力装置および代表スポット出力プログラム |
JP2016103205A (ja) * | 2014-11-28 | 2016-06-02 | 富士通株式会社 | データ分類装置、データ分類プログラム、および、データ分類方法 |
EP3046037A1 (en) | 2015-01-15 | 2016-07-20 | Fujitsu Limited | Similarity determination apparatus, similarity determination method, and computer-readable recording medium |
US10025784B2 (en) | 2015-01-15 | 2018-07-17 | Fujitsu Limited | Similarity determination apparatus, similarity determination method, and computer-readable recording medium |
JP2016218512A (ja) * | 2015-05-14 | 2016-12-22 | 富士ゼロックス株式会社 | 情報処理装置及び情報処理プログラム |
Also Published As
Publication number | Publication date |
---|---|
US20080195595A1 (en) | 2008-08-14 |
EP1830281A1 (en) | 2007-09-05 |
JPWO2006048998A1 (ja) | 2008-05-22 |
KR20070084004A (ko) | 2007-08-24 |
CN101069177A (zh) | 2007-11-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2006048998A1 (ja) | キーワード抽出装置 | |
CN111177365B (zh) | 一种基于图模型的无监督自动文摘提取方法 | |
Alajmi et al. | Toward an ARABIC stop-words list generation | |
JP4233836B2 (ja) | 文書自動分類システム、不要語判定方法、文書自動分類方法、およびプログラム | |
CN105808524A (zh) | 一种基于专利文献摘要的专利自动分类方法 | |
JP4634736B2 (ja) | 専門的記述と非専門的記述間の語彙変換方法・プログラム・システム | |
US20070112720A1 (en) | Two stage search | |
WO2006115260A1 (ja) | 情報解析報告書自動作成装置、情報解析報告書自動作成プログラムおよび情報解析報告書自動作成方法 | |
JP2012027845A (ja) | 情報処理装置、関連文提供方法、及びプログラム | |
Wang et al. | How far we can go with extractive text summarization? Heuristic methods to obtain near upper bounds | |
Patil et al. | A novel feature selection based on information gain using WordNet | |
Madsen et al. | Pruning the vocabulary for better context recognition | |
JP4525433B2 (ja) | 文書集約装置及びプログラム | |
Olsen et al. | Full text searching and information overload | |
Prasad et al. | A survey paper on concept mining in text documents | |
Tong et al. | Integrating hedonic quality for user experience modelling | |
BAZRFKAN et al. | Using machine learning methods to summarize persian texts | |
Burmani et al. | Graph based method for Arabic text summarization | |
Sridharan et al. | Modeling word meaning: Distributional semantics and the corpus quality-quantity trade-off | |
McDonald et al. | Contextual Distinctiveness: a new lexical property computed from large corpora | |
Wang et al. | Predicting thread linking structure by lexical chaining | |
Bazghandi et al. | Extractive summarization Of Farsi documents based on PSO clustering | |
Choi et al. | Specificity and exhaustivity of bibliographic classifications–A cross-cultural comparison with text analytic approach | |
Barakat | What makes an (audio) book popular? | |
Keim et al. | Analyzing document collections via context-aware term extraction |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AK | Designated states |
Kind code of ref document: A1 Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BW BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE EG ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KM KP KR KZ LC LK LR LS LT LU LV LY MA MD MG MK MN MW MX MZ NA NG NI NO NZ OM PG PH PL PT RO RU SC SD SE SG SK SL SM SY TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW |
|
AL | Designated countries for regional patents |
Kind code of ref document: A1 Designated state(s): BW GH GM KE LS MW MZ NA SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IS IT LT LU LV MC NL PL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
DPE1 | Request for preliminary examination filed after expiration of 19th month from priority date (pct application filed from 20040101) | ||
WWE | Wipo information: entry into national phase |
Ref document number: 200580037260.5 Country of ref document: CN |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2006542917 Country of ref document: JP |
|
WWE | Wipo information: entry into national phase |
Ref document number: 1020077010276 Country of ref document: KR |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2005793129 Country of ref document: EP |
|
WWP | Wipo information: published in national office |
Ref document number: 2005793129 Country of ref document: EP |
|
WWE | Wipo information: entry into national phase |
Ref document number: 11667097 Country of ref document: US |