WO2015159702A1 - Système d'extraction d'informations partielles - Google Patents
Système d'extraction d'informations partielles Download PDFInfo
- Publication number
- WO2015159702A1 WO2015159702A1 PCT/JP2015/060087 JP2015060087W WO2015159702A1 WO 2015159702 A1 WO2015159702 A1 WO 2015159702A1 JP 2015060087 W JP2015060087 W JP 2015060087W WO 2015159702 A1 WO2015159702 A1 WO 2015159702A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- vector
- segment
- partial
- condition
- feature vector
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
Definitions
- the present invention relates to a partial information extraction system that further divides a plurality of pieces of information into partial information and extracts partial information close to target information.
- Patent Document 1 calculates the appearance frequency of a keyword included in a document to be searched for each paragraph, and extracts a paragraph having a high appearance frequency.
- the specific gravity of important words that are repeatedly used in the conditional sentence is the same as the specific gravity of other words that appear only once. That is, there is a problem that even if the conditional statement is described in detail, the search accuracy does not change or is lowered. In addition, since the index is created only from conditional statements, the number of words is limited, and the accuracy of calculating the similarity between extracted partial documents is reduced. There is also the problem that people need to read everything, which takes effort and time.
- the object of the present invention is to realize a partial search with high accuracy in a short time.
- the inventors have found that it is effective not to use a keyword-based search method, but to generate a feature vector of a unit document of a condition and a document group to be searched based on the appearance frequency of words and compare the two. . That is, by refining the conditions, it was found that even general-purpose words often use keywords related to keywords, and as a result, fluctuations in keywords due to the use of synonyms and the like are alleviated and search accuracy is improved.
- an index that is the basis for calculating the appearance frequency of words is extracted from the condition
- an index is extracted from the entire search target document.
- a feature vector of a condition and a partial document (hereinafter referred to as a document segment) is also generated based on the index, and the similarity between the two is calculated.
- the feature vector of the document segment does not change even if the conditional sentence is changed. Therefore, the feature vector of the document segment need only be calculated once, and it is not necessary to redo generation of the feature vector. Therefore, it is possible to extract similar document segments at high speed for various conditional sentences.
- the similarity between the document segments included in the search result based on the condition can be calculated, and the search result can be clustered by contents. .
- the partial information extraction method is: A partial information extraction method for extracting partial information close to the concept of a condition from a plurality of information, A vector generation procedure in which a feature vector generation unit generates an index from a group of information to be searched, divides the information into a plurality of predetermined segments, and generates a feature vector based on a vector space model based on the index for each segment
- a vector determination unit that generates a feature vector of the condition as a condition vector, and calculates a similarity between the condition vector and the feature vector of the segment
- a partial extraction procedure in which the partial extraction unit extracts the segment close to the condition on a predetermined basis using the similarity between the condition vector and the feature vector of the segment; In order.
- the clustering unit calculates the similarity between the segments using the feature vector of the segment extracted in the partial extraction procedure, and based on the similarity between the segments, A clustering procedure for classifying the segments extracted in the partial extraction procedure into a plurality of information clusters may be further included after the partial extraction procedure.
- the mapping unit includes a mapping procedure in which the segment extracted in the partial extraction procedure is arranged on a map according to the degree of similarity between the segments. You may have further after the procedure.
- the partial information extraction system is: A partial information extraction system that extracts partial information close to the concept of conditions from a plurality of documents, An index is generated from the information group to be searched, the information is divided into a plurality of predetermined segments, and a feature vector generation unit that generates a feature vector based on a vector space model based on the index for each segment; Generating a feature vector of the condition as a condition vector, and calculating a similarity between the condition vector and the feature vector of the segment; Using the similarity between the condition vector and the feature vector of the segment, a partial extraction unit that extracts the segment close to the condition on a predetermined basis; Is provided.
- the similarity between the segments is calculated using the feature vector of the segment extracted by the partial extraction unit, and the extraction of the partial extraction unit is performed based on the similarity between the segments.
- a clustering unit that classifies the segment into a plurality of information clusters may be further provided.
- the partial information extraction system may further include a mapping unit that arranges the segments extracted by the partial extraction unit on a map according to the similarity between the segments.
- a partial search with high accuracy can be realized in a short time.
- the structural example of the partial information extraction system which concerns on Embodiment 1 is shown.
- the sequence of the partial information extraction system which concerns on Embodiment 1 is shown.
- the structural example of the partial information extraction system which concerns on Embodiment 2 is shown.
- the sequence of the partial information extraction system which concerns on Embodiment 2 is shown.
- An example of a map is shown.
- FIG. 1 shows a configuration example of a partial information extraction system according to this embodiment.
- the partial information extraction system according to this embodiment includes a server 10, a storage 20, and a user terminal 30.
- the storage 20 is an arbitrary storage medium accessible from the server 10.
- the server 10 and the user terminal 30 are computers having computer resources such as a CPU (Central Processing Unit) and a storage medium, and a program is installed in the storage medium. Any number of servers 10, storages 20, and user terminals 30 may be employed. In the present embodiment, a case where there is one server 10, two storages 20, and one user terminal 30 will be described.
- CPU Central Processing Unit
- the storage 20 holds an information group.
- the information group includes arbitrary data transmitted / received via the communication network, and includes, for example, text, numerical data, log data, and customer information. Examples of sentences include patents, papers, books, reports, and homepages. Examples of the numerical data include sensor data, measurement data, and POS (Point Of Sales) data. Examples of the log data include online access data and status data of various devices. In the present embodiment, a case where the information is a document will be described as an example.
- FIG. 2 shows a sequence of the partial information extraction system according to this embodiment.
- the server 10 acquires a document from the storage 20, divides the acquired document into a plurality of predetermined segments, and generates a feature vector based on the vector space model based on the index for each segment (S101). It is preferable that the feature vector of each segment is stored in the secondary storage 20 separately from the original information group and used for the subsequent calculation of similarity.
- the original information group is not used at all in the calculation stage, and is used only when displaying the original information in the final stage.
- User terminal 30 transmits a condition via a communication network (S102).
- the server 10 acquires the feature vector of each segment from the storage 20 (S102), extracts a segment having a feature vector close to the feature vector of the condition (S104), and obtains the extraction result as the user. It transmits to the terminal 30 (S105).
- the user terminal 30 displays the extraction result received from the server 10 (S106).
- the server 10 includes a communication function unit (not shown) that transmits / receives information to / from the user terminal 30 and the storage 20 via a communication network, and a configuration for extracting a segment.
- the configuration for extracting a segment includes, for example, a feature vector generation unit 11, a vector determination unit 12, and a partial extraction unit 13.
- the server 10 may be realized by causing a computer to function as the feature vector generation unit 11, the vector determination unit 12, and the partial extraction unit 13. In this case, each configuration is realized by the CPU in the server 10 executing a computer program stored in a storage unit (not shown).
- the server 10 executes the partial information extraction method according to the present embodiment when extracting the segment.
- the partial information extraction method according to this embodiment includes a vector generation procedure (S101), a vector determination procedure (S103), and a partial extraction procedure (S104) in this order.
- the feature vector generation unit 11 In the vector generation procedure (S101), the feature vector generation unit 11 generates a feature vector based on the vector space model for each segment.
- the elements constituting the feature vector that is, the index, are not defined by the conditional sentence but are generated from the search target information group. Since the index of the feature vector does not depend on the conditional statement, the feature vector does not deteriorate depending on how the conditional statement is written. Further, even if the conditional statement changes, the feature vector of the same segment can always be used, so that the processing load on the server 10 is small.
- the segment is, for example, a paragraph or a sentence.
- paragraph units are identified by detecting line breaks.
- a unit sentence is identified by detecting a punctuation mark “.” Or “.”, A question mark “?”, And an exclamation mark “!”.
- the index is a vector element, for example, a word list. In the present embodiment, as an example, a case where a segment is a paragraph and an index is a word list will be described.
- the vector determination unit 12 determines the proximity of the content with the condition d k for each segment d i . For example, the vector determination unit 12 vectorizes the condition d k based on the vector space model. Then, the vector determination unit 12 determines the proximity of the condition vector and the feature vector.
- the information d i can be expressed in matrix with respect to the element t j
- the condition can be described by a condition vector whose elements are words included in the condition.
- a segment can also be described by a segment vector whose elements are words included in the segment.
- segment d i concept vector d i (n i1, n i2, n i3, «) can be represented by.
- the word t 1 in the segment d 1, application number of t 2, t 3 are 0, 1, 0, respectively
- word t 1 in the segment d 2, t 2, t 3 of the application number respectively 2,1, 0, if the applicant number of words t 1, t 2, t 3 in the segment d 3 is 1, 2, and 3, respectively
- the matrix M of the segment is expressed as follows.
- the closeness of the contents of the segment d i and the condition d k can be quantified by calculating the feature vector d i and the condition vector d k .
- the calculation used for digitization may be a distance between vectors, or an arbitrary calculation such as inner product or outer product.
- the words that are commonly used for all segments do not affect the proximity of the content of the document. Therefore, in calculating the vector, it is preferable to provide a difference in the contribution of each word characteristic to each document to the vector of the other words. For example, weighting is performed using the tfidf (Term Frequency Inverse Document Frequency) method. Thereby, the precision of the closeness of the content of a segment can be improved.
- the word weight tfidf used in the same manner in any document is small, and the tfidf having a large frequency used by the document has a large tfidf.
- the determination of the closeness of the content may be performed based on the presence or absence of a word included in the condition, for example.
- a determination is made based on whether the keyword is included or not included for each segment.
- a logical expression is formed, and each segment is determined by a binary value indicating whether or not the logical expression is met.
- the partial extraction unit 13 extracts a segment having a vector close to a predetermined condition from a plurality of segments.
- the segment to be extracted may be a predetermined number of segments, or may be a segment whose vector is within a predetermined proximity range. In this way, by extracting segments with similar vectors, it is possible to extract only portions that are close to the concept constituted by the search conditions.
- clustering processing may be performed.
- the partial extraction unit 13 classifies the extracted segments into a plurality of information clusters based on the similarity between the segments using the similarity between the condition vector and the segment feature vector.
- the classification is performed, for example, by classifying the common clusters in order from the closest vector distance. As described above, by performing the clustering process, it is possible to provide the user terminal 30 with the result of classifying the contents described in each segment into a hierarchy.
- the document in the present invention is not limited to this.
- the segment is, for example, time or time, region or place, or attribution. If the document includes customer data, the segment is, for example, time or time, region or location, attribution, or age.
- the unit of time is arbitrary, for example, it may be a second unit or a year unit.
- vectorization based on the vector space model is performed as follows.
- the document is the access log data of the user in the online service
- the number of accesses of the user t j between the times d i to d i + T (time interval T) is n ij .
- the output numerical value of the sensor t j between the times d i to d i + T (time interval T) is n ij .
- the image d i is frequency-converted, and the numerical value of each frequency component t j after the conversion is n ij .
- the weighting tfidf is performed as follows.
- the document is the access log data of the user in the online service
- the weight tfidf of the user who accesses on average is small, and the weight tfidf of the user having large access unevenness is large.
- the document is sensor data
- the sensor weight tfidf for which the output numerical value does not change much decreases, and the sensor weight tfidf for which the output numerical value changes greatly increases.
- the frequency weighting tfidf with a small component value variation between images is small, and the frequency weighting tfidf with a large component value variation between images is large.
- FIG. 3 shows a configuration example of the partial information extraction system according to the present embodiment.
- the partial information extraction system according to the present embodiment further includes a mapping unit 14 in addition to the configuration of the first embodiment.
- FIG. 4 shows a sequence of the partial information extraction system according to this embodiment.
- the partial information extraction method according to the present embodiment further includes a mapping procedure (S107) after the partial extraction procedure (S104) described in the first embodiment.
- the server 10 transmits the map created by the mapping procedure to the user terminal 30 (S108).
- the user terminal 30 displays the map received from the server 10 (S109).
- mapping procedure (S107) points indicating segments and conditions extracted by the partial extraction unit 13 are arranged on the map based on the vector values created by the vector determination unit 12 according to the closeness of the contents of the vectors. To do.
- the closeness between feature vectors is calculated, and based on the closeness between the vectors, mapping based on the closeness of contents between information, that is, “semantic distance” is performed.
- the calculation may be a distance between vectors, or an arbitrary calculation such as inner product or outer product.
- an information cluster including a plurality of segments may be arranged on the map. Based on the proximity of the contents between the information obtained d i each other can be created map as shown in FIG. 5 by using a mapping algorithm.
- the system according to the present embodiment can extract segments using concept search and map the distribution of the contents of each segment using a vector calculated using concept search.
- the present invention can be applied to the information and communication industry.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
La présente invention a pour objet de mettre en œuvre des recherches partielles rapides et précises. Le procédé d'extraction d'informations partielles selon la présente invention comprend les étapes suivantes, dans cette ordre: une étape (S101) de génération de vecteur lors de laquelle des informations faisant l'objet d'une recherche sont divisées en une pluralité prédéterminée de segments et un vecteur de caractéristiques est généré pour chaque segment; une étape (S103) de détermination de vecteur lors de laquelle un vecteur de conditions constitué d'un vecteur de caractéristiques associé à une condition est généré, et pour chaque segment, le degré de similarité entre le vecteur de conditions et le vecteur de caractéristiques pour le segment en question est calculé; et une étape (S104) d'extraction partielle lors de laquelle les degrés de similarité entre le vecteur de conditions et les vecteurs de caractéristiques pour les segments respectifs sont utilisés pour extraire des segments qui, selon des critères prédéterminés, sont proches de la condition susmentionnée.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2014-082779 | 2014-04-14 | ||
JP2014082779A JP2015203960A (ja) | 2014-04-14 | 2014-04-14 | 部分情報抽出システム |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2015159702A1 true WO2015159702A1 (fr) | 2015-10-22 |
Family
ID=54323913
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2015/060087 WO2015159702A1 (fr) | 2014-04-14 | 2015-03-31 | Système d'extraction d'informations partielles |
Country Status (2)
Country | Link |
---|---|
JP (1) | JP2015203960A (fr) |
WO (1) | WO2015159702A1 (fr) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106294733B (zh) * | 2016-08-10 | 2019-05-07 | 成都轻车快马网络科技有限公司 | 基于文本分析的网页检测方法 |
JP7068106B2 (ja) * | 2018-08-28 | 2022-05-16 | 株式会社日立製作所 | 試験計画策定支援装置、試験計画策定支援方法及びプログラム |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH10207911A (ja) * | 1996-11-25 | 1998-08-07 | Fuji Xerox Co Ltd | 文書検索装置 |
JP2004213626A (ja) * | 2002-11-27 | 2004-07-29 | Sony United Kingdom Ltd | 情報の格納及び検索 |
JP2004295712A (ja) * | 2003-03-28 | 2004-10-21 | Hitachi Ltd | 類似文書検索方法および類似文書検索装置 |
JP2013182466A (ja) * | 2012-03-02 | 2013-09-12 | Kurimoto Ltd | Web検索システムおよびWeb検索方法 |
-
2014
- 2014-04-14 JP JP2014082779A patent/JP2015203960A/ja active Pending
-
2015
- 2015-03-31 WO PCT/JP2015/060087 patent/WO2015159702A1/fr active Application Filing
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH10207911A (ja) * | 1996-11-25 | 1998-08-07 | Fuji Xerox Co Ltd | 文書検索装置 |
JP2004213626A (ja) * | 2002-11-27 | 2004-07-29 | Sony United Kingdom Ltd | 情報の格納及び検索 |
JP2004295712A (ja) * | 2003-03-28 | 2004-10-21 | Hitachi Ltd | 類似文書検索方法および類似文書検索装置 |
JP2013182466A (ja) * | 2012-03-02 | 2013-09-12 | Kurimoto Ltd | Web検索システムおよびWeb検索方法 |
Also Published As
Publication number | Publication date |
---|---|
JP2015203960A (ja) | 2015-11-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US12086720B2 (en) | Cooperatively training and/or using separate input and subsequent content neural networks for information retrieval | |
CN107315759B (zh) | 归类关键字的方法、装置和处理系统、分类模型生成方法 | |
CN108804641B (zh) | 一种文本相似度的计算方法、装置、设备和存储介质 | |
US9454602B2 (en) | Grouping semantically related natural language specifications of system requirements into clusters | |
Alami et al. | Unsupervised neural networks for automatic Arabic text summarization using document clustering and topic modeling | |
US10042896B2 (en) | Providing search recommendation | |
CN102799647B (zh) | 网页去重方法和设备 | |
JP5817531B2 (ja) | 文書クラスタリングシステム、文書クラスタリング方法およびプログラム | |
US8095546B1 (en) | Book content item search | |
US10002188B2 (en) | Automatic prioritization of natural language text information | |
US11580119B2 (en) | System and method for automatic persona generation using small text components | |
US20130060769A1 (en) | System and method for identifying social media interactions | |
WO2020114100A1 (fr) | Procédé et appareil de traitement d'informations, et support d'enregistrement informatique | |
US20180114136A1 (en) | Trend identification using multiple data sources and machine learning techniques | |
US10936806B2 (en) | Document processing apparatus, method, and program | |
CN107688616B (zh) | 使实体的独特事实显现 | |
JP6346367B1 (ja) | 類似性指標値算出装置、類似検索装置および類似性指標値算出用プログラム | |
Rashid et al. | Analysis of streaming data using big data and hybrid machine learning approach | |
JP2015203961A (ja) | 文書抽出システム | |
Al Mostakim et al. | Bangla content categorization using text based supervised learning methods | |
CN111737607B (zh) | 数据处理方法、装置、电子设备以及存储介质 | |
Komninos et al. | Structured generative models of continuous features for word sense induction | |
WO2015159702A1 (fr) | Système d'extraction d'informations partielles | |
TW201243627A (en) | Multi-label text categorization based on fuzzy similarity and k nearest neighbors | |
US20230169103A1 (en) | System and method for automatic profile segmentation using small text variations |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 15779391 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 15779391 Country of ref document: EP Kind code of ref document: A1 |