WO2015159702A1 - Système d'extraction d'informations partielles - Google Patents

Système d'extraction d'informations partielles Download PDF

Info

Publication number
WO2015159702A1
WO2015159702A1 PCT/JP2015/060087 JP2015060087W WO2015159702A1 WO 2015159702 A1 WO2015159702 A1 WO 2015159702A1 JP 2015060087 W JP2015060087 W JP 2015060087W WO 2015159702 A1 WO2015159702 A1 WO 2015159702A1
Authority
WO
WIPO (PCT)
Prior art keywords
vector
segment
partial
condition
feature vector
Prior art date
Application number
PCT/JP2015/060087
Other languages
English (en)
Japanese (ja)
Inventor
佳男 高枝
哲也 金田
弘海 矢野
康生 大原
Original Assignee
株式会社toor
サイバネットシステム株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 株式会社toor, サイバネットシステム株式会社 filed Critical 株式会社toor
Publication of WO2015159702A1 publication Critical patent/WO2015159702A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor

Definitions

  • the present invention relates to a partial information extraction system that further divides a plurality of pieces of information into partial information and extracts partial information close to target information.
  • Patent Document 1 calculates the appearance frequency of a keyword included in a document to be searched for each paragraph, and extracts a paragraph having a high appearance frequency.
  • the specific gravity of important words that are repeatedly used in the conditional sentence is the same as the specific gravity of other words that appear only once. That is, there is a problem that even if the conditional statement is described in detail, the search accuracy does not change or is lowered. In addition, since the index is created only from conditional statements, the number of words is limited, and the accuracy of calculating the similarity between extracted partial documents is reduced. There is also the problem that people need to read everything, which takes effort and time.
  • the object of the present invention is to realize a partial search with high accuracy in a short time.
  • the inventors have found that it is effective not to use a keyword-based search method, but to generate a feature vector of a unit document of a condition and a document group to be searched based on the appearance frequency of words and compare the two. . That is, by refining the conditions, it was found that even general-purpose words often use keywords related to keywords, and as a result, fluctuations in keywords due to the use of synonyms and the like are alleviated and search accuracy is improved.
  • an index that is the basis for calculating the appearance frequency of words is extracted from the condition
  • an index is extracted from the entire search target document.
  • a feature vector of a condition and a partial document (hereinafter referred to as a document segment) is also generated based on the index, and the similarity between the two is calculated.
  • the feature vector of the document segment does not change even if the conditional sentence is changed. Therefore, the feature vector of the document segment need only be calculated once, and it is not necessary to redo generation of the feature vector. Therefore, it is possible to extract similar document segments at high speed for various conditional sentences.
  • the similarity between the document segments included in the search result based on the condition can be calculated, and the search result can be clustered by contents. .
  • the partial information extraction method is: A partial information extraction method for extracting partial information close to the concept of a condition from a plurality of information, A vector generation procedure in which a feature vector generation unit generates an index from a group of information to be searched, divides the information into a plurality of predetermined segments, and generates a feature vector based on a vector space model based on the index for each segment
  • a vector determination unit that generates a feature vector of the condition as a condition vector, and calculates a similarity between the condition vector and the feature vector of the segment
  • a partial extraction procedure in which the partial extraction unit extracts the segment close to the condition on a predetermined basis using the similarity between the condition vector and the feature vector of the segment; In order.
  • the clustering unit calculates the similarity between the segments using the feature vector of the segment extracted in the partial extraction procedure, and based on the similarity between the segments, A clustering procedure for classifying the segments extracted in the partial extraction procedure into a plurality of information clusters may be further included after the partial extraction procedure.
  • the mapping unit includes a mapping procedure in which the segment extracted in the partial extraction procedure is arranged on a map according to the degree of similarity between the segments. You may have further after the procedure.
  • the partial information extraction system is: A partial information extraction system that extracts partial information close to the concept of conditions from a plurality of documents, An index is generated from the information group to be searched, the information is divided into a plurality of predetermined segments, and a feature vector generation unit that generates a feature vector based on a vector space model based on the index for each segment; Generating a feature vector of the condition as a condition vector, and calculating a similarity between the condition vector and the feature vector of the segment; Using the similarity between the condition vector and the feature vector of the segment, a partial extraction unit that extracts the segment close to the condition on a predetermined basis; Is provided.
  • the similarity between the segments is calculated using the feature vector of the segment extracted by the partial extraction unit, and the extraction of the partial extraction unit is performed based on the similarity between the segments.
  • a clustering unit that classifies the segment into a plurality of information clusters may be further provided.
  • the partial information extraction system may further include a mapping unit that arranges the segments extracted by the partial extraction unit on a map according to the similarity between the segments.
  • a partial search with high accuracy can be realized in a short time.
  • the structural example of the partial information extraction system which concerns on Embodiment 1 is shown.
  • the sequence of the partial information extraction system which concerns on Embodiment 1 is shown.
  • the structural example of the partial information extraction system which concerns on Embodiment 2 is shown.
  • the sequence of the partial information extraction system which concerns on Embodiment 2 is shown.
  • An example of a map is shown.
  • FIG. 1 shows a configuration example of a partial information extraction system according to this embodiment.
  • the partial information extraction system according to this embodiment includes a server 10, a storage 20, and a user terminal 30.
  • the storage 20 is an arbitrary storage medium accessible from the server 10.
  • the server 10 and the user terminal 30 are computers having computer resources such as a CPU (Central Processing Unit) and a storage medium, and a program is installed in the storage medium. Any number of servers 10, storages 20, and user terminals 30 may be employed. In the present embodiment, a case where there is one server 10, two storages 20, and one user terminal 30 will be described.
  • CPU Central Processing Unit
  • the storage 20 holds an information group.
  • the information group includes arbitrary data transmitted / received via the communication network, and includes, for example, text, numerical data, log data, and customer information. Examples of sentences include patents, papers, books, reports, and homepages. Examples of the numerical data include sensor data, measurement data, and POS (Point Of Sales) data. Examples of the log data include online access data and status data of various devices. In the present embodiment, a case where the information is a document will be described as an example.
  • FIG. 2 shows a sequence of the partial information extraction system according to this embodiment.
  • the server 10 acquires a document from the storage 20, divides the acquired document into a plurality of predetermined segments, and generates a feature vector based on the vector space model based on the index for each segment (S101). It is preferable that the feature vector of each segment is stored in the secondary storage 20 separately from the original information group and used for the subsequent calculation of similarity.
  • the original information group is not used at all in the calculation stage, and is used only when displaying the original information in the final stage.
  • User terminal 30 transmits a condition via a communication network (S102).
  • the server 10 acquires the feature vector of each segment from the storage 20 (S102), extracts a segment having a feature vector close to the feature vector of the condition (S104), and obtains the extraction result as the user. It transmits to the terminal 30 (S105).
  • the user terminal 30 displays the extraction result received from the server 10 (S106).
  • the server 10 includes a communication function unit (not shown) that transmits / receives information to / from the user terminal 30 and the storage 20 via a communication network, and a configuration for extracting a segment.
  • the configuration for extracting a segment includes, for example, a feature vector generation unit 11, a vector determination unit 12, and a partial extraction unit 13.
  • the server 10 may be realized by causing a computer to function as the feature vector generation unit 11, the vector determination unit 12, and the partial extraction unit 13. In this case, each configuration is realized by the CPU in the server 10 executing a computer program stored in a storage unit (not shown).
  • the server 10 executes the partial information extraction method according to the present embodiment when extracting the segment.
  • the partial information extraction method according to this embodiment includes a vector generation procedure (S101), a vector determination procedure (S103), and a partial extraction procedure (S104) in this order.
  • the feature vector generation unit 11 In the vector generation procedure (S101), the feature vector generation unit 11 generates a feature vector based on the vector space model for each segment.
  • the elements constituting the feature vector that is, the index, are not defined by the conditional sentence but are generated from the search target information group. Since the index of the feature vector does not depend on the conditional statement, the feature vector does not deteriorate depending on how the conditional statement is written. Further, even if the conditional statement changes, the feature vector of the same segment can always be used, so that the processing load on the server 10 is small.
  • the segment is, for example, a paragraph or a sentence.
  • paragraph units are identified by detecting line breaks.
  • a unit sentence is identified by detecting a punctuation mark “.” Or “.”, A question mark “?”, And an exclamation mark “!”.
  • the index is a vector element, for example, a word list. In the present embodiment, as an example, a case where a segment is a paragraph and an index is a word list will be described.
  • the vector determination unit 12 determines the proximity of the content with the condition d k for each segment d i . For example, the vector determination unit 12 vectorizes the condition d k based on the vector space model. Then, the vector determination unit 12 determines the proximity of the condition vector and the feature vector.
  • the information d i can be expressed in matrix with respect to the element t j
  • the condition can be described by a condition vector whose elements are words included in the condition.
  • a segment can also be described by a segment vector whose elements are words included in the segment.
  • segment d i concept vector d i (n i1, n i2, n i3, «) can be represented by.
  • the word t 1 in the segment d 1, application number of t 2, t 3 are 0, 1, 0, respectively
  • word t 1 in the segment d 2, t 2, t 3 of the application number respectively 2,1, 0, if the applicant number of words t 1, t 2, t 3 in the segment d 3 is 1, 2, and 3, respectively
  • the matrix M of the segment is expressed as follows.
  • the closeness of the contents of the segment d i and the condition d k can be quantified by calculating the feature vector d i and the condition vector d k .
  • the calculation used for digitization may be a distance between vectors, or an arbitrary calculation such as inner product or outer product.
  • the words that are commonly used for all segments do not affect the proximity of the content of the document. Therefore, in calculating the vector, it is preferable to provide a difference in the contribution of each word characteristic to each document to the vector of the other words. For example, weighting is performed using the tfidf (Term Frequency Inverse Document Frequency) method. Thereby, the precision of the closeness of the content of a segment can be improved.
  • the word weight tfidf used in the same manner in any document is small, and the tfidf having a large frequency used by the document has a large tfidf.
  • the determination of the closeness of the content may be performed based on the presence or absence of a word included in the condition, for example.
  • a determination is made based on whether the keyword is included or not included for each segment.
  • a logical expression is formed, and each segment is determined by a binary value indicating whether or not the logical expression is met.
  • the partial extraction unit 13 extracts a segment having a vector close to a predetermined condition from a plurality of segments.
  • the segment to be extracted may be a predetermined number of segments, or may be a segment whose vector is within a predetermined proximity range. In this way, by extracting segments with similar vectors, it is possible to extract only portions that are close to the concept constituted by the search conditions.
  • clustering processing may be performed.
  • the partial extraction unit 13 classifies the extracted segments into a plurality of information clusters based on the similarity between the segments using the similarity between the condition vector and the segment feature vector.
  • the classification is performed, for example, by classifying the common clusters in order from the closest vector distance. As described above, by performing the clustering process, it is possible to provide the user terminal 30 with the result of classifying the contents described in each segment into a hierarchy.
  • the document in the present invention is not limited to this.
  • the segment is, for example, time or time, region or place, or attribution. If the document includes customer data, the segment is, for example, time or time, region or location, attribution, or age.
  • the unit of time is arbitrary, for example, it may be a second unit or a year unit.
  • vectorization based on the vector space model is performed as follows.
  • the document is the access log data of the user in the online service
  • the number of accesses of the user t j between the times d i to d i + T (time interval T) is n ij .
  • the output numerical value of the sensor t j between the times d i to d i + T (time interval T) is n ij .
  • the image d i is frequency-converted, and the numerical value of each frequency component t j after the conversion is n ij .
  • the weighting tfidf is performed as follows.
  • the document is the access log data of the user in the online service
  • the weight tfidf of the user who accesses on average is small, and the weight tfidf of the user having large access unevenness is large.
  • the document is sensor data
  • the sensor weight tfidf for which the output numerical value does not change much decreases, and the sensor weight tfidf for which the output numerical value changes greatly increases.
  • the frequency weighting tfidf with a small component value variation between images is small, and the frequency weighting tfidf with a large component value variation between images is large.
  • FIG. 3 shows a configuration example of the partial information extraction system according to the present embodiment.
  • the partial information extraction system according to the present embodiment further includes a mapping unit 14 in addition to the configuration of the first embodiment.
  • FIG. 4 shows a sequence of the partial information extraction system according to this embodiment.
  • the partial information extraction method according to the present embodiment further includes a mapping procedure (S107) after the partial extraction procedure (S104) described in the first embodiment.
  • the server 10 transmits the map created by the mapping procedure to the user terminal 30 (S108).
  • the user terminal 30 displays the map received from the server 10 (S109).
  • mapping procedure (S107) points indicating segments and conditions extracted by the partial extraction unit 13 are arranged on the map based on the vector values created by the vector determination unit 12 according to the closeness of the contents of the vectors. To do.
  • the closeness between feature vectors is calculated, and based on the closeness between the vectors, mapping based on the closeness of contents between information, that is, “semantic distance” is performed.
  • the calculation may be a distance between vectors, or an arbitrary calculation such as inner product or outer product.
  • an information cluster including a plurality of segments may be arranged on the map. Based on the proximity of the contents between the information obtained d i each other can be created map as shown in FIG. 5 by using a mapping algorithm.
  • the system according to the present embodiment can extract segments using concept search and map the distribution of the contents of each segment using a vector calculated using concept search.
  • the present invention can be applied to the information and communication industry.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

La présente invention a pour objet de mettre en œuvre des recherches partielles rapides et précises. Le procédé d'extraction d'informations partielles selon la présente invention comprend les étapes suivantes, dans cette ordre: une étape (S101) de génération de vecteur lors de laquelle des informations faisant l'objet d'une recherche sont divisées en une pluralité prédéterminée de segments et un vecteur de caractéristiques est généré pour chaque segment; une étape (S103) de détermination de vecteur lors de laquelle un vecteur de conditions constitué d'un vecteur de caractéristiques associé à une condition est généré, et pour chaque segment, le degré de similarité entre le vecteur de conditions et le vecteur de caractéristiques pour le segment en question est calculé; et une étape (S104) d'extraction partielle lors de laquelle les degrés de similarité entre le vecteur de conditions et les vecteurs de caractéristiques pour les segments respectifs sont utilisés pour extraire des segments qui, selon des critères prédéterminés, sont proches de la condition susmentionnée.
PCT/JP2015/060087 2014-04-14 2015-03-31 Système d'extraction d'informations partielles WO2015159702A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2014-082779 2014-04-14
JP2014082779A JP2015203960A (ja) 2014-04-14 2014-04-14 部分情報抽出システム

Publications (1)

Publication Number Publication Date
WO2015159702A1 true WO2015159702A1 (fr) 2015-10-22

Family

ID=54323913

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2015/060087 WO2015159702A1 (fr) 2014-04-14 2015-03-31 Système d'extraction d'informations partielles

Country Status (2)

Country Link
JP (1) JP2015203960A (fr)
WO (1) WO2015159702A1 (fr)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106294733B (zh) * 2016-08-10 2019-05-07 成都轻车快马网络科技有限公司 基于文本分析的网页检测方法
JP7068106B2 (ja) * 2018-08-28 2022-05-16 株式会社日立製作所 試験計画策定支援装置、試験計画策定支援方法及びプログラム

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH10207911A (ja) * 1996-11-25 1998-08-07 Fuji Xerox Co Ltd 文書検索装置
JP2004213626A (ja) * 2002-11-27 2004-07-29 Sony United Kingdom Ltd 情報の格納及び検索
JP2004295712A (ja) * 2003-03-28 2004-10-21 Hitachi Ltd 類似文書検索方法および類似文書検索装置
JP2013182466A (ja) * 2012-03-02 2013-09-12 Kurimoto Ltd Web検索システムおよびWeb検索方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH10207911A (ja) * 1996-11-25 1998-08-07 Fuji Xerox Co Ltd 文書検索装置
JP2004213626A (ja) * 2002-11-27 2004-07-29 Sony United Kingdom Ltd 情報の格納及び検索
JP2004295712A (ja) * 2003-03-28 2004-10-21 Hitachi Ltd 類似文書検索方法および類似文書検索装置
JP2013182466A (ja) * 2012-03-02 2013-09-12 Kurimoto Ltd Web検索システムおよびWeb検索方法

Also Published As

Publication number Publication date
JP2015203960A (ja) 2015-11-16

Similar Documents

Publication Publication Date Title
US12086720B2 (en) Cooperatively training and/or using separate input and subsequent content neural networks for information retrieval
CN107315759B (zh) 归类关键字的方法、装置和处理系统、分类模型生成方法
CN108804641B (zh) 一种文本相似度的计算方法、装置、设备和存储介质
US9454602B2 (en) Grouping semantically related natural language specifications of system requirements into clusters
Alami et al. Unsupervised neural networks for automatic Arabic text summarization using document clustering and topic modeling
US10042896B2 (en) Providing search recommendation
CN102799647B (zh) 网页去重方法和设备
JP5817531B2 (ja) 文書クラスタリングシステム、文書クラスタリング方法およびプログラム
US8095546B1 (en) Book content item search
US10002188B2 (en) Automatic prioritization of natural language text information
US11580119B2 (en) System and method for automatic persona generation using small text components
US20130060769A1 (en) System and method for identifying social media interactions
WO2020114100A1 (fr) Procédé et appareil de traitement d'informations, et support d'enregistrement informatique
US20180114136A1 (en) Trend identification using multiple data sources and machine learning techniques
US10936806B2 (en) Document processing apparatus, method, and program
CN107688616B (zh) 使实体的独特事实显现
JP6346367B1 (ja) 類似性指標値算出装置、類似検索装置および類似性指標値算出用プログラム
Rashid et al. Analysis of streaming data using big data and hybrid machine learning approach
JP2015203961A (ja) 文書抽出システム
Al Mostakim et al. Bangla content categorization using text based supervised learning methods
CN111737607B (zh) 数据处理方法、装置、电子设备以及存储介质
Komninos et al. Structured generative models of continuous features for word sense induction
WO2015159702A1 (fr) Système d'extraction d'informations partielles
TW201243627A (en) Multi-label text categorization based on fuzzy similarity and k nearest neighbors
US20230169103A1 (en) System and method for automatic profile segmentation using small text variations

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15779391

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15779391

Country of ref document: EP

Kind code of ref document: A1