WO2015159702A1

WO2015159702A1 - Partial-information extraction system

Info

Publication number: WO2015159702A1
Application number: PCT/JP2015/060087
Authority: WO
Inventors: 佳男高枝; 哲也金田; 弘海矢野; 康生大原
Original assignee: 株式会社ｔｏｏｒ; サイバネットシステム株式会社
Priority date: 2014-04-14
Filing date: 2015-03-31
Publication date: 2015-10-22
Also published as: JP2015203960A

Abstract

The purpose of this invention is to implement fast, precise partial searches. The partial-information extraction method in this invention includes the following steps, in this order: a vector generation step (S101) in which information being searched is divided into a predetermined plurality of segments and a feature vector is generated for each segment; a vector determination step (S103) in which a condition vector consisting of a feature vector associated with a condition is generated, and for each segment, the degree of similarity between the condition vector and the feature vector for that segment is calculated; and a partial extraction step (S104) in which the degrees of similarity between the condition vector and the feature vectors for the respective segments are used to extract segments that, using predetermined criteria, are close to the aforementioned condition.

Description

Partial information extraction system

The present invention relates to a partial information extraction system that further divides a plurality of pieces of information into partial information and extracts partial information close to target information.

Suppose a document as an example of information. So far, a system for searching for documents with similar contents from a large number of documents has been proposed (for example, see Patent Document 1). Patent Document 1 calculates the appearance frequency of a keyword included in a document to be searched for each paragraph, and extracts a paragraph having a high appearance frequency.

JP 2013-30089 A

記述 Use the description contents you want to find as a search condition, and extract partial description contents close to the sentence from the search target sentence group. In the invention of Patent Document 1, a word for creating an index is extracted from a conditional sentence, the appearance frequency of the word unit of the index for each page of the search target document is calculated, and the document page is weighted. However, in this method, since the index generated by the conditional sentence is different, the appearance frequency of the word based on the index of the target document needs to be recalculated every time the conditional sentence is changed, and there is a problem that it takes a long calculation time. . Furthermore, the conditional sentence is used only for index extraction, and the appearance frequency of words in the conditional sentence is not calculated. Therefore, the specific gravity of important words that are repeatedly used in the conditional sentence is the same as the specific gravity of other words that appear only once. That is, there is a problem that even if the conditional statement is described in detail, the search accuracy does not change or is lowered. In addition, since the index is created only from conditional statements, the number of words is limited, and the accuracy of calculating the similarity between extracted partial documents is reduced. There is also the problem that people need to read everything, which takes effort and time.

As described above, in the invention of the cited document 1, since the index changes every time the conditional sentence is changed, it is necessary to recalculate the appearance frequency of the word in the document based on the index every time. There is also a problem that cannot be improved. Furthermore, there is a problem that it takes time and effort to search for information that is really desired from the extraction result.

The object of the present invention is to realize a partial search with high accuracy in a short time.

In the conventional keyword-based search method, there is a problem that a sentence using a synonym other than a keyword cannot be searched even if it is an important sentence in terms of content. In order to prevent this, various methods such as using a synonym dictionary have been proposed, but there are problems such as lack of reproducibility of search results because it differs depending on the developer such as dictionary creation.

The inventors have found that it is effective not to use a keyword-based search method, but to generate a feature vector of a unit document of a condition and a document group to be searched based on the appearance frequency of words and compare the two. . That is, by refining the conditions, it was found that even general-purpose words often use keywords related to keywords, and as a result, fluctuations in keywords due to the use of synonyms and the like are alleviated and search accuracy is improved.

Furthermore, if an index that is the basis for calculating the appearance frequency of words is extracted from the condition, there is a problem that the index changes every time the condition changes. In order to solve this problem, an index is extracted from the entire search target document. A feature vector of a condition and a partial document (hereinafter referred to as a document segment) is also generated based on the index, and the similarity between the two is calculated. By using this method, the feature vector of the document segment does not change even if the conditional sentence is changed. Therefore, the feature vector of the document segment need only be calculated once, and it is not necessary to redo generation of the feature vector. Therefore, it is possible to extract similar document segments at high speed for various conditional sentences.

Furthermore, by using the feature vector of the document segment generated in this way, the similarity between the document segments included in the search result based on the condition can be calculated, and the search result can be clustered by contents. .

Specifically, the partial information extraction method according to the present invention is:
A partial information extraction method for extracting partial information close to the concept of a condition from a plurality of information,
A vector generation procedure in which a feature vector generation unit generates an index from a group of information to be searched, divides the information into a plurality of predetermined segments, and generates a feature vector based on a vector space model based on the index for each segment When,
A vector determination unit that generates a feature vector of the condition as a condition vector, and calculates a similarity between the condition vector and the feature vector of the segment;
A partial extraction procedure in which the partial extraction unit extracts the segment close to the condition on a predetermined basis using the similarity between the condition vector and the feature vector of the segment;
In order.

In the partial information extraction method according to the present invention, the clustering unit calculates the similarity between the segments using the feature vector of the segment extracted in the partial extraction procedure, and based on the similarity between the segments, A clustering procedure for classifying the segments extracted in the partial extraction procedure into a plurality of information clusters may be further included after the partial extraction procedure.

In the partial information extraction method according to the present invention, the mapping unit includes a mapping procedure in which the segment extracted in the partial extraction procedure is arranged on a map according to the degree of similarity between the segments. You may have further after the procedure.

Specifically, the partial information extraction system according to the present invention is:
A partial information extraction system that extracts partial information close to the concept of conditions from a plurality of documents,
An index is generated from the information group to be searched, the information is divided into a plurality of predetermined segments, and a feature vector generation unit that generates a feature vector based on a vector space model based on the index for each segment;
Generating a feature vector of the condition as a condition vector, and calculating a similarity between the condition vector and the feature vector of the segment;
Using the similarity between the condition vector and the feature vector of the segment, a partial extraction unit that extracts the segment close to the condition on a predetermined basis;
Is provided.

In the partial information extraction system according to the present invention, the similarity between the segments is calculated using the feature vector of the segment extracted by the partial extraction unit, and the extraction of the partial extraction unit is performed based on the similarity between the segments. A clustering unit that classifies the segment into a plurality of information clusters may be further provided.

The partial information extraction system according to the present invention may further include a mapping unit that arranges the segments extracted by the partial extraction unit on a map according to the similarity between the segments.

According to the present invention, a partial search with high accuracy can be realized in a short time.

The structural example of the partial information extraction system which concerns on Embodiment 1 is shown. The sequence of the partial information extraction system which concerns on Embodiment 1 is shown. The structural example of the partial information extraction system which concerns on Embodiment 2 is shown. The sequence of the partial information extraction system which concerns on Embodiment 2 is shown. An example of a map is shown.

Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings. In addition, this invention is not limited to embodiment shown below. These embodiments are merely examples, and the present invention can be implemented in various modifications and improvements based on the knowledge of those skilled in the art. In the present specification and drawings, the same reference numerals denote the same components.

(Embodiment 1)
FIG. 1 shows a configuration example of a partial information extraction system according to this embodiment. The partial information extraction system according to this embodiment includes a server 10, a storage 20, and a user terminal 30. The storage 20 is an arbitrary storage medium accessible from the server 10. The server 10 and the user terminal 30 are computers having computer resources such as a CPU (Central Processing Unit) and a storage medium, and a program is installed in the storage medium. Any number of servers 10, storages 20, and user terminals 30 may be employed. In the present embodiment, a case where there is one server 10, two storages 20, and one user terminal 30 will be described.

The storage 20 holds an information group. The information group includes arbitrary data transmitted / received via the communication network, and includes, for example, text, numerical data, log data, and customer information. Examples of sentences include patents, papers, books, reports, and homepages. Examples of the numerical data include sensor data, measurement data, and POS (Point Of Sales) data. Examples of the log data include online access data and status data of various devices. In the present embodiment, a case where the information is a document will be described as an example.

FIG. 2 shows a sequence of the partial information extraction system according to this embodiment. The server 10 acquires a document from the storage 20, divides the acquired document into a plurality of predetermined segments, and generates a feature vector based on the vector space model based on the index for each segment (S101). It is preferable that the feature vector of each segment is stored in the secondary storage 20 separately from the original information group and used for the subsequent calculation of similarity. The original information group is not used at all in the calculation stage, and is used only when displaying the original information in the final stage.

User terminal 30 transmits a condition via a communication network (S102). Upon receiving the condition from the user terminal 30, the server 10 acquires the feature vector of each segment from the storage 20 (S102), extracts a segment having a feature vector close to the feature vector of the condition (S104), and obtains the extraction result as the user. It transmits to the terminal 30 (S105). The user terminal 30 displays the extraction result received from the server 10 (S106).

The server 10 includes a communication function unit (not shown) that transmits / receives information to / from the user terminal 30 and the storage 20 via a communication network, and a configuration for extracting a segment. The configuration for extracting a segment includes, for example, a feature vector generation unit 11, a vector determination unit 12, and a partial extraction unit 13. The server 10 may be realized by causing a computer to function as the feature vector generation unit 11, the vector determination unit 12, and the partial extraction unit 13. In this case, each configuration is realized by the CPU in the server 10 executing a computer program stored in a storage unit (not shown).

The server 10 executes the partial information extraction method according to the present embodiment when extracting the segment. The partial information extraction method according to this embodiment includes a vector generation procedure (S101), a vector determination procedure (S103), and a partial extraction procedure (S104) in this order.

In the vector generation procedure (S101), the feature vector generation unit 11 generates a feature vector based on the vector space model for each segment. The elements constituting the feature vector, that is, the index, are not defined by the conditional sentence but are generated from the search target information group. Since the index of the feature vector does not depend on the conditional statement, the feature vector does not deteriorate depending on how the conditional statement is written. Further, even if the conditional statement changes, the feature vector of the same segment can always be used, so that the processing load on the server 10 is small.

When the document includes a sentence, the segment is, for example, a paragraph or a sentence. In the case of paragraphs, for example, paragraph units are identified by detecting line breaks. In the case of a sentence, a unit sentence is identified by detecting a punctuation mark “.” Or “.”, A question mark “?”, And an exclamation mark “!”. The index is a vector element, for example, a word list. In the present embodiment, as an example, a case where a segment is a paragraph and an index is a word list will be described.

In the vector determination procedure (S103), the vector determination unit 12 determines the proximity of the content with the condition d _k for each segment d _i . For example, the vector determination unit 12 vectorizes the condition d _k based on the vector space model. Then, the vector determination unit 12 determines the proximity of the condition vector and the feature vector.

When the information d _i can be expressed in matrix with respect to the element t _j , the information d _i can be described by a vector space model d _i = (t ₁ , t ₂ , t ₃ ,...). For this reason, the condition can be described by a condition vector whose elements are words included in the condition. A segment can also be described by a segment vector whose elements are words included in the segment.

When the frequency of occurrence of elements _{t j} appearing in segment _{d i} and _{n ij,} segment _{d i} concept vector _{_{_{d i = (n i1, n}}} i2, n i3, ......) can be represented by. For example, the word _t 1 in the segment _{d _1,} application number of t 2, _{t 3} are 0, 1, 0, respectively, word _t 1 in the segment _{d _2,} _t _{_2,} t ₃ of the application number, respectively 2,1, 0, if the applicant number of words _t _1, t 2, _{t 3} in the segment _{d 3} is 1, 2, and 3, respectively, the matrix M of the segment is expressed as follows.

The closeness of the contents of the segment d _i and the condition d _k can be quantified by calculating the feature vector d _i and the condition vector d _k . The calculation used for digitization may be a distance between vectors, or an arbitrary calculation such as inner product or outer product.

Here, the words that are commonly used for all segments do not affect the proximity of the content of the document. Therefore, in calculating the vector, it is preferable to provide a difference in the contribution of each word characteristic to each document to the vector of the other words. For example, weighting is performed using the tfidf (Term Frequency Inverse Document Frequency) method. Thereby, the precision of the closeness of the content of a segment can be improved. The word weight tfidf used in the same manner in any document is small, and the tfidf having a large frequency used by the document has a large tfidf.

The determination of the closeness of the content may be performed based on the presence or absence of a word included in the condition, for example. When a single word is included in the condition, a determination is made based on whether the keyword is included or not included for each segment. When there are a plurality of words in the condition, a logical expression is formed, and each segment is determined by a binary value indicating whether or not the logical expression is met.

In the partial extraction procedure (S104), the partial extraction unit 13 extracts a segment having a vector close to a predetermined condition from a plurality of segments. At this time, the segment to be extracted may be a predetermined number of segments, or may be a segment whose vector is within a predetermined proximity range. In this way, by extracting segments with similar vectors, it is possible to extract only portions that are close to the concept constituted by the search conditions.

In the partial extraction procedure (S104), clustering processing may be performed. At this time, the partial extraction unit 13 classifies the extracted segments into a plurality of information clusters based on the similarity between the segments using the similarity between the condition vector and the segment feature vector. The classification is performed, for example, by classifying the common clusters in order from the closest vector distance. As described above, by performing the clustering process, it is possible to provide the user terminal 30 with the result of classifying the contents described in each segment into a hierarchy.

In this embodiment, the example in which the document is a sentence has been described, but the document in the present invention is not limited to this. When the document includes numerical data or log data, the segment is, for example, time or time, region or place, or attribution. If the document includes customer data, the segment is, for example, time or time, region or location, attribution, or age. The unit of time is arbitrary, for example, it may be a second unit or a year unit.

When the document includes numerical data or log data, vectorization based on the vector space model is performed as follows.
When the document is the access log data of the user in the online service, the number of accesses of the user t _j between the times d _i to d _i + T (time interval T) is n _ij . The time d _i can be expressed as a vector d _i = (n _i1 , n _i2 , n _i3 ,...).
When the document is sensor data, the output numerical value of the sensor t _j between the times d _i to d _i + T (time interval T) is n _ij . The time d _i can be expressed as a vector d _i = (n _i1 , n _i2 , n _i3 ,...).
When the document is image data, the image d _i is frequency-converted, and the numerical value of each frequency component t _j after the conversion is n _ij . The time d _i can be expressed as a vector d _i = (n _i1 , n _i2 , n _i3 ,...).

When the document includes numerical data or log data, the weighting tfidf is performed as follows.
When the document is the access log data of the user in the online service, the weight tfidf of the user who accesses on average is small, and the weight tfidf of the user having large access unevenness is large.
When the document is sensor data, the sensor weight tfidf for which the output numerical value does not change much decreases, and the sensor weight tfidf for which the output numerical value changes greatly increases.
When the document is image data, the frequency weighting tfidf with a small component value variation between images is small, and the frequency weighting tfidf with a large component value variation between images is large.

(Embodiment 2)
FIG. 3 shows a configuration example of the partial information extraction system according to the present embodiment. The partial information extraction system according to the present embodiment further includes a mapping unit 14 in addition to the configuration of the first embodiment.

FIG. 4 shows a sequence of the partial information extraction system according to this embodiment. The partial information extraction method according to the present embodiment further includes a mapping procedure (S107) after the partial extraction procedure (S104) described in the first embodiment. The server 10 transmits the map created by the mapping procedure to the user terminal 30 (S108). The user terminal 30 displays the map received from the server 10 (S109).

In the mapping procedure (S107), points indicating segments and conditions extracted by the partial extraction unit 13 are arranged on the map based on the vector values created by the vector determination unit 12 according to the closeness of the contents of the vectors. To do.

The closeness between feature vectors is calculated, and based on the closeness between the vectors, mapping based on the closeness of contents between information, that is, “semantic distance” is performed. The calculation may be a distance between vectors, or an arbitrary calculation such as inner product or outer product. When the partial extraction unit 13 performs the clustering process, an information cluster including a plurality of segments may be arranged on the map. Based on the proximity of the contents between the information obtained d _i each other can be created map as shown in FIG. 5 by using a mapping algorithm.

The system according to the present embodiment can extract segments using concept search and map the distribution of the contents of each segment using a vector calculated using concept search.

The present invention can be applied to the information and communication industry.

10: Server 11: Feature vector generation unit 12: Vector determination unit 13: Partial extraction unit 14: Mapping unit 20: Storage 30: User terminal 31: Clustering unit

Claims

A partial information extraction method for extracting partial information close to the concept of a condition from a plurality of information,
A vector generation procedure in which a feature vector generation unit generates an index from a group of information to be searched, divides the information into a plurality of predetermined segments, and generates a feature vector based on a vector space model based on the index for each segment When,
A vector determination unit that generates a feature vector of the condition as a condition vector, and calculates a similarity between the condition vector and the feature vector of the segment;
A partial extraction procedure in which the partial extraction unit extracts the segment close to the condition on a predetermined basis using the similarity between the condition vector and the feature vector of the segment;
The partial information extraction method which has sequentially.
The partial extraction procedure uses the similarity between the condition vector and the feature vector of the segment to classify the extracted segment into a plurality of information clusters based on the similarity between the segments. Item 2. The partial information extraction method according to Item 1.
The mapping unit further includes a mapping procedure for placing the segment extracted in the partial extraction procedure on a map according to the similarity between the segments after the partial extraction procedure. Partial information extraction method described in 1.
A partial information extraction system that extracts partial information close to the concept of conditions from a plurality of documents,
An index is generated from the information group to be searched, the information is divided into a plurality of predetermined segments, and a feature vector generation unit that generates a feature vector based on a vector space model based on the index for each segment;
Generating a feature vector of the condition as a condition vector, and calculating a similarity between the condition vector and the feature vector of the segment;
Using the similarity between the condition vector and the feature vector of the segment, a partial extraction unit that extracts the segment close to the condition on a predetermined basis;
A partial information extraction system comprising:
The partial extraction unit classifies the extracted segments into a plurality of information clusters based on the similarity between the segments using the similarity between the condition vector and the feature vector of the segment. Item 5. The partial information extraction system according to Item 4.
The partial information extraction system according to claim 4 or 5, further comprising a mapping unit that arranges the segment extracted by the partial extraction unit on a map according to a similarity between the segments.