WO2024013991A1 - 情報処理プログラム、情報処理方法および情報処理装置 - Google Patents

情報処理プログラム、情報処理方法および情報処理装置 Download PDF

Info

Publication number
WO2024013991A1
WO2024013991A1 PCT/JP2022/027902 JP2022027902W WO2024013991A1 WO 2024013991 A1 WO2024013991 A1 WO 2024013991A1 JP 2022027902 W JP2022027902 W JP 2022027902W WO 2024013991 A1 WO2024013991 A1 WO 2024013991A1
Authority
WO
WIPO (PCT)
Prior art keywords
sentence
words
characteristic
sentences
information processing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/JP2022/027902
Other languages
English (en)
French (fr)
Japanese (ja)
Inventor
正弘 片岡
彩 岩田
隆志 ▲徳▼田
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fujitsu Ltd
Original Assignee
Fujitsu Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fujitsu Ltd filed Critical Fujitsu Ltd
Priority to AU2022469863A priority Critical patent/AU2022469863B2/en
Priority to JP2024533482A priority patent/JP7800692B2/ja
Priority to PCT/JP2022/027902 priority patent/WO2024013991A1/ja
Priority to EP22951199.3A priority patent/EP4557126A4/en
Publication of WO2024013991A1 publication Critical patent/WO2024013991A1/ja
Priority to US19/000,417 priority patent/US20250131194A1/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/319Inverted lists
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Creation or modification of classes or clusters
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis

Definitions

  • the present invention relates to information processing programs and the like.
  • a huge amount of data such as text is registered in a DB (Data Base), and it is required to appropriately search such a DB for data similar to a search query specified by a user.
  • text will be explained as a sentence having multiple words.
  • a transposed index is set when text is registered in a DB, and a data search is executed when a search query is received.
  • a data search is executed when a search query is received.
  • a sentence vector a vector for each sentence (hereinafter referred to as a sentence vector) is calculated, and similar sentence vectors are classified into the same cluster.
  • the positions of multiple sentences included in the same cluster and their representative vectors are associated with each other and set in a transposed index, thereby improving the efficiency of search processing.
  • each word vector of a plurality of words making up a sentence is calculated, and the word vectors of all words are integrated to calculate the sentence vector.
  • Sentences include various words such as nouns, verbs, adjectives, particles, etc., and if you simply calculate the sentence vector by integrating the word vectors of all the words included in the sentence, as in the conventional technology, the sentence vector There are cases where the sentence vector does not clearly indicate the characteristics.
  • clustering is performed using such sentence vectors, a plurality of sentences that should normally be classified into different clusters may end up being classified into the same cluster.
  • the present invention aims to provide an information processing program, an information processing method, and an information processing device that can appropriately cluster text.
  • the computer executes the following process.
  • the computer obtains multiple sentences with multiple words.
  • the computer executes, on a plurality of sentences, a process of identifying a set of feature words from a plurality of words, based on a sentence vector of a sentence having a plurality of words and a word vector of a plurality of words.
  • the computer classifies the plurality of sentences such that sentences having the same set of characteristic words are included in the same cluster.
  • Text can be clustered appropriately.
  • FIG. 1 is a diagram for explaining the process of identifying characteristic words of a sentence.
  • FIG. 2 is a diagram for explaining clustering processing.
  • FIG. 3 is a diagram (1) for explaining the search phase processing.
  • FIG. 4 is a diagram (2) for explaining the search phase processing.
  • FIG. 5 is a functional block diagram showing the configuration of the information processing apparatus according to this embodiment.
  • FIG. 6 is a diagram showing an example of the data structure of a word vector dictionary.
  • FIG. 7 is a diagram showing the data structure of a transposed index.
  • FIG. 8 is a flowchart showing the processing procedure of the preparation phase of the information processing apparatus according to this embodiment.
  • FIG. 9 is a flowchart showing the processing procedure of the search phase of the information processing apparatus according to the present embodiment.
  • FIG. 1 is a diagram for explaining the process of identifying characteristic words of a sentence.
  • FIG. 2 is a diagram for explaining clustering processing.
  • FIG. 3 is a diagram (1) for explaining the search phase processing.
  • FIG. 10 is a flowchart showing the processing procedure of search processing based on a plurality of sentences.
  • FIG. 11 is a diagram for explaining other processing of the information processing device.
  • FIG. 12 is a diagram illustrating an example of the hardware configuration of a computer that implements the same functions as the information processing device of the embodiment.
  • the information processing device performs preparation phase processing and search phase processing.
  • the preparation phase processing executed by the information processing device will be described.
  • the preparation phase includes a process of identifying characteristic words of sentences and a process of clustering sentences.
  • FIG. 1 is a diagram for explaining the process of identifying characteristic words of a sentence.
  • characteristic words are specified from the sentence "Horses like sweet carrots.” registered in the text DB 50.
  • the information processing device identifies vectors for each word included in the sentence "Horses like sweet carrots.” based on a word vector dictionary that defines the relationship between words and word vectors.
  • a word vector will be referred to as a "word vector.”
  • the word vector for "horse” is wv-a.
  • the word vector of “sweet” be wv-b.
  • the word vector of “carrot” be wv-c.
  • the word vector of “like” be wv-d. Illustrations of word vectors for the words "ha”, "ga", and "da” are omitted.
  • the information processing device calculates the sentence vector sv1 of the sentence "Horses like sweet carrots.” by integrating the word vectors of each word of the sentence "Horses like sweet carrots.”
  • the information processing device calculates the cosine similarity between the sentence vector sv1 and the word vectors wv-a to wv-d, and based on the cosine similarity, words in the word vector that deviate from the sentence vector sv1 are be identified as “feature words”. For example, the information processing device sets a word of the word vector whose cosine similarity with the sentence vector sv1 is equal to or greater than a threshold value as a feature word.
  • the cosine similarity between sentence vector sv1 and word vector wv-a, the cosine similarity between sentence vector sv1 and word vector wv1-c, and the cosine similarity between sentence vector sv1 and word vector wv-d are each greater than a threshold. shall be. Then, the information processing device specifies the word "horse” in the word vector wv-a, the word “carrot” in the word vector wv1-c, and the word “like” in the word vector wv-d as characteristic words.
  • the information processing device clusters the sentences based on the characteristic words identified in the process of FIG.
  • FIG. 2 is a diagram for explaining clustering processing.
  • the information processing device identifies the word cluster ID of each feature word based on the word cluster dictionary 60.
  • the word cluster dictionary 60 a plurality of words are classified into a plurality of clusters, and a word cluster ID is set for each cluster.
  • the information processing device specifies the word cluster ID set in the cluster to which the feature word belongs, and sets the specified word cluster ID as the word cluster ID of the feature word.
  • the information processing device specifies the word cluster ID "l” set in the cluster containing the feature word “horse”, and sets it as the word cluster ID of the feature word “horse”.
  • the information processing device specifies the word cluster ID "m” set for the cluster including the feature word “carrot” and sets it as the word cluster ID of the feature word "carrot”.
  • the information processing device specifies the word cluster ID "n” set in the cluster containing the feature word "like” and sets it as the word ID of the feature word "like”.
  • the information processing device specifies word cluster IDs "l”, “m”, and “n” corresponding to the characteristic words “horse”, “carrot”, and “like”.
  • the information processing device sets such a set of word cluster IDs "l”, “m”, and “l” as a set of word cluster IDs corresponding to the sentence "Horses like sweet carrots.”
  • the information processing device identifies the sentence cluster to which the sentence belongs based on the set of word cluster IDs set for the sentence and the sentence cluster dictionary 70.
  • the sentence cluster dictionary 70 associates a sentence cluster ID that identifies a sentence cluster with a set of word cluster IDs. For example, the sentence cluster ID corresponding to the word cluster IDs "l", “m”, and “n” is "Cr1". Therefore, the information processing device specifies the sentence cluster ID of the sentence cluster to which the sentence "Horses like sweet carrots.” is "Cr1.”
  • the information processing device associates the sentence cluster ID of the identified sentence "Horses like sweet carrots.” with the position of the sentence “Horses like sweet carrots.” in the text DB 50, and writes it into the transposed index 80. register.
  • the information processing device repeatedly executes the above process for each sentence registered in the text DB 50, and registers the relationship between the sentence cluster ID of each sentence and the position of each sentence in the transposed index 80.
  • the information processing apparatus identifies a set of characteristic words from a plurality of words included in a sentence, and identifies a set of characteristic words to which the sentence belongs based on the set of characteristic words and the sentence cluster dictionary 70. Identify the sentence cluster ID of the cluster. This allows sentences to be appropriately clustered.
  • FIG. 3 is a diagram (1) for explaining the search phase processing.
  • a case will be explained in which a single sentence "Horses like sweet carrots" is specified as the search query q1.
  • the information processing device When the information processing device receives the search query q1, it identifies the characteristic words "horse”, “carrot”, and “favorite food” from the sentence "horses like sweet carrots” in the search query q1.
  • the process by which the information processing device identifies characteristic words from a plurality of words included in a sentence is similar to the process described in FIG. 1 .
  • the information processing device identifies the word cluster ID of each feature word based on each feature word and the word cluster dictionary 60. For example, based on the word cluster dictionary 60, the information processing device specifies the word cluster ID "l" set to the cluster including the feature word “horse”, and sets it as the word cluster ID of the feature word "horse”. Based on the word cluster dictionary 60, the information processing device specifies the word cluster ID "m” set for the cluster including the feature word "carrot” and sets it as the word cluster ID of the feature word "carrot”. Based on the word cluster dictionary 60, the information processing device specifies the word cluster ID "n” set in the cluster including the feature word "favorite food” and sets it as the word ID of the feature word "favorite food”.
  • the information processing device identifies word cluster IDs "l”, “m”, and “n” corresponding to the characteristic words “horse”, “carrot”, and “favorite food”.
  • the information processing device sets the set of word cluster IDs “l”, “m”, and “l” as the set of word cluster IDs corresponding to the sentence “Horses like sweet carrots.” of search query q1. .
  • the information processing device identifies the sentence cluster to which the sentence of the search query q1 belongs based on the set of word cluster IDs set for the sentence of the search query q1 and the sentence cluster dictionary 70. For example, the sentence cluster ID corresponding to the word cluster IDs "l", “m", and “n” is “Cr1". Therefore, the information processing device specifies the sentence cluster ID of the sentence cluster to which the sentence “Horses like sweet carrots” of the search query q1 is “Cr1”.
  • the information processing device identifies the position of the sentence on the text DB 50 that belongs to the sentence cluster ID based on the sentence cluster ID (for example, "Cr1") corresponding to the sentence of the search query q1 and the transposed index 80.
  • the information processing device extracts a sentence from the specified position and outputs the extracted sentence as a search result.
  • the information processing device identifies a set of characteristic words from a plurality of words included in the search query q1, and identifies a set of characteristic words corresponding to the search query q1 based on the set of characteristic words and the sentence cluster dictionary 70. Identify the sentence cluster ID.
  • the information processing device then performs a search based on the transposed index 80 created in advance and the sentence cluster ID corresponding to the search query q1. Thereby, it is possible to appropriately search for a sentence corresponding to the search query q1.
  • FIG. 4 is a diagram (2) for explaining the search phase processing.
  • search query q2 includes multiple sentences such as ⁇
  • the features of this program are as follows...It is composed of multiple sub-programs... Its features are... Speed-up effect "We have realized." is included.
  • the information processing device calculates sentence vectors for each of the multiple sentences included in the search query q2. For example, the information processing device calculates a sentence vector of one sentence by integrating word vectors of a plurality of words included in one sentence. Alternatively, the information processing device may execute the processing described in FIGS. 1 and 2 to identify a set of word cluster IDs for each feature word of a sentence, and use the set of word cluster IDs as a sentence vector. good.
  • the sentence vector of the sentence "The function is its characteristic” is set as the sentence vector sv2-1.
  • the sentence vector of the sentence "The features of this program are as follows.” be the sentence vector sv2-2.
  • the sentence vector of the sentence "It is composed of a plurality of sub-pros.” be the sentence vector sv2-3.
  • the sentence vector of the sentence "The effect of speeding up has been realized.” be the sentence vector sv2-4.
  • the information processing device calculates the sentence vector dv1 by integrating the sentence vectors of multiple sentences included in the search query q2.
  • the information processing device calculates the cosine similarity between the sentence vector dv1 and the sentence vectors sv2-1 to wv2-4, and based on the cosine similarity, the sentence of the sentence vector that deviates from the sentence vector dv1 is Specify as "characteristic sentence". For example, the information processing device determines a sentence of a sentence vector whose cosine similarity with the sentence vector dv1 is equal to or greater than a threshold value as a characteristic sentence.
  • the cosine similarity between the sentence vector dv1 and the sentence vector sv1-1, the cosine similarity between the sentence vector dv1 and the sentence vector sv1-3, and the cosine similarity between the sentence vector dv1 and the sentence vector sv1-4 are each greater than the threshold value. shall be. Then, the information processing device specifies the sentence of the sentence vector sv1-1, the sentence of the sentence vector sv1-3, and the word of the sentence vector sv1-4 as characteristic sentences.
  • the information processing device searches the text DB 50 for a sentence corresponding to the characteristic sentence by executing the process described in FIG. 3 for each characteristic sentence.
  • sentences X1, X2, X3, and X4 are searched as search candidates corresponding to the characteristic sentence "The feature is its function.”
  • Sentences X2, X3, X6, and X10 are searched as search candidates corresponding to the characteristic sentence "It is composed of a plurality of sub-pros.”
  • Sentences X2, X3, X7, and X22 are searched as search candidates corresponding to the characteristic sentence "speeding up has been achieved.”
  • the information processing device identifies a common sentence among the search candidates for each feature sentence as the final search result.
  • sentences X2 and X3 common to each search candidate are output as the final search results.
  • the information processing device identifies a characteristic sentence and identifies a sentence that is common among the search results corresponding to the characteristic sentence as the final search result. . Thereby, even if search query q2 includes multiple sentences, it is possible to efficiently search for a sentence corresponding to search query q2.
  • FIG. 5 is a functional block diagram showing the configuration of the information processing apparatus according to this embodiment.
  • the information processing device 100 includes a communication section 110, an input section 120, a display section 130, a storage section 140, and a control section 150.
  • the communication unit 110 is connected to an external device or the like by wire or wirelessly, and transmits and receives information to and from the external device.
  • the communication unit 110 is realized by a NIC (Network Interface Card) or the like.
  • the communication unit 110 may be connected to a network (not shown).
  • the input unit 120 is an input device that inputs various information to the information processing device 100.
  • the input unit 120 corresponds to a keyboard, a mouse, a touch panel, etc.
  • the user may operate the input unit 120 to input data such as sentences, search queries, and the like.
  • the display unit 130 is a display device that displays information output from the control unit 150.
  • the display unit 130 corresponds to a liquid crystal display, an organic EL (Electro Luminescence) display, a touch panel, etc.
  • the search results of the search query are displayed on the display unit 130.
  • the storage unit 140 includes a word vector dictionary 40, a text DB 50, a word cluster dictionary 60, a sentence cluster dictionary 70, and a transposed index 80.
  • the storage unit 140 is realized by, for example, a semiconductor memory element such as a RAM (Random Access Memory) or a flash memory, or a storage device such as a hard disk or an optical disk.
  • the word vector dictionary 40 is a table that defines codes assigned to words and word vectors.
  • FIG. 6 is a diagram showing an example of the data structure of a word vector dictionary. As shown in FIG. 6, this word vector dictionary 40 has codes, words, and word vectors (1) to (7).
  • the code is a code assigned to a word.
  • a word is a word included in a character string.
  • Word vectors (1) to (7) are vectors assigned to words.
  • the text DB 50 is a database that stores multiple sentences.
  • the text DB 50 includes multiple records.
  • One record includes multiple sentences.
  • the word cluster dictionary 60 a plurality of words are classified into a plurality of clusters, and a word cluster ID is set for each cluster. For a plurality of words classified into the same cluster, the cosine similarity of the word vector of each word is greater than or equal to the threshold. Other explanations regarding the word cluster dictionary 60 are the same as those given in FIG. 2.
  • the sentence cluster dictionary 70 associates sentence cluster IDs that identify clusters of sentences with sets of word cluster IDs. The same sentence cluster ID is set for multiple sentences belonging to the same sentence cluster. The rest of the explanation regarding the sentence cluster dictionary 70 is the same as the explanation given in FIG.
  • the transposed index 80 associates a sentence cluster ID with the position of a sentence belonging to the sentence cluster ID (position on the text DB 50).
  • FIG. 7 is a diagram showing the data structure of a transposed index. As shown in FIG. 7, in this transposed index 80, a plurality of pairs of record pointers and position pointers are set in association with sentence cluster IDs.
  • the record pointer indicates the position of the corresponding record.
  • the position indicated by the record pointer is defined by the number of words (offset) from the first word of the text DB 50 to the first word of the record.
  • the position pointer indicates the position of the relevant sentence.
  • the position pointer is defined by the offset from the first word of the record containing the relevant sentence to the first word of the relevant sentence.
  • the sentence cluster ID of the sentence ⁇ Horses like sweet carrots.'' is ⁇ Cr1'' and the sentence ⁇ Horses like sweet carrots.'' is included in record R1
  • the offset of the record R1 is set in the record pointer (1) corresponding to the sentence cluster ID "Cr1”.
  • the offset of the sentence "Horses like sweet carrots.” is set in the position pointer (1) corresponding to the sentence cluster ID "Cr1.”
  • the data structure of the transposed index 80 is not limited to that shown in FIG. 7, and may simply be one that associates a sentence cluster ID with an offset of each sentence belonging to the corresponding sentence cluster ID.
  • the control unit 150 includes an acquisition unit 151, a preprocessing unit 152, and a search unit 153.
  • the control unit 150 is realized by, for example, a CPU (Central Processing Unit) or an MPU (Micro Processing Unit). Further, the control unit 150 may be executed by an integrated circuit such as an ASIC (Application Specific Integrated Circuit) or an FPGA (Field Programmable Gate Array).
  • ASIC Application Specific Integrated Circuit
  • FPGA Field Programmable Gate Array
  • the acquisition unit 151 acquires various information via the communication unit 110 or the input unit 120. For example, when acquiring record information, the acquiring unit 151 registers the acquired record information in the text DB 50.
  • the preprocessing unit 152 executes the preparation phase processing described above.
  • the preprocessing unit 152 acquires a sentence from the text DB and executes the process described in FIG. 1 to identify characteristic words included in the sentence. After identifying the characteristic words, the preprocessing unit 152 executes sentence clustering as described with reference to FIG. 2 .
  • the preprocessing unit 152 identifies the word cluster ID set to the cluster to which each feature word belongs based on the word cluster dictionary 60.
  • the preprocessing unit 152 identifies the sentence cluster ID to which the sentence belongs based on the specified set of word cluster IDs and the sentence cluster dictionary 70.
  • the preprocessing unit 152 associates the cluster ID of the sentence with a record pointer and a position pointer that can identify the position of the sentence, and sets the cluster ID in the transposed index 80 .
  • the preprocessing unit 152 repeatedly executes the above process for each sentence registered in the text DB 50.
  • the search unit 153 executes the search phase processing described above.
  • the search unit 153 acquires a search query via the communication unit 110 or the input unit 120.
  • the search unit 153 determines whether the search query includes one sentence or multiple sentences.
  • the search unit 153 executes the process described in FIG. 3. For example, the search unit 153 identifies characteristic words from sentences included in the search query. The search unit 153 identifies a set of word cluster IDs for each feature word based on each feature word and the word cluster dictionary 60. The search unit 153 identifies a sentence cluster ID corresponding to the search query based on the set of word cluster IDs and the sentence cluster dictionary 70.
  • the search unit 153 Based on the sentence cluster ID corresponding to the search query and the transposed index 80, the search unit 153 identifies a set of a record pointer and a position pointer corresponding to the sentence cluster ID. The search unit 153 acquires a sentence (sentences) corresponding to the specified set of record pointer and position pointer from the text DB 50, and causes the display unit 130 to display the sentence as a search result. The search unit 153 may notify an external device of the search results.
  • the search unit 153 executes the process described in FIG. 4 when the search query includes multiple sentences.
  • the search unit 153 identifies characteristic sentences from a plurality of sentences included in the search query.
  • the search unit 153 identifies the sentence cluster ID of each characteristic sentence based on the set of word cluster IDs corresponding to each characteristic sentence and the sentence cluster dictionary 70.
  • the search unit 153 acquires a plurality of sentences (search results) corresponding to each characteristic sentence from the text DB 50 based on the sentence cluster ID of each characteristic sentence and the transposed index 80.
  • the search unit 153 searches for a sentence common to the search results corresponding to each characteristic sentence as the final search result.
  • the search unit 153 causes the display unit 130 to display the search results.
  • the search unit 153 may notify an external device of the search results.
  • FIG. 8 is a flowchart showing the processing procedure of the preparation phase of the information processing apparatus according to this embodiment.
  • the preprocessing unit 152 of the information processing device 100 acquires an unprocessed sentence from the text DB 50 (step S101).
  • the preprocessing unit 152 calculates a sentence vector by integrating word vectors of a plurality of words included in a sentence based on the word vector dictionary 40 (step S102).
  • the preprocessing unit 152 identifies characteristic words based on the cosine similarity between the sentence vector and the word vector of each word (step S103).
  • the preprocessing unit 152 identifies the word cluster ID for the feature word based on the word cluster dictionary 60 (step S104).
  • the preprocessing unit 152 identifies the sentence cluster ID of the cluster to which the sentence belongs based on the set of word cluster IDs for the characteristic words of the sentence and the sentence cluster dictionary 70 (step S105).
  • the sentence position information (a set of a record pointer and a position pointer) and the sentence cluster ID are associated with each other and registered in the transposed index 80 (step S106).
  • step S107, Yes If the unprocessed sentence exists in the text DB 50 (step S107, Yes), the preprocessing unit 152 moves to step S101. On the other hand, if the unprocessed sentence does not exist in the text DB 50 (step S107, No), the preprocessing unit 152 ends the preparation phase process.
  • FIG. 9 is a flowchart showing the processing procedure of the search phase of the information processing apparatus according to the present embodiment.
  • the search unit 153 of the information processing device 100 receives a search query (step S201).
  • the search unit 153 determines whether the search query includes multiple sentences (step S202).
  • step S203 the search unit 153 calculates a sentence vector by integrating word vectors of a plurality of words included in the sentence of the search query based on the word vector dictionary 40 (step S203).
  • the search unit 153 identifies characteristic words included in the sentence of the search query based on the cosine similarity between the sentence vector and the word vector of each word (step S204).
  • the search unit 153 identifies the word cluster ID for the characteristic word based on the word cluster dictionary 60 (step S205).
  • the search unit 153 identifies the sentence cluster ID of the cluster to which the sentence of the search query belongs based on the set of word cluster IDs for the characteristic words of the sentence of the search query and the sentence cluster dictionary 70 (step S206).
  • the search unit 153 identifies the position information of the sentence corresponding to the sentence cluster ID based on the sentence cluster ID of the sentence of the search query and the transposed index 80 (step S207).
  • the search unit 153 acquires the sentence at the position corresponding to the position information from the text DB 50 (step S208).
  • the search unit 153 outputs the search results (step S209).
  • step S202 if the search query includes multiple sentences in step S202 (step S202, Yes), the search unit 153 moves to step S210.
  • the search unit 153 executes a search process based on a plurality of sentences (step S210), and proceeds to step S209.
  • FIG. 10 is a flowchart showing the processing procedure of search processing based on a plurality of sentences.
  • the search unit 153 of the information processing device 100 selects an unselected sentence from a plurality of sentences included in the search query (step S301).
  • the search unit 153 calculates a sentence vector by integrating word vectors of a plurality of words included in the selected sentence based on the word vector dictionary 40 (step S302).
  • the search unit 153 identifies characteristic words included in the sentence based on the cosine similarity between the sentence vector and the word vector of each word (step S303).
  • the search unit 153 identifies the word cluster ID for the characteristic word based on the word cluster dictionary 60 (step S304).
  • the search unit 153 identifies the sentence cluster ID of the cluster to which the sentence belongs based on the set of word cluster IDs for the characteristic words of the sentence and the sentence cluster dictionary 70 (step S305).
  • the search unit 153 identifies the position information of the sentence corresponding to the sentence cluster ID based on the sentence cluster ID of the sentence and the transposed index 80 (step S306).
  • the search unit 153 acquires a sentence (search result) at the position corresponding to the position information from the text DB 50 (step S307).
  • step S308, Yes If there is an unprocessed sentence in the search query (step S308, Yes), the search unit 153 moves to step S301. If there is no unprocessed sentence in the search query (step S308, No), the search unit 153 sets a sentence common to the search results of each sentence included in the search query as the final search result (step S308, No). S309), the search process based on the plurality of sentences is ended.
  • the information processing device 100 identifies a set of characteristic words from a plurality of words included in a sentence, and identifies a sentence cluster ID of a cluster to which the sentence belongs based on the set of characteristic words and the sentence cluster dictionary 70. This allows sentences to be appropriately clustered.
  • the information processing device 100 calculates the cosine similarity between the sentence vector of the sentence and the word vectors of a plurality of words, and identifies the word of the word vector whose cosine similarity with the sentence vector is equal to or greater than a threshold value as a feature word. . This makes it possible to identify characteristic words that deviate from the sentence vector.
  • the information processing device 100 generates the transposed index 80 by associating the sentence cluster ID of the cluster to which the sentence belongs with the position information of the sentence.
  • a transposed index 80 By using such a transposed index 80, it becomes possible to easily specify the positional information of a plurality of sentences belonging to the same sentence cluster ID.
  • the information processing device 100 identifies a set of characteristic words from a plurality of words included in a search query q1 having one sentence, and identifies a set of characteristic words corresponding to the search query q1 based on the set of characteristic words and the sentence cluster dictionary 70. Specify the sentence cluster ID. Then, the information processing device 100 performs a search based on the transposed index 80 created in advance and the sentence cluster ID corresponding to the search query q1. Thereby, it is possible to appropriately search for a sentence corresponding to the search query q1.
  • the information processing device 100 identifies the characteristic sentence, and identifies a sentence common to the search results corresponding to the characteristic sentence as the final search result. Thereby, even if search query q2 includes multiple sentences, it is possible to efficiently search for a sentence corresponding to search query q2.
  • processing of the information processing device 100 described above is an example, and the information processing device 100 may perform other processing. Other processing of the information processing device 100 will be described below.
  • the search unit 153 of the information processing device 100 When receiving a search query having multiple sentences, the search unit 153 of the information processing device 100 identifies multiple characteristic sentences whose cosine similarity is equal to or higher than a threshold value, and identifies the characteristic sentences that are common to the search results using each characteristic sentence. sentence was detected as the final search result.
  • the search unit 153 may further execute a process of increasing or decreasing the number of feature sentences by accepting a change in the threshold value for comparison with the cosine similarity.
  • the search unit 153 accepts from the input unit 120 a change in the threshold value used when identifying a characteristic sentence, and performs a process of displaying the relationship between the changed threshold value and the characteristic sentence on the display unit 130. Execute repeatedly. As the threshold value increases, the number of characteristic sentences decreases, and as the threshold value decreases, the number of characteristic sentences increases. When the search unit 153 receives a confirmation instruction from the input unit 120, the search unit 153 confirms the characteristic sentence. The processing performed after the search unit 153 determines the characteristic sentence is the same as that of the prior art described above.
  • a zoom-in/out function is provided that increases or decreases the number of search candidates by increasing or decreasing the number of characteristic sentences included in the search query such as paragraphs or items. It can be realized.
  • the primary structure of a protein includes multiple consecutive base acid sequences Kmer that appear repeatedly.
  • the continuous base acid sequence Kmer will be referred to as the "basic structure" of the protein.
  • the "basic structure" of a protein may be expressed by a continuous amino acid sequence, oligopeptide, or the like.
  • FIG. 11 is a diagram for explaining other processing of the information processing device.
  • the protein primary structure Pro1 includes a plurality of basic structures " ⁇ -Kmer”, “ ⁇ -Kmer”, “ ⁇ -Kmer”, and " ⁇ -Kmer”.
  • the information processing device 100 identifies each basic structure vector included in the protein primary structure Pro1 based on a basic structure vector dictionary that defines basic structures and basic structure vectors. For example, assume that the vector of the basic structure " ⁇ -Kmer” is v1. Let v2 be the vector of the basic structure “ ⁇ -Kmer”. Let the vector of the basic structure " ⁇ -Kmer” be v3. Let the vector of the basic structure " ⁇ -Kmer” be v4. The information processing device 100 calculates the vector tv1 of the primary structure Pro1 by integrating the vectors of each basic structure included in the primary structure Pro1 of the protein.
  • the information processing device 100 calculates the cosine similarity between the vector tv1 and the vectors v1 to v4, and identifies the basic structure of the vector that deviates from the vector tv1 as a "feature basic structure" based on the cosine similarity. do. For example, the information processing device sets the basic structure of a vector whose cosine similarity with vector tv1 is equal to or greater than a threshold value as the feature basic structure.
  • the cosine similarity between vector tv1 and vector v1, the cosine similarity between vector tv1 and vector v3, and the cosine similarity between vector tv1 and vector v4 are each set to be greater than or equal to the threshold value. Then, the information processing device specifies the basic structure " ⁇ -Kmer” of vector v1, the basic structure " ⁇ -Kmer” of vector v3, and the basic structure " ⁇ -Kmer” of vector v4 as characteristic basic structures.
  • the information processing device 100 clusters the primary structure based on the feature basic structure identified in the above process. Specifically, in the same manner as in FIG. 2, a basic structure cluster ID is assigned to each feature basic structure, and the cluster ID of the primary structure is specified based on the set of basic structure cluster IDs.
  • Other processing is the same as the processing described with reference to FIG. 2 in which characteristic words are replaced with characteristic basic structures and sentences are replaced with primary structures.
  • the search process is similar to the process described with reference to FIGS. 3 and 4 in which characteristic words are replaced with characteristic basic structures and sentences are replaced with primary structures.
  • a transposed index associated with a cluster ID of a protein primary structure it is possible to search for similar receptors in response to a search query for a receptor composed of a plurality of primary structures. By applying this, it is possible to search for receptors similar to receptors that are targets of biopharmaceutical ligands, and to estimate side reactions of biopharmaceuticals.
  • FIG. 12 is a diagram illustrating an example of the hardware configuration of a computer that implements the same functions as the information processing device of the embodiment.
  • the computer 200 includes a CPU 201 that executes various calculation processes, an input device 202 that accepts data input from the user, and a display 203.
  • the computer 200 also includes a communication device 204 and an interface device 205 that exchange data with an external device or the like via a wired or wireless network.
  • the computer 200 also includes a RAM 206 that temporarily stores various information and a hard disk device 207. Each device 201-207 is then connected to a bus 208.
  • the hard disk device 207 has an acquisition program 307a, a preprocessing program 207b, and a search program 207c. Further, the CPU 201 reads each program 207a to 207c and expands it into the RAM 206.
  • the acquisition program 207a functions as an acquisition process 206a.
  • the preprocessing program 207b functions as a preprocessing process 206b.
  • Search program 207c functions as search process 206c.
  • the processing of the acquisition process 206a corresponds to the processing of the acquisition unit 151.
  • the processing of the preprocessing process 206b corresponds to the processing of the preprocessing unit 152.
  • the processing of the search process 206c corresponds to the processing of the search unit 153.
  • each of the programs 207a to 207c does not necessarily have to be stored in the hard disk device 207 from the beginning.
  • each program is stored in a "portable physical medium" such as a flexible disk (FD), CD-ROM, DVD, magneto-optical disk, or IC card that is inserted into the computer 200. Then, the computer 200 may read and execute each program 207a to 207c.
  • Word vector dictionary 50 Text DB 60 word cluster dictionary 70 sentence cluster dictionary 80 inverted index 100 information processing device 110 communication unit 120 input unit 130 display unit 140 storage unit 150 control unit 151 acquisition unit 152 preprocessing unit 153 search unit

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
PCT/JP2022/027902 2022-07-15 2022-07-15 情報処理プログラム、情報処理方法および情報処理装置 Ceased WO2024013991A1 (ja)

Priority Applications (5)

Application Number Priority Date Filing Date Title
AU2022469863A AU2022469863B2 (en) 2022-07-15 2022-07-15 Information processing program, information processing method, and information processing device
JP2024533482A JP7800692B2 (ja) 2022-07-15 2022-07-15 情報処理プログラム、情報処理方法および情報処理装置
PCT/JP2022/027902 WO2024013991A1 (ja) 2022-07-15 2022-07-15 情報処理プログラム、情報処理方法および情報処理装置
EP22951199.3A EP4557126A4 (en) 2022-07-15 2022-07-15 INFORMATION PROCESSING PROGRAM, INFORMATION PROCESSING METHOD, AND INFORMATION PROCESSING DEVICE
US19/000,417 US20250131194A1 (en) 2022-07-15 2024-12-23 Computer-readable recording medium storing information processing program, information processing method, and information processing device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2022/027902 WO2024013991A1 (ja) 2022-07-15 2022-07-15 情報処理プログラム、情報処理方法および情報処理装置

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US19/000,417 Continuation US20250131194A1 (en) 2022-07-15 2024-12-23 Computer-readable recording medium storing information processing program, information processing method, and information processing device

Publications (1)

Publication Number Publication Date
WO2024013991A1 true WO2024013991A1 (ja) 2024-01-18

Family

ID=89536262

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2022/027902 Ceased WO2024013991A1 (ja) 2022-07-15 2022-07-15 情報処理プログラム、情報処理方法および情報処理装置

Country Status (5)

Country Link
US (1) US20250131194A1 (https=)
EP (1) EP4557126A4 (https=)
JP (1) JP7800692B2 (https=)
AU (1) AU2022469863B2 (https=)
WO (1) WO2024013991A1 (https=)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118312542A (zh) * 2024-06-07 2024-07-09 安徽南瑞中天电力电子有限公司 光伏逆变器通信协议查找方法、装置和电子设备

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2016207141A (ja) * 2015-04-28 2016-12-08 ヤフー株式会社 要約生成装置、要約生成方法、及び要約生成プログラム
JP2019101993A (ja) 2017-12-07 2019-06-24 富士通株式会社 特定プログラム、特定方法および情報処理装置
JP2019211884A (ja) * 2018-06-01 2019-12-12 国立大学法人鳥取大学 情報検索システム
JP2020004156A (ja) * 2018-06-29 2020-01-09 富士通株式会社 分類方法、装置、及びプログラム
WO2020095357A1 (ja) 2018-11-06 2020-05-14 データ・サイエンティスト株式会社 検索ニーズ評価装置、検索ニーズ評価システム、及び検索ニーズ評価方法

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10678832B2 (en) * 2017-09-29 2020-06-09 Apple Inc. Search index utilizing clusters of semantically similar phrases
US11269665B1 (en) 2018-03-28 2022-03-08 Intuit Inc. Method and system for user experience personalization in data management systems using machine learning

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2016207141A (ja) * 2015-04-28 2016-12-08 ヤフー株式会社 要約生成装置、要約生成方法、及び要約生成プログラム
JP2019101993A (ja) 2017-12-07 2019-06-24 富士通株式会社 特定プログラム、特定方法および情報処理装置
JP2019211884A (ja) * 2018-06-01 2019-12-12 国立大学法人鳥取大学 情報検索システム
JP2020004156A (ja) * 2018-06-29 2020-01-09 富士通株式会社 分類方法、装置、及びプログラム
WO2020095357A1 (ja) 2018-11-06 2020-05-14 データ・サイエンティスト株式会社 検索ニーズ評価装置、検索ニーズ評価システム、及び検索ニーズ評価方法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4557126A4

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118312542A (zh) * 2024-06-07 2024-07-09 安徽南瑞中天电力电子有限公司 光伏逆变器通信协议查找方法、装置和电子设备

Also Published As

Publication number Publication date
US20250131194A1 (en) 2025-04-24
AU2022469863B2 (en) 2026-04-16
JPWO2024013991A1 (https=) 2024-01-18
EP4557126A4 (en) 2025-08-20
AU2022469863A1 (en) 2025-01-16
JP7800692B2 (ja) 2026-01-16
EP4557126A1 (en) 2025-05-21

Similar Documents

Publication Publication Date Title
Climer et al. Rearrangement Clustering: Pitfalls, Remedies, and Applications.
KR20200032258A (ko) 일정한 처리 시간 내에 k개의 극값을 찾는 방법
JP7367754B2 (ja) 特定方法および情報処理装置
CN103853802B (zh) 用于索引电子内容的装置和方法
CN109284497B (zh) 用于识别自然语言的医疗文本中的医疗实体的方法和装置
US20140081982A1 (en) Method and Computer for Indexing and Searching Structures
JP2019086995A (ja) 類似性指標値算出装置、類似検索装置および類似性指標値算出用プログラム
JP5194818B2 (ja) データ分類方法およびデータ処理装置
WO2024013991A1 (ja) 情報処理プログラム、情報処理方法および情報処理装置
CN112199958A (zh) 概念词序列生成方法、装置、计算机设备及存储介质
Corel et al. A min-cut algorithm for the consistency problem in multiple sequence alignment
JP7143752B2 (ja) 学習プログラム、学習方法および学習装置
JP2019046048A (ja) 特定プログラム、特定方法および情報処理装置
JP7626219B2 (ja) 情報処理プログラム、情報処理方法および情報処理装置
JP7643580B2 (ja) 処理方法、処理プログラムおよび情報処理装置
CN115146027A (zh) 文本向量化存储及检索方法、装置和计算机设备
US20210263923A1 (en) Information processing device, similarity calculation method, and computer-recording medium recording similarity calculation program
JPWO2019171538A1 (ja) 意味推定システム、方法およびプログラム
CN107967300B (zh) 机构名称的检索方法、装置、设备及存储介质
JPWO2019171537A1 (ja) 意味推定システム、方法およびプログラム
WO2021245926A1 (ja) 情報処理プログラム、情報処理方法および情報処理装置
JP6496025B2 (ja) 文書処理システム及び文書処理方法
JP2019125025A (ja) システム、文書データの管理方法、及びプログラム
WO2022264385A1 (ja) 検索方法、検索プログラムおよび情報処理装置
JP2009043023A (ja) 表示制御装置、表示制御方法、および、プログラム

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22951199

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2024533482

Country of ref document: JP

WWE Wipo information: entry into national phase

Ref document number: AU2022469863

Country of ref document: AU

ENP Entry into the national phase

Ref document number: 2022469863

Country of ref document: AU

Date of ref document: 20220715

Kind code of ref document: A

WWE Wipo information: entry into national phase

Ref document number: 2022951199

Country of ref document: EP

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2022951199

Country of ref document: EP

Effective date: 20250217

WWP Wipo information: published in national office

Ref document number: 2022951199

Country of ref document: EP