JP2023093349A

JP2023093349A - Information processing device and information processing method

Info

Publication number: JP2023093349A
Application number: JP2022194980A
Authority: JP
Inventors: 平王; Ping Wang; スヌ・リ; Li Sun; 留安汪; Liu An Wang; 俊孫; Shun Son
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2021-12-22
Filing date: 2022-12-06
Publication date: 2023-07-04
Also published as: CN116415587A

Abstract

To provide an information processing device and an information processing method.SOLUTION: An information processing device includes: a fifth text generation unit configured to generate a plurality of fifth texts corresponding to a processing waiting text using a text generation model; a candidate text selection unit configured to select, from among the plurality of fifth texts, a fifth text having a semantic matching degree with the processing waiting text equal to or greater than a predetermined matching degree as a candidate fifth text; a text similarity calculation unit configured to calculate, for each candidate fifth text, a text similarity between the candidate fifth text and each of other candidate fifth texts; and a target text selection unit configured to select, from among the candidate fifth texts, a fourth predetermined number of candidate fifth texts having the lowest text similarity between each other as target texts for subsequent processing.SELECTED DRAWING: Figure 6

Description

本発明は、情報処理の分野に関し、特に、情報処理装置及び情報処理方法に関する。 The present invention relates to the field of information processing, and more particularly to an information processing apparatus and information processing method.

テキストが情報処理の分野で多くのアプリケーションにおいて使用される。今のところ、テキストを処理する技術が幾つか存在する。 Text is used in many applications in the field of information processing. There are currently several techniques for processing text.

本発明の目的は、テキストを処理するための情報処理装置及び情報処理方法を提供することにある。 An object of the present invention is to provide an information processing apparatus and an information processing method for processing text.

本発明の一側面によれば、情報処理装置が提供され、それは、
一つ又は複数の所定の第一テキストのうちの各々について、複数のテキスト生成モデルを用いて該第一テキストに対応する複数の第二テキストを生成するように構成される第二テキスト生成ユニット；
前記複数の第二テキストと、対応する第一テキストとの間の語義マッチ度に基づいて、前記複数のテキスト生成モデルのうちから、第一所定数のテキスト生成モデルを候補モデルとして選択するように構成される候補モデル選択ユニット；
前記一つ又は複数の所定の第一テキストのうちの各々について、前記候補モデルを用いて生成された、該第一テキストに対応する複数の第二テキストの互いの間のテキスト類似度を計算し、そして、第二テキストの互いの間のテキスト類似度に基づいて、前記候補モデルの互いの間のモデル類似度を計算するように構成される類似度計算ユニット；
前記候補モデルのうちから、互いの間のモデル類似度が最も低い第二所定数の候補モデルを目標モデルとして選択するように構成される目標モデル選択ユニット；及び
前記目標モデルを用いて、処理待ちテキストに対応する第二所定数の第四テキストを、後続の処理のために生成するように構成される第四テキスト生成ユニットを含む。 According to one aspect of the present invention, an information processing device is provided, which comprises:
a second text generation unit configured to, for each of one or more predetermined first texts, generate a plurality of second texts corresponding to the first text using a plurality of text generation models;
selecting a first predetermined number of text generation models from among the plurality of text generation models as candidate models based on the degree of semantic matching between the plurality of second texts and the corresponding first text; a composed candidate model selection unit;
For each of the one or more predetermined first texts, calculate text similarity between each of a plurality of second texts corresponding to the first text generated using the candidate model. and a similarity computation unit configured to compute model similarities between said candidate models based on text similarities between each other of second texts;
a target model selection unit configured to select, as target models, from among said candidate models a second predetermined number of candidate models having the lowest model similarity between each other; and, using said target models, pending processing. A fourth text generation unit configured to generate a second predetermined number of fourth texts corresponding to the text for subsequent processing.

本発明のもう一つの側面によれば、情報処理装置が提供され、それは、
テキスト生成モデルを用いて、処理待ちテキストに対応する複数の第五テキストを生成するように構成される第五テキスト生成ユニット；
前記複数の第五テキストのうちから、前記処理待ちテキストとの語義マッチ度が所定マッチ度以上である第五テキストを候補第五テキストとして選択するように構成される候補テキスト選択ユニット；
各候補第五テキストについて、該候補第五テキストと、他の候補第五テキストのうちの各々との間のテキスト類似度を計算するように構成されるテキスト類似度計算ユニット；及び
前記候補第五テキストのうちから、互いの間のテキスト類似度が最も低い第四所定数の候補第五テキストを、後続の処理のために、目標テキストとして選択するように構成される目標テキスト選択ユニットを含む。 According to another aspect of the present invention, an information processing device is provided, which comprises:
a fifth text generation unit configured to generate a plurality of fifth texts corresponding to the pending text using the text generation model;
a candidate text selection unit configured to select, from among the plurality of fifth texts, a fifth text having a semantic matching degree with the awaiting text equal to or greater than a predetermined matching degree as a candidate fifth text;
a text similarity computation unit configured to compute, for each candidate fifth text, a text similarity between said candidate fifth text and each of the other candidate fifth texts; and said candidate fifth text. a target text selection unit configured to select, from among the texts, a fourth predetermined number of candidate fifth texts having the lowest text similarity between each other as target texts for subsequent processing;

本発明のもう一つの側面によれば、情報処理方法が提供され、それは、
一つ又は複数の所定の第一テキストのうちの各々について、複数のテキスト生成モデルを用いて、該第一テキストに対応する複数の第二テキストを生成し；
前記複数の第二テキストと、対応する第一テキストとの間の語義マッチ度に基づいて、前記複数のテキスト生成モデルのうちから、第一所定数のテキスト生成モデルを候補モデルとして選択し；
前記一つ又は複数の所定の第一テキストのうちの各々について、前記候補モデルを用いて生成された、該第一テキストに対応する複数の第二テキストの互いの間のテキスト類似度を計算し、そして、第二テキストの互いの間のテキスト類似度に基づいて、前記候補モデルの互いの間のモデル類似度を計算し；
前記候補モデルのうちから、互いの間のモデル類似度が最も低い第二所定数の候補モデルを目標モデルとして選択し；及び
前記目標モデルを用いて、処理待ちテキストに対応する第二所定数の第四テキストを、後続の処理のために生成することを含む。 According to another aspect of the present invention, an information processing method is provided, comprising:
for each of one or more predetermined first texts, using a plurality of text generation models to generate a plurality of second texts corresponding to the first text;
selecting a first predetermined number of text generation models from among the plurality of text generation models as candidate models based on the degree of semantic match between the plurality of second texts and the corresponding first text;
For each of the one or more predetermined first texts, calculate text similarity between each of a plurality of second texts corresponding to the first text generated using the candidate model. and calculating the model similarity between the candidate models based on the text similarity between the second texts;
selecting a second predetermined number of candidate models having the lowest model similarity between each other as target models from the candidate models; and using the target models, a second predetermined number corresponding to pending text. Generating a fourth text for subsequent processing.

本発明の他の側面によれば、本発明による上述の方法を実現するためのコンピュータプログラムコード及びコンピュータプログラムプロダクト、並びに本発明による上述の方法を実現するためのコンピュータプログラムコードを記憶しているコンピュータ可読記憶媒体がさらに提供される。 According to other aspects of the invention, a computer program code and a computer program product for implementing the above method according to the invention and a computer storing the computer program code for implementing the above method according to the invention. A readable storage medium is further provided.

本発明の第一実施例における情報処理装置の機能構成例のブロック図である。1 is a block diagram of a functional configuration example of an information processing apparatus according to a first embodiment of the present invention; FIG. バックトランスレーション方法を用いて第二テキストを生成する一例を示す図である。Fig. 10 shows an example of generating a second text using the back translation method; 本発明の第二実施例における情報処理装置の機能構成例のブロック図である。FIG. 5 is a block diagram of a functional configuration example of an information processing apparatus according to a second embodiment of the present invention; Ｃｈａｒａｄｅｓ－ＳＴＡのデータセットの場合、それぞれ、処理待ちテキストについての類似度シーケンス、ビデオタイミング位置決めユニットにより取得された平均類似度シーケンス、ビデオタイミング位置決めユニットにより取得された中値類似度シーケンス、及び手動平均類似度シーケンスを用いて目標フレーム識別を行った結果の例を示す図である。For the Charades-STA dataset, the similarity sequence for the pending text, the average similarity sequence obtained by the video timing positioning unit, the median similarity sequence obtained by the video timing positioning unit, and the manual average FIG. 10 is a diagram showing an example of a result of target frame identification using a similarity sequence; 本発明の第三実施例における情報処理装置の機能構成例のブロック図である。FIG. 11 is a block diagram of an example of the functional configuration of an information processing device according to the third embodiment of the present invention; 本発明の第四実施例における情報処理装置の機能構成例のブロック図である。FIG. 12 is a block diagram of a functional configuration example of an information processing apparatus according to a fourth embodiment of the present invention; Ｃｈａｒａｄｅｓ－ＳＴＡのデータセットの場合、それぞれ、処理待ちテキストについての類似度シーケンス、ビデオタイミング位置決めユニットにより取得された平均類似度シーケンス、及び手動平均類似度シーケンスを用いて目標フレーム識別を行った結果の例を示す図である。For the Charades-STA dataset, the results of target frame identification using the similarity sequence for the pending text, the average similarity sequence obtained by the video timing positioning unit, and the manual average similarity sequence, respectively. FIG. 4 is a diagram showing an example; 本発明の第五実施例における情報処理方法の例示的なフローチャートである。FIG. 5 is an exemplary flow chart of an information processing method according to the fifth embodiment of the present invention; FIG. 本発明の第六実施例における情報処理方法の例示的なフローチャートである。FIG. 7 is an exemplary flow chart of an information processing method according to the sixth embodiment of the present invention; FIG. 本発明の実施例において採用され得るパーソナルコンピュータの例示的な構成のブロック図である。1 is a block diagram of an exemplary configuration of a personal computer that may be employed in embodiments of the invention; FIG.

以下、添付した図面を参照しながら、本発明を実施するための好適な実施例を詳細に説明する。なお、これらの実施例は例示に過ぎず、本発明を限定するものではない。 Preferred embodiments for carrying out the present invention will now be described in detail with reference to the accompanying drawings. It should be noted that these examples are merely illustrative and do not limit the present invention.

まず、図１を参照しながら本発明の第一実施例に係る情報処理装置１００の実現例を説明する。図１は、本発明の第一実施例における情報処理装置１００の機能構成例のブロック図である。 First, an implementation example of an information processing apparatus 100 according to the first embodiment of the present invention will be described with reference to FIG. FIG. 1 is a block diagram of a functional configuration example of an information processing apparatus 100 according to the first embodiment of the present invention.

図１に示すように、本発明の第一実施例による情報処理装置１００は第二テキスト生成ユニット１０２、候補モデル選択ユニット１０４、類似度計算ユニット１０６、目標モデル選択ユニット１０８及び第四テキスト生成ユニット１１０を含んでも良い。 As shown in FIG. 1, the information processing apparatus 100 according to the first embodiment of the present invention includes a second text generation unit 102, a candidate model selection unit 104, a similarity calculation unit 106, a target model selection unit 108 and a fourth text generation unit. 110 may be included.

第二テキスト生成ユニット１０２は、一つ又は複数の所定の第一テキストのうちの各々について、複数のテキスト生成モデルを用いて、該第一テキストに対応する複数の第二テキストを生成するように構成されても良い。例えば、各テキスト生成モデルは単独のテキスト生成モデルであっても良い。また、例えば、上述の複数のテキスト生成モデルはそれぞれ、同一のテキスト生成モデルにおける異なるサブモジュールに対応しても良い。例えば、同一の第一テキストについて生成された複数の第二テキストのうちの各々は同じ語種（言語の種類）に属しても良い。 A second text generation unit 102, for each of one or more predetermined first texts, using a plurality of text generation models to generate a plurality of second texts corresponding to the first text. may be configured. For example, each text generation model may be a single text generation model. Also, for example, each of the multiple text generation models described above may correspond to a different sub-module in the same text generation model. For example, each of the plurality of second texts generated for the same first text may belong to the same word type (language type).

候補モデル選択ユニット１０４は、第二テキスト生成ユニット１０２が生成した上述の複数の第二テキストと、対応する第一テキストとの間の語義マッチ度に基づいて、上述の複数のテキスト生成モデルのうちから、第一所定数のテキスト生成モデルを候補モデルとして選択するように構成されても良い。例えば、実際のニーズに応じて第一所定数を設定しても良い。例えば、Ｓｅｎｔｅｎｃｅ－ＢＥＲＴの既存モデルを使用して、各第二テキストと、対応する第一テキストとの間の語義マッチ度を決定しても良い。 Candidate model selection unit 104, based on the degree of semantic matching between the plurality of second texts generated by the second text generation unit 102 and the corresponding first texts, selects from the plurality of text generation models , a first predetermined number of text generation models may be selected as candidate models. For example, the first predetermined number may be set according to actual needs. For example, existing models of Sentence-BERT may be used to determine the degree of semantic match between each secondary text and the corresponding primary text.

一例として、候補モデル選択ユニット１０４は、生成された第二テキストと、対応する第一テキストとの間の語義マッチ度が比較的高いテキスト生成モデルを候補モデルとして選択しても良い。 As an example, the candidate model selection unit 104 may select a text generation model with a relatively high semantic match between the generated second text and the corresponding first text as the candidate model.

例えば、候補モデル選択ユニット１０４は各テキスト生成モデルについて、該テキスト生成モデルを用いて生成された各第二テキストと、対応する第一テキストとの間の語義マッチ度の平均値を計算し、そして、上述の複数のテキスト生成モデルのうちの、対応する語義マッチ度の平均値が最も高い第一所定数のテキスト生成モデルを候補モデルとして選択しても良い。 For example, the candidate model selection unit 104 calculates, for each text generation model, an average semantic match between each second text generated using the text generation model and the corresponding first text, and , a first predetermined number of text generation models having the highest average value of corresponding word sense matching degrees may be selected as candidate models from among the plurality of text generation models described above.

類似度計算ユニット１０６は、上述の一つ又は複数の所定の第一テキストのうちの各々について、候補モデル選択ユニット１０４を用いて選択された候補モデルによって生成された、該第一テキストに対応する複数の第二テキストの互いの間のテキスト類似度を計算し、そして、第二テキストの互いの間のテキスト類似度に基づいて、候補モデルの互いの間のモデル類似度を計算するように構成されても良い。 Similarity calculation unit 106, for each of said one or more predetermined first texts, corresponding to said first text generated by the candidate model selected using candidate model selection unit 104 configured to calculate text similarities between the plurality of second texts to each other, and to calculate model similarities between the candidate models to each other based on the text similarities between the second texts to each other. May be.

目標モデル選択ユニット１０８は、候補モデルのうちから、互いの間のモデル類似度が最も低い第二所定数の候補モデルを目標モデルとして選択するように構成されても良い。例えば、実際のニーズに基づいて第二所定数を設定しても良い。 The target model selection unit 108 may be configured to select, among the candidate models, a second predetermined number of candidate models having the lowest model similarity between each other as target models. For example, the second predetermined number may be set according to actual needs.

第四テキスト生成ユニット１１０は、目標モデルを利用して、処理待ちテキストに対応する第二所定数の第四テキストを、後続の処理のために生成するように構成されても良い。複数の処理待ちテキストが存在する場合、第四テキスト生成ユニット１１０は各処理待ちテキストについて、該処理待ちテキストに対応する第二所定数の第四テキストを生成できる。例えば、同一の処理待ちテキストについて生成された複数の第四テキストのうちの各々は同じ語種に属し得る。 The fourth text generation unit 110 may be configured to utilize the target model to generate a second predetermined number of fourth texts corresponding to the pending texts for subsequent processing. If there are multiple pending texts, the fourth text generating unit 110 can generate, for each pending text, a second predetermined number of fourth texts corresponding to the pending text. For example, each of the plurality of fourth texts generated for the same pending text may belong to the same word class.

上述のように、本発明の第一実施例に係る情報処理装置１００は、第一テキストと、複数のテキスト生成モデルを用いて生成された第二テキストとの間の語義マッチ度、及び、第二テキストの間のテキスト類似度を考慮して、複数のテキスト生成モデルのうちから目標モデルを選択できる。よって、目標モデルを利用することで、処理待ちテキストと適切な語義マッチ度を有する第四テキスト、例えば、処理待ちテキストの意味に比較的近い第四テキストを生成できる。例えば、処理待ちテキストの意味に比較的近い第四テキストは高品質の第四テキストと称されても良い。また、第四テキストの間のテキスト類似度が比較的低くなっても良く、これにより、多様性が比較的高くなる。言い換えれば、情報処理装置１００を使用することで、高品質及び高多様性を具備する第四テキストを生成できる。 As described above, the information processing apparatus 100 according to the first embodiment of the present invention determines the degree of semantic matching between the first text and the second text generated using a plurality of text generation models, A target model can be selected from multiple text generation models by considering the text similarity between two texts. Therefore, by using the target model, it is possible to generate a fourth text having an appropriate semantic matching degree with the text to be processed, for example, a fourth text relatively close to the meaning of the text to be processed. For example, a fourth text that is relatively close in meaning to the pending text may be referred to as a high quality fourth text. Also, the text similarity between the fourth texts may be relatively low, resulting in relatively high diversity. In other words, by using the information processing apparatus 100, a fourth text having high quality and high diversity can be generated.

また、上述のように目標モデルを選択し、かつ目標モデルを利用することで第四テキストを生成することによって、情報処理装置１００は語義マッチの不正確によるノイズを削減できる。ここで、“ノイズ”とは、処理待ちテキストの意味にあまり近くない第四テキストを指す。 Further, by selecting the target model as described above and generating the fourth text by using the target model, the information processing apparatus 100 can reduce noise caused by inaccurate word sense matching. Here, "noise" refers to fourth text that is not very close in meaning to pending text.

一例として、処理待ちテキスト及び第一テキストは同じ語種、例えば、中国語、英語、日本語などに属しても良い、これに限定されない。 For example, the text to be processed and the first text may belong to the same language, such as Chinese, English, Japanese, etc., but not limited to this.

もう１つの例として、処理待ちテキスト及び第一テキストは異なる語種に属しても良い。このような場合、例えば、第四テキスト生成ユニット１１０は先に、処理待ちテキストを第一テキストと同じ語種に変換し、次に、目標モデルを用いて第二所定数の第四テキストを生成しても良い。 As another example, the pending text and the first text may belong to different word types. In such a case, for example, the fourth text generation unit 110 first converts the text to be processed into the same vocabulary as the first text, and then uses the target model to generate a second predetermined number of fourth texts. can be

例えば、第二テキスト及び第四テキストは同じ語種、例えば、中国語、英語、日本語などに属しても良いが、これに限られない。 For example, the second text and the fourth text may belong to the same word class, such as Chinese, English, Japanese, etc., but is not limited to this.

一例として、第一テキスト、第二テキスト及び第四テキストは同じ語種に属し得る。このような場合、例えば、第二テキスト生成ユニット１０２はバックトランスレーション（逆翻訳）方法を用いて第二テキストを生成できる。例えば、第二テキスト生成ユニット１０２は第一テキストを、第一テキストの語種（以下、“第一語種”と称されても良い）とは異なる語種（以下、“第二語種”と称されても良い）のテキストに変換し、その後、変換後のテキストを第一語種に変換して第二テキストを取得できる。例えば、図２はバックトランスレーション方法を用いて第二テキストを生成する一例を示しており、そのうち、第一語種は英語、第二語種は中国語である。図２に示すように、第二テキスト生成ユニット１０２は英語の第一テキスト“Ｔｈｅｗｅａｔｈｅｒｉｓｇｏｏｄ”を中国語のテキスト“好天気”に変換し、その後、中国語のテキスト“好天気”を英語の第二テキスト“Ｇｏｏｄｗｅａｔｈｅｒ”に変換できる。 As an example, the first text, the second text and the fourth text may belong to the same word type. In such a case, for example, the second text generation unit 102 can use a back-translation method to generate the second text. For example, the second text generation unit 102 generates the first text in a word type (hereinafter referred to as a "second word type") that is different from the word type of the first text (hereinafter may be referred to as a "first word type"). ), and then convert the converted text to the first word type to obtain the second text. For example, FIG. 2 shows an example of generating a second text using the back translation method, wherein the first language type is English and the second language type is Chinese. As shown in FIG. 2, the second text generation unit 102 converts the English first text "The weather is good" into the Chinese text "good weather", and then converts the Chinese text "good weather" into English. can be converted to the second text "Good weather".

図２には、第二テキスト生成ユニット１０２が一つの第二テキストを生成する例が示されているが、第二テキスト生成ユニット１０２は、実際のニーズに応じて、第一テキストを複数の異なる第二語種（例えば、日本語、ドイツ語、スペイン語など）のテキストに変換し、その後、変換後のテキストを第一語種に変換して複数の第二テキストを得ても良い。 Although FIG. 2 shows an example in which the second text generation unit 102 generates one second text, the second text generation unit 102 can generate a plurality of different first texts according to actual needs. The text may be converted to a second language type (eg, Japanese, German, Spanish, etc.), and then the converted text may be converted to the first language type to obtain a plurality of second texts.

また、例えば、第四テキスト生成ユニット１１０はバックトランスレーション方法を用いて第四テキストを生成しても良い。例えば、第四テキスト生成ユニット１１０は上述の第二テキスト生成ユニット１０２の方法と類似した方法を用いて、一つ又は複数の第四テキストを生成しても良い。 Also, for example, the fourth text generation unit 110 may use a back translation method to generate the fourth text. For example, the fourth text generation unit 110 may generate one or more fourth texts using methods similar to those of the second text generation unit 102 described above.

例えば、各テキスト生成モデルは単独のテキスト翻訳モデルであっても良く、例えば、一つのテキスト生成モデルは、第一語種のテキストと、一つの第二語種のテキストとの間の変換のためのテキスト翻訳モデルに対応しても良い。 For example, each text generation model may be a single text translation model, e.g., one text generation model is a text generation model for conversion between a first language text and a second language text. It may correspond to a translation model.

一例として、類似度計算ユニット１０６は次のような方式で任意の二つの第二テキストの間のテキスト類似度を計算でき、即ち、この二つの第二テキストをそれぞれワード及び／又はフレーズの集合に分割し、得られた集合の間の共通集合及び合併集合を取得し、そして、取得された共通集合に含まれるワード及びフレーズの数（即ち、ワードの数とフレーズの数との和）と、取得された合併集合に含まれるワード及びフレーズの数（即ち、ワードの数とフレーズの数との和）との比を、前記任意の二つの第二テキストの間のテキスト類似度とすることができる。 As an example, the similarity calculation unit 106 can calculate the text similarity between any two second texts in the following manner: each of the two second texts is a set of words and/or phrases. dividing and obtaining intersections and unions between the obtained sets, and the number of words and phrases contained in the obtained intersections (i.e., the sum of the number of words and the number of phrases); A ratio of the number of words and phrases (i.e., the sum of the number of words and the number of phrases) contained in the obtained union set may be taken as the text similarity between any two second texts. can.

なお、ここで使用される“ワード”は一つのワード（単語）、例えば、一つの英語のワード、一つの中国語の漢字、一つの日本語の単語などを表しても良い。また、ここで使用される“フレーズ”は二つ又は複数のワードの組み合わせを表しても良い。 It should be noted that "word" as used herein may represent a single word, such as a single English word, a single Chinese character, a single Japanese word, and the like. Also, as used herein, a "phrase" may refer to a combination of two or more words.

以下、第二テキストが英語のテキストである例に基づいて、上述のテキスト類似度の計算方法についてさらに説明する。例えば、仮に、第二テキスト生成ユニット１０２により生成される第二テキストが第一英語フレーズ“ａｍａｎｗａｓｓｔａｎｄｉｎｇｉｎｔｈｅｂａｔｈｒｏｏｍｈｏｌｄｉｎｇｇｌａｓｓｅｓ”及び第二英語フレーズ“ａｐｅｒｓｏｎｉｓｓｔａｎｄｉｎｇｉｎｔｈｅｂａｔｈｒｏｏｍｈｏｌｄｉｎｇａｇｌａｓｓ”を含むとする。類似度計算ユニット１０６は第一英語フレーズを第一ワード集合｛ａ，ｍａｎ，ｗａｓ，ｓｔａｎｄｉｎｇ，ｉｎ，ｔｈｅ，ｂａｔｈｒｏｏｍ，ｈｏｌｄｉｎｇ，ｇｌａｓｓｅｓ｝のように分割し、第二英語フレーズを第二ワード集合｛ａ，ｐｅｒｓｏｎ，ｉｓ，ｓｔａｎｄｉｎｇ，ｉｎ，ｔｈｅ，ｂａｔｈｒｏｏｍ，ｈｏｌｄｉｎｇ，ａ，ｇｌａｓｓ｝のように分割できる。第一ワード集合と第二ワード集合との共通集合は｛ａ，ｍａｎ，ｓｔａｎｄｉｎｇ，ｉｎ，ｔｈｅ，ｂａｔｈｒｏｏｍ，ｈｏｌｄｉｎｇ｝であり、７つのワードを含み、第一ワード集合と第二ワード集合との合併集合は｛ａ，ｍａｎ，ｗａｓ，ｉｓ，ｓｔａｎｄｉｎｇ，ｉｎ，ｔｈｅ，ｂａｔｈｒｏｏｍ，ｈｏｌｄｉｎｇ，ｇｌａｓｓｅｓ，ｇｌａｓｓ｝であり、１１個のワードを含む。類似度計算ユニット１０６は共通集合のワード数（即ち、７）と合併集合のワード数（即ち、１１）との比（即ち、７／１１）を、第一英語フレーズと第二英語フレーズとの間のテキスト類似度として使用できる。 The above text similarity calculation method will be further described below based on an example where the second text is an English text. For example, if the second text generated by the second text generation unit 102 contains the first English phrase "a man was standing in the bathroom holding glasses" and the second English phrase "a person is standing in the bathroom holding a glasses". including The similarity computation unit 106 divides the first English phrase into a first word set {a, man, was, standing, in, the, bathroom, holding, glasses} and the second English phrase into a second word set { a, person, is, standing, in, the, bathroom, holding, a, glass}. The intersection of the first word set and the second word set is {a, man, standing, in, the, bathroom, holding}, containing seven words, and the union of the first word set and the second word set The set is {a, man, was, is, standing, in, the, bathroom, holding, glasses, glasses} and contains 11 words. The similarity calculation unit 106 calculates the ratio (i.e., 7/11) of the number of words in the intersection (i.e., 7) to the number of words in the union (i.e., 11) as the ratio of the first English phrase to the second English phrase. can be used as a text similarity measure between

例えば、第二テキストが英語テキストであり、かつ大文字及び小文字を含む場合、類似度計算ユニット１０６は第二テキストを大文字又は小文字に変換し、その後、第二テキストを分割でき、もちろん、類似度計算ユニット１０６は第二テキストを分割した後に、分割後のワードを大文字又は小文字に変換しても良い。当業者が理解できるように、英語テキストについてのこのような字母（ｌｅｔｔｅｒ）の変換は同様に他の語種のテキスト、例えば、ドイツ語テキスト、スペイン語テキスト、フランス語テキストなどにも適用できる。 For example, if the second text is English text and contains uppercase and lowercase letters, the similarity calculation unit 106 can convert the second text to uppercase or lowercase letters, then split the second text, and of course similarity calculation After unit 106 splits the second text, it may convert the split words to upper or lower case. As those skilled in the art will appreciate, such letter conversions for English text are equally applicable to text in other language languages, such as German text, Spanish text, French text, and the like.

一例として、類似度計算ユニット１０６は次のような方式で任意の二つの候補モデルの間のモデル類似度を計算でき、即ち、一つ又は複数の所定の第一テキストのうちの各々について、上述の任意の二つの候補モデルによって得られた、該第一テキストに対応する第二テキストの間のテキスト類似度を取得し、そして、上述の一つ又は複数の所定の第一テキストに対応するテキスト類似度の平均値を、上述の任意の二つの候補モデルの間のモデル類似度として計算する。例えば、仮に、一つ又は複数の所定の第一テキストがｔｅｘｔ_１，ｔｅｘｔ_２，……，ｔｅｘｔ_ｍ（そのうち、ｍは０よりも大きい自然数であり、それは第一テキストの数を示す）を含み、候補モデルＡを用いて取得されたｔｅｘｔ_１の第二テキストと、候補モデルＢを用いて取得されたｔｅｘｔ_１の第二テキストとの間のテキスト類似度がｓ_１であり、候補モデルＡを用いて取得されたｔｅｘｔ_２の第二テキストと、候補モデルＢを用いて取得されたｔｅｘｔ_２の第二テキストとの間のテキスト類似度がｓ_２であり、候補モデルＡを用いて取得されたｔｅｘｔ_ｍの第二テキストと、候補モデルＢを用いて取得されたｔｅｘｔ_ｍの第二テキストとの間のテキスト類似度がｓ_ｍであるとする。このような場合、類似度計算ユニット１０６は、ｔｅｘｔ_１，ｔｅｘｔ_２，……，ｔｅｘｔ_ｍに対応するテキスト類似度ｓ_１，ｓ_２，……，ｓ_ｍの平均値を、候補モデルＡと候補モデルＢとの間のモデル類似度とすることができる。もちろん、類似度計算ユニット１０６は、テキスト類似度ｓ_１，ｓ_２，……、ｓ_ｍに基づいて他の方式で候補モデルＡと候補モデルＢとの間のモデル類似度を計算しても良い。 As an example, the similarity calculation unit 106 can calculate the model similarity between any two candidate models in the following manner: for each of the one or more predetermined first texts, obtain the text similarity between the second text corresponding to the first text obtained by any two candidate models of , and obtain the text similarity between the second text corresponding to said first text, and the text corresponding to said one or more given first texts The average similarity is calculated as the model similarity between any two candidate models above. For example, if one or more predetermined first texts include text ₁ , text ₂ , _. , the text similarity between the second text of text ₁ obtained using candidate model A and the second text of text ₁ obtained using candidate model B is _s1 , and candidate model A is The text similarity between the second text of text ₂ obtained using candidate model B and the second text of text ₂ obtained using candidate model B is s ₂ and obtained using candidate model A Let the text similarity between the second text of text _m and the second text of text _m obtained using candidate model B be s _m . In such a case, the similarity calculation unit 106 calculates the mean values _of the text similarities _s ₁ , s ₂ , . . . , _{s m} _{corresponding} to text 1 , text 2 , . It can be the model similarity with model B. Of course, the similarity calculation unit 106 may calculate the model similarity between candidate model A and candidate model _B in other ways based on the text similarities s ₁ , s ₂ , . .

一例として、目標モデル選択ユニット１０８は行列式ポイントプロセス（ＤｅｔｅｒｍｉｎａｎｔａｌＰｏｉｎｔＰｒｏｃｅｓｓ（例えば、ＣｈｅｎＬ，ＺｈａｎｇＧ，ＺｈｏｕＥ．Ｆａｓｔｇｒｅｅｄｙｍａｐｉｎｆｅｒｅｎｃｅｆｏｒｄｅｔｅｒｍｉｎａｎｔａｌｐｏｉｎｔｐｒｏｃｅｓｓｔｏｉｍｐｒｏｖｅｒｅｃｏｍｍｅｎｄａｔｉｏｎｄｉｖｅｒｓｉｔｙ［Ｊ］．ＡｄｖａｎｃｅｓｉｎＮｅｕｒａｌＩｎｆｏｒｍａｔｉｏｎＰｒｏｃｅｓｓｉｎｇＳｙｓｔｅｍｓ，２０１８，３１参照））を用いて、候補モデルのうちの、互いの間のモデル類似度が最も低い第二所定数の候補モデルを目標モデルとして選択することができる。例えば、目標モデル選択ユニット１０８は、候補モデルの互いの間のモデル類似度に基づいて、Ｎ＊Ｎ次元マトリックスＳＳを構築でき、該Ｎ＊Ｎ次元マトリックスＳＳにおける各要素は、対応する候補モデルの間のモデル類似度を示し、例えば、ＳＳ［ｉ，ｊ］は第ｉ候補モデルと第ｊ候補モデルとの間の類似度を表す。そのうち、Ｎは０よりも大きい自然数であり、それは候補モデルの数（即ち、第一所定数）を示し、ｉ及びｊは０よりも大きくかつＮ以下の自然数である。その後、目標モデル選択ユニット１０８は、行列式ポイントプロセスを用いて、Ｎ＊Ｎ次元マトリックスＳＳのＭ＊Ｍ次元最大行列式サブマトリックスを求めることでき、そのうち、Ｍ＊Ｍ次元最大行列式サブマトリックスは、候補モデルのうちの、互いの間のモデル類似度が最も低い第二所定数の候補モデルに対応する。ここで、Ｍは０よりも大きくかつＮより小さい自然数であり、それは第二所定数を表す。 As an example, the target model selection unit 108 uses a determinant point process (e.g., Chen L, Zhang G, Zhou E. Fast greedy map influence for determinant point process to improve recommendation Diversity [J] Advances in Neural Information Processing Systems, 2018, 31))), a second predetermined number of candidate models with the lowest model similarity between each other among the candidate models can be selected as target models. For example, the target model selection unit 108 can build an N*N-dimensional matrix SS based on the model similarity between candidate models, where each element in the N*N-dimensional matrix SS is the corresponding candidate model's For example, SS[i,j] represents the similarity between the ith candidate model and the jth candidate model. Wherein N is a natural number greater than 0, which indicates the number of candidate models (ie, the first predetermined number), and i and j are natural numbers greater than 0 and less than or equal to N. The target model selection unit 108 can then use the determinant point process to determine the M*M dimensional maximal determinant submatrix of the N*N dimensional matrix SS, wherein the M*M dimensional maximal determinant submatrix is , corresponds to a second predetermined number of candidate models having the lowest model similarity between each other among the candidate models. Here, M is a natural number greater than 0 and less than N, which represents the second predetermined number.

以下、図３を参照して本発明の第二実施例における情報処理装置３００を説明する。図３は本発明の第二実施例に係る情報処理装置３００の機能構成例のブロック図である。 The information processing apparatus 300 according to the second embodiment of the present invention will be described below with reference to FIG. FIG. 3 is a block diagram of a functional configuration example of an information processing apparatus 300 according to the second embodiment of the present invention.

図３に示すように、本発明の第二実施例による情報処理装置３００は第二テキスト生成ユニット３０２、候補モデル選択ユニット３０４、類似度計算ユニット３０６、目標モデル選択ユニット３０８、第四テキスト生成ユニット３１０及びビデオタイミング位置決めユニット（ビデオ位置決めユニットともいう）３１２を含んでも良い。なお、第二テキスト生成ユニット３０２、候補モデル選択ユニット３０４、類似度計算ユニット３０６、目標モデル選択ユニット３０８及び第四テキスト生成ユニット３１０は図１及び図２をもとに説明した情報処理装置１００に含まれる第二テキスト生成ユニット１０２、候補モデル選択ユニット１０４、類似度計算ユニット１０６、目標モデル選択ユニット１０８及び第四テキスト生成ユニット１１０と類似しているため、ここではその詳しい説明を省略する。 As shown in FIG. 3, the information processing apparatus 300 according to the second embodiment of the present invention includes a second text generation unit 302, a candidate model selection unit 304, a similarity calculation unit 306, a target model selection unit 308, and a fourth text generation unit. 310 and a video timing positioning unit (also called a video positioning unit) 312 may be included. The second text generation unit 302, the candidate model selection unit 304, the similarity calculation unit 306, the target model selection unit 308, and the fourth text generation unit 310 are included in the information processing apparatus 100 described with reference to FIGS. Since it is similar to the included second text generation unit 102, candidate model selection unit 104, similarity calculation unit 106, target model selection unit 108 and fourth text generation unit 110, a detailed description thereof will be omitted here.

例えば、処理待ちテキストはユーザ入力のテキスト、又は、ユーザ入力の語音又は画像を変換することで取得されたテキストであっても良く、例えば、処理待ちテキストは、ユーザが所定のビデオから識別したい対象、イベントなどを指示できる。ビデオタイミング位置決めユニット３１２は処理待ちテキスト、及び、第四テキスト生成ユニット３１０により生成された第二所定数の第四テキストに基づいて、所定のビデオから処理待ちテキストに対応するフレーム（以下、“目標フレーム”と称されても良い）の位置を識別できる。ビデオタイミング位置決めユニット３１２は強化されたテキスト（即ち、処理待ちテキスト及び第四テキスト）を利用して所定のビデオに対して識別を行うことができるため、識別精度を向上させることができる。 For example, the pending text may be user-inputted text, or text obtained by converting user-inputted speech sounds or images, for example, the pending text may be an object that the user wishes to identify from a given video. , events, etc. Based on the pending text and the second predetermined number of fourth texts generated by the fourth text generating unit 310, the video timing positioning unit 312 generates a frame corresponding to the pending text from the given video (hereinafter "target frame"). (which may be referred to as a "frame") can be identified. Because the video timing positioning unit 312 can utilize the enhanced text (ie, the pending text and the quaternary text) to perform identification for a given video, the identification accuracy can be improved.

例えば、ビデオタイミング位置決めユニット３１２は訓練済みのマルチモーダル（ｍｕｌｔｉｍｏｄａｌ）モデルを使用して、所定のビデオから処理待ちテキストに対応するフレームの位置を識別できる。例えば、ビデオタイミング位置決めユニット３１２は処理待ちテキスト及び第四テキストのうちの各々について、訓練済みのマルチモーダルモデルを利用して、所定のビデオにおける各フレームと、該テキストとの間の類似度を計算することで、Ｍ＋１個の類似度シーケンスを取得し、そして、取得されたＭ＋１個の類似度シーケンスに基づいて、所定のビデオにおける処理待ちテキストに対応するフレームの位置を特定できる。ここで、Ｍは０よりも大きい自然数であり、それは第四テキストの数を表す。例えば、ビデオタイミング位置決めユニット３１２は、取得されたＭ＋１個の類似度シーケンスに対して平均を求めることで平均類似度シーケンスを取得し、そして、平均類似度シーケンスに基づいて所定のビデオにおける処理待ちテキストに対応するフレームの位置を認識できる。また、例えば、ビデオタイミング位置決めユニット３１２は、取得されたＭ＋１個の類似度シーケンスのピーク値の中値を確定し、そして、中値に対応する類似度シーケンス（以下、“中値類似度シーケンス”とも称される）に基づいて、所定のビデオにおける処理待ちテキストに対応するフレームの位置を識別できる。 For example, video timing positioning unit 312 can use a trained multimodal model to identify the position of frames corresponding to pending text from a given video. For example, the video timing positioning unit 312 uses a trained multimodal model for each of the pending text and the quaternary text to calculate the similarity between each frame in a given video and the text. , we can obtain M+1 similarity sequences, and locate the frame corresponding to the pending text in a given video based on the obtained M+1 similarity sequences. Here, M is a natural number greater than 0, which represents the number of fourth texts. For example, the video timing positioning unit 312 obtains an average similarity sequence by averaging over the obtained M+1 similarity sequences, and calculates pending text in a given video based on the average similarity sequence. can recognize the position of the corresponding frame. Also, for example, the video timing positioning unit 312 determines the median of the peak values of the obtained M+1 similarity sequences, and determines the similarity sequence corresponding to the median (hereinafter "median similarity sequence"). ), the position of the frame corresponding to pending text in a given video can be identified.

図４は、Ｃｈａｒａｄｅｓ－ＳＴＡのデータセットの場合、それぞれ、処理待ちテキストについての類似度シーケンス、ビデオタイミング位置決めユニット３１２が取得した平均類似度シーケンス、ビデオタイミング位置決めユニット３１２が取得した中値類似度シーケンス、及び手動平均類似度シーケンスを利用して、目標フレーム識別を行った結果の例を示している。そのうち、手動平均類似度シーケンスは、手動で選択されたＭ（図４に示す例では、Ｍ＝１０である）個のテキスト生成モデルによって生成された、処理待ちテキストに対応するＭ個のテキストと、所定のビデオとの類似度シーケンス、及び、処理待ちテキストについての類似度シーケンスに対して平均を求めることで得られた平均類似度シーケンスを表す。 FIG. 4 shows the similarity sequence for the pending text, the average similarity sequence obtained by the video timing positioning unit 312, and the median similarity sequence obtained by the video timing positioning unit 312, respectively, for the Charades-STA dataset. , and a manual averaged similarity sequence to perform target frame identification. Among them, the manual averaged similarity sequence consists of M texts corresponding to pending texts generated by M manually selected text generation models (M=10 in the example shown in FIG. 4) and , represents the average similarity sequence obtained by averaging over the similarity sequence with a given video and the similarity sequence for the pending text.

図４では、ＩｏＵ０．５Ｒ＠１は、最適の識別結果と真値とのＩｏＵ（ＩｎｔｅｒｓｅｃｔｉｏｎｏｖｅｒＵｎｉｏｎ）が０．５よりも大きい場合、識別結果が正確な場合のリコール率を確定することを表す。図４に示すように、ビデオタイミング位置決めユニット３１２が取得した中値類似度シーケンスに基づくリコール率は、処理待ちテキストについての類似度シーケンスに基づくリコール率に比べて約３．０１％向上しており、手動平均類似度シーケンスに基づくリコール率に比較して約２．０１％向上している。また、ビデオタイミング位置決めユニット３１２が取得した平均類似度シーケンスに基づくリコール率は、処理待ちテキストについての類似度シーケンスに基づくリコール率に比較して約１．８３％向上しており、手動平均類似度シーケンスに基づくリコール率に比べて約０．８３％向上している。 In FIG. 4, IoU0.5 R@1 indicates that if the IoU (Intersection over Union) between the optimal identification result and the true value is greater than 0.5, the recall rate when the identification result is correct is determined. show. As shown in FIG. 4, the recall rate based on the medium similarity sequence obtained by the video timing positioning unit 312 is about 3.01% higher than the recall rate based on the similarity sequence for pending text. , an improvement of about 2.01% compared to the recall rate based on manual average similarity sequences. Also, the recall rate based on the average similarity sequence obtained by the video timing positioning unit 312 improved by about 1.83% compared to the recall rate based on the similarity sequence for pending text, and the manual average similarity It is about 0.83% better than the recall rate based on sequencing.

また、図４に示すように、ビデオタイミング位置決めユニット３１２が取得した中値類似度シーケンスに基づくリコール率は、ビデオタイミング位置決めユニット３１２が取得した平均類似度シーケンスに基づくリコール率に比べてさらに約１．１８％向上しており、何故ならば、中値類似度シーケンスが平均類似度シーケンスに比べてノイズの影響をさらに低減できるからである。 Also, as shown in FIG. 4, the recall rate based on the median similarity sequences obtained by the video timing positioning unit 312 is about 1 more than the recall rate based on the average similarity sequences obtained by the video timing positioning unit 312. .18% improvement, because the medium similarity sequence can further reduce the effect of noise compared to the average similarity sequence.

一例として、図３に示すように、情報処理装置３００はさらに、候補テキスト選択ユニット３１４及び目標テキスト選択ユニット３１６を含んでも良い。 As an example, the information processing apparatus 300 may further include a candidate text selection unit 314 and a target text selection unit 316, as shown in FIG.

候補テキスト選択ユニット３１４は、第二所定数の第四テキストのうちから、処理待ちテキストとの語義マッチ度が所定マッチ度以上である複数の第四テキストを候補テキストとして選択するように構成されても良い。 The candidate text selection unit 314 is configured to select, from among the second predetermined number of fourth texts, a plurality of fourth texts having a semantic matching degree with the text to be processed equal to or greater than a predetermined matching degree as candidate texts. Also good.

目標テキスト選択ユニット３１６は、候補テキスト選択ユニット３１４が選択した候補テキストのうちから、互いの間のテキスト類似度が最も低い第三所定数の候補テキストを目標テキストとして選択するように構成されても良い。なお、実際のニーズに応じて第三所定数を設定しても良い。 The target text selection unit 316 may be configured to select, from among the candidate texts selected by the candidate text selection unit 314, a third predetermined number of candidate texts having the lowest text similarity between each other as target texts. good. The third predetermined number may be set according to actual needs.

上述の候補テキスト選択ユニット３１４及び目標テキスト選択ユニット３１６の操作により、例えば、第四テキストのうちから、互いの間の類似度がより低い目標テキストをさらに選択するようにさせることができ、これにより、目標テキストの多様性をさらに向上させることができる。 The operation of the candidate text selection unit 314 and the target text selection unit 316 described above may, for example, cause further selection of target texts among the fourth texts that have a lower degree of similarity between each other, whereby , can further improve the diversity of target texts.

情報処理装置３００が候補テキスト選択ユニット３１４及び目標テキスト選択ユニット３１６を含む場合、ビデオタイミング位置決めユニット３１２は処理待ちテキスト及び目標テキストに基づいて、所定のビデオから、処理待ちテキストに対応するフレームの位置を識別することで、例えば、識別精度をさらに向上させることができる。 If the information processing apparatus 300 includes a candidate text selection unit 314 and a target text selection unit 316, the video timing positioning unit 312, based on the pending text and the target text, locates frames corresponding to the pending text from a given video. By identifying, for example, the identification accuracy can be further improved.

例えば、目標テキスト選択ユニット３１６は、行列式ポイントプロセスを利用して、候補テキストのうちの、互いの間のテキスト類似度が最も低い第三所定数の候補テキストを目標テキストとして選択できる。例えば、目標テキスト選択ユニット３１６は、上述の目標モデル選択ユニット１０８について説明した方法と類似した方法でマトリックスを構築し、その後、行列式ポイントプロセスを用いて、構築されたマトリックスのＬ＊Ｌ次元最大行列式サブマトリックスを求めることができ、そのうち、Ｌ＊Ｌ次元最大行列式サブマトリックスは、候補テキストのうちの、互いの間のテキスト類似度が最も低い第三所定数の候補テキストに対応する。ここで、Ｌは０よりも大きい自然数であり、それは第三所定数を表す。 For example, target text selection unit 316 may utilize a determinant point process to select a third predetermined number of candidate texts that have the lowest text similarity between each other among the candidate texts as target texts. For example, target text selection unit 316 constructs a matrix in a manner similar to that described for target model selection unit 108 above, and then uses a determinant point process to use the L*L dimensional maxima of the constructed matrix. A determinant sub-matrix can be determined, of which the L*L dimensional maximal determinant sub-matrix corresponds to a third predetermined number of candidate texts with the lowest text similarity between each other among the candidate texts. Here, L is a natural number greater than 0, which represents the third predetermined number.

なお、図３では点線枠を用いて候補テキスト選択ユニット３１４及び目標テキスト選択ユニット３１６を示しており、これは幾つかの実施例において情報処理装置３００が候補テキスト選択ユニット３１４及び目標テキスト選択ユニット３１６を含まなくても良いことを意味する。 Note that FIG. 3 uses dashed boxes to denote candidate text selection unit 314 and target text selection unit 316, which in some embodiments information processing apparatus 300 may select candidate text selection unit 314 and target text selection unit 316. means that it does not have to contain

以下、図５をもとに本発明の第三実施例に係る情報処理装置４００を説明する。図５は本発明の第三実施例における情報処理装置４００の機能構成例のブロック図である。 The information processing apparatus 400 according to the third embodiment of the present invention will be described below with reference to FIG. FIG. 5 is a block diagram of a functional configuration example of an information processing apparatus 400 according to the third embodiment of the present invention.

図５に示すように、本発明の第三実施例による情報処理装置４００は第二テキスト生成ユニット４０２、候補モデル選択ユニット４０４、類似度計算ユニット４０６、目標モデル選択ユニット４０８、第四テキスト生成ユニット４１０及びマルチモーダルモデル訓練ユニット４２０を含んでも良い。第二テキスト生成ユニット４０２、候補モデル選択ユニット４０４、類似度計算ユニット４０６、目標モデル選択ユニット４０８及び第四テキスト生成ユニット４１０は上述の図１及び図２をベースに説明した第一情報処理装置１００に含まれる第二テキスト生成ユニット１０２、候補モデル選択ユニット１０４、類似度計算ユニット１０６、目標モデル選択ユニット１０８及び第四テキスト生成ユニット１１０と類似しているので、ここではその詳しい説明を省略する。 As shown in FIG. 5, the information processing apparatus 400 according to the third embodiment of the present invention includes a second text generation unit 402, a candidate model selection unit 404, a similarity calculation unit 406, a target model selection unit 408, and a fourth text generation unit. 410 and a multimodal model training unit 420. The second text generation unit 402, the candidate model selection unit 404, the similarity calculation unit 406, the target model selection unit 408 and the fourth text generation unit 410 are similar to the first information processing device 100 described above with reference to FIGS. are similar to the second text generation unit 102, the candidate model selection unit 104, the similarity calculation unit 106, the target model selection unit 108 and the fourth text generation unit 110 included in .

例えば、マルチモーダルモデル訓練ユニット４２０は処理待ちテキスト及び第二所定数の第四テキストに基づいて、ビデオタイミング位置決めのためのマルチモーダルモデルを訓練することで、訓練済みのマルチモーダルモデルを得るように構成されても良い。これにより、例えば、訓練済みのマルチモーダルモデルの識別精度、ロバストネスなどを向上させることができる。 For example, the multimodal model training unit 420 trains a multimodal model for video timing positioning based on the pending text and the second predetermined number of fourth texts to obtain a trained multimodal model. may be configured. Thereby, for example, the discrimination accuracy, robustness, etc. of the trained multimodal model can be improved.

一例として、図５に示すように、情報処理装置４００はさらに、候補テキスト選択ユニット４１４及び目標テキスト選択ユニット４１６を含み得る。候補テキスト選択ユニット４１４及び目標テキスト選択ユニット４１６は上述の図３をもとに説明した候補テキスト選択ユニット３１４及び目標テキスト選択ユニット３１６と類似しているので、ここではその詳しい説明を省略する。 As an example, the information processing apparatus 400 may further include a candidate text selection unit 414 and a target text selection unit 416, as shown in FIG. The candidate text selection unit 414 and the target text selection unit 416 are similar to the candidate text selection unit 314 and the target text selection unit 316 described with reference to FIG. 3 above, so a detailed description thereof is omitted here.

例えば、マルチモーダルモデル訓練ユニット４２０は処理待ちテキスト、及び候補テキスト選択ユニット４１４が選択した目標テキストに基づいて、ビデオタイミング位置決めのためのマルチモーダルモデルを訓練することで、訓練済みのマルチモーダルモデルを得るように構成されても良く、これにより、例えば、訓練済みのマルチモーダルモデルの識別精度、ロバストネスなどをさらに向上させることができる。 For example, multimodal model training unit 420 trains a multimodal model for video timing positioning based on the pending text and the target text selected by candidate text selection unit 414, resulting in a trained multimodal model. , which can further improve, for example, discriminative accuracy, robustness, etc. of the trained multimodal model.

なお、図５では点線枠で候補テキスト選択ユニット４１４及び目標テキスト選択ユニット４１６を示しており、これは幾つかの実施例において情報処理装置４００が候補テキスト選択ユニット４１４及び目標テキスト選択ユニット４１６を含まなくても良いことを意味する。 It should be noted that FIG. 5 shows candidate text selection unit 414 and target text selection unit 416 in dashed boxes, which indicates that information processor 400 includes candidate text selection unit 414 and target text selection unit 416 in some embodiments. It means that you can do without it.

以下、図６を参照しながら本発明の第四実施例における情報処理装置５００を説明する。図６は本発明の第四実施例に係る情報処理装置５００の機能構成例のブロック図である。 The information processing apparatus 500 according to the fourth embodiment of the present invention will be described below with reference to FIG. FIG. 6 is a block diagram of a functional configuration example of an information processing apparatus 500 according to the fourth embodiment of the present invention.

図６に示すように、本発明の第四実施例による情報処理装置５００は第五テキスト生成ユニット５０２、候補テキスト選択ユニット５０４、テキスト類似度計算ユニット５０６及び目標テキスト選択ユニット５０８を含み得る。 As shown in FIG. 6, the information processing device 500 according to the fourth embodiment of the present invention can include a fifth text generation unit 502, a candidate text selection unit 504, a text similarity calculation unit 506 and a target text selection unit 508.

第五テキスト生成ユニット５０２はテキスト生成モデルを利用して、処理待ちテキストに対応する複数の第五テキストを生成するように構成されても良い。例えば、第五テキスト生成ユニット５０２はバックトランスレーション方法により複数の第五テキストを生成できる。例えば、第五テキスト生成ユニット５０２は複数のテキスト生成モデル（例えば、複数のテキスト翻訳モデル）を使用して複数の第五テキストを生成できる。また、例えば、第五テキスト生成ユニット５０２は一つのテキスト生成モデル（例えば、一つのテキスト翻訳モデル）を用いて、ビーム探索（ｂｅａｍｓｅａｒｃｈ）により、複数の第五テキストを生成できる。 The fifth text generation unit 502 may be configured to utilize the text generation model to generate a plurality of fifth texts corresponding to the pending text. For example, the fifth text generation unit 502 can generate multiple fifth texts by back translation method. For example, the fifth text generation unit 502 can use multiple text generation models (eg, multiple text translation models) to generate multiple fifth texts. Also, for example, the fifth text generation unit 502 can use one text generation model (eg, one text translation model) to generate a plurality of fifth texts by beam search.

候補テキスト選択ユニット５０４は、第五テキスト生成ユニット５０２が生成した複数の第五テキストのうちから、処理待ちテキストとの語義マッチ度が所定マッチ度以上である第五テキストを候補第五テキストとして選択するように構成されても良い。 A candidate text selection unit 504 selects, from among the plurality of fifth texts generated by the fifth text generation unit 502, fifth texts having a degree of semantic matching with the text awaiting processing equal to or greater than a predetermined degree of matching as candidate fifth texts. It may be configured to

テキスト類似度計算ユニット５０６は、各候補第五テキストについて、該候補第五テキストと他の候補第五テキストのうちの各々とのテキスト類似度を計算するように構成されても良い。例えば、テキスト類似度計算ユニット５０６は、上述の第一実施例における類似度計算ユニット１０６がテキスト類似度を計算する方法と類似した方法を利用して、候補第五テキストの間のテキスト類似度を算出できる。 The text similarity calculation unit 506 may be configured to calculate, for each candidate fifth text, the text similarity between the candidate fifth text and each of the other candidate fifth texts. For example, the text similarity calculation unit 506 calculates the text similarity between the candidate fifth texts using a method similar to the method by which the similarity calculation unit 106 in the first embodiment described above calculates the text similarity. can be calculated.

目標テキスト選択ユニット５０８は、候補第五テキストのうちから、互いの間のテキスト類似度が最も低い第四所定数の候補第五テキストを、後続の処理のために、目標テキストとして選択するように構成されても良い。例えば、目標テキスト選択ユニット５０８は、行列式ポイントプロセスを利用して、候補第五テキストのうちから、互いの間のテキスト類似度が最も低い第四所定数の候補第五テキストを、後続の処理のために、目標テキストとして選択することができる。例えば、実際のニーズに応じて第四所定数を設定できる。 A target text selection unit 508 is configured to select, from among the candidate fifth texts, a fourth predetermined number of candidate fifth texts having the lowest text similarity between each other as target texts for subsequent processing. may be configured. For example, target text selection unit 508 may utilize a determinant point process to select, from among candidate fifth texts, a fourth predetermined number of candidate fifth texts with the lowest text similarity between each other for subsequent processing. can be selected as the target text for For example, the fourth predetermined number can be set according to actual needs.

上述のように、本発明の第四実施例による情報処理装置５００は、処理待ちテキストと、テキスト生成モデルを用いて生成された第五テキストとの間の語義マッチ度、及び、第五テキストの間のテキスト類似度を考慮して、複数の第五テキストのうちから目標テキストを選択できる。よって、目標テキストと処理待ちテキストとの間には適切な語義マッチ度があり、例えば、目標テキストと処理待ちテキストの意味は互いに近くても良い。また、目標テキストの間のテキスト類似度が比較的低くなっても良く、これによって、多様性が比較的高くなる。言い換えれば、情報処理装置５００により、高品質及び高多様性を有する目標テキストを得ることができる。 As described above, the information processing apparatus 500 according to the fourth embodiment of the present invention determines the degree of word sense matching between the pending text and the fifth text generated using the text generation model, and the degree of matching of the fifth text. A target text can be selected from among a plurality of fifth texts by considering the text similarity between them. Therefore, there is an appropriate degree of semantic match between the target text and the pending text, eg, the semantics of the target text and the pending text may be close to each other. Also, the text similarity between the target texts may be relatively low, resulting in relatively high diversity. In other words, with the information processing device 500, a target text with high quality and high diversity can be obtained.

例えば、処理待ちテキストはユーザ入力のテキスト、又は、ユーザ入力の語音又は画像を変換して取得したテキストを含んでも良い。 For example, the text to be processed may include user-inputted text, or text obtained by converting user-inputted speech sounds or images.

一例として、図６に示すように、情報処理装置５００はさらに、ビデオタイミング位置決めユニット５１２を含んでも良い。ビデオタイミング位置決めユニット５１２は、処理待ちテキスト、及び目標テキスト選択ユニット５０８により選択された目標テキストに基づいて、所定のビデオから、処理待ちテキストに対応するフレームの位置を認識できる。ビデオタイミング位置決めユニット５１２は強化されたテキスト（即ち、処理待ちテキスト及び目標テキスト）を用いて所定のビデオに対して識別を行うことができるため、識別精度を向上させることができる。例えば、ビデオタイミング位置決めユニット５１２は、上述の第二実施例におけるビデオタイミング位置決めユニット３１２の設定と類似した設定を有しても良いので、ここではその詳しい説明を省略する。 As an example, the information processing device 500 may further include a video timing positioning unit 512, as shown in FIG. Based on the pending text and the target text selected by the target text selection unit 508, the video timing positioning unit 512 can recognize the position of the frame corresponding to the pending text from a given video. Because the video timing positioning unit 512 can perform identification for a given video with enhanced text (ie, pending text and target text), identification accuracy can be improved. For example, the video timing positioning unit 512 may have settings similar to those of the video timing positioning unit 312 in the second embodiment described above, so a detailed description thereof will be omitted here.

図７は、Ｃｈａｒａｄｅｓ－ＳＴＡのデータセットの場合、それぞれ、処理待ちテキストについての類似度シーケンス、ビデオタイミング位置決めユニット５１２が取得した平均類似度シーケンス、及び手動平均類似度シーケンスを利用して目標フレーム識別を行った結果の例を示している。図７に示す例では、一つの処理待ちテキストに対応する目標テキストの数が１０である。図７に示すように、ビデオタイミング位置決めユニット５１２が取得した平均類似度シーケンスに基づくリコール率は、処理待ちテキストについての類似度シーケンスに基づくリコール率よりも約１．６２％向上しており、手動平均類似度シーケンスに基づくリコール率よりも約０．６２％向上している。 FIG. 7 shows target frame identification using the similarity sequence for the pending text, the average similarity sequence obtained by the video timing positioning unit 512, and the manual average similarity sequence, respectively, for the Charades-STA dataset. shows an example of the results of In the example shown in FIG. 7, the number of target texts corresponding to one pending text is ten. As shown in FIG. 7, the recall rate based on the average similarity sequence obtained by the video timing positioning unit 512 is about 1.62% better than the recall rate based on the similarity sequence for the pending text, and the manual It is about 0.62% better than the recall rate based on the average similarity sequence.

もう１つの例として、処理待ちテキストは訓練テキストを含んでも良く、図６に示すように、情報処理装置５００はさらに、マルチモーダルモデル訓練ユニット５２０を含んでも良い。例えば、マルチモーダルモデル訓練ユニット５２０は、処理待ちテキスト、及び目標テキスト選択ユニット５０８により選択された目標テキストに基づいて、ビデオタイミング位置決めのためのマルチモーダルモデルを訓練することで、訓練済みのマルチモーダルモデルを得ても良く、これにより、例えば、訓練済みのマルチモーダルモデルの識別精度、ロバストネスなどを向上させることができる。例えば、マルチモーダルモデル訓練ユニット５２０は上述の第三実施例におけるマルチモーダルモデル訓練ユニット４２０の設定と類似した設定を有しても良いので、ここではその詳しい説明を省略する。 As another example, the text to be processed may include training text, and as shown in FIG. 6, information processing apparatus 500 may further include multimodal model training unit 520 . For example, multimodal model training unit 520 trains a multimodal model for video timing positioning based on pending text and target text selected by target text selection unit 508, resulting in trained multimodal A model may be obtained, which can, for example, improve the discriminative accuracy, robustness, etc. of the trained multimodal model. For example, the multimodal model training unit 520 may have settings similar to those of the multimodal model training unit 420 in the third embodiment described above, so a detailed description thereof is omitted here.

なお、図６では点線枠でビデオタイミング位置決めユニット５１２及びマルチモーダルモデル訓練ユニット５２０を示しており、これは幾つかの実施例において情報処理装置５００がビデオタイミング位置決めユニット５１２及び／又はマルチモーダルモデル訓練ユニット５２０を含まなくても良いことを表す。 It should be noted that in FIG. 6 the video timing positioning unit 512 and the multimodal model training unit 520 are shown in dashed boxes, indicating that in some embodiments the information processing apparatus 500 may perform the video timing positioning unit 512 and/or the multimodal model training. This indicates that the unit 520 may not be included.

以上、本発明の実施例による情報処理装置を説明したが、上述の情報処理装置の実施例に対応して、本発明はさらに以下のような情報処理方法の実施例を提供する。 Although the information processing apparatus according to the embodiments of the present invention has been described above, the present invention further provides the following information processing method embodiments corresponding to the above-described information processing apparatus embodiments.

図８は本発明の実施例における情報処理方法６００の例示的なフローチャートである。図８に示すように、本発明の実施例による情報処理方法６００はスタートステップＳ６０１で開始し、エンドステップＳ６１２で終了しても良く、また、第二テキスト生成ステップＳ６０２、候補モデル選択ステップＳ６０４、類似度計算ステップＳ６０６、目標モデル選択ステップＳ６０８及び第四テキスト生成ステップＳ６１０を含んでも良い。 FIG. 8 is an exemplary flowchart of an information processing method 600 in accordance with an embodiment of the present invention. As shown in FIG. 8, the information processing method 600 according to an embodiment of the present invention may start with a start step S601 and end with an end step S612, and also includes a second text generation step S602, a candidate model selection step S604, A similarity calculation step S606, a target model selection step S608 and a fourth text generation step S610 may be included.

第二テキスト生成ステップＳ６０２では、一つ又は複数の所定の第一テキストのうちの各々について、複数のテキスト生成モデルを用いて、該第一テキストに対応する複数の第二テキストを生成できる。例えば、第二テキスト生成ステップＳ６０２は上述の装置の実施例における第二テキスト生成ユニット１０２、３０２及び４０２により実施され得るので、具体的な細部については上述の第二テキスト生成ユニット１０２、３０２及び４０２についての説明を参照でき、ここではその詳しい説明を省略する。 In the second text generation step S602, for each of one or more predetermined first texts, multiple text generation models can be used to generate multiple second texts corresponding to the first text. For example, the second text generation step S602 can be performed by the second text generation units 102, 302 and 402 in the above-described apparatus embodiments, so the specific details are the second text generation units 102, 302 and 402 described above. can be referred to, and the detailed description thereof is omitted here.

候補モデル選択ステップＳ６０４では、第二テキスト生成ステップＳ６０２で生成された上述の複数の第二テキストと、対応する第一テキストとの間の語義マッチ度に基づいて、上述の複数のテキスト生成モデルのうちから、第一所定数のテキスト生成モデルを候補モデルとして選択できる。例えば、実際のニーズに応じて第一所定数を設定できる。また、例えば、候補モデル選択ステップＳ６０４は上述の装置の実施例における候補モデル選択ユニット１０４、３０４及び４０４により実施され得るので、具体的な細部については上述の候補モデル選択ユニット１０４、３０４及び４０４についての説明を参照でき、ここではその詳しい説明を省略する。 In the candidate model selection step S604, the plurality of text generation models are selected based on the degree of semantic matching between the plurality of second texts generated in the second text generation step S602 and the corresponding first texts. From among them, a first predetermined number of text generation models can be selected as candidate models. For example, the first predetermined number can be set according to actual needs. Also, for example, candidate model selection step S604 may be performed by candidate model selection units 104, 304, and 404 in the apparatus embodiments described above, so specific details will be provided for candidate model selection units 104, 304, and 404 described above. , and the detailed description thereof is omitted here.

類似度計算ステップＳ６０６では、上述の一つ又は複数の所定の第一テキストのうちの各々について、候補モデル選択ステップＳ６０４で選択された候補モデルによって生成された、該第一テキストに対応する複数の第二テキストの互いの間のテキスト類似度を計算し、そして、第二テキストの互いの間のテキスト類似度に基づいて、候補モデルの互いの間のモデル類似度を計算できる。例えば、類似度計算ステップＳ６０６は上述の装置の実施例における類似度計算ユニット１０６、３０６及び４０６により実施され得るため、具体的な細部については上述の類似度計算ユニット１０６、３０６及び４０６についての説明を参照でき、ここではその詳しい説明を省略する。 In a similarity calculation step S606, for each of said one or more predetermined first texts, a plurality of corresponding first text generated by the candidate model selected in the candidate model selection step S604 A text similarity between the second texts can be calculated, and a model similarity between the candidate models can be calculated based on the text similarity between the second texts. For example, the similarity calculation step S606 can be performed by the similarity calculation units 106, 306 and 406 in the above-described apparatus embodiments, so the specific details are given in the description of the similarity calculation units 106, 306 and 406 above. can be referred to, and the detailed description thereof is omitted here.

目標モデル選択ステップＳ６０８では、候補モデルのうちから、互いの間のモデル類似度が最も低い第二所定数の候補モデルを目標モデルとして選択できる。例えば、実際のニーズに応じて第二所定数を設定できる。例えば、目標モデル選択ステップＳ６０８は上述の装置の実施例における目標モデル選択ユニット１０８、３０８及び４０８により実施され得るから、具体的な細部については上述の目標モデル選択ユニット１０８、３０８及び４０８についての説明を参照でき、ここではその詳しい説明を省略する。 In the target model selection step S608, from among the candidate models, a second predetermined number of candidate models having the lowest model similarity between each other can be selected as target models. For example, the second predetermined number can be set according to actual needs. For example, target model selection step S608 may be performed by target model selection units 108, 308, and 408 in the above-described apparatus embodiments, so specific details are provided in the descriptions of target model selection units 108, 308, and 408 above. can be referred to, and the detailed description thereof is omitted here.

第四テキスト生成ステップＳ６１０では、目標モデルを利用して、処理待ちテキストに対応する第二所定数の第四テキストを、後続の処理のために生成できる。例えば、第四テキスト生成ステップＳ６１０は上述の装置の実施例における第四テキスト生成ユニット１１０、３１０及び４１０により実施され得るので、具体的な細部については上述の第四テキスト生成ユニット１１０、３１０及び４１０についての説明を参照でき、ここではその詳しい説明を省略する。 In a fourth text generation step S610, the target model can be used to generate a second predetermined number of fourth texts corresponding to the pending texts for subsequent processing. For example, the fourth text generation step S610 can be performed by the fourth text generation units 110, 310 and 410 in the above-described apparatus embodiments, so the specific details are described in the fourth text generation units 110, 310 and 410 above. can be referred to, and the detailed description thereof is omitted here.

本発明の実施例に係る情報処理装置と同様に、本発明の第五実施例による情報処理方法６００も、第一テキストと、複数のテキスト生成モデルにより生成された第二テキストとの間の語義マッチ度、及び、第二テキストの間のテキスト類似度を考慮して、複数のテキスト生成モデルのうちから目標モデルを選択できる。よって、目標モデルを利用して、処理待ちテキストとの間に適切な語義マッチ度を有する第四テキストを生成でき、例えば、処理待ちテキストの意味に近い第四テキストを生成できる。また、第四テキストの間のテキスト類似度が比較的低くなっても良く、これによって、多様性が比較的高くなる。言い換えれば、情報処理方法６００により、高品質及び高多様性を備える第四テキストを生成できる。 Similar to the information processing apparatus according to the embodiments of the present invention, the information processing method 600 according to the fifth embodiment of the present invention also uses the semantic definition between the first text and the second text generated by a plurality of text generation models. A target model can be selected from a plurality of text generation models considering the degree of match and the degree of text similarity between the second texts. Therefore, using the target model, it is possible to generate a fourth text having an appropriate degree of semantic matching with the text to be processed, for example, to generate a fourth text close to the meaning of the text to be processed. Also, the text similarity between the fourth texts may be relatively low, resulting in relatively high diversity. In other words, the information processing method 600 allows the generation of fourth texts with high quality and high diversity.

一例として、類似度計算ステップＳ６０６では次のような方式で任意の二つの第二テキストの間のテキスト類似度を算出でき、即ち、この二つの第二テキストをそれぞれワード及び／又はフレーズの集合に分割し、得られた集合の間の共通集合及び合併集合を取得し、そして、取得された共通集合に含まれるワード及びフレーズの数（即ち、ワードの数とフレーズの数との合計）と、取得された合併集合に含まれるワード及びフレーズの数（即ち、ワードの数とフレーズの数との合計）との比を、前記任意の二つの第二テキストの間のテキスト類似度として算出する。 For example, in the similarity calculation step S606, the text similarity between any two second texts can be calculated in the following way: each of the two second texts is a set of words and/or phrases. dividing and obtaining intersections and unions between the obtained sets, and the number of words and phrases contained in the obtained intersections (i.e., the sum of the number of words and the number of phrases); A ratio of the number of words and phrases (ie, the sum of the number of words and the number of phrases) contained in the obtained merged set is calculated as the text similarity between the two arbitrary second texts.

例えば、類似度計算ステップＳ６０６では次のような方式で任意の二つの候補モデルの間のモデル類似度を計算でき、即ち、一つ又は複数の所定の第一テキストのうちの各々について、上述の任意の二つの候補モデルにより取得された、該第一テキストに対応する第二テキストの間のテキスト類似度を取得し、そして、上述の一つ又は複数の所定の第一テキストに対応するテキスト類似度の平均値を、上述の任意の二つの候補モデルの間のモデル類似度として計算する。 For example, the similarity calculation step S606 can calculate the model similarity between any two candidate models in the following manner: For each of the one or more predetermined first texts, the above obtaining the text similarity between the second texts corresponding to the first texts obtained by any two candidate models; Calculate the mean of the degrees as the model similarity between any two candidate models above.

例えば、目標モデル選択ステップＳ６０８では行列式ポイントプロセスを用いて、候補モデルのうちの、互いの間のモデル類似度が最も低い第二所定数の候補モデルを目標モデルとして選ぶことができる。 For example, the target model selection step S608 may use a determinant point process to select as target models a second predetermined number of candidate models that have the lowest model similarity between each other among the candidate models.

一例として、情報処理方法６００はさらに、ビデオタイミング位置決めステップ（図示せず）を含み得る。ビデオタイミング位置決めステップでは、処理待ちテキスト、及び第四テキスト生成ステップＳ６１０で生成された第二所定数の第四テキストに基づいて、所定のビデオから処理待ちテキストに対応するフレームの位置を識別できる。ビデオタイミング位置決めステップでは強化されたテキスト（即ち、処理待ちテキスト及び第四テキスト）を用いて所定のビデオに対して識別を行うことができるため、識別精度を向上させることができる。 As an example, information processing method 600 may further include a video timing positioning step (not shown). The video timing positioning step may identify the position of the frame corresponding to the pending text from the given video based on the pending text and the second predetermined number of fourth texts generated in the fourth text generating step S610. The identification accuracy can be improved because the video timing positioning step can use the enhanced text (ie, the pending text and the quaternary text) to perform the identification for a given video.

例えば、ビデオタイミング位置決めステップは上述の第二実施例の中のビデオタイミング位置決めユニット３１２により実施され得るので、具体的な細部については上述のビデオタイミング位置決めユニット３１２についての説明を参照でき、ここではその詳しい説明を省略する。 For example, the video timing positioning step can be performed by the video timing positioning unit 312 in the second embodiment described above, so for specific details, please refer to the description of the video timing positioning unit 312 above, which is hereby referred to. A detailed explanation is omitted.

もう１つの例として、情報処理方法６００はさらに、マルチモーダルモデル訓練ステップ（図示せず）を含んでも良い。マルチモーダルモデル訓練ステップでは、処理待ちテキスト及び第二所定数の第四テキストに基づいて、ビデオタイミング位置決めのためのマルチモーダルモデルを訓練することで、訓練済みのマルチモーダルモデルを得ることができ、これにより、例えば、訓練済みのマルチモーダルモデルの識別精度、ロバストネスなどを向上させることができる。例えば、マルチモーダルモデル訓練ステップは上述の第三実施例中のマルチモーダルモデル訓練ユニット４２０により実施され得るため、具体的な細部については上述のマルチモーダルモデル訓練ユニット４２０についての説明を参照でき、ここではその詳しい説明を省略する。 As another example, information processing method 600 may further include a multimodal model training step (not shown). In a multimodal model training step, training a multimodal model for video timing positioning based on the pending text and the second predetermined number of fourth texts to obtain a trained multimodal model; Thereby, for example, the discrimination accuracy, robustness, etc. of the trained multimodal model can be improved. For example, the multimodal model training step can be performed by the multimodal model training unit 420 in the third embodiment above, so for specific details see the description of the multimodal model training unit 420 above, here. The detailed explanation is omitted here.

一例として、情報処理方法６００はさらに、候補テキスト選択ステップ及び目標テキスト選択ステップ（図示せず）を含んでも良い。 As an example, the information processing method 600 may further include a candidate text selection step and a target text selection step (not shown).

候補テキスト選択ステップでは、第二所定数の第四テキストのうちから、処理待ちテキストとの語義マッチ度が所定マッチ度以上である複数の第四テキストを候補テキストとして選択できる。例えば、候補テキスト選択ステップは上述の装置の実施例における候補テキスト選択ユニット３１４及び４１４により実施され得るため、具体的な細部については上述の候補テキスト選択ユニット３１４及び４１４についての説明を参照でき、ここではその詳しい説明を省略する。 In the candidate text selection step, from among the second predetermined number of fourth texts, a plurality of fourth texts having a word sense matching degree with the text awaiting processing equal to or greater than a predetermined matching degree can be selected as candidate texts. For example, the candidate text selection step may be performed by the candidate text selection units 314 and 414 in the apparatus embodiments described above, so for specific details see the description of the candidate text selection units 314 and 414 above, here. The detailed explanation is omitted here.

目標テキスト選択ステップでは、候補テキスト選択ステップで選択された候補テキストのうちから、互いの間のテキスト類似度が最も低い第三所定数の候補テキストを目標テキストとして選ぶことができる。例えば、目標テキスト選択ステップは上述の装置の実施例における目標テキスト選択ユニット３１６及び４１６により実施され得るため、具体的な細部については上述の目標テキスト選択ユニット３１６及び４１６についての説明を参照でき、ここではその詳しい説明を省略する。 In the target text selection step, from among the candidate texts selected in the candidate text selection step, a third predetermined number of candidate texts having the lowest text similarity between each other may be selected as target texts. For example, since the target text selection step can be performed by the target text selection units 316 and 416 in the apparatus embodiments described above, the specific details can be referred to the description of the target text selection units 316 and 416 above, here. The detailed explanation is omitted here.

例えば、情報処理方法６００が候補テキスト選択ステップ及び目標テキスト選択ステップを含む場合、ビデオタイミング位置決めステップでは、処理待ちテキスト及び目標テキストに基づいて、所定のビデオから、処理待ちテキストに対応するフレームの位置を識別できるため、例えば、識別精度をさらに向上させることができる。 For example, if the information processing method 600 includes a candidate text selection step and a target text selection step, the video timing positioning step, based on the pending text and the target text, determines the position of the frame corresponding to the pending text from the given video. can be identified, for example, the identification accuracy can be further improved.

また、例えば、情報処理方法６００が候補テキスト選択ステップ及び目標テキスト選択ステップを含む場合、マルチモーダルモデル訓練ステップでは、処理待ちテキスト及び目標テキストに基づいて、ビデオタイミング位置決めのためのマルチモーダルモデルを訓練することで、訓練済みのマルチモーダルモデルを得ることができるので、例えば、訓練済みのマルチモーダルモデルの識別精度、ロバストネスなどをさらに向上させることができる。 Also, for example, if the information processing method 600 includes a candidate text selection step and a target text selection step, the multimodal model training step trains a multimodal model for video timing positioning based on the pending text and the target text. By doing so, a trained multimodal model can be obtained, so that, for example, the discrimination accuracy, robustness, etc. of the trained multimodal model can be further improved.

以下、図９を参照しながら本発明の第六実施例における情報処理方法７００を説明する。図９は本発明の実施例における情報処理方法７００の例示的なフローチャートである。図９に示すように、本発明の実施例における情報処理方法７００はスタートステップＳ７０１で開始し、エンドステップＳ７１２で終了しても良く、また、第五テキスト生成ステップＳ７０２、候補テキスト選択ステップＳ７０４、テキスト類似度計算ステップＳ７０６及び目標テキスト選択ステップＳ７０８を含んでも良い。 The information processing method 700 according to the sixth embodiment of the present invention will now be described with reference to FIG. FIG. 9 is an exemplary flowchart of an information processing method 700 in accordance with an embodiment of the present invention. As shown in FIG. 9, the information processing method 700 in an embodiment of the present invention may start with a start step S701 and end with an end step S712, and also include a fifth text generation step S702, a candidate text selection step S704, A text similarity calculation step S706 and a target text selection step S708 may be included.

第五テキスト生成ステップＳ７０２では、テキスト生成モデルを用いて処理待ちテキストに対応する複数の第五テキストを生成できる。例えば、第五テキスト生成ステップＳ７０２は上述の第四実施例中の第五テキスト生成ユニット５０２により実施され得るため、具体的な細部については上述の第五テキスト生成ユニット５０２についての説明を参照でき、ここではその詳しい説明を省略する。 In the fifth text generation step S702, the text generation model can be used to generate a plurality of fifth texts corresponding to the pending texts. For example, the fifth text generation step S702 can be implemented by the fifth text generation unit 502 in the above fourth embodiment, so the specific details can refer to the description of the fifth text generation unit 502 above, A detailed description thereof is omitted here.

候補テキスト選択ステップＳ７０４では、第五テキスト生成ステップＳ７０２で生成された複数の第五テキストのうちから、処理待ちテキストとの語義マッチ度が所定マッチ度以上である第五テキストを候補第五テキストとして選択できる。例えば、候補テキスト選択ステップＳ７０４は上述の第四実施例における候補テキスト選択ユニット５０４により実施され得るため、具体的な細部については上述の候補テキスト選択ユニット５０４についての説明を参照でき、ここではその詳しい説明を省略する。 In the candidate text selection step S704, from among the plurality of fifth texts generated in the fifth text generation step S702, a fifth text having a word meaning matching degree with the text awaiting processing equal to or higher than a predetermined matching degree is selected as a candidate fifth text. You can choose. For example, since the candidate text selection step S704 can be implemented by the candidate text selection unit 504 in the fourth embodiment described above, the specific details can be referred to the above description of the candidate text selection unit 504, and here the details are described. Description is omitted.

テキスト類似度計算ステップＳ７０６では、各候補第五テキストについて、該候補第五テキストと他の候補第五テキストのうちの各々とのテキスト類似度を計算できる。例えば、テキスト類似度計算ステップＳ７０６は上述の第四実施例中のテキスト類似度計算ユニット５０６により実施され得るため、具体的な細部については上述のテキスト類似度計算ユニット５０６についての説明を参照でき、ここではその詳しい説明を省略する。 In a text similarity calculation step S706, for each candidate fifth text, the text similarity between the candidate fifth text and each of the other candidate fifth texts can be calculated. For example, the text similarity calculation step S706 can be implemented by the text similarity calculation unit 506 in the above fourth embodiment, so the specific details can refer to the above description of the text similarity calculation unit 506, A detailed description thereof is omitted here.

目標テキスト選択ステップＳ７０８では、候補第五テキストのうちから、互いの間のテキスト類似度が最も低い第四所定数の候補第五テキストを、後続の処理のために、目標テキストとして選択することができる。例えば、目標テキスト選択ステップＳ７０８は上述の第四実施例の中の目標テキスト選択ユニット５０８により実施され得るので、具体的な細部については上述の目標テキスト選択ユニット５０８についての説明を参照でき、ここではその詳しい説明を省略する。 In a target text selection step S708, from among the candidate fifth texts, a fourth predetermined number of candidate fifth texts having the lowest text similarity between each other may be selected as target texts for subsequent processing. can. For example, the target text selection step S708 can be implemented by the target text selection unit 508 in the above-described fourth embodiment, so the specific details can refer to the above description of the target text selection unit 508, and here A detailed description thereof is omitted.

上述のように、本発明の第六実施例による情報処理方法７００は、処理待ちテキストと、テキスト生成モデルを用いて生成された第五テキストとの間の語義マッチ度、及び第五テキストの間のテキスト類似度を考慮して、複数の第五テキストのうちから目標テキストを選択できる。よって、目標テキストと処理待ちテキストとの間には適切な語義マッチ度があり、例えば、目標テキストと処理待ちテキストとの意味は近くても良い。また、目標テキストの間のテキスト類似度が比較的低くなっても良く、これによって、多様性が比較的高くなる。言い換えれば、情報処理方法７００により、高品質及び高多様性を有する目標テキストを取得できる。 As described above, the information processing method 700 according to the sixth embodiment of the present invention provides the degree of semantic matching between the text to be processed and the fifth text generated using the text generation model, and A target text can be selected from among a plurality of fifth texts by considering the text similarity of . Therefore, there is an appropriate degree of semantic matching between the target text and the to-be-processed text, for example, the target text and the to-be-processed text may be close in meaning. Also, the text similarity between the target texts may be relatively low, resulting in relatively high diversity. In other words, the information processing method 700 makes it possible to obtain target texts with high quality and high diversity.

一例として、情報処理方法７００はさらに、ビデオタイミング位置決めステップ（図示せず）を含み得る。ビデオタイミング位置決めステップでは、処理待ちテキスト及び目標テキストに基づいて、所定のビデオから、処理待ちテキストに対応するフレームの位置を認識できるので、例えば、識別精度を向上させることができる。例えば、ビデオタイミング位置決めステップは上述の第四実施例中のビデオタイミング位置決めユニット５１２により実施され得るため、具体的な細部については上述のビデオタイミング位置決めユニット５１２についての説明を参照でき、ここではその詳しい説明を省略する。 As an example, information processing method 700 may further include a video timing positioning step (not shown). Based on the pending text and the target text, the video timing positioning step can recognize, from a given video, the position of the frame corresponding to the pending text, thus improving the identification accuracy, for example. For example, the video timing positioning step can be performed by the video timing positioning unit 512 in the fourth embodiment described above, so the specific details can be referred to the description of the video timing positioning unit 512 above, and here the details are described. Description is omitted.

他の例として、情報処理方法７００はさらに、マルチモーダルモデル訓練ステップ（図示せず）を含んでも良い。例えば、マルチモーダルモデル訓練ステップでは、処理待ちテキスト及び目標テキストに基づいて、ビデオタイミング位置決めのためのマルチモーダルモデルを訓練することで、訓練済みのマルチモーダルモデルを得ることができ、これによって、例えば、訓練済みのマルチモーダルモデルの識別精度、ロバストネスなどを向上させることができる。例えば、マルチモーダルモデル訓練ステップは上述の第四実施例中のマルチモーダルモデル訓練ユニット５２０により実施され得るので、具体的な細部については上述のマルチモーダルモデル訓練ユニット５２０についての説明を参照でき、ここではその詳しい説明を省略する。 As another example, information processing method 700 may further include a multimodal model training step (not shown). For example, the multimodal model training step may train a multimodal model for video timing positioning based on the pending text and the target text to obtain a trained multimodal model, thereby e.g. , can improve the discrimination accuracy, robustness, etc. of trained multimodal models. For example, the multimodal model training step can be performed by the multimodal model training unit 520 in the fourth embodiment above, so for specific details see the description of the multimodal model training unit 520 above, here. The detailed explanation is omitted here.

なお、以上、本発明の実施例に係る情報処理装置及び情報処理方法の機能設定及び操作を説明したが、これらは例示に過ぎず、当業者は本発明の原理に基づいて上述の実施例に対して変更などを行うことができ、例えば、各実施例中の機能モジュール及び操作の増減、組み合わせなどを行うことができ、また、このような変更などは、すべて、本発明の範囲に属する。 Although the function settings and operations of the information processing apparatus and the information processing method according to the embodiments of the present invention have been described above, these are merely examples, and those skilled in the art will be able to modify the above-described embodiments based on the principles of the present invention. For example, functional modules and operations in each embodiment may be added, reduced, combined, etc., and all such modifications are within the scope of the present invention.

また、ここでの方法の実施例は上述の装置の実施例に対応するものであるので、方法の実施例で詳細に説明されない内容については装置の実施例中の対応する部分の説明を参照でき、ここではその詳しい説明を省略する。 In addition, since the method embodiments herein correspond to the apparatus embodiments described above, the corresponding descriptions in the apparatus embodiments can be referred to for details not described in the method embodiments. , the detailed description of which is omitted here.

また、本発明はさらに、記憶媒体及びプログラムプロダクトを提供する。なお、本発明の実施例による記憶媒体及びプログラムプロダクト中のマシン可実行な命令はさらに、上述の情報処理方法を実行するように構成され得る。よって、ここで詳細に説明されない内容については前の対応する部分の説明を参照できるため、ここではその詳しい説明を省略する。 Also, the present invention further provides a storage medium and a program product. It should be noted that the machine-executable instructions in the storage media and program products according to embodiments of the present invention may be further configured to perform the information processing methods described above. Therefore, the contents not described in detail here can be referred to the description of the previous corresponding part, and the detailed description thereof will be omitted here.

さらに、上述の一連の処理及び装置はソフトウェア及び／又はファームウェアにより実現され得る。ソフトウェア及び／又はファームウェアにより実現される場合、記憶媒体又はネットワークから、専用ハードウェア構造を有するコンピュータ、例えば、図１０に示す汎用パーソナルコンピュータ１０００に、該ソフトウェアを構成するプログラムをインストールし、該コンピュータは各種のプログラムがインストールされているときに、各種の機能などを実行できる。 Furthermore, the series of processes and devices described above may be implemented by software and/or firmware. When realized by software and/or firmware, a program that constitutes the software is installed from a storage medium or network to a computer having a dedicated hardware structure, for example, a general-purpose personal computer 1000 shown in FIG. Various functions can be executed when various programs are installed.

それ相応に、上述のマシン実行可能な命令を含むプログラムプロダクトをキャリー（ｃａｒｒｙ）する記憶媒体も本発明の開示に含まれる。該記憶媒体はフロッピーディスク、光ディスク、磁気ディスク、メモリカードなどを含んでも良いが、これらに限定されない。 Correspondingly, storage media carrying program products containing machine-executable instructions as described above are also included in the present disclosure. The storage medium may include, but is not limited to, floppy disks, optical disks, magnetic disks, memory cards, and the like.

上述の装置における各構成コンポーネントやユニットなどは、ソフトウェア、ファームウェア、ハードウェア又はその組み合わせの方式で構成されても良い。なお、構成に使用される具体的な手段や方法が当業者にとって周知のものであるため、ここではその詳しい説明を省略する。ソフトウェア又はファームウェアにより実現される場合、記憶媒体又はネットワークから専用ハードウェア構造を有するコンピュータ（例えば、図１０に示す汎用コンピュータ１０００）に該ソフトウェアを構成するプログラムをインストールし、該コンピュータは各種のプログラムがインストールされているときに、各種の機能などを実現することができる。 Each constituent component, unit, etc. in the above-described apparatus may be configured in the form of software, firmware, hardware, or a combination thereof. Since specific means and methods used for the configuration are well known to those skilled in the art, detailed description thereof will be omitted here. When realized by software or firmware, a program that constitutes the software is installed from a storage medium or a network to a computer having a dedicated hardware structure (for example, a general-purpose computer 1000 shown in FIG. 10), and the computer installs various programs. When installed, various functions can be realized.

図１０は、本発明の実施例における方法及び装置を実現し得るハードウェア構成（汎用コンピュータ）１０００の構成図である。 FIG. 10 is a block diagram of a hardware configuration (general-purpose computer) 1000 that can implement the methods and apparatus of embodiments of the present invention.

汎用コンピュータ１０００は、例えば、コンピュータシステムであっても良い。なお、汎用コンピュータ１０００は例示に過ぎず、本発明による方法及び装置の適用範囲又は機能について限定しない。また、汎用コンピュータ１０００は、上述の方法及び装置における任意のモジュールやアセンブリなど又はその組み合わせにも依存しない。 General purpose computer 1000 may be, for example, a computer system. It should be noted that general purpose computer 1000 is exemplary only and does not limit the scope or functionality of the methods and apparatus according to the present invention. Nor does general-purpose computer 1000 rely on any of the modules, assemblies, etc., or combinations thereof in the methods and apparatus described above.

図１０では、中央処理装置（ＣＰＵ）１００１は、ＲＯＭ１００２に記憶されるプログラム又は記憶部１００８からＲＡＭ１００３にロッドされているプログラムに基づいて各種の処理を行う。ＲＡＭ１００３では、ニーズに応じて、ＣＰＵ１００１が各種の処理を行うときに必要なデータなどを記憶することもできる。ＣＰＵ１００１、ＲＯＭ１００２及びＲＡＭ１００３は、バス１００４を経由して互いに接続される。入力／出力インターフェース１００５もバス１００４に接続される。 In FIG. 10, a central processing unit (CPU) 1001 performs various processes based on programs stored in a ROM 1002 or programs loaded from a storage unit 1008 to a RAM 1003 . The RAM 1003 can also store data necessary for the CPU 1001 to perform various processes according to needs. The CPU 1001 , ROM 1002 and RAM 1003 are interconnected via a bus 1004 . Input/output interface 1005 is also connected to bus 1004 .

また、入力／出力インターフェース１００５にはさらに、次のような部品が接続され、即ち、キーボードなどを含む入力部１００６、液晶表示器（ＬＣＤ）などのような表示器及びスピーカーなどを含む出力部１００７、ハードディスクなどを含む記憶部１００８、ネットワーク・インターフェース・カード、例えば、ＬＡＮカード、モデムなどを含む通信部１００９である。通信部１００９は、例えば、インターネット、ＬＡＮなどのネットワークを経由して通信処理を行う。ドライブ１０１０は、ニーズに応じて、入力／出力インターフェース１００５に接続されても良い。取り外し可能な媒体１０１１、例えば、半導体メモリなどは、必要に応じて、ドライブ１０１０にセットされることにより、その中から読み取られたコンピュータプログラムを記憶部１００８にインストールすることができる。 The input/output interface 1005 is further connected with the following components: an input unit 1006 including a keyboard, etc., an output unit 1007 including a display such as a liquid crystal display (LCD) and a speaker. , a storage unit 1008 including a hard disk, etc., and a communication unit 1009 including a network interface card such as a LAN card, a modem, and the like. A communication unit 1009 performs communication processing via a network such as the Internet or a LAN, for example. Drives 1010 may be connected to input/output interfaces 1005 as desired. A removable medium 1011 such as a semiconductor memory can be set in the drive 1010 as necessary, and a computer program read from the medium can be installed in the storage unit 1008 .

また、本発明はさらに、マシン可読命令コードを含むプログラムプロダクトを提供する。このような命令コードは、マシンにより読み取られ実行されるときに、上述の本発明の実施形態における方法を実行することができる。それ相応に、このようなプログラムプロダクトをキャリー（ｃａｒｒｙ）する、例えば、磁気ディスク（フロッピーディスク（登録商標）を含む）、光ディスク（ＣＤ－ＲＯＭ及びＤＶＤを含む）、光磁気ディスク（ＭＤ（登録商標）を含む）、及び半導体記憶装置などの各種の記憶媒体も、本発明に含まれる。 Additionally, the present invention further provides a program product including machine-readable instruction code. Such instruction codes, when read and executed by a machine, are capable of performing the methods in the embodiments of the present invention described above. Correspondingly, for carrying such program products, for example magnetic disks (including floppy disks), optical disks (including CD-ROMs and DVDs), magneto-optical disks (MD®) ), and various storage media such as semiconductor storage devices are also included in the present invention.

上述の記憶媒体は、例えば、磁気ディスク、光ディスク、光磁気ディスク、半導体記憶装置などを含んでも良いが、これらに限定されない。 The storage medium described above may include, for example, a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor storage device, etc., but is not limited to these.

また、上述の方法における各操作（処理）は、各種のマシン可読記憶媒体に記憶されるコンピュータ実行可能なプログラムの方式で実現することもできる。 Each operation (process) in the above-described method can also be implemented in the form of a computer-executable program stored on various machine-readable storage media.

また、以上の実施例などに関し、さらに以下のように付記として開示する。 In addition, the above examples and the like are further disclosed as supplementary notes as follows.

（付記１）
情報処理装置であって、
一つ又は複数の所定の第一テキストのうちの各々について、複数のテキスト生成モデルを用いて、該第一テキストに対応する複数の第二テキストを生成するように構成される第二テキスト生成ユニット；
前記複数の第二テキストと、対応する第一テキストとの間の語義マッチ度に基づいて、前記複数のテキスト生成モデルのうちから、第一所定数のテキスト生成モデルを候補モデルとして選択するように構成される候補モデル選択ユニット；
前記一つ又は複数の所定の第一テキストのうちの各々について、前記候補モデルを用いて生成された、該第一テキストに対応する複数の第二テキストの互いの間のテキスト類似度を計算し、そして、第二テキストの互いの間のテキスト類似度に基づいて、前記候補モデルの互いの間のモデル類似度を計算するように構成される類似度計算ユニット；
前記候補モデルのうちから、互いの間のモデル類似度が最も低い第二所定数の候補モデルを目標モデルとして選択するように構成される目標モデル選択ユニット；及び
前記目標モデルを用いて、処理待ちテキストに対応する第二所定数の第四テキストを、後続の処理のために生成するように構成される第四テキスト生成ユニットを含む、装置。 (Appendix 1)
An information processing device,
A second text generation unit configured to generate, for each of one or more predetermined first texts, a plurality of second texts corresponding to the first text using a plurality of text generation models. ;
selecting a first predetermined number of text generation models from among the plurality of text generation models as candidate models based on the degree of semantic matching between the plurality of second texts and the corresponding first text; a composed candidate model selection unit;
For each of the one or more predetermined first texts, calculate text similarity between each of a plurality of second texts corresponding to the first text generated using the candidate model. and a similarity computation unit configured to compute model similarities between said candidate models based on text similarities between each other of second texts;
a target model selection unit configured to select, as target models, from among said candidate models a second predetermined number of candidate models having the lowest model similarity between each other; and, using said target models, pending processing. An apparatus comprising a fourth text generation unit configured to generate a second predetermined number of fourth texts corresponding to the text for subsequent processing.

（付記２）
付記１に記載の情報処理装置であって、
前記処理待ちテキストは、ユーザ入力のテキスト、又は、ユーザ入力の語音又は画像を変換して取得したテキストを含み、
前記情報処理装置はビデオタイミング位置決めユニットをさらに含み、それは、前記処理待ちテキスト及び前記第二所定数の第四テキストに基づいて、所定のビデオから、前記処理待ちテキストに対応するフレームの位置を識別するように構成される、装置。 (Appendix 2)
The information processing device according to Supplementary Note 1,
The text awaiting processing includes text input by the user or text obtained by converting speech sounds or images input by the user;
The information processing apparatus further includes a video timing positioning unit, which, based on the pending text and the second predetermined number of fourth texts, identifies from a given video the position of the frame corresponding to the pending text. A device configured to.

（付記３）
付記１に記載の情報処理装置であって、
前記処理待ちテキストは訓練テキストを含み、
前記情報処理装置は、さらに、
前記処理待ちテキスト及び前記第二所定数の第四テキストに基づいてビデオタイミング位置決めのためのマルチモーダルモデルを訓練し、訓練済みのマルチモーダルモデルを得るように構成される、マルチモーダルモデル訓練ユニットを含む、装置。 (Appendix 3)
The information processing device according to Supplementary Note 1,
the pending text includes training text;
The information processing device further includes:
a multimodal model training unit configured to train a multimodal model for video timing positioning based on the pending text and the second predetermined number of fourth texts to obtain a trained multimodal model; including, equipment.

（付記４）
付記２に記載の情報処理装置であって、さらに、
前記第二所定数の第四テキストのうちから、前記処理待ちテキストとの語義マッチ度が所定マッチ度以上である複数の第四テキストを候補テキストとして選択するように構成される候補テキスト選択ユニット；及び
前記候補テキストのうちから、互いの間のテキスト類似度が最も低い第三所定数の候補テキストを目標テキストとして選択するように構成される目標テキスト選択ユニットを含み、
そのうち、前記ビデオタイミング位置決めユニットはさらに、前記処理待ちテキスト及び前記目標テキストに基づいて、前記所定のビデオから、前記処理待ちテキストに対応するレームの位置を識別するように構成される、装置。 (Appendix 4)
The information processing device according to appendix 2, further comprising:
a candidate text selection unit configured to select, as candidate texts, a plurality of fourth texts having a semantic matching degree with the awaiting text equal to or greater than a predetermined matching degree from among the second predetermined number of fourth texts; and a target text selection unit configured to select, as target texts, from among the candidate texts, a third predetermined number of candidate texts having the lowest text similarity between each other;
wherein said video timing positioning unit is further configured to identify, from said predetermined video, a position of a frame corresponding to said pending text based on said pending text and said target text.

（付記５）
付記１乃至４のうちの何れか１項に記載の情報処理装置であって、
前記第一テキスト、前記第二テキスト及び前記第四テキストは同じ語種に属し、
前記第二テキスト生成ユニットはバックトランスレーション方法を用いて前記複数の第二テキストを生成するように構成され、
前記第四テキスト生成ユニットはバックトランスレーション方法を用いて前記第四テキストを生成するように構成される、装置。 (Appendix 5)
The information processing device according to any one of Appendices 1 to 4,
the first text, the second text and the fourth text belong to the same word class;
the second text generation unit is configured to generate the plurality of second texts using a back translation method;
The apparatus, wherein the fourth text generation unit is configured to generate the fourth text using a back-translation method.

（付記６）
付記１乃至４のうちの何れか１項に記載の情報処理装置であって、
前記類似度計算ユニットは、次のような方式で、前記複数の第二テキストの互いの間のテキスト類似度を計算するように構成され、即ち、
各第二テキストをワード及び／又はフレーズの集合に分割し；及び
任意の二つの第二テキストに対応する二つのワード及び／又はフレーズの集合の間の共通集合及び合併集合を取得し、かつ取得された共通集合に含まれるワード及びフレーズの数と、取得された合併集合に含まれるワード及びフレーズの数との比を、前記任意の二つの第二テキストの間のテキスト類似度として計算する、装置。 (Appendix 6)
The information processing device according to any one of Appendices 1 to 4,
The similarity calculation unit is configured to calculate the text similarity between each other of the plurality of second texts in the following manner:
dividing each secondary text into a set of words and/or phrases; and obtaining and obtaining intersections and unions between the two sets of words and/or phrases corresponding to any two secondary texts. calculating the ratio of the number of words and phrases in the obtained intersection to the number of words and phrases in the obtained union as the text similarity between any two second texts; Device.

（付記７）
付記６に記載の情報処理装置であって、
前記類似度計算ユニットは、次のような方式で、前記候補モデルの互いの間のモデル類似度を計算するように構成され、即ち、
前記一つ又は複数の所定の第一テキストのうちの各々について、任意の二つの候補モデルにより取得された、該第一テキストに対応する第二テキストの間のテキスト類似度を取得し、かつ前記一つ又は複数の所定の第一テキストに対応するテキスト類似度の平均値を、上述の任意の二つの候補モデルの間のモデル類似度として計算する、装置。 (Appendix 7)
The information processing device according to appendix 6,
The similarity computation unit is configured to compute model similarities between the candidate models in the following manner:
For each of the one or more predetermined first texts, obtain text similarity between second texts corresponding to the first text obtained by any two candidate models; and Apparatus for calculating an average value of text similarities corresponding to one or more predetermined first texts as a model similarity between any two candidate models mentioned above.

（付記８）
付記１乃至４のうちの何れか１項に記載の情報処理装置であって、
前記目標モデル選択ユニットは、次のような方式で、前記目標モデルを選択するように構成され、即ち、
前記候補モデルの互いの間のモデル類似度に基づいてＮ＊Ｎ次元マトリックスを構築し、前記Ｎ＊Ｎ次元マトリックスにおける各要素は対応する候補モデルの間のモデル類似度を表し、そのうち、Ｎは前記第一所定数を表し；
行列式ポイントプロセスを用いて前記Ｎ＊Ｎ次元マトリックスのＭ＊Ｍ次元最大行列式サブマトリックスを求め、そのうち、Ｍは前記第二所定数を表し；及び
前記Ｍ＊Ｍ次元最大行列式サブマトリックスに対応する候補モデルを前記目標モデルとして選択する、装置。 (Appendix 8)
The information processing device according to any one of Appendices 1 to 4,
The target model selection unit is configured to select the target model in the following manner:
constructing an N*N-dimensional matrix based on the model similarities between the candidate models each other, each element in the N*N-dimensional matrix representing the model similarity between corresponding candidate models, where N is representing said first predetermined number;
determining an M*M dimensional maximal determinant sub-matrix of said N*N dimensional matrix using a determinant point process, wherein M represents said second predetermined number; Apparatus for selecting a corresponding candidate model as said target model.

（付記９）
付記４に記載の情報処理装置であって、
前記ビデオタイミング位置決めユニットはさらに、前記処理待ちテキストに基づく識別結果及び前記目標テキストに基づく識別結果の中値に基づいて最終識別結果を確定するように構成される、装置。 (Appendix 9)
The information processing device according to appendix 4,
The apparatus, wherein the video timing positioning unit is further configured to determine a final identification result based on a median value of the pending text-based identification result and the target text-based identification result.

（付記１０）
情報処理装置であって、
テキスト生成モデルを用いて、処理待ちテキストに対応する複数の第五テキストを生成するように構成される第五テキスト生成ユニット；
前記複数の第五テキストのうちから、前記処理待ちテキストとの語義マッチ度が所定マッチ度以上である第五テキストを候補第五テキストとして選択するように構成される候補テキスト選択ユニット；
各候補第五テキストについて、該候補第五テキストと、他の候補第五テキストのうちの各々との間のテキスト類似度を計算するように構成されるテキスト類似度計算ユニット；及び
前記候補第五テキストのうちから、互いの間のテキスト類似度が最も低い第四所定数の候補第五テキストを、後続の処理のために、目標テキストとして選択するように構成される目標テキスト選択ユニットを含む、装置。 (Appendix 10)
An information processing device,
a fifth text generation unit configured to generate a plurality of fifth texts corresponding to the pending text using the text generation model;
a candidate text selection unit configured to select, from among the plurality of fifth texts, a fifth text having a semantic matching degree with the awaiting text equal to or greater than a predetermined matching degree as a candidate fifth text;
a text similarity computation unit configured to compute, for each candidate fifth text, a text similarity between said candidate fifth text and each of the other candidate fifth texts; and said candidate fifth text. a target text selection unit configured to select, from among the texts, a fourth predetermined number of candidate fifth texts having the lowest text similarity between each other as target texts for subsequent processing; Device.

（付記１１）
付記１０に記載の情報処理装置であって、
前記処理待ちテキストは、ユーザ入力のテキスト、又は、ユーザ入力の語音又は画像を変換して取得したテキストを含み、
前記情報処理装置はビデオタイミング位置決めユニットをさらに含み、それは前記処理待ちテキスト及び前記目標テキストに基づいて、所定のビデオから前記処理待ちテキストに対応するフレームの位置を見つけるように構成される、装置。 (Appendix 11)
The information processing device according to Appendix 10,
The text awaiting processing includes text input by the user or text obtained by converting speech sounds or images input by the user;
The information processing device further comprises a video timing positioning unit, which is configured to locate a frame corresponding to the pending text from a given video based on the pending text and the target text.

（付記１２）
付記１０に記載の情報処理装置であって、
前記処理待ちテキストは訓練テキストを含み、
前記情報処理装置はマルチモーダルモデル訓練ユニットをさらに含み、それは、前記処理待ちテキスト及び前記目標テキストに基づいてビデオタイミング位置決めのためのマルチモーダルモデルを訓練し、訓練済みのマルチモーダルモデルを得るように構成される、装置。 (Appendix 12)
The information processing device according to Appendix 10,
the pending text includes training text;
The information processing device further includes a multimodal model training unit, which trains a multimodal model for video timing positioning based on the pending text and the target text to obtain a trained multimodal model. Constructed, device.

（付記１３）
情報処理方法であって、
一つ又は複数の所定の第一テキストのうちの各々について、複数のテキスト生成モデルを用いて該第一テキストに対応する複数の第二テキストを生成し；
前記複数の第二テキストと、対応する第一テキストとの間の語義マッチ度に基づいて、前記複数のテキスト生成モデルのうちから、第一所定数のテキスト生成モデルを候補モデルとして選択し；
前記一つ又は複数の所定の第一テキストのうちの各々について、前記候補モデルを用いて生成された、該第一テキストに対応する複数の第二テキストの互いの間のテキスト類似度を計算し、そして、第二テキストの互いの間のテキスト類似度に基づいて、前記候補モデルの互いの間のモデル類似度を計算し；
前記候補モデルのうちから、互いの間のモデル類似度が最も低い第二所定数の候補モデルを目標モデルとして選択し；及び
前記目標モデルを用いて、処理待ちテキストに対応する第二所定数の第四テキストを、後続の処理のために生成することを含む、方法。 (Appendix 13)
An information processing method,
for each of one or more predetermined first texts, generating a plurality of second texts corresponding to the first text using a plurality of text generation models;
selecting a first predetermined number of text generation models from among the plurality of text generation models as candidate models based on the degree of semantic match between the plurality of second texts and the corresponding first text;
For each of the one or more predetermined first texts, calculate text similarity between each of a plurality of second texts corresponding to the first text generated using the candidate model. and calculating the model similarity between the candidate models based on the text similarity between the second texts;
selecting a second predetermined number of candidate models having the lowest model similarity between each other as target models from the candidate models; and using the target models, a second predetermined number corresponding to pending text. A method comprising generating a fourth text for subsequent processing.

（付記１４）
付記１３に記載の情報処理方法であって、
前記処理待ちテキストは、ユーザ入力のテキスト、又は、ユーザ入力の語音又は画像を変換して取得したテキストを含み、
前記情報処理方法はさらに、前記処理待ちテキスト及び前記第二所定数の第四テキストに基づいて、所定のビデオから、前記処理待ちテキストに対応するフレームの位置を認識することを含む、方法。 (Appendix 14)
The information processing method according to Appendix 13,
The text awaiting processing includes text input by the user or text obtained by converting speech sounds or images input by the user;
The information processing method further comprises recognizing, from a given video, a position of a frame corresponding to the pending text based on the pending text and the second predetermined number of fourth texts.

（付記１５）
付記１３に記載の情報処理方法であって、
前記処理待ちテキストは訓練テキストを含み、
前記情報処理方法はさらに、前記処理待ちテキスト及び前記第二所定数の第四テキストに基づいてビデオタイミング位置決めのためのマルチモーダルモデルを訓練し、訓練済みのマルチモーダルモデルを得ることを含む、方法。 (Appendix 15)
The information processing method according to Appendix 13,
the pending text includes training text;
The information processing method further comprises training a multimodal model for video timing positioning based on the pending text and the second predetermined number of fourth texts to obtain a trained multimodal model. .

（付記１６）
付記１４に記載の情報処理方法であって、さらに、
前記第二所定数の第四テキストのうちから、前記処理待ちテキストとの語義マッチ度が所定マッチ度以上である複数の第四テキストを候補テキストとして選択し；及び
前記候補テキストのうちから、互いの間のテキスト類似度が最も低い第三所定数の候補テキストを目標テキストとして選択することを含み、
そのうち、前記処理待ちテキストに対応するフレームの位置の認識は前記処理待ちテキスト及び前記目標テキストに基づいて、前記所定のビデオから、前記処理待ちテキストに対応するフレームの位置を認識することを含む、方法。 (Appendix 16)
The information processing method according to appendix 14, further comprising:
selecting, as candidate texts, a plurality of fourth texts having a degree of semantic matching with the text awaiting processing equal to or greater than a predetermined degree of matching, from among the second predetermined number of fourth texts; selecting as target text a third predetermined number of candidate texts having the lowest text similarity between
wherein recognizing a position of a frame corresponding to the pending text includes recognizing a position of a frame corresponding to the pending text from the predetermined video based on the pending text and the target text; Method.

（付記１７）
付記１３乃至１６のうちの何れか１項に記載の情報処理方法であって、
前記第一テキスト、前記第二テキスト及び前記第四テキストは同じ語種に属し、
バックトランスレーション方法を用いて、前記複数の第二テキスト及び前記第四テキストを生成する、方法。 (Appendix 17)
17. The information processing method according to any one of Appendices 13 to 16,
the first text, the second text and the fourth text belong to the same word class;
generating said plurality of second texts and said fourth texts using a back-translation method.

（付記１８）
付記１３乃至１６のうちの何れか１項に記載の情報処理方法であって、
次のような方式で、前記複数の第二テキストの互いの間のテキスト類似度を計算し、即ち、
各第二テキストをワード及び／又はフレーズの集合に分割し；及び
任意の二つの第二テキストに対応する二つのワード及び／又はフレーズの集合の間の共通集合及び合併集合を取得し、かつ取得された共通集合に含まれるワード及びフレーズの数と、取得された合併集合に含まれるワード及びフレーズの数との比を、前記任意の二つの第二テキストの間のテキスト類似度として計算する、方法。 (Appendix 18)
17. The information processing method according to any one of Appendices 13 to 16,
calculating the text similarity between the plurality of second texts in the following manner:
dividing each secondary text into a set of words and/or phrases; and obtaining and obtaining intersections and unions between the two sets of words and/or phrases corresponding to any two secondary texts. calculating the ratio of the number of words and phrases in the obtained intersection to the number of words and phrases in the obtained union as the text similarity between any two second texts; Method.

（付記１９）
付記１８に記載の情報処理方法であって、
次のような方式で、前記候補モデル互いの間のモデル類似度を計算し、即ち、
前記一つ又は複数の所定の第一テキストのうちの各々について、任意の二つの候補モデルにより取得された、該第一テキストに対応する第二テキストの間のテキスト類似度を取得し、かつ前記一つ又は複数の所定の第一テキストに対応するテキスト類似度の平均値を、上述の任意の二つの候補モデルの間のモデル類似度として計算する、方法。 (Appendix 19)
The information processing method according to Appendix 18,
Compute the model similarity between the candidate models in the following manner:
For each of the one or more predetermined first texts, obtain text similarity between second texts corresponding to the first text obtained by any two candidate models; and A method of calculating an average value of text similarities corresponding to one or more given first texts as the model similarity between any two candidate models mentioned above.

（付記２０）
付記１６に記載の情報処理方法であって、
前記処理待ちテキストに基づく識別結果及び前記目標テキストに基づく識別結果の中値に基づいて最終識別結果を決定する、方法。 (Appendix 20)
The information processing method according to appendix 16,
determining a final identification result based on a median value of the pending text-based identification result and the target text-based identification result.

以上、本発明の好ましい実施形態を説明したが、本発明はこの実施形態に限定されず、本発明の趣旨を離脱しない限り、本発明に対するあらゆる変更は、本発明の技術的範囲に属する。 Although the preferred embodiment of the present invention has been described above, the present invention is not limited to this embodiment, and all modifications to the present invention fall within the technical scope of the present invention as long as they do not depart from the gist of the present invention.

Claims

An information processing device,
A second text generation unit configured to generate, for each of one or more predetermined first texts, a plurality of second texts corresponding to the first text using a plurality of text generation models. ;
selecting a first predetermined number of text generation models from among the plurality of text generation models as candidate models based on the degree of semantic matching between the plurality of second texts and the corresponding first text; a composed candidate model selection unit;
For each of the one or more predetermined first texts, calculate text similarity between each of a plurality of second texts corresponding to the first text generated using the candidate model. , a similarity computation unit configured to compute model similarities between said candidate models based on text similarities between each other of second texts;
a target model selection unit configured to select, as target models, from among said candidate models a second predetermined number of candidate models having the lowest model similarity between each other; and, using said target models, pending processing. Information processing apparatus comprising a fourth text generation unit configured to generate a second predetermined number of fourth texts corresponding to the text for subsequent processing.

The information processing device according to claim 1,
The text awaiting processing includes text input by the user or text obtained by converting speech sounds or images input by the user;
The information processing device further comprises a video timing positioning unit, the video timing positioning unit corresponding to the pending text from a predetermined video based on the pending text and the second predetermined number of fourth texts. An information processing device configured to identify the position of a frame.

The information processing device according to claim 1,
The information processing apparatus further includes a multimodal model training unit, wherein the multimodal model training unit develops a multimodal model for video timing positioning based on the pending text and the second predetermined number of fourth texts. An information processor configured to train and obtain a trained multimodal model.

The information processing device according to claim 2,
a candidate text selection unit configured to select, as candidate texts, a plurality of fourth texts having a semantic matching degree with the awaiting text equal to or greater than a predetermined matching degree from among the second predetermined number of fourth texts; and further comprising a target text selection unit configured to select, as target texts, a third predetermined number of candidate texts having the lowest text similarity between each other from among the candidate texts;
The information processing apparatus, wherein the video timing positioning unit is further configured to identify, from the predetermined video, a position of a frame corresponding to the pending text based on the pending text and the target text.

The information processing device according to any one of claims 1 to 4,
the first text, the plurality of second texts and the fourth text belong to the same word class;
the second text generation unit is configured to generate the plurality of second texts using a back translation method;
Information processing apparatus, wherein the fourth text generation unit is configured to generate the fourth text using a back-translation method.

The information processing device according to any one of claims 1 to 4,
The similarity calculation unit is
dividing each secondary text into a set of words and/or phrases; by calculating the ratio of the number of words and phrases contained in the obtained intersection to the number of words and phrases contained in the obtained union set as the text similarity between any two second texts , an information processing device configured to calculate a text similarity between said plurality of second texts.

The information processing device according to claim 6,
The similarity calculation unit is
for each of the one or more predetermined first texts, obtaining text similarity between second texts corresponding to the first texts obtained by any two candidate models; model similarity between each other of said candidate models by calculating the mean value of the text similarities corresponding to one or more given first texts as the model similarity between any two candidate models mentioned above; An information processing device configured to calculate degrees.

The information processing device according to any one of claims 1 to 4,
The target model selection unit comprises:
constructing an N*N-dimensional matrix based on the model similarities between the candidate models, each element in the N*N-dimensional matrix representing the model similarity between the corresponding candidate models, N being the representing a first predetermined number;
determining an M*M dimensional maximal determinant sub-matrix of said N*N dimensional matrix using a determinant point process, where M represents said second predetermined number; and corresponding to said M*M dimensional maximal determinant sub-matrix. An information processing apparatus configured to select the target model by selecting a candidate model as the target model.

An information processing device,
a fifth text generation unit configured to generate a plurality of fifth texts corresponding to the pending text using the text generation model;
a candidate text selection unit configured to select, from among the plurality of fifth texts, a fifth text having a semantic matching degree with the awaiting text equal to or greater than a predetermined matching degree as a candidate fifth text;
a text similarity computation unit configured to compute, for each candidate fifth text, a text similarity between said candidate fifth text and each of the other candidate fifth texts; and said candidate fifth text. a target text selection unit configured to select, from among the texts, a fourth predetermined number of candidate fifth texts having the lowest text similarity between each other as target texts for subsequent processing; Information processing equipment.

An information processing method,
for each of one or more predetermined first texts, generating a plurality of second texts corresponding to the first text using a plurality of text generation models;
selecting a first predetermined number of text generation models from among the plurality of text generation models as candidate models based on the degree of semantic match between the plurality of second texts and the corresponding first text;
For each of the one or more predetermined first texts, calculate text similarity between each of a plurality of second texts corresponding to the first text generated using the candidate model. , calculating the model similarity between the candidate models based on the text similarity between the second texts;
selecting a second predetermined number of candidate models having the lowest model similarity between each other as target models from the candidate models; and using the target models, a second predetermined number corresponding to pending text. A method of information processing, comprising generating a fourth text for subsequent processing.