CN116415587A - Information processing apparatus and information processing method - Google Patents

Information processing apparatus and information processing method Download PDF

Info

Publication number
CN116415587A
CN116415587A CN202111581637.0A CN202111581637A CN116415587A CN 116415587 A CN116415587 A CN 116415587A CN 202111581637 A CN202111581637 A CN 202111581637A CN 116415587 A CN116415587 A CN 116415587A
Authority
CN
China
Prior art keywords
text
texts
candidate
similarity
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111581637.0A
Other languages
Chinese (zh)
Inventor
王平
孙利
汪留安
孙俊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fujitsu Ltd
Original Assignee
Fujitsu Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fujitsu Ltd filed Critical Fujitsu Ltd
Priority to CN202111581637.0A priority Critical patent/CN116415587A/en
Priority to JP2022194980A priority patent/JP2023093349A/en
Publication of CN116415587A publication Critical patent/CN116415587A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses an information processing apparatus and an information processing method. The information processing apparatus includes: a second text generation unit that generates a plurality of second texts corresponding to each of the first texts using a plurality of text generation models; a candidate model selection unit configured to select a candidate model from a plurality of text generation models based on a degree of semantic matching between the plurality of second texts and the corresponding first texts; a similarity calculation unit configured to calculate, for each first text, a text similarity between a plurality of second texts corresponding to the first text generated using the candidate models, and calculate a model similarity between the candidate models based on the text similarity between the second texts; a target model selection unit configured to select a target model from the candidate models based on the model similarity; and a fourth text generation unit configured to generate a second predetermined number of fourth texts corresponding to the text to be processed using the target model.

Description

Information processing apparatus and information processing method
Technical Field
The present disclosure relates to the field of information processing, and in particular, to an information processing apparatus and an information processing method.
Background
Text is required for use in many applications in the field of information processing. There are some techniques currently available for processing text.
Disclosure of Invention
The following presents a simplified summary of the disclosure in order to provide a basic understanding of some aspects of the disclosure. However, it should be understood that this summary is not an exhaustive overview of the disclosure. It is not intended to identify key or critical elements of the disclosure or to delineate the scope of the disclosure. Its purpose is to present some concepts related to the disclosure in a simplified form as a prelude to the more detailed description that is presented later.
According to an aspect of the present disclosure, there is provided an information processing apparatus including: a second text generation unit configured to generate, for each of one or more predetermined first texts, a plurality of second texts corresponding to the first text using a plurality of text generation models; a candidate model selection unit configured to select a first predetermined number of text generation models from the plurality of text generation models as candidate models based on a degree of semantic matching between the plurality of second texts and the corresponding first texts; a similarity calculation unit configured to calculate, for each of the one or more predetermined first texts, a text similarity between a plurality of second texts corresponding to the first text generated using the candidate models, and calculate model similarity between the candidate models based on the text similarity between the second texts; a target model selecting unit configured to select, as target models, a second predetermined number of candidate models having the lowest model similarity to each other from among the candidate models; and a fourth text generation unit configured to generate a second predetermined number of fourth texts corresponding to the text to be processed using the target model for subsequent processing.
According to another aspect of the present disclosure, there is provided an information processing apparatus including: a fifth text generation unit configured to generate a plurality of fifth texts corresponding to the text to be processed using the text generation model; a candidate text selection unit configured to select, as a candidate fifth text, a fifth text having a semantic matching degree with the text to be processed greater than or equal to a predetermined matching degree from the plurality of fifth texts; a text similarity calculation unit configured to calculate, for each candidate fifth text, a text similarity between the candidate fifth text and each of the other candidate fifth texts; and a target text selection unit configured to select, as target texts, a fourth predetermined number of candidate fifth texts having the lowest text similarity to each other from among the candidate fifth texts, for subsequent processing.
According to still another aspect of the present disclosure, there is provided an information processing method including: generating a plurality of second texts corresponding to the first texts by using a plurality of text generation models for each of one or more predetermined first texts; selecting a first predetermined number of text generation models from the plurality of text generation models as candidate models based on the degree of semantic matching between the plurality of second texts and the corresponding first texts; calculating, for each of the one or more predetermined first texts, a text similarity between a plurality of second texts corresponding to the first text generated using the candidate models, and calculating model similarity between the candidate models based on the text similarity between the second texts; selecting a second predetermined number of candidate models with the lowest model similarity among the candidate models as target models; and generating a second predetermined number of fourth texts corresponding to the text to be processed by using the target model for subsequent processing.
According to other aspects of the present disclosure, there are also provided a computer program code and a computer program product for implementing the above-described method according to the present disclosure, and a computer readable storage medium having the computer program code recorded thereon for implementing the above-described method according to the present disclosure.
Other aspects of the disclosed embodiments are set forth in the description section below, wherein the detailed description is for fully disclosing preferred embodiments of the disclosed embodiments without placing limitations thereon.
Drawings
The present disclosure may be better understood by reference to the following detailed description taken in conjunction with the accompanying drawings, in which the same or similar reference numerals are used throughout the figures to designate the same or similar components. The accompanying drawings, which are included to provide a further illustration of the preferred embodiments of the disclosure and to explain the principles and advantages of the disclosure, are incorporated in and form a part of the specification along with the detailed description that follows. Wherein:
fig. 1 is a block diagram showing a functional configuration example of an information processing apparatus according to a first embodiment of the present disclosure;
FIG. 2 illustrates one example of generating a second text using a back-translation method;
Fig. 3 is a block diagram showing a functional configuration example of an information processing apparatus according to a second embodiment of the present disclosure;
fig. 4 shows examples of results of target frame recognition in the case of a Charades-STA dataset using a similarity sequence for text to be processed, an average similarity sequence obtained by a video timing positioning unit, a median similarity sequence obtained by a video timing positioning unit, a manual average similarity sequence, respectively.
Fig. 5 is a block diagram showing a functional configuration example of an information processing apparatus according to a third embodiment of the present disclosure;
fig. 6 is a block diagram showing a functional configuration example of an information processing apparatus according to a fourth embodiment of the present disclosure;
fig. 7 shows an example of a result of target frame recognition in the case of a chapads-STA dataset by using a similarity sequence for a text to be processed, an average similarity sequence obtained by a video timing positioning unit, a manual average similarity sequence, respectively;
fig. 8 is a flowchart showing a flow example of an information processing method according to a fifth embodiment of the present disclosure;
fig. 9 is a flowchart showing a flow example of an information processing method according to a sixth embodiment of the present disclosure; and
Fig. 10 is a block diagram showing an example structure of a personal computer employable in embodiments of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure will be described hereinafter with reference to the accompanying drawings. In the interest of clarity and conciseness, not all features of an actual implementation are described in this specification. It will of course be appreciated that in the development of any such actual embodiment, numerous implementation-specific decisions must be made in order to achieve the developer's specific goals, such as compliance with system-and business-related constraints, and that these constraints will vary from one implementation to another. Moreover, it will be appreciated that such a development effort might be complex and time-consuming, but would nevertheless be a routine undertaking for those of ordinary skill in the art having the benefit of this disclosure.
It is also noted herein that, in order to avoid obscuring the disclosure with unnecessary details, only the device structures and/or processing steps closely related to the solution according to the present disclosure are shown in the drawings, while other details not greatly related to the present disclosure are omitted.
Embodiments according to the present disclosure are described in detail below with reference to the accompanying drawings.
< first embodiment >
First, an implementation example of an information processing apparatus 100 according to a first embodiment of the present disclosure will be described with reference to fig. 1. Fig. 1 is a block diagram showing a functional configuration example of an information processing apparatus 100 according to a first embodiment of the present disclosure.
As shown in fig. 1, an information processing apparatus 100 according to a first embodiment of the present disclosure may include a second text generation unit 102, a candidate model selection unit 104, a similarity calculation unit 106, a target model selection unit 108, and a fourth text generation unit 110.
The second text generation unit 102 may be configured to generate, for each of one or more predetermined first texts, a plurality of second texts corresponding to the first text using a plurality of text generation models. For example, each text generation model may be a separate text generation model. Further, for example, the plurality of text generation models may each correspond to a different sub-module in the same text generation model. For example, each of the plurality of second texts generated for the same first text may belong to the same language.
The candidate model selection unit 104 may be configured to select a first predetermined number of text generation models from the plurality of text generation models as the candidate model based on the degree of semantic matching between the plurality of second texts generated by the second text generation unit 102 and the corresponding first text. For example, the first predetermined number may be set according to actual needs. For example, an existing model, such as a Sentence-BERT, may be used to determine the degree of semantic matching between each second text and the corresponding first text.
As an example, the candidate model selection unit 104 may select, as the candidate model, a text generation model based on which the degree of semantic matching between the generated second text and the corresponding first text is high.
For example, the candidate model selection unit 104 may calculate, for each text generation model, the average value of the degree of semantic matching between each second text generated with the text generation model and the corresponding first text, and select, as the candidate model, a first predetermined number of text generation models having the highest average value of the degree of semantic matching corresponding thereto among the plurality of text generation models.
The similarity calculation unit 106 may be configured to calculate, for each of the one or more predetermined first texts, a text similarity between a plurality of second texts corresponding to the first text generated using the candidate model selected by the candidate model selection unit 104, and a model similarity between candidate models based on the text similarity between the second texts.
The target model selection unit 108 may be configured to select, as the target model, a second predetermined number of candidate models having the lowest model similarity to each other from among the candidate models. For example, the second predetermined number may be set according to actual needs.
The fourth text generation unit 110 may be configured to generate a second predetermined number of fourth texts corresponding to the text to be processed using the target model for subsequent processing. In the case where there are a plurality of pending texts, the fourth text generation unit 110 may generate, for each of the pending texts, a second predetermined number of fourth texts corresponding to the pending texts. For example, each of a plurality of fourth texts generated for the same text to be processed may belong to the same language.
As described above, the information processing apparatus 100 according to the first embodiment of the present disclosure can select a target model from a plurality of text generation models in consideration of the degree of semantic matching between a first text and a second text generated using the plurality of text generation models and the degree of text similarity between the second texts. Thus, using the target model, a fourth text having a suitable degree of semantic matching with the text to be processed, such as a fourth text having a meaning relatively close to that of the text to be processed, can be generated. For example, fourth text having a meaning closer to that of the text to be processed may also be referred to as high-quality fourth text. In addition, the text similarity between the fourth texts may be low, so that the diversity is high. That is, with the information processing apparatus 100, the fourth text having high quality and high diversity can be generated.
In addition, by selecting the target model and generating the fourth text using the target model as described above, the information processing apparatus 100 can reduce noise due to inaccuracy of semantic matching. Where "noise" refers to fourth text having a meaning that is not too close to the meaning of the text to be processed.
As an example, the text to be processed and the first text may belong to the same language, such as chinese, english, japanese, etc., but are not limited thereto.
As a further example, the text to be processed and the first text may belong to different languages. In this case, for example, the fourth text generating unit 110 may first convert the text to be processed into the same language as the first text and then generate a second predetermined number of fourth texts using the object model.
For example, the second text and the fourth text may belong to the same language, such as chinese, english, japanese, etc., but are not limited thereto.
As an example, the first text, the second text, and the fourth text may belong to the same language. In this case, for example, the second text generation unit 102 may generate the second text using the back-translation method. For example, the second text generation unit 102 may convert the first text into text in a language (hereinafter may be referred to as "second language") different from the language of the first text (hereinafter may be referred to as "first language"), and then convert the converted text into the first language to obtain the second text. For example, FIG. 2 illustrates one example of generating a second text using a back-translation method, where the first language is English and the second language is Chinese. As shown in fig. 2, the second text generation unit 102 may convert the english first text "The weather is Good" into the chinese text "weather Good" and then convert the chinese text "weather Good" into the english second text "Good weather".
Although fig. 2 shows an example in which the second text generating unit 102 generates one second text, the second text generating unit 102 may convert the first text into a plurality of different texts in a second language (e.g., japanese, german, spanish, etc.) according to actual needs, and then convert the converted text into the first language to obtain a plurality of second texts.
Further, for example, the fourth text generation unit 110 may generate the fourth text using the back-translation method. For example, the fourth text generation unit 110 may generate one or more fourth texts in a similar manner to the second text generation unit 102 described above.
For example, each text generation model may be a separate text translation model, such as one text generation model may correspond to a text translation model of a conversion between text in a first language and text in a second language.
As an example, the similarity calculation unit 106 may calculate the text similarity between any two second texts by: splitting the two second texts into sets of words and/or terms, respectively, acquiring intersections and union between the acquired sets, and calculating a ratio of the number of words and terms (i.e., the sum of the number of words and the number of terms) contained in the acquired intersections to the number of words and terms (i.e., the sum of the number of words and the number of terms) contained in the acquired union as a text similarity between the arbitrary two second texts.
As used herein, a "word" may represent a single word, such as a single english word, a single chinese word, a single japanese word, and the like. In addition, the word "as used herein may refer to a combination of two or more words.
The manner in which the above-described similarity of texts is calculated will be further described below in connection with an example in which the second text is english text. For example, it is assumed that the second text generated by the second text generating unit 102 includes a first english sentence "a man was standing in the bathroom holding glasses" and a second english sentence "a person is standing in the bathroom holding a glass". The similarity calculation unit 106 may split the first english sentence into a first word set { a, man, was, standing, in, the, bat, holding, glass }, and split the second english sentence into a second word set { a, person, is, standing, in, the, bat, holding, a, glass }. The intersection of the first word set and the second word set is { a, man, standing, in, the, bat, holding }, containing 7 words, and the union of the first word set and the second word set is { a, man, was, is, standing, in, the, bat, holding, glass }, containing 11 words. The similarity calculation unit 106 may use the ratio (i.e., 7/11) of the number of words of the intersection (i.e., 7) to the number of words of the union (i.e., 11) as the text similarity between the first english sentence and the second english sentence.
For example, in the case where the second text is english text and contains uppercase letters and lowercase letters, the similarity calculation unit 106 may convert the second text into uppercase letters or lowercase letters, and then split the second text, and of course, the similarity calculation unit 106 may convert the split word into uppercase letters or lowercase letters after splitting the second text. As will be appreciated by those skilled in the art, such alphabetic conversion of english text may be similarly applicable to text in other languages, such as, for example, german text, spanish text, french text, and the like.
As an example, the similarity calculation unit 106 may calculate the model similarity between any two candidate models by: for each of one or more predetermined first texts, obtaining a text similarity between second texts corresponding to the first texts obtained by the arbitrary two candidate models, and calculating a mean value of the text similarity corresponding to the one or more predetermined first texts as a model similarity between the arbitrary two candidate models. For example, assume that one or more predetermined first texts include text 1 、text 2 ……text m (where m is a natural number greater than 0, which represents the number of first texts), text obtained using candidate model A 1 Second text and text obtained with candidate model B 1 The text similarity between the second texts of (2) is s 1 Text obtained by using candidate model A 2 Second text and text obtained with candidate model B 2 The text similarity between the second texts of (2) is s 2 Text obtained by using candidate model A m Second text and text obtained with candidate model B m The text similarity between the second texts of (2) is s m . In this case, the similarity calculation unit 106 may calculate the similarity to text 1 、text 2 ……text m Corresponding text similarity s 1 、s 2 ……s m Is used as the model similarity between candidate model a and candidate model B. Of course, the similarity calculation unit 106 may be based on the text similarity s 1 、s 2 ……s m Model similarity between candidate model a and candidate model B is calculated in other ways.
As an example, the target model selection unit 108 may select, as the target model, a second predetermined number of candidate models having the lowest model similarity to each other among the candidate models using a determinant point process (Determinantal Point Process, see, for example, chen L, zhang G, zhou e.fast greedy map inference for determinantal point process to improve recommendation diversity [ J ]. Advances in Neural Information Processing Systems,2018,31 ]). For example, the object model selection unit 108 may construct an n×n-dimensional matrix SS based on the model similarities between the candidate models, each element in the n×n-dimensional matrix SS representing the model similarity between the respective candidate models, e.g., SS [ i, j ] representing the similarity between the i-th candidate model and the j-th candidate model. Where N is a natural number greater than 0, which represents the number of candidate models (i.e., the first predetermined number), i and j are natural numbers greater than 0 and less than or equal to N. Then, the object model selection unit 108 may solve an m×m-dimensional maximum determinant submatrix of the n×n-dimensional matrix SS using a determinant point process, where the m×m-dimensional maximum determinant submatrix corresponds to a second predetermined number of candidate models having the lowest model similarity among each other among the candidate models. Wherein M is a natural number greater than 0 and less than N, which represents a second predetermined number.
< second embodiment >
An information processing apparatus 300 according to a second embodiment of the present disclosure will be described below with reference to fig. 3. Fig. 3 is a block diagram showing a functional configuration example of an information processing apparatus 300 according to the second embodiment of the present disclosure.
As shown in fig. 3, an information processing apparatus 300 according to a second embodiment of the present disclosure may include a second text generation unit 302, a candidate model selection unit 304, a similarity calculation unit 306, a target model selection unit 308, a fourth text generation unit 310, and a video timing positioning unit 312. The second text generating unit 302, the candidate model selecting unit 304, the similarity calculating unit 306, the object model selecting unit 308, and the fourth text generating unit 310 may be similar to the second text generating unit 102, the candidate model selecting unit 104, the similarity calculating unit 106, the object model selecting unit 108, and the fourth text generating unit 110 included in the information processing apparatus 100 described above with reference to fig. 1 and 2, and thus will not be described in detail herein.
For example, the text to be processed may include text input by a user or text obtained by converting voice or image input by the user, for example, the text to be processed may indicate an object, event, or the like that the user desires to recognize from a predetermined video. The video timing positioning unit 312 may identify the position of a frame (hereinafter may also be referred to as a "target frame") corresponding to the text to be processed from a predetermined video based on the text to be processed and the second predetermined number of fourth texts generated by the fourth text generating unit 310. Since the video timing positioning unit 312 can recognize a predetermined video with enhanced text (i.e., the text to be processed and the fourth text), the recognition accuracy can be improved.
For example, video timing positioning unit 312 may identify the location of frames corresponding to text to be processed from a predetermined video using a trained multimodal model. For example, the video timing positioning unit 312 may calculate, for each of the text to be processed and the fourth text, the similarity between each frame in the predetermined video and the text using the trained multimodal model to obtain m+1 similarity sequences, and identify the positions of frames in the predetermined video corresponding to the text to be processed based on the obtained m+1 similarity sequences. M is a natural number greater than 0, which represents the number of fourth text. For example, the video timing positioning unit 312 may average the obtained m+1 similarity sequences to obtain an average similarity sequence, and identify the position of a frame corresponding to the text to be processed in a predetermined video based on the average similarity sequence. Further, for example, the video timing positioning unit 312 may determine the median value of the peaks of the obtained m+1 similarity sequences, and identify the position of the frame corresponding to the text to be processed in the predetermined video based on the similarity sequence corresponding to the median value (hereinafter may also be referred to as "median similarity sequence").
Fig. 4 shows an example of the result of target frame recognition in the case of a Charades-STA dataset by using the similarity sequence for the text to be processed, the average similarity sequence obtained by the video timing positioning unit 312, the median similarity sequence obtained by the video timing positioning unit 312, and the manual average similarity sequence, respectively. Wherein the manual average similarity sequence represents an average similarity sequence obtained by averaging the similarity sequences of M texts corresponding to the text to be processed and the predetermined video generated using the manually selected M (in the example shown in fig. 4, m=10) text generation models and the similarity sequence for the text to be processed.
In fig. 4, iou0.5 R@1 represents a recall rate in the case where the recognition result is determined to be correct if IoU (Intersection over Union, cross ratio) of the optimal recognition result to the true value is greater than 0.5. As shown in fig. 4, the recall of the median similarity sequence obtained based on the video timing positioning unit 312 is increased by about 3.01% relative to the recall of the similarity sequence for the text to be processed, and by about 2.01% relative to the recall of the similarity sequence based on the manual average. Further, the recall ratio of the average similarity sequence obtained based on the video timing positioning unit 312 is increased by about 1.83% with respect to the recall ratio based on the similarity sequence for the text to be processed, and is increased by about 0.83% with respect to the recall ratio based on the manual average similarity sequence.
In addition, as shown in fig. 4, the recall of the median similarity sequence obtained based on the video timing positioning unit 312 is further increased by about 1.18% relative to the recall of the average similarity sequence obtained based on the video timing positioning unit 312, because the median similarity sequence can further reduce the effect of noise relative to the average similarity sequence.
As an example, as shown in fig. 3, the information processing apparatus 300 may further include a candidate text selection unit 314 and a target text selection unit 316.
The candidate text selection unit 314 may be configured to select, as the candidate text, a plurality of fourth texts having a semantic matching degree with the text to be processed greater than or equal to a predetermined matching degree from the second predetermined number of fourth texts.
The target text selection unit 316 may be configured to select, as the target text, a third predetermined number of candidate texts having the lowest text similarity to each other from among the candidate texts selected by the candidate text selection unit 314. The third predetermined number may be set according to actual needs.
By the operations of the candidate text selection unit 314 and the target text selection unit 316 described above, for example, target texts having a lower similarity to each other can be further selected from the fourth texts, so that the diversity of the target texts can be further improved.
In the case where the information processing apparatus 300 includes the candidate text selection unit 314 and the target text selection unit 316, the video timing positioning unit 312 can identify the position of the frame corresponding to the text to be processed from the predetermined video based on the text to be processed and the target text, so that, for example, the identification accuracy can be further improved.
For example, the target text selection unit 316 may select a third predetermined number of candidate texts having the lowest text similarity to each other among the candidate texts as the target text using a determinant point process. For example, the target text selection unit 316 may construct a matrix using a method similar to that described above for the target model selection unit 108, and then solve for an l×l-dimensional maximum determinant submatrix of the constructed matrix using a determinant point process, where the l×l-dimensional maximum determinant submatrix corresponds to a third predetermined number of candidate texts having the lowest text similarity to each other among the candidate texts. Wherein L is a natural number greater than 0, which represents a third predetermined number.
Note that the candidate text selection unit 314 and the target text selection unit 316 are shown in fig. 3 with dashed boxes to illustrate that in some embodiments, the information processing apparatus 300 may not include the candidate text selection unit 314 and the target text selection unit 316.
< third embodiment >
An information processing apparatus 400 according to a third embodiment of the present disclosure will be described below with reference to fig. 5. Fig. 5 is a block diagram showing a functional configuration example of an information processing apparatus 400 according to the third embodiment of the present disclosure.
As shown in fig. 5, an information processing apparatus 400 according to a third embodiment of the present disclosure may include a second text generation unit 402, a candidate model selection unit 404, a similarity calculation unit 406, a target model selection unit 408, a fourth text generation unit 410, and a multimodal model training unit 420. The second text generating unit 402, the candidate model selecting unit 404, the similarity calculating unit 406, the target model selecting unit 408, and the fourth text generating unit 410 may be similar to the second text generating unit 102, the candidate model selecting unit 104, the similarity calculating unit 106, the target model selecting unit 108, and the fourth text generating unit 110 included in the first information processing apparatus 100 described above with reference to fig. 1 and 2, and thus will not be described in detail herein.
For example, the multimodal model training unit 420 may be configured to train the multimodal model for video timing localization based on the text to be processed and the second predetermined number of fourth texts to obtain a trained multimodal model, so that for example, recognition accuracy, robustness, etc. of the trained multimodal model may be improved.
As an example, as shown in fig. 5, the information processing apparatus 400 may further include a candidate text selection unit 414 and a target text selection unit 416. The candidate text selection unit 414 and the target text selection unit 416 may be similar to the candidate text selection unit 314 and the target text selection unit 316 described above with reference to fig. 3, and thus the specific details will not be repeated.
For example, the multimodal model training unit 420 may train the multimodal model for video timing positioning based on the text to be processed and the target text selected by the candidate text selecting unit 414 to obtain a trained multimodal model, so that, for example, recognition accuracy, robustness, and the like of the trained multimodal model may be further improved.
Note that the candidate text selection unit 414 and the target text selection unit 416 are shown in fig. 5 with dashed boxes to illustrate that in some embodiments, the information processing apparatus 400 may not include the candidate text selection unit 414 and the target text selection unit 416.
< fourth embodiment >
An information processing apparatus 500 according to a fourth embodiment of the present disclosure will be described below with reference to fig. 6. Fig. 6 is a block diagram showing a functional configuration example of an information processing apparatus 500 according to a fourth embodiment of the present disclosure.
As shown in fig. 6, an information processing apparatus 500 according to a fourth embodiment of the present disclosure may include a fifth text generation unit 502, a candidate text selection unit 504, a text similarity calculation unit 506, and a target text selection unit 508.
The fifth text generation unit 502 may be configured to generate a plurality of fifth texts corresponding to the text to be processed using the text generation model. For example, the fifth text generation unit 502 may generate a plurality of fifth texts using the back-translation method. For example, fifth text generation unit 502 may generate a plurality of fifth texts using a plurality of text generation models (e.g., a plurality of text translation models). Further, for example, the fifth text generation unit 502 may generate a plurality of fifth texts by means of bundle search (beamsearch) using one text generation model, such as one text translation model.
The candidate text selection unit 504 may be configured to select, as the candidate fifth text, a fifth text having a semantic matching degree with the text to be processed greater than or equal to a predetermined matching degree from among the plurality of fifth texts generated by the fifth text generation unit 502.
The text similarity calculation unit 506 may be configured to calculate, for each candidate fifth text, a text similarity of the candidate fifth text with each of the other candidate fifth texts. For example, the text similarity calculation unit 506 may calculate the text similarity between the candidate fifth texts in a similar manner to the manner in which the similarity calculation unit 106 in the first embodiment described above calculates the text similarity.
The target text selection unit 508 may be configured to select, as the target text, a fourth predetermined number of candidate fifth texts having the lowest text similarity to each other from among the candidate fifth texts, for subsequent processing. For example, the target text selection unit 508 may select, as the target text, a fourth predetermined number of candidate fifth texts having the lowest text similarity to each other from among the candidate fifth texts using a determinant point process, for subsequent processing. For example, the fourth predetermined number may be set according to actual needs.
As described above, the information processing apparatus 500 according to the fourth embodiment of the present disclosure may select a target text from among a plurality of fifth texts in consideration of the degree of semantic matching between a text to be processed and the fifth texts generated using the text generation model and the degree of text similarity between the fifth texts. Thus, there is a suitable degree of semantic matching between the target text and the text to be processed, e.g. the meaning of the target text and the text to be processed may be close. In addition, the text similarity between the target texts may be low, so that the diversity is high. That is, with the information processing apparatus 500, a target text having high quality and high diversity can be obtained.
For example, the text to be processed may include text input by a user or text obtained by converting voice or image input by the user.
As an example, as shown in fig. 6, the information processing apparatus 500 may further include a video timing positioning unit 512. The video timing positioning unit 512 may identify the position of the frame corresponding to the text to be processed from the predetermined video based on the text to be processed and the target text selected by the target text selecting unit 508. Since the video timing positioning unit 512 can recognize a predetermined video using enhanced text (i.e., a text to be processed and a target text), recognition accuracy can be improved. For example, the video timing positioning unit 512 may have a configuration similar to that of the video timing positioning unit 312 in the above-described second embodiment, and thus will not be described in detail.
Fig. 7 shows an example of a result of target frame identification in the case of a chapads-STA dataset by using the similarity sequence for the text to be processed, the average similarity sequence obtained by the video timing positioning unit 512, and the manual average similarity sequence, respectively. In the example shown in fig. 7, the number of target texts corresponding to one text to be processed is 10. As shown in fig. 7, the recall ratio of the average similarity sequence obtained based on the video timing positioning unit 512 was increased by about 1.62% with respect to the recall ratio based on the similarity sequence for the text to be processed, and was increased by about 0.62% with respect to the recall ratio based on the manual average similarity sequence.
As a further example, the text to be processed may include training text, and the information processing apparatus 500 may further include a multimodal model training unit 520, as shown in fig. 6. For example, the multimodal model training unit 520 may train the multimodal model for video timing positioning based on the text to be processed and the target text selected by the target text selecting unit 508 to obtain a trained multimodal model, so that, for example, recognition accuracy, robustness, and the like of the trained multimodal model may be improved. For example, the multimodal model training unit 520 may have a configuration similar to that of the multimodal model training unit 420 in the above-described third embodiment, and thus will not be described in detail.
Note that video timing positioning unit 512 and multimodal model training unit 520 are shown in dashed boxes in fig. 6, and that in some embodiments, information processing apparatus 500 may not include video timing positioning unit 512 and/or multimodal model training unit 520.
The information processing apparatus according to the embodiment of the present disclosure has been described above, and the present disclosure also provides the following embodiments of the information processing method, corresponding to the embodiments of the information processing apparatus described above.
< fifth embodiment >
Fig. 8 is a flowchart showing a flowchart example of the information processing method 600 according to an embodiment of the present disclosure. As shown in fig. 8, the information processing method 600 according to an embodiment of the present disclosure may start at a start step S601, end at an end step S612, and may include a second text generation step S602, a candidate model selection step S604, a similarity calculation step S606, a target model selection step S608, and a fourth text generation step S610.
In the second text generation step S602, a plurality of second texts corresponding to one or more predetermined first texts may be generated using a plurality of text generation models for each of the first texts. For example, the second text generating step S602 may be implemented by the second text generating units 102, 302 and 402 in the above apparatus embodiments, so specific details may be found in the above description of the second text generating units 102, 302 and 402, and will not be repeated here.
In the candidate model selection step S604, a first predetermined number of text generation models may be selected as candidate models from the plurality of text generation models based on the degree of semantic matching between the plurality of second texts generated by the second text generation step S602 and the corresponding first texts. For example, the first predetermined number may be set according to actual needs. For example, the candidate model selection step S604 may be implemented by the candidate model selection units 104, 304 and 404 in the above apparatus embodiments, so specific details may be found in the above description of the candidate model selection units 104, 304 and 404, and will not be repeated here.
In the similarity calculating step S606, for each of the one or more predetermined first texts, a text similarity between a plurality of second texts corresponding to the first text generated by the candidate model selected in the candidate model selecting step S604 may be calculated, and a model similarity between candidate models may be calculated based on the text similarity between the second texts. For example, the similarity calculation step S606 may be implemented by the similarity calculation units 106, 306 and 406 in the above apparatus embodiments, so specific details may be found in the above description of the similarity calculation units 106, 306 and 406, and will not be repeated here.
In the target model selection step S608, a second predetermined number of candidate models having the lowest model similarity to each other may be selected from among the candidate models as the target models. For example, the second predetermined number may be set according to actual needs. For example, the object model selection step S608 may be implemented by the object model selection units 108, 308 and 408 in the above apparatus embodiments, so specific details may be found in the above description of the object model selection units 108, 308 and 408, and will not be repeated here.
In a fourth text generation step S610, a second predetermined number of fourth texts corresponding to the text to be processed may be generated using the target model for subsequent processing. For example, the fourth text generating step S610 may be implemented by the fourth text generating units 110, 310 and 410 in the above apparatus embodiments, so specific details may be found in the above description of the fourth text generating units 110, 310 and 410, and will not be repeated here.
Similarly to the information processing apparatus according to the embodiment of the present disclosure, the information processing method 600 according to the fifth embodiment of the present disclosure may select a target model from a plurality of text generation models in consideration of the degree of semantic matching between a first text and a second text generated using the plurality of text generation models and the degree of text similarity between the second texts. Thus, using the object model, a fourth text having a suitable degree of semantic matching with the text to be processed, for example, a fourth text having a meaning close to that of the text to be processed, can be generated. In addition, the text similarity between the fourth texts may be low, so that the diversity is high. That is, with the information processing method 600, the fourth text having high quality and high diversity can be generated.
As an example, in the similarity calculation step S606, the text similarity between any two second texts can be calculated as follows: splitting the two second texts into sets of words and/or terms, respectively, acquiring intersections and union between the acquired sets, and calculating a ratio of the number of words and terms (i.e., the sum of the number of words and the number of terms) contained in the acquired intersections to the number of words and terms (i.e., the sum of the number of words and the number of terms) contained in the acquired union as a text similarity between the arbitrary two second texts.
For example, in the similarity calculation step S606, the model similarity between any two candidate models can be calculated as follows: for each of one or more predetermined first texts, obtaining a text similarity between second texts corresponding to the first texts obtained by the arbitrary two candidate models, and calculating a mean value of the text similarity corresponding to the one or more predetermined first texts as a model similarity between the arbitrary two candidate models.
For example, in the target model selection step S608, a second predetermined number of candidate models having the lowest model similarity among each other among the candidate models may be selected as the target models using the determinant point process.
As an example, the information processing method 600 may further include a video timing positioning step (not shown). In the video timing positioning step, the position of the frame corresponding to the text to be processed may be identified from the predetermined video based on the text to be processed and the second predetermined number of fourth texts generated by the fourth text generating step S610. Since the predetermined video can be recognized using the enhanced text (i.e., the text to be processed and the fourth text) in the video timing positioning step, the recognition accuracy can be improved.
For example, the video timing positioning step may be implemented by the video timing positioning unit 312 in the above second embodiment, so specific details may be found in the above description of the video timing positioning unit 312, and will not be repeated here.
As a further example, the information processing method 600 may also include a multimodal model training step (not shown). In the multimodal model training step, the multimodal model for video timing positioning may be trained based on the text to be processed and the second predetermined number of fourth texts to obtain a trained multimodal model, so that, for example, recognition accuracy, robustness, and the like of the trained multimodal model may be improved. For example, the multimodal model training step may be implemented by the multimodal model training unit 420 in the third embodiment, so specific details may be found in the description of the multimodal model training unit 420 above, and will not be repeated here.
As an example, the information processing method 600 may further include a candidate text selection step and a target text selection step (not shown).
In the candidate text selecting step, a plurality of fourth texts having a semantic matching degree with the text to be processed greater than or equal to a predetermined matching degree may be selected from the second predetermined number of fourth texts as candidate texts. For example, the candidate text selection step may be implemented by the candidate text selection units 314 and 414 in the above apparatus embodiment, so specific details may be found in the above description of the candidate text selection units 314 and 414, and will not be repeated here.
In the target text selecting step, a third predetermined number of candidate texts having the lowest text similarity to each other may be selected as the target text from among the candidate texts selected by the candidate text selecting step. For example, the target text selection step may be implemented by the target text selection units 316 and 416 in the above apparatus embodiments, so specific details may be found in the above description of the target text selection units 316 and 416, and will not be repeated here.
For example, in the case where the information processing method 600 includes the candidate text selection step and the target text selection step, in the video timing positioning step, the position of the frame corresponding to the text to be processed can be identified from the predetermined video based on the text to be processed and the target text, so that, for example, the identification accuracy can be further improved.
Further, for example, in the case where the information processing method 600 includes the candidate text selection step and the target text selection step, in the multimodal model training step, the multimodal model for video timing positioning may be trained based on the text to be processed and the target text to obtain a trained multimodal model, so that, for example, recognition accuracy, robustness, and the like of the trained multimodal model may be further improved.
< sixth embodiment >
An information processing method 700 according to a sixth embodiment of the present disclosure will be described below with reference to fig. 9. Fig. 9 is a flowchart showing a flowchart example of the information processing method 700 according to an embodiment of the present disclosure. As shown in fig. 9, the information processing method 700 according to an embodiment of the present disclosure may start at a start step S701, end at an end step S712, and may include a fifth text generation step S702, a candidate text selection step S704, a text similarity calculation step S706, and a target text selection step S708.
In the fifth text generation step S702, a plurality of fifth texts corresponding to the text to be processed may be generated using the text generation model. For example, the fifth text generating step S702 may be implemented by the fifth text generating unit 502 in the fourth embodiment, so specific details may be found in the description of the fifth text generating unit 502 above, and will not be repeated here.
In the candidate text selection step S704, a fifth text having a semantic matching degree with the text to be processed greater than or equal to a predetermined matching degree may be selected as a candidate fifth text from among the plurality of fifth texts generated by the fifth text generation step S702. For example, the candidate text selection step S704 may be implemented by the candidate text selection unit 504 in the fourth embodiment, so specific details may be found in the above description of the candidate text selection unit 504, and will not be repeated here.
In the text similarity calculation step S706, for each candidate fifth text, the text similarity of the candidate fifth text to each of the other candidate fifth texts may be calculated. For example, the text similarity calculating step S706 may be implemented by the text similarity calculating unit 506 in the fourth embodiment, so specific details may be found in the description of the text similarity calculating unit 506 above, and will not be repeated here.
In the target text selection step S708, a fourth predetermined number of candidate fifth texts having the lowest text similarity to each other may be selected as target texts from among the candidate fifth texts for subsequent processing. For example, the target text selection step S708 may be implemented by the target text selection unit 508 in the fourth embodiment, so specific details may be found in the description of the target text selection unit 508 above, and will not be repeated here.
As described above, the information processing method 700 according to the sixth embodiment of the present disclosure may select a target text from among a plurality of fifth texts in consideration of the degree of semantic matching between a text to be processed and the fifth texts generated using the text generation model and the degree of text similarity between the fifth texts. Thus, there is a suitable degree of semantic matching between the target text and the text to be processed, e.g. the meaning of the target text and the text to be processed may be close. In addition, the text similarity between the target texts may be low, so that the diversity is high. That is, with the information processing method 700, a target text having high quality and high diversity can be obtained.
As an example, the information processing method 700 may further include a video timing positioning step (not shown). In the video timing positioning step, the position of the frame corresponding to the text to be processed can be identified from the predetermined video based on the text to be processed and the target text, so that, for example, the identification accuracy can be improved. For example, the video timing positioning step may be implemented by the video timing positioning unit 512 in the fourth embodiment, so specific details may be found in the description of the video timing positioning unit 512 above, and will not be repeated here.
As a further example, the information processing method 700 may also include a multimodal model training step (not shown). For example, in the multimodal model training step, the multimodal model for video timing positioning may be trained based on the text to be processed and the target text to obtain a trained multimodal model, so that, for example, recognition accuracy, robustness, and the like of the trained multimodal model may be improved. For example, the multimodal model training step may be implemented by the multimodal model training unit 520 in the fourth embodiment, so specific details may be found in the description of the multimodal model training unit 520 above, and will not be repeated here.
It should be noted that, although the functional configurations and operations of the information processing apparatus and the information processing method according to the embodiments of the present disclosure are described above, this is merely an example and not a limitation, and those skilled in the art may modify the above embodiments according to the principles of the present disclosure, for example, may add, delete, or combine the functional modules and operations in the respective embodiments, and such modifications fall within the scope of the present disclosure.
It should be noted that the method embodiments herein correspond to the system embodiments described above, and therefore, details not described in detail in the method embodiments may be referred to the description of corresponding parts in the system embodiments, and the description is not repeated here.
Further, the present disclosure also provides storage media and program products. It should be appreciated that the machine executable instructions in the storage medium and the program product according to embodiments of the present disclosure may also be configured to perform the above-described information processing method, and thus the contents not described in detail herein may refer to the description of the previous corresponding parts, and the description is not repeated herein.
Accordingly, a storage medium for carrying the above-described program product comprising machine-executable instructions is also included in the disclosure of the present invention. Including but not limited to floppy disks, optical disks, magneto-optical disks, memory cards, memory sticks, and the like.
In addition, it should be noted that the series of processes and systems described above may also be implemented in software and/or firmware. In the case of implementation by software and/or firmware, a program constituting the software is installed from a storage medium or a network to a computer having a dedicated hardware structure, such as the general-purpose personal computer 1000 shown in fig. 10, which is capable of executing various functions and the like when various programs are installed.
In fig. 10, a Central Processing Unit (CPU) 1001 performs various processes according to a program stored in a Read Only Memory (ROM) 1002 or a program loaded from a storage section 1008 to a Random Access Memory (RAM) 1003. In the RAM 1003, data required when the CPU 1001 executes various processes and the like is also stored as needed.
The CPU 1001, ROM1002, and RAM 1003 are connected to each other via a bus 1004. An input/output interface 1005 is also connected to the bus 1004.
The following components are connected to the input/output interface 1005: an input section 1006 including a keyboard, a mouse, and the like; an output section 1007 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), etc., and a speaker, etc.; a storage portion 1008 including a hard disk and the like; and a communication section 1009 including a network interface card such as a LAN card, a modem, or the like. The communication section 1009 performs communication processing via a network such as the internet.
The drive 1010 is also connected to the input/output interface 1005 as needed. A removable medium 1011 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is installed on the drive 1010 as needed, so that a computer program read out therefrom is installed into the storage section 1008 as needed.
In the case of implementing the above-described series of processes by software, a program constituting the software is installed from a network such as the internet or a storage medium such as the removable medium 1011.
It will be understood by those skilled in the art that such a storage medium is not limited to the removable medium 1011 shown in fig. 10, in which the program is stored, which is distributed separately from the apparatus to provide the program to the user. Examples of the removable medium 1011 include a magnetic disk (including a floppy disk (registered trademark)), an optical disk (including a compact disk read only memory (CD-ROM) and a Digital Versatile Disk (DVD)), a magneto-optical disk (including a Mini Disk (MD) (registered trademark)), and a semiconductor memory. Alternatively, the storage medium may be a hard disk or the like contained in the ROM1002, the storage section 1008, or the like, in which a program is stored, and distributed to users together with a device containing them.
The preferred embodiments of the present disclosure have been described above with reference to the accompanying drawings, but the present disclosure is of course not limited to the above examples. Various changes and modifications may be made by those skilled in the art within the scope of the appended claims, and it is understood that such changes and modifications will naturally fall within the technical scope of the present disclosure.
For example, a plurality of functions included in one unit in the above embodiments may be implemented by separate devices. Alternatively, the functions realized by the plurality of units in the above embodiments may be realized by separate devices, respectively. In addition, one of the above functions may be implemented by a plurality of units. Needless to say, such a configuration is included in the technical scope of the present disclosure.
In this specification, the steps described in the flowcharts include not only processes performed in the order described in the sequence but also processes performed in parallel or individually, not necessarily in the sequence. Furthermore, even in the steps processed in sequence, it is needless to say that the sequence can be appropriately changed.
In addition, the technology according to the present disclosure may also be configured as follows.
Supplementary note 1. An information processing apparatus includes:
a second text generation unit configured to generate, for each of one or more predetermined first texts, a plurality of second texts corresponding to the first text using a plurality of text generation models;
A candidate model selection unit configured to select a first predetermined number of text generation models from the plurality of text generation models as candidate models based on a degree of semantic matching between the plurality of second texts and the corresponding first texts;
a similarity calculation unit configured to calculate, for each of the one or more predetermined first texts, a text similarity between a plurality of second texts corresponding to the first text generated using the candidate models, and calculate model similarity between the candidate models based on the text similarity between the second texts;
a target model selecting unit configured to select, as target models, a second predetermined number of candidate models having the lowest model similarity to each other from among the candidate models; and
and a fourth text generation unit configured to generate a second predetermined number of fourth texts corresponding to the text to be processed using the target model for subsequent processing.
Supplementary note 2. The information processing apparatus according to supplementary note 1, wherein,
the text to be processed includes text input by a user or text obtained by converting voice or image input by the user,
The information processing apparatus further includes a video timing positioning unit configured to identify a position of a frame corresponding to the text to be processed from a predetermined video based on the text to be processed and the second predetermined number of fourth texts.
Supplementary note 3. The information processing apparatus according to supplementary note 1, wherein,
the text to be processed comprises training text,
the information processing apparatus further includes:
a multimodal model training unit configured to train a multimodal model for video timing positioning based on the text to be processed and the second predetermined number of fourth texts to obtain a trained multimodal model.
The information processing apparatus according to supplementary note 4, further comprising:
a candidate text selection unit configured to select, as candidate texts, a plurality of fourth texts having a semantic matching degree with the text to be processed greater than or equal to a predetermined matching degree from the second predetermined number of fourth texts; and
a target text selection unit configured to select, as target texts, a third predetermined number of candidate texts having the lowest text similarity to each other from the candidate texts,
wherein the video timing positioning unit is further configured to identify a position of a frame corresponding to the text to be processed from the predetermined video based on the text to be processed and the target text.
Supplementary note 5. The information processing apparatus according to any one of supplementary notes 1 to 4, wherein,
the first text, the second text and the fourth text belong to the same language,
the second text generation unit is configured to generate the plurality of second texts by using a back-translation method, and
the fourth text generation unit is configured to generate the fourth text using a back-translation method.
Supplementary notes 6. The information processing apparatus according to any one of supplementary notes 1 to 4, wherein,
the similarity calculation unit is configured to calculate text similarity between the plurality of second texts by:
splitting each second text into words and/or sets of words; and
an intersection and a union between sets of two words and/or words corresponding to any two second texts are acquired, and a ratio of the number of words and words contained in the acquired intersection to the number of words and words contained in the acquired union is calculated as a text similarity between the any two second texts.
The information processing apparatus according to supplementary note 7, wherein the similarity calculation unit is configured to calculate model similarity between the candidate models using:
For each of the one or more predetermined first texts, obtaining text similarity between second texts corresponding to the first text obtained through any two candidate models, and calculating a mean value of the text similarity corresponding to the one or more predetermined first texts as model similarity between the any two candidate models.
Supplementary note 8 the information processing apparatus according to any one of supplementary notes 1 to 4, wherein the object model selecting unit is configured to select the object model by:
constructing an N x N-dimensional matrix based on the model similarity of the candidate models to each other, each element in the N x N-dimensional matrix representing the model similarity between the respective candidate models, wherein N represents the first predetermined number;
solving an M x M dimension maximum determinant submatrix of the N dimension matrix using a determinant point process, wherein M represents the second predetermined number; and
and selecting a candidate model corresponding to the M-dimension maximum determinant submatrix as the target model.
Supplementary note 9 the information processing apparatus according to supplementary note 4, wherein the video timing positioning unit is further configured to determine a final recognition result from a median value of the recognition result based on the text to be processed and the recognition result based on the target text.
An information processing apparatus includes:
a fifth text generation unit configured to generate a plurality of fifth texts corresponding to the text to be processed using the text generation model;
a candidate text selection unit configured to select, as a candidate fifth text, a fifth text having a semantic matching degree with the text to be processed greater than or equal to a predetermined matching degree from the plurality of fifth texts;
a text similarity calculation unit configured to calculate, for each candidate fifth text, a text similarity between the candidate fifth text and each of the other candidate fifth texts; and
and a target text selection unit configured to select, as target texts, a fourth predetermined number of candidate fifth texts having the lowest text similarity to each other from the candidate fifth texts, for subsequent processing.
Supplementary note 11. The information processing apparatus according to supplementary note 10, wherein,
the text to be processed includes text input by a user or text obtained by converting voice or image input by the user,
the information processing apparatus further includes a video timing positioning unit configured to find a position of a frame corresponding to the text to be processed from a predetermined video based on the text to be processed and the target text.
Supplementary note 12. The information processing apparatus according to supplementary note 10, wherein,
the text to be processed comprises training text,
the information processing apparatus further includes a multimodal model training unit configured to train a multimodal model for video timing localization based on the text to be processed and the target text to obtain a trained multimodal model.
Supplementary note 13. An information processing method includes:
generating a plurality of second texts corresponding to the first texts by using a plurality of text generation models for each of one or more predetermined first texts;
selecting a first predetermined number of text generation models from the plurality of text generation models as candidate models based on the degree of semantic matching between the plurality of second texts and the corresponding first texts;
calculating, for each of the one or more predetermined first texts, a text similarity between a plurality of second texts corresponding to the first text generated using the candidate models, and calculating model similarity between the candidate models based on the text similarity between the second texts;
selecting a second predetermined number of candidate models with the lowest model similarity among the candidate models as target models; and
Generating a second predetermined number of fourth texts corresponding to the texts to be processed by using the target model for subsequent processing.
Supplementary note 14. The information processing method according to supplementary note 13, wherein,
the text to be processed includes text input by a user or text obtained by converting voice or image input by the user,
the information processing method further includes: based on the text to be processed and the second predetermined number of fourth texts, a position of a frame corresponding to the text to be processed is identified from a predetermined video.
Supplementary note 15. The information processing method according to supplementary note 13, wherein,
the text to be processed comprises training text,
the information processing method further includes: training a multimodal model for video timing positioning based on the text to be processed and the second predetermined number of fourth texts to obtain a trained multimodal model.
Supplementary note 16. The information processing method according to supplementary note 14, further comprising:
selecting a plurality of fourth texts with the semantic matching degree with the text to be processed being greater than or equal to a preset matching degree from the second preset number of fourth texts as candidate texts; and
Selecting a third predetermined number of candidate texts with the lowest text similarity among the candidate texts as target texts,
wherein identifying the location of the frame corresponding to the text to be processed comprises: and identifying the position of a frame corresponding to the text to be processed from the preset video based on the text to be processed and the target text.
Supplementary note 17 the information processing method according to any one of supplementary notes 13 to 16, wherein,
the first text, the second text and the fourth text belong to the same language,
the plurality of second texts and the fourth text are generated using a back-translation method.
Supplementary note 18 the information processing method according to any one of supplementary notes 13 to 16, wherein,
calculating text similarity between the plurality of second texts by using the following method:
splitting each second text into words and/or sets of words; and
an intersection and a union between sets of two words and/or words corresponding to any two second texts are acquired, and a ratio of the number of words and words contained in the acquired intersection to the number of words and words contained in the acquired union is calculated as a text similarity between the any two second texts.
Supplementary note 19. The information processing method according to supplementary note 18, wherein the model similarity between the candidate models is calculated using the following manner:
for each of the one or more predetermined first texts, obtaining text similarity between second texts corresponding to the first text obtained through any two candidate models, and calculating a mean value of the text similarity corresponding to the one or more predetermined first texts as model similarity between the any two candidate models.
Supplementary note 20. The information processing method according to supplementary note 16, wherein a final recognition result is determined according to a median value of the recognition result based on the text to be processed and the recognition result based on the target text.

Claims (10)

1. An information processing apparatus comprising:
a second text generation unit configured to generate, for each of one or more predetermined first texts, a plurality of second texts corresponding to the first text using a plurality of text generation models;
a candidate model selection unit configured to select a first predetermined number of text generation models from the plurality of text generation models as candidate models based on a degree of semantic matching between the plurality of second texts and the corresponding first texts;
A similarity calculation unit configured to calculate, for each of the one or more predetermined first texts, a text similarity between a plurality of second texts corresponding to the first text generated using the candidate models, and calculate model similarity between the candidate models based on the text similarity between the second texts;
a target model selecting unit configured to select, as target models, a second predetermined number of candidate models having the lowest model similarity to each other from among the candidate models; and
and a fourth text generation unit configured to generate a second predetermined number of fourth texts corresponding to the text to be processed using the target model for subsequent processing.
2. The information processing apparatus according to claim 1, wherein,
the text to be processed includes text input by a user or text obtained by converting voice or image input by the user,
the information processing apparatus further includes a video timing positioning unit configured to identify a position of a frame corresponding to the text to be processed from a predetermined video based on the text to be processed and the second predetermined number of fourth texts.
3. The information processing apparatus according to claim 1, wherein,
the information processing apparatus further includes a multimodal model training unit configured to train a multimodal model for video timing positioning based on the text to be processed and the second predetermined number of fourth texts to obtain a trained multimodal model.
4. The information processing apparatus according to claim 2, further comprising:
a candidate text selection unit configured to select, as candidate texts, a plurality of fourth texts having a semantic matching degree with the text to be processed greater than or equal to a predetermined matching degree from the second predetermined number of fourth texts; and
a target text selection unit configured to select, as target texts, a third predetermined number of candidate texts having the lowest text similarity to each other from the candidate texts,
wherein the video timing positioning unit is further configured to identify a position of a frame corresponding to the text to be processed from the predetermined video based on the text to be processed and the target text.
5. The information processing apparatus according to any one of claims 1 to 4, wherein,
the first text, the plurality of second texts and the fourth text belong to the same language,
The second text generation unit is configured to generate the plurality of second texts by using a back-translation method, and
the fourth text generation unit is configured to generate the fourth text using a back-translation method.
6. The information processing apparatus according to any one of claims 1 to 4, wherein,
the similarity calculation unit is configured to calculate text similarity between the plurality of second texts by:
splitting each second text into words and/or sets of words; and
an intersection and a union between sets of two words and/or words corresponding to any two second texts are acquired, and a ratio of the number of words and words contained in the acquired intersection to the number of words and words contained in the acquired union is calculated as a text similarity between the any two second texts.
7. The information processing apparatus according to claim 6, wherein the similarity calculation unit is configured to calculate model similarity between the candidate models with each other by:
for each of the one or more predetermined first texts, obtaining text similarity between second texts corresponding to the first text obtained through any two candidate models, and calculating a mean value of the text similarity corresponding to the one or more predetermined first texts as model similarity between the any two candidate models.
8. The information processing apparatus according to any one of claims 1 to 4, wherein the target model selecting unit is configured to select the target model by:
constructing an N x N-dimensional matrix based on the model similarity of the candidate models to each other, each element in the N x N-dimensional matrix representing the model similarity between the respective candidate models, wherein N represents the first predetermined number;
solving an M x M dimension maximum determinant submatrix of the N dimension matrix using a determinant point process, wherein M represents the second predetermined number; and
and selecting a candidate model corresponding to the M-dimension maximum determinant submatrix as the target model.
9. An information processing apparatus comprising:
a fifth text generation unit configured to generate a plurality of fifth texts corresponding to the text to be processed using the text generation model;
a candidate text selection unit configured to select, as a candidate fifth text, a fifth text having a semantic matching degree with the text to be processed greater than or equal to a predetermined matching degree from the plurality of fifth texts;
a text similarity calculation unit configured to calculate, for each candidate fifth text, a text similarity between the candidate fifth text and each of the other candidate fifth texts; and
And a target text selection unit configured to select, as target texts, a fourth predetermined number of candidate fifth texts having the lowest text similarity to each other from the candidate fifth texts, for subsequent processing.
10. An information processing method, comprising:
generating a plurality of second texts corresponding to the first texts by using a plurality of text generation models for each of one or more predetermined first texts;
selecting a first predetermined number of text generation models from the plurality of text generation models as candidate models based on the degree of semantic matching between the plurality of second texts and the corresponding first texts;
calculating, for each of the one or more predetermined first texts, a text similarity between a plurality of second texts corresponding to the first text generated using the candidate models, and calculating model similarity between the candidate models based on the text similarity between the second texts;
selecting a second predetermined number of candidate models with the lowest model similarity among the candidate models as target models; and
generating a second predetermined number of fourth texts corresponding to the texts to be processed by using the target model for subsequent processing.
CN202111581637.0A 2021-12-22 2021-12-22 Information processing apparatus and information processing method Pending CN116415587A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202111581637.0A CN116415587A (en) 2021-12-22 2021-12-22 Information processing apparatus and information processing method
JP2022194980A JP2023093349A (en) 2021-12-22 2022-12-06 Information processing device and information processing method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111581637.0A CN116415587A (en) 2021-12-22 2021-12-22 Information processing apparatus and information processing method

Publications (1)

Publication Number Publication Date
CN116415587A true CN116415587A (en) 2023-07-11

Family

ID=87001064

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111581637.0A Pending CN116415587A (en) 2021-12-22 2021-12-22 Information processing apparatus and information processing method

Country Status (2)

Country Link
JP (1) JP2023093349A (en)
CN (1) CN116415587A (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117473047B (en) * 2023-12-26 2024-04-12 深圳市明源云客电子商务有限公司 Business text generation method and device, electronic equipment and readable storage medium

Also Published As

Publication number Publication date
JP2023093349A (en) 2023-07-04

Similar Documents

Publication Publication Date Title
CN111046133B (en) Question and answer method, equipment, storage medium and device based on mapping knowledge base
CN109190131B (en) Neural machine translation-based English word and case joint prediction method thereof
WO2017162134A1 (en) Electronic device and method for text processing
CN111968649A (en) Subtitle correction method, subtitle display method, device, equipment and medium
CN110750959A (en) Text information processing method, model training method and related device
CN111274829B (en) Sequence labeling method utilizing cross-language information
CN107305543B (en) Method and device for classifying semantic relation of entity words
CN113268576B (en) Deep learning-based department semantic information extraction method and device
RU2712101C2 (en) Prediction of probability of occurrence of line using sequence of vectors
CN111814479B (en) Method and device for generating enterprise abbreviations and training model thereof
CN110968725A (en) Image content description information generation method, electronic device, and storage medium
Chen et al. Cross-lingual text image recognition via multi-task sequence to sequence learning
CN111444720A (en) Named entity recognition method for English text
CN113761377B (en) False information detection method and device based on attention mechanism multi-feature fusion, electronic equipment and storage medium
CN112818091A (en) Object query method, device, medium and equipment based on keyword extraction
CN116415587A (en) Information processing apparatus and information processing method
CN112836019B (en) Public medical health named entity identification and entity linking method and device, electronic equipment and storage medium
CN113220862A (en) Standard question recognition method and device, computer equipment and storage medium
CN116306594A (en) Medical OCR recognition error correction method
CN114861669A (en) Chinese entity linking method integrating pinyin information
CN111222342B (en) Translation method and device
Yang et al. Automatic metadata information extraction from scientific literature using deep neural networks
CN114398492B (en) Knowledge graph construction method, terminal and medium in digital field
US20180033425A1 (en) Evaluation device and evaluation method
CN116561325B (en) Multi-language fused media text emotion analysis method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination