WO2016121048A1 - Text generation device and text generation method - Google Patents

Text generation device and text generation method Download PDF

Info

Publication number
WO2016121048A1
WO2016121048A1 PCT/JP2015/052478 JP2015052478W WO2016121048A1 WO 2016121048 A1 WO2016121048 A1 WO 2016121048A1 JP 2015052478 W JP2015052478 W JP 2015052478W WO 2016121048 A1 WO2016121048 A1 WO 2016121048A1
Authority
WO
WIPO (PCT)
Prior art keywords
sentence
expression
candidate
unit
evaluation
Prior art date
Application number
PCT/JP2015/052478
Other languages
French (fr)
Japanese (ja)
Inventor
佐藤 美沙
利昇 三好
利彦 柳瀬
芳樹 丹羽
孝介 柳井
Original Assignee
株式会社日立製作所
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 株式会社日立製作所 filed Critical 株式会社日立製作所
Priority to PCT/JP2015/052478 priority Critical patent/WO2016121048A1/en
Publication of WO2016121048A1 publication Critical patent/WO2016121048A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities

Definitions

  • the present invention relates to a sentence generation apparatus that abstracts a sentence or a sentence given by a user and a method executed by the apparatus.
  • a recommendation sentence is generated from a sentence example by replacing a keyword. Specifically, first, a sentence example is selected based on a keyword designated by the user, and the keyword in the sentence example is associated with the input keyword. The degree of similarity between corresponding keywords is measured, and when the degree of similarity is medium, the target sentence is obtained by replacing the keyword in the sentence example with the keyword specified by the user.
  • a specific expression represents an entity
  • an abstract expression represents a higher level concept of the entity. For example, if the sentence “Malaria is endemic every year in Sri Lanka” is given, you can generate an assertion that “Malaria is endemic every year in developing countries” by replacing “Myanmar” with “Developing countries”. .
  • Patent Document 1 relates to generation of a recommended sentence, and the sentence cannot be abstracted.
  • the term input by the user is used as it is as the replacement term.
  • the replacement destination term is not automatically selected.
  • the present invention has been made in view of the above, and provides a mechanism for automatically generating a properly abstracted sentence or sentence based on a given sentence or sentence while maintaining the correctness of the contents. To do.
  • a sentence generation system which is one of the inventions for solving the above problem has the following sections.
  • Input section used to input sentence and theme information to be processed
  • a replacement target expression extraction unit that extracts one or more of one or more unique expressions included in the sentence based on the theme information as a replacement target expression and specifies a keyword representing the theme information
  • a candidate generation unit that generates a plurality of candidate expressions that are replacement candidates that abstract the replacement target expression using dictionary information stored in advance.
  • a first evaluation unit that outputs a first evaluation result obtained by evaluating the candidate expression using the dictionary information.
  • a post-conversion sentence generation unit that generates a post-conversion sentence by replacing the replacement target expression with the candidate expression having a high evaluation in the first evaluation result
  • a replacement target expression included in a sentence can be replaced with an appropriate candidate expression in relation to the input theme information, and a more abstract post-conversion sentence that is easy to understand is automatically generated. Can be generated.
  • FIG. 1 is a diagram illustrating a hardware configuration of a document generation device according to a first embodiment.
  • 1 is a diagram illustrating a functional configuration of a document generation apparatus according to a first embodiment.
  • the figure which shows the function structure of a 2nd evaluation part. 6 is a flowchart for explaining a processing procedure executed by the document generation apparatus according to the first embodiment.
  • a generalized sentence is obtained by inputting a sentence composed of one sentence or a plurality of sentences and a text representing the theme information of the sentence and performing appropriate replacement.
  • a sentence generation device having a function of outputting will be described. For example, when given the keyword “malaria” and the sentence “We should continue to promote economic assistance in the future. Malaria is endemic and many people die in Sri Lanka”. Replacing the term “Myanmar” with the general expression “developing countries” and “promoting economic assistance in the future. Output.
  • the text generation device is configured with hardware using a normal computer.
  • FIG. 1 shows an example of a specific hardware configuration.
  • the sentence generator includes an input device 110, an output device 120, an arithmetic device 130, a memory 140 that stores various data and various programs, a storage device 150 that stores various data and various programs, and a network device that controls communication with an external device. 160, and a bus 170 connecting them.
  • the network device 170 is not necessary.
  • the input device 110 and the output device 120 can be omitted.
  • FIG. 2 shows the functions of a program executed through the arithmetic unit 130 of the sentence generation device.
  • the input unit 210 receives a sentence to be replaced (only one sentence may be used) and theme information instructed by the user.
  • An input device 110 keyboard, mouse or other input device, GUI screen, etc.
  • the entity extraction unit 220 performs linguistic analysis on the input text and theme information, and identifies a specific expression to be replaced as an entity.
  • the “entity extraction unit” is also referred to as a “replacement target expression extraction unit”.
  • the entity information table 230 stores entity replacement destination candidate information.
  • the entity information table 230 is stored as a file in the memory 140 or the storage device 150.
  • the candidate generation unit 240 generates a replacement destination candidate for the entity extracted with reference to the entity information table 230.
  • the first evaluation unit 250 calculates a first evaluation score using the entity information table 230 for the generated candidate. The first evaluation score is executed for each sentence.
  • the second evaluation unit 260 calculates a second evaluation score for each candidate from the viewpoint of the entire sentence (a plurality of sentences). Note that the evaluation by the second evaluation unit 260 may be performed on a candidate with a high evaluation result by the first evaluation unit 250.
  • the post-conversion sentence generation unit 270 determines a replacement destination candidate based on the first evaluation score and the second evaluation score, and generates a final sentence using the determined candidate. Note that when the conversion target is a single sentence, the post-conversion sentence generation unit 270 is also referred to as a “post-conversion sentence generation unit”.
  • the output unit 280 presents (displays) the generated text (abstracted text) to the user through the output device 120.
  • the entity extraction unit 220 first identifies a keyword described as a theme based on the input text and theme. However, when the theme is input as a keyword, the input is used as it is as a keyword. When the theme is input as a sentence, the keyword is specified from the expression in the sentence. Specifically, language analysis is performed on the input theme, and a specific expression is extracted. Among the proper expressions, the one with the most appearances is set as a keyword. Alternatively, an expression that appears in common with the text and the theme is extracted and used as a keyword.
  • the entity extraction unit 220 performs linguistic analysis on the input sentence, and extracts one or more specific expressions included in the sentence. Among the extracted specific expressions, those that are not keywords are used as specific expressions to be replaced (also referred to as “entities” or “replacement target expressions”). A specific expression that represents a date / number is an entity. There may be multiple entities in a sentence.
  • FIG. 3 shows a conceptual diagram of the entity information table 230.
  • the entity information table 230 is a dictionary (dictionary information) that stores one or more pairs of entities and their abstract expressions. A circle in the cell indicates that the entity in the corresponding column can take a candidate expression of the corresponding row.
  • the entity information table 230 it is possible to examine abstract expressions that an entity can take. On the contrary, by referring to the entity information table 230, it is possible to examine entities that can take a certain abstract expression.
  • FIG. 4 shows an example of the data structure of the entity information table 230.
  • the entity information table 230 is a dictionary that uses a character string of a unique expression as a key, and a value has an entity represented by the unique expression.
  • Entities consist of classes and candidates. It has multiple abstract representations for each entity as candidate fields.
  • Each entity can have a class to which the entity belongs in the field.
  • the class is a semantic classification such as “person name”, “location”, “organization name”, and the like.
  • Each entity may have a synonym expression field in order to prevent a plurality of data for the same entity from being distributed in the entity information table 230.
  • Each abstract expression can be scored according to the frequency of co-occurring with the corresponding specific expression. In “Myanmar”, “country with government”, “developing country”, “humid area”, “South Asia”, “country”, etc. are acquired as candidate expressions.
  • each entity has a class field, as described above.
  • “Nokia” may represent “Nokia”, a Finnish city, and “Nokia”, a telecommunications equipment manufacturer that sells mobile phones and the like. Therefore, when representing Nokia in a Finnish city, the class is “place name” and the candidates have “city” and “Europe”. On the other hand, when representing Nokia of a telecommunication equipment manufacturer, the class is “organization name” and the candidates have “telecommunications equipment manufacturer” and “company”.
  • the entity can be distinguished by storing the entity separately in the class and the candidates.
  • the entity information table 230 can be created by manually assigning an entity to a specific expression and its abstract expression. However, it is difficult to manually add an abstract expression to all of a large number of unique expressions. Therefore, the relationship extraction technology automatically extracts the entity and the relationship information about the entity from the plain text, and gives an abstract expression from the acquired relationship information.
  • the candidate generation unit 240 refers to the entity information table 230 and generates a plurality of candidate expressions that are candidates for replacing each entity. It should be noted that the possibility of not replacing is also ensured by including the specific expression to be replaced in the candidate expression.
  • FIG. 5 shows a functional configuration of the first evaluation unit 250.
  • the first evaluation unit 250 gives a first evaluation result in consideration of the content of the sentence to each candidate expression of the entity.
  • the similar case sentence search unit 251 represents a case similar to the case represented by the sentence including a specific expression (“entity” or “replacement target expression”) to be replaced.
  • a plurality of sentences are acquired from the sentence text data 252.
  • the sentence text data 252 may be text data stored in advance or text data on the Web.
  • Similar case sentences can be acquired by searching for similar sentences using an associative search engine using a query obtained by excluding a specific expression (replacement target expression) to be replaced from words in the sentence.
  • the similar case entity extracting unit 253 extracts an entity in the similar sentence corresponding to the entity in the input sentence. For example, as a similar sentence, “Malaria is endemic every year and many people die in Sri Lanka”, “Malaria is endemic every year and many people die in Cambodia”. At this time, the similar case entity extraction unit 253 extracts “Cambodia” as an entity corresponding to “Myanmar”. In this case, “Myanmar” and “Cambodia” are similar case entities.
  • the similar case entity extraction unit 253 is also referred to as a “corresponding expression extraction unit” in this specification.
  • the first evaluation score calculation unit 254 calculates a score representing the accuracy of the extracted entity replacement expression candidate with a numerical value.
  • the operation of the first evaluation score calculation unit 254 will be described with reference to FIG. Except for the bottom row and the rightmost column of the table, a part of the entity information table 230 is cut out.
  • FIG. 6 shows a column of replacement target entities (replacement target expressions) and all candidate expressions that can be taken by the replacement target entities (replacement target expressions). is there.
  • a circle in the cell indicates that the entity in the corresponding column can take a candidate expression of the corresponding row.
  • the bottom row of the table indicates whether the entity in that column has been extracted as a similar case entity.
  • the rightmost column of the table represents the calculation result (first evaluation result) of the first evaluation score for each candidate expression.
  • the first evaluation score calculation unit 254 gives, to each candidate expression of an entity, (1) a high score for an abstract expression for more similar case entities, and (2) a non-similar case entity Therefore, the first evaluation result is given so as to reflect two viewpoints of giving a high score to a non-abstract expression. Specifically, based on the following formula, a score that gives the degree of accuracy of replacement with the abstract expression a is calculated.
  • First evaluation result (a) harmonic average of (P (a), R (a))
  • evaluation P (a) and evaluation R (a) are given below, respectively.
  • ⁇ Evaluation P (a) (Number of similar case entities having a as an abstract expression) / (Number of all entities having a as an abstract expression)
  • Evaluation R (a) (Number of similar case entities having a as an abstract expression) / (Number of similar case entities having a as an abstract expression)
  • the first evaluation score calculation unit 254 is also referred to as a “score calculation unit” in this specification.
  • the calculation method of the first evaluation result is not limited to this.
  • the similar case sentence search unit 251 may simultaneously search for sentences that deny similar cases and use them for calculating the first evaluation result.
  • the similar case entity extraction unit 253 extracts “similar case negative entity” in which occurrence of the similar case is denied for the sentence that denies the similar case.
  • the abstract representations that similar case negative entities can take are inappropriate when replacing the original case text. Therefore, the following formula is used by adding a case classification to the calculation formula of the first evaluation result.
  • the first evaluation unit 250 When an appropriate relationship is extracted, the first evaluation unit 250 newly adds information that the corresponding entity can take the corresponding candidate expression to the entity information table 230.
  • This function is referred to as “dictionary information update unit” in this specification. In this way, correspondence information between entities and candidate expressions can be increased.
  • the first evaluation unit 250 generates a provisional sentence by replacing other entities with candidate expressions for individual entities.
  • a process similar to that for calculating the first evaluation result P (a) for one entity is executed for the provisional sentence generated by the number of entities in the sentence.
  • the first evaluation result P (a) when there are a plurality of entities is not given separately to each candidate expression, but is given to a combination of candidate expressions. This combination function is referred to as a “combination generation unit” in this specification.
  • FIG. 7 shows a functional configuration of the second evaluation unit 260.
  • the second evaluation unit 260 gives each candidate expression a second evaluation result considering the context from the contents of the entire sentence.
  • the important word extraction unit 261 extracts important words in the input sentence.
  • the important words can be extracted by a technique such as TF-IDF (Term Frequency-Inverse Document Frequency).
  • the synonym expansion unit 262 acquires and outputs a synonym for the given word. Synonyms can be acquired by methods such as a synonym dictionary and Word2Vec.
  • synonym expansion is performed on the keyword extracted by the keyword extraction unit 261 and each candidate expression given from the first evaluation unit 250.
  • the second evaluation score calculation unit 263 calculates the degree of co-occurrence with an important word in the input sentence for each candidate expression and outputs it as a second evaluation score (second evaluation result).
  • the degree of co-occurrence refers to the relationship between words that are likely to co-occur in general sentences.
  • the co-occurrence degree can be obtained by the number of hits when a search is performed using a word / word combination as a query in a Web search engine.
  • the co-occurrence degree it is possible to measure whether each candidate expression is an abstraction according to the context of the input sentence.
  • a word expanded by the previous synonym expansion may be used.
  • Second evaluation result In “developing countries” and “humid areas”, a higher context appropriateness score (second evaluation result) is given to “developing countries” that have a high co-occurrence with the key word “economic assistance” in the input text. Given. When there are a plurality of entities in the sentence, a second evaluation score (second evaluation result) is calculated for the combination of candidate expressions.
  • the post-conversion sentence generation unit 270 uses the candidate expressions to which high evaluation is given in each of the first evaluation unit 250 and the second evaluation unit 260 to By substituting, a converted sentence (or converted sentence) is generated. In order to make a natural sentence, an operation of changing the candidate expression from the singular to the plural, an operation of changing the first letter of the sentence to upper case, and the like are also performed.
  • a sentence may be generated using any candidate expression.
  • the selection may be made using criteria such as a small number of words constituting the candidate expression and a score of the candidate expression stored in the entity information table 230.
  • Step S800 The user uses the input device 110 to input a sentence to be replaced and a theme of the sentence.
  • the input sentence and theme are analyzed through the arithmetic unit 130 and given to the entity extraction unit 220.
  • Step S801 The entity extraction unit 220 extracts a specific expression from each of the input sentence and the theme information, and specifies a specific expression (entity) to be replaced and a keyword representing the theme information.
  • Step S802 The candidate generating unit 240 refers to the entity information table 230 for each entity specified in step S801, and acquires a plurality of replacement candidate expressions. For a specific expression in a sentence, a class can be acquired as a result of the specific expression recognition.
  • the candidate generation unit 240 acquires information from the entity information table 230 using the character string and class of the unique expression, and acquires a plurality of candidate expressions to be replaced.
  • Step S803 The first evaluation unit 250 calculates a first evaluation result for the candidate expression generated by the candidate generation unit 240. That is, the first evaluation unit 250 assigns an accuracy score to each candidate expression.
  • Step S804 The second evaluation unit 260 calculates a second evaluation result for the candidate expression generated by the candidate generation unit 240. That is, the second evaluation unit 260 assigns a context appropriateness score.
  • Step S805 The post-conversion sentence generation unit 270 uses the candidate expression with the highest evaluation result to replace the entity, and generates a post-conversion sentence.
  • Step S901 The similar case sentence search unit 251 creates a character string obtained by removing the entity from the target sentence to be replaced as a query.
  • Step S902 The similar case sentence search unit 251 gives the query created in step S900 to the associative search engine, and acquires a plurality of similar case sentences representing cases similar to the case represented by the input sentence.
  • Step S903 The similar case entity 253 performs language analysis on each similar case sentence, and extracts a specific expression as in step S801.
  • Step S904 The similar case entity 253 associates specific expressions, and selects a specific expression corresponding to the entity among the specific expressions in each similar case sentence.
  • Step S905 The similar case entity 253 acquires candidate expressions from the entity information table 230 for the corresponding specific expressions, as in step S802.
  • Step S906 The first evaluation score calculation unit 254 counts, for each candidate expression generated by the candidate generation unit 240, the number of corresponding specific expressions in the similar case sentence that have the same candidate expression. Output as accuracy score for each candidate expression. Whether the candidate expressions are the same can be determined by character string matching.
  • Step S907 The first evaluation score calculation unit 254 ranks the candidate expressions using the calculated accuracy score. By leaving only candidates with a certain rank or higher or a score or higher, it is possible to select a highly accurate candidate. When the candidate expression is narrowed down to one by the evaluation based on the accuracy score, the evaluation by the second evaluation unit 260 can be omitted.
  • Step 1002 the important word extraction unit 261 extracts words other than the unique expression from the input sentence. However, frequent words such as “of” and “a” are excluded.
  • Step 1003 The synonym expansion unit 262 expands words included in the candidate expression and the word set of the input sentence into synonyms using WordNet.
  • Step 1004 The second evaluation score calculation unit 263 counts the overlap between the candidate expression after synonym expansion and the word set extracted in the previous stage, and outputs it as a context appropriateness score.
  • FIG. 11 shows an overall image of the sentence generation system used in the present embodiment.
  • the system includes a text generation device 1100 and a data management device 1101.
  • the sentence generation device 1100 When a topic is input, the sentence generation device 1100 outputs a descriptive sentence that describes an opinion on the topic.
  • the data management device 1101 stores data that has been processed in advance and is accessible from the text generation device 1100.
  • the sentence generation device 1100 sequentially executes nine processing functions.
  • the input unit 1102 receives a topic from the user.
  • the topic analysis unit 1103 analyzes the topic and determines the polarity of the topic and the keyword used for the search.
  • the search unit 1104 searches for an article using a keyword and an issue word indicating an issue in the debate.
  • the issue determination unit 1105 classifies the output articles and determines an issue to be used when generating an opinion.
  • the sentence extraction unit 1106 extracts a sentence describing the issue from the output article.
  • the sentence rearrangement unit 1107 generates a sentence by rearranging the extracted sentences.
  • the evaluation unit 1108 evaluates the generated sentence.
  • the replacement unit 1109 inserts appropriate conjunctions, deletes unnecessary expressions, and replaces some unique expressions with abstract expressions according to theme information.
  • the output unit 1110 outputs the sentence with the highest evaluation as a descriptive sentence describing an opinion.
  • the replacement unit 1109 in the present embodiment has a configuration in which input information is added to the configuration described in the first embodiment. In the following, processing functions added to the first embodiment will be described.
  • a sentence set rearranged as sentences is input to the input unit 210 used in this embodiment, and a theme or an analysis result of the topic analysis unit 1103 or a keyword used as a query in the search unit 1104 is input as theme information. .
  • the similar case search unit 251 of the first evaluation unit 250 used in the present embodiment can use the output of the search unit 1104 as a search target. Since each sentence has a document as an extraction source, the information in the entity information table can be updated by extracting the relationship from the document.
  • the second evaluation unit 260 used in the present embodiment can include topic information in a target whose co-occurrence with candidate expressions is measured. Since each sentence has a document as an extraction source, the degree of co-occurrence with an important word in the document can be included in the evaluation.
  • the data management system 1101 includes an interface unit 1111, a structuring unit 1112, and four databases 1113 to 1116.
  • the interface DB 1111 provides an access unit for data managed in the database together with the structuring unit 1112.
  • the text data DB 1113 is text data such as news articles.
  • the text annotation data DB 1114 is data assigned to the text data DB 1113.
  • the search index DB 1115 is an index for making the text data DB 1113 and the annotation data DB 1114 searchable.
  • the issue ontology DB 1116 is a database in which issues that are often discussed in debates and related words are linked.
  • the present invention is not limited to the above-described embodiments, and includes various modifications.
  • the above-described embodiment has been described in detail for easy understanding of the present invention, and it is not always necessary to include all the configurations described.
  • a part of the configuration of the above-described embodiment may be deleted, a known technique may be added to the configuration of the above-described embodiment, or a part of the configuration of the above-described embodiment may be known. It may be replaced by the technique of.
  • each of the above-described configurations, functions, processing units, processing means, and the like may be realized by hardware by designing a part or all of them with, for example, an integrated circuit.
  • Each of the above-described configurations, functions, and the like may be realized by the processor interpreting and executing a program that realizes each function (that is, in software).
  • Information such as programs, tables, and files that realize each function can be stored in a storage device such as a memory, a hard disk, or an SSD (Solid State Drive), or a storage medium such as an IC card, an SD card, or a DVD.
  • Control lines and information lines indicate what is considered necessary for the description, and do not represent all control lines and information lines necessary for the product. In practice, it can be considered that almost all components are connected to each other.

Abstract

A text generation device is provided with: (1) an input unit used to input text to be processed and theme information; (2) a replacement target expression extraction unit for extracting, as a replacement target expression, one or more from one or more unique expressions included in the text on the basis of the theme information, and specifying a keyword that expresses the theme information; (3) a candidate generation unit for generating a plurality of candidate expressions as replacement candidates for abstracting the replacement target expressions by using dictionary information accumulated in advance; (4) a first evaluation unit for outputting a first evaluation result obtained by evaluating the candidate expressions by using the dictionary information; and (5) a post-conversion text generation unit for generating post-conversion text by replacing the replacement target expression by the candidate expression which is highly valued as the first evaluation result.

Description

文章生成装置及び方法Sentence generating apparatus and method
 本発明は、ユーザより与えられた文または文章を抽象化する文章生成装置及び当該装置で実行される方法に関する。 The present invention relates to a sentence generation apparatus that abstracts a sentence or a sentence given by a user and a method executed by the apparatus.
 自動要約などの既存のテキストを編集して新たなテキストを生成する分野や、概念からのテキスト生成を目的とするconcept-to-text生成と呼ばれる分野では、キーワードの置き換えにより元の意味を変じた文章を生成する手法がある。 In the field where new text is generated by editing existing text such as automatic summarization, or in the field called concept-to-text generation for the purpose of generating text from a concept, the original meaning was changed by keyword replacement. There is a method to generate sentences.
 下記特許文献1では、キーワードの置き換えにより文例から推薦文を生成している。具体的には、まず利用者が指定したキーワードに基づいて文例を選択し、文例中のキーワードと入力されたキーワードとの対応付けを行う。対応するキーワード同士の類似度を測り、類似度が中程度の場合に、文例中のキーワードを利用者が指定したキーワードに置き換えることで、目的の文を得る。 In the following Patent Document 1, a recommendation sentence is generated from a sentence example by replacing a keyword. Specifically, first, a sentence example is selected based on a keyword designated by the user, and the keyword in the sentence example is associated with the input keyword. The degree of similarity between corresponding keywords is measured, and when the degree of similarity is medium, the target sentence is obtained by replacing the keyword in the sentence example with the keyword specified by the user.
特開2001-256222号公報JP 2001-256222 A
 ところで、論述では、具体的な事例だけを述べるよりも、主張したい内容を表すある程度の抽象性をもった主張文を簡潔に述べることが好ましく、これにより、文章の論旨がはっきりし、分かりやすい文章となる。例えば「今後も経済援助を推進するべきだ。ミャンマーでは毎年マラリアが流行し、多くの人々が死亡している。」という文章よりも、「今後も経済援助を推進するべきだ。途上国では毎年マラリアが流行し、多くの人々が死亡している。」という抽象的な文章の作成が望まれる。 By the way, in the discussion, it is preferable to briefly state an assertion sentence with a certain level of abstraction that represents the content to be claimed, rather than just a specific case, so that the sentence is clear and easy to understand It becomes. For example, rather than the sentence “We should continue to promote economic assistance in the future. Malaria is endemic every year and many people die.” It is hoped that an abstract sentence that “malaria is prevalent and many people have died” will be created.
 そこで、文中の固有表現を、より抽象的な表現に置き換えて抽象化し、主張文を生成することが考えられる。固有表現はエンティティを表し、抽象的な表現はそのエンティティの上位の概念を表している。たとえば「ミャンマーでは毎年マラリアが流行する」という文が与えられた場合、「ミャンマー」を「途上国」に置き換えることで、「途上国では毎年マラリアが流行する」という主張文を生成することができる。 Therefore, it is possible to replace the proper expression in the sentence with a more abstract expression and generate an assertion sentence. A specific expression represents an entity, and an abstract expression represents a higher level concept of the entity. For example, if the sentence “Malaria is endemic every year in Myanmar” is given, you can generate an assertion that “Malaria is endemic every year in developing countries” by replacing “Myanmar” with “Developing countries”. .
 しかし、エンティティには対応する上位概念が複数存在する。抽象度の段階も、方向性も様々である。文の抽象化のために、固有表現を上位概念表現に置き換える際には、置き換えた後の文の内容が正しさは保たれているかどうか、また、前後の文章の文脈に合っているかどうかを考慮して、上位概念表現を適切に選択する必要がある。 However, there are multiple corresponding superordinate concepts for entities. There are various levels of abstraction and directions. For sentence abstraction, when replacing a proper expression with a superordinate concept expression, it is necessary to check whether the contents of the replaced sentence are correct and whether it matches the context of the preceding and following sentences. It is necessary to appropriately select the superordinate concept expression in consideration.
 ところで、特許文献1に記載されている技術は、推薦文の生成に関するものであり、文を抽象化することはできない。また、当該技術では、置き換え先の用語として、利用者が入力した用語をそのまま用いている。また、当該技術では、複数の置換先候補がある場合に、置き換え先の用語を自動的に選択することもしていない。 By the way, the technique described in Patent Document 1 relates to generation of a recommended sentence, and the sentence cannot be abstracted. In this technique, the term input by the user is used as it is as the replacement term. Further, in the technique, when there are a plurality of replacement destination candidates, the replacement destination term is not automatically selected.
 本発明は、上記に鑑みてなされたものであり、与えられた文または文章に基づいて、内容の正しさを保ちつつ適切に抽象化した文または文章を自動的に生成するための仕組みを提供する。 The present invention has been made in view of the above, and provides a mechanism for automatically generating a properly abstracted sentence or sentence based on a given sentence or sentence while maintaining the correctness of the contents. To do.
 上記解題を解決する発明の1つである文章生成システムは、以下の各部を有している。
(1)処理対象とする文とテーマ情報の入力に用いられる入力部
(2)前記テーマ情報に基づいて前記文に含まれる1つ又は複数の固有表現のうち1つ又は複数を置換対象表現として抽出すると共に、前記テーマ情報を表すキーワードを特定する置換対象表現抽出部
(3)予め蓄積されている辞書情報を用い、前記置換対象表現を抽象化する置き換え候補である候補表現を複数生成する候補生成部
(4)前記辞書情報を用いて前記候補表現を評価した第1評価結果を出力する第1評価部
(5)前記第1評価結果において評価が高い前記候補表現を用いて前記置換対象表現を置き換えることにより、変換後文を生成する変換後文生成部
A sentence generation system which is one of the inventions for solving the above problem has the following sections.
(1) Input section used to input sentence and theme information to be processed
(2) A replacement target expression extraction unit that extracts one or more of one or more unique expressions included in the sentence based on the theme information as a replacement target expression and specifies a keyword representing the theme information
(3) A candidate generation unit that generates a plurality of candidate expressions that are replacement candidates that abstract the replacement target expression using dictionary information stored in advance.
(4) A first evaluation unit that outputs a first evaluation result obtained by evaluating the candidate expression using the dictionary information.
(5) A post-conversion sentence generation unit that generates a post-conversion sentence by replacing the replacement target expression with the candidate expression having a high evaluation in the first evaluation result
 本発明によれば、文に含まれる置換対象表現を、入力されたテーマ情報との関係において適切な候補表現に置き換えることができ、より抽象化された論旨の分かりやすい変換後文を自動的に生成することができる。前述した以外の課題、構成及び効果は、以下の実施の形態の説明により明らかにされる。 According to the present invention, a replacement target expression included in a sentence can be replaced with an appropriate candidate expression in relation to the input theme information, and a more abstract post-conversion sentence that is easy to understand is automatically generated. Can be generated. Problems, configurations, and effects other than those described above will become apparent from the following description of embodiments.
第1の実施の形態の文書生成装置のハードウェア構成を示す図。1 is a diagram illustrating a hardware configuration of a document generation device according to a first embodiment. 第1実施形態の文書生成装置の機能構成を示す図。1 is a diagram illustrating a functional configuration of a document generation apparatus according to a first embodiment. エンティティ情報テーブルの一例を示す図。The figure which shows an example of an entity information table. エンティティ情報テーブルのデータ構造例を示す図。The figure which shows the data structure example of an entity information table. 第1評価部の機能構成を示す図。The figure which shows the function structure of a 1st evaluation part. 第1評価スコア計算部で計算されるスコアの内容を説明する図。The figure explaining the content of the score calculated in a 1st evaluation score calculation part. 第2評価部の機能構成を示す図。The figure which shows the function structure of a 2nd evaluation part. 第1の実施の形態の文書生成装置で実行される処理手順を説明するフローチャート。6 is a flowchart for explaining a processing procedure executed by the document generation apparatus according to the first embodiment. 第1評価部で実行される処理手順を説明するフローチャート。The flowchart explaining the process sequence performed in a 1st evaluation part. 第2評価部で実行される処理手順を説明するフローチャート。The flowchart explaining the process sequence performed in a 2nd evaluation part. 第2の実施の形態の文書生成装置のハードウェア構成を示す図。The figure which shows the hardware constitutions of the document production | generation apparatus of 2nd Embodiment.
 以下、図面に基づいて、本発明の実施の形態を説明する。なお、本発明の実施の態様は、後述する形態に限定されるものではなく、その技術思想の範囲において、種々の変形が可能である。以下の実施の形態では、主に英語や日本語の文書を処理する場合について説明するが、言語固有の処理を置き換えれば中国語などその他の言語でも、同様の手順で適用可能である。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. The embodiment of the present invention is not limited to the form described later, and various modifications are possible within the scope of the technical idea. In the following embodiment, a case where an English or Japanese document is mainly processed will be described. However, if a language-specific process is replaced, other languages such as Chinese can be applied in the same procedure.
(1)第1の実施の形態
 本実施の形態では、一文あるいは複数文から成る文章とその文章のテーマ情報を表すテキストを入力とし、適切な置き換えを施すことで、より一般化された文章を出力する機能を有する文章生成装置について説明する。たとえば、入力として、キーワード「マラリア」と文章「今後も経済援助を推進するべきだ。ミャンマーでは毎年マラリアが流行し、多くの人々が死亡している。」が与えられたとき、文章中のエンティティ用語「ミャンマー」を「途上国」という一般的な表現に置き換え、「今後も経済援助を推進するべきだ。途上国では毎年マラリアが流行し、多くの人々が死亡している。」という文章を出力する。
(1) First Embodiment In this embodiment, a generalized sentence is obtained by inputting a sentence composed of one sentence or a plurality of sentences and a text representing the theme information of the sentence and performing appropriate replacement. A sentence generation device having a function of outputting will be described. For example, when given the keyword “malaria” and the sentence “We should continue to promote economic assistance in the future. Malaria is endemic and many people die in Myanmar every year”. Replacing the term “Myanmar” with the general expression “developing countries” and “promoting economic assistance in the future. Output.
(1-1)ハードウェア構成
 文章生成装置は、通常の計算機を利用したハードウェアで構成されている。図1に、具体的なハードウェア構成の例を示す。文章生成装置は、入力装置110、出力装置120、演算装置130、各種データや各種プログラムを記憶するメモリ140、各種データや各種プログラムを記憶する記憶装置150、外部装置との通信を制御するネットワークデバイス160、これらを接続するバス170で構成される。記憶装置内のデータのみを用いる場合、ネットワークデバイス170は不要である。また、ネットワークを介してリモートから操作する場合、入力装置110と出力装置120は省略可能である。
(1-1) Hardware Configuration The text generation device is configured with hardware using a normal computer. FIG. 1 shows an example of a specific hardware configuration. The sentence generator includes an input device 110, an output device 120, an arithmetic device 130, a memory 140 that stores various data and various programs, a storage device 150 that stores various data and various programs, and a network device that controls communication with an external device. 160, and a bus 170 connecting them. When only the data in the storage device is used, the network device 170 is not necessary. Further, when operating remotely via a network, the input device 110 and the output device 120 can be omitted.
(1-2)機能ブロック構成
 図2は、文章生成装置の演算装置130を通じて実行されるプログラムが有する機能を表したものである。入力部210には、置き換え対象とする文章(1文のみでもよい。)とユーザの指示するテーマ情報とが入力される。入力部210に対する文章やテーマ情報の入力には、入力装置110(キーボード、マウスその他の入力デバイス、GUI画面等)が用いられる。エンティティ抽出部220は、入力された文章とテーマ情報に対する言語解析を実行し、置き換え対象の固有表現をエンティティとして特定する。「エンティティ抽出部」を「置換対象表現抽出部」ともいう。
(1-2) Functional Block Configuration FIG. 2 shows the functions of a program executed through the arithmetic unit 130 of the sentence generation device. The input unit 210 receives a sentence to be replaced (only one sentence may be used) and theme information instructed by the user. An input device 110 (keyboard, mouse or other input device, GUI screen, etc.) is used to input text and theme information to the input unit 210. The entity extraction unit 220 performs linguistic analysis on the input text and theme information, and identifies a specific expression to be replaced as an entity. The “entity extraction unit” is also referred to as a “replacement target expression extraction unit”.
 エンティティ情報テーブル230は、エンティティの置き換え先候補情報を蓄積する。エンティティ情報テーブル230は、メモリ140や記憶装置150にファイルとして格納される。候補生成部240は、エンティティ情報テーブル230を参照して抽出したエンティティに対する置き換え先候補を生成する。第1評価部250は、生成した候補に対してエンティティ情報テーブル230を用いて第1評価スコアの計算を行う。第1評価スコアは文毎に実行される。第2評価部260は、文章(複数の文)全体の観点から各候補についての第2評価スコアを計算する。なお、第2評価部260による評価は、第1評価部250による評価結果の高い候補に対して実行してもよい。変換後文章生成部270は、第1評価スコア及び第2評価スコアに基づいて置き換え先候補を決定し、決定した候補を用いて最終的な文章を生成する。なお、変換対象が1文であることに着目する場合、変換後文章生成部270を「変換後文生成部」ともいう。出力部280は、生成した文章(抽象化された文章)を、出力装置120を通じてユーザに提示(表示)する。 The entity information table 230 stores entity replacement destination candidate information. The entity information table 230 is stored as a file in the memory 140 or the storage device 150. The candidate generation unit 240 generates a replacement destination candidate for the entity extracted with reference to the entity information table 230. The first evaluation unit 250 calculates a first evaluation score using the entity information table 230 for the generated candidate. The first evaluation score is executed for each sentence. The second evaluation unit 260 calculates a second evaluation score for each candidate from the viewpoint of the entire sentence (a plurality of sentences). Note that the evaluation by the second evaluation unit 260 may be performed on a candidate with a high evaluation result by the first evaluation unit 250. The post-conversion sentence generation unit 270 determines a replacement destination candidate based on the first evaluation score and the second evaluation score, and generates a final sentence using the determined candidate. Note that when the conversion target is a single sentence, the post-conversion sentence generation unit 270 is also referred to as a “post-conversion sentence generation unit”. The output unit 280 presents (displays) the generated text (abstracted text) to the user through the output device 120.
(1-3)各機能部の説明
 以下では、各部で実行される具体的な処理内容を個別に説明する。
(1-3) Description of Each Functional Unit Hereinafter, specific processing contents executed by each unit will be described individually.
(1-3-1)エンティティ抽出部
 エンティティ抽出部220は、まず入力された文章とテーマに基づいて、テーマとして述べられているキーワードを特定する。もっとも、テーマがキーワードとして入力された場合は、入力をそのままキーワードとする。テーマが文章として入力された場合は、その文章内の表現からキーワードを特定する。具体的には、入力されたテーマに対して言語解析を行い、固有表現を抽出する。その固有表現のうち、最も出現回数の多いものをキーワードとする。もしくは、文章とテーマに共通して出現する表現を抽出し、それをキーワードとする。
(1-3-1) Entity Extraction Unit The entity extraction unit 220 first identifies a keyword described as a theme based on the input text and theme. However, when the theme is input as a keyword, the input is used as it is as a keyword. When the theme is input as a sentence, the keyword is specified from the expression in the sentence. Specifically, language analysis is performed on the input theme, and a specific expression is extracted. Among the proper expressions, the one with the most appearances is set as a keyword. Alternatively, an expression that appears in common with the text and the theme is extracted and used as a keyword.
 次に、エンティティ抽出部220は、入力された文章に対して言語解析を実行し、文章に含まれる1つ又は複数の固有表現を抽出する。抽出した固有表現のうち、キーワードではないものを置き換え対象の固有表現(「エンティティ」又は「置換対象表現」ともいう)とする。日付/数値を表す固有表現はエンティティ。エンティティは1文に複数あってよい。 Next, the entity extraction unit 220 performs linguistic analysis on the input sentence, and extracts one or more specific expressions included in the sentence. Among the extracted specific expressions, those that are not keywords are used as specific expressions to be replaced (also referred to as “entities” or “replacement target expressions”). A specific expression that represents a date / number is an entity. There may be multiple entities in a sentence.
 キーワード「マラリア」と文章「今後も経済援助を推進するべきだ。ミャンマーでは毎年マラリアが流行し、多くの人々が死亡している。」が文章生成装置に入力されたとき、エンティティ抽出部220は、文章中から固有表現として「ミャンマー」と「マラリア」を抽出し、そのうちキーワードではない「ミャンマー」をエンティティ(置換対象表現)として抽出する。 When the keyword “malaria” and the sentence “Economic assistance should continue to be promoted. Malaria is endemic every year and many people die in Myanmar” are input to the sentence generator. Then, “Myanmar” and “Malaria” are extracted as specific expressions from the text, and “Myanmar” that is not a keyword is extracted as an entity (replacement target expression).
 [エンティティ情報テーブル]
 図3に、エンティティ情報テーブル230の概念図を示す。エンティティ情報テーブル230は、エンティティとその抽象表現の組を1つ又は複数個記憶した辞書(辞書情報)である。セル内の○印は、該当列のエンティティが該当行の候補表現を取り得ることを示している。エンティティ情報テーブル230を参照することにより、あるエンティティが取り得る抽象表現を調べることができる。反対に、エンティティ情報テーブル230を参照することにより、ある抽象表現を取り得るエンティティを調べることもできる。
[Entity information table]
FIG. 3 shows a conceptual diagram of the entity information table 230. The entity information table 230 is a dictionary (dictionary information) that stores one or more pairs of entities and their abstract expressions. A circle in the cell indicates that the entity in the corresponding column can take a candidate expression of the corresponding row. By referring to the entity information table 230, it is possible to examine abstract expressions that an entity can take. On the contrary, by referring to the entity information table 230, it is possible to examine entities that can take a certain abstract expression.
 図4に、エンティティ情報テーブル230のデータ構造例を表す。図4に示すように、エンティティ情報テーブル230は、固有表現の文字列をキーとする辞書であり、値には、固有表現が表すエンティティを持つ。エンティティは、クラス(class)と候補(candidates)から成る。各エンティティに対する抽象表現を候補(candidates)フィールドとして複数持っている。各エンティティは、エンティティの属するクラスをフィールドに持つことができる。クラスとは、「人名」、「場所」、「組織名」等の意味的な分類である。各エンティティは、エンティティ情報テーブル230内に同一のエンティティに対するデータが分散して複数存在してしまうことを防止するために、同義表現フィールドを持つこともできる。各抽象表現には、対応する固有表現と共起する頻度等でスコアを付けておくこともできる。「ミャンマー」では、候補表現として「政府のある国」、「途上国」、「湿度の高いエリア」、「南アジア」、「国」などが獲得される。 FIG. 4 shows an example of the data structure of the entity information table 230. As shown in FIG. 4, the entity information table 230 is a dictionary that uses a character string of a unique expression as a key, and a value has an entity represented by the unique expression. Entities consist of classes and candidates. It has multiple abstract representations for each entity as candidate fields. Each entity can have a class to which the entity belongs in the field. The class is a semantic classification such as “person name”, “location”, “organization name”, and the like. Each entity may have a synonym expression field in order to prevent a plurality of data for the same entity from being distributed in the entity information table 230. Each abstract expression can be scored according to the frequency of co-occurring with the corresponding specific expression. In “Myanmar”, “country with government”, “developing country”, “humid area”, “South Asia”, “country”, etc. are acquired as candidate expressions.
[エンティティ情報テーブル上の語義曖昧性への対処方法]
 同じ文字列が、複数の異なるエンティティを表し得る場合がある。これらを区別するために、各エンティティは、前述したように、クラス(class)フィールドを持っている。例えば、「ノキア」は、フィンランドの都市である「ノキア」を表す場合と、携帯電話の販売等をしている電気通信機器メーカーである「ノキア」を表す場合がある。そこで、フィンランドの都市のノキアを表す場合は、クラス(class)は「地名」とし、候補(candidates)には「市」や「ヨーロッパ」を持つ。一方、電気通信機器メーカーのノキアを表す場合は、クラス(class)は「組織名」とし、候補(candidates)には「電気通信機器メーカー」や「会社」を持つ。このように、エンティティを、クラス(class)と候補(candidates)とに分けて保存することで、エンティティを区別することができる。
[How to deal with ambiguity in the entity information table]
The same string may represent multiple different entities. To distinguish between these, each entity has a class field, as described above. For example, “Nokia” may represent “Nokia”, a Finnish city, and “Nokia”, a telecommunications equipment manufacturer that sells mobile phones and the like. Therefore, when representing Nokia in a Finnish city, the class is “place name” and the candidates have “city” and “Europe”. On the other hand, when representing Nokia of a telecommunication equipment manufacturer, the class is “organization name” and the candidates have “telecommunications equipment manufacturer” and “company”. As described above, the entity can be distinguished by storing the entity separately in the class and the candidates.
[エンティティ情報テーブルの自動生成方法]
 エンティティ情報テーブル230には、固有表現に対するエンティティとその抽象表現を人手で付与することにより作成することができる。ただし、大量の固有表現の全てに対して抽象表現を人手で付与することは困難である。そこで、関係抽出の技術によって平文のテキストからエンティティとエンティティに関する関係情報を自動で抽出し、獲得した関係情報から抽象表現を付与する。
[How to automatically generate an entity information table]
The entity information table 230 can be created by manually assigning an entity to a specific expression and its abstract expression. However, it is difficult to manually add an abstract expression to all of a large number of unique expressions. Therefore, the relationship extraction technology automatically extracts the entity and the relationship information about the entity from the plain text, and gives an abstract expression from the acquired relationship information.
 たとえば、"Nokia is a Finnish telecommunications corporation." という文章があった場合、"is"で繋がれているという構文情報などを用いることで、”Nokia”と”a Finnish telecommunications corporation”が"is-A"関係にあるという関係情報を抽出できる。この関係情報から、”Nokia”の抽象表現として”a Finnish telecommunications corporation”をエンティティ情報テーブル230に保存することができる。また、”a Finnish telecommunications corporation”から修飾表現を除くことで、更に抽象度の高い抽象表現として、”a corporation”を得ることができる。もしくは、"Turkish casinos"という所有関係から、”Turkey”の抽象表現に"a country which has casinos"を獲得することができる。これらの処理も演算装置130で実行されるプログラムを通じて実現する。 For example, if there is a sentence "Nokia is a Finnish telecommunications corporation." “Relation information that is related can be extracted. From this relationship information, “a Finnish telecommunications corporation” can be stored in the entity information table 230 as an abstract expression of “Nokia”. Further, by removing the modified expression from “a Finnish telecommunications corporation”, “a corporation” can be obtained as an abstract expression having a higher abstraction level. Or you can get “a" country which has casinos ”as an abstract representation of“ Turkey ”from the ownership relationship of“ Turkish casinos ”. These processes are also realized through a program executed by the arithmetic device 130.
(1-3-2)候補生成部
 候補生成部240は、エンティティ情報テーブル230を参照し、各エンティティを置き換える候補である候補表現を複数生成する。なお、置き換え対象の固有表現も候補表現に含めることで、置き換えを行わない可能性も確保しておく。
(1-3-2) Candidate Generation Unit The candidate generation unit 240 refers to the entity information table 230 and generates a plurality of candidate expressions that are candidates for replacing each entity. It should be noted that the possibility of not replacing is also ensured by including the specific expression to be replaced in the candidate expression.
(1-3-3)第1評価部
 図5に、第1評価部250の機能構成を示す。第1評価部250は、エンティティの各候補表現に対して、文の内容を考慮した第1評価結果を与える。まず、類似事例文検索部251が、置き換え対象の固有表現(「エンティティ」又は「置換対象表現」)が含まれる文に対して、その文で表されている事例と類似の事例を表す類似事例文を文テキストデータ252から複数獲得する。文テキストデータ252は、事前に蓄積しておいたテキストデータでも、Web上のテキストデータでも構わない。類似事例文は、文中の単語から置き換え対象の固有表現(置換対象表現)を除いたものをクエリとし、連想検索エンジンによって類似の文を検索することで獲得できる。
(1-3-3) First Evaluation Unit FIG. 5 shows a functional configuration of the first evaluation unit 250. The first evaluation unit 250 gives a first evaluation result in consideration of the content of the sentence to each candidate expression of the entity. First, a similar case in which the similar case sentence search unit 251 represents a case similar to the case represented by the sentence including a specific expression (“entity” or “replacement target expression”) to be replaced. A plurality of sentences are acquired from the sentence text data 252. The sentence text data 252 may be text data stored in advance or text data on the Web. Similar case sentences can be acquired by searching for similar sentences using an associative search engine using a query obtained by excluding a specific expression (replacement target expression) to be replaced from words in the sentence.
 獲得した類似事例文に対して、類似事例エンティティ抽出部253が、入力文中のエンティティに対応する類似文中のエンティティを抽出する。たとえば「ミャンマーでは毎年マラリアが流行し、多くの人々が死亡している。」の類似文として、「カンボジアでは毎年マラリアが流行し、多くの人々が死亡している。」が獲得できたとする。このとき、類似事例エンティティ抽出部253は、「ミャンマー」に対応するエンティティとして「カンボジア」を抽出する。この場合、「ミャンマー」と「カンボジア」は、類似事例エンティティとなる。類似事例エンティティ抽出部253は、本明細書において「対応表現抽出部」ともいう。 For the acquired similar case sentence, the similar case entity extracting unit 253 extracts an entity in the similar sentence corresponding to the entity in the input sentence. For example, as a similar sentence, “Malaria is endemic every year and many people die in Myanmar”, “Malaria is endemic every year and many people die in Cambodia”. At this time, the similar case entity extraction unit 253 extracts “Cambodia” as an entity corresponding to “Myanmar”. In this case, “Myanmar” and “Cambodia” are similar case entities. The similar case entity extraction unit 253 is also referred to as a “corresponding expression extraction unit” in this specification.
 第1評価スコア計算部254は、抽出されたエンティティの置き換え表現候補について正確性を数値で表したスコアを計算する。図6を用いて、第1評価スコア計算部254の動作を説明する。表の最下行と最右列以外は、エンティティ情報テーブル230の一部を切り出したものである。図6は、置き換え対象のエンティティ(置換対象表現)の列と、置き換え対象のエンティティ(置換対象表現)が取り得る全ての候補表現に対し、その候補表現を取り得る全てのエンティティの列を切り出してある。セル内の○印は、該当列のエンティティが該当行の候補表現を取り得ることを示している。 The first evaluation score calculation unit 254 calculates a score representing the accuracy of the extracted entity replacement expression candidate with a numerical value. The operation of the first evaluation score calculation unit 254 will be described with reference to FIG. Except for the bottom row and the rightmost column of the table, a part of the entity information table 230 is cut out. FIG. 6 shows a column of replacement target entities (replacement target expressions) and all candidate expressions that can be taken by the replacement target entities (replacement target expressions). is there. A circle in the cell indicates that the entity in the corresponding column can take a candidate expression of the corresponding row.
 表の最下行は、その列のエンティティが、類似事例エンティティとして抽出されたかどうかを表している。表の最右列は、各候補表現に対する第1評価スコアの計算結果(第1評価結果)を表す。第1評価スコア計算部254は、エンティティの各候補表現に対して、(1)より多くの類似事例エンティティにとって抽象表現であるものに高いスコアを与える、かつ、(2)類似事例エンティティではないものにとっては抽象表現ではないものに高いスコアを与える、の2つの観点を反映するように第1評価結果を付与する。具体的には、以下の式に基づいて、抽象表現aで置換することの正確性の度合いを与えるスコアを計算する。 The bottom row of the table indicates whether the entity in that column has been extracted as a similar case entity. The rightmost column of the table represents the calculation result (first evaluation result) of the first evaluation score for each candidate expression. The first evaluation score calculation unit 254 gives, to each candidate expression of an entity, (1) a high score for an abstract expression for more similar case entities, and (2) a non-similar case entity Therefore, the first evaluation result is given so as to reflect two viewpoints of giving a high score to a non-abstract expression. Specifically, based on the following formula, a score that gives the degree of accuracy of replacement with the abstract expression a is calculated.
 第1評価結果(a)=(P(a), R(a))の調和平均
 ただし、評価P(a)と評価R(a)は、それぞれ以下で与えられる。
・評価P(a)
=(類似事例エンティティのうちaを抽象表現に持つ数)/(aを抽象表現に持つ全エンティティの数)
・評価R(a)
=(類似事例エンティティのうちaを抽象表現に持つ数)/(aを抽象表現に持つ類似事例エンティティの数)
First evaluation result (a) = harmonic average of (P (a), R (a)) However, evaluation P (a) and evaluation R (a) are given below, respectively.
・ Evaluation P (a)
= (Number of similar case entities having a as an abstract expression) / (Number of all entities having a as an abstract expression)
・ Evaluation R (a)
= (Number of similar case entities having a as an abstract expression) / (Number of similar case entities having a as an abstract expression)
 この例の場合、「途上国」に対して最高スコアの1.0が与えられ、「湿度の高いエリア」に対して次点の0.8が与えられる。なお、第1評価スコア計算部254は、本明細書において「スコア計算部」ともいう。 In this example, the highest score of 1.0 is given to “developing countries”, and the next score of 0.8 is given to “humid areas”. The first evaluation score calculation unit 254 is also referred to as a “score calculation unit” in this specification.
 因みに、第1評価結果の計算方法はこれに限らない。たとえば類似事例文検索部251において、類似事例を否定する文を同時に検索し、第1評価結果の算出に用いてもよい。類似事例エンティティ抽出部253は、類似事例を否定する文について、類似事例の発生が否定された「類似事例否定エンティティ」を抽出する。類似事例否定エンティティが取り得る抽象表現は、元の事例の文章を置き換える際には不適切である。そこで、先ほどの第1評価結果の計算式に場合分けを加えた以下の式を使用する。
・第1評価結果P(a)
={(類似事例エンティティのうちaを抽象表現に持つ数)/(aを抽象表現に持つ全エンティティの数)}
 ただし、(類似事例否定エンティティのうちaを抽象表現に持つ数)>1の場合、第1評価結果P(a)=0とする。
Incidentally, the calculation method of the first evaluation result is not limited to this. For example, the similar case sentence search unit 251 may simultaneously search for sentences that deny similar cases and use them for calculating the first evaluation result. The similar case entity extraction unit 253 extracts “similar case negative entity” in which occurrence of the similar case is denied for the sentence that denies the similar case. The abstract representations that similar case negative entities can take are inappropriate when replacing the original case text. Therefore, the following formula is used by adding a case classification to the calculation formula of the first evaluation result.
・ First evaluation result P (a)
= {(Number of similar case entities having a as an abstract expression) / (Number of all entities having a as an abstract expression)}
However, if (number of similar case negative entities having a as an abstract expression)> 1, the first evaluation result P (a) = 0.
 これによって、類似事例エンティティの発生が否定されているエンティティに対しても、正確性を保った文の抽象化が可能になる。 This makes it possible to abstract sentences that are accurate even for entities for which the occurrence of similar case entities is denied.
 [第1評価部によるエンティティ情報テーブルの更新]
 エンティティ情報テーブル230の情報が少なく、エンティティと候補表現の対応情報が少ない場合、第1評価部250は、ほとんどの候補表現に対して高い評価を与えてしまう。そこで、各抽象表現について計算された第1評価結果P(a)のばらつきが事前に定めた閾値よりも小さい場合、他のテキストデータを参照して、エンティティ情報テーブル230を更新する。具体的には、あるエンティティの取り得る候補表現と、他のエンティティとが共起する文を検索する。第1評価部250は、前述した関係抽出技術の実行を通じ、エンティティについての構文上の関係を文中から抽出する。この機能を、本明細書では「関係抽出部」という。適切な関係が抽出された場合、第1評価部250は、該当エンティティが該当候補表現を取り得るという情報を新しくエンティティ情報テーブル230に加える。この機能を、本明細書では「辞書情報更新部」という。このようにして、エンティティと候補表現の対応情報を増やすことができる。
[Update of entity information table by first evaluation unit]
When the information in the entity information table 230 is small and the correspondence information between the entity and the candidate expression is small, the first evaluation unit 250 gives a high evaluation to most candidate expressions. Therefore, when the variation of the first evaluation result P (a) calculated for each abstract expression is smaller than a predetermined threshold, the entity information table 230 is updated with reference to other text data. Specifically, a sentence in which a candidate expression that a certain entity can take and another entity co-occurs is searched. The first evaluation unit 250 extracts the syntactic relationship about the entity from the sentence through the execution of the relationship extraction technique described above. This function is referred to as “relation extraction unit” in this specification. When an appropriate relationship is extracted, the first evaluation unit 250 newly adds information that the corresponding entity can take the corresponding candidate expression to the entity information table 230. This function is referred to as “dictionary information update unit” in this specification. In this way, correspondence information between entities and candidate expressions can be increased.
 [文内に複数のエンティティが存在する場合]
 ところで、ユーザから入力された1つの文中に複数のエンティティが存在する場合も考えられる。この場合、第1評価部250は、個々のエンティティに対して、それ以外のエンティティを候補表現に置き変えることで仮の文を生成する。文中のエンティティの数だけ生成された仮の文に対して、1つのエンティティについて第1評価結果P(a)を算出する場合と同様の処理を実行する。エンティティが複数存在する場合の第1評価結果P(a)は、それぞれの候補表現に別々に与えられるのではなく、候補表現の組合せに対して与えられる。この組み合わせ機能を、本明細書では、「組合せ生成部」という。
[If there are multiple entities in the statement]
By the way, there may be a case where a plurality of entities exist in one sentence input by the user. In this case, the first evaluation unit 250 generates a provisional sentence by replacing other entities with candidate expressions for individual entities. A process similar to that for calculating the first evaluation result P (a) for one entity is executed for the provisional sentence generated by the number of entities in the sentence. The first evaluation result P (a) when there are a plurality of entities is not given separately to each candidate expression, but is given to a combination of candidate expressions. This combination function is referred to as a “combination generation unit” in this specification.
(1-3-4)第2評価部
 図7に、第2評価部260の機能構成を示す。第2評価部260は、各候補表現に対して、文章全体の内容から文脈を考慮した第2評価結果を与える。まず、重要語抽出部261が、入力文章中の重要語を抽出する。重要語は、TF-IDF(Term Frequency - Inverse Document Frequency)などの技術によって抽出可能である。類義語展開部262は、与えられた語に対する類義語を取得し、出力する。類義語は、類語辞書やWord2Vecなどの手法により獲得できる。ここでは、重要語抽出部261で抽出された重要語と、第1評価部250から与えられる各候補表現に対して類義語展開を行う。
(1-3-4) Second Evaluation Unit FIG. 7 shows a functional configuration of the second evaluation unit 260. The second evaluation unit 260 gives each candidate expression a second evaluation result considering the context from the contents of the entire sentence. First, the important word extraction unit 261 extracts important words in the input sentence. The important words can be extracted by a technique such as TF-IDF (Term Frequency-Inverse Document Frequency). The synonym expansion unit 262 acquires and outputs a synonym for the given word. Synonyms can be acquired by methods such as a synonym dictionary and Word2Vec. Here, synonym expansion is performed on the keyword extracted by the keyword extraction unit 261 and each candidate expression given from the first evaluation unit 250.
 第2評価スコア計算部263は、各候補表現に対して、入力文章中の重要語との共起度を計算し、第2評価スコア(第2評価結果)として出力する。共起度とは、一般の文章中で共起しやすい語と語の関係を言い、たとえばWeb検索エンジンで語と語の組合せをクエリにして検索した際のヒット数によって取得できる。共起度を用いることにより、各候補表現が入力文章の文脈に沿った抽象化となっているかどうかを測ることができる。なお、共起度の計算時には、前段の類義語展開によって展開された語を用いても良い。 The second evaluation score calculation unit 263 calculates the degree of co-occurrence with an important word in the input sentence for each candidate expression and outputs it as a second evaluation score (second evaluation result). The degree of co-occurrence refers to the relationship between words that are likely to co-occur in general sentences. For example, the co-occurrence degree can be obtained by the number of hits when a search is performed using a word / word combination as a query in a Web search engine. By using the co-occurrence degree, it is possible to measure whether each candidate expression is an abstraction according to the context of the input sentence. When calculating the degree of co-occurrence, a word expanded by the previous synonym expansion may be used.
 「途上国」と「湿度の高いエリア」では、入力文章中の重要語「経済援助」との共起度の高い「途上国」の方に、高い文脈適切性スコア(第2評価結果)が与えられる。なお、文内に複数のエンティティが存在する場合、前記候補表現の組み合わせについて第2評価スコア(第2評価結果)を計算する。 In “developing countries” and “humid areas”, a higher context appropriateness score (second evaluation result) is given to “developing countries” that have a high co-occurrence with the key word “economic assistance” in the input text. Given. When there are a plurality of entities in the sentence, a second evaluation score (second evaluation result) is calculated for the combination of candidate expressions.
(1-3-5)変換後文章生成部
 変換後文章生成部270は、第1評価部250及び第2評価部260のそれぞれにおいて高い評価が与えられた候補表現を用いて、文章中のエンティティを置き換えることで、変換後文章(又は変換後文)を生成する。自然な文章にするために、候補表現を単数形から複数形に変化させる操作や、文頭の一文字目を大文字に変更する操作なども行われる。
(1-3-5) Post-conversion sentence generation unit The post-conversion sentence generation unit 270 uses the candidate expressions to which high evaluation is given in each of the first evaluation unit 250 and the second evaluation unit 260 to By substituting, a converted sentence (or converted sentence) is generated. In order to make a natural sentence, an operation of changing the candidate expression from the singular to the plural, an operation of changing the first letter of the sentence to upper case, and the like are also performed.
 第1評価部250と第2評価部260の評価結果が複数の候補表現に対して同じ値である場合、いずれも置き換え表現として同程度に適切であるとして、最高評価である複数の候補表現のうち任意の候補表現を用いて文章を生成してよい。もしくは、候補表現を構成する単語数の少なさや、エンティティ情報テーブル230内に保存されている候補表現のスコアなどの基準を用いて選択してもよい。 When the evaluation results of the first evaluation unit 250 and the second evaluation unit 260 are the same value for a plurality of candidate expressions, it is assumed that both are equally suitable as replacement expressions, A sentence may be generated using any candidate expression. Alternatively, the selection may be made using criteria such as a small number of words constituting the candidate expression and a score of the candidate expression stored in the entity information table 230.
(1-4)処理の流れ
(1-4-1)処理の概要
 図8を用い、主張文を生成する際に文章生成装置で実行される処理の流れを説明する。
(1-4) Process Flow (1-4-1) Process Overview With reference to FIG. 8, the process flow executed by the sentence generation device when generating an asserted sentence will be described.
・ステップS800
 ユーザは、入力装置110を使用して、置き換えたい文章とその文章のテーマを入力する。入力された文章とテーマは演算装置130を通じて解析され、エンティティ抽出部220に与えられる。
Step S800
The user uses the input device 110 to input a sentence to be replaced and a theme of the sentence. The input sentence and theme are analyzed through the arithmetic unit 130 and given to the entity extraction unit 220.
・ステップS801
 エンティティ抽出部220は、入力された文章とテーマ情報のそれぞれから固有表現を抽出し、置き換え対象の固有表現(エンティティ)とテーマ情報を表すキーワードをそれぞれ特定する。
Step S801
The entity extraction unit 220 extracts a specific expression from each of the input sentence and the theme information, and specifies a specific expression (entity) to be replaced and a keyword representing the theme information.
・ステップS802
 候補生成部240は、ステップS801で特定された各エンティティに対して、エンティティ情報テーブル230を参照し、置き換え先の候補表現を複数取得する。文章中の固有表現には、固有表現認識の結果として、クラス(class)を取得することができる。候補生成部240は、固有表現の文字列とクラス(class)を用いて、エンティティ情報テーブル230から情報を取得し、置き換え先となる候補表現を複数取得する。
Step S802
The candidate generating unit 240 refers to the entity information table 230 for each entity specified in step S801, and acquires a plurality of replacement candidate expressions. For a specific expression in a sentence, a class can be acquired as a result of the specific expression recognition. The candidate generation unit 240 acquires information from the entity information table 230 using the character string and class of the unique expression, and acquires a plurality of candidate expressions to be replaced.
・ステップS803
 第1評価部250は、候補生成部240で生成された候補表現について第1評価結果を計算する。すなわち、第1評価部250は、各候補表現に正確性スコアを付与する。
Step S803
The first evaluation unit 250 calculates a first evaluation result for the candidate expression generated by the candidate generation unit 240. That is, the first evaluation unit 250 assigns an accuracy score to each candidate expression.
・ステップS804
 第2評価部260は、候補生成部240で生成された候補表現について第2評価結果を算出する。すなわち、第2評価部260は、文脈適切性スコアを付与する。
Step S804
The second evaluation unit 260 calculates a second evaluation result for the candidate expression generated by the candidate generation unit 240. That is, the second evaluation unit 260 assigns a context appropriateness score.
・ステップS805
 変換後文章生成部270は、評価結果が最も高くなった候補表現を利用してエンティティを置き換え、変換後文章を生成する。
Step S805
The post-conversion sentence generation unit 270 uses the candidate expression with the highest evaluation result to replace the entity, and generates a post-conversion sentence.
(1-4-2)第1評価部の処理
 図9を用い、第1評価部250(ステップ803)で実行される処理の詳細を説明する。
(1-4-2) Processing of First Evaluation Unit Details of processing executed by the first evaluation unit 250 (step 803) will be described using FIG.
・ステップS901
 類似事例文検索部251は、置き換えたい対象文からエンティティを除いた文字列をクエリとして作成する。
Step S901
The similar case sentence search unit 251 creates a character string obtained by removing the entity from the target sentence to be replaced as a query.
・ステップS902
 類似事例文検索部251は、ステップS900で作成したクエリを連想検索エンジンに与え、入力文で表されている事例と類似する事例を表す類似事例文を複数取得する。
Step S902
The similar case sentence search unit 251 gives the query created in step S900 to the associative search engine, and acquires a plurality of similar case sentences representing cases similar to the case represented by the input sentence.
・ステップS903
 類似事例エンティティ253は、各類似事例文に対し言語解析を実行し、ステップS801と同様に固有表現を抽出する。
Step S903
The similar case entity 253 performs language analysis on each similar case sentence, and extracts a specific expression as in step S801.
・ステップS904
 類似事例エンティティ253は、固有表現の対応付けを行い、各類似事例文中の固有表現のうちエンティティに対応する固有表現を選択する。
Step S904
The similar case entity 253 associates specific expressions, and selects a specific expression corresponding to the entity among the specific expressions in each similar case sentence.
・ステップS905
 類似事例エンティティ253は、対応する固有表現に対して、ステップS802と同様に、候補表現をエンティティ情報テーブル230から取得する。
Step S905
The similar case entity 253 acquires candidate expressions from the entity information table 230 for the corresponding specific expressions, as in step S802.
・ステップS906
 第1評価スコア計算部254は、候補生成部240で生成されたそれぞれの候補表現に対して、類似事例文中の対応する固有表現のうち候補表現として同じものを持つものの数をカウントし、それを各候補表現の正確性スコアとして出力する。候補表現が同じであるかどうかは文字列マッチングにより判定できる。
Step S906
The first evaluation score calculation unit 254 counts, for each candidate expression generated by the candidate generation unit 240, the number of corresponding specific expressions in the similar case sentence that have the same candidate expression. Output as accuracy score for each candidate expression. Whether the candidate expressions are the same can be determined by character string matching.
・ステップS907
 第1評価スコア計算部254は、計算された正確性スコアを利用して、候補表現の順位付けを行う。ある順位以上、あるいは、あるスコア以上の候補だけを残すことによって、正確性の高い候補を選択することができる。この正確性スコアによる評価によって候補表現が1つに絞られた場合、第2評価部260による評価は省略することができる。
Step S907
The first evaluation score calculation unit 254 ranks the candidate expressions using the calculated accuracy score. By leaving only candidates with a certain rank or higher or a score or higher, it is possible to select a highly accurate candidate. When the candidate expression is narrowed down to one by the evaluation based on the accuracy score, the evaluation by the second evaluation unit 260 can be omitted.
(1-4-3)第2評価部の処理
・ステップ1001
 まず、重要語抽出部261は、候補表現の組み合わせを生成する。
(1-4-3) Processing of second evaluation unit / step 1001
First, the keyword extraction unit 261 generates a combination of candidate expressions.
・ステップ1002
 次に、重要語抽出部261は、入力文章から固有表現以外の単語を抽出する。ただし、“of”や”a”といった頻出語は除く。
Step 1002
Next, the important word extraction unit 261 extracts words other than the unique expression from the input sentence. However, frequent words such as “of” and “a” are excluded.
・ステップ1003
 類義語展開部262は、候補表現と入力文章の単語集合とについて、それぞれに含まれる語を、WordNetを用いて類義語に展開する。
Step 1003
The synonym expansion unit 262 expands words included in the candidate expression and the word set of the input sentence into synonyms using WordNet.
・ステップ1004
 第2評価スコア計算部263は、類義語展開した後の候補表現と、前段で抽出した単語集合とのオーバーラップをカウントし、文脈適切性スコアとして出力する。
Step 1004
The second evaluation score calculation unit 263 counts the overlap between the candidate expression after synonym expansion and the word set extracted in the previous stage, and outputs it as a context appropriateness score.
(1-5)まとめ
 以上の処理動作の実行により、具体的な事例を表す文の一部の固有表現をより一般的な意味を持つ固有表現に置き換えた文(文章)を生成することができる。事例文を置き換えて生成した主張文は、事例文との対応が付いている。そこで、例えば、生成した主張文を事例文の手前に配置し、事例文の冒頭に事例であることを示す接続詞を挿入して、新しく文章を構築する。このようにして、主張が明示されているため論旨がわかりやすく、かつ事例が示されているため説得力のある文章を自動的に構築することができる。
(1-5) Summary By executing the above processing operations, it is possible to generate a sentence (sentence) in which a specific expression of a sentence representing a specific case is replaced with a specific expression having a more general meaning. . The claim sentence generated by replacing the case sentence has a correspondence with the case sentence. Therefore, for example, the generated sentence is placed in front of the case sentence, and a conjunction that indicates the case is inserted at the beginning of the case sentence to construct a new sentence. In this way, it is possible to automatically construct a compelling sentence because the assertion is clearly indicated and the argument is easy to understand and the case is shown.
(2)第2の実施の形態
 本実施の形態では、複数の文章から抜き出してきた文を適切な順番に並べた文集合に対して、論述文を生成するための置き換えを第1の実施の形態と同様に行う。
(2) Second Embodiment In this embodiment, replacement for generating a statement sentence is replaced with a sentence set in which sentences extracted from a plurality of sentences are arranged in an appropriate order. The same as the form.
 図11は、本実施の形態で使用する文章生成システムの全体像を表している。当該システムは、文章生成装置1100とデータ管理装置1101で構成される。文章生成装置1100は、論題が入力されると、その論題に対する意見を述べた論述文を出力する。データ管理装置1101は、予めデータ処理されたデータを保存し、文章生成装置1100からアクセス可能である。 FIG. 11 shows an overall image of the sentence generation system used in the present embodiment. The system includes a text generation device 1100 and a data management device 1101. When a topic is input, the sentence generation device 1100 outputs a descriptive sentence that describes an opinion on the topic. The data management device 1101 stores data that has been processed in advance and is accessible from the text generation device 1100.
 文章生成装置1100は、9つの処理機能を順次実行する。まず、入力部1102は、ユーザから論題を受け取る。次に、論題解析部1103は、論題を解析し、論題の極性と検索に用いるキーワードを判定する。検索部1104は、キーワードとディベートにおける争点を示す争点語とを用いて記事を検索する。争点決定部1105は、出力された記事を分類し、意見を生成する際に用いる争点を決定する。文抽出部1106は、出力した記事から争点について述べている文を抽出する。文並び替え部1107は、抽出された文を並び替えることにより文章を生成する。評価部1108で、生成された文を評価する。置き換え部1109は、適切な接続詞の挿入、不要な表現の削除、テーマ情報に応じた一部の固有表現の抽象表現への置き換えを行う。出力部1110は、最も評価の高い文章を、意見を述べた論述文として出力する。 The sentence generation device 1100 sequentially executes nine processing functions. First, the input unit 1102 receives a topic from the user. Next, the topic analysis unit 1103 analyzes the topic and determines the polarity of the topic and the keyword used for the search. The search unit 1104 searches for an article using a keyword and an issue word indicating an issue in the debate. The issue determination unit 1105 classifies the output articles and determines an issue to be used when generating an opinion. The sentence extraction unit 1106 extracts a sentence describing the issue from the output article. The sentence rearrangement unit 1107 generates a sentence by rearranging the extracted sentences. The evaluation unit 1108 evaluates the generated sentence. The replacement unit 1109 inserts appropriate conjunctions, deletes unnecessary expressions, and replaces some unique expressions with abstract expressions according to theme information. The output unit 1110 outputs the sentence with the highest evaluation as a descriptive sentence describing an opinion.
 本実施の形態における置き換え部1109は、第1の実施の形態で説明した構成に対して、入力される情報を追加した構成となる。以下、第1の実施の形態について付加される処理機能について説明する。 The replacement unit 1109 in the present embodiment has a configuration in which input information is added to the configuration described in the first embodiment. In the following, processing functions added to the first embodiment will be described.
 本実施の形態で用いる入力部210には、文章として並び替えた文集合が入力され、テーマ情報として論題もしくは論題解析部1103の解析結果もしくは検索部1104でクエリとして用いたキーワードがそれぞれ入力される。 A sentence set rearranged as sentences is input to the input unit 210 used in this embodiment, and a theme or an analysis result of the topic analysis unit 1103 or a keyword used as a query in the search unit 1104 is input as theme information. .
 本実施の形態で用いる第1評価部250の類似事例検索部251は、検索部1104の出力を検索対象として用いることができる。また、各文には抽出元となる文書が存在するため、その文書内から関係を抽出することで、エンティティ情報テーブルの情報を更新することができる。 The similar case search unit 251 of the first evaluation unit 250 used in the present embodiment can use the output of the search unit 1104 as a search target. Since each sentence has a document as an extraction source, the information in the entity information table can be updated by extracting the relationship from the document.
 本実施の形態で用いる第2評価部260は、候補表現との共起度を測る対象に論題情報を含めることができる。また、各文には抽出元となる文書が存在するため、その文書内の重要語との共起度も評価に含めることができる。 The second evaluation unit 260 used in the present embodiment can include topic information in a target whose co-occurrence with candidate expressions is measured. Since each sentence has a document as an extraction source, the degree of co-occurrence with an important word in the document can be included in the evaluation.
 一方、データ管理システム1101は、インターフェース部1111と、構造化部1112と、4つのデータベース1113~1116とから構成される。インターフェースDB1111は、構造化部1112と共にデータベースに管理されているデータに対するアクセス手段を提供する。テキストデータDB1113は、ニュース記事などのテキストデータである。テキストアノテーションデータDB1114は、テキストデータDBタ1113に付与されたデータである。検索用インデックスDB1115は、テキストデータDB1113とアノテーションデータDB1114を検索可能にするためのインデックスである。争点オントロジDB1116、ディベートでよく議論になる争点と、その関連語を紐づけたデータベースである。 On the other hand, the data management system 1101 includes an interface unit 1111, a structuring unit 1112, and four databases 1113 to 1116. The interface DB 1111 provides an access unit for data managed in the database together with the structuring unit 1112. The text data DB 1113 is text data such as news articles. The text annotation data DB 1114 is data assigned to the text data DB 1113. The search index DB 1115 is an index for making the text data DB 1113 and the annotation data DB 1114 searchable. The issue ontology DB 1116 is a database in which issues that are often discussed in debates and related words are linked.
 以上のように、本実施の形態によれば、複数の文章から抜き出してきた文を適切な順番に並べた文集合に対して論述文を生成する場合にも、内容がより一般化された論述文を最終的に出力することができる。 As described above, according to the present embodiment, even when a statement statement is generated for a statement set in which sentences extracted from a plurality of sentences are arranged in an appropriate order, the statement whose contents are more generalized The sentence can be finally output.
(3)他の実施の形態
 本発明は、上述した実施の形態に限定されるものでなく、様々な変形例を含んでいる。例えば、上述した実施の形態は、本発明を分かりやすく説明するために詳細に説明したものであり、必ずしも説明した全ての構成を備える必要はない。また、上述した実施の形態の構成の一部を削除してもよく、上述した実施の形態の構成に既知の技術を追加してもよく、又は上述した実施の形態の構成の一部を既知の技術で置換してもよい。
(3) Other Embodiments The present invention is not limited to the above-described embodiments, and includes various modifications. For example, the above-described embodiment has been described in detail for easy understanding of the present invention, and it is not always necessary to include all the configurations described. Further, a part of the configuration of the above-described embodiment may be deleted, a known technique may be added to the configuration of the above-described embodiment, or a part of the configuration of the above-described embodiment may be known. It may be replaced by the technique of.
 また、上述した各構成、機能、処理部、処理手段等は、それらの一部又は全部を、例えば集積回路で設計する等によりハードウェアで実現しても良い。また、上記の各構成、機能等は、プロセッサがそれぞれの機能を実現するプログラムを解釈し、実行することにより(すなわちソフトウェア的に)実現しても良い。各機能を実現するプログラム、テーブル、ファイル等の情報は、メモリ、ハードディスク、SSD(Solid State Drive)等の記憶装置、又は、ICカード、SDカード、DVD等の記憶媒体に格納することができる。また、制御線や情報線は、説明上必要と考えられるものを示すものであり、製品上必要な全ての制御線や情報線を表すものでない。実際にはほとんど全ての構成が相互に接続されていると考えて良い。 In addition, each of the above-described configurations, functions, processing units, processing means, and the like may be realized by hardware by designing a part or all of them with, for example, an integrated circuit. Each of the above-described configurations, functions, and the like may be realized by the processor interpreting and executing a program that realizes each function (that is, in software). Information such as programs, tables, and files that realize each function can be stored in a storage device such as a memory, a hard disk, or an SSD (Solid State Drive), or a storage medium such as an IC card, an SD card, or a DVD. Control lines and information lines indicate what is considered necessary for the description, and do not represent all control lines and information lines necessary for the product. In practice, it can be considered that almost all components are connected to each other.
 110 入力装置
 120 出力装置
 130 演算装置(Central Processing Unit:CPU) 
 140 メモリ
 150 記憶装置
 160 ネットワークデバイス
 170 バス
 210 入力部
 220 エンティティ抽出部
 230 エンティティ情報テーブル
 240 候補生成部
 250 第1評価部
 260 第2評価部
 270 変換後文章生成部
 280 出力部
 251 類似事例文検索部
 252 文テキストデータ
 253 類似事例エンティティ抽出部
 254 第1評価スコア計算部
 261 重要語抽出部
 262 類義語展開部
 263 第2評価スコア計算部
110 Input Device 120 Output Device 130 Arithmetic Unit (Central Processing Unit: CPU)
140 Memory 150 Storage Device 160 Network Device 170 Bus 210 Input Unit 220 Entity Extraction Unit 230 Entity Information Table 240 Candidate Generation Unit 250 First Evaluation Unit 260 Second Evaluation Unit 270 Converted Text Generation Unit 280 Output Unit 251 Similar Case Sentence Search Unit 252 sentence text data 253 similar case entity extraction unit 254 first evaluation score calculation unit 261 important word extraction unit 262 synonym expansion unit 263 second evaluation score calculation unit

Claims (12)

  1.  処理対象とする文とテーマ情報の入力に用いられる入力部と、
     前記テーマ情報に基づいて前記文に含まれる1つ又は複数の固有表現のうち1つ又は複数を置換対象表現として抽出すると共に、前記テーマ情報を表すキーワードを特定する置換対象表現抽出部と、
     予め蓄積されている辞書情報を用い、前記置換対象表現を抽象化する置き換え候補である候補表現を複数生成する候補生成部と、
     前記辞書情報を用いて前記候補表現を評価した第1評価結果を出力する第1評価部と、
     前記第1評価結果において評価が高い前記候補表現を用いて前記置換対象表現を置き換えることにより、変換後文を生成する変換後文生成部と
     を有する文章生成装置。
    An input unit used for inputting a sentence to be processed and theme information;
    A replacement target expression extracting unit that extracts one or more of one or more specific expressions included in the sentence based on the theme information as a replacement target expression and identifies a keyword representing the theme information;
    A candidate generation unit that generates a plurality of candidate expressions that are replacement candidates that abstract the replacement target expression using dictionary information stored in advance;
    A first evaluation unit that outputs a first evaluation result obtained by evaluating the candidate expression using the dictionary information;
    A sentence generation apparatus comprising: a post-conversion sentence generation unit that generates a post-conversion sentence by replacing the replacement target expression with the candidate expression having a high evaluation in the first evaluation result.
  2.  請求項1に記載の文章生成装置において、
     文章を構成する複数の文に含まれる他の固有表現との関係に基づいて前記候補表現を評価した第2評価結果を出力する第2評価部を更に有し、
     前記変換後文生成部は、前記第1評価結果の評価と前記第2評価結果の評価が共に高い前記候補表現を用いて前記置換対象表現を置き換えることにより、前記変換後文を生成する
     ことを特徴とする文章生成装置。
    The sentence generation device according to claim 1,
    A second evaluation unit that outputs a second evaluation result obtained by evaluating the candidate expression based on a relationship with other specific expressions included in a plurality of sentences constituting the sentence;
    The post-conversion sentence generation unit generates the post-conversion sentence by replacing the replacement target expression by using the candidate expression that has both high evaluation of the first evaluation result and evaluation of the second evaluation result. A featured sentence generator.
  3.  請求項1に記載の文章生成装置において、
     前記第1評価部は、
      前記文で表される事例と類似する事例を表す類似事例文を検索する類似事例文検索部と、
      前記類似事例文から前記置換対象表現と同じ事例が生じている対応置換対象表現を複数抽出する対応表現抽出部と、
      前記対応置換対象表現の置き換え候補表現を前記辞書情報から複数生成し、前記候補表現に正確性スコアを付与するスコア計算部と
     を有することを特徴とする文章生成装置。
    The sentence generation device according to claim 1,
    The first evaluation unit includes:
    A similar case sentence search unit for searching a similar case sentence representing a case similar to the case represented by the sentence;
    A corresponding expression extraction unit that extracts a plurality of corresponding replacement target expressions in which the same case as the replacement target expression is generated from the similar case sentence;
    A sentence generation device, comprising: a score calculation unit that generates a plurality of replacement candidate expressions of the corresponding replacement target expression from the dictionary information and gives an accuracy score to the candidate expression.
  4.  請求項2に記載の文章生成装置において、
     前記第2評価部は、
      前記文章中の重要語を抽出する重要語抽出部と、
      前記重要語と前記候補表現の類義語表現を獲得する類義語展開部と、
      前記類義語表現を利用して前記候補表現に前記候補表現に文脈適切性スコアを付与するスコア計算部と
     を有することを特徴とする文章生成装置。
    The sentence generation apparatus according to claim 2,
    The second evaluation unit is
    An important word extraction unit for extracting an important word in the sentence;
    A synonym expansion unit for acquiring a synonym expression of the important word and the candidate expression;
    A sentence generation device comprising: a score calculation unit that uses the synonym expression to give a context appropriateness score to the candidate expression.
  5.  請求項2に記載の文章生成装置において、
     前記置換対象表現が複数抽出された場合に、前記候補表現の組合せを生成する組合せ生成部を更に有し、
     前記第1評価部および前記第2評価部は、前記候補表現の組合せに対して評価を出力する
     ことを特徴とする文章生成装置。
    The sentence generation apparatus according to claim 2,
    A combination generating unit that generates a combination of the candidate expressions when a plurality of replacement target expressions are extracted;
    The sentence evaluation apparatus, wherein the first evaluation unit and the second evaluation unit output an evaluation with respect to the combination of the candidate expressions.
  6.  請求項1に記載の文章生成装置において、
     前記第1評価部は、
     前記置換対象表現についての構文上の関係を文中から抽出する関係抽出部と、
     前記関係に基づいて前記辞書情報を更新する辞書情報更新部と
     を更に有する文章生成装置。
    The sentence generation device according to claim 1,
    The first evaluation unit includes:
    A relationship extraction unit that extracts a syntactic relationship of the replacement target expression from a sentence;
    And a dictionary information updating unit that updates the dictionary information based on the relationship.
  7.  演算装置と記憶装置を有する文章生成装置において実行される文章生成方法において、
     前記演算装置が、
     処理対象とする文とテーマ情報の入力を受け付ける処理と、
     前記テーマ情報に基づいて前記文に含まれる1つ又は複数の固有表現のうち1つ又は複数を置換対象表現として抽出する処理と、
     前記テーマ情報を表すキーワードを特定する処理と、
     予め蓄積されている辞書情報を用い、前記置換対象表現を抽象化する置き換え候補である候補表現を複数生成する処理と、
     前記辞書情報を用いて前記候補表現を評価した第1評価結果を出力する処理と、
     前記第1評価結果において評価が高い前記候補表現を用いて前記置換対象表現を置き換えることにより、変換後文を生成する処理と
     を実行する文章生成方法。
    In a sentence generation method executed in a sentence generation apparatus having an arithmetic device and a storage device,
    The arithmetic unit is
    A process for receiving input of a sentence to be processed and theme information;
    A process of extracting one or more of one or more specific expressions included in the sentence based on the theme information as a replacement target expression;
    Processing for specifying a keyword representing the theme information;
    Processing to generate a plurality of candidate expressions that are replacement candidates for abstracting the replacement target expression using dictionary information stored in advance;
    A process of outputting a first evaluation result obtained by evaluating the candidate expression using the dictionary information;
    A sentence generation method that executes a process of generating a post-conversion sentence by replacing the replacement target expression with the candidate expression having a high evaluation in the first evaluation result.
  8.  請求項7に記載の文章生成方法において、
     前記演算装置が、
     文章を構成する複数の文に含まれる他の固有表現との関係に基づいて前記候補表現を評価した第2評価結果を出力する処理を更に実行し、
     前記変換後文を生成する処理は、前記第1評価結果の評価と前記第2評価結果の評価が共に高い前記候補表現を用いて前記置換対象表現を置き換えることにより、前記変換後文を生成する
     ことを特徴とする文章生成方法。
    The sentence generation method according to claim 7,
    The arithmetic unit is
    Further executing a process of outputting a second evaluation result obtained by evaluating the candidate expression based on a relationship with other specific expressions included in a plurality of sentences constituting the sentence;
    The process of generating the post-conversion sentence generates the post-conversion sentence by replacing the replacement target expression using the candidate expression that has a high evaluation of the first evaluation result and a high evaluation of the second evaluation result. A sentence generation method characterized by that.
  9.  請求項7に記載の文章生成方法において、
     前記第1評価結果を出力する処理は、
      前記文で表される事例と類似する事例を表す類似事例文を検索する処理と、
      前記類似事例文から前記置換対象表現と同じ事例が生じている対応置換対象表現を複数抽出する処理と、
      前記対応置換対象表現の置き換え候補表現を前記辞書情報から複数生成し、前記候補表現に正確性スコアを付与する処理と
     を有することを特徴とする文章生成方法。
    The sentence generation method according to claim 7,
    The process of outputting the first evaluation result includes
    A process of searching for a similar case sentence representing a case similar to the case represented by the sentence;
    A process of extracting a plurality of corresponding replacement target expressions in which the same case as the replacement target expression is generated from the similar case sentence;
    A sentence generation method comprising: generating a plurality of replacement candidate expressions for the corresponding replacement target expression from the dictionary information and assigning an accuracy score to the candidate expression.
  10.  請求項8に記載の文章生成方法において、
     前記第2評価結果を出力する処理は、
      前記文章中の重要語を抽出する処理と、
      前記重要語と前記候補表現の類義語表現を獲得する処理と、
      前記類義語表現を利用して前記候補表現に前記候補表現に文脈適切性スコアを付与する処理と
     を有することを特徴とする文章生成方法。
    The sentence generation method according to claim 8,
    The process of outputting the second evaluation result includes
    Processing to extract important words in the sentence;
    Obtaining a synonym expression of the important word and the candidate expression;
    And a process of assigning a context appropriateness score to the candidate expression using the synonym expression.
  11.  請求項8に記載の文章生成方法において、
     前記置換対象表現が複数抽出された場合に、前記候補表現の組合せを生成する処理を更に有し、
     前記第1評価結果を出力する処理および前記第2評価結果を出力する処理は、前記候補表現の組合せに対して評価を出力する
     ことを特徴とする文章生成方法。
    The sentence generation method according to claim 8,
    When a plurality of replacement target expressions are extracted, the method further includes a process of generating a combination of the candidate expressions,
    The sentence generating method, wherein the process of outputting the first evaluation result and the process of outputting the second evaluation result output an evaluation with respect to the combination of the candidate expressions.
  12.  請求項7に記載の文章生成方法において、
     前記演算装置が、
     前記置換対象表現についての構文上の関係を文中から抽出する処理と、
     前記関係に基づいて前記辞書情報を更新する処理と
     を更に実行する文章生成方法。
    The sentence generation method according to claim 7,
    The arithmetic unit is
    A process of extracting syntactical relationships for the replacement target expression from the sentence;
    A sentence generation method further comprising: updating the dictionary information based on the relationship.
PCT/JP2015/052478 2015-01-29 2015-01-29 Text generation device and text generation method WO2016121048A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/JP2015/052478 WO2016121048A1 (en) 2015-01-29 2015-01-29 Text generation device and text generation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2015/052478 WO2016121048A1 (en) 2015-01-29 2015-01-29 Text generation device and text generation method

Publications (1)

Publication Number Publication Date
WO2016121048A1 true WO2016121048A1 (en) 2016-08-04

Family

ID=56542700

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2015/052478 WO2016121048A1 (en) 2015-01-29 2015-01-29 Text generation device and text generation method

Country Status (1)

Country Link
WO (1) WO2016121048A1 (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109284357A (en) * 2018-08-29 2019-01-29 腾讯科技(深圳)有限公司 Interactive method, device, electronic equipment and computer-readable medium
CN109858021A (en) * 2019-01-02 2019-06-07 平安科技(深圳)有限公司 Traffic issues statistical method, device, computer equipment and its storage medium
CN110555196A (en) * 2018-05-30 2019-12-10 北京百度网讯科技有限公司 method, device, equipment and storage medium for automatically generating article
CN110674272A (en) * 2019-09-05 2020-01-10 科大讯飞股份有限公司 Question answer determining method and related device
CN111353293A (en) * 2018-12-21 2020-06-30 深圳市优必选科技有限公司 Statement material generation method and terminal equipment
CN111680152A (en) * 2020-06-10 2020-09-18 创新奇智(成都)科技有限公司 Method and device for extracting abstract of target text, electronic equipment and storage medium
CN111832309A (en) * 2019-03-26 2020-10-27 北京京东尚科信息技术有限公司 Text generation method and device and computer readable storage medium
CN113486169A (en) * 2021-07-27 2021-10-08 平安国际智慧城市科技股份有限公司 Synonymy statement generation method, device, equipment and storage medium based on BERT model

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS63158663A (en) * 1986-12-23 1988-07-01 Toshiba Corp Document privacy protecting device
JP2012027567A (en) * 2010-07-21 2012-02-09 National Institute Of Information & Communication Technology Paraphrase relationship set acquisition device, paraphrase relationship set acquisition method, and program

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS63158663A (en) * 1986-12-23 1988-07-01 Toshiba Corp Document privacy protecting device
JP2012027567A (en) * 2010-07-21 2012-02-09 National Institute Of Information & Communication Technology Paraphrase relationship set acquisition device, paraphrase relationship set acquisition method, and program

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
HITOYUKI OKADA ET AL.: "Wikipedia o Riyo shita Nihongo Sakubun Shien System no Kaihatsu", INFORMATION PROCESSING SOCIETY OF JAPAN SYMPOSIUM JINMONKON SYMPOSIUM, 11 December 2009 (2009-12-11), pages 225 - 230 *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110555196A (en) * 2018-05-30 2019-12-10 北京百度网讯科技有限公司 method, device, equipment and storage medium for automatically generating article
CN109284357A (en) * 2018-08-29 2019-01-29 腾讯科技(深圳)有限公司 Interactive method, device, electronic equipment and computer-readable medium
CN111353293A (en) * 2018-12-21 2020-06-30 深圳市优必选科技有限公司 Statement material generation method and terminal equipment
CN109858021A (en) * 2019-01-02 2019-06-07 平安科技(深圳)有限公司 Traffic issues statistical method, device, computer equipment and its storage medium
CN109858021B (en) * 2019-01-02 2023-11-14 平安科技(深圳)有限公司 Service problem statistics method, device, computer equipment and storage medium thereof
CN111832309A (en) * 2019-03-26 2020-10-27 北京京东尚科信息技术有限公司 Text generation method and device and computer readable storage medium
CN110674272A (en) * 2019-09-05 2020-01-10 科大讯飞股份有限公司 Question answer determining method and related device
CN111680152A (en) * 2020-06-10 2020-09-18 创新奇智(成都)科技有限公司 Method and device for extracting abstract of target text, electronic equipment and storage medium
CN111680152B (en) * 2020-06-10 2023-04-18 创新奇智(成都)科技有限公司 Method and device for extracting abstract of target text, electronic equipment and storage medium
CN113486169A (en) * 2021-07-27 2021-10-08 平安国际智慧城市科技股份有限公司 Synonymy statement generation method, device, equipment and storage medium based on BERT model
CN113486169B (en) * 2021-07-27 2024-04-16 平安国际智慧城市科技股份有限公司 Synonymous statement generation method, device, equipment and storage medium based on BERT model

Similar Documents

Publication Publication Date Title
US10558754B2 (en) Method and system for automating training of named entity recognition in natural language processing
WO2016121048A1 (en) Text generation device and text generation method
Chen et al. CUNY-BLENDER TAC-KBP2010
US9734238B2 (en) Context based passage retreival and scoring in a question answering system
KR102491172B1 (en) Natural language question-answering system and learning method
Imam et al. An ontology-based summarization system for arabic documents (ossad)
JP5710581B2 (en) Question answering apparatus, method, and program
Jabbar et al. A survey on Urdu and Urdu like language stemmers and stemming techniques
Jabbar et al. An improved Urdu stemming algorithm for text mining based on multi-step hybrid approach
Mahmood et al. Query based information retrieval and knowledge extraction using Hadith datasets
Eger et al. Lemmatization and Morphological Tagging in German and Latin: A Comparison and a Survey of the State-of-the-art
Yang et al. Sentiment analysis for Chinese reviews of movies in multi-genre based on morpheme-based features and collocations
JP2011118689A (en) Retrieval method and system
CN115186050B (en) Method, system and related equipment for recommending selected questions based on natural language processing
CN102214189A (en) Data mining-based word usage knowledge acquisition system and method
JP2021136027A (en) Analysis of theme coverage of documents
Karim et al. A step towards information extraction: Named entity recognition in Bangla using deep learning
US11048737B2 (en) Concept identification in a question answering system
EP3514706A1 (en) Method for processing a question in natural language
Siklósi Using embedding models for lexical categorization in morphologically rich languages
Pham et al. A hybrid approach for biomedical event extraction
Pouliquen et al. Automatic construction of multilingual name dictionaries
KR101983477B1 (en) Method and System for zero subject resolution in Korean using a paragraph-based pivotal entity identification
Ullah et al. Pattern and semantic analysis to improve unsupervised techniques for opinion target identification
JP7122773B2 (en) DICTIONARY CONSTRUCTION DEVICE, DICTIONARY PRODUCTION METHOD, AND PROGRAM

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15879945

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15879945

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: JP