WO2022110454A1 - Automatic text generation method and apparatus, and electronic device and storage medium - Google Patents

Automatic text generation method and apparatus, and electronic device and storage medium Download PDF

Info

Publication number
WO2022110454A1
WO2022110454A1 PCT/CN2020/139952 CN2020139952W WO2022110454A1 WO 2022110454 A1 WO2022110454 A1 WO 2022110454A1 CN 2020139952 W CN2020139952 W CN 2020139952W WO 2022110454 A1 WO2022110454 A1 WO 2022110454A1
Authority
WO
WIPO (PCT)
Prior art keywords
words
text
word
paragraphs
generated
Prior art date
Application number
PCT/CN2020/139952
Other languages
French (fr)
Chinese (zh)
Inventor
夏维
孙赫
张恒
高鹏
Original Assignee
中译语通科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN202011341955.5A external-priority patent/CN112417846B/en
Application filed by 中译语通科技股份有限公司 filed Critical 中译语通科技股份有限公司
Publication of WO2022110454A1 publication Critical patent/WO2022110454A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Definitions

  • the present invention relates to the technical field of artificial intelligence, and in particular, to a method, device, electronic device and storage medium for automatic text generation.
  • Embodiments of the present invention provide a method, apparatus, electronic device, and storage medium for automatic text generation, so as to solve the defects existing in the prior art.
  • An embodiment of the present invention provides a method for automatic text generation, including:
  • the to-be-generated text is generated based on the Transfromer model, the number of paragraphs of the to-be-generated text, and the paragraph keywords.
  • determining the number of paragraphs and the subject headings of the text to be generated based on the keywords, the estimated number of sentences, and a pre-built word association map specifically including:
  • the number of paragraphs is determined to be the default number, and based on the word association map, the correlation between words that are related to each keyword is determined. list of words;
  • a reserved number of words is determined, and the paragraph topic word is determined based on the words of the reserved word number in the topic word list.
  • determining the number of paragraphs and the subject headings of the text to be generated based on the keywords, the estimated number of sentences, and a pre-built word association map specifically including:
  • the number of paragraphs is determined as the default number, and based on the word association map, the number of words that are related to each keyword is determined;
  • the word association map determines the similar words that have a similar relationship with the any keyword, and based on the word association map , determine the list of related words for each similar word;
  • the paragraph topic word is determined based on the words in the topic word list.
  • determining the number of paragraphs and the subject headings of the text to be generated based on the keywords, the estimated number of sentences, and a pre-built word association map specifically including:
  • the paragraph subject headings are clustered, and the number of paragraphs is determined based on a result of the clustering.
  • determining the number of paragraphs and the subject headings of the text to be generated based on the keywords, the estimated number of sentences, and a pre-built word association map specifically including:
  • the paragraph subject headings are clustered, and the number of paragraphs is determined based on a result of the clustering.
  • the word association map is specifically constructed by the following method:
  • the word association map is constructed based on the similarity relationship between the any two sample words and the correlation relationship between the any two sample words.
  • the estimated number of sentences is specifically obtained by the following method:
  • the estimated sentence count is determined.
  • An embodiment of the present invention also provides an automatic text generation device, including: an acquisition module, a determination module, and a text generation module. in,
  • the obtaining module is used to obtain the keywords of the text to be generated and the estimated number of sentences respectively;
  • a determination module configured to determine the number of paragraphs and the subject words of the paragraphs of the text to be generated based on the keywords, the estimated number of sentences and the pre-built word association map;
  • a text generation module configured to generate the to-be-generated text based on the Transfromer model, the number of paragraphs of the to-be-generated text, and the paragraph keywords.
  • An embodiment of the present invention further provides an electronic device, including a memory, a processor, and a computer program stored in the memory and running on the processor, where the processor implements any of the above text when the processor executes the program Steps to automate the generation of methods.
  • Embodiments of the present invention further provide a non-transitory computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, implements the steps of any of the above-described automatic text generation methods.
  • the automatic text generation method, device, electronic device, and storage medium provided by the embodiments of the present invention firstly obtain the keywords of the text to be generated and the estimated number of sentences; then, based on the keywords, the estimated number of sentences, and pre-built words Associating the graph to determine the number of paragraphs and the subject heading of the text to be generated; finally, the text to be generated is generated based on the Transfromer model, the number of paragraphs and the subject heading of the text to be generated.
  • This is a new type of text generation method implemented by the Transformer model.
  • the Transformer model can be used to make The generated text is no longer single content and fixed format like the text generated by traditional methods, and can be widely used in many fields such as report generation, literary creation, and intelligent question answering.
  • FIG. 1 is a schematic flowchart of a method for automatically generating text according to an embodiment of the present invention
  • FIG. 2 is a schematic flowchart of determining the number of paragraphs of text to be generated and the subject headings of paragraphs in a method for automatic text generation provided by an embodiment of the present invention
  • FIG. 3 is a schematic diagram of a complete flow of a method for automatic text generation provided by an embodiment of the present invention.
  • FIG. 4 is a schematic structural diagram of a device for automatic text generation provided by an embodiment of the present invention.
  • FIG. 5 is a schematic structural diagram of an electronic device provided by an embodiment of the present invention.
  • an automatic text generation method is provided in the embodiments of the present invention to solve the problems existing in the prior art.
  • FIG. 1 is a schematic flowchart of a method for automatic text generation provided in an embodiment of the present invention. As shown in FIG. 1 , the method includes:
  • the to-be-generated text is generated based on the Transfromer model, the number of paragraphs of the to-be-generated text, and the paragraph subject heading.
  • the execution body is a server, which can be either a local server or a cloud server, and the local server can be a computer, etc., which is not specifically limited in the embodiment of the present invention.
  • Step S1 is performed first. Among them, it is necessary to obtain the keywords of the text to be generated and the estimated number of sentences.
  • the keywords refer to the keywords required for generating the text to be generated.
  • the keywords can be determined by the keyword-related information in the user input information.
  • the keyword-related information It can include a single keyword, multiple keywords, or a piece of text containing one sentence or multiple sentences, etc.
  • the keyword-related information is a single keyword or multiple keywords
  • the input keyword is the keyword of the text to be generated; when the keyword-related information is a piece of text, it can first be extracted from a piece of text input by the user For more important words, the extraction of words can be achieved by either extraction algorithms or syntactic analysis algorithms.
  • the extraction algorithm may include a tf-idf algorithm, a textrank algorithm, and the like.
  • which method to use for extraction can be controlled by passing parameters.
  • the user input information may further include extraction parameters, and different values of the extraction parameters represent different extraction methods selected by the user. Then, the extracted words are processed to remove stop words, that is, the words that are stopped in use are removed, and the keywords of the text to be generated are obtained.
  • the estimated number of sentences refers to the estimated number of sentences that may exist in the text to be generated, and the estimated number of sentences may also be determined by user input information.
  • the user input information may also include the target number of words of the text to be generated.
  • the sentence length statistics are performed on the training prediction, and the average value is obtained, and the average sentence length can be obtained to include 33 to 34 characters. Therefore, the default sentence length is 33 words, and the estimated number of sentences can be determined by the ratio of the target word count to the default sentence length. It should be noted that, the default sentence length in the embodiment of the present invention does not mean that each sentence in the generated text contains 33 characters, and the estimated sentence number is only a preliminary estimated value. In particular, if the target word count entered by the user is less than 33, the default estimated sentence count is 1.
  • step S2 is performed to determine the number of paragraphs and the subject words of the paragraphs of the text to be generated according to the keywords, the estimated number of sentences, and the pre-built word association map. Since the number of keywords and the number of estimated sentences are different, different processing modes can be used to determine the number of paragraphs and the subject words of the text to be generated. Therefore, in the embodiment of the present invention, a corresponding processing mode can be determined according to the number of keywords and the conditions that the estimated number of sentences meets, and then the paragraph of the text to be generated can be determined according to the processing mode and a pre-built word association map. Quantities and paragraph headings. Among them, the word association graph is pre-built based on training expectations, and is used to represent the association between words.
  • the relationship between words can include similarity and correlation. Similarity is used to represent the similarity between two words, which can be determined by the similarity between the two words, and correlation is used to represent the relationship between two words.
  • the dependency relationship of the two words may be determined by performing dependency analysis on the statement in which the two words are located, which is not specifically limited in this embodiment of the present invention.
  • step S3 is performed, and the text to be generated is generated according to the Transfromer model, the number of paragraphs of the text to be generated, and the subject heading of the paragraph.
  • the Transfromer model is used to combine the number of paragraphs of the text to be generated and the subject headings of the paragraphs, and determine the next sentence through the previous sentence in each paragraph.
  • a Transfromer model can contain four inputs:
  • the first input the semantic vector of the previous sentence. If there is no sentence before, the semantic vector of this input is the 0 vector of the corresponding dimension.
  • the second input the word vector of the randomly selected paragraph subject words.
  • the third input the sum of the word vectors of all keywords.
  • the fourth input the final sentence judgment vector. For example, if it is a closing sentence, it is the numeric 8 constant vector of the corresponding dimension, if not, it is the numeric 1 constant vector of the corresponding dimension.
  • the Transfromer model is based on the semantics of the previous text, and the output is the current sentence based on the text. Output the semantic vector of the current sentence while outputting the current sentence.
  • the number of words in the text will be counted after each Transfromer model runs. If the number of words is close to the target number of words in the current paragraph, the fourth item of the input will be changed to output the conclusion of the paragraph.
  • words and sentences are converted into the form of text vectors during use.
  • text semantic vector conversion can be performed through the bert pre-training model.
  • the automatic text generation method in the embodiment of the present invention can be developed and implemented based on Python.
  • the automatic text generation method provided in the embodiment of the present invention firstly obtains the keywords of the text to be generated and the estimated number of sentences; The number of paragraphs and the paragraph keywords of the generated text; finally, the to-be-generated text is generated based on the Transfromer model, the number of paragraphs and the paragraph keywords of the text to be generated.
  • This is a new type of text generation method implemented by the Transformer model. It introduces the screening and determination of paragraph subject words, which can realize the expansion and restriction of the generated text subject, so that the generated text has the core idea; at the same time, the Transformer model can be used to make The generated text is no longer single content and fixed format like the text generated by traditional methods, and can be widely used in many fields such as report generation, literary creation, and intelligent question answering.
  • words may be randomly selected from the popular thesaurus as the keywords of the text to be generated.
  • the popular thesaurus can be obtained by collecting popular words on a daily basis, and can be regularly updated and maintained.
  • the estimated number of sentences is specifically obtained by the following method:
  • the estimated sentence count is determined.
  • the target number of words of the text to be generated may be determined first, and the target number of words may be input by the user, that is, the user input information may include the target number of words. Then, according to the target number of words, the estimated number of sentences can be determined. Specifically, the ratio of the target number of words to the default sentence length can be used as the estimated number of sentences.
  • the target number of words is introduced, so that the generated text is no longer a random number of words, but can generate text with the number of words desired by the user according to the user's needs.
  • a number may be randomly selected from 500 to 5000 as the target number of words.
  • the actual number of words in the generated text and the target number of words are not necessarily exactly equal.
  • the target number of words is less than 500, the actual number of words in the generated text may have a deviation of 50 words up and down; when the target number of words is greater than 500, the generated text There may be a deviation of 50 words to 200 words in the actual number of words, which are all within the controllable range.
  • the target number of words is too small, for example, less than 33 words, only one sentence is generated, and the generation of the sentence is completely based on the semantics of the keywords of the text to be generated in step S1.
  • FIG. 2 it is a schematic flowchart of selecting different processing modes when the number of keywords and estimated sentences meet different conditions in an embodiment of the present invention, which is specifically described with reference to the following embodiments.
  • the number of paragraphs of the to-be-generated text is determined based on the keywords, the estimated number of sentences, and a pre-built word association map and paragraph headings, including:
  • the number of paragraphs is determined to be the default number, and based on the word association map, the correlation between words that are related to each keyword is determined. list of words;
  • a reserved number of words is determined, and the paragraph topic word is determined based on the words of the reserved word number in the topic word list.
  • the target number of words is about 500 words.
  • the target number of words is about 500 words.
  • the first processing mode can be performed to determine the final number of paragraphs and paragraph headings.
  • the number of sentences to be generated is too small (the default can be less than or equal to 8), then it is also considered that the paragraph subject words can be filtered out, so the first processing mode is also performed.
  • the first condition may be that the number of keywords is greater than or equal to the first threshold and the estimated number of sentences is less than the second threshold, or the number of keywords is less than the first threshold and the estimated number of sentences is less than the third threshold.
  • the first threshold, the second threshold and the third threshold can be set as required, and the third threshold is smaller than the second threshold.
  • the first threshold can be 2
  • the second threshold can be 15, and the third threshold can be 8.
  • the first condition corresponds to the first processing mode, that is, if the number of keywords and estimated sentences satisfies the first condition, the first processing mode determines the number of paragraphs and the subject headings of the text to be generated. as shown in picture 2.
  • the first processing mode is specifically as follows: first, the number of paragraphs is determined as a default number, and the default number can be set according to needs and the specific content of the first condition, for example, it can be set to 1. Then, according to the word association graph, determine a list of related words composed of words that have a related relationship with each keyword. Assuming that the number of keywords is n, then input n keywords into the word association graph, and filter by the relationship when querying. Get only words that are related to the input keyword. The results obtained from each keyword query can be stored in a list respectively, and n query result lists will be obtained, and the query result list is a related word list composed of words related to each keyword.
  • the topic dictionary can be sorted once by the size of the values from large to small. Since the dictionary is unordered in python, after sorting, the storage format can be converted into a list form, that is, a list of subject words is obtained. For example: [(term1,7),(term2,7),(term3,5),(term4,2)].
  • the list of subject words is in the form of tuples, each tuple contains two values, the word itself and the number of occurrences of the word.
  • the estimated number of sentences determine the number of words to retain. Specifically, the value obtained by multiplying the estimated number of sentences by 0.6 can be rounded up to obtain the number of reserved words.
  • the subject word list is intercepted according to the number of reserved words, and the intercepted words are the paragraph subject words. On this basis, you can add paragraph subject headings to a new list, and this new list is the paragraph heading list. Because there is only one paragraph, there is only one list of paragraph subject headings.
  • the embodiment of the present invention provides a method for determining the number of paragraphs and the subject headings of the text to be generated, that is, through the first processing mode, the generated text can be divided into paragraphs intelligently without applying a template.
  • the number of paragraphs of the to-be-generated text is determined based on the keywords, the estimated number of sentences, and a pre-built word association map and paragraph headings, including:
  • the number of paragraphs is determined as the default number, and based on the word association map, the number of words that are related to each keyword is determined;
  • the word association map determines the similar words that have a similar relationship with the any keyword, and based on the word association map , determine the list of related words for each similar word;
  • the paragraph topic word is determined based on the words in the topic word list.
  • the second processing mode when the estimated number of sentences is greater than or equal to 8 and less than 15, if there are not enough keywords (here, the default is less than 2, that is, there is only 1 keyword), the second processing mode will be performed to Determine the final number of paragraphs and paragraph headings.
  • the fundamental reason for not adopting the first processing mode at this time is that the number of keywords is too small, which may result in too few paragraph keywords and limit the freedom of article topics. That is, the second condition may be that the number of keywords is less than the first threshold, and the estimated number of sentences is less than the second threshold and greater than or equal to the third threshold.
  • the second condition corresponds to the second processing mode, that is, if the number of keywords and the estimated sentences satisfy the second condition, the second processing mode is used to determine the number of paragraphs and the subject headings of the text to be generated. as shown in picture 2.
  • the second processing mode is specifically as follows: first, the number of paragraphs is determined as a default number, and the default number can be set according to needs and the specific content of the second condition, for example, it can be set to 1. Then this keyword is input into the word association graph, and the related relationship query is performed to determine the number of words that have a related relationship with the keyword.
  • the keyword needs to be expanded, and then the keyword needs to be input into the word association map, and a similarity relationship query is performed to determine that the keyword has a similarity in a similar relationship.
  • Words expand keywords with similar words.
  • the first preset threshold may be 0.6 times the estimated number of sentences.
  • the default filter similarity value is 0.98. If no similar words are found, the similarity threshold will continue to decrease by 0.01 until similar words can be obtained and queried. If k similar words are found, and related words are searched for these k words respectively, k related word lists are obtained, and all the words in the k related word lists are unified together, and put into a new one after deduplication.
  • this new list is the subject word list. If the number of words in the topic word list is greater than or equal to the second preset threshold, the words in the topic word list can be directly used as the determined paragraph topic words, that is, the topic word list is a paragraph topic word list.
  • the second preset threshold may be 0.6 times the estimated number of sentences. If the number of words in the topic word list is less than the second preset threshold, continue to reduce the similarity threshold by 0.01, re-acquire new similar words, and repeat the above calculation operation until the number of words in the topic word list is greater than or equal to the second Preset threshold.
  • the keyword does not need to be expanded, which is equivalent to the first processing mode.
  • the embodiment of the present invention provides a method for determining the number of paragraphs and the subject headings of the text to be generated, that is, through the second processing mode, the generated text can be divided into paragraphs intelligently without applying a template.
  • the second processing mode is suitable for the case that the number of keywords is too small, which can ensure that the number of paragraph keywords obtained is moderate, and the degree of freedom of the article topic can be improved.
  • the number of paragraphs of the to-be-generated text is determined based on the keywords, the estimated number of sentences, and a pre-built word association map and paragraph headings, including:
  • the paragraph subject headings are clustered, and the number of paragraphs is determined based on a result of the clustering.
  • the estimated number of sentences when the estimated number of sentences is greater than or equal to 15, the estimated number of sentences is compared with the number of keywords. If the estimated number of sentences is less than or equal to 1.5 times the number of keywords, the third processing method is used. to determine the number of paragraphs and paragraph headings. That is to say, the third condition may be that the estimated number of sentences is greater than or equal to the second threshold, and the estimated number of sentences is less than or equal to the number of keywords that is a preset multiple.
  • the third condition corresponds to the third processing mode, that is, if the number of keywords and estimated sentences satisfies the third condition, the third processing mode is used to determine the number of paragraphs and the subject words of the text to be generated. as shown in picture 2.
  • the third processing mode is specifically: similar to the first processing mode. Directly query each keyword in the word association graph. If the number of keywords is k, then a list of k related words will be obtained. After directly summarizing and deduplicating the words in the k related word lists, the subject word list is obtained. Then multiply the estimated number of sentences by 0.6, and use the resulting value to truncate the list of subject words.
  • the last intercepted word list is the paragraph subject heading list, and the words contained in it are the paragraph subject headings.
  • paragraph subject word list The words in the paragraph subject word list are clustered, and the number of paragraphs can be determined through the results of the clustering.
  • the number of paragraphs is determined as shown in the formula:
  • each category of words corresponds to a paragraph, so that the paragraphs and the paragraph subject word list are in one-to-one correspondence.
  • the embodiment of the present invention provides a method for determining the number of paragraphs and the subject headings of the text to be generated, that is, the third processing mode is implemented, so that the generated text can be divided into paragraphs intelligently without applying a template. Moreover, the third processing mode is suitable for a large number of estimated sentences, which can ensure the accuracy of the calculation results.
  • the number of paragraphs of the to-be-generated text is determined based on the keywords, the estimated number of sentences, and a pre-built word association map and paragraph headings, including:
  • paragraph subject headings are clustered, and the number of paragraphs is determined based on the result of the clustering.
  • the fourth processing method is used to Determine the number of paragraphs and paragraph headings. That is to say, the fourth condition may be that the estimated number of sentences is greater than or equal to the second threshold, and the estimated number of sentences is greater than the number of keywords with a preset multiple.
  • the fourth condition corresponds to the fourth processing mode, that is, if the number of keywords and estimated sentences satisfies the fourth condition, the fourth processing mode is used to determine the number of paragraphs and the subject words of the text to be generated. as shown in picture 2.
  • the fourth processing mode is specifically: the determination of the paragraph subject heading is similar to the second processing mode, and the determination of the number of paragraphs is similar to the third processing mode. Firstly, all keywords are searched for related words respectively, and corresponding related word lists are obtained, and the words in all related word lists are aggregated and deduplicated and put into a list, which is a list of subject words. If the number of words in the topic word list is greater than or equal to the second preset threshold, the situation is similar to the third processing mode, and word clustering is performed on the topic word list, and the number of paragraphs and The paragraph subject heading for the paragraph. If the number of words in the subject word list is less than the second preset threshold, similar word matching needs to be performed.
  • Similar words are matched for each keyword according to the second processing mode, and the matched similar words are used to match related words. Finally, all related words are counted together and put into a list.
  • the list is A list of subject words. If the number of words in the subject word list is less than the second preset threshold, reduce the similarity matching threshold to obtain related words again, if the number of words in the subject word list is greater than the second preset threshold and the second preset threshold is greater than 6 ( The subject word list with the second preset threshold value not exceeding 6 is required), then the subject word list is dominated by the words in the subject word list, and the words in the subject word list are randomly selected without replacement until the final subject word list is controlled.
  • the embodiment of the present invention provides a method for determining the number of paragraphs and the subject headings of the text to be generated, which is realized by the fourth processing mode, so that the generated text can be divided into paragraphs intelligently without applying a template. Moreover, the fourth processing mode is suitable for a large number of estimated sentences, which can ensure the accuracy of the calculation results.
  • the first processing mode and the second processing mode in the embodiments of the present invention may be combined into one, and the second processing mode is the main one ;
  • the third processing mode and the fourth processing mode can be combined into one, and the fourth processing mode is the main one.
  • the word association map is specifically constructed by the following method:
  • the word association map is constructed based on the similarity relationship between the any two sample words and the correlation relationship between the any two sample words.
  • the entities in the word association graph are words, and the relationship between entities is divided into two types, one is a similarity relationship, and the other is a related relationship.
  • obtain the bert semantic model through corpus training and then use the bert semantic model to convert each word into a semantic vector of the target dimension.
  • the selection of the target dimension can be controlled by parameters. Values are 64, 128, 256, 512, etc.
  • the cosine similarity is calculated for every two semantic vectors, and the value of the cosine similarity is the similarity between the words represented by the two semantic vectors.
  • the obtained similarity value will be stored in the graph database as the attribute of the similarity relationship between the two words, so as to facilitate the use of query; for the acquisition method of the triple pair of the related relationship: through the dependency analysis of the statement, obtain to the dependencies between words and words, and then the words with dependencies are stored in the graph database as correlation triples.
  • the graph database used in the embodiment of the present invention is the neo4j database, the development language is python, and the cypher language is invoked through the interface of the py2neo library to perform addition, deletion, modification, and query operations on the database.
  • the automatic text generation method provided in the embodiment of the present invention further includes: verifying the generated text.
  • the verification of the generated text is to perform right and wrong analysis and correction, so that the generated text conforms to the current grammatical rules and the sentences are fluent.
  • the first word of the sentence cannot appear similar particles or modal particles such as le, ⁇ , ah. Create a dictionary containing words similar to the above. If the first word of the generated text is included in the dictionary, the sentence will be regenerated into a new sentence by the Transfromer model. and replace the original statement.
  • the collocation between words has a certain grammatical structure, such as adjectives followed by nouns, verbs followed by adverbs and so on.
  • the system will perform dependency syntax analysis and part-of-speech tagging on the generated text (which can be implemented by ltp and hanlp libraries), and judge according to established rules (the verb-object structure should correspond to verbs and nouns, and adverbs should correspond to adverbs and adjectives, etc.) Regenerated with the Transfromer model.
  • FIG. 3 it is a complete schematic flowchart of the automatic text generation method provided in the embodiment of the present invention.
  • Figure 3 first, on the one hand, the keyword information and target quantity input by the user are obtained, and then the keywords of the text to be generated are determined, and the estimated number of sentences is determined; Word extraction, constructs a word association map. Then, based on the keywords, the estimated number of sentences, and the constructed word association graph, the number of paragraphs and the subject words of the paragraphs of the text to be generated are determined. Then, the text to be generated is generated by the Transfromer model. Finally, the generated text is verified.
  • an embodiment of the present invention provides an automatic text generation device, including: an acquisition module 41 , a determination module 42 and a text generation module 43 . in,
  • the obtaining module 41 is used to obtain the keywords of the text to be generated and the number of estimated sentences respectively;
  • the determining module 42 is configured to determine the number of paragraphs and the subject words of the paragraphs of the text to be generated based on the keywords, the number of estimated sentences and the pre-built word association map;
  • the text generation module 43 is configured to generate the text to be generated based on the Transfromer model, the number of paragraphs of the text to be generated, and the subject headings of the paragraphs.
  • the functions of the modules in the automatic text generation device provided in the embodiments of the present invention correspond one-to-one with the operation procedures of the steps in the above-mentioned method embodiments, and the achieved effects are also the same.
  • the above-mentioned embodiments This is not repeated in this embodiment of the present invention.
  • the determining module is specifically used for:
  • the number of paragraphs is determined to be the default number, and based on the word association map, the correlation between words that are related to each keyword is determined. list of words;
  • a reserved number of words is determined, and the paragraph topic word is determined based on the words of the reserved word number in the topic word list.
  • the determining module is specifically used for:
  • the number of paragraphs is determined as the default number, and based on the word association map, the number of words that are related to each keyword is determined;
  • the word association map determines the similar words that have a similar relationship with the any keyword, and based on the word association map , determine the list of related words for each similar word;
  • the paragraph topic word is determined based on the words in the topic word list.
  • the determining module is specifically used for:
  • the paragraph subject headings are clustered, and the number of paragraphs is determined based on a result of the clustering.
  • the determining module is specifically used for:
  • the paragraph subject headings are clustered, and the number of paragraphs is determined based on a result of the clustering.
  • the automatic text generation device provided in the embodiment of the present invention further includes: a graph construction module, which is used for:
  • the word association map is constructed based on the similarity relationship between the any two sample words and the correlation relationship between the any two sample words.
  • the acquisition module is specifically used for:
  • the estimated sentence count is determined.
  • FIG. 5 illustrates a schematic diagram of the physical structure of an electronic device.
  • the electronic device may include: a processor (processor) 510, a communication interface (CommunicationsInterface) 520, a memory (memory) 530 and a communication bus 540, wherein , the processor 510 , the communication interface 520 , and the memory 530 communicate with each other through the communication bus 540 .
  • processor processor
  • Communication interface Communication interface
  • memory memory
  • the processor 510 can call the logic instructions in the memory 530 to execute the automatic text generation method, the method includes: respectively obtaining the keywords of the text to be generated and the estimated number of sentences; The word association map of the to-be-generated text is determined, and the number of paragraphs and the paragraph subject words of the text to be generated are determined; the text to be generated is generated based on the Transfromer model, the number of paragraphs of the text to be generated, and the paragraph subject words.
  • the above-mentioned logic instructions in the memory 530 can be implemented in the form of software functional units and can be stored in a computer-readable storage medium when sold or used as an independent product.
  • the technical solution of the present invention can be embodied in the form of a software product in essence, or the part that contributes to the prior art or the part of the technical solution.
  • the computer software product is stored in a storage medium, including Several instructions are used to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present invention.
  • the aforementioned storage medium includes: U disk, removable hard disk, Read-Only Memory (ROM, Read-Only Memory), Random Access Memory (RAM, Random Access Memory), magnetic disk or optical disk and other media that can store program codes.
  • an embodiment of the present invention also provides a computer program product
  • the computer program product includes a computer program stored on a non-transitory computer-readable storage medium
  • the computer program includes program instructions, when the program instructions When executed by a computer, the computer can execute the automatic text generation method provided by the above method embodiments, the method includes: respectively acquiring keywords of the text to be generated and estimating the number of sentences; based on the keywords, the estimated number of sentences and The pre-built word association map determines the number of paragraphs and the subject headings of the text to be generated; the text to be generated is generated based on the Transfromer model, the number of paragraphs and the subject heading of the text to be generated.
  • embodiments of the present invention further provide a non-transitory computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, is implemented to execute the automatic text generation method provided by the above embodiments,
  • the method includes: respectively acquiring keywords of the text to be generated and the number of estimated sentences; based on the keywords, the estimated number of sentences and a pre-built word association map, determining the number of paragraphs and the subject words of the paragraphs of the text to be generated;
  • the to-be-generated text is generated based on the Transfromer model, the number of paragraphs of the to-be-generated text, and the paragraph keywords.
  • the device embodiments described above are only illustrative, wherein the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in One place, or it can be distributed over multiple network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution in this embodiment. Those of ordinary skill in the art can understand and implement it without creative effort.
  • each embodiment can be implemented by means of software plus a necessary general hardware platform, and certainly can also be implemented by hardware.
  • the above-mentioned technical solutions can be embodied in the form of software products in essence or the parts that make contributions to the prior art, and the computer software products can be stored in computer-readable storage media, such as ROM/RAM, magnetic A disc, an optical disc, etc., includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform the methods described in various embodiments or some parts of the embodiments.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Animal Behavior & Ethology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

An automatic text generation method and apparatus, and an electronic device and a storage medium. The method comprises: firstly, respectively acquiring a keyword and an estimated sentence quantity of text to be generated (S1); then, determining a paragraph quantity and a paragraph subject word of said text on the basis of the keyword, the estimated sentence quantity and a pre-constructed word association graph (S2); and finally, generating said text on the basis of a transformer model, and the paragraph quantity and the paragraph subject word of said text (S3). The method is a new text generation method implemented by means of a transformer model. Screening and determination of a paragraph subject word are introduced, and the subject of generated text can be extended and restrained, such that the generated text contains a core idea. Moreover, the transformer model is used, such that the generated text is no longer simple in content and is fixed in format as text generated by means of a traditional method.

Description

文本自动化生成方法、装置、电子设备及存储介质Automatic text generation method, device, electronic device and storage medium 技术领域technical field
本发明涉及人工智能技术领域,尤其涉及一种文本自动化生成方法、装置、电子设备及存储介质。The present invention relates to the technical field of artificial intelligence, and in particular, to a method, device, electronic device and storage medium for automatic text generation.
背景技术Background technique
目前,基于人工智能(Artificial Intelligence,AI)实现的文本生成是自然语言处理领域的一项具有挑战性的任务,其目的是让计算机能够像人一样写出高质量的文章。这就要求采用的模型拥有更强大的理解、生成文本的能力。传统的文本生成方法有两种,一种是基于规则和模板的生成方法,一种是基于抽取的生成方法,这两种方法生成的文本格式都比较固定,不能生成内容丰富、风格多样的文本。At present, text generation based on artificial intelligence (AI) is a challenging task in the field of natural language processing, and its purpose is to enable computers to write high-quality articles like humans. This requires the adopted model to have a stronger ability to understand and generate text. There are two traditional text generation methods, one is the generation method based on rules and templates, and the other is the generation method based on extraction. The text formats generated by these two methods are relatively fixed and cannot generate texts with rich content and diverse styles .
发明内容SUMMARY OF THE INVENTION
本发明实施例提供一种文本自动化生成方法、装置、电子设备及存储介质,用以解决现有技术中存在的缺陷。Embodiments of the present invention provide a method, apparatus, electronic device, and storage medium for automatic text generation, so as to solve the defects existing in the prior art.
本发明实施例提供一种文本自动化生成方法,包括:An embodiment of the present invention provides a method for automatic text generation, including:
分别获取待生成文本的关键词以及估算语句数量;Obtain the keywords of the text to be generated and the estimated number of sentences respectively;
基于所述关键词、所述估算语句数量以及预先构建的词语关联图谱,确定所述待生成文本的段落数量以及段落主题词;Determine the number of paragraphs and the subject words of the paragraphs of the text to be generated based on the keywords, the estimated number of sentences, and the pre-built word association map;
基于Transfromer模型、所述待生成文本的段落数量以及段落主题词,生成所述待生成文本。The to-be-generated text is generated based on the Transfromer model, the number of paragraphs of the to-be-generated text, and the paragraph keywords.
根据本发明一个实施例的文本自动化生成方法,所述基于所述关键词、所述估算语句数量以及预先构建的词语关联图谱,确定所述待生成文本的段落数量以及段落主题词,具体包括:According to an automatic text generation method according to an embodiment of the present invention, determining the number of paragraphs and the subject headings of the text to be generated based on the keywords, the estimated number of sentences, and a pre-built word association map, specifically including:
若判断获知所述关键词以及所述估算语句数量满足第一条件,则确定所述段落数量为默认数量,并基于所述词语关联图谱,确定与每个关键词具有相关关系的词语构成的相关词语列表;If it is determined that the keyword and the estimated number of sentences satisfy the first condition, the number of paragraphs is determined to be the default number, and based on the word association map, the correlation between words that are related to each keyword is determined. list of words;
对所有关键词对应的相关词语列表进行汇总,确定主题词语列表;Summarize the list of related words corresponding to all keywords to determine the list of subject words;
基于所述估算语句数量,确定词语保留数量,并基于所述主题词语列表中所述词语保留数量的词语,确定所述段落主题词。Based on the estimated number of sentences, a reserved number of words is determined, and the paragraph topic word is determined based on the words of the reserved word number in the topic word list.
根据本发明一个实施例的文本自动化生成方法,所述基于所述关键词、所述估算语句数量以及预先构建的词语关联图谱,确定所述待生成文本的段落数量以及段落主题词,具体包括:According to an automatic text generation method according to an embodiment of the present invention, determining the number of paragraphs and the subject headings of the text to be generated based on the keywords, the estimated number of sentences, and a pre-built word association map, specifically including:
若判断获知所述关键词以及所述估算语句数量满足第二条件,则确定所述段落数量为默认数量,并基于所述词语关联图谱,确定与每个关键词具有相关关系的词语数量;If it is determined that the keyword and the estimated number of sentences satisfy the second condition, the number of paragraphs is determined as the default number, and based on the word association map, the number of words that are related to each keyword is determined;
若任一所述关键词对应的所述词语数量小于等于第一预设阈值,则基于所述词语关联图谱,确定所述任一关键词具有相似关系的相似词语,并基于所述词语关联图谱,确定每个相似词语的相关词语列表;If the number of the words corresponding to any one of the keywords is less than or equal to the first preset threshold, then based on the word association map, determine the similar words that have a similar relationship with the any keyword, and based on the word association map , determine the list of related words for each similar word;
对所有相关词语列表进行汇总,确定主题词语列表;Summarize all related word lists to determine the subject word list;
若判断获知所述主题词语列表中的词语数量大于等于第二预设阈值,则基于所述主题词语列表中的词语,确定所述段落主题词。If it is determined that the number of words in the topic word list is greater than or equal to a second preset threshold, the paragraph topic word is determined based on the words in the topic word list.
根据本发明一个实施例的文本自动化生成方法,所述基于所述关键词、所述估算语句数量以及预先构建的词语关联图谱,确定所述待生成文本的段落数量以及段落主题词,具体包括:According to an automatic text generation method according to an embodiment of the present invention, determining the number of paragraphs and the subject headings of the text to be generated based on the keywords, the estimated number of sentences, and a pre-built word association map, specifically including:
若判断获知所述关键词以及所述估算语句数量满足第三条件,则基于所述词语关联图谱,确定与每个关键词具有相关关系的词语构成的相关词语列表;If it is determined that the keyword and the estimated number of sentences satisfy the third condition, then based on the word association map, determine a related word list composed of words that are related to each keyword;
对所有关键词对应的相关词语列表进行汇总,确定主题词语列表;Summarize the list of related words corresponding to all keywords to determine the list of subject words;
基于所述估算语句数量,确定词语保留数量,并基于所述主题词语列表中所述词语保留数量的词语,确定所述段落主题词;Based on the estimated number of sentences, determine the number of reserved words, and determine the paragraph subject words based on the words in the reserved number of words in the subject word list;
对所述段落主题词进行聚类,并基于聚类的结果确定所述段落数量。The paragraph subject headings are clustered, and the number of paragraphs is determined based on a result of the clustering.
根据本发明一个实施例的文本自动化生成方法,所述基于所述关键词、所述估算语句数量以及预先构建的词语关联图谱,确定所述待生成文本的段落数量以及段落主题词,具体包括:According to an automatic text generation method according to an embodiment of the present invention, determining the number of paragraphs and the subject headings of the text to be generated based on the keywords, the estimated number of sentences, and a pre-built word association map, specifically including:
若判断获知所述关键词以及所述估算语句数量满足第四条件,则基于所述词语关联图谱,确定与每个关键词具有相关关系的词语构成的相关词语列表;If it is determined that the keyword and the estimated number of sentences satisfy the fourth condition, then, based on the word association map, determine a related word list composed of words that are related to each keyword;
对所有关键词对应的相关词语列表进行汇总,确定主题词语列表;Summarize the list of related words corresponding to all keywords to determine the list of subject words;
若判断获知所述主题词语列表中的词语数量大于等于第二预设阈值,则基于所述主题词语列表中的词语,确定所述段落主题词;If it is judged that the number of words in the topic word list is greater than or equal to a second preset threshold, determining the paragraph topic word based on the words in the topic word list;
对所述段落主题词进行聚类,并基于聚类的结果确定所述段落数量。The paragraph subject headings are clustered, and the number of paragraphs is determined based on a result of the clustering.
根据本发明一个实施例的文本自动化生成方法,所述词语关联图谱具体通过如下方法构建:According to the automatic text generation method according to an embodiment of the present invention, the word association map is specifically constructed by the following method:
获取语料库中每个样本词语的语义向量,并计算任意两个样本词语的语义向量之间的相似度,所述相似度用于表征所述任意两个样本词语的相似关系;Obtain the semantic vector of each sample word in the corpus, and calculate the similarity between the semantic vectors of any two sample words, where the similarity is used to represent the similarity relationship between the any two sample words;
对所述语料库中所述任意两个样本词语进行依存分析,确定所述任意两个样本词语的依赖关系,所述依赖关系用于表征所述任意两个样本词语的相关关系;Performing a dependency analysis on the any two sample words in the corpus, and determining a dependency between the any two sample words, where the dependency is used to represent the correlation between the any two sample words;
基于所述任意两个样本词语的相似关系以及所述任意两个样本词语的相关关系,构建所述词语关联图谱。The word association map is constructed based on the similarity relationship between the any two sample words and the correlation relationship between the any two sample words.
根据本发明一个实施例的文本自动化生成方法,所述估算语句数量具体通过如下方法获取:According to the automatic text generation method according to an embodiment of the present invention, the estimated number of sentences is specifically obtained by the following method:
确定所述待生成文本的目标字数;Determine the target word count of the text to be generated;
基于所述目标字数,确定所述估算语句数量。Based on the target word count, the estimated sentence count is determined.
本发明实施例还提供一种文本自动化生成装置,包括:获取模块、确定模块以及文本生成模块。其中,An embodiment of the present invention also provides an automatic text generation device, including: an acquisition module, a determination module, and a text generation module. in,
获取模块,用于分别获取待生成文本的关键词以及估算语句数量;The obtaining module is used to obtain the keywords of the text to be generated and the estimated number of sentences respectively;
确定模块,用于基于所述关键词、所述估算语句数量以及预先构建的词语关联图谱,确定所述待生成文本的段落数量以及段落主题词;A determination module, configured to determine the number of paragraphs and the subject words of the paragraphs of the text to be generated based on the keywords, the estimated number of sentences and the pre-built word association map;
文本生成模块,用于基于Transfromer模型、所述待生成文本的段落数量以及段落主题词,生成所述待生成文本。A text generation module, configured to generate the to-be-generated text based on the Transfromer model, the number of paragraphs of the to-be-generated text, and the paragraph keywords.
本发明实施例还提供一种电子设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时实现如上述任一种所述文本自动化生成方法的步骤。An embodiment of the present invention further provides an electronic device, including a memory, a processor, and a computer program stored in the memory and running on the processor, where the processor implements any of the above text when the processor executes the program Steps to automate the generation of methods.
本发明实施例还提供一种非暂态计算机可读存储介质,其上存储有计算机程序,该计算机程序被处理器执行时实现如上述任一种所述文本自动化生成方法的步骤。Embodiments of the present invention further provide a non-transitory computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, implements the steps of any of the above-described automatic text generation methods.
本发明实施例提供的文本自动化生成方法、装置、电子设备及存储介质,首先分别获取待生成文本的关键词以及估算语句数量;然后基于所述关键词、所述估算语句数量以及预先构建的词语关联图谱,确定所述待生成文本的段落数量以及段落主题词;最后基于Transfromer模型、所述待生成文本的段落数量以及段落主题词,生成所述待生成文本。这是通过Transformer模型实现的一种新型的文 本生成方法,引入了段落主题词的筛选确定,可以实现对生成文本主题的扩展与约束,使得生成的文本具有核心思想;同时利用Transformer模型,可以使生成的文本不再像传统方法生成的文本那样内容单一、格式固定,可以广泛应用于报告生成、文学创作、智能问答等多方面领域。The automatic text generation method, device, electronic device, and storage medium provided by the embodiments of the present invention firstly obtain the keywords of the text to be generated and the estimated number of sentences; then, based on the keywords, the estimated number of sentences, and pre-built words Associating the graph to determine the number of paragraphs and the subject heading of the text to be generated; finally, the text to be generated is generated based on the Transfromer model, the number of paragraphs and the subject heading of the text to be generated. This is a new type of text generation method implemented by the Transformer model. It introduces the screening and determination of paragraph subject words, which can realize the expansion and restriction of the generated text subject, so that the generated text has the core idea; at the same time, the Transformer model can be used to make The generated text is no longer single content and fixed format like the text generated by traditional methods, and can be widely used in many fields such as report generation, literary creation, and intelligent question answering.
附图说明Description of drawings
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作一简单地介绍,显而易见地,下面描述中的附图是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the following briefly introduces the accompanying drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description These are some embodiments of the present invention. For those of ordinary skill in the art, other drawings can also be obtained according to these drawings without creative efforts.
图1是本发明实施例提供的一种文本自动化生成方法的流程示意图;1 is a schematic flowchart of a method for automatically generating text according to an embodiment of the present invention;
图2是本发明实施例提供的一种文本自动化生成方法中确定待生成文本的段落数量以及段落主题词的流程示意图;2 is a schematic flowchart of determining the number of paragraphs of text to be generated and the subject headings of paragraphs in a method for automatic text generation provided by an embodiment of the present invention;
图3是本发明实施例提供的一种文本自动化生成方法的完整流程示意图;3 is a schematic diagram of a complete flow of a method for automatic text generation provided by an embodiment of the present invention;
图4是本发明实施例提供的一种文本自动化生成装置的结构示意图;4 is a schematic structural diagram of a device for automatic text generation provided by an embodiment of the present invention;
图5是本发明实施例提供的一种电子设备的结构示意图。FIG. 5 is a schematic structural diagram of an electronic device provided by an embodiment of the present invention.
具体实施方式Detailed ways
为使本发明实施例的目的、技术方案和优点更加清楚,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。In order to make the purposes, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments These are some embodiments of the present invention, but not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.
由于传统的文本生成方法生成的文本格式都比较固定,不能生成内容丰富、风格多样的文本。为此,本发明实施例中提供了一种文本自动化生成方法,以解决现有技术中存在的问题。Since the text formats generated by traditional text generation methods are relatively fixed, they cannot generate texts with rich content and diverse styles. To this end, an automatic text generation method is provided in the embodiments of the present invention to solve the problems existing in the prior art.
图1为本发明实施例中提供的一种文本自动化生成方法的流程示意图,如图1所示,所述方法包括:FIG. 1 is a schematic flowchart of a method for automatic text generation provided in an embodiment of the present invention. As shown in FIG. 1 , the method includes:
S1,分别获取待生成文本的关键词以及估算语句数量;S1, obtain the keywords of the text to be generated and the estimated number of sentences respectively;
S2,基于所述关键词、所述估算语句数量以及预先构建的词语关联图谱,确定所述待生成文本的段落数量以及段落主题词;S2, based on the keywords, the estimated sentence quantity and the pre-built word association map, determine the paragraph quantity and the paragraph subject words of the text to be generated;
S3,基于Transfromer模型、所述待生成文本的段落数量以及段落主题词,生成所述待生成文本。S3, the to-be-generated text is generated based on the Transfromer model, the number of paragraphs of the to-be-generated text, and the paragraph subject heading.
具体地,本发明实施例中提供的文本自动化生成方法,其执行主体为服务器,既可以是本地服务器也可以是云端服务器,本地服务器可以是计算机等,本发明实施例中对此不作具体限定。Specifically, in the automatic text generation method provided in the embodiment of the present invention, the execution body is a server, which can be either a local server or a cloud server, and the local server can be a computer, etc., which is not specifically limited in the embodiment of the present invention.
首先执行步骤S1。其中,需要获取待生成文本的关键词以及估算语句数量,关键词是指用于生成待生成文本所需要的关键词,关键词可以由用户输入信息中的关键词相关信息确定,关键词相关信息可以包括单个关键词、多个关键词或者包含有一个语句或多个语句的一段文本等。当关键词相关信息为单个关键词或多个关键词时,输入的关键词即为待生成文本的关键词;当关键词相关信息为一段文本时,可以首先从用户输入的一段文本中提取出比较重要的词语,词语的提取既可以通过提取算法实现,也可以通过句法分析算法实现。其中,提取算法可以包括tf-idf算法和textrank算法等。本发明实施例中,为了保证最后文本生成的效果,可以通过传递参数的方式控制使用哪种方法进行提取。例如,用户输入信息中还可以包括提取参数,该提取参数的不同取值表示用户选择使用的不同提取方法。然后,将提取出的词语进行去停用词处理,即去除其中停止使用的词语,得到待生成文本的关键词。Step S1 is performed first. Among them, it is necessary to obtain the keywords of the text to be generated and the estimated number of sentences. The keywords refer to the keywords required for generating the text to be generated. The keywords can be determined by the keyword-related information in the user input information. The keyword-related information It can include a single keyword, multiple keywords, or a piece of text containing one sentence or multiple sentences, etc. When the keyword-related information is a single keyword or multiple keywords, the input keyword is the keyword of the text to be generated; when the keyword-related information is a piece of text, it can first be extracted from a piece of text input by the user For more important words, the extraction of words can be achieved by either extraction algorithms or syntactic analysis algorithms. The extraction algorithm may include a tf-idf algorithm, a textrank algorithm, and the like. In this embodiment of the present invention, in order to ensure the effect of final text generation, which method to use for extraction can be controlled by passing parameters. For example, the user input information may further include extraction parameters, and different values of the extraction parameters represent different extraction methods selected by the user. Then, the extracted words are processed to remove stop words, that is, the words that are stopped in use are removed, and the keywords of the text to be generated are obtained.
估算语句数量是指估算得到的待生成文本中可能存在的语句数量,估算语句数量也可以由用户输入信息确定。此处,用户输入信息中还可以包括待生成文本的目标字数,本发明实施例中对训练预料进行句长统计,并取平均值,可以得到句长均值在包含33到34个字之间,因此默认句长为包含33个字,则可以通过目标字数与默认句长的比值,确定出估算语句数量。需要说明的是,本发明实施例中的默认句长并非表示生成的文本中每个语句都包含33个字,而且估算语句数量也只是一个前期的估算值。特别地,如果用户输入的目标字数小于33,则默认估算语句数量为1。The estimated number of sentences refers to the estimated number of sentences that may exist in the text to be generated, and the estimated number of sentences may also be determined by user input information. Here, the user input information may also include the target number of words of the text to be generated. In the embodiment of the present invention, the sentence length statistics are performed on the training prediction, and the average value is obtained, and the average sentence length can be obtained to include 33 to 34 characters. Therefore, the default sentence length is 33 words, and the estimated number of sentences can be determined by the ratio of the target word count to the default sentence length. It should be noted that, the default sentence length in the embodiment of the present invention does not mean that each sentence in the generated text contains 33 characters, and the estimated sentence number is only a preliminary estimated value. In particular, if the target word count entered by the user is less than 33, the default estimated sentence count is 1.
然后执行步骤S2,根据关键词、估算语句数量以及预先构建的词语关联图谱,确定待生成文本的段落数量以及段落主题词。由于关键词数量以及估算语句数量不同,可以采用不同的处理模式确定出待生成文本的段落数量以及段落主题词。因此本发明实施例中可以先根 据关键词数量以及估算语句数量符合的条件,确定出对应的处理模式,然后根据该处理模式,结合预先构建的词语关联图谱,即可确定出待生成文本的段落数量以及段落主题词。其中,词语关联图谱是基于训练预料预先构建的,用于表征词语之间的关联关系。词语之间的关联关系可以包括相似关系和相关关系,相似关系用于表征两个词语之间的相似性,可以通过两个词语之间的相似度确定,相关关系用于表征两个词语之间的依赖关系,可以通过对两个词语所在的语句进行依存分析确定,本发明实施例中对此不作具体限定。Then, step S2 is performed to determine the number of paragraphs and the subject words of the paragraphs of the text to be generated according to the keywords, the estimated number of sentences, and the pre-built word association map. Since the number of keywords and the number of estimated sentences are different, different processing modes can be used to determine the number of paragraphs and the subject words of the text to be generated. Therefore, in the embodiment of the present invention, a corresponding processing mode can be determined according to the number of keywords and the conditions that the estimated number of sentences meets, and then the paragraph of the text to be generated can be determined according to the processing mode and a pre-built word association map. Quantities and paragraph headings. Among them, the word association graph is pre-built based on training expectations, and is used to represent the association between words. The relationship between words can include similarity and correlation. Similarity is used to represent the similarity between two words, which can be determined by the similarity between the two words, and correlation is used to represent the relationship between two words. The dependency relationship of the two words may be determined by performing dependency analysis on the statement in which the two words are located, which is not specifically limited in this embodiment of the present invention.
最后执行步骤S3,根据Transfromer模型、待生成文本的段落数量以及段落主题词,生成待生成文本。其中,Transfromer模型用于结合待生成文本的段落数量以及段落主题词,通过每一段落中的前面语句确定出后一语句。Transfromer模型可以包含四项输入:Finally, step S3 is performed, and the text to be generated is generated according to the Transfromer model, the number of paragraphs of the text to be generated, and the subject heading of the paragraph. Among them, the Transfromer model is used to combine the number of paragraphs of the text to be generated and the subject headings of the paragraphs, and determine the next sentence through the previous sentence in each paragraph. A Transfromer model can contain four inputs:
第一项输入:前一语句的语义向量,如果前面没有语句,这一项输入的语义向量为对应维度的0向量。The first input: the semantic vector of the previous sentence. If there is no sentence before, the semantic vector of this input is the 0 vector of the corresponding dimension.
第二项输入:随机抽取的段落主题词的词向量。The second input: the word vector of the randomly selected paragraph subject words.
第三项输入:所有关键词的词向量的总和。The third input: the sum of the word vectors of all keywords.
第四项输入:结束句判断向量。例如,如果是结束句,就是对应维度的数字8常数向量,如果不是,就是对应维度的数字1常数向量。The fourth input: the final sentence judgment vector. For example, if it is a closing sentence, it is the numeric 8 constant vector of the corresponding dimension, if not, it is the numeric 1 constant vector of the corresponding dimension.
所有的输入项向量都是在最后一个维度进行拼接到一起的。然后送入到Transfromer模型之中。Transfromer模型基于前文的语义,输出的是基于文章的当前语句。在输出当前语句的同时输出当前句话的语义向量。All input term vectors are concatenated together in the last dimension. Then feed into the Transfromer model. The Transfromer model is based on the semantics of the previous text, and the output is the current sentence based on the text. Output the semantic vector of the current sentence while outputting the current sentence.
假设前一语句的语义向量用A表示,当前语句的语义向量用B表示,则Transfromer模型的下一次输入的语义向量A’=A*0.1+B*0.9。Assuming that the semantic vector of the previous sentence is represented by A, and the semantic vector of the current sentence is represented by B, then the next input semantic vector of the Transfromer model is A'=A*0.1+B*0.9.
文本的字数会在每一次Transfromer模型运行后进行统计,如果字数接近当前段落目标字数,就会改变输入第四项,进行段落结束语输出。The number of words in the text will be counted after each Transfromer model runs. If the number of words is close to the target number of words in the current paragraph, the fourth item of the input will be changed to output the conclusion of the paragraph.
需要说明的是,本发明实施例中对于单词、语句,在使用过程中都会转成文本向量的形式。转换方式有很多种,作为优选方案,可以通过bert预训练模型进行文本语义向量转换。本发明实施例中的文本自动化生成方法,可以基于Python进行开发实现。It should be noted that, in the embodiments of the present invention, words and sentences are converted into the form of text vectors during use. There are many conversion methods. As a preferred solution, text semantic vector conversion can be performed through the bert pre-training model. The automatic text generation method in the embodiment of the present invention can be developed and implemented based on Python.
本发明实施例中提供的文本自动化生成方法,首先分别获取待生成文本的关键词以及估算语句数量;然后基于所述关键词、所述估算语句数量以及预先构建的词语关联图谱,确定所述待生成文本的段落数量以及段落主题词;最后基于Transfromer模型、所述待生成文本 的段落数量以及段落主题词,生成所述待生成文本。这是通过Transformer模型实现的一种新型的文本生成方法,引入了段落主题词的筛选确定,可以实现对生成文本主题的扩展与约束,使得生成的文本具有核心思想;同时利用Transformer模型,可以使生成的文本不再像传统方法生成的文本那样内容单一、格式固定,可以广泛应用于报告生成、文学创作、智能问答等多方面领域。The automatic text generation method provided in the embodiment of the present invention firstly obtains the keywords of the text to be generated and the estimated number of sentences; The number of paragraphs and the paragraph keywords of the generated text; finally, the to-be-generated text is generated based on the Transfromer model, the number of paragraphs and the paragraph keywords of the text to be generated. This is a new type of text generation method implemented by the Transformer model. It introduces the screening and determination of paragraph subject words, which can realize the expansion and restriction of the generated text subject, so that the generated text has the core idea; at the same time, the Transformer model can be used to make The generated text is no longer single content and fixed format like the text generated by traditional methods, and can be widely used in many fields such as report generation, literary creation, and intelligent question answering.
在上述实施例的基础上,若用户输入信息中并不包含关键词相关信息,则可以随机从热门词库中抽选词语作为待生成文本的关键词。热门词库可以通过日常收集热门词语得到,可以进行定期更新维护。On the basis of the above embodiment, if the information input by the user does not contain keyword-related information, words may be randomly selected from the popular thesaurus as the keywords of the text to be generated. The popular thesaurus can be obtained by collecting popular words on a daily basis, and can be regularly updated and maintained.
在上述实施例的基础上,本发明实施例中提供的文本自动化生成方法,所述估算语句数量具体通过如下方法获取:On the basis of the above embodiment, in the automatic text generation method provided in the embodiment of the present invention, the estimated number of sentences is specifically obtained by the following method:
确定所述待生成文本的目标字数;Determine the target word count of the text to be generated;
基于所述目标字数,确定所述估算语句数量。Based on the target word count, the estimated sentence count is determined.
具体地,在确定估算语句数量时,可以先确定待生成文本的目标字数,该目标字数可以由用户输入,即用户输入信息中可以包含有目标字数。然后根据目标字数,即可确定出估算语句数量,具体可以将目标字数与默认句长的比值作为估算语句数量。Specifically, when determining the number of estimated sentences, the target number of words of the text to be generated may be determined first, and the target number of words may be input by the user, that is, the user input information may include the target number of words. Then, according to the target number of words, the estimated number of sentences can be determined. Specifically, the ratio of the target number of words to the default sentence length can be used as the estimated number of sentences.
本发明实施例中在确定估算语句数量时,引入目标字数,可以使生成的文本也不再是随机字数,而是能根据用户的需求,生成用户想要的字数的文本。In this embodiment of the present invention, when determining the number of estimated sentences, the target number of words is introduced, so that the generated text is no longer a random number of words, but can generate text with the number of words desired by the user according to the user's needs.
在上述实施例的基础上,若用户输入信息中并不包含目标字数,则可以随机从500到5000选取一个数字作为目标字数。需要说明的是,生成文本的真实字数和目标字数并不一定完全相等,当目标字数小于500时,生成文本的真实字数可能会有上下50个字的偏差;当目标字数大于500时,生成文本的真实字数可能会有50个字到200个字的偏差,这均在可控范围内。同时,如果目标字数过小,例如小于33个字,则只生成一个语句,该语句的生成完全基于步骤S1中待生成文本的关键词的语义。On the basis of the above embodiment, if the user input information does not contain the target number of words, a number may be randomly selected from 500 to 5000 as the target number of words. It should be noted that the actual number of words in the generated text and the target number of words are not necessarily exactly equal. When the target number of words is less than 500, the actual number of words in the generated text may have a deviation of 50 words up and down; when the target number of words is greater than 500, the generated text There may be a deviation of 50 words to 200 words in the actual number of words, which are all within the controllable range. Meanwhile, if the target number of words is too small, for example, less than 33 words, only one sentence is generated, and the generation of the sentence is completely based on the semantics of the keywords of the text to be generated in step S1.
如图2所示,为本发明实施例中当关键词以及估算语句数量满足不同条件时选择不同处理模式的流程示意图,具体结合以下实施例进行说明。As shown in FIG. 2 , it is a schematic flowchart of selecting different processing modes when the number of keywords and estimated sentences meet different conditions in an embodiment of the present invention, which is specifically described with reference to the following embodiments.
在上述实施例的基础上,本发明实施例中提供的文本自动化生成方法,所述基于所述关键词、所述估算语句数量以及预先构建的词语关联图谱,确定所述待生成文本的段落数量以及段落主题词,具体包括:On the basis of the above embodiment, in the automatic text generation method provided in the embodiment of the present invention, the number of paragraphs of the to-be-generated text is determined based on the keywords, the estimated number of sentences, and a pre-built word association map and paragraph headings, including:
若判断获知所述关键词以及所述估算语句数量满足第一条件,则确定所述段落数量为默认数量,并基于所述词语关联图谱,确定与每个关键词具有相关关系的词语构成的相关词语列表;If it is determined that the keyword and the estimated number of sentences satisfy the first condition, the number of paragraphs is determined to be the default number, and based on the word association map, the correlation between words that are related to each keyword is determined. list of words;
对所有关键词对应的相关词语列表进行汇总,确定主题词语列表;Summarize the list of related words corresponding to all keywords to determine the list of subject words;
基于所述估算语句数量,确定词语保留数量,并基于所述主题词语列表中所述词语保留数量的词语,确定所述段落主题词。Based on the estimated number of sentences, a reserved number of words is determined, and the paragraph topic word is determined based on the words of the reserved word number in the topic word list.
具体地,本发明实施例中,当估算语句数量小于15时,这时目标字数大约是500字左右。对于这样的文本,如果关键词足够多(可以默认大于等于2),那么关键词数量足够筛选出段落主题词,这时可以进行第一种处理模式来确定最终的段落数量和段落主题词。当没有足够多的关键词时,如果要生成的语句数量过少(可以默认少于等于8个),那么也认为是可以筛选出段落主题词的,因此也进行第一种处理模式。也就是说,第一条件可以是关键词数量大于等于第一阈值且估算语句数量小于第二阈值,或者关键词数量小于第一阈值且估算语句数量小于第三阈值。第一阈值、第二阈值以及第三阈值可以根据需要进行设定,且第三阈值小于第二阈值,例如第一阈值可以为2,第二阈值可以为15,第三阈值可以为8。第一条件与第一种处理模式相对应,即如果关键词以及估算语句数量满足第一条件,则通过第一种处理模式确定出待生成文本的段落数量以及段落主题词。如图2所示。Specifically, in the embodiment of the present invention, when the estimated number of sentences is less than 15, the target number of words is about 500 words. For such a text, if there are enough keywords (the default can be greater than or equal to 2), then the number of keywords is enough to filter out the paragraph headings. At this time, the first processing mode can be performed to determine the final number of paragraphs and paragraph headings. When there are not enough keywords, if the number of sentences to be generated is too small (the default can be less than or equal to 8), then it is also considered that the paragraph subject words can be filtered out, so the first processing mode is also performed. That is, the first condition may be that the number of keywords is greater than or equal to the first threshold and the estimated number of sentences is less than the second threshold, or the number of keywords is less than the first threshold and the estimated number of sentences is less than the third threshold. The first threshold, the second threshold and the third threshold can be set as required, and the third threshold is smaller than the second threshold. For example, the first threshold can be 2, the second threshold can be 15, and the third threshold can be 8. The first condition corresponds to the first processing mode, that is, if the number of keywords and estimated sentences satisfies the first condition, the first processing mode determines the number of paragraphs and the subject headings of the text to be generated. as shown in picture 2.
第一种处理模式具体为:首先确定段落数量为默认数量,默认数量可以根据需要以及第一条件的具体内容进行设定,例如可以设定为1。然后根据词语关联图谱,确定与每个关键词具有相关关系的词语构成的相关词语列表,假设关键词数量为n,则分别将n个关键词输入到词语关联图谱中,查询时通过关系筛选,得到只和输入的关键词具有相关关系的词语。可以将每个关键词查询到的结果都分别存储到一个列表中,则会得到n个查询结果列表,查询结果列表即为与每个关键词具有相关关系的词语构成的相关词语列表。可以新建一个空的主题字典,对上述n个查询结果列表进行汇总、去重后存储到主题字典中,主题字典的键是去重后的单个词语,字典的值统一默认是0。然后统计主题字典中的每个词出现在n个列表中的次数,出现一次,主题字典中对应词的值加1。最后,可以对主题字典进行一次按照值的大小从大到小的排序。由于在python中字典是无序的,所以排序后,存储格式又可以转为列表形式,即得到主题词语列表。例如:[(词语1,7),(词语2,7),(词语3,5),(词语4,2)]。主题词语列表中是元组的格 式,每个元组包括两个值,词语本身和词语出现的次数。The first processing mode is specifically as follows: first, the number of paragraphs is determined as a default number, and the default number can be set according to needs and the specific content of the first condition, for example, it can be set to 1. Then, according to the word association graph, determine a list of related words composed of words that have a related relationship with each keyword. Assuming that the number of keywords is n, then input n keywords into the word association graph, and filter by the relationship when querying. Get only words that are related to the input keyword. The results obtained from each keyword query can be stored in a list respectively, and n query result lists will be obtained, and the query result list is a related word list composed of words related to each keyword. You can create an empty topic dictionary, summarize and deduplicate the above n query result lists and store them in the topic dictionary. The key of the topic dictionary is the single word after deduplication, and the value of the dictionary is 0 by default. Then count the number of times that each word in the theme dictionary appears in the n lists, and if it appears once, the value of the corresponding word in the theme dictionary is incremented by 1. Finally, the topic dictionary can be sorted once by the size of the values from large to small. Since the dictionary is unordered in python, after sorting, the storage format can be converted into a list form, that is, a list of subject words is obtained. For example: [(term1,7),(term2,7),(term3,5),(term4,2)]. The list of subject words is in the form of tuples, each tuple contains two values, the word itself and the number of occurrences of the word.
然后根据估算语句数量,确定词语保留数量。具体可以对估算语句数量乘以0.6得到的值进行向上取整,得到词语保留数量。对主题词语列表按照词语保留数量进行截取,截取出来的词语即为段落主题词。在此基础上,可以将段落主题词都添加到一个新的列表里面中,这个新的列表就是段落主题词列表。因为只有这一个段落,所以段落主题词列表也只有一个。Then according to the estimated number of sentences, determine the number of words to retain. Specifically, the value obtained by multiplying the estimated number of sentences by 0.6 can be rounded up to obtain the number of reserved words. The subject word list is intercepted according to the number of reserved words, and the intercepted words are the paragraph subject words. On this basis, you can add paragraph subject headings to a new list, and this new list is the paragraph heading list. Because there is only one paragraph, there is only one list of paragraph subject headings.
本发明实施例中提供了一种确定待生成文本的段落数量以及段落主题词的方法,即通过第一种处理模式实现,可以使得在没有套用模板的情况下,生成文本能够智能划分段落。The embodiment of the present invention provides a method for determining the number of paragraphs and the subject headings of the text to be generated, that is, through the first processing mode, the generated text can be divided into paragraphs intelligently without applying a template.
在上述实施例的基础上,本发明实施例中提供的文本自动化生成方法,所述基于所述关键词、所述估算语句数量以及预先构建的词语关联图谱,确定所述待生成文本的段落数量以及段落主题词,具体包括:On the basis of the above embodiment, in the automatic text generation method provided in the embodiment of the present invention, the number of paragraphs of the to-be-generated text is determined based on the keywords, the estimated number of sentences, and a pre-built word association map and paragraph headings, including:
若判断获知所述关键词以及所述估算语句数量满足第二条件,则确定所述段落数量为默认数量,并基于所述词语关联图谱,确定与每个关键词具有相关关系的词语数量;If it is determined that the keyword and the estimated number of sentences satisfy the second condition, the number of paragraphs is determined as the default number, and based on the word association map, the number of words that are related to each keyword is determined;
若任一所述关键词对应的所述词语数量小于等于第一预设阈值,则基于所述词语关联图谱,确定所述任一关键词具有相似关系的相似词语,并基于所述词语关联图谱,确定每个相似词语的相关词语列表;If the number of the words corresponding to any one of the keywords is less than or equal to the first preset threshold, then based on the word association map, determine the similar words that have a similar relationship with the any keyword, and based on the word association map , determine the list of related words for each similar word;
对所有相关词语列表进行汇总,确定主题词语列表;Summarize all related word lists to determine the subject word list;
若判断获知所述主题词语列表中的词语数量大于等于第二预设阈值,则基于所述主题词语列表中的词语,确定所述段落主题词。If it is determined that the number of words in the topic word list is greater than or equal to a second preset threshold, the paragraph topic word is determined based on the words in the topic word list.
具体地,本发明实施例中,当估算语句数量大于等于8且小于15时,如果没有足够多的关键词(这里默认小于2个即只有1个关键词),将进行第二种处理模式来确定最终的段落数量和段落主题词。此时未采用第一种处理模式的根本原因是因为关键词数量过少,可能会造成获得的段落主题词数量过少,限制文章主题的自由度。也就是说,第二条件可以是关键词数量小于第一阈值,估算语句数量小于第二阈值且大于等于第三阈值。第二条件与第二种处理模式相对应,即如果关键词以及估算语句数量满足第二条件,则通过第二种处理模式确定出待生成文本的段落数量以及段落主题词。如图2所示。Specifically, in this embodiment of the present invention, when the estimated number of sentences is greater than or equal to 8 and less than 15, if there are not enough keywords (here, the default is less than 2, that is, there is only 1 keyword), the second processing mode will be performed to Determine the final number of paragraphs and paragraph headings. The fundamental reason for not adopting the first processing mode at this time is that the number of keywords is too small, which may result in too few paragraph keywords and limit the freedom of article topics. That is, the second condition may be that the number of keywords is less than the first threshold, and the estimated number of sentences is less than the second threshold and greater than or equal to the third threshold. The second condition corresponds to the second processing mode, that is, if the number of keywords and the estimated sentences satisfy the second condition, the second processing mode is used to determine the number of paragraphs and the subject headings of the text to be generated. as shown in picture 2.
第二种处理模式具体为:首先确定段落数量为默认数量,默认数量可以根据需要以及第二条件的具体内容进行设定,例如可以设定为1。然后这一个关键词输入词语关联图谱中,进行相关关系查询,确 定出与该关键词具有相关关系的词语数量。The second processing mode is specifically as follows: first, the number of paragraphs is determined as a default number, and the default number can be set according to needs and the specific content of the second condition, for example, it can be set to 1. Then this keyword is input into the word association graph, and the related relationship query is performed to determine the number of words that have a related relationship with the keyword.
若该关键词对应的词语数量小于等于第一预设阈值,则需要对关键词进行扩展,进而将这个关键词输入到词语关联图谱中,进行相似关系查询,确定该关键词具有相似关系的相似词语,通过相似词语对关键词进行扩展。其中,第一预设阈值可以是估算语句数量的0.6倍。默认筛选的相似度值为0.98,如果查询不到相似词语,这个相似度阈值会不断的减0.01,直到能够获取到和查询到相似词语。假如找到k个相似词语,对这k个词语分别进行相关词语查询,就获取到k个相关词语列表,这k个相关词语列表中所有的词统一到一起,去重后放入到一个新的列表中,这个新的列表即为主题词语列表。若主题词语列表中的词语数量大于等于第二预设阈值,则可以直接将主题词语列表中的词语作为确定段落主题词,即该主题词语列表为段落主题词列表。其中,第二预设阈值可以是估算语句数量的0.6倍。如果主题词语列表中的词语数量小于第二预设阈值,则继续对相似度阈值减0.01,重新获取到新的相似词语,重复上面的计算操作,直到主题词语列表中的词语数量大于等于第二预设阈值。If the number of words corresponding to the keyword is less than or equal to the first preset threshold, the keyword needs to be expanded, and then the keyword needs to be input into the word association map, and a similarity relationship query is performed to determine that the keyword has a similarity in a similar relationship. Words, expand keywords with similar words. Wherein, the first preset threshold may be 0.6 times the estimated number of sentences. The default filter similarity value is 0.98. If no similar words are found, the similarity threshold will continue to decrease by 0.01 until similar words can be obtained and queried. If k similar words are found, and related words are searched for these k words respectively, k related word lists are obtained, and all the words in the k related word lists are unified together, and put into a new one after deduplication. In the list, this new list is the subject word list. If the number of words in the topic word list is greater than or equal to the second preset threshold, the words in the topic word list can be directly used as the determined paragraph topic words, that is, the topic word list is a paragraph topic word list. Wherein, the second preset threshold may be 0.6 times the estimated number of sentences. If the number of words in the topic word list is less than the second preset threshold, continue to reduce the similarity threshold by 0.01, re-acquire new similar words, and repeat the above calculation operation until the number of words in the topic word list is greater than or equal to the second Preset threshold.
若该关键词对应的词语数量大于第一预设阈值,则不需要对关键词进行扩展,此时等价于第一种处理模式。If the number of words corresponding to the keyword is greater than the first preset threshold, the keyword does not need to be expanded, which is equivalent to the first processing mode.
本发明实施例中提供了一种确定待生成文本的段落数量以及段落主题词的方法,即通过第二种处理模式实现,可以使得在没有套用模板的情况下,生成文本能够智能划分段落。而且,第二种处理模式适用于关键词数量过少的情况,可以保证获得的段落主题词数量适中,提高文章主题的自由度。The embodiment of the present invention provides a method for determining the number of paragraphs and the subject headings of the text to be generated, that is, through the second processing mode, the generated text can be divided into paragraphs intelligently without applying a template. Moreover, the second processing mode is suitable for the case that the number of keywords is too small, which can ensure that the number of paragraph keywords obtained is moderate, and the degree of freedom of the article topic can be improved.
在上述实施例的基础上,本发明实施例中提供的文本自动化生成方法,所述基于所述关键词、所述估算语句数量以及预先构建的词语关联图谱,确定所述待生成文本的段落数量以及段落主题词,具体包括:On the basis of the above embodiment, in the automatic text generation method provided in the embodiment of the present invention, the number of paragraphs of the to-be-generated text is determined based on the keywords, the estimated number of sentences, and a pre-built word association map and paragraph headings, including:
若判断获知所述关键词以及所述估算语句数量满足第三条件,则基于所述词语关联图谱,确定与每个关键词具有相关关系的词语构成的相关词语列表;If it is determined that the keyword and the estimated number of sentences satisfy the third condition, then based on the word association map, determine a related word list composed of words that are related to each keyword;
对所有关键词对应的相关词语列表进行汇总,确定主题词语列表;Summarize the list of related words corresponding to all keywords to determine the list of subject words;
基于所述估算语句数量,确定词语保留数量,并基于所述主题词语列表中所述词语保留数量的词语,确定所述段落主题词;Based on the estimated number of sentences, determine the number of reserved words, and determine the paragraph subject words based on the words in the reserved number of words in the subject word list;
对所述段落主题词进行聚类,并基于聚类的结果确定所述段落数量。The paragraph subject headings are clustered, and the number of paragraphs is determined based on a result of the clustering.
具体地,本发明实施例中,当估算语句数量大于等于15时,会对估算语句数量和关键词数量进行对比,如果估算语句数量小于等于1.5倍的关键词数量,则通过第三种处理方式来确定段落数量和段落主题词。也就是说,第三条件可以是估算语句数量大于等于第二阈值,且估算语句数量小于等于预设倍数的关键词数量。第三条件与第三种处理模式相对应,即如果关键词以及估算语句数量满足第三条件,则通过第三种处理模式确定出待生成文本的段落数量以及段落主题词。如图2所示。Specifically, in the embodiment of the present invention, when the estimated number of sentences is greater than or equal to 15, the estimated number of sentences is compared with the number of keywords. If the estimated number of sentences is less than or equal to 1.5 times the number of keywords, the third processing method is used. to determine the number of paragraphs and paragraph headings. That is to say, the third condition may be that the estimated number of sentences is greater than or equal to the second threshold, and the estimated number of sentences is less than or equal to the number of keywords that is a preset multiple. The third condition corresponds to the third processing mode, that is, if the number of keywords and estimated sentences satisfies the third condition, the third processing mode is used to determine the number of paragraphs and the subject words of the text to be generated. as shown in picture 2.
第三种处理模式具体为:类似于第一种处理模式。直接对每个关键词在词语关联图谱中查询,假如关键词数量是k个,那么将获取到k个相关词语列表。直接对这k个相关词语列表里面的词语进行汇总去重后,就获得了主题词语列表。然后对估算语句数量乘以0.6,用得到的数值去截取主题词语列表。最后截取到的词语列表就是段落主题词列表,其中包含的词语即为段落主题词。The third processing mode is specifically: similar to the first processing mode. Directly query each keyword in the word association graph. If the number of keywords is k, then a list of k related words will be obtained. After directly summarizing and deduplicating the words in the k related word lists, the subject word list is obtained. Then multiply the estimated number of sentences by 0.6, and use the resulting value to truncate the list of subject words. The last intercepted word list is the paragraph subject heading list, and the words contained in it are the paragraph subject headings.
对段落主题词列表里面的词语进行聚类,通过聚类的结果即可确定段落数量。段落数量的确定如公式所示:The words in the paragraph subject word list are clustered, and the number of paragraphs can be determined through the results of the clustering. The number of paragraphs is determined as shown in the formula:
段落数量=max(3,聚类数量)Number of paragraphs = max(3, number of clusters)
由于单独一个主题词无法作为一个类别,它会被归属到最近的类别之中。段落数量确定好后,每一类别的词语分别对应一个段落,这样段落与段落主题词列表就一一对应了。Since a subject heading alone cannot be used as a category, it will be assigned to the nearest category. After the number of paragraphs is determined, each category of words corresponds to a paragraph, so that the paragraphs and the paragraph subject word list are in one-to-one correspondence.
本发明实施例中提供了一种确定待生成文本的段落数量以及段落主题词的方法,即通过第三种处理模式实现,可以使得在没有套用模板的情况下,生成文本能够智能划分段落。而且,第三种处理模式适用于估算语句数量较大的情况,可以保证计算结果的准确性。The embodiment of the present invention provides a method for determining the number of paragraphs and the subject headings of the text to be generated, that is, the third processing mode is implemented, so that the generated text can be divided into paragraphs intelligently without applying a template. Moreover, the third processing mode is suitable for a large number of estimated sentences, which can ensure the accuracy of the calculation results.
在上述实施例的基础上,本发明实施例中提供的文本自动化生成方法,所述基于所述关键词、所述估算语句数量以及预先构建的词语关联图谱,确定所述待生成文本的段落数量以及段落主题词,具体包括:On the basis of the above embodiment, in the automatic text generation method provided in the embodiment of the present invention, the number of paragraphs of the to-be-generated text is determined based on the keywords, the estimated number of sentences, and a pre-built word association map and paragraph headings, including:
若判断获知所述关键词以及所述估算语句数量满足第四条件,则基于所述词语关联图谱,确定与每个关键词具有相关关系的词语构成的列表;If it is determined that the keyword and the estimated number of sentences satisfy the fourth condition, then based on the word association map, determine a list of words that are related to each keyword;
对所有关键词对应的列表进行汇总,确定主题词语列表;Summarize the lists corresponding to all keywords to determine the list of subject words;
若判断获知所述主题词语列表中的词语数量大于等于第二预设阈值,则基于所述主题词语列表中的词语,确定所述段落主题词;If it is judged that the number of words in the topic word list is greater than or equal to a second preset threshold, determining the paragraph topic word based on the words in the topic word list;
对所述段落主题词进行聚类,并基于聚类的结果确定所述段落数 量。The paragraph subject headings are clustered, and the number of paragraphs is determined based on the result of the clustering.
具体地,本发明实施例中,当估算语句数量大于等于15时,会对估算语句数量和关键词数量进行对比,如果估算语句数量大于1.5倍的关键词数量,则通过第四种处理方式来确定段落数量和段落主题词。也就是说,第四条件可以是估算语句数量大于等于第二阈值,且估算语句数量大于预设倍数的关键词数量。第四条件与第四种处理模式相对应,即如果关键词以及估算语句数量满足第四条件,则通过第四种处理模式确定出待生成文本的段落数量以及段落主题词。如图2所示。Specifically, in the embodiment of the present invention, when the estimated number of sentences is greater than or equal to 15, the estimated number of sentences and the number of keywords are compared, and if the estimated number of sentences is greater than 1.5 times the number of keywords, the fourth processing method is used to Determine the number of paragraphs and paragraph headings. That is to say, the fourth condition may be that the estimated number of sentences is greater than or equal to the second threshold, and the estimated number of sentences is greater than the number of keywords with a preset multiple. The fourth condition corresponds to the fourth processing mode, that is, if the number of keywords and estimated sentences satisfies the fourth condition, the fourth processing mode is used to determine the number of paragraphs and the subject words of the text to be generated. as shown in picture 2.
第四种处理模式具体为:段落主题词的确定类似于第二种处理模式,段落数量的确定类似于第三种处理模式。首先对所有的关键词分别搜索相关词语,得到分别对应的相关词语列表,并将所有相关词语列表中的词语进行汇总去重后放入到一个列表中,该列表为主题词语列表。如果主题词语列表中的词语数量大于等于第二预设阈值,此时情况类似于第三种处理模式,对主题词语列表进行词语聚类处理,按照第三种处理模式的方式获取到段落数量以及段落对应的段落主题词。如果主题词语列表中的词语数量小于第二预设阈值,则需要进行相似词语匹配。对每一个关键词按照第二种处理模式的方式进行相似词语匹配,用匹配出的相似词语再进行相关词语匹配,最后把所有的相关词语统计到一起,放入到一个列表中,该列表为主题词语列表。如果主题词语列表中的词语数量小于第二预设阈值,就减小相似度匹配阈值重新获取相关词语,如果主题词语列表中的词语数量大于第二预设阈值且第二预设阈值大于6(第二预设阈值不超过6的主题词语列表为所需),则以主题词语列表中的词语为主,从主题词语列表中的词语中随机不放回抽取,直到控制最后主题词语列表中的词语数量除以0.6再减去估算语句数量后,数值在0到10之间。这样获得的主题词语列表中的词语就是段落主题词。然后对这个主题词语列表聚类,最终得到文章的段落数量以及每个段落对应的主题列表。The fourth processing mode is specifically: the determination of the paragraph subject heading is similar to the second processing mode, and the determination of the number of paragraphs is similar to the third processing mode. Firstly, all keywords are searched for related words respectively, and corresponding related word lists are obtained, and the words in all related word lists are aggregated and deduplicated and put into a list, which is a list of subject words. If the number of words in the topic word list is greater than or equal to the second preset threshold, the situation is similar to the third processing mode, and word clustering is performed on the topic word list, and the number of paragraphs and The paragraph subject heading for the paragraph. If the number of words in the subject word list is less than the second preset threshold, similar word matching needs to be performed. Similar words are matched for each keyword according to the second processing mode, and the matched similar words are used to match related words. Finally, all related words are counted together and put into a list. The list is A list of subject words. If the number of words in the subject word list is less than the second preset threshold, reduce the similarity matching threshold to obtain related words again, if the number of words in the subject word list is greater than the second preset threshold and the second preset threshold is greater than 6 ( The subject word list with the second preset threshold value not exceeding 6 is required), then the subject word list is dominated by the words in the subject word list, and the words in the subject word list are randomly selected without replacement until the final subject word list is controlled. Dividing the number of words by 0.6 and subtracting the estimated number of sentences gives a value between 0 and 10. The words in the subject word list thus obtained are paragraph subject words. Then cluster the topic word list, and finally get the number of paragraphs in the article and the topic list corresponding to each paragraph.
本发明实施例中提供了一种确定待生成文本的段落数量以及段落主题词的方法,即通过第四种处理模式实现,可以使得在没有套用模板的情况下,生成文本能够智能划分段落。而且,第四种处理模式适用于估算语句数量较大的情况,可以保证计算结果的准确性。The embodiment of the present invention provides a method for determining the number of paragraphs and the subject headings of the text to be generated, which is realized by the fourth processing mode, so that the generated text can be divided into paragraphs intelligently without applying a template. Moreover, the fourth processing mode is suitable for a large number of estimated sentences, which can ensure the accuracy of the calculation results.
在上述实施例的基础上,本发明实施例中提供的文本自动化生成方法,本发明实施例中的第一种处理模式和第二种处理模式可以合并成一种,以第二种处理模式为主;第三种处理模式和第四种处理模式 可以合并成一种,以第四种处理模式为主。On the basis of the above embodiments, in the automatic text generation method provided in the embodiments of the present invention, the first processing mode and the second processing mode in the embodiments of the present invention may be combined into one, and the second processing mode is the main one ; The third processing mode and the fourth processing mode can be combined into one, and the fourth processing mode is the main one.
在上述实施例的基础上,本发明实施例中提供的文本自动化生成方法,所述词语关联图谱具体通过如下方法构建:On the basis of the above embodiment, in the automatic text generation method provided in the embodiment of the present invention, the word association map is specifically constructed by the following method:
获取语料库中每个样本词语的语义向量,并计算任意两个样本词语的语义向量之间的相似度,所述相似度用于表征所述任意两个样本词语的相似关系;Obtain the semantic vector of each sample word in the corpus, and calculate the similarity between the semantic vectors of any two sample words, where the similarity is used to represent the similarity relationship between the any two sample words;
对所述语料库中所述任意两个样本词语进行依存分析,确定所述任意两个样本词语的依赖关系,所述依赖关系用于表征所述任意两个样本词语的相关关系;Performing a dependency analysis on the any two sample words in the corpus, and determining a dependency between the any two sample words, where the dependency is used to represent the correlation between the any two sample words;
基于所述任意两个样本词语的相似关系以及所述任意两个样本词语的相关关系,构建所述词语关联图谱。The word association map is constructed based on the similarity relationship between the any two sample words and the correlation relationship between the any two sample words.
具体地,词语关联图谱里面的实体是词语,实体之间的关系分为两种,一种是相似关系,一种是相关关系。对于相似关系的三元组对的获取方式:通过语料库训练得到bert语义模型,然后通过bert语义模型,将一个一个词语转化为目标维度的语义向量,目标维度的选取可以通过参数控制,具体可以取值为64、128、256、512等。获取到词语所代表的语义向量后,对每两个语义向量计算一次余弦相似度,余弦相似度的值就是这两个语义向量所代表词语之间的相似度。获取到的相似度的值,会作为对应两个词语的相似关系的属性存储在图数据库中,以方便使用查询;对于相关关系的三元组对的获取方式:通过对语句进行依存分析,获取到词语和词语之间的依赖关系,然后将存在依赖关系的词语作为相关关系三元组存储在图数据库中。Specifically, the entities in the word association graph are words, and the relationship between entities is divided into two types, one is a similarity relationship, and the other is a related relationship. For the acquisition method of triple pairs of similar relationships: obtain the bert semantic model through corpus training, and then use the bert semantic model to convert each word into a semantic vector of the target dimension. The selection of the target dimension can be controlled by parameters. Values are 64, 128, 256, 512, etc. After obtaining the semantic vectors represented by the words, the cosine similarity is calculated for every two semantic vectors, and the value of the cosine similarity is the similarity between the words represented by the two semantic vectors. The obtained similarity value will be stored in the graph database as the attribute of the similarity relationship between the two words, so as to facilitate the use of query; for the acquisition method of the triple pair of the related relationship: through the dependency analysis of the statement, obtain to the dependencies between words and words, and then the words with dependencies are stored in the graph database as correlation triples.
本发明实施例中采用的图数据库是neo4j数据库,开发语言是python,通过py2neo库的接口调用cypher语言,进行对数据库的增删改查操作。The graph database used in the embodiment of the present invention is the neo4j database, the development language is python, and the cypher language is invoked through the interface of the py2neo library to perform addition, deletion, modification, and query operations on the database.
由于词语关联图谱的构建,所以基于默认参数判断的情况下,在使用过程中不存在主题词语不足的情况。Due to the construction of the word association graph, there is no shortage of subject words in the process of use when judging based on the default parameters.
在上述实施例的基础上,本发明实施例中提供的文本自动化生成方法,还包括:对生成文本进行校验。On the basis of the foregoing embodiment, the automatic text generation method provided in the embodiment of the present invention further includes: verifying the generated text.
具体地,对生成文本进行校验即是进行对错分析以及校正,以使生成文本符合现在的语法规则,使得语句通顺。Specifically, the verification of the generated text is to perform right and wrong analysis and correction, so that the generated text conforms to the current grammatical rules and the sentences are fluent.
首先,语句首字不能出现类似于的,了,地,啊等助词或语气词。建立一个包含类似上述词语的词典,如果生成文本的语句首字包含在词典内,那么该语句会经过Transfromer模型重新生成新的语句。并进行对原语句的替换处理。First of all, the first word of the sentence cannot appear similar particles or modal particles such as le, 地, ah. Create a dictionary containing words similar to the above. If the first word of the generated text is included in the dictionary, the sentence will be regenerated into a new sentence by the Transfromer model. and replace the original statement.
其次,在汉语中,词语之间的搭配都是有一定语法结构的,比如形容词后接名词,动词后面接副词等。本系统将生成的文本进行依存句法分析以及词性标注(ltp、hanlp库都能实现),按照既定规则(动宾结构应对应动词与名词,状语对应副词形容词等)进行评判,对不符合的语句重新用Transfromer模型生成。Secondly, in Chinese, the collocation between words has a certain grammatical structure, such as adjectives followed by nouns, verbs followed by adverbs and so on. The system will perform dependency syntax analysis and part-of-speech tagging on the generated text (which can be implemented by ltp and hanlp libraries), and judge according to established rules (the verb-object structure should correspond to verbs and nouns, and adverbs should correspond to adverbs and adjectives, etc.) Regenerated with the Transfromer model.
如图3所示,为本发明实施例中提供的文本自动化生成方法的完整流程示意图。图3中,首先,一方面获取用户输入的关键词信息以及目标数量,然后确定出待生成文本的关键词,并确定出估算语句数量;另一方面,通过对语料库中的相似词语抽取以及相关词语抽取,构建得到词语关联图谱。然后,基于关键词、估算语句数量以及构建的词语关联图谱,确定待生成文本的段落数量以及段落主题词。然后,通过Transfromer模型生成待生成文本。最后,对生成的文本进行校验。As shown in FIG. 3 , it is a complete schematic flowchart of the automatic text generation method provided in the embodiment of the present invention. In Figure 3, first, on the one hand, the keyword information and target quantity input by the user are obtained, and then the keywords of the text to be generated are determined, and the estimated number of sentences is determined; Word extraction, constructs a word association map. Then, based on the keywords, the estimated number of sentences, and the constructed word association graph, the number of paragraphs and the subject words of the paragraphs of the text to be generated are determined. Then, the text to be generated is generated by the Transfromer model. Finally, the generated text is verified.
如图4所示,在上述实施例的基础上,本发明实施例中提供了一种文本自动化生成装置,包括:获取模块41、确定模块42和文本生成模块43。其中,As shown in FIG. 4 , on the basis of the above-mentioned embodiment, an embodiment of the present invention provides an automatic text generation device, including: an acquisition module 41 , a determination module 42 and a text generation module 43 . in,
获取模块41用于分别获取待生成文本的关键词以及估算语句数量;The obtaining module 41 is used to obtain the keywords of the text to be generated and the number of estimated sentences respectively;
确定模块42用于基于所述关键词、所述估算语句数量以及预先构建的词语关联图谱,确定所述待生成文本的段落数量以及段落主题词;The determining module 42 is configured to determine the number of paragraphs and the subject words of the paragraphs of the text to be generated based on the keywords, the number of estimated sentences and the pre-built word association map;
文本生成模块43用于基于Transfromer模型、所述待生成文本的段落数量以及段落主题词,生成所述待生成文本。The text generation module 43 is configured to generate the text to be generated based on the Transfromer model, the number of paragraphs of the text to be generated, and the subject headings of the paragraphs.
具体地,本发明实施例中提供的文本自动化生成装置中各模块的作用与上述方法类实施例中各步骤的操作流程是一一对应的,实现的效果也是一致的,具体参见上述实施例,本发明实施例中对此不再赘述。Specifically, the functions of the modules in the automatic text generation device provided in the embodiments of the present invention correspond one-to-one with the operation procedures of the steps in the above-mentioned method embodiments, and the achieved effects are also the same. For details, refer to the above-mentioned embodiments. This is not repeated in this embodiment of the present invention.
在上述实施例的基础上,本发明实施例中提供的文本自动化生成装置,所述确定模块,具体用于:On the basis of the foregoing embodiment, in the automatic text generation device provided in the embodiment of the present invention, the determining module is specifically used for:
若判断获知所述关键词以及所述估算语句数量满足第一条件,则确定所述段落数量为默认数量,并基于所述词语关联图谱,确定与每个关键词具有相关关系的词语构成的相关词语列表;If it is determined that the keyword and the estimated number of sentences satisfy the first condition, the number of paragraphs is determined to be the default number, and based on the word association map, the correlation between words that are related to each keyword is determined. list of words;
对所有关键词对应的相关词语列表进行汇总,确定主题词语列表;Summarize the list of related words corresponding to all keywords to determine the list of subject words;
基于所述估算语句数量,确定词语保留数量,并基于所述主题词语列表中所述词语保留数量的词语,确定所述段落主题词。Based on the estimated number of sentences, a reserved number of words is determined, and the paragraph topic word is determined based on the words of the reserved word number in the topic word list.
在上述实施例的基础上,本发明实施例中提供的文本自动化生成装置,所述确定模块,具体用于:On the basis of the foregoing embodiment, in the automatic text generation device provided in the embodiment of the present invention, the determining module is specifically used for:
若判断获知所述关键词以及所述估算语句数量满足第二条件,则确定所述段落数量为默认数量,并基于所述词语关联图谱,确定与每个关键词具有相关关系的词语数量;If it is determined that the keyword and the estimated number of sentences satisfy the second condition, the number of paragraphs is determined as the default number, and based on the word association map, the number of words that are related to each keyword is determined;
若任一所述关键词对应的所述词语数量小于等于第一预设阈值,则基于所述词语关联图谱,确定所述任一关键词具有相似关系的相似词语,并基于所述词语关联图谱,确定每个相似词语的相关词语列表;If the number of the words corresponding to any one of the keywords is less than or equal to the first preset threshold, then based on the word association map, determine the similar words that have a similar relationship with the any keyword, and based on the word association map , determine the list of related words for each similar word;
对所有相关词语列表进行汇总,确定主题词语列表;Summarize all related word lists to determine the subject word list;
若判断获知所述主题词语列表中的词语数量大于等于第二预设阈值,则基于所述主题词语列表中的词语,确定所述段落主题词。If it is determined that the number of words in the topic word list is greater than or equal to a second preset threshold, the paragraph topic word is determined based on the words in the topic word list.
在上述实施例的基础上,本发明实施例中提供的文本自动化生成装置,所述确定模块,具体用于:On the basis of the foregoing embodiment, in the automatic text generation device provided in the embodiment of the present invention, the determining module is specifically used for:
若判断获知所述关键词以及所述估算语句数量满足第三条件,则基于所述词语关联图谱,确定与每个关键词具有相关关系的词语构成的相关词语列表;If it is determined that the keyword and the estimated number of sentences satisfy the third condition, then based on the word association map, determine a related word list composed of words that are related to each keyword;
对所有关键词对应的相关词语列表进行汇总,确定主题词语列表;Summarize the list of related words corresponding to all keywords to determine the list of subject words;
基于所述估算语句数量,确定词语保留数量,并基于所述主题词语列表中所述词语保留数量的词语,确定所述段落主题词;Based on the estimated number of sentences, determine the number of reserved words, and determine the paragraph subject words based on the words in the reserved number of words in the subject word list;
对所述段落主题词进行聚类,并基于聚类的结果确定所述段落数量。The paragraph subject headings are clustered, and the number of paragraphs is determined based on a result of the clustering.
在上述实施例的基础上,本发明实施例中提供的文本自动化生成装置,所述确定模块,具体用于:On the basis of the foregoing embodiment, in the automatic text generation device provided in the embodiment of the present invention, the determining module is specifically used for:
若判断获知所述关键词以及所述估算语句数量满足第四条件,则基于所述词语关联图谱,确定与每个关键词具有相关关系的词语构成的相关词语列表;If it is determined that the keyword and the estimated number of sentences satisfy the fourth condition, then, based on the word association map, determine a related word list composed of words that are related to each keyword;
对所有关键词对应的相关词语列表进行汇总,确定主题词语列表;Summarize the list of related words corresponding to all keywords to determine the list of subject words;
若判断获知所述主题词语列表中的词语数量大于等于第二预设阈值,则基于所述主题词语列表中的词语,确定所述段落主题词;If it is judged that the number of words in the topic word list is greater than or equal to a second preset threshold, determining the paragraph topic word based on the words in the topic word list;
对所述段落主题词进行聚类,并基于聚类的结果确定所述段落数量。The paragraph subject headings are clustered, and the number of paragraphs is determined based on a result of the clustering.
在上述实施例的基础上,本发明实施例中提供的文本自动化生成装置,还包括:图谱构建模块,用于:On the basis of the above-mentioned embodiment, the automatic text generation device provided in the embodiment of the present invention further includes: a graph construction module, which is used for:
获取语料库中每个样本词语的语义向量,并计算任意两个样本词语的语义向量之间的相似度,所述相似度用于表征所述任意两个样本 词语的相似关系;Obtain the semantic vector of each sample word in the corpus, and calculate the similarity between the semantic vectors of any two sample words, and the similarity is used to characterize the similarity between the any two sample words;
对所述语料库中所述任意两个样本词语进行依存分析,确定所述任意两个样本词语的依赖关系,所述依赖关系用于表征所述任意两个样本词语的相关关系;Performing a dependency analysis on the any two sample words in the corpus, and determining a dependency between the any two sample words, where the dependency is used to represent the correlation between the any two sample words;
基于所述任意两个样本词语的相似关系以及所述任意两个样本词语的相关关系,构建所述词语关联图谱。The word association map is constructed based on the similarity relationship between the any two sample words and the correlation relationship between the any two sample words.
在上述实施例的基础上,本发明实施例中提供的文本自动化生成装置,所述获取模块,具体用于:On the basis of the above embodiment, in the automatic text generation device provided in the embodiment of the present invention, the acquisition module is specifically used for:
确定所述待生成文本的目标字数;Determine the target word count of the text to be generated;
基于所述目标字数,确定所述估算语句数量。Based on the target word count, the estimated sentence count is determined.
图5示例了一种电子设备的实体结构示意图,如图5所示,该电子设备可以包括:处理器(processor)510、通信接口(CommunicationsInterface)520、存储器(memory)530和通信总线540,其中,处理器510,通信接口520,存储器530通过通信总线540完成相互间的通信。处理器510可以调用存储器530中的逻辑指令,以执行文本自动化生成方法,该方法包括:分别获取待生成文本的关键词以及估算语句数量;基于所述关键词、所述估算语句数量以及预先构建的词语关联图谱,确定所述待生成文本的段落数量以及段落主题词;基于Transfromer模型、所述待生成文本的段落数量以及段落主题词,生成所述待生成文本。FIG. 5 illustrates a schematic diagram of the physical structure of an electronic device. As shown in FIG. 5 , the electronic device may include: a processor (processor) 510, a communication interface (CommunicationsInterface) 520, a memory (memory) 530 and a communication bus 540, wherein , the processor 510 , the communication interface 520 , and the memory 530 communicate with each other through the communication bus 540 . The processor 510 can call the logic instructions in the memory 530 to execute the automatic text generation method, the method includes: respectively obtaining the keywords of the text to be generated and the estimated number of sentences; The word association map of the to-be-generated text is determined, and the number of paragraphs and the paragraph subject words of the text to be generated are determined; the text to be generated is generated based on the Transfromer model, the number of paragraphs of the text to be generated, and the paragraph subject words.
此外,上述的存储器530中的逻辑指令可以通过软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-OnlyMemory)、随机存取存储器(RAM,RandomAccessMemory)、磁碟或者光盘等各种可以存储程序代码的介质。In addition, the above-mentioned logic instructions in the memory 530 can be implemented in the form of software functional units and can be stored in a computer-readable storage medium when sold or used as an independent product. Based on such understanding, the technical solution of the present invention can be embodied in the form of a software product in essence, or the part that contributes to the prior art or the part of the technical solution. The computer software product is stored in a storage medium, including Several instructions are used to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present invention. The aforementioned storage medium includes: U disk, removable hard disk, Read-Only Memory (ROM, Read-Only Memory), Random Access Memory (RAM, Random Access Memory), magnetic disk or optical disk and other media that can store program codes.
另一方面,本发明实施例还提供一种计算机程序产品,所述计算机程序产品包括存储在非暂态计算机可读存储介质上的计算机程序,所述计算机程序包括程序指令,当所述程序指令被计算机执行时,计算机能够执行上述各方法实施例所提供的文本自动化生成方法,该方法包括:分别获取待生成文本的关键词以及估算语句数量;基于所述 关键词、所述估算语句数量以及预先构建的词语关联图谱,确定所述待生成文本的段落数量以及段落主题词;基于Transfromer模型、所述待生成文本的段落数量以及段落主题词,生成所述待生成文本。On the other hand, an embodiment of the present invention also provides a computer program product, the computer program product includes a computer program stored on a non-transitory computer-readable storage medium, the computer program includes program instructions, when the program instructions When executed by a computer, the computer can execute the automatic text generation method provided by the above method embodiments, the method includes: respectively acquiring keywords of the text to be generated and estimating the number of sentences; based on the keywords, the estimated number of sentences and The pre-built word association map determines the number of paragraphs and the subject headings of the text to be generated; the text to be generated is generated based on the Transfromer model, the number of paragraphs and the subject heading of the text to be generated.
又一方面,本发明实施例还提供一种非暂态计算机可读存储介质,其上存储有计算机程序,该计算机程序被处理器执行时实现以执行上述各实施例提供的文本自动化生成方法,该方法包括:分别获取待生成文本的关键词以及估算语句数量;基于所述关键词、所述估算语句数量以及预先构建的词语关联图谱,确定所述待生成文本的段落数量以及段落主题词;基于Transfromer模型、所述待生成文本的段落数量以及段落主题词,生成所述待生成文本。In yet another aspect, embodiments of the present invention further provide a non-transitory computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, is implemented to execute the automatic text generation method provided by the above embodiments, The method includes: respectively acquiring keywords of the text to be generated and the number of estimated sentences; based on the keywords, the estimated number of sentences and a pre-built word association map, determining the number of paragraphs and the subject words of the paragraphs of the text to be generated; The to-be-generated text is generated based on the Transfromer model, the number of paragraphs of the to-be-generated text, and the paragraph keywords.
以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性的劳动的情况下,即可以理解并实施。The device embodiments described above are only illustrative, wherein the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in One place, or it can be distributed over multiple network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution in this embodiment. Those of ordinary skill in the art can understand and implement it without creative effort.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到各实施方式可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件。基于这样的理解,上述技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品可以存储在计算机可读存储介质中,如ROM/RAM、磁碟、光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行各个实施例或者实施例的某些部分所述的方法。From the description of the above embodiments, those skilled in the art can clearly understand that each embodiment can be implemented by means of software plus a necessary general hardware platform, and certainly can also be implemented by hardware. Based on this understanding, the above-mentioned technical solutions can be embodied in the form of software products in essence or the parts that make contributions to the prior art, and the computer software products can be stored in computer-readable storage media, such as ROM/RAM, magnetic A disc, an optical disc, etc., includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform the methods described in various embodiments or some parts of the embodiments.
最后应说明的是:以上实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, but not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that it can still be The technical solutions described in the foregoing embodiments are modified, or some technical features thereof are equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (10)

  1. 一种文本自动化生成方法,其特征在于,包括:A method for automatic text generation, comprising:
    分别获取待生成文本的关键词以及估算语句数量;Obtain the keywords of the text to be generated and the estimated number of sentences respectively;
    基于所述关键词、所述估算语句数量以及预先构建的词语关联图谱,确定所述待生成文本的段落数量以及段落主题词;Determine the number of paragraphs and the subject words of the paragraphs of the text to be generated based on the keywords, the estimated number of sentences, and the pre-built word association map;
    基于Transfromer模型、所述待生成文本的段落数量以及段落主题词,生成所述待生成文本。The to-be-generated text is generated based on the Transfromer model, the number of paragraphs of the to-be-generated text, and the paragraph keywords.
  2. 根据权利要求1所述的文本自动化生成方法,其特征在于,所述基于所述关键词、所述估算语句数量以及预先构建的词语关联图谱,确定所述待生成文本的段落数量以及段落主题词,具体包括:The automatic text generation method according to claim 1, wherein the number of paragraphs and the subject words of the paragraphs of the text to be generated are determined based on the keywords, the estimated number of sentences, and a pre-built word association map , including:
    若判断获知所述关键词以及所述估算语句数量满足第一条件,则确定所述段落数量为默认数量,并基于所述词语关联图谱,确定与每个关键词具有相关关系的词语构成的相关词语列表;If it is determined that the keyword and the estimated number of sentences satisfy the first condition, the number of paragraphs is determined to be the default number, and based on the word association map, the correlation between words that are related to each keyword is determined. list of words;
    对所有关键词对应的相关词语列表进行汇总,确定主题词语列表;Summarize the list of related words corresponding to all keywords to determine the list of subject words;
    基于所述估算语句数量,确定词语保留数量,并基于所述主题词语列表中所述词语保留数量的词语,确定所述段落主题词。Based on the estimated number of sentences, a reserved number of words is determined, and the paragraph topic word is determined based on the words of the reserved word number in the topic word list.
  3. 根据权利要求1所述的文本自动化生成方法,其特征在于,所述基于所述关键词、所述估算语句数量以及预先构建的词语关联图谱,确定所述待生成文本的段落数量以及段落主题词,具体包括:The automatic text generation method according to claim 1, wherein the number of paragraphs and the subject words of the paragraphs of the text to be generated are determined based on the keywords, the estimated number of sentences, and a pre-built word association map , including:
    若判断获知所述关键词以及所述估算语句数量满足第二条件,则确定所述段落数量为默认数量,并基于所述词语关联图谱,确定与每个关键词具有相关关系的词语数量;If it is determined that the keyword and the estimated number of sentences satisfy the second condition, the number of paragraphs is determined as the default number, and based on the word association map, the number of words that are related to each keyword is determined;
    若任一所述关键词对应的所述词语数量小于等于第一预设阈值,则基于所述词语关联图谱,确定所述任一关键词具有相似关系的相似词语,并基于所述词语关联图谱,确定每个相似词语的相关词语列表;If the number of the words corresponding to any one of the keywords is less than or equal to the first preset threshold, then based on the word association map, determine the similar words that have a similar relationship with the any keyword, and based on the word association map , determine the list of related words for each similar word;
    对所有相关词语列表进行汇总,确定主题词语列表;Summarize all related word lists to determine the subject word list;
    若判断获知所述主题词语列表中的词语数量大于等于第二预设阈值,则基于所述主题词语列表中的词语,确定所述段落主题词。If it is determined that the number of words in the topic word list is greater than or equal to a second preset threshold, the paragraph topic word is determined based on the words in the topic word list.
  4. 根据权利要求1所述的文本自动化生成方法,其特征在于,所述基于所述关键词、所述估算语句数量以及预先构建的词语关联图谱,确定所述待生成文本的段落数量以及段落主题词,具体包括:The automatic text generation method according to claim 1, wherein the number of paragraphs and the subject words of the paragraphs of the text to be generated are determined based on the keywords, the estimated number of sentences, and a pre-built word association map , including:
    若判断获知所述关键词以及所述估算语句数量满足第三条件,则基于所述词语关联图谱,确定与每个关键词具有相关关系的词语构成 的相关词语列表;If it is judged that the keyword and the estimated number of sentences meet the third condition, then based on the word association map, determine a related word list composed of words that are related to each keyword;
    对所有关键词对应的相关词语列表进行汇总,确定主题词语列表;Summarize the list of related words corresponding to all keywords to determine the list of subject words;
    基于所述估算语句数量,确定词语保留数量,并基于所述主题词语列表中所述词语保留数量的词语,确定所述段落主题词;Based on the estimated number of sentences, determine the number of reserved words, and determine the paragraph subject words based on the words in the reserved number of words in the subject word list;
    对所述段落主题词进行聚类,并基于聚类的结果确定所述段落数量。The paragraph subject headings are clustered, and the number of paragraphs is determined based on a result of the clustering.
  5. 根据权利要求1所述的文本自动化生成方法,其特征在于,所述基于所述关键词、所述估算语句数量以及预先构建的词语关联图谱,确定所述待生成文本的段落数量以及段落主题词,具体包括:The automatic text generation method according to claim 1, wherein the number of paragraphs and the subject words of the paragraphs of the text to be generated are determined based on the keywords, the estimated number of sentences, and a pre-built word association map , including:
    若判断获知所述关键词以及所述估算语句数量满足第四条件,则基于所述词语关联图谱,确定与每个关键词具有相关关系的词语构成的相关词语列表;If it is determined that the keyword and the estimated number of sentences satisfy the fourth condition, then, based on the word association map, determine a related word list composed of words that are related to each keyword;
    对所有关键词对应的相关词语列表进行汇总,确定主题词语列表;Summarize the list of related words corresponding to all keywords to determine the list of subject words;
    若判断获知所述主题词语列表中的词语数量大于等于第二预设阈值,则基于所述主题词语列表中的词语,确定所述段落主题词;If it is judged that the number of words in the topic word list is greater than or equal to a second preset threshold, determining the paragraph topic word based on the words in the topic word list;
    对所述段落主题词进行聚类,并基于聚类的结果确定所述段落数量。The paragraph subject headings are clustered, and the number of paragraphs is determined based on a result of the clustering.
  6. 根据权利要求1-5中任一项所述的文本自动化生成方法,其特征在于,所述词语关联图谱具体通过如下方法构建:The automatic text generation method according to any one of claims 1-5, wherein the word association map is specifically constructed by the following method:
    获取语料库中每个样本词语的语义向量,并计算任意两个样本词语的语义向量之间的相似度,所述相似度用于表征所述任意两个样本词语的相似关系;Obtain the semantic vector of each sample word in the corpus, and calculate the similarity between the semantic vectors of any two sample words, where the similarity is used to represent the similarity relationship between the any two sample words;
    对所述语料库中所述任意两个样本词语进行依存分析,确定所述任意两个样本词语的依赖关系,所述依赖关系用于表征所述任意两个样本词语的相关关系;Performing a dependency analysis on the any two sample words in the corpus, and determining a dependency between the any two sample words, where the dependency is used to represent the correlation between the any two sample words;
    基于所述任意两个样本词语的相似关系以及所述任意两个样本词语的相关关系,构建所述词语关联图谱。The word association map is constructed based on the similarity relationship between the any two sample words and the correlation relationship between the any two sample words.
  7. 根据权利要求1-5中任一项所述的文本自动化生成方法,其特征在于,所述估算语句数量具体通过如下方法获取:The automatic text generation method according to any one of claims 1-5, wherein the estimated number of sentences is specifically obtained by the following method:
    确定所述待生成文本的目标字数;Determine the target word count of the text to be generated;
    基于所述目标字数,确定所述估算语句数量。Based on the target word count, the estimated sentence count is determined.
  8. 一种文本自动化生成装置,其特征在于,包括:A device for automatic text generation, comprising:
    获取模块,用于分别获取待生成文本的关键词以及估算语句数量;The obtaining module is used to obtain the keywords of the text to be generated and the estimated number of sentences respectively;
    确定模块,用于基于所述关键词、所述估算语句数量以及预先构建的词语关联图谱,确定所述待生成文本的段落数量以及段落主题词;A determination module, configured to determine the number of paragraphs and the subject words of the paragraphs of the text to be generated based on the keywords, the estimated number of sentences and the pre-built word association map;
    文本生成模块,用于基于Transfromer模型、所述待生成文本的段落数量以及段落主题词,生成所述待生成文本。A text generation module, configured to generate the to-be-generated text based on the Transfromer model, the number of paragraphs of the to-be-generated text, and the paragraph keywords.
  9. 一种电子设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,其特征在于,所述处理器执行所述程序时实现如权利要求1至7任一项所述文本自动化生成方法的步骤。An electronic device, comprising a memory, a processor and a computer program stored on the memory and running on the processor, characterized in that, when the processor executes the program, the program as claimed in any one of claims 1 to 7 is implemented. Describe the steps of the automatic text generation method.
  10. 一种非暂态计算机可读存储介质,其上存储有计算机程序,其特征在于,该计算机程序被处理器执行时实现如权利要求1至7任一项所述文本自动化生成方法的步骤。A non-transitory computer-readable storage medium on which a computer program is stored, characterized in that, when the computer program is executed by a processor, the steps of the automatic text generation method according to any one of claims 1 to 7 are implemented.
PCT/CN2020/139952 2020-11-25 2020-12-28 Automatic text generation method and apparatus, and electronic device and storage medium WO2022110454A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011341955.5A CN112417846B (en) 2020-11-25 Text automatic generation method and device, electronic equipment and storage medium
CN202011341955.5 2020-11-25

Publications (1)

Publication Number Publication Date
WO2022110454A1 true WO2022110454A1 (en) 2022-06-02

Family

ID=74842398

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/139952 WO2022110454A1 (en) 2020-11-25 2020-12-28 Automatic text generation method and apparatus, and electronic device and storage medium

Country Status (1)

Country Link
WO (1) WO2022110454A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116484805A (en) * 2023-05-06 2023-07-25 国网浙江省电力有限公司 Intelligent cleaning processing method for power report combining knowledge graph and semantic analysis
CN117422795A (en) * 2023-12-18 2024-01-19 华南理工大学 Automatic generation method and system for packaging material printing graphics context based on data processing
CN117934229A (en) * 2024-03-18 2024-04-26 新励成教育科技股份有限公司 Originality excitation-based talent training guiding method, system, equipment and medium
CN117976231A (en) * 2024-01-30 2024-05-03 北京康众时代医药科技集团有限公司 Method for integrating and analyzing clinical data of Chinese patent medicine in evidence-based medicine

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107526718A (en) * 2017-09-19 2017-12-29 北京百度网讯科技有限公司 Method and apparatus for generating text
CN108108342A (en) * 2017-11-07 2018-06-01 汉王科技股份有限公司 Generation method, search method and the device of structured text
CN108427665A (en) * 2018-03-15 2018-08-21 广州大学 A kind of text automatic generation method based on LSTM type RNN models
CN109086408A (en) * 2018-08-02 2018-12-25 腾讯科技(深圳)有限公司 Document creation method, device, electronic equipment and computer-readable medium
US20190317953A1 (en) * 2018-04-12 2019-10-17 Abel BROWARNIK System and method for computerized semantic indexing and searching
CN110362797A (en) * 2019-06-14 2019-10-22 哈尔滨工业大学(深圳) A kind of research report generation method and relevant device
CN110688857A (en) * 2019-10-08 2020-01-14 北京金山数字娱乐科技有限公司 Article generation method and device
CN111274776A (en) * 2020-01-21 2020-06-12 中国搜索信息科技股份有限公司 Article generation method based on keywords
CN111930929A (en) * 2020-07-09 2020-11-13 车智互联(北京)科技有限公司 Article title generation method and device and computing equipment

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107526718A (en) * 2017-09-19 2017-12-29 北京百度网讯科技有限公司 Method and apparatus for generating text
CN108108342A (en) * 2017-11-07 2018-06-01 汉王科技股份有限公司 Generation method, search method and the device of structured text
CN108427665A (en) * 2018-03-15 2018-08-21 广州大学 A kind of text automatic generation method based on LSTM type RNN models
US20190317953A1 (en) * 2018-04-12 2019-10-17 Abel BROWARNIK System and method for computerized semantic indexing and searching
CN109086408A (en) * 2018-08-02 2018-12-25 腾讯科技(深圳)有限公司 Document creation method, device, electronic equipment and computer-readable medium
CN110362797A (en) * 2019-06-14 2019-10-22 哈尔滨工业大学(深圳) A kind of research report generation method and relevant device
CN110688857A (en) * 2019-10-08 2020-01-14 北京金山数字娱乐科技有限公司 Article generation method and device
CN111274776A (en) * 2020-01-21 2020-06-12 中国搜索信息科技股份有限公司 Article generation method based on keywords
CN111930929A (en) * 2020-07-09 2020-11-13 车智互联(北京)科技有限公司 Article title generation method and device and computing equipment

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116484805A (en) * 2023-05-06 2023-07-25 国网浙江省电力有限公司 Intelligent cleaning processing method for power report combining knowledge graph and semantic analysis
CN116484805B (en) * 2023-05-06 2023-09-15 国网浙江省电力有限公司 Intelligent cleaning processing method for power report combining knowledge graph and semantic analysis
CN117422795A (en) * 2023-12-18 2024-01-19 华南理工大学 Automatic generation method and system for packaging material printing graphics context based on data processing
CN117422795B (en) * 2023-12-18 2024-03-29 华南理工大学 Automatic generation method and system for packaging material printing graphics context based on data processing
CN117976231A (en) * 2024-01-30 2024-05-03 北京康众时代医药科技集团有限公司 Method for integrating and analyzing clinical data of Chinese patent medicine in evidence-based medicine
CN117934229A (en) * 2024-03-18 2024-04-26 新励成教育科技股份有限公司 Originality excitation-based talent training guiding method, system, equipment and medium

Also Published As

Publication number Publication date
CN112417846A (en) 2021-02-26

Similar Documents

Publication Publication Date Title
WO2022110454A1 (en) Automatic text generation method and apparatus, and electronic device and storage medium
US11386136B2 (en) Automatic construction method of software bug knowledge graph
US10042896B2 (en) Providing search recommendation
Wan et al. CollabRank: towards a collaborative approach to single-document keyphrase extraction
Hajar Using YouTube comments for text-based emotion recognition
US20060047632A1 (en) Method using ontology and user query processing to solve inventor problems and user problems
EP2477125A1 (en) Word pair acquisition device, word pair acquisition method, and program
JP2016532173A (en) Semantic information, keyword expansion and related keyword search method and system
WO2016051551A1 (en) Text generation system
CN111523304B (en) Automatic generation method of product description text based on pre-training model
JP2011118689A (en) Retrieval method and system
CN112035506A (en) Semantic recognition method and equipment
US20120317125A1 (en) Method and apparatus for identifier retrieval
JP5452563B2 (en) Method and apparatus for extracting evaluation information
US20230205996A1 (en) Automatic Synonyms Using Word Embedding and Word Similarity Models
WO2022183923A1 (en) Phrase generation method and apparatus, and computer readable storage medium
CN112036485A (en) Method and device for topic classification and computer equipment
JP5426292B2 (en) Opinion classification device and program
WO2021120979A1 (en) Method and apparatus for generating patent summary information, and electronic device and medium
US11669574B2 (en) Method, apparatus, and computer-readable medium for determining a data domain associated with data
JP2004133564A (en) Document search system
KR20180082035A (en) Server and method for content providing based on context information
Husain Critical concepts and techniques for information retrieval system
JP4567025B2 (en) Text classification device, text classification method, text classification program, and recording medium recording the program
Jiang et al. A semantic-based approach to service clustering from service documents

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20963342

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20963342

Country of ref document: EP

Kind code of ref document: A1