WO2022110454A1

WO2022110454A1 - Automatic text generation method and apparatus, and electronic device and storage medium

Info

Publication number: WO2022110454A1
Application number: PCT/CN2020/139952
Authority: WO
Inventors: 夏维; 孙赫; 张恒; 高鹏
Original assignee: 中译语通科技股份有限公司
Priority date: 2020-11-25
Filing date: 2020-12-28
Publication date: 2022-06-02
Also published as: CN112417846A

Abstract

An automatic text generation method and apparatus, and an electronic device and a storage medium. The method comprises: firstly, respectively acquiring a keyword and an estimated sentence quantity of text to be generated (S1); then, determining a paragraph quantity and a paragraph subject word of said text on the basis of the keyword, the estimated sentence quantity and a pre-constructed word association graph (S2); and finally, generating said text on the basis of a transformer model, and the paragraph quantity and the paragraph subject word of said text (S3). The method is a new text generation method implemented by means of a transformer model. Screening and determination of a paragraph subject word are introduced, and the subject of generated text can be extended and restrained, such that the generated text contains a core idea. Moreover, the transformer model is used, such that the generated text is no longer simple in content and is fixed in format as text generated by means of a traditional method.

Description

Automatic text generation method, device, electronic device and storage medium

technical field

The present invention relates to the technical field of artificial intelligence, and in particular, to a method, device, electronic device and storage medium for automatic text generation.

Background technique

At present, text generation based on artificial intelligence (AI) is a challenging task in the field of natural language processing, and its purpose is to enable computers to write high-quality articles like humans. This requires the adopted model to have a stronger ability to understand and generate text. There are two traditional text generation methods, one is the generation method based on rules and templates, and the other is the generation method based on extraction. The text formats generated by these two methods are relatively fixed and cannot generate texts with rich content and diverse styles .

SUMMARY OF THE INVENTION

Embodiments of the present invention provide a method, apparatus, electronic device, and storage medium for automatic text generation, so as to solve the defects existing in the prior art.

An embodiment of the present invention provides a method for automatic text generation, including:

Obtain the keywords of the text to be generated and the estimated number of sentences respectively;

Determine the number of paragraphs and the subject words of the paragraphs of the text to be generated based on the keywords, the estimated number of sentences, and the pre-built word association map;

The to-be-generated text is generated based on the Transfromer model, the number of paragraphs of the to-be-generated text, and the paragraph keywords.

According to an automatic text generation method according to an embodiment of the present invention, determining the number of paragraphs and the subject headings of the text to be generated based on the keywords, the estimated number of sentences, and a pre-built word association map, specifically including:

If it is determined that the keyword and the estimated number of sentences satisfy the first condition, the number of paragraphs is determined to be the default number, and based on the word association map, the correlation between words that are related to each keyword is determined. list of words;

Summarize the list of related words corresponding to all keywords to determine the list of subject words;

Based on the estimated number of sentences, a reserved number of words is determined, and the paragraph topic word is determined based on the words of the reserved word number in the topic word list.

If it is determined that the keyword and the estimated number of sentences satisfy the second condition, the number of paragraphs is determined as the default number, and based on the word association map, the number of words that are related to each keyword is determined;

If the number of the words corresponding to any one of the keywords is less than or equal to the first preset threshold, then based on the word association map, determine the similar words that have a similar relationship with the any keyword, and based on the word association map , determine the list of related words for each similar word;

Summarize all related word lists to determine the subject word list;

If it is determined that the number of words in the topic word list is greater than or equal to a second preset threshold, the paragraph topic word is determined based on the words in the topic word list.

If it is determined that the keyword and the estimated number of sentences satisfy the third condition, then based on the word association map, determine a related word list composed of words that are related to each keyword;

Based on the estimated number of sentences, determine the number of reserved words, and determine the paragraph subject words based on the words in the reserved number of words in the subject word list;

The paragraph subject headings are clustered, and the number of paragraphs is determined based on a result of the clustering.

If it is determined that the keyword and the estimated number of sentences satisfy the fourth condition, then, based on the word association map, determine a related word list composed of words that are related to each keyword;

If it is judged that the number of words in the topic word list is greater than or equal to a second preset threshold, determining the paragraph topic word based on the words in the topic word list;

According to the automatic text generation method according to an embodiment of the present invention, the word association map is specifically constructed by the following method:

Obtain the semantic vector of each sample word in the corpus, and calculate the similarity between the semantic vectors of any two sample words, where the similarity is used to represent the similarity relationship between the any two sample words;

Performing a dependency analysis on the any two sample words in the corpus, and determining a dependency between the any two sample words, where the dependency is used to represent the correlation between the any two sample words;

The word association map is constructed based on the similarity relationship between the any two sample words and the correlation relationship between the any two sample words.

According to the automatic text generation method according to an embodiment of the present invention, the estimated number of sentences is specifically obtained by the following method:

Determine the target word count of the text to be generated;

Based on the target word count, the estimated sentence count is determined.

An embodiment of the present invention also provides an automatic text generation device, including: an acquisition module, a determination module, and a text generation module. in,

The obtaining module is used to obtain the keywords of the text to be generated and the estimated number of sentences respectively;

A determination module, configured to determine the number of paragraphs and the subject words of the paragraphs of the text to be generated based on the keywords, the estimated number of sentences and the pre-built word association map;

A text generation module, configured to generate the to-be-generated text based on the Transfromer model, the number of paragraphs of the to-be-generated text, and the paragraph keywords.

An embodiment of the present invention further provides an electronic device, including a memory, a processor, and a computer program stored in the memory and running on the processor, where the processor implements any of the above text when the processor executes the program Steps to automate the generation of methods.

Embodiments of the present invention further provide a non-transitory computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, implements the steps of any of the above-described automatic text generation methods.

The automatic text generation method, device, electronic device, and storage medium provided by the embodiments of the present invention firstly obtain the keywords of the text to be generated and the estimated number of sentences; then, based on the keywords, the estimated number of sentences, and pre-built words Associating the graph to determine the number of paragraphs and the subject heading of the text to be generated; finally, the text to be generated is generated based on the Transfromer model, the number of paragraphs and the subject heading of the text to be generated. This is a new type of text generation method implemented by the Transformer model. It introduces the screening and determination of paragraph subject words, which can realize the expansion and restriction of the generated text subject, so that the generated text has the core idea; at the same time, the Transformer model can be used to make The generated text is no longer single content and fixed format like the text generated by traditional methods, and can be widely used in many fields such as report generation, literary creation, and intelligent question answering.

Description of drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the following briefly introduces the accompanying drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description These are some embodiments of the present invention. For those of ordinary skill in the art, other drawings can also be obtained according to these drawings without creative efforts.

1 is a schematic flowchart of a method for automatically generating text according to an embodiment of the present invention;

2 is a schematic flowchart of determining the number of paragraphs of text to be generated and the subject headings of paragraphs in a method for automatic text generation provided by an embodiment of the present invention;

3 is a schematic diagram of a complete flow of a method for automatic text generation provided by an embodiment of the present invention;

4 is a schematic structural diagram of a device for automatic text generation provided by an embodiment of the present invention;

FIG. 5 is a schematic structural diagram of an electronic device provided by an embodiment of the present invention.

Detailed ways

In order to make the purposes, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments These are some embodiments of the present invention, but not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

Since the text formats generated by traditional text generation methods are relatively fixed, they cannot generate texts with rich content and diverse styles. To this end, an automatic text generation method is provided in the embodiments of the present invention to solve the problems existing in the prior art.

FIG. 1 is a schematic flowchart of a method for automatic text generation provided in an embodiment of the present invention. As shown in FIG. 1 , the method includes:

S1, obtain the keywords of the text to be generated and the estimated number of sentences respectively;

S2, based on the keywords, the estimated sentence quantity and the pre-built word association map, determine the paragraph quantity and the paragraph subject words of the text to be generated;

S3, the to-be-generated text is generated based on the Transfromer model, the number of paragraphs of the to-be-generated text, and the paragraph subject heading.

Specifically, in the automatic text generation method provided in the embodiment of the present invention, the execution body is a server, which can be either a local server or a cloud server, and the local server can be a computer, etc., which is not specifically limited in the embodiment of the present invention.

Step S1 is performed first. Among them, it is necessary to obtain the keywords of the text to be generated and the estimated number of sentences. The keywords refer to the keywords required for generating the text to be generated. The keywords can be determined by the keyword-related information in the user input information. The keyword-related information It can include a single keyword, multiple keywords, or a piece of text containing one sentence or multiple sentences, etc. When the keyword-related information is a single keyword or multiple keywords, the input keyword is the keyword of the text to be generated; when the keyword-related information is a piece of text, it can first be extracted from a piece of text input by the user For more important words, the extraction of words can be achieved by either extraction algorithms or syntactic analysis algorithms. The extraction algorithm may include a tf-idf algorithm, a textrank algorithm, and the like. In this embodiment of the present invention, in order to ensure the effect of final text generation, which method to use for extraction can be controlled by passing parameters. For example, the user input information may further include extraction parameters, and different values of the extraction parameters represent different extraction methods selected by the user. Then, the extracted words are processed to remove stop words, that is, the words that are stopped in use are removed, and the keywords of the text to be generated are obtained.

The estimated number of sentences refers to the estimated number of sentences that may exist in the text to be generated, and the estimated number of sentences may also be determined by user input information. Here, the user input information may also include the target number of words of the text to be generated. In the embodiment of the present invention, the sentence length statistics are performed on the training prediction, and the average value is obtained, and the average sentence length can be obtained to include 33 to 34 characters. Therefore, the default sentence length is 33 words, and the estimated number of sentences can be determined by the ratio of the target word count to the default sentence length. It should be noted that, the default sentence length in the embodiment of the present invention does not mean that each sentence in the generated text contains 33 characters, and the estimated sentence number is only a preliminary estimated value. In particular, if the target word count entered by the user is less than 33, the default estimated sentence count is 1.

Then, step S2 is performed to determine the number of paragraphs and the subject words of the paragraphs of the text to be generated according to the keywords, the estimated number of sentences, and the pre-built word association map. Since the number of keywords and the number of estimated sentences are different, different processing modes can be used to determine the number of paragraphs and the subject words of the text to be generated. Therefore, in the embodiment of the present invention, a corresponding processing mode can be determined according to the number of keywords and the conditions that the estimated number of sentences meets, and then the paragraph of the text to be generated can be determined according to the processing mode and a pre-built word association map. Quantities and paragraph headings. Among them, the word association graph is pre-built based on training expectations, and is used to represent the association between words. The relationship between words can include similarity and correlation. Similarity is used to represent the similarity between two words, which can be determined by the similarity between the two words, and correlation is used to represent the relationship between two words. The dependency relationship of the two words may be determined by performing dependency analysis on the statement in which the two words are located, which is not specifically limited in this embodiment of the present invention.

Finally, step S3 is performed, and the text to be generated is generated according to the Transfromer model, the number of paragraphs of the text to be generated, and the subject heading of the paragraph. Among them, the Transfromer model is used to combine the number of paragraphs of the text to be generated and the subject headings of the paragraphs, and determine the next sentence through the previous sentence in each paragraph. A Transfromer model can contain four inputs:

The first input: the semantic vector of the previous sentence. If there is no sentence before, the semantic vector of this input is the 0 vector of the corresponding dimension.

The second input: the word vector of the randomly selected paragraph subject words.

The third input: the sum of the word vectors of all keywords.

The fourth input: the final sentence judgment vector. For example, if it is a closing sentence, it is the numeric 8 constant vector of the corresponding dimension, if not, it is the numeric 1 constant vector of the corresponding dimension.

All input term vectors are concatenated together in the last dimension. Then feed into the Transfromer model. The Transfromer model is based on the semantics of the previous text, and the output is the current sentence based on the text. Output the semantic vector of the current sentence while outputting the current sentence.

Assuming that the semantic vector of the previous sentence is represented by A, and the semantic vector of the current sentence is represented by B, then the next input semantic vector of the Transfromer model is A'=A*0.1+B*0.9.

The number of words in the text will be counted after each Transfromer model runs. If the number of words is close to the target number of words in the current paragraph, the fourth item of the input will be changed to output the conclusion of the paragraph.

It should be noted that, in the embodiments of the present invention, words and sentences are converted into the form of text vectors during use. There are many conversion methods. As a preferred solution, text semantic vector conversion can be performed through the bert pre-training model. The automatic text generation method in the embodiment of the present invention can be developed and implemented based on Python.

The automatic text generation method provided in the embodiment of the present invention firstly obtains the keywords of the text to be generated and the estimated number of sentences; The number of paragraphs and the paragraph keywords of the generated text; finally, the to-be-generated text is generated based on the Transfromer model, the number of paragraphs and the paragraph keywords of the text to be generated. This is a new type of text generation method implemented by the Transformer model. It introduces the screening and determination of paragraph subject words, which can realize the expansion and restriction of the generated text subject, so that the generated text has the core idea; at the same time, the Transformer model can be used to make The generated text is no longer single content and fixed format like the text generated by traditional methods, and can be widely used in many fields such as report generation, literary creation, and intelligent question answering.

On the basis of the above embodiment, if the information input by the user does not contain keyword-related information, words may be randomly selected from the popular thesaurus as the keywords of the text to be generated. The popular thesaurus can be obtained by collecting popular words on a daily basis, and can be regularly updated and maintained.

On the basis of the above embodiment, in the automatic text generation method provided in the embodiment of the present invention, the estimated number of sentences is specifically obtained by the following method:

Determine the target word count of the text to be generated;

Based on the target word count, the estimated sentence count is determined.

Specifically, when determining the number of estimated sentences, the target number of words of the text to be generated may be determined first, and the target number of words may be input by the user, that is, the user input information may include the target number of words. Then, according to the target number of words, the estimated number of sentences can be determined. Specifically, the ratio of the target number of words to the default sentence length can be used as the estimated number of sentences.

In this embodiment of the present invention, when determining the number of estimated sentences, the target number of words is introduced, so that the generated text is no longer a random number of words, but can generate text with the number of words desired by the user according to the user's needs.

On the basis of the above embodiment, if the user input information does not contain the target number of words, a number may be randomly selected from 500 to 5000 as the target number of words. It should be noted that the actual number of words in the generated text and the target number of words are not necessarily exactly equal. When the target number of words is less than 500, the actual number of words in the generated text may have a deviation of 50 words up and down; when the target number of words is greater than 500, the generated text There may be a deviation of 50 words to 200 words in the actual number of words, which are all within the controllable range. Meanwhile, if the target number of words is too small, for example, less than 33 words, only one sentence is generated, and the generation of the sentence is completely based on the semantics of the keywords of the text to be generated in step S1.

As shown in FIG. 2 , it is a schematic flowchart of selecting different processing modes when the number of keywords and estimated sentences meet different conditions in an embodiment of the present invention, which is specifically described with reference to the following embodiments.

On the basis of the above embodiment, in the automatic text generation method provided in the embodiment of the present invention, the number of paragraphs of the to-be-generated text is determined based on the keywords, the estimated number of sentences, and a pre-built word association map and paragraph headings, including:

Specifically, in the embodiment of the present invention, when the estimated number of sentences is less than 15, the target number of words is about 500 words. For such a text, if there are enough keywords (the default can be greater than or equal to 2), then the number of keywords is enough to filter out the paragraph headings. At this time, the first processing mode can be performed to determine the final number of paragraphs and paragraph headings. When there are not enough keywords, if the number of sentences to be generated is too small (the default can be less than or equal to 8), then it is also considered that the paragraph subject words can be filtered out, so the first processing mode is also performed. That is, the first condition may be that the number of keywords is greater than or equal to the first threshold and the estimated number of sentences is less than the second threshold, or the number of keywords is less than the first threshold and the estimated number of sentences is less than the third threshold. The first threshold, the second threshold and the third threshold can be set as required, and the third threshold is smaller than the second threshold. For example, the first threshold can be 2, the second threshold can be 15, and the third threshold can be 8. The first condition corresponds to the first processing mode, that is, if the number of keywords and estimated sentences satisfies the first condition, the first processing mode determines the number of paragraphs and the subject headings of the text to be generated. as shown in picture 2.

The first processing mode is specifically as follows: first, the number of paragraphs is determined as a default number, and the default number can be set according to needs and the specific content of the first condition, for example, it can be set to 1. Then, according to the word association graph, determine a list of related words composed of words that have a related relationship with each keyword. Assuming that the number of keywords is n, then input n keywords into the word association graph, and filter by the relationship when querying. Get only words that are related to the input keyword. The results obtained from each keyword query can be stored in a list respectively, and n query result lists will be obtained, and the query result list is a related word list composed of words related to each keyword. You can create an empty topic dictionary, summarize and deduplicate the above n query result lists and store them in the topic dictionary. The key of the topic dictionary is the single word after deduplication, and the value of the dictionary is 0 by default. Then count the number of times that each word in the theme dictionary appears in the n lists, and if it appears once, the value of the corresponding word in the theme dictionary is incremented by 1. Finally, the topic dictionary can be sorted once by the size of the values from large to small. Since the dictionary is unordered in python, after sorting, the storage format can be converted into a list form, that is, a list of subject words is obtained. For example: [(term1,7),(term2,7),(term3,5),(term4,2)]. The list of subject words is in the form of tuples, each tuple contains two values, the word itself and the number of occurrences of the word.

Then according to the estimated number of sentences, determine the number of words to retain. Specifically, the value obtained by multiplying the estimated number of sentences by 0.6 can be rounded up to obtain the number of reserved words. The subject word list is intercepted according to the number of reserved words, and the intercepted words are the paragraph subject words. On this basis, you can add paragraph subject headings to a new list, and this new list is the paragraph heading list. Because there is only one paragraph, there is only one list of paragraph subject headings.

The embodiment of the present invention provides a method for determining the number of paragraphs and the subject headings of the text to be generated, that is, through the first processing mode, the generated text can be divided into paragraphs intelligently without applying a template.

Summarize all related word lists to determine the subject word list;

Specifically, in this embodiment of the present invention, when the estimated number of sentences is greater than or equal to 8 and less than 15, if there are not enough keywords (here, the default is less than 2, that is, there is only 1 keyword), the second processing mode will be performed to Determine the final number of paragraphs and paragraph headings. The fundamental reason for not adopting the first processing mode at this time is that the number of keywords is too small, which may result in too few paragraph keywords and limit the freedom of article topics. That is, the second condition may be that the number of keywords is less than the first threshold, and the estimated number of sentences is less than the second threshold and greater than or equal to the third threshold. The second condition corresponds to the second processing mode, that is, if the number of keywords and the estimated sentences satisfy the second condition, the second processing mode is used to determine the number of paragraphs and the subject headings of the text to be generated. as shown in picture 2.

The second processing mode is specifically as follows: first, the number of paragraphs is determined as a default number, and the default number can be set according to needs and the specific content of the second condition, for example, it can be set to 1. Then this keyword is input into the word association graph, and the related relationship query is performed to determine the number of words that have a related relationship with the keyword.

If the number of words corresponding to the keyword is less than or equal to the first preset threshold, the keyword needs to be expanded, and then the keyword needs to be input into the word association map, and a similarity relationship query is performed to determine that the keyword has a similarity in a similar relationship. Words, expand keywords with similar words. Wherein, the first preset threshold may be 0.6 times the estimated number of sentences. The default filter similarity value is 0.98. If no similar words are found, the similarity threshold will continue to decrease by 0.01 until similar words can be obtained and queried. If k similar words are found, and related words are searched for these k words respectively, k related word lists are obtained, and all the words in the k related word lists are unified together, and put into a new one after deduplication. In the list, this new list is the subject word list. If the number of words in the topic word list is greater than or equal to the second preset threshold, the words in the topic word list can be directly used as the determined paragraph topic words, that is, the topic word list is a paragraph topic word list. Wherein, the second preset threshold may be 0.6 times the estimated number of sentences. If the number of words in the topic word list is less than the second preset threshold, continue to reduce the similarity threshold by 0.01, re-acquire new similar words, and repeat the above calculation operation until the number of words in the topic word list is greater than or equal to the second Preset threshold.

If the number of words corresponding to the keyword is greater than the first preset threshold, the keyword does not need to be expanded, which is equivalent to the first processing mode.

The embodiment of the present invention provides a method for determining the number of paragraphs and the subject headings of the text to be generated, that is, through the second processing mode, the generated text can be divided into paragraphs intelligently without applying a template. Moreover, the second processing mode is suitable for the case that the number of keywords is too small, which can ensure that the number of paragraph keywords obtained is moderate, and the degree of freedom of the article topic can be improved.

Specifically, in the embodiment of the present invention, when the estimated number of sentences is greater than or equal to 15, the estimated number of sentences is compared with the number of keywords. If the estimated number of sentences is less than or equal to 1.5 times the number of keywords, the third processing method is used. to determine the number of paragraphs and paragraph headings. That is to say, the third condition may be that the estimated number of sentences is greater than or equal to the second threshold, and the estimated number of sentences is less than or equal to the number of keywords that is a preset multiple. The third condition corresponds to the third processing mode, that is, if the number of keywords and estimated sentences satisfies the third condition, the third processing mode is used to determine the number of paragraphs and the subject words of the text to be generated. as shown in picture 2.

The third processing mode is specifically: similar to the first processing mode. Directly query each keyword in the word association graph. If the number of keywords is k, then a list of k related words will be obtained. After directly summarizing and deduplicating the words in the k related word lists, the subject word list is obtained. Then multiply the estimated number of sentences by 0.6, and use the resulting value to truncate the list of subject words. The last intercepted word list is the paragraph subject heading list, and the words contained in it are the paragraph subject headings.

The words in the paragraph subject word list are clustered, and the number of paragraphs can be determined through the results of the clustering. The number of paragraphs is determined as shown in the formula:

Number of paragraphs = max(3, number of clusters)

Since a subject heading alone cannot be used as a category, it will be assigned to the nearest category. After the number of paragraphs is determined, each category of words corresponds to a paragraph, so that the paragraphs and the paragraph subject word list are in one-to-one correspondence.

The embodiment of the present invention provides a method for determining the number of paragraphs and the subject headings of the text to be generated, that is, the third processing mode is implemented, so that the generated text can be divided into paragraphs intelligently without applying a template. Moreover, the third processing mode is suitable for a large number of estimated sentences, which can ensure the accuracy of the calculation results.

If it is determined that the keyword and the estimated number of sentences satisfy the fourth condition, then based on the word association map, determine a list of words that are related to each keyword;

Summarize the lists corresponding to all keywords to determine the list of subject words;

The paragraph subject headings are clustered, and the number of paragraphs is determined based on the result of the clustering.

Specifically, in the embodiment of the present invention, when the estimated number of sentences is greater than or equal to 15, the estimated number of sentences and the number of keywords are compared, and if the estimated number of sentences is greater than 1.5 times the number of keywords, the fourth processing method is used to Determine the number of paragraphs and paragraph headings. That is to say, the fourth condition may be that the estimated number of sentences is greater than or equal to the second threshold, and the estimated number of sentences is greater than the number of keywords with a preset multiple. The fourth condition corresponds to the fourth processing mode, that is, if the number of keywords and estimated sentences satisfies the fourth condition, the fourth processing mode is used to determine the number of paragraphs and the subject words of the text to be generated. as shown in picture 2.

The fourth processing mode is specifically: the determination of the paragraph subject heading is similar to the second processing mode, and the determination of the number of paragraphs is similar to the third processing mode. Firstly, all keywords are searched for related words respectively, and corresponding related word lists are obtained, and the words in all related word lists are aggregated and deduplicated and put into a list, which is a list of subject words. If the number of words in the topic word list is greater than or equal to the second preset threshold, the situation is similar to the third processing mode, and word clustering is performed on the topic word list, and the number of paragraphs and The paragraph subject heading for the paragraph. If the number of words in the subject word list is less than the second preset threshold, similar word matching needs to be performed. Similar words are matched for each keyword according to the second processing mode, and the matched similar words are used to match related words. Finally, all related words are counted together and put into a list. The list is A list of subject words. If the number of words in the subject word list is less than the second preset threshold, reduce the similarity matching threshold to obtain related words again, if the number of words in the subject word list is greater than the second preset threshold and the second preset threshold is greater than 6 ( The subject word list with the second preset threshold value not exceeding 6 is required), then the subject word list is dominated by the words in the subject word list, and the words in the subject word list are randomly selected without replacement until the final subject word list is controlled. Dividing the number of words by 0.6 and subtracting the estimated number of sentences gives a value between 0 and 10. The words in the subject word list thus obtained are paragraph subject words. Then cluster the topic word list, and finally get the number of paragraphs in the article and the topic list corresponding to each paragraph.

The embodiment of the present invention provides a method for determining the number of paragraphs and the subject headings of the text to be generated, which is realized by the fourth processing mode, so that the generated text can be divided into paragraphs intelligently without applying a template. Moreover, the fourth processing mode is suitable for a large number of estimated sentences, which can ensure the accuracy of the calculation results.

On the basis of the above embodiments, in the automatic text generation method provided in the embodiments of the present invention, the first processing mode and the second processing mode in the embodiments of the present invention may be combined into one, and the second processing mode is the main one ; The third processing mode and the fourth processing mode can be combined into one, and the fourth processing mode is the main one.

On the basis of the above embodiment, in the automatic text generation method provided in the embodiment of the present invention, the word association map is specifically constructed by the following method:

Specifically, the entities in the word association graph are words, and the relationship between entities is divided into two types, one is a similarity relationship, and the other is a related relationship. For the acquisition method of triple pairs of similar relationships: obtain the bert semantic model through corpus training, and then use the bert semantic model to convert each word into a semantic vector of the target dimension. The selection of the target dimension can be controlled by parameters. Values are 64, 128, 256, 512, etc. After obtaining the semantic vectors represented by the words, the cosine similarity is calculated for every two semantic vectors, and the value of the cosine similarity is the similarity between the words represented by the two semantic vectors. The obtained similarity value will be stored in the graph database as the attribute of the similarity relationship between the two words, so as to facilitate the use of query; for the acquisition method of the triple pair of the related relationship: through the dependency analysis of the statement, obtain to the dependencies between words and words, and then the words with dependencies are stored in the graph database as correlation triples.

The graph database used in the embodiment of the present invention is the neo4j database, the development language is python, and the cypher language is invoked through the interface of the py2neo library to perform addition, deletion, modification, and query operations on the database.

Due to the construction of the word association graph, there is no shortage of subject words in the process of use when judging based on the default parameters.

On the basis of the foregoing embodiment, the automatic text generation method provided in the embodiment of the present invention further includes: verifying the generated text.

Specifically, the verification of the generated text is to perform right and wrong analysis and correction, so that the generated text conforms to the current grammatical rules and the sentences are fluent.

First of all, the first word of the sentence cannot appear similar particles or modal particles such as le, 地, ah. Create a dictionary containing words similar to the above. If the first word of the generated text is included in the dictionary, the sentence will be regenerated into a new sentence by the Transfromer model. and replace the original statement.

Secondly, in Chinese, the collocation between words has a certain grammatical structure, such as adjectives followed by nouns, verbs followed by adverbs and so on. The system will perform dependency syntax analysis and part-of-speech tagging on the generated text (which can be implemented by ltp and hanlp libraries), and judge according to established rules (the verb-object structure should correspond to verbs and nouns, and adverbs should correspond to adverbs and adjectives, etc.) Regenerated with the Transfromer model.

As shown in FIG. 3 , it is a complete schematic flowchart of the automatic text generation method provided in the embodiment of the present invention. In Figure 3, first, on the one hand, the keyword information and target quantity input by the user are obtained, and then the keywords of the text to be generated are determined, and the estimated number of sentences is determined; Word extraction, constructs a word association map. Then, based on the keywords, the estimated number of sentences, and the constructed word association graph, the number of paragraphs and the subject words of the paragraphs of the text to be generated are determined. Then, the text to be generated is generated by the Transfromer model. Finally, the generated text is verified.

As shown in FIG. 4 , on the basis of the above-mentioned embodiment, an embodiment of the present invention provides an automatic text generation device, including: an acquisition module 41 , a determination module 42 and a text generation module 43 . in,

The obtaining module 41 is used to obtain the keywords of the text to be generated and the number of estimated sentences respectively;

The determining module 42 is configured to determine the number of paragraphs and the subject words of the paragraphs of the text to be generated based on the keywords, the number of estimated sentences and the pre-built word association map;

The text generation module 43 is configured to generate the text to be generated based on the Transfromer model, the number of paragraphs of the text to be generated, and the subject headings of the paragraphs.

Specifically, the functions of the modules in the automatic text generation device provided in the embodiments of the present invention correspond one-to-one with the operation procedures of the steps in the above-mentioned method embodiments, and the achieved effects are also the same. For details, refer to the above-mentioned embodiments. This is not repeated in this embodiment of the present invention.

On the basis of the foregoing embodiment, in the automatic text generation device provided in the embodiment of the present invention, the determining module is specifically used for:

Summarize all related word lists to determine the subject word list;

On the basis of the above-mentioned embodiment, the automatic text generation device provided in the embodiment of the present invention further includes: a graph construction module, which is used for:

Obtain the semantic vector of each sample word in the corpus, and calculate the similarity between the semantic vectors of any two sample words, and the similarity is used to characterize the similarity between the any two sample words;

On the basis of the above embodiment, in the automatic text generation device provided in the embodiment of the present invention, the acquisition module is specifically used for:

Determine the target word count of the text to be generated;

Based on the target word count, the estimated sentence count is determined.

FIG. 5 illustrates a schematic diagram of the physical structure of an electronic device. As shown in FIG. 5 , the electronic device may include: a processor (processor) 510, a communication interface (CommunicationsInterface) 520, a memory (memory) 530 and a communication bus 540, wherein , the processor 510 , the communication interface 520 , and the memory 530 communicate with each other through the communication bus 540 . The processor 510 can call the logic instructions in the memory 530 to execute the automatic text generation method, the method includes: respectively obtaining the keywords of the text to be generated and the estimated number of sentences; The word association map of the to-be-generated text is determined, and the number of paragraphs and the paragraph subject words of the text to be generated are determined; the text to be generated is generated based on the Transfromer model, the number of paragraphs of the text to be generated, and the paragraph subject words.

In addition, the above-mentioned logic instructions in the memory 530 can be implemented in the form of software functional units and can be stored in a computer-readable storage medium when sold or used as an independent product. Based on such understanding, the technical solution of the present invention can be embodied in the form of a software product in essence, or the part that contributes to the prior art or the part of the technical solution. The computer software product is stored in a storage medium, including Several instructions are used to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present invention. The aforementioned storage medium includes: U disk, removable hard disk, Read-Only Memory (ROM, Read-Only Memory), Random Access Memory (RAM, Random Access Memory), magnetic disk or optical disk and other media that can store program codes.

On the other hand, an embodiment of the present invention also provides a computer program product, the computer program product includes a computer program stored on a non-transitory computer-readable storage medium, the computer program includes program instructions, when the program instructions When executed by a computer, the computer can execute the automatic text generation method provided by the above method embodiments, the method includes: respectively acquiring keywords of the text to be generated and estimating the number of sentences; based on the keywords, the estimated number of sentences and The pre-built word association map determines the number of paragraphs and the subject headings of the text to be generated; the text to be generated is generated based on the Transfromer model, the number of paragraphs and the subject heading of the text to be generated.

In yet another aspect, embodiments of the present invention further provide a non-transitory computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, is implemented to execute the automatic text generation method provided by the above embodiments, The method includes: respectively acquiring keywords of the text to be generated and the number of estimated sentences; based on the keywords, the estimated number of sentences and a pre-built word association map, determining the number of paragraphs and the subject words of the paragraphs of the text to be generated; The to-be-generated text is generated based on the Transfromer model, the number of paragraphs of the to-be-generated text, and the paragraph keywords.

The device embodiments described above are only illustrative, wherein the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in One place, or it can be distributed over multiple network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution in this embodiment. Those of ordinary skill in the art can understand and implement it without creative effort.

From the description of the above embodiments, those skilled in the art can clearly understand that each embodiment can be implemented by means of software plus a necessary general hardware platform, and certainly can also be implemented by hardware. Based on this understanding, the above-mentioned technical solutions can be embodied in the form of software products in essence or the parts that make contributions to the prior art, and the computer software products can be stored in computer-readable storage media, such as ROM/RAM, magnetic A disc, an optical disc, etc., includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform the methods described in various embodiments or some parts of the embodiments.

Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, but not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that it can still be The technical solutions described in the foregoing embodiments are modified, or some technical features thereof are equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

A method for automatic text generation, comprising:

Obtain the keywords of the text to be generated and the estimated number of sentences respectively;

Determine the number of paragraphs and the subject words of the paragraphs of the text to be generated based on the keywords, the estimated number of sentences, and the pre-built word association map;

The to-be-generated text is generated based on the Transfromer model, the number of paragraphs of the to-be-generated text, and the paragraph keywords.
The automatic text generation method according to claim 1, wherein the number of paragraphs and the subject words of the paragraphs of the text to be generated are determined based on the keywords, the estimated number of sentences, and a pre-built word association map , including:

If it is determined that the keyword and the estimated number of sentences satisfy the first condition, the number of paragraphs is determined to be the default number, and based on the word association map, the correlation between words that are related to each keyword is determined. list of words;

Summarize the list of related words corresponding to all keywords to determine the list of subject words;

Based on the estimated number of sentences, a reserved number of words is determined, and the paragraph topic word is determined based on the words of the reserved word number in the topic word list.
The automatic text generation method according to claim 1, wherein the number of paragraphs and the subject words of the paragraphs of the text to be generated are determined based on the keywords, the estimated number of sentences, and a pre-built word association map , including:

If it is determined that the keyword and the estimated number of sentences satisfy the second condition, the number of paragraphs is determined as the default number, and based on the word association map, the number of words that are related to each keyword is determined;

If the number of the words corresponding to any one of the keywords is less than or equal to the first preset threshold, then based on the word association map, determine the similar words that have a similar relationship with the any keyword, and based on the word association map , determine the list of related words for each similar word;

Summarize all related word lists to determine the subject word list;

If it is determined that the number of words in the topic word list is greater than or equal to a second preset threshold, the paragraph topic word is determined based on the words in the topic word list.
The automatic text generation method according to claim 1, wherein the number of paragraphs and the subject words of the paragraphs of the text to be generated are determined based on the keywords, the estimated number of sentences, and a pre-built word association map , including:

If it is judged that the keyword and the estimated number of sentences meet the third condition, then based on the word association map, determine a related word list composed of words that are related to each keyword;

Summarize the list of related words corresponding to all keywords to determine the list of subject words;

Based on the estimated number of sentences, determine the number of reserved words, and determine the paragraph subject words based on the words in the reserved number of words in the subject word list;

The paragraph subject headings are clustered, and the number of paragraphs is determined based on a result of the clustering.
The automatic text generation method according to claim 1, wherein the number of paragraphs and the subject words of the paragraphs of the text to be generated are determined based on the keywords, the estimated number of sentences, and a pre-built word association map , including:

If it is determined that the keyword and the estimated number of sentences satisfy the fourth condition, then, based on the word association map, determine a related word list composed of words that are related to each keyword;

Summarize the list of related words corresponding to all keywords to determine the list of subject words;

If it is judged that the number of words in the topic word list is greater than or equal to a second preset threshold, determining the paragraph topic word based on the words in the topic word list;

The paragraph subject headings are clustered, and the number of paragraphs is determined based on a result of the clustering.
The automatic text generation method according to any one of claims 1-5, wherein the word association map is specifically constructed by the following method:

Obtain the semantic vector of each sample word in the corpus, and calculate the similarity between the semantic vectors of any two sample words, where the similarity is used to represent the similarity relationship between the any two sample words;

Performing a dependency analysis on the any two sample words in the corpus, and determining a dependency between the any two sample words, where the dependency is used to represent the correlation between the any two sample words;

The word association map is constructed based on the similarity relationship between the any two sample words and the correlation relationship between the any two sample words.
The automatic text generation method according to any one of claims 1-5, wherein the estimated number of sentences is specifically obtained by the following method:

Determine the target word count of the text to be generated;

Based on the target word count, the estimated sentence count is determined.
A device for automatic text generation, comprising:

The obtaining module is used to obtain the keywords of the text to be generated and the estimated number of sentences respectively;

A determination module, configured to determine the number of paragraphs and the subject words of the paragraphs of the text to be generated based on the keywords, the estimated number of sentences and the pre-built word association map;

A text generation module, configured to generate the to-be-generated text based on the Transfromer model, the number of paragraphs of the to-be-generated text, and the paragraph keywords.
An electronic device, comprising a memory, a processor and a computer program stored on the memory and running on the processor, characterized in that, when the processor executes the program, the program as claimed in any one of claims 1 to 7 is implemented. Describe the steps of the automatic text generation method.
A non-transitory computer-readable storage medium on which a computer program is stored, characterized in that, when the computer program is executed by a processor, the steps of the automatic text generation method according to any one of claims 1 to 7 are implemented.