CN113688633A

CN113688633A - Outline determination method and device

Info

Publication number: CN113688633A
Application number: CN202110880841.6A
Authority: CN
Inventors: 王浪; 陈启贤; 余燕
Original assignee: Beijing Kingsoft Office Software Inc; Zhuhai Kingsoft Office Software Co Ltd
Current assignee: Beijing Kingsoft Office Software Inc; Zhuhai Kingsoft Office Software Co Ltd
Priority date: 2021-08-02
Filing date: 2021-08-02
Publication date: 2021-11-23

Abstract

The embodiment of the invention provides a method and a device for determining outline, wherein the method comprises the following steps: obtaining description information of a text to be generated; obtaining semantic features of the description information as first semantic features; and determining the outline of the text to be generated from the preset outlines based on the semantic features of the preset outlines and the first semantic features. When the scheme provided by the embodiment of the invention is applied to determining the outline, the outline determining efficiency can be improved.

Description

Outline determination method and device

Technical Field

The invention relates to the technical field of data processing, in particular to a outline determination method and device.

Background

Automatic text generation is a research branch of natural language processing, and is realized by enabling electronic equipment to generate texts. The text generated by the electronic device can assist the user in efficiently authoring high quality text. Before generating the text, it is generally necessary to determine an outline of the text, and the electronic device generates the text based on each outline.

In the prior art, an outline is usually manually selected from a large number of outlines by a worker. However, this method is time-consuming and labor-intensive, resulting in inefficient outline determination.

Disclosure of Invention

The embodiment of the invention aims to provide a method and a device for determining outline so as to improve the efficiency of outline determination. The specific technical scheme is as follows:

in a first aspect, an embodiment of the present invention provides a outline determination method, where the method includes:

obtaining description information of a text to be generated;

obtaining semantic features of the description information as first semantic features;

and selecting the outline of the text to be generated from the preset outlines based on the semantic features of the preset outlines and the first semantic features.

In an embodiment of the present invention, the determining the outline of the text to be generated from the preset outline based on the semantic features of the preset outline and the first semantic features includes:

and selecting the outline of the text to be generated from the preset outlines based on the similarity between the semantic features of the preset outlines and the first semantic features.

In an embodiment of the present invention, the selecting the outline of the text to be generated from the preset outline based on the similarity between the semantic features of the preset outline and the first semantic features includes:

based on the similarity between the semantic features of the clustering centers of all outline groups and the first semantic features, selecting outline groups to which the outlines of the text to be generated belong from all outline groups as alternative outline groups, wherein each outline group is as follows: clustering according to the similarity between semantic features of the outline to obtain an outline group;

and selecting the outline of the text to be generated from all the outlines of the alternative outline group according to the similarity between the semantic features of all the outlines in the alternative outline group and the first semantic features.

In an embodiment of the present invention, selecting the outline of the text to be generated from each outline of the candidate outline group according to the similarity between the semantic feature of each outline in the candidate outline group and the first semantic feature includes:

selecting an alternative outline group with the highest similarity between the semantic features of the clustering center and the first semantic features from the alternative outline groups;

and selecting the outline of the text to be generated from all the outlines in the selected candidate outline group according to the similarity between the semantic features of all the outlines in the selected candidate outline group and the first semantic features.

calculating the similarity between the semantic features of each outline in the alternative outline group and the first semantic features;

and according to the sequence of the similarity corresponding to each outline from high to low, selecting the first preset number of outlines from the outlines as the outlines of the text to be generated.

In an embodiment of the present invention, the selecting, based on the similarity between the semantic feature of the clustering center of each outline group and the first semantic feature, an outline group to which an outline of the text to be generated belongs from each outline group includes:

calculating the similarity between the semantic features of the clustering centers of all outline groups and the first semantic features;

selecting a second preset number of outline groups from the outline groups according to the sequence of the corresponding similarity of all the outline groups from high to low;

determining the outline number of the outlines containing the description information in each selected outline group;

and determining the outline group to which the outline of the text to be generated belongs from the selected outline group according to the determined outline quantity.

In an embodiment of the present invention, the method further includes:

and selecting the paragraphs of the text to be generated from preset paragraphs corresponding to the outline of the text to be generated.

In an embodiment of the present invention, the selecting a paragraph of the text to be generated from a preset paragraph corresponding to the outline of the text to be generated includes:

and selecting the paragraph of the text to be generated from the preset paragraphs corresponding to the outline of the text to be generated based on the similarity between the semantic features of the preset paragraphs corresponding to the outline of the text to be generated and the first semantic features.

In an embodiment of the present invention, the preset paragraph is a predetermined paragraph, and the method includes the following steps:

acquiring a pre-selected text corresponding to a preset outline;

and extracting paragraphs from each paragraph corresponding to the preset outline in the text to serve as the preset paragraphs corresponding to the outline.

In an embodiment of the present invention, the extracting paragraphs from the paragraphs corresponding to the predetermined outline in the text as the predetermined paragraphs corresponding to the outline includes:

determining feature information of each paragraph corresponding to the preset outline in the text;

selecting an alternative paragraph from the paragraphs based on the feature information of the paragraphs, and determining the alternative paragraph as a preset paragraph corresponding to the preset outline in the text.

In an embodiment of the present invention, the determining the alternative paragraphs as preset paragraphs corresponding to the preset outline in the text includes:

determining semantic features of the alternative paragraphs and word sense features of words in the alternative paragraphs for each alternative paragraph, inputting the determined semantic features and word sense features into a pre-trained paragraph quality evaluation model to obtain a quality score value of the alternative paragraphs, and taking the paragraphs with the quality score values larger than a preset quality score threshold value as preset paragraphs corresponding to the preset outline in the text;

wherein, the paragraph quality evaluation model is as follows: and training a preset neural network model by taking the semantic features of the sample paragraph and the semantic features of each word in the sample paragraph as model input and taking the labeled quality score value of the sample paragraph as a training reference, wherein the labeled quality score value is used for obtaining the quality score value of the paragraph.

In an embodiment of the present invention, the method further includes:

and sequencing the selected paragraphs of the text to be generated based on the outline of the text to be generated, and generating the text containing the outline of the text to be generated and the sequenced paragraphs.

In an embodiment of the present invention, the description information includes at least one of the following information: user representation, keywords, entity words, key sentences, and text types.

In a second aspect, an embodiment of the present invention provides a outline determining apparatus, where the apparatus includes:

the information acquisition module is used for acquiring the description information of the text to be generated;

the characteristic obtaining module is used for obtaining semantic characteristics of the description information as first semantic characteristics;

and the outline selection module is used for selecting the outline of the text to be generated from the preset outlines based on the semantic features of the preset outlines and the first semantic features.

In an embodiment of the present invention, the outline selecting module is specifically configured to select the outline of the text to be generated from the preset outlines based on a similarity between a semantic feature of the preset outline and the first semantic feature.

In an embodiment of the present invention, the outline selecting module includes:

and the outline group selection submodule is used for selecting an outline group to which the outline of the text to be generated belongs from each outline group as an alternative outline group based on the similarity between the semantic features of the clustering centers of the outline groups and the first semantic features, wherein each outline group is as follows: clustering according to the similarity between semantic features of the outline to obtain an outline group;

and the outline selection submodule is used for selecting the outline of the text to be generated from each outline of the candidate outline group according to the similarity between the semantic features of each outline in the candidate outline group and the first semantic features.

In an embodiment of the present invention, the outline selection sub-module is specifically configured to select, from the candidate outline groups, a candidate outline group with a highest similarity between the semantic features of the clustering center and the first semantic features; and selecting the outline of the text to be generated from all the outlines in the selected candidate outline group according to the similarity between the semantic features of all the outlines in the selected candidate outline group and the first semantic features.

In an embodiment of the present invention, the outline selection sub-module is specifically configured to calculate a similarity between a semantic feature of each outline in the candidate outline group and the first semantic feature; and according to the sequence of the similarity corresponding to each outline from high to low, selecting the first preset number of outlines from the outlines as the outlines of the text to be generated.

In an embodiment of the invention, the outline group selection submodule includes:

the similarity calculation unit is used for calculating the similarity between the semantic features of the clustering centers of all outline groups and the first semantic features;

the outline group selection unit is used for selecting a second preset number of outline groups from the outline groups according to the sequence from high to low of the corresponding similarity of each outline group obtained through calculation;

a quantity determining unit, configured to determine the outline quantity of the outline containing the description information in each selected outline group;

and the outline group determining unit is used for determining the outline group to which the outline of the text to be generated belongs from the selected outline group according to the determined outline number.

In an embodiment of the present invention, the apparatus further includes a paragraph selection module,

the paragraph selection module is specifically configured to select a paragraph of the text to be generated from a preset paragraph corresponding to the outline of the text to be generated.

In an embodiment of the present invention, the paragraph selection module is specifically configured to select a paragraph of the text to be generated from a preset paragraph corresponding to the outline of the text to be generated based on a similarity between a semantic feature of the preset paragraph corresponding to the outline of the text to be generated and the first semantic feature.

In an embodiment of the present invention, the apparatus further includes a preset paragraph determining module, where the preset paragraph determining module includes:

the text acquisition submodule is used for acquiring a preselected text corresponding to the preset outline;

and the paragraph determining submodule is used for extracting paragraphs from each paragraph corresponding to the preset outline in the text to serve as the preset paragraphs corresponding to the outline.

In an embodiment of the invention, the paragraph determination submodule includes:

the information determining unit is used for determining the characteristic information of each paragraph corresponding to the preset outline in the text;

a paragraph determining unit, configured to select, based on the feature information of each paragraph, an alternative paragraph from the paragraphs, and determine the alternative paragraph as a preset paragraph corresponding to the preset outline in the text.

In an embodiment of the present invention, the paragraph determining unit is specifically configured to determine, for each alternative paragraph, semantic features of the alternative paragraph and word sense features of words in the alternative paragraph, input the determined semantic features and word sense features into a pre-trained paragraph quality evaluation model, obtain a quality score value of the alternative paragraph, and use a paragraph with a quality score value greater than a preset quality score threshold as a preset paragraph corresponding to the preset outline in the text;

In an embodiment of the present invention, the apparatus further includes: a text generation module for generating a text based on the received text,

the text generation module is specifically configured to sort the selected paragraphs of the text to be generated based on the outline of the text to be generated, and generate a text including the outline of the text to be generated and the sorted paragraphs.

In a third aspect, an embodiment of the present invention provides an electronic device, including a processor, a communication interface, a memory, and a communication bus, where the processor and the communication interface complete communication between the memory and the processor through the communication bus;

a memory for storing a computer program;

a processor configured to implement the method steps of the first aspect when executing the program stored in the memory.

In a fourth aspect, the present invention provides a computer-readable storage medium, in which a computer program is stored, and the computer program, when executed by a processor, implements the method steps described in the first aspect.

As can be seen from the above, when the outline is determined by applying the scheme provided by the embodiment of the present invention, since the outline of the text to be generated is determined from the preset outline based on the semantic features and the first semantic features of the preset outline, compared with the prior art, a worker does not need to manually determine the outline, and the outline determining efficiency is improved.

In addition, because the first semantic features can reflect the semantics expressed by the description information of the text to be generated, the semantic features of the preset outline can reflect the semantics expressed by each preset outline, and the outline of the text to be generated can be more accurately determined from the preset outlines according to the two types of information.

Of course, not all of the advantages described above need to be achieved at the same time in the practice of any one product or method of the invention.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic flow chart of a first outline determination method according to an embodiment of the present invention;

fig. 2 is a schematic flow chart of a outline selection method according to an embodiment of the present invention;

fig. 3 is a schematic flow chart of a second outline determination method according to an embodiment of the present invention;

fig. 4 is a schematic flow chart of a third outline determination method according to an embodiment of the present invention;

FIG. 5 is a block flow diagram of a paragraph obtaining method according to an embodiment of the present invention;

fig. 6 is a flowchart of a text information obtaining method according to an embodiment of the present invention;

FIG. 7 is a block diagram of a process for calculating semantic similarity based on a model according to an embodiment of the present invention;

fig. 8 is a flowchart of a text generation method according to an embodiment of the present invention;

FIG. 9a is a flowchart illustrating a predetermined paragraph obtaining process according to an embodiment of the present invention;

fig. 9b is a schematic flowchart of a quality evaluation process of an alternative paragraph according to an embodiment of the present invention;

FIG. 9c is a flowchart illustrating an Attention mechanism according to an embodiment of the present invention;

fig. 10 is a schematic structural diagram of a first outline determining apparatus according to an embodiment of the present invention;

fig. 11 is a schematic structural diagram of a outline selection module according to an embodiment of the present invention;

fig. 12 is a schematic structural diagram of a second outline determining apparatus according to an embodiment of the present invention;

fig. 13 is a schematic structural diagram of a third outline determining apparatus according to an embodiment of the present invention;

fig. 14 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, fig. 1 is a schematic flow chart of a first outline determining method according to an embodiment of the present invention, where the method includes S101 to S103.

The execution subject of the embodiment of the present invention may be an electronic device, for example: servers, notebook computers, etc.

S101: and obtaining the description information of the text to be generated.

The text to be generated can be news, articles, miscellaneous text and the like.

The description information of the text to be generated can be understood as: information describing the text to be generated, for example: the description information of the text to be generated may include a keyword, a title, a text type, and the like of the text to be generated.

The description information is used for describing information of the text to be generated, and may reflect basic features of the text to be generated to a certain extent, such as features of the type, style, keywords, and the like of the text to be generated, and may also be referred to as feature information.

Specifically, the information for describing the text to be generated, which is input by the user, may be used as the description information of the text to be generated.

In one embodiment, the user may directly enter the descriptive information of the text to be generated.

For example: the user inputs keywords, types, titles and the like of the text to be generated, and the electronic equipment takes the information input by the user as description information of the text to be generated.

In another embodiment, the user may also input a description text segment of the text to be generated, and the electronic device extracts the description information of the description text segment to obtain the description information of the text to be generated.

Specifically, after obtaining the description text segment, the electronic device may perform cleaning and filtering on the description text segment, for example: the above description text segment may be subjected to sensitive word filtering, stop word filtering, and the like, and then the description information of the cleaned and filtered description text segment is extracted.

In another embodiment, the electronic device may further obtain a user representation, and use both the description information input by the user and the user representation as description information of the text to be generated.

Specifically, when the user portrait is obtained, the user portrait may be obtained based on a user identifier and a preset correspondence between the user identifier and the user portrait. The user identification may be an ID number, a login name, etc. of the user.

The user representation may be understood as information describing the user. The user representation may include a plurality of different dimensions of descriptive information, such as: the user representation may include attribute information, interest information, behavior information, scene information, etc. of the user.

S102: and obtaining semantic features of the description information as first semantic features.

The semantic features are used for reflecting the semantics expressed by a certain object, in the embodiment of the invention, the semantic features of different objects appear, and for each object, the semantic features of the object are used for reflecting the semantics expressed by the object.

The semantic features of the description information are used to reflect the semantics expressed by the description information. When obtaining semantic features describing information, semantic features characterized in a vectorized form may be obtained. For example: the description information can be vectorized and coded, and the coded result is used as the semantic feature of the description information. The semantic information of the description information can be analyzed, and semantic features are extracted based on the analysis result.

Step S103: and determining the outline of the text to be generated from the preset outlines based on the semantic features and the first semantic features of the preset outlines.

The outline is used to reflect the structural information of the text. Specifically, the outline may be each subtitle in the text, and the outline may also be a central sentence of each paragraph in the text.

For example: taking patent text as an example, the outline of patent text may include: the specification abstract, the description of the drawings, the claims, the specification, the drawings of the specification; taking an academic paper as an example, the outline of the academic paper may include: abstract, keywords, specific content, references.

Because each subtitle of the text is usually a summarizing title, the central sentence of each paragraph in the text is usually used for summarizing the central thought of the paragraph, and the text is usually formed according to a certain logic sequence based on the central thought of each paragraph or the content of each subtitle, the outline can more accurately reflect the structural information of the text.

The preset outline may be an outline extracted from various types of texts acquired in advance. Specifically, the electronic device may store a history text, extract an outline from the stored text, and use the extracted outline as a preset outline; the electronic equipment can also periodically and automatically crawl texts of a specified website and store the texts in a database based on an automatic crawler system, so that incremental texts of the specified website are obtained, text information in the Internet is monitored in real time, text data sources are explored, dynamic information of the text data is detected, a large number of texts can be obtained, outline outlines in the obtained texts are extracted, and the outline outlines of the extracted texts are stored as preset outline outlines.

The semantic features of the preset outline are used for reflecting the semantics expressed by the preset outline. Specifically, the semantic features of the preset outline may be that the electronic device performs vectorization coding on each preset outline in advance, and uses the coded vector as the semantic features of each preset outline.

In an embodiment of the present invention, the outline of the text to be generated may be selected from the preset outline based on a similarity between the semantic features of the preset outline and the first semantic features.

The similarity between the semantic features of the preset outline and the first semantic features may be determined by calculating a distance between the semantic features of the preset outline and the first semantic features, and determining the similarity based on the calculated distance. The distance may be a euclidean distance, a cosine distance, or the like. For example: for example: a distance similarity conversion algorithm may be employed to convert the calculated distances into similarities.

When the outline of the text to be generated is selected from the preset outlines based on the similarity between the semantic features of the preset outline and the first semantic features, a preset number of preset outlines with the highest similarity can be selected as the outline of the text to be generated, and the preset outline with the similarity larger than the preset similarity can be selected as the outline of the text to be generated.

The preset number may be set by a worker according to experience, for example: the predetermined number may be 1, 2, 3, 4, etc. Taking the first preset number as 3 as an example, 3 preset outlines with the highest similarity are selected as the outlines of the text to be generated.

In this way, the outline of the text to be generated is determined based on the similarity between the semantic features of the preset outline and the first semantic features, the first semantic features can reflect the semantics expressed by the description information of the text to be generated, and the semantic features of the preset outline can reflect the semantics expressed by each preset outline, so that the outline of the text to be generated can be determined more accurately based on the similarity between the semantic features.

As can be seen from the above, when the outline is determined by applying the scheme provided in this embodiment, because the outline of the text to be generated is determined from the preset outline based on the semantic features and the first semantic features of the preset outline, compared with the prior art, a worker does not need to manually determine the outline, and the outline determining efficiency is improved.

In an embodiment of the present invention, the predetermined outline may be; and clustering the outlines included in the plurality of outline groups according to the similarity among the semantic features of the outlines.

Specifically, the semantic features of the preset outlines may be clustered based on the similarity between the semantic features of the preset outlines. For example: the semantic features of each preset outline can be clustered by adopting a K-means (K-means clustering) algorithm to obtain a clustered outline group.

After determining each clustered outline group, determining a clustering center in each outline group, wherein the clustering center refers to an outline with higher similarity to each preset outline included in the outline group.

Specifically, a distributed file storage system may be used to store semantic feature vector data of each outline in each outline group and semantic feature vector data of a cluster center, and store the semantic feature vector data of the cluster center of each clustered outline group in a memory of the electronic device.

On the basis of the above embodiment, referring to fig. 2, fig. 2 is a schematic flow chart of a outline selection method provided by the embodiment of the present invention. The outline of the text to be generated can be selected from the preset outline in the step S103 based on the similarity between the semantic features of the preset outline and the first semantic features according to the following steps S201 to S202.

S201: and selecting the outline group to which the outline of the text to be generated belongs from each outline group as an alternative outline group based on the similarity between the semantic features of the clustering center of each outline group and the first semantic features.

The above outline groups are: and clustering according to the similarity between the semantic features of the outline to obtain an outline group.

The outline groups are obtained by clustering according to the similarity between the semantic features of the outlines, so that each outline group has a clustering center, the clustering center refers to an outline with higher similarity between preset outlines included in the outline group, and the semantic features of the clustering center can reflect the whole semantic features of each outline group.

The first semantic features are as follows: semantic features of the information are described.

Specifically, because the semantic features of the clustering centers of the outline groups can be stored in the memory of the electronic device, when the similarity between the semantic features of the clustering centers of the outline groups and the first semantic features is calculated, the semantic features of the clustering centers of the outline groups can be obtained from the memory of the electronic device, so that the obtaining efficiency of the semantic features of the clustering centers is improved.

When determining the similarity between the semantic features of the cluster centers of the outline groups and the first semantic features, the distance between the semantic features of the preset cluster centers of the outline groups and the first semantic features may be calculated, and the similarity may be determined based on the calculated distance. The distance may be a euclidean distance, a cosine distance, or the like. For example: a distance similarity conversion algorithm may be employed to convert the calculated distances into similarities.

Based on the similarity between the semantic features of the clustering centers of the outline groups and the first semantic features, when an outline group to which an outline of a text to be generated belongs is selected from the outline groups as an alternative outline group, a preset number of outline groups with the highest similarity can be selected as alternative outline groups, and outline groups with the similarity greater than the preset similarity can be selected as alternative outline groups.

S202: and selecting the outline of the text to be generated from each outline of the alternative outline group according to the similarity between the semantic features of each outline in the alternative outline group and the first semantic features.

Specifically, when determining the similarity between the semantic features of each outline in the candidate outline group and the first semantic features, the distance between the semantic features of each outline and the first semantic features may be calculated, and the similarity may be determined based on the calculated distance. The distance may be a euclidean distance, a cosine distance, or the like. For example: a distance similarity conversion algorithm may be employed to convert the calculated distances into similarities.

Specifically, when the outline of the text to be generated is selected from the outlines of the candidate outline group according to the similarity between the semantic features of the outlines in the candidate outline group and the first semantic features, a preset number of outlines with the highest similarity may be selected, and an outline with a similarity greater than the preset similarity may also be selected.

Therefore, the outline group to which the outline of the text to be generated belongs is selected from the outline groups after clustering and is used as the alternative outline group; and then selecting the outline of the text to be generated from all the outlines of the alternative outline group. And clustering all outline groups based on the similarity among the semantic features of all outlines, namely dividing all similar outlines into one outline group, and compared with the method for selecting the outline of the text to be generated by taking all outlines as a unit, the method for selecting the outline of the text to be generated by taking the outline group as a unit has higher efficiency.

In an embodiment of the present invention, the step S202 may be implemented according to the following steps a1-a2, and an outline of the text to be generated is selected from each outline of the candidate outline group according to the similarity between the semantic features of each outline in the candidate outline group and the first semantic features.

Step A1: and selecting the candidate outline group with the highest similarity between the semantic features of the clustering center and the first semantic features from the candidate outline groups.

Specifically, the candidate outline group with the highest similarity may be selected from the candidate outline groups based on the similarity between the semantic feature of the cluster center of each candidate outline group and the first semantic feature.

Step A2: and selecting the outline of the text to be generated from all the outlines in the selected candidate outline group according to the similarity between the semantic features of all the outlines in the selected candidate outline group and the first semantic features.

Specifically, when the outline of the text to be generated is selected, a preset number of outlines with the highest similarity may be selected as the outline of the text to be generated. And selecting the outline with the similarity larger than the preset similarity as the outline of the text to be generated.

In this way, because the similarity between the semantic feature of the cluster center of the selected candidate outline group and the first semantic feature is highest, the semantic information representing the cluster center of the selected candidate outline group is closest to the description information of the text to be generated, and the outline of the text to be generated is determined from the selected candidate outline group, the semantic information expressed by the outline of the determined text to be generated can be closest to the description information of the text to be generated, so that the accuracy of determining the outline of the text to be generated is improved.

In an embodiment of the present invention, the step S202 may be implemented according to the following steps B1-B2, and an outline of the text to be generated is selected from each outline of the candidate outline group according to the similarity between the semantic features of each outline in the candidate outline group and the first semantic features.

Step B1: and calculating the similarity between the semantic features of each outline in the alternative outline group and the first semantic features.

Specifically, when the similarity between the semantic feature of each outline in the candidate outline group and the first semantic feature is calculated, the distance between the semantic feature of each outline and the first semantic feature may be calculated, and the similarity may be determined based on the calculated distance.

Step B2: and according to the sequence of the similarity corresponding to each outline from high to low, selecting the first preset number of outlines from each outline group as the outlines of the text to be generated.

The first preset number may be set by a worker according to experience. For example: the first predetermined number may be 6, 8, etc. Taking the first preset number as 6 as an example, 6 outlines with the highest similarity can be selected.

After the similarity between the semantic features of the outline and the first semantic features is obtained, the first preset number of outlines are selected according to the sequence of the calculated corresponding similarity of the outline from high to low. It is understood that the selected outline can be an outline in different alternative outline groups, and can also be an outline in the same alternative outline group.

For example: the sequence of the corresponding similarity of each outline from high to low is assumed as follows: the method comprises the following steps that 1, 2, 3, 4 and 5, the first preset number is 3, and then the first preset number of the obtained outlines is selected as: outline 1, outline 2 and outline 3.

In this way, the first preset number of the synopsis before is selected as the synopsis of the text to be generated according to the sequence from high to low of the corresponding similarity of each synopsis, and the similarity can reflect the similarity between semantic information expressed by each synopsis and description information of the text to be generated, so that the accuracy of selecting the synopsis of the text to be generated is improved.

In an embodiment of the present invention, the step S201 may be implemented according to the following steps C1-C4, where the outline group to which the outline of the text to be generated belongs is selected from each outline group based on the similarity between the semantic feature of the clustering center of each outline group and the first semantic feature.

Step C1: and calculating the similarity between the semantic features of the clustering centers of all outline groups and the first semantic features.

Specifically, when the similarity between the semantic features of the cluster centers of the outline group and the first semantic features is calculated, the distance between the semantic features of each cluster center and the first semantic features may be calculated, and the similarity may be determined based on the calculated distance.

Step C2: and selecting a second preset number of outline groups from the outline groups according to the sequence of the corresponding similarity of all the outline groups from high to low.

The second preset number may be set by a worker according to experience. For example: the second predetermined number may be 5, 10, etc.

And after the similarity between the semantic features of the clustering centers of all the outline groups and the first semantic features is obtained, selecting a second preset number of outline groups from all the outline groups according to the sequence of the corresponding similarity of all the outline groups from high to low.

For example: the similarity corresponding to each outline group is assumed to be in the order from high to low: the outline group 1, the outline group 2, the outline group 3, the outline group 4 and the outline group 5, and the second preset number is 3, then the second preset number of the outline groups selected are: outline group 1, outline group 2 and outline group 3.

Step C3: and determining the outline number of the outline containing the description information in each selected outline group.

The description information is the description information of the text to be generated obtained in S101.

The outline containing the description information may be in two cases, one case is an outline containing all the description information of the text to be generated, and the other case is an outline containing part of the description information of the text to be generated.

Specifically, when determining the number of the outlines, it may be determined whether the outlines in each outline group include description information. In one embodiment, the description information of each outline in the outline group may be extracted, whether the extracted description information is the description information of the text to be generated is determined, and if so, the outline is considered to include the description information.

In another embodiment, the description information of each outline may be extracted in advance, the extracted description information is stored in the database, the description information of the text to be generated is matched with the description information corresponding to each outline stored in the database, and if the matching is successful, the outline may be considered to include the description information.

After determining that the outline in each outline group contains the description information, counting the number of the outlines of the outline containing the description information in each outline group, thereby determining the number of the outlines of the outline containing the description information in each selected outline group

Step C4: and determining the outline group to which the outline of the text to be generated belongs from the selected outline group according to the determined outline number, and taking the outline group as an alternative outline group.

In one embodiment, the same initial weight is assigned to each outline group, wherein the weight of the outline group is used to represent the probability that the outline group is selected as the candidate outline group. When the selection weight of the outline group is larger, the probability that the outline group is selected as the alternative outline group is larger; when the selection weight of the outline group is smaller, the probability that the outline group is selected as the alternative outline group is smaller.

And adjusting and updating the weight of each outline group based on the determined outline number, and after the weight of each outline group is updated and adjusted, selecting a preset number of outline groups with the highest weight values as alternative outline groups.

For example: suppose that the selected outline group includes an outline group S₁Outline group S₂Wherein the outline group S₁Contains the outline group S with the number of the outline of the description information of 5₂Contains description information with outline number of 3, outline group S₁Outline group S₂Is 10, and the preset number is 1,.

And adjusting the weight value of each enhancement group based on the outline number of each outline group containing the description information, wherein the adjusted weight values are respectively as follows: outline group S₁Has a weight of 15 and a outline group S₂Has a weight of 12, due to outline group S₁Has the highest weight, so that the outline group S can be formed₁As an alternative outline group.

In another embodiment, the same initial weight is given to each outline group, the number of description information included in each outline group is determined, the weight of each outline group is adjusted based on the determined number and the outline number, and a preset number of outline groups with the highest weight value can be selected as the candidate outline group.

For example: suppose that the selected outline group includes an outline group S₃Outline group S₄Wherein the outline group S₃Contains description information outline number of 2 and outline group S₄Contains description information with outline number of 3 and outline group S₃The number of the description information contained in each outline is 2, and the outline group S₄The number of the description information contained in each outline is 3, and the outline group S₃Outline group S₄Is 10, and the preset number is 1

And adjusting the weight value of each enhancement group based on the number of the outline of each outline group containing the description information and the number of the description information contained in each outline group. The weight value of each outline group can be adjusted according to the weight corresponding to the preset outline number and the weight corresponding to the number of the contained description information.

In this way, the candidate outline group is determined based on the determined number of outlines, and the determined number of outlines can reflect the situation that each outline in the outline group contains description information, and when the determined number of outlines is larger, the situation that the number of outlines in the outline group is larger, the situation that semantic information expressed by the outlines in the outline group is close to the description information of the text to be generated can be represented. Therefore, the accuracy of determining the alternative outline group can be improved based on the determined number of outlines.

Referring to fig. 3, fig. 3 is a schematic flow chart of a second outline determination method according to an embodiment of the present invention. On the basis of the above embodiment, the above method further includes the following step S104.

S104: and selecting the paragraphs of the text to be generated from preset paragraphs corresponding to the outline of the text to be generated.

The preset paragraph may be: after the outline is extracted from the pre-acquired text, the paragraphs corresponding to all the outlines are obtained and are used as the preset paragraphs corresponding to the outline.

Taking a patent text as an example, the paragraphs corresponding to the outline of the abstract of the specification are as follows: the summary of the abstract of the specification is that the outline is that the paragraphs corresponding to the 'claims' are: the subject matter of the claims section.

Specifically, the electronic device may store a history text, extract an outline from the stored text, and obtain a paragraph corresponding to the extracted outline as a preset paragraph; the method can also be based on an automatic crawler system, periodically and automatically crawl texts of a specified website and store the texts in a database, so that incremental texts of the specified website are obtained, text information in the Internet is monitored in real time, text data sources are discovered, dynamic information of the text data is detected, a large number of texts can be obtained, outline in the obtained texts is extracted, and paragraphs corresponding to the outline are obtained and serve as preset paragraphs corresponding to the outline.

The preset outline and the preset paragraphs are in a corresponding relationship, and specifically, one preset outline may correspond to one preset paragraph or to a plurality of preset paragraphs. Therefore, after the outline of the text to be generated is determined, the preset paragraphs corresponding to the outline of the text to be generated can be determined.

When selecting a paragraph of a text to be generated from a preset paragraph corresponding to a synopsis of the text to be generated, in an implementation manner, a paragraph may be randomly selected from a preset paragraph corresponding to a synopsis of the text to be generated as a paragraph of the text to be generated.

In an embodiment of the present invention, based on a similarity between a semantic feature of a preset paragraph corresponding to a synopsis of the text to be generated and the first semantic feature, a paragraph of the text to be generated may be selected from the preset paragraph corresponding to the synopsis of the text to be generated.

The semantic features of the preset paragraphs corresponding to the outline of the text to be generated are used for reflecting the semantics expressed by the preset paragraphs. The first semantic features are semantic features of the description information and are used for reflecting semantics expressed by the description information.

Specifically, the similarity may be determined by calculating a distance between a semantic feature of a preset paragraph corresponding to the outline of the text to be generated and the first semantic feature, and determining the similarity based on the calculated distance. The distance may be a euclidean distance, a cosine distance, or the like. For example: a distance similarity conversion algorithm may be employed to convert the calculated distances into similarities.

When a paragraph of the text to be generated is selected from the preset paragraphs corresponding to the outline of the text to be generated based on the similarity between the semantic features and the first semantic features of the preset paragraphs corresponding to the outline of the text to be generated, the preset paragraph with the highest similarity may be selected as the paragraph of the text to be generated, or the preset paragraph with the similarity greater than the preset similarity may be selected as the paragraph of the text to be generated.

Because the paragraphs of the text to be generated are determined based on the similarity between the semantic features of the preset paragraphs and the first semantic features, the first semantic features can reflect the semantics expressed by the description information of the text to be generated, and the semantic features of the preset paragraphs can reflect the semantics expressed by each preset paragraph, the semantic information expressed by the paragraphs determined based on the similarity between the semantic features can be closer to the semantic information expressed by the text to be generated, so that the generated text is more accurate.

Therefore, the paragraphs of the text to be generated are selected from the preset paragraphs corresponding to the outline of the text to be generated, so that the selected paragraphs have high degree of engagement with the outline of the text to be generated, and the overall logic of the text is strong when the text is generated subsequently.

Referring to fig. 4, fig. 4 is a schematic flow chart of a third outline determining method according to an embodiment of the present invention. On the basis of the above embodiment, the above method further includes the following step S105.

Step S105: and sequencing the selected paragraphs of the text to be generated based on the outline of the text to be generated, and generating the text containing the outline of the text to be generated and the sequenced paragraphs.

When the selected paragraphs of the text to be generated are ordered based on the outline of the text to be generated, the arrangement order of the paragraphs may be determined based on the position information of the outline of the text to be generated in the text.

For example: assuming that the outline of the text to be generated includes a beginning outline, an intermediate outline and an ending outline, wherein the selected paragraphs include a paragraph 1, a paragraph 2 and a paragraph 3, and the paragraph 1 corresponding to the beginning outline, the paragraph 2 corresponding to the intermediate outline and the paragraph 3 corresponding to the ending outline, since the structure of the characters generally consists of the beginning outline, the intermediate outline and the ending outline, the sequence of the selected paragraphs can be determined based on the outline of the text to be generated, and the sequence sequentially includes: paragraph 1, paragraph 2, paragraph 3.

When generating the text including the outline of the text to be generated and the ordered paragraphs, the outline of each text to be generated may be used as the subtitle of each corresponding ordered paragraph, and the determined subtitles and the ordered paragraphs may be combined to generate the text. The outline of each text to be generated can be used as the beginning sentence of each corresponding sequenced paragraph, so as to generate the text.

In this way, the selected paragraphs are ordered based on the outline of the text to be generated, and the text including the outline of the text to be generated and the ordered paragraphs is generated. And the outline can reflect the structural information of the text, so that the selected paragraphs are sequenced based on the outline of the text to be generated, the sequenced paragraphs can be structural, the structural of the generated text is improved, and the overall logic of the generated text is higher.

In addition, because the outline of the text to be generated is determined first, and then the paragraphs of the text to be generated are selected from the preset paragraphs corresponding to the outline of the text to be generated, and the number of the preset paragraphs corresponding to the outline of the text to be generated is far smaller than the total number of the preset paragraphs, the efficiency of selecting the paragraphs of the text to be generated from the preset paragraphs corresponding to the outline of the text to be generated is high, and the text generation efficiency is improved.

Since in the prior art, electronic devices typically generate text in units of words. Specifically, first words of the text to be generated are determined according to the obtained keywords of the text to be generated, words adjacent to the first words are determined based on the determined first words, each word of the text to be generated is determined according to the method, and the text is generated based on each determined word. The text is generated by using a method taking words as units in the prior art, so that the structure of the generated text is poor, and the overall logic of the text is not strong. In the scheme, the text is not generated word by word, the selected paragraphs are sequenced based on the outline of the text to be generated, and the sequenced paragraphs can have structural property, so that the structural property of the generated text is improved, and the overall logic property of the generated text is higher.

In an embodiment of the present invention, the description information of the text to be generated in step S101 may include at least one of the following information: user representation, keywords, entity words, key sentences, and text types.

In particular, the user representation is used to describe characteristic information of the user. A user representation may describe the user's profile information from different dimensions. For example, the user portrait may include user attribute feature information, user interest feature information, user behavior feature information, and user scene feature information.

The user attribute feature information may include the gender, age, location, occupation, and the like of the user; the user interest attribute feature information can contain the type of articles which are interested by the user, the type of created articles, writing rules and the like; the user behavior characteristic information comprises a recent reading article type and a recent authoring article type of the user; the user scene feature information includes the current writing scene of the user.

Specifically, when the user representation is obtained, the user representation may be determined based on the user identifier and a pre-stored correspondence between the user identifier and the user representation. The pre-stored corresponding relationship between the user identification and the user portrait can be real-time monitoring of the electronic equipment on-line and off-line data of the user, comprehensive, accurate and multidimensional user portrait constructed for each user, and the corresponding relationship between the user identification and the user portrait is determined based on the constructed user portrait.

Keywords may be understood as words that express a text-centric idea or primary content. Specifically, when obtaining the above keywords, the keywords may be determined based on the frequency of occurrence of each word in a text segment in a description text segment of the text to be generated, which is input by the user. For example: the method can adopt TF-IDF (term frequency-inverse document frequency) algorithm to extract key words, and in the TF-IDF algorithm, the following formula is mainly adopted to extract key words;

wherein, tf_i,jIndicating the frequency of occurrence of the ith word in the above description text passage, df_iRepresenting the number of texts containing the ith word in a preset text library, N representing the total number of texts in the preset text library, and W_i,fFor the important value of the ith word in the above description text, in particular, when W_i,fThe higher the value of the importance of the ith word in the above description text paragraph, when W_i,fThe ith word can be considered as a keyword.

Entity words are to be understood as proper nouns appearing in the text. When obtaining the entity words, the entity words describing the text segments of the text to be generated, which is input by the user, can be extracted using the bidirectional LSTM and conditional random fields.

A key sentence may be understood as a sentence expressing a text-centric idea or main content. Specifically, when the key sentences are obtained, the description text segment of the text to be generated, which is input by the user, may be divided into a plurality of sentence units, a graph model may be established according to the context relationship between the sentence units, and the sentence units with higher importance may be determined based on the established graph model, so as to extract the key sentences in the text segment. The method can be realized based on a TextRank algorithm.

When the text type is obtained, the semantic features of the text segment to be generated, which is input by the user, can be identified, the semantic features are matched with the semantic features of the preset text type based on the identified semantic features, and the text type is determined based on the matching result.

The following embodiment describes a specific implementation process for obtaining a paragraph of text to be generated. Referring to fig. 5, fig. 5 is a flowchart of a paragraph obtaining method according to an embodiment of the present invention.

In fig. 5, an optimal outline group is first selected from the clustered outline groups based on semantic features of the text to be generated. The outline groups comprise an outline group 1, an outline group 2 and … ….

And then determining the outline of the text to be generated from the selected optimal outline group based on the characteristic information of the text to be generated. Specifically, the outline of the text to be generated may be determined based on a case where each outline in the optimal outline group includes feature information of the text to be generated. The outline in the optimal outline group comprises an outline 1, an outline 2 and an outline … ….

And secondly, determining paragraphs of the text to be generated based on the characteristic information of the text to be generated from each preset paragraph corresponding to each determined outline. The preset paragraphs corresponding to outline 1 include paragraph 11, paragraph 12 and paragraph … …. The preset paragraphs corresponding to outline 2 include paragraph 21, paragraph 22 and paragraph … …. Specifically, the paragraphs of the text to be generated may be determined based on the fact that each paragraph contains feature information of the text to be generated.

In an embodiment of the present invention, the semantic features and feature information of the preset outline, and the semantic features and feature information of the preset paragraph may be determined as follows. Referring to fig. 6, fig. 6 is a flowchart of a text information obtaining method according to an embodiment of the present invention.

In fig. 6, data acquisition is first performed. Specifically, the electronic device can monitor text information in the internet in real time based on an automatic crawler system, regularly and automatically crawl texts of a specified website and store the texts into a message cache message queue, and explore text data sources from the message cache message queue by adopting a preset data scheduling system and monitor and crawl dynamic information of the text data of the internet in real time.

Then, washing and filtering are carried out. Specifically, the method can be used for processing text similarity duplication elimination, spelling correction, sensitive topic filtering in the text, abnormal character processing, emotion analysis, complex and simplified body correction, misspelling correction, character format unification, useless information deletion and the like.

And constructing the document portrait again. Specifically, the outline and the paragraph of the text after cleaning and filtering can be extracted respectively, and the feature information in the outline and the paragraph can be obtained respectively. The characteristic information may include a document type, a keyword, a key sentence, a substantial word, a text body, and the like. When feature information in outline and paragraph is obtained, the same processing method can be used for processing.

Specifically, when obtaining the above keywords, the keywords may be determined based on the frequency of occurrence of each word in a text segment in a description text segment of the text to be generated, which is input by the user. For example: the key words can be extracted using the TF-IDF algorithm. When the entity words are obtained, the entity words of the text segment describing the text to be generated, which is input by the user, can be extracted by using the bidirectional LSTM and the conditional random field. When the key sentences are obtained, the description text segment of the text to be generated, which is input by the user, can be divided into a plurality of sentence units, a graph model is established according to the context relationship among the sentence units, and the sentence units with higher importance degree are determined based on the established graph model, so that the key sentences in the text segment are extracted. The method can be realized based on a TextRank algorithm. When the text type is obtained, the semantic features of the description text segment of the text to be generated, which is input by the user, can be identified, the semantic features are matched with the semantic features of the preset text type based on the identified semantic features, and the text type is determined based on the matching result.

And finally, carrying out feature extraction and data storage.

Specifically, a vectorization coding technique may be used to code the outline or the paragraph, and the coded result is used as the semantic feature of the outline or the paragraph.

And after the text portrait and the semantic features are obtained, the data can be stored in two sets of distributed data storage modes. One set of distributed data storage system is used for storing the characteristic information of the outline and the paragraph, and the data storage system can store the original information of the text and the characteristic information of the keywords, the participles, the inverted indexes and the like. And the other set of distributed data storage system is used for storing the semantic features of the outline and the paragraph and calculating the similarity of the semantic features.

In an embodiment of the present invention, the similarity between the semantic features based on the preset outline and the first semantic features in S102 may be implemented as follows, and the outline of the text to be generated is selected from the preset outline and is used as the outline of the text to be generated.

And inputting the semantic features and the first semantic features of the preset outline into a pre-trained semantic similarity calculation model to obtain the similarity between the semantic features and the first semantic features of the preset outline, and selecting the outline of the text to be generated from the preset outline based on the obtained similarity.

The semantic similarity calculation model is obtained by training a preset neural network model by taking a large number of semantic features of a sample preset outline and semantic features of a sample text as input and taking the similarity between the semantic features of the sample preset outline and the semantic features of the sample text as a training reference, and is used for obtaining the similarity between the semantic features of the preset outline and the first semantic features.

The semantic features of the sample preset outline and the semantic features of the sample text can adopt the semantic features in the semantic feature vector library in the graph X as samples.

Specifically, referring to fig. 7, fig. 7 is a block diagram of a process for calculating semantic similarity based on a model according to an embodiment of the present invention.

Fig. 7 is a calculation based on a semantic similarity calculation model, and specifically, when calculating the similarity between the semantic features of the preset outline and the first semantic features, first, the similarity between the semantic features of the preset outline and the first semantic features is respectively encoded to obtain an encoding result; then connecting the obtained coding results, and calculating the difference value and the inner product; and then, processing the calculation result by the full connection layer and the Softmax layer, and outputting the similarity between the semantic features of the preset outline and the first semantic features.

A specific implementation process of text generation is described below with a specific embodiment, referring to fig. 8, and fig. 8 is a flow chart of a text generation method according to an embodiment of the present invention. FIG. 8 includes a data platform building module, a user portrait and user intention identifying module, a vectorized semantic search module, and a text generating module.

The data platform building module comprises the steps of data acquisition, data cleaning, feature information and semantic feature extraction and data storage.

In a user portrait and intention identification module, firstly, acquiring information of a text to be generated input by a user; the information is then filtered, for example: filtering the input information by sensitive words, filtering stop words and the like to remove noise information; and secondly, extracting characteristic information of the cleaned information, wherein the characteristic information can comprise keywords, entity words and text types of the text to be generated, and extracting semantic characteristics of the characteristic information.

And in the vectorization semantic retrieval module, similar semantic features are recalled for the semantic features of the outline and the paragraph stored in the distributed vector database based on the extracted semantic features.

In the text generation system, outline groups which are similar to the semantic features are determined from a distributed vector database, the extracted outline groups are sorted based on the feature information, a preset number of the ordered outline groups are selected, paragraphs of a text to be generated are determined based on the semantic features and the semantic features of preset paragraphs corresponding to the selected outline groups, and the text is generated based on the determined outlines and the paragraphs.

In an embodiment of the invention, the preset paragraphs corresponding to the outline in the step S104 are predetermined paragraphs, and the corresponding preset paragraphs may be specifically increased according to the following steps D1-D2.

Step D1: and acquiring a pre-selected text corresponding to the preset outline.

As can be known from the description in step S103, the electronic device may store the history text and extract the synopsis from the history text, or may also use the automatic crawler system to crawl the text of the specified website periodically, extract the synopsis of the crawled text, store the synopsis in the database, and store the corresponding relationship between the synopsis and the text. On the basis, the electronic equipment can pre-select the text corresponding to the preset outline from the texts stored in the database, and the pre-selected text corresponding to the preset outline is used as the pre-selected text corresponding to the preset enhancement.

Step D2: and extracting paragraphs from each paragraph corresponding to a preset outline in the text to serve as the preset paragraphs corresponding to the outline.

In an embodiment, paragraphs may be randomly selected from the paragraphs corresponding to the predetermined outline in the text, and the number of the selected paragraphs may be 1 or multiple.

In an embodiment of the present invention, feature information of each paragraph corresponding to a preset outline in the text may also be determined, based on the feature information of each paragraph, an alternative paragraph is selected from each paragraph, and the alternative paragraph is determined as the preset paragraph corresponding to the outline.

The feature information of the above paragraph is used to describe basic information of the paragraph, and the feature information of the above paragraph may include: the length of each sentence in the paragraph, punctuation mark composition of each sentence, the number of sentences in the paragraph, and the like.

Taking the feature information as the length of each sentence in the paragraph as an example, the length of each paragraph in the text may be calculated when determining the feature information.

When selecting the alternative paragraphs, it may be determined whether the feature information of each paragraph satisfies a preset information filtering rule, and if so, the paragraph is taken as the alternative paragraph.

The preset information screening rule may be set by a worker according to experience, or the worker extracts feature information of a high-quality paragraph and uses the screening rule containing the high-quality feature information as the preset information screening rule.

For example: the preset information screening rule may include: the sentence length in the paragraph is more than 8 bytes; the punctuation marks in the sentence comprise commas and periods; the number of sentences in the paragraph is greater than 5, etc.

When the characteristics of the paragraphs satisfy the preset information screening rules, the paragraphs can be used as alternative paragraphs.

For example: the preset information screening rule is assumed to be; keeping paragraphs with sentence length larger than 8 bytes, and the paragraphs contain preset punctuation marks (comma, period), and the number of sentences in the paragraphs is larger than 5, assuming that the paragraphs are: the students need to learn well and go up everyday, and also need to be honored to the old and young, and the students have difficulty in stretching out the aid hands in time, and then carefully listen to and speak in class and carefully write in class. "there is a sentence with a length greater than 8 bytes in this paragraph," we want to learn to go up every day, and the paragraph contains commas and periods, and the number of sentences in the paragraph is greater than 5, that is, the paragraph satisfies the preset information filtering rule, so that the paragraph can be used as an alternative paragraph.

When the alternative paragraphs are determined as the preset paragraphs corresponding to the outline, in an embodiment, the alternative paragraphs may be directly determined as the preset paragraphs corresponding to the preset outline in the text. For example: assuming that the determined candidate paragraph is paragraph 1, paragraph 1 may be directly determined as a preset paragraph corresponding to a preset outline in the text.

In an embodiment of the present invention, for each alternative paragraph, semantic features of the alternative paragraph and word sense features of words in the alternative paragraph may be determined, the determined semantic features and word sense features are input into a pre-trained paragraph quality evaluation model, a quality score value of the alternative paragraph is obtained, and the alternative paragraph with the quality score value greater than the preset quality score value is used as a preset paragraph corresponding to a preset outline in a text.

The semantic features of the alternative paragraphs are used for reflecting the semantics expressed by the alternative paragraphs, and the word sense features of the words are used for reflecting the semantics expressed by the words.

Specifically, when the word sense features of each word in the alternative paragraphs are determined, word segmentation processing may be performed on the alternative paragraphs to obtain each word in the alternative paragraphs, and word vectorization may be performed on each word in the alternative paragraphs to obtain the word sense features of each word in the alternative paragraphs. For example: when a jieba Word segmentation processing tool is used for Word segmentation processing, Word vectorization is performed on each Word by using a Word2Vec (Word to Vector) model.

When determining the semantic features of the alternative paragraphs, the semantic information of the alternative paragraphs may be extracted, and the semantic features of the alternative paragraphs may be determined based on the extracted semantic information.

The association degree between the words can be determined based on the word sense characteristics of the words in the alternative paragraphs, and the semantic characteristics of the alternative paragraphs are determined based on the determined association degree. Specifically, an Attention mechanism may be adopted, and first, based on the word meaning characteristics of each word in the alternative paragraphs, the association degree between each word is determined, and based on the determined association degree, each word is given a weight to obtain a weight matrix; then, based on the word vector of each word and the weight matrix, carrying out weighted summation on the weight of each word to obtain the weight matrix after weighted summation of each word; and determining the semantic features of the alternative paragraphs based on the word vectors of the words and the weight matrix obtained by weighting and summing the words.

The quality evaluation model of the above paragraphs is: and training a preset neural network model by taking the semantic features of the sample paragraph and the semantic features of each word in the sample paragraph as model input and taking the labeled quality score value of the sample paragraph as a training reference, wherein the labeled quality score value is used for obtaining the quality score value of the paragraph. The Neural network model may be TextCNN (Txet volumetric Neural Networks).

Specifically, when the semantic features of the candidate paragraphs and the word sense features of each word in the candidate paragraphs are input into a paragraph quality evaluation model, a convolution layer in the paragraph quality evaluation model performs convolution operation on the semantic features and the word sense features to obtain a convolution result; and pooling and softmax solving the convolution result to obtain a mass fraction value of the alternative paragraph.

Specifically, when performing softmax (logistic regression) solution, the following formula can be adopted for calculation:

wherein p denotes the sequence number of the preset quality classes, k denotes the total number of the preset quality classes,

a quality score value indicating that the quality of the candidate segment is the pth preset quality classification,

the quality of the alternative paragraph is represented by the sum of the quality score values of the respective preset quality classes, a_pAnd the quality of the alternative paragraph is represented as a normalized quality score value of the p-th preset quality classification. When the mass fraction value is calculated by the formula, the calculated mass fraction value is [0,1 ]]Within the range.

In an embodiment of the present invention, the semantic features and feature information of the preset outline, and the semantic features and feature information of the preset paragraph may be determined as follows. Referring to fig. 8, fig. 8 is a flowchart of a text information obtaining method according to an embodiment of the present invention.

In fig. 8, data acquisition is first performed. Specifically, the electronic device can monitor text information in the internet in real time based on an automatic crawler system, regularly and automatically crawl texts of a specified website and store the texts into a message cache message queue, and explore text data sources from the message cache message queue by adopting a preset data scheduling system and monitor and crawl dynamic information of the text data of the internet in real time.

Specifically, when obtaining the above keywords, the keywords may be determined based on the frequency of occurrence of each word in a text segment in a description text segment of the text to be generated, which is input by the user. For example: the key words can be extracted using the TF-IDF algorithm. When obtaining the above entity words, the entity words describing text segments of the text to be generated, which are input by the user, can be extracted by using a bidirectional LSTM (Long Short-Term Memory network) and a conditional random field. When the key sentences are obtained, the description text segment of the text to be generated, which is input by the user, can be divided into a plurality of sentence units, a graph model is established according to the context relationship among the sentence units, and the sentence units with higher importance degree are determined based on the established graph model, so that the key sentences in the text segment are extracted. The method can be realized based on a TextRank algorithm. When the text type is obtained, the semantic features of the description text segment of the text to be generated, which is input by the user, can be identified, the semantic features are matched with the semantic features of the preset text type based on the identified semantic features, and the text type is determined based on the matching result.

And finally, carrying out feature extraction and data storage.

Referring to fig. 9a, fig. 9a is a schematic flowchart of a preset segment obtaining process according to an embodiment of the present invention.

In the first step in fig. 9a, a large amount of document data including a large amount of text is obtained.

And secondly, extracting paragraphs in each text to form a paragraph list.

And thirdly, coarsely extracting paragraphs in the paragraph list based on preset paragraph description information such as length, punctuation marks, sentence number and the like to obtain alternative paragraphs.

And the fifth step and the sixth step are that Word meaning characteristics of each Word in each alternative paragraph are determined by adopting a Word2Vec model, semantic characteristics of each alternative paragraph are determined by adopting an Attention mechanism, and the determined semantic characteristics and the Word meaning characteristics are input into a textCNN model trained in advance to obtain a quality score value of the alternative paragraph.

And the seventh step is that the candidate paragraphs with the quality score values larger than the preset quality threshold value are used as the preset paragraphs based on the quality score values of the candidate paragraphs.

In this way, since the mass score value of the paragraph is determined in conjunction with the deep learning method, the mass of the paragraph can be accurately determined based on the mass score value.

Referring to fig. 9b, fig. 9b is a schematic flowchart of a quality evaluation process of an alternative paragraph according to an embodiment of the present invention.

In fig. 9b, the paragraph list after rough extraction, i.e. each alternative paragraph, is obtained first according to the pointing order of the arrow.

And secondly, processing each alternative paragraph in a batch processing mode.

And then, aiming at each alternative paragraph, obtaining each Word in the alternative paragraph by adopting a Word segmentation mode, and carrying out Word vectorization on each Word by using a Word2Vec mode to obtain a Word vector of each Word in the alternative paragraph. And secondly, obtaining an Attention Feature Map (Attention Feature Map) of the alternative paragraph based on the obtained word vector and the Attention mechanism.

And finally, inputting the word vectors and the attention feature maps of all words in the alternative paragraphs into a CNN (Convolutional Neural Networks) model, and outputting the quality score values of the alternative paragraphs after convolution operation, pooling and softmax in the CNN model.

Therefore, each alternative paragraph is processed in a batch processing mode, so that the processing efficiency is improved, and the paragraph acquisition efficiency is further improved.

Referring to fig. 9c, fig. 9c is a schematic flowchart of an Attention mechanism according to an embodiment of the present invention. In fig. 9c, firstly, performing dot product operation (dot) based on the word sense feature of each word in the alternative paragraph to obtain the word weight (weight) of each word in the alternative paragraph; performing softmax solution on the word weight of each word in the alternative paragraphs to obtain a weight normalization result; then, carrying out weighted summation (summalize) on the word meaning characteristics of each word in the alternative paragraphs and the weight normalization result, and outputting a characteristic map (feature map); and obtaining a final feature map (final feature map) based on the feature map and the word meaning features of the words in the alternative paragraphs.

Corresponding to the outline determining method, the embodiment of the invention also provides an outline determining device.

Referring to fig. 10, fig. 10 is a schematic structural diagram of a first outline determining apparatus according to an embodiment of the present invention, where the apparatus includes the following

modules

1001 and 1003.

An information obtaining module 1001, configured to obtain description information of a text to be generated;

a feature obtaining module 1002, configured to obtain a semantic feature of the description information as a first semantic feature;

and the outline selecting module 1003 is configured to select an outline of the text to be generated from the preset outlines based on semantic features of preset outlines and the first semantic features.

In an embodiment of the present invention, the outline selecting module 1003 is specifically configured to select the outline of the text to be generated from the preset outlines based on a similarity between a semantic feature of the preset outline and the first semantic feature.

Referring to fig. 11, fig. 11 is a schematic structural diagram of a outline selection module 1003 provided in an embodiment of the present invention, where the module includes the following sub-modules 10031-10032.

The outline group selection sub-module 10031 is configured to select, based on the similarity between the semantic features of the clustering center of each outline group and the first semantic features, an outline group to which an outline of the text to be generated belongs from each outline group as an alternative outline group, where each outline group is: clustering according to the similarity between semantic features of the outline to obtain an outline group;

the outline selection sub-module 10032 is configured to select an outline of the text to be generated from each outline of the candidate outline group according to the similarity between the semantic feature of each outline in the candidate outline group and the first semantic feature.

In an embodiment of the present invention, the outline selection sub-module 10032 is specifically configured to select, from the candidate outline groups, a candidate outline group with a highest similarity between the semantic features of the cluster center and the first semantic features; and selecting the outline of the text to be generated from all the outlines in the selected candidate outline group according to the similarity between the semantic features of all the outlines in the selected candidate outline group and the first semantic features.

In an embodiment of the present invention, the outline selection sub-module 10032 is specifically configured to calculate a similarity between the semantic feature of each outline in the candidate outline group and the first semantic feature; and according to the sequence of the similarity corresponding to each outline from high to low, selecting the first preset number of outlines from the outlines as the outlines of the text to be generated.

In an embodiment of the present invention, the outline group selection sub-module 10031 includes:

Referring to fig. 12, fig. 12 is a schematic structural diagram of a second outline determining apparatus according to an embodiment of the present invention, where the apparatus further includes a paragraph selecting module 1004.

The paragraph selecting module 1004 is specifically configured to select a paragraph of the text to be generated from a preset paragraph corresponding to the outline of the text to be generated.

In an embodiment of the present invention, the paragraph selecting module 1004 is specifically configured to select a paragraph of the text to be generated from a preset paragraph corresponding to the outline of the text to be generated, based on a similarity between a semantic feature of the preset paragraph corresponding to the outline of the text to be generated and the first semantic feature.

Referring to fig. 13, fig. 13 is a schematic structural diagram of a third outline determining apparatus according to an embodiment of the present invention, where the apparatus further includes a text generating module 1005.

The text generating module 1005 is specifically configured to sort the selected paragraphs of the text to be generated based on the outline of the text to be generated, and generate a text including the outline of the text to be generated and the sorted paragraphs.

Corresponding to the outline determination method, the embodiment of the invention also provides electronic equipment.

Referring to fig. 14, fig. 14 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, including a processor 1401, a communication interface 1402, a memory 1403, and a communication bus 1404, where the processor 1401, the communication interface 1402, and the memory 1403 are communicated with each other via the communication bus 1404,

a memory 1403 for storing a computer program;

the processor 1401 is configured to implement the outline determination method provided in the embodiment of the present invention when executing the program stored in the memory 1403.

The communication bus mentioned in the electronic device may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.

The communication interface is used for communication between the electronic equipment and other equipment.

The Memory may include a Random Access Memory (RAM) or a Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.

The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but also Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components.

In another embodiment provided by the present invention, a computer-readable storage medium is further provided, in which a computer program is stored, and the computer program, when executed by a processor, implements the outline determination method provided by the embodiment of the present invention.

In another embodiment, the present invention further provides a computer program product containing instructions, which when executed on a computer, causes the computer to implement the outline determination method provided by the embodiment of the present invention.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the embodiments of the apparatus, the electronic device, and the computer-readable storage medium, since they are substantially similar to the embodiments of the method, the description is simple, and for the relevant points, reference may be made to the partial description of the embodiments of the method.

The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims

1. A outline determination method, characterized in that the method comprises:

obtaining description information of a text to be generated;

and determining the outline of the text to be generated from the preset outlines based on the semantic features of the preset outlines and the first semantic features.

2. The method according to claim 1, wherein the determining the outline of the text to be generated from the preset outline based on the semantic features of the preset outline and the first semantic features comprises:

3. The method according to claim 2, wherein selecting the outline of the text to be generated from the preset outline based on the similarity between the semantic features of the preset outline and the first semantic features comprises:

4. The method according to claim 3, wherein the selecting the outline of the text to be generated from each outline of the candidate outline group according to the similarity between the semantic features of each outline in the candidate outline group and the first semantic features comprises:

5. The method according to claim 3, wherein the selecting the outline of the text to be generated from each outline of the candidate outline group according to the similarity between the semantic features of each outline in the candidate outline group and the first semantic features comprises:

6. The method according to any one of claims 3 to 5, wherein selecting, from each outline group, an outline group to which an outline of the text to be generated belongs based on a similarity between the semantic feature of the cluster center of each outline group and the first semantic feature comprises:

7. The method of claim 1, further comprising:

8. The method according to claim 7, wherein the selecting the paragraph of the text to be generated from the preset paragraphs corresponding to the outline of the text to be generated comprises:

9. The method according to claim 8, wherein the preset paragraph is a predetermined paragraph, comprising the steps of:

acquiring a pre-selected text corresponding to a preset outline;

10. The method according to claim 9, wherein the extracting paragraphs from each paragraph corresponding to a predetermined outline in the text as the predetermined paragraphs corresponding to the outline includes:

11. The method according to claim 10, wherein the determining the alternative passage as a preset passage corresponding to the preset outline in the text comprises:

determining semantic features of the alternative paragraphs and semantic features of words in the alternative paragraphs for each alternative paragraph;

inputting the semantic features and the word sense features determined by each alternative paragraph into a pre-trained paragraph quality evaluation model to obtain a quality score value of each alternative paragraph, and taking the paragraphs with the quality score values larger than a preset quality score threshold value as preset paragraphs corresponding to the preset outline in the text;

12. The method according to any one of claims 7-11, further comprising:

13. The method according to any of claims 7-11, wherein the description information comprises at least one of the following information: user representation, keywords, entity words, key sentences, and text types.

14. A outline determination apparatus, characterized by comprising:

15. The apparatus of claim 14, wherein the outline selection module comprises:

16. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for realizing mutual communication by the memory through the communication bus;

a memory for storing a computer program;

a processor for implementing the method steps of any of claims 1-13 when executing a program stored in the memory.

17. A computer-readable storage medium, characterized in that a computer program is stored in the computer-readable storage medium, which computer program, when being executed by a processor, carries out the method steps of any one of claims 1 to 13.