WO2020020084A1 - Text generation method, apparatus and device - Google Patents

Text generation method, apparatus and device Download PDF

Info

Publication number
WO2020020084A1
WO2020020084A1 PCT/CN2019/096894 CN2019096894W WO2020020084A1 WO 2020020084 A1 WO2020020084 A1 WO 2020020084A1 CN 2019096894 W CN2019096894 W CN 2019096894W WO 2020020084 A1 WO2020020084 A1 WO 2020020084A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
module
information
valid
position information
Prior art date
Application number
PCT/CN2019/096894
Other languages
French (fr)
Chinese (zh)
Inventor
沈力行
陈展
Original Assignee
杭州海康威视数字技术股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 杭州海康威视数字技术股份有限公司 filed Critical 杭州海康威视数字技术股份有限公司
Publication of WO2020020084A1 publication Critical patent/WO2020020084A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/174Form filling; Merging

Definitions

  • the present application relates to the technical field of natural language processing, and in particular, to a method, a device, and a device for generating text.
  • Natural language is the language that people use every day. Natural language processing technology can realize natural language communication between humans and computers. It is widely used to generate texts with fixed writing format and specified demand information and expressed in natural language. For example, for each module in the fixed writing format of the text to be generated, using natural language processing technology to determine valid text from the database that meets the text requirements of each module, and then filling the determined valid text directly into each module, The filled text of each module is obtained, and then the filled text of each module is arranged in a fixed writing format to obtain the text to be generated. Among them, the filled text of each module in the fixed writing format usually includes: structured text with a fixed structure of words or sentences, and / or unstructured text with a fixed structure of sentences.
  • each module in a fixed writing format of a hot news is a "Title” module, a “Release Date” module, and a “Body” module.
  • the filled text of the "Title” and “Release Date” modules is structured text.
  • the filled text of the Body module is unstructured text.
  • the filled text of the module since valid text is directly filled into the module without considering the representation structure after the valid text is filled, for a module that has unstructured text, it is likely that the filled text of the module is multiple valid texts In the mechanical combination, the filled text of the module does not conform to the natural language expression structure, which leads to the problem that the text to be generated obtained by using the filled text module does not conform to the natural language expression structure.
  • the text requirement information of the "body” module is "2018 World Cup”.
  • valid texts determined from the database that meet the text requirements information include: “The World Cup is held in Russia for the first time”, “The 2018 World Cup is held in 12 stadiums in 11 cities in Russia” and “ The competition will be held from June 14th to July 15th, 2018. " Because the filled text of the "body” module is unstructured text with a fixed structure, the valid text is directly filled into the module.
  • the filled text of the generated body module may be "The competition will be from June 14 to July 2018. Held on the 15th, the 2018 World Cup will be held in 12 stadiums in 11 cities in Russia, and the World Cup will be held in Russia for the first time.
  • the purpose of the embodiments of the present application is to provide a method, a device, and a device for generating a text, so as to achieve the purpose of generating a text conforming to a natural language expression structure.
  • Specific technical solutions are as follows:
  • an embodiment of the present application provides a text generating method, which includes:
  • a plurality of valid texts that meet the module's requirement information are obtained from a preset database, and the requirement information is used to indicate the text content corresponding to the module;
  • each module input multiple valid texts of the module into the first recurrent neural network trained in advance to obtain the first feature vector of each valid text of the module. Obtained by training the sample valid text of the information that meets the specified requirements;
  • the text structure of the text is the same as the text structure of the first sample text used in the training of the memory network.
  • the first sample text is a text that conforms to the structure of the natural language expression and meets the specified requirements. Obtained by training the first sample text;
  • the filling text of each module is arranged to obtain the text to be generated.
  • an embodiment of the present application provides a text generating device, where the device includes:
  • Text acquisition module for each module in the fixed writing format of the text to be generated, obtain multiple valid texts from the preset database that meet the module's requirement information, and the requirement information is used to indicate the text content corresponding to the module ;
  • a feature extraction module is used for each module to input multiple valid texts of the module into the first recurrent neural network trained in advance to obtain the first feature vector and first recurrent neural network of each valid text of the module. It is obtained by training with multiple pre-collected sample valid texts that meet the specified requirements information;
  • a position information determining module is used for each module to input a first feature vector of each valid text of the module into a pre-trained memory network to obtain each participle in each valid text of the module.
  • the first position information in the filled text, the text structure of the filled text is the same as the text structure of the first sample text in the memory network, the first sample text is a text that conforms to the natural language expression structure and meets the specified requirements information,
  • the memory network is obtained by training with multiple first collected first sample texts;
  • a text generating module is configured to arrange each participle in each valid text of the module according to the obtained first position information for each module to obtain the filled text of the module; according to the text to be generated, The writing format is fixed, and the filled text of each module is arranged to obtain the text to be generated.
  • an embodiment of the present application provides a computer device, where the device includes:
  • an embodiment of the present application provides a computer-readable storage medium.
  • a computer program is stored in the storage medium.
  • the steps of the text generation method provided in the first aspect are implemented.
  • a text generation method, device, and device provided in the embodiments of the present application.
  • the memory network is trained by using a plurality of pre-collected first sample texts, and the first sample text conforms to natural language.
  • the segmentation in the valid text is arranged according to the first position information, and the text structure of the obtained filled text is the same as that of the first sample text, which also conforms to the natural language expression structure. Therefore, it can be ensured that the filled text of each module is arranged in accordance with the fixed writing format of the text to be generated, and the obtained text to be generated is a text conforming to the structure of natural language expression.
  • FIG. 1 is a schematic flowchart of a text generation method according to an embodiment of the present application
  • FIG. 2 is a schematic structural diagram of a recurrent neural network in a text generating method according to an embodiment of the present application
  • FIG. 3 is a schematic structural diagram of a memory network in a text generating method according to an embodiment of the present application
  • FIG. 4 is a schematic structural diagram of a memory network in a text generating method according to another embodiment of the present application.
  • FIG. 5 is a schematic flowchart of a text generation method according to another embodiment of the present application.
  • FIG. 6 is a schematic structural diagram of a convolutional neural network in a text generation method according to another embodiment of the present application.
  • FIG. 7 is a schematic structural diagram of a sequence labeling model in a text generation method according to another embodiment of the present application.
  • FIG. 8 is a schematic structural diagram of a text generating device according to an embodiment of the present application.
  • FIG. 9 is a schematic structural diagram of a text generating apparatus according to another embodiment of the present application.
  • FIG. 10 is a schematic structural diagram of a computer device according to an embodiment of the present application.
  • a text generation method according to an embodiment of the present application is first introduced below.
  • the text generation method provided in the embodiment of the present application can be applied to a computer device capable of text generation.
  • the device includes a desktop computer, a portable computer, an Internet television, a smart mobile terminal, a wearable smart terminal, and a server, etc., and is not limited herein. Any computer equipment that can implement the embodiments of the present application belongs to the protection scope of the embodiments of the present application.
  • the flow of a text generation method may include:
  • the text of each module is used to describe the same event.
  • the "Title” module and “Text” module in the press release covering the 2018 World Cup start all describe the start of the 2018 World Cup.
  • the requirement information of each module indicates its corresponding text content, it can also indicate the event described by the text to be generated to which the module belongs.
  • a method of performing keyword matching may be used to obtain text containing the requirement information of the module from a preset database.
  • the requirement information may be used as the text to be answered, and the position of the answer matching the text to be answered is obtained from a preset database by using a reading comprehension technique, and the answer at this position is used as the valid text.
  • Any method for obtaining valid text can be used in this application, and this embodiment does not limit this.
  • S102 For each module, input multiple valid texts of the module into a first recurrent neural network trained in advance, and obtain a first feature vector of each valid text of the module.
  • the first recurrent neural network uses multiple Pre-collected samples of valid text that meet the specified requirements are obtained by training.
  • the events described by the sample valid text that meets the specified requirement information have the same characteristics as the events described by the module's requirement information, and the specified requirement information can be the same or similar to the module's requirement information.
  • the demand information is "Spring Game”
  • the specified demand information may be “Spring Game”, “Winter Game”, or “Indoor Game” and so on.
  • the RNN may have a structure as shown in FIG. 2.
  • the current input of the neuron 202 in the hidden layer may include the output 2010 of the input layer 201 and the neuron 202.
  • the output 2020 at a moment enables the recurrent neural network to remember and use the output at the previous moment to determine the output at the current moment, and then obtains the feature vector output by the output layer 203. Therefore, for texts where each participle is not isolated, the current participle and the previous participle can be used to predict the next participle.
  • the relationship between each participle in the text can be used to extract the feature vector of the effective text using a recurrent neural network.
  • the recurrent neural network can remember and use the output of the previous moment to determine the characteristics of the output at the current moment, so that the extracted feature vector It can reflect the characteristics of each segmentation in the effective text and the characteristics of the relationship between each segmentation.
  • the first recurrent neural network in the above S102 obtained by training with a plurality of sample valid texts collected in advance and meeting the specified demand information, establishes a mapping relationship between valid texts and feature vectors, thereby ensuring the obtained first
  • the feature vector can reflect the semantic features of the effective text as a whole, not just the features of the individual participles in the text. For example, if the current participle is "hit” and the previous participle is “driving", the next participle is likely to be "hurt".
  • the recurrent neural network in any embodiment of the present application is similar to the first recurrent neural network in S102 described above, the difference is that in order to implement the extraction of feature vectors of different input texts, it is used for training to obtain different loops The samples of the neural network are different.
  • S103 For each module, first input the first feature vector of each valid text of the module into the memory network obtained in advance, and obtain the word segmentation of each valid text of the module. A position information.
  • the text structure of the filled text is the same as the text structure of the first sample text used in the training of the memory network.
  • the first sample text is a text that conforms to the structure of the natural language expression and meets the information required by the module.
  • the network is obtained by training with a plurality of pre-collected first sample texts.
  • the arrangement of the modules is only related to the fixed format of the text to be generated, and does not involve the structure of the text in the text to be generated.
  • the text to be generated is only a fixed format exception, and the text does not conform to the natural language.
  • the problem of structure Therefore, to make the text to be generated conform to the natural language expression structure, it is necessary to ensure that the filled text of each module conforms to the natural language expression structure.
  • a plurality of pre-collected first sample texts can be used to train a memory network, and the first sample texts are samples that conform to the natural language content structure and meet the module's requirement information. Therefore, the first position information of each participle in the effective text obtained from the memory network in the filled text is the same as the position information of each participle in the first sample text, and it can be guaranteed that in the subsequent step S104, the first position information By arranging each participle in each valid text, the resulting filled text is that the arrangement position of each participle is the same as the position information of each participle in the first sample text to ensure that the filled text conforms to the natural language content structure.
  • the memory network in this embodiment may specifically have a structure as shown in FIG. 3:
  • the input layer 301 is a first recurrent neural network having the same structure as the recurrent neural network in the embodiment of FIG. 2 of the present application, and is used to obtain a first feature vector and input the first feature vector to a hidden layer, which will not be repeated here. For details, see The above description of the embodiment shown in FIG. 2.
  • the hidden layer 302 may specifically include a neuron 3020, a neuron 3021, and a neuron 3022.
  • the hidden layer 302 may Adopt the structure of recurrent neural network.
  • the position of each participle is related to the characteristics of the entire text. Therefore, it is also necessary to save the historical state information 3023 of each neuron as the input of each neuron.
  • the input of neuron 3021 may include the output and status information of all 3020 neurons. 3023. Therefore, features can be extracted from the input according to the historical state information stored in the memory network to extract features associated with the historical state information.
  • a plurality of pre-collected first sample texts are used to train a memory network, and the memory network stores historical state information of the first sample text that conforms to the natural language content structure and meets the module's requirement information. Then, the memory network is used to determine each participle in the valid text.
  • the input of the valid text can be determined based on the historical state information saved by each neuron to indicate compliance with the natural language content structure. In the feature, the start position 303 and the end position 304 of each participle.
  • word segmentation is used as an example for effective description.
  • the valid text is not limited to the word segmentation, and may include sentences and paragraphs.
  • the first sample text of the requirement information "Spring Game” is "Little Red Kick”.
  • the word “Little Red” is in the first position
  • "Kick” is in the second position
  • "Bitch” is in the third position.
  • the memory network trained using the first sample can input the features of the network
  • the position of the participle "Xiao Ming" having the same characteristics as “Xiao Hong” in the vector is determined as the first position
  • the position of the participle "put” having the same characteristics as “Kick” is determined as the second position
  • the position of the participle "kite” with the same characteristics is determined as the third position.
  • step S104 each participle in the valid text is arranged according to the first position information, and the text structure of the obtained filled text is The text structure of the first sample text is the same and conforms to the structure of natural language expression. For example, based on the first position information of the valid texts “Xiao Ming", “Fang” and “Kite” corresponding to the first feature vector, the filled text "Xiao Ming Flies Kite” is obtained.
  • the determination of the first position information it is possible to generate filled text that conforms to the structure of natural language expressions, and avoids filling the valid text directly into the module, which can result in expressions that do not conform to natural language, such as "Kite Flying Xiaoming” or "Kite Flying Xiaoming". Custom filled text.
  • the fixed writing format of the text to be generated may include an arrangement rule of each module, and the identification information of each module is used to distinguish each module, and then the filled-in text of each module is arranged by using the identification information of the module.
  • the fixed writing format of the text to be generated includes: "theme” module M1 is arranged before the "body” module M2, and the filling text of the identification information M1 of the module can be arranged in front of the filling text of the identification information M2 of the module.
  • a text generation method provided in the embodiment of the present application is that, for each module, the memory network is trained by using a plurality of pre-collected first sample texts, and the first sample text conforms to a natural language content structure, and A sample of information that meets the requirements of the module. Therefore, the first position information of each participle in the valid text obtained from the memory network in the filled text is the same as the position information of each participle in the first sample text. On this basis, the segmentation in the valid text is arranged according to the first position information, and the text structure of the obtained filled text is the same as that of the first sample text, which also conforms to the natural language expression structure. Therefore, it is guaranteed that the filled text of each module is arranged according to the fixed writing format of the text to be generated, and the obtained text to be generated is a text conforming to the structure of natural language expression.
  • the text generating method provided in the embodiment of the present application may further include the following steps:
  • the first identification information of the module is marked for each valid text of the module.
  • the first identification information is preset information used to indicate the uniqueness of each module.
  • step S105 in the embodiment shown in FIG. 1 of the present application may specifically include:
  • Each filled text is arranged according to the sixth position information to obtain the text to be generated.
  • the filling text of each module needs to be arranged according to the fixed writing format of the text to be generated to which the module belongs.
  • the fixed writing format of the text to be generated may be expressed in advance as a correspondence table or mapping (for example, key-value) of the first identification information and the module position, thereby according to the correspondence between the first identification information and the module position.
  • the sixth position information of each filled text in the text to be generated can be determined, so that each filled text is arranged according to the sixth position information, and the obtained text to be generated is a text that conforms to the fixed writing format of the text to be generated.
  • a fixed writing format for generating hot news includes ["Title” module, "Posting time” module, "Text” module]. Mark the first identification information a1 for the filling text of the "Title” module “2018 World Cup”, and mark the first identification information a2 for the filling text of the "Posting Time” module on June 14, 2018, position 02, as the "body text” Filled text of the module "The 2018 World Cup matches will start on June 14, 2018 and will run until July 15 in 12 stadiums in 11 cities in Russia. This is the first time the World Cup has been held in Russia.” Mark the first identification information a3, position 03.
  • the sixth position information of the filling text of the "title” module in the text to be generated is At position 01
  • the sixth position information of the filled text of the "Post Time” module in the text to be generated is position 02
  • the sixth position information of the filled text of the "Text” module in the text to be generated is position 03.
  • the requirement information can be used as the text to be answered and the requirement information
  • the text is used as the answer to the text to be answered, and the valid text is obtained from the semantic level of the requirement information, so as to avoid the problem of inaccurate and insufficient rich text obtained by matching only at the text level.
  • step S101 in the embodiment shown in FIG. 1 of the present application may specifically include the following steps 1 to 5:
  • Step 1 For each module in the fixed writing format of the text to be generated, obtain a plurality of complete texts from the preset database that meet the events described by the text to be generated, as the backup text of the module.
  • the complete text of each module is used to describe the same event.
  • the "Title” module and “Body” module in the press release covering the 2018 World Cup start all describe the 2018 World Cup start.
  • the requirement information of each module indicates the complete text of the module itself, it can also indicate that the module belongs to the same event described by the text to be generated.
  • multiple complete texts in the preset database that match the events described by the text to be generated can be used as backup text for each module.
  • the valid text of the "Party Natural Situations" module is the party's information text in the case data
  • the valid text of the "Cause” module is the text of the lawsuit request in the case data.
  • Step 2 For each module, input each backup text of the module into a second recurrent neural network trained in advance to obtain a second feature vector of each backup text.
  • the second recurrent neural network is a plurality of previously collected
  • the sample backup text is obtained by training.
  • Step 3 For each module, input the module's demand information into a third recurrent neural network trained in advance, and obtain a third feature vector of the demand information as the feature vector of the module.
  • the third recurrent neural network consists of multiple The sample requirement information of the module collected in advance is obtained through training.
  • the requirement information when used as the text to be answered and the text that meets the requirement information is used as the answer to the text to be answered, it is equivalent to calculating the feature matching degree between the standby text and the requirement information. Therefore, for each module, it is necessary to obtain each second feature vector of each backup text of the module and the third feature vector of the requirement information of the module.
  • the second recurrent neural network and the third recurrent neural network are recurrent neural networks with the same structure as the recurrent neural network in the embodiment of FIG. 2 of the present application. The difference is that in order to obtain corresponding outputs for different inputs, they are used for training to obtain different The samples of the recurrent neural network are different. The same parts are not repeated here. For details, refer to the description of the embodiment shown in FIG. 2 above.
  • Step 4 For each module, input the vector information corresponding to each backup text of the module into the fourth recurrent neural network obtained in advance, and obtain each backup text of the module that meets the information required by the module.
  • the third position information is obtained by training the sample complete text of the same event corresponding to the requirement information of the module, and the third position information is the position information of the text that meets the module's requirement information in the sample complete text.
  • a backup text and a sample full text are taken as examples, the third position information is marked, and the sample full text describing the event "Spring Game” described by the text to be generated is "Spring is here, children can go out and play.”
  • the third position information corresponding to the requirement information "Games for Spring” marked in the complete text of the sample includes: the first position information of "Spring”, the first position, and the tenth position of the ending position information of "Kicker".
  • the backup text "Vector in spring, children can go out to play, and Huaweing likes flying a kite" is the vector information: the second feature vector and the demand information "Spring Day Game
  • the third feature vector that is, the feature vector of the module is input to the fourth recurrent neural network, so as to obtain the second position information of the text of "Spring Game” in the standby text that meets the module's requirement information: "Spring”
  • the first position information is the first position
  • the end position information of the "kite” is the tenth position.
  • step 5 for each module, the text at the corresponding second position information is extracted from each of the backup texts of the module as valid text that meets the requirement information of the module.
  • the text at the corresponding second position information can be extracted from each standby text of the module, as valid text that meets the module's requirements information.
  • the second position information corresponding to the backup text is extracted: The text "Spring kite” at the 1st position and the end position information of the "Kite Flying” is located at the 10th position information, as the valid text of the information "Spring Game” that meets the requirements of this module.
  • S101 in the embodiment shown in FIG. 1 in the foregoing application may specifically include the following steps:
  • the demand information of the "body” module includes: demand information Q1 "2018 World Cup holding time”, demand information Q2 "2018 World Cup holding place", and demand information Q3 " Special information for the 2018 World Cup.
  • the multiple valid texts from the preset database that meet each requirement information of the "body” module include: the valid text of the demand information Q1 A1 "The 2018 World Cup match starts on June 14, 2018” and A2 " The 2018 World Cup will continue until July 15 ", the effective text of demand information Q2 A3" in Russia “and A4" held in 12 stadiums in 11 cities ", the effective text of demand information Q3 A5" World Cup for the first time Held in Russia.
  • S102 in the embodiment shown in FIG. 1 of the foregoing application may specifically include:
  • a plurality of valid texts of each requirement information of the module are respectively input into a first recurrent neural network trained in advance to obtain a first feature vector of each valid text.
  • this step obtains multiple valid texts corresponding to multiple requirement information of the same module.
  • the text generation method provided in the embodiment of the present application may further include:
  • each requirement information of the module is input into a third recurrent neural network trained in advance, and a third feature vector of each requirement information of the module is obtained.
  • the third recurrent neural network is a plurality of previously collected This module is obtained by training the sample requirement information of the module.
  • the valid text corresponding to the multiple requirement information needs to be arranged according to the respective corresponding requirement information. Therefore, it is necessary to obtain a feature vector of each requirement information of the same module for subsequent determination of position information of multiple valid texts of the module.
  • S103 in the embodiment shown in FIG. 1 of the present application may specifically include:
  • the vector information corresponding to each requirement information of the module For each module, input the vector information corresponding to each requirement information of the module into the pre-trained memory network, and obtain the valid text corresponding to each requirement information of the module and the corresponding first position information;
  • the first position information is each participle in the valid text, and the position information in the filled text of the module;
  • the vector information corresponding to any requirement information of the module includes: the first feature of each valid text corresponding to the requirement information Vector, and the third feature vector of the demand information;
  • the filled text is the text corresponding to the module, and the text structure of the filled text and the text structure of the first sample text marked with the fourth position information used in the training of the memory network
  • the fourth position information is position information of each text in the first sample text that meets the specified requirement information.
  • the vector information corresponding to each requirement information of the module can be input into the pre-trained
  • the memory network obtains the valid text corresponding to each requirement information of the module and the corresponding first position information. Based on this, the segmented words in each valid text are subsequently arranged according to the first position information, and the resulting filled text has the same structure as the first sample text, and the first sample text is in line with natural language description habits. , So filled text is also in line with natural language description habits.
  • the third feature vector of the demand information corresponding to the effective text is also input to the memory network, and the first position of the first sample text used for training the memory network is labeled with a fourth position, thereby ensuring the accuracy of each segmentation in the determined effective text.
  • the fourth position information can be arranged according to the demand information.
  • the memory network in this embodiment may specifically have a structure as shown in FIG. 4:
  • the memory network in this embodiment is similar to the memory network in the embodiment in FIG. 3 described above. The difference is that in order to cope with a situation where there is multiple demand information, the memory network in this embodiment adds an input layer 401 for each module. A third feature vector of each requirement information of the module is obtained, and the third feature vector is input to the hidden layer. Regarding the recurrent neural network, it will not be repeated here. For details, refer to the description of the embodiment in FIG. 2 described above. After adding the input layer 401, neuron 406, and historical state information 4033 corresponding to the neuron to extract the third feature vector, the output of the input layer 401 is used as the input of the neuron 4030 to obtain each requirement information of the module The corresponding valid text and the corresponding first position information.
  • each participle in the text is arranged corresponding to the demand information.
  • the input layer 402, the hidden layer 403, the neuron 4030, the neuron 4031, the neuron 4032, the historical state information 4033, the start position 404, and the cut-off position 405 of each participle are the same as the memory network in the embodiment of FIG. 3 of this application.
  • the input layer 301, hidden layer 302, neuron 3020, neuron 3021, neuron 3022, historical state information 3023, start position 303, and cut-off position 304 of each participle are the same, and will not be repeated here, see Figure 3 for details. Description of the illustrated embodiment.
  • a first sample text that meets the specified requirements information Q11 "2008 Olympic Games Held”, Q12 "2008 Olympic Games Held Location”, and Q13 "2008 Olympic Special Information” marked with fourth location information "The 2008 Olympic Games will be held in 6 cities in China from August 8 to August 24, 2008. This is the first time that the Olympic Games will be held in China.”
  • the fourth position information marked in the first sample text includes the position information 4th and 6th positions of "August 8, 2008” and "August 24, 2008” that meet the specified requirement information Q11, which meet the specified requirements.
  • the valid texts A1 to A5 obtained above and the requirement information Q1 to Q3 of the "body” module are input into the memory network, and the fourth position information of the valid text of each demand information in the filled text is determined, so that each fourth position information is used Later, we got the filled text with the same structure and natural language as the first sample.
  • the 2018 World Cup will be held from June 14 to July 15, 2018 in 12 stadiums in 11 cities in Russia. This is the first time the World Cup has been held in Russia.
  • the complete text of the module is a structured text and the complete text of the module is an unstructured text.
  • structured type text has a fixed representation structure, compared with unstructured type text, it requires less information to be determined by the neural network, and usually the neural network will occupy a large amount of computing resources. Therefore, in order to reduce the occupation of computing resources and improve the efficiency of text generation, the text type of the module can be determined, so that different text generation methods can be performed on modules with different text types in a targeted manner.
  • a flow of a text generation method according to another embodiment of the present application may include:
  • S501 For each module, input the requirement information of the module into a preset classification algorithm to obtain the text type of the filled text of the module, and the text type includes a structured type and an unstructured type.
  • the text type of the filled text of the module is an unstructured type
  • S502 to S505 are performed
  • S506 to S508 are performed.
  • the preset classification algorithm may specifically be a support vector machine algorithm, a logistic regression algorithm, or a pre-trained convolutional neural network using a plurality of sample demand information corresponding to structured text and unstructured text collected in advance. . It can also be judged whether the demand information is preset information corresponding to the text type. For example, for a civil indictment to be generated, the preset information corresponding to the structured type is "the natural situation of the parties", “the respondent court", "payment”, and " Attachment ", the default information corresponding to the unstructured type is” suit request "and” facts and reasons ". Any classification algorithm capable of determining the text type corresponding to the model based on the model's requirement information can be used in this application, which is not limited in this embodiment.
  • the convolutional neural network When used to determine the text type of the filled text, it may specifically have a structure as shown in FIG. 6.
  • the hidden layer of the neural network of this embodiment has two feature extraction channels. After inputting the demand information through the input layer 601, the channel 602 is used to extract local feature variables, and the channel 603 is used to extract global feature variables to ensure that the extracted features not only reflect the needs. The characteristics of each participle in the information can also reflect the overall semantics of each participle.
  • the probability that the demand information output by the output layer 604 belongs to different text types is obtained, and based on the output probability, the text type of the filled text corresponding to the input demand information is determined.
  • Structured type text includes text or sentences with a fixed structure of expression.
  • Unstructured type text includes text or sentences with a fixed structure of text. For example, each module in a fixed writing format of a hot news is a "title” module, a “release date” module, and a “body” module, where the text of the "title” and “release date” modules are structured text, The text of the Body module is unstructured text.
  • S502 For each module in the fixed writing format of the text to be generated, obtain a plurality of valid texts from a preset database that meet the requirement information of the module, and the requirement information is used to indicate the text content corresponding to the module.
  • each module input multiple valid texts of the module into the first recurrent neural network trained in advance to obtain the first feature vector of each valid text.
  • the first recurrent neural network uses multiple pre-collected A sample of valid text that meets the specified requirements is obtained by training.
  • each module For each module, first input the first feature vector of each valid text of the module into the memory network obtained in advance, and obtain the participles in each valid text of the module.
  • the first position information, the text structure of the filled text is the same as the text structure of the first sample text used in the training of the memory network, and the first sample text is a text that conforms to the structure of the natural language expression and meets the required information of the module.
  • the memory network is obtained by training a plurality of pre-collected first sample texts.
  • S505 For each module, arrange the participles in each valid text of the module according to the obtained first position information to obtain the filled text of the module.
  • Steps S502 to S505 are the same steps as S101 to S104 in the embodiment shown in FIG. 1 of this application, and are not repeated here. For details, refer to the description of the embodiment shown in FIG. 1 of this application.
  • S506 Input multiple valid texts of the module into a sequence labeling model trained in advance to obtain the second identification information of each participle in each valid text.
  • the sequence labeling model is a plurality of pre-labeled second labels that are collected in advance. The information is obtained by training a second sample of valid text that meets the requirements of the module.
  • the second identification information is used to represent the uniqueness of each participle in the valid text.
  • the sequence labeling model is used to label the second valid information of the input valid text, and is used to determine the position information of each participle in the filled text in step S507.
  • the sequence labeling model in this embodiment may specifically have the structure shown in FIG. 7.
  • the valid text is input to the sequence labeling model through the input layer 701 in the form of a string.
  • the second identification information corresponding to each segmentation is determined, so that the second segment of each segmentation is labeled at the output layer 703. Identification information. Considering that there is an association between each participle in the text, the context of a certain participle will affect the semantics of the participle.
  • each neuron in the hidden layer of the sequence labeling model is an LSTM network (Long Short Term Memory, a kind of An RNN network with a special structure), when the network is a neuron, information is exchanged between neurons to extract a feature that reflects the overall semantics of the effective text, and based on this feature, the second identification information is marked for each participle of the effective text .
  • LSTM network Long Short Term Memory, a kind of An RNN network with a special structure
  • S507 Determine, according to the second identification information, the fifth position information of each participle in each valid text in the filled text of the module by using a preset correspondence between the identification and the participle position information.
  • the preset correspondence between the identifier and the segmentation position information may be a correspondence table between the identifier and the segmentation position information, and may also be a correspondence mapping (for example, key-value).
  • S508 Arrange the participles in each valid text according to the fifth position information of each participle in each valid text to obtain a filled text.
  • the filling text corresponding to the "Title” module in a fixed writing format to generate hot news is structured type text, and the valid text "2018 World Cup starts on June 14 in Russia" is entered into a preset sequence label
  • the model obtains the second identification information g1 of the segmentation "2018”, the second identification information g2 of the segmentation "World Cup”, and the second identification information g3 of the segmentation "open match”.
  • the structured type text has a fixed representation structure. Compared with the unstructured type text, less information needs to be determined through the neural network, and usually the neural network will occupy a large amount of computing resources. . Therefore, by determining the text type of the module, and accordingly performing different text generation methods on modules with different text types, it can reduce the occupation of computing resources and improve the efficiency of text generation.
  • an embodiment of the present application further provides a text generating device.
  • the structure of a text generating device may include:
  • a text acquisition module 801 for each module in the fixed writing format of the text to be generated, obtains a plurality of valid texts from a preset database that meet the requirement information of the module, and the requirement information is used to indicate the text corresponding to the module content;
  • a feature extraction module 802 is configured for each module to input multiple valid texts of the module into a first recurrent neural network trained in advance to obtain a first feature vector and a first recurrent neural network of each valid text of the module.
  • the network is obtained by training with multiple pre-collected sample valid texts that meet the specified requirements information;
  • a position information determining module 803 is configured to input a first feature vector of each valid text of the module into a memory network obtained in advance for each module, and obtain the word segmentation of each valid text of the module in the filling
  • the first position information in the text, the text structure of the filled text is the same as the text structure of the first sample text used in the training of the memory network, and the first sample text is information that conforms to the structure of the natural language expression and meets the specified requirements Text
  • the memory network is obtained by training with a plurality of the first sample texts collected in advance;
  • a text generating module 804 for each module, arranging the participles in each valid text of the module according to the obtained first position information to obtain filled text; according to a fixed writing format of the text to be generated, Arrange the filled text of each module to obtain the text to be generated.
  • a text generating device provided in the embodiment of the present application is that, for each module, a memory network used is obtained by training a plurality of pre-collected first sample texts, and the first sample texts conform to natural language content.
  • the segmentation in the valid text is arranged according to the first position information, and the text structure of the obtained filled text is the same as that of the first sample text, which also conforms to the natural language expression structure. Therefore, the filling text of each module is arranged according to the fixed writing format of the text to be generated, and the obtained text to be generated is also a text conforming to the structure of natural language expression.
  • the text generation module 804 is specifically configured to:
  • the sixth position information of the filled text of the module in the text to be generated is determined according to the preset correspondence between the first identification information and the module position, and the preset first identification information and the position of the module
  • the correspondence relationship is used to represent a fixed writing format of the text to be generated
  • Each filled text is arranged according to the sixth position information to obtain the text to be generated.
  • the text acquisition module 801 is specifically used for:
  • the feature extraction module 802 is further configured for each module to input each backup text of the module into a second recurrent neural network trained in advance to obtain a second feature vector and a second recurrent neural network of each backup text.
  • the network is trained with multiple pre-collected sample backup texts.
  • the requirement information of the module is input into a third recurrent neural network trained in advance, and a third feature vector of the demand information is obtained as the feature vector of the module.
  • the third recurrent neural network is a plurality of sample requirements of the module collected in advance. Information obtained through training;
  • the position information determining module 803 is further configured for each module to input the vector information corresponding to each backup text of the module into the fourth recurrent neural network obtained in advance, and obtain each backup text of the module.
  • the third position information is the text that meets the demand information of the module. Position information in the text;
  • the text obtaining module 801 is specifically used for each module to extract the text at the corresponding second position information from each standby text of the module, as valid text that meets the requirements of the module.
  • the requirement information of the module is multiple:
  • the text acquisition module 801 is specifically used for:
  • the feature extraction module 802 is further configured to:
  • each module input multiple valid texts of each requirement information of the module into the first recurrent neural network trained in advance to obtain the first feature vector of each valid text;
  • each requirement information of the module is input into a third recurrent neural network trained in advance, and a third feature vector of each requirement information of the module is obtained.
  • the third recurrent neural network is a plurality of previously collected Obtained by training the sample requirement information of the module;
  • the location information determining module 803 is specifically configured to:
  • the vector information corresponding to each requirement information of the module For each module, input the vector information corresponding to each requirement information of the module into the pre-trained memory network, and obtain the valid text corresponding to each requirement information of the module and the corresponding first position information;
  • the first position information is each participle in the valid text, and the first position information in the filled text of the module;
  • the vector information corresponding to any requirement information of the module includes: each valid text corresponding to the requirement information And the third feature vector of the required information;
  • the text structure of the filled text is the same as the text structure of the first sample text labeled with the fourth position information used in the training of the memory network, and the fourth position information Position information of each text in the first sample text that meets the requirement information.
  • the structure of a text generating device may include:
  • a text classification module 901 is configured to input the requirement information of the module into a preset classification algorithm for each module, and obtain a text type of the module's filled text, where the text type includes a structured type and an unstructured type;
  • a text acquisition module 902 is used for each module in the fixed writing format of the text to be generated.
  • the text type of the filled text of the module is an unstructured type
  • the information corresponding to the module's requirements is obtained from a preset database. Multiple valid texts;
  • a feature extraction module 903 is configured for each module.
  • the text type of the filled text of the module is an unstructured type
  • multiple valid texts of the module are input into a first recurrent neural network trained in advance to obtain each First feature vector of valid text;
  • the position information determining module 904 is configured for each module.
  • the text type of the filled text of the module is an unstructured type
  • the first feature vector of each valid text is separately input into a memory network obtained in advance to obtain each The first position information of each participle in each valid text in the filled text of the module;
  • the text acquisition module 902 is further configured for each module in the fixed writing format of the text to be generated.
  • the text type of the filled text of the module is a structured type
  • the plurality of valid text inputs of the module are pre-trained.
  • the obtained sequence labeling model obtains the second identification information of each participle in each valid text.
  • the sequence labeling model is a plurality of pre-collected pre-labeled second identification information and meets the requirements of the module. Obtained by training the second sample of valid text;
  • the position information determining module 904 is further configured to determine, according to the second identification information, the fifth position information of each participle in each valid text in the filled text of the module by using a preset correspondence between the identification and the positional part information;
  • the text generating module 905 is further configured to arrange the participles in each valid text according to the fifth position information of each participle in each valid text to obtain the filled text of the module; according to the text to be generated Fixed writing format, arrange the filled text of each module to get the text to be generated.
  • an embodiment of the present application further provides a computer device, as shown in FIG. 10, which may include:
  • the processor 1001 is configured to implement the steps of the text generating method in any one of the foregoing embodiments when the computer program stored in the memory 1003 is executed.
  • a computer device provided in the embodiment of the present application is that, for each module, a memory network used is obtained by training a plurality of pre-collected first sample texts, and the first sample text conforms to a natural language content structure. Samples that meet the module's requirements information. Therefore, the first position information of each participle in the valid text obtained from the first memory network in the filled text is the same as the position information of each participle in the first sample text. On this basis, the segmentation in the valid text is arranged according to the first position information, and the text structure of the obtained filled text is the same as that of the first sample text, which also conforms to the natural language expression structure. Therefore, it is guaranteed that the filled text of each module is arranged according to the fixed writing format of the text to be generated, and the obtained text to be generated is a text conforming to the structure of natural language expression.
  • the foregoing memory may include RAM (Random Access Memory, Random Access Memory), and may also include NVM (Non-Volatile Memory, non-volatile memory), such as at least one disk memory.
  • NVM Non-Volatile Memory, non-volatile memory
  • the memory may also be at least one storage device located far from the processor.
  • the above processor may be a general-purpose processor, including a CPU (Central Processing Unit), a NP (Network Processor), etc .; it may also be a DSP (Digital Signal Processor), ASIC (Application Specific Integrated Circuit (ASIC), FPGA (Field-Programmable Gate Array), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.
  • CPU Central Processing Unit
  • NP Network Processor
  • DSP Digital Signal Processor
  • ASIC Application Specific Integrated Circuit
  • FPGA Field-Programmable Gate Array
  • a computer-readable storage medium provided by an embodiment of the present application.
  • the computer-readable storage medium stores a computer program.
  • the steps of the text generation method in any of the foregoing embodiments are implemented.
  • a computer-readable storage medium provided by an embodiment of the present application.
  • the computer program is executed by a processor, since a memory network used for each module is obtained by using a plurality of pre-collected first sample texts, And the first sample text is a sample that conforms to the natural language content structure and meets the module's requirements information. Therefore, the first position information of each participle in the valid text obtained from the first memory network in the filled text is the same as the position information of each participle in the first sample text.
  • the segmentation in the valid text is arranged according to the first position information, and the text structure of the obtained filled text is the same as that of the first sample text, which also conforms to the natural language expression structure. Therefore, it is guaranteed that the filled text of each module is arranged according to the fixed writing format of the text to be generated, and the obtained text to be generated is a text conforming to the structure of natural language expression.
  • a computer program product containing instructions is also provided.
  • the computer program product is run on a computer, the computer is caused to execute the text generating method in any of the foregoing embodiments.
  • an application program is also provided, and when the application program is running, the text generating method in any of the foregoing embodiments may be executed.
  • the computer program product includes one or more computer instructions.
  • the computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices.
  • the computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be from a website site, computer, server, or data center Transmission by wire (for example, coaxial cable, optical fiber, DSL (Digital Subscriber Line) or wireless (for example: infrared, radio, microwave, etc.) to another website site, computer, server, or data center.
  • a computer-readable storage medium may be any available media that can be accessed by a computer or a data storage device such as a server, data center, etc. that includes one or more available media integrations.
  • the available media may be magnetic media (eg, a floppy disk, a hard disk , Magnetic tape), optical media (for example: DVD (Digital Versatile Disc), or semiconductor media (for example: SSD (Solid State Disk)).

Abstract

The embodiments of the present application provide a text generation method, an apparatus and a device. Said method comprises: for each module in a fixed writing format of a text to be generated, acquiring, from a preset database, a plurality of valid texts conforming to demand information of the module; inputting, for each module, the plurality of valid texts of the module into a pre-trained first recurrent neural network respectively, so as to obtain a first feature vector of each valid text; inputting, for each module, the first feature vector of each valid text into a pre-trained memory network, so as to obtain segmented words in each valid text, and first position information in a filling text of the module, and arranging the segmented words in each valid text to obtain the filling text; and arranging the filling text of each module according to the fixed writing format of the text to be generated, so as to obtain the text to be generated. Thus, a text to be generated conforming to the natural language expression structure, is obtained.

Description

文本生成方法、装置及设备Text generating method, device and equipment
本申请要求于2018年07月27日提交中国专利局、申请号为201810846953.8发明名称为“文本生成方法、装置及设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims priority from a Chinese patent application filed with the Chinese Patent Office on July 27, 2018, with an application number of 201810846953.8, and the invention name is "text generation method, device, and device", the entire contents of which are incorporated herein by reference.
技术领域Technical field
本申请涉及自然语言处理技术领域,特别是涉及一种文本生成方法、装置及设备。The present application relates to the technical field of natural language processing, and in particular, to a method, a device, and a device for generating text.
背景技术Background technique
自然语言是人们日常使用的语言,自然语言处理技术可以实现人机之间的自然语言通信,被广泛用于生成具有固定写作格式和指定需求信息的、且以自然语言表述的文本。例如,针对待生成文本的固定写作格式中的每个模块,利用自然语言处理技术从资料库中确定符合各个模块的文本需求信息的有效文本,再将确定的有效文本直接填充至各模块中,得到各模块的填充文本,进而按固定写作格式对各模块的填充文本进行排列,得到待生成文本。其中,固定写作格式中各模块的填充文本通常包括:文字或句子的表述结构固定的结构化文本,和/或,句子表述结构不固定的非结构化文本。例如,某一热点新闻的固定写作格式中各模块分别为“标题”模块、“发布日期”模块以及“正文”模块,其中,“标题”和“发布日期”模块的填充文本为结构化文本,“正文”模块的填充文本为非结构化文本。Natural language is the language that people use every day. Natural language processing technology can realize natural language communication between humans and computers. It is widely used to generate texts with fixed writing format and specified demand information and expressed in natural language. For example, for each module in the fixed writing format of the text to be generated, using natural language processing technology to determine valid text from the database that meets the text requirements of each module, and then filling the determined valid text directly into each module, The filled text of each module is obtained, and then the filled text of each module is arranged in a fixed writing format to obtain the text to be generated. Among them, the filled text of each module in the fixed writing format usually includes: structured text with a fixed structure of words or sentences, and / or unstructured text with a fixed structure of sentences. For example, each module in a fixed writing format of a hot news is a "Title" module, a "Release Date" module, and a "Body" module. The filled text of the "Title" and "Release Date" modules is structured text. The filled text of the Body module is unstructured text.
在上述自然语言处理技术中,由于将有效文本直接填充至模块,而没有考虑有效文本填充后的表述结构,对于存在非结构化文本的模块,很有可能造成模块的填充文本是多个有效文本的机械式组合,模块的填充文本不符合自然语言表述结构,导致利用已填充文本的模块得到的待生成文本也存在不符合自然语言表述结构的问题。以上述某一热点新闻的“正文”模块为例,“正文”模块的文本需求信息是“2018年世界杯”。针对“正文”模块,从资料库中确定的符合文本需求信息的有效文本包括:“世界杯首次在俄罗斯境内举行”、“2018年世界杯在俄罗斯境内11座城市中的12座球场内举行”以及“比赛将于2018年6月14日至7月15日举行”。由于“正文”模块的填充文本 是表述结构不固定的非结构化文本,直接将有效文本填充至模块,生成的正文”模块的填充文本可能是“比赛将于2018年6月14日至7月15日举行,2018年世界杯在俄罗斯境内11座城市中的12座球场内举行,世界杯首次在俄罗斯境内举行”。而符合自然语言表述结构的填充文本可以是“2018年世界杯比赛将于2018年6月14日至7月15日,在俄罗斯境内11座城市中的12座球场内举行,这是世界杯首次在俄罗斯境内举行”。In the above-mentioned natural language processing technology, since valid text is directly filled into the module without considering the representation structure after the valid text is filled, for a module that has unstructured text, it is likely that the filled text of the module is multiple valid texts In the mechanical combination, the filled text of the module does not conform to the natural language expression structure, which leads to the problem that the text to be generated obtained by using the filled text module does not conform to the natural language expression structure. Taking the "body" module of one of the above hot news as an example, the text requirement information of the "body" module is "2018 World Cup". For the "body" module, valid texts determined from the database that meet the text requirements information include: "The World Cup is held in Russia for the first time", "The 2018 World Cup is held in 12 stadiums in 11 cities in Russia" and " The competition will be held from June 14th to July 15th, 2018. " Because the filled text of the "body" module is unstructured text with a fixed structure, the valid text is directly filled into the module. The filled text of the generated body module may be "The competition will be from June 14 to July 2018. Held on the 15th, the 2018 World Cup will be held in 12 stadiums in 11 cities in Russia, and the World Cup will be held in Russia for the first time. "And the filled text conforming to the structure of natural language expressions can be" The 2018 World Cup matches will be held in June 2018 From 14th to 15th July, it will be held in 12 stadiums in 11 cities in Russia. This is the first time the World Cup has been held in Russia. "
可见,对于存在非结构化文本的模块而言,将有效内容直接填充至模块,用以生成待生成文本时,所生成的待生成文本将存在文本结构不符合自然语言表述结构的问题。It can be seen that, for a module having unstructured text, valid content is directly filled into the module to generate the text to be generated, and the generated text to be generated will have a problem that the text structure does not conform to the structure of natural language expressions.
发明内容Summary of the Invention
本申请实施例的目的在于提供一种文本生成方法、装置及设备,以实现生成符合自然语言表述结构的文本的目的。具体技术方案如下:The purpose of the embodiments of the present application is to provide a method, a device, and a device for generating a text, so as to achieve the purpose of generating a text conforming to a natural language expression structure. Specific technical solutions are as follows:
第一方面,本申请实施例提供了一种文本生成方法,该方法包括:In a first aspect, an embodiment of the present application provides a text generating method, which includes:
针对待生成文本的固定写作格式中的每个模块,从预设资料库中获取符合该模块的需求信息的多个有效文本,需求信息用于表明该模块对应的文本内容;For each module in the fixed writing format of the text to be generated, a plurality of valid texts that meet the module's requirement information are obtained from a preset database, and the requirement information is used to indicate the text content corresponding to the module;
针对每个模块,将该模块的多个有效文本分别输入预先训练得到的第一循环神经网络,得到该模块的每个有效文本的第一特征向量,第一循环神经网络为以多个预先收集的符合指定需求信息的样本有效文本进行训练得到的;For each module, input multiple valid texts of the module into the first recurrent neural network trained in advance to obtain the first feature vector of each valid text of the module. Obtained by training the sample valid text of the information that meets the specified requirements;
针对每个模块,将该模块的每个有效文本的第一特征向量分别输入预先训练得到的记忆网络,得到每个有效文本中的各分词,在该模块的填充文本的第一位置信息,填充文本的文本结构与记忆网络训练时所利用的第一样本文本的文本结构相同,第一样本文本为符合自然语言表述结构、且符合指定需求信息的文本,记忆网络为以多个预先收集的第一样本文本进行训练得到的;For each module, input the first feature vector of each valid text of the module into the pre-trained memory network to obtain the segmentation words in each valid text, and fill in the first position information of the filled text in this module. The text structure of the text is the same as the text structure of the first sample text used in the training of the memory network. The first sample text is a text that conforms to the structure of the natural language expression and meets the specified requirements. Obtained by training the first sample text;
针对所述每个模块,将该模块的每个有效文本中的各分词,按照所得到的第一位置信息排列,得到该模块的所述填充文本;For each module, arrange the participles in each valid text of the module according to the obtained first position information to obtain the filled text of the module;
按照待生成文本的固定写作格式,排列每个模块的填充文本,得到所述待生成文本。According to the fixed writing format of the text to be generated, the filling text of each module is arranged to obtain the text to be generated.
第二方面,本申请实施例提供了一种文本生成装置,该装置包括:In a second aspect, an embodiment of the present application provides a text generating device, where the device includes:
文本获取模块,用于针对待生成文本的固定写作格式中的每个模块,从预设资料库中获取符合该模块的需求信息的多个有效文本,需求信息用于表明该模块对应的文本内容;Text acquisition module, for each module in the fixed writing format of the text to be generated, obtain multiple valid texts from the preset database that meet the module's requirement information, and the requirement information is used to indicate the text content corresponding to the module ;
特征提取模块,用于针对每个模块,将该模块的多个有效文本分别输入预先训练得到的第一循环神经网络,得到该模块的每个有效文本的第一特征向量,第一循环神经网络为以多个预先收集的符合指定需求信息的样本有效文本进行训练得到的;A feature extraction module is used for each module to input multiple valid texts of the module into the first recurrent neural network trained in advance to obtain the first feature vector and first recurrent neural network of each valid text of the module. It is obtained by training with multiple pre-collected sample valid texts that meet the specified requirements information;
位置信息确定模块,用于针对每个模块,将该模块的每个有效文本的第一特征向量分别输入预先训练得到的记忆网络,得到该模块的每个有效文本中的各分词,在该模块的填充文本中的第一位置信息,填充文本的文本结构与记忆网络中的第一样本文本的文本结构相同,第一样本文本为符合自然语言表述结构、且符合指定需求信息的文本,记忆网络为以多个预先收集的第一样本文本进行训练得到的;A position information determining module is used for each module to input a first feature vector of each valid text of the module into a pre-trained memory network to obtain each participle in each valid text of the module. The first position information in the filled text, the text structure of the filled text is the same as the text structure of the first sample text in the memory network, the first sample text is a text that conforms to the natural language expression structure and meets the specified requirements information, The memory network is obtained by training with multiple first collected first sample texts;
文本生成模块,用于针对所述每个模块,将该模块的每个有效文本中的各分词,按照所得到的第一位置信息排列,得到该模块的所述填充文本;按照待生成文本的固定写作格式,排列每个模块的所述填充文本,得到待生成文本。A text generating module is configured to arrange each participle in each valid text of the module according to the obtained first position information for each module to obtain the filled text of the module; according to the text to be generated, The writing format is fixed, and the filled text of each module is arranged to obtain the text to be generated.
第三方面,本申请实施例提供了一种计算机设备,该设备包括:In a third aspect, an embodiment of the present application provides a computer device, where the device includes:
处理器、通信接口、存储器和通信总线,其中,处理器,通信接口,存储器通过总线完成相互间的通信;存储器,用于存放计算机程序;处理器,用于执行存储器上所存放的程序,实现上述第一方面提供的文本生成方法的步骤。A processor, a communication interface, a memory, and a communication bus, wherein the processor, the communication interface, and the memory communicate with each other through the bus; the memory is used to store a computer program; the processor is used to execute the program stored in the memory to implement The steps of the text generating method provided by the first aspect above.
第四方面,本申请实施例提供了一种计算机可读存储介质,该存储介质内存储有计算机程序,该计算机程序被处理器执行时实现上述第一方面提供的文本生成方法的步骤。In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium. A computer program is stored in the storage medium. When the computer program is executed by a processor, the steps of the text generation method provided in the first aspect are implemented.
本申请实施例提供的一种文本生成方法、装置及设备,由于针对每个模块,记忆网络是利用多个预先收集的第一样本文本训练得到的,并且第一样本文本是符合自然语言内容结构、且符合该模块需求信息的样本。因此,利 用记忆网络得到的有效文本中的各分词在填充文本中的第一位置信息,与第一样本文本中各分词的位置信息相同。在此基础上,按照第一位置信息排列有效文本中的各分词,得到的填充文本的文本结构与第一样本文本的文本结构相同,也就符合自然语言表述结构。从而可以保证按照待生成文本的固定写作格式,排列每个模块的填充文本后,得到的待生成文本是符合自然语言表述结构的文本。A text generation method, device, and device provided in the embodiments of the present application. For each module, the memory network is trained by using a plurality of pre-collected first sample texts, and the first sample text conforms to natural language. A sample of the content structure and the information required by the module. Therefore, the first position information of each participle in the valid text obtained from the memory network in the filled text is the same as the position information of each participle in the first sample text. On this basis, the segmentation in the valid text is arranged according to the first position information, and the text structure of the obtained filled text is the same as that of the first sample text, which also conforms to the natural language expression structure. Therefore, it can be ensured that the filled text of each module is arranged in accordance with the fixed writing format of the text to be generated, and the obtained text to be generated is a text conforming to the structure of natural language expression.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
为了更清楚地说明本申请实施例和现有技术的技术方案,下面对实施例和现有技术中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly explain the embodiments of the present application and the technical solutions of the prior art, the following briefly introduces the drawings used in the embodiments and the prior art. Obviously, the drawings in the following description are only the present invention. Some embodiments of the application, for those of ordinary skill in the art, can obtain other drawings according to the drawings without paying creative labor.
图1为本申请一实施例提供的文本生成方法的流程示意图;FIG. 1 is a schematic flowchart of a text generation method according to an embodiment of the present application; FIG.
图2为本申请一实施例提供的文本生成方法中,循环神经网络的结构示意图;2 is a schematic structural diagram of a recurrent neural network in a text generating method according to an embodiment of the present application;
图3为本申请一实施例提供的文本生成方法中,记忆网络的结构示意图;3 is a schematic structural diagram of a memory network in a text generating method according to an embodiment of the present application;
图4为本申请另一实施例提供的文本生成方法中,记忆网络的结构示意图;4 is a schematic structural diagram of a memory network in a text generating method according to another embodiment of the present application;
图5为本申请再一实施例提供的文本生成方法的流程示意图;5 is a schematic flowchart of a text generation method according to another embodiment of the present application;
图6为本申请再一实施例提供的文本生成方法中,卷积神经网络的结构示意图;FIG. 6 is a schematic structural diagram of a convolutional neural network in a text generation method according to another embodiment of the present application; FIG.
图7为本申请再一实施例提供的文本生成方法中,序列标注模型的结构示意图;FIG. 7 is a schematic structural diagram of a sequence labeling model in a text generation method according to another embodiment of the present application; FIG.
图8为本申请一实施例提供的文本生成装置的结构示意图;8 is a schematic structural diagram of a text generating device according to an embodiment of the present application;
图9为本申请另一实施例提供的文本生成装置的结构示意图;9 is a schematic structural diagram of a text generating apparatus according to another embodiment of the present application;
图10为本申请一实施例提供的计算机设备的结构示意图。FIG. 10 is a schematic structural diagram of a computer device according to an embodiment of the present application.
具体实施方式detailed description
为使本申请的目的、技术方案、及优点更加清楚明白,以下参照附图并举实施例,对本申请进一步详细说明。显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通 技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。In order to make the purpose, technical solution, and advantages of the present application clearer and clearer, the following describes the present application in detail with reference to the accompanying drawings and examples. Obviously, the described embodiments are only a part of the embodiments of the present application, but not all the embodiments. Based on the embodiments in the present application, all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.
下面首先对本申请一实施例的文本生成方法进行介绍。A text generation method according to an embodiment of the present application is first introduced below.
本申请实施例提供的文本生成方法,可以应用于能够进行文本生成的计算机设备,该设备包括台式计算机、便携式计算机、互联网电视,智能移动终端、可穿戴式智能终端以及服务器等,在此不作限定,任何可以实现本申请实施例的计算机设备,均属于本申请实施例的保护范围。The text generation method provided in the embodiment of the present application can be applied to a computer device capable of text generation. The device includes a desktop computer, a portable computer, an Internet television, a smart mobile terminal, a wearable smart terminal, and a server, etc., and is not limited herein. Any computer equipment that can implement the embodiments of the present application belongs to the protection scope of the embodiments of the present application.
如图1所示,本申请一实施例的文本生成方法的流程,该方法可以包括:As shown in FIG. 1, the flow of a text generation method according to an embodiment of the present application may include:
S101,针对待生成文本的固定写作格式中的每个模块,从预设资料库中获取符合该模块的需求信息的多个有效文本,需求信息用于表明该模块对应的文本内容。S101. For each module in the fixed writing format of the text to be generated, a plurality of valid texts that meet the requirement information of the module are obtained from a preset database, and the requirement information is used to indicate the text content corresponding to the module.
由于对于同一个待生成文本而言,各模块的文本用于描述同一事件,如报道2018世界杯开赛的新闻稿中的“标题”模块、“正文”模块均描述2018世界杯开赛,因此,对于一待生成文本的每个模块,每个模块的需求信息在表明自身对应的文本内容的同时,还可以表明该模块所属待生成文本所描述的事件。For the same text to be generated, the text of each module is used to describe the same event. For example, the "Title" module and "Text" module in the press release covering the 2018 World Cup start all describe the start of the 2018 World Cup. For each module that generates text, while the requirement information of each module indicates its corresponding text content, it can also indicate the event described by the text to be generated to which the module belongs.
从预设资料库中获取符合该模块的需求信息的多个有效文本的方式,可以是多种的。示例性的,可以采用进行关键字匹配的方式,从预设资料库中获取包含该模块的需求信息的文本。或者,示例性的,可以将需求信息作为待回答文本,利用阅读理解技术从预设资料库中获取与待回答文本匹配的答案的位置,将该位置处的答案作为有效文本。任何用于获取有效文本的方法均可用于本申请,本实施例对此不作限制。There are various ways to obtain multiple valid texts from the preset database that meet the requirements of the module. Exemplarily, a method of performing keyword matching may be used to obtain text containing the requirement information of the module from a preset database. Alternatively, as an example, the requirement information may be used as the text to be answered, and the position of the answer matching the text to be answered is obtained from a preset database by using a reading comprehension technique, and the answer at this position is used as the valid text. Any method for obtaining valid text can be used in this application, and this embodiment does not limit this.
S102,针对每个模块,将该模块的多个有效文本分别输入预先训练得到的第一循环神经网络,得到该模块的每个有效文本的第一特征向量,第一循环神经网络为以多个预先收集的符合指定需求信息的样本有效文本进行训练得到的。S102. For each module, input multiple valid texts of the module into a first recurrent neural network trained in advance, and obtain a first feature vector of each valid text of the module. The first recurrent neural network uses multiple Pre-collected samples of valid text that meet the specified requirements are obtained by training.
其中,针对每个模块,符合指定需求信息的样本有效文本所描述的事件与该模块的需求信息所描述的事件存在相同特征,指定需求信息可以是与该模块的需求信息相同或者相似的信息,可以根据具体应用需求进行设置。例 如,需求信息为“2018年世界杯”时,指定需求信息可以为“2018年世界杯”,“2008年奥运会”,或者“2018年NBA”等等信息。需求信息为“春日游戏”时,指定需求信息可以为“春日游戏”,“冬天游戏”,或者“室内游戏”等等信息。For each module, the events described by the sample valid text that meets the specified requirement information have the same characteristics as the events described by the module's requirement information, and the specified requirement information can be the same or similar to the module's requirement information. Can be set according to specific application needs. For example, when the demand information is "2018 World Cup", the specified demand information can be "2018 World Cup", "2008 Olympics", or "2018 NBA" and so on. When the demand information is "Spring Game", the specified demand information may be "Spring Game", "Winter Game", or "Indoor Game" and so on.
并且,示例性的,RNN(Recurrent Neural Networks,循环神经网络)具体可以是如图2所示的结构,隐藏层中神经元202的当前输入可以包括输入层201的输出2010和该神经元202上一时刻的输出2020,使循环神经网络记忆并利用上一时刻的输出确定当前时刻的输出,进而得到输出层203输出的特征向量。由此,对于各分词并不孤立的文本,可以利用当前分词和前一个分词预测出下一个分词,在提取有效文本的特征向量时,为了使提取的特征不仅包含单个分词的特征,还能反映出文本中各分词之间的关系,可以使用循环神经网络提取有效文本的特征向量,利用循环神经网络能够记忆并利用上一时刻的输出来确定当前时刻的输出的特点,使提取出的特征向量能够反映有效文本各分词的特征以及各分词之间关系的特征。在此基础上,以多个预先收集的符合指定需求信息的样本有效文本进行训练得到的上述S102中的第一循环神经网络,建立了有效文本和特征向量的映射关系,从而保证得到的第一特征向量能够反映有效文本整体的语义特征,而不仅仅是文本中单个分词本身的特征。例如,当前分词是“撞”,前一个分词是“开车”,则下一个分词很可能是“伤”。In addition, for example, the RNN (Recurrent Neural Networks) may have a structure as shown in FIG. 2. The current input of the neuron 202 in the hidden layer may include the output 2010 of the input layer 201 and the neuron 202. The output 2020 at a moment enables the recurrent neural network to remember and use the output at the previous moment to determine the output at the current moment, and then obtains the feature vector output by the output layer 203. Therefore, for texts where each participle is not isolated, the current participle and the previous participle can be used to predict the next participle. When extracting the feature vector of the effective text, in order to make the extracted features not only include the features of a single participle, but also reflect The relationship between each participle in the text can be used to extract the feature vector of the effective text using a recurrent neural network. The recurrent neural network can remember and use the output of the previous moment to determine the characteristics of the output at the current moment, so that the extracted feature vector It can reflect the characteristics of each segmentation in the effective text and the characteristics of the relationship between each segmentation. Based on this, the first recurrent neural network in the above S102 obtained by training with a plurality of sample valid texts collected in advance and meeting the specified demand information, establishes a mapping relationship between valid texts and feature vectors, thereby ensuring the obtained first The feature vector can reflect the semantic features of the effective text as a whole, not just the features of the individual participles in the text. For example, if the current participle is "hit" and the previous participle is "driving", the next participle is likely to be "hurt".
此外,可以理解的是,本申请任一实施例中的循环神经网络与上述S102中的第一循环神经网络类似,区别在于为了实现对不同输入文本的特征向量的提取,用于训练得到不同循环神经网络的样本不同。In addition, it can be understood that the recurrent neural network in any embodiment of the present application is similar to the first recurrent neural network in S102 described above, the difference is that in order to implement the extraction of feature vectors of different input texts, it is used for training to obtain different loops The samples of the neural network are different.
S103,针对每个模块,将该模块的每个有效文本的第一特征向量分别输入预先训练得到的记忆网络,得到该模块的每个有效文本的各分词,在该模块的填充文本中的第一位置信息,填充文本的文本结构与记忆网络训练时所利用的第一样本文本的文本结构相同,第一样本文本为符合自然语言表述结构、且符合该模块的需求信息的文本,记忆网络为以多个预先收集的第一样本文本进行训练得到的。S103. For each module, first input the first feature vector of each valid text of the module into the memory network obtained in advance, and obtain the word segmentation of each valid text of the module. A position information. The text structure of the filled text is the same as the text structure of the first sample text used in the training of the memory network. The first sample text is a text that conforms to the structure of the natural language expression and meets the information required by the module. The network is obtained by training with a plurality of pre-collected first sample texts.
考虑到待生成文本的固定写作格式的特点为:各模块的排列仅与待生成 文本的固定格式有关,而不涉及待生成文本中文字的表述结构。例如,将“标题”模块排列在“正文”模块后面,且这两个模块中的文本符合自然语言表述结构时,得到的待生成文本仅是固定格式异常,而不会产生文本不符合自然语言表述结构的问题。因此,要使待生成文本符合自然语言表述结构,需要保证每个模块的填充文本符合自然语言表述结构。Taking into account the characteristics of the fixed writing format of the text to be generated: the arrangement of the modules is only related to the fixed format of the text to be generated, and does not involve the structure of the text in the text to be generated. For example, when the "Title" module is arranged behind the "Body" module, and the text in these two modules conforms to the natural language expression structure, the text to be generated is only a fixed format exception, and the text does not conform to the natural language. The problem of structure. Therefore, to make the text to be generated conform to the natural language expression structure, it is necessary to ensure that the filled text of each module conforms to the natural language expression structure.
为此,可以利用多个预先收集的第一样本文本训练得到记忆网络,且第一样本文本是符合自然语言内容结构、且符合该模块需求信息的样本。因此,利用记忆网络得到的有效文本中的各分词在填充文本中的第一位置信息,与第一样本文本中各分词的位置信息相同,可以保证在后续步骤S104中,按照第一位置信息,排列每个有效文本中的各分词,得到的填充文本,是各分词的排列位置与第一样本文本中各分词的位置信息相同,保证填充文本符合自然语言内容结构。To this end, a plurality of pre-collected first sample texts can be used to train a memory network, and the first sample texts are samples that conform to the natural language content structure and meet the module's requirement information. Therefore, the first position information of each participle in the effective text obtained from the memory network in the filled text is the same as the position information of each participle in the first sample text, and it can be guaranteed that in the subsequent step S104, the first position information By arranging each participle in each valid text, the resulting filled text is that the arrangement position of each participle is the same as the position information of each participle in the first sample text to ensure that the filled text conforms to the natural language content structure.
其中,本实施例中的记忆网络具体可以是如图3所示的结构:The memory network in this embodiment may specifically have a structure as shown in FIG. 3:
输入层301为与本申请图2实施例的循环神经网络结构相同的第一循环神经网络,用于得到第一特征向量,并将第一特征向量输入隐藏层,在此不再赘述,详见上述图2所示实施例的描述。The input layer 301 is a first recurrent neural network having the same structure as the recurrent neural network in the embodiment of FIG. 2 of the present application, and is used to obtain a first feature vector and input the first feature vector to a hidden layer, which will not be repeated here. For details, see The above description of the embodiment shown in FIG. 2.
隐藏层302具体可以包括神经元3020、神经元3021以及神经元3022,在确定文本中各分词的位置时,文本中各分词之间的上下文关系会影响到分词的位置,因此,隐藏层302可以采用循环神经网络的结构。并且,各分词的位置与文本整体的特征相关,因此,还需要保存各神经元的历史状态信息3023作为各神经元的输入,例如神经元3021的输入可以包含所有3020神经元的输出和状态信息3023。由此,可以根据记忆网络保存的历史状态信息,对输入进行特征提取,以提取与历史状态信息关联的特征。例如上述步骤S103中,利用多个预先收集的第一样本文本训练得到记忆网络,则记忆网络中保存了第一样本文本的符合自然语言内容结构、且符合该模块需求信息的历史状态信息,后续使用记忆网络确定有效文本中的各分词,在填充文本中的第一位置信息时,可以根据各神经元保存的用于表明符合自然语言内容结构的历史状态信息,确定输入的有效文本的特征中,各分词的起始位置303和截止位置304。The hidden layer 302 may specifically include a neuron 3020, a neuron 3021, and a neuron 3022. When determining the position of each participle in the text, the context relationship between each participle in the text will affect the position of the participle. Therefore, the hidden layer 302 may Adopt the structure of recurrent neural network. In addition, the position of each participle is related to the characteristics of the entire text. Therefore, it is also necessary to save the historical state information 3023 of each neuron as the input of each neuron. For example, the input of neuron 3021 may include the output and status information of all 3020 neurons. 3023. Therefore, features can be extracted from the input according to the historical state information stored in the memory network to extract features associated with the historical state information. For example, in step S103 above, a plurality of pre-collected first sample texts are used to train a memory network, and the memory network stores historical state information of the first sample text that conforms to the natural language content structure and meets the module's requirement information. Then, the memory network is used to determine each participle in the valid text. When filling the first position information in the text, the input of the valid text can be determined based on the historical state information saved by each neuron to indicate compliance with the natural language content structure. In the feature, the start position 303 and the end position 304 of each participle.
为了便于理解,本申请实施例中将采用分词作为有效文本进行示例性描述,在具体应用中,有效文本不限于分词,还可以包括句子和段落等。For ease of understanding, in the embodiments of the present application, word segmentation is used as an example for effective description. In specific applications, the valid text is not limited to the word segmentation, and may include sentences and paragraphs.
示例性的,符合需求信息“春日游戏”的第一样本文本为“小红踢毽子”。第一样本文本中分词“小红”在第1位置,“踢”在第2位置,“毽子”在第3位置,则利用第一样本训练得到的记忆网络可以将输入该网络的特征向量中与“小红”具有相同特征的分词“小明”的位置确定为第1位置,将与“踢”具有相同特征的分词“放”的位置确定为第2位置,将与“毽子”具有相同特征的分词“风筝”的位置确定为第3位置。Exemplarily, the first sample text of the requirement information "Spring Game" is "Little Red Kick". In the first sample text, the word "Little Red" is in the first position, "Kick" is in the second position, and "Bitch" is in the third position. Then the memory network trained using the first sample can input the features of the network The position of the participle "Xiao Ming" having the same characteristics as "Xiao Hong" in the vector is determined as the first position, the position of the participle "put" having the same characteristics as "Kick" is determined as the second position, and it has the same position as "Xunzi" The position of the participle "kite" with the same characteristics is determined as the third position.
S104,针对每个模块,将该模块的每个有效文本中的各分词,按照所得到的第一位置信息排列,得到该模块的填充文本。S104. For each module, arrange the participles in each valid text of the module according to the obtained first position information to obtain the filled text of the module.
由于在步骤S103中确定的第一位置信息与第一样本文本中各分词的位置相同,因此,步骤S104中按照第一位置信息排列有效文本中的各分词,得到的填充文本的文本结构与第一样本文本的文本结构相同,符合自然语言表述结构。例如,由第一特征向量对应的有效文本“小明”、“放”以及“风筝”的第一位置信息,得到填充文本“小明放风筝”。通过第一位置信息的确定,实现了生成符合自然语言表述结构的填充文本,避免了直接将有效文本填充至模块时,得到如“放风筝小明”或者“风筝放小明”等不符合自然语言表述习惯的填充文本。Because the first position information determined in step S103 is the same as the position of each participle in the first sample text, therefore, in step S104, each participle in the valid text is arranged according to the first position information, and the text structure of the obtained filled text is The text structure of the first sample text is the same and conforms to the structure of natural language expression. For example, based on the first position information of the valid texts "Xiao Ming", "Fang" and "Kite" corresponding to the first feature vector, the filled text "Xiao Ming Flies Kite" is obtained. Through the determination of the first position information, it is possible to generate filled text that conforms to the structure of natural language expressions, and avoids filling the valid text directly into the module, which can result in expressions that do not conform to natural language, such as "Kite Flying Xiaoming" or "Kite Flying Xiaoming". Custom filled text.
S105,按照待生成文本的固定写作格式,排列每个模块的填充文本,得到待生成文本。S105. According to the fixed writing format of the text to be generated, arrange the filled text of each module to obtain the text to be generated.
其中,待生成文本的固定写作格式可以包括每个模块的排列规则,并且以每个模块的标识信息区分各模块,进而利用模块的标识信息排列每个模块的填充文本。举例而言,待生成文本的固定写作格式包括:“主题”模块M1排在“正文”模块M2之前,可以将模块的标识信息M1的填充文本排列在模块的标识信息M2的填充文本前面。The fixed writing format of the text to be generated may include an arrangement rule of each module, and the identification information of each module is used to distinguish each module, and then the filled-in text of each module is arranged by using the identification information of the module. For example, the fixed writing format of the text to be generated includes: "theme" module M1 is arranged before the "body" module M2, and the filling text of the identification information M1 of the module can be arranged in front of the filling text of the identification information M2 of the module.
本申请实施例提供的一种文本生成方法,由于针对每个模块,记忆网络是利用多个预先收集的第一样本文本训练得到的,并且第一样本文本是符合自然语言内容结构、且符合该模块需求信息的样本。因此,利用记忆网络得到的有效文本中的各分词在填充文本中的第一位置信息,与第一样本文本中 各分词的位置信息相同。在此基础上,按照第一位置信息排列有效文本中的各分词,得到的填充文本的文本结构与第一样本文本的文本结构相同,也就符合自然语言表述结构。从而保证按照待生成文本的固定写作格式,排列每个模块的填充文本,得到的待生成文本是符合自然语言表述结构的文本。A text generation method provided in the embodiment of the present application is that, for each module, the memory network is trained by using a plurality of pre-collected first sample texts, and the first sample text conforms to a natural language content structure, and A sample of information that meets the requirements of the module. Therefore, the first position information of each participle in the valid text obtained from the memory network in the filled text is the same as the position information of each participle in the first sample text. On this basis, the segmentation in the valid text is arranged according to the first position information, and the text structure of the obtained filled text is the same as that of the first sample text, which also conforms to the natural language expression structure. Therefore, it is guaranteed that the filled text of each module is arranged according to the fixed writing format of the text to be generated, and the obtained text to be generated is a text conforming to the structure of natural language expression.
可选的,在上述本申请图1所示实施例的步骤S101之后,本申请实施例提供的文本生成方法还可以包括如下步骤:Optionally, after step S101 in the embodiment shown in FIG. 1 of the present application, the text generating method provided in the embodiment of the present application may further include the following steps:
针对每个模块,为该模块的每个有效文本标注该模块的第一标识信息。For each module, the first identification information of the module is marked for each valid text of the module.
其中,第一标识信息为预设的用于表明每个模块唯一性的信息。The first identification information is preset information used to indicate the uniqueness of each module.
相应的,上述本申请图1所示实施例的步骤S105,具体可以包括:Correspondingly, step S105 in the embodiment shown in FIG. 1 of the present application may specifically include:
针对每个模块,按照预设的第一标识信息与模块位置的对应关系,确定该模块的填充文本在待生成文本中的第六位置信息,预设的第一标识信息与模块位置的对应关系用于表示待生成文本的固定写作格式;For each module, according to the preset correspondence between the first identification information and the module position, determine the sixth position information of the filled text of the module in the text to be generated, and the correspondence between the preset first identification information and the module position. A fixed writing format used to represent the text to be generated;
按照第六位置信息排列每个填充文本,得到待生成文本。Each filled text is arranged according to the sixth position information to obtain the text to be generated.
为了得到待生成文本,还需要将每个模块的填充文本按照模块所属的待生成文本的固定写作格式进行排列。具体的,可以预先将待生成文本的固定写作格式表示为第一标识信息与模块位置的对应关系表或者对应映射(例如键-值),由此,按照第一标识信息与模块位置的对应关系,就可以确定每个填充文本在待生成文本中的第六位置信息,从而按照第六位置信息排列每个填充文本,得到的待生成文本就是符合待生成文本的固定写作格式的文本。In order to obtain the text to be generated, the filling text of each module needs to be arranged according to the fixed writing format of the text to be generated to which the module belongs. Specifically, the fixed writing format of the text to be generated may be expressed in advance as a correspondence table or mapping (for example, key-value) of the first identification information and the module position, thereby according to the correspondence between the first identification information and the module position. , The sixth position information of each filled text in the text to be generated can be determined, so that each filled text is arranged according to the sixth position information, and the obtained text to be generated is a text that conforms to the fixed writing format of the text to be generated.
示例性的,一待生成热点新闻的固定写作格式包括【“标题”模块,“发布时间”模块,“正文”模块】。为“标题”模块的填充文本“2018世界杯开赛”标注第一标识信息a1,为“发布时间”模块的填充文本“2018年6月14日”标注第一标识信息a2,位置02,为“正文”模块的填充文本“2018年世界杯比赛于2018年6月14日开赛,将持续至7月15日,在俄罗斯境内11座城市中的12座球场内举行,这是世界杯首次在俄罗斯境内举行”标注第一标识信息a3,位置03。按照预设的第一标识信息与模块位置的对应关系【a1对应位置01,a2对应位置02,a3对应位置03】,确定“标题”模块的填充文本在待生成文本中的第六位置信息是位置01,“发布时间”模块的填充文本在待生成文本中的第六位置信息是位置02,“正文”模块的填充文本在待生成文 本中的第六位置信息是位置03。按照第六位置信息排列每个填充文本,得到待生成文本【标题:2018世界杯开赛;发布时间:2018年6月14日;正文:2018年世界杯比赛于2018年6月14日开赛,将持续至7月15日,在俄罗斯境内11座城市中的12座球场内举行,这是世界杯首次在俄罗斯境内举行】。Exemplarily, a fixed writing format for generating hot news includes ["Title" module, "Posting time" module, "Text" module]. Mark the first identification information a1 for the filling text of the "Title" module "2018 World Cup", and mark the first identification information a2 for the filling text of the "Posting Time" module on June 14, 2018, position 02, as the "body text" Filled text of the module "The 2018 World Cup matches will start on June 14, 2018 and will run until July 15 in 12 stadiums in 11 cities in Russia. This is the first time the World Cup has been held in Russia." Mark the first identification information a3, position 03. According to the preset correspondence between the first identification information and the module position [a1 corresponds to position 01, a2 corresponds to position 02, and a3 corresponds to position 03], it is determined that the sixth position information of the filling text of the "title" module in the text to be generated is At position 01, the sixth position information of the filled text of the "Post Time" module in the text to be generated is position 02, and the sixth position information of the filled text of the "Text" module in the text to be generated is position 03. Arrange each fill text according to the sixth position information to get the text to be generated [Title: 2018 World Cup starts; Release time: June 14, 2018; Text: 2018 World Cup matches start on June 14, 2018, and will continue until July 15, held in 12 stadiums in 11 cities in Russia, this is the first time the World Cup was held in Russia].
在获取符合需求信息的文本时,为了避免将需求信息作为关键字,导致以关键字匹配的方式获取的文本内容不够准确以及不够丰富的问题,可以将需求信息作为待回答文本,将符合需求信息的文本作为待回答文本对应的答案,从需求信息的语义层面获取有效文本,从而避免仅从文字层面匹配造成的所获取文本不准确以及不够丰富的问题。When obtaining the text that meets the requirement information, in order to avoid the problem that the requirement information is used as a keyword and the text content obtained by keyword matching is not accurate and rich enough, the requirement information can be used as the text to be answered and the requirement information The text is used as the answer to the text to be answered, and the valid text is obtained from the semantic level of the requirement information, so as to avoid the problem of inaccurate and insufficient rich text obtained by matching only at the text level.
由此,可选的,上述本申请图1所示实施例的步骤S101,具体可以包括如下步骤1至步骤5:Therefore, optionally, step S101 in the embodiment shown in FIG. 1 of the present application may specifically include the following steps 1 to 5:
步骤1,针对待生成文本的固定写作格式中的每个模块,从预设资料库中获取符合待生成文本所描述事件的多个完整文本,作为该模块的备用文本。Step 1: For each module in the fixed writing format of the text to be generated, obtain a plurality of complete texts from the preset database that meet the events described by the text to be generated, as the backup text of the module.
由于对于同一个待生成文本而言,各模块的完整文本用于描述同一事件,如报道2018世界杯开赛的新闻稿中的“标题”模块、“正文”模块均描述2018世界杯开赛,因此,对于一待生成文本的每个模块,每个模块的需求信息在表明模块自身的完整文本的同时,还可以表明该模块所属待生成文本所描述的同一事件。在此基础上,为了保证获取的有效文本描述同一事件,且能够获取丰富的有效文本,可以将预设资料库中符合待生成文本所描述事件的多个完整文本作为每个模块的备用文本。当然,虽然每个模块使用了相同的多个完整文本,但不同模块具有不同的需求信息,因此不同需求信息的有效文本是多个完整文本中不同的文本,因此,不会造成内容重复的问题。例如,对于待生成民事起诉书,“当事人自然情况”模块的有效文本是案件资料中的当事人信息文本,“案由”模块的有效文本是案件资料中的诉讼请求文本。For the same text to be generated, the complete text of each module is used to describe the same event. For example, the "Title" module and "Body" module in the press release covering the 2018 World Cup start all describe the 2018 World Cup start. For each module of the text to be generated, while the requirement information of each module indicates the complete text of the module itself, it can also indicate that the module belongs to the same event described by the text to be generated. On this basis, in order to ensure that the obtained valid text describes the same event and can obtain rich valid text, multiple complete texts in the preset database that match the events described by the text to be generated can be used as backup text for each module. Of course, although each module uses the same multiple complete texts, different modules have different requirements information, so the valid text of different requirements information is different texts in multiple complete texts, so it will not cause the problem of duplicate content . For example, for a civil indictment to be generated, the valid text of the "Party Natural Situations" module is the party's information text in the case data, and the valid text of the "Cause" module is the text of the lawsuit request in the case data.
步骤2,针对每个模块,将该模块的各备用文本分别输入预先训练得到的第二循环神经网络,得到每个备用文本的第二特征向量,第二循环神经网络为以多个预先收集的样本备用文本进行训练得到的。Step 2: For each module, input each backup text of the module into a second recurrent neural network trained in advance to obtain a second feature vector of each backup text. The second recurrent neural network is a plurality of previously collected The sample backup text is obtained by training.
步骤3,针对每个模块,将该模块的需求信息输入预先训练得到的第三循环神经网络,得到需求信息的第三特征向量,作为该模块的特征向量,第三 循环神经网络为以多个预先收集的该模块的样本需求信息进行训练得到的。Step 3: For each module, input the module's demand information into a third recurrent neural network trained in advance, and obtain a third feature vector of the demand information as the feature vector of the module. The third recurrent neural network consists of multiple The sample requirement information of the module collected in advance is obtained through training.
在具体应用中,将需求信息作为待回答文本,将符合需求信息的文本作为待回答文本的答案时,相当于将备用文本和需求信息进行特征匹配度的计算。因此,需要针对每个模块,获取该模块的每个备用文本的各第二特征向量,以及该模块的需求信息的第三特征向量。并且,第二循环神经网络、第三循环神经网络分别为与本申请图2实施例的循环神经网络结构相同的循环神经网络,区别在于为了针对不同的输入得到相应的输出,用于训练得到不同循环神经网络的样本不同。对于相同部分在此不再赘述,详见上述图2所示实施例的描述。In specific applications, when the requirement information is used as the text to be answered and the text that meets the requirement information is used as the answer to the text to be answered, it is equivalent to calculating the feature matching degree between the standby text and the requirement information. Therefore, for each module, it is necessary to obtain each second feature vector of each backup text of the module and the third feature vector of the requirement information of the module. In addition, the second recurrent neural network and the third recurrent neural network are recurrent neural networks with the same structure as the recurrent neural network in the embodiment of FIG. 2 of the present application. The difference is that in order to obtain corresponding outputs for different inputs, they are used for training to obtain different The samples of the recurrent neural network are different. The same parts are not repeated here. For details, refer to the description of the embodiment shown in FIG. 2 above.
步骤4,针对每个模块,分别将该模块的每个备用文本对应的向量信息输入预先训练得到的第四循环神经网络,得到该模块的每个备用文本中,符合该模块的需求信息的文本的第二位置信息;其中,该模块的任一备用文本对应的向量信息包括:该备用文本的第二特征向量和该模块的特征向量;第四循环神经网络为以多个预先收集的标注了第三位置信息、且描述该模块的需求信息对应的同一事件的样本完整文本进行训练得到的,第三位置信息为符合该模块的需求信息的文本在样本完整文本中的位置信息。Step 4. For each module, input the vector information corresponding to each backup text of the module into the fourth recurrent neural network obtained in advance, and obtain each backup text of the module that meets the information required by the module. The second position information of the module; wherein the vector information corresponding to any backup text of the module includes: the second feature vector of the backup text and the feature vector of the module; the fourth recurrent neural network is The third position information is obtained by training the sample complete text of the same event corresponding to the requirement information of the module, and the third position information is the position information of the text that meets the module's requirement information in the sample complete text.
示例性的,以一个备用文本和一个样本完整文本为例,标注了第三位置信息、且描述待生成文本所描述事件“春日游戏”的样本完整文本为“春天来了,小朋友可以外出玩耍了,小红去踢毽子”。该样本完整文本中标注的需求信息“适合春天的游戏”对应的第三位置信息包括:“春天”所在的起始位置信息第1位置,“踢毽子”所在的终止位置信息第10位置。则利用该样本完整文本训练得到的第四循环神经网络,可以将备用文本“春天的时候,小朋友可以外出玩耍了,小明喜欢放风筝”对应的向量信息:第二特征向量以及需求信息“春日游戏”的第三特征向量,也就是该模块的特征向量输入第四循环神经网络,从而得到符合该模块的需求信息“春日游戏”的文本在备用文本中的第二位置信息:“春天”所在的起始位置信息第1位置,“放风筝”所在的终止位置信息第10位置。Exemplarily, a backup text and a sample full text are taken as examples, the third position information is marked, and the sample full text describing the event "Spring Game" described by the text to be generated is "Spring is here, children can go out and play." , Xiaohong went to kick the shuttlecock. " The third position information corresponding to the requirement information "Games for Spring" marked in the complete text of the sample includes: the first position information of "Spring", the first position, and the tenth position of the ending position information of "Kicker". Then, using the fourth recurrent neural network trained on the full text of the sample, the backup text "Vector in spring, children can go out to play, and Xiaoming likes flying a kite" is the vector information: the second feature vector and the demand information "Spring Day Game The third feature vector, that is, the feature vector of the module is input to the fourth recurrent neural network, so as to obtain the second position information of the text of "Spring Game" in the standby text that meets the module's requirement information: "Spring" The first position information is the first position, and the end position information of the "kite" is the tenth position.
步骤5,针对每个模块,分别从该模块的每个备用文本中,抽取相应的第二位置信息处的文本,作为符合该模块的需求信息的有效文本。In step 5, for each module, the text at the corresponding second position information is extracted from each of the backup texts of the module as valid text that meets the requirement information of the module.
在通过上述步骤4得到第二位置信息后,可以针对每个模块,分别从该模块的每个备用文本中,抽取相应的第二位置信息处的文本,作为符合该模块的需求信息的有效文本。示例性的,从该模块的备用文本“春天的时候,小朋友可以外出玩耍了,小明喜欢放风筝”中,抽取该备用文本所对应的第二位置信息:“春天”所在的起始位置信息第1位置以及“放风筝”所在的终止位置信息第10位置信息处的文本“春天放风筝”,作为符合该模块的需求信息“春日游戏”的有效文本。After the second position information is obtained through the above step 4, for each module, the text at the corresponding second position information can be extracted from each standby text of the module, as valid text that meets the module's requirements information. . Exemplarily, from the backup text of the module "in the spring, children can go out to play, Xiaoming likes to fly a kite", the second position information corresponding to the backup text is extracted: The text "Spring kite" at the 1st position and the end position information of the "Kite Flying" is located at the 10th position information, as the valid text of the information "Spring Game" that meets the requirements of this module.
具体应用中,还可能存在同一模块有多个需求信息的情况,此时需要针对该模块的每个需求信息获取有效文本。对此,可选的,针对待生成文本的固定写作格式中的每个模块,当该模块的需求信息为多个时:In specific applications, there may be a case where there is multiple requirement information for the same module, and at this time, a valid text needs to be obtained for each requirement information of the module. In this regard, optionally, for each module in the fixed writing format of the text to be generated, when the requirement information of the module is multiple:
上述本申请图1所示实施例中的S101,具体可以包括如下步骤:S101 in the embodiment shown in FIG. 1 in the foregoing application may specifically include the following steps:
针对待生成文本的固定写作格式中的每个模块,从预设资料库中获取符合该模块的每个需求信息的多个有效文本。For each module in the fixed writing format of the text to be generated, multiple valid texts are obtained from a preset database that meets each requirement information of the module.
示例性的,一待生成热点新闻的固定写作格式中,“正文”模块的需求信息包括:需求信息Q1“2018年世界杯举行时间”、需求信息Q2“2018年世界杯举行地点”以及需求信息Q3“2018年世界杯特殊信息”。则从预设资料库中获取的符合“正文”模块的每个需求信息的多个有效文本包括:需求信息Q1的有效文本A1“2018年世界杯比赛于2018年6月14日开赛”以及A2“2018年世界杯比赛将持续至7月15日”,需求信息Q2的有效文本A3“在俄罗斯境内”以及A4“11座城市中的12座球场内举行”,需求信息Q3的有效文本A5“世界杯首次在俄罗斯境内举行”。For example, in a fixed writing format to generate hot news, the demand information of the "body" module includes: demand information Q1 "2018 World Cup holding time", demand information Q2 "2018 World Cup holding place", and demand information Q3 " Special information for the 2018 World Cup. " Then the multiple valid texts from the preset database that meet each requirement information of the "body" module include: the valid text of the demand information Q1 A1 "The 2018 World Cup match starts on June 14, 2018" and A2 " The 2018 World Cup will continue until July 15 ", the effective text of demand information Q2 A3" in Russia "and A4" held in 12 stadiums in 11 cities ", the effective text of demand information Q3 A5" World Cup for the first time Held in Russia. "
相应的,上述本申请图1所示实施例中的S102,具体可以包括:Accordingly, S102 in the embodiment shown in FIG. 1 of the foregoing application may specifically include:
针对每个模块,分别将该模块的每个需求信息的多个有效文本输入预先训练得到的第一循环神经网络,得到每个有效文本的第一特征向量。For each module, a plurality of valid texts of each requirement information of the module are respectively input into a first recurrent neural network trained in advance to obtain a first feature vector of each valid text.
与图1所示实施例中的S102获取同一模块的以需求信息对应的多个有效文本不同的是,本步骤获取的是同一模块的多个需求信息对应的多个有效文本。Different from obtaining multiple valid texts corresponding to requirement information of the same module in S102 in the embodiment shown in FIG. 1, this step obtains multiple valid texts corresponding to multiple requirement information of the same module.
相应的,在上述本申请图1所示实施例中的S103之前,本申请实施例提供的文本生成方法还可以包括:Correspondingly, before S103 in the embodiment shown in FIG. 1 of the present application, the text generation method provided in the embodiment of the present application may further include:
针对每个模块,分别将该模块的每个需求信息输入预先训练得到的第三循环神经网络,得到该模块的每个需求信息的第三特征向量,第三循环神经网络为以多个预先收集的该模块的样本需求信息进行训练得到的。For each module, each requirement information of the module is input into a third recurrent neural network trained in advance, and a third feature vector of each requirement information of the module is obtained. The third recurrent neural network is a plurality of previously collected This module is obtained by training the sample requirement information of the module.
在具体应用中,如果同一模块存在多个需求信息,多个需求信息对应的有效文本需要分别按照各自对应的需求信息进行排列。因此,需要获取同一模块的每个需求信息的特征向量,以用于后续确定该模块的多个有效文本的位置信息。In specific applications, if there is multiple requirement information in the same module, the valid text corresponding to the multiple requirement information needs to be arranged according to the respective corresponding requirement information. Therefore, it is necessary to obtain a feature vector of each requirement information of the same module for subsequent determination of position information of multiple valid texts of the module.
相应的,上述本申请图1所示实施例中的S103,具体可以包括:Correspondingly, S103 in the embodiment shown in FIG. 1 of the present application may specifically include:
针对每个模块,分别将该模块的每个需求信息对应的向量信息输入预先训练得到的记忆网络,得到该模块的每个需求信息对应的有效文本,所对应的第一位置信息;其中,所述第一位置信息为有效文本中的各分词,在该模块的填充文本中的位置信息;该模块的任一需求信息对应的向量信息包括:该需求信息对应的每个有效文本的第一特征向量,以及该需求信息的第三特征向量;填充文本为该模块对应的文本,且填充文本的文本结构与记忆网络训练时所利用的标注了第四位置信息的第一样本文本的文本结构相同,第四位置信息为符合指定需求信息的每个文本在第一样本文本中的位置信息。For each module, input the vector information corresponding to each requirement information of the module into the pre-trained memory network, and obtain the valid text corresponding to each requirement information of the module and the corresponding first position information; The first position information is each participle in the valid text, and the position information in the filled text of the module; the vector information corresponding to any requirement information of the module includes: the first feature of each valid text corresponding to the requirement information Vector, and the third feature vector of the demand information; the filled text is the text corresponding to the module, and the text structure of the filled text and the text structure of the first sample text marked with the fourth position information used in the training of the memory network Similarly, the fourth position information is position information of each text in the first sample text that meets the specified requirement information.
由于第一样本文本中,标注了符合指定需求信息的每个文本的第四位置信息,因此,针对每个模块,可以分别将该模块的每个需求信息对应的向量信息输入预先训练得到的记忆网络,得到该模块的每个需求信息对应的有效文本,所对应的第一位置信息。在此基础上,后续按照第一位置信息排列每个有效文本中的各分词,得到的填充文本,与第一样本文本的结构相同,并且第一样本文本又是符合自然语言描述习惯的,因此,填充文本也是符合自然语言描述习惯的。本实施例还将有效文本对应的需求信息的第三特征向量输入记忆网络,并且对用于训练记忆网络的第一样本文本进行第四位置标注,从而保证所确定的有效文本中各分词的第四位置信息能够按照需求信息进行排列。Because the first sample text is labeled with the fourth position information of each text that meets the specified requirement information, for each module, the vector information corresponding to each requirement information of the module can be input into the pre-trained The memory network obtains the valid text corresponding to each requirement information of the module and the corresponding first position information. Based on this, the segmented words in each valid text are subsequently arranged according to the first position information, and the resulting filled text has the same structure as the first sample text, and the first sample text is in line with natural language description habits. , So filled text is also in line with natural language description habits. In this embodiment, the third feature vector of the demand information corresponding to the effective text is also input to the memory network, and the first position of the first sample text used for training the memory network is labeled with a fourth position, thereby ensuring the accuracy of each segmentation in the determined effective text. The fourth position information can be arranged according to the demand information.
其中,本实施例中的记忆网络具体可以是如图4所示的结构:The memory network in this embodiment may specifically have a structure as shown in FIG. 4:
本实施例的记忆网络与上述图3实施例中的记忆网络类似,区别在于,为了应对存在多个需求信息的情况,本实施例的记忆网络增加了输入层401, 用于针对每个模块,得到该模块的每个需求信息的第三特征向量,并将第三特征向量输入隐藏层。对于循环神经网络在此不再赘述,详见上述图2实施例的描述。加入用于提取第三特征向量的输入层401、神经元406以及该神经元对应的历史状态信息4033后,将输入层401的输出作为神经元4030的输入,以获取该模块的每个需求信息对应的有效文本,所对应的第一位置信息。并且,将神经元406的输出加入到神经元4032的输出中,可以确定输出的各分词的起始位置404和截止位置405属于不同需求信息的概率,从而可以基于该概率,保证确定出的有效文本中各分词的位置是对应于需求信息进行排列的。The memory network in this embodiment is similar to the memory network in the embodiment in FIG. 3 described above. The difference is that in order to cope with a situation where there is multiple demand information, the memory network in this embodiment adds an input layer 401 for each module. A third feature vector of each requirement information of the module is obtained, and the third feature vector is input to the hidden layer. Regarding the recurrent neural network, it will not be repeated here. For details, refer to the description of the embodiment in FIG. 2 described above. After adding the input layer 401, neuron 406, and historical state information 4033 corresponding to the neuron to extract the third feature vector, the output of the input layer 401 is used as the input of the neuron 4030 to obtain each requirement information of the module The corresponding valid text and the corresponding first position information. In addition, by adding the output of the neuron 406 to the output of the neuron 4032, it is possible to determine the probability that the start position 404 and the cut-off position 405 of each segmentation of the output belong to different demand information, so that the determined validity can be guaranteed based on the probability. The position of each participle in the text is arranged corresponding to the demand information.
此外,输入层402、隐藏层403、神经元4030、神经元4031、神经元4032、历史状态信息4033、各分词的起始位置404以及截止位置405,与本申请图3实施例中的记忆网络中的输入层301、隐藏层302、神经元3020、神经元3021、神经元3022、历史状态信息3023、各分词的起始位置303以及截止位置304相同,在此不再赘述,详见图3所示实施例的描述。In addition, the input layer 402, the hidden layer 403, the neuron 4030, the neuron 4031, the neuron 4032, the historical state information 4033, the start position 404, and the cut-off position 405 of each participle are the same as the memory network in the embodiment of FIG. 3 of this application. The input layer 301, hidden layer 302, neuron 3020, neuron 3021, neuron 3022, historical state information 3023, start position 303, and cut-off position 304 of each participle are the same, and will not be repeated here, see Figure 3 for details. Description of the illustrated embodiment.
示例性的,符合指定需求信息Q11“2008年奥运会举行时间”,Q12“2008年奥运会举行地点”,以及Q13“2008年奥运会特殊信息”、且标注了第四位置信息的一个第一样本文本为“2008年奥运会于2008年8月8日至8月24日,在中国境内6座城市举行,这是奥运会第一次在中国举行”。该第一样本文本中标注的第四位置信息包括:符合指定需求信息Q11的“2008年8月8日”以及“8月24日”所在的位置信息第4位置和第6位置,符合指定需求信息Q12的“中国境内”以及“6座城市”所在的位置信息第8位置和第9位置,符合指定需求信息Q13的“奥运会第一次在中国举行”所在的位置信息第12位置。将上述获取的有效文本A1至A5,以及“正文”模块的需求信息Q1至Q3输入记忆网络,确定各需求信息的有效文本在填充文本中的第四位置信息,从而将各第四位置信息用于后续得到与第一样本文本结构相同、且符合自然语言的填充文本“2018年世界杯比赛将于2018年6月14日至7月15日,在俄罗斯境内11座城市中的12座球场内举行,这是世界杯首次在俄罗斯境内举行”。Exemplary, a first sample text that meets the specified requirements information Q11 "2008 Olympic Games Held", Q12 "2008 Olympic Games Held Location", and Q13 "2008 Olympic Special Information" marked with fourth location information "The 2008 Olympic Games will be held in 6 cities in China from August 8 to August 24, 2008. This is the first time that the Olympic Games will be held in China." The fourth position information marked in the first sample text includes the position information 4th and 6th positions of "August 8, 2008" and "August 24, 2008" that meet the specified requirement information Q11, which meet the specified requirements. The eighth and ninth positions of the location information of the "inside China" and "six cities" of the demand information Q12, and the twelfth location of the location information of the "first Olympics held in China" that meets the specified demand information Q13. The valid texts A1 to A5 obtained above and the requirement information Q1 to Q3 of the "body" module are input into the memory network, and the fourth position information of the valid text of each demand information in the filled text is determined, so that each fourth position information is used Later, we got the filled text with the same structure and natural language as the first sample. "The 2018 World Cup will be held from June 14 to July 15, 2018 in 12 stadiums in 11 cities in Russia. This is the first time the World Cup has been held in Russia. "
在具体应用中,很多具有固定写作格式的待生成文本中,很可能同时存在模块的完整文本为结构化类型文本和模块的完整文本为非结构化类型文本的情况。此时,结构化类型文本具有固定表述结构,与非结构化类型文本相比,需要通过神经网络确定的信息较少,并且通常情况下神经网络会占用大量的运算资源。因此,为了减少运算资源占用,提高文本生成效率,可以确定模块的文本类型,以便有针对性的,对具有不同文本类型的模块执行不同的文本生成方式。In specific applications, in many texts to be generated with a fixed writing format, it is likely that the complete text of the module is a structured text and the complete text of the module is an unstructured text. At this time, structured type text has a fixed representation structure, compared with unstructured type text, it requires less information to be determined by the neural network, and usually the neural network will occupy a large amount of computing resources. Therefore, in order to reduce the occupation of computing resources and improve the efficiency of text generation, the text type of the module can be determined, so that different text generation methods can be performed on modules with different text types in a targeted manner.
为此,如图5所示,本申请再一实施例的文本生成方法的流程,该方法可以包括:For this reason, as shown in FIG. 5, a flow of a text generation method according to another embodiment of the present application may include:
S501,针对每个模块,将该模块的需求信息输入预设分类算法,得到该模块的填充文本的文本类型,文本类型包括结构化类型和非结构化类型。当该模块的填充文本的文本类型为非结构化类型时,执行S502至S505,当该模块的填充文本的文本类型为结构化类型时,执行S506至S508。S501. For each module, input the requirement information of the module into a preset classification algorithm to obtain the text type of the filled text of the module, and the text type includes a structured type and an unstructured type. When the text type of the filled text of the module is an unstructured type, S502 to S505 are performed, and when the text type of the filled text of the module is a structured type, S506 to S508 are performed.
其中,预设分类算法具体可以为支持向量机算法、逻辑回归算法,或者利用预先收集的多个对应于结构化类型文本和非结构化类型文本的样本需求信息,预先训练得到的卷积神经网络。还可以是判断需求信息是否为文本类型对应的预设信息,例如,对于待生成民事起诉书,结构化类型对应的预设信息为“当事人自然情况”、“受诉法院”“落款”以及“附件说明”,非结构化类型对应的预设信息为“诉讼请求”以及“事实与理由”。任何能够基于模型的需求信息确定模型对应的文本类型的分类算法,均可用于本申请,本实施例对此不作限制。The preset classification algorithm may specifically be a support vector machine algorithm, a logistic regression algorithm, or a pre-trained convolutional neural network using a plurality of sample demand information corresponding to structured text and unstructured text collected in advance. . It can also be judged whether the demand information is preset information corresponding to the text type. For example, for a civil indictment to be generated, the preset information corresponding to the structured type is "the natural situation of the parties", "the respondent court", "payment", and " Attachment ", the default information corresponding to the unstructured type is" suit request "and" facts and reasons ". Any classification algorithm capable of determining the text type corresponding to the model based on the model's requirement information can be used in this application, which is not limited in this embodiment.
其中,卷积神经网络用于确定填充文本的文本类型时,具体可以为如图6所示的结构。本实施例的神经网络的隐藏层具有两个特征提取通道,将需求信息通过输入层601输入后,利用通道602提取局部特征变量,利用通道603提取全局特征变量,以保证提取的特征不仅反映需求信息中各分词的特征,还能够反映各分词整体的语义。综合局部特征变量和全局特征变量,得到输出层604输出的需求信息分别属于不同文本类型的概率,从而基于输出的概率,确定输入的需求信息对应的填充文本的文本类型。When the convolutional neural network is used to determine the text type of the filled text, it may specifically have a structure as shown in FIG. 6. The hidden layer of the neural network of this embodiment has two feature extraction channels. After inputting the demand information through the input layer 601, the channel 602 is used to extract local feature variables, and the channel 603 is used to extract global feature variables to ensure that the extracted features not only reflect the needs. The characteristics of each participle in the information can also reflect the overall semantics of each participle. By synthesizing the local feature variables and the global feature variables, the probability that the demand information output by the output layer 604 belongs to different text types is obtained, and based on the output probability, the text type of the filled text corresponding to the input demand information is determined.
结构化类型文本包括文字或句子的表述结构固定的文本,非结构化类型 文本包括文字或句子的表述结构不固定的文本。例如,某一热点新闻的固定写作格式中各模块分别为“标题”模块、“发布日期”模块以及“正文”模块,其中,“标题”和“发布日期”模块的文本为结构化类型文本,“正文”模块的文本为非结构化文本。Structured type text includes text or sentences with a fixed structure of expression. Unstructured type text includes text or sentences with a fixed structure of text. For example, each module in a fixed writing format of a hot news is a "title" module, a "release date" module, and a "body" module, where the text of the "title" and "release date" modules are structured text, The text of the Body module is unstructured text.
S502,针对待生成文本的固定写作格式中的每个模块,从预设资料库中获取符合该模块的需求信息的多个有效文本,需求信息用于表明该模块对应的文本内容。S502: For each module in the fixed writing format of the text to be generated, obtain a plurality of valid texts from a preset database that meet the requirement information of the module, and the requirement information is used to indicate the text content corresponding to the module.
S503,针对每个模块,将该模块的多个有效文本分别输入预先训练得到的第一循环神经网络,得到每个有效文本的第一特征向量,第一循环神经网络为以多个预先收集的符合指定需求信息的样本有效文本进行训练得到的。S503. For each module, input multiple valid texts of the module into the first recurrent neural network trained in advance to obtain the first feature vector of each valid text. The first recurrent neural network uses multiple pre-collected A sample of valid text that meets the specified requirements is obtained by training.
S504,针对每个模块,将该模块的每个有效文本的第一特征向量分别输入预先训练得到的记忆网络,得到该模块的每个有效文本中的各分词,在该模块的填充文本中的第一位置信息,填充文本的文本结构与记忆网络训练时所利用的第一样本文本的文本结构相同,第一样本文本为符合自然语言表述结构、且符合该模块的需求信息的文本,记忆网络为以多个预先收集的第一样本文本进行训练得到的。S504. For each module, first input the first feature vector of each valid text of the module into the memory network obtained in advance, and obtain the participles in each valid text of the module. The first position information, the text structure of the filled text is the same as the text structure of the first sample text used in the training of the memory network, and the first sample text is a text that conforms to the structure of the natural language expression and meets the required information of the module. The memory network is obtained by training a plurality of pre-collected first sample texts.
S505,针对每个模块,将该模块的每个有效文本中的各分词,按照所得到的第一位置信息排列,得到该模块的填充文本。S505: For each module, arrange the participles in each valid text of the module according to the obtained first position information to obtain the filled text of the module.
S502至S505为与本申请图1所示实施例中的S101至S104相同的步骤,在此不再赘述,详见本申请图1所示实施例的描述。Steps S502 to S505 are the same steps as S101 to S104 in the embodiment shown in FIG. 1 of this application, and are not repeated here. For details, refer to the description of the embodiment shown in FIG. 1 of this application.
S506,将该模块的多个有效文本输入预先训练得到的序列标注模型,得到每个有效文本中的各分词的第二标识信息,序列标注模型为以多个预先收集的预先标注了第二标识信息、且符合该模块的需求信息的第二样本有效文本训练得到的。S506. Input multiple valid texts of the module into a sequence labeling model trained in advance to obtain the second identification information of each participle in each valid text. The sequence labeling model is a plurality of pre-labeled second labels that are collected in advance. The information is obtained by training a second sample of valid text that meets the requirements of the module.
其中,第二标识信息用于表示有效文本中的各分词的唯一性。序列标注模型用于对输入的有效文本标注第二标识信息,以用于后续在步骤S507中确定各分词在填充文本中的位置信息。本实施例中序列标注模型具体可以为图7所示的结构。以字符串的形式将有效文本通过输入层701输入到序列标注模型中,经过隐藏层702的特征提取,确定每个分词对应的第二标识信息,从 而在输出层703标注每个分词的第二标识信息。考虑到文本中各分词之间存在关联关系,某一分词的上下文会影响该分词的语义,因此,本实施例序列标注模型的隐藏层中各神经元为LSTM网络(Long Short Term Memory,一种具有特殊结构的RNN网络),该网络作为神经元时各神经元之间进行信息交互,以提取能够反映有效文本整体语义的特征,从而基于该特征,为有效文本的各分词标注第二标识信息。The second identification information is used to represent the uniqueness of each participle in the valid text. The sequence labeling model is used to label the second valid information of the input valid text, and is used to determine the position information of each participle in the filled text in step S507. The sequence labeling model in this embodiment may specifically have the structure shown in FIG. 7. The valid text is input to the sequence labeling model through the input layer 701 in the form of a string. After the feature extraction of the hidden layer 702, the second identification information corresponding to each segmentation is determined, so that the second segment of each segmentation is labeled at the output layer 703. Identification information. Considering that there is an association between each participle in the text, the context of a certain participle will affect the semantics of the participle. Therefore, in this embodiment, each neuron in the hidden layer of the sequence labeling model is an LSTM network (Long Short Term Memory, a kind of An RNN network with a special structure), when the network is a neuron, information is exchanged between neurons to extract a feature that reflects the overall semantics of the effective text, and based on this feature, the second identification information is marked for each participle of the effective text .
S507,根据第二标识信息,利用预设的标识与分词位置信息的对应关系,确定每个有效文本中的各分词在该模块的填充文本中的第五位置信息。S507: Determine, according to the second identification information, the fifth position information of each participle in each valid text in the filled text of the module by using a preset correspondence between the identification and the participle position information.
其中,预设的标识与分词位置信息的对应关系可以是标识与分词位置信息的对应关系表,还可以是对应关系映射(例如键-值)。Wherein, the preset correspondence between the identifier and the segmentation position information may be a correspondence table between the identifier and the segmentation position information, and may also be a correspondence mapping (for example, key-value).
S508,按照每个有效文本中的各分词的第五位置信息,排列每个有效文本中的各分词,得到填充文本。S508: Arrange the participles in each valid text according to the fifth position information of each participle in each valid text to obtain a filled text.
示例性的,一待生成热点新闻的固定写作格式中的“标题”模块对应的填充文本为结构化类型文本,将有效文本“2018年世界杯于6月14日在俄罗斯开赛”输入预设序列标注模型,得到分词“2018年”的第二标识信息g1,分词“世界杯”的第二标识信息g2,以及分词“开赛”的第二标识信息g3。利用预设的标识与分词位置信息的对应关系【“g1-位置1”,“g2-位置2”,“g3-位置3”】,确定分词“2018年”的第五位置信息为位置1,分词“世界杯”的第五位置信息为位置2,分词“开赛”的第五位置信息为位置3。按照各分词的第五位置信息,排列每个有效文本中的各分词,得到填充文本“2018年世界杯开赛”。Exemplarily, the filling text corresponding to the "Title" module in a fixed writing format to generate hot news is structured type text, and the valid text "2018 World Cup starts on June 14 in Russia" is entered into a preset sequence label The model obtains the second identification information g1 of the segmentation "2018", the second identification information g2 of the segmentation "World Cup", and the second identification information g3 of the segmentation "open match". Using the preset correspondence between the identifier and the participle position information ["g1-position 1", "g2-position 2", "g3-position 3"], determine that the fifth position information of the participle "2018" is position 1, The fifth position information of the participle "World Cup" is position 2 and the fifth position information of the participle "kickoff" is position 3. According to the fifth position information of each participle, arrange each participle in each valid text to get the filled text "2018 World Cup start".
S509,按照待生成文本的固定写作格式,排列每个模块的填充文本,得到待生成文本。S509. According to the fixed writing format of the text to be generated, arrange the filled text of each module to obtain the text to be generated.
上述S509与本申请图1实施例的S105为相同的步骤,在此不再赘述,详见本申请图1实施例的描述。The above S509 is the same step as S105 in the embodiment of FIG. 1 of this application, which is not repeated here. For details, refer to the description of the embodiment of FIG. 1 in this application.
在上述本申请图5实施例中,结构化类型文本具有固定表述结构,与非结构化类型文本相比,需要通过神经网络确定的信息较少,并且通常情况下神经网络会占用大量的运算资源。因此,通过确定模块的文本类型,从而有针对性的,对具有不同文本类型的模块执行不同的文本生成方式,可以减少 运算资源占用,提高文本生成效率。In the above-mentioned embodiment of FIG. 5 of the present application, the structured type text has a fixed representation structure. Compared with the unstructured type text, less information needs to be determined through the neural network, and usually the neural network will occupy a large amount of computing resources. . Therefore, by determining the text type of the module, and accordingly performing different text generation methods on modules with different text types, it can reduce the occupation of computing resources and improve the efficiency of text generation.
相应于上述方法实施例,本申请一实施例还提供了文本生成装置。Corresponding to the foregoing method embodiments, an embodiment of the present application further provides a text generating device.
如图8所示,本申请一实施例的文本生成装置的结构,该装置可以包括:As shown in FIG. 8, the structure of a text generating device according to an embodiment of the present application may include:
文本获取模块801,用于针对待生成文本的固定写作格式中的每个模块,从预设资料库中获取符合该模块的需求信息的多个有效文本,需求信息用于表明该模块对应的文本内容;A text acquisition module 801, for each module in the fixed writing format of the text to be generated, obtains a plurality of valid texts from a preset database that meet the requirement information of the module, and the requirement information is used to indicate the text corresponding to the module content;
特征提取模块802,用于针对每个模块,将该模块的多个有效文本分别输入预先训练得到的第一循环神经网络,得到该模块的每个有效文本的第一特征向量,第一循环神经网络为以多个预先收集的符合指定需求信息的样本有效文本进行训练得到的;A feature extraction module 802 is configured for each module to input multiple valid texts of the module into a first recurrent neural network trained in advance to obtain a first feature vector and a first recurrent neural network of each valid text of the module. The network is obtained by training with multiple pre-collected sample valid texts that meet the specified requirements information;
位置信息确定模块803,用于针对所述每个模块,将该模块的每个有效文本的第一特征向量分别输入预先训练得到的记忆网络,得到该模块的每个有效文本的各分词在填充文本中的第一位置信息,填充文本的文本结构与记忆网络训练时所利用的第一样本文本的文本结构相同,所述第一样本文本为符合自然语言表述结构、且符合指定需求信息的文本,记忆网络为以多个预先收集的所述第一样本文本进行训练得到的;A position information determining module 803 is configured to input a first feature vector of each valid text of the module into a memory network obtained in advance for each module, and obtain the word segmentation of each valid text of the module in the filling The first position information in the text, the text structure of the filled text is the same as the text structure of the first sample text used in the training of the memory network, and the first sample text is information that conforms to the structure of the natural language expression and meets the specified requirements Text, the memory network is obtained by training with a plurality of the first sample texts collected in advance;
文本生成模块804,用于针对所述每个模块,将该模块的每个有效文本中的各分词,按照所得到的第一位置信息排列,得到填充文本;按照待生成文本的固定写作格式,排列每个模块的所述填充文本,得到所述待生成文本。A text generating module 804, for each module, arranging the participles in each valid text of the module according to the obtained first position information to obtain filled text; according to a fixed writing format of the text to be generated, Arrange the filled text of each module to obtain the text to be generated.
本申请实施例提供的一种文本生成装置,由于针对每个模块,所使用的记忆网络是利用多个预先收集的第一样本文本训练得到的,而第一样本文本是符合自然语言内容结构、且符合该模块需求信息的样本。因此,利用第一记忆网络得到的有效文本中的各分词在填充文本中的第一位置信息,与第一样本文本中各分词的位置信息相同。在此基础上,按照第一位置信息排列有效文本中的各分词,得到的填充文本的文本结构与第一样本文本的文本结构相同,也就符合自然语言表述结构。从而按照待生成文本的固定写作格式,排列每个模块的填充文本,得到的待生成文本也是符合自然语言表述结构的文本。A text generating device provided in the embodiment of the present application is that, for each module, a memory network used is obtained by training a plurality of pre-collected first sample texts, and the first sample texts conform to natural language content. A sample of structure and information that meets the requirements of the module. Therefore, the first position information of each participle in the valid text obtained from the first memory network in the filled text is the same as the position information of each participle in the first sample text. On this basis, the segmentation in the valid text is arranged according to the first position information, and the text structure of the obtained filled text is the same as that of the first sample text, which also conforms to the natural language expression structure. Therefore, the filling text of each module is arranged according to the fixed writing format of the text to be generated, and the obtained text to be generated is also a text conforming to the structure of natural language expression.
可选的,文本生成模块804,具体用于:Optionally, the text generation module 804 is specifically configured to:
针对每个模块,为该模块的每个有效文本标注该模块的第一标识信息;For each module, mark the first identification information of the module for each valid text of the module;
针对每个模块,按照预设的第一标识信息与模块位置的对应关系,确定该模块的填充文本在待生成文本中的第六位置信息,所述预设的第一标识信息与模块位置的对应关系用于表示待生成文本的固定写作格式;For each module, the sixth position information of the filled text of the module in the text to be generated is determined according to the preset correspondence between the first identification information and the module position, and the preset first identification information and the position of the module The correspondence relationship is used to represent a fixed writing format of the text to be generated;
按照第六位置信息排列每个填充文本,得到待生成文本。Each filled text is arranged according to the sixth position information to obtain the text to be generated.
可选的,文本获取模块801,具体用于:Optionally, the text acquisition module 801 is specifically used for:
针对待生成文本的固定写作格式中的每个模块,从预设资料库中获取符合所述待生成文本所描述事件的多个完整文本,作为该模块的备用文本;For each module in the fixed writing format of the text to be generated, obtaining a plurality of complete texts from the preset database that conform to the events described by the text to be generated, as the backup text of the module;
相应的,特征提取模块802,还用于针对每个模块,将该模块的各备用文本分别输入预先训练得到的第二循环神经网络,得到每个备用文本的第二特征向量,第二循环神经网络为以多个预先收集的样本备用文本进行训练得到的。将该模块的需求信息输入预先训练得到的第三循环神经网络,得到需求信息的第三特征向量,作为该模块的特征向量,第三循环神经网络为以多个预先收集的该模块的样本需求信息进行训练得到的;Correspondingly, the feature extraction module 802 is further configured for each module to input each backup text of the module into a second recurrent neural network trained in advance to obtain a second feature vector and a second recurrent neural network of each backup text. The network is trained with multiple pre-collected sample backup texts. The requirement information of the module is input into a third recurrent neural network trained in advance, and a third feature vector of the demand information is obtained as the feature vector of the module. The third recurrent neural network is a plurality of sample requirements of the module collected in advance. Information obtained through training;
相应的,位置信息确定模块803,还用于针对每个模块,分别将该模块的每个备用文本对应的向量信息输入预先训练得到的第四循环神经网络,得到该模块的每个备用文本中,符合该模块的需求信息的文本的第二位置信息;其中,该模块的任一备用文本对应的向量信息包括:该备用文本的第二特征向量和该模块的特征向量;第四循环神经网络为以多个预先收集的标注了第三位置信息、且描述指定需求信息对应的同一事件的样本完整文本进行训练得到的,第三位置信息为符合该模块的需求信息的文本在所述样本完整文本中的位置信息;Correspondingly, the position information determining module 803 is further configured for each module to input the vector information corresponding to each backup text of the module into the fourth recurrent neural network obtained in advance, and obtain each backup text of the module. , The second position information of the text that meets the requirement information of the module; wherein the vector information corresponding to any backup text of the module includes: the second feature vector of the backup text and the feature vector of the module; the fourth recurrent neural network It is obtained by training with a plurality of pre-collected samples of the complete text marked with the third position information and describing the same event corresponding to the specified demand information. The third position information is the text that meets the demand information of the module. Position information in the text;
相应的,文本获取模块801,具体用于针对每个模块,分别从该模块的每个备用文本中,抽取相应的第二位置信息处的文本,作为符合该模块的需求信息的有效文本。Correspondingly, the text obtaining module 801 is specifically used for each module to extract the text at the corresponding second position information from each standby text of the module, as valid text that meets the requirements of the module.
可选的,针对待生成文本的固定写作格式中的每个模块,该模块的需求信息为多个:Optionally, for each module in the fixed writing format of the text to be generated, the requirement information of the module is multiple:
相应的,文本获取模块801,具体用于:Correspondingly, the text acquisition module 801 is specifically used for:
针对待生成文本的固定写作格式中的每个模块,从预设资料库中获取符合该模块的每个需求信息的多个有效文本;For each module in the fixed writing format of the text to be generated, obtain a plurality of valid texts from a preset database that meets each requirement information of the module;
相应的,特征提取模块802,还用于:Correspondingly, the feature extraction module 802 is further configured to:
针对每个模块,分别将该模块的每个需求信息的多个有效文本输入预先训练得到的第一循环神经网络,得到每个有效文本的第一特征向量;For each module, input multiple valid texts of each requirement information of the module into the first recurrent neural network trained in advance to obtain the first feature vector of each valid text;
针对每个模块,分别将该模块的每个需求信息输入预先训练得到的第三循环神经网络,得到该模块的每个需求信息的第三特征向量,第三循环神经网络为以多个预先收集的该模块的样本需求信息进行训练得到的;For each module, each requirement information of the module is input into a third recurrent neural network trained in advance, and a third feature vector of each requirement information of the module is obtained. The third recurrent neural network is a plurality of previously collected Obtained by training the sample requirement information of the module;
相应的,位置信息确定模块803,具体用于:Correspondingly, the location information determining module 803 is specifically configured to:
针对每个模块,分别将该模块的每个需求信息对应的向量信息输入预先训练得到的记忆网络,得到该模块的每个需求信息对应的有效文本,所对应的第一位置信息;其中,所述第一位置信息为所述有效文本中的各分词,在该模块的填充文本中的第一位置信息;该模块的任一需求信息对应的向量信息包括:该需求信息对应的每个有效文本的第一特征向量,以及该需求信息的第三特征向量;填充文本的文本结构与记忆网络训练时所利用的标注了第四位置信息的第一样本文本的文本结构相同,第四位置信息为符合该需求信息的每个文本在所述第一样本文本中的位置信息。For each module, input the vector information corresponding to each requirement information of the module into the pre-trained memory network, and obtain the valid text corresponding to each requirement information of the module and the corresponding first position information; The first position information is each participle in the valid text, and the first position information in the filled text of the module; the vector information corresponding to any requirement information of the module includes: each valid text corresponding to the requirement information And the third feature vector of the required information; the text structure of the filled text is the same as the text structure of the first sample text labeled with the fourth position information used in the training of the memory network, and the fourth position information Position information of each text in the first sample text that meets the requirement information.
如图9所示,本申请另一实施例的文本生成装置的结构,该装置可以包括:As shown in FIG. 9, the structure of a text generating device according to another embodiment of the present application may include:
文本分类模块901,用于针对所述每个模块,将该模块的需求信息输入预设分类算法,得到该模块的填充文本的文本类型,所述文本类型包括结构化类型和非结构化类型;A text classification module 901 is configured to input the requirement information of the module into a preset classification algorithm for each module, and obtain a text type of the module's filled text, where the text type includes a structured type and an unstructured type;
文本获取模块902,用于针对待生成文本的固定写作格式中的每个模块,当该模块的填充文本的文本类型为非结构化类型时,从预设资料库中获取符合该模块的需求信息的多个有效文本;A text acquisition module 902 is used for each module in the fixed writing format of the text to be generated. When the text type of the filled text of the module is an unstructured type, the information corresponding to the module's requirements is obtained from a preset database. Multiple valid texts;
特征提取模块903,用于针对每个模块,当该模块的填充文本的文本类型为非结构化类型时,将该模块的多个有效文本分别输入预先训练得到的第一循环神经网络,得到每个有效文本的第一特征向量;A feature extraction module 903 is configured for each module. When the text type of the filled text of the module is an unstructured type, multiple valid texts of the module are input into a first recurrent neural network trained in advance to obtain each First feature vector of valid text;
位置信息确定模块904,用于针对每个模块,当该模块的填充文本的文本类型为非结构化类型时,将每个有效文本的第一特征向量分别输入预先训练得到的记忆网络,得到每个有效文本中的各分词在该模块的填充文本中的第一位置信息;The position information determining module 904 is configured for each module. When the text type of the filled text of the module is an unstructured type, the first feature vector of each valid text is separately input into a memory network obtained in advance to obtain each The first position information of each participle in each valid text in the filled text of the module;
文本获取模块902,还用于针对待生成文本的固定写作格式中的每个模块,当该模块的填充文本的文本类型为结构化类型时,将该模块的所述多个有效文本输入预先训练得到的序列标注模型,得到每个有效文本中的各分词的第二标识信息,所述序列标注模型为以多个预先收集的预先标注了所述第二标识信息、且符合该模块的需求信息的第二样本有效文本训练得到的;The text acquisition module 902 is further configured for each module in the fixed writing format of the text to be generated. When the text type of the filled text of the module is a structured type, the plurality of valid text inputs of the module are pre-trained. The obtained sequence labeling model obtains the second identification information of each participle in each valid text. The sequence labeling model is a plurality of pre-collected pre-labeled second identification information and meets the requirements of the module. Obtained by training the second sample of valid text;
位置信息确定模块904,还用于根据第二标识信息,利用预设的标识与分词位置信息的对应关系,确定每个有效文本中的各分词在该模块的填充文本中的第五位置信息;The position information determining module 904 is further configured to determine, according to the second identification information, the fifth position information of each participle in each valid text in the filled text of the module by using a preset correspondence between the identification and the positional part information;
文本生成模块905,还用于按照每个有效文本中的各分词的所述第五位置信息,排列所述每个有效文本中的各分词,得到该模块的所述填充文本;按照待生成文本的固定写作格式,排列每个模块的所述填充文本,得到待生成文本。The text generating module 905 is further configured to arrange the participles in each valid text according to the fifth position information of each participle in each valid text to obtain the filled text of the module; according to the text to be generated Fixed writing format, arrange the filled text of each module to get the text to be generated.
相应于上述实施例,本申请实施例还提供了一种计算机设备,如图10所示,可以包括:Corresponding to the foregoing embodiments, an embodiment of the present application further provides a computer device, as shown in FIG. 10, which may include:
处理器1001、通信接口1002、存储器1003和通信总线1004,其中,处理器1001,通信接口1002,存储器通1003过通信总线1004完成相互间的通信;A processor 1001, a communication interface 1002, a memory 1003, and a communication bus 1004, where the processor 1001, the communication interface 1002, and the memory communicate with each other through the communication bus 1004 through the communication bus 1004;
存储器1003,用于存放计算机程序;A memory 1003, configured to store a computer program;
处理器1001,用于执行上述存储器1003上所存放的计算机程序时,实现上述任一实施例中文本生成方法的步骤。The processor 1001 is configured to implement the steps of the text generating method in any one of the foregoing embodiments when the computer program stored in the memory 1003 is executed.
本申请实施例提供的一种计算机设备,由于针对每个模块,所使用的记忆网络是利用多个预先收集的第一样本文本训练得到的,并且第一样本文本是符合自然语言内容结构、且符合该模块需求信息的样本。因此,利用第一记忆网络得到的有效文本中的各分词在填充文本中的第一位置信息,与第一 样本文本中各分词的位置信息相同。在此基础上,按照第一位置信息排列有效文本中的各分词,得到的填充文本的文本结构与第一样本文本的文本结构相同,也就符合自然语言表述结构。从而保证按照待生成文本的固定写作格式,排列每个模块的填充文本,得到的待生成文本是符合自然语言表述结构的文本。A computer device provided in the embodiment of the present application is that, for each module, a memory network used is obtained by training a plurality of pre-collected first sample texts, and the first sample text conforms to a natural language content structure. Samples that meet the module's requirements information. Therefore, the first position information of each participle in the valid text obtained from the first memory network in the filled text is the same as the position information of each participle in the first sample text. On this basis, the segmentation in the valid text is arranged according to the first position information, and the text structure of the obtained filled text is the same as that of the first sample text, which also conforms to the natural language expression structure. Therefore, it is guaranteed that the filled text of each module is arranged according to the fixed writing format of the text to be generated, and the obtained text to be generated is a text conforming to the structure of natural language expression.
上述存储器可以包括RAM(Random Access Memory,随机存取存储器),也可以包括NVM(Non-Volatile Memory,非易失性存储器),例如至少一个磁盘存储器。可选的,存储器还可以是至少一个位于远离于上述处理器的存储装置。The foregoing memory may include RAM (Random Access Memory, Random Access Memory), and may also include NVM (Non-Volatile Memory, non-volatile memory), such as at least one disk memory. Optionally, the memory may also be at least one storage device located far from the processor.
上述处理器可以是通用处理器,包括CPU(Central Processing Unit,中央处理器)、NP(Network Processor,网络处理器)等;还可以是DSP(Digital Signal Processor,数字信号处理器)、ASIC(Application Specific Integrated Circuit,专用集成电路)、FPGA(Field-Programmable Gate Array,现场可编程门阵列)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。The above processor may be a general-purpose processor, including a CPU (Central Processing Unit), a NP (Network Processor), etc .; it may also be a DSP (Digital Signal Processor), ASIC (Application Specific Integrated Circuit (ASIC), FPGA (Field-Programmable Gate Array), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.
本申请一实施例提供的计算机可读存储介质,该计算机可读存储介质内存储有计算机程序,该计算机程序被处理器执行时,实现上述任一实施例中文本生成方法的步骤。A computer-readable storage medium provided by an embodiment of the present application. The computer-readable storage medium stores a computer program. When the computer program is executed by a processor, the steps of the text generation method in any of the foregoing embodiments are implemented.
本申请实施例提供的一种计算机可读存储介质,该计算机程序被处理器执行时,由于针对每个模块,所使用的记忆网络是利用多个预先收集的第一样本文本训练得到的,并且第一样本文本是符合自然语言内容结构、且符合该模块需求信息的样本。因此,利用第一记忆网络得到的有效文本中的各分词在填充文本中的第一位置信息,与第一样本文本中各分词的位置信息相同。在此基础上,按照第一位置信息排列有效文本中的各分词,得到的填充文本的文本结构与第一样本文本的文本结构相同,也就符合自然语言表述结构。从而保证按照待生成文本的固定写作格式,排列每个模块的填充文本,得到的待生成文本是符合自然语言表述结构的文本。A computer-readable storage medium provided by an embodiment of the present application. When the computer program is executed by a processor, since a memory network used for each module is obtained by using a plurality of pre-collected first sample texts, And the first sample text is a sample that conforms to the natural language content structure and meets the module's requirements information. Therefore, the first position information of each participle in the valid text obtained from the first memory network in the filled text is the same as the position information of each participle in the first sample text. On this basis, the segmentation in the valid text is arranged according to the first position information, and the text structure of the obtained filled text is the same as that of the first sample text, which also conforms to the natural language expression structure. Therefore, it is guaranteed that the filled text of each module is arranged according to the fixed writing format of the text to be generated, and the obtained text to be generated is a text conforming to the structure of natural language expression.
在本申请的又一实施例中,还提供了一种包含指令的计算机程序产品,当其在计算机上运行时,使得计算机执行上述任一实施例中文本生成方法。In still another embodiment of the present application, a computer program product containing instructions is also provided. When the computer program product is run on a computer, the computer is caused to execute the text generating method in any of the foregoing embodiments.
在本申请的又一实施例中,还提供了一种应用程序,当所述应用程序运行时,可以执行上述任一实施例中文本生成方法。In another embodiment of the present application, an application program is also provided, and when the application program is running, the text generating method in any of the foregoing embodiments may be executed.
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、DSL(Digital Subscriber Line,数字用户线)或无线(例如:红外线、无线电、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如:DVD(Digital Versatile Disc,数字通用光盘))、或者半导体介质(例如:SSD(Solid State Disk,固态硬盘))等。In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, it may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, all or part of the processes or functions according to the embodiments of the present application are generated. The computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be from a website site, computer, server, or data center Transmission by wire (for example, coaxial cable, optical fiber, DSL (Digital Subscriber Line) or wireless (for example: infrared, radio, microwave, etc.) to another website site, computer, server, or data center. A computer-readable storage medium may be any available media that can be accessed by a computer or a data storage device such as a server, data center, etc. that includes one or more available media integrations. The available media may be magnetic media (eg, a floppy disk, a hard disk , Magnetic tape), optical media (for example: DVD (Digital Versatile Disc), or semiconductor media (for example: SSD (Solid State Disk)).
在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。In this article, relational terms such as first and second are used only to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply any such relationship between these entities or operations. Actual relationship or order. Moreover, the terms "including", "comprising", or any other variation thereof are intended to encompass non-exclusive inclusion, such that a process, method, article, or device that includes a series of elements includes not only those elements but also those that are not explicitly listed Or other elements inherent to such a process, method, article, or device. Without more restrictions, the elements defined by the sentence "including a ..." do not exclude the existence of other identical elements in the process, method, article, or equipment including the elements.
本说明书中的各个实施例均采用相关的方式描述,各个实施例之间相同相似的部分互相参见即可,每个实施例重点说明的都是与其他实施例的不同之处。尤其,对于装置和计算机设备实施例而言,由于其基本相似于方法实施例,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。Each embodiment in this specification is described in a related manner, and the same or similar parts between the various embodiments can be referred to each other. Each embodiment focuses on the differences from other embodiments. In particular, the embodiments of the apparatus and computer equipment are basically similar to the method embodiments, so the description is relatively simple. For the related parts, refer to the description of the method embodiments.
以上所述仅为本申请的较佳实施例而已,并非用于限定本申请的保护范围。凡在本申请的精神和原则之内所作的任何修改、等同替换、改进等,均包含在本申请的保护范围内。The above descriptions are merely preferred embodiments of the present application, and are not intended to limit the protection scope of the present application. Any modification, equivalent replacement, and improvement made within the spirit and principle of this application are included in the protection scope of this application.
以上所述仅为本申请的较佳实施例而已,并不用以限制本申请,凡在本申请的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本申请保护的范围之内。The above are only preferred embodiments of this application, and are not intended to limit this application. Any modification, equivalent replacement, or improvement made within the spirit and principles of this application shall be included in this application Within the scope of protection.

Claims (15)

  1. 一种文本生成方法,其特征在于,所述方法包括:A text generation method, characterized in that the method includes:
    针对待生成文本的固定写作格式中的每个模块,从预设资料库中获取符合该模块的需求信息的多个有效文本,所述需求信息用于表明该模块对应的文本内容;For each module in the fixed writing format of the text to be generated, obtain a plurality of valid texts from a preset database that meet the requirement information of the module, and the requirement information is used to indicate the text content corresponding to the module;
    针对所述每个模块,将该模块的所述多个有效文本分别输入预先训练得到的第一循环神经网络,得到该模块的每个有效文本的第一特征向量,所述第一循环神经网络为以多个预先收集的符合指定需求信息的样本有效文本进行训练得到的;For each module, input the plurality of valid texts of the module into a first recurrent neural network trained in advance to obtain a first feature vector of each valid text of the module, the first recurrent neural network It is obtained by training with multiple pre-collected sample valid texts that meet the specified requirements information;
    针对所述每个模块,将该模块的每个有效文本的第一特征向量分别输入预先训练得到的记忆网络,得到该模块的每个有效文本的,各分词在该模块的填充文本中的第一位置信息,所述填充文本的文本结构与所述记忆网络训练时所利用的第一样本文本的文本结构相同,所述第一样本文本为符合自然语言表述结构、且符合指定需求信息的文本,所述记忆网络为以多个预先收集的所述第一样本文本进行训练得到的;For each module, first input the first feature vector of each valid text of the module into the memory network obtained in advance, and obtain the first word of each valid text of the module in the filled text of the module. Location information, the text structure of the filled text is the same as the text structure of the first sample text used in the training of the memory network, and the first sample text is information that conforms to a natural language expression structure and meets specified requirements Text, the memory network is obtained by training with a plurality of the first sample texts collected in advance;
    针对所述每个模块,将该模块的每个有效文本中的各分词,按照所得到的第一位置信息排列,得到该模块的所述填充文本;For each module, arrange the participles in each valid text of the module according to the obtained first position information to obtain the filled text of the module;
    按照所述待生成文本的固定写作格式,排列每个模块的所述填充文本,得到所述待生成文本。According to the fixed writing format of the text to be generated, the filled text of each module is arranged to obtain the text to be generated.
  2. 根据权利要求1所述的方法,其特征在于,所述针对待生成文本的固定写作格式中的每个模块,从预设资料库中获取符合该模块的需求信息的多个有效文本,包括:The method according to claim 1, wherein, for each module in the fixed writing format for the text to be generated, obtaining a plurality of valid texts from a preset database that meets the module's requirement information includes:
    针对待生成文本的固定写作格式中的每个模块,从预设资料库中获取符合所述待生成文本所描述事件的多个完整文本,作为该模块的备用文本;For each module in the fixed writing format of the text to be generated, obtaining a plurality of complete texts from the preset database that conform to the events described by the text to be generated, as the backup text of the module;
    针对所述每个模块,将该模块的各备用文本分别输入预先训练得到的第二循环神经网络,得到每个备用文本的第二特征向量,所述第二循环神经网络为以多个预先收集的样本备用文本进行训练得到的;For each module, each backup text of the module is input into a second recurrent neural network trained in advance to obtain a second feature vector of each backup text. The second recurrent neural network is Obtained by training the sample backup text;
    针对所述每个模块,将该模块的需求信息输入预先训练得到的第三循环神经网络,得到所述需求信息的第三特征向量,作为该模块的特征向量,所 述第三循环神经网络为以多个预先收集的该模块的样本需求信息进行训练得到的;For each of the modules, the demand information of the module is input into a third recurrent neural network trained in advance, and a third feature vector of the demand information is obtained as a feature vector of the module. The third recurrent neural network is Trained with multiple pre-collected sample requirement information of this module;
    针对所述每个模块,分别将该模块的每个备用文本对应的向量信息输入预先训练得到的第四循环神经网络,得到该模块的每个备用文本中,符合该模块的需求信息的文本的第二位置信息;其中,该模块的任一备用文本对应的向量信息包括:该备用文本的第二特征向量和该模块的特征向量;所述第四循环神经网络为以多个预先收集的标注了第三位置信息、且描述指定需求信息对应的同一事件的样本完整文本进行训练得到的,所述第三位置信息为符合该模块的需求信息的文本在所述样本完整文本中的位置信息;For each module, input the vector information corresponding to each backup text of the module into the fourth recurrent neural network trained in advance, and obtain the text of each backup text of the module that meets the information requirements of the module. Second position information; wherein the vector information corresponding to any backup text of the module includes: the second feature vector of the backup text and the feature vector of the module; the fourth recurrent neural network is a plurality of pre-collected labels The third position information is obtained by training the complete text of the sample describing the same event corresponding to the specified demand information, where the third position information is the position information of the text that meets the demand information of the module in the complete text of the sample;
    针对所述每个模块,分别从该模块的每个备用文本中,抽取相应的所述第二位置信息处的文本,作为符合该模块的需求信息的有效文本。For each module, the corresponding text at the second position information is extracted from each standby text of the module as the valid text that meets the module's requirement information.
  3. 根据权利要求1所述的方法,其特征在于,所述该模块的需求信息为多个:The method according to claim 1, wherein the module has a plurality of requirements information:
    所述针对待生成文本的固定写作格式中的每个模块,从预设资料库中获取符合该模块的需求信息的有效文本,包括:For each module in the fixed writing format of the text to be generated, obtaining a valid text from a preset database that meets the information requirements of the module includes:
    针对待生成文本的固定写作格式中的每个模块,从预设资料库中获取符合该模块的每个需求信息的多个有效文本;For each module in the fixed writing format of the text to be generated, obtain a plurality of valid texts from a preset database that meets each requirement information of the module;
    所述针对所述每个模块,将该模块的所述多个有效文本分别输入预先训练得到的第一循环神经网络,得到该模块的每个有效文本的第一特征向量,包括:For each module, inputting the plurality of valid texts of the module into a first recurrent neural network trained in advance to obtain a first feature vector of each valid text of the module includes:
    针对所述每个模块,分别将该模块的每个需求信息的多个有效文本输入预先训练得到的第一循环神经网络,得到该模块的每个有效文本的第一特征向量;For each of the modules, input a plurality of valid texts of each requirement information of the module into a first recurrent neural network trained in advance to obtain a first feature vector of each valid text of the module;
    在所述针对所述每个模块,将该模块的每个有效文本的第一特征向量分别输入预先训练得到的记忆网络,得到每个有效文本中的各分词在该模块的填充文本中的第一位置信息之前,所述方法还包括:For each module, firstly input the first feature vector of each valid text of the module into the memory network obtained in advance, and obtain the first part of each participle in each valid text in the filled text of the module. Before a position information, the method further includes:
    针对所述每个模块,分别将该模块的每个需求信息输入预先训练得到的第三循环神经网络,得到该模块的每个需求信息的第三特征向量,所述第三循环神经网络为以多个预先收集的该模块的样本需求信息进行训练得到的;For each module, each requirement information of the module is input into a third recurrent neural network trained in advance, and a third feature vector of each requirement information of the module is obtained. The third recurrent neural network is Multiple pre-collected sample requirement information of the module is obtained through training;
    所述针对所述每个模块,将该模块的每个有效文本的第一特征向量分别输入预先训练得到的第一记忆网络,得到该模块的每个有效文本中的各分词在填充文本中的第一位置信息,包括:For each module, the first feature vector of each valid text of the module is input into the first memory network obtained in advance, and the word segmentation in each valid text of the module in the filled text is obtained. The first position information includes:
    针对所述每个模块,分别将该模块的每个需求信息对应的向量信息输入预先训练得到的记忆网络,得到该模块的每个需求信息对应的有效文本,所对应的第一位置信息;其中,所述第一位置信息为所述有效文本中的各分词,在该模块的填充文本中的位置信息;该模块的任一需求信息对应的向量信息包括:该需求信息对应的每个有效文本的第一特征向量,以及该需求信息的第三特征向量;所述填充文本的文本结构与所述记忆网络训练时所利用的标注了第四位置信息的第一样本文本的文本结构相同,所述第四位置信息为符合该需求信息的每个文本在所述第一样本文本中的位置信息。For each module, input the vector information corresponding to each requirement information of the module into the pre-trained memory network to obtain the valid text corresponding to each requirement information of the module and the corresponding first position information; , The first position information is each participle in the valid text, and the position information in the filled text of the module; the vector information corresponding to any requirement information of the module includes: each valid text corresponding to the requirement information The first feature vector and the third feature vector of the demand information; the text structure of the filled text is the same as the text structure of the first sample text marked with the fourth position information used in the training of the memory network, The fourth position information is position information of each text that meets the requirement information in the first sample text.
  4. 根据权利要求1所述的方法,其特征在于,在所述针对所述每个模块,将该模块的所述多个有效文本分别输入预先训练得到的第一循环神经网络,得到每个有效文本的第一特征向量之前,所述方法还包括:The method according to claim 1, characterized in that, for each of the modules, the plurality of valid texts of the module are respectively input into a first recurrent neural network trained in advance to obtain each valid text. Before the first feature vector, the method further includes:
    针对所述每个模块,将该模块的需求信息输入预设分类算法,得到该模块的填充文本的文本类型,所述文本类型包括结构化类型和非结构化类型;For each module, input the requirement information of the module into a preset classification algorithm to obtain the text type of the filled text of the module, where the text type includes a structured type and an unstructured type;
    针对所述每个模块,当该模块的填充文本的文本类型为所述非结构化类型时,执行所述将该模块的所述多个有效文本输入预先训练得到的第一循环神经网络,得到每个有效文本的第一特征向量。For each module, when the text type of the filled text of the module is the unstructured type, executing the first recurrent neural network obtained by pre-training the multiple valid texts of the module into the first recurrent neural network obtained by The first feature vector of each valid text.
  5. 根据权利要求4所述的方法,其特征在于,在所述针对所述每个模块,将该模块的需求信息输入预设分类算法,得到该模块的填充文本的文本类型之后,所述方法还包括:The method according to claim 4, characterized in that, for each of the modules, after entering the requirement information of the module into a preset classification algorithm to obtain the text type of the filled text of the module, the method further include:
    针对所述每个模块,当该模块的填充文本的文本类型为所述结构化类型时,执行以下步骤:For each module, when the text type of the filled text of the module is the structured type, the following steps are performed:
    将该模块的所述多个有效文本输入预先训练得到的序列标注模型,得到每个有效文本中的各分词的第二标识信息,所述序列标注模型为以多个预先收集的预先标注了所述第二标识信息、且符合该模块的需求信息的第二样本有效文本训练得到的;The multiple valid texts of the module are input into a sequence labeling model trained in advance, and second identification information of each participle in each valid text is obtained. The sequence labeling model uses a plurality of previously collected Training the second sample valid text with the second identification information and the requirement information of the module;
    根据所述第二标识信息,利用预设的标识与分词位置信息的对应关系, 确定每个有效文本中的各分词在该模块的填充文本中的第五位置信息;Determining, according to the second identification information, the fifth position information of each participle in each valid text in the filled text of the module by using a preset correspondence between the identification and the participle position information;
    按照每个有效文本中的各分词的所述第五位置信息,排列所述每个有效文本中的各分词,得到该模块的所述填充文本。According to the fifth position information of each participle in each valid text, arrange the participles in each valid text to obtain the filled text of the module.
  6. 根据权利要求1所述的方法,其特征在于,在所述针对待生成文本的固定写作格式中的每个模块,从预设资料库中获取符合该模块的需求信息的多个有效文本之后,所述方法还包括:The method according to claim 1, wherein after each module in the fixed writing format for the text to be generated, obtains a plurality of valid texts from a preset database that meets the module's requirement information, The method further includes:
    针对所述每个模块,为该模块的每个有效文本标注该模块的第一标识信息;For each module, mark the first identification information of the module for each valid text of the module;
    所述按照所述待生成文本的固定写作格式,排列每个模块的所述填充文本,得到所述待生成文本,包括:Arranging the filled text of each module according to the fixed writing format of the text to be generated to obtain the text to be generated includes:
    针对所述每个模块,按照预设的第一标识信息与模块位置的对应关系,确定该模块的所述填充文本在待生成文本中的第六位置信息,所述预设的第一标识信息与模块位置的对应关系用于表示所述待生成文本的固定写作格式;For each module, the sixth position information of the filled text of the module in the text to be generated is determined according to the preset correspondence between the first identification information and the module position, and the preset first identification information A correspondence relationship with a module position is used to indicate a fixed writing format of the text to be generated;
    按照所述第六位置信息排列每个填充文本,得到所述待生成文本。Arrange each filled text according to the sixth position information to obtain the text to be generated.
  7. 一种文本生成装置,其特征在于,所述装置包括:A text generating device, characterized in that the device includes:
    文本获取模块,用于针对待生成文本的固定写作格式中的每个模块,从预设资料库中获取符合该模块的需求信息的多个有效文本,所述需求信息用于表明该模块对应的文本内容;A text acquisition module, for each module in the fixed writing format of the text to be generated, obtains from the preset database multiple valid texts that meet the module's requirement information, and the requirement information is used to indicate that the module corresponds to Text content
    特征提取模块,用于针对所述每个模块,将该模块的所述多个有效文本分别输入预先训练得到的第一循环神经网络,得到该模块的每个有效文本的第一特征向量,所述第一循环神经网络为以多个预先收集的符合指定需求信息的样本有效文本进行训练得到的;A feature extraction module is configured for each of the modules to input the plurality of valid texts of the module into a first recurrent neural network obtained in advance to obtain a first feature vector of each valid text of the module. The first recurrent neural network is obtained by training a plurality of pre-collected sample valid texts that meet the specified requirements information;
    位置信息确定模块,用于针对所述每个模块,将该模块的每个有效文本的第一特征向量分别输入预先训练得到的记忆网络,得到该模块的每个有效文本中的各分词在该模块的填充文本的第一位置信息,所述填充文本的文本结构与所述记忆网络训练时所利用的第一样本文本的文本结构相同,所述第一样本文本为符合自然语言表述结构、且符合指定需求信息的文本,所述记忆网络为以多个预先收集的所述第一样本文本进行训练得到的;A position information determining module, configured to, for each of the modules, input a first feature vector of each valid text of the module into a pre-trained memory network, and obtain each segmentation word in each valid text of the module in the module; The first position information of the filled text of the module, the text structure of the filled text is the same as the text structure of the first sample text used in the training of the memory network, and the first sample text is in accordance with a natural language expression structure And the text that meets the specified requirement information, the memory network is obtained by training with a plurality of the first sample texts collected in advance;
    文本生成模块,用于针对所述每个模块,将该模块的每个有效文本中的 各分词,按照所得到的第一位置信息排列,得到所述填充文本;按照所述待生成文本的固定写作格式,排列每个模块的所述填充文本,得到所述待生成文本。A text generating module, for each module, arranging the participles in each valid text of the module according to the obtained first position information to obtain the filled text; according to the fixed text to be generated Writing format, arrange the filled text of each module to get the text to be generated.
  8. 根据权利要求7所述的装置,其特征在于,所述文本获取模块,具体用于:The apparatus according to claim 7, wherein the text acquisition module is specifically configured to:
    针对待生成文本的固定写作格式中的每个模块,从预设资料库中获取符合所述待生成文本所描述事件的多个完整文本,作为该模块的备用文本;For each module in the fixed writing format of the text to be generated, obtaining a plurality of complete texts from the preset database that conform to the events described by the text to be generated, as the backup text of the module;
    所述特征提取模块,还用于针对所述每个模块,将该模块的各备用文本分别输入预先训练得到的第二循环神经网络,得到每个备用文本的第二特征向量,所述第二循环神经网络为以多个预先收集的样本备用文本进行训练得到的;将该模块的需求信息输入预先训练得到的第三循环神经网络,得到所述需求信息的第三特征向量,作为该模块的特征向量,所述第三循环神经网络为以多个预先收集的该模块的样本需求信息进行训练得到的;The feature extraction module is further configured to, for each module, input each backup text of the module into a second recurrent neural network trained in advance to obtain a second feature vector of each backup text, and the second The recurrent neural network is obtained by training with a plurality of pre-collected sample backup texts; the module's requirement information is input into a third recurrent neural network obtained in advance to obtain a third feature vector of the demand information as the module's Feature vector, the third recurrent neural network is obtained by training with a plurality of pre-collected sample demand information of the module;
    所述位置信息确定模块,还用于针对每个模块,分别将该模块的每个备用文本对应的向量信息输入预先训练得到的第四循环神经网络,得到该模块的每个备用文本中,符合该模块的需求信息的文本的第二位置信息;其中,该模块的任一备用文本对应的向量信息包括:该备用文本的第二特征向量和该模块的特征向量;所述第四循环神经网络为以多个预先收集的标注了第三位置信息、且描述指定需求信息对应的同一事件的样本完整文本进行训练得到的,所述第三位置信息为符合该模块的需求信息的文本在所述样本完整文本中的位置信息;The position information determining module is further configured to input, for each module, vector information corresponding to each backup text of the module into a fourth recurrent neural network obtained in advance, and obtain each backup text of the module in accordance with The second position information of the text of the demand information of the module; wherein the vector information corresponding to any backup text of the module includes: the second feature vector of the backup text and the feature vector of the module; the fourth recurrent neural network It is obtained by training with a plurality of pre-collected samples of the complete text marked with the third position information and describing the same event corresponding to the specified demand information. The third position information is a text that meets the demand information of the module. Position information in the complete text of the sample;
    所述文本获取模块,具体用于针对所述每个模块,分别从该模块的每个备用文本中,抽取相应的所述第二位置信息处的文本,作为符合该模块的需求信息的有效文本。The text acquisition module is specifically configured for each module to extract the corresponding text at the second position information from each standby text of the module, as valid text that meets the requirements of the module. .
  9. 根据权利要求7所述的装置,其特征在于,所述该模块的需求信息为多个:The device according to claim 7, characterized in that the requirement information of the module is multiple:
    所述文本获取模块,具体用于:The text acquisition module is specifically configured to:
    针对待生成文本的固定写作格式中的每个模块,从预设资料库中获取符合该模块的每个需求信息的多个有效文本;For each module in the fixed writing format of the text to be generated, obtain a plurality of valid texts from a preset database that meets each requirement information of the module;
    所述特征提取模块,还用于:The feature extraction module is further configured to:
    针对所述每个模块,分别将该模块的每个需求信息的多个有效文本输入预先训练得到的第一循环神经网络,得到每个有效文本的第一特征向量;For each of the modules, input a plurality of valid texts of each requirement information of the module into a first recurrent neural network trained in advance to obtain a first feature vector of each valid text;
    针对所述每个模块,分别将该模块的每个需求信息输入预先训练得到的第三循环神经网络,得到该模块的该需求信息的第三特征向量,所述第三循环神经网络为以多个预先收集的该模块的样本需求信息进行训练得到的;For each module, each requirement information of the module is input into a third recurrent neural network trained in advance, and a third feature vector of the demand information of the module is obtained. Obtained by pre-collecting the sample requirement information of the module;
    所述位置信息确定模块,具体用于:The location information determining module is specifically configured to:
    针对所述每个模块,分别将该模块的每个需求信息对应的向量信息输入预先训练得到的记忆网络,得到该模块的每个需求信息对应的有效文本,所对应的第一位置信息;其中,所述第一位置信息为所述有效文本中的各分词的第一位置信息;该模块的任一需求信息对应的向量信息包括:该需求信息对应的每个有效文本的第一特征向量,以及该需求信息的第三特征向量;所述填充文本的文本结构与所述记忆网络训练时所利用的标注了第四位置信息的第一样本文本的文本结构相同,所述第四位置信息为符合该需求信息的每个文本在所述第一样本文本中的位置信息。For each module, input the vector information corresponding to each requirement information of the module into the pre-trained memory network to obtain the valid text corresponding to each requirement information of the module and the corresponding first position information; The first position information is the first position information of each participle in the valid text; the vector information corresponding to any requirement information of the module includes: the first feature vector of each valid text corresponding to the requirement information, And the third feature vector of the demand information; the text structure of the filled text is the same as the text structure of the first sample text labeled with the fourth position information used in the training of the memory network, and the fourth position information Position information of each text in the first sample text that meets the requirement information.
  10. 根据权利要求7所述的装置,其特征在于,所述装置还包括:The apparatus according to claim 7, further comprising:
    文本分类模块,用于针对所述每个模块,将该模块的需求信息输入预设分类算法,得到该模块的填充文本的文本类型,所述文本类型包括结构化类型和非结构化类型;A text classification module, for each of the modules, inputting the requirement information of the module into a preset classification algorithm to obtain a text type of the module's filled text, the text type includes a structured type and an unstructured type;
    针对所述每个模块,当该模块的填充文本的文本类型为所述非结构化类型时,所述文本获取模块,用于将该模块的所述多个有效文本输入预先训练得到的第一循环神经网络,得到每个有效文本的第一特征向量。For each module, when the text type of the filled text of the module is the unstructured type, the text acquisition module is configured to input the multiple valid texts of the module into the first trained first Recurrent neural network to obtain the first feature vector of each valid text.
  11. 根据权利要求10所述的装置,其特征在于,所述文本获取模块,还用于:The apparatus according to claim 10, wherein the text acquisition module is further configured to:
    针对所述每个模块,当该模块的填充文本的文本类型为所述结构化类型时,将该模块的所述多个有效文本输入预先训练得到的序列标注模型,得到每个有效文本中的各分词的第二标识信息,所述序列标注模型为以多个预先收集的预先标注了所述第二标识信息、且符合该模块的需求信息的第二样本有效文本训练得到的;For each of the modules, when the text type of the filled text of the module is the structured type, the multiple valid texts of the module are input into a sequence labeling model trained in advance to obtain the The second identification information of each segmentation, the sequence labeling model is obtained by training with a plurality of pre-collected second sample valid texts which are labeled with the second identification information in advance and meet the requirement information of the module;
    所述位置信息确定模块,还用于根据所述第二标识信息,利用预设的标识与分词位置信息的对应关系,确定每个有效文本中的各分词在该模块的填充文本中的第五位置信息;The position information determination module is further configured to determine, based on the second identification information, a fifth relationship between each segmented word in each valid text in the filled text of the module by using a corresponding relationship between a preset identifier and the segmented word position information. location information;
    所述文本生成模块,还用于按照每个有效文本中的各分词的所述第五位置信息,排列所述每个有效文本中的各分词,得到该模块的所述填充文本;按照所述待生成文本的固定写作格式,排列每个模块的所述填充文本,得到所述待生成文本。The text generating module is further configured to arrange the participles in each valid text according to the fifth position information of each participle in each valid text to obtain the filled text of the module; In the fixed writing format of the text to be generated, the filled text of each module is arranged to obtain the text to be generated.
  12. 根据权利要求7所述的装置,其特征在于,所述文本生成模块,具体用于:The apparatus according to claim 7, wherein the text generating module is specifically configured to:
    针对所述每个模块,为该模块的每个有效文本标注该模块的第一标识信息;For each module, mark the first identification information of the module for each valid text of the module;
    针对所述每个模块,按照预设的第一标识信息与模块位置的对应关系,确定该模块的所述填充文本在待生成文本中的第六位置信息,所述预设的第一标识信息与模块位置的对应关系用于表示所述待生成文本的固定写作格式;For each module, the sixth position information of the filled text of the module in the text to be generated is determined according to the preset correspondence between the first identification information and the module position, and the preset first identification information A correspondence relationship with a module position is used to indicate a fixed writing format of the text to be generated;
    按照所述第六位置信息排列每个填充文本,得到所述待生成文本。Arrange each filled text according to the sixth position information to obtain the text to be generated.
  13. 一种计算机设备,其特征在于,包括处理器、通信接口、存储器和通信总线,其中,处理器,通信接口,存储器通过总线完成相互间的通信;存储器,用于存放计算机程序;处理器,用于执行存储器上所存放的程序,实现如权利要求1-6任一所述的方法步骤。A computer device is characterized by comprising a processor, a communication interface, a memory, and a communication bus, wherein the processor, the communication interface, and the memory complete communication with each other through the bus; the memory is used to store a computer program; the processor is used to The program stored on the execution memory implements the method steps according to any one of claims 1-6.
  14. 一种计算机可读存储介质,其特征在于,所述存储介质内存储有计算机程序,所述计算机程序被处理器执行时实现如权利要求1-6任一所述的方法步骤。A computer-readable storage medium, characterized in that a computer program is stored in the storage medium, and when the computer program is executed by a processor, the method steps according to any one of claims 1-6 are implemented.
  15. 一种应用程序,其特征在于,所述应用程序在运行时执行:如权利要求1-6任一所述的方法步骤。An application program, wherein the application program is executed at runtime: the method steps according to any one of claims 1-6.
PCT/CN2019/096894 2018-07-27 2019-07-19 Text generation method, apparatus and device WO2020020084A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810846953.8 2018-07-27
CN201810846953.8A CN110852084B (en) 2018-07-27 2018-07-27 Text generation method, device and equipment

Publications (1)

Publication Number Publication Date
WO2020020084A1 true WO2020020084A1 (en) 2020-01-30

Family

ID=69181212

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/096894 WO2020020084A1 (en) 2018-07-27 2019-07-19 Text generation method, apparatus and device

Country Status (2)

Country Link
CN (1) CN110852084B (en)
WO (1) WO2020020084A1 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107193792A (en) * 2017-05-18 2017-09-22 北京百度网讯科技有限公司 The method and apparatus of generation article based on artificial intelligence
JP2018084627A (en) * 2016-11-22 2018-05-31 日本放送協会 Language model learning device and program thereof
CN108197294A (en) * 2018-01-22 2018-06-22 桂林电子科技大学 A kind of text automatic generation method based on deep learning

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104199805B (en) * 2014-09-11 2017-10-20 清华大学 Text joining method and device
US9881003B2 (en) * 2015-09-23 2018-01-30 Google Llc Automatic translation of digital graphic novels
CN106919646B (en) * 2017-01-18 2020-06-09 南京云思创智信息科技有限公司 Chinese text abstract generating system and method
CN107832310A (en) * 2017-11-27 2018-03-23 首都师范大学 Structuring argument generation method and system based on seq2seq models

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2018084627A (en) * 2016-11-22 2018-05-31 日本放送協会 Language model learning device and program thereof
CN107193792A (en) * 2017-05-18 2017-09-22 北京百度网讯科技有限公司 The method and apparatus of generation article based on artificial intelligence
CN108197294A (en) * 2018-01-22 2018-06-22 桂林电子科技大学 A kind of text automatic generation method based on deep learning

Also Published As

Publication number Publication date
CN110852084A (en) 2020-02-28
CN110852084B (en) 2021-04-02

Similar Documents

Publication Publication Date Title
US10827024B1 (en) Realtime bandwidth-based communication for assistant systems
US11216510B2 (en) Processing an incomplete message with a neural network to generate suggested messages
CN107766371B (en) Text information classification method and device
CN107251006B (en) Gallery of messages with shared interests
US20190114362A1 (en) Searching Online Social Networks Using Entity-based Embeddings
CN109670163B (en) Information identification method, information recommendation method, template construction method and computing device
US20190188285A1 (en) Image Search with Embedding-based Models on Online Social Networks
US9047868B1 (en) Language model data collection
US20190108282A1 (en) Parsing and Classifying Search Queries on Online Social Networks
WO2018149209A1 (en) Voice recognition method, electronic device, and computer storage medium
US20180143980A1 (en) Generating News Headlines on Online Social Networks
US20170193086A1 (en) Methods, devices, and systems for constructing intelligent knowledge base
WO2018201600A1 (en) Information mining method and system, electronic device and readable storage medium
US20190155916A1 (en) Retrieving Content Objects Through Real-time Query-Post Association Analysis on Online Social Networks
CN106982256A (en) Information-pushing method, device, equipment and storage medium
CN111046667B (en) Statement identification method, statement identification device and intelligent equipment
WO2022134421A1 (en) Multi-knowledge graph based intelligent reply method and apparatus, computer device and storage medium
CN102314440B (en) Utilize the method and system in network operation language model storehouse
US10810214B2 (en) Determining related query terms through query-post associations on online social networks
CN102567534B (en) Interactive product user generated content intercepting system and intercepting method for the same
CN108960574A (en) Quality determination method, device, server and the storage medium of question and answer
US11880401B2 (en) Template generation using directed acyclic word graphs
WO2022111347A1 (en) Information processing method and apparatus, electronic device, and storage medium
CN103984771A (en) Method for extracting geographical interest points in English microblog and perceiving time trend of geographical interest points
CN106462564A (en) Providing factual suggestions within a document

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 16.08.2021)

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19840753

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 19840753

Country of ref document: EP

Kind code of ref document: A1