WO2020020084A1 - Procédé, appareil et dispositif de génération de texte - Google Patents

Procédé, appareil et dispositif de génération de texte Download PDF

Info

Publication number
WO2020020084A1
WO2020020084A1 PCT/CN2019/096894 CN2019096894W WO2020020084A1 WO 2020020084 A1 WO2020020084 A1 WO 2020020084A1 CN 2019096894 W CN2019096894 W CN 2019096894W WO 2020020084 A1 WO2020020084 A1 WO 2020020084A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
module
information
valid
position information
Prior art date
Application number
PCT/CN2019/096894
Other languages
English (en)
Chinese (zh)
Inventor
沈力行
陈展
Original Assignee
杭州海康威视数字技术股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 杭州海康威视数字技术股份有限公司 filed Critical 杭州海康威视数字技术股份有限公司
Publication of WO2020020084A1 publication Critical patent/WO2020020084A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/174Form filling; Merging

Definitions

  • the present application relates to the technical field of natural language processing, and in particular, to a method, a device, and a device for generating text.
  • Natural language is the language that people use every day. Natural language processing technology can realize natural language communication between humans and computers. It is widely used to generate texts with fixed writing format and specified demand information and expressed in natural language. For example, for each module in the fixed writing format of the text to be generated, using natural language processing technology to determine valid text from the database that meets the text requirements of each module, and then filling the determined valid text directly into each module, The filled text of each module is obtained, and then the filled text of each module is arranged in a fixed writing format to obtain the text to be generated. Among them, the filled text of each module in the fixed writing format usually includes: structured text with a fixed structure of words or sentences, and / or unstructured text with a fixed structure of sentences.
  • each module in a fixed writing format of a hot news is a "Title” module, a “Release Date” module, and a “Body” module.
  • the filled text of the "Title” and “Release Date” modules is structured text.
  • the filled text of the Body module is unstructured text.
  • the filled text of the module since valid text is directly filled into the module without considering the representation structure after the valid text is filled, for a module that has unstructured text, it is likely that the filled text of the module is multiple valid texts In the mechanical combination, the filled text of the module does not conform to the natural language expression structure, which leads to the problem that the text to be generated obtained by using the filled text module does not conform to the natural language expression structure.
  • the text requirement information of the "body” module is "2018 World Cup”.
  • valid texts determined from the database that meet the text requirements information include: “The World Cup is held in Russia for the first time”, “The 2018 World Cup is held in 12 stadiums in 11 cities in Russia” and “ The competition will be held from June 14th to July 15th, 2018. " Because the filled text of the "body” module is unstructured text with a fixed structure, the valid text is directly filled into the module.
  • the filled text of the generated body module may be "The competition will be from June 14 to July 2018. Held on the 15th, the 2018 World Cup will be held in 12 stadiums in 11 cities in Russia, and the World Cup will be held in Russia for the first time.
  • the purpose of the embodiments of the present application is to provide a method, a device, and a device for generating a text, so as to achieve the purpose of generating a text conforming to a natural language expression structure.
  • Specific technical solutions are as follows:
  • an embodiment of the present application provides a text generating method, which includes:
  • a plurality of valid texts that meet the module's requirement information are obtained from a preset database, and the requirement information is used to indicate the text content corresponding to the module;
  • each module input multiple valid texts of the module into the first recurrent neural network trained in advance to obtain the first feature vector of each valid text of the module. Obtained by training the sample valid text of the information that meets the specified requirements;
  • the text structure of the text is the same as the text structure of the first sample text used in the training of the memory network.
  • the first sample text is a text that conforms to the structure of the natural language expression and meets the specified requirements. Obtained by training the first sample text;
  • the filling text of each module is arranged to obtain the text to be generated.
  • an embodiment of the present application provides a text generating device, where the device includes:
  • Text acquisition module for each module in the fixed writing format of the text to be generated, obtain multiple valid texts from the preset database that meet the module's requirement information, and the requirement information is used to indicate the text content corresponding to the module ;
  • a feature extraction module is used for each module to input multiple valid texts of the module into the first recurrent neural network trained in advance to obtain the first feature vector and first recurrent neural network of each valid text of the module. It is obtained by training with multiple pre-collected sample valid texts that meet the specified requirements information;
  • a position information determining module is used for each module to input a first feature vector of each valid text of the module into a pre-trained memory network to obtain each participle in each valid text of the module.
  • the first position information in the filled text, the text structure of the filled text is the same as the text structure of the first sample text in the memory network, the first sample text is a text that conforms to the natural language expression structure and meets the specified requirements information,
  • the memory network is obtained by training with multiple first collected first sample texts;
  • a text generating module is configured to arrange each participle in each valid text of the module according to the obtained first position information for each module to obtain the filled text of the module; according to the text to be generated, The writing format is fixed, and the filled text of each module is arranged to obtain the text to be generated.
  • an embodiment of the present application provides a computer device, where the device includes:
  • an embodiment of the present application provides a computer-readable storage medium.
  • a computer program is stored in the storage medium.
  • the steps of the text generation method provided in the first aspect are implemented.
  • a text generation method, device, and device provided in the embodiments of the present application.
  • the memory network is trained by using a plurality of pre-collected first sample texts, and the first sample text conforms to natural language.
  • the segmentation in the valid text is arranged according to the first position information, and the text structure of the obtained filled text is the same as that of the first sample text, which also conforms to the natural language expression structure. Therefore, it can be ensured that the filled text of each module is arranged in accordance with the fixed writing format of the text to be generated, and the obtained text to be generated is a text conforming to the structure of natural language expression.
  • FIG. 1 is a schematic flowchart of a text generation method according to an embodiment of the present application
  • FIG. 2 is a schematic structural diagram of a recurrent neural network in a text generating method according to an embodiment of the present application
  • FIG. 3 is a schematic structural diagram of a memory network in a text generating method according to an embodiment of the present application
  • FIG. 4 is a schematic structural diagram of a memory network in a text generating method according to another embodiment of the present application.
  • FIG. 5 is a schematic flowchart of a text generation method according to another embodiment of the present application.
  • FIG. 6 is a schematic structural diagram of a convolutional neural network in a text generation method according to another embodiment of the present application.
  • FIG. 7 is a schematic structural diagram of a sequence labeling model in a text generation method according to another embodiment of the present application.
  • FIG. 8 is a schematic structural diagram of a text generating device according to an embodiment of the present application.
  • FIG. 9 is a schematic structural diagram of a text generating apparatus according to another embodiment of the present application.
  • FIG. 10 is a schematic structural diagram of a computer device according to an embodiment of the present application.
  • a text generation method according to an embodiment of the present application is first introduced below.
  • the text generation method provided in the embodiment of the present application can be applied to a computer device capable of text generation.
  • the device includes a desktop computer, a portable computer, an Internet television, a smart mobile terminal, a wearable smart terminal, and a server, etc., and is not limited herein. Any computer equipment that can implement the embodiments of the present application belongs to the protection scope of the embodiments of the present application.
  • the flow of a text generation method may include:
  • the text of each module is used to describe the same event.
  • the "Title” module and “Text” module in the press release covering the 2018 World Cup start all describe the start of the 2018 World Cup.
  • the requirement information of each module indicates its corresponding text content, it can also indicate the event described by the text to be generated to which the module belongs.
  • a method of performing keyword matching may be used to obtain text containing the requirement information of the module from a preset database.
  • the requirement information may be used as the text to be answered, and the position of the answer matching the text to be answered is obtained from a preset database by using a reading comprehension technique, and the answer at this position is used as the valid text.
  • Any method for obtaining valid text can be used in this application, and this embodiment does not limit this.
  • S102 For each module, input multiple valid texts of the module into a first recurrent neural network trained in advance, and obtain a first feature vector of each valid text of the module.
  • the first recurrent neural network uses multiple Pre-collected samples of valid text that meet the specified requirements are obtained by training.
  • the events described by the sample valid text that meets the specified requirement information have the same characteristics as the events described by the module's requirement information, and the specified requirement information can be the same or similar to the module's requirement information.
  • the demand information is "Spring Game”
  • the specified demand information may be “Spring Game”, “Winter Game”, or “Indoor Game” and so on.
  • the RNN may have a structure as shown in FIG. 2.
  • the current input of the neuron 202 in the hidden layer may include the output 2010 of the input layer 201 and the neuron 202.
  • the output 2020 at a moment enables the recurrent neural network to remember and use the output at the previous moment to determine the output at the current moment, and then obtains the feature vector output by the output layer 203. Therefore, for texts where each participle is not isolated, the current participle and the previous participle can be used to predict the next participle.
  • the relationship between each participle in the text can be used to extract the feature vector of the effective text using a recurrent neural network.
  • the recurrent neural network can remember and use the output of the previous moment to determine the characteristics of the output at the current moment, so that the extracted feature vector It can reflect the characteristics of each segmentation in the effective text and the characteristics of the relationship between each segmentation.
  • the first recurrent neural network in the above S102 obtained by training with a plurality of sample valid texts collected in advance and meeting the specified demand information, establishes a mapping relationship between valid texts and feature vectors, thereby ensuring the obtained first
  • the feature vector can reflect the semantic features of the effective text as a whole, not just the features of the individual participles in the text. For example, if the current participle is "hit” and the previous participle is “driving", the next participle is likely to be "hurt".
  • the recurrent neural network in any embodiment of the present application is similar to the first recurrent neural network in S102 described above, the difference is that in order to implement the extraction of feature vectors of different input texts, it is used for training to obtain different loops The samples of the neural network are different.
  • S103 For each module, first input the first feature vector of each valid text of the module into the memory network obtained in advance, and obtain the word segmentation of each valid text of the module. A position information.
  • the text structure of the filled text is the same as the text structure of the first sample text used in the training of the memory network.
  • the first sample text is a text that conforms to the structure of the natural language expression and meets the information required by the module.
  • the network is obtained by training with a plurality of pre-collected first sample texts.
  • the arrangement of the modules is only related to the fixed format of the text to be generated, and does not involve the structure of the text in the text to be generated.
  • the text to be generated is only a fixed format exception, and the text does not conform to the natural language.
  • the problem of structure Therefore, to make the text to be generated conform to the natural language expression structure, it is necessary to ensure that the filled text of each module conforms to the natural language expression structure.
  • a plurality of pre-collected first sample texts can be used to train a memory network, and the first sample texts are samples that conform to the natural language content structure and meet the module's requirement information. Therefore, the first position information of each participle in the effective text obtained from the memory network in the filled text is the same as the position information of each participle in the first sample text, and it can be guaranteed that in the subsequent step S104, the first position information By arranging each participle in each valid text, the resulting filled text is that the arrangement position of each participle is the same as the position information of each participle in the first sample text to ensure that the filled text conforms to the natural language content structure.
  • the memory network in this embodiment may specifically have a structure as shown in FIG. 3:
  • the input layer 301 is a first recurrent neural network having the same structure as the recurrent neural network in the embodiment of FIG. 2 of the present application, and is used to obtain a first feature vector and input the first feature vector to a hidden layer, which will not be repeated here. For details, see The above description of the embodiment shown in FIG. 2.
  • the hidden layer 302 may specifically include a neuron 3020, a neuron 3021, and a neuron 3022.
  • the hidden layer 302 may Adopt the structure of recurrent neural network.
  • the position of each participle is related to the characteristics of the entire text. Therefore, it is also necessary to save the historical state information 3023 of each neuron as the input of each neuron.
  • the input of neuron 3021 may include the output and status information of all 3020 neurons. 3023. Therefore, features can be extracted from the input according to the historical state information stored in the memory network to extract features associated with the historical state information.
  • a plurality of pre-collected first sample texts are used to train a memory network, and the memory network stores historical state information of the first sample text that conforms to the natural language content structure and meets the module's requirement information. Then, the memory network is used to determine each participle in the valid text.
  • the input of the valid text can be determined based on the historical state information saved by each neuron to indicate compliance with the natural language content structure. In the feature, the start position 303 and the end position 304 of each participle.
  • word segmentation is used as an example for effective description.
  • the valid text is not limited to the word segmentation, and may include sentences and paragraphs.
  • the first sample text of the requirement information "Spring Game” is "Little Red Kick”.
  • the word “Little Red” is in the first position
  • "Kick” is in the second position
  • "Bitch” is in the third position.
  • the memory network trained using the first sample can input the features of the network
  • the position of the participle "Xiao Ming" having the same characteristics as “Xiao Hong” in the vector is determined as the first position
  • the position of the participle "put” having the same characteristics as “Kick” is determined as the second position
  • the position of the participle "kite” with the same characteristics is determined as the third position.
  • step S104 each participle in the valid text is arranged according to the first position information, and the text structure of the obtained filled text is The text structure of the first sample text is the same and conforms to the structure of natural language expression. For example, based on the first position information of the valid texts “Xiao Ming", “Fang” and “Kite” corresponding to the first feature vector, the filled text "Xiao Ming Flies Kite” is obtained.
  • the determination of the first position information it is possible to generate filled text that conforms to the structure of natural language expressions, and avoids filling the valid text directly into the module, which can result in expressions that do not conform to natural language, such as "Kite Flying Xiaoming” or "Kite Flying Xiaoming". Custom filled text.
  • the fixed writing format of the text to be generated may include an arrangement rule of each module, and the identification information of each module is used to distinguish each module, and then the filled-in text of each module is arranged by using the identification information of the module.
  • the fixed writing format of the text to be generated includes: "theme” module M1 is arranged before the "body” module M2, and the filling text of the identification information M1 of the module can be arranged in front of the filling text of the identification information M2 of the module.
  • a text generation method provided in the embodiment of the present application is that, for each module, the memory network is trained by using a plurality of pre-collected first sample texts, and the first sample text conforms to a natural language content structure, and A sample of information that meets the requirements of the module. Therefore, the first position information of each participle in the valid text obtained from the memory network in the filled text is the same as the position information of each participle in the first sample text. On this basis, the segmentation in the valid text is arranged according to the first position information, and the text structure of the obtained filled text is the same as that of the first sample text, which also conforms to the natural language expression structure. Therefore, it is guaranteed that the filled text of each module is arranged according to the fixed writing format of the text to be generated, and the obtained text to be generated is a text conforming to the structure of natural language expression.
  • the text generating method provided in the embodiment of the present application may further include the following steps:
  • the first identification information of the module is marked for each valid text of the module.
  • the first identification information is preset information used to indicate the uniqueness of each module.
  • step S105 in the embodiment shown in FIG. 1 of the present application may specifically include:
  • Each filled text is arranged according to the sixth position information to obtain the text to be generated.
  • the filling text of each module needs to be arranged according to the fixed writing format of the text to be generated to which the module belongs.
  • the fixed writing format of the text to be generated may be expressed in advance as a correspondence table or mapping (for example, key-value) of the first identification information and the module position, thereby according to the correspondence between the first identification information and the module position.
  • the sixth position information of each filled text in the text to be generated can be determined, so that each filled text is arranged according to the sixth position information, and the obtained text to be generated is a text that conforms to the fixed writing format of the text to be generated.
  • a fixed writing format for generating hot news includes ["Title” module, "Posting time” module, "Text” module]. Mark the first identification information a1 for the filling text of the "Title” module “2018 World Cup”, and mark the first identification information a2 for the filling text of the "Posting Time” module on June 14, 2018, position 02, as the "body text” Filled text of the module "The 2018 World Cup matches will start on June 14, 2018 and will run until July 15 in 12 stadiums in 11 cities in Russia. This is the first time the World Cup has been held in Russia.” Mark the first identification information a3, position 03.
  • the sixth position information of the filling text of the "title” module in the text to be generated is At position 01
  • the sixth position information of the filled text of the "Post Time” module in the text to be generated is position 02
  • the sixth position information of the filled text of the "Text” module in the text to be generated is position 03.
  • the requirement information can be used as the text to be answered and the requirement information
  • the text is used as the answer to the text to be answered, and the valid text is obtained from the semantic level of the requirement information, so as to avoid the problem of inaccurate and insufficient rich text obtained by matching only at the text level.
  • step S101 in the embodiment shown in FIG. 1 of the present application may specifically include the following steps 1 to 5:
  • Step 1 For each module in the fixed writing format of the text to be generated, obtain a plurality of complete texts from the preset database that meet the events described by the text to be generated, as the backup text of the module.
  • the complete text of each module is used to describe the same event.
  • the "Title” module and “Body” module in the press release covering the 2018 World Cup start all describe the 2018 World Cup start.
  • the requirement information of each module indicates the complete text of the module itself, it can also indicate that the module belongs to the same event described by the text to be generated.
  • multiple complete texts in the preset database that match the events described by the text to be generated can be used as backup text for each module.
  • the valid text of the "Party Natural Situations" module is the party's information text in the case data
  • the valid text of the "Cause” module is the text of the lawsuit request in the case data.
  • Step 2 For each module, input each backup text of the module into a second recurrent neural network trained in advance to obtain a second feature vector of each backup text.
  • the second recurrent neural network is a plurality of previously collected
  • the sample backup text is obtained by training.
  • Step 3 For each module, input the module's demand information into a third recurrent neural network trained in advance, and obtain a third feature vector of the demand information as the feature vector of the module.
  • the third recurrent neural network consists of multiple The sample requirement information of the module collected in advance is obtained through training.
  • the requirement information when used as the text to be answered and the text that meets the requirement information is used as the answer to the text to be answered, it is equivalent to calculating the feature matching degree between the standby text and the requirement information. Therefore, for each module, it is necessary to obtain each second feature vector of each backup text of the module and the third feature vector of the requirement information of the module.
  • the second recurrent neural network and the third recurrent neural network are recurrent neural networks with the same structure as the recurrent neural network in the embodiment of FIG. 2 of the present application. The difference is that in order to obtain corresponding outputs for different inputs, they are used for training to obtain different The samples of the recurrent neural network are different. The same parts are not repeated here. For details, refer to the description of the embodiment shown in FIG. 2 above.
  • Step 4 For each module, input the vector information corresponding to each backup text of the module into the fourth recurrent neural network obtained in advance, and obtain each backup text of the module that meets the information required by the module.
  • the third position information is obtained by training the sample complete text of the same event corresponding to the requirement information of the module, and the third position information is the position information of the text that meets the module's requirement information in the sample complete text.
  • a backup text and a sample full text are taken as examples, the third position information is marked, and the sample full text describing the event "Spring Game” described by the text to be generated is "Spring is here, children can go out and play.”
  • the third position information corresponding to the requirement information "Games for Spring” marked in the complete text of the sample includes: the first position information of "Spring”, the first position, and the tenth position of the ending position information of "Kicker".
  • the backup text "Vector in spring, children can go out to play, and Huaweing likes flying a kite" is the vector information: the second feature vector and the demand information "Spring Day Game
  • the third feature vector that is, the feature vector of the module is input to the fourth recurrent neural network, so as to obtain the second position information of the text of "Spring Game” in the standby text that meets the module's requirement information: "Spring”
  • the first position information is the first position
  • the end position information of the "kite” is the tenth position.
  • step 5 for each module, the text at the corresponding second position information is extracted from each of the backup texts of the module as valid text that meets the requirement information of the module.
  • the text at the corresponding second position information can be extracted from each standby text of the module, as valid text that meets the module's requirements information.
  • the second position information corresponding to the backup text is extracted: The text "Spring kite” at the 1st position and the end position information of the "Kite Flying” is located at the 10th position information, as the valid text of the information "Spring Game” that meets the requirements of this module.
  • S101 in the embodiment shown in FIG. 1 in the foregoing application may specifically include the following steps:
  • the demand information of the "body” module includes: demand information Q1 "2018 World Cup holding time”, demand information Q2 "2018 World Cup holding place", and demand information Q3 " Special information for the 2018 World Cup.
  • the multiple valid texts from the preset database that meet each requirement information of the "body” module include: the valid text of the demand information Q1 A1 "The 2018 World Cup match starts on June 14, 2018” and A2 " The 2018 World Cup will continue until July 15 ", the effective text of demand information Q2 A3" in Russia “and A4" held in 12 stadiums in 11 cities ", the effective text of demand information Q3 A5" World Cup for the first time Held in Russia.
  • S102 in the embodiment shown in FIG. 1 of the foregoing application may specifically include:
  • a plurality of valid texts of each requirement information of the module are respectively input into a first recurrent neural network trained in advance to obtain a first feature vector of each valid text.
  • this step obtains multiple valid texts corresponding to multiple requirement information of the same module.
  • the text generation method provided in the embodiment of the present application may further include:
  • each requirement information of the module is input into a third recurrent neural network trained in advance, and a third feature vector of each requirement information of the module is obtained.
  • the third recurrent neural network is a plurality of previously collected This module is obtained by training the sample requirement information of the module.
  • the valid text corresponding to the multiple requirement information needs to be arranged according to the respective corresponding requirement information. Therefore, it is necessary to obtain a feature vector of each requirement information of the same module for subsequent determination of position information of multiple valid texts of the module.
  • S103 in the embodiment shown in FIG. 1 of the present application may specifically include:
  • the vector information corresponding to each requirement information of the module For each module, input the vector information corresponding to each requirement information of the module into the pre-trained memory network, and obtain the valid text corresponding to each requirement information of the module and the corresponding first position information;
  • the first position information is each participle in the valid text, and the position information in the filled text of the module;
  • the vector information corresponding to any requirement information of the module includes: the first feature of each valid text corresponding to the requirement information Vector, and the third feature vector of the demand information;
  • the filled text is the text corresponding to the module, and the text structure of the filled text and the text structure of the first sample text marked with the fourth position information used in the training of the memory network
  • the fourth position information is position information of each text in the first sample text that meets the specified requirement information.
  • the vector information corresponding to each requirement information of the module can be input into the pre-trained
  • the memory network obtains the valid text corresponding to each requirement information of the module and the corresponding first position information. Based on this, the segmented words in each valid text are subsequently arranged according to the first position information, and the resulting filled text has the same structure as the first sample text, and the first sample text is in line with natural language description habits. , So filled text is also in line with natural language description habits.
  • the third feature vector of the demand information corresponding to the effective text is also input to the memory network, and the first position of the first sample text used for training the memory network is labeled with a fourth position, thereby ensuring the accuracy of each segmentation in the determined effective text.
  • the fourth position information can be arranged according to the demand information.
  • the memory network in this embodiment may specifically have a structure as shown in FIG. 4:
  • the memory network in this embodiment is similar to the memory network in the embodiment in FIG. 3 described above. The difference is that in order to cope with a situation where there is multiple demand information, the memory network in this embodiment adds an input layer 401 for each module. A third feature vector of each requirement information of the module is obtained, and the third feature vector is input to the hidden layer. Regarding the recurrent neural network, it will not be repeated here. For details, refer to the description of the embodiment in FIG. 2 described above. After adding the input layer 401, neuron 406, and historical state information 4033 corresponding to the neuron to extract the third feature vector, the output of the input layer 401 is used as the input of the neuron 4030 to obtain each requirement information of the module The corresponding valid text and the corresponding first position information.
  • each participle in the text is arranged corresponding to the demand information.
  • the input layer 402, the hidden layer 403, the neuron 4030, the neuron 4031, the neuron 4032, the historical state information 4033, the start position 404, and the cut-off position 405 of each participle are the same as the memory network in the embodiment of FIG. 3 of this application.
  • the input layer 301, hidden layer 302, neuron 3020, neuron 3021, neuron 3022, historical state information 3023, start position 303, and cut-off position 304 of each participle are the same, and will not be repeated here, see Figure 3 for details. Description of the illustrated embodiment.
  • a first sample text that meets the specified requirements information Q11 "2008 Olympic Games Held”, Q12 "2008 Olympic Games Held Location”, and Q13 "2008 Olympic Special Information” marked with fourth location information "The 2008 Olympic Games will be held in 6 cities in China from August 8 to August 24, 2008. This is the first time that the Olympic Games will be held in China.”
  • the fourth position information marked in the first sample text includes the position information 4th and 6th positions of "August 8, 2008” and "August 24, 2008” that meet the specified requirement information Q11, which meet the specified requirements.
  • the valid texts A1 to A5 obtained above and the requirement information Q1 to Q3 of the "body” module are input into the memory network, and the fourth position information of the valid text of each demand information in the filled text is determined, so that each fourth position information is used Later, we got the filled text with the same structure and natural language as the first sample.
  • the 2018 World Cup will be held from June 14 to July 15, 2018 in 12 stadiums in 11 cities in Russia. This is the first time the World Cup has been held in Russia.
  • the complete text of the module is a structured text and the complete text of the module is an unstructured text.
  • structured type text has a fixed representation structure, compared with unstructured type text, it requires less information to be determined by the neural network, and usually the neural network will occupy a large amount of computing resources. Therefore, in order to reduce the occupation of computing resources and improve the efficiency of text generation, the text type of the module can be determined, so that different text generation methods can be performed on modules with different text types in a targeted manner.
  • a flow of a text generation method according to another embodiment of the present application may include:
  • S501 For each module, input the requirement information of the module into a preset classification algorithm to obtain the text type of the filled text of the module, and the text type includes a structured type and an unstructured type.
  • the text type of the filled text of the module is an unstructured type
  • S502 to S505 are performed
  • S506 to S508 are performed.
  • the preset classification algorithm may specifically be a support vector machine algorithm, a logistic regression algorithm, or a pre-trained convolutional neural network using a plurality of sample demand information corresponding to structured text and unstructured text collected in advance. . It can also be judged whether the demand information is preset information corresponding to the text type. For example, for a civil indictment to be generated, the preset information corresponding to the structured type is "the natural situation of the parties", “the respondent court", "payment”, and " Attachment ", the default information corresponding to the unstructured type is” suit request "and” facts and reasons ". Any classification algorithm capable of determining the text type corresponding to the model based on the model's requirement information can be used in this application, which is not limited in this embodiment.
  • the convolutional neural network When used to determine the text type of the filled text, it may specifically have a structure as shown in FIG. 6.
  • the hidden layer of the neural network of this embodiment has two feature extraction channels. After inputting the demand information through the input layer 601, the channel 602 is used to extract local feature variables, and the channel 603 is used to extract global feature variables to ensure that the extracted features not only reflect the needs. The characteristics of each participle in the information can also reflect the overall semantics of each participle.
  • the probability that the demand information output by the output layer 604 belongs to different text types is obtained, and based on the output probability, the text type of the filled text corresponding to the input demand information is determined.
  • Structured type text includes text or sentences with a fixed structure of expression.
  • Unstructured type text includes text or sentences with a fixed structure of text. For example, each module in a fixed writing format of a hot news is a "title” module, a “release date” module, and a “body” module, where the text of the "title” and “release date” modules are structured text, The text of the Body module is unstructured text.
  • S502 For each module in the fixed writing format of the text to be generated, obtain a plurality of valid texts from a preset database that meet the requirement information of the module, and the requirement information is used to indicate the text content corresponding to the module.
  • each module input multiple valid texts of the module into the first recurrent neural network trained in advance to obtain the first feature vector of each valid text.
  • the first recurrent neural network uses multiple pre-collected A sample of valid text that meets the specified requirements is obtained by training.
  • each module For each module, first input the first feature vector of each valid text of the module into the memory network obtained in advance, and obtain the participles in each valid text of the module.
  • the first position information, the text structure of the filled text is the same as the text structure of the first sample text used in the training of the memory network, and the first sample text is a text that conforms to the structure of the natural language expression and meets the required information of the module.
  • the memory network is obtained by training a plurality of pre-collected first sample texts.
  • S505 For each module, arrange the participles in each valid text of the module according to the obtained first position information to obtain the filled text of the module.
  • Steps S502 to S505 are the same steps as S101 to S104 in the embodiment shown in FIG. 1 of this application, and are not repeated here. For details, refer to the description of the embodiment shown in FIG. 1 of this application.
  • S506 Input multiple valid texts of the module into a sequence labeling model trained in advance to obtain the second identification information of each participle in each valid text.
  • the sequence labeling model is a plurality of pre-labeled second labels that are collected in advance. The information is obtained by training a second sample of valid text that meets the requirements of the module.
  • the second identification information is used to represent the uniqueness of each participle in the valid text.
  • the sequence labeling model is used to label the second valid information of the input valid text, and is used to determine the position information of each participle in the filled text in step S507.
  • the sequence labeling model in this embodiment may specifically have the structure shown in FIG. 7.
  • the valid text is input to the sequence labeling model through the input layer 701 in the form of a string.
  • the second identification information corresponding to each segmentation is determined, so that the second segment of each segmentation is labeled at the output layer 703. Identification information. Considering that there is an association between each participle in the text, the context of a certain participle will affect the semantics of the participle.
  • each neuron in the hidden layer of the sequence labeling model is an LSTM network (Long Short Term Memory, a kind of An RNN network with a special structure), when the network is a neuron, information is exchanged between neurons to extract a feature that reflects the overall semantics of the effective text, and based on this feature, the second identification information is marked for each participle of the effective text .
  • LSTM network Long Short Term Memory, a kind of An RNN network with a special structure
  • S507 Determine, according to the second identification information, the fifth position information of each participle in each valid text in the filled text of the module by using a preset correspondence between the identification and the participle position information.
  • the preset correspondence between the identifier and the segmentation position information may be a correspondence table between the identifier and the segmentation position information, and may also be a correspondence mapping (for example, key-value).
  • S508 Arrange the participles in each valid text according to the fifth position information of each participle in each valid text to obtain a filled text.
  • the filling text corresponding to the "Title” module in a fixed writing format to generate hot news is structured type text, and the valid text "2018 World Cup starts on June 14 in Russia" is entered into a preset sequence label
  • the model obtains the second identification information g1 of the segmentation "2018”, the second identification information g2 of the segmentation "World Cup”, and the second identification information g3 of the segmentation "open match”.
  • the structured type text has a fixed representation structure. Compared with the unstructured type text, less information needs to be determined through the neural network, and usually the neural network will occupy a large amount of computing resources. . Therefore, by determining the text type of the module, and accordingly performing different text generation methods on modules with different text types, it can reduce the occupation of computing resources and improve the efficiency of text generation.
  • an embodiment of the present application further provides a text generating device.
  • the structure of a text generating device may include:
  • a text acquisition module 801 for each module in the fixed writing format of the text to be generated, obtains a plurality of valid texts from a preset database that meet the requirement information of the module, and the requirement information is used to indicate the text corresponding to the module content;
  • a feature extraction module 802 is configured for each module to input multiple valid texts of the module into a first recurrent neural network trained in advance to obtain a first feature vector and a first recurrent neural network of each valid text of the module.
  • the network is obtained by training with multiple pre-collected sample valid texts that meet the specified requirements information;
  • a position information determining module 803 is configured to input a first feature vector of each valid text of the module into a memory network obtained in advance for each module, and obtain the word segmentation of each valid text of the module in the filling
  • the first position information in the text, the text structure of the filled text is the same as the text structure of the first sample text used in the training of the memory network, and the first sample text is information that conforms to the structure of the natural language expression and meets the specified requirements Text
  • the memory network is obtained by training with a plurality of the first sample texts collected in advance;
  • a text generating module 804 for each module, arranging the participles in each valid text of the module according to the obtained first position information to obtain filled text; according to a fixed writing format of the text to be generated, Arrange the filled text of each module to obtain the text to be generated.
  • a text generating device provided in the embodiment of the present application is that, for each module, a memory network used is obtained by training a plurality of pre-collected first sample texts, and the first sample texts conform to natural language content.
  • the segmentation in the valid text is arranged according to the first position information, and the text structure of the obtained filled text is the same as that of the first sample text, which also conforms to the natural language expression structure. Therefore, the filling text of each module is arranged according to the fixed writing format of the text to be generated, and the obtained text to be generated is also a text conforming to the structure of natural language expression.
  • the text generation module 804 is specifically configured to:
  • the sixth position information of the filled text of the module in the text to be generated is determined according to the preset correspondence between the first identification information and the module position, and the preset first identification information and the position of the module
  • the correspondence relationship is used to represent a fixed writing format of the text to be generated
  • Each filled text is arranged according to the sixth position information to obtain the text to be generated.
  • the text acquisition module 801 is specifically used for:
  • the feature extraction module 802 is further configured for each module to input each backup text of the module into a second recurrent neural network trained in advance to obtain a second feature vector and a second recurrent neural network of each backup text.
  • the network is trained with multiple pre-collected sample backup texts.
  • the requirement information of the module is input into a third recurrent neural network trained in advance, and a third feature vector of the demand information is obtained as the feature vector of the module.
  • the third recurrent neural network is a plurality of sample requirements of the module collected in advance. Information obtained through training;
  • the position information determining module 803 is further configured for each module to input the vector information corresponding to each backup text of the module into the fourth recurrent neural network obtained in advance, and obtain each backup text of the module.
  • the third position information is the text that meets the demand information of the module. Position information in the text;
  • the text obtaining module 801 is specifically used for each module to extract the text at the corresponding second position information from each standby text of the module, as valid text that meets the requirements of the module.
  • the requirement information of the module is multiple:
  • the text acquisition module 801 is specifically used for:
  • the feature extraction module 802 is further configured to:
  • each module input multiple valid texts of each requirement information of the module into the first recurrent neural network trained in advance to obtain the first feature vector of each valid text;
  • each requirement information of the module is input into a third recurrent neural network trained in advance, and a third feature vector of each requirement information of the module is obtained.
  • the third recurrent neural network is a plurality of previously collected Obtained by training the sample requirement information of the module;
  • the location information determining module 803 is specifically configured to:
  • the vector information corresponding to each requirement information of the module For each module, input the vector information corresponding to each requirement information of the module into the pre-trained memory network, and obtain the valid text corresponding to each requirement information of the module and the corresponding first position information;
  • the first position information is each participle in the valid text, and the first position information in the filled text of the module;
  • the vector information corresponding to any requirement information of the module includes: each valid text corresponding to the requirement information And the third feature vector of the required information;
  • the text structure of the filled text is the same as the text structure of the first sample text labeled with the fourth position information used in the training of the memory network, and the fourth position information Position information of each text in the first sample text that meets the requirement information.
  • the structure of a text generating device may include:
  • a text classification module 901 is configured to input the requirement information of the module into a preset classification algorithm for each module, and obtain a text type of the module's filled text, where the text type includes a structured type and an unstructured type;
  • a text acquisition module 902 is used for each module in the fixed writing format of the text to be generated.
  • the text type of the filled text of the module is an unstructured type
  • the information corresponding to the module's requirements is obtained from a preset database. Multiple valid texts;
  • a feature extraction module 903 is configured for each module.
  • the text type of the filled text of the module is an unstructured type
  • multiple valid texts of the module are input into a first recurrent neural network trained in advance to obtain each First feature vector of valid text;
  • the position information determining module 904 is configured for each module.
  • the text type of the filled text of the module is an unstructured type
  • the first feature vector of each valid text is separately input into a memory network obtained in advance to obtain each The first position information of each participle in each valid text in the filled text of the module;
  • the text acquisition module 902 is further configured for each module in the fixed writing format of the text to be generated.
  • the text type of the filled text of the module is a structured type
  • the plurality of valid text inputs of the module are pre-trained.
  • the obtained sequence labeling model obtains the second identification information of each participle in each valid text.
  • the sequence labeling model is a plurality of pre-collected pre-labeled second identification information and meets the requirements of the module. Obtained by training the second sample of valid text;
  • the position information determining module 904 is further configured to determine, according to the second identification information, the fifth position information of each participle in each valid text in the filled text of the module by using a preset correspondence between the identification and the positional part information;
  • the text generating module 905 is further configured to arrange the participles in each valid text according to the fifth position information of each participle in each valid text to obtain the filled text of the module; according to the text to be generated Fixed writing format, arrange the filled text of each module to get the text to be generated.
  • an embodiment of the present application further provides a computer device, as shown in FIG. 10, which may include:
  • the processor 1001 is configured to implement the steps of the text generating method in any one of the foregoing embodiments when the computer program stored in the memory 1003 is executed.
  • a computer device provided in the embodiment of the present application is that, for each module, a memory network used is obtained by training a plurality of pre-collected first sample texts, and the first sample text conforms to a natural language content structure. Samples that meet the module's requirements information. Therefore, the first position information of each participle in the valid text obtained from the first memory network in the filled text is the same as the position information of each participle in the first sample text. On this basis, the segmentation in the valid text is arranged according to the first position information, and the text structure of the obtained filled text is the same as that of the first sample text, which also conforms to the natural language expression structure. Therefore, it is guaranteed that the filled text of each module is arranged according to the fixed writing format of the text to be generated, and the obtained text to be generated is a text conforming to the structure of natural language expression.
  • the foregoing memory may include RAM (Random Access Memory, Random Access Memory), and may also include NVM (Non-Volatile Memory, non-volatile memory), such as at least one disk memory.
  • NVM Non-Volatile Memory, non-volatile memory
  • the memory may also be at least one storage device located far from the processor.
  • the above processor may be a general-purpose processor, including a CPU (Central Processing Unit), a NP (Network Processor), etc .; it may also be a DSP (Digital Signal Processor), ASIC (Application Specific Integrated Circuit (ASIC), FPGA (Field-Programmable Gate Array), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.
  • CPU Central Processing Unit
  • NP Network Processor
  • DSP Digital Signal Processor
  • ASIC Application Specific Integrated Circuit
  • FPGA Field-Programmable Gate Array
  • a computer-readable storage medium provided by an embodiment of the present application.
  • the computer-readable storage medium stores a computer program.
  • the steps of the text generation method in any of the foregoing embodiments are implemented.
  • a computer-readable storage medium provided by an embodiment of the present application.
  • the computer program is executed by a processor, since a memory network used for each module is obtained by using a plurality of pre-collected first sample texts, And the first sample text is a sample that conforms to the natural language content structure and meets the module's requirements information. Therefore, the first position information of each participle in the valid text obtained from the first memory network in the filled text is the same as the position information of each participle in the first sample text.
  • the segmentation in the valid text is arranged according to the first position information, and the text structure of the obtained filled text is the same as that of the first sample text, which also conforms to the natural language expression structure. Therefore, it is guaranteed that the filled text of each module is arranged according to the fixed writing format of the text to be generated, and the obtained text to be generated is a text conforming to the structure of natural language expression.
  • a computer program product containing instructions is also provided.
  • the computer program product is run on a computer, the computer is caused to execute the text generating method in any of the foregoing embodiments.
  • an application program is also provided, and when the application program is running, the text generating method in any of the foregoing embodiments may be executed.
  • the computer program product includes one or more computer instructions.
  • the computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices.
  • the computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be from a website site, computer, server, or data center Transmission by wire (for example, coaxial cable, optical fiber, DSL (Digital Subscriber Line) or wireless (for example: infrared, radio, microwave, etc.) to another website site, computer, server, or data center.
  • a computer-readable storage medium may be any available media that can be accessed by a computer or a data storage device such as a server, data center, etc. that includes one or more available media integrations.
  • the available media may be magnetic media (eg, a floppy disk, a hard disk , Magnetic tape), optical media (for example: DVD (Digital Versatile Disc), or semiconductor media (for example: SSD (Solid State Disk)).

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Les modes de réalisation de la présente invention concernent un procédé de génération de texte, un appareil et un dispositif. Ledit procédé consiste : pour chaque module dans un format d'écriture fixe d'un texte à générer, à acquérir, en provenance d'une base de données prédéfinie, une pluralité de textes valides conformes à des informations de demande du module ; à entrer, pour chaque module, la pluralité de textes valides du module dans un premier réseau neuronal récurrent pré-entraîné respectivement, de façon à obtenir un premier vecteur de caractéristiques de chaque texte valide ; à entrer, pour chaque module, le premier vecteur de caractéristiques de chaque texte valide dans un réseau de mémoire pré-entraîné, de façon à obtenir des mots segmentés dans chaque texte valide, et des premières informations de position dans un texte de remplissage du module, et à agencer les mots segmentés dans chaque texte valide afin d'obtenir le texte de remplissage ; et à agencer le texte de remplissage de chaque module en fonction du format d'écriture fixe du texte à générer, de façon à obtenir le texte à générer. Ainsi, un texte à générer conforme à la structure d'expression de langage naturel est obtenu.
PCT/CN2019/096894 2018-07-27 2019-07-19 Procédé, appareil et dispositif de génération de texte WO2020020084A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810846953.8 2018-07-27
CN201810846953.8A CN110852084B (zh) 2018-07-27 2018-07-27 文本生成方法、装置及设备

Publications (1)

Publication Number Publication Date
WO2020020084A1 true WO2020020084A1 (fr) 2020-01-30

Family

ID=69181212

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/096894 WO2020020084A1 (fr) 2018-07-27 2019-07-19 Procédé, appareil et dispositif de génération de texte

Country Status (2)

Country Link
CN (1) CN110852084B (fr)
WO (1) WO2020020084A1 (fr)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107193792A (zh) * 2017-05-18 2017-09-22 北京百度网讯科技有限公司 基于人工智能的生成文章的方法和装置
JP2018084627A (ja) * 2016-11-22 2018-05-31 日本放送協会 言語モデル学習装置およびそのプログラム
CN108197294A (zh) * 2018-01-22 2018-06-22 桂林电子科技大学 一种基于深度学习的文本自动生成方法

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104199805B (zh) * 2014-09-11 2017-10-20 清华大学 文本拼接方法及装置
US9881003B2 (en) * 2015-09-23 2018-01-30 Google Llc Automatic translation of digital graphic novels
CN106919646B (zh) * 2017-01-18 2020-06-09 南京云思创智信息科技有限公司 中文文本摘要生成系统及方法
CN107832310A (zh) * 2017-11-27 2018-03-23 首都师范大学 基于seq2seq模型的结构化论点生成方法及系统

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2018084627A (ja) * 2016-11-22 2018-05-31 日本放送協会 言語モデル学習装置およびそのプログラム
CN107193792A (zh) * 2017-05-18 2017-09-22 北京百度网讯科技有限公司 基于人工智能的生成文章的方法和装置
CN108197294A (zh) * 2018-01-22 2018-06-22 桂林电子科技大学 一种基于深度学习的文本自动生成方法

Also Published As

Publication number Publication date
CN110852084A (zh) 2020-02-28
CN110852084B (zh) 2021-04-02

Similar Documents

Publication Publication Date Title
US11216510B2 (en) Processing an incomplete message with a neural network to generate suggested messages
US20190114362A1 (en) Searching Online Social Networks Using Entity-based Embeddings
CN109670163B (zh) 信息识别方法、信息推荐方法、模板构建方法及计算设备
WO2018032937A1 (fr) Procédé et appareil pour classifier des informations textuelles
US20190188285A1 (en) Image Search with Embedding-based Models on Online Social Networks
US9047868B1 (en) Language model data collection
WO2018036272A1 (fr) Procédé de poussée de contenu d'actualités, dispositif électronique et support d'informations lisible par ordinateur
US10678786B2 (en) Translating search queries on online social networks
CN112313644A (zh) 基于会话数据构建定制的用户简档
WO2018149209A1 (fr) Procédé de reconnaissance vocale, dispositif électronique et support de stockage d'ordinateur
US20170193086A1 (en) Methods, devices, and systems for constructing intelligent knowledge base
US20190155916A1 (en) Retrieving Content Objects Through Real-time Query-Post Association Analysis on Online Social Networks
US10951555B2 (en) Providing local service information in automated chatting
WO2018201600A1 (fr) Système et procédé d'extraction d'informations, dispositif électronique et support de stockage lisible
CN106982256A (zh) 信息推送方法、装置、设备及存储介质
CN111046667B (zh) 一种语句识别方法、语句识别装置及智能设备
WO2022134421A1 (fr) Procédé et appareil de réponse intelligente basée sur un graphe multi-connaissances, dispositif informatique et support de stockage
CN102314440B (zh) 利用网络维护语言模型库的方法和系统
US10810214B2 (en) Determining related query terms through query-post associations on online social networks
CN102567534B (zh) 互动产品用户生成内容拦截系统及其拦截方法
CN108960574A (zh) 问答的质量确定方法、装置、服务器和存储介质
US11880401B2 (en) Template generation using directed acyclic word graphs
WO2022111347A1 (fr) Procédé et appareil de traitement d'informations, dispositif électronique et support de stockage
CN103984771A (zh) 一种英文微博中地理兴趣点抽取和感知其时间趋势的方法
CN106462564A (zh) 在文档内提供实际建议

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 16.08.2021)

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19840753

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 19840753

Country of ref document: EP

Kind code of ref document: A1