WO2022188584A1 - Similar sentence generation method and apparatus based on pre-trained language model - Google Patents

Similar sentence generation method and apparatus based on pre-trained language model Download PDF

Info

Publication number
WO2022188584A1
WO2022188584A1 PCT/CN2022/075657 CN2022075657W WO2022188584A1 WO 2022188584 A1 WO2022188584 A1 WO 2022188584A1 CN 2022075657 W CN2022075657 W CN 2022075657W WO 2022188584 A1 WO2022188584 A1 WO 2022188584A1
Authority
WO
WIPO (PCT)
Prior art keywords
similar
sentence
candidate
model
discriminant
Prior art date
Application number
PCT/CN2022/075657
Other languages
French (fr)
Chinese (zh)
Inventor
高臻
闫慧丽
顾松庠
Original Assignee
京东科技控股股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 京东科技控股股份有限公司 filed Critical 京东科技控股股份有限公司
Publication of WO2022188584A1 publication Critical patent/WO2022188584A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Definitions

  • the present application relates to the technical field of artificial intelligence, and in particular, to a method and device for generating similar sentences based on a pre-trained language model.
  • the customer service robot will add FAQs (Frequently Asked Questions, frequently asked questions) from time to time, and accordingly, it is necessary to expand the diversity of similar questions.
  • FAQs frequently Asked Questions, frequently asked questions
  • the template is manually formulated, and only the corresponding entities and keywords need to be filled in to complete the problem expansion, which requires a lot of manpower and time to edit the template.
  • the sentence pattern is fixed and lacks the diversity of expression.
  • the present application aims to solve one of the technical problems in the related art at least to a certain extent.
  • the present application proposes a method and device for generating similar sentences based on a pre-trained language model, so as to automatically generate similar questions with diverse forms and consistent semantics, and improve the quality and efficiency of similar sentence generation.
  • the embodiment of the first aspect of the present application proposes a method for generating similar sentences based on a pre-trained language model, including:
  • the plurality of discriminative sentence pairs are input into a trained discriminant model, a discriminant result is obtained, and a target similar sentence is obtained from the plurality of candidate similar sentences according to the discriminant result.
  • the method for generating similar sentences based on a pre-trained language model obtains a statement to be processed; inputs the sentence to be processed into a trained generation model to obtain a plurality of candidate similar sentences; , generate multiple discriminative sentence pairs; input the multiple discriminative sentence pairs into the trained discriminant model, obtain the discriminant result, and obtain the target similar sentence from the multiple candidate similar sentences according to the discriminant result.
  • similar questions with diverse forms and consistent semantics are automatically generated, and the quality and efficiency of similar sentences are improved.
  • the embodiment of the second aspect of the present application proposes an apparatus for generating similar sentences based on a pre-trained language model, including:
  • the first obtaining module is used to obtain the to-be-processed statement
  • a first processing module configured to input the to-be-processed statement into a trained generative model to obtain a plurality of candidate similar statements
  • a first generation module configured to generate a plurality of discriminative sentence pairs according to the to-be-processed sentence and the plurality of candidate similar sentences;
  • the second processing module is used for inputting the plurality of discriminative sentence pairs into a discriminant model that has been trained to obtain a discriminant result
  • the second obtaining module is configured to obtain a target similar sentence from the plurality of candidate similar sentences according to the discrimination result.
  • the apparatus for generating similar sentences based on a pre-trained language model obtains the sentences to be processed; inputs the sentences to be processed into the trained generation model to obtain a plurality of candidate similar sentences; , generate multiple discriminative sentence pairs; input the multiple discriminative sentence pairs into the trained discriminant model, obtain the discriminant result, and obtain the target similar sentence from the multiple candidate similar sentences according to the discriminant result.
  • similar questions with diverse forms and consistent semantics are automatically generated, and the quality and efficiency of similar sentences are improved.
  • An embodiment of the third aspect of the present application proposes an electronic device, including: a memory, a processor, and a computer program stored in the memory and running on the processor.
  • the processor executes the program, the computer program as described in the present application
  • the embodiment of the fourth aspect of the present application provides a non-transitory computer-readable storage medium, on which a computer program is stored, and when the program is executed by a processor, implements the pre-training language-based language proposed by the embodiment of the first aspect of the present application Similar sentence generation method for the model.
  • the embodiment of the fifth aspect of the present application provides a computer program product.
  • the instructions in the computer program product are executed by the processor, the similar sentence generation based on the pre-trained language model proposed in the embodiment of the first aspect of the present application is executed. method.
  • FIG. 1 is a schematic flowchart of a method for generating similar sentences based on a pre-trained language model provided in Embodiment 1 of the present application;
  • FIG. 2 is a schematic flowchart of a method for generating similar sentences based on a pre-trained language model provided in Embodiment 2 of the present application;
  • FIG. 3 is a schematic flowchart of a method for generating similar sentences based on a pre-trained language model provided in Embodiment 3 of the present application;
  • FIG. 4 is a schematic flow chart of generating a similar sentence in an embodiment of the present application.
  • FIG. 5 is a schematic structural diagram of an apparatus for generating similar sentences based on a pre-trained language model provided in Embodiment 4 of the present application;
  • Figure 6 shows a block diagram of an exemplary electronic device or server suitable for use in implementing embodiments of the present application.
  • the embodiment of the present application proposes a pre-training based method A method for generating similar sentences for language models, by obtaining the sentences to be processed; inputting the sentences to be processed into a trained generation model to obtain multiple candidate similar sentences; and generating multiple discriminative sentence pairs according to the sentences to be processed and the multiple candidate similar sentences; A plurality of discriminative sentence pairs are input into the trained discriminant model, a discriminant result is obtained, and a target similar sentence is obtained from a plurality of candidate similar sentences according to the discriminant result.
  • similar questions with diverse forms and consistent semantics are automatically generated, and the quality and efficiency of similar sentences are improved.
  • FIG. 1 is a schematic flowchart of a method for generating similar sentences based on a pre-trained language model according to Embodiment 1 of the present application.
  • the dialog recognition method of the embodiment of the present application can be applied to an electronic device.
  • the electronic device can be any device with computing capabilities, such as a PC (Personal Computer), a mobile terminal, etc.
  • the mobile terminal can be, for example, a mobile phone, a tablet computer, a personal digital assistant, a wearable device, etc.
  • An operating system, touch screen and/or display hardware device can be applied to an electronic device.
  • the method for generating similar sentences based on a pretrained language model may include the following steps 101 to 104 .
  • Step 101 acquiring the to-be-processed statement.
  • the to-be-processed statement can be understood as needing to generate a plurality of similar statements corresponding to it, which can be selected and acquired according to the actual application scenario.
  • the to-be-processed sentences can be "how is the development of product A", “can you introduce the product A is not” and so on.
  • Step 102 Input the sentence to be processed into the trained generation model, and obtain a plurality of candidate similar sentences.
  • the generation model is a pre-trained language model that has been trained.
  • the specific training process please refer to the subsequent description, which will not be described in detail here.
  • the to-be-processed sentence is encoded to obtain an encoding vector; an autoregressive method is used to generate candidate similar sentences word by word; wherein, the probability distribution of each candidate similar word is obtained, and the top N candidates with the highest probability are obtained
  • a candidate similar word is randomly sampled from the similar words as a target candidate similar word, where N is a positive integer; candidate similar sentences are generated according to the target candidate similar words of each word to be processed.
  • the process of generating candidate similar sentences is generated word by word from left to right.
  • the process is randomly selected from the N with the highest probability, such as 5 words. Therefore, the same to-be-processed sentence is input to the generation model and the output of the candidate similar sentences is different each time. Repeat this process many times to get multiple candidate similar sentences.
  • the random sampling method is adopted, that is, each word generated is based on the standard question and the generated content, and is randomly selected from multiple candidate similar words with the highest probability in the current conditional distribution to obtain multiple candidate similar sentences. Increase the variety of generated expressions.
  • Step 103 Generate a plurality of discriminative sentence pairs according to the sentence to be processed and a plurality of candidate similar sentences.
  • Step 104 inputting a plurality of discriminative sentence pairs into the trained discriminant model, obtaining a discriminant result, and obtaining a target similar sentence from a plurality of candidate similar sentences according to the discriminant result.
  • the to-be-processed statement and each candidate similar statement respectively form a discriminative statement pair, such as the to-be-processed statement X
  • the multiple candidate similar statements are Y1-Y5
  • the formed discriminative statement pairs are (X Y1), ( X Y2) to (X Y5), get 5 discriminative sentence pairs.
  • the discriminant model is a trained BERT module based on machine translation bidirectional coding.
  • the training process please refer to the subsequent description, which will not be described in detail here.
  • each discriminant sentence pair is encoded, a plurality of discriminant vectors are obtained, each discriminant vector is predicted, and the similarity between the to-be-processed sentence and each candidate similar sentence is obtained.
  • the input of the discriminant model is a sentence pair composed of (to-be-processed sentence, candidate similar sentence), the discriminant model encodes the sentence pair, and classifies and predicts whether the sentence pair is a similar sentence, and in the case of a similar sentence, obtain the corresponding Candidate similar sentences are target similar sentences.
  • the method for generating similar sentences based on a pre-trained language model obtains a statement to be processed; inputs the sentence to be processed into a trained generation model to obtain a plurality of candidate similar sentences; , generate multiple discriminative sentence pairs; input the multiple discriminative sentence pairs into the trained discriminant model, obtain the discriminant result, and obtain the target similar sentence from the multiple candidate similar sentences according to the discriminant result.
  • similar questions with diverse forms and consistent semantics are automatically generated, and the quality and efficiency of similar sentences are improved.
  • a pre-trained language model UniLM (UNIfied pre-trained Language Model, unified pre-trained language model) is used as a generation model to generate high-quality text
  • BERT Bidirectional Encoder Representation from Transformers (two-way encoding representation based on machine translation) is used as a discriminant model to filter unqualified generated texts.
  • BERT Bidirectional Encoder Representation from Transformers (two-way encoding representation based on machine translation)
  • FIG. 2 is a schematic flowchart of a method for generating similar sentences based on a pre-trained language model provided by Embodiment 2 of the present application.
  • the method for generating similar sentences based on a pretrained language model may include the following steps 201 to 204 .
  • Step 201 obtaining a dataset of similar problems in a general domain.
  • Step 202 Input the data set of similar problems in the general domain into the training of the pre-trained language model, obtain the first training similar sentence, calculate the first error between the first training sentence and the first standard sentence through the loss function, and adjust the value of the pre-trained language model. parameter until the first error is smaller than the preset threshold, and generate a candidate generative model.
  • the encoder (UniLM) is used to transfer the similar question generation task on the general domain similar question dataset, and there are many ways to obtain the general domain similar question dataset, such as crawling and collecting related posts, question answering For similar problems recommended by websites and other similar problems, use maximum likelihood estimation to perform transfer learning of similar problem generation tasks until the pre-trained language model converges. Since the training data is crawled from the Internet, manual annotation is not required, which improves the training efficiency.
  • the UniLM model is open sourced by Microsoft, based on the Transformer (deep self-attention transformation network) architecture, and a pre-trained language model that integrates natural language understanding and generation capabilities.
  • the multi-task learning method combined with regression, the two tasks are: masked language model (MLM) and sequence to sequence (sequence to sequence, seq2seq), which can do both downstream tasks of natural language understanding type and natural language understanding.
  • MLM masked language model
  • sequence to sequence sequence to sequence, seq2seq
  • downstream tasks of language generation type that is, encoding and decoding training can be performed after random masking of each word in the training sentence to improve the quality of subsequent generation.
  • UniLM is a pre-training model.
  • This application uses the parameters of the UniLM model as initialization parameters to train similar problems on it.
  • the generation task is transfer learning.
  • the goal of training is to maximize the likelihood value of the generated target sequence.
  • the pre-trained language model is considered to have converged, and the training can be stopped. , generate candidate generative models.
  • Step 203 obtaining a dataset of similar questions in the target domain.
  • Step 204 input the target domain similar problem data set into the candidate generation model for training, obtain the second training similar sentence, calculate the second error between the second training sentence and the second standard sentence through the loss function, and adjust the parameters of the candidate generation model Until the second error is smaller than the preset threshold, the trained generative model is generated.
  • the pre-trained language model for training is more suitable for the target domain, wherein the target domain can be selected and set according to the application scenario, such as the customer service business domain, etc., the encoder (UniLM) can be used in the customer service business. Similar question generation task fine-tuning on the FAQ similar question database.
  • a relatively small dataset of similar problems in the target domain can be used to input the candidate generation model for maximum likelihood estimation, and the negative logarithm of the likelihood value of the obtained target sequence is smaller than the preset likelihood threshold, and the trained model is generated. Generate the model.
  • the present application first uses a large amount of easily obtained supervised data to perform migration of similar tasks, and then uses a small amount of existing business data and a small amount of labeled data obtained when filtering available data to perform domain migration. To achieve the lowest cost of labeling, achieve ideal business indicators, and improve processing efficiency.
  • FIG. 3 is a schematic flowchart of a method for generating similar sentences based on a pre-trained language model provided by Embodiment 2 of the present application.
  • the method for generating similar sentences based on a pretrained language model may include the following steps 301 to 304 .
  • Step 301 obtaining a dataset of similar sentence pairs.
  • Step 302 input the similar sentence pair data set into the BERT-based bidirectional encoding representation module for training, and generate a candidate discriminant model.
  • the discriminant model (BERT) is used to perform similar problem discrimination task transfer on the financial semantic similarity data set
  • the discriminator (BERT) is used to perform similar problem discrimination on the FAQ and similar questions that are being used by the customer service business.
  • Mission fine-tuning is used to perform similar problem discrimination on the FAQ and similar questions that are being used by the customer service business.
  • the discriminant model BERT is constructed, and the publicly available similar question corpus is used for similar question discrimination training.
  • the BERT model is open sourced by Google and is a pre-trained language model based on the Transformer architecture. It is mainly used in natural language understanding tasks.
  • BERT pre-training The process adopts self-encoding multi-task learning. The two tasks are: masked language model (MLM) and sequence-to-sequence (next sentence prediction, NSP). BERT can be used as the initialization parameter of the downstream task model.
  • MLM masked language model
  • NSP next sentence prediction
  • the transfer learning of similar problem discrimination task is carried out on the publicly available data set such as financial semantic similarity data set that can be easily obtained, which does not require manual labeling of data, and improves the training efficiency.
  • the maximum likelihood estimation method is used to transfer the domain, so that the discriminant model learns the data distribution of the customer service business, and the amount of data used is much smaller than the training data scale in generating the candidate discriminant model. Further improve training efficiency.
  • Step 303 acquiring positive samples and negative samples of similar sentence pairs in the target domain.
  • Step 304 input the candidate discriminant models of the similar sentences to positive samples and negative samples, and generate a trained discriminant model.
  • the similar problem discrimination task is fine-tuned on the FAQ similar questions accumulated by the customer service business.
  • the training discriminant model also needs to mark the unavailable data when screening the available similar questions as a counter-example for domain transfer. This enables the discriminant model to learn the operator's discriminant criteria, and the amount of data required is relatively small, which further improves the training efficiency.
  • the present application uses the pre-trained language model UniLM as the generation model to generate high-quality text, and uses BERT as the discriminant model to filter unqualified generated text, such as shown in FIG.
  • the sentences are filtered by the discriminant model to obtain the target similar sentences that meet the standard, thereby automatically generating similar problems with diverse forms and consistent semantics, and improving the quality and efficiency of similar sentences generation.
  • the present application also provides a similar sentence generating device based on the pre-training language model.
  • the similar sentence generation device for training the language model corresponds to the similar sentence generation method based on the pre-trained language model provided in the embodiments of FIG. 1 to FIG. 4, so the implementation of the similar sentence generation method based on the pre-trained language model also applies
  • the apparatus for generating similar sentences based on the pre-trained language model provided by the embodiments of the present application will not be described in detail in the embodiments of the present application.
  • FIG. 5 is a schematic structural diagram of an apparatus for generating similar sentences based on a pre-trained language model according to Embodiment 4 of the present application.
  • the similar sentence generation apparatus 500 based on the pre-trained language model is applied to an electronic device, and includes: a first acquisition module 501, a first processing module 502, a first generation module 503, a second processing module 504, and a first 2.
  • the first obtaining module 501 is used to obtain the to-be-processed statement.
  • the first processing module 502 is configured to input the sentence to be processed into the trained generation model, and obtain a plurality of candidate similar sentences.
  • the first generating module 503 is configured to generate a plurality of discriminative sentence pairs according to the sentence to be processed and the plurality of candidate similar sentences.
  • the second processing module 504 is configured to input a plurality of discriminative sentence pairs into the trained discriminant model to obtain discriminant results.
  • the second obtaining module 505 is configured to obtain a target similar sentence from the plurality of candidate similar sentences according to the discrimination result.
  • the first processing module 502 is specifically configured to: encode the statement to be processed, and obtain an encoding vector; perform decoding processing on the encoding vector, and use an autoregressive method to generate candidate candidates Similar sentences; wherein, the probability distribution of each candidate similar word is obtained, and a candidate similar word is randomly sampled from the top N candidate similar words with the highest probability as the target candidate similar word, where N is a positive integer, according to the target candidate Similar words generate candidate similar sentences.
  • the second processing module 504 is specifically configured to: encode each discriminant sentence pair to obtain multiple discriminant vectors; predict each discriminant vector, Get the similarity between the sentence to be processed and each candidate similar sentence.
  • the apparatus 500 for generating similar sentences based on a pre-trained language model may further include:
  • the third acquisition module is used to acquire a data set of similar problems in the general field; the second generation module is used to input the data set of similar problems in the general field into the pre-training language model for training, and obtain the first training similar sentence, which is calculated by the loss function For the first error between the first training sentence and the first standard sentence, adjust the parameters of the pre-trained language model until the first error is less than a preset threshold, and generate a candidate generation model; the fourth acquisition module is used to obtain similarities in the target domain The problem data set; the third generation module is used to input the similar problem data set in the target domain into the candidate generation model for training, obtain the second training similar sentence, and calculate the second training sentence and the second standard sentence through the loss function. error, adjust the parameters of the candidate generative model until the second error is smaller than the preset threshold, and generate a trained generative model.
  • the apparatus 500 for generating similar sentences based on a pre-trained language model may further include:
  • the fifth acquisition module is used to acquire the similar sentence pair data set; the fourth generation module is used to input the similar sentence pair data set into the BERT-based bidirectional encoding representation module for training to generate a candidate discriminant model; the sixth acquisition module is used to Obtaining the positive samples and negative samples of similar sentences in the target field; the fifth generation module is used for inputting the positive samples and negative samples of similar sentences into the candidate discrimination model for training, and obtaining the similarity of the target sequence is less than the preset similarity threshold, Generate a trained discriminative model.
  • the apparatus for generating similar sentences based on a pre-trained language model obtains the sentences to be processed; inputs the sentences to be processed into the trained generation model to obtain a plurality of candidate similar sentences; , generate multiple discriminative sentence pairs; input the multiple discriminative sentence pairs into the trained discriminant model, obtain the discriminant result, and obtain the target similar sentence from the multiple candidate similar sentences according to the discriminant result.
  • similar questions with diverse forms and consistent semantics are automatically generated, and the quality and efficiency of similar sentences are improved.
  • the present application also proposes an electronic device, which includes a memory, a processor, and a computer program stored in the memory and running on the processor.
  • the processor executes the program, the above-mentioned FIG.
  • the method for generating similar sentences based on a pre-trained language model proposed by any of the embodiments in FIG. 4 .
  • the present application also proposes a non-transitory computer-readable storage medium on which a computer program is stored.
  • the program is executed by a processor, the pre-trained language model based on the pre-training proposed in any of the foregoing embodiments of the present application is implemented.
  • the present application also proposes a computer program product.
  • the instructions in the computer program product are executed by the processor, the similar sentence generation method based on the pre-trained language model proposed in any of the foregoing embodiments of the present application is executed. .
  • FIG. 6 shows a block diagram of an exemplary electronic device or server suitable for use in implementing embodiments of the present application.
  • the electronic device or server 12 shown in FIG. 6 is only an example, and should not impose any limitations on the functions and scope of use of the embodiments of the present application.
  • the electronic device or server 12 takes the form of a general purpose computing device.
  • Components of the electronic device or server 12 may include, but are not limited to, one or more processors or processing units 16 , system memory 28 , and a bus 18 connecting various system components including system memory 28 and processing unit 16 .
  • Bus 18 represents one or more of several types of bus structures, including a memory bus or memory controller, a peripheral bus, a graphics acceleration port, a processor, or a local bus using any of a variety of bus structures.
  • these architectures include, but are not limited to, Industry Standard Architecture (hereinafter referred to as: ISA) bus, Micro Channel Architecture (hereinafter referred to as: MAC) bus, enhanced ISA bus, video electronics Standards Association (Video Electronics Standards Association; hereinafter referred to as: VESA) local bus and Peripheral Component Interconnection (Peripheral Component Interconnection; hereinafter referred to as: PCI) bus.
  • ISA Industry Standard Architecture
  • MAC Micro Channel Architecture
  • VESA Video Electronics Standards Association
  • PCI Peripheral Component Interconnection
  • the electronic device or server 12 typically includes a variety of computer system readable media. These media can be any available media that can be accessed by the electronic device or server 12, including both volatile and non-volatile media, removable and non-removable media.
  • the memory 28 may include computer system readable media in the form of volatile memory, such as random access memory (Random Access Memory; hereinafter: RAM) 30 and/or cache memory 32 .
  • RAM random access memory
  • the electronic device or server 12 may further include other removable/non-removable, volatile/non-volatile computer system storage media.
  • storage system 34 may be used to read and write to non-removable, non-volatile magnetic media (not shown in FIG. 6, commonly referred to as a "hard drive").
  • a magnetic disk drive for reading and writing to removable non-volatile magnetic disks (eg "floppy disks") and removable non-volatile optical disks (eg compact disk read only memory) may be provided Disc Read Only Memory; hereinafter referred to as: CD-ROM), Digital Video Disc Read Only Memory (hereinafter referred to as: DVD-ROM) or other optical media) read and write optical drives.
  • CD-ROM Disc Read Only Memory
  • DVD-ROM Digital Video Disc Read Only Memory
  • each drive may be connected to bus 18 through one or more data media interfaces.
  • Memory 28 may include at least one program product having a set (eg, at least one) of program modules configured to perform the functions of various embodiments of the present application.
  • a program/utility 40 having a set (at least one) of program modules 42, which may be stored, for example, in memory 28, such program modules 42 including, but not limited to, an operating system, one or more application programs, other program modules, and program data , each or some combination of these examples may include an implementation of a network environment.
  • Program modules 42 generally perform the functions and/or methods of the embodiments described herein.
  • the electronic device or server 12 may also communicate with one or more external devices 14 (eg, keyboard, pointing device, display 24, etc.), and may also communicate with one or more devices that enable a user to interact with the electronic device or server 12, and/or with any device (eg, network card, modem, etc.) that enables the electronic device or server 12 to communicate with one or more other computing devices. Such communication may take place through input/output (I/O) interface 22 .
  • I/O input/output
  • the electronic device or server 12 can also communicate with one or more networks (such as a local area network (Local Area Network; hereinafter referred to as: LAN), wide area network (Wide Area Network; hereinafter referred to as: WAN) and/or a public network through the network adapter 20, such as Internet) communication.
  • the network adapter 20 communicates with the electronic device or other modules of the server 12 via the bus 18 .
  • other hardware and/or software modules may be used in conjunction with the electronic device or server 12, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, Tape drives and data backup storage systems, etc.
  • the processing unit 16 executes various functional applications and data processing by running the programs stored in the system memory 28 , for example, implements the methods mentioned in the foregoing embodiments.
  • first and second are only used for descriptive purposes, and should not be construed as indicating or implying relative importance or implying the number of indicated technical features. Thus, a feature delimited with “first”, “second” may expressly or implicitly include at least one of that feature.
  • plurality means at least two, such as two, three, etc., unless expressly and specifically defined otherwise.
  • a "computer-readable medium” can be any device that can contain, store, communicate, propagate, or transport the program for use by or in connection with an instruction execution system, apparatus, or apparatus.
  • computer readable media include the following: electrical connections with one or more wiring (electronic devices), portable computer disk cartridges (magnetic devices), random access memory (RAM), Read Only Memory (ROM), Erasable Editable Read Only Memory (EPROM or Flash Memory), Fiber Optic Devices, and Portable Compact Disc Read Only Memory (CDROM).
  • the computer readable medium may even be paper or other suitable medium on which the program may be printed, as the paper or other medium may be optically scanned, for example, followed by editing, interpretation, or other suitable medium as necessary process to obtain the program electronically and then store it in computer memory.
  • each functional unit in each embodiment of the present application may be integrated into one processing module, or each unit may exist physically alone, or two or more units may be integrated into one module.
  • the above-mentioned integrated modules can be implemented in the form of hardware, and can also be implemented in the form of software function modules. If the integrated modules are implemented in the form of software functional modules and sold or used as independent products, they may also be stored in a computer-readable storage medium.
  • the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, and the like.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Machine Translation (AREA)

Abstract

The present application provides a similar sentence generation method and apparatus based on a pre-trained language model. The method comprises: acquiring a sentence to be processed; inputting, into a trained generative model, the sentence to be processed, so as to acquire a plurality of candidate similar sentences; generating a plurality of discriminative sentence pairs according to the sentence to be processed and the plurality of candidate similar sentences; and inputting the plurality of discriminative sentence pairs into a trained discriminative model, so as to acquire a discrimination result, and acquiring a target similar sentence from among the plurality of candidate similar sentences according to the discrimination result.

Description

基于预训练语言模型的相似语句生成方法和装置Similar sentence generation method and device based on pre-trained language model
相关申请的交叉引用CROSS-REFERENCE TO RELATED APPLICATIONS
本申请基于申请号为202110270871.5、申请日为2021年03月12日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此引入本申请作为参考。This application is based on the Chinese patent application with the application number of 202110270871.5 and the filing date of March 12, 2021, and claims the priority of the Chinese patent application. The entire content of the Chinese patent application is incorporated herein by reference.
技术领域technical field
本申请涉及人工智能技术领域,尤其涉及一种基于预训练语言模型的相似语句生成方法和装置。The present application relates to the technical field of artificial intelligence, and in particular, to a method and device for generating similar sentences based on a pre-trained language model.
背景技术Background technique
通常,客服机器人会不定期新增FAQ(Frequently Asked Questions,经常问到的问题),相应就需要做相似问题多样性扩写。Usually, the customer service robot will add FAQs (Frequently Asked Questions, frequently asked questions) from time to time, and accordingly, it is necessary to expand the diversity of similar questions.
相关技术中,由人工制定模版,只需填入相应的实体和关键词完成问题扩写,需要投入大量人力和时间来编辑模版,每有新的问题类型加入就需要订制相应的模版,产生的句式固定,缺乏表达的多样性。In the related art, the template is manually formulated, and only the corresponding entities and keywords need to be filled in to complete the problem expansion, which requires a lot of manpower and time to edit the template. The sentence pattern is fixed and lacks the diversity of expression.
发明内容SUMMARY OF THE INVENTION
本申请旨在至少在一定程度上解决相关技术中的技术问题之一。The present application aims to solve one of the technical problems in the related art at least to a certain extent.
本申请提出一种基于预训练语言模型的相似语句生成方法和装置,以实现自动生成兼具形式多样且语义一致的相似问题,提高相似语句生成质量和效率。The present application proposes a method and device for generating similar sentences based on a pre-trained language model, so as to automatically generate similar questions with diverse forms and consistent semantics, and improve the quality and efficiency of similar sentence generation.
本申请第一方面实施例提出了一种基于预训练语言模型的相似语句生成方法,包括:The embodiment of the first aspect of the present application proposes a method for generating similar sentences based on a pre-trained language model, including:
获取待处理语句;Get the pending statement;
将所述待处理语句输入已训练的生成模型,获取多个候选相似语句;Inputting the to-be-processed statement into a trained generative model to obtain multiple candidate similar statements;
根据所述待处理语句和所述多个候选相似语句,生成多个判别语句对;generating a plurality of discriminative sentence pairs according to the to-be-processed sentence and the plurality of candidate similar sentences;
将所述多个判别语句对输入已训练的判别模型,获取判别结果,以及根据所述判别结果从所述多个候选相似语句中获取目标相似语句。The plurality of discriminative sentence pairs are input into a trained discriminant model, a discriminant result is obtained, and a target similar sentence is obtained from the plurality of candidate similar sentences according to the discriminant result.
本申请实施例的基于预训练语言模型的相似语句生成方法,通过获取待处理语句;将待处理语句输入已训练的生成模型,获取多个候选相似语句;根据待处理语句和多个候选相似语句,生成多个判别语句对;将多个判别语句对输入已训练的判别模型,获取判别结果,以及根据判别结果从多个候选相似语句中获取目标相似语句。由此,自动生成兼具形式多样且语义一致的相似问题,提高相似语句生成质量和效率。The method for generating similar sentences based on a pre-trained language model according to the embodiment of the present application obtains a statement to be processed; inputs the sentence to be processed into a trained generation model to obtain a plurality of candidate similar sentences; , generate multiple discriminative sentence pairs; input the multiple discriminative sentence pairs into the trained discriminant model, obtain the discriminant result, and obtain the target similar sentence from the multiple candidate similar sentences according to the discriminant result. As a result, similar questions with diverse forms and consistent semantics are automatically generated, and the quality and efficiency of similar sentences are improved.
本申请第二方面实施例提出了一种基于预训练语言模型的相似语句生成装置,包括:The embodiment of the second aspect of the present application proposes an apparatus for generating similar sentences based on a pre-trained language model, including:
第一获取模块,用于获取待处理语句;The first obtaining module is used to obtain the to-be-processed statement;
第一处理模块,用于将所述待处理语句输入已训练的生成模型,获取多个候选相似语句;a first processing module, configured to input the to-be-processed statement into a trained generative model to obtain a plurality of candidate similar statements;
第一生成模块,用于根据所述待处理语句和所述多个候选相似语句,生成多个判别语句对;a first generation module, configured to generate a plurality of discriminative sentence pairs according to the to-be-processed sentence and the plurality of candidate similar sentences;
第二处理模块,用于将所述多个判别语句对输入已训练的判别模型,获取判别结果;The second processing module is used for inputting the plurality of discriminative sentence pairs into a discriminant model that has been trained to obtain a discriminant result;
第二获取模块,用于根据所述判别结果从所述多个候选相似语句中获取目标相似语句。The second obtaining module is configured to obtain a target similar sentence from the plurality of candidate similar sentences according to the discrimination result.
本申请实施例的基于预训练语言模型的相似语句生成装置,通过获取待处理语句;将待处理语句输入已训练的生成模型,获取多个候选相似语句;根据待处理语句和多个候选相似语句,生成多个判别语句对;将多个判别语句对输入已训练的判别模型,获取判别结果,以及根据判别结果从多个候选相似语句中获取目标相似语句。由此,自动生成兼具形式多样且语义一致的相似问题,提高相似语句生成质量和效率。The apparatus for generating similar sentences based on a pre-trained language model according to the embodiment of the present application obtains the sentences to be processed; inputs the sentences to be processed into the trained generation model to obtain a plurality of candidate similar sentences; , generate multiple discriminative sentence pairs; input the multiple discriminative sentence pairs into the trained discriminant model, obtain the discriminant result, and obtain the target similar sentence from the multiple candidate similar sentences according to the discriminant result. As a result, similar questions with diverse forms and consistent semantics are automatically generated, and the quality and efficiency of similar sentences are improved.
本申请第三方面实施例提出了一种电子设备,包括:存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时,实现如本申请第一方面实施例提出的基于预训练语言模型的相似语句生成方法。An embodiment of the third aspect of the present application proposes an electronic device, including: a memory, a processor, and a computer program stored in the memory and running on the processor. When the processor executes the program, the computer program as described in the present application A method for generating similar sentences based on a pre-trained language model proposed by the embodiments of the first aspect.
本申请第四方面实施例提出了一种非临时性计算机可读存储介质,其上存储有计算机程序,所述程序被处理器执行时实现如本申请第一方面实施例提出的基于预训练语言模型的相似语句生成方法。The embodiment of the fourth aspect of the present application provides a non-transitory computer-readable storage medium, on which a computer program is stored, and when the program is executed by a processor, implements the pre-training language-based language proposed by the embodiment of the first aspect of the present application Similar sentence generation method for the model.
本申请第五方面实施例提出了一种计算机程序产品,当所述计算机程序产品中的指令由处理器执行时,执行如本申请第一方面实施例提出的基于预训练语言模型的相似语句生成方法。The embodiment of the fifth aspect of the present application provides a computer program product. When the instructions in the computer program product are executed by the processor, the similar sentence generation based on the pre-trained language model proposed in the embodiment of the first aspect of the present application is executed. method.
本申请附加的方面和优点将在下面的描述中部分给出,部分将从下面的描述中变得明显,或通过本申请的实践了解到。Additional aspects and advantages of the present application will be set forth, in part, in the following description, and in part will be apparent from the following description, or learned by practice of the present application.
附图说明Description of drawings
本申请上述的和/或附加的方面和优点从下面结合附图对实施例的描述中将变得明显和容易理解,其中:The above and/or additional aspects and advantages of the present application will become apparent and readily understood from the following description of embodiments taken in conjunction with the accompanying drawings, wherein:
图1为本申请实施例一所提供的基于预训练语言模型的相似语句生成方法的流程示意图;1 is a schematic flowchart of a method for generating similar sentences based on a pre-trained language model provided in Embodiment 1 of the present application;
图2为本申请实施例二所提供的基于预训练语言模型的相似语句生成方法的流程示意图;2 is a schematic flowchart of a method for generating similar sentences based on a pre-trained language model provided in Embodiment 2 of the present application;
图3为本申请实施例三所提供的基于预训练语言模型的相似语句生成方法的流程示意图;3 is a schematic flowchart of a method for generating similar sentences based on a pre-trained language model provided in Embodiment 3 of the present application;
图4为本申请实施例中相似语句生成流程示意图;FIG. 4 is a schematic flow chart of generating a similar sentence in an embodiment of the present application;
图5为本申请实施例四所提供的基于预训练语言模型的相似语句生成装置的结构示意图;5 is a schematic structural diagram of an apparatus for generating similar sentences based on a pre-trained language model provided in Embodiment 4 of the present application;
图6示出了适于用来实现本申请实施方式的示例性电子设备或服务器的框图。Figure 6 shows a block diagram of an exemplary electronic device or server suitable for use in implementing embodiments of the present application.
具体实施方式Detailed ways
下面详细描述本申请的实施例,所述实施例的示例在附图中示出,其中自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。下面通过参考附图描述的实施例是示例性的,旨在用于解释本申请,而不能理解为对本申请的限制。The following describes in detail the embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein the same or similar reference numerals refer to the same or similar elements or elements having the same or similar functions throughout. The embodiments described below with reference to the accompanying drawings are exemplary, and are intended to be used to explain the present application, but should not be construed as a limitation to the present application.
针对需要投入大量人力和时间来编辑模版,每有新的问题类型加入就需要订制相应的模版,产生的句式固定,缺乏表达的多样性的问题,本申请实施例提出一种基于预训练语言模型的相似语句生成方法,通过获取待处理语句;将待处理语句输入已训练的生成模型,获取多个候选相似语句;根据待处理语句和多个候选相似语句,生成多个判别语句对;将多个判别语句对输入已训练的判别模型,获取判别结果,以及根据判别结果从多个候选相似语句中获取目标相似语句。由此,自动生成兼具形式多样且语义一致的相似问题,提高相似语句生成质量和效率。In view of the need to invest a lot of manpower and time to edit the template, the corresponding template needs to be customized every time a new question type is added, the generated sentence pattern is fixed, and the variety of expressions is lacking, the embodiment of the present application proposes a pre-training based method A method for generating similar sentences for language models, by obtaining the sentences to be processed; inputting the sentences to be processed into a trained generation model to obtain multiple candidate similar sentences; and generating multiple discriminative sentence pairs according to the sentences to be processed and the multiple candidate similar sentences; A plurality of discriminative sentence pairs are input into the trained discriminant model, a discriminant result is obtained, and a target similar sentence is obtained from a plurality of candidate similar sentences according to the discriminant result. As a result, similar questions with diverse forms and consistent semantics are automatically generated, and the quality and efficiency of similar sentences are improved.
下面参考附图描述本申请实施例的基于预训练语言模型的相似语句生成方法和装置。The method and apparatus for generating similar sentences based on a pretrained language model according to the embodiments of the present application will be described below with reference to the accompanying drawings.
图1为本申请实施例一所提供的基于预训练语言模型的相似语句生成方法的流程示意图。FIG. 1 is a schematic flowchart of a method for generating similar sentences based on a pre-trained language model according to Embodiment 1 of the present application.
本申请实施例的对话识别方法,可以应用于电子设备。其中,电子设备可以为任一具有计算能力的设备,例如可以为PC(Personal Computer,个人电脑)、移动终端等,移动终端例如可以为手机、平板电脑、个人数字助理、穿戴式设备等具有各种操作系统、触摸屏和/或显示屏的硬件设备。The dialog recognition method of the embodiment of the present application can be applied to an electronic device. The electronic device can be any device with computing capabilities, such as a PC (Personal Computer), a mobile terminal, etc., and the mobile terminal can be, for example, a mobile phone, a tablet computer, a personal digital assistant, a wearable device, etc. An operating system, touch screen and/or display hardware device.
如图1所示,该基于预训练语言模型的相似语句生成方法可以包括以下步骤101至步骤104。As shown in FIG. 1 , the method for generating similar sentences based on a pretrained language model may include the following steps 101 to 104 .
步骤101,获取待处理语句。 Step 101 , acquiring the to-be-processed statement.
本申请实施例中,待处理语句可以理解为需要生成与其对应的多个相似语句,可以根据实际应用场景选择获取。In the embodiment of the present application, the to-be-processed statement can be understood as needing to generate a plurality of similar statements corresponding to it, which can be selected and acquired according to the actual application scenario.
举例而言,待处理语句可以为“产品A发展怎么样”、“可以介绍一下产品A不”等等。For example, the to-be-processed sentences can be "how is the development of product A", "can you introduce the product A is not" and so on.
步骤102,将待处理语句输入已训练的生成模型,获取多个候选相似语句。Step 102: Input the sentence to be processed into the trained generation model, and obtain a plurality of candidate similar sentences.
在本申请实施例中,生成模型为已训练的预训练语言模型,具体训练过程详见后续描述,此处不再详述。In the embodiment of the present application, the generation model is a pre-trained language model that has been trained. For details of the specific training process, please refer to the subsequent description, which will not be described in detail here.
在本申请实施例中,对待处理语句进行编码,获取编码向量;采用自回归方式,逐字生成候选相似语句;其中,获取每个候选相似字的概率分布,并从概率最高的前N个候选相似字中随机采样一个候选相似字字作为目标候选相似字,其中,N为正整数;根据每个待处理字的目标候选相似字生成候选相似语句。In the embodiment of the present application, the to-be-processed sentence is encoded to obtain an encoding vector; an autoregressive method is used to generate candidate similar sentences word by word; wherein, the probability distribution of each candidate similar word is obtained, and the top N candidates with the highest probability are obtained A candidate similar word is randomly sampled from the similar words as a target candidate similar word, where N is a positive integer; candidate similar sentences are generated according to the target candidate similar words of each word to be processed.
举例而言,输入待处理语句X,对X进行编码,然后采用随机采样策略生成候选相似语句,具体地,生成候选相似语句的过程是自左向右逐字生成的,由于生成每个字的过程都是从概率最高的N比如5个字中随机选择得到的,由此,同样的待处理语句输入 到生成模型每次输出的到的候选相似语句都不一样,多次重复这个过程就得到了多个候选相似语句。For example, input the sentence X to be processed, encode X, and then use a random sampling strategy to generate candidate similar sentences. Specifically, the process of generating candidate similar sentences is generated word by word from left to right. The process is randomly selected from the N with the highest probability, such as 5 words. Therefore, the same to-be-processed sentence is input to the generation model and the output of the candidate similar sentences is different each time. Repeat this process many times to get multiple candidate similar sentences.
由此,采用随机采样的方式,即每生成一个字都是以标准问题和已生成内容为条件,从当前条件分布中概率最高的多个候选相似字中随机选择,获取多个候选相似语句,提高生成表达形式的多样性。Therefore, the random sampling method is adopted, that is, each word generated is based on the standard question and the generated content, and is randomly selected from multiple candidate similar words with the highest probability in the current conditional distribution to obtain multiple candidate similar sentences. Increase the variety of generated expressions.
步骤103,根据待处理语句和多个候选相似语句,生成多个判别语句对。Step 103: Generate a plurality of discriminative sentence pairs according to the sentence to be processed and a plurality of candidate similar sentences.
步骤104,将多个判别语句对输入已训练的判别模型,获取判别结果,以及根据判别结果从多个候选相似语句中获取目标相似语句。 Step 104 , inputting a plurality of discriminative sentence pairs into the trained discriminant model, obtaining a discriminant result, and obtaining a target similar sentence from a plurality of candidate similar sentences according to the discriminant result.
在本申请实施例中,待处理语句与每个候选相似语句分别组成判别语句对,比如待处理语句X,多个候选相似语句为Y1-Y5,组成的判别语句对为(X Y1)、(X Y2)到(X Y5),获取5个判别语句对。In the embodiment of the present application, the to-be-processed statement and each candidate similar statement respectively form a discriminative statement pair, such as the to-be-processed statement X, the multiple candidate similar statements are Y1-Y5, and the formed discriminative statement pairs are (X Y1), ( X Y2) to (X Y5), get 5 discriminative sentence pairs.
在本申请实施例中,判别模型为已训练的基于机器翻译的双向编码表示BERT模块,具体训练过程详见后续描述,此处不再详述。In the embodiment of the present application, the discriminant model is a trained BERT module based on machine translation bidirectional coding. For details of the training process, please refer to the subsequent description, which will not be described in detail here.
在本申请实施例中,对每个判别语句对进行编码,获取多个判别向量,对每个判别向量进行预测,获取待处理语句和每个候选相似语句之间的相似度。In the embodiment of the present application, each discriminant sentence pair is encoded, a plurality of discriminant vectors are obtained, each discriminant vector is predicted, and the similarity between the to-be-processed sentence and each candidate similar sentence is obtained.
具体地,判别模型的输入是(待处理语句,候选相似句)组成的语句对,判别模型对语句对进行编码,分类预测语句对是否是相似句,在是相似句的情况下,获取对应的候选相似句为目标相似语句。Specifically, the input of the discriminant model is a sentence pair composed of (to-be-processed sentence, candidate similar sentence), the discriminant model encodes the sentence pair, and classifies and predicts whether the sentence pair is a similar sentence, and in the case of a similar sentence, obtain the corresponding Candidate similar sentences are target similar sentences.
本申请实施例的基于预训练语言模型的相似语句生成方法,通过获取待处理语句;将待处理语句输入已训练的生成模型,获取多个候选相似语句;根据待处理语句和多个候选相似语句,生成多个判别语句对;将多个判别语句对输入已训练的判别模型,获取判别结果,以及根据判别结果从多个候选相似语句中获取目标相似语句。由此,自动生成兼具形式多样且语义一致的相似问题,提高相似语句生成质量和效率。The method for generating similar sentences based on a pre-trained language model according to the embodiment of the present application obtains a statement to be processed; inputs the sentence to be processed into a trained generation model to obtain a plurality of candidate similar sentences; , generate multiple discriminative sentence pairs; input the multiple discriminative sentence pairs into the trained discriminant model, obtain the discriminant result, and obtain the target similar sentence from the multiple candidate similar sentences according to the discriminant result. As a result, similar questions with diverse forms and consistent semantics are automatically generated, and the quality and efficiency of similar sentences are improved.
在本申请实施例的一种可能的实现方式中,采用预训练语言模UniLM(UNIfied pre-trained Language Model,统一预训练语言模型)作为生成模型,生成高质量文本,利用BERT(Bidirectional Encoder Representation from Transformers即基于机器翻译的双向编码表示)作为判别模型,过滤不合格的生成文本,具体结合图2和图3进行详细描述训练过程。In a possible implementation of the embodiment of the present application, a pre-trained language model UniLM (UNIfied pre-trained Language Model, unified pre-trained language model) is used as a generation model to generate high-quality text, using BERT (Bidirectional Encoder Representation from Transformers (two-way encoding representation based on machine translation) is used as a discriminant model to filter unqualified generated texts. The training process is described in detail with reference to Figures 2 and 3.
图2为本申请实施例二所提供的基于预训练语言模型的相似语句生成方法的流程示意图。FIG. 2 is a schematic flowchart of a method for generating similar sentences based on a pre-trained language model provided by Embodiment 2 of the present application.
如图2所示,该基于预训练语言模型的相似语句生成方法可以包括以下步骤201至步骤204。As shown in FIG. 2 , the method for generating similar sentences based on a pretrained language model may include the following steps 201 to 204 .
步骤201,获取通用领域相似问题数据集。 Step 201, obtaining a dataset of similar problems in a general domain.
步骤202,将通用领域相似问题数据集输入预训练语言模型训练,获取第一训练相似语句,通过损失函数计算第一训练语句和第一标准语句之间的第一误差,调整预训练语言模型的参数直到第一误差小于预设阈值,生成候选生成模型。Step 202: Input the data set of similar problems in the general domain into the training of the pre-trained language model, obtain the first training similar sentence, calculate the first error between the first training sentence and the first standard sentence through the loss function, and adjust the value of the pre-trained language model. parameter until the first error is smaller than the preset threshold, and generate a candidate generative model.
在本申请实施例中,使用编码器(UniLM)在通用领域相似问题数据集上进行相似问题生成任务迁移,获取通用领域相似问题数据集的方式有很多种,比如爬取收集相关贴吧、问题回答网站等推荐的相似问题,运用最大似然估计进行相似问题生成任务迁移学习,直到预训练语言模型收敛即可。由于训练数据是从网上爬取的,所以不需要人工标注,提高训练效率。In the embodiment of the present application, the encoder (UniLM) is used to transfer the similar question generation task on the general domain similar question dataset, and there are many ways to obtain the general domain similar question dataset, such as crawling and collecting related posts, question answering For similar problems recommended by websites and other similar problems, use maximum likelihood estimation to perform transfer learning of similar problem generation tasks until the pre-trained language model converges. Since the training data is crawled from the Internet, manual annotation is not required, which improves the training efficiency.
在本申请实施例中,UniLM模型是由微软开源的,基于Transformer(深度自注意力变换网络)架构的,融合自然语言理解与生成能力的预训练语言模型,UniLM预训练过程采用自编码与自回归结合的多任务学习方式,两个任务分别是:遮蔽语言模型(masked language model,MLM)和序列到序列(sequence to sequence,seq2seq),既可以做自然语言理解类型的下游任务又可以做自然语言生成类型的下游任务,也就是说,可以通过对训练语句中的各个字进行随机掩码后进行编码解码训练,提高后续生成质量。In the embodiment of this application, the UniLM model is open sourced by Microsoft, based on the Transformer (deep self-attention transformation network) architecture, and a pre-trained language model that integrates natural language understanding and generation capabilities. The multi-task learning method combined with regression, the two tasks are: masked language model (MLM) and sequence to sequence (sequence to sequence, seq2seq), which can do both downstream tasks of natural language understanding type and natural language understanding. For downstream tasks of language generation type, that is, encoding and decoding training can be performed after random masking of each word in the training sentence to improve the quality of subsequent generation.
在本申请实施例中,UniLM是一个预训练模型,原始的UniLM的预训练任务中,是没有相似生成相似问题这个任务的,本申请使用UniLM模型的参数作为初始化参数,在其上训练相似问题生成任务即迁移学习,训练的目标是使生成的目标序列的似然值最大,当目标函数的值不再变化,或者变化小于某一个阈值,就认为预训练语言模型收敛了,即可以停止训练,生成候选生成模型。In the embodiment of this application, UniLM is a pre-training model. In the original UniLM pre-training task, there is no similar task of generating similar problems. This application uses the parameters of the UniLM model as initialization parameters to train similar problems on it. The generation task is transfer learning. The goal of training is to maximize the likelihood value of the generated target sequence. When the value of the objective function no longer changes, or the change is less than a certain threshold, the pre-trained language model is considered to have converged, and the training can be stopped. , generate candidate generative models.
步骤203,获取目标领域相似问题数据集。 Step 203, obtaining a dataset of similar questions in the target domain.
步骤204,将目标领域相似问题数据集输入候选生成模型进行训练,获取第二训练相似语句,通过损失函数计算第二训练语句和第二标准语句之间的第二误差,调整候选生成模型的参数直到第二误差小于预设阈值,生成已训练的生成模型。 Step 204, input the target domain similar problem data set into the candidate generation model for training, obtain the second training similar sentence, calculate the second error between the second training sentence and the second standard sentence through the loss function, and adjust the parameters of the candidate generation model Until the second error is smaller than the preset threshold, the trained generative model is generated.
在本申请实施例中,为了训练的预训练语言模型更加适用于目标领域,其中,目标领域可以根据应用场景选择设置,比如客服业务领域等,可以使用编码器(UniLM)在客服业务正在使用的FAQ相似问题库上进行相似问题生成任务微调。In the embodiment of the present application, the pre-trained language model for training is more suitable for the target domain, wherein the target domain can be selected and set according to the application scenario, such as the customer service business domain, etc., the encoder (UniLM) can be used in the customer service business. Similar question generation task fine-tuning on the FAQ similar question database.
在本申请实施例中,可以使用相对少的目标领域相似问题数据集输入候选生成模型进行最大似然估计,获取目标序列的似然值的负对数小于预设似然阈值,生成已训练的生成模型。In the embodiment of the present application, a relatively small dataset of similar problems in the target domain can be used to input the candidate generation model for maximum likelihood estimation, and the negative logarithm of the likelihood value of the obtained target sequence is smaller than the preset likelihood threshold, and the trained model is generated. Generate the model.
由此,本申请首先使用容易获取到的大量监督数据做相似任务的迁移,接着使用已有的少量业务数据和筛选可用数据时得到的少量标注数据做领域迁移。实现用最少的标注成本,达到理想的业务指标,提高处理效率。Therefore, the present application first uses a large amount of easily obtained supervised data to perform migration of similar tasks, and then uses a small amount of existing business data and a small amount of labeled data obtained when filtering available data to perform domain migration. To achieve the lowest cost of labeling, achieve ideal business indicators, and improve processing efficiency.
图3为本申请实施例二所提供的基于预训练语言模型的相似语句生成方法的流程示意图。FIG. 3 is a schematic flowchart of a method for generating similar sentences based on a pre-trained language model provided by Embodiment 2 of the present application.
如图4所示,该基于预训练语言模型的相似语句生成方法可以包括以下步骤301至步骤304。As shown in FIG. 4 , the method for generating similar sentences based on a pretrained language model may include the following steps 301 to 304 .
步骤301,获取相似语句对数据集。 Step 301, obtaining a dataset of similar sentence pairs.
步骤302,将相似语句对数据集输入基于BERT的双向编码表示模块进行训练,生成候选判别模型。 Step 302 , input the similar sentence pair data set into the BERT-based bidirectional encoding representation module for training, and generate a candidate discriminant model.
在本申请实施例中,使用判别模型(BERT)在金融语义相似度数据集上进行相似问题判别任务迁移,以及使用判别器(BERT)在客服业务正在使用的FAQ和相似问题上进行相似问题判别任务微调。In the embodiment of the present application, the discriminant model (BERT) is used to perform similar problem discrimination task transfer on the financial semantic similarity data set, and the discriminator (BERT) is used to perform similar problem discrimination on the FAQ and similar questions that are being used by the customer service business. Mission fine-tuning.
具体地,构建判别模型BERT,利用可公开获取的相似问题语料进行相似问题判别训练,BERT模型是由Google开源的,基于Transformer架构的预训练语言模型,主要应用于自然语言理解任务,BERT预训练过程采用自编码方式多任务学习,两个任务分别是:遮蔽语言模型(masked language model,MLM)和序列到序列(next sentence prediction,NSP),BERT可以作为下游任务模型的初始化参数,只需根据特定任务添加简单的输出层结构并少量标注数据上微调即可实现本申请的效果。Specifically, the discriminant model BERT is constructed, and the publicly available similar question corpus is used for similar question discrimination training. The BERT model is open sourced by Google and is a pre-trained language model based on the Transformer architecture. It is mainly used in natural language understanding tasks. BERT pre-training The process adopts self-encoding multi-task learning. The two tasks are: masked language model (MLM) and sequence-to-sequence (next sentence prediction, NSP). BERT can be used as the initialization parameter of the downstream task model. The effect of this application can be achieved by adding a simple output layer structure for specific tasks and fine-tuning on a small amount of labeled data.
其中,使用可以方便获取到的公开数据集比如金融语义相似度数据集上进行相似问题判别任务迁移学习,不需要人工标注数据,提高训练效率。Among them, the transfer learning of similar problem discrimination task is carried out on the publicly available data set such as financial semantic similarity data set that can be easily obtained, which does not require manual labeling of data, and improves the training efficiency.
具体地,利用客服业务积累的FAQ相似问题数据,用最大似然估计方法进行领域迁移,使判别模型学习客服业务的数据分布,使用的数据量远小于在生成候选判别模型中的训练数据规模,进一步提高训练效率。Specifically, using the FAQ similar question data accumulated by the customer service business, the maximum likelihood estimation method is used to transfer the domain, so that the discriminant model learns the data distribution of the customer service business, and the amount of data used is much smaller than the training data scale in generating the candidate discriminant model. Further improve training efficiency.
步骤303,获取目标领域的相似语句对正样本和负样本。 Step 303 , acquiring positive samples and negative samples of similar sentence pairs in the target domain.
步骤304,将相似语句对正样本和对负样本输入候选判别模型进行训练,生成已训练的判别模型。 Step 304 , input the candidate discriminant models of the similar sentences to positive samples and negative samples, and generate a trained discriminant model.
具体地,在客服业务积累的FAQ相似问题上进行相似问题判别任务微调,训练判别模型除了需要客服积累的FAQ相似问题,还需要在筛选可用相似问题时标注的不可用数据作为反例做领域迁移,使得判别模型学习运营人员的判别标准,需要的数据量也比较小,进一步提高训练效率。Specifically, the similar problem discrimination task is fine-tuned on the FAQ similar questions accumulated by the customer service business. In addition to the FAQ similar questions accumulated by the customer service, the training discriminant model also needs to mark the unavailable data when screening the available similar questions as a counter-example for domain transfer. This enables the discriminant model to learn the operator's discriminant criteria, and the amount of data required is relatively small, which further improves the training efficiency.
基于上述实施例,本申请采用预训练语言模型UniLM作为生成模型生成高质量文本,利用BERT作为判别模型过滤不合格的生成文本,比如图4所示,将待处理语句输入生成模型,获得候选相似语句,再经判别模型过滤,得到符合标准的目标相似语句,由此,自动生成兼具生形式多样且语义一致的相似问题,提高相似语句生成质量和效率。Based on the above embodiment, the present application uses the pre-trained language model UniLM as the generation model to generate high-quality text, and uses BERT as the discriminant model to filter unqualified generated text, such as shown in FIG. The sentences are filtered by the discriminant model to obtain the target similar sentences that meet the standard, thereby automatically generating similar problems with diverse forms and consistent semantics, and improving the quality and efficiency of similar sentences generation.
与上述图1至图4实施例提供的基于预训练语言模型的相似语句生成方法相对应,本申请还提供一种基于预训练语言模型的相似语句生成装置,由于本申请实施例提供的基于预训练语言模型的相似语句生成装置与上述图1至图4实施例提供的基于预训练语言模型的相似语句生成方法相对应,因此在基于预训练语言模型的相似语句生成方法的实施方式也适用于本申请实施例提供的基于预训练语言模型的相似语句生成装置,在本申请实施例中不再详细描述。Corresponding to the method for generating similar sentences based on the pre-training language model provided by the above-mentioned embodiments of FIGS. 1 to 4 , the present application also provides a similar sentence generating device based on the pre-training language model. The similar sentence generation device for training the language model corresponds to the similar sentence generation method based on the pre-trained language model provided in the embodiments of FIG. 1 to FIG. 4, so the implementation of the similar sentence generation method based on the pre-trained language model also applies The apparatus for generating similar sentences based on the pre-trained language model provided by the embodiments of the present application will not be described in detail in the embodiments of the present application.
图5为本申请实施例四所提供的基于预训练语言模型的相似语句生成装置的结构示意图。FIG. 5 is a schematic structural diagram of an apparatus for generating similar sentences based on a pre-trained language model according to Embodiment 4 of the present application.
如图5所示,该基于预训练语言模型的相似语句生成装置500应用于电子设备,包括:第一获取模块501、第一处理模块502、第一生成模块503、第二处理模块504和第 二获取模块505。As shown in FIG. 5 , the similar sentence generation apparatus 500 based on the pre-trained language model is applied to an electronic device, and includes: a first acquisition module 501, a first processing module 502, a first generation module 503, a second processing module 504, and a first 2. Obtaining module 505.
第一获取模块501,用于获取待处理语句。The first obtaining module 501 is used to obtain the to-be-processed statement.
第一处理模块502,用于将待处理语句输入已训练的生成模型,获取多个候选相似语句。The first processing module 502 is configured to input the sentence to be processed into the trained generation model, and obtain a plurality of candidate similar sentences.
第一生成模块503,用于根据待处理语句和所述多个候选相似语句,生成多个判别语句对。The first generating module 503 is configured to generate a plurality of discriminative sentence pairs according to the sentence to be processed and the plurality of candidate similar sentences.
第二处理模块504,用于将多个判别语句对输入已训练的判别模型,获取判别结果。The second processing module 504 is configured to input a plurality of discriminative sentence pairs into the trained discriminant model to obtain discriminant results.
第二获取模块505,用于根据判别结果从所述多个候选相似语句中获取目标相似语句。The second obtaining module 505 is configured to obtain a target similar sentence from the plurality of candidate similar sentences according to the discrimination result.
进一步地,在本申请实施例的一种可能的实现方式中,第一处理模块502,具体用于:对待处理语句进行编码,获取编码向量;对编码向量进行解码处理,采用自回归方式生成候选相似语句;其中,获取每个候选相似字的概率分布,并从概率最高的前N个候选相似字中随机采样一个候选相似字字作为目标候选相似字,其中,N为正整数,根据目标候选相似字生成候选相似语句。Further, in a possible implementation manner of the embodiment of the present application, the first processing module 502 is specifically configured to: encode the statement to be processed, and obtain an encoding vector; perform decoding processing on the encoding vector, and use an autoregressive method to generate candidate candidates Similar sentences; wherein, the probability distribution of each candidate similar word is obtained, and a candidate similar word is randomly sampled from the top N candidate similar words with the highest probability as the target candidate similar word, where N is a positive integer, according to the target candidate Similar words generate candidate similar sentences.
进一步地,在本申请实施例的一种可能的实现方式中,第二处理模块504,具体用于:对每个判别语句对进行编码,获取多个判别向量;对每个判别向量进行预测,获取待处理语句和每个候选相似语句之间的相似度。Further, in a possible implementation manner of the embodiment of the present application, the second processing module 504 is specifically configured to: encode each discriminant sentence pair to obtain multiple discriminant vectors; predict each discriminant vector, Get the similarity between the sentence to be processed and each candidate similar sentence.
进一步地,在本申请实施例的一种可能的实现方式中,该基于预训练语言模型的相似语句生成装置500还可以包括:Further, in a possible implementation manner of the embodiment of the present application, the apparatus 500 for generating similar sentences based on a pre-trained language model may further include:
第三获取模块,用于获取通用领域相似问题数据集;第二生成模块,用于将所述通用领域相似问题数据集输入预训练语言模型进行训练,获取第一训练相似语句,通过损失函数计算所述第一训练语句和第一标准语句之间的第一误差,调整预训练语言模型的参数直到第一误差小于预设阈值,生成候选生成模型;第四获取模块,用于获取目标领域相似问题数据集;第三生成模块,用于将目标领域相似问题数据集输入候选生成模型进行训练,获取第二训练相似语句,通过损失函数计算第二训练语句和第二标准语句之间的第二误差,调整候选生成模型的参数直到第二误差小于预设阈值,生成已训练的生成模型。The third acquisition module is used to acquire a data set of similar problems in the general field; the second generation module is used to input the data set of similar problems in the general field into the pre-training language model for training, and obtain the first training similar sentence, which is calculated by the loss function For the first error between the first training sentence and the first standard sentence, adjust the parameters of the pre-trained language model until the first error is less than a preset threshold, and generate a candidate generation model; the fourth acquisition module is used to obtain similarities in the target domain The problem data set; the third generation module is used to input the similar problem data set in the target domain into the candidate generation model for training, obtain the second training similar sentence, and calculate the second training sentence and the second standard sentence through the loss function. error, adjust the parameters of the candidate generative model until the second error is smaller than the preset threshold, and generate a trained generative model.
进一步地,在本申请实施例的一种可能的实现方式中,该基于预训练语言模型的相似语句生成装置500还可以包括:Further, in a possible implementation manner of the embodiment of the present application, the apparatus 500 for generating similar sentences based on a pre-trained language model may further include:
第五获取模块,用于获取相似语句对数据集;第四生成模块,用于将相似语句对数据集输入基于BERT的双向编码表示模块进行训练,生成候选判别模型;第六获取模块,用于获取目标领域的相似语句对正样本和负样本;第五生成模块,用于将相似语句对正样本和负样本输入所述候选判别模型进行训练,获取目标序列的相似度小于预设相似阈值,生成已训练的判别模型。The fifth acquisition module is used to acquire the similar sentence pair data set; the fourth generation module is used to input the similar sentence pair data set into the BERT-based bidirectional encoding representation module for training to generate a candidate discriminant model; the sixth acquisition module is used to Obtaining the positive samples and negative samples of similar sentences in the target field; the fifth generation module is used for inputting the positive samples and negative samples of similar sentences into the candidate discrimination model for training, and obtaining the similarity of the target sequence is less than the preset similarity threshold, Generate a trained discriminative model.
本申请实施例的基于预训练语言模型的相似语句生成装置,通过获取待处理语句;将待处理语句输入已训练的生成模型,获取多个候选相似语句;根据待处理语句和多个 候选相似语句,生成多个判别语句对;将多个判别语句对输入已训练的判别模型,获取判别结果,以及根据判别结果从多个候选相似语句中获取目标相似语句。由此,自动生成兼具生形式多样且语义一致的相似问题,提高相似语句生成质量和效率。The apparatus for generating similar sentences based on a pre-trained language model according to the embodiment of the present application obtains the sentences to be processed; inputs the sentences to be processed into the trained generation model to obtain a plurality of candidate similar sentences; , generate multiple discriminative sentence pairs; input the multiple discriminative sentence pairs into the trained discriminant model, obtain the discriminant result, and obtain the target similar sentence from the multiple candidate similar sentences according to the discriminant result. As a result, similar questions with diverse forms and consistent semantics are automatically generated, and the quality and efficiency of similar sentences are improved.
为了实现上述实施例,本申请还提出一种电子设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,处理器执行程序时,实现如本申请前述图1-图4中任一实施例提出的基于预训练语言模型的相似语句生成方法。In order to realize the above-mentioned embodiments, the present application also proposes an electronic device, which includes a memory, a processor, and a computer program stored in the memory and running on the processor. When the processor executes the program, the above-mentioned FIG. The method for generating similar sentences based on a pre-trained language model proposed by any of the embodiments in FIG. 4 .
为了实现上述实施例,本申请还提出一种非临时性计算机可读存储介质,其上存储有计算机程序,程序被处理器执行时实现如本申请前述任一实施例提出的基于预训练语言模型的相似语句生成方法。In order to realize the above-mentioned embodiments, the present application also proposes a non-transitory computer-readable storage medium on which a computer program is stored. When the program is executed by a processor, the pre-trained language model based on the pre-training proposed in any of the foregoing embodiments of the present application is implemented. The similar sentence generation method of .
为了实现上述实施例,本申请还提出一种计算机程序产品,当计算机程序产品中的指令由处理器执行时,执行如本申请前述任一实施例提出的基于预训练语言模型的相似语句生成方法。In order to realize the above-mentioned embodiments, the present application also proposes a computer program product. When the instructions in the computer program product are executed by the processor, the similar sentence generation method based on the pre-trained language model proposed in any of the foregoing embodiments of the present application is executed. .
图6示出了适于用来实现本申请实施方式的示例性电子设备或服务器的框图。图6显示的电子设备或服务器12仅仅是一个示例,不应对本申请实施例的功能和使用范围带来任何限制。Figure 6 shows a block diagram of an exemplary electronic device or server suitable for use in implementing embodiments of the present application. The electronic device or server 12 shown in FIG. 6 is only an example, and should not impose any limitations on the functions and scope of use of the embodiments of the present application.
如图6所示,电子设备或服务器12以通用计算设备的形式表现。电子设备或服务器12的组件可以包括但不限于:一个或者多个处理器或者处理单元16,系统存储器28,连接不同系统组件(包括系统存储器28和处理单元16)的总线18。As shown in Figure 6, the electronic device or server 12 takes the form of a general purpose computing device. Components of the electronic device or server 12 may include, but are not limited to, one or more processors or processing units 16 , system memory 28 , and a bus 18 connecting various system components including system memory 28 and processing unit 16 .
总线18表示几类总线结构中的一种或多种,包括存储器总线或者存储器控制器,外围总线,图形加速端口,处理器或者使用多种总线结构中的任意总线结构的局域总线。举例来说,这些体系结构包括但不限于工业标准体系结构(Industry Standard Architecture;以下简称:ISA)总线,微通道体系结构(Micro Channel Architecture;以下简称:MAC)总线,增强型ISA总线、视频电子标准协会(Video Electronics Standards Association;以下简称:VESA)局域总线以及外围组件互连(Peripheral Component Interconnection;以下简称:PCI)总线。 Bus 18 represents one or more of several types of bus structures, including a memory bus or memory controller, a peripheral bus, a graphics acceleration port, a processor, or a local bus using any of a variety of bus structures. For example, these architectures include, but are not limited to, Industry Standard Architecture (hereinafter referred to as: ISA) bus, Micro Channel Architecture (hereinafter referred to as: MAC) bus, enhanced ISA bus, video electronics Standards Association (Video Electronics Standards Association; hereinafter referred to as: VESA) local bus and Peripheral Component Interconnection (Peripheral Component Interconnection; hereinafter referred to as: PCI) bus.
电子设备或服务器12典型地包括多种计算机系统可读介质。这些介质可以是任何能够被电子设备或服务器12访问的可用介质,包括易失性和非易失性介质,可移动的和不可移动的介质。The electronic device or server 12 typically includes a variety of computer system readable media. These media can be any available media that can be accessed by the electronic device or server 12, including both volatile and non-volatile media, removable and non-removable media.
存储器28可以包括易失性存储器形式的计算机系统可读介质,例如随机存取存储器(Random Access Memory;以下简称:RAM)30和/或高速缓存存储器32。电子设备或服务器12可以进一步包括其它可移动/不可移动的、易失性/非易失性计算机系统存储介质。仅作为举例,存储系统34可以用于读写不可移动的、非易失性磁介质(图6未显示,通常称为“硬盘驱动器”)。尽管图6中未示出,可以提供用于对可移动非易失性磁盘(例如“软盘”)读写的磁盘驱动器,以及对可移动非易失性光盘(例如:光盘只读存储器(Compact Disc Read Only Memory;以下简称:CD-ROM)、数字多功能只读光盘(Digital Video Disc Read Only Memory;以下简称:DVD-ROM)或者其它光介 质)读写的光盘驱动器。在这些情况下,每个驱动器可以通过一个或者多个数据介质接口与总线18相连。存储器28可以包括至少一个程序产品,该程序产品具有一组(例如至少一个)程序模块,这些程序模块被配置以执行本申请各实施例的功能。The memory 28 may include computer system readable media in the form of volatile memory, such as random access memory (Random Access Memory; hereinafter: RAM) 30 and/or cache memory 32 . The electronic device or server 12 may further include other removable/non-removable, volatile/non-volatile computer system storage media. For example only, storage system 34 may be used to read and write to non-removable, non-volatile magnetic media (not shown in FIG. 6, commonly referred to as a "hard drive"). Although not shown in FIG. 6, a magnetic disk drive for reading and writing to removable non-volatile magnetic disks (eg "floppy disks") and removable non-volatile optical disks (eg compact disk read only memory) may be provided Disc Read Only Memory; hereinafter referred to as: CD-ROM), Digital Video Disc Read Only Memory (hereinafter referred to as: DVD-ROM) or other optical media) read and write optical drives. In these cases, each drive may be connected to bus 18 through one or more data media interfaces. Memory 28 may include at least one program product having a set (eg, at least one) of program modules configured to perform the functions of various embodiments of the present application.
具有一组(至少一个)程序模块42的程序/实用工具40,可以存储在例如存储器28中,这样的程序模块42包括但不限于操作系统、一个或者多个应用程序、其它程序模块以及程序数据,这些示例中的每一个或某种组合中可能包括网络环境的实现。程序模块42通常执行本申请所描述的实施例中的功能和/或方法。A program/utility 40 having a set (at least one) of program modules 42, which may be stored, for example, in memory 28, such program modules 42 including, but not limited to, an operating system, one or more application programs, other program modules, and program data , each or some combination of these examples may include an implementation of a network environment. Program modules 42 generally perform the functions and/or methods of the embodiments described herein.
电子设备或服务器12也可以与一个或多个外部设备14(例如键盘、指向设备、显示器24等)通信,还可与一个或者多个使得用户能与该电子设备或服务器12交互的设备通信,和/或与使得该电子设备或服务器12能与一个或多个其它计算设备进行通信的任何设备(例如网卡,调制解调器等等)通信。这种通信可以通过输入/输出(I/O)接口22进行。并且,电子设备或服务器12还可以通过网络适配器20与一个或者多个网络(例如局域网(Local Area Network;以下简称:LAN),广域网(Wide Area Network;以下简称:WAN)和/或公共网络,例如因特网)通信。如图所示,网络适配器20通过总线18与电子设备或服务器12的其它模块通信。应当明白,尽管图中未示出,可以结合电子设备或服务器12使用其它硬件和/或软件模块,包括但不限于:微代码、设备驱动器、冗余处理单元、外部磁盘驱动阵列、RAID系统、磁带驱动器以及数据备份存储系统等。The electronic device or server 12 may also communicate with one or more external devices 14 (eg, keyboard, pointing device, display 24, etc.), and may also communicate with one or more devices that enable a user to interact with the electronic device or server 12, and/or with any device (eg, network card, modem, etc.) that enables the electronic device or server 12 to communicate with one or more other computing devices. Such communication may take place through input/output (I/O) interface 22 . In addition, the electronic device or server 12 can also communicate with one or more networks (such as a local area network (Local Area Network; hereinafter referred to as: LAN), wide area network (Wide Area Network; hereinafter referred to as: WAN) and/or a public network through the network adapter 20, such as Internet) communication. As shown, the network adapter 20 communicates with the electronic device or other modules of the server 12 via the bus 18 . It should be understood that, although not shown, other hardware and/or software modules may be used in conjunction with the electronic device or server 12, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, Tape drives and data backup storage systems, etc.
处理单元16通过运行存储在系统存储器28中的程序,从而执行各种功能应用以及数据处理,例如实现前述实施例中提及的方法。The processing unit 16 executes various functional applications and data processing by running the programs stored in the system memory 28 , for example, implements the methods mentioned in the foregoing embodiments.
在本说明书的描述中,参考术语“一个实施例”、“一些实施例”、“示例”、“具体示例”、或“一些示例”等的描述意指结合该实施例或示例描述的具体特征、结构、材料或者特点包含于本申请的至少一个实施例或示例中。在本说明书中,对上述术语的示意性表述不必须针对的是相同的实施例或示例。而且,描述的具体特征、结构、材料或者特点可以在任一个或多个实施例或示例中以合适的方式结合。此外,在不相互矛盾的情况下,本领域的技术人员可以将本说明书中描述的不同实施例或示例以及不同实施例或示例的特征进行结合和组合。In the description of this specification, description with reference to the terms "one embodiment," "some embodiments," "example," "specific example," or "some examples", etc., mean specific features described in connection with the embodiment or example , structure, material or feature is included in at least one embodiment or example of the present application. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, those skilled in the art may combine and combine the different embodiments or examples described in this specification, as well as the features of the different embodiments or examples, without conflicting each other.
此外,术语“第一”、“第二”仅用于描述目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括至少一个该特征。在本申请的描述中,“多个”的含义是至少两个,例如两个,三个等,除非另有明确具体的限定。In addition, the terms "first" and "second" are only used for descriptive purposes, and should not be construed as indicating or implying relative importance or implying the number of indicated technical features. Thus, a feature delimited with "first", "second" may expressly or implicitly include at least one of that feature. In the description of the present application, "plurality" means at least two, such as two, three, etc., unless expressly and specifically defined otherwise.
流程图中或在此以其他方式描述的任何过程或方法描述可以被理解为,表示包括一个或更多个用于实现定制逻辑功能或过程的步骤的可执行指令的代码的模块、片段或部分,并且本申请的优选实施方式的范围包括另外的实现,其中可以不按所示出或讨论的顺序,包括根据所涉及的功能按基本同时的方式或按相反的顺序,来执行功能,这应被本申请的实施例所属技术领域的技术人员所理解。Any process or method description in the flowcharts or otherwise described herein may be understood to represent a module, segment or portion of code comprising one or more executable instructions for implementing custom logical functions or steps of the process , and the scope of the preferred embodiments of the present application includes alternative implementations in which the functions may be performed out of the order shown or discussed, including performing the functions substantially concurrently or in the reverse order depending upon the functions involved, which should It is understood by those skilled in the art to which the embodiments of the present application belong.
在流程图中表示或在此以其他方式描述的逻辑和/或步骤,例如,可以被认为是用于实现逻辑功能的可执行指令的定序列表,可以具体实现在任何计算机可读介质中,以供指令执行系统、装置或设备(如基于计算机的系统、包括处理器的系统或其他可以从指令执行系统、装置或设备取指令并执行指令的系统)使用,或结合这些指令执行系统、装置或设备而使用。就本说明书而言,"计算机可读介质"可以是任何可以包含、存储、通信、传播或传输程序以供指令执行系统、装置或设备或结合这些指令执行系统、装置或设备而使用的装置。计算机可读介质的更具体的示例(非穷尽性列表)包括以下:具有一个或多个布线的电连接部(电子装置),便携式计算机盘盒(磁装置),随机存取存储器(RAM),只读存储器(ROM),可擦除可编辑只读存储器(EPROM或闪速存储器),光纤装置,以及便携式光盘只读存储器(CDROM)。另外,计算机可读介质甚至可以是可在其上打印所述程序的纸或其他合适的介质,因为可以例如通过对纸或其他介质进行光学扫描,接着进行编辑、解译或必要时以其他合适方式进行处理来以电子方式获得所述程序,然后将其存储在计算机存储器中。The logic and/or steps represented in flowcharts or otherwise described herein, for example, may be considered an ordered listing of executable instructions for implementing the logical functions, may be embodied in any computer-readable medium, For use with, or in conjunction with, an instruction execution system, apparatus, or device (such as a computer-based system, a system including a processor, or other system that can fetch instructions from and execute instructions from an instruction execution system, apparatus, or apparatus) or equipment. For the purposes of this specification, a "computer-readable medium" can be any device that can contain, store, communicate, propagate, or transport the program for use by or in connection with an instruction execution system, apparatus, or apparatus. More specific examples (non-exhaustive list) of computer readable media include the following: electrical connections with one or more wiring (electronic devices), portable computer disk cartridges (magnetic devices), random access memory (RAM), Read Only Memory (ROM), Erasable Editable Read Only Memory (EPROM or Flash Memory), Fiber Optic Devices, and Portable Compact Disc Read Only Memory (CDROM). In addition, the computer readable medium may even be paper or other suitable medium on which the program may be printed, as the paper or other medium may be optically scanned, for example, followed by editing, interpretation, or other suitable medium as necessary process to obtain the program electronically and then store it in computer memory.
应当理解,本申请的各部分可以用硬件、软件、固件或它们的组合来实现。在上述实施方式中,多个步骤或方法可以用存储在存储器中且由合适的指令执行系统执行的软件或固件来实现。如,如果用硬件来实现和在另一实施方式中一样,可用本领域公知的下列技术中的任一项或他们的组合来实现:具有用于对数据信号实现逻辑功能的逻辑门电路的离散逻辑电路,具有合适的组合逻辑门电路的专用集成电路,可编程门阵列(PGA),现场可编程门阵列(FPGA)等。It should be understood that various parts of this application may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware as in another embodiment, it can be implemented by any one of the following techniques known in the art, or a combination thereof: discrete with logic gates for implementing logic functions on data signals Logic circuits, application specific integrated circuits with suitable combinational logic gates, Programmable Gate Arrays (PGA), Field Programmable Gate Arrays (FPGA), etc.
本技术领域的普通技术人员可以理解实现上述实施例方法携带的全部或部分步骤是可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,该程序在执行时,包括方法实施例的步骤之一或其组合。Those skilled in the art can understand that all or part of the steps carried by the methods of the above embodiments can be completed by instructing the relevant hardware through a program, and the program can be stored in a computer-readable storage medium, and the program can be stored in a computer-readable storage medium. When executed, one or a combination of the steps of the method embodiment is included.
此外,在本申请各个实施例中的各功能单元可以集成在一个处理模块中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。所述集成的模块如果以软件功能模块的形式实现并作为独立的产品销售或使用时,也可以存储在一个计算机可读取存储介质中。In addition, each functional unit in each embodiment of the present application may be integrated into one processing module, or each unit may exist physically alone, or two or more units may be integrated into one module. The above-mentioned integrated modules can be implemented in the form of hardware, and can also be implemented in the form of software function modules. If the integrated modules are implemented in the form of software functional modules and sold or used as independent products, they may also be stored in a computer-readable storage medium.
上述提到的存储介质可以是只读存储器,磁盘或光盘等。尽管上面已经示出和描述了本申请的实施例,可以理解的是,上述实施例是示例性的,不能理解为对本申请的限制,本领域的普通技术人员在本申请的范围内可以对上述实施例进行变化、修改、替换和变型。The above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, and the like. Although the embodiments of the present application have been shown and described above, it should be understood that the above embodiments are exemplary and should not be construed as limitations to the present application. Embodiments are subject to variations, modifications, substitutions and variations.

Claims (13)

  1. 一种基于预训练语言模型的相似语句生成方法,包括:A method for generating similar sentences based on a pre-trained language model, comprising:
    获取待处理语句;Get the pending statement;
    将所述待处理语句输入已训练的生成模型,获取多个候选相似语句;Inputting the to-be-processed statement into a trained generative model to obtain multiple candidate similar statements;
    根据所述待处理语句和所述多个候选相似语句,生成多个判别语句对;generating a plurality of discriminative sentence pairs according to the to-be-processed sentence and the plurality of candidate similar sentences;
    将所述多个判别语句对输入已训练的判别模型,获取判别结果,以及根据所述判别结果从所述多个候选相似语句中获取目标相似语句。The plurality of discriminative sentence pairs are input into a trained discriminant model, a discriminant result is obtained, and a target similar sentence is obtained from the plurality of candidate similar sentences according to the discriminant result.
  2. 如权利要求1所述的方法,其中所述将所述待处理语句输入已训练的生成模型,获取多个候选相似语句,包括:The method of claim 1, wherein inputting the to-be-processed sentence into a trained generative model to obtain a plurality of candidate similar sentences, comprising:
    对所述待处理语句进行编码,获取编码向量;Encoding the to-be-processed statement to obtain an encoding vector;
    对所述编码向量进行解码处理,采用自回归方式生成候选相似语句;其中,获取每个候选相似字的概率分布,并从概率最高的前N个候选相似字中随机采样一个候选相似字字作为目标候选相似字,其中,N为正整数,根据所述目标候选相似字生成所述候选相似语句。The encoding vector is decoded, and a candidate similar sentence is generated by an autoregressive method; wherein, the probability distribution of each candidate similar word is obtained, and a candidate similar word is randomly sampled from the top N candidate similar words with the highest probability as Target candidate similar words, where N is a positive integer, and the candidate similar sentences are generated according to the target candidate similar words.
  3. 如权利要求1所述的方法,其中所述将所述多个判别语句对输入已训练的判别模型,获取判别结果,包括:The method of claim 1, wherein inputting the plurality of discriminative sentence pairs into a trained discriminant model to obtain a discriminant result comprises:
    对每个所述判别语句对进行编码,获取多个判别向量;encoding each pair of the discriminant sentences to obtain a plurality of discriminant vectors;
    对每个所述判别向量进行预测,获取所述待处理语句和每个所述候选相似语句之间的相似度。Predicting each of the discriminant vectors to obtain the similarity between the to-be-processed sentence and each of the candidate similar sentences.
  4. 如权利要求1所述的方法,其中在所述将所述待处理语句输入已训练的生成模型之前,还包括:The method of claim 1, wherein before said inputting said to-be-processed statement into the trained generative model, further comprising:
    获取通用领域相似问题数据集;Obtain a dataset of similar problems in general domains;
    将所述通用领域相似问题数据集输入预训练语言模型进行训练,获取第一训练相似语句,通过损失函数计算所述第一训练语句和第一标准语句之间的第一误差,调整所述预训练语言模型的参数直到所述第一误差小于预设阈值,生成候选生成模型;Inputting the data set of similar questions in the general domain into a pre-training language model for training, obtaining a first training similar sentence, calculating the first error between the first training sentence and the first standard sentence through a loss function, and adjusting the pre-training sentence. Train the parameters of the language model until the first error is less than a preset threshold, and generate a candidate generative model;
    获取目标领域相似问题数据集;Obtain a dataset of similar problems in the target domain;
    将所述目标领域相似问题数据集输入候选生成模型进行训练,获取第二训练相似语句,通过损失函数计算所述第二训练语句和第二标准语句之间的第二误差,调整所述候选生成模型的参数直到所述第二误差小于预设阈值,生成所述已训练的生成模型。Inputting the target domain similar problem data set into the candidate generation model for training, obtaining a second training similar sentence, calculating the second error between the second training sentence and the second standard sentence through a loss function, and adjusting the candidate generation model The parameters of the model are generated until the second error is less than a preset threshold, and the trained generative model is generated.
  5. 如权利要求1所述的方法,其中在所述将所述多个判别语句对输入已训练的判别模型之前,还包括:The method of claim 1, wherein before inputting the plurality of discriminative sentence pairs into the trained discriminant model, further comprising:
    获取相似语句对数据集;Get a dataset of similar sentence pairs;
    将所述相似语句对数据集输入基于BERT的双向编码表示模块进行训练,生成候选判别模型;Inputting the data set of the similar sentences into a BERT-based bidirectional encoding representation module for training to generate a candidate discriminant model;
    获取目标领域的相似语句对正样本和负样本;Obtain the positive and negative samples of similar sentences in the target domain;
    将所述相似语句对正样本和负样本输入所述候选判别模型进行训练,生成所述已训 练的判别模型。The positive samples and negative samples of the similar sentences are input into the candidate discriminant model for training, and the trained discriminant model is generated.
  6. 一种基于预训练语言模型的相似语句生成装置,包括:A similar sentence generation device based on a pre-trained language model, comprising:
    第一获取模块,用于获取待处理语句;The first obtaining module is used to obtain the to-be-processed statement;
    第一处理模块,用于将所述待处理语句输入已训练的生成模型,获取多个候选相似语句;a first processing module, configured to input the to-be-processed statement into a trained generative model to obtain a plurality of candidate similar statements;
    第一生成模块,用于根据所述待处理语句和所述多个候选相似语句,生成多个判别语句对;a first generation module, configured to generate a plurality of discriminative sentence pairs according to the to-be-processed sentence and the plurality of candidate similar sentences;
    第二处理模块,用于将所述多个判别语句对输入已训练的判别模型,获取判别结果;The second processing module is used for inputting the plurality of discriminative sentence pairs into a discriminant model that has been trained to obtain a discriminant result;
    第二获取模块,用于根据所述判别结果从所述多个候选相似语句中获取目标相似语句。The second obtaining module is configured to obtain a target similar sentence from the plurality of candidate similar sentences according to the discrimination result.
  7. 如权利要求6所述的装置,其中所述第一处理模块进一步用于:The apparatus of claim 6, wherein the first processing module is further configured to:
    对所述待处理语句进行编码,获取编码向量;Encoding the to-be-processed statement to obtain an encoding vector;
    对所述编码向量进行解码处理,采用自回归方式生成候选相似语句;其中,获取每个候选相似字的概率分布,并从概率最高的前N个候选相似字中随机采样一个候选相似字字作为目标候选相似字,其中,N为正整数,根据所述目标候选相似字生成所述候选相似语句。The encoding vector is decoded, and a candidate similar sentence is generated by an autoregressive method; wherein, the probability distribution of each candidate similar word is obtained, and a candidate similar word is randomly sampled from the top N candidate similar words with the highest probability as Target candidate similar words, where N is a positive integer, and the candidate similar sentences are generated according to the target candidate similar words.
  8. 如权利要求6所述的装置,其中所述第二处理模块进一步用于:The apparatus of claim 6, wherein the second processing module is further configured to:
    对每个所述判别语句对进行编码,获取多个判别向量;encoding each pair of the discriminant sentences to obtain a plurality of discriminant vectors;
    对每个所述判别向量进行预测,获取所述待处理语句和每个所述候选相似语句之间的相似度。Predicting each of the discriminant vectors to obtain the similarity between the to-be-processed sentence and each of the candidate similar sentences.
  9. 如权利要求6所述的装置,还包括:The apparatus of claim 6, further comprising:
    第三获取模块,用于获取通用领域相似问题数据集;The third acquisition module is used to acquire a dataset of similar problems in general fields;
    第二生成模块,用于将所述通用领域相似问题数据集输入预训练语言模型进行训练,获取第一训练相似语句,通过损失函数计算所述第一训练语句和第一标准语句之间的第一误差,调整所述预训练语言模型的参数直到所述第一误差小于预设阈值,生成候选生成模型;The second generation module is configured to input the data set of similar problems in the general domain into the pre-trained language model for training, obtain the first training similar sentence, and calculate the first training sentence and the first standard sentence through the loss function. an error, adjusting the parameters of the pre-trained language model until the first error is less than a preset threshold, and generating a candidate generative model;
    第四获取模块,用于获取目标领域相似问题数据集;The fourth acquisition module is used to acquire a dataset of similar problems in the target domain;
    第三生成模块,用于将所述目标领域相似问题数据集输入候选生成模型进行训练,获取第二训练相似语句,通过损失函数计算所述第二训练语句和第二标准语句之间的第二误差,调整所述候选生成模型的参数直到所述第二误差小于预设阈值,生成所述已训练的生成模型。The third generation module is used for inputting the target domain similar problem data set into the candidate generation model for training, obtaining the second training similar sentence, and calculating the second training sentence between the second training sentence and the second standard sentence through the loss function. error, adjust the parameters of the candidate generative model until the second error is less than a preset threshold, and generate the trained generative model.
  10. 如权利要求6所述的方法,还包括:The method of claim 6, further comprising:
    第五获取模块,用于获取相似语句对数据集;a fifth acquisition module, used to acquire a dataset of similar sentence pairs;
    第四生成模块,用于将所述相似语句对数据集输入基于BERT的双向编码表示模块进行训练,生成候选判别模型;The fourth generation module is used to input the data set of the similar sentences into the BERT-based bidirectional encoding representation module for training, and generate a candidate discriminant model;
    第六获取模块,用于获取目标领域的相似语句对正样本和负样本;The sixth acquisition module is used to acquire positive samples and negative samples of similar sentence pairs in the target field;
    第五生成模块,用于将所述相似语句对正样本和对负样本输入所述候选判别模型进行训练,生成所述已训练的判别模型。The fifth generation module is configured to input the similar sentences into the candidate discriminant model for training positive samples and negative samples to generate the trained discriminant model.
  11. 一种电子设备,其中包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时,实现以下步骤:An electronic device comprising a memory, a processor and a computer program stored in the memory and running on the processor, when the processor executes the program, the following steps are implemented:
    获取待处理语句;Get the pending statement;
    将所述待处理语句输入已训练的生成模型,获取多个候选相似语句;Inputting the to-be-processed statement into a trained generative model to obtain multiple candidate similar statements;
    根据所述待处理语句和所述多个候选相似语句,生成多个判别语句对;generating a plurality of discriminative sentence pairs according to the to-be-processed sentence and the plurality of candidate similar sentences;
    将所述多个判别语句对输入已训练的判别模型,获取判别结果,以及根据所述判别结果从所述多个候选相似语句中获取目标相似语句。The plurality of discriminative sentence pairs are input into a trained discriminant model, a discriminant result is obtained, and a target similar sentence is obtained from the plurality of candidate similar sentences according to the discriminant result.
  12. 一种非临时性计算机可读存储介质,其上存储有计算机程序,其中,所述程序被处理器执行时实现以下步骤:A non-transitory computer-readable storage medium on which a computer program is stored, wherein, when the program is executed by a processor, the following steps are implemented:
    获取待处理语句;Get the pending statement;
    将所述待处理语句输入已训练的生成模型,获取多个候选相似语句;Inputting the to-be-processed statement into a trained generative model to obtain multiple candidate similar statements;
    根据所述待处理语句和所述多个候选相似语句,生成多个判别语句对;generating a plurality of discriminative sentence pairs according to the to-be-processed sentence and the plurality of candidate similar sentences;
    将所述多个判别语句对输入已训练的判别模型,获取判别结果,以及根据所述判别结果从所述多个候选相似语句中获取目标相似语句。The plurality of discriminative sentence pairs are input into a trained discriminant model, a discriminant result is obtained, and a target similar sentence is obtained from the plurality of candidate similar sentences according to the discriminant result.
  13. 一种计算机程序产品,其中,当所述计算机程序产品中的指令由处理器执行时,执行以下步骤:A computer program product wherein, when the instructions in the computer program product are executed by a processor, the following steps are performed:
    获取待处理语句;Get the pending statement;
    将所述待处理语句输入已训练的生成模型,获取多个候选相似语句;Inputting the to-be-processed statement into a trained generative model to obtain multiple candidate similar statements;
    根据所述待处理语句和所述多个候选相似语句,生成多个判别语句对;generating a plurality of discriminative sentence pairs according to the to-be-processed sentence and the plurality of candidate similar sentences;
    将所述多个判别语句对输入已训练的判别模型,获取判别结果,以及根据所述判别结果从所述多个候选相似语句中获取目标相似语句。The plurality of discriminative sentence pairs are input into a trained discriminant model, a discriminant result is obtained, and a target similar sentence is obtained from the plurality of candidate similar sentences according to the discriminant result.
PCT/CN2022/075657 2021-03-12 2022-02-09 Similar sentence generation method and apparatus based on pre-trained language model WO2022188584A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110270871.5A CN113807074A (en) 2021-03-12 2021-03-12 Similar statement generation method and device based on pre-training language model
CN202110270871.5 2021-03-12

Publications (1)

Publication Number Publication Date
WO2022188584A1 true WO2022188584A1 (en) 2022-09-15

Family

ID=78892914

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/075657 WO2022188584A1 (en) 2021-03-12 2022-02-09 Similar sentence generation method and apparatus based on pre-trained language model

Country Status (2)

Country Link
CN (1) CN113807074A (en)
WO (1) WO2022188584A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115497633A (en) * 2022-10-19 2022-12-20 联仁健康医疗大数据科技股份有限公司 Data processing method, device, equipment and storage medium
CN117332180A (en) * 2023-12-01 2024-01-02 浙商期货有限公司 Method, equipment and storage medium for intelligent writing of research report based on large language model

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113807074A (en) * 2021-03-12 2021-12-17 京东科技控股股份有限公司 Similar statement generation method and device based on pre-training language model
CN114357974B (en) * 2021-12-28 2022-09-23 北京海泰方圆科技股份有限公司 Similar sample corpus generation method and device, electronic equipment and storage medium
CN114817517B (en) * 2022-05-30 2022-12-20 北京海天瑞声科技股份有限公司 Corpus acquisition method and device, electronic equipment and storage medium
CN117291181A (en) * 2022-06-17 2023-12-26 华为云计算技术有限公司 Statement generation method, device and storage medium
CN116955590B (en) * 2023-09-20 2023-12-08 成都明途科技有限公司 Training data screening method, model training method and text generation method

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180018577A1 (en) * 2016-07-12 2018-01-18 International Business Machines Corporation Generating training data for machine learning
CN110245219A (en) * 2019-04-25 2019-09-17 义语智能科技(广州)有限公司 A kind of answering method and equipment based on automatic extension Q & A database
CN110765758A (en) * 2019-11-04 2020-02-07 北京小米智能科技有限公司 Method, device and medium for generating synonym sentence generation model
CN111400470A (en) * 2020-03-13 2020-07-10 深圳市腾讯计算机系统有限公司 Question processing method and device, computer equipment and storage medium
CN111695356A (en) * 2020-05-28 2020-09-22 平安科技(深圳)有限公司 Synonym corpus generation method, synonym corpus generation device, computer system and readable storage medium
CN113807074A (en) * 2021-03-12 2021-12-17 京东科技控股股份有限公司 Similar statement generation method and device based on pre-training language model

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9135237B2 (en) * 2011-07-13 2015-09-15 Nuance Communications, Inc. System and a method for generating semantically similar sentences for building a robust SLM
CN109710915B (en) * 2017-10-26 2021-02-23 华为技术有限公司 Method and device for generating repeated statement
CN109033390B (en) * 2018-07-27 2020-02-18 深圳追一科技有限公司 Method and device for automatically generating similar question sentences
CN111046147A (en) * 2018-10-11 2020-04-21 马上消费金融股份有限公司 Question answering method and device and terminal equipment
CN110162604B (en) * 2019-01-24 2023-09-12 腾讯科技(深圳)有限公司 Statement generation method, device, equipment and storage medium
CN111368024A (en) * 2020-02-14 2020-07-03 深圳壹账通智能科技有限公司 Text semantic similarity analysis method and device and computer equipment

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180018577A1 (en) * 2016-07-12 2018-01-18 International Business Machines Corporation Generating training data for machine learning
CN110245219A (en) * 2019-04-25 2019-09-17 义语智能科技(广州)有限公司 A kind of answering method and equipment based on automatic extension Q & A database
CN110765758A (en) * 2019-11-04 2020-02-07 北京小米智能科技有限公司 Method, device and medium for generating synonym sentence generation model
CN111400470A (en) * 2020-03-13 2020-07-10 深圳市腾讯计算机系统有限公司 Question processing method and device, computer equipment and storage medium
CN111695356A (en) * 2020-05-28 2020-09-22 平安科技(深圳)有限公司 Synonym corpus generation method, synonym corpus generation device, computer system and readable storage medium
CN113807074A (en) * 2021-03-12 2021-12-17 京东科技控股股份有限公司 Similar statement generation method and device based on pre-training language model

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115497633A (en) * 2022-10-19 2022-12-20 联仁健康医疗大数据科技股份有限公司 Data processing method, device, equipment and storage medium
CN115497633B (en) * 2022-10-19 2024-01-30 联仁健康医疗大数据科技股份有限公司 Data processing method, device, equipment and storage medium
CN117332180A (en) * 2023-12-01 2024-01-02 浙商期货有限公司 Method, equipment and storage medium for intelligent writing of research report based on large language model
CN117332180B (en) * 2023-12-01 2024-03-12 浙商期货有限公司 Method, equipment and storage medium for intelligent writing of research report based on large language model

Also Published As

Publication number Publication date
CN113807074A (en) 2021-12-17

Similar Documents

Publication Publication Date Title
WO2022188584A1 (en) Similar sentence generation method and apparatus based on pre-trained language model
US11120801B2 (en) Generating dialogue responses utilizing an independent context-dependent additive recurrent neural network
CN111967266B (en) Chinese named entity recognition system, model construction method, application and related equipment
CN108984683B (en) Method, system, equipment and storage medium for extracting structured data
US20210406465A1 (en) Stylistic Text Rewriting for a Target Author
US20210256390A1 (en) Computationally efficient neural network architecture search
CN110442718B (en) Statement processing method and device, server and storage medium
CN110442878B (en) Translation method, training method and device of machine translation model and storage medium
US10832658B2 (en) Quantized dialog language model for dialog systems
CN110175336B (en) Translation method and device and electronic equipment
CN108062388A (en) Interactive reply generation method and device
US20210326719A1 (en) Method and System for Unlabeled Data Selection Using Failed Case Analysis
CN111738016A (en) Multi-intention recognition method and related equipment
JP7335300B2 (en) Knowledge pre-trained model training method, apparatus and electronic equipment
CN111553159B (en) Question generation method and system
US20220351634A1 (en) Question answering systems
EP3832485A1 (en) Question answering systems
CN114722833A (en) Semantic classification method and device
Wang et al. Data augmentation for internet of things dialog system
CN112100360A (en) Dialog response method, device and system based on vector retrieval
CN117111952A (en) Code complement method and device based on generation type artificial intelligence and medium
CN116432611A (en) Manuscript writing auxiliary method, system, terminal and storage medium
US11475335B2 (en) Cognitive data preparation for deep learning model training
US20240169629A1 (en) Pixel-Based Machine-Learned Models for Multimodal Vision-Language Tasks
Zhao et al. Test case classification via few-shot learning

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22766111

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 17.01.2024)