WO2023045725A1 - 用于数据集创建的方法、电子设备和计算机程序产品 - Google Patents

用于数据集创建的方法、电子设备和计算机程序产品 Download PDF

Info

Publication number
WO2023045725A1
WO2023045725A1 PCT/CN2022/116381 CN2022116381W WO2023045725A1 WO 2023045725 A1 WO2023045725 A1 WO 2023045725A1 CN 2022116381 W CN2022116381 W CN 2022116381W WO 2023045725 A1 WO2023045725 A1 WO 2023045725A1
Authority
WO
WIPO (PCT)
Prior art keywords
premise
statements
statement
target
conclusion
Prior art date
Application number
PCT/CN2022/116381
Other languages
English (en)
French (fr)
Inventor
张欣勃
袁莉萍
周浩
Original Assignee
北京有竹居网络技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京有竹居网络技术有限公司 filed Critical 北京有竹居网络技术有限公司
Publication of WO2023045725A1 publication Critical patent/WO2023045725A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models

Definitions

  • Embodiments of the present disclosure relate generally to data processing systems, and more particularly, to a method, electronic device, and computer program product for dataset creation.
  • a scheme for dataset creation is provided.
  • a computer-implemented method includes obtaining a set of first premise statements and a set of second premise statements associated with the set of first premise statements; generating an a plurality of concluding sentences linked, the plurality of concluding sentences indicating a correlation between the set of first premise sentences and the set of second premise sentences; and based at least on the set of first premise sentences, the The set of second premise statements and the plurality of conclusion statements determine a target data set.
  • an electronic device in a second aspect of the present disclosure, includes at least one processing unit; and at least one memory coupled to the at least one processing unit and storing instructions for execution by the at least one processing unit.
  • the instructions when executed by at least one processing unit, cause the device to perform the following actions: obtain a set of first premise statements and a set of second premise statements associated with the set of first premise statements; generate a set of second premise statements associated with the set of first premise statements a plurality of conclusion statements associated with the set of second premise statements and the premise statement, the plurality of conclusion statements indicating a correlation between the set of first premise statements and the set of second premise statements; and A target data set is determined based on at least the set of first premise statements, the set of second premise statements, and the plurality of conclusion statements.
  • an apparatus for dataset creation includes: an acquisition module configured to acquire a set of first premise statements and a set of second premise statements associated with the set of first premise statements; a generation module configured to generate a set of second premise statements associated with the set of first premise statements a premise statement and a plurality of conclusion statements associated with the set of second premise statements, the plurality of conclusion statements indicating a correlation between the set of first premise statements and the set of second premise statements; And a determining module configured to determine a target data set based at least on the set of first premise statements, the set of second premise statements and the plurality of conclusion statements.
  • a computer readable storage medium is provided.
  • a computer program is stored on the medium, and when the program is executed by the processor, the method in the first aspect is realized.
  • Figure 1 shows a schematic diagram of an example environment in which embodiments of the present disclosure can be implemented
  • Fig. 2 shows a schematic diagram of the process of creating a data set according to some embodiments of the present disclosure
  • Fig. 3 shows a schematic diagram of the process of creating a data set according to some embodiments of the present disclosure
  • FIG. 4 shows a flowchart of a process of creating a data set according to some embodiments of the present disclosure
  • FIG. 5 shows a block diagram of an apparatus for creating a data set according to some embodiments of the present disclosure.
  • Figure 6 shows a block diagram of a device capable of implementing various embodiments of the present disclosure.
  • model can learn the relationship between the corresponding input and output from the training data, so that the corresponding output can be generated for the given input after the training is completed.
  • the generation of the model may be based on machine learning techniques.
  • Deep learning is a machine learning algorithm that uses multiple layers of processing units to process input and provide corresponding output.
  • a neural network model is an example of a deep learning based model.
  • a “model” may also be referred to herein as a "machine learning model,” “learning model,” “machine learning network,” or “learning network,” and these terms are used interchangeably herein.
  • a “neural network” is a machine learning network based on deep learning.
  • a neural network is capable of processing input and providing a corresponding output, which generally includes an input layer and an output layer and one or more hidden layers between the input layer and the output layer.
  • Neural networks used in deep learning applications typically include many hidden layers, increasing the depth of the network.
  • the layers of the neural network are connected in sequence so that the output of the previous layer is provided as the input of the subsequent layer, where the input layer receives the input of the neural network, and the output of the output layer serves as the final output of the neural network.
  • Each layer of a neural network consists of one or more nodes (also known as processing nodes or neurons), each of which processes input from the previous layer.
  • machine learning can roughly include three phases, namely training phase, testing phase and application phase (also known as inference phase).
  • training phase a given model can be trained using a large amount of training data, and the parameter values are updated iteratively until the model can obtain consistent inferences that meet the expected goals from the training data.
  • a model can be thought of as being able to learn associations from inputs to outputs (also known as input-to-output mappings) from the training data.
  • the parameter values of the trained model are determined.
  • the testing phase the performance of the model is determined by applying test inputs to the trained model to test whether the model can provide the correct output.
  • the model can be used to process the actual input and determine the corresponding output based on the parameter values obtained by training.
  • Figure 1 shows a schematic diagram of an example environment in which embodiments of the disclosure can be implemented.
  • example environment 100 may include computing device 110 .
  • the computing device 110 may perform processing on data.
  • Data processing may include, for example, data collection, data analysis, data segment extraction, data segment transformation, data screening, and data set generation.
  • Computing device 110 may acquire, look up, or search for target data from knowledge base 120 . For example, when computing device 110 is intended to be in the process of building a natural language-based data set. The computing device 110 may obtain a plurality of natural language sentences from the knowledge base 120 as data collected by the computing device 110 . The computing device 110 may also search the knowledge base 120 for required target data based on some specific sentence elements, for example.
  • computing device 110 may also, for example, classify, transform, filter, or label the collected data.
  • Computing device 110 may generate the desired dataset based on the processed data.
  • the generated data set can be sent by the computing device 110 to the language training model 130 as an input of the language training model 130 , so as to realize the desired learning effect for the language training model 130 based on the data set.
  • the illustrated computing device 110 in the example environment 110 of FIG. 1 may be any computing device capable of data processing, including but not limited to a personal computer, a server computer, a handheld or laptop telephones, personal digital assistants (PDAs), media players, etc.), multiprocessor systems, consumer electronics, minicomputers, mainframe computers, distributed computing environments including any of the foregoing systems or devices, etc.
  • PDAs personal digital assistants
  • multiprocessor systems consumer electronics, minicomputers, mainframe computers, distributed computing environments including any of the foregoing systems or devices, etc.
  • computing systems suitable for implementing example embodiments described in this disclosure may include one or more different components, other components, and/or different arrangement.
  • computing device 110 and language training model 130 may be integrated in the same system or device. Embodiments of the present disclosure are not limited in this respect.
  • the computing device 110 may obtain a set of first premise statements and a set of second premise statements associated with the set of first premise statements.
  • Computing device 110 may also generate a plurality of conclusion statements associated with the set of first premise statements and the set of second premise statements.
  • the plurality of conclusion statements indicates a correlation between the set of first premise statements and the set of second premise statements.
  • Computing device 110 may also determine a target data set based at least on the set of first premise statements, the set of second premise statements, and the plurality of conclusion statements.
  • Fig. 2 shows a schematic diagram of the process of creating a data set according to some embodiments of the present disclosure.
  • computing device 110 may obtain one or more first premise statements 210 .
  • the first premise statement 210 may be, for example, a natural language statement.
  • the computing device 110 may acquire the acquired one or more first premise sentences 210 from a natural language knowledge base.
  • Computing device 110 may also obtain corresponding second premise statements 220 associated with one or more first premise statements 210 . It should be understood that one or more second premise statements 220 may be obtained for the same first premise statement 210 .
  • the second premise statement 210 may be, for example, a natural language statement.
  • the computing device 110 may extract any segment from the one or more first premise statements 210 as keywords. Based on the keyword and the semantics of the one or more first premise sentences 210, the computing device 110 may search the natural language knowledge base for a natural language sentence related to the first premise sentence as the second premise sentence.
  • the first premise statement is "In the food chain process, the role of green plants as producers can provide food for consumers". If “green plant” is used as the extracted keyword, the second premise statement may be "green plant provides food for consumers through photosynthesis”.
  • the second premise sentence acquired by the computing device 110 may be different. It should also be understood that for the same keyword extracted from the first premise sentence, the computing device 110 may also obtain multiple different second premise sentences.
  • computing device 110 may generate Multiple conclusion statements associated with a statement.
  • the concluding statement may indicate a correlation between a first premise statement of the one or more first premise statements and a corresponding second premise statement of the one or more second premise statements.
  • the conclusion statement generated by the computing device 120 may be "In the food chain process, green plants play the role of producers through photosynthesis”.
  • the conclusion statement can be given, for example, by a set of associations between reference premise statements.
  • the correlation may, for example, involve a pre-trained model used to characterize the correlation between multiple premise sentences.
  • the first premise sentence obtained by the computing device 110 and the second premise sentence associated with the first premise sentence can be used as the input of the model, and the output of the model can be used to indicate the first premise sentence and the first premise sentence.
  • the computing device 110 may determine the acquired one or more first premise sentences 210 and the corresponding corresponding one or more first premise sentences 210 through manual annotation.
  • the concluding statement may be entered into computing device 110, for example.
  • the computing device 110 Based on the first premise statement 210, the second premise statement 220 associated with the first premise statement 210, and the conclusion statement indicating the correlation between the first premise statement 210 and the second premise statement 220, the computing device 110 One of the data entries in the dataset to be generated can be identified.
  • the data entry may have, for example, the following format: ⁇ Premise 1, Premise 2, Conclusion>.
  • a conclusion statement can be obtained based on the first premise statement 210 and the second premise statement 220 associated with the first premise statement 210, then the conclusion field in the above format is marked to describe the first premise statement 220.
  • the conclusion statement cannot be obtained based on the first premise statement 210 and the second premise statement 220 associated with the first premise statement 210, then "no valid conclusion" is marked at the conclusion field in the above format.
  • computing device 110 may generate a data set based on one or more first premise statements 210, a corresponding second premise statement 220 associated with one or more first premise statements 210, and a corresponding conclusion statement , the data set may include a plurality of data entries, each data entry consists of a first premise statement, a second premise statement associated with the first premise statement, and a The correlation between the conclusion statement composition.
  • data set 230 generated by computing device 110 may include entries 231 through 23N. If the conclusion of the first premise statement and the second premise statement can be inferred, the specific conclusion can be identified at the conclusion statement field as shown in item 231 in FIG. 2 . And if the conclusion of the first premise statement and the second premise statement cannot be deduced, then "no valid conclusion" can be marked at the conclusion statement field as in item 232 shown in FIG. 2 .
  • data set 230 may include any number of data entries and is not limited to the example shown in FIG. 2 .
  • a data set generated during the process of creating a data set described in connection with FIG. 2 may be considered an initial data set generated by computing device 110 .
  • This data set can be used as a training data set for training a natural language model.
  • the initial data set can be further optimized.
  • Fig. 3 shows a schematic diagram of the process of creating a data set according to some embodiments of the present disclosure.
  • computing device 110 may remove data entries that have a conclusion statement field marked as "no valid conclusion.”
  • the computing device 110 can transform the data items with valid conclusions, that is, marked with specific conclusions at the conclusion statement field.
  • transforming the data item may include transforming at least one of the first premise statement and the second premise statement.
  • the transforming of the first premise sentence and the second premise sentence may include transforming a specific segment in at least one of the first premise sentence and the second premise sentence.
  • a particular phrase may refer to the middle term in the first premise statement and the second premise statement.
  • the term "middle term" in this application refers to the concept of a syllogism in logic.
  • Syllogism reasoning is a simple judgment reasoning in deductive reasoning. It contains two premises consisting of categorical propositions (ie, the first premise sentence and the second premise sentence described above), and a conclusion consisting of a categorical proposition.
  • a correct syllogism has one and only three terms, among which the term connecting the first premise sentence and the second premise sentence is called the middle term, which can appear twice in the premise.
  • the transformation of a specific segment in at least one of the first premise sentence and the second premise sentence may include synonymous segment replacement, antonymous segment replacement, upper-level segment replacement, lower-level term At least one of segment replacement, negative segment replacement, double negative segment replacement, and reverse translation segment replacement.
  • synonymous segment replacement, antonymous segment replacement, superordinate segment replacement, and inferior segment replacement can be operated based on the middle term in the first premise sentence and the second premise sentence mentioned above . For example, for each middle item, first perform semantic disambiguation to find its corresponding word meaning item in the lexicon, such as "wordnet”, then find the corresponding transformed word, and finally perform grammatical error correction.
  • negative segment replacement, double negative segment replacement, and reverse translation segment replacement can be transformed using some language transformation tools, such as the TextFlint toolkit.
  • data entry 231 with valid conclusions can be transformed to generate data entry 231' and data entry 231".
  • Data entry 231' for example, can include a transformed first premise statement, an original second premise statement, and The conclusion.
  • the data entry 231" may include, for example, the original first premise statement, the transformed second premise statement, and the conclusion.
  • the conclusion statement field of the processed raw data entry may be marked as "no valid conclusion”.
  • Natural logic has richer expression ability than formal logic, such as it can express probability, quantity and other issues.
  • common sense information can also be combined in reasoning.
  • the data set model obtained in this way can solve the problem of lack of data sets in natural language reasoning, so that the language model trained based on this data set can have the ability of reasoning rather than based on simple rule patterns, so that the trained language model The performance is more optimized.
  • the first premise language and the second premise language are input multiple times or into different models for representing the association relationship between the premise languages. If the deviation of conclusions associated with the first premise language and the second premise language obtained by multiple checks is less than a threshold deviation, the data entry is regarded as a valid entry. If the deviation of conclusions associated with the first premise language and the second premise language obtained by multiple checks is greater than a threshold deviation, the data entry is removed from the data set.
  • the above verification process can also be implemented by manual marking. For example, it can be judged whether the conclusion in the data entry is correct. If the conclusion of the data entry is judged to be correct, the ratio of the number of verifiers to the total verifiers is greater than the threshold ratio, then the data entry is regarded as a valid entry. Otherwise, the data entry is removed from the dataset.
  • a conclusion generated by a machine may be provided for each data item in the data set and manually marked whether the generated conclusion is correct.
  • an evaluator can be obtained to evaluate the generated results of the model. In this way, the diversity of reasoning results can be taken into account, and the disadvantage that the evaluation method based on word overlap is difficult to evaluate the quality of the conclusion generated by the model can be avoided.
  • FIG. 4 shows a flowchart of a process 400 for consistency detection between documents and abstracts according to some embodiments of the present disclosure.
  • Process 400 may be implemented at computing device 110 shown in FIG. 1 .
  • a set of first premise statements and a set of second premise statements associated with the set of first premise statements are obtained.
  • each keyword in the set of first premise sentences may be extracted; and based on the each keyword and the set of first premise sentences The semantics of the precondition captures the set of second precondition statements.
  • a plurality of conclusion statements associated with the set of first premise statements and the set of second premise statements are generated.
  • the plurality of conclusion statements indicates a correlation between the set of first premise statements and the set of second premise statements.
  • a group of associations between the reference premise sentences can be obtained. If it is determined that the correlation between the first part of the first premise sentences in the set of first premise sentences and the first part of the second premise sentences in the set of second premise sentences is successfully inferred based on the association relationship, A conclusion sentence describing the correlation is then generated.
  • the conclusion statement when generating the conclusion statement, if it is determined that the second part of the first premise statement and the set of second premise statements in the set of first premise statements have not been successfully deduced based on the association relationship A correlation between the second part of the second premise statement in the second part generates an indication that the correlation does not have a valid conclusion.
  • a target data set is determined based on at least the set of first premise statements, the set of second premise statements, and the plurality of conclusion statements.
  • the target data set when determining the target data set, if the correlation between the first target premise sentence in the set of first premise sentences and the second target premise sentence in the set of second premise sentences is determined properties can be inferred, making changes to the first target premise statement. A concluding statement indicating a correlation between the altered first target premise statement and the second target premise statement is generated. The target data set is determined based on the altered first target premise statement, the second target premise statement, and the conclusion statement.
  • the target data set when determining the target data set, if the correlation between the first target premise sentence in the set of first premise sentences and the second target premise sentence in the set of second premise sentences is determined properties can be inferred, making changes to the second target premise statement. A concluding statement indicating a correlation between the first target premise statement and the changed second target premise statement is generated. The target data set is determined based on the first target premise statement, the changed second target premise statement, and the conclusion statement.
  • the target data set when determining the target data set, if the correlation between the first target premise sentence in the set of first premise sentences and the second target premise sentence in the set of second premise sentences is determined properties can be inferred, making changes to said first target premise statement and said second target premise statement.
  • a conclusion statement is generated indicating a correlation between the altered first target premise statement and the altered second target premise statement.
  • the target data set is determined based on the changed first target premise statement, the changed second target premise statement, and the conclusion statement.
  • At least one of the first target premise sentence and the second target premise sentence is changed, at least one of the following operations is performed on the target transformation segment: synonymous segment Replacement; Antonym Segment Replacement; Superordinate Sentence Replacement; Subordinate Sentence Replacement; Negative Segment Replacement; Double Negative Segment Replacement; and Back-Translation Sentence Replacement.
  • the initial data set generated based on the set of first premise sentences, the set of second premise sentences and the plurality of conclusion sentences is verified; if determined Some of the conclusion sentences in the plurality of conclusion sentences are erroneous, by deleting the part of the conclusion sentences where the error occurs and the corresponding part of the set of first premise sentences associated with the part of the conclusion sentences where the error occurs and all updating the initial data set with a corresponding portion of the second set of premise statements; and determining the updated initial data set as the target data set.
  • the set of first premise sentences and the set of second premise sentences comprise natural language sentences.
  • Fig. 5 shows a block diagram of an apparatus 500 for creating a dataset according to some embodiments of the present disclosure.
  • Apparatus 500 may be implemented as or included in computing device 110 shown in FIG. 1 .
  • Each module/component in the device 500 may be implemented by hardware, software, firmware or any combination thereof.
  • the apparatus 500 includes an acquisition module 510 configured to acquire a set of first premise statements and a set of second premise statements associated with the set of first premise statements.
  • Apparatus 500 also includes a generating module 520 configured to generate a plurality of concluding sentences associated with the set of first premise sentences and the set of second premise sentences, the plurality of concluding sentences indicating that the set of first premise sentences A correlation between a premise statement and the set of second premise statements.
  • the apparatus 500 further includes a determination module configured to determine a target data set based at least on the set of first premise statements, the set of second premise statements and the plurality of conclusion statements.
  • the acquisition module 510 includes: a keyword extraction module configured to extract a keyword in each of the set of first premise sentences; and a second premise sentence acquisition module configured to extract keywords based on the set of A keyword and the semantics of the set of first premise statements capture the set of second premise statements.
  • the generation module 520 includes an association relationship acquisition module configured to acquire an association relationship between a set of reference premise sentences; and a first conclusion statement generation module configured to determine if the association relationship is successfully Inferring the correlation between the first part of the first premise sentences in the set of first premise sentences and the first part of the second premise sentences in the set of second premise sentences, generating a description for describing the correlation conclusion statement.
  • the generation module 520 further includes a second conclusion statement generation module configured to, if it is determined that the second part of the first premise statements in the set of first premise statements are unsuccessfully deduced based on the association relationship and a second portion of second premise statements in the set of second premise statements, an indication is generated that the correlation does not have a valid conclusion.
  • the determining module is further configured to determine if a correlation between a first target premise statement in the set of first premise statements and a second target premise statement in the set of second premise statements It can be inferred that a change is made to the first target premise statement.
  • a concluding statement indicating a correlation between the altered first target premise statement and the second target premise statement is generated. The target data set is determined based on the altered first target premise statement, the second target premise statement, and the conclusion statement.
  • the determining module is further configured to determine if a correlation between a first target premise statement in the set of first premise statements and a second target premise statement in the set of second premise statements It can be inferred that a change is made to the second target premise statement.
  • a concluding statement indicating a correlation between the first target premise statement and the changed second target premise statement is generated. The target data set is determined based on the first target premise statement, the changed second target premise statement, and the conclusion statement.
  • the determining module is further configured to determine if a correlation between a first target premise statement in the set of first premise statements and a second target premise statement in the set of second premise statements It can be inferred that a change is made to the first target premise statement and the second target premise statement.
  • a conclusion statement is generated indicating a correlation between the altered first target premise statement and the altered second target premise statement.
  • the target data set is determined based on the changed first target premise statement, the changed second target premise statement, and the conclusion statement.
  • the apparatus 500 may further include a change module configured to perform the following on the target conversion sentence when changing at least one of the first target premise sentence and the second target premise sentence At least one of the operations: synonymous segment substitution; antonymous segment substitution; superordinate segment substitution; subordinate segment substitution; negative segment substitution; double negative segment substitution; and reverse translation segment substitution.
  • the determination module is further configured to check the initial data set generated based on the set of first premise statements, the set of second premise statements and the plurality of conclusion statements; if it is determined that the an error occurs in some of the plurality of conclusion statements, by deleting the errored partial conclusion statement and the corresponding part of the set of first premise statements associated with the error partial conclusion statement and the updating the initial data set with a corresponding portion of a set of second premise statements; and determining the updated initial data set as the target data set.
  • the set of first premise sentences and the set of second premise sentences comprise natural language sentences.
  • FIG. 6 shows a block diagram illustrating a computing device 600 in which one or more embodiments of the present disclosure may be implemented. It should be understood that the computing device 600 shown in FIG. 6 is exemplary only and should not constitute any limitation on the functionality and scope of the embodiments described herein. The computing device 600 shown in FIG. 6 may be used to implement the computing device 110 of FIG. 1 .
  • computing device 600 is in the form of a general-purpose computing device.
  • Components of computing device 600 may include, but are not limited to, one or more processors or processing units 610, memory 620, storage devices 630, one or more communication units 640, one or more input devices 650, and one or more output devices 660.
  • the processing unit 610 may be an actual or virtual processor and is capable of performing various processes according to programs stored in the memory 620 .
  • multiple processing units execute computer-executable instructions in parallel to increase the parallel processing capability of the computing device 600 .
  • Computing device 600 typically includes a plurality of computer storage media. Such media can be any available media that is accessible by computing device 600 , including but not limited to, volatile and nonvolatile media, removable and non-removable media.
  • Memory 620 can be volatile memory (eg, registers, cache, random access memory (RAM)), nonvolatile memory (eg, read only memory (ROM), electrically erasable programmable read only memory (EEPROM) , flash memory) or some combination of them.
  • Storage device 630 may be removable or non-removable media, and may include machine-readable media, such as flash drives, magnetic disks, or any other media that may be capable of storing information and/or data (e.g., training data for training ) and can be accessed within computing device 600.
  • Computing device 600 may further include additional removable/non-removable, volatile/nonvolatile storage media.
  • a disk drive for reading from or writing to a removable, nonvolatile disk such as a "floppy disk"
  • a disk drive for reading from a removable, nonvolatile disk may be provided.
  • CD-ROM drive for reading or writing.
  • each drive may be connected to the bus (not shown) by one or more data media interfaces.
  • Memory 620 may include a computer program product 625 having one or more program modules configured to perform the various methods or actions of the various embodiments of the present disclosure.
  • the communication unit 640 enables communication with other computing devices through the communication medium. Additionally, the functionality of the components of computing device 600 may be implemented in a single computing cluster or as a plurality of computing machines capable of communicating via communication links. Accordingly, computing device 600 may operate in a networked environment using logical connections to one or more other servers, a network personal computer (PC), or another network node.
  • PC network personal computer
  • Input device 650 may be one or more input devices, such as a mouse, keyboard, trackball, and the like.
  • Output device 660 may be one or more output devices, such as a display, speakers, printer, or the like.
  • the computing device 600 can also communicate with one or more external devices (not shown) through the communication unit 640 as needed, such as storage devices, display devices, etc., and one or more devices that enable the user to interact with the computing device 600 In communication, or with any device (eg, network card, modem, etc.) that enables computing device 600 to communicate with one or more other computing devices. Such communication may be performed via an input/output (I/O) interface (not shown).
  • I/O input/output
  • a computer-readable storage medium on which computer-executable instructions are stored, wherein the computer-executable instructions are executed by a processor to implement the methods described above.
  • a computer program product tangibly stored on a non-transitory computer-readable medium and comprising computer-executable instructions, and the computer-executable instructions are executed by a processor to implement the method described above.
  • These computer-readable program instructions may be provided to a processing unit of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine such that when executed by the processing unit of the computer or other programmable data processing apparatus , producing an apparatus for realizing the functions/actions specified in one or more blocks in the flowchart and/or block diagram.
  • These computer-readable program instructions can also be stored in a computer-readable storage medium, and these instructions cause computers, programmable data processing devices and/or other devices to work in a specific way, so that the computer-readable medium storing instructions includes An article of manufacture comprising instructions for implementing various aspects of the functions/acts specified in one or more blocks in flowcharts and/or block diagrams.
  • computer-readable program instructions can be loaded onto a computer, other programmable data processing apparatus, or other equipment, so that a series of operational steps are performed on the computer, other programmable data processing apparatus, or other equipment to produce a computer-implemented process, Instructions executed on computers, other programmable data processing devices, or other devices can thus implement the functions/actions specified in one or more blocks in the flowcharts and/or block diagrams.
  • each block in a flowchart or block diagram may represent a module, a program segment, or a portion of an instruction that contains one or more executable instruction.
  • the functions noted in the block may occur out of the order noted in the figures. For example, two blocks in succession may, in fact, be executed substantially concurrently, or they may sometimes be executed in the reverse order, depending upon the functionality involved.
  • each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations can be implemented by a dedicated hardware-based system that performs the specified function or action , or may be implemented by a combination of dedicated hardware and computer instructions.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Machine Translation (AREA)

Abstract

提供了用于数据集创建的方法、设备、装置和介质。方法包括获取一组第一前提语句以及与一组第一前提语句相关联的一组第二前提语句(S410);生成与一组第一前提语句和一组第二前提语句相关联的多个结论语句(S420);以及至少基于一组第一前提语句、一组第二前提语句与多个结论语句确定目标数据集(S430)。以此方式得到的数据集模型能够解决自然语言推理方面数据集缺乏的问题,从而使得基于此数据集训练的语言模型能够具备推理的能力而非基于简单的规则模式,因此使得经训练的语言模型的性能更加优化。

Description

用于数据集创建的方法、电子设备和计算机程序产品
相关申请的交叉引用
本申请要求2021年9月26日递交的,标题为“用于数据集创建的方法、电子设备和计算机程序产品”、申请号为CN202111130224.0的中国发明专利申请的优先权。
技术领域
本公开的实施例一般地涉及数据处理系统,并且更特别地,涉及一种用于数据集创建的方法、电子设备和计算机程序产品。
背景技术
利用外部知识系统进行推理是人工智能多年来致力于追求的方向。常见的做法是将自然语言进行语义解析,再利用形式逻辑进行推理。这种做法中存在语义解析带来的错误传播以及形式逻辑的表达能力有限的问题。
迄今为止,尚未有工作提出基于自然语言的推理生成任务,因此与自然语言推理方面相关的数据集是缺乏的。然而,自然语言推理在针对语言模型的训练方面的意义十分重要。
发明内容
根据本公开的示例实施例,提供了一种用于数据集创建的方案。
在本公开的第一方面,提供了一种由计算机实现的方法。该方法包括获取一组第一前提语句以及与所述一组第一前提语句相关联的一组第二前提语句;生成与所述一组第一前提语句和所述一组第二前提语句相关联的多个结论语句,所述多个结论语句指示所述一组第一前提语句和所述一组第二前提语句之间的相关性;以及至少基于所述 一组第一前提语句、所述一组第二前提语句与所述多个结论语句确定目标数据集。
在本公开的第二方面,提供了一种电子设备。该设备包括至少一个处理单元;以及至少一个存储器,至少一个存储器被耦合到至少一个处理单元并且存储用于由至少一个处理单元执行的指令。指令在由至少一个处理单元执行时使设备执行以下动作:获取一组第一前提语句以及与所述一组第一前提语句相关联的一组第二前提语句;生成与所述一组第一前提语句和所述一组第二前提语句相关联的多个结论语句,所述多个结论语句指示所述一组第一前提语句和所述一组第二前提语句之间的相关性;以及至少基于所述一组第一前提语句、所述一组第二前提语句与所述多个结论语句确定目标数据集。
在本公开的第三方面,提供了一种用于数据集创建的装置。该装置包括:获取模块,被配置为获取一组第一前提语句以及与所述一组第一前提语句相关联的一组第二前提语句;生成模块,被配置为生成与所述一组第一前提语句和所述一组第二前提语句相关联的多个结论语句,所述多个结论语句指示所述一组第一前提语句和所述一组第二前提语句之间的相关性;以及确定模块,被配置为至少基于所述一组第一前提语句、所述一组第二前提语句与所述多个结论语句确定目标数据集。
在本公开的第四方面,提供了一种计算机可读存储介质。介质上存储有计算机程序,程序被处理器执行时实现第一方面的方法。
应当理解,本发明内容部分中所描述的内容并非旨在限定本公开的实施例的关键特征或重要特征,也不用于限制本公开的范围。本公开的其它特征将通过以下的描述而变得容易理解。
附图说明
通过参考附图阅读下文的详细描述,本公开的实施例的上述以及其他目的、特征和优点将变得容易理解。在附图中,以示例性而非限制性的方式示出了本公开的若干实施例,其中:
图1示出了本公开的实施例能够在其中实现的示例环境的示意图;
图2示出了根据本公开的一些实施例的创建数据集的过程的示意图;
图3示出了根据本公开的一些实施例的创建数据集的过程的示意图;
图4示出了根据本公开的一些实施例的创建数据集的过程的流程图;
图5示出了根据本公开的一些实施例的创建数据集的装置的框图;以及
图6示出了能够实施本公开的多个实施例的设备的框图。
贯穿所有附图,相同或者相似的参考标号被用来表示相同或者相似的组件。
具体实施方式
下面将参考附图中所示出的若干示例性实施例来描述本公开的原理和精神。应当理解,描述这些具体的实施例仅是为了使本领域的技术人员能够更好地理解并实现本公开,而并非以任何方式限制本公开的范围。
如本文中所使用的,术语“模型”可以从训练数据中学习到相应的输入与输出之间的关联,从而在训练完成后可以针对给定的输入,生成对应的输出。模型的生成可以基于机器学习技术。深度学习是一种机器学习算法,通过使用多层处理单元来处理输入和提供相应输出。神经网络模型是基于深度学习的模型的一个示例。在本文中,“模型”也可以被称为“机器学习模型”、“学习模型”、“机器学习网络”或“学习网络”,这些术语在本文中可互换地使用。
“神经网络”是一种基于深度学习的机器学习网络。神经网络能够处理输入并且提供相应输出,其通常包括输入层和输出层以及在输入层与输出层之间的一个或多个隐藏层。在深度学习应用中使用的神 经网络通常包括许多隐藏层,从而增加网络的深度。神经网络的各个层按顺序相连,从而前一层的输出被提供作为后一层的输入,其中输入层接收神经网络的输入,而输出层的输出作为神经网络的最终输出。神经网络的每个层包括一个或多个节点(也称为处理节点或神经元),每个节点处理来自上一层的输入。
通常,机器学习大致可以包括三个阶段,即训练阶段、测试阶段和应用阶段(也称为推理阶段)。在训练阶段,给定的模型可以使用大量的训练数据进行训练,不断迭代更新参数值,直到模型能够从训练数据中获取一致的满足预期目标的推理。通过训练,模型可以被认为能够从训练数据中学习从输入到输出之间的关联(也称为输入到输出的映射)。训练后的模型的参数值被确定。在测试阶段,将测试输入应用到训练后的模型,测试模型是否能够提供正确的输出,从而确定模型的性能。在应用阶段,模型可以被用于基于训练得到的参数值,对实际的输入进行处理,确定对应的输出。
上文已经提到,利用外部知识系统进行推理是人工智能多年来致力于追求的方向。常见的做法是将自然语言进行语义解析,再利用形式逻辑进行推理。这种做法中存在语义解析带来的错误传播以及形式逻辑的表达能力有限的问题。
迄今为止,尚未有工作提出基于自然语言的推理生成任务。一些目前存在的数据集提出了在问答任务中,生成推理过程的任务。这类数据集给定多个事实和规则、问题以及候选答案,需要回答出正确的答案以及写出整个推理过程。
然而,这类的数据集涉及到的推理能力仅仅涉及简单的规则模式。建立在这些数据集上模型训练或机器学习网络的并不是真正的学习到了推理能力,而只是学习到一些简单的规则模式。
基于此,当前与自然语言推理方面相关的数据集是缺乏的。然而,自然语言推理在针对语言模型的训练方面的意义十分重要。
示例环境
图1示出了本公开的实施例能够在其中实现的示例环境的示意图。
如图1所示,示例环境100可以包括计算设备110。该计算设备110可以执行对于数据的处理。对数据的处理例如可以包括数据采集、数据分析、数据片段提取、数据片段变换、数据筛选以及数据集生成等操作。
计算设备110可以从知识库120中获取、查找或搜索目标数据。例如,当计算设备110旨在建立基于自然语言的数据集的过程中。计算设备110可以从知识库120获取多个自然语言语句作为计算设备110所收集的数据。计算设备110例如也可以基于某些特定语句元素而在知识库120中搜索所需的目标数据。
此外,计算设备110例如还可以对所收集的数据进行分类、变换、筛选或标注。
计算设备110可以基于所处理的数据生成所需的数据集。所生成的数据集可以由计算设备110发送至语言训练模型130处作为语言训练模型130的输入,从而实现对于语言训练模型130的基于该数据集实现所期望的学习效果。
应当理解,图1的示例环境110中的示出的计算设备110可以是能够进行数据处理的任意计算设备,包括但不限于个人计算机、服务器计算机、手持或膝上型设备、移动设备(诸如移动电话、个人数字助理(PDA)、媒体播放器等)、多处理器系统、消费电子产品、小型计算机、大型计算机、包括上述系统或设备中的任意一个的分布式计算环境等。
应当理解,图1示出的环境中的部件和布置仅是示例,适于用于实现本公开所描述的示例实施例的计算系统可以包括一个或多个不同的部件、其他部件和/或不同的布置方式。例如,虽然被示出为是分离的,但计算设备110和语言训练模型130可以集成在相同系统或设备。本公开的实施例在此方面不受限制。
以下将继续参考附图,分别描述示例实施例。
数据集的创建
根据本公开的实施例,提出了一种用于数据集创建的方案。根据该方案,在创建数据集的过程中,计算设备110可以获取一组第一前提语句以及与所述一组第一前提语句相关联的一组第二前提语句。计算设备110还可以生成与所述一组第一前提语句和所述一组第二前提语句相关联的多个结论语句。该多个结论语句指示所述一组第一前提语句和所述一组第二前提语句之间的相关性。计算设备110还可以至少基于所述一组第一前提语句、所述一组第二前提语句与所述多个结论语句确定目标数据集。
图2示出了根据本公开的一些实施例的创建数据集的过程的示意图。
如图2所示,计算设备110可以获取一个或多个第一前提语句210。该第一前提语句210例如可以是自然语言语句。计算设备110可以从自然语言知识库中获取到该获取一个或多个第一前提语句210。
计算设备110还可以获取与一个或多个第一前提语句210相关联的相应的第二前提语句220。应当理解,针对相同的第一前提语句210可以获取一个或多个第二前提语句220。该第二前提语句210例如可以是自然语言语句。
在一些实施例中,计算设备110可以从一个或多个第一前提语句210提取任意语段作为关键词。基于该关键词和一个或多个第一前提语句210的语义,计算设备110可以从自然语言知识库中搜索与第一前提语句具有相关性的自然语言语句作为第二前提语句。
例如,第一前提语句为“在食物链过程中,绿色植物作为生产者的角色可以为消费者提供食物”。如果以“绿色植物”作为所提取的关键词,第二前提语句可以是“绿色植物通过光合作用为消费者提供食物”。
应当理解,针对从第一前提语句中所提取的不同关键词,计算设 备110所获取的第二前提语句可能不同。还应当理解,针对从第一前提语句中所提取的同一关键词,计算设备110也可以获取到多个不同的第二前提语句。
基于所获取的一个或多个第一前提语句210和与一个或多个第一前提语句210相关联的相应的第二前提语句220,计算设备110可以生成与该第一前提语句和第二前提语句相关联的多个结论语句。该结论语句可以指示一个或多个第一前提语句中的一个第一前提语句和一个或多个第二前提语句中的相应第二前提语句之间的相关性。
仍以在上文中描述的示例为例。如果第一前提语句为“在食物链过程中,绿色植物作为生产者的角色可以为消费者提供食物”,而第二前提语句是“绿色植物通过光合作用为消费者提供食物”。计算设备120所生成的结论语句可以是“在食物链过程中,绿色植物通过光合作用而被作为生产者的角色”。
在一些实施例中,该结论语句例如可以通过一组参考前提语句之间的关联关系来给出。该关联关系例如可以涉及预先训练好的用于表征多个前提语句之间相关性的模型。由计算设备110获取到的第一前提语句和与该第一前提语句相关联的第二前提语句可以作为该模型的输入,该模型的输出可以是用于指示第一前提语句和与该第一前提语句相关联的第二前提语句之间的相关性的结论语句。
在一些实施例中,也可以通过人工标注的方式而使得计算设备110确定用于指示所获取的一个或多个第一前提语句210和与一个或多个第一前提语句210相关联的相应的第二前提语句220之间的相应相关性的结论语句。该结论语句例如可以被输入到计算设备110。
基于第一前提语句210、与该第一前提语句210相关联的第二前提语句220以及用于指示该第一前提语句210和第二前提语句220之间的相关性的结论语句,计算设备110可以确定要生成的数据集中的其中一个数据条目。该数据条目例如可以具有如下格式:<提前1、前提2、结论>。
在一些实施例中,如果基于第一前提语句210和与该第一前提语 句210相关联的第二前提语句220可以得到结论语句,则在上述格式中的结论字段处标注用于描述指示该第一前提语句210和第二前提语句220之间的相关性的结论的语句。
在一些实施例中,如果基于第一前提语句210和与该第一前提语句210相关联的第二前提语句220无法得到结论语句,则在上述格式中的结论字段处标注“无有效结论”。
以此方式,计算设备110可以根据由一个或多个第一前提语句210、与一个或多个第一前提语句210相关联的相应的第二前提语句220以及相对应的结论语句来生成数据集,该数据集可以包括多个数据条目,每个数据条目均由一个第一前提语句,与该第一前提语句相关联的第二前提语句以及用于指示该第一前提语句和第二前提语句之间的相关性的结论语句组成。
例如,如图2所示,由计算设备110所生成的数据集230可以包括条目231至23N。如果可以推理出第一前提语句和第二前提语句的结论,则可以如在图2中示出的条目231那样,在结论语句字段处标识出具体的结论。而如果无法推理出第一前提语句和第二前提语句的结论,则可以如在图2中示出的条目232那样,在结论语句字段处标识出“无有效结论”。
应当理解,数据集230可以包括任意数目的数据条目,而不限于在图2中所示出的示例。
在一些实施例中,在结合图2描述的创建数据集的过程中生成的数据集可以被视作计算设备110所生成的初始数据集。该数据集可以被用作对自然语言模型进行训练的训练数据集。然而,为了进一步增加推理的复杂性以使得自然语言模型能够被更加完善的训练,可以对该初始数据集进行进一步优化。
图3示出了根据本公开的一些实施例的创建数据集的过程的示意图。
为了对初始数据集进行优化,计算设备110可以移除结论语句字段被标注为“无有效结论”的数据条目。此外,为了增加推理的复杂 度,计算设备110可以对具有有效结论,即在结论语句字段处标注有具体结论的数据条目进行变换。
在一些实施例中,对数据条目进行变换可以包括对第一前提语句和第二前提语句中的至少一项进行变换。
在一些实施例中,对上述第一前提语句和第二前提语句所进行的变换可以包括对第一前提语句和第二前提语句中的至少一项中的特定语段进行变换来实现。
在一些实施例中,特定语段可以涉及第一前提语句和第二前提语句中的中项。在本申请中的术语“中项”涉及逻辑学上三段论的概念。三段论推理是演绎推理中的一种简单判断推理。它包含两个直言命题构成的前提(即上文中描述的第一前提语句和第二前提语句),和一个直言命题构成的结论。一个正确的三段论有且仅有三个词项,其中联系第一前提语句和第二前提语句的词项叫中项,在前提中可以出现两次。
在一些实施例中,对第一前提语句和第二前提语句中的至少一项中的特定语段进行的变换可以包括同义语段替换、反义语段替换、上位语段替换、下位语段替换、否定语段替换、双重否定语段替换、以及反向翻译语段替换中的至少一种。
在一些实施例中,同义语段替换、反义语段替换、上位语段替换、下位语段替换可以基于在上文中提及的第一前提语句和第二前提语句中的中项来操作。例如,对于每个中项,先进行语义消歧找到其在词库,例如“wordnet”,中对应的词义项,然后找到相应的变换词,最后再进行语法纠错。
在一些实施例中,否定语段替换、双重否定语段替换、以及反向翻译语段替换可以利用一些语言变换工具,例如TextFlint工具包进行变换。
如图3所示,例如可以对具有有效结论的数据条目231进行变换生成数据条目231’和数据条目231”。数据条目231’例如可以包括经变换的第一前提语句,原始第二前提语句以及结论。而数据条目 231”例如可以包括原始第一前提语句、经变换的第二前提语句以及结论。
在一些实施例中,可能出现的是,在对第一前提语句或第二前提语句中的至少一项进行变换后,无法得出有效结论。因此经处理的原始数据条目的结论语句字段可以被标注为“无有效结论”。
在一些实施例中,如果经处理的数据条目的结论语句字段存在具体结论,也可以比较该经处理的数据条目的结论语句字段中描述的结论与原始数据条目的结论语句字段中描述的结论是否一致。
经过对初始数据集130的进一步处理,如图3所示,计算设备110可以生成数据集330,该数据集可以包括初始数据集中的数据条目231和232,以及经过对原始数据条目231进行处理而获得的数据条目231’和数据条目231”。
本公开的实施例所描述的方案基于自然语言的推理。自然逻辑具有比形式逻辑更丰富的表达能力,如它能够表示概率、数量等问题。同时,利用大规模预训练的语言模型,也可以在推理中结合常识信息。
此外,为了增加数据集难度,对数据集中的每条数据的前提进行微小的扰动,迫使相似的前提却推出完全不同的结论,从而避免模型学习到简单的规则模式。
以此方式得到的数据集模型能够解决自然语言推理方面数据集缺乏的问题,从而使得基于此数据集训练的语言模型能够具备推理的能力而非基于简单的规则模式,因此使得经训练的语言模型的性能更加优化。
在一些实施例中,还可以对已经生成的数据集中的数据条目进行校验。例如将第一前提语言和第二前提语言多次输入或输入至不同的用于表征前提语言之间的关联关系的模型中。如果多次校验所得到的与第一前提语言和第二前提语言相关联的结论的偏差小于阈值偏差,则将该数据条目视作有效条目。如果多次校验所得到的与第一前提语言和第二前提语言相关联的结论的偏差大于阈值偏差,则将该数据条目移除出数据集。
同样的,在一些实施例中,上述校验过程也可以通过人工标注的方式来实现。例如可以判断数据条目中的结论是否正确。如果将数据条目的结论判断为正确的校验者的人数与总校验者的比例大于阈值比例,则将该数据条目视作有效条目。反正则将该数据条目移除出数据集。
在一些实施例中,为了评估模型生成结论的质量,可以为数据集中的每个数据条目提供机器生成的结论并进行人工标注生成结论是否正确。利用这些数据精调模型,例如BLEURT模型,可以得到一个评估器来对模型的生成结果进行评估。以此方式,能够考虑到推理结果的多样性,避免了基于单词重叠的评估方法难以评估模型生成结论的好坏的弊端。
示例过程
图4示出了根据本公开的一些实施例的用于文档与摘要的一致性检测的过程400的流程图。过程400可以被实现在图1中示出的计算设备110处。
在框410,获取一组第一前提语句以及与所述一组第一前提语句相关联的一组第二前提语句。
在一些实施例中,在获取所述一组第二前提语句时,可以提取所述一组第一前提语句中的各一个关键词;以及基于所述各一个关键词和所述一组第一前提语句的语义获取所述一组第二前提语句。
在框420,生成与所述一组第一前提语句和所述一组第二前提语句相关联的多个结论语句。该多个结论语句指示所述一组第一前提语句和所述一组第二前提语句之间的相关性。
在一些实施例中,在生成结论语句时,可以获取一组参考前提语句之间的关联关系。如果确定基于所述关联关系成功地推断出所述一组第一前提语句中的第一部分第一前提语句和所述一组第二前提语句中的第一部分第二前提语句之间的相关性,则生成用于描述所述相关性的结论语句。
在一些实施例中,在生成结论语句时,如果确定基于所述关联关系未成功地推断出所述一组第一前提语句中的第二部分第一前提语句和所述一组第二前提语句中的第二部分第二前提语句之间的相关性,则生成所述相关性不具备有效结论的指示。
在框430,至少基于所述一组第一前提语句、所述一组第二前提语句与所述多个结论语句确定目标数据集。
在一些实施例中,在确定目标数据集时,如果确定所述一组第一前提语句中的第一目标前提语句和所述一组第二前提语句中的第二目标前提语句之间的相关性能够被推断,对所述第一目标前提语句进行变化。生成指示经变化的第一目标前提语句和所述第二目标前提语句之间的相关性的结论语句。基于经变化的第一目标前提语句、所述第二目标前提语句以及所述结论语句确定所述目标数据集。
在一些实施例中,在确定目标数据集时,如果确定所述一组第一前提语句中的第一目标前提语句和所述一组第二前提语句中的第二目标前提语句之间的相关性能够被推断,对所述第二目标前提语句进行变化。生成指示所述第一目标前提语句和经变化的第二目标前提语句之间的相关性的结论语句。基于所述第一目标前提语句、所述经变化的第二目标前提语句以及所述结论语句确定所述目标数据集。
在一些实施例中,在确定目标数据集时,如果确定所述一组第一前提语句中的第一目标前提语句和所述一组第二前提语句中的第二目标前提语句之间的相关性能够被推断,对所述第一目标前提语句和所述第二目标前提语句进行变化。生成指示经变化的第一目标前提语句和经变化的第二目标前提语句之间的相关性的结论语句。基于所述经变化的第一目标前提语句、所述经变化第二目标前提语句以及所述结论语句确定所述目标数据集。
在一些实施例中,在对所述第一目标前提语句和第二目标前提语句中的至少一项进行变化时,对所述目标变换语段执行以下操作中的至少一项:同义语段替换;反义语段替换;上位语段替换;下位语段替换;否定语段替换;双重否定语段替换;以及反向翻译语段替换。
在一些实施例中,在确定目标数据集时,对基于所述一组第一前提语句、所述一组第二前提语句与所述多个结论语句生成的初始数据集进行校验;如果确定所述多个结论语句中的部分结论语句出现错误,通过删除所述出现错误的部分结论语句和与所述出现错误的部分结论语句相关联的所述一组第一前提语句的相应部分和所述一组第二前提语句的相应部分来更新所述初始数据集;以及将经更新的初始数据集确定为所述目标数据集。
在一些实施例中,所述一组第一前提语句和所述一组第二前提语句包括自然语言语句。
示例装置和设备
图5示出了根据本公开的一些实施例的用于数据集创建的装置500的框图。装置500可以被实现为或者被包括在图1中示出的计算设备110中。装置500中的各个模块/组件可以由硬件、软件、固件或者它们的任意组合来实现。
如图所示,装置500包括获取模块510,被配置为获取一组第一前提语句以及与所述一组第一前提语句相关联的一组第二前提语句。装置500还包括生成模块520,被配置为生成与所述一组第一前提语句和所述一组第二前提语句相关联的多个结论语句,所述多个结论语句指示所述一组第一前提语句和所述一组第二前提语句之间的相关性。装置500还包括确定模块,被配置为至少基于所述一组第一前提语句、所述一组第二前提语句与所述多个结论语句确定目标数据集。
在一些实施例中,获取模块510包括:关键词提取模块,被配置为提取所述一组第一前提语句中的各一个关键词;以及第二前提语句获取模块,被配置为基于所述各一个关键词和所述一组第一前提语句的语义获取所述一组第二前提语句。
在一些实施例中,生成模块520包括关联关系获取模块,被配置为获取一组参考前提语句之间的关联关系;以及第一结论语句生成模块,被配置为如果确定基于所述关联关系成功地推断出所述一组第一 前提语句中的第一部分第一前提语句和所述一组第二前提语句中的第一部分第二前提语句之间的相关性,则生成用于描述所述相关性的结论语句。
在一些实施例中,生成模块520还包括第二结论语句生成模块,被配置为如果确定基于所述关联关系未成功地推断出所述一组第一前提语句中的第二部分第一前提语句和所述一组第二前提语句中的第二部分第二前提语句之间的相关性,则生成所述相关性不具备有效结论的指示。
在一些实施例中,确定模块还被配置为如果确定所述一组第一前提语句中的第一目标前提语句和所述一组第二前提语句中的第二目标前提语句之间的相关性能够被推断,对所述第一目标前提语句进行变化。生成指示经变化的第一目标前提语句和所述第二目标前提语句之间的相关性的结论语句。基于经变化的第一目标前提语句、所述第二目标前提语句以及所述结论语句确定所述目标数据集。
在一些实施例中,确定模块还被配置为如果确定所述一组第一前提语句中的第一目标前提语句和所述一组第二前提语句中的第二目标前提语句之间的相关性能够被推断,对所述第二目标前提语句进行变化。生成指示所述第一目标前提语句和经变化的第二目标前提语句之间的相关性的结论语句。基于所述第一目标前提语句、所述经变化的第二目标前提语句以及所述结论语句确定所述目标数据集。
在一些实施例中,确定模块还被配置为如果确定所述一组第一前提语句中的第一目标前提语句和所述一组第二前提语句中的第二目标前提语句之间的相关性能够被推断,对所述第一目标前提语句和所述第二目标前提语句进行变化。生成指示经变化的第一目标前提语句和经变化的第二目标前提语句之间的相关性的结论语句。基于所述经变化的第一目标前提语句、所述经变化第二目标前提语句以及所述结论语句确定所述目标数据集。
在一些实施例中,装置500还可以包括变化模块,被配置为在对所述第一目标前提语句和第二目标前提语句中的至少一项进行变化 时,对所述目标变换语段执行以下操作中的至少一项:同义语段替换;反义语段替换;上位语段替换;下位语段替换;否定语段替换;双重否定语段替换;以及反向翻译语段替换。
在一些实施例中,确定模块还被配置为对基于所述一组第一前提语句、所述一组第二前提语句与所述多个结论语句生成的初始数据集进行校验;如果确定所述多个结论语句中的部分结论语句出现错误,通过删除所述出现错误的部分结论语句和与所述出现错误的部分结论语句相关联的所述一组第一前提语句的相应部分和所述一组第二前提语句的相应部分来更新所述初始数据集;以及将经更新的初始数据集确定为所述目标数据集。
在一些实施例中,所述一组第一前提语句和所述一组第二前提语句包括自然语言语句。
图6示出了示出了其中可以实施本公开的一个或多个实施例的计算设备600的框图。应当理解,图6所示出的计算设备600仅仅是示例性的,而不应当构成对本文所描述的实施例的功能和范围的任何限制。图6所示出的计算设备600可以用于实现图1的计算设备110。
如图6所示,计算设备600是通用计算设备的形式。计算设备600的组件可以包括但不限于一个或多个处理器或处理单元610、存储器620、存储设备630、一个或多个通信单元640、一个或多个输入设备650以及一个或多个输出设备660。处理单元610可以是实际或虚拟处理器并且能够根据存储器620中存储的程序来执行各种处理。在多处理器系统中,多个处理单元并行执行计算机可执行指令,以提高计算设备600的并行处理能力。
计算设备600通常包括多个计算机存储介质。这样的介质可以是计算设备600可访问的任何可以获得的介质,包括但不限于易失性和非易失性介质、可拆卸和不可拆卸介质。存储器620可以是易失性存储器(例如寄存器、高速缓存、随机访问存储器(RAM))、非易失性存储器(例如,只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、闪存)或它们的某种组合。存储设备630可以是可拆 卸或不可拆卸的介质,并且可以包括机器可读介质,诸如闪存驱动、磁盘或者任何其他介质,其可以能够用于存储信息和/或数据(例如用于训练的训练数据)并且可以在计算设备600内被访问。
计算设备600可以进一步包括另外的可拆卸/不可拆卸、易失性/非易失性存储介质。尽管未在图6中示出,可以提供用于从可拆卸、非易失性磁盘(例如“软盘”)进行读取或写入的磁盘驱动和用于从可拆卸、非易失性光盘进行读取或写入的光盘驱动。在这些情况中,每个驱动可以由一个或多个数据介质接口被连接至总线(未示出)。存储器620可以包括计算机程序产品625,其具有一个或多个程序模块,这些程序模块被配置为执行本公开的各种实施例的各种方法或动作。
通信单元640实现通过通信介质与其他计算设备进行通信。附加地,计算设备600的组件的功能可以以单个计算集群或多个计算机器来实现,这些计算机器能够通过通信连接进行通信。因此,计算设备600可以使用与一个或多个其他服务器、网络个人计算机(PC)或者另一个网络节点的逻辑连接来在联网环境中进行操作。
输入设备650可以是一个或多个输入设备,例如鼠标、键盘、追踪球等。输出设备660可以是一个或多个输出设备,例如显示器、扬声器、打印机等。计算设备600还可以根据需要通过通信单元640与一个或多个外部设备(未示出)进行通信,外部设备诸如存储设备、显示设备等,与一个或多个使得用户与计算设备600交互的设备进行通信,或者与使得计算设备600与一个或多个其他计算设备通信的任何设备(例如,网卡、调制解调器等)进行通信。这样的通信可以经由输入/输出(I/O)接口(未示出)来执行。
根据本公开的示例性实现方式,提供了一种计算机可读存储介质,其上存储有计算机可执行指令,其中计算机可执行指令被处理器执行以实现上文描述的方法。根据本公开的示例性实现方式,还提供了一种计算机程序产品,计算机程序产品被有形地存储在非瞬态计算机可读介质上并且包括计算机可执行指令,而计算机可执行指令被处 理器执行以实现上文描述的方法。
这里参照根据本公开实现的方法、装置、设备和计算机程序产品的流程图和/或框图描述了本公开的各个方面。应当理解,流程图和/或框图的每个方框以及流程图和/或框图中各方框的组合,都可以由计算机可读程序指令实现。
这些计算机可读程序指令可以提供给通用计算机、专用计算机或其他可编程数据处理装置的处理单元,从而生产出一种机器,使得这些指令在通过计算机或其他可编程数据处理装置的处理单元执行时,产生了实现流程图和/或框图中的一个或多个方框中规定的功能/动作的装置。也可以把这些计算机可读程序指令存储在计算机可读存储介质中,这些指令使得计算机、可编程数据处理装置和/或其他设备以特定方式工作,从而,存储有指令的计算机可读介质则包括一个制造品,其包括实现流程图和/或框图中的一个或多个方框中规定的功能/动作的各个方面的指令。
可以把计算机可读程序指令加载到计算机、其他可编程数据处理装置、或其他设备上,使得在计算机、其他可编程数据处理装置或其他设备上执行一系列操作步骤,以产生计算机实现的过程,从而使得在计算机、其他可编程数据处理装置、或其他设备上执行的指令实现流程图和/或框图中的一个或多个方框中规定的功能/动作。
附图中的流程图和框图显示了根据本公开的多个实现的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段或指令的一部分,模块、程序段或指令的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个连续的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或动作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令 的组合来实现。
以上已经描述了本公开的各实现,上述说明是示例性的,并非穷尽性的,并且也不限于所公开的各实现。在不偏离所说明的各实现的范围和精神的情况下,对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。本文中所用术语的选择,旨在最好地解释各实现的原理、实际应用或对市场中的技术的改进,或者使本技术领域的其他普通技术人员能理解本文公开的各个实现方式。

Claims (22)

  1. 一种由计算机实现的方法,包括:
    获取一组第一前提语句以及与所述一组第一前提语句相关联的一组第二前提语句;
    生成与所述一组第一前提语句和所述一组第二前提语句相关联的多个结论语句,所述多个结论语句指示所述一组第一前提语句和所述一组第二前提语句之间的相关性;以及
    至少基于所述一组第一前提语句、所述一组第二前提语句与所述多个结论语句确定目标数据集。
  2. 根据权利要求1所述的方法,其中获取所述一组第二前提语句包括:
    提取所述一组第一前提语句中的各一个关键词;以及
    基于所述各一个关键词和所述一组第一前提语句的语义获取所述一组第二前提语句。
  3. 根据权利要求1所述的方法,其中生成所述结论语句包括:
    获取一组参考前提语句之间的关联关系;以及
    如果确定基于所述关联关系成功地推断出所述一组第一前提语句中的第一部分第一前提语句和所述一组第二前提语句中的第一部分第二前提语句之间的相关性,则生成用于描述所述相关性的结论语句。
  4. 根据权利要求3所述的方法,还包括:
    如果确定基于所述关联关系未成功地推断出所述一组第一前提语句中的第二部分第一前提语句和所述一组第二前提语句中的第二部分第二前提语句之间的相关性,则生成所述相关性不具备有效结论的指示。
  5. 根据权利要求1所述的方法,其中确定所述目标数据集包括:
    如果确定所述一组第一前提语句中的第一目标前提语句和所述 一组第二前提语句中的第二目标前提语句之间的相关性能够被推断,对所述第一目标前提语句进行变化;
    生成指示经变化的第一目标前提语句和所述第二目标前提语句之间的相关性的结论语句;以及
    基于经变化的第一目标前提语句、所述第二目标前提语句以及所述结论语句确定所述目标数据集。
  6. 根据权利要求1所述的方法,其中确定所述目标数据集包括:
    如果确定所述一组第一前提语句中的第一目标前提语句和所述一组第二前提语句中的第二目标前提语句之间的相关性能够被推断,对所述第二目标前提语句进行变化;
    生成指示所述第一目标前提语句和经变化的第二目标前提语句之间的相关性的结论语句;以及
    基于所述第一目标前提语句、所述经变化的第二目标前提语句以及所述结论语句确定所述目标数据集。
  7. 根据权利要求1所述的方法,其中确定所述目标数据集包括:
    如果确定所述一组第一前提语句中的第一目标前提语句和所述一组第二前提语句中的第二目标前提语句之间的相关性能够被推断,对所述第一目标前提语句和所述第二目标前提语句进行变化;
    生成指示经变化的第一目标前提语句和经变化的第二目标前提语句之间的相关性的结论语句;以及
    基于所述经变化的第一目标前提语句、所述经变化第二目标前提语句以及所述结论语句确定所述目标数据集。
  8. 根据权利要求5至7中任一项所述的方法,其中对所述第一目标前提语句和第二目标前提语句中的至少一项进行变化包括:
    从所述第一目标前提语句和第二目标前提语句中的至少一项所包含的语段中确定可变换语义的目标变换语段;
    对所述目标变换语段执行以下操作中的至少一项:
    同义语段替换;
    反义语段替换;
    上位语段替换;
    下位语段替换;
    否定语段替换;
    双重否定语段替换;以及
    反向翻译语段替换。
  9. 根据权利要求1所述的方法,其中确定所述目标数据集包括:
    对基于所述一组第一前提语句、所述一组第二前提语句与所述多个结论语句生成的初始数据集进行校验;
    如果确定所述多个结论语句中的部分结论语句出现错误,通过删除所述出现错误的部分结论语句和与所述出现错误的部分结论语句相关联的所述一组第一前提语句的相应部分和所述一组第二前提语句的相应部分来更新所述初始数据集;以及
    将经更新的初始数据集确定为所述目标数据集。
  10. 根据权利要求1所述的方法,其中所述一组第一前提语句和所述一组第二前提语句包括自然语言语句。
  11. 一种电子设备,包括:
    至少一个处理单元;以及
    至少一个存储器,所述至少一个存储器被耦合到所述至少一个处理单元并且存储用于由所述至少一个处理单元执行的指令,所述指令在由所述至少一个处理单元执行时使所述设备执行以下动作:
    获取一组第一前提语句以及与所述一组第一前提语句相关联的一组第二前提语句;
    生成与所述一组第一前提语句和所述一组第二前提语句相关联的多个结论语句,所述多个结论语句指示所述一组第一前提语句和所述一组第二前提语句之间的相关性;以及
    至少基于所述一组第一前提语句、所述一组第二前提语句与所述多个结论语句确定目标数据集。
  12. 根据权利要求11所述的设备,其中获取所述一组第二前提语句包括:
    提取所述一组第一前提语句中的各一个关键词;以及
    基于所述各一个关键词和所述一组第一前提语句的语义获取所述一组第二前提语句。
  13. 根据权利要求11所述的设备,其中生成所述结论语句包括:
    获取一组参考前提语句之间的关联关系;以及
    如果确定基于所述关联关系成功地推断出所述一组第一前提语句中的第一部分第一前提语句和所述一组第二前提语句中的第一部分第二前提语句之间的相关性,则生成用于描述所述相关性的结论语句。
  14. 根据权利要求13所述的设备,还包括:
    如果确定基于所述关联关系未成功地推断出所述一组第一前提语句中的第二部分第一前提语句和所述一组第二前提语句中的第二部分第二前提语句之间的相关性,则生成所述相关性不具备有效结论的指示。
  15. 根据权利要求11所述的设备,其中确定所述目标数据集包括:
    如果确定所述一组第一前提语句中的第一目标前提语句和所述一组第二前提语句中的第二目标前提语句之间的相关性能够被推断,对所述第一目标前提语句进行变化;
    生成指示经变化的第一目标前提语句和所述第二目标前提语句之间的相关性的结论语句;以及
    基于经变化的第一目标前提语句、所述第二目标前提语句以及所述结论语句确定所述目标数据集。
  16. 根据权利要求11所述的设备,其中确定所述目标数据集包括:
    如果确定所述一组第一前提语句中的第一目标前提语句和所述一组第二前提语句中的第二目标前提语句之间的相关性能够被推断,对所述第二目标前提语句进行变化;
    生成指示所述第一目标前提语句和经变化的第二目标前提语句 之间的相关性的结论语句;以及
    基于所述第一目标前提语句、所述经变化的第二目标前提语句以及所述结论语句确定所述目标数据集。
  17. 根据权利要求11所述的设备,其中确定所述目标数据集包括:
    如果确定所述一组第一前提语句中的第一目标前提语句和所述一组第二前提语句中的第二目标前提语句之间的相关性能够被推断,对所述第一目标前提语句和所述第二目标前提语句进行变化;
    生成指示经变化的第一目标前提语句和经变化的第二目标前提语句之间的相关性的结论语句;以及
    基于所述经变化的第一目标前提语句、所述经变化第二目标前提语句以及所述结论语句确定所述目标数据集。
  18. 根据权利要求15至17中任一项所述的设备,其中对所述第一目标前提语句和第二目标前提语句中的至少一项进行变化包括:
    从所述第一目标前提语句和第二目标前提语句中的至少一项所包含的语段中确定可变换语义的目标变换语段;
    对所述目标变换语段执行以下操作中的至少一项:
    同义语段替换;
    反义语段替换;
    上位语段替换;
    下位语段替换;
    否定语段替换;
    双重否定语段替换;
    反向翻译语段替换。
  19. 根据权利要求11所述的设备,其中确定所述目标数据集包括:
    对基于所述一组第一前提语句、所述一组第二前提语句与所述多个结论语句生成的初始数据集进行校验;
    如果确定所述多个结论语句中的部分结论语句出现错误,通过删 除所述出现错误的部分结论语句和与所述出现错误的部分结论语句相关联的所述一组第一前提语句的相应部分和所述一组第二前提语句的相应部分来更新所述初始数据集;以及
    将经更新的初始数据集确定为所述目标数据集。
  20. 根据权利要求1所述的设备,其中所述一组第一前提语句和所述一组第二前提语句包括自然语言语句。
  21. 一种用于数据集创建的装置,包括:
    获取模块,被配置为获取一组第一前提语句以及与所述一组第一前提语句相关联的一组第二前提语句;
    生成模块,被配置为生成与所述一组第一前提语句和所述一组第二前提语句相关联的多个结论语句,所述多个结论语句指示所述一组第一前提语句和所述一组第二前提语句之间的相关性;以及
    确定模块,被配置为至少基于所述一组第一前提语句、所述一组第二前提语句与所述多个结论语句确定目标数据集。
  22. 一种计算机可读存储介质,其上存储有计算机程序,所述程序被处理器执行时实现根据权利要求1至10中任一项所述的方法。
PCT/CN2022/116381 2021-09-26 2022-08-31 用于数据集创建的方法、电子设备和计算机程序产品 WO2023045725A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111130224.0A CN113806489A (zh) 2021-09-26 2021-09-26 用于数据集创建的方法、电子设备和计算机程序产品
CN202111130224.0 2021-09-26

Publications (1)

Publication Number Publication Date
WO2023045725A1 true WO2023045725A1 (zh) 2023-03-30

Family

ID=78938665

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/116381 WO2023045725A1 (zh) 2021-09-26 2022-08-31 用于数据集创建的方法、电子设备和计算机程序产品

Country Status (2)

Country Link
CN (1) CN113806489A (zh)
WO (1) WO2023045725A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113806489A (zh) * 2021-09-26 2021-12-17 北京有竹居网络技术有限公司 用于数据集创建的方法、电子设备和计算机程序产品
CN116401339A (zh) * 2023-06-07 2023-07-07 北京百度网讯科技有限公司 数据处理方法、装置、电子设备、介质以及程序产品

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3163513A1 (en) * 2015-10-26 2017-05-03 Agfa Healthcare Method of automated notation 3 (n3) query rule creation
CN110705255A (zh) * 2019-10-12 2020-01-17 京东数字科技控股有限公司 检测语句之间的关联关系的方法和装置
CN110765235A (zh) * 2019-09-09 2020-02-07 深圳市人马互动科技有限公司 训练数据的生成方法、装置、终端及可读介质
US20200242146A1 (en) * 2019-01-24 2020-07-30 Andrew R. Kalukin Artificial intelligence system for generating conjectures and comprehending text, audio, and visual data using natural language understanding
CN113806489A (zh) * 2021-09-26 2021-12-17 北京有竹居网络技术有限公司 用于数据集创建的方法、电子设备和计算机程序产品

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107688583A (zh) * 2016-08-05 2018-02-13 株式会社Ntt都科摩 创建用于自然语言处理装置的训练数据的方法和设备
EP3483746A1 (en) * 2017-11-09 2019-05-15 Snips Methods and devices for generating data to train a natural language understanding component

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3163513A1 (en) * 2015-10-26 2017-05-03 Agfa Healthcare Method of automated notation 3 (n3) query rule creation
US20200242146A1 (en) * 2019-01-24 2020-07-30 Andrew R. Kalukin Artificial intelligence system for generating conjectures and comprehending text, audio, and visual data using natural language understanding
CN110765235A (zh) * 2019-09-09 2020-02-07 深圳市人马互动科技有限公司 训练数据的生成方法、装置、终端及可读介质
CN110705255A (zh) * 2019-10-12 2020-01-17 京东数字科技控股有限公司 检测语句之间的关联关系的方法和装置
CN113806489A (zh) * 2021-09-26 2021-12-17 北京有竹居网络技术有限公司 用于数据集创建的方法、电子设备和计算机程序产品

Also Published As

Publication number Publication date
CN113806489A (zh) 2021-12-17

Similar Documents

Publication Publication Date Title
US11281976B2 (en) Generative adversarial network based modeling of text for natural language processing
US9442917B2 (en) Detecting semantic errors in text using ontology-based extraction rules
WO2023045725A1 (zh) 用于数据集创建的方法、电子设备和计算机程序产品
US10956463B2 (en) System and method for generating improved search queries from natural language questions
US20190317986A1 (en) Annotated text data expanding method, annotated text data expanding computer-readable storage medium, annotated text data expanding device, and text classification model training method
US20230259707A1 (en) Systems and methods for natural language processing (nlp) model robustness determination
Burdisso et al. τ-SS3: A text classifier with dynamic n-grams for early risk detection over text streams
US11669740B2 (en) Graph-based labeling rule augmentation for weakly supervised training of machine-learning-based named entity recognition
Zhu et al. Building a Large-scale Software Programming Taxonomy from Stackoverflow.
JP6770709B2 (ja) 機械学習用モデル生成装置及びプログラム。
Wu et al. MFD: Multi-Feature Detection of LLM-Generated Text
Xiao et al. Machine learning-based automated essay scoring system for Chinese proficiency test (HSK)
Singh et al. Validation of inspection reviews over variable features set threshold
Bai et al. Gated character-aware convolutional neural network for effective automated essay scoring
Sangeetha et al. Information retrieval system for laws
Olivero Figurative Language Understanding based on Large Language Models
Baziyad et al. On the Linguistic Limitations of ChatGPT: An Experimental Case Study
Mizgajski et al. Return on investment in machine learning: Crossing the chasm between academia and business
Loyola et al. UNSL at eRisk 2022: Decision policies with history for early classification.
Gudmundsson et al. Swedish Natural Language Processing with Long Short-term Memory Neural Networks: A Machine Learning-powered Grammar and Spell-checker for the Swedish Language
Zhao et al. Test case classification via few-shot learning
US20230112740A1 (en) Textual content evaluation using machine learned models
Bourgeade From text to trust: a priori interpretability versus post hoc explainability in natural language processing
Bu et al. Prompt-based data labeling method for aspect based sentiment analysis
Yang Intelligent English Translation Evaluation System Based on Internet Automation Technology

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22871775

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE