WO2021184547A1 - Conversation robot intention corpus generation method and apparatus, medium, and electronic device - Google Patents

Conversation robot intention corpus generation method and apparatus, medium, and electronic device Download PDF

Info

Publication number
WO2021184547A1
WO2021184547A1 PCT/CN2020/093043 CN2020093043W WO2021184547A1 WO 2021184547 A1 WO2021184547 A1 WO 2021184547A1 CN 2020093043 W CN2020093043 W CN 2020093043W WO 2021184547 A1 WO2021184547 A1 WO 2021184547A1
Authority
WO
WIPO (PCT)
Prior art keywords
similar sentence
sentence corpus
target
candidate
corpus
Prior art date
Application number
PCT/CN2020/093043
Other languages
French (fr)
Chinese (zh)
Inventor
陈亮
李治根
杨坤
许开河
周琳
王少军
王嘉雯
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021184547A1 publication Critical patent/WO2021184547A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems

Definitions

  • This application relates to the field of data processing technology, and in particular to a method, device, medium, and electronic equipment for generating intent corpus of a dialogue robot.
  • dialogue robots especially task-type dialogue robots, generally rely on intent recognition algorithms for intent recognition.
  • Dialogue robots generally perform corresponding actions based on the identified intent, such as verbal reply, information query, etc.
  • the inventor realizes that in order to ensure the quality of the dialogue when a dialogue robot is in a dialogue, it needs to have high requirements for the quantity and quality of similar sentences under each intent.
  • Different dialogue robots can perform dialogues with different tasks. There are often problems such as low-frequency problems accumulated by some dialogue robots, such as less intent corpus and imbalance in the number of intent corpora, which reduces the accuracy of intent recognition.
  • Labeling will also waste a lot of labor costs.
  • the purpose of this application is to provide a method, device, medium, and electronic equipment for generating the intention corpus of a dialogue robot.
  • a method for generating intention corpus of a dialogue robot including:
  • each intent includes a plurality of similar sentence corpus, each intent corresponds to a dialogue robot, and each dialogue robot has at least one intent;
  • the candidate similar sentence corpus in the candidate similar sentence corpus is determined to belong to The target similar sentence corpus of the target intention.
  • a device for generating an intention corpus of a dialogue robot comprising:
  • the first acquisition module is configured to acquire an intent set including a plurality of intents, wherein each intent includes a plurality of similar sentence corpus, each intent corresponds to a dialogue robot, and each dialogue robot has at least one intent;
  • the second acquisition module is configured to acquire the target similar sentence corpus included in the target intention as a target similar sentence corpus
  • the first determining module is configured to determine the similarity between the target similar sentence corpus and the similar sentence corpus;
  • a construction module configured to select candidate similar sentence corpora from the intent set based on the similarity to construct a candidate similar sentence corpus
  • the second determining module is configured to be based on the similarity between each candidate similar sentence corpus in the candidate similar sentence corpus and the target similar sentence corpus in the target similar sentence corpus.
  • the candidate similar sentence corpus determines the target similar sentence corpus belonging to the target intention.
  • an electronic device including:
  • a memory where computer-readable instructions are stored, and when the computer-readable instructions are executed by the processor, it realizes:
  • each intent includes a plurality of similar sentence corpus, each intent corresponds to a dialogue robot, and each dialogue robot has at least one intent;
  • the candidate similar sentence corpus in the candidate similar sentence corpus is determined to belong to The target similar sentence corpus of the target intention.
  • a computer-readable storage medium which stores computer program instructions, and when the computer program instructions are executed by a computer, the computer executes the aforementioned method.
  • This application can realize the automatic expansion of intent corpus, increase the number of intent corpus, and make the number of corpus of each intent more balanced, thereby improving the accuracy of intent recognition to a certain extent, and reducing the need to expand the intent corpus. cost.
  • Fig. 1 is a schematic diagram showing a system architecture of a method for generating intention corpus of a dialogue robot according to an exemplary embodiment
  • Fig. 2 is a flow chart showing a method for generating intention corpus of a dialogue robot according to an exemplary embodiment
  • step 3 is a flowchart showing details of step 210 and step 220 of an embodiment according to the embodiment corresponding to FIG. 2;
  • FIG. 4 is a detailed flowchart of step 240 according to an embodiment shown in the embodiment corresponding to FIG. 2;
  • Fig. 5 is a block diagram showing a device for generating intention corpus of a dialogue robot according to an exemplary embodiment
  • Fig. 6 is a block diagram showing an example of an electronic device for realizing the above-mentioned method for generating intent corpus of a dialogue robot according to an exemplary embodiment
  • Fig. 7 shows a computer-readable storage medium for realizing the above-mentioned method for generating an intention corpus of a dialog robot according to an exemplary embodiment.
  • the intent corpus generation method of the dialogue robot provided in this application can also be applied to the field of artificial intelligence, and enhance the universal applicability of the dialogue robot based on machine learning and deep learning.
  • Dialogue robots can be various robots that can conduct human-machine dialogue with humans.
  • Dialogue robots can include multiple models or algorithms, such as language models, acoustic models, etc., and dialogue robots can conduct text, voice, or video voice dialogue with humans.
  • the relationship between the intention of the dialogue robot and the corpus is a different way of expressing meaning.
  • the meaning is the intention, and a specific way of expression is a kind of corpus. Therefore, an intention of the dialogue robot usually corresponds to multiple similar corpora, and different dialogues. Robots also have different intentions and corpus.
  • the general method of intent recognition by dialogue robots during human-machine dialogue is classification algorithms based on statistical learning or deep learning to learn which similar corpus corresponds to each intent to classify the intent.
  • the intention corpus generation is the process of adding corpus for a certain intent of the dialog robot, that is, the method for generating the intention corpus of the dialog robot provided in this application can increase the corpus of a certain intent of the dialog robot.
  • the implementation terminal of this application can be any device with computing, processing, and storage functions.
  • the device can be connected to an external device for receiving or sending data.
  • it can be a portable mobile device, such as a smart phone, a tablet computer, a notebook computer, PDA (Personal Digital Assistant), etc., can also be fixed devices, such as computer equipment, field terminals, desktop computers, servers, workstations, etc., or a collection of multiple devices, such as cloud computing physical infrastructure or server clusters .
  • the implementation terminal of this application may be a server or a physical infrastructure of cloud computing.
  • Fig. 1 is a schematic diagram showing a system architecture of a method for generating an intention corpus of a dialogue robot according to an exemplary embodiment.
  • the system architecture includes a server 110, a plurality of robot terminals 120, and a database 130 corresponding to each robot terminal 120. Between each robot terminal 120 and the server 110, each robot terminal 120 and the corresponding database 130 They are all connected by a communication link, so that data can be received and sent.
  • Each robot terminal 120 is fixedly equipped with a dialogue robot, and the database 130 corresponding to the robot terminal 120 stores data used by the dialogue robot to conduct a dialogue. For example, it may include intent and corresponding corpus data.
  • the corpus data may be text, etc.
  • the database 130 corresponding to each robot terminal 120 can store multiple corpus data corresponding to multiple intents.
  • the server 110 is the implementation terminal of the application.
  • the server 110 can operate the corpus data in the database 130 corresponding to each robot terminal 120 through each robot terminal 120, for example, it can correspond to a robot terminal 120.
  • the database 130 acquires corpus data and transfers the acquired corpus data to the database 130 corresponding to other robot terminals 120, so that the corpus corresponding to the intention of a certain dialogue robot can be added.
  • Fig. 1 is only an embodiment of the present application.
  • the implementation terminal in this embodiment is a server, in other embodiments, the implementation terminal may be various terminals or devices as described above; although in this embodiment, different dialog robots are fixed on different terminals.
  • the corpus of intents corresponding to different dialog robots are also stored in different databases.
  • the corpus of intents corresponding to each dialog robot and/or each dialog robot can be stored in the same terminal or in different databases.
  • each dialogue robot and the corresponding intent corpus can also be stored locally in the implementation terminal of this application, this application does not make any limitation on this, and the scope of protection of this application should not be restricted in any way.
  • Fig. 2 is a flowchart showing a method for generating intention corpus of a dialogue robot according to an exemplary embodiment.
  • the method for generating dialogue robot intention corpus provided in this embodiment can be executed by a server, as shown in FIG. 2, and includes the following steps:
  • Step 210 Obtain an intent set including multiple intents.
  • each intent includes a plurality of similar sentence corpus, each intent corresponds to a dialogue robot, and each dialogue robot has at least one intent.
  • Each intent corresponds to a dialogue robot means that the intent is the intent of the dialogue robot, and the dialogue robot can use the intent to have a dialogue with humans.
  • each intent includes an identification of the dialogue robot, and the intent corresponds to the dialogue robot by including the identification of the dialogue robot.
  • the relationship between intention and corpus is the relationship between a meaning and different expressions corresponding to the meaning.
  • a meaning is equivalent to an intention, and a specific expression corresponding to the meaning is equivalent to a corpus. .
  • the corpus included in the same intention is usually similar, so it is called similar sentence corpus.
  • the two corpus "I don't know medical insurance” and “what does medical insurance mean” are similar sentence corpora, and both belong to the intention of "I want to know a detailed introduction about medical insurance”.
  • the intent set W including multiple intents can be expressed by the following expression:
  • bracket pair In the same bracket pair are an intention I x and a similar sentence corpus S xi included in the intention.
  • I 1 can represent the intention numbered 1
  • S 11 can represent the first similar intention included in the intention.
  • Sentence corpus, S 12 can represent the second similar sentence corpus included in the intention, and so on.
  • the intent set is pre-stored locally, and the acquiring an intent set including a plurality of intents includes: reading the intent set including a plurality of intents locally.
  • the intent set is pre-stored in a database
  • the obtaining an intent set including a plurality of intents includes: obtaining an intent set including a plurality of intents by querying the database.
  • the intent set is pre-stored in a target terminal other than the local terminal, and the acquiring of intent sets including multiple intents includes:
  • the intent set may also be pre-stored in a blockchain network node.
  • the intent set can be shared between different platforms, and data can also be prevented from being tampered with.
  • the intent set needs to be obtained, it can be obtained directly from the blockchain by invoking the smart contract.
  • the blockchain is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm.
  • the blockchain is essentially a decentralized database, which is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify the validity of the information. (Anti-counterfeiting) and generate the next block.
  • the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
  • Step 220 Obtain the target similar sentence corpus included in the target intention as a target similar sentence corpus.
  • the acquiring the target similar sentence corpus included in the target intention as a target similar sentence corpus includes: reading the target similar sentence corpus included in the target intention from a local preset path as the target similarity Sentence corpus collection.
  • FIG. 3 is a flowchart showing details of step 210 and step 220 of an embodiment according to the embodiment corresponding to FIG. 2. As shown in Figure 3, it includes the following steps:
  • a plurality of intents is selected from a total set of intents including a plurality of intents to form an intent sub-set based on the first predetermined rule.
  • each intention includes a plurality of similar sentence corpora, and each intention in the total set of intentions corresponds to a dialogue robot.
  • the multiple intentions that make up the intention sub-set can be selected from the total set of intentions based on various methods or rules.
  • the first predetermined rule can be to randomly select multiple intentions from the total set of intentions to form the intention sub-set, or according to
  • the generation sequence of each intent is to sequentially select a predetermined number of intents from a total set of intents including a plurality of intents to form an intent sub-set.
  • Step 221 Select the target intent from the intents corresponding to all the dialogue robots except the intents corresponding to the intents in the intent sub-set based on the second predetermined rule.
  • the selection of the target intent from the intents corresponding to all the dialogue robots except the intents corresponding to the intents in the intent subset based on the second predetermined rule includes:
  • the intent that includes the least corpus of similar sentences is selected as the target intent.
  • the selection of the target intent from the intents corresponding to all the dialogue robots except the intents corresponding to the intents in the intent subset based on the second predetermined rule includes:
  • intents corresponding to all other dialogue robots except for the intents corresponding to the intents are determined to include intents whose number of similar sentence corpus is less than the first predetermined number as the first candidate target intent;
  • Any one of the first candidate target intentions is selected as the target intention.
  • the intents included in the similar sentence corpus that are less than the first predetermined number have the same possibility of being selected as the target intent, which improves fairness, and because the selected target intent includes similar sentences
  • the number of corpus is less than the first predetermined number, and corpus can be generated preferentially for low-frequency intent.
  • the selection of the target intent from the intents corresponding to all the dialogue robots except the intents corresponding to the intents in the intent subset based on the second predetermined rule includes:
  • intents corresponding to all other dialog robots except for the intents corresponding to the intents are determined to include intents whose number of similar sentence corpus is less than the minimum value, as the second candidate target intent;
  • Any one of the second candidate target intentions is selected as the target intention.
  • Step 222 Obtain the similar sentence corpus included in the target intention as the target similar sentence corpus, and obtain the target similar sentence corpus.
  • This embodiment is an example of obtaining similar sentence corpus from intents outside the intent set.
  • Step 230 Determine the similarity between the target similar sentence corpus and the similar sentence corpus.
  • the target similar sentence corpus and the similar sentence corpus are respectively composed of multiple word elements, and the determining the similarity between the target similar sentence corpus and the similar sentence corpus includes:
  • s 1 represents the target similar sentence corpus
  • s 2 represents the similar sentence corpus
  • Len is used to find the number of word elements in the set
  • f score (s 1 , s 2 ) is the target similar sentence corpus The degree of similarity with the similar sentence corpus.
  • Len(s 1 ⁇ s 2 ) is used to calculate the number of word elements included in the target similar sentence corpus and the similar sentence corpus
  • Len(s 1 ⁇ s 2 ) is used to calculate the target similarity The sentence corpus and the number of all word elements contained in the similar sentence corpus.
  • the determining the similarity between the target similar sentence corpus and the similar sentence corpus includes:
  • the similarity between the target similar sentence corpus and each similar sentence corpus is determined.
  • the number of similarities between the determined target similar sentence corpus and the similar sentence corpus is maximized, so that the scale of the established candidate similar sentence corpus can be maximized.
  • the determining the similarity between the target similar sentence corpus and the similar sentence corpus includes:
  • Step 240 Select a candidate similar sentence corpus from the intent set based on the similarity to construct a candidate similar sentence corpus.
  • step 240 may include the following steps:
  • Step 241 For each intent in the intent set, if there is a similar sentence corpus in the similar sentence corpus included in the intent, the similarity between the similar sentence corpus and the target similar sentence corpus is greater than a predetermined similarity threshold, then all the intents included in the intent are acquired. Similar sentence corpus is used as candidate similar sentence corpus.
  • the predetermined similarity threshold may be a floating point number in the range of [0, 1].
  • Step 242 Use all the obtained candidate similar sentence corpora to construct a candidate similar sentence corpus.
  • the similar sentence corpus not only guarantees the number of candidate similar sentence corpus in the constructed candidate similar sentence corpus, but also for an intent, if it is determined that the similar sentence corpus included in the intent has a similar sentence corpus and the target similar sentence corpus If the similarity of is greater than the predetermined similarity threshold, there is no need to judge other similar sentence corpus of the intention, and the amount of calculation can also be reduced.
  • the determining the similarity between the target similar sentence corpus and the similar sentence corpus includes:
  • the selecting candidate similar sentence corpus from the intent set based on the similarity to construct a candidate similar sentence corpus includes:
  • For each similar sentence corpus determine the average value of the similarity between each target similar sentence corpus and the similar sentence corpus;
  • the determining the similarity between the target similar sentence corpus and the similar sentence corpus includes:
  • the selecting candidate similar sentence corpus from the intent set based on the similarity to construct a candidate similar sentence corpus includes:
  • For each similar sentence corpus determine the maximum value of the similarity between each target similar sentence corpus and the similar sentence corpus;
  • a similar sentence corpus with the maximum value greater than a predetermined maximum similarity threshold is obtained as a candidate similar sentence corpus, and a candidate similar sentence corpus is constructed using all the obtained candidate similar sentence corpora.
  • the determining the similarity between the target similar sentence corpus and the similar sentence corpus includes:
  • the selecting candidate similar sentence corpus from the intent set based on the similarity to construct a candidate similar sentence corpus includes:
  • For each similar sentence corpus determine the minimum similarity between each target similar sentence corpus and the similar sentence corpus
  • Step 250 Based on the similarity between each candidate similar sentence corpus in the candidate similar sentence corpus and the target similar sentence corpus in the target similar sentence corpus, in the candidate similar sentence corpus in the candidate similar sentence corpus Determine the target similar sentence corpus belonging to the target intention.
  • step 250 may include:
  • the following formula is used to calculate each candidate similar sentence corpus in the candidate similar sentence corpus Based on the score, the target similar sentence corpus belonging to the target intention is determined in the candidate similar sentence corpus of the candidate similar sentence corpus in the candidate similar sentence corpus:
  • s i and s j represent the target similar sentence corpus
  • s k represents the candidate similar sentence corpus
  • Len is used to find the number of word elements in the set
  • f score (s 1 , s 2 ) is the The similarity between the target similar sentence corpus and the candidate similar sentence corpus
  • C is the candidate similar sentence corpus
  • O is the target similar sentence corpus
  • n is the candidate similar sentence in the candidate similar sentence corpus
  • the number of corpora m is the number of the target similar sentence corpus in the target similar sentence corpus
  • is a weighting factor
  • selectSen is the score of the candidate similar sentence corpus in the candidate similar sentence corpus.
  • can be 0.7, then 1- ⁇ is 0.3.
  • This part calculates the average of the similarity between the target similar sentence corpus in the target similar sentence corpus and the candidate similar sentence corpus in the candidate similar sentence corpus.
  • the average similarity degree of the candidate similar sentence corpus This part calculates the maximum similarity between the target similar sentence corpus in the target similar sentence corpus and the candidate similar sentence corpus in the candidate similar sentence corpus.
  • the above formula takes into account the selection of candidate similar sentence corpus with high average similarity, which can ensure that the target similar sentence corpus is similar to the original target intent. At the same time, it also calculates the total similarity score and deducts a certain weight. The similarity between the candidate similar sentence corpus and a target similar sentence corpus in the existing target similar sentence corpus can ensure that the generated target similar sentence corpus is a semantic supplement to the existing target similar sentence corpus.
  • the candidate similar sentence is calculated using the following formula
  • the score of each candidate similar sentence corpus in the corpus, and the target similar sentence corpus that belongs to the target intention is determined based on the score in the candidate similar sentence corpus of the candidate similar sentence corpus, including:
  • the step of selecting the target similar sentence corpus is iteratively performed, and the step of selecting the target similar sentence corpus includes:
  • the step of determining the score of candidate similar sentence corpus is performed, and the step of determining the score of candidate similar sentence corpus includes: based on the comparison between each candidate similar sentence corpus in the candidate similar sentence corpus and the target similar sentence corpus in the target similar sentence corpus. For similarity, the following formula is used to calculate the score of each candidate similar sentence corpus in the candidate similar sentence corpus:
  • s i and s j represent the target similar sentence corpus
  • s k represents the candidate similar sentence corpus
  • Len is used to find the number of word elements in the set
  • f score (s 1 , s 2 ) is the The similarity between the target similar sentence corpus and the candidate similar sentence corpus
  • C is the candidate similar sentence corpus
  • O is the target similar sentence corpus
  • n is the candidate similar sentence in the candidate similar sentence corpus
  • the number of corpora m is the number of the target similar sentence corpus in the target similar sentence corpus
  • is a weighting factor
  • selectSen is the score of the candidate similar sentence corpus in the candidate similar sentence corpus;
  • the target candidate similar sentence corpus is added to the target similar sentence corpus as the target similar sentence corpus, and the target candidate similar sentence corpus is similar to the candidate Sentence corpus deletion;
  • the step of determining the score of the candidate similar sentence corpus is transferred again, and the amplified target similar sentence is used.
  • the corpus set recalculates the score of each candidate similar sentence corpus in the candidate similar sentence corpus, so that the determined score of the candidate similar sentence corpus becomes more and more accurate, thereby ensuring the quality of the target similar sentence corpus added to the target similar sentence corpus
  • the candidate similar sentence corpus with the highest score and the score reaching the predetermined score threshold is selected each time and added to the target similar sentence corpus, so that the candidate similar sentence corpus added to the target similar sentence corpus is always the candidate similar sentence corpus The highest score in the set, thereby further ensuring the quality of the transferred target similar sentence corpus.
  • the candidate similar sentence corpus is labeled.
  • all candidate similar sentence corpora in the candidate similar sentence corpus are labeled, It is determined that all candidate similar sentence corpora in the candidate similar sentence corpus have been judged.
  • the determining the target similar sentence corpus belonging to the target intention from the candidate similar sentence corpus of the candidate similar sentence corpus based on the score includes:
  • a candidate similar sentence corpus whose score reaches a predetermined score threshold is obtained as the target similar sentence corpus belonging to the target intention.
  • the target similar sentence corpus is determined by comparing the score with a predetermined score threshold, which ensures the rationality of the selected target similar sentence corpus.
  • the determining the target similar sentence corpus belonging to the target intention from the candidate similar sentence corpus of the candidate similar sentence corpus based on the score includes:
  • the third predetermined number of candidate similar sentence corpora with the score reaching the predetermined score threshold is randomly selected as belonging to The target similar sentence corpus of the target intention;
  • the candidate similar sentence corpus whose score reaches the predetermined score threshold is obtained as the target similar sentence corpus belonging to the target intention.
  • the number of candidate similar sentence corpora whose score reaches the predetermined score threshold is too large, the number of target similar sentence corpora to be finally selected is limited.
  • the determining the target similar sentence corpus belonging to the target intention from the candidate similar sentence corpus of the candidate similar sentence corpus based on the score includes:
  • the step of determining the target candidate similar sentence corpus includes: obtaining the candidate similar sentence corpus with the highest score among the candidate similar sentence corpora in the candidate similar sentence corpus, as the target candidate similarity Sentence corpus
  • the target candidate similar sentence corpus is added to the target similar sentence corpus as the target similar sentence corpus, and the target candidate similar sentence corpus is similar to the candidate Sentence corpus deletion;
  • the candidate similar sentence corpus with the highest score is selected each time, and the candidate similar sentence corpus is added to the target similar sentence corpus when it is judged that the score of the candidate similar sentence corpus reaches a predetermined score threshold. , So that the candidate similar sentence corpus added to the target similar sentence corpus has the highest score, thereby ensuring the quality of the transferred target similar sentence corpus.
  • the determining the target similar sentence corpus belonging to the target intention from the candidate similar sentence corpus of the candidate similar sentence corpus based on the score includes:
  • one candidate similar sentence corpus is selected each time, and if the score of the candidate similar sentence corpus reaches a predetermined score threshold, the candidate similar sentence corpus is added to the target similar sentence corpus as the target similar sentence corpus , And delete the candidate similar sentence corpus from the candidate similar sentence corpus until the number of target similar sentence corpora included in the target similar sentence corpus reaches the second predetermined number or the selected candidate similar sentence corpus has no score
  • the predetermined score threshold is reached.
  • the corpus of other intentions is transferred to the intention that needs to be expanded, so as to realize the automatic expansion of the intention corpus and improve the intention.
  • the number of corpus can make the corpus of each intent more balanced, thereby improving the accuracy of intent recognition to a certain extent, and also reducing the cost of expanding the intent corpus.
  • the present application also provides a device for generating the intention corpus of a dialogue robot.
  • the following are device embodiments of the present application.
  • Fig. 5 is a block diagram showing a device for generating intention corpus of a dialogue robot according to an exemplary embodiment. As shown in FIG. 5, the device 500 includes:
  • the first obtaining module 510 is configured to obtain an intent set including a plurality of intents, wherein each intent includes a plurality of similar sentence corpus, each intent corresponds to a dialogue robot, and each dialogue robot has at least one intent;
  • the second acquiring module 520 is configured to acquire the target similar sentence corpus included in the target intention as a target similar sentence corpus;
  • the first determining module 530 is configured to determine the similarity between the target similar sentence corpus and the similar sentence corpus;
  • the construction module 540 is configured to select candidate similar sentence corpora from the intent set based on the similarity to construct a candidate similar sentence corpus;
  • the second determining module 550 is configured to determine, based on the similarity between each candidate similar sentence corpus in the candidate similar sentence corpus and the target similar sentence corpus in the target similar sentence corpus, in the candidate similar sentence corpus, Identify the target similar sentence corpus belonging to the target intention from the candidate similar sentence corpus.
  • an electronic device capable of implementing the above method.
  • the electronic device 600 according to this embodiment of the present application will be described below with reference to FIG. 6.
  • the electronic device 600 shown in FIG. 6 is only an example, and should not bring any limitation to the function and scope of use of the embodiments of the present application.
  • the electronic device 600 is represented in the form of a general-purpose computing device.
  • the components of the electronic device 600 may include, but are not limited to: the aforementioned at least one processing unit 610, the aforementioned at least one storage unit 620, and a bus 630 connecting different system components (including the storage unit 620 and the processing unit 610).
  • the storage unit stores program code, and the program code can be executed by the processing unit 610, so that the processing unit 610 executes the various exemplary methods described in the above-mentioned "Embodiment Method" section of this specification. Steps of implementation.
  • the storage unit 620 may include a readable storage medium in the form of a volatile storage unit, such as a random access storage unit (RAM) 621 and/or a cache storage unit 622, and may further include a read-only storage unit (ROM) 623.
  • RAM random access storage unit
  • ROM read-only storage unit
  • the storage unit 620 may also include a program/utility tool 624 having a set of (at least one) program module 625.
  • program module 625 includes but is not limited to: an operating system, one or more application programs, other program modules, and program data, Each of these examples or some combination may include the implementation of a network environment.
  • the bus 630 may represent one or more of several types of bus structures, including a storage unit bus or a storage unit controller, a peripheral bus, a graphics acceleration port, a processing unit, or a local area using any bus structure among multiple bus structures. bus.
  • the electronic device 600 may also communicate with one or more external devices 800 (such as keyboards, pointing devices, Bluetooth devices, etc.), and may also communicate with one or more devices that enable a user to interact with the electronic device 600, and/or communicate with Any device (such as a router, modem, etc.) that enables the electronic device 600 to communicate with one or more other computing devices. This communication can be performed through an input/output (I/O) interface 650.
  • the electronic device 600 may also communicate with one or more networks (for example, a local area network (LAN), a wide area network (WAN), and/or a public network, such as the Internet) through the network adapter 660.
  • networks for example, a local area network (LAN), a wide area network (WAN), and/or a public network, such as the Internet
  • the network adapter 660 communicates with other modules of the electronic device 600 through the bus 630. It should be understood that although not shown in the figure, other hardware and/or software modules can be used in conjunction with the electronic device 600, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives And data backup storage system, etc.
  • the example embodiments described here can be implemented by software, or can be implemented by combining software with necessary hardware. Therefore, the technical solution according to the embodiments of the present application can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, U disk, mobile hard disk, etc.) or on the network , Including several instructions to make a computing device (which can be a personal computer, a server, a terminal device, or a network device, etc.) execute the method according to the embodiment of the present application.
  • a computing device which can be a personal computer, a server, a terminal device, or a network device, etc.
  • a computer-readable storage medium on which is stored a program product capable of implementing the above-mentioned method in this specification.
  • the computer-readable storage medium may be non-volatile or volatile.
  • each aspect of the present application can also be implemented in the form of a program product, which includes program code.
  • the program product runs on a terminal device, the program code is used to make the The terminal device executes the steps according to various exemplary embodiments of the present application described in the above-mentioned "Exemplary Method" section of this specification.
  • a program product 700 for implementing the above method according to an embodiment of the present application is described. It can adopt a portable compact disk read-only memory (CD-ROM) and include program code, and can be stored in a terminal device, For example, running on a personal computer.
  • CD-ROM compact disk read-only memory
  • the program product of this application is not limited to this.
  • the readable storage medium can be any tangible medium that contains or stores a program, and the program can be used by or in combination with an instruction execution system, device, or device.
  • the program product may adopt any combination of one or more readable storage media.
  • the readable storage medium may be a readable signal medium or a readable storage medium.
  • the readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or a combination of any of the above. More specific examples (non-exhaustive list) of readable storage media include: electrical connections with one or more wires, portable disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable Type programmable read only memory (EPROM or flash memory), optical fiber, portable compact disk read only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
  • the computer-readable signal medium may include a data signal propagated in baseband or as a part of a carrier wave, and readable program code is carried therein. This propagated data signal can take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
  • the readable signal medium may also be any readable storage medium other than the readable storage medium, and the readable storage medium may send, propagate, or transmit a program for use by or in combination with the instruction execution system, apparatus, or device.
  • the program code contained on the readable storage medium can be transmitted by any suitable medium, including but not limited to wireless, wired, optical cable, RF, etc., or any suitable combination of the above.
  • the program code used to perform the operations of the present application can be written in any combination of one or more programming languages.
  • the programming languages include object-oriented programming languages—such as Java, C++, etc., as well as conventional procedural programming languages. Programming language-such as "C" language or similar programming language.
  • the program code can be executed entirely on the user's computing device, partly on the user's device, executed as an independent software package, partly on the user's computing device and partly executed on the remote computing device, or entirely on the remote computing device or server Executed on.
  • the remote computing device can be connected to a user computing device through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computing device (for example, using Internet service providers). Business to connect via the Internet).
  • LAN local area network
  • WAN wide area network
  • Internet service providers for example, using Internet service providers.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Human Computer Interaction (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A conversation robot intention corpus generation method and apparatus, a medium, and an electronic device, relating to the field of data processing. The method comprises: obtaining an intention set comprising a plurality of intentions (210); obtaining a target similar sentence corpus of a target intention as a target similar sentence corpus set (220); determining similarity between the target similar sentence corpus and a similar sentence corpus (230); according to the similarity, selecting candidate similar sentence corpora from the intention set to construct a candidate similar sentence corpus set (240); and according to the similarity between each candidate similar sentence corpus in the candidate similar sentence corpus set and the target similar sentence corpus in the target similar sentence corpus set, determining, from the candidate similar sentence corpora of the candidate similar sentence corpus set, the target similar sentence corpus belonging to the target intention (250). The method achieves the automatic expansion of intention corpora, increases the number of the intention corpora, and can make the number of the corpora of each intention be more balanced, thereby further improving the accuracy of intention recognition, and further reducing the cost required for expanding the intention corpora.

Description

对话机器人意图语料生成方法、装置、介质及电子设备Method, device, medium and electronic equipment for generating intention corpus of dialogue robot
本申请要求于2020年3月20日提交中国专利局,申请号为2020102010018、发明名称为“对话机器人意图语料生成方法、装置、介质及电子设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on March 20, 2020, the application number is 2020102010018, and the invention title is "Methods, Apparatus, Media and Electronic Equipment for Generating Conversational Robot Intent Corpus", all of which are approved The reference is incorporated in this application.
技术领域Technical field
本申请涉及数据处理技术领域,特别涉及一种对话机器人意图语料生成方法、装置、介质及电子设备。This application relates to the field of data processing technology, and in particular to a method, device, medium, and electronic equipment for generating intent corpus of a dialogue robot.
背景技术Background technique
目前对话机器人,特别是任务型对话机器人普遍依赖意图识别算法进行意图识别,对话机器人一般根据识别后的意图执行相应的动作,如话术回复,信息查询等。然而,发明人意识到,对话机器人在进行对话时,若要保证对话质量,需要对每个意图下的相似句数量和质量都有很高的要求。不同对话机器人能够执行不同任务的对话,经常存在有些对话机器人积累的低频问题的意图语料较少、各意图语料间数量不均衡等问题,从而降低了意图识别的准确率,此外,如果安排标注人员进行标注,还会浪费大量的人工成本。At present, dialogue robots, especially task-type dialogue robots, generally rely on intent recognition algorithms for intent recognition. Dialogue robots generally perform corresponding actions based on the identified intent, such as verbal reply, information query, etc. However, the inventor realizes that in order to ensure the quality of the dialogue when a dialogue robot is in a dialogue, it needs to have high requirements for the quantity and quality of similar sentences under each intent. Different dialogue robots can perform dialogues with different tasks. There are often problems such as low-frequency problems accumulated by some dialogue robots, such as less intent corpus and imbalance in the number of intent corpora, which reduces the accuracy of intent recognition. In addition, if annotator is arranged Labeling will also waste a lot of labor costs.
发明内容Summary of the invention
在数据处理技术领域,为了解决上述技术问题,本申请的目的在于提供一种对话机器人意图语料生成方法、装置、介质及电子设备。In the field of data processing technology, in order to solve the above-mentioned technical problems, the purpose of this application is to provide a method, device, medium, and electronic equipment for generating the intention corpus of a dialogue robot.
根据本申请的一方面,提供了一种对话机器人意图语料生成方法,所述方法包括:According to an aspect of the present application, there is provided a method for generating intention corpus of a dialogue robot, the method including:
获取包括多个意图的意图集合,其中,每一意图包括多个相似句语料,每一意图对应一个对话机器人,每一对话机器人具有至少一个意图;Acquiring an intent set including a plurality of intents, where each intent includes a plurality of similar sentence corpus, each intent corresponds to a dialogue robot, and each dialogue robot has at least one intent;
获取目标意图所包括的目标相似句语料,作为目标相似句语料集合;Obtain the target similar sentence corpus included in the target intention as a target similar sentence corpus;
确定所述目标相似句语料与所述相似句语料的相似度;Determine the similarity between the target similar sentence corpus and the similar sentence corpus;
基于所述相似度在所述意图集合中选择出候选相似句语料,以构建候选相似句语料集合;Selecting a candidate similar sentence corpus from the intent set based on the similarity to construct a candidate similar sentence corpus;
基于所述候选相似句语料集合中各候选相似句语料与所述目标相似句语料集合中所述目标相似句语料的相似度,在所述候选相似句语料集合的候选相似句语料中确定出属于所述目标意图的目标相似句语料。Based on the similarity between each candidate similar sentence corpus in the candidate similar sentence corpus and the target similar sentence corpus in the target similar sentence corpus, the candidate similar sentence corpus in the candidate similar sentence corpus is determined to belong to The target similar sentence corpus of the target intention.
根据本申请的另一方面,提供了一种对话机器人意图语料生成装置,所述装置包括:According to another aspect of the present application, there is provided a device for generating an intention corpus of a dialogue robot, the device comprising:
第一获取模块,被配置为获取包括多个意图的意图集合,其中,每一意图包括多个相似句语料,每一意图对应一个对话机器人,每一对话机器人具有至少一个意图;The first acquisition module is configured to acquire an intent set including a plurality of intents, wherein each intent includes a plurality of similar sentence corpus, each intent corresponds to a dialogue robot, and each dialogue robot has at least one intent;
第二获取模块,被配置为获取目标意图所包括的目标相似句语料,作为目标相似句语料集合;The second acquisition module is configured to acquire the target similar sentence corpus included in the target intention as a target similar sentence corpus;
第一确定模块,被配置为确定所述目标相似句语料与所述相似句语料的相似度;The first determining module is configured to determine the similarity between the target similar sentence corpus and the similar sentence corpus;
构建模块,被配置为基于所述相似度在所述意图集合中选择出候选相似句语料,以构建候选相似句语料集合;A construction module configured to select candidate similar sentence corpora from the intent set based on the similarity to construct a candidate similar sentence corpus;
第二确定模块,被配置为基于所述候选相似句语料集合中各候选相似句语料与所述目标相似句语料集合中所述目标相似句语料的相似度,在所述候选相似句语料集合的候选相似句语料中确定出属于所述目标意图的目标相似句语料。The second determining module is configured to be based on the similarity between each candidate similar sentence corpus in the candidate similar sentence corpus and the target similar sentence corpus in the target similar sentence corpus. The candidate similar sentence corpus determines the target similar sentence corpus belonging to the target intention.
根据本申请的另一方面,提供了一种电子设备,所述电子设备包括:According to another aspect of the present application, there is provided an electronic device, the electronic device including:
处理器;processor;
存储器,所述存储器上存储有计算机可读指令,所述计算机可读指令被所述处理器执行时,实现:A memory, where computer-readable instructions are stored, and when the computer-readable instructions are executed by the processor, it realizes:
获取包括多个意图的意图集合,其中,每一意图包括多个相似句语料,每一意图对应一个对话机器人,每一对话机器人具有至少一个意图;Acquiring an intent set including a plurality of intents, where each intent includes a plurality of similar sentence corpus, each intent corresponds to a dialogue robot, and each dialogue robot has at least one intent;
获取目标意图所包括的目标相似句语料,作为目标相似句语料集合;Obtain the target similar sentence corpus included in the target intention as a target similar sentence corpus;
确定所述目标相似句语料与所述相似句语料的相似度;Determine the similarity between the target similar sentence corpus and the similar sentence corpus;
基于所述相似度在所述意图集合中选择出候选相似句语料,以构建候选相似句语料集合;Selecting a candidate similar sentence corpus from the intent set based on the similarity to construct a candidate similar sentence corpus;
基于所述候选相似句语料集合中各候选相似句语料与所述目标相似句语料集合中所述目标相似句语料的相似度,在所述候选相似句语料集合的候选相似句语料中确定出属于所述目标意图的目标相似句语料。Based on the similarity between each candidate similar sentence corpus in the candidate similar sentence corpus and the target similar sentence corpus in the target similar sentence corpus, the candidate similar sentence corpus in the candidate similar sentence corpus is determined to belong to The target similar sentence corpus of the target intention.
根据本申请的另一方面,提供了一种计算机可读存储介质,其存储有计算机程序指令,当所述计算机程序指令被计算机执行时,使计算机执行如前所述的方法。According to another aspect of the present application, a computer-readable storage medium is provided, which stores computer program instructions, and when the computer program instructions are executed by a computer, the computer executes the aforementioned method.
本申请可实现意图语料的自动扩充,提高了意图语料的数量,可以使各意图的语料数量更为均衡,进而在一定程度上提高了意图识别的准确率,还降低了扩展意图语料所需的成本。This application can realize the automatic expansion of intent corpus, increase the number of intent corpus, and make the number of corpus of each intent more balanced, thereby improving the accuracy of intent recognition to a certain extent, and reducing the need to expand the intent corpus. cost.
附图说明Description of the drawings
此处的附图被并入说明书中并构成本说明书的一部分,示出了符合本申请的实施例,并与说明书一起用于解释本申请的原理。The drawings herein are incorporated into the specification and constitute a part of the specification, show embodiments that conform to the application, and are used together with the specification to explain the principle of the application.
图1是根据一示例性实施例示出的一种对话机器人意图语料生成方法的系统架构示意图;Fig. 1 is a schematic diagram showing a system architecture of a method for generating intention corpus of a dialogue robot according to an exemplary embodiment;
图2是根据一示例性实施例示出的一种对话机器人意图语料生成方法的流程图;Fig. 2 is a flow chart showing a method for generating intention corpus of a dialogue robot according to an exemplary embodiment;
图3是根据图2对应实施例示出的一实施例的步骤210及步骤220的细节的流程图;3 is a flowchart showing details of step 210 and step 220 of an embodiment according to the embodiment corresponding to FIG. 2;
图4是根据图2对应实施例示出的一实施例的步骤240的细节流程图;FIG. 4 is a detailed flowchart of step 240 according to an embodiment shown in the embodiment corresponding to FIG. 2;
图5是根据一示例性实施例示出的一种对话机器人意图语料生成装置的框图;Fig. 5 is a block diagram showing a device for generating intention corpus of a dialogue robot according to an exemplary embodiment;
图6是根据一示例性实施例示出的一种实现上述对话机器人意图语料生成方法的电子设备示例框图;Fig. 6 is a block diagram showing an example of an electronic device for realizing the above-mentioned method for generating intent corpus of a dialogue robot according to an exemplary embodiment;
图7是根据一示例性实施例示出的一种实现上述对话机器人意图语料生成方法的计算机可读存储介质。Fig. 7 shows a computer-readable storage medium for realizing the above-mentioned method for generating an intention corpus of a dialog robot according to an exemplary embodiment.
具体实施方式Detailed ways
这里将详细地对示例性实施例进行说明,其示例表示在附图中。下面的描述涉及附图时,除非另有表示,不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本申请相一致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、本申请的一些方面相一致的装置和方法的例子。The exemplary embodiments will be described in detail here, and examples thereof are shown in the accompanying drawings. When the following description refers to the accompanying drawings, unless otherwise indicated, the same numbers in different drawings represent the same or similar elements. The implementation manners described in the following exemplary embodiments do not represent all implementation manners consistent with the present application. On the contrary, they are merely examples of devices and methods consistent with some aspects of the application as detailed in the appended claims.
此外,附图仅为本申请的示意性图解,并非一定是按比例绘制。图中相同的附图标记表示相同或类似的部分,因而将省略对它们的重复描述。附图中所示的一些方框图是功能实体,不一定必须与物理或逻辑上独立的实体相对应。In addition, the drawings are only schematic illustrations of the application and are not necessarily drawn to scale. The same reference numerals in the figures denote the same or similar parts, and thus their repeated description will be omitted. Some of the block diagrams shown in the drawings are functional entities and do not necessarily correspond to physically or logically independent entities.
本申请提供的对话机器人意图语料生成方法,可同样适用于人工智能领域,基于机器学习、深度学习等方面增强对话机器人的普适度。对话机器人可以是各种能够与人类进行人机对话的机器人,对话机器人可以包括多种模型或者算法,比如可以包括语言模型、声学模型等,对话机器人可以与人类进行文本、语音或者视频语音对话。对话机器人的意图和语料的关系就是一种含义的不同表达方式,含义就是意图,而一种具体的表达方式就是一种语料,因此,对话机器人的一个意图通常对应多个相似的语料,不同对话机器人也有不同的意图和语料。对话机器人在进行人机对话时采用意图识别的一般方式为基于统计学习或深度学习的分类算法来学习每个意图下对应了哪些相似的语料来进行意图分类。而意图语料生成就是为对话机器人的某一意图增加语料的过程,也就是说,本申请提供的对话机器人意图语料生成方法可以增加对话机器人的某一意图的语料。The intent corpus generation method of the dialogue robot provided in this application can also be applied to the field of artificial intelligence, and enhance the universal applicability of the dialogue robot based on machine learning and deep learning. Dialogue robots can be various robots that can conduct human-machine dialogue with humans. Dialogue robots can include multiple models or algorithms, such as language models, acoustic models, etc., and dialogue robots can conduct text, voice, or video voice dialogue with humans. The relationship between the intention of the dialogue robot and the corpus is a different way of expressing meaning. The meaning is the intention, and a specific way of expression is a kind of corpus. Therefore, an intention of the dialogue robot usually corresponds to multiple similar corpora, and different dialogues. Robots also have different intentions and corpus. The general method of intent recognition by dialogue robots during human-machine dialogue is classification algorithms based on statistical learning or deep learning to learn which similar corpus corresponds to each intent to classify the intent. The intention corpus generation is the process of adding corpus for a certain intent of the dialog robot, that is, the method for generating the intention corpus of the dialog robot provided in this application can increase the corpus of a certain intent of the dialog robot.
本申请的实施终端可以是任何具有运算、处理以及存储功能的设备,该设备可以与外 部设备相连,用于接收或者发送数据,具体可以是便携移动设备,例如智能手机、平板电脑、笔记本电脑、PDA(Personal Digital Assistant)等,也可以是固定式设备,例如,计算机设备、现场终端、台式电脑、服务器、工作站等,还可以是多个设备的集合,比如云计算的物理基础设施或者服务器集群。The implementation terminal of this application can be any device with computing, processing, and storage functions. The device can be connected to an external device for receiving or sending data. Specifically, it can be a portable mobile device, such as a smart phone, a tablet computer, a notebook computer, PDA (Personal Digital Assistant), etc., can also be fixed devices, such as computer equipment, field terminals, desktop computers, servers, workstations, etc., or a collection of multiple devices, such as cloud computing physical infrastructure or server clusters .
可选地,本申请的实施终端可以为服务器或者云计算的物理基础设施。Optionally, the implementation terminal of this application may be a server or a physical infrastructure of cloud computing.
图1是根据一示例性实施例示出的一种对话机器人意图语料生成方法的系统架构示意图。如图1所示,该系统架构包括服务器110、多个机器人终端120以及与每一机器人终端120对应的数据库130,各机器人终端120与服务器110之间、每一机器人终端120与对应的数据库130之间均通过通信链路相连,从而可以进行数据的接收和发送。每一机器人终端120上固设有对话机器人,而机器人终端120对应的数据库130则存储有对话机器人进行对话所使用的数据,比如,可以包括意图和对应的语料数据,语料数据比如可以是文本等类型的数据,每一机器人终端120对应的数据库130可以存储多个意图对应的多个语料数据。在图1所示实施例中,服务器110为本申请的实施终端,服务器110可以经由各机器人终端120对各机器人终端120对应的数据库130中的语料数据进行操作,比如可以从一个机器人终端120对应的数据库130获取语料数据并将获取的该语料数据迁移至其他机器人终端120对应的数据库130,这样就可以增加某一对话机器人对应意图的语料。Fig. 1 is a schematic diagram showing a system architecture of a method for generating an intention corpus of a dialogue robot according to an exemplary embodiment. As shown in FIG. 1, the system architecture includes a server 110, a plurality of robot terminals 120, and a database 130 corresponding to each robot terminal 120. Between each robot terminal 120 and the server 110, each robot terminal 120 and the corresponding database 130 They are all connected by a communication link, so that data can be received and sent. Each robot terminal 120 is fixedly equipped with a dialogue robot, and the database 130 corresponding to the robot terminal 120 stores data used by the dialogue robot to conduct a dialogue. For example, it may include intent and corresponding corpus data. The corpus data may be text, etc. For types of data, the database 130 corresponding to each robot terminal 120 can store multiple corpus data corresponding to multiple intents. In the embodiment shown in FIG. 1, the server 110 is the implementation terminal of the application. The server 110 can operate the corpus data in the database 130 corresponding to each robot terminal 120 through each robot terminal 120, for example, it can correspond to a robot terminal 120. The database 130 acquires corpus data and transfers the acquired corpus data to the database 130 corresponding to other robot terminals 120, so that the corpus corresponding to the intention of a certain dialogue robot can be added.
值得一提的是,图1仅为本申请的一个实施例。虽然在本实施例中的实施终端为服务器,但在其他实施例中,实施终端可以为如前所述的各种终端或设备;虽然在本实施例中,不同对话机器人固设在不同的终端上,不同对话机器人对应的意图的语料也分别存储于不同的数据库中,但在其他实施例或者具体应用中,各对话机器人和/或各对话机器人对应的意图的语料可以存储于同一终端或不同的终端上,各对话机器人以及对应的意图的语料还可以存储于本申请的实施终端本地,本申请对此不作任何限定,本申请的保护范围也不应因此而受到任何限制。It is worth mentioning that Fig. 1 is only an embodiment of the present application. Although the implementation terminal in this embodiment is a server, in other embodiments, the implementation terminal may be various terminals or devices as described above; although in this embodiment, different dialog robots are fixed on different terminals. Above, the corpus of intents corresponding to different dialog robots are also stored in different databases. However, in other embodiments or specific applications, the corpus of intents corresponding to each dialog robot and/or each dialog robot can be stored in the same terminal or in different databases. On the terminal of, each dialogue robot and the corresponding intent corpus can also be stored locally in the implementation terminal of this application, this application does not make any limitation on this, and the scope of protection of this application should not be restricted in any way.
图2是根据一示例性实施例示出的一种对话机器人意图语料生成方法的流程图。本实施例提供的对话机器人意图语料生成方法可以由服务器执行,如图2所示,包括以下步骤:Fig. 2 is a flowchart showing a method for generating intention corpus of a dialogue robot according to an exemplary embodiment. The method for generating dialogue robot intention corpus provided in this embodiment can be executed by a server, as shown in FIG. 2, and includes the following steps:
步骤210,获取包括多个意图的意图集合。Step 210: Obtain an intent set including multiple intents.
其中,每一意图包括多个相似句语料,每一意图对应一个对话机器人,每一对话机器人具有至少一个意图。Among them, each intent includes a plurality of similar sentence corpus, each intent corresponds to a dialogue robot, and each dialogue robot has at least one intent.
每一意图对应一个对话机器人是指,意图是对话机器人的意图,对话机器人可以利用意图对人类进行对话。Each intent corresponds to a dialogue robot means that the intent is the intent of the dialogue robot, and the dialogue robot can use the intent to have a dialogue with humans.
在一个实施例中,每一意图包括对话机器人的标识,所述意图通过包括的对话机器人的标识与对话机器人相对应。In one embodiment, each intent includes an identification of the dialogue robot, and the intent corresponds to the dialogue robot by including the identification of the dialogue robot.
如前所述,意图和语料的关系就是一种含义和该含义对应的不同表达方式的关系,一种含义就相当于一种意图,该含义对应的一种具体表达方式就相当于一种语料。同一意图包括的语料之间通常是相似的,所以称为相似句语料。比如,在保险领域,“我不知道医疗险”和“医疗险是什么意思”这两个语料是相似句语料,都属于“我想知道关于医疗险的详细介绍”这一意图。As mentioned above, the relationship between intention and corpus is the relationship between a meaning and different expressions corresponding to the meaning. A meaning is equivalent to an intention, and a specific expression corresponding to the meaning is equivalent to a corpus. . The corpus included in the same intention is usually similar, so it is called similar sentence corpus. For example, in the field of insurance, the two corpus "I don't know medical insurance" and "what does medical insurance mean" are similar sentence corpora, and both belong to the intention of "I want to know a detailed introduction about medical insurance".
在一个实施例中,包括多个意图的意图集合W可以利用如下表达式来表示:In one embodiment, the intent set W including multiple intents can be expressed by the following expression:
W=[(I 1→S 11),(I 1→S 12),…,(I x→S xi)], W=[(I 1 →S 11 ),(I 1 →S 12 ),…,(I x →S xi )],
其中属于同一括号对内的分别为一个意图I x和该意图包括的一个相似句语料S xi,比如,I 1可以代表编号为1的意图,而S 11可以代表该意图包括的第一个相似句语料,S 12可以代表该意图包括的第二个相似句语料,以此类推。 In the same bracket pair are an intention I x and a similar sentence corpus S xi included in the intention. For example, I 1 can represent the intention numbered 1, and S 11 can represent the first similar intention included in the intention. Sentence corpus, S 12 can represent the second similar sentence corpus included in the intention, and so on.
在一个实施例中,所述意图集合预先存储在本地,所述获取包括多个意图的意图集合,包括:从本地读取包括多个意图的意图集合。In one embodiment, the intent set is pre-stored locally, and the acquiring an intent set including a plurality of intents includes: reading the intent set including a plurality of intents locally.
在一个实施例中,所述意图集合预先存储在数据库中,所述获取包括多个意图的意图集合,包括:通过查询数据库获取包括多个意图的意图集合。In one embodiment, the intent set is pre-stored in a database, and the obtaining an intent set including a plurality of intents includes: obtaining an intent set including a plurality of intents by querying the database.
在一个实施例中,所述意图集合预先存储在本端之外的目标终端中,所述获取包括多个意图的意图集合,包括:In an embodiment, the intent set is pre-stored in a target terminal other than the local terminal, and the acquiring of intent sets including multiple intents includes:
向目标终端发送意图集合获取请求;Send an intent set acquisition request to the target terminal;
接收目标终端根据所述意图集合获取请求做出响应时返回的包括多个意图的意图集合。Receive an intent set including multiple intents returned when the target terminal responds according to the intent set acquisition request.
在一个实施例中,所述意图集合还可预先存储至区块链网络节点中,通过区块链存储,实现意图集合在不同平台之间的共享,也可防止数据被篡改。当需要获取意图集合时,可通过调用智能合约的方式直接从区块链中获取。In one embodiment, the intent set may also be pre-stored in a blockchain network node. Through the blockchain storage, the intent set can be shared between different platforms, and data can also be prevented from being tampered with. When the intent set needs to be obtained, it can be obtained directly from the blockchain by invoking the smart contract.
其中,区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层。Among them, the blockchain is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm. The blockchain is essentially a decentralized database, which is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify the validity of the information. (Anti-counterfeiting) and generate the next block. The blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
步骤220,获取目标意图所包括的目标相似句语料,作为目标相似句语料集合。Step 220: Obtain the target similar sentence corpus included in the target intention as a target similar sentence corpus.
在一个实施例中,所述获取目标意图所包括的目标相似句语料,作为目标相似句语料集合,包括:从在本地的预设路径读取目标意图所包括的目标相似句语料,作为目标相似句语料集合。In one embodiment, the acquiring the target similar sentence corpus included in the target intention as a target similar sentence corpus includes: reading the target similar sentence corpus included in the target intention from a local preset path as the target similarity Sentence corpus collection.
在一个实施例中,步骤210以及步骤220的具体步骤可以如图3所示。图3是根据图2对应实施例示出的一实施例的步骤210及步骤220的细节的流程图。如图3所示,包括以下步骤:In an embodiment, the specific steps of step 210 and step 220 may be as shown in FIG. 3. FIG. 3 is a flowchart showing details of step 210 and step 220 of an embodiment according to the embodiment corresponding to FIG. 2. As shown in Figure 3, it includes the following steps:
步骤211,基于第一预定规则从包括多个意图的意图总集合中选择出多个意图组成意图子集合。In step 211, a plurality of intents is selected from a total set of intents including a plurality of intents to form an intent sub-set based on the first predetermined rule.
其中,每一意图包括多个相似句语料,所述意图总集合中的每一意图对应一个对话机器人。Wherein, each intention includes a plurality of similar sentence corpora, and each intention in the total set of intentions corresponds to a dialogue robot.
可以基于各种方式或规则从意图总集合中选择出组成意图子集合的多个意图,比如,第一预定规则可以是从意图总集合中随机选取多个意图组成意图子集合,还可以是按照各意图的生成顺序从包括多个意图的意图总集合中依次选取预定数目个意图组成意图子集合。The multiple intentions that make up the intention sub-set can be selected from the total set of intentions based on various methods or rules. For example, the first predetermined rule can be to randomly select multiple intentions from the total set of intentions to form the intention sub-set, or according to The generation sequence of each intent is to sequentially select a predetermined number of intents from a total set of intents including a plurality of intents to form an intent sub-set.
步骤221,基于第二预定规则在所述意图子集合中意图对应的对话机器人之外的其他所有对话机器人对应的意图中选择出目标意图。Step 221: Select the target intent from the intents corresponding to all the dialogue robots except the intents corresponding to the intents in the intent sub-set based on the second predetermined rule.
在一个实施例中,所述基于第二预定规则在所述意图子集合中意图对应的对话机器人之外的其他所有对话机器人对应的意图中选择出目标意图,包括:In one embodiment, the selection of the target intent from the intents corresponding to all the dialogue robots except the intents corresponding to the intents in the intent subset based on the second predetermined rule includes:
在所述意图子集合中意图对应的对话机器人之外的其他所有对话机器人对应的意图中选择出包括的相似句语料最少的意图,作为目标意图。Among the intents corresponding to all other dialogue robots except the dialogue robots corresponding to the intents in the intent subset, the intent that includes the least corpus of similar sentences is selected as the target intent.
在本实施例中,通过将包括的相似句语料最少的意图作为目标意图,可以为包括的相似句语料最少的意图这类低频意图优先生成语料。In this embodiment, by taking the intention with the least included similar sentence corpus as the target intention, it is possible to preferentially generate corpus for the low-frequency intention including the least included similar sentence corpus.
在一个实施例中,所述基于第二预定规则在所述意图子集合中意图对应的对话机器人之外的其他所有对话机器人对应的意图中选择出目标意图,包括:In one embodiment, the selection of the target intent from the intents corresponding to all the dialogue robots except the intents corresponding to the intents in the intent subset based on the second predetermined rule includes:
在所述意图子集合中意图对应的对话机器人之外的其他所有对话机器人对应的意图中确定出包括的相似句语料的数量小于第一预定数目的意图,作为第一候选目标意图;In the intention subset, intents corresponding to all other dialogue robots except for the intents corresponding to the intents are determined to include intents whose number of similar sentence corpus is less than the first predetermined number as the first candidate target intent;
在所述第一候选目标意图中任取一个,作为目标意图。Any one of the first candidate target intentions is selected as the target intention.
在本实施例中,使包括的相似句语料的数量小于第一预定数目的意图均有相同的被选为目标意图的可能性,提高了公平性,并且由于选出的目标意图包括的相似句语料的数量 小于第一预定数目,可以为低频意图优先生成语料。In this embodiment, the intents included in the similar sentence corpus that are less than the first predetermined number have the same possibility of being selected as the target intent, which improves fairness, and because the selected target intent includes similar sentences The number of corpus is less than the first predetermined number, and corpus can be generated preferentially for low-frequency intent.
在一个实施例中,所述基于第二预定规则在所述意图子集合中意图对应的对话机器人之外的其他所有对话机器人对应的意图中选择出目标意图,包括:In one embodiment, the selection of the target intent from the intents corresponding to all the dialogue robots except the intents corresponding to the intents in the intent subset based on the second predetermined rule includes:
确定所述意图子集合内各意图所包括的相似句语料的数目的最小值;Determining the minimum number of similar sentence corpora included in each intent in the intent subset;
在所述意图子集合中意图对应的对话机器人之外的其他所有对话机器人对应的意图中确定出包括的相似句语料的数量小于所述最小值的意图,作为第二候选目标意图;In the intention sub-set, intents corresponding to all other dialog robots except for the intents corresponding to the intents are determined to include intents whose number of similar sentence corpus is less than the minimum value, as the second candidate target intent;
在所述第二候选目标意图中任取一个,作为目标意图。Any one of the second candidate target intentions is selected as the target intention.
当一个意图包括的相似句语料的数量比意图子集合内各意图所包括的相似句语料的数目的最小值还要小时,说明该意图包括的相似句语料的数量是足够小的,在本实施例中通过在这类意图中任选一个作为目标意图,保证了选择的目标意图的合理性。When the number of similar sentence corpora included in an intention is smaller than the minimum value of the number of similar sentence corpora included in each intention in the intention sub-set, it means that the number of similar sentence corpora included in the intention is small enough. In this implementation In the example, by choosing one of these intentions as the target intention, the rationality of the selected target intention is ensured.
步骤222,获取目标意图包括的相似句语料作为目标相似句语料,得到目标相似句语料集合。Step 222: Obtain the similar sentence corpus included in the target intention as the target similar sentence corpus, and obtain the target similar sentence corpus.
本实施例即为从意图集合之外的意图获取相似句语料的示例。This embodiment is an example of obtaining similar sentence corpus from intents outside the intent set.
步骤230,确定所述目标相似句语料与所述相似句语料的相似度。Step 230: Determine the similarity between the target similar sentence corpus and the similar sentence corpus.
可以利用各种算法或公式来计算两个相似句语料之间的相似度。Various algorithms or formulas can be used to calculate the similarity between two similar sentence corpora.
在一个实施例中,所述目标相似句语料和所述相似句语料分别由多个词元素组成,所述确定所述目标相似句语料与所述相似句语料的相似度,包括:In an embodiment, the target similar sentence corpus and the similar sentence corpus are respectively composed of multiple word elements, and the determining the similarity between the target similar sentence corpus and the similar sentence corpus includes:
利用如下公式确定所述目标相似句语料与所述相似句语料的相似度:Use the following formula to determine the similarity between the target similar sentence corpus and the similar sentence corpus:
Figure PCTCN2020093043-appb-000001
Figure PCTCN2020093043-appb-000001
其中,s 1代表所述目标相似句语料,s 2代表所述相似句语料,Len用于求取集合内词元素的个数,f score(s 1,s 2)为所述目标相似句语料与所述相似句语料的相似度。 Among them, s 1 represents the target similar sentence corpus, s 2 represents the similar sentence corpus, Len is used to find the number of word elements in the set, and f score (s 1 , s 2 ) is the target similar sentence corpus The degree of similarity with the similar sentence corpus.
比如,Len(s 1∩s 2)用于计算所述目标相似句语料与所述相似句语料共同包含的词元素的个数,而Len(s 1∪s 2)用于计算所述目标相似句语料与所述相似句语料所包含的所有词元素的个数。 For example, Len(s 1 ∩s 2 ) is used to calculate the number of word elements included in the target similar sentence corpus and the similar sentence corpus, and Len(s 1 ∪s 2 ) is used to calculate the target similarity The sentence corpus and the number of all word elements contained in the similar sentence corpus.
在一个实施例中,所述确定所述目标相似句语料与所述相似句语料的相似度,包括:In an embodiment, the determining the similarity between the target similar sentence corpus and the similar sentence corpus includes:
针对每一所述目标相似句语料,确定该目标相似句语料与每一所述相似句语料的相似度。For each target similar sentence corpus, the similarity between the target similar sentence corpus and each similar sentence corpus is determined.
在本实施例中,使得确定出的目标相似句语料与相似句语料的相似度的数量最大,从而可以使建立的候选相似句语料集合的规模最大。In this embodiment, the number of similarities between the determined target similar sentence corpus and the similar sentence corpus is maximized, so that the scale of the established candidate similar sentence corpus can be maximized.
在一个实施例中,所述确定所述目标相似句语料与所述相似句语料的相似度,包括:In an embodiment, the determining the similarity between the target similar sentence corpus and the similar sentence corpus includes:
在目标意图所包括的目标相似句语料中任取一个目标相似句语料;Choose any target similar sentence corpus from the target similar sentence corpus included in the target intention;
确定任取的该目标相似句语料与每一所述相似句语料的相似度。Determine the similarity between the arbitrary target similar sentence corpus and each similar sentence corpus.
步骤240,基于所述相似度在所述意图集合中选择出候选相似句语料,以构建候选相似句语料集合。Step 240: Select a candidate similar sentence corpus from the intent set based on the similarity to construct a candidate similar sentence corpus.
在一个实施例中,步骤240的具体步骤可以如图4所示。图4是根据图2对应实施例示出的一实施例的步骤240的细节流程图。参照图4所示,步骤240可以包括以下步骤:In an embodiment, the specific steps of step 240 may be as shown in FIG. 4. FIG. 4 is a detailed flowchart of step 240 according to an embodiment shown in the embodiment corresponding to FIG. 2. Referring to FIG. 4, step 240 may include the following steps:
步骤241,针对所述意图集合中每一意图,若该意图包括的相似句语料中存在一个相似句语料与所述目标相似句语料的相似度大于预定相似度阈值,则获取该意图包括的所有相似句语料作为候选相似句语料。Step 241: For each intent in the intent set, if there is a similar sentence corpus in the similar sentence corpus included in the intent, the similarity between the similar sentence corpus and the target similar sentence corpus is greater than a predetermined similarity threshold, then all the intents included in the intent are acquired. Similar sentence corpus is used as candidate similar sentence corpus.
预定相似度阈值可以是属于[0,1]范围内的浮点数。The predetermined similarity threshold may be a floating point number in the range of [0, 1].
步骤242,利用获取的所有候选相似句语料构建候选相似句语料集合。Step 242: Use all the obtained candidate similar sentence corpora to construct a candidate similar sentence corpus.
在本实施例中,在只要一个意图中的相似句语料与目标相似句语料的相似度大于预定相似度阈值的情况下,就选取该意图包括的所有相似句语料作为候选相似句语料来构建候 选相似句语料集合,不仅保证了构建的候选相似句语料集合中候选相似句语料的数量,而且对于一个意图来说,如果确定该意图包括的相似句语料中有一个相似句语料与目标相似句语料的相似度大于预定相似度阈值,就不需要对该意图的其他相似句语料进行判断,还可以减少计算量。In this embodiment, as long as the similarity between the similar sentence corpus in an intent and the target similar sentence corpus is greater than the predetermined similarity threshold, all similar sentence corpora included in the intent are selected as candidate similar sentence corpora to construct the candidate. The similar sentence corpus not only guarantees the number of candidate similar sentence corpus in the constructed candidate similar sentence corpus, but also for an intent, if it is determined that the similar sentence corpus included in the intent has a similar sentence corpus and the target similar sentence corpus If the similarity of is greater than the predetermined similarity threshold, there is no need to judge other similar sentence corpus of the intention, and the amount of calculation can also be reduced.
在一个实施例中,所述确定所述目标相似句语料与所述相似句语料的相似度,包括:In an embodiment, the determining the similarity between the target similar sentence corpus and the similar sentence corpus includes:
针对每一所述目标相似句语料,确定该目标相似句语料与每一所述相似句语料的相似度;For each target similar sentence corpus, determine the similarity between the target similar sentence corpus and each similar sentence corpus;
所述基于所述相似度在所述意图集合中选择出候选相似句语料,以构建候选相似句语料集合,包括:The selecting candidate similar sentence corpus from the intent set based on the similarity to construct a candidate similar sentence corpus includes:
针对每一所述相似句语料,确定各目标相似句语料与该相似句语料的相似度的平均值;For each similar sentence corpus, determine the average value of the similarity between each target similar sentence corpus and the similar sentence corpus;
获取所述平均值大于预定相似度平均值阈值的相似句语料所属意图包括的所有相似句语料作为候选相似句语料,并利用获取的所有候选相似句语料构建候选相似句语料集合。Obtain all similar sentence corpora intended to include similar sentence corpora whose average value is greater than a predetermined similarity average threshold value as candidate similar sentence corpora, and construct a candidate similar sentence corpus set by using all the obtained candidate similar sentence corpora.
在一个实施例中,所述确定所述目标相似句语料与所述相似句语料的相似度,包括:In an embodiment, the determining the similarity between the target similar sentence corpus and the similar sentence corpus includes:
针对每一所述目标相似句语料,确定该目标相似句语料与每一所述相似句语料的相似度;For each target similar sentence corpus, determine the similarity between the target similar sentence corpus and each similar sentence corpus;
所述基于所述相似度在所述意图集合中选择出候选相似句语料,以构建候选相似句语料集合,包括:The selecting candidate similar sentence corpus from the intent set based on the similarity to construct a candidate similar sentence corpus includes:
针对每一所述相似句语料,确定各目标相似句语料与该相似句语料的相似度的最大值;For each similar sentence corpus, determine the maximum value of the similarity between each target similar sentence corpus and the similar sentence corpus;
获取所述最大值大于预定相似度最大值阈值的相似句语料作为候选相似句语料,并利用获取的所有候选相似句语料构建候选相似句语料集合。A similar sentence corpus with the maximum value greater than a predetermined maximum similarity threshold is obtained as a candidate similar sentence corpus, and a candidate similar sentence corpus is constructed using all the obtained candidate similar sentence corpora.
在一个实施例中,所述确定所述目标相似句语料与所述相似句语料的相似度,包括:In an embodiment, the determining the similarity between the target similar sentence corpus and the similar sentence corpus includes:
针对每一所述目标相似句语料,确定该目标相似句语料与每一所述相似句语料的相似度;For each target similar sentence corpus, determine the similarity between the target similar sentence corpus and each similar sentence corpus;
所述基于所述相似度在所述意图集合中选择出候选相似句语料,以构建候选相似句语料集合,包括:The selecting candidate similar sentence corpus from the intent set based on the similarity to construct a candidate similar sentence corpus includes:
针对每一所述相似句语料,确定各目标相似句语料与该相似句语料的相似度的最小值;For each similar sentence corpus, determine the minimum similarity between each target similar sentence corpus and the similar sentence corpus;
获取所述最小值大于预定相似度最小值阈值的相似句语料所属意图包括的所有相似句语料作为候选相似句语料,并利用获取的所有候选相似句语料构建候选相似句语料集合。Obtain all similar sentence corpora intended to include the similar sentence corpus whose minimum value is greater than the predetermined similarity minimum threshold value as candidate similar sentence corpus, and use all the obtained candidate similar sentence corpus to construct a candidate similar sentence corpus.
对于一个相似句语料来说,当各目标相似句语料与该相似句语料的相似度的最小值还比预定相似度最小值阈值大,说明该相似句语料与各目标相似句语料的整体相比是足够相似的,在本实施例中,提高了获取候选相似句语料的标准。For a similar sentence corpus, when the minimum similarity between each target similar sentence corpus and the similar sentence corpus is greater than the predetermined minimum similarity threshold, it indicates that the similar sentence corpus is compared with the overall similarity sentence corpus of each target. Are sufficiently similar. In this embodiment, the standard for obtaining candidate similar sentence corpus is improved.
步骤250,基于所述候选相似句语料集合中各候选相似句语料与所述目标相似句语料集合中所述目标相似句语料的相似度,在所述候选相似句语料集合的候选相似句语料中确定出属于所述目标意图的目标相似句语料。Step 250: Based on the similarity between each candidate similar sentence corpus in the candidate similar sentence corpus and the target similar sentence corpus in the target similar sentence corpus, in the candidate similar sentence corpus in the candidate similar sentence corpus Determine the target similar sentence corpus belonging to the target intention.
在一个实施例中,步骤250可以包括:In one embodiment, step 250 may include:
基于所述候选相似句语料集合中各候选相似句语料与所述目标相似句语料集合中所述目标相似句语料的相似度,利用如下公式计算所述候选相似句语料集合中各候选相似句语料的得分,并基于所述得分在所述候选相似句语料集合的候选相似句语料中确定出属于所述目标意图的目标相似句语料:Based on the similarity between each candidate similar sentence corpus in the candidate similar sentence corpus and the target similar sentence corpus in the target similar sentence corpus, the following formula is used to calculate each candidate similar sentence corpus in the candidate similar sentence corpus Based on the score, the target similar sentence corpus belonging to the target intention is determined in the candidate similar sentence corpus of the candidate similar sentence corpus in the candidate similar sentence corpus:
Figure PCTCN2020093043-appb-000002
Figure PCTCN2020093043-appb-000002
其中,s i和s j代表所述目标相似句语料,s k代表所述候选相似句语料,Len用于求取集合内词元素的个数,f score(s 1,s 2)为所述目标相似句语料与所述候选相似句语料的相似度,C为所述候选相似句语料集合,O为所述目标相似句语料集合,n为所述候选相似句语料集合中所述候选相似句语料的数目,m为所述目标相似句语料集合中所述目标相似句语料的数目,α为权重因子,selectSen为所述候选相似句语料集合中候选相似句语料的得分。 Among them, s i and s j represent the target similar sentence corpus, s k represents the candidate similar sentence corpus, Len is used to find the number of word elements in the set, and f score (s 1 , s 2 ) is the The similarity between the target similar sentence corpus and the candidate similar sentence corpus, C is the candidate similar sentence corpus, O is the target similar sentence corpus, n is the candidate similar sentence in the candidate similar sentence corpus The number of corpora, m is the number of the target similar sentence corpus in the target similar sentence corpus, α is a weighting factor, and selectSen is the score of the candidate similar sentence corpus in the candidate similar sentence corpus.
比如,α可以为0.7,那么1-α为0.3。For example, α can be 0.7, then 1-α is 0.3.
在上述公式中,
Figure PCTCN2020093043-appb-000003
这一部分计算了所述目标相似句语料集合中所述目标相似句语料与所述候选相似句语料集合中所述候选相似句语料的相似度的平均值,即衡量了所述目标相似句语料与所述候选相似句语料的平均相似程度;
Figure PCTCN2020093043-appb-000004
这一部分计算了所述目标相似句语料集合中所述目标相似句语料与所述候选相似句语料集合中所述候选相似句语料的相似度的最大值。
In the above formula,
Figure PCTCN2020093043-appb-000003
This part calculates the average of the similarity between the target similar sentence corpus in the target similar sentence corpus and the candidate similar sentence corpus in the candidate similar sentence corpus. The average similarity degree of the candidate similar sentence corpus;
Figure PCTCN2020093043-appb-000004
This part calculates the maximum similarity between the target similar sentence corpus in the target similar sentence corpus and the candidate similar sentence corpus in the candidate similar sentence corpus.
因此,上述公式一方面考虑了选择平均相似度高的候选相似句语料,可以保证目标相似句语料与原目标意图的目标相似句语料含义相似;同时也计算了相似性总分上减掉一定权重的候选相似句语料与已有目标相似句语料中某条目标相似句语料最相似的相似度,可以保证生成的目标相似句语料是对已有目标相似句语料的语义补充。Therefore, the above formula takes into account the selection of candidate similar sentence corpus with high average similarity, which can ensure that the target similar sentence corpus is similar to the original target intent. At the same time, it also calculates the total similarity score and deducts a certain weight. The similarity between the candidate similar sentence corpus and a target similar sentence corpus in the existing target similar sentence corpus can ensure that the generated target similar sentence corpus is a semantic supplement to the existing target similar sentence corpus.
在一个实施例中,所述基于所述候选相似句语料集合中各候选相似句语料与所述目标相似句语料集合中所述目标相似句语料的相似度,利用如下公式计算所述候选相似句语料集合中各候选相似句语料的得分,并基于所述得分在所述候选相似句语料集合的候选相似句语料中确定出属于所述目标意图的目标相似句语料,包括:In one embodiment, based on the similarity between each candidate similar sentence corpus in the candidate similar sentence corpus and the target similar sentence corpus in the target similar sentence corpus, the candidate similar sentence is calculated using the following formula The score of each candidate similar sentence corpus in the corpus, and the target similar sentence corpus that belongs to the target intention is determined based on the score in the candidate similar sentence corpus of the candidate similar sentence corpus, including:
迭代执行目标相似句语料选取步骤,所述目标相似句语料选取步骤包括:The step of selecting the target similar sentence corpus is iteratively performed, and the step of selecting the target similar sentence corpus includes:
执行确定候选相似句语料得分步骤,所述确定候选相似句语料得分步骤包括:基于所述候选相似句语料集合中各候选相似句语料与所述目标相似句语料集合中所述目标相似句语料的相似度,利用如下公式计算所述候选相似句语料集合中各候选相似句语料的得分:The step of determining the score of candidate similar sentence corpus is performed, and the step of determining the score of candidate similar sentence corpus includes: based on the comparison between each candidate similar sentence corpus in the candidate similar sentence corpus and the target similar sentence corpus in the target similar sentence corpus. For similarity, the following formula is used to calculate the score of each candidate similar sentence corpus in the candidate similar sentence corpus:
Figure PCTCN2020093043-appb-000005
Figure PCTCN2020093043-appb-000005
其中,s i和s j代表所述目标相似句语料,s k代表所述候选相似句语料,Len用于求取集合内词元素的个数,f score(s 1,s 2)为所述目标相似句语料与所述候选相似句语料的相似度,C为所述候选相似句语料集合,O为所述目标相似句语料集合,n为所述候选相似句语料集合中所述候选相似句语料的数目,m为所述目标相似句语料集合中所述目标相似句语料的数目,α为权重因子,selectSen为所述候选相似句语料集合中候选相似句语料的得分; Among them, s i and s j represent the target similar sentence corpus, s k represents the candidate similar sentence corpus, Len is used to find the number of word elements in the set, and f score (s 1 , s 2 ) is the The similarity between the target similar sentence corpus and the candidate similar sentence corpus, C is the candidate similar sentence corpus, O is the target similar sentence corpus, n is the candidate similar sentence in the candidate similar sentence corpus The number of corpora, m is the number of the target similar sentence corpus in the target similar sentence corpus, α is a weighting factor, and selectSen is the score of the candidate similar sentence corpus in the candidate similar sentence corpus;
在所述候选相似句语料集合的各候选相似句语料中获取所述得分最高的候选相似句语料,作为目标候选相似句语料;Acquiring the candidate similar sentence corpus with the highest score from each candidate similar sentence corpus in the candidate similar sentence corpus as a target candidate similar sentence corpus;
若该目标候选相似句语料的得分达到预定得分阈值,则将该目标候选相似句语料作为目标相似句语料加入至所述目标相似句语料集合,并将该目标候选相似句语料从所述候选相似句语料集合删除;If the score of the target candidate similar sentence corpus reaches the predetermined score threshold, the target candidate similar sentence corpus is added to the target similar sentence corpus as the target similar sentence corpus, and the target candidate similar sentence corpus is similar to the candidate Sentence corpus deletion;
转至所述确定候选相似句语料得分步骤,直至所述目标相似句语料集合中包括的目标相似句语料的数目达到第二预定数目或者对所述候选相似句语料集合的所有候选相似句语料均经过了判断。Turn to the step of determining the scores of candidate similar sentence corpora until the number of target similar sentence corpora included in the target similar sentence corpus reaches a second predetermined number or all candidate similar sentence corpora in the candidate similar sentence corpus are equal After the judgment.
在本实施例中,一方面,通过在将目标候选相似句语料作为目标相似句语料加入至目 标相似句语料集合后,重新转至确定候选相似句语料得分步骤,利用扩增后的目标相似句语料集合重新计算候选相似句语料集合中各候选相似句语料的得分,使得确定出的候选相似句语料的得分越来越准确,从而保证了加入至目标相似句语料集合的目标相似句语料的质量;另一方面,通过每次选取得分最高并且得分达到预定得分阈值的候选相似句语料加入至目标相似句语料集合,使得加入至目标相似句语料集合的候选相似句语料总是候选相似句语料集合中得分最高的,从而进一步保证了所迁移的目标相似句语料的质量。In this embodiment, on the one hand, after adding the target candidate similar sentence corpus as the target similar sentence corpus to the target similar sentence corpus, the step of determining the score of the candidate similar sentence corpus is transferred again, and the amplified target similar sentence is used. The corpus set recalculates the score of each candidate similar sentence corpus in the candidate similar sentence corpus, so that the determined score of the candidate similar sentence corpus becomes more and more accurate, thereby ensuring the quality of the target similar sentence corpus added to the target similar sentence corpus On the other hand, the candidate similar sentence corpus with the highest score and the score reaching the predetermined score threshold is selected each time and added to the target similar sentence corpus, so that the candidate similar sentence corpus added to the target similar sentence corpus is always the candidate similar sentence corpus The highest score in the set, thereby further ensuring the quality of the transferred target similar sentence corpus.
在一个实施例中,通过以下方式确定对所述候选相似句语料集合的所有候选相似句语料均经过了判断:In an embodiment, it is determined that all candidate similar sentence corpora in the candidate similar sentence corpus have been judged in the following manner:
每当对所述候选相似句语料集合的一个候选相似句语料进行了判断,则为该候选相似句语料打上标签,当所述候选相似句语料集合的所有候选相似句语料均被打上标签时,则确定对所述候选相似句语料集合的所有候选相似句语料均经过了判断。Whenever a candidate similar sentence corpus in the candidate similar sentence corpus is judged, the candidate similar sentence corpus is labeled. When all candidate similar sentence corpora in the candidate similar sentence corpus are labeled, It is determined that all candidate similar sentence corpora in the candidate similar sentence corpus have been judged.
在一个实施例中,所述基于所述得分在所述候选相似句语料集合的候选相似句语料中确定出属于所述目标意图的目标相似句语料,包括:In an embodiment, the determining the target similar sentence corpus belonging to the target intention from the candidate similar sentence corpus of the candidate similar sentence corpus based on the score includes:
获取所述得分达到预定得分阈值的候选相似句语料,作为属于所述目标意图的目标相似句语料。A candidate similar sentence corpus whose score reaches a predetermined score threshold is obtained as the target similar sentence corpus belonging to the target intention.
在本实施例中,通过得分与预定得分阈值的比较来确定目标相似句语料,保证了选择出的目标相似句语料的合理性。In this embodiment, the target similar sentence corpus is determined by comparing the score with a predetermined score threshold, which ensures the rationality of the selected target similar sentence corpus.
在一个实施例中,所述基于所述得分在所述候选相似句语料集合的候选相似句语料中确定出属于所述目标意图的目标相似句语料,包括:In an embodiment, the determining the target similar sentence corpus belonging to the target intention from the candidate similar sentence corpus of the candidate similar sentence corpus based on the score includes:
若所述得分达到预定得分阈值的候选相似句语料的数目达到第三预定数目,则在所述得分达到预定得分阈值的候选相似句语料中任取第三预定数目个候选相似句语料,作为属于所述目标意图的目标相似句语料;If the number of candidate similar sentence corpora with the score reaching the predetermined score threshold reaches the third predetermined number, the third predetermined number of candidate similar sentence corpora with the score reaching the predetermined score threshold is randomly selected as belonging to The target similar sentence corpus of the target intention;
若所述得分达到预定得分阈值的候选相似句语料的数目未达到第三预定数目,则获取所述得分达到预定得分阈值的候选相似句语料,作为属于所述目标意图的目标相似句语料。If the number of candidate similar sentence corpora whose score reaches the predetermined score threshold does not reach the third predetermined number, the candidate similar sentence corpus whose score reaches the predetermined score threshold is obtained as the target similar sentence corpus belonging to the target intention.
在本实施例中,当得分达到预定得分阈值的候选相似句语料的数目过多时,对最终选择的目标相似句语料的数量进行了限制。In this embodiment, when the number of candidate similar sentence corpora whose score reaches the predetermined score threshold is too large, the number of target similar sentence corpora to be finally selected is limited.
在一个实施例中,所述基于所述得分在所述候选相似句语料集合的候选相似句语料中确定出属于所述目标意图的目标相似句语料,包括:In an embodiment, the determining the target similar sentence corpus belonging to the target intention from the candidate similar sentence corpus of the candidate similar sentence corpus based on the score includes:
执行确定目标候选相似句语料步骤,所述确定目标候选相似句语料步骤包括:在所述候选相似句语料集合的各候选相似句语料中获取所述得分最高的候选相似句语料,作为目标候选相似句语料;The step of determining the target candidate similar sentence corpus is performed, and the step of determining the target candidate similar sentence corpus includes: obtaining the candidate similar sentence corpus with the highest score among the candidate similar sentence corpora in the candidate similar sentence corpus, as the target candidate similarity Sentence corpus
若该目标候选相似句语料的得分达到预定得分阈值,则将该目标候选相似句语料作为目标相似句语料加入至所述目标相似句语料集合,并将该目标候选相似句语料从所述候选相似句语料集合删除;If the score of the target candidate similar sentence corpus reaches the predetermined score threshold, the target candidate similar sentence corpus is added to the target similar sentence corpus as the target similar sentence corpus, and the target candidate similar sentence corpus is similar to the candidate Sentence corpus deletion;
转至所述确定目标候选相似句语料步骤,直至所述目标相似句语料集合中包括的目标相似句语料的数目达到第二预定数目或者对所述候选相似句语料集合的所有候选相似句语料均经过了判断。Go to the step of determining the target similar sentence corpus until the number of target similar sentence corpora included in the target similar sentence corpus reaches a second predetermined number, or all candidate similar sentence corpora in the candidate similar sentence corpus is equal to After the judgment.
在本实施例中,通过每次所述选取得分最高的候选相似句语料,并在判断该候选相似句语料的得分达到预定得分阈值时就将该候选相似句语料加入至目标相似句语料集合,使得加入至目标相似句语料集合的候选相似句语料得分是最高的,从而保证了所迁移的目标相似句语料的质量。In this embodiment, the candidate similar sentence corpus with the highest score is selected each time, and the candidate similar sentence corpus is added to the target similar sentence corpus when it is judged that the score of the candidate similar sentence corpus reaches a predetermined score threshold. , So that the candidate similar sentence corpus added to the target similar sentence corpus has the highest score, thereby ensuring the quality of the transferred target similar sentence corpus.
在一个实施例中,所述基于所述得分在所述候选相似句语料集合的候选相似句语料中确定出属于所述目标意图的目标相似句语料,包括:In an embodiment, the determining the target similar sentence corpus belonging to the target intention from the candidate similar sentence corpus of the candidate similar sentence corpus based on the score includes:
对所述候选相似句语料集合的各候选相似句语料按照所述得分从高到低的顺序进行排 序;Sort the candidate similar sentence corpora in the candidate similar sentence corpus according to the order of the score from high to low;
按照所述排序的顺序,每次选取一个候选相似句语料,若该候选相似句语料的得分达到预定得分阈值,则将该候选相似句语料作为目标相似句语料加入至所述目标相似句语料集合,并将该候选相似句语料从所述候选相似句语料集合删除,直至所述目标相似句语料集合中包括的目标相似句语料的数目达到第二预定数目或者选取的候选相似句语料的得分未达到预定得分阈值。According to the sorting order, one candidate similar sentence corpus is selected each time, and if the score of the candidate similar sentence corpus reaches a predetermined score threshold, the candidate similar sentence corpus is added to the target similar sentence corpus as the target similar sentence corpus , And delete the candidate similar sentence corpus from the candidate similar sentence corpus until the number of target similar sentence corpora included in the target similar sentence corpus reaches the second predetermined number or the selected candidate similar sentence corpus has no score The predetermined score threshold is reached.
综上所述,根据图2实施例提供的对话机器人意图语料生成方法,通过利用知识迁移的方式,将其他意图的语料迁移到需要扩展的意图中,从而实现意图语料的自动扩充,提高了意图语料的数量,可以使各意图的语料数量更为均衡,进而在一定程度上提高了意图识别的准确率,还降低了扩展意图语料所需的成本。To sum up, according to the method for generating the intention corpus of the dialog robot provided by the embodiment in FIG. 2, by using knowledge transfer, the corpus of other intentions is transferred to the intention that needs to be expanded, so as to realize the automatic expansion of the intention corpus and improve the intention. The number of corpus can make the corpus of each intent more balanced, thereby improving the accuracy of intent recognition to a certain extent, and also reducing the cost of expanding the intent corpus.
本申请还提供了一种对话机器人意图语料生成装置,以下是本申请的装置实施例。The present application also provides a device for generating the intention corpus of a dialogue robot. The following are device embodiments of the present application.
图5是根据一示例性实施例示出的一种对话机器人意图语料生成装置的框图。如图5所示,装置500包括:Fig. 5 is a block diagram showing a device for generating intention corpus of a dialogue robot according to an exemplary embodiment. As shown in FIG. 5, the device 500 includes:
第一获取模块510,被配置为获取包括多个意图的意图集合,其中,每一意图包括多个相似句语料,每一意图对应一个对话机器人,每一对话机器人具有至少一个意图;The first obtaining module 510 is configured to obtain an intent set including a plurality of intents, wherein each intent includes a plurality of similar sentence corpus, each intent corresponds to a dialogue robot, and each dialogue robot has at least one intent;
第二获取模块520,被配置为获取目标意图所包括的目标相似句语料,作为目标相似句语料集合;The second acquiring module 520 is configured to acquire the target similar sentence corpus included in the target intention as a target similar sentence corpus;
第一确定模块530,被配置为确定所述目标相似句语料与所述相似句语料的相似度;The first determining module 530 is configured to determine the similarity between the target similar sentence corpus and the similar sentence corpus;
构建模块540,被配置为基于所述相似度在所述意图集合中选择出候选相似句语料,以构建候选相似句语料集合;The construction module 540 is configured to select candidate similar sentence corpora from the intent set based on the similarity to construct a candidate similar sentence corpus;
第二确定模块550,被配置为基于所述候选相似句语料集合中各候选相似句语料与所述目标相似句语料集合中所述目标相似句语料的相似度,在所述候选相似句语料集合的候选相似句语料中确定出属于所述目标意图的目标相似句语料。The second determining module 550 is configured to determine, based on the similarity between each candidate similar sentence corpus in the candidate similar sentence corpus and the target similar sentence corpus in the target similar sentence corpus, in the candidate similar sentence corpus, Identify the target similar sentence corpus belonging to the target intention from the candidate similar sentence corpus.
根据本申请的第三方面,还提供了一种能够实现上述方法的电子设备。According to the third aspect of the present application, there is also provided an electronic device capable of implementing the above method.
所属技术领域的技术人员能够理解,本申请的各个方面可以实现为系统、方法或程序产品。因此,本申请的各个方面可以具体实现为以下形式,即:完全的硬件实施方式、完全的软件实施方式(包括固件、微代码等),或硬件和软件方面结合的实施方式,这里可以统称为“电路”、“模块”或“系统”。Those skilled in the art can understand that various aspects of the present application can be implemented as a system, a method, or a program product. Therefore, each aspect of the present application can be specifically implemented in the following forms, namely: complete hardware implementation, complete software implementation (including firmware, microcode, etc.), or a combination of hardware and software implementations, which can be collectively referred to herein as "Circuit", "Module" or "System".
下面参照图6来描述根据本申请的这种实施方式的电子设备600。图6显示的电子设备600仅仅是一个示例,不应对本申请实施例的功能和使用范围带来任何限制。The electronic device 600 according to this embodiment of the present application will be described below with reference to FIG. 6. The electronic device 600 shown in FIG. 6 is only an example, and should not bring any limitation to the function and scope of use of the embodiments of the present application.
如图6所示,电子设备600以通用计算设备的形式表现。电子设备600的组件可以包括但不限于:上述至少一个处理单元610、上述至少一个存储单元620、连接不同系统组件(包括存储单元620和处理单元610)的总线630。As shown in FIG. 6, the electronic device 600 is represented in the form of a general-purpose computing device. The components of the electronic device 600 may include, but are not limited to: the aforementioned at least one processing unit 610, the aforementioned at least one storage unit 620, and a bus 630 connecting different system components (including the storage unit 620 and the processing unit 610).
其中,所述存储单元存储有程序代码,所述程序代码可以被所述处理单元610执行,使得所述处理单元610执行本说明书上述“实施例方法”部分中描述的根据本申请各种示例性实施方式的步骤。Wherein, the storage unit stores program code, and the program code can be executed by the processing unit 610, so that the processing unit 610 executes the various exemplary methods described in the above-mentioned "Embodiment Method" section of this specification. Steps of implementation.
存储单元620可以包括易失性存储单元形式的可读存储介质,例如随机存取存储单元(RAM)621和/或高速缓存存储单元622,还可以进一步包括只读存储单元(ROM)623。The storage unit 620 may include a readable storage medium in the form of a volatile storage unit, such as a random access storage unit (RAM) 621 and/or a cache storage unit 622, and may further include a read-only storage unit (ROM) 623.
存储单元620还可以包括具有一组(至少一个)程序模块625的程序/实用工具624,这样的程序模块625包括但不限于:操作系统、一个或者多个应用程序、其它程序模块以及程序数据,这些示例中的每一个或某种组合中可能包括网络环境的实现。The storage unit 620 may also include a program/utility tool 624 having a set of (at least one) program module 625. Such program module 625 includes but is not limited to: an operating system, one or more application programs, other program modules, and program data, Each of these examples or some combination may include the implementation of a network environment.
总线630可以为表示几类总线结构中的一种或多种,包括存储单元总线或者存储单元控制器、外围总线、图形加速端口、处理单元或者使用多种总线结构中的任意总线结构的局域总线。The bus 630 may represent one or more of several types of bus structures, including a storage unit bus or a storage unit controller, a peripheral bus, a graphics acceleration port, a processing unit, or a local area using any bus structure among multiple bus structures. bus.
电子设备600也可以与一个或多个外部设备800(例如键盘、指向设备、蓝牙设备等)通信,还可与一个或者多个使得用户能与该电子设备600交互的设备通信,和/或与使得该电子设备600能与一个或多个其它计算设备进行通信的任何设备(例如路由器、调制解调器等等)通信。这种通信可以通过输入/输出(I/O)接口650进行。并且,电子设备600还可以通过网络适配器660与一个或者多个网络(例如局域网(LAN),广域网(WAN)和/或公共网络,例如因特网)通信。如图所示,网络适配器660通过总线630与电子设备600的其它模块通信。应当明白,尽管图中未示出,可以结合电子设备600使用其它硬件和/或软件模块,包括但不限于:微代码、设备驱动器、冗余处理单元、外部磁盘驱动阵列、RAID系统、磁带驱动器以及数据备份存储系统等。The electronic device 600 may also communicate with one or more external devices 800 (such as keyboards, pointing devices, Bluetooth devices, etc.), and may also communicate with one or more devices that enable a user to interact with the electronic device 600, and/or communicate with Any device (such as a router, modem, etc.) that enables the electronic device 600 to communicate with one or more other computing devices. This communication can be performed through an input/output (I/O) interface 650. In addition, the electronic device 600 may also communicate with one or more networks (for example, a local area network (LAN), a wide area network (WAN), and/or a public network, such as the Internet) through the network adapter 660. As shown in the figure, the network adapter 660 communicates with other modules of the electronic device 600 through the bus 630. It should be understood that although not shown in the figure, other hardware and/or software modules can be used in conjunction with the electronic device 600, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives And data backup storage system, etc.
通过以上的实施方式的描述,本领域的技术人员易于理解,这里描述的示例实施方式可以通过软件实现,也可以通过软件结合必要的硬件的方式来实现。因此,根据本申请实施方式的技术方案可以以软件产品的形式体现出来,该软件产品可以存储在一个非易失性存储介质(可以是CD-ROM,U盘,移动硬盘等)中或网络上,包括若干指令以使得一台计算设备(可以是个人计算机、服务器、终端装置、或者网络设备等)执行根据本申请实施方式的方法。Through the description of the above embodiments, those skilled in the art can easily understand that the example embodiments described here can be implemented by software, or can be implemented by combining software with necessary hardware. Therefore, the technical solution according to the embodiments of the present application can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, U disk, mobile hard disk, etc.) or on the network , Including several instructions to make a computing device (which can be a personal computer, a server, a terminal device, or a network device, etc.) execute the method according to the embodiment of the present application.
根据本申请的第四方面,还提供了一种计算机可读存储介质,其上存储有能够实现本说明书上述方法的程序产品。其中,该计算机可读存储介质可以是非易失性,也可以是易失性。在一些可能的实施方式中,本申请的各个方面还可以实现为一种程序产品的形式,其包括程序代码,当所述程序产品在终端设备上运行时,所述程序代码用于使所述终端设备执行本说明书上述“示例性方法”部分中描述的根据本申请各种示例性实施方式的步骤。According to the fourth aspect of the present application, there is also provided a computer-readable storage medium on which is stored a program product capable of implementing the above-mentioned method in this specification. Wherein, the computer-readable storage medium may be non-volatile or volatile. In some possible implementation manners, each aspect of the present application can also be implemented in the form of a program product, which includes program code. When the program product runs on a terminal device, the program code is used to make the The terminal device executes the steps according to various exemplary embodiments of the present application described in the above-mentioned "Exemplary Method" section of this specification.
参考图7所示,描述了根据本申请的实施方式的用于实现上述方法的程序产品700,其可以采用便携式紧凑盘只读存储器(CD-ROM)并包括程序代码,并可以在终端设备,例如个人电脑上运行。然而,本申请的程序产品不限于此,在本文件中,可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。Referring to FIG. 7, a program product 700 for implementing the above method according to an embodiment of the present application is described. It can adopt a portable compact disk read-only memory (CD-ROM) and include program code, and can be stored in a terminal device, For example, running on a personal computer. However, the program product of this application is not limited to this. In this document, the readable storage medium can be any tangible medium that contains or stores a program, and the program can be used by or in combination with an instruction execution system, device, or device.
所述程序产品可以采用一个或多个可读存储介质的任意组合。可读存储介质可以是可读信号介质或者可读存储介质。可读存储介质例如可以为但不限于电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。可读存储介质的更具体的例子(非穷举的列表)包括:具有一个或多个导线的电连接、便携式盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。The program product may adopt any combination of one or more readable storage media. The readable storage medium may be a readable signal medium or a readable storage medium. The readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or a combination of any of the above. More specific examples (non-exhaustive list) of readable storage media include: electrical connections with one or more wires, portable disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable Type programmable read only memory (EPROM or flash memory), optical fiber, portable compact disk read only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了可读程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。可读信号介质还可以是可读存储介质以外的任何可读存储介质,该可读存储介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。The computer-readable signal medium may include a data signal propagated in baseband or as a part of a carrier wave, and readable program code is carried therein. This propagated data signal can take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing. The readable signal medium may also be any readable storage medium other than the readable storage medium, and the readable storage medium may send, propagate, or transmit a program for use by or in combination with the instruction execution system, apparatus, or device.
可读存储介质上包含的程序代码可以用任何适当的介质传输,包括但不限于无线、有线、光缆、RF等等,或者上述的任意合适的组合。The program code contained on the readable storage medium can be transmitted by any suitable medium, including but not limited to wireless, wired, optical cable, RF, etc., or any suitable combination of the above.
可以以一种或多种程序设计语言的任意组合来编写用于执行本申请操作的程序代码,所述程序设计语言包括面向对象的程序设计语言—诸如Java、C++等,还包括常规的过程式程序设计语言—诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算设备上执行、部分地在用户设备上执行、作为一个独立的软件包执行、部分在用户计算设备上部分在远程计算设备上执行、或者完全在远程计算设备或服务器上执行。在涉及远程 计算设备的情形中,远程计算设备可以通过任意种类的网络,包括局域网(LAN)或广域网(WAN),连接到用户计算设备,或者,可以连接到外部计算设备(例如利用因特网服务提供商来通过因特网连接)。The program code used to perform the operations of the present application can be written in any combination of one or more programming languages. The programming languages include object-oriented programming languages—such as Java, C++, etc., as well as conventional procedural programming languages. Programming language-such as "C" language or similar programming language. The program code can be executed entirely on the user's computing device, partly on the user's device, executed as an independent software package, partly on the user's computing device and partly executed on the remote computing device, or entirely on the remote computing device or server Executed on. In the case of a remote computing device, the remote computing device can be connected to a user computing device through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computing device (for example, using Internet service providers). Business to connect via the Internet).
应当理解的是,本申请并不局限于上面已经描述并在附图中示出的精确结构,并且可以在不脱离其范围执行各种修改和改变。本申请的范围仅由所附的权利要求来限制。It should be understood that the present application is not limited to the precise structure that has been described above and shown in the drawings, and various modifications and changes can be performed without departing from its scope. The scope of the application is only limited by the appended claims.

Claims (20)

  1. 一种对话机器人意图语料生成方法,其中,所述方法包括:A method for generating conversational robot intention corpus, wherein the method includes:
    获取包括多个意图的意图集合,其中,每一意图包括多个相似句语料,每一意图对应一个对话机器人,每一对话机器人具有至少一个意图;Acquiring an intent set including a plurality of intents, where each intent includes a plurality of similar sentence corpus, each intent corresponds to a dialogue robot, and each dialogue robot has at least one intent;
    获取目标意图所包括的目标相似句语料,作为目标相似句语料集合;Obtain the target similar sentence corpus included in the target intention as a target similar sentence corpus;
    确定所述目标相似句语料与所述相似句语料的相似度;Determine the similarity between the target similar sentence corpus and the similar sentence corpus;
    基于所述相似度在所述意图集合中选择出候选相似句语料,以构建候选相似句语料集合;Selecting a candidate similar sentence corpus from the intent set based on the similarity to construct a candidate similar sentence corpus;
    基于所述候选相似句语料集合中各候选相似句语料与所述目标相似句语料集合中所述目标相似句语料的相似度,在所述候选相似句语料集合的候选相似句语料中确定出属于所述目标意图的目标相似句语料。Based on the similarity between each candidate similar sentence corpus in the candidate similar sentence corpus and the target similar sentence corpus in the target similar sentence corpus, the candidate similar sentence corpus in the candidate similar sentence corpus is determined to belong to The target similar sentence corpus of the target intention.
  2. 根据权利要求1所述的方法,其中,所述意图集合为意图子集合,所述获取包括多个意图的意图集合,包括:The method according to claim 1, wherein the intent set is an intent subset, and the acquiring an intent set including a plurality of intents includes:
    基于第一预定规则从包括多个意图的意图总集合中选择出多个意图组成意图子集合,其中,每一意图包括多个相似句语料,所述意图总集合中的每一意图对应一个对话机器人;Based on the first predetermined rule, multiple intents are selected from a total set of intents including multiple intents to form an intent sub-set, wherein each intent includes a plurality of similar sentence corpus, and each intent in the total set of intents corresponds to a dialogue robot;
    所述获取目标意图所包括的目标相似句语料,作为目标相似句语料集合,包括:The acquiring target similar sentence corpus included in the target intention, as a target similar sentence corpus, includes:
    基于第二预定规则在所述意图子集合中意图对应的对话机器人之外的其他所有对话机器人对应的意图中选择出目标意图;Selecting the target intent from the intents corresponding to all dialogue robots other than the intents corresponding to the intents in the intent sub-set based on the second predetermined rule;
    获取目标意图包括的相似句语料作为目标相似句语料,得到目标相似句语料集合。Obtain the similar sentence corpus included in the target intention as the target similar sentence corpus, and obtain the target similar sentence corpus.
  3. 根据权利要求1或2所述的方法,其中,所述目标相似句语料和所述相似句语料分别由多个词元素组成,所述确定所述目标相似句语料与所述相似句语料的相似度,包括:The method according to claim 1 or 2, wherein the target similar sentence corpus and the similar sentence corpus are respectively composed of a plurality of word elements, and the determination of the similarity between the target similar sentence corpus and the similar sentence corpus is Degree, including:
    利用如下公式确定所述目标相似句语料与所述相似句语料的相似度:Use the following formula to determine the similarity between the target similar sentence corpus and the similar sentence corpus:
    Figure PCTCN2020093043-appb-100001
    Figure PCTCN2020093043-appb-100001
    其中,s 1代表所述目标相似句语料,s 2代表所述相似句语料,Len用于求取集合内词元素的个数,f score(s 1,s 2)为所述目标相似句语料与所述相似句语料的相似度。 Among them, s 1 represents the target similar sentence corpus, s 2 represents the similar sentence corpus, Len is used to find the number of word elements in the set, and f score (s 1 , s 2 ) is the target similar sentence corpus The degree of similarity with the similar sentence corpus.
  4. 根据权利要求1或2所述的方法,其中,所述基于所述相似度在所述意图集合中选择出候选相似句语料,以构建候选相似句语料集合,包括:The method according to claim 1 or 2, wherein the selecting a candidate similar sentence corpus from the intent set based on the similarity to construct a candidate similar sentence corpus includes:
    针对所述意图集合中每一意图,若该意图包括的相似句语料中存在一个相似句语料与所述目标相似句语料的相似度大于预定相似度阈值,则获取该意图包括的所有相似句语料作为候选相似句语料;For each intent in the intent set, if there is a similar sentence corpus in the similar sentence corpus included in the intent and the similarity between the similar sentence corpus and the target similar sentence corpus is greater than a predetermined similarity threshold, then all similar sentence corpora included in the intent are acquired As candidate similar sentence corpus;
    利用获取的所有候选相似句语料构建候选相似句语料集合。Use all the obtained candidate similar sentence corpora to construct a candidate similar sentence corpus.
  5. 根据权利要求1或2所述的方法,其中,所述基于所述候选相似句语料集合中各候选相似句语料与所述目标相似句语料集合中所述目标相似句语料的相似度,在所述候选相似句语料集合的候选相似句语料中确定出属于所述目标意图的目标相似句语料,包括:The method according to claim 1 or 2, wherein the similarity between each candidate similar sentence corpus in the candidate similar sentence corpus and the target similar sentence corpus in the target similar sentence corpus is based on In the candidate similar sentence corpus of the candidate similar sentence corpus, the target similar sentence corpus that belongs to the target intention is determined from the candidate similar sentence corpus, including:
    基于所述候选相似句语料集合中各候选相似句语料与所述目标相似句语料集合中所述目标相似句语料的相似度,利用如下公式计算所述候选相似句语料集合中各候选相似句语料的得分,并基于所述得分在所述候选相似句语料集合的候选相似句语料中确定出属于所述目标意图的目标相似句语料:Based on the similarity between each candidate similar sentence corpus in the candidate similar sentence corpus and the target similar sentence corpus in the target similar sentence corpus, the following formula is used to calculate each candidate similar sentence corpus in the candidate similar sentence corpus Based on the score, the target similar sentence corpus belonging to the target intention is determined in the candidate similar sentence corpus of the candidate similar sentence corpus in the candidate similar sentence corpus:
    Figure PCTCN2020093043-appb-100002
    Figure PCTCN2020093043-appb-100002
    其中,s i和s j代表所述目标相似句语料,s k代表所述候选相似句语料,Len用于求取集合内词元素的个数,f score(s 1,s 2)为所述目标相似句语料与所述候选相似句语料的相似度,C为所述候选相似句语料集合,O为所述目标相似句语料集合,n为所述候选相似句语料集合中所述候选相似句语料的数目,m为所述目标相似句语料集合中所述目标相似句语料的数目,α为权重因子,selectSen为所述候选相似句语料集合中候选相似句语料的得分。 Among them, s i and s j represent the target similar sentence corpus, s k represents the candidate similar sentence corpus, Len is used to find the number of word elements in the set, and f score (s 1 , s 2 ) is the The similarity between the target similar sentence corpus and the candidate similar sentence corpus, C is the candidate similar sentence corpus, O is the target similar sentence corpus, n is the candidate similar sentence in the candidate similar sentence corpus The number of corpora, m is the number of the target similar sentence corpus in the target similar sentence corpus, α is a weighting factor, and selectSen is the score of the candidate similar sentence corpus in the candidate similar sentence corpus.
  6. 根据权利要求5所述的方法,其中,所述基于所述候选相似句语料集合中各候选相似句语料与所述目标相似句语料集合中所述目标相似句语料的相似度,利用如下公式计算所述候选相似句语料集合中各候选相似句语料的得分,并基于所述得分在所述候选相似句语料集合的候选相似句语料中确定出属于所述目标意图的目标相似句语料,包括:The method according to claim 5, wherein the similarity between each candidate similar sentence corpus in the candidate similar sentence corpus and the target similar sentence corpus in the target similar sentence corpus is calculated by using the following formula The score of each candidate similar sentence corpus in the candidate similar sentence corpus, and determining the target similar sentence corpus belonging to the target intention from the candidate similar sentence corpus of the candidate similar sentence corpus based on the score includes:
    迭代执行目标相似句语料选取步骤,所述目标相似句语料选取步骤包括:The step of selecting the target similar sentence corpus is iteratively performed, and the step of selecting the target similar sentence corpus includes:
    执行确定候选相似句语料得分步骤,所述确定候选相似句语料得分步骤包括:基于所述候选相似句语料集合中各候选相似句语料与所述目标相似句语料集合中所述目标相似句语料的相似度,利用如下公式计算所述候选相似句语料集合中各候选相似句语料的得分:The step of determining the score of candidate similar sentence corpus is performed, and the step of determining the score of candidate similar sentence corpus includes: based on the comparison between each candidate similar sentence corpus in the candidate similar sentence corpus and the target similar sentence corpus in the target similar sentence corpus. For similarity, the following formula is used to calculate the score of each candidate similar sentence corpus in the candidate similar sentence corpus:
    Figure PCTCN2020093043-appb-100003
    Figure PCTCN2020093043-appb-100003
    其中,s i和s j代表所述目标相似句语料,s k代表所述候选相似句语料,Len用于求取集合内词元素的个数,f score(s 1,s 2)为所述目标相似句语料与所述候选相似句语料的相似度,C为所述候选相似句语料集合,O为所述目标相似句语料集合,n为所述候选相似句语料集合中所述候选相似句语料的数目,m为所述目标相似句语料集合中所述目标相似句语料的数目,α为权重因子,selectSen为所述候选相似句语料集合中候选相似句语料的得分; Among them, s i and s j represent the target similar sentence corpus, s k represents the candidate similar sentence corpus, Len is used to find the number of word elements in the set, and f score (s 1 , s 2 ) is the The similarity between the target similar sentence corpus and the candidate similar sentence corpus, C is the candidate similar sentence corpus, O is the target similar sentence corpus, n is the candidate similar sentence in the candidate similar sentence corpus The number of corpora, m is the number of the target similar sentence corpus in the target similar sentence corpus, α is a weighting factor, and selectSen is the score of the candidate similar sentence corpus in the candidate similar sentence corpus;
    在所述候选相似句语料集合的各候选相似句语料中获取所述得分最高的候选相似句语料,作为目标候选相似句语料;Acquiring the candidate similar sentence corpus with the highest score from each candidate similar sentence corpus in the candidate similar sentence corpus as a target candidate similar sentence corpus;
    若该目标候选相似句语料的得分达到预定得分阈值,则将该目标候选相似句语料作为目标相似句语料加入至所述目标相似句语料集合,并将该目标候选相似句语料从所述候选相似句语料集合删除;If the score of the target candidate similar sentence corpus reaches the predetermined score threshold, the target candidate similar sentence corpus is added to the target similar sentence corpus as the target similar sentence corpus, and the target candidate similar sentence corpus is similar to the candidate Sentence corpus deletion;
    转至所述确定候选相似句语料得分步骤,直至目标相似句语料集合中包括的目标相似句语料的数目达到第二预定数目或者对所述候选相似句语料集合的所有候选相似句语料均经过了判断。Turn to the step of determining the score of candidate similar sentence corpus until the number of target similar sentence corpora included in the target similar sentence corpus reaches the second predetermined number or all candidate similar sentence corpora in the candidate similar sentence corpus has passed judge.
  7. 根据权利要求5所述的方法,其中,所述基于所述得分在所述候选相似句语料集合的候选相似句语料中确定出属于所述目标意图的目标相似句语料,包括:5. The method according to claim 5, wherein the determining the target similar sentence corpus belonging to the target intention from the candidate similar sentence corpus of the candidate similar sentence corpus based on the score comprises:
    执行确定目标候选相似句语料步骤,所述确定目标候选相似句语料步骤包括:在所述候选相似句语料集合的各候选相似句语料中获取所述得分最高的候选相似句语料,作为目标候选相似句语料;The step of determining the target candidate similar sentence corpus is performed, and the step of determining the target candidate similar sentence corpus includes: obtaining the candidate similar sentence corpus with the highest score among the candidate similar sentence corpora in the candidate similar sentence corpus, as the target candidate similarity Sentence corpus
    若该目标候选相似句语料的得分达到预定得分阈值,则将该目标候选相似句语料作为目标相似句语料加入至所述目标相似句语料集合,并将该目标候选相似句语料从所述候选相似句语料集合删除;If the score of the target candidate similar sentence corpus reaches the predetermined score threshold, the target candidate similar sentence corpus is added to the target similar sentence corpus as the target similar sentence corpus, and the target candidate similar sentence corpus is similar to the candidate Sentence corpus deletion;
    转至所述确定目标候选相似句语料步骤,直至所述目标相似句语料集合中包括的目标相似句语料的数目达到第二预定数目或者对所述候选相似句语料集合的所有候选相似句语料均经过了判断。Go to the step of determining the target similar sentence corpus until the number of target similar sentence corpora included in the target similar sentence corpus reaches a second predetermined number, or all candidate similar sentence corpora in the candidate similar sentence corpus is equal to After the judgment.
  8. 根据权利要求2所述的方法,其中,所述基于第二预定规则在所述意图子集合中意图对应的对话机器人之外的其他所有对话机器人对应的意图中选择出目标意图,包括:The method according to claim 2, wherein the selecting the target intention from the intents corresponding to all the dialog robots other than the dialog robots corresponding to the intentions in the intention subset based on the second predetermined rule comprises:
    在所述意图子集合中意图对应的对话机器人之外的其他所有对话机器人对应的意图中确定出包括的相似句语料的数量小于第一预定数目的意图,作为第一候选目标意图;In the intention subset, intents corresponding to all other dialogue robots except for the intents corresponding to the intents are determined to include intents whose number of similar sentence corpus is less than the first predetermined number as the first candidate target intent;
    在所述第一候选目标意图中任取一个,作为目标意图。Any one of the first candidate target intentions is selected as the target intention.
  9. 根据权利要求5所述的方法,其中,所述基于所述得分在所述候选相似句语料集合的候选相似句语料中确定出属于所述目标意图的目标相似句语料,包括:5. The method according to claim 5, wherein the determining the target similar sentence corpus belonging to the target intention from the candidate similar sentence corpus of the candidate similar sentence corpus based on the score comprises:
    若所述得分达到预定得分阈值的候选相似句语料的数目达到第三预定数目,则在所述得分达到预定得分阈值的候选相似句语料中任取第三预定数目个候选相似句语料,作为属于所述目标意图的目标相似句语料;If the number of candidate similar sentence corpora with the score reaching the predetermined score threshold reaches the third predetermined number, the third predetermined number of candidate similar sentence corpora with the score reaching the predetermined score threshold is randomly selected as belonging to The target similar sentence corpus of the target intention;
    若所述得分达到预定得分阈值的候选相似句语料的数目未达到第三预定数目,则获取所述得分达到预定得分阈值的候选相似句语料,作为属于所述目标意图的目标相似句语料。If the number of candidate similar sentence corpora whose score reaches the predetermined score threshold does not reach the third predetermined number, the candidate similar sentence corpus whose score reaches the predetermined score threshold is obtained as the target similar sentence corpus belonging to the target intention.
  10. 一种对话机器人意图语料生成装置,其中,所述装置包括:An intention corpus generation device for a dialogue robot, wherein the device includes:
    第一获取模块,被配置为获取包括多个意图的意图集合,其中,每一意图包括多个相似句语料,每一意图对应一个对话机器人,每一对话机器人具有至少一个意图;The first acquisition module is configured to acquire an intent set including a plurality of intents, wherein each intent includes a plurality of similar sentence corpus, each intent corresponds to a dialogue robot, and each dialogue robot has at least one intent;
    第二获取模块,被配置为获取目标意图所包括的目标相似句语料,作为目标相似句语料集合;The second acquisition module is configured to acquire the target similar sentence corpus included in the target intention as a target similar sentence corpus;
    第一确定模块,被配置为确定所述目标相似句语料与所述相似句语料的相似度;The first determining module is configured to determine the similarity between the target similar sentence corpus and the similar sentence corpus;
    构建模块,被配置为基于所述相似度在所述意图集合中选择出候选相似句语料,以构建候选相似句语料集合;A construction module configured to select candidate similar sentence corpora from the intent set based on the similarity to construct a candidate similar sentence corpus;
    第二确定模块,被配置为基于所述候选相似句语料集合中各候选相似句语料与所述目标相似句语料集合中所述目标相似句语料的相似度,在所述候选相似句语料集合的候选相似句语料中确定出属于所述目标意图的目标相似句语料。The second determining module is configured to be based on the similarity between each candidate similar sentence corpus in the candidate similar sentence corpus and the target similar sentence corpus in the target similar sentence corpus. The candidate similar sentence corpus determines the target similar sentence corpus belonging to the target intention.
  11. 一种电子设备,其中,所述电子设备包括:An electronic device, wherein the electronic device includes:
    处理器;processor;
    存储器,所述存储器上存储有计算机可读指令,所述计算机可读指令被所述处理器执行时,实现:A memory, where computer-readable instructions are stored, and when the computer-readable instructions are executed by the processor, it realizes:
    获取包括多个意图的意图集合,其中,每一意图包括多个相似句语料,每一意图对应一个对话机器人,每一对话机器人具有至少一个意图;Acquiring an intent set including a plurality of intents, where each intent includes a plurality of similar sentence corpus, each intent corresponds to a dialogue robot, and each dialogue robot has at least one intent;
    获取目标意图所包括的目标相似句语料,作为目标相似句语料集合;Obtain the target similar sentence corpus included in the target intention as a target similar sentence corpus;
    确定所述目标相似句语料与所述相似句语料的相似度;Determine the similarity between the target similar sentence corpus and the similar sentence corpus;
    基于所述相似度在所述意图集合中选择出候选相似句语料,以构建候选相似句语料集合;Selecting a candidate similar sentence corpus from the intent set based on the similarity to construct a candidate similar sentence corpus;
    基于所述候选相似句语料集合中各候选相似句语料与所述目标相似句语料集合中所述目标相似句语料的相似度,在所述候选相似句语料集合的候选相似句语料中确定出属于所述目标意图的目标相似句语料。Based on the similarity between each candidate similar sentence corpus in the candidate similar sentence corpus and the target similar sentence corpus in the target similar sentence corpus, the candidate similar sentence corpus in the candidate similar sentence corpus is determined to belong to The target similar sentence corpus of the target intention.
  12. 根据权利要求11所述的电子设备,其中,所述意图集合为意图子集合,所述计算机可读指令被所述处理器执行时,实现:The electronic device according to claim 11, wherein the set of intents is a subset of intents, and when the computer-readable instruction is executed by the processor, it realizes:
    基于第一预定规则从包括多个意图的意图总集合中选择出多个意图组成意图子集合,其中,每一意图包括多个相似句语料,所述意图总集合中的每一意图对应一个对话机器人;Based on the first predetermined rule, multiple intents are selected from a total set of intents including multiple intents to form an intent sub-set, wherein each intent includes a plurality of similar sentence corpus, and each intent in the total set of intents corresponds to a dialogue robot;
    所述获取目标意图所包括的目标相似句语料,作为目标相似句语料集合,包括:The acquiring target similar sentence corpus included in the target intention, as a target similar sentence corpus, includes:
    基于第二预定规则在所述意图子集合中意图对应的对话机器人之外的其他所有对话机器人对应的意图中选择出目标意图;Selecting the target intent from the intents corresponding to all dialogue robots other than the intents corresponding to the intents in the intent sub-set based on the second predetermined rule;
    获取目标意图包括的相似句语料作为目标相似句语料,得到目标相似句语料集合。Obtain the similar sentence corpus included in the target intention as the target similar sentence corpus, and obtain the target similar sentence corpus.
  13. 根据权利要求11或12所述的电子设备,其中,所述目标相似句语料和所述相似句语料分别由多个词元素组成,所述计算机可读指令被所述处理器执行时,实现:The electronic device according to claim 11 or 12, wherein the target similar sentence corpus and the similar sentence corpus are respectively composed of multiple word elements, and when the computer-readable instruction is executed by the processor, it realizes:
    利用如下公式确定所述目标相似句语料与所述相似句语料的相似度:Use the following formula to determine the similarity between the target similar sentence corpus and the similar sentence corpus:
    Figure PCTCN2020093043-appb-100004
    Figure PCTCN2020093043-appb-100004
    其中,s 1代表所述目标相似句语料,s 2代表所述相似句语料,Len用于求取集合内词元素的个数,f score(s 1,s 2)为所述目标相似句语料与所述相似句语料的相似度。 Among them, s 1 represents the target similar sentence corpus, s 2 represents the similar sentence corpus, Len is used to find the number of word elements in the set, and f score (s 1 , s 2 ) is the target similar sentence corpus The degree of similarity with the similar sentence corpus.
  14. 根据权利要求11或12所述的电子设备,其中,所述计算机可读指令被所述处理器执行时,实现:The electronic device according to claim 11 or 12, wherein, when the computer-readable instruction is executed by the processor, it realizes:
    针对所述意图集合中每一意图,若该意图包括的相似句语料中存在一个相似句语料与所述目标相似句语料的相似度大于预定相似度阈值,则获取该意图包括的所有相似句语料作为候选相似句语料;For each intent in the intent set, if there is a similar sentence corpus in the similar sentence corpus included in the intent and the similarity between the target similar sentence corpus is greater than the predetermined similarity threshold, then all similar sentence corpora included in the intent are acquired As a candidate similar sentence corpus;
    利用获取的所有候选相似句语料构建候选相似句语料集合。Use all the obtained candidate similar sentence corpora to construct a candidate similar sentence corpus.
  15. 根据权利要求11或12所述的电子设备,其中,所述计算机可读指令被所述处理器执行时,实现:The electronic device according to claim 11 or 12, wherein, when the computer-readable instruction is executed by the processor, it realizes:
    基于所述候选相似句语料集合中各候选相似句语料与所述目标相似句语料集合中所述目标相似句语料的相似度,利用如下公式计算所述候选相似句语料集合中各候选相似句语料的得分,并基于所述得分在所述候选相似句语料集合的候选相似句语料中确定出属于所述目标意图的目标相似句语料:Based on the similarity between each candidate similar sentence corpus in the candidate similar sentence corpus and the target similar sentence corpus in the target similar sentence corpus, the following formula is used to calculate each candidate similar sentence corpus in the candidate similar sentence corpus Based on the score, the target similar sentence corpus belonging to the target intention is determined in the candidate similar sentence corpus of the candidate similar sentence corpus in the candidate similar sentence corpus:
    Figure PCTCN2020093043-appb-100005
    Figure PCTCN2020093043-appb-100005
    其中,s i和s j代表所述目标相似句语料,s k代表所述候选相似句语料,Len用于求取集合内词元素的个数,f score(s 1,s 2)为所述目标相似句语料与所述候选相似句语料的相似度,C为所述候选相似句语料集合,O为所述目标相似句语料集合,n为所述候选相似句语料集合中所述候选相似句语料的数目,m为所述目标相似句语料集合中所述目标相似句语料的数目,α为权重因子,selectSen为所述候选相似句语料集合中候选相似句语料的得分。 Among them, s i and s j represent the target similar sentence corpus, s k represents the candidate similar sentence corpus, Len is used to find the number of word elements in the set, and f score (s 1 , s 2 ) is the The similarity between the target similar sentence corpus and the candidate similar sentence corpus, C is the candidate similar sentence corpus, O is the target similar sentence corpus, n is the candidate similar sentence in the candidate similar sentence corpus The number of corpora, m is the number of the target similar sentence corpus in the target similar sentence corpus, α is a weighting factor, and selectSen is the score of the candidate similar sentence corpus in the candidate similar sentence corpus.
  16. 根据权利要求15所述的电子设备,其中,所述计算机可读指令被所述处理器执行时,实现:The electronic device according to claim 15, wherein, when the computer-readable instruction is executed by the processor, it realizes:
    迭代执行目标相似句语料选取步骤,所述目标相似句语料选取步骤包括:The step of selecting the target similar sentence corpus is iteratively performed, and the step of selecting the target similar sentence corpus includes:
    执行确定候选相似句语料得分步骤,所述确定候选相似句语料得分步骤包括:基于所述候选相似句语料集合中各候选相似句语料与所述目标相似句语料集合中所述目标相似句语料的相似度,利用如下公式计算所述候选相似句语料集合中各候选相似句语料的得分:The step of determining the score of candidate similar sentence corpus is performed, and the step of determining the score of candidate similar sentence corpus includes: based on the comparison between each candidate similar sentence corpus in the candidate similar sentence corpus and the target similar sentence corpus in the target similar sentence corpus. For similarity, the following formula is used to calculate the score of each candidate similar sentence corpus in the candidate similar sentence corpus:
    Figure PCTCN2020093043-appb-100006
    Figure PCTCN2020093043-appb-100006
    其中,s i和s j代表所述目标相似句语料,s k代表所述候选相似句语料,Len用于求取集合内词元素的个数,f score(s 1,s 2)为所述目标相似句语料与所述候选相似句语料的相似度,C为所述候选相似句语料集合,O为所述目标相似句语料集合,n为所述候选相似句语料集合中所述候选相似句语料的数目,m为所述目标相似句语料集合中所述目标相似句语料的数目,α为权重因子,selectSen为所述候选相似句语料集合中候选相似句语料的得分; Among them, s i and s j represent the target similar sentence corpus, s k represents the candidate similar sentence corpus, Len is used to find the number of word elements in the set, and f score (s 1 , s 2 ) is the The similarity between the target similar sentence corpus and the candidate similar sentence corpus, C is the candidate similar sentence corpus, O is the target similar sentence corpus, n is the candidate similar sentence in the candidate similar sentence corpus The number of corpora, m is the number of the target similar sentence corpus in the target similar sentence corpus, α is a weighting factor, and selectSen is the score of the candidate similar sentence corpus in the candidate similar sentence corpus;
    在所述候选相似句语料集合的各候选相似句语料中获取所述得分最高的候选相似句语料,作为目标候选相似句语料;Acquiring the candidate similar sentence corpus with the highest score from each candidate similar sentence corpus in the candidate similar sentence corpus as a target candidate similar sentence corpus;
    若该目标候选相似句语料的得分达到预定得分阈值,则将该目标候选相似句语料作为目标相似句语料加入至所述目标相似句语料集合,并将该目标候选相似句语料从所述候选相似句语料集合删除;If the score of the target candidate similar sentence corpus reaches the predetermined score threshold, the target candidate similar sentence corpus is added to the target similar sentence corpus as the target similar sentence corpus, and the target candidate similar sentence corpus is similar to the candidate Sentence corpus deletion;
    转至所述确定候选相似句语料得分步骤,直至目标相似句语料集合中包括的目标相似句语料的数目达到第二预定数目或者对所述候选相似句语料集合的所有候选相似句语料均经过了判断。Turn to the step of determining the score of candidate similar sentence corpus until the number of target similar sentence corpora included in the target similar sentence corpus reaches the second predetermined number or all candidate similar sentence corpora in the candidate similar sentence corpus has passed judge.
  17. 根据权利要求15所述的电子设备,其中,所述计算机可读指令被所述处理器执行时,实现:The electronic device according to claim 15, wherein, when the computer-readable instruction is executed by the processor, it realizes:
    执行确定目标候选相似句语料步骤,所述确定目标候选相似句语料步骤包括:在所述候选相似句语料集合的各候选相似句语料中获取所述得分最高的候选相似句语料,作为目标候选相似句语料;The step of determining the target candidate similar sentence corpus is performed, and the step of determining the target candidate similar sentence corpus includes: obtaining the candidate similar sentence corpus with the highest score among the candidate similar sentence corpora in the candidate similar sentence corpus, as the target candidate similarity Sentence corpus
    若该目标候选相似句语料的得分达到预定得分阈值,则将该目标候选相似句语料作为目标相似句语料加入至所述目标相似句语料集合,并将该目标候选相似句语料从所述候选相似句语料集合删除;If the score of the target candidate similar sentence corpus reaches the predetermined score threshold, the target candidate similar sentence corpus is added to the target similar sentence corpus as the target similar sentence corpus, and the target candidate similar sentence corpus is similar to the candidate Sentence corpus deletion;
    转至所述确定目标候选相似句语料步骤,直至所述目标相似句语料集合中包括的目标相似句语料的数目达到第二预定数目或者对所述候选相似句语料集合的所有候选相似句语料均经过了判断。Turn to the step of determining the target similar sentence corpus, until the number of target similar sentence corpora included in the target similar sentence corpus reaches a second predetermined number or all candidate similar sentence corpora in the candidate similar sentence corpus is equal to After the judgment.
  18. 根据权利要求12所述的电子设备,其中,所述计算机可读指令被所述处理器执行时,实现:The electronic device according to claim 12, wherein, when the computer-readable instruction is executed by the processor, it realizes:
    在所述意图子集合中意图对应的对话机器人之外的其他所有对话机器人对应的意图中确定出包括的相似句语料的数量小于第一预定数目的意图,作为第一候选目标意图;In the intention subset, intents corresponding to all other dialogue robots except for the intents corresponding to the intents are determined to include intents whose number of similar sentence corpus is less than the first predetermined number as the first candidate target intent;
    在所述第一候选目标意图中任取一个,作为目标意图。Any one of the first candidate target intentions is selected as the target intention.
  19. 根据权利要求15所述的电子设备,其中,所述计算机可读指令被所述处理器执行时,实现:The electronic device according to claim 15, wherein, when the computer-readable instruction is executed by the processor, it realizes:
    若所述得分达到预定得分阈值的候选相似句语料的数目达到第三预定数目,则在所述得分达到预定得分阈值的候选相似句语料中任取第三预定数目个候选相似句语料,作为属于所述目标意图的目标相似句语料;If the number of candidate similar sentence corpora with the score reaching the predetermined score threshold reaches the third predetermined number, the third predetermined number of candidate similar sentence corpora with the score reaching the predetermined score threshold is randomly selected as belonging to The target similar sentence corpus of the target intention;
    若所述得分达到预定得分阈值的候选相似句语料的数目未达到第三预定数目,则获取所述得分达到预定得分阈值的候选相似句语料,作为属于所述目标意图的目标相似句语料。If the number of candidate similar sentence corpora whose score reaches the predetermined score threshold does not reach the third predetermined number, the candidate similar sentence corpus whose score reaches the predetermined score threshold is obtained as the target similar sentence corpus belonging to the target intention.
  20. 一种计算机可读存储介质,其中,其存储有计算机程序指令,当所述计算机程序指令被计算机执行时,使计算机执行根据权利要求1至9中任一项所述的方法。A computer-readable storage medium, wherein computer program instructions are stored, and when the computer program instructions are executed by a computer, the computer executes the method according to any one of claims 1 to 9.
PCT/CN2020/093043 2020-03-20 2020-05-28 Conversation robot intention corpus generation method and apparatus, medium, and electronic device WO2021184547A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010201001.8A CN111460117B (en) 2020-03-20 2020-03-20 Method and device for generating intent corpus of conversation robot, medium and electronic equipment
CN202010201001.8 2020-03-20

Publications (1)

Publication Number Publication Date
WO2021184547A1 true WO2021184547A1 (en) 2021-09-23

Family

ID=71685675

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/093043 WO2021184547A1 (en) 2020-03-20 2020-05-28 Conversation robot intention corpus generation method and apparatus, medium, and electronic device

Country Status (2)

Country Link
CN (1) CN111460117B (en)
WO (1) WO2021184547A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114996506A (en) * 2022-05-24 2022-09-02 腾讯科技(深圳)有限公司 Corpus generation method and device, electronic equipment and computer-readable storage medium

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112784024B (en) * 2021-01-11 2023-10-31 软通动力信息技术(集团)股份有限公司 Man-machine conversation method, device, equipment and storage medium
CN113539245B (en) * 2021-07-05 2024-03-15 思必驰科技股份有限公司 Language model automatic training method and system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070073534A1 (en) * 2005-09-29 2007-03-29 International Business Machines Corporation Corpus expansion system and method thereof
CN109597873A (en) * 2018-11-21 2019-04-09 腾讯科技(深圳)有限公司 Processing method, device, computer-readable medium and the electronic equipment of corpus data
CN109710939A (en) * 2018-12-28 2019-05-03 北京百度网讯科技有限公司 Method and apparatus for determining theme
CN110390006A (en) * 2019-07-23 2019-10-29 腾讯科技(深圳)有限公司 Question and answer corpus generation method, device and computer readable storage medium
CN110765759A (en) * 2019-10-21 2020-02-07 普信恒业科技发展(北京)有限公司 Intention identification method and device

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104021796B (en) * 2013-02-28 2017-06-20 华为技术有限公司 Speech enhan-cement treating method and apparatus
CN103617280B (en) * 2013-12-09 2017-01-25 苏州大学 Method and system for mining Chinese event information
CN104216875B (en) * 2014-09-26 2017-05-03 中国科学院自动化研究所 Automatic microblog text abstracting method based on unsupervised key bigram extraction
CN104834735B (en) * 2015-05-18 2018-01-23 大连理工大学 A kind of documentation summary extraction method based on term vector
CN106598949B (en) * 2016-12-22 2019-01-04 北京金山办公软件股份有限公司 A kind of determination method and device of word to text contribution degree
CN109933787B (en) * 2019-02-14 2023-07-14 安徽省泰岳祥升软件有限公司 Text key information extraction method, device and medium
CN110222192A (en) * 2019-05-20 2019-09-10 国网电子商务有限公司 Corpus method for building up and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070073534A1 (en) * 2005-09-29 2007-03-29 International Business Machines Corporation Corpus expansion system and method thereof
CN109597873A (en) * 2018-11-21 2019-04-09 腾讯科技(深圳)有限公司 Processing method, device, computer-readable medium and the electronic equipment of corpus data
CN109710939A (en) * 2018-12-28 2019-05-03 北京百度网讯科技有限公司 Method and apparatus for determining theme
CN110390006A (en) * 2019-07-23 2019-10-29 腾讯科技(深圳)有限公司 Question and answer corpus generation method, device and computer readable storage medium
CN110765759A (en) * 2019-10-21 2020-02-07 普信恒业科技发展(北京)有限公司 Intention identification method and device

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114996506A (en) * 2022-05-24 2022-09-02 腾讯科技(深圳)有限公司 Corpus generation method and device, electronic equipment and computer-readable storage medium

Also Published As

Publication number Publication date
CN111460117A (en) 2020-07-28
CN111460117B (en) 2024-03-08

Similar Documents

Publication Publication Date Title
US11636264B2 (en) Stylistic text rewriting for a target author
WO2020207431A1 (en) Document classification method, apparatus and device, and storage medium
CN108959246B (en) Answer selection method and device based on improved attention mechanism and electronic equipment
WO2021017721A1 (en) Intelligent question answering method and apparatus, medium and electronic device
WO2020182122A1 (en) Text matching model generation method and device
WO2021249528A1 (en) Intelligent dialogue method and apparatus and electronic device
WO2021184547A1 (en) Conversation robot intention corpus generation method and apparatus, medium, and electronic device
CA3065765C (en) Extracting domain-specific actions and entities in natural language commands
CN110741364A (en) Determining a state of an automated assistant dialog
US11551437B2 (en) Collaborative information extraction
WO2020244065A1 (en) Character vector definition method, apparatus and device based on artificial intelligence, and storage medium
AU2017424116B2 (en) Extracting domain-specific actions and entities in natural language commands
WO2021208727A1 (en) Text error detection method and apparatus based on artificial intelligence, and computer device
WO2021072863A1 (en) Method and apparatus for calculating text similarity, electronic device, and computer-readable storage medium
WO2021063089A1 (en) Rule matching method, rule matching apparatus, storage medium and electronic device
CN104714942B (en) Method and system for the content availability for natural language processing task
US9460081B1 (en) Transcription correction using multi-token structures
US11977567B2 (en) Method of retrieving query, electronic device and medium
WO2021143016A1 (en) Approximate data processing method and apparatus, medium and electronic device
CN112926308B (en) Method, device, equipment, storage medium and program product for matching text
CN111931488A (en) Method, device, electronic equipment and medium for verifying accuracy of judgment result
US10902215B1 (en) Social hash for language models
WO2020252925A1 (en) Method and apparatus for searching user feature group for optimized user feature, electronic device, and computer nonvolatile readable storage medium
WO2021072864A1 (en) Text similarity acquisition method and apparatus, and electronic device and computer-readable storage medium
JP2019036210A (en) FAQ registration support method using machine learning, and computer system

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20925334

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20925334

Country of ref document: EP

Kind code of ref document: A1