WO2021129123A1 - 语料数据处理方法、装置、服务器和存储介质 - Google Patents

语料数据处理方法、装置、服务器和存储介质 Download PDF

Info

Publication number
WO2021129123A1
WO2021129123A1 PCT/CN2020/124481 CN2020124481W WO2021129123A1 WO 2021129123 A1 WO2021129123 A1 WO 2021129123A1 CN 2020124481 W CN2020124481 W CN 2020124481W WO 2021129123 A1 WO2021129123 A1 WO 2021129123A1
Authority
WO
WIPO (PCT)
Prior art keywords
corpus data
user
diversity
diversity score
word
Prior art date
Application number
PCT/CN2020/124481
Other languages
English (en)
French (fr)
Inventor
邓东
张晴
舒昌文
周元甲
曾春亮
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2021129123A1 publication Critical patent/WO2021129123A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3335Syntactic pre-processing, e.g. stopword elimination, stemming

Definitions

  • This application relates to the field of artificial intelligence technology, and in particular to a corpus data processing method, device, server and storage medium.
  • Dialogue system is an important research direction of interactive artificial intelligence (Artificial Intelligence, AI), and it also has important applications in the industrial field.
  • Intelligent Virtual Assistant (IVA) or Voice Assistant (VA) can analyze and recognize the user's voice query, and then perform corresponding operations to meet the user's requirements. For example, in the smart car terminal, the driver’s voice is detected to identify the driver’s needs for playing music and checking hot news; in the smart home system, the user’s voice commands are detected to identify the user’s needs for playing TV series and cleaning indoor hygiene. .
  • NLU Natural Language Understanding
  • the embodiments of the present application provide a corpus data processing method, device, server, and storage medium, which can increase the diversity of user statements on the Bot platform.
  • an embodiment of the present application provides a corpus data processing method, including:
  • the corpus data is processed.
  • the obtaining the corpus data to be processed includes: obtaining the original corpus data input by the user; performing data cleaning on the original corpus data to obtain the corpus data to be processed.
  • data cleaning Through preprocessing such as data cleaning, the interference caused by irrelevant words or symbols on feature extraction and subsequent diversity score calculation can be reduced.
  • the performing data cleaning on the original corpus data to obtain the corpus data to be processed includes: identifying a plurality of slot-value pairs in the original corpus data, and determining the value of the word in each slot-value pair Slot name; Replace words with the same slot name with the corresponding slot name; Identify and delete the stop words in the original corpus data to obtain the corpus data to be processed.
  • the feature information includes the generation probability of each word in the corpus data
  • the extracting the feature information of the corpus data includes: identifying at least one user intention contained in the corpus data; determining each user Intentionally included user sentences; according to the number of occurrences of each word in the user sentence, the generation probability of each word is calculated.
  • the calculating the generation probability of each word according to the number of occurrences of each word in the user sentence includes: segmenting each user sentence included in the target user's intention, where the target user's intention is Any one of the user’s intentions contained in the corpus data; respectively count the number of occurrences of each word after word segmentation; and, count the total number of occurrences of all words after word segmentation; according to the number of occurrences of each word and the The total number of occurrences of all words, and the generation probability of each word under the intention of the target user is calculated.
  • the calculating the generation probability of each word under the intention of the target user according to the number of occurrences of each word and the total number of occurrences of all words includes: calculating the number of occurrences of the target word The ratio between the total number of occurrences of all the words, and the ratio is taken as the generation probability of the target word under the intention of the target user, and the target word is any one of the all words.
  • the calculation of the diversity score of the corpus data according to the feature information includes: counting the number of words of all words after word segmentation; using the number of words of all words and each of the words
  • the generation probability of a word is a parameter, and a preset information entropy formula is used to calculate the diversity score of the target user's intention; the diversity score of the corpus data is determined according to the diversity score of the multiple target user's intentions.
  • the diversity score of each user’s intention contained in the corpus data is calculated with the help of information entropy, and then the diversity score of the corpus data is determined, which can be used to evaluate the diversity of the corpus data, which can effectively evaluate the diversity of the corpus data. Quantification allows developers and reviewers of the Bot platform to intuitively understand whether the corpus data currently provided is rich.
  • the determining the diversity score of the corpus data according to the diversity scores of multiple target user intentions includes: counting the number of user sentences included in the intention of each target user, and counting all target users The total number of sentences of all user sentences included in the intention; calculate the ratio between the number of user sentences included in the intention of each target user and the total number of sentences, and use the ratio as the weight of the corresponding target user's intentions Value; according to the weight value of each target user's intention, weighted summation is performed on the diversity score of each target user's intention to obtain the diversity score of the corpus data.
  • the method further includes: receiving annotating information that the user has annotated with respect to a plurality of sample corpus data, where the annotating information includes the first information or The second information; collect the sample corpus data with the same annotation information in the same set to obtain the first set and the second set; according to the diversity of each sample corpus data in the first set and the second set Score, determine the diversity score threshold.
  • the threshold used to compare whether the diversity is rich is determined, which can effectively ensure the accuracy of the determined threshold.
  • the determining the diversity score threshold according to the diversity score of each sample corpus data in the first set and the second set includes: calculating the value of the sample corpus data in the first set The lower bound of the score of the diversity score; and, calculating the upper bound of the diversity score of the sample corpus data in the second set; calculating the average of the lower bound of the score and the upper bound of the score, The average value is used as the diversity score threshold.
  • the processing the corpus data according to the diversity score includes: if the diversity score of the corpus data is greater than or equal to the diversity score threshold, determining that the user configured The corpus data is sufficiently diverse, and other processing can be performed on the corpus data and its corresponding interaction skills; if the diversity score of the corpus data is less than the diversity score threshold, the user is prompted to comment on the corpus data Make changes or additions to improve the diversity of the corpus.
  • the processing the corpus data according to the diversity score includes: if the diversity score of the corpus data is greater than or equal to the diversity score threshold, determining that the user configured The corpus data is sufficiently diverse, and other processing can be performed on the corpus data and its corresponding interaction skills; if the diversity score of the corpus data is less than the diversity score threshold, the user is prompted to comment on the corpus data Make changes or additions to improve the diversity of the corpus.
  • an embodiment of the present application provides a corpus data processing device, including:
  • the corpus data acquisition module is used to acquire the corpus data to be processed
  • the feature information extraction module is used to extract feature information of the corpus data
  • the diversity score calculation module is configured to calculate the diversity score of the corpus data according to the feature information
  • the corpus data processing module is used to process the corpus data according to the diversity score.
  • an embodiment of the present application provides a server, including a memory, a processor, and a computer program stored in the memory and capable of running on the processor.
  • the processor executes the computer program when the computer program is executed.
  • the corpus data processing method according to any one of the above-mentioned first aspects.
  • an embodiment of the present application provides a computer-readable storage medium, the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor of a server, any one of the above-mentioned first aspects is implemented.
  • the corpus data processing method is a method that uses a computer program to execute the computer program.
  • the embodiments of the present application provide a computer program product, which when the computer program product runs on a server, causes the server to execute the corpus data processing method described in any one of the above-mentioned first aspects.
  • the embodiments of the present application include the following beneficial effects:
  • the diversity score of the corpus data can be calculated according to the feature information, so that the corpus data can be targeted for the diversity score. deal with.
  • the above methods can effectively evaluate the diversity of user statements defined by developers, which is convenient for developers to provide richer statement data when configuring skills, which helps to improve the quality of skills, reduce the skill review cycle, and improve the overall skill development cycle. It can be applied Applying this method in fields such as natural language processing, especially in the data preprocessing stage of a dialogue system, can improve the efficiency and accuracy of subsequent language understanding and analysis.
  • Figure 1 is a schematic diagram of the technical framework of corpus generalization in the prior art
  • FIG. 2 is a schematic step flowchart of a corpus data processing method provided by an embodiment of the present application
  • FIG. 3 is a schematic diagram of a Bot platform skill configuration page provided by an embodiment of the present application.
  • Fig. 4 is a graph structure of an N-gram language model provided by an embodiment of the present application.
  • FIG. 5 is a schematic step flowchart of a corpus data processing method provided by another embodiment of the present application.
  • FIG. 6 is a schematic step flowchart of a corpus data processing method provided by another embodiment of the present application.
  • FIG. 7 is a schematic step flowchart of a corpus data processing method provided by another embodiment of the present application.
  • FIG. 8 is a schematic diagram of the architecture of a system to which the corpus data processing method provided by an embodiment of the present application is applicable;
  • Fig. 9 is a schematic diagram of a Bot platform skill development process provided by an embodiment of the present application.
  • FIG. 10 is a schematic diagram of the diversity scores of the "My Train Butler 1" skill provided by an embodiment of the present application.
  • FIG. 11 is a schematic diagram of the diversity scores of the "My Train Steward 2" skill provided by an embodiment of the present application.
  • FIG. 12 is a schematic diagram of the diversity scores of the "My Train Steward 3" skill provided by an embodiment of the present application.
  • FIG. 13 is a structural block diagram of a corpus data processing device provided by an embodiment of the present application.
  • FIG. 14 is a schematic structural diagram of a server provided by an embodiment of the present application.
  • the corpus generalization technology can combine the user statement data provided by the developer for each intent, and based on the model learning method, generalize more user statements in different expressions and increase the corpus diversity of skills.
  • the Bot platform can enhance skills and capabilities by adding user-defined user statements and generalized user statements to the training module for training.
  • FIG. 1 it is a schematic diagram of the technical framework of corpus generalization in the prior art.
  • the technical framework includes the following key modules and processes:
  • the word segmentation module performs word segmentation on the input data. Since Chinese text data is based on phrases as the basic semantic unit, the word segmentation module needs to convert the input text sequence data into phrases containing semantic information.
  • the sentence structure generation module extracts the existing sentence structure according to the corpus, and then divides the user's statement into clusters according to the extracted sentence structure, and records the different sentence structure types stored in the same sentence cluster
  • the mapping method for transforming sentences
  • the corpus generation module generates new corpus data based on the sentence structure and mapping method in the previous step, and adds it to the original unit parallel corpus.
  • corpus generalization is that it can automatically generate a large number of user statements with different sentence patterns. However, it is difficult to quantify the diversity of user statements after generalization using the user statements generated by this method. At the same time, there are many sentences with grammatical errors and ambiguous semantics in the generated user statements. For example, it is impossible to judge whether the generalized user statement data has reached the diversity needs of Bot platform skill training. Since the process of corpus generalization is not related to skill effects, there is no guarantee that how many corpora should be generalized, and how many corpora should be generalized, the diversity can meet the training requirements.
  • the corpus generalization technology is based on the user's statement provided by the developer and is automatically generated through model training, the grammatical logic of many generalized corpus is quite different from the sentences usually spoken by the user. Although corpora has been added, the quality of these corpora cannot be guaranteed.
  • the core idea of the embodiment of this application is to collect the user statement data of the Bot platform for the problem of the lack of judgment or preprocessing method for the diversity of user statements in the dialogue system. After the data is cleaned, feature extraction is performed; at the same time, in the feature extraction process, by constructing a data-driven language model, the probability distribution of all words in the current corpus can be calculated; then, the information entropy is used to calculate the diversity score of the corpus , By comparing the thresholds, you can easily determine whether you need to interact with the user, so as to notify the user to add more user statements with different sentences, thereby increasing the diversity of user statements.
  • This method can greatly improve the life cycle of the "voice interaction" skills on the Bot platform, improve the quality of the skills, and strengthen the interaction iteration between the user and the Bot platform.
  • this method can be applied to a server, that is, the execution subject of this embodiment is the server.
  • the aforementioned server may refer to a skill platform used to provide users or developers with voice interaction skill configuration, that is, a bot platform.
  • the Bot platform is one of the most important entrances for major companies to open up their voice interaction capabilities to a large number of third-party developers.
  • the developer configures a voice interaction skill on the Bot platform, so that after the skill is reviewed and released, the user can perform corresponding voice interaction with the voice assistant on the terminal device, so that the voice assistant can perform according to the user's voice instructions Corresponding operations to meet the needs of users.
  • FIG 3 it is a schematic diagram of a Bot platform skill configuration page in this embodiment. Developers can fill in the corresponding skill name on the page shown in Figure 3 and classify the skill. For example, the developer can configure a skill named "My Train Ticket", which belongs to the category of "Tool Assistant". Of course, when the developer configures the skill, he can also set the confidence level of the skill, submit a picture as the icon of the skill, and so on.
  • the developer can provide some user statements related to "My Train Ticket" to the Bot platform. These user statements may refer to some statements or sentence patterns that the user may use when purchasing a train ticket. For example: buy me a train ticket from Beijing to Shenzhen tomorrow, I want to buy a high-speed rail ticket from Beijing to Shenzhen, I want to change this high-speed rail ticket, and so on.
  • the Bot platform can review or process these user statements provided by the developer.
  • the above user statements are the corpus data that the Bot platform needs to process.
  • a piece of corpus data can contain one or more user intentions.
  • the corpus data provided by the developer contains two user intentions: “book a train ticket” and "change a train ticket”.
  • Each user intention specifically includes multiple user sentences or statements, and each user statement or statement contains multiple words.
  • FIG. 4 it is a graph structure of an N-gram language model of this embodiment.
  • I represents the number of user intentions in the skill
  • M represents how many user statements or user statements are in a user intention
  • N represents how many words are in a user statement or statement
  • the w in the middle circle represents the word corresponding Variables.
  • extracting the feature information of the corpus data may refer to extracting the feature information of each word in the corpus data to determine the probability distribution of each word in the corpus data.
  • the N-gram language model can be used to extract feature information.
  • other language models can also be used to extract feature information, and the same effect can be obtained. This embodiment does not limit the type of language model used.
  • S203 Calculate the diversity score of the corpus data according to the feature information
  • the diversity score of the corpus data can be calculated by means of information entropy.
  • the aforementioned diversity score can be used to indicate the richness of user sentences or statements provided by the developer under a certain skill or intention. Generally, the higher the diversity score, the richer the user sentences or statements under the skill or intention provided by the developer; on the contrary, the fewer statements or statements provided by the developer and the poorer diversity.
  • the corpus can be processed differently according to the level of the score.
  • the Bot platform can give priority to the review to speed up the progress of the skill release.
  • corpus data with a relatively low diversity score it means that when the developer configures a certain skill or intention, the user sentence or statement data provided is less, the diversity is poor, and the developer needs to provide more user sentences Or speak data to enhance the richness of the corpus.
  • the Bot platform can promptly notify developers of insufficient diversity, and inform developers that they need to fill in more statements with different sentences to increase the diversity score. In this way, before the skill review step of the Bot platform, the problem of insufficient diversity can be solved, and the problem of excessive development cycle caused by the failure of the skill review can be effectively shortened.
  • the diversity score of the corpus data can be calculated according to the feature information, so that the corpus can be targeted to the corpus according to the diversity score.
  • Data is processed.
  • the above methods can effectively evaluate the diversity of user statements defined by developers, which is convenient for developers to provide richer statement data when configuring skills, which helps to improve the quality of skills, reduce the skill review cycle, and improve the overall skill development cycle. It can be applied Applying this method in fields such as natural language processing, especially in the data preprocessing stage of a dialogue system, can improve the efficiency and accuracy of subsequent language understanding and analysis.
  • FIG. 5 there is shown a schematic step flowchart of a corpus data processing method provided by another embodiment of the present application.
  • the method may specifically include the following steps:
  • the execution subject of this embodiment is a server, and the server may refer to a Bot platform used to provide developers with voice interaction skills.
  • the Bot platform may provide a configuration page to the developer through the terminal device, for the developer to fill in the relevant information of the skill to be configured on the page, and input the corresponding original corpus data. Therefore, the original corpus data to be obtained in this embodiment may refer to the unprocessed corpus data directly submitted by the developer. For example: buy me a train ticket from Beijing to Shenzhen tomorrow, I want to buy a high-speed rail ticket from Beijing to Shenzhen, I want to change this high-speed rail ticket, and so on.
  • S502 Perform data cleaning on the original corpus data to obtain to-be-processed corpus data
  • the original corpus data may include a lot of meaningless words or symbols, and these meaningless words or symbols have little effect in the subsequent processing; in addition, the user's statement input by the developer may also include some number strings or Letter strings, these consecutive numbers and letter strings usually have some special meanings and need to be considered as a whole. Therefore, after the Bot platform obtains the original corpus data input by the developer, it needs to perform data cleaning on the original corpus data, and use the corpus data obtained after cleaning as the corpus data to be subsequently processed by the platform.
  • data cleaning is mainly to perform basic preprocessing of user statement data collected from the Bot platform.
  • data cleaning tasks can be carried out:
  • Stop word filtering Because many words, such as " ⁇ ", " ⁇ " and other characters can not be used to distinguish the user's intention, but they are used in many scenarios. Therefore, such stop words in the original corpus data can be identified and deleted.
  • the corpus data to be processed can be obtained.
  • the user intent contained in the recognition corpus data may refer to the user intent contained in the skill currently to be configured by the developer.
  • the skill “My train ticket”
  • different intentions such as “book a train ticket”, “change a train ticket” and “cancel a train ticket” can be included.
  • the user sentence contained therein may be different, and the user sentence is the user's sentence corresponding to the corresponding intention entered by the developer into the Bot platform.
  • the user sentence contained in it can be "help me buy a train ticket from Beijing to Shenzhen tomorrow", “I want to buy a high-speed rail ticket from Beijing to Shenzhen”, etc.;
  • the intention of "resign a train ticket” can include user sentences such as “I want to change this high-speed rail ticket”, “help me change the high-speed rail ticket from Beijing to Shenzhen tomorrow” and so on.
  • S505 Calculate the generation probability of each word according to the number of occurrences of each word in the user sentence;
  • the probability distribution of all words can be extracted through the language model. This distribution can effectively describe the probability of developers using different words or sentence patterns.
  • the N-gram language model can be used to extract features.
  • other language models different from N-gram have the same effect in practical applications and can be used instead.
  • the value of N in the N-gram language model is generally 2 or 3.
  • the generation probability of the i-th word w i in the user sentence can be represented by the following formula:
  • each user sentence included in the target user's intention can be segmented first, and the target user's intention may be any one of the user's intentions included in the corpus data. Then, by separately counting the number of occurrences of each word after word segmentation and the total number of occurrences of all words after word segmentation, the number of occurrences of each word and the total number of occurrences of all words can be used to calculate the number of occurrences of each word under the intention of the target user Generate probability.
  • the ratio between the number of occurrences of the target word and the total number of occurrences of all words can be calculated as the generation probability of the target word under the intention of the target user.
  • the target word can be any of all words One.
  • the generation probability of each word obtained by the above calculation is the feature information that is subsequently used to calculate the diversity score of the corpus data.
  • S506 Calculate the diversity score of the corpus data according to the feature information
  • S507 Process the corpus data according to the diversity score.
  • Steps S506-S507 in this embodiment are similar to steps S204-S205 in the foregoing embodiment, and reference may be made to each other, which will not be repeated in this embodiment.
  • the corpus data after acquiring the corpus data input by the developer for configuring the voice interaction skills, the corpus data can be cleaned. Through preprocessing such as data cleansing, the feature extraction of irrelevant words or symbols can be reduced. And the interference caused by the subsequent diversity score calculation. Secondly, in this embodiment, by using a data-driven language model to calculate the probability of word generation, the probability distribution of each word in the user's statement can be effectively extracted, which contributes to the accuracy of subsequent diversity score calculation.
  • FIG. 6 there is shown a schematic step flowchart of a corpus data processing method provided by another embodiment of the present application.
  • the method may specifically include the following steps:
  • S602 Identify at least one user intention included in the corpus data, and determine a user sentence included in each user intention;
  • S603 Perform word segmentation on each user sentence included in the intention of the target user, where the target user intention is any one of the user intentions included in the corpus data;
  • S604 Count the number of occurrences of each word after word segmentation; and, count the total number of occurrences of all words after word segmentation;
  • steps S601-S605 in this embodiment are similar to steps S201-S202 and S501-S505 in the foregoing embodiment, reference may be made to each other, which will not be repeated in this embodiment.
  • the probability distribution of each word can be extracted from the corpus data. Based on this probability distribution, information entropy can be used to calculate the diversity score of the user's statement under a certain user intention, that is, the diversity score of the intention .
  • the following formula can be used to calculate the diversity score of a user's intention:
  • P(w i ) represents the generation probability of a word or phrase w i in all user statements (removing repeated slots) under a certain intent of the current conversation task
  • V is the number of words in the dictionary, which is the current intent.
  • S represents the set of all user statements under the current user's intention
  • a user statement corresponds to a user statement in the corpus data.
  • S607 Determine the diversity score of the corpus data according to the diversity scores of multiple target users' intentions
  • the diversity score of the corpus data corresponding to the skill can be determined according to the diversity score of each intent.
  • the diversity score of each intent under the skill A i when the diversity score of each intent under the skill A i is obtained, it can be weighted according to the proportion of the user's statement contained in each intent to the total user's statement of all intents in the conversation task Then get the diversity score of the skill A i.
  • the ratio between the total number of the above sentences, and the ratio is respectively used as the weight value of the corresponding target user’s intention.
  • the diversity of each target user’s intention is scored. The value is weighted and summed to get the diversity score of the corpus data corresponding to the skill.
  • S608 Process the corpus data according to the diversity score.
  • the diversity score of each user's intention contained in the corpus data can be calculated by means of information entropy, and then the diversity score of the corpus data can be determined It is used to evaluate the diversity of corpus data, which can effectively quantify the diversity of corpus data, so that developers and Bot platform reviewers can intuitively understand whether the corpus data currently provided is rich.
  • FIG. 7 there is shown a schematic step flowchart of a corpus data processing method provided by another embodiment of the present application.
  • the method may specifically include the following steps:
  • S703 Calculate the diversity score of the corpus data according to the feature information
  • steps S701-S703 in this embodiment are similar to steps S201-S203, S501-S506, and S601-S607 in the foregoing embodiment, reference may be made to each other, and this embodiment will not be repeated here.
  • S704 Receive tagging information that the user has separately tagged for multiple pieces of sample corpus data, where the tagging information includes first information or second information;
  • the annotation information that the user separately annotates for multiple pieces of sample corpus data may refer to the annotation information obtained by manually recognizing part of the corpus data.
  • the above-mentioned first information may be that the indicators are rich in diversity
  • the second information may be information that the indicators are poor in diversity.
  • the reviewers of the Bot platform can divide the corpus data into corpus with rich diversity or corpus with poor diversity by means of manual identification.
  • S705. Collect sample corpus data with the same annotation information in the same set to obtain a first set and a second set.
  • the skills corresponding to the corpus data marked as rich in diversity may be gathered into the first set A, and the skills corresponding to the corpus data marked as poor in diversity may be gathered into the second set B.
  • S706 Determine a diversity score threshold according to the diversity score of each sample corpus data in the first set and the second set;
  • the diversity score threshold can be calculated by calculating the upper bound of the diversity score of the skill set A with rich diversity and the lower bound of the diversity score of the skill set B with poor diversity, and then taking the average value. To.
  • the lower bound of the diversity score of the sample corpus data in the first set with rich diversity can be calculated first, and the score of the diversity score of the sample corpus data in the second set with poor diversity can be calculated.
  • the upper bound of the value is calculated, and then the average of the lower bound of the above score and the upper bound of the score is calculated, and the average is used as the final diversity score threshold.
  • the diversity score threshold can be used to determine whether the corpus data of a certain skill provided by the developer is rich.
  • the size between them can be easily compared. If the diversity score is greater than or equal to the above threshold, it means that the developer’s configuration of user statements is sufficiently diverse, and the Bot platform can give priority to reviewing such skills to speed up the release of the skills.
  • the diversity score is less than the above threshold, it means that the diversity of user statements configured by the developer is insufficient, and the developer needs to be prompted to change or supplement the current corpus data, and fill in more statements with different sentences to increase the diversity score. value.
  • the Bot platform can prompt the developer about the lack of diversity through the configuration page, intuitively inform the developer of the problem of the lack of diversity, and remind the developer to add more sentences that are not used as soon as possible, and speed up the release of skills process.
  • the threshold used to compare whether the diversity is rich is determined according to the diversity score of the sample corpus data, which can effectively ensure the accuracy of the determined threshold.
  • the diversity score threshold By comparing with the diversity score threshold, it is also possible to quickly determine whether the user's statement provided by the current skill is rich.
  • Bot can give priority to review and release quickly; while for skills with insufficient diversity, developers can be reminded in time through the interactive interface to help developers find problems as soon as possible. If the developer is notified of insufficient corpus diversity during the skill review, since the review cycle is generally based on days, it will greatly increase the time cost of the developer to find the problem.
  • FIG. 8 it is a schematic diagram of the architecture of a system to which the corpus data processing method of this embodiment is applicable.
  • the user statement data of the Bot platform can be collected, and feature extraction can be performed after preliminary data cleaning.
  • the feature extraction process by constructing a data-driven language model, you can calculate The probability distribution of all words in the current corpus; then, the information entropy is used to calculate the diversity score of the corpus.
  • FIG. 9 it is a schematic diagram of a Bot platform skill development process in this embodiment.
  • the developer needs to configure the skill on the Bot platform, and then submit the model training. After the training is completed, the Bot platform must be audited before the skill can be released.
  • the diversity score of each skill configured by the developer is calculated. If the diversity score is sufficient, the skill can directly pass the online preliminary review; if the diversity score is insufficient, The Bot platform can provide immediate feedback to developers on the diversity of user statements, allowing developers to configure more sentence patterns or user statements with richer words. After the cycle, the skills deployed by the developers will get better and better.
  • the entire process described above can be completed on the Bot platform, and the diversity score is also an online score, which greatly shortens the time for developers to find problems and greatly improves the developer's experience and the progress of the delivery skills of the Bot platform.
  • Figure 10 lists the diversity scores of the skill "My Train Manager 1". It can be seen from Figure 10 that "My Train Manager 1" defines three intents, which are “book train tickets”, “Change train ticket” and “Cancel train ticket”. In Figure 10, “book a train ticket”, “change a train ticket” and “cancel a train ticket” all contain the same sentence pattern (sentence pattern: ⁇ departure time ⁇ ⁇ departure city ⁇ to ⁇ destination city ⁇ high-speed rail ticket ) Data, both "revise train ticket” and “cancel train ticket” contain only one sentence, while “book train ticket” contains five sentences, but these five sentences only change the destination city, they are still the same sentence pattern According to the user's statement below, in Figure 10, according to the processing method provided in this embodiment, it can be concluded that under this corpus configuration, the skill diversity score is 8.333;
  • Figure 11 shows the diversity score of the skill "My Train Manager 2" configured on the basis of the "My Train Manager 1" skill. "My Train Manager 2" replaces the four sentences previously defined with the statement data of the intention of "booking train tickets” with a richer sentence structure.
  • the embodiment of the present application provides a new method for preprocessing the diversity of user statements that integrates language models and information entropy, which can effectively evaluate the diversity scores and thresholds of user statements defined by developers.
  • the developers can interact and iterate based on the diversity score to notify the developers to provide more sentences and richer statements, thereby improving the performance of skills
  • it greatly reduces the skill review and launch cycle and improves the developer's experience on the Bot platform.
  • FIG. 13 shows a structural block diagram of a corpus data processing device provided by an embodiment of the present application. For ease of description, only the parts related to the embodiment of the present application are shown. .
  • the device can be applied to a server, and specifically can include the following modules:
  • the corpus data acquisition module 1301 is used to acquire the corpus data to be processed
  • the feature information extraction module 1302 is used to extract feature information of the corpus data
  • the diversity score calculation module 1303 is configured to calculate the diversity score of the corpus data according to the feature information
  • the corpus data processing module 1304 is configured to process the corpus data according to the diversity score.
  • the corpus data acquisition module 1301 may specifically include the following sub-modules:
  • the original corpus number acquisition sub-module is used to obtain the original corpus data input by the user;
  • the data cleaning sub-module is used to perform data cleaning on the original corpus data to obtain the corpus data to be processed.
  • the data cleaning submodule may specifically include the following units:
  • a slot identification unit used to identify multiple slot-value pairs in the original corpus data, and determine the slot name of the word in each slot-value pair;
  • the stop word filtering unit is used to identify and delete the stop words in the original corpus data to obtain the corpus data to be processed.
  • the feature information may include the generation probability of each word in the corpus data; the feature information extraction module 1302 may specifically include the following sub-modules:
  • the user intention recognition sub-module is used to identify at least one user intention contained in the corpus data
  • the user sentence determination sub-module is used to determine the user sentence that each user intends to include
  • the generation probability calculation sub-module is used to calculate the generation probability of each word according to the number of times each word in the user sentence appears.
  • the generation probability calculation sub-module may specifically include the following units:
  • the user sentence segmentation unit is used to segment each user sentence included in the intention of the target user, where the target user intention is any one of the user intentions included in the corpus data;
  • the word count unit is used to separately count the number of occurrences of each word after word segmentation; and, to count the total number of occurrences of all words after word segmentation;
  • the generation probability calculation unit is configured to calculate the generation probability of each word under the intention of the target user according to the number of occurrences of each word and the total number of occurrences of all words.
  • the generation probability calculation unit may specifically include the following subunits:
  • the generation probability calculation subunit is used to calculate the ratio between the number of occurrences of the target word and the total number of occurrences of all the words, and use the ratio as the generation probability of the target word under the intention of the target user.
  • the target word is any one of all the words.
  • the diversity score calculation module 1303 may specifically include the following sub-modules:
  • the word count sub-module is used to count the word count of all words after word segmentation
  • the user intention diversity score calculation sub-module is used to calculate the target user’s intention diversity score using a preset information entropy formula using the number of words of all the words and the generation probability of each word as parameters value;
  • the corpus data diversity score calculation sub-module is used to determine the diversity score of the corpus data according to the diversity scores of multiple target users' intentions.
  • the corpus data diversity score calculation sub-module may specifically include the following units:
  • Sentence count statistics unit used to count the number of user sentences included in the intention of each target user, and count the total number of sentences in all user sentences included in the intention of all target users;
  • the weight value calculation unit is used to calculate the ratio between the number of user sentences included in each target user's intention and the total number of the sentences, and use the ratio as the weight value of the corresponding target user's intention;
  • the corpus data diversity score calculation unit is configured to perform a weighted summation of the diversity scores of each target user's intentions according to the weight value of each target user's intention to obtain the diversity scores of the corpus data value.
  • the device may further include the following modules:
  • the labeling information receiving module is configured to receive labeling information that a user has respectively annotated for multiple pieces of sample corpus data, where the labeling information includes the first information or the second information;
  • the corpus data collection module is used to collect sample corpus data with the same annotation information in the same collection to obtain the first collection and the second collection;
  • the diversity score threshold determination module is configured to determine the diversity score threshold according to the diversity score of each sample corpus data in the first set and the second set.
  • the diversity score threshold determination module may specifically include the following sub-modules for:
  • the lower bound of diversity score calculation sub-module for calculating the lower bound of the diversity score of the sample corpus data in the first set.
  • the diversity score upper bound calculation sub-module is used to calculate the upper bound of the diversity score of the sample corpus data in the second set;
  • the diversity score threshold calculation sub-module is used to calculate the average value of the lower score value and the upper score value, and the average value is used as the diversity score threshold value.
  • the corpus data processing module 1304 may specifically include the following sub-modules:
  • the interactive skill processing sub-module is configured to process the interactive skills corresponding to the corpus data if the diversity score of the corpus data is greater than or equal to the diversity score threshold;
  • the developer prompting sub-module is configured to prompt the user to modify or supplement the corpus data if the diversity score of the corpus data is less than the diversity score threshold.
  • the description is relatively simple, and for related parts, please refer to the description of the method embodiment part.
  • the server 1400 of this embodiment includes: a processor 1410, a memory 1420, and a computer program 1421 stored in the memory 1420 and running on the processor 1410.
  • the processor 1410 executes the computer program 1421
  • the steps in each embodiment of the aforementioned corpus data processing method are implemented, for example, steps S201 to S204 shown in FIG. 2.
  • the processor 1410 executes the computer program 1421
  • the functions of the modules/units in the foregoing device embodiments are implemented, for example, the functions of the modules 1301 to 1304 shown in FIG. 13.
  • the computer program 1421 may be divided into one or more modules/units, and the one or more modules/units are stored in the memory 1420 and executed by the processor 1410 to complete The method provided in the embodiment of this application.
  • the one or more modules/units may be a series of computer program instruction segments capable of completing specific functions, and the instruction segments may be used to describe the execution process of the computer program 1421 in the server 1400.
  • the computer program 1421 can be divided into a corpus data acquisition module, a feature information extraction module, a diversity score calculation module, and a corpus data processing module.
  • the specific functions of each module are as follows:
  • the corpus data acquisition module is used to acquire the corpus data to be processed
  • the feature information extraction module is used to extract feature information of the corpus data
  • the diversity score calculation module is configured to calculate the diversity score of the corpus data according to the feature information
  • the corpus data processing module is used to process the corpus data according to the diversity score.
  • the server 1400 may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server.
  • the server 1400 may include, but is not limited to, a processor 1410 and a memory 1420.
  • FIG. 14 is only an example of the server 1400, and does not constitute a limitation on the server 1400. It may include more or less components than shown, or a combination of certain components, or different components.
  • the server 1400 may also include input and output devices, network access devices, buses, and the like.
  • the processor 1410 may be a central processing unit (Central Processing Unit, CPU), or other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), Ready-made programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.
  • the storage 1420 may be an internal storage unit of the server 1400, such as a hard disk or a memory of the server 1400.
  • the memory 1420 may also be an external storage device of the server 1400, for example, a plug-in hard disk, a Smart Media Card (SMC), or a Secure Digital (SD) card equipped on the server 1400. Flash Card and so on.
  • the storage 1420 may also include both an internal storage unit of the server 1400 and an external storage device.
  • the memory 1420 is used to store the computer program 1421 and other programs and data required by the server 1400.
  • the memory 1420 can also be used to temporarily store data that has been output or will be output.
  • the embodiment of the present application also discloses a computer-readable storage medium, the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the aforementioned corpus data processing method can be realized.
  • the embodiment of the present application also discloses a computer program product, which when the computer program product runs on a server, causes the server to execute the aforementioned corpus data processing method.
  • the disclosed corpus data processing method, device, server, and storage medium can be implemented in other ways.
  • the division of the modules or units is only a logical function division, and there may be other divisions in actual implementation.
  • multiple units or components can be combined or integrated into another system, or some features can be ignored. Or not.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.
  • the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium.
  • the computer program can be stored in a computer-readable storage medium.
  • the computer program can be stored in a computer-readable storage medium.
  • the steps of the foregoing method embodiments can be implemented.
  • the computer program includes computer program code, and the computer program code may be in the form of source code, object code, executable file, or some intermediate forms.
  • the computer-readable medium may at least include: any entity or device capable of carrying computer program code to the corpus data processing device and server, recording medium, computer memory, read-only memory (ROM, Read-Only Memory), random access Memory (RAM, Random Access Memory), electric carrier signal, telecommunications signal, and software distribution medium.
  • ROM read-only memory
  • RAM random access Memory
  • electric carrier signal telecommunications signal
  • software distribution medium for example, U disk, mobile hard disk, floppy disk or CD-ROM, etc.
  • computer-readable media cannot be electrical carrier signals and telecommunication signals.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Human Computer Interaction (AREA)
  • Machine Translation (AREA)

Abstract

本申请实施例适用于人工智能技术领域,提供了一种语料处理方法、装置、服务器和存储介质,所述方法包括:获取待处理的语料数据;提取所述语料数据的特征信息;根据所述特征信息,计算所述语料数据的多样性分值;根据所述多样性分值,对所述语料数据进行处理。上述方法可以有效地评估开发者定义的用户说法的多样性,便于开发者在配置技能时提供更丰富的说法数据,有助于提高技能质量,减少技能审核周期,提升技能整体开发周期,可以应用于自然语言处理等领域,尤其是在对话系统的数据预处理阶段应用本方法,可以提高后续语言理解、分析的效率和准确率。

Description

语料数据处理方法、装置、服务器和存储介质
本申请要求于2019年12月25日提交国家知识产权局、申请号为201911355478.5、申请名称为“语料数据处理方法、装置、服务器和存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能技术领域,尤其涉及一种语料数据处理方法、装置、服务器和存储介质。
背景技术
对话系统是交互式人工智能(Artificial Intelligence,AI)的一个重要研究方向,其在工业领域中也有重要的应用。智能虚拟助手(Intelligent Virtual Assistant,IVA)或语音助手(Voice Assistant,VA)可以针对用户的语音询问进行分析和识别,然后执行相应的操作以满足用户的要求。例如,在智能车载终端中,对司机的语音进行检测,识别司机播放音乐、查阅热点新闻等需求;在智能家居系统中,对用户的语音命令进行检测,识别用户播放电视剧、打扫室内卫生等需求。
在实际应用中,对话系统依然是一个具有挑战性的课题,主要问题包括外部噪音对用户语音的干扰、自然语言理解的准确性、对话上下文管理等。其中,自然语音理解(Natural Language Understanding,NLU)是影响对话系统智能程度非常重要的一部分。但是,由于用户在表达同一个意图时,往往会有各种各样的表达方式,这给对话系统正确理解用户意图带来了很大的阻碍。
目前,众多商业公司为开发者提供了技能平台(Bot平台),以方便开发者为用户提供“语音交互”能力,但由于用户缺乏对话系统的专业知识,用户在配置技能时往往会漏掉很多的用户说法或句式,导致用户定义的语音技能效果不佳。因此,提升Bot平台中的用户说法多样性或丰富度,是提升“语音交互”能力的关键。
发明内容
本申请实施例提供了一种语料数据处理方法、装置、服务器和存储介质,可以提升Bot平台中的用户说法多样性。
第一方面,本申请实施例提供了一种语料数据处理方法,包括:
获取待处理的语料数据;
提取所述语料数据的特征信息;
根据所述特征信息,计算所述语料数据的多样性分值;
根据所述多样性分值,对所述语料数据进行处理。
示例性的,所述获取待处理的语料数据,包括:获取用户输入的原始语料数据;对所述原始语料数据进行数据清洗,获得待处理的语料数据。通过数据清洗等预处理过程,可以减少无关词语或符号对于特征提取及后续多样性分值计算所造成的干扰。
示例性的,所述对所述原始语料数据进行数据清洗,获得待处理的语料数据,包括:识别所述原始语料数据中的多个槽-值对,确定每个槽-值对中词语的槽位名称; 将具有相同槽位名称的词语替换为对应的槽位名称;识别并删除所述原始语料数据中的停用词,获得待处理的语料数据。
示例性的,所述特征信息包括所述语料数据中每个词语的生成概率;所述提取所述语料数据的特征信息,包括:识别所述语料数据包含的至少一个用户意图;确定每个用户意图包含的用户语句;根据所述用户语句中每个词语出现的次数,计算所述每个词语的生成概率。通过采用基于数据驱动的语言模型计算词语生成概率,可以有效地提取出用户说法中每个词语的概率分布情况,有助于后续多样性分值计算的准确性。
示例性的,所述根据所述用户语句中每个词语出现的次数,计算所述每个词语的生成概率,包括:对目标用户意图包含的每个用户语句进行分词,所述目标用户意图为所述语料数据包含的用户意图中的任意一个;分别统计分词后的每个词语出现的次数;以及,统计分词后的全部词语出现的总次数;根据所述每个词语出现的次数和所述全部词语出现的总次数,计算所述每个词语在所述目标用户意图下的生成概率。
示例性的,所述根据所述每个词语出现的次数和所述全部词语出现的总次数,计算所述每个词语在所述目标用户意图下的生成概率,包括:计算目标词语出现的次数与所述全部词语出现的总次数之间的比值,将所述比值作为所述目标词语在所述目标用户意图下的生成概率,所述目标词语为所述全部词语中的任意一个。
示例性的,所述根据所述特征信息,计算所述语料数据的多样性分值,包括:统计分词后的全部词语的词语个数;以所述全部词语的词语个数和所述每个词语的生成概率为参数,采用预设的信息熵公式计算所述目标用户意图的多样性分值;根据多个目标用户意图的多样性分值,确定所述语料数据的多样性分值。通过借助信息熵计算得到语料数据中包含的每个用户意图的多样性分值,进而确定出语料数据的多样性分值,用于评价语料数据的多样性,能够有效地对语料数据的多样性进行量化,便于开发者和Bot平台的审核人员直观地了解当前提供的语料数据是否丰富。
示例性的,所述根据多个目标用户意图的多样性分值,确定所述语料数据的多样性分值,包括:统计每个目标用户意图包含的用户语句的语句数量,以及统计全部目标用户意图包含的全部用户语句的语句总数量;计算所述每个目标用户意图包含的用户语句的语句数量与所述语句总数量之间的比值,将所述比值分别作为对应的目标用户意图的权重值;根据所述每个目标用户意图的权重值,对所述每个目标用户意图的多样性分值进行加权求和,得到所述语料数据的多样性分值。
示例性的,在根据所述特征信息,计算所述语料数据的多样性分值之后,还包括:接收用户针对多份样本语料数据分别进行标注的标注信息,所述标注信息包括第一信息或第二信息;将具有相同标注信息的样本语料数据汇集在同一个集合中,获得第一集合和第二集合;根据所述第一集合和所述第二集合中每份样本语料数据的多样性分值,确定多样性评分阈值。根据样本语料数据的多样性分值确定用于比较多样性是否丰富的阈值,可以有效地保证确定出的阈值的准确性。
示例性的,所述根据所述第一集合和所述第二集合中每份样本语料数据的多样性分值,确定多样性评分阈值,包括:计算所述第一集合中的样本语料数据的多样性分值的分值下界;以及,计算所述第二集合中的样本语料数据的多样性分值的分值上界;计算所述分值下界与所述分值上界的平均值,将所述平均值作为所述多样性评分阈值。
示例性的,所述根据所述多样性分值,对所述语料数据进行处理,包括:若所述语料数据的多样性分值大于或等于所述多样性评分阈值,则可以判定用户配置的语料数据多样性足够,可以对所述语料数据及其对应的交互技能进行其他处理;若所述语料数据的多样性分值小于所述多样性评分阈值,则提示所述用户对所述语料数据进行更改或补充,提升语料多样性。通过与多样性评分阈值进行比较,能够快速地判断出当前技能所提供的用户说法是否丰富。
第二方面,本申请实施例提供了一种语料数据处理装置,包括:
语料数据获取模块,用于获取待处理的语料数据;
特征信息提取模块,用于提取所述语料数据的特征信息;
多样性分值计算模块,用于根据所述特征信息,计算所述语料数据的多样性分值;
语料数据处理模块,用于根据所述多样性分值,对所述语料数据进行处理。
第三方面,本申请实施例提供了一种服务器,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现上述第一方面中任一项所述的语料数据处理方法。
第四方面,本申请实施例提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序被服务器的处理器执行时实现上述第一方面中任一项所述的语料数据处理方法。
第五方面,本申请实施例提供了一种计算机程序产品,当所述计算机程序产品在服务器上运行时,使得服务器执行上述第一方面中任一项所述的语料数据处理方法。
与现有技术相比,本申请实施例包括以下有益效果:
本申请实施例,通过获取待处理的语料数据,提取语料数据的特征信息,可以根据特征信息,计算出语料数据的多样性分值,从而能够根据多样性分值,针对性地对语料数据进行处理。上述方法可以有效地评估开发者定义的用户说法的多样性,便于开发者在配置技能时提供更丰富的说法数据,有助于提高技能质量,减少技能审核周期,提升技能整体开发周期,可以应用于自然语言处理等领域,尤其是在对话系统的数据预处理阶段应用本方法,可以提高后续语言理解、分析的效率和准确率。
附图说明
图1是现有技术中语料泛化的技术框架示意图;
图2是本申请一实施例提供的语料数据处理方法的示意性步骤流程图;
图3是本申请一实施例提供的Bot平台技能配置页面的示意图;
图4是本申请一实施例提供的N-gram语言模型的图结构;
图5是本申请另一实施例提供的语料数据处理方法的示意性步骤流程图;
图6是本申请另一实施例提供的语料数据处理方法的示意性步骤流程图;
图7是本申请另一实施例提供的语料数据处理方法的示意性步骤流程图;
图8是本申请一实施例提供的语料数据处理方法所适用于的系统的架构示意图;
图9是本申请一实施例提供的Bot平台技能开发流程示意图;
图10是本申请一实施例提供的“我的火车管家1”技能的多样性分值示意图;
图11是本申请一实施例提供的“我的火车管家2”技能的多样性分值示意图;
图12是本申请一实施例提供的“我的火车管家3”技能的多样性分值示意图;
图13是本申请一实施例提供的语料数据处理装置的结构框图;
图14是本申请一实施例提供的服务器的结构示意图。
具体实施方式
随着信息技术的发展,目前大部分的商业公司平台都倾向于利用语料泛化的方法来解决用户说法多样性的问题。在Bot平台上,语料泛化技术可以结合开发者提供的每个意图下的用户说法数据,基于模型学习的方式,泛化出更多不同表现形式的用户说法,增加技能的语料多样性。Bot平台通过将用户定义的用户说法与泛化出的用户说法一起加入到训练模块进行训练,可以增强技能能力。
如图1所示,是现有技术中语料泛化的技术框架示意图,该技术框架包括以下几个关键模块和过程:
(1)收集Bot平台上开发者定义的部分单元平行语料库作为语料泛化技术框架的输入。
(2)分词模块对输入数据进行分词。由于中文文本数据是以短语为基本的语义单元,因此分词模块需要把输入的文本序列数据转变为包含语义信息的短语。
(3)句式结构生成模块根据语料抽取出已有的句式结构,然后根据抽取出来的句式结构将用户说法分成一个一个的簇,并记录存储于同一语句簇中的不同句式结构类别对语句进行变换的映射方式。
(4)语料生成模块基于上一步的句式结构和映射方式生成新的语料数据,加入到最初的单元平行语料库中。
语料泛化的优点是可以自动化地生成大量不同句式的用户说法。但是,采用这种方法生成的用户说法难以量化泛化后的用户说法的多样性。同时,在生成的用户说法中也存在较多语法错误、语义模糊的句子。例如,无法判断经过泛化后的用户说法数据是否已经达到了Bot平台技能训练的多样性需要。由于语料泛化的过程没有与技能效果进行关联,应该泛化多少条语料,以及泛化多少条语料后多样性可以达到训练需求都没有保证。另一方面,由于语料泛化技术是基于开发者提供的用户说法,通过模型训练自动生成的,导致很多泛化出来的语料的语法逻辑与用户平常说出来的语句相差较大。虽然增加了语料,但无法保证这些语料的质量。
因此,对于目前人工智能领域中,针对对话系统用户说法多样性的评判或预处理方法缺失的问题,提出了本申请实施例的核心构思在于,通过收集Bot平台的用户说法数据,在进行初步的数据清洗后进行特征提取;同时,在特征提取过程中,通过构建一个基于数据驱动的语言模型,可以计算出当前语料中所有词语的概率分布;然后,利用信息熵计算出语料的多样性分值,通过对比阈值,可以很容易地判断出是否需要与用户交互,以便通知用户增加更多句式不同的用户说法,从而提升用户说法的多样性。本方法可以大幅度提升Bot平台上“语音交互”技能的生命周期,提高技能质量,加强用户与Bot平台之间的交互迭代。
下面结合具体的实施例,对本申请的语料数据处理方法进行介绍。
以下描述中,为了说明而不是为了限定,提出了诸如特定系统结构、技术之类的具体细节,以便透彻理解本申请实施例。然而,本领域技术人员应当清楚,在没有这些具体细节的其他实施例中也可以实现本申请。在其他情况中,省略对众所周知的系 统、装置、电路以及方法的详细说明,以免不必要的细节妨碍本申请的描述。
以下实施例中所使用的术语只是为了描述特定实施例的目的,而并非旨在作为对本申请的限制。如在本申请的说明书和所附权利要求书中所使用的那样,单数表达形式“一个”、“一种”、“所述”、“上述”、“该”和“这一”旨在也包括例如“一个或多个”这种表达形式,除非其上下文中明确地有相反指示。还应当理解,在本申请实施例中,“一个或多个”是指一个、两个或两个以上;“和/或”,描述关联对象的关联关系,表示可以存在三种关系;例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B的情况,其中A、B可以是单数或者复数。字符“/”一般表示前后关联对象是一种“或”的关系。
参照图2,示出了本申请一实施例提供的语料数据处理方法的示意性步骤流程图,该方法具体可以包括如下步骤:
S201、获取待处理的语料数据;
需要说明的是,本方法可以应用于服务器中,即本实施例的执行主体为服务器。上述服务器可以是指用于提供给用户或开发者进行语音交互技能配置的技能平台,即Bot平台。
Bot平台是各大公司向海量的三方开发者开放语音交互能力非常重要的入口之一。开发者通过在Bot平台上配置出某个语音交互技能,从而在该技能被审核发布后,用户可以在终端设备上与语音助手进行相应的语音交互,使得语音助手可以按照用户语音的指示,执行相应的操作以满足用户的需求。
如图3所示,是本实施例的一种Bot平台技能配置页面的示意图。开发者可以在图3所示的页面中填写相应的技能名称,并对该技能进行分类。例如,开发者可以配置一个名称为“我的火车票”的技能,该技能属于“工具助手”类。当然,开发者在配置技能时,还可以设定该技能的置信度,提交图片作为该技能的图标,等等。
在完成上述基础配置后,开发者可以向Bot平台提供与“我的火车票”相关的一些用户语句,这些用户语句可以是指用户在购买火车票时可能采用的一些说法或句式。例如:帮我买明天北京到深圳的火车票,我想买一张北京到深圳的高铁票,我要改签这张高铁票,等等。
通常,开发者提供的用户语句越丰富,能够包含更多类型的用户说法,那么技能发布后,用户在使用时的准确率也就越高。因此,开发者在配置某个技能时,Bot平台可以对开发者提供的这些用户语句进行审核或处理,上述用户语句即是Bot平台需要处理的语料数据。
S202、提取所述语料数据的特征信息;
通常,开发者在配置一个语音技能时,其提供的用户语句或说法可以看作是一份语料数据。一份语料数据中可以包含一个或多个用户意图。例如,在上述“我的火车票”的技能中,开发者提供的语料数据就包含有“订火车票”和“改签火车票”两个用户意图。每个用户意图下又具体包含多个用户语句或说法,每个用户语句或说法中存在多个词语。
如图4所示,是本实施例的一种N-gram语言模型的图结构。在图4中,I表示技能中的用户意图数量,M表示一个用户意图中有多少条用户说法或用户语句,N表示 一条用户说法或语句中有多少个词语,正中圆圈内的w表示词语对应的变量。
因此,在本实施例中,提取语料数据的特征信息可以是指提取语料数据中每个词语的特征信息,确定每个词语在语料数据中的概率分布。
在具体实现中,可以采用N-gram语言模型提取特征信息。当然,根据实际需求的不同,也可以采用其他语言模型提取特征信息,能够获得相同的效果,本实施例对采用的语言模型的类型不作限定。
S203、根据所述特征信息,计算所述语料数据的多样性分值;
在本实施例中,针对前一步骤提取出的特征信息,可以借助信息熵来计算语料数据的多样性分值。
信息熵的概念,解决了对信息的量化度量问题。一般而言,当一种信息出现概率更高的时候,表明它被传播得更广泛,或者说,被引用的程度更高。因此,借助于信息熵计算语料数据的多样性分值,可以解决如何对语料数据的多样性进行量化的问题。
上述多样性分值可以用于表示某个技能或意图下,开发者所提供的用户语句或说法的丰富程度。一般地,多样性分值越高,表示开发者提供的该技能或意图下的用户语句或说法越丰富;反之,则表示开发者提供的语句或说法较少,多样性较差。
S204、根据所述多样性分值,对所述语料数据进行处理。
由于语料数据的多样性分值可以表示该语料是否丰富,因此,在计算得到多样性分值后,可以根据分值的高低,对语料作不同的处理。
例如,对于多样性分值较高的语料数据,表示开发者在配置某个技能或意图时,所提供的用户语句或说法数据较多,多样性丰富。对于这类技能的审核,Bot平台可以优先进行审核,加快该技能发布的进度。而对于多样性分值相对较低的语料数据,表示开发者在配置某个技能或意图时,所提供的用户语句或说法数据较少,多样性较差,需要开发者提供更多的用户语句或说法数据,增强语料的丰富性。对于这类技能,Bot平台可以及时地通知开发者多样性不足的问题,告知开发者需要填入更多句式不同的说法,以增加多样性分值。这样,在Bot平台的技能审核步骤之前,就可以解决掉多样性不足的问题,有效缩短因为技能审核不通过而带来的开发周期过长的问题。
在本申请实施例中,通过获取待处理的语料数据,提取语料数据的特征信息,可以根据特征信息,计算出语料数据的多样性分值,从而能够根据多样性分值,针对性地对语料数据进行处理。上述方法可以有效地评估开发者定义的用户说法的多样性,便于开发者在配置技能时提供更丰富的说法数据,有助于提高技能质量,减少技能审核周期,提升技能整体开发周期,可以应用于自然语言处理等领域,尤其是在对话系统的数据预处理阶段应用本方法,可以提高后续语言理解、分析的效率和准确率。
参照图5,示出了本申请另一实施例提供的语料数据处理方法的示意性步骤流程图,该方法具体可以包括如下步骤:
S501、获取用户输入的原始语料数据;
需要说明的是,本实施例的执行主体为服务器,该服务器可以是指用于提供给开发者配置语音交互技能的Bot平台。
在本实施例中,Bot平台可以通过终端设备向开发者提供一个配置页面,供开发者在该页面上填写所要配置的技能的相关信息,输入相应的原始语料数据。因此,本 实施例中所要获取的原始语料数据可以是指开发者直接提交、未经处理的语料数据。例如:帮我买明天北京到深圳的火车票,我想买一张北京到深圳的高铁票,我要改签这张高铁票,等等。
S502、对所述原始语料数据进行数据清洗,获得待处理的语料数据;
通常,原始语料数据可能包括较多的无意义的词语或符号,这些无意义的词语或符号在后续的处理过程中作用不大;另外,开发者输入的用户说法中还可能包括一些数字串或字母串,这些连续出现的数字串和字母串通常有着某些特殊意义,需要整体考虑。因此,Bot平台在获取到开发者输入的原始语料数据后,需要对原始语料数据进行数据清洗,以清洗后获得的语料数据作为平台后续所要处理的语料数据。
在本实施例中,数据清洗主要是对从Bot平台上收集的用户说法数据进行基本的预处理。在这一过程中,可以进行如下的数据清洗工作:
(1)连续数字合并处理。由于技能的意图多种多样,这导致很多用户说法都包含连续数字串,如电话号码、计量数字、金额等,这些数字串需要整体考虑,而不能分开。
(2)连续英文字符合并处理。在中文中出现的连续英文字符通常代表着一个英文单词,或者是有着特殊意义的字串,同样需要将这类字符整体考虑。
(3)停用词过滤。由于很多词汇,如“的”、“在”等字符既无法用来区分用户的意图,但又在很多场景下被使用。因此,可以识别并删除原始语料数据中的这类停用词。
(4)槽位同义词过滤。在用户说法中,既有系统定义的槽位和相关的同义词(如国家、市、时间等),又有开发者自己定义的槽位及其同义词(如火车票、高铁票、硬座等),这些词在用户说法中可以随意的替换而不改变用户说法的多样性。因此,可以识别原始语料数据中的多个槽-值对,确定每个槽-值对中词语的槽位名称,然后将具有相同槽位名称的词语替换为对应的槽位名称。即,在数据清洗时,可以将表示同一个槽位的同义词都统一用此槽位来表示。例如,“火车票、硬座、高铁票”等表示“火车票”这一类型的槽位可以在用户说法中统一用“火车票”来表示。
在完成上述步骤的数据清洗后,便可以获得待处理的语料数据。
当然,以上所介绍的数据清洗过程仅仅是本实施例的一种示例,根据实际需要的不同,还可以采用其他清洗手段,本实施例对此不作限定。
S503、识别所述语料数据包含的至少一个用户意图;
在本实施例中,识别语料数据包含的用户意图可以是指开发者当前所要配置的技能所包含的用户意图。例如,在“我的火车票”这一技能中,可以包含“订火车票”、“改签火车票”和“取消火车票”等不同的意图。
S504、确定每个用户意图包含的用户语句;
针对不同的用户意图,其包含的用户语句可能是不同的,该用户语句也就是开发者输入至Bot平台中对应相应意图的用户说法。
例如,对于“订火车票”这一意图,其包含的用户语句可以是“帮我买明天北京到深圳的火车票”,“我想买一张北京到深圳的高铁票”等等;而对于“改签火车票”这一意图,其包含的用户语句可以是“我要改签这张高铁票”,“帮我改签明天北京 到深圳的高铁票”等等。
S505、根据所述用户语句中每个词语出现的次数,计算所述每个词语的生成概率;
基于开发者在Bot平台上提供的语料数据,通过语言模型可以提取出所有词语的概率分布,这个分布可以有效地刻画开发者使用不同词语或句式的概率情况。
在本实施例中,基于Bot平台的特点,可以利用N-gram语言模型来提取特征。当然,其他不同于N-gram的语言模型在实际应用中也有着相同的效果,可替换使用。
在N-gram语言模型中,为了考虑局部的语言顺序,N-gram语言模型中的N一般取值为2或者3。
在N-gram语言模型中,可以假设所有的用户说法是基于概率统计生成的,即用户语句中第i个词w i的生成概率可以如下公式所示:
Figure PCTCN2020124481-appb-000001
其中,P(w i|w i-1,…w 1)=P(w i)表示词w i的生成与前后词无关,这是N-gram模型的基本假设,C(w i)表示词w i在当前意图下出现的频数,C(w)表示当前意图下出现的词的总次数。
即,在计算每个词语的生成概率时,可以首先对对目标用户意图包含的每个用户语句进行分词,上述目标用户意图可以是语料数据包含的用户意图中的任意一个。然后通过分别统计分词后的每个词语出现的次数及分词后的全部词语出现的总次数,可以根据每个词语出现的次数和全部词语出现的总次数,计算每个词语在目标用户意图下的生成概率。
如公式(1)所示,可以计算目标词语出现的次数与全部词语出现的总次数之间的比值,作为该目标词语在目标用户意图下的生成概率,上述目标词语可以是全部词语中的任意一个。
上述计算得到的每个词语的生成概率,便是后续用于计算语料数据的多样性分值的特征信息。
S506、根据所述特征信息,计算所述语料数据的多样性分值;
S507、根据所述多样性分值,对所述语料数据进行处理。
本实施例中步骤S506-S507与前述实施例中步骤S204-S205类似,可以相互参阅,本实施例对此不再赘述。
在本申请实施例中,在获取到开发者输入的用于配置语音交互技能的语料数据后,可以对语料数据进行数据清洗,通过数据清洗等预处理过程,可以减少无关词语或符号对于特征提取及后续多样性分值计算所造成的干扰。其次,本实施例通过采用基于数据驱动的语言模型计算词语生成概率,可以有效地提取出用户说法中每个词语的概率分布情况,有助于后续多样性分值计算的准确性。
参照图6,示出了本申请另一实施例提供的语料数据处理方法的示意性步骤流程图,该方法具体可以包括如下步骤:
S601、获取待处理的语料数据;
S602、识别所述语料数据包含的至少一个用户意图,确定每个用户意图包含的用户语句;
S603、对目标用户意图包含的每个用户语句进行分词,所述目标用户意图为所述语料数据包含的用户意图中的任意一个;
S604、分别统计分词后的每个词语出现的次数;以及,统计分词后的全部词语出现的总次数;
S605、计算目标词语出现的次数与所述全部词语出现的总次数之间的比值,将所述比值作为所述目标词语在所述目标用户意图下的生成概率,所述目标词语为所述全部词语中的任意一个;
需要说明的是,由于本实施例中步骤S601-S605与前述实施例中步骤S201-S202及S501-S505类似,可以相互参阅,本实施例对此不再赘述。
S606、统计分词后的全部词语的词语个数,以所述全部词语的词语个数和所述每个词语的生成概率为参数,采用预设的信息熵公式计算所述目标用户意图的多样性分值;
通过语言模型可以从语料数据中提取出每个词语的概率分布,基于这个概率分布,可以借助信息熵来计算某个用户意图下,用户说法的多样性分值,即该意图的多样性分值。
在具体实现中,可以采用如下公式计算某个用户意图的多样性分值:
Figure PCTCN2020124481-appb-000002
其中,P(w i)表示词或短语w i在当前会话任务的某个意图下所有用户说法(去掉重复槽位)中的生成概率,V是词典中词的个数,也就是当前意图下所有用户说法分词后获得的全部词语的词语个数,S表示当前用户意图下所有用户说法的集合,一个用户说法对应语料数据中的一条用户语句。
通过上述信息熵公式,可以计算出开发者当前所要配置的技能中的每个用户意图的多样性分值。
S607、根据多个目标用户意图的多样性分值,确定所述语料数据的多样性分值;
对于某个技能,在计算出该技能包含的每个意图的多样性分值后,可以根据每个意图的多样性分值,确定出该技能对应的语料数据的多样性分值。
以某个技能A i为例,当求出技能A i下每个意图的多样性分值后,可以根据每个意图下含有的用户说法占会话任务所有意图下总的用户说法的比例,加权后得到该技能A i的多样性分值。
在具体实现中,可以统计每个目标用户意图包含的用户语句的语句数量,以及全部目标用户意图包含的全部用户语句的语句总数量,然后计算每个目标用户意图包含的用户语句的语句数量与上述语句总数量之间的比值,并将该比值分别作为对应的目标用户意图的权重值,在此基础上,根据每个目标用户意图的权重值,通过对每个目标用户意图的多样性分值进行加权求和,可以得到该技能对应的语料数据的多样性分值。
S608、根据所述多样性分值,对所述语料数据进行处理。
在本申请实施例中,通过提取语料数据中每个词语的概率分布,可以借助信息熵计算得到语料数据中包含的每个用户意图的多样性分值,进而确定出语料数据的多样性分值,用于评价语料数据的多样性,能够有效地对语料数据的多样性进行量化,便 于开发者和Bot平台的审核人员直观地了解当前提供的语料数据是否丰富。
参照图7,示出了本申请另一实施例提供的语料数据处理方法的示意性步骤流程图,该方法具体可以包括如下步骤:
S701、获取待处理的语料数据;
S702、提取所述语料数据的特征信息;
S703、根据所述特征信息,计算所述语料数据的多样性分值;
需要说明的是,由于本实施例中步骤S701-S703与前述实施例中步骤S201-S203、S501-S506及S601-S607类似,可以相互参阅,本实施例对此不再赘述。
S704、接收用户针对多份样本语料数据分别进行标注的标注信息,所述标注信息包括第一信息或第二信息;
在本实施例中,用户针对多份样本语料数据分别进行标注的标注信息可以是指人工对部分语料数据进行识别所得到的标注信息。上述第一信息可以是指标注为多样性丰富,而第二信息则可以是指标注为多样性较差的信息。
例如,对于一些样本语料数据,Bot平台的审核人员可以通过人工识别的方式分别将这些语料数据划分为多样性丰富的语料或多样性较差的语料。
S705、将具有相同标注信息的样本语料数据汇集在同一个集合中,获得第一集合和第二集合;
在本实施例中,可以将标注为多样性丰富的语料数据对应的技能汇集为第一集合A,将标注为多样性较差的语料数据对应的技能汇集为第二集合B。
S706、根据所述第一集合和所述第二集合中每份样本语料数据的多样性分值,确定多样性评分阈值;
在本实施例中,多样性评分阈值可以通过计算多样性丰富的技能集合A的多样性分值的上界以及多样性较差的技能集合B的多样性分值的下界,然后再取平均值得到。
在具体实现中,可以首先计算多样性丰富的第一集合中的样本语料数据的多样性分值的分值下界,计算多样性较差第二集合中的样本语料数据的多样性分值的分值上界,然后再计算上述分值下界与分值上界的平均值,将该平均值作为最终的多样性评分阈值。多样性评分阈值可以用于判断开发者提供的某个技能的语料数据是否丰富。
S707、若所述语料数据的多样性分值大于或等于所述多样性评分阈值,则对所述语料数据对应的交互技能进行处理;
在本实施例中,在计算得到语料数据的多样性分值和多样性评分阈值后,可以很容易对比它们之间的大小。如果多样性分值大于或等于上述阈值,则表示开发者配置的用户说法多样性足够,Bot平台可以优先进行审核这类技能,加快该技能发布的进度。
S708、若所述语料数据的多样性分值小于所述多样性评分阈值,则提示所述用户对所述语料数据进行更改或补充。
如果多样性分值小于上述阈值,则表示开发者配置的用户说法多样性不足,需要提示开发者对目前的语料数据进行更改或补充,填入更多句式不同的说法,以增加多样性分值。
在本实施例中,Bot平台可以通过配置页面向开发者提示多样性不足的问题,直 观地将多样性不足的问题告知开发者,提醒开发者尽快补充更多句式不用的说法,加快技能发布进程。
在本申请实施例中,根据样本语料数据的多样性分值确定用于比较多样性是否丰富的阈值,可以有效地保证确定出的阈值的准确性。通过与多样性评分阈值进行比较,也能够快速地判断出当前技能所提供的用户说法是否丰富。对于多样性丰富的技能,Bot可以优先审核,快速发布;而对于多样性不足的技能,则可以通过交互界面及时地提醒开发者,帮助开发者趁早发现问题。如果在技能审核时才通知开发者语料多样性不足的问题,由于审核周期一般是以天为单位的,就会极大提升开发者发现问题付出的时间成本。另一方面,如果希望技能的设计、训练和上线满足用户的需求,需要Bot平台方和开发者一起协调工作,而本实施例通过设计Bot平台与开发者之间的交互迭代,极大地增强了Bot平台的交互能力,让开发者在配置技能的同时,实时的发现自身的问题,使开发者在Bot平台上配置技能时能够获得更好的操作体验。
应理解,上述各个实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。
为了便于理解,下面结合一个具体的示例,对本申请实施例的语料数据处理方法作一介绍。
如图8所示,是本实施例的语料数据处理方法所适用于的系统的架构示意图。按照图8所示的架构,可以通过收集Bot平台的用户说法数据,在进行初步的数据清洗后进行特征提取;同时,在特征提取过程中,通过构建一个基于数据驱动的语言模型,可以计算出当前语料中所有词语的概率分布;然后,利用信息熵计算出语料的多样性分值,通过对比阈值,可以很容易地判断出是否需要与用户交互,以便通知用户增加更多句式不同的用户说法,从而提升用户说法的多样性。
如图9所示,是本实施例的一种Bot平台技能开发流程示意图。对于某项语音交互技能的开发,开发者需要在Bot平台上进行技能配置,然后提交模型训练,训练完成后需要通过Bot平台的审核,才能发布该技能。
开发者在Bot平台上配置技能时,由于很多开发者缺乏对话系统的专业知识,导致他们在配置意图的用户说法时,没有意识到要提供句式或词汇足够丰富的句子,这样的技能开发流程不仅大大延迟了开发者开发一个技能的周期,更严重影响了Bot平台对技能的交付进度。而按照本实施例提供的处理方法计算出开发者配置的每个技能的多样性分值,如果多样性分值足够,则此技能就可以直接通过线上初步审核;如果多样性分值不够,Bot平台可以即时地反馈开发者用户说法多样性的问题,让开发者配置更多句式或用词更丰富的用户说法。循环往复后,开发者配置的技能将越来越好。上述整个过程可以全部在Bot平台上完成,多样性评分同样是在线评分,这样就大大缩短了开发者发现问题的时间,极大地提升了开发者的体验和Bot平台交付技能的进度。
本实施例通过将技能包含的用户说法多样性评分嵌套在Bot平台的技能训练过程中,在技能审核步骤之前,可以有效的缩短因为技能审核不通过而带来的开发周期过长的问题。
下面,结合在Bot平台上创建“我的火车管家”技能,通过在“购买火车票”这个意图下定义多样性不同的说法,具体说明本实施例提供的处理方法在评估用户说法多样性的有效性。
图10列出了“我的火车管家1”这个技能的多样性分值,从图10中可以看出,“我的火车管家1”定义了三个意图,它们分别是“订火车票”、“改签火车票”和“取消火车票”。在图10中,“订火车票”、“改签火车票”和“取消火车票”都只含有同一句式(句式为:{出发时间}{出发城市}到{目的城市}的高铁票)的说法数据,“改签火车票”和“取消火车票”都只含有一条语句,而“订火车票”含有五条语句,但这五条语句都只是改动了目的城市,它们还是同一个句式下的用户说法,在图10中,按照本实施例提供的处理方法可以得出在这种语料配置下,该技能的多样性分值为8.333;
图11是在“我的火车管家1”技能的基础上,配置的“我的火车管家2”这个技能的多样性分值。“我的火车管家2”将“订火车票”这个意图的说法数据用句式更丰富的说法替换了之前定义的四条语句,从图11中可以看出,除了“{出发时间}{出发城市}到{目的城市}的高铁票”的句式外,还包含“帮我购买从{出发城市}到{目的城市}坐高铁的票”、“购买{出发城市}启程去{目的城市}的高铁票”、“{出发城市}为起始站{目的城市}为终点站的高铁票还有吗”、“{出发城市}出发去{目的城市}的高铁票还有吗”四种新句式,在这种数据配置下,此技能的多样性分值上升到了26.812,相比图10的第一个技能,此技能的用户说法多样性得到了更好的配置,而按照本实施例提供的方法计算的多样性分值也更高,说明了此本方法的有效性。
进一步地,如图12所示,在“我的火车管家2”的基础上,继续增加其他句式的说法,如“帮我买一张{出发城市}坐到{目的城市}的高铁票”、“坐高铁去{目的城市}的票能帮忙买一张吗”、“高铁{出发城市}到{目的城市}的票还有吗”、“订一张{出发城市}到{目的城市}的高铁票”;在这种语料配置下,此技能的多样性分值上升到了42.203,这是因为这个技能包含了更多句式的用户说法,而此技能相对于“我的火车管家1”也极大地增强了它对用户说法的识别准确率,提升了用户使用此技能的体验。
本申请实施例通过提供了一种新的融合语言模型和信息熵的用户说法多样性预处理方法,可以有效地评估开发者定义的用户说法的多样性评分分值和阈值。当用户定义的用户说法多样性不足时(多样性评分分值小于阈值),可以基于多样性评分分值与开发者交互迭代,通知开发者提供更多句式更丰富的说法,在提升技能性能的同时,极大的减少了技能审核上线周期,改善开发者对Bot平台的体验。
对应于上文实施例所述的语料数据处理方法,图13示出了本申请一实施例提供的语料数据处理装置的结构框图,为了便于说明,仅示出了与本申请实施例相关的部分。
参照图13,该装置可以应用于服务器中,具体可以包括如下模块:
语料数据获取模块1301,用于获取待处理的语料数据;
特征信息提取模块1302,用于提取所述语料数据的特征信息;
多样性分值计算模块1303,用于根据所述特征信息,计算所述语料数据的多样性分值;
语料数据处理模块1304,用于根据所述多样性分值,对所述语料数据进行处理。
在本申请实施例中,所述语料数据获取模块1301具体可以包括如下子模块:
原始语料数获取子模块,用于获取用户输入的原始语料数据;
数据清洗子模块,用于对所述原始语料数据进行数据清洗,获得待处理的语料数据。
在本申请实施例中,所述数据清洗子模块具体可以包括如下单元:
槽位识别单元,用于识别所述原始语料数据中的多个槽-值对,确定每个槽-值对中词语的槽位名称;
槽位词替换单元,用于将具有相同槽位名称的词语替换为对应的槽位名称;
停用词过滤单元,用于识别并删除所述原始语料数据中的停用词,获得待处理的语料数据。
在本申请实施例中,所述特征信息可以包括所述语料数据中每个词语的生成概率;所述特征信息提取模块1302具体可以包括如下子模块:
用户意图识别子模块,用于识别所述语料数据包含的至少一个用户意图;
用户语句确定子模块,用于确定每个用户意图包含的用户语句;
生成概率计算子模块,用于根据所述用户语句中每个词语出现的次数,计算所述每个词语的生成概率。
在本申请实施例中,所述生成概率计算子模块具体可以包括如下单元:
用户语句分词单元,用于对目标用户意图包含的每个用户语句进行分词,所述目标用户意图为所述语料数据包含的用户意图中的任意一个;
词语次数统计单元,用于分别统计分词后的每个词语出现的次数;以及,统计分词后的全部词语出现的总次数;
生成概率计算单元,用于根据所述每个词语出现的次数和所述全部词语出现的总次数,计算所述每个词语在所述目标用户意图下的生成概率。
在本申请实施例中,所述生成概率计算单元具体可以包括如下子单元:
生成概率计算子单元,用于计算目标词语出现的次数与所述全部词语出现的总次数之间的比值,将所述比值作为所述目标词语在所述目标用户意图下的生成概率,所述目标词语为所述全部词语中的任意一个。
在本申请实施例中,所述多样性分值计算模块1303具体可以包括如下子模块:
词语个数统计子模块,用于统计分词后的全部词语的词语个数;
用户意图多样性分值计算子模块,用于以所述全部词语的词语个数和所述每个词语的生成概率为参数,采用预设的信息熵公式计算所述目标用户意图的多样性分值;
语料数据多样性分值计算子模块,用于根据多个目标用户意图的多样性分值,确定所述语料数据的多样性分值。
在本申请实施例中,所述语料数据多样性分值计算子模块具体可以包括如下单元:
语句数量统计单元,用于统计每个目标用户意图包含的用户语句的语句数量,以及统计全部目标用户意图包含的全部用户语句的语句总数量;
权重值计算单元,用于计算所述每个目标用户意图包含的用户语句的语句数量与所述语句总数量之间的比值,将所述比值分别作为对应的目标用户意图的权重值;
语料数据多样性分值计算单元,用于根据所述每个目标用户意图的权重值,对所 述每个目标用户意图的多样性分值进行加权求和,得到所述语料数据的多样性分值。
在本申请实施例中,所述装置还可以包括如下模块:
标注信息接收模块,用于接收用户针对多份样本语料数据分别进行标注的标注信息,所述标注信息包括第一信息或第二信息;
语料数据汇集模块,用于将具有相同标注信息的样本语料数据汇集在同一个集合中,获得第一集合和第二集合;
多样性评分阈值确定模块,用于根据所述第一集合和所述第二集合中每份样本语料数据的多样性分值,确定多样性评分阈值。
在本申请实施例中,所述多样性评分阈值确定模块具体可以包括如下子模块,用于:
多样性分值下界计算子模块,用于计算所述第一集合中的样本语料数据的多样性分值的分值下界;以及,
多样性分值上界计算子模块,用于计算所述第二集合中的样本语料数据的多样性分值的分值上界;
多样性评分阈值计算子模块,用于计算所述分值下界与所述分值上界的平均值,将所述平均值作为所述多样性评分阈值。
在本申请实施例中,所述语料数据处理模块1304具体可以包括如下子模块:
交互技能处理子模块,用于若所述语料数据的多样性分值大于或等于所述多样性评分阈值,则对所述语料数据对应的交互技能进行处理;
开发者提示子模块,用于若所述语料数据的多样性分值小于所述多样性评分阈值,则提示所述用户对所述语料数据进行更改或补充。
对于装置实施例而言,由于其与方法实施例基本相似,所以描述得比较简单,相关之处参见方法实施例部分的说明即可。
参照图14,示出了本申请一实施例的一种服务器的示意图。如图14所示,本实施例的服务器1400包括:处理器1410、存储器1420以及存储在所述存储器1420中并可在所述处理器1410上运行的计算机程序1421。所述处理器1410执行所述计算机程序1421时实现上述语料数据处理方法各个实施例中的步骤,例如图2所示的步骤S201至S204。或者,所述处理器1410执行所述计算机程序1421时实现上述各装置实施例中各模块/单元的功能,例如图13所示模块1301至1304的功能。
示例性的,所述计算机程序1421可以被分割成一个或多个模块/单元,所述一个或者多个模块/单元被存储在所述存储器1420中,并由所述处理器1410执行,以完成本申请实施例提供的方法。所述一个或多个模块/单元可以是能够完成特定功能的一系列计算机程序指令段,该指令段可以用于描述所述计算机程序1421在所述服务器1400中的执行过程。例如,所述计算机程序1421可以被分割成语料数据获取模块、特征信息提取模块、多样性分值计算模块、语料数据处理模块,各模块具体功能如下:
语料数据获取模块,用于获取待处理的语料数据;
特征信息提取模块,用于提取所述语料数据的特征信息;
多样性分值计算模块,用于根据所述特征信息,计算所述语料数据的多样性分值;
语料数据处理模块,用于根据所述多样性分值,对所述语料数据进行处理。
所述服务器1400可以是桌上型计算机、笔记本、掌上电脑、云端服务器等计算设备。所述服务器1400可包括,但不仅限于,处理器1410、存储器1420。本领域技术人员可以理解,图14仅仅是服务器1400的一种示例,并不构成对服务器1400的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件,例如所述服务器1400还可以包括输入输出设备、网络接入设备、总线等。
所述处理器1410可以是中央处理单元(Central Processing Unit,CPU),还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。
所述存储器1420可以是所述服务器1400的内部存储单元,例如服务器1400的硬盘或内存。所述存储器1420也可以是所述服务器1400的外部存储设备,例如所述服务器1400上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等等。进一步地,所述存储器1420还可以既包括所述服务器1400的内部存储单元也包括外部存储设备。所述存储器1420用于存储所述计算机程序1421以及所述服务器1400所需的其他程序和数据。所述存储器1420还可以用于暂时地存储已经输出或者将要输出的数据。
本申请实施例还公开了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时可以实现前述语料数据处理方法。
本申请实施例还公开了一种计算机程序产品,当所述计算机程序产品在服务器上运行时,使得服务器执行前述语料数据处理方法。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述或记载的部分,可以参见其他实施例的相关描述。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
在本申请所提供的实施例中,应该理解到,所揭露的语料数据处理方法、装置、服务器和存储介质,可以通过其他的方式实现。例如,所述模块或单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通讯连接可以是通过一些接口,装置或单元的间接耦合或通讯连接,可以是电性,机械或其他的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请实现上述实施例方法中的全部或部分流程,可以通过计算机程序来指令相关的硬件来完成,所述的计算机程序可存储于一计算机可读存储介质中,该计算机程序在被处理器执行时,可实现上述各个方法实施例的步骤。。其中,所述计算机程序包括计算机程序代码,所述计算机程序代码可以为源代码形式、对象代码形式、可执行文件或某些中间形式等。所述计算机可读介质至少可以包括:能够将计算机程序代码携带到语料数据处理装置和服务器的任何实体或装置、记录介质、计算机存储器、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、电载波信号、电信信号以及软件分发介质。例如U盘、移动硬盘、磁碟或者光盘等。在某些司法管辖区,根据立法和专利实践,计算机可读介质不可以是电载波信号和电信信号。
最后应说明的是:以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何在本申请揭露的技术范围内的变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。

Claims (14)

  1. 一种语料数据处理方法,其特征在于,包括:
    获取待处理的语料数据;
    提取所述语料数据的特征信息;
    根据所述特征信息,计算所述语料数据的多样性分值;
    根据所述多样性分值,对所述语料数据进行处理。
  2. 根据权利要求1所述的方法,其特征在于,所述获取待处理的语料数据,包括:
    获取用户输入的原始语料数据;
    对所述原始语料数据进行数据清洗,获得待处理的语料数据。
  3. 根据权利要求2所述的方法,其特征在于,所述对所述原始语料数据进行数据清洗,获得待处理的语料数据,包括:
    识别所述原始语料数据中的多个槽-值对,确定每个槽-值对中词语的槽位名称;
    将具有相同槽位名称的词语替换为对应的槽位名称;
    识别并删除所述原始语料数据中的停用词,获得待处理的语料数据。
  4. 根据权利要求1-3任一项所述的方法,其特征在于,所述特征信息包括所述语料数据中每个词语的生成概率;
    相应的,所述提取所述语料数据的特征信息,包括:
    识别所述语料数据包含的至少一个用户意图;
    确定每个用户意图包含的用户语句;
    根据所述用户语句中每个词语出现的次数,计算所述每个词语的生成概率。
  5. 根据权利要求4所述的方法,其特征在于,所述根据所述用户语句中每个词语出现的次数,计算所述每个词语的生成概率,包括:
    对目标用户意图包含的每个用户语句进行分词,所述目标用户意图为所述语料数据包含的用户意图中的任意一个;
    分别统计分词后的每个词语出现的次数;以及,
    统计分词后的全部词语出现的总次数;
    根据所述每个词语出现的次数和所述全部词语出现的总次数,计算所述每个词语在所述目标用户意图下的生成概率。
  6. 根据权利要求5所述的方法,其特征在于,所述根据所述每个词语出现的次数和所述全部词语出现的总次数,计算所述每个词语在所述目标用户意图下的生成概率,包括:
    计算目标词语出现的次数与所述全部词语出现的总次数之间的比值,将所述比值作为所述目标词语在所述目标用户意图下的生成概率,所述目标词语为所述全部词语中的任意一个。
  7. 根据权利要求5或所述的方法,其特征在于,所述根据所述特征信息,计算所述语料数据的多样性分值,包括:
    统计分词后的全部词语的词语个数;
    以所述全部词语的词语个数和所述每个词语的生成概率为参数,采用预设的信息 熵公式计算所述目标用户意图的多样性分值;
    根据多个目标用户意图的多样性分值,确定所述语料数据的多样性分值。
  8. 根据权利要求7所述的方法,其特征在于,所述根据多个目标用户意图的多样性分值,确定所述语料数据的多样性分值,包括:
    统计每个目标用户意图包含的用户语句的语句数量,以及统计全部目标用户意图包含的全部用户语句的语句总数量;
    计算所述每个目标用户意图包含的用户语句的语句数量与所述语句总数量之间的比值,将所述比值分别作为对应的目标用户意图的权重值;
    根据所述每个目标用户意图的权重值,对所述每个目标用户意图的多样性分值进行加权求和,得到所述语料数据的多样性分值。
  9. 根据权利要求1或2或3或5或6或7或8所述的方法,其特征在于,在根据所述特征信息,计算所述语料数据的多样性分值之后,还包括:
    接收用户针对多份样本语料数据分别进行标注的标注信息,所述标注信息包括第一信息或第二信息;
    将具有相同标注信息的样本语料数据汇集在同一个集合中,获得第一集合和第二集合;
    根据所述第一集合和所述第二集合中每份样本语料数据的多样性分值,确定多样性评分阈值。
  10. 根据权利要求9所述的方法,其特征在于,所述根据所述第一集合和所述第二集合中每份样本语料数据的多样性分值,确定多样性评分阈值,包括:
    计算所述第一集合中的样本语料数据的多样性分值的分值下界;以及,
    计算所述第二集合中的样本语料数据的多样性分值的分值上界;
    计算所述分值下界与所述分值上界的平均值,将所述平均值作为所述多样性评分阈值。
  11. 根据权利要求10所述的方法,其特征在于,所述根据所述多样性分值,对所述语料数据进行处理,包括:
    若所述语料数据的多样性分值大于或等于所述多样性评分阈值,则对所述语料数据对应的交互技能进行处理;
    若所述语料数据的多样性分值小于所述多样性评分阈值,则提示所述用户对所述语料数据进行更改或补充。
  12. 一种语料数据处理装置,其特征在于,包括:
    语料数据获取模块,用于获取待处理的语料数据;
    特征信息提取模块,用于提取所述语料数据的特征信息;
    多样性分值计算模块,用于根据所述特征信息,计算所述语料数据的多样性分值;
    语料数据处理模块,用于根据所述多样性分值,对所述语料数据进行处理。
  13. 一种服务器,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,其特征在于,所述处理器执行所述计算机程序时实现如权利要求1至11任一项所述的语料数据处理方法。
  14. 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,其 特征在于,所述计算机程序被处理器执行时实现如权利要求1至11任一项所述的语料数据处理方法。
PCT/CN2020/124481 2019-12-25 2020-10-28 语料数据处理方法、装置、服务器和存储介质 WO2021129123A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911355478.5 2019-12-25
CN201911355478.5A CN111209363B (zh) 2019-12-25 2019-12-25 语料数据处理方法、装置、服务器和存储介质

Publications (1)

Publication Number Publication Date
WO2021129123A1 true WO2021129123A1 (zh) 2021-07-01

Family

ID=70784297

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/124481 WO2021129123A1 (zh) 2019-12-25 2020-10-28 语料数据处理方法、装置、服务器和存储介质

Country Status (2)

Country Link
CN (1) CN111209363B (zh)
WO (1) WO2021129123A1 (zh)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111209363B (zh) * 2019-12-25 2024-02-09 华为技术有限公司 语料数据处理方法、装置、服务器和存储介质
CN112035632A (zh) * 2020-08-21 2020-12-04 惠州市德赛西威汽车电子股份有限公司 一种适用于多对话机器人协作任务的择优分发方法和系统
CN112489628B (zh) * 2020-11-23 2024-02-06 平安科技(深圳)有限公司 语音数据选择方法、装置、电子设备及存储介质
CN114330285B (zh) * 2021-11-30 2024-04-16 腾讯科技(深圳)有限公司 语料处理方法、装置、电子设备及计算机可读存储介质
CN114372446B (zh) * 2021-12-13 2023-02-17 北京爱上车科技有限公司 一种车属性标注方法、设备及存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104866496A (zh) * 2014-02-22 2015-08-26 腾讯科技(深圳)有限公司 确定词素重要性分析模型的方法及装置
US20180217977A1 (en) * 2014-11-12 2018-08-02 Applause App Quality, Inc. Computer-implemented methods and systems for clustering user reviews and ranking clusters
CN109614608A (zh) * 2018-10-26 2019-04-12 平安科技(深圳)有限公司 电子装置、文本信息检测方法及存储介质
CN110457684A (zh) * 2019-07-15 2019-11-15 广州九四智能科技有限公司 智能电话客服的语义分析方法
CN111209363A (zh) * 2019-12-25 2020-05-29 华为技术有限公司 语料数据处理方法、装置、服务器和存储介质

Family Cites Families (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9081760B2 (en) * 2011-03-08 2015-07-14 At&T Intellectual Property I, L.P. System and method for building diverse language models
DE102013101871A1 (de) * 2013-02-26 2014-08-28 PSYWARE GmbH Wortwahlbasierte Sprachanalyse und Sprachanalyseeinrichtung
US20160162473A1 (en) * 2014-12-08 2016-06-09 Microsoft Technology Licensing, Llc Localization complexity of arbitrary language assets and resources
WO2017171826A1 (en) * 2016-04-01 2017-10-05 Intel Corporation Entropic classification of objects
CN108334353B (zh) * 2017-08-31 2021-04-02 科大讯飞股份有限公司 技能开发系统及方法
CN108268668B (zh) * 2018-02-28 2022-01-18 福州大学 一种基于话题多样性的文本数据观点摘要挖掘方法
CN108549656B (zh) * 2018-03-09 2022-06-28 北京百度网讯科技有限公司 语句解析方法、装置、计算机设备及可读介质
CN108664568A (zh) * 2018-04-24 2018-10-16 科大讯飞股份有限公司 语义技能创建方法及装置
CN108831442A (zh) * 2018-05-29 2018-11-16 平安科技(深圳)有限公司 兴趣点识别方法、装置、终端设备及存储介质
CN109858029B (zh) * 2019-01-31 2023-02-10 沈阳雅译网络技术有限公司 一种提高语料整体质量的数据预处理方法
CN110223674B (zh) * 2019-04-19 2023-05-26 平安科技(深圳)有限公司 语音语料训练方法、装置、计算机设备和存储介质
CN110297880B (zh) * 2019-05-21 2023-04-18 深圳壹账通智能科技有限公司 语料产品的推荐方法、装置、设备及存储介质
CN110377900A (zh) * 2019-06-17 2019-10-25 深圳壹账通智能科技有限公司 网络内容发布的审核方法、装置、计算机设备及存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104866496A (zh) * 2014-02-22 2015-08-26 腾讯科技(深圳)有限公司 确定词素重要性分析模型的方法及装置
US20180217977A1 (en) * 2014-11-12 2018-08-02 Applause App Quality, Inc. Computer-implemented methods and systems for clustering user reviews and ranking clusters
CN109614608A (zh) * 2018-10-26 2019-04-12 平安科技(深圳)有限公司 电子装置、文本信息检测方法及存储介质
CN110457684A (zh) * 2019-07-15 2019-11-15 广州九四智能科技有限公司 智能电话客服的语义分析方法
CN111209363A (zh) * 2019-12-25 2020-05-29 华为技术有限公司 语料数据处理方法、装置、服务器和存储介质

Also Published As

Publication number Publication date
CN111209363A (zh) 2020-05-29
CN111209363B (zh) 2024-02-09

Similar Documents

Publication Publication Date Title
WO2021129123A1 (zh) 语料数据处理方法、装置、服务器和存储介质
CN109241524B (zh) 语义解析方法及装置、计算机可读存储介质、电子设备
WO2020119075A1 (zh) 通用文本信息提取方法、装置、计算机设备和存储介质
WO2018205389A1 (zh) 语音识别方法、系统、电子装置及介质
CN110222182B (zh) 一种语句分类方法及相关设备
TWI662425B (zh) 一種自動生成語義相近句子樣本的方法
CN110516073A (zh) 一种文本分类方法、装置、设备和介质
CN112069298A (zh) 基于语义网和意图识别的人机交互方法、设备及介质
WO2021151271A1 (zh) 基于命名实体的文本问答的方法、装置、设备及存储介质
CN113590810B (zh) 摘要生成模型训练方法、摘要生成方法、装置及电子设备
US9811517B2 (en) Method and system of adding punctuation and establishing language model using a punctuation weighting applied to chinese speech recognized text
He et al. Using convolutional neural network with BERT for intent determination
CN111489746A (zh) 一种基于bert的电网调度语音识别语言模型构建方法
CN112287656A (zh) 文本比对方法、装置、设备和存储介质
WO2021012958A1 (zh) 原创文本甄别方法、装置、设备与计算机可读存储介质
CN114722832A (zh) 一种摘要提取方法、装置、设备以及存储介质
CN108763202A (zh) 识别敏感文本的方法、装置、设备及可读存储介质
CN108268443B (zh) 确定话题点转移以及获取回复文本的方法、装置
CN107688594B (zh) 基于社交信息的风险事件的识别系统及方法
CN113434631A (zh) 基于事件的情感分析方法、装置、计算机设备及存储介质
WO2023207566A1 (zh) 语音房质量评估方法及其装置、设备、介质、产品
CN117290515A (zh) 文本标注模型的训练方法、文生图方法及装置
CN110377753B (zh) 基于关系触发词与gru模型的关系抽取方法及装置
WO2023124837A1 (zh) 问诊处理方法、装置、设备及存储介质
CN115310429B (zh) 一种多轮倾听对话模型中的数据压缩与高性能计算方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20905760

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20905760

Country of ref document: EP

Kind code of ref document: A1