CN111339767A - Conversation source data processing method and device, electronic equipment and computer readable medium - Google Patents

Conversation source data processing method and device, electronic equipment and computer readable medium Download PDF

Info

Publication number
CN111339767A
CN111339767A CN202010107942.5A CN202010107942A CN111339767A CN 111339767 A CN111339767 A CN 111339767A CN 202010107942 A CN202010107942 A CN 202010107942A CN 111339767 A CN111339767 A CN 111339767A
Authority
CN
China
Prior art keywords
source data
dialogue
conversation
word segmentation
granularity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010107942.5A
Other languages
Chinese (zh)
Other versions
CN111339767B (en
Inventor
翟周伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Baidu Online Network Technology Beijing Co Ltd
Shanghai Xiaodu Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202010107942.5A priority Critical patent/CN111339767B/en
Publication of CN111339767A publication Critical patent/CN111339767A/en
Application granted granted Critical
Publication of CN111339767B publication Critical patent/CN111339767B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

The present disclosure provides a dialog source data processing method, which includes: step 101, obtaining effective dialogue source data based on the dialogue source data; 102, acquiring a conversation sample by using the effective conversation source data; 103, constructing a word segmentation model based on the dialogue sample; and 104, applying the word segmentation model to a dialogue system, acquiring new dialogue source data according to user behaviors, and returning the newly acquired dialogue source data to the step 101. The dialogue source data processing method obtains large-scale and high-precision dialogue samples and adaptively improves a dialogue system, so that the word segmentation accuracy is improved. The disclosure also provides a dialogue source data processing device, an electronic device and a computer readable medium.

Description

Conversation source data processing method and device, electronic equipment and computer readable medium
Technical Field
The embodiment of the disclosure relates to the technical field of information processing, and in particular relates to a method and a device for processing conversation source data, an electronic device and a computer readable medium.
Background
Natural Language Processing (NLP) technology is to recognize the voice content of a user through a voice recognition module, analyze the voice content, and finally obtain the corresponding semantics of the voice content, thereby facilitating the communication between a person and a machine.
The analysis of the speech is to cut words of the speech content input by the user by using the marked linguistic data, and the word cutting method generally comprises a dictionary matching method, a large-vocabulary continuous speech recognition model (N-Gram) based statistical and Dynamic Programming (DP) method and a sequence labeling method. The dictionary matching method is to dig out an independent basic word dictionary and then segment by using forward maximum or reverse maximum matching. The word segmentation method is difficult to solve the problems of semantic and ambiguous boundaries, so that the precision is poor. And calculating the segmentation probability based on an N-Gram statistic and an N-Gram dictionary depending on the DP method, wherein the boundary segmentation of predicates and proper nouns is not ideal in a dialogue scene. For example, "enlarged pessimism", the ideal word-cutting method is the predicate "put" and the proper noun "pessimism", but is easily cut into the predicate "enlarged" and the proper noun "pessimism". The sequence labeling method utilizes a sequence labeling model to cut words, but large-scale and high-precision samples are difficult to obtain, so that the precision of word cutting is difficult to improve.
Disclosure of Invention
The embodiment of the disclosure provides a conversation source data processing method and device, electronic equipment and a computer readable medium.
In a first aspect, an embodiment of the present disclosure provides a method for processing dialog source data, including:
step S101, obtaining effective dialogue source data according to the dialogue source data;
step S102, obtaining a conversation sample by utilizing the effective conversation source data;
step S103, constructing a word segmentation model based on the dialogue sample;
and step S104, applying the word segmentation model to a dialogue system, acquiring new dialogue source data according to user behaviors, and returning the new dialogue source data to the step S101.
In some embodiments, the obtaining valid dialog source data from the dialog source data includes:
obtaining conversation satisfaction from the conversation source data;
and extracting the dialogue source data with the dialogue satisfaction degree larger than the satisfaction degree threshold value as the effective dialogue source data.
In some embodiments, said obtaining a conversation sample using said valid conversation source data comprises:
segmenting conversation content in the conversation source data into N fragments in a collision alignment mode; wherein N is an integer greater than or equal to 1;
segmenting the segments to obtain word segmentation granularity;
and correcting the boundary and granularity of the word segmentation granularity to obtain a conversation sample.
In some embodiments, the segment is the largest common segment.
In some embodiments, the participle granularity includes one or more of a base word granularity and a shuffle word granularity.
In some embodiments, the modifying the boundaries and the granularity of the word segmentation granularity to obtain a dialog sample includes:
counting the alignment times, the segmented times and the independent search times of the word segmentation granularity;
calculating a merging probability of the word segmentation granularity based on the alignment times, the segmented times, the independent search times and proper nouns;
calculating the segmentation probability of the word segmentation granularity based on the alignment times and the segmentation times;
and modifying the word segmentation granularity according to the merging probability and the segmented probability to obtain the dialogue sample.
In some embodiments, said constructing a word segmentation model based on said dialogue sample comprises:
and constructing a sequence labeling model by using the dialogue sample through a gating cycle model and a conditional random field model.
In some embodiments, the sequence annotation model comprises a base word model and a mixed-rank word model.
In a second aspect, an embodiment of the present disclosure provides a session source data processing apparatus, which includes:
the effective data acquisition module is used for acquiring effective conversation source data according to the conversation source data;
the sample acquisition module is used for acquiring a conversation sample by utilizing the effective conversation source data;
the model building module is used for building a word segmentation model based on the dialogue sample;
and the source data acquisition module is used for applying the word segmentation model to a dialogue system, acquiring new dialogue source data according to user behavior and returning the new dialogue source data to the effective data acquisition module.
In some embodiments, the valid data acquisition module comprises:
a satisfaction acquiring unit for acquiring a conversation satisfaction from the conversation source data;
and the extracting unit is used for extracting the dialogue source data with the dialogue satisfaction degree larger than the satisfaction degree threshold value as the effective dialogue source data.
In some embodiments, the sample acquisition module comprises:
the segmentation unit is used for segmenting the conversation content in the conversation source data into N fragments in a collision alignment mode; wherein N is an integer greater than or equal to 1;
the word segmentation unit is used for segmenting the fragments to obtain word segmentation granularity;
and the sample acquisition unit is used for correcting the boundary and the granularity of the word segmentation granularity to obtain a conversation sample.
In some embodiments, the sample acquiring unit comprises:
the counting subunit is used for counting the alignment times, the segmented times and the independent search times of the word segmentation granularity;
a merging probability calculating subunit, configured to calculate a merging probability of the word segmentation granularity based on the alignment times, the segmented times, the independent search times, and the proper nouns;
a segmented probability calculating subunit, configured to calculate a segmented probability of the word segmentation granularity based on the alignment times and the segmented times;
and the word segmentation granularity correction subunit is used for correcting the word segmentation granularity according to the merging probability and the segmented probability to obtain the conversation sample.
In some embodiments, the model building module builds a sequence annotation model using the dialogue sample by a gated cycle model and a conditional random field model.
In a third aspect, an embodiment of the present disclosure provides an electronic device, including:
one or more processors;
a memory having one or more programs stored thereon that, when executed by the one or more processors, cause the one or more processors to perform any of the above-described dialog source data processing methods;
one or more I/O interfaces connected between the processor and the memory and configured to enable information interaction between the processor and the memory.
In a fourth aspect, the disclosed embodiments provide a computer readable medium, on which a computer program is stored, which when executed by a processor implements any of the above-mentioned dialog source data processing methods.
The conversation source data processing method provided by the embodiment of the disclosure includes the steps of 101, acquiring effective conversation source data based on the conversation source data; 102, acquiring a conversation sample by using the effective conversation source data; 103, constructing a word segmentation model based on the dialogue sample; and step 104, applying the word segmentation model to a dialogue system, obtaining new dialogue source data according to user behaviors, and returning the new dialogue source data to the step 101, so that large-scale and high-precision dialogue samples are automatically and circularly mined, the word segmentation model is trained and promoted in a self-adaptive manner, the word segmentation accuracy under different dialogue scenes is improved, and the user experience of the dialogue system is improved.
Drawings
The accompanying drawings are included to provide a further understanding of the embodiments of the disclosure and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the principles of the disclosure and not to limit the disclosure.
The above and other features and advantages will become more apparent to those skilled in the art by describing in detail exemplary embodiments thereof with reference to the attached drawings, in which:
fig. 1 is a flowchart of a session source data processing method according to an embodiment of the present disclosure;
fig. 2 is a flowchart of a step of obtaining a session sample in another session source data processing method according to an embodiment of the present disclosure;
fig. 3 is a schematic block diagram of a session source data processing apparatus according to an embodiment of the present disclosure;
fig. 4 is a schematic block diagram of a sample obtaining module in the dialog source data processing apparatus according to the embodiment of the present disclosure;
fig. 5 is a block diagram of an electronic device according to an embodiment of the present disclosure.
Detailed Description
In order to make those skilled in the art better understand the technical solution of the present disclosure, the following describes in detail a dialog source data processing method and apparatus, an electronic device, and a computer readable medium provided in the present disclosure with reference to the accompanying drawings.
Example embodiments will be described more fully hereinafter with reference to the accompanying drawings, but which may be embodied in different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
Embodiments of the present disclosure and features of embodiments may be combined with each other without conflict.
As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and the present disclosure, and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
Word segmentation is a fundamental problem of lexical analysis in natural language processing, and the difficulty lies in segmentation of ambiguous boundaries and selection of segmentation granularity. The current word segmentation method aims at non-conversation scenes, but predicate and proper name boundary errors easily occur in conversation scenes, particularly conversation scenes based on artificial intelligent voice assistants. For example, in a dialog scenario, "please turn up an air conditioner by one" is easily mistakenly switched out by one "; the 'leaving cat' is switched out by mistake to 'leave school'. One of the reasons for inaccurate word segmentation is that high-precision samples are difficult to obtain, so that model training has defects, thereby affecting the boundary and granularity of the word segmentation. The embodiment provides a dialogue source data processing method and device aiming at the problem of inaccurate word segmentation.
In a first aspect, an embodiment of the present disclosure provides a method for processing dialog source data. The dialogue source data processing method can be applied to an automatic voice dialogue system to accurately and automatically reply to the inquiry of a user. Fig. 1 is a flowchart of a dialog source data processing method according to an embodiment of the present disclosure. Referring to fig. 1, the dialog source data processing method includes:
step 101, obtaining valid dialog source data based on the dialog source data.
Wherein the dialog source data may be from an automated voice dialog system. The automatic voice dialogue system is used for analyzing the voice of a user inquiry, recognizing the intention of the user and automatically replying to the user inquiry. In this process, dialog source data is formed that includes user queries, intentions, and system replies.
Wherein the effective dialogue source data is dialogue source data which satisfies the system reply by the user. In some embodiments, the dialog source data also includes user ratings for dialog satisfaction. The rating of the dialog satisfaction may be a simple rating of satisfaction and dissatisfaction of the user or may be a specific satisfaction score.
In some embodiments, after the user uses the automatic voice conversation system, the system response is evaluated, and the user evaluates the conversation source data as satisfactory, or the conversation source data with the satisfaction degree score higher than a preset score value is used as effective conversation source data.
Specifically, obtaining a dialog satisfaction from the dialog source data; and extracting the dialogue source data with the dialogue satisfaction degree larger than the satisfaction degree threshold value as effective dialogue source data. The satisfaction threshold may be set arbitrarily, for example, the satisfaction threshold may be set according to the data amount of the dialogue source data. It will be appreciated that the greater the satisfaction threshold, the less the active session source data, and conversely, the more active session source data. In practical applications, the satisfaction threshold may be set according to the performance of the hardware.
And 102, acquiring a conversation sample by utilizing the effective conversation source data.
Segmenting conversation content in a collision alignment mode to obtain segments; and then segmenting the segments by a word segmentation method to obtain word segmentation granularity, thereby obtaining a conversation sample. In some embodiments, the boundary segmentation probability and the merging probability of a segment are statistically analyzed, and then the boundary and the participle granularity of the segment are modified, thereby obtaining more accurate conversational samples.
And 103, constructing a word segmentation model based on the dialogue sample.
Wherein, the word segmentation model can be a sequence labeling model of one or more word segmentations. In some embodiments, the sequence annotation model is constructed by a gated cyclic model and a conditional random field model, i.e., a sequence annotation model is constructed by a gated cyclic model and a conditional random field model using the dialogue sample.
It should be noted that, in the embodiment of the present disclosure, constructing a word segmentation model based on the dialog sample is not only interpreted as initially constructing the word segmentation model, but also interpreted as optimizing the word segmentation model. For example, when the word segmentation model is established for the first time, the word segmentation model is established based on the dialogue sample; when the word segmentation model already exists, the word segmentation model is optimized based on the dialogue sample.
In some embodiments, the sequence annotation model can be a base word model and a shuffle word model.
And step 104, applying the word segmentation model to the dialogue system, acquiring new dialogue source data according to the user behavior, and returning the newly acquired dialogue source data to the step 101.
Applying the word segmentation model obtained in the step 103 to the dialogue system to obtain new dialogue source data, namely, generating a large amount of effective source data, returning the newly obtained dialogue source data to the step 101, and optimizing the word segmentation model by using the new dialogue source data, so that the effect of the dialogue system is continuously improved, and the word segmentation accuracy of the word segmentation model is continuously improved.
Fig. 2 is a flow chart of obtaining a dialog sample in an embodiment of the present disclosure. As shown in fig. 2, the dialogue sample is obtained by the following steps:
step 201, segmenting the dialogue content in the dialogue source data into N segments by means of collision alignment.
Wherein N is an integer greater than or equal to 1.
In some embodiments, the dialog queries and system reply content are aligned for collision, hitting the most common segment, while the dialog queries and system reply content are segmented.
For example, the content of the dialog query is "listen to the sea grass dance" and the system reverts to "listen to XX sea grass dance together". Through the way of collision alignment, the maximum public segment 'sea grass dance' is obtained, and 'listening sea grass dance' is divided into 'listening/sea grass dance'.
Step 202, segmenting the segments by utilizing a word segmentation mode to obtain word segmentation granularity.
And segmenting the segments to obtain word segmentation granularity. The word segmentation mode can adopt a word segmentation mode commonly used in the field.
In some embodiments, the participle granularity includes one or more of a base word granularity and a shuffle word granularity. Where the base word granularity is the smallest unit segment with independent semantics. The shuffling particle size is a combination of multiple particle sizes based on the particle size of the base word.
For example, the segment "listening to the dance of sea grass" is segmented, and "listening" is a single word belonging to the granularity of basic words and cannot be segmented. The "seaweed dance" may be segmented into "seaweed/dance".
And 203, correcting the boundary and granularity of the word granularity to obtain a dialogue sample.
In some embodiments, the step of partitioning the boundaries of the word granularity and the granularity comprises: counting the alignment times, the segmented times and the independent search times of the word segmentation granularity; calculating the merging probability of the word segmentation granularity based on the alignment times, the segmented times, the independent search times and the proper nouns; calculating the segmentation probability of the word segmentation granularity based on the alignment times and the segmentation times; and modifying the word segmentation granularity according to the combination probability and the segmented probability to obtain a dialogue sample.
In some embodiments, the merging probability of the participle granularity is obtained by the merging probability formula (1):
Figure BDA0002389013670000081
wherein, PmFor the merge probability, selfAlignnFre is the self-alignment times of participle granularity, SplitFreq is the segmented times of participle granularity, searchFreq is the segment independent retrieval times, isperpernoun is whether the proper name is, and f is the decision function.
In some embodiments, the segmentation probability of the participle granularity is obtained by the segmentation probability formula (2):
Figure BDA0002389013670000082
wherein, PsTo merge probabilities, Splitfeq is the number of times the participle granularity is segmented, selfAlignFreq is the number of self-alignments of the participle granularity, and f is the decision function.
It should be understood that if the higher the number of times of alignment and the higher the number of times of independent search, the greater the compactness of the word segmentation granularity is indicated, and the granularity merging should be performed. If the segmentation probability is high, the segmentation point (boundary) is determined.
In the method for processing dialog source data provided by this embodiment, effective dialog source data is obtained based on the dialog source data; obtaining a conversation sample by using the effective conversation source data; constructing a word segmentation model based on the conversation sample; the word segmentation model is applied to a dialogue system, dialogue source data are obtained according to user behaviors, the obtained dialogue source data are returned to further optimize the word segmentation model, large-scale and high-precision dialogue samples are automatically and circularly mined, and the word segmentation model is trained and promoted in a self-adaptive mode, so that the word segmentation accuracy under different dialogue scenes is improved, and the user experience of the dialogue system is improved.
In a second aspect, an embodiment of the present disclosure provides a session source data processing apparatus. The dialogue source data processing device can be applied to an automatic voice dialogue system to accurately and automatically reply to the inquiry of a user. Fig. 3 is a functional block diagram of a session source data processing apparatus according to an embodiment of the present disclosure.
Referring to fig. 3, a session source data processing apparatus provided in an embodiment of the present disclosure includes:
and the valid data acquisition module 301 is configured to acquire valid dialog source data based on the dialog source data.
The dialogue source data is the voice data generated in the inquiry and automatic reply process of the automatic voice dialogue system and the user. The valid session source data is the session source data that the user is satisfied with the system reply. In some embodiments, the dialog source data also includes user ratings for dialog satisfaction. The rating of the dialog satisfaction may be a simple rating of satisfaction and dissatisfaction of the user or may be a specific satisfaction score.
In some embodiments, after the user uses the automatic voice conversation system, the system response is evaluated, and the user evaluates the conversation source data as satisfactory, or the conversation source data with the satisfaction degree score higher than a preset score value is used as effective conversation source data. The valid conversation source data is obtained conversation satisfaction from the conversation source data; and extracting the dialogue source data with the dialogue satisfaction degree larger than the satisfaction degree threshold value as effective dialogue source data. The satisfaction threshold may be set arbitrarily, for example, the satisfaction threshold may be set according to the data amount of the dialogue source data. It will be appreciated that the greater the satisfaction threshold, the less the active session source data, and conversely, the more active session source data. In practical applications, the satisfaction threshold may be set according to the performance of the hardware.
In some embodiments, the valid data acquisition module 301 includes a satisfaction acquisition unit for obtaining the dialog satisfaction from the dialog source data, and an extraction unit. The extraction unit is used for extracting the dialogue source data with the dialogue satisfaction degree larger than the satisfaction degree threshold value as effective dialogue source data.
A sample obtaining module 302 for obtaining the session sample using the active session source data.
Segmenting conversation content in a collision alignment mode to obtain segments; and then segmenting the segments by a word segmentation method to obtain word segmentation granularity, thereby obtaining a conversation sample. In some embodiments, the boundary segmentation probability and the merging probability of a segment are statistically analyzed, and then the boundary and the participle granularity of the segment are modified, thereby obtaining more accurate conversational samples.
And the model building module 303 is used for building a word segmentation model based on the dialogue sample.
Wherein, the word segmentation model can be a sequence labeling model of one or more word segmentations. In some embodiments, the sequence annotation model is constructed by a gated cyclic model and a conditional random field model, i.e., a sequence annotation model is constructed by a gated cyclic model and a conditional random field model using the dialogue sample.
A source data obtaining module 304, configured to apply the word segmentation model to the dialog system, obtain dialog source data again according to the user behavior, and return the obtained dialog source data to the valid data obtaining module.
The word segmentation model obtained by the model construction module 303 is applied to the dialogue system, new dialogue source data is obtained again according to the user behavior, namely a large amount of dialogue source data is obtained, and the new dialogue source data optimization model is utilized, so that the effect of the dialogue system is continuously improved, and the word segmentation accuracy is further continuously improved.
In some embodiments, the model building module 303 builds the sequence annotation model using the dialogue samples by a gated cycle model and a conditional random field model.
As shown in fig. 4, in some embodiments, the sample acquisition module 400 includes:
a dividing unit 401, configured to divide the dialog content in the dialog source data into N segments in a collision alignment manner. Wherein N is an integer greater than or equal to 1;
and the word segmentation unit 402 is configured to segment the segments to obtain word segmentation granularity.
And segmenting the fragments by utilizing a word segmentation mode to obtain word segmentation granularity. The word segmentation mode can adopt a word segmentation mode commonly used in the field.
In some embodiments, the participle granularity includes one or more of a base word granularity and a shuffle word granularity. Where the base word granularity is the smallest unit segment with independent semantics. The shuffling particle size is a combination of multiple particle sizes based on the particle size of the base word.
The sample obtaining unit 403 is configured to modify the boundary and the granularity of the participle granularity, and obtain a dialog sample.
In some embodiments, the sample obtaining unit 403 includes a statistics subunit, a merging probability calculating subunit, a segmented probability calculating subunit, and a word segmentation granularity correcting subunit, where the statistics subunit is configured to count the number of times of alignment, the number of times of segmentation, and the number of times of independent search of the word segmentation granularity; a merging probability calculating subunit, configured to calculate merging probabilities of the word segmentation granularity based on the alignment times, the segmented times, the independent search times, and the proper nouns; a segmented probability calculating subunit, configured to calculate a segmented probability of the word segmentation granularity based on the alignment times and the segmented times; and the word segmentation granularity correction subunit is used for correcting the word segmentation granularity according to the merging probability and the segmented probability so as to obtain the conversation sample.
Wherein, the merging probability calculation subunit is obtained by a merging probability formula (1):
Figure BDA0002389013670000101
wherein, PmFor the merge probability, selfAlignnFre is the self-alignment times of participle granularity, SplitFreq is the segmented times of participle granularity, searchFreq is the segment independent retrieval times, isperpernoun is whether the proper name is, and f is the decision function.
The segmented probability calculation subunit is obtained by the segmentation probability formula (2):
Figure BDA0002389013670000102
wherein, PsFor the merge probability, SplitFreq is the number of times the participle granularity is segmented, selfAlignnFleq is the number of self-alignments of the participle granularity, and f is the decision function.
In the session source data processing apparatus provided in this embodiment, the valid data obtaining module obtains valid session source data according to the session source data; the sample acquisition module acquires a conversation sample by using the effective conversation source data; constructing a word segmentation model based on the conversation sample; the model building module applies the word segmentation model to the dialogue system, obtains dialogue source data according to user behaviors, automatically and circularly mines in the way, obtains large-scale and high-precision dialogue samples, and adaptively trains and promotes the dialogue system, so that the word segmentation accuracy is improved.
In a third aspect, referring to fig. 5, an embodiment of the present disclosure provides an electronic device, including:
one or more processors 501;
a memory 502 on which one or more programs are stored, which when executed by the one or more processors, cause the one or more processors to implement the dialog source data processing method of any one of the above;
one or more I/O interfaces 503 coupled between the processor and the memory and configured to enable information interaction between the processor and the memory.
The processor 501 is a device with data processing capability, and includes but is not limited to a Central Processing Unit (CPU) and the like; memory 502 is a device having data storage capabilities including, but not limited to, random access memory (RAM, more specifically SDRAM, DDR, etc.), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), FLASH memory (FLASH); an I/O interface (read/write interface) 503 is connected between the processor 501 and the memory 502, and can realize information interaction between the processor 501 and the memory 502, which includes but is not limited to a data Bus (Bus) and the like.
In some embodiments, the processor 501, memory 502, and I/O interface 503 are connected to each other and to other components of the computing device by a bus.
In a fourth aspect, the present disclosure provides a computer-readable medium, on which a computer program is stored, where the computer program is executed by a processor to implement any one of the above methods for speech word segmentation.
It will be understood by those of ordinary skill in the art that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. In a hardware implementation, the division between functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may be performed by several physical components in cooperation. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as is well known to those of ordinary skill in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer. In addition, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media as known to those skilled in the art.
Example embodiments have been disclosed herein, and although specific terms are employed, they are used and should be interpreted in a generic and descriptive sense only and not for purposes of limitation. In some instances, features, characteristics and/or elements described in connection with a particular embodiment may be used alone or in combination with features, characteristics and/or elements described in connection with other embodiments, unless expressly stated otherwise, as would be apparent to one skilled in the art. Accordingly, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the scope of the disclosure as set forth in the appended claims.

Claims (15)

1. A conversation source data processing method, comprising:
step 101, obtaining effective dialogue source data based on the dialogue source data;
102, acquiring a conversation sample by using the effective conversation source data;
103, constructing a word segmentation model based on the dialogue sample;
and 104, applying the word segmentation model to a dialogue system, acquiring new dialogue source data according to user behaviors, and returning the newly acquired dialogue source data to the step 101.
2. The method of claim 1, wherein said obtaining valid session origin data from session origin data comprises:
obtaining conversation satisfaction from the conversation source data;
and extracting the dialogue source data with the dialogue satisfaction degree larger than the satisfaction degree threshold value as the effective dialogue source data.
3. The method of claim 1, wherein said obtaining a conversation sample using said active conversation source data comprises:
segmenting conversation content in the effective conversation source data into N fragments in a collision alignment mode; wherein N is an integer greater than or equal to 1;
segmenting the segments to obtain word segmentation granularity;
and correcting the boundary and granularity of the word segmentation granularity to obtain a conversation sample.
4. The method of claim 3, wherein the segment is a largest common segment.
5. The method of claim 3, wherein the participle granularity comprises one or more of a base word granularity and a shuffle word granularity.
6. The method of claim 3, wherein the modifying the boundaries and granularity of the participle granularity to obtain a conversational sample comprises:
counting the alignment times, the segmented times and the independent search times of the word segmentation granularity;
calculating a merging probability of the word segmentation granularity based on the alignment times, the segmented times, the independent search times and proper nouns;
calculating the segmentation probability of the word segmentation granularity based on the alignment times and the segmentation times;
and modifying the word segmentation granularity according to the merging probability and the segmented probability to obtain the dialogue sample.
7. The method of claim 1, wherein the constructing a word segmentation model based on the conversational sample comprises:
and constructing a sequence labeling model by using the dialogue sample through a gating cycle model and a conditional random field model.
8. The method of claim 7, wherein the sequence annotation model comprises a base word model and a shuffle word model.
9. A conversation source data processing apparatus, comprising:
the effective data acquisition module is used for acquiring effective conversation source data based on the conversation source data;
the sample acquisition module is used for acquiring a conversation sample by utilizing the effective conversation source data;
the model building module is used for building a word segmentation model based on the dialogue sample;
and the source data acquisition module is used for applying the word segmentation model to a dialogue system, acquiring new dialogue source data according to user behavior and returning the new dialogue source data to the effective data acquisition module.
10. The apparatus of claim 9, wherein the valid data acquisition module comprises:
a satisfaction acquiring unit for acquiring a conversation satisfaction from the conversation source data;
and the extracting unit is used for extracting the dialogue source data with the dialogue satisfaction degree larger than the satisfaction degree threshold value as the effective dialogue source data.
11. The apparatus of claim 9, wherein the sample acquisition module comprises:
the segmentation unit is used for segmenting the conversation content in the conversation source data into N fragments in a collision alignment mode; wherein N is an integer greater than or equal to 1;
the word segmentation unit is used for segmenting the fragments to obtain word segmentation granularity;
and the sample acquisition unit is used for correcting the boundary and the granularity of the word segmentation granularity to obtain a conversation sample.
12. The apparatus of claim 11, wherein the sample acquisition unit comprises:
the counting subunit is used for counting the alignment times, the segmented times and the independent search times of the word segmentation granularity;
a merging probability calculating subunit, configured to calculate a merging probability of the word segmentation granularity based on the alignment times, the segmented times, the independent search times, and the proper nouns;
a segmented probability calculating subunit, configured to calculate a segmented probability of the word segmentation granularity based on the alignment times and the segmented times;
and the word segmentation granularity correction subunit is used for correcting the word segmentation granularity according to the merging probability and the segmented probability to obtain the conversation sample.
13. The apparatus of claim 9 wherein the model construction module constructs a sequence annotation model using the dialogue samples by a gated cycle model and a conditional random field model.
14. An electronic device, comprising:
one or more processors;
storage means having one or more programs stored thereon which, when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-8;
one or more I/O interfaces connected between the processor and the memory and configured to enable information interaction between the processor and the memory.
15. A computer-readable medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-8.
CN202010107942.5A 2020-02-21 2020-02-21 Dialogue source data processing method and device, electronic equipment and computer readable medium Active CN111339767B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010107942.5A CN111339767B (en) 2020-02-21 2020-02-21 Dialogue source data processing method and device, electronic equipment and computer readable medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010107942.5A CN111339767B (en) 2020-02-21 2020-02-21 Dialogue source data processing method and device, electronic equipment and computer readable medium

Publications (2)

Publication Number Publication Date
CN111339767A true CN111339767A (en) 2020-06-26
CN111339767B CN111339767B (en) 2023-07-21

Family

ID=71185416

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010107942.5A Active CN111339767B (en) 2020-02-21 2020-02-21 Dialogue source data processing method and device, electronic equipment and computer readable medium

Country Status (1)

Country Link
CN (1) CN111339767B (en)

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030009339A1 (en) * 2001-07-03 2003-01-09 Yuen Michael S. Method and apparatus for improving voice recognition performance in a voice application distribution system
CN105930432A (en) * 2016-04-19 2016-09-07 北京百度网讯科技有限公司 Training method and apparatus for sequence labeling tool
CN108038725A (en) * 2017-12-04 2018-05-15 中国计量大学 A kind of electric business Customer Satisfaction for Product analysis method based on machine learning
CN108717409A (en) * 2018-05-16 2018-10-30 联动优势科技有限公司 A kind of sequence labelling method and device
CN108874967A (en) * 2018-06-07 2018-11-23 腾讯科技(深圳)有限公司 Dialogue state determines method and device, conversational system, terminal, storage medium
CN108898435A (en) * 2018-06-28 2018-11-27 北京京东尚科信息技术有限公司 Session data processing method and system, computer system and readable storage medium storing program for executing
KR20190004486A (en) * 2017-07-04 2019-01-14 조희정 Method for training conversation using dubbing/AR
CN109657056A (en) * 2018-11-14 2019-04-19 金色熊猫有限公司 Target sample acquisition methods, device, storage medium and electronic equipment
US20190138597A1 (en) * 2017-07-28 2019-05-09 Nia Marcia Maria Dowell Computational linguistic analysis of learners' discourse in computer-mediated group learning environments
CN109783623A (en) * 2018-12-25 2019-05-21 华东师范大学 The data analysing method of user and customer service dialogue under a kind of real scene
MX2018001930A (en) * 2018-02-15 2019-08-16 Centro De Investigacion Y De Estudios Avanzados Del Instituto Politecnico Nac Facial and voice recognition system in collaborative groups for the search of designated multimedia content.
CN110298391A (en) * 2019-06-12 2019-10-01 同济大学 A kind of iterative increment dialogue intention classification recognition methods based on small sample
CN110647617A (en) * 2019-09-29 2020-01-03 百度在线网络技术(北京)有限公司 Training sample construction method of dialogue guide model and model generation method

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030009339A1 (en) * 2001-07-03 2003-01-09 Yuen Michael S. Method and apparatus for improving voice recognition performance in a voice application distribution system
CN105930432A (en) * 2016-04-19 2016-09-07 北京百度网讯科技有限公司 Training method and apparatus for sequence labeling tool
KR20190004486A (en) * 2017-07-04 2019-01-14 조희정 Method for training conversation using dubbing/AR
US20190138597A1 (en) * 2017-07-28 2019-05-09 Nia Marcia Maria Dowell Computational linguistic analysis of learners' discourse in computer-mediated group learning environments
CN108038725A (en) * 2017-12-04 2018-05-15 中国计量大学 A kind of electric business Customer Satisfaction for Product analysis method based on machine learning
MX2018001930A (en) * 2018-02-15 2019-08-16 Centro De Investigacion Y De Estudios Avanzados Del Instituto Politecnico Nac Facial and voice recognition system in collaborative groups for the search of designated multimedia content.
CN108717409A (en) * 2018-05-16 2018-10-30 联动优势科技有限公司 A kind of sequence labelling method and device
CN108874967A (en) * 2018-06-07 2018-11-23 腾讯科技(深圳)有限公司 Dialogue state determines method and device, conversational system, terminal, storage medium
CN108898435A (en) * 2018-06-28 2018-11-27 北京京东尚科信息技术有限公司 Session data processing method and system, computer system and readable storage medium storing program for executing
CN109657056A (en) * 2018-11-14 2019-04-19 金色熊猫有限公司 Target sample acquisition methods, device, storage medium and electronic equipment
CN109783623A (en) * 2018-12-25 2019-05-21 华东师范大学 The data analysing method of user and customer service dialogue under a kind of real scene
CN110298391A (en) * 2019-06-12 2019-10-01 同济大学 A kind of iterative increment dialogue intention classification recognition methods based on small sample
CN110647617A (en) * 2019-09-29 2020-01-03 百度在线网络技术(北京)有限公司 Training sample construction method of dialogue guide model and model generation method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
CHEN X ET AL: "Long short-term memory neutal network for chinese word segmentation", 《PROCEEDINGS OF THE 2015 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING》 *
李超: "基于深度学习的中文分词方法研究", 《万方学位论文数据库》 *

Also Published As

Publication number Publication date
CN111339767B (en) 2023-07-21

Similar Documents

Publication Publication Date Title
CN111221983B (en) Time sequence knowledge graph generation method, device, equipment and medium
US20180158449A1 (en) Method and device for waking up via speech based on artificial intelligence
CN106570180B (en) Voice search method and device based on artificial intelligence
CN107526826B (en) Voice search processing method and device and server
CN107341143B (en) Sentence continuity judgment method and device and electronic equipment
US11790174B2 (en) Entity recognition method and apparatus
CN112347778A (en) Keyword extraction method and device, terminal equipment and storage medium
CN110334209B (en) Text classification method, device, medium and electronic equipment
CN108027814B (en) Stop word recognition method and device
US20140032207A1 (en) Information Classification Based on Product Recognition
CN110413787B (en) Text clustering method, device, terminal and storage medium
CN106528694B (en) semantic judgment processing method and device based on artificial intelligence
CN110134777B (en) Question duplication eliminating method and device, electronic equipment and computer readable storage medium
CN107861948B (en) Label extraction method, device, equipment and medium
CN111832299A (en) Chinese word segmentation system
CN111291177A (en) Information processing method and device and computer storage medium
CN110866095A (en) Text similarity determination method and related equipment
CN110543637A (en) Chinese word segmentation method and device
CN111177375A (en) Electronic document classification method and device
CN110263345B (en) Keyword extraction method, keyword extraction device and storage medium
CN111160026B (en) Model training method and device, and text processing method and device
CN111354354B (en) Training method, training device and terminal equipment based on semantic recognition
CN115858773A (en) Keyword mining method, device and medium suitable for long document
CN111027316A (en) Text processing method and device, electronic equipment and computer readable storage medium
CN111091001B (en) Method, device and equipment for generating word vector of word

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20210520

Address after: 100085 Baidu Building, 10 Shangdi Tenth Street, Haidian District, Beijing

Applicant after: BAIDU ONLINE NETWORK TECHNOLOGY (BEIJING) Co.,Ltd.

Applicant after: Shanghai Xiaodu Technology Co.,Ltd.

Address before: 100085 Baidu Building, 10 Shangdi Tenth Street, Haidian District, Beijing

Applicant before: BAIDU ONLINE NETWORK TECHNOLOGY (BEIJING) Co.,Ltd.

GR01 Patent grant
GR01 Patent grant