CN111339767A

CN111339767A - Conversation source data processing method and device, electronic equipment and computer readable medium

Info

Publication number: CN111339767A
Application number: CN202010107942.5A
Authority: CN
Inventors: 翟周伟
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Baidu Online Network Technology Beijing Co Ltd; Shanghai Xiaodu Technology Co Ltd
Priority date: 2020-02-21
Filing date: 2020-02-21
Publication date: 2020-06-26
Anticipated expiration: 2040-02-21
Also published as: CN111339767B

Abstract

The present disclosure provides a dialog source data processing method, which includes: step 101, obtaining effective dialogue source data based on the dialogue source data; 102, acquiring a conversation sample by using the effective conversation source data; 103, constructing a word segmentation model based on the dialogue sample; and 104, applying the word segmentation model to a dialogue system, acquiring new dialogue source data according to user behaviors, and returning the newly acquired dialogue source data to the step 101. The dialogue source data processing method obtains large-scale and high-precision dialogue samples and adaptively improves a dialogue system, so that the word segmentation accuracy is improved. The disclosure also provides a dialogue source data processing device, an electronic device and a computer readable medium.

Description

Conversation source data processing method and device, electronic equipment and computer readable medium

Technical Field

The embodiment of the disclosure relates to the technical field of information processing, and in particular relates to a method and a device for processing conversation source data, an electronic device and a computer readable medium.

Background

Natural Language Processing (NLP) technology is to recognize the voice content of a user through a voice recognition module, analyze the voice content, and finally obtain the corresponding semantics of the voice content, thereby facilitating the communication between a person and a machine.

The analysis of the speech is to cut words of the speech content input by the user by using the marked linguistic data, and the word cutting method generally comprises a dictionary matching method, a large-vocabulary continuous speech recognition model (N-Gram) based statistical and Dynamic Programming (DP) method and a sequence labeling method. The dictionary matching method is to dig out an independent basic word dictionary and then segment by using forward maximum or reverse maximum matching. The word segmentation method is difficult to solve the problems of semantic and ambiguous boundaries, so that the precision is poor. And calculating the segmentation probability based on an N-Gram statistic and an N-Gram dictionary depending on the DP method, wherein the boundary segmentation of predicates and proper nouns is not ideal in a dialogue scene. For example, "enlarged pessimism", the ideal word-cutting method is the predicate "put" and the proper noun "pessimism", but is easily cut into the predicate "enlarged" and the proper noun "pessimism". The sequence labeling method utilizes a sequence labeling model to cut words, but large-scale and high-precision samples are difficult to obtain, so that the precision of word cutting is difficult to improve.

Disclosure of Invention

The embodiment of the disclosure provides a conversation source data processing method and device, electronic equipment and a computer readable medium.

In a first aspect, an embodiment of the present disclosure provides a method for processing dialog source data, including:

step S101, obtaining effective dialogue source data according to the dialogue source data;

step S102, obtaining a conversation sample by utilizing the effective conversation source data;

step S103, constructing a word segmentation model based on the dialogue sample;

and step S104, applying the word segmentation model to a dialogue system, acquiring new dialogue source data according to user behaviors, and returning the new dialogue source data to the step S101.

In some embodiments, the obtaining valid dialog source data from the dialog source data includes:

obtaining conversation satisfaction from the conversation source data;

and extracting the dialogue source data with the dialogue satisfaction degree larger than the satisfaction degree threshold value as the effective dialogue source data.

In some embodiments, said obtaining a conversation sample using said valid conversation source data comprises:

segmenting conversation content in the conversation source data into N fragments in a collision alignment mode; wherein N is an integer greater than or equal to 1;

segmenting the segments to obtain word segmentation granularity;

and correcting the boundary and granularity of the word segmentation granularity to obtain a conversation sample.

In some embodiments, the segment is the largest common segment.

In some embodiments, the participle granularity includes one or more of a base word granularity and a shuffle word granularity.

In some embodiments, the modifying the boundaries and the granularity of the word segmentation granularity to obtain a dialog sample includes:

counting the alignment times, the segmented times and the independent search times of the word segmentation granularity;

calculating a merging probability of the word segmentation granularity based on the alignment times, the segmented times, the independent search times and proper nouns;

calculating the segmentation probability of the word segmentation granularity based on the alignment times and the segmentation times;

and modifying the word segmentation granularity according to the merging probability and the segmented probability to obtain the dialogue sample.

In some embodiments, said constructing a word segmentation model based on said dialogue sample comprises:

and constructing a sequence labeling model by using the dialogue sample through a gating cycle model and a conditional random field model.

In some embodiments, the sequence annotation model comprises a base word model and a mixed-rank word model.

In a second aspect, an embodiment of the present disclosure provides a session source data processing apparatus, which includes:

the effective data acquisition module is used for acquiring effective conversation source data according to the conversation source data;

the sample acquisition module is used for acquiring a conversation sample by utilizing the effective conversation source data;

the model building module is used for building a word segmentation model based on the dialogue sample;

and the source data acquisition module is used for applying the word segmentation model to a dialogue system, acquiring new dialogue source data according to user behavior and returning the new dialogue source data to the effective data acquisition module.

In some embodiments, the valid data acquisition module comprises:

a satisfaction acquiring unit for acquiring a conversation satisfaction from the conversation source data;

and the extracting unit is used for extracting the dialogue source data with the dialogue satisfaction degree larger than the satisfaction degree threshold value as the effective dialogue source data.

In some embodiments, the sample acquisition module comprises:

the segmentation unit is used for segmenting the conversation content in the conversation source data into N fragments in a collision alignment mode; wherein N is an integer greater than or equal to 1;

the word segmentation unit is used for segmenting the fragments to obtain word segmentation granularity;

and the sample acquisition unit is used for correcting the boundary and the granularity of the word segmentation granularity to obtain a conversation sample.

In some embodiments, the sample acquiring unit comprises:

the counting subunit is used for counting the alignment times, the segmented times and the independent search times of the word segmentation granularity;

a merging probability calculating subunit, configured to calculate a merging probability of the word segmentation granularity based on the alignment times, the segmented times, the independent search times, and the proper nouns;

a segmented probability calculating subunit, configured to calculate a segmented probability of the word segmentation granularity based on the alignment times and the segmented times;

and the word segmentation granularity correction subunit is used for correcting the word segmentation granularity according to the merging probability and the segmented probability to obtain the conversation sample.

In some embodiments, the model building module builds a sequence annotation model using the dialogue sample by a gated cycle model and a conditional random field model.

In a third aspect, an embodiment of the present disclosure provides an electronic device, including:

one or more processors;

a memory having one or more programs stored thereon that, when executed by the one or more processors, cause the one or more processors to perform any of the above-described dialog source data processing methods;

one or more I/O interfaces connected between the processor and the memory and configured to enable information interaction between the processor and the memory.

In a fourth aspect, the disclosed embodiments provide a computer readable medium, on which a computer program is stored, which when executed by a processor implements any of the above-mentioned dialog source data processing methods.

The conversation source data processing method provided by the embodiment of the disclosure includes the steps of 101, acquiring effective conversation source data based on the conversation source data; 102, acquiring a conversation sample by using the effective conversation source data; 103, constructing a word segmentation model based on the dialogue sample; and step 104, applying the word segmentation model to a dialogue system, obtaining new dialogue source data according to user behaviors, and returning the new dialogue source data to the step 101, so that large-scale and high-precision dialogue samples are automatically and circularly mined, the word segmentation model is trained and promoted in a self-adaptive manner, the word segmentation accuracy under different dialogue scenes is improved, and the user experience of the dialogue system is improved.

Drawings

The accompanying drawings are included to provide a further understanding of the embodiments of the disclosure and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the principles of the disclosure and not to limit the disclosure.

The above and other features and advantages will become more apparent to those skilled in the art by describing in detail exemplary embodiments thereof with reference to the attached drawings, in which:

fig. 1 is a flowchart of a session source data processing method according to an embodiment of the present disclosure;

fig. 2 is a flowchart of a step of obtaining a session sample in another session source data processing method according to an embodiment of the present disclosure;

fig. 3 is a schematic block diagram of a session source data processing apparatus according to an embodiment of the present disclosure;

fig. 4 is a schematic block diagram of a sample obtaining module in the dialog source data processing apparatus according to the embodiment of the present disclosure;

fig. 5 is a block diagram of an electronic device according to an embodiment of the present disclosure.

Detailed Description

In order to make those skilled in the art better understand the technical solution of the present disclosure, the following describes in detail a dialog source data processing method and apparatus, an electronic device, and a computer readable medium provided in the present disclosure with reference to the accompanying drawings.

Example embodiments will be described more fully hereinafter with reference to the accompanying drawings, but which may be embodied in different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

Embodiments of the present disclosure and features of embodiments may be combined with each other without conflict.

As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and the present disclosure, and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

Word segmentation is a fundamental problem of lexical analysis in natural language processing, and the difficulty lies in segmentation of ambiguous boundaries and selection of segmentation granularity. The current word segmentation method aims at non-conversation scenes, but predicate and proper name boundary errors easily occur in conversation scenes, particularly conversation scenes based on artificial intelligent voice assistants. For example, in a dialog scenario, "please turn up an air conditioner by one" is easily mistakenly switched out by one "; the 'leaving cat' is switched out by mistake to 'leave school'. One of the reasons for inaccurate word segmentation is that high-precision samples are difficult to obtain, so that model training has defects, thereby affecting the boundary and granularity of the word segmentation. The embodiment provides a dialogue source data processing method and device aiming at the problem of inaccurate word segmentation.

In a first aspect, an embodiment of the present disclosure provides a method for processing dialog source data. The dialogue source data processing method can be applied to an automatic voice dialogue system to accurately and automatically reply to the inquiry of a user. Fig. 1 is a flowchart of a dialog source data processing method according to an embodiment of the present disclosure. Referring to fig. 1, the dialog source data processing method includes:

step 101, obtaining valid dialog source data based on the dialog source data.

Wherein the dialog source data may be from an automated voice dialog system. The automatic voice dialogue system is used for analyzing the voice of a user inquiry, recognizing the intention of the user and automatically replying to the user inquiry. In this process, dialog source data is formed that includes user queries, intentions, and system replies.

Wherein the effective dialogue source data is dialogue source data which satisfies the system reply by the user. In some embodiments, the dialog source data also includes user ratings for dialog satisfaction. The rating of the dialog satisfaction may be a simple rating of satisfaction and dissatisfaction of the user or may be a specific satisfaction score.

In some embodiments, after the user uses the automatic voice conversation system, the system response is evaluated, and the user evaluates the conversation source data as satisfactory, or the conversation source data with the satisfaction degree score higher than a preset score value is used as effective conversation source data.

Specifically, obtaining a dialog satisfaction from the dialog source data; and extracting the dialogue source data with the dialogue satisfaction degree larger than the satisfaction degree threshold value as effective dialogue source data. The satisfaction threshold may be set arbitrarily, for example, the satisfaction threshold may be set according to the data amount of the dialogue source data. It will be appreciated that the greater the satisfaction threshold, the less the active session source data, and conversely, the more active session source data. In practical applications, the satisfaction threshold may be set according to the performance of the hardware.

And 102, acquiring a conversation sample by utilizing the effective conversation source data.

Segmenting conversation content in a collision alignment mode to obtain segments; and then segmenting the segments by a word segmentation method to obtain word segmentation granularity, thereby obtaining a conversation sample. In some embodiments, the boundary segmentation probability and the merging probability of a segment are statistically analyzed, and then the boundary and the participle granularity of the segment are modified, thereby obtaining more accurate conversational samples.

And 103, constructing a word segmentation model based on the dialogue sample.

Wherein, the word segmentation model can be a sequence labeling model of one or more word segmentations. In some embodiments, the sequence annotation model is constructed by a gated cyclic model and a conditional random field model, i.e., a sequence annotation model is constructed by a gated cyclic model and a conditional random field model using the dialogue sample.

It should be noted that, in the embodiment of the present disclosure, constructing a word segmentation model based on the dialog sample is not only interpreted as initially constructing the word segmentation model, but also interpreted as optimizing the word segmentation model. For example, when the word segmentation model is established for the first time, the word segmentation model is established based on the dialogue sample; when the word segmentation model already exists, the word segmentation model is optimized based on the dialogue sample.

In some embodiments, the sequence annotation model can be a base word model and a shuffle word model.

And step 104, applying the word segmentation model to the dialogue system, acquiring new dialogue source data according to the user behavior, and returning the newly acquired dialogue source data to the step 101.

Applying the word segmentation model obtained in the step 103 to the dialogue system to obtain new dialogue source data, namely, generating a large amount of effective source data, returning the newly obtained dialogue source data to the step 101, and optimizing the word segmentation model by using the new dialogue source data, so that the effect of the dialogue system is continuously improved, and the word segmentation accuracy of the word segmentation model is continuously improved.

Fig. 2 is a flow chart of obtaining a dialog sample in an embodiment of the present disclosure. As shown in fig. 2, the dialogue sample is obtained by the following steps:

step 201, segmenting the dialogue content in the dialogue source data into N segments by means of collision alignment.

Wherein N is an integer greater than or equal to 1.

In some embodiments, the dialog queries and system reply content are aligned for collision, hitting the most common segment, while the dialog queries and system reply content are segmented.

For example, the content of the dialog query is "listen to the sea grass dance" and the system reverts to "listen to XX sea grass dance together". Through the way of collision alignment, the maximum public segment 'sea grass dance' is obtained, and 'listening sea grass dance' is divided into 'listening/sea grass dance'.

Step 202, segmenting the segments by utilizing a word segmentation mode to obtain word segmentation granularity.

And segmenting the segments to obtain word segmentation granularity. The word segmentation mode can adopt a word segmentation mode commonly used in the field.

In some embodiments, the participle granularity includes one or more of a base word granularity and a shuffle word granularity. Where the base word granularity is the smallest unit segment with independent semantics. The shuffling particle size is a combination of multiple particle sizes based on the particle size of the base word.

For example, the segment "listening to the dance of sea grass" is segmented, and "listening" is a single word belonging to the granularity of basic words and cannot be segmented. The "seaweed dance" may be segmented into "seaweed/dance".

And 203, correcting the boundary and granularity of the word granularity to obtain a dialogue sample.

In some embodiments, the step of partitioning the boundaries of the word granularity and the granularity comprises: counting the alignment times, the segmented times and the independent search times of the word segmentation granularity; calculating the merging probability of the word segmentation granularity based on the alignment times, the segmented times, the independent search times and the proper nouns; calculating the segmentation probability of the word segmentation granularity based on the alignment times and the segmentation times; and modifying the word segmentation granularity according to the combination probability and the segmented probability to obtain a dialogue sample.

In some embodiments, the merging probability of the participle granularity is obtained by the merging probability formula (1):

wherein, P_mFor the merge probability, selfAlignnFre is the self-alignment times of participle granularity, SplitFreq is the segmented times of participle granularity, searchFreq is the segment independent retrieval times, isperpernoun is whether the proper name is, and f is the decision function.

In some embodiments, the segmentation probability of the participle granularity is obtained by the segmentation probability formula (2):

wherein, P_sTo merge probabilities, Splitfeq is the number of times the participle granularity is segmented, selfAlignFreq is the number of self-alignments of the participle granularity, and f is the decision function.

It should be understood that if the higher the number of times of alignment and the higher the number of times of independent search, the greater the compactness of the word segmentation granularity is indicated, and the granularity merging should be performed. If the segmentation probability is high, the segmentation point (boundary) is determined.

In the method for processing dialog source data provided by this embodiment, effective dialog source data is obtained based on the dialog source data; obtaining a conversation sample by using the effective conversation source data; constructing a word segmentation model based on the conversation sample; the word segmentation model is applied to a dialogue system, dialogue source data are obtained according to user behaviors, the obtained dialogue source data are returned to further optimize the word segmentation model, large-scale and high-precision dialogue samples are automatically and circularly mined, and the word segmentation model is trained and promoted in a self-adaptive mode, so that the word segmentation accuracy under different dialogue scenes is improved, and the user experience of the dialogue system is improved.

In a second aspect, an embodiment of the present disclosure provides a session source data processing apparatus. The dialogue source data processing device can be applied to an automatic voice dialogue system to accurately and automatically reply to the inquiry of a user. Fig. 3 is a functional block diagram of a session source data processing apparatus according to an embodiment of the present disclosure.

Referring to fig. 3, a session source data processing apparatus provided in an embodiment of the present disclosure includes:

and the valid data acquisition module 301 is configured to acquire valid dialog source data based on the dialog source data.

The dialogue source data is the voice data generated in the inquiry and automatic reply process of the automatic voice dialogue system and the user. The valid session source data is the session source data that the user is satisfied with the system reply. In some embodiments, the dialog source data also includes user ratings for dialog satisfaction. The rating of the dialog satisfaction may be a simple rating of satisfaction and dissatisfaction of the user or may be a specific satisfaction score.

In some embodiments, after the user uses the automatic voice conversation system, the system response is evaluated, and the user evaluates the conversation source data as satisfactory, or the conversation source data with the satisfaction degree score higher than a preset score value is used as effective conversation source data. The valid conversation source data is obtained conversation satisfaction from the conversation source data; and extracting the dialogue source data with the dialogue satisfaction degree larger than the satisfaction degree threshold value as effective dialogue source data. The satisfaction threshold may be set arbitrarily, for example, the satisfaction threshold may be set according to the data amount of the dialogue source data. It will be appreciated that the greater the satisfaction threshold, the less the active session source data, and conversely, the more active session source data. In practical applications, the satisfaction threshold may be set according to the performance of the hardware.

In some embodiments, the valid data acquisition module 301 includes a satisfaction acquisition unit for obtaining the dialog satisfaction from the dialog source data, and an extraction unit. The extraction unit is used for extracting the dialogue source data with the dialogue satisfaction degree larger than the satisfaction degree threshold value as effective dialogue source data.

A sample obtaining module 302 for obtaining the session sample using the active session source data.

And the model building module 303 is used for building a word segmentation model based on the dialogue sample.

A source data obtaining module 304, configured to apply the word segmentation model to the dialog system, obtain dialog source data again according to the user behavior, and return the obtained dialog source data to the valid data obtaining module.

The word segmentation model obtained by the model construction module 303 is applied to the dialogue system, new dialogue source data is obtained again according to the user behavior, namely a large amount of dialogue source data is obtained, and the new dialogue source data optimization model is utilized, so that the effect of the dialogue system is continuously improved, and the word segmentation accuracy is further continuously improved.

In some embodiments, the model building module 303 builds the sequence annotation model using the dialogue samples by a gated cycle model and a conditional random field model.

As shown in fig. 4, in some embodiments, the sample acquisition module 400 includes:

a dividing unit 401, configured to divide the dialog content in the dialog source data into N segments in a collision alignment manner. Wherein N is an integer greater than or equal to 1;

and the word segmentation unit 402 is configured to segment the segments to obtain word segmentation granularity.

And segmenting the fragments by utilizing a word segmentation mode to obtain word segmentation granularity. The word segmentation mode can adopt a word segmentation mode commonly used in the field.

The sample obtaining unit 403 is configured to modify the boundary and the granularity of the participle granularity, and obtain a dialog sample.

In some embodiments, the sample obtaining unit 403 includes a statistics subunit, a merging probability calculating subunit, a segmented probability calculating subunit, and a word segmentation granularity correcting subunit, where the statistics subunit is configured to count the number of times of alignment, the number of times of segmentation, and the number of times of independent search of the word segmentation granularity; a merging probability calculating subunit, configured to calculate merging probabilities of the word segmentation granularity based on the alignment times, the segmented times, the independent search times, and the proper nouns; a segmented probability calculating subunit, configured to calculate a segmented probability of the word segmentation granularity based on the alignment times and the segmented times; and the word segmentation granularity correction subunit is used for correcting the word segmentation granularity according to the merging probability and the segmented probability so as to obtain the conversation sample.

Wherein, the merging probability calculation subunit is obtained by a merging probability formula (1):

The segmented probability calculation subunit is obtained by the segmentation probability formula (2):

wherein, P_sFor the merge probability, SplitFreq is the number of times the participle granularity is segmented, selfAlignnFleq is the number of self-alignments of the participle granularity, and f is the decision function.

In the session source data processing apparatus provided in this embodiment, the valid data obtaining module obtains valid session source data according to the session source data; the sample acquisition module acquires a conversation sample by using the effective conversation source data; constructing a word segmentation model based on the conversation sample; the model building module applies the word segmentation model to the dialogue system, obtains dialogue source data according to user behaviors, automatically and circularly mines in the way, obtains large-scale and high-precision dialogue samples, and adaptively trains and promotes the dialogue system, so that the word segmentation accuracy is improved.

In a third aspect, referring to fig. 5, an embodiment of the present disclosure provides an electronic device, including:

one or more processors 501;

a memory 502 on which one or more programs are stored, which when executed by the one or more processors, cause the one or more processors to implement the dialog source data processing method of any one of the above;

one or more I/O interfaces 503 coupled between the processor and the memory and configured to enable information interaction between the processor and the memory.

The processor 501 is a device with data processing capability, and includes but is not limited to a Central Processing Unit (CPU) and the like; memory 502 is a device having data storage capabilities including, but not limited to, random access memory (RAM, more specifically SDRAM, DDR, etc.), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), FLASH memory (FLASH); an I/O interface (read/write interface) 503 is connected between the processor 501 and the memory 502, and can realize information interaction between the processor 501 and the memory 502, which includes but is not limited to a data Bus (Bus) and the like.

In some embodiments, the processor 501, memory 502, and I/O interface 503 are connected to each other and to other components of the computing device by a bus.

In a fourth aspect, the present disclosure provides a computer-readable medium, on which a computer program is stored, where the computer program is executed by a processor to implement any one of the above methods for speech word segmentation.

It will be understood by those of ordinary skill in the art that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. In a hardware implementation, the division between functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may be performed by several physical components in cooperation. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as is well known to those of ordinary skill in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer. In addition, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media as known to those skilled in the art.

Example embodiments have been disclosed herein, and although specific terms are employed, they are used and should be interpreted in a generic and descriptive sense only and not for purposes of limitation. In some instances, features, characteristics and/or elements described in connection with a particular embodiment may be used alone or in combination with features, characteristics and/or elements described in connection with other embodiments, unless expressly stated otherwise, as would be apparent to one skilled in the art. Accordingly, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the scope of the disclosure as set forth in the appended claims.

Claims

1. A conversation source data processing method, comprising:

step 101, obtaining effective dialogue source data based on the dialogue source data;

102, acquiring a conversation sample by using the effective conversation source data;

103, constructing a word segmentation model based on the dialogue sample;

and 104, applying the word segmentation model to a dialogue system, acquiring new dialogue source data according to user behaviors, and returning the newly acquired dialogue source data to the step 101.

2. The method of claim 1, wherein said obtaining valid session origin data from session origin data comprises:

obtaining conversation satisfaction from the conversation source data;

3. The method of claim 1, wherein said obtaining a conversation sample using said active conversation source data comprises:

segmenting conversation content in the effective conversation source data into N fragments in a collision alignment mode; wherein N is an integer greater than or equal to 1;

segmenting the segments to obtain word segmentation granularity;

4. The method of claim 3, wherein the segment is a largest common segment.

5. The method of claim 3, wherein the participle granularity comprises one or more of a base word granularity and a shuffle word granularity.

6. The method of claim 3, wherein the modifying the boundaries and granularity of the participle granularity to obtain a conversational sample comprises:

7. The method of claim 1, wherein the constructing a word segmentation model based on the conversational sample comprises:

8. The method of claim 7, wherein the sequence annotation model comprises a base word model and a shuffle word model.

9. A conversation source data processing apparatus, comprising:

the effective data acquisition module is used for acquiring effective conversation source data based on the conversation source data;

10. The apparatus of claim 9, wherein the valid data acquisition module comprises:

11. The apparatus of claim 9, wherein the sample acquisition module comprises:

12. The apparatus of claim 11, wherein the sample acquisition unit comprises:

13. The apparatus of claim 9 wherein the model construction module constructs a sequence annotation model using the dialogue samples by a gated cycle model and a conditional random field model.

14. An electronic device, comprising:

one or more processors;

storage means having one or more programs stored thereon which, when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-8;

15. A computer-readable medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-8.