CN111125317A - Model training, classification, system, device and medium for conversational text classification - Google Patents

Model training, classification, system, device and medium for conversational text classification Download PDF

Info

Publication number
CN111125317A
CN111125317A CN201911374381.9A CN201911374381A CN111125317A CN 111125317 A CN111125317 A CN 111125317A CN 201911374381 A CN201911374381 A CN 201911374381A CN 111125317 A CN111125317 A CN 111125317A
Authority
CN
China
Prior art keywords
data
vector
model
classification
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911374381.9A
Other languages
Chinese (zh)
Inventor
杨森
罗超
胡泓
王思博
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ctrip Computer Technology Shanghai Co Ltd
Original Assignee
Ctrip Computer Technology Shanghai Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ctrip Computer Technology Shanghai Co Ltd filed Critical Ctrip Computer Technology Shanghai Co Ltd
Priority to CN201911374381.9A priority Critical patent/CN111125317A/en
Publication of CN111125317A publication Critical patent/CN111125317A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses model training, classification, a system, equipment and a medium for conversational text classification, which comprise the following steps: carrying out violation annotation on training data to obtain first data, wherein the first data comprises original dialogue sentences; segmenting the first data into a plurality of segments; matching the word segmentation with a pre-trained word vector to obtain a first word segmentation vector; matching the first word segmentation vector with a pre-trained word vector to obtain a second word segmentation vector; and inputting the second segmentation vector into a HAN model for training. The model training method for classifying the conversational texts classifies the conversational texts by adopting a hierarchical attention HAN algorithm, solves the problems of relation between lost sentences and semantic loss, and improves the accuracy of classification results; meanwhile, word vector and word vector representation are introduced, so that the influence caused by OOV is reduced, and the accuracy of classification results and the accuracy of models are improved.

Description

Model training, classification, system, device and medium for conversational text classification
Technical Field
The present invention relates to the field of text classification, and more particularly to model training, classification, systems, devices and media for conversational text classification.
Background
The current model training method aiming at text classification mainly comprises two types, one type is that all sentences in a document are spliced together and then text classification modeling is carried out based on CNN (convolutional neural network), RNN (cyclic neural network) or BERT (bidirectional encoder from Transformers). The other model training method is to limit the number of sentences in the document and then to perform classification by splicing, and compared with the former method, although the text length is reduced, the loss of partial sentences is likely to lose semantics, so that the final classification result is not high in precision.
Disclosure of Invention
The invention aims to overcome the defects that the accuracy of a classification result is low due to the fact that the relation between sentences is lost by sentence splicing and the number of sentences in a text is limited and semantics are lost in the prior art, and provides a model training, classification, system, equipment and medium for conversational text classification.
The invention solves the technical problems through the following technical scheme:
a model training method for conversational text classification comprises the following steps:
carrying out violation annotation on training data to obtain first data, wherein the first data comprises original dialogue sentences;
segmenting the first data into a plurality of segments;
matching the word segmentation with a pre-trained word vector to obtain a first word segmentation vector;
matching the first word segmentation vector with a pre-trained word vector to obtain a second word segmentation vector;
the second participle vector is input to a HAN (a hierarchical attention algorithm) model for training.
Preferably, the step of marking the training data for violation to obtain the first data specifically includes:
dividing the training data into a plurality of first sample data by using a clustering algorithm;
extracting a plurality of pieces of data from each piece of the first sample data as second sample data;
calculating the edit distance between any piece of the second sample data and the marked training data;
and if the editing distance is larger than a preset value, carrying out violation marking on the first sample data to obtain first data.
Preferably, the step of segmenting the first data into a plurality of segments specifically includes:
preprocessing the first data to obtain second data;
comparing the second data with preset model data to filter meaningless sentences in the second data to obtain third data;
segmenting the third data into a plurality of segments.
Preferably, the step of inputting the second segmentation vector into the HAN model for training specifically includes:
and inputting the second word segmentation vector and a preset feature vector into the HAN model.
Preferably, the model training method further comprises:
judging whether the model output result meets a preset threshold value, if so, taking the model output result as a classification result; if not, labeling the training data with the confidence coefficient of the model output result smaller than the preset threshold value again to obtain fourth data;
and re-inputting the fourth data into the HAN model, and returning to the step to judge whether the output result of the model meets a preset threshold value.
A method of classifying conversational text, the method comprising the steps of:
obtaining dialogue type text data, wherein the dialogue type text data comprises original dialogue sentences;
segmenting the conversational text data into a plurality of segments;
matching the word segmentation with a pre-trained word vector to obtain a first word segmentation vector;
matching the first word segmentation vector with a pre-trained word vector to obtain a second word segmentation vector;
and inputting the second segmentation vector into an HAN model trained by adopting any one of the above dialogue type text classification model training methods to obtain a classification result.
Preferably, the step of inputting the second segmentation vector into the HAN model trained by using any one of the above methods for model training for conversational text classification to obtain a classification result specifically includes:
and inputting the second word segmentation vector and a preset feature vector into an HAN model trained by adopting any one of the dialogue type text classification model training methods to obtain a classification result.
Preferably, the pre-trained word vectors include an Elmo (embedded language model) word vector and a Glove (global vector for word representation) word vector;
the pre-trained word vector comprises a Glove word vector.
A model training system for conversational text classification comprising the following modules:
the first data acquisition module is used for carrying out violation marking on training data to obtain first data, and the first data comprises original dialogue sentences;
the first word segmentation acquisition module is used for segmenting the first data into a plurality of words;
the first vector acquisition module is used for matching the word segmentation with a pre-trained word vector to obtain a first word segmentation vector;
the second vector acquisition module is used for matching the first word segmentation vector with a pre-trained word vector to obtain a second word segmentation vector;
and the model input module is used for inputting the second participle vector into the HAN model for training.
Preferably, the first data acquisition module includes:
a first sample acquisition unit for dividing the training data into a plurality of pieces of first sample data using a clustering algorithm;
a second sample obtaining unit, configured to extract a plurality of pieces of data from each piece of the first sample data as second sample data;
the editing distance calculation unit is used for calculating the editing distance between any piece of second sample data and the marked training data;
and the first data acquisition unit is used for carrying out violation marking on the first sample data to obtain first data if the editing distance is larger than a preset value.
Preferably, the first word segmentation obtaining module includes:
the second data acquisition unit is used for preprocessing the first data to obtain second data;
the third data acquisition unit is used for comparing the second data with preset model data and filtering meaningless sentences in the second data to obtain third data;
and the word segmentation acquisition unit is used for segmenting the third data into a plurality of words.
Preferably, the model input module is specifically configured to input the second segmentation vector and a preset feature vector to the HAN model.
Preferably, the model training system further comprises:
a result judging module for judging whether the output result of the model meets a preset threshold value,
the first classification result acquisition module is used for taking the model output result as a classification result when the model output result meets a preset threshold value;
the fourth data labeling module is used for labeling the training data with the confidence coefficient of the model output result smaller than the preset threshold value again to obtain fourth data when the model output result does not meet the preset threshold value;
and the second classification result acquisition module is used for inputting the fourth data to the HAN model again and returning the fourth data to the result judgment module.
A classification system for conversational text, the classification system comprising the following modules:
the system comprises a text data acquisition module, a conversion module and a conversion module, wherein the text data acquisition module is used for acquiring conversation type text data which comprises original conversation sentences;
the second segmentation acquisition module is used for segmenting the dialogue type text data into a plurality of segments;
the first word segmentation vector acquisition module is used for matching the word segmentation with a pre-trained word vector to obtain a first word segmentation vector;
the second word segmentation vector acquisition module is used for matching the first word segmentation vector with a pre-trained character vector to obtain a second word segmentation vector;
and the second classification result acquisition module is used for inputting the second segmentation vector into an HAN model trained by adopting any one of the dialogue type text classification model training systems to obtain a classification result.
Preferably, the second classification result obtaining module is further configured to input the second segmentation vector and a preset feature vector to an HAN model trained by using any one of the above model training systems for conversational text classification, so as to obtain a classification result.
Preferably, the pre-trained word vectors include an Elmo word vector and a Glove word vector;
the pre-trained word vector comprises a Glove word vector.
An electronic device comprising a memory, a processor and a computer program stored on the memory and operable on the processor, the processor implementing any of the above methods of model training for conversational text classification or the methods of classification of conversational text when executing the program.
A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method for model training for conversational text classification or the method for classifying conversational text described in any of the above.
The positive progress effects of the invention are as follows:
1. the classification of the dialog text is carried out by adopting a hierarchical attention HAN algorithm, so that the problems of relation between lost sentences and lost semantics are solved, and the accuracy of classification results is improved;
2. meanwhile, word vectors and word vector representations are introduced, so that the influence caused by OOV (Out of Vocalburry) is reduced, and the accuracy of classification results and the accuracy of models are improved;
3. the feature vector information is added, and the accuracy of the classification result and the precision of the model are further increased;
4. and performing data iteration based on an active learning mode, improving the data quality and further optimizing the model.
Drawings
Fig. 1 is a flowchart illustrating a method for training a dialog-type text classification model according to a preferred embodiment 1 of the present invention.
FIG. 2 is a flowchart illustrating a method for training dialog-based text classification according to a preferred embodiment 2 of the present invention.
Fig. 3 is a flowchart illustrating a method for classifying dialog-type texts according to a preferred embodiment 3 of the present invention.
FIG. 4 is a block diagram of a system for training dialog-based text classification models according to an embodiment 4 of the present invention.
FIG. 5 is a block diagram of a system for training dialog-based text classification models according to an embodiment 5 of the present invention.
FIG. 6 is a block diagram of a dialog-based text classification system according to an embodiment 6 of the present invention.
Fig. 7 is a block diagram of an electronic device according to a preferred embodiment 7 of the invention.
Detailed Description
The invention is further illustrated by the following examples, which are not intended to limit the scope of the invention.
Example 1
As shown in fig. 1, the method for training a dialog-type text classification model of the present embodiment includes the following steps:
s101, carrying out violation annotation on training data to obtain first data, wherein the first data comprises original dialogue sentences;
s102, segmenting the first data into a plurality of word segments;
s103, matching the word segmentation with the pre-trained word vector to obtain a first word segmentation vector;
specifically, matching is performed from pre-trained word vectors through the word segmentation result of the previous step, wherein the word vectors comprise a glove word vector and an Elmo word vector. The Glove word vector and the Elmo word vector are obtained by collecting a large amount of high-quality text data which are open on the internet and text data in the hotel field, preprocessing the high-quality text data and the text data and then training the preprocessed high-quality text data and the preprocessed high-quality text data based on unsupervised learning, and the purpose of obtaining a universal language model is achieved.
S104, matching the first word segmentation vector with the pre-trained character vector to obtain a second word segmentation vector;
specifically, to reduce the effect of OOV, a word-based feature vector, i.e., a glove word vector, is added. The final result will be a word vector of the word vector concatenated to get a resultant representation from word to vector.
And S105, inputting the second word segmentation vector into the HAN model for training.
In the model training method for dialog type text classification of the embodiment, data containing original dialog sentences are used for model training, the completeness of the sentences is kept, and the relationship of interlocutors is convenient to determine through the context dialog contents; the word vectors and the word vectors are matched at the same time, so that the problem of word ambiguity of different contexts is solved, and the accuracy of the model is improved; the hierarchical attention HAN model is used for carrying out semantic analysis on the context call content, so that the context relationship can be well established, and the model classification precision is higher.
Example 2
As shown in fig. 2, the model training method for dialog-type text classification in this embodiment is a further definition of embodiment 1, and specifically:
step S101 includes:
s1010, dividing the training data into a plurality of first sample data by using a clustering algorithm;
s1011, extracting a plurality of pieces of data from each piece of first sample data as second sample data;
s1012, calculating the edit distance between any one piece of second sample data and the marked training data;
and S1013, if the editing distance is larger than a preset value, carrying out violation marking on the first sample data to obtain first data.
The labeling of conversational text data requires more manpower than other text-type data. In the model training method for conversational text classification of the embodiment, when training data is obtained, a clustering algorithm is used to divide the data into k parts; randomly extracting n pieces of data from each piece of data; and traversing the rest of data in k groups, and randomly extracting one piece of data from n pieces of data to calculate the editing distance so as to see whether the rest of data is greatly different from other data, wherein if the difference is large, the data is not contained in the data sample and needs to be added into the data sample. The specific clustering algorithm is a Levenshtein (edit distance) algorithm, the preset confidence coefficient is 0.7 or 0.8, and when the calculated edit distance is smaller than the preset confidence coefficient, the data is supplemented to the training sample, so that the sufficiency of the training sample data is ensured, and the manual labeling cost is reduced.
Step S102 includes:
s1021, preprocessing the first data to obtain second data;
s1022, comparing the second data with preset model data, and filtering meaningless sentences in the second data to obtain third data;
specifically, the first data is subjected to full angle to half angle, traditional Chinese to simplified Chinese and upper case to lower case operation, and the purpose of the operation is to unify the format of the text content; then, the text with unified format is subjected to word deactivation, such as character removal or low-frequency word removal, null value filtration and sensitive word filtration, so as to reduce the text length and reduce nonsense words to obtain second data; and calculating the similarity of the second data and model data prepared in advance, and further filtering out partial meaningless sentences in the conversation, such as hello, goodbye and the like, wherein the model data is a self-defined meaningless word template. Data preprocessing is performed before word segmentation, and the word segmentation accuracy is further improved.
And S1023, dividing the third data into a plurality of word segments.
In the training method for the conversational text classification model of the embodiment, a HanLP (language processor) word segmentation device is adopted, and in order to increase the word segmentation accuracy and coverage of the scene, a user-defined field dictionary such as big double beds and pre-authorization is added in the word segmentation device.
Step S105 specifically includes:
and inputting the second word segmentation vector and the preset feature vector into the HAN model for training.
Specifically, the HAN model is used to weight each word in each sentence to obtain a sentence vector representation of the whole sentence, and then the vector representation of the whole document is obtained by weighting and summing each sentence under the document. To enrich the features, we add additional feature dimensions to the re-classification layer, such as call duration and order status, to further improve model accuracy.
The method for training the dialog text classification model of the embodiment further comprises the following steps:
s1061, judging whether the output result of the model meets a preset threshold value, if so, executing a step S1062: taking the output result of the model as a classification result;
specifically, the preset threshold is set as the number of iterations or the accuracy of the model or there is no more data with a confidence below 0.9.
If not, go to step S1063: labeling the training data with the confidence coefficient of the model output result smaller than the preset threshold value again to obtain fourth data;
specifically, because the artificially labeled data has certain errors, the active learning idea is used to re-label the data of which the confidence coefficient of the model classification result is lower than 0.9, and the artificially re-labeling is specifically adopted.
S1064, inputting the fourth data into the HAN model again, and returning to the step S1601;
specifically, according to the model classification result, the classification with the highest confidence coefficient is selected as the final classification, and the specific classification is divided into two classes with or without violation.
The model training method for conversational text classification in the embodiment adopts a clustering algorithm to preprocess training data, so that the sufficiency of training sample data is ensured; data preprocessing is carried out before word segmentation, so that the word segmentation accuracy is further improved; a user-defined field dictionary is added, and the word segmentation accuracy and coverage of the scene are increased; the feature vector information is further increased, and the accuracy of the classification result and the accuracy of the model are further increased.
Example 3
As shown in fig. 3, the method for classifying dialog texts of the present embodiment includes the following steps:
s301, obtaining dialogue type text data, wherein the dialogue type text data comprises original dialogue sentences;
s302, segmenting the conversational text data into a plurality of participles;
s303, matching the participles with pre-trained word vectors to obtain first participle vectors;
s304, matching the first word segmentation vector with the pre-trained word vector to obtain a second word segmentation vector;
and S305, inputting the second segmentation vector into the HAN model trained by the dialogue-type text classification model training method of the embodiment 1 or 2 to obtain a classification result.
Step S305 is to input the second segmentation vector and the preset feature vector into the HAN model trained by the dialog-type text classification model training method of embodiment 1 or 2, so as to obtain a classification result.
The pre-trained word vectors include an Elmo word vector and a Glove word vector;
the pre-trained word vectors include Glove word vectors.
The classification method of the conversational text of the embodiment classifies the transcribed conversation data by adopting the trained HAN model based on the hierarchical attention, introduces factors which have large influence on result judgment, further enriches the characteristic information and increases the precision of the model; selecting a class with a large prediction probability value as a final classification result, so that the classification precision of the model is higher; compared with manual spot check, the recall rate is improved, and the precision is ensured; not only the purpose of violation detection is achieved, but also a large amount of manpower is saved.
Example 4
Referring to fig. 4, the model training system for conversational text classification in this embodiment includes the following modules:
the first data acquisition module 1 is used for carrying out violation annotation on training data to obtain first data, and the first data comprises original dialogue sentences;
the first segmentation obtaining module 2 is used for segmenting the first data into a plurality of segments;
specifically, matching is performed from pre-trained word vectors through the word segmentation result of the previous step, wherein the word vectors comprise a Glove word vector and an Elmo word vector. The Glove word vector and the Elmo word vector are obtained by collecting a large amount of high-quality text data which are open on the internet and text data in the hotel field, preprocessing the high-quality text data and the text data and then training the preprocessed high-quality text data and the preprocessed high-quality text data based on unsupervised learning, and the purpose of obtaining a universal language model is achieved.
The first vector acquisition module 3 is used for matching the word segmentation with the pre-trained word vector to obtain a first word segmentation vector;
the second vector acquisition module 4 is used for matching the first word segmentation vector with the pre-trained word vector to obtain a second word segmentation vector;
specifically, to reduce the effect of OOV, a word-based feature vector, i.e., a Glove word vector, is added. The final result will be a word vector of the word vector concatenated to get a resultant representation from word to vector.
And the model input module 5 is used for inputting the second participle vector into the HAN model for training.
The dialogue-type text classification model training system of the embodiment performs model training by using data containing original dialogue sentences, so that the completeness of the sentences is kept, and the relation of interlocutors is conveniently determined through the context dialogue content; the word vectors and the word vectors are matched at the same time, so that the problem of word ambiguity of different contexts is solved, and the accuracy of the model is improved; the hierarchical attention HAN model is used for carrying out semantic analysis on the context call content, so that the context relationship can be well established, and the model classification precision is higher.
Example 5
As shown in fig. 5, the model training system for conversational text classification of this embodiment is a further definition of embodiment 4, specifically:
the first data obtaining module 1 specifically includes:
a first sample acquiring unit 11, configured to divide the training data into a plurality of pieces of first sample data by using a clustering algorithm;
a second sample obtaining unit 12, configured to extract a plurality of pieces of data from each piece of the first sample data as second sample data;
an edit distance calculation unit 13, configured to calculate an edit distance between any one piece of second sample data and the labeled training data;
the first data obtaining unit 14 is configured to, if the edit distance is greater than the preset value, perform violation marking on the first sample data to obtain first data.
The labeling of conversational text data requires more manpower than other text-type data. In the model training method for conversational text classification of the embodiment, when training data is obtained, a clustering algorithm is used to divide the data into k parts; randomly extracting n pieces of data from each piece of data; and traversing the rest of data in k groups, and randomly extracting one piece of data from n pieces of data to calculate the editing distance so as to see whether the rest of data is greatly different from other data, wherein if the difference is large, the data is not contained in the data sample and needs to be added into the data sample. The specific clustering algorithm is a Levenshtein algorithm, the preset confidence coefficient is 0.7 or 0.8, and when the calculated editing distance is smaller than the preset confidence coefficient, the data is supplemented to the training sample, so that the sufficiency of the training sample data is ensured, and the manual labeling cost is reduced.
The first segmentation obtaining module 2 specifically includes:
a second data obtaining unit 21, configured to pre-process the first data to obtain second data;
the third data acquisition unit 22 is configured to compare the second data with preset model data and filter meaningless sentences in the second data to obtain third data;
specifically, the first data is subjected to full angle to half angle, traditional Chinese to simplified Chinese and upper case to lower case operation, and the purpose of the operation is to unify the format of the text content; then, the text with unified format is subjected to word deactivation, such as character removal or low-frequency word removal, null value filtration and sensitive word filtration, so as to reduce the text length and reduce nonsense words to obtain second data; and calculating the similarity of the second data and model data prepared in advance, and further filtering out partial meaningless sentences in the conversation, such as hello, goodbye and the like, wherein the model data is a self-defined meaningless word template. Data preprocessing is performed before word segmentation, and the word segmentation accuracy is further improved.
A word segmentation obtaining unit 23, configured to segment the third data into a plurality of words.
In the dialogue type text classification model training method, a HanLP word segmentation device is adopted, and in order to increase the word segmentation accuracy and coverage of the scene, a user-defined field dictionary such as big double beds and pre-authorization is added in the word segmentation device.
And the model input module 5 is specifically configured to input the second segmentation vector and the preset feature vector to the HAN model.
Specifically, the HAN model is used to weight each word in each sentence to obtain a sentence vector representation of the whole sentence, and then the vector representation of the whole document is obtained by weighting and summing each sentence under the document. To enrich the features, we add additional feature dimensions to the re-classification layer, such as call duration and order status, to further improve model accuracy.
The model training system for conversational text classification of the present embodiment further includes the following modules:
the result judging module 6 is used for judging whether the output result of the model meets a preset threshold value or not;
specifically, the preset threshold is set as the number of iterations or the accuracy of the model or there is no more data with a confidence below 0.9.
The first classification result acquisition module 7 is used for taking the model output result as a classification result when the model output result meets a preset threshold value;
the fourth data labeling module 8 is configured to label the training data with the confidence coefficient of the model output result being smaller than the preset threshold again to obtain fourth data when the model output result does not meet the preset threshold;
specifically, because the artificially labeled data has certain errors, the active learning idea is used to re-label the data of which the confidence coefficient of the model classification result is lower than 0.9, and the artificially re-labeling is specifically adopted.
And the second classification result acquisition module 9 is configured to re-input the fourth data into the HAN model, and return the fourth data to the result judgment module.
Specifically, according to the model classification result, the classification with the highest confidence coefficient is selected as the final classification, and the specific classification is divided into two classes with or without violation.
The model training method for conversational text classification in the embodiment adopts a clustering algorithm to preprocess training data, so that the sufficiency of training sample data is ensured; data preprocessing is carried out before word segmentation, so that the word segmentation accuracy is further improved; a user-defined field dictionary is added, and the word segmentation accuracy and coverage of the scene are increased; the feature vector information is further increased, and the accuracy of the classification result and the accuracy of the model are further increased.
Example 6
As shown in fig. 6, the classification system of dialog-type text of this embodiment adopts the model training system for classifying dialog-type text in any of embodiments 4 and 5, and the classification system of dialog-type text of this embodiment includes the following modules:
a text data obtaining module 60, configured to obtain dialogue-type text data, where the dialogue-type text data includes an original dialogue statement;
a second segmentation obtaining module 61, configured to segment the conversational text data into a plurality of segments;
a first segmentation vector obtaining module 62, configured to match the segmentation with a pre-trained word vector to obtain a first segmentation vector;
a second segmentation vector obtaining module 63, configured to match the first segmentation vector with the pre-trained word vector to obtain a second segmentation vector;
and a second classification result obtaining module 64, configured to input the second segmentation vector to the pre-trained HAN model to obtain a classification result.
The second classification result obtaining module 64 is specifically configured to input the second segmentation vector and the preset feature vector to the HAN model trained by the dialog text classification model training system in embodiment 4 or 5, so as to obtain a classification result.
The pre-trained word vectors include an Elmo word vector and a Glove word vector;
the pre-trained word vectors include Glove word vectors.
The classification system of the conversational text of the embodiment classifies the transcribed call data by adopting the trained HAN model based on the hierarchical attention, introduces factors which have large influence on result judgment, further enriches the characteristic information and increases the precision of the model; selecting a class with a large prediction probability value as a final classification result, so that the classification precision of the model is higher; compared with manual spot check, the recall rate is improved, and the precision is ensured; not only the purpose of violation detection is achieved, but also a large amount of manpower is saved.
Example 7
Fig. 7 is a schematic block diagram of an electronic device according to embodiment 7 of the present invention. The electronic device comprises a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the model training method for conversational text classification or the classification method for conversational text of 3 provided in embodiments 1 or 2 when executing the program. The electronic device 30 shown in fig. 7 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiment of the present invention.
As shown in fig. 7, the electronic device 30 may be embodied in the form of a general purpose computing device, which may be, for example, a server device. The components of the electronic device 30 may include, but are not limited to: the at least one processor 31, the at least one memory 32, and a bus 33 connecting the various system components (including the memory 32 and the processor 31).
The bus 33 includes a data bus, an address bus, and a control bus.
The memory 32 may include volatile memory, such as Random Access Memory (RAM)321 and/or cache memory 322, and may further include Read Only Memory (ROM) 323.
Memory 32 may also include a program/utility 325 having a set (at least one) of program modules 324, such program modules 324 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.
The processor 31 executes various functional applications and data processing, such as the model training method for conversational text classification provided in embodiment 1 or 2 of the present invention or the classification method for conversational text provided in embodiment 3, by running a computer program stored in the memory 32.
The electronic device 30 may also communicate with one or more external devices 34 (e.g., keyboard, pointing device, etc.). Such communication may be through input/output (I/O) interfaces 35. Also, model-generating device 30 may also communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the Internet) via network adapter 36. As shown, network adapter 36 communicates with the other modules of model-generating device 30 via bus 33. It should be understood that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the model-generating device 30, including but not limited to: microcode, device drivers, redundant processors, external disk drive arrays, RAID (disk array) systems, tape drives, and data backup storage systems, etc.
It should be noted that although in the above detailed description several units/modules or sub-units/modules of the electronic device are mentioned, such a division is merely exemplary and not mandatory. Indeed, the features and functionality of two or more of the units/modules described above may be embodied in one unit/module according to embodiments of the invention. Conversely, the features and functions of one unit/module described above may be further divided into embodiments by a plurality of units/modules.
Example 8
The present embodiment provides a computer-readable storage medium on which a computer program is stored, which when executed by a processor implements the model training method for conversational text classification provided in embodiment 1 or 2 or the classification method steps for conversational text provided in 3.
More specific examples, among others, that the readable storage medium may employ may include, but are not limited to: a portable disk, a hard disk, random access memory, read only memory, erasable programmable read only memory, optical storage device, magnetic storage device, or any suitable combination of the foregoing.
In a possible embodiment, the invention can also be implemented in the form of a program product comprising program code for causing a terminal device to carry out the steps of a model training method for dialog-type text classification or the classification method for dialog-type text of 3, which method is provided in embodiment 1 or 2, when the program product is run on the terminal device.
Where program code for carrying out the invention is written in any combination of one or more programming languages, the program code may execute entirely on the user device, partly on the user device, as a stand-alone software package, partly on the user device and partly on a remote device or entirely on the remote device.
While specific embodiments of the invention have been described above, it will be appreciated by those skilled in the art that this is by way of example only, and that the scope of the invention is defined by the appended claims. Various changes and modifications to these embodiments may be made by those skilled in the art without departing from the spirit and scope of the invention, and these changes and modifications are within the scope of the invention.

Claims (18)

1. A method for training a model for conversational text classification, comprising the steps of:
carrying out violation annotation on training data to obtain first data, wherein the first data comprises original dialogue sentences;
segmenting the first data into a plurality of segments;
matching the word segmentation with a pre-trained word vector to obtain a first word segmentation vector;
matching the first word segmentation vector with a pre-trained word vector to obtain a second word segmentation vector;
and inputting the second segmentation vector into a HAN model for training.
2. The method for training models for conversational text classification according to claim 1, wherein the step of labeling training data for violations to obtain first data specifically comprises:
dividing the training data into a plurality of first sample data by using a clustering algorithm;
extracting a plurality of pieces of data from each piece of the first sample data as second sample data;
calculating the edit distance between any piece of the second sample data and the marked training data;
and if the editing distance is larger than a preset value, carrying out violation marking on the first sample data to obtain first data.
3. The method of model training for conversational text classification according to claim 1, wherein the step of segmenting the first data into a plurality of segments specifically comprises:
preprocessing the first data to obtain second data;
comparing the second data with preset model data to filter meaningless sentences in the second data to obtain third data;
segmenting the third data into a plurality of segments.
4. The method of model training for conversational text classification according to claim 1, wherein the step of inputting the second segmentation vector into the HAN model for training specifically comprises:
and inputting the second word segmentation vector and a preset feature vector into an HAN model for training.
5. The method of model training for conversational text classification as recited in claim 1, further comprising:
judging whether the model output result meets a preset threshold value, if so, taking the model output result as a classification result; if not, labeling the training data with the confidence coefficient of the model output result smaller than the preset threshold value again to obtain fourth data;
and inputting the fourth data into the HAN model again for training, and returning to the step to judge whether the output result of the model meets a preset threshold value.
6. A method for classifying conversational text, the method comprising the steps of:
obtaining dialogue type text data, wherein the dialogue type text data comprises original dialogue sentences;
segmenting the conversational text data into a plurality of segments;
matching the word segmentation with a pre-trained word vector to obtain a first word segmentation vector;
matching the first word segmentation vector with a pre-trained word vector to obtain a second word segmentation vector;
inputting the second segmentation vector into the HAN model trained by the dialog type text classification model training method according to any one of claims 1-5 to obtain a classification result.
7. The method for classifying conversational text according to claim 6, wherein the step of inputting the second segmentation vector into the HAN model trained by the method for model training of conversational text classification according to any one of claims 1-5 to obtain the classification result specifically comprises:
inputting the second segmentation vectors and preset feature vectors into the HAN model trained by the conversational text classification model training method according to any one of claims 1-5, and obtaining classification results.
8. The method for classifying conversational text according to claim 6, wherein the pre-trained word vectors comprise Elmo word vectors and Glove word vectors;
the pre-trained word vector comprises a Glove word vector.
9. A conversational text classification model training system, comprising the following modules:
the first data acquisition module is used for carrying out violation marking on training data to obtain first data, and the first data comprises original dialogue sentences;
the first word segmentation acquisition module is used for segmenting the first data into a plurality of words;
the first vector acquisition module is used for matching the word segmentation with a pre-trained word vector to obtain a first word segmentation vector;
the second vector acquisition module is used for matching the first word segmentation vector with a pre-trained word vector to obtain a second word segmentation vector;
and the model input module is used for inputting the second participle vector into the HAN model for training.
10. The system for model training of conversational text classification of claim 9, wherein the first data acquisition module comprises:
a first sample acquisition unit for dividing the training data into a plurality of pieces of first sample data using a clustering algorithm;
a second sample obtaining unit, configured to extract a plurality of pieces of data from each piece of the first sample data as second sample data;
the editing distance calculation unit is used for calculating the editing distance between any piece of second sample data and the marked training data;
and the first data acquisition unit is used for carrying out violation marking on the first sample data to obtain first data if the editing distance is larger than a preset value.
11. The system for model training of conversational text classification of claim 9, wherein the first segmentation acquisition module comprises:
the second data acquisition unit is used for preprocessing the first data to obtain second data;
the third data acquisition unit is used for comparing the second data with preset model data and filtering meaningless sentences in the second data to obtain third data;
and the word segmentation acquisition unit is used for segmenting the third data into a plurality of words.
12. The system of model training for conversational text classification according to claim 9, wherein the model input module is specifically configured to input the second segmentation vector and preset feature vectors into a HAN model.
13. The conversational text classification model training system of claim 9, further comprising:
a result judging module for judging whether the output result of the model meets a preset threshold value,
the first classification result acquisition module is used for taking the model output result as a classification result when the model output result meets a preset threshold value;
the fourth data labeling module is used for labeling the training data with the confidence coefficient of the model output result smaller than the preset threshold value again to obtain fourth data when the model output result does not meet the preset threshold value;
and the second classification result acquisition module is used for inputting the fourth data to the HAN model again and returning the fourth data to the result judgment module.
14. A classification system for conversational text, the classification system comprising the modules:
the system comprises a text data acquisition module, a conversion module and a conversion module, wherein the text data acquisition module is used for acquiring conversation type text data which comprises original conversation sentences;
the second segmentation acquisition module is used for segmenting the dialogue type text data into a plurality of segments;
the first word segmentation vector acquisition module is used for matching the word segmentation with a pre-trained word vector to obtain a first word segmentation vector;
the second word segmentation vector acquisition module is used for matching the first word segmentation vector with a pre-trained character vector to obtain a second word segmentation vector;
a second classification result obtaining module, configured to input the second segmentation vector to an HAN model trained by using the dialogue-type text classification model training system according to any one of claims 9 to 13, so as to obtain a classification result.
15. The system for classifying conversational text according to claim 14, wherein the second classification result obtaining module is further configured to input the second segmentation vector and preset feature vectors into the HAN model trained by the model training system for conversational text classification according to any one of claims 9-13, so as to obtain the classification result.
16. The system for classification of conversational text according to claim 14, wherein the pre-trained word vectors comprise Elmo word vectors and Glove word vectors;
the pre-trained word vector comprises a Glove word vector.
17. An electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of model training for conversational text classification of any of claims 1-5 or the method of classification of conversational text of any of claims 6-8 when the program is executed.
18. A computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, carries out the steps of the method for model training for conversational text classification of any of claims 1-5 or the method for classification of conversational text of any of claims 6-8.
CN201911374381.9A 2019-12-27 2019-12-27 Model training, classification, system, device and medium for conversational text classification Pending CN111125317A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911374381.9A CN111125317A (en) 2019-12-27 2019-12-27 Model training, classification, system, device and medium for conversational text classification

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911374381.9A CN111125317A (en) 2019-12-27 2019-12-27 Model training, classification, system, device and medium for conversational text classification

Publications (1)

Publication Number Publication Date
CN111125317A true CN111125317A (en) 2020-05-08

Family

ID=70503745

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911374381.9A Pending CN111125317A (en) 2019-12-27 2019-12-27 Model training, classification, system, device and medium for conversational text classification

Country Status (1)

Country Link
CN (1) CN111125317A (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111343203A (en) * 2020-05-18 2020-06-26 国网电子商务有限公司 Sample recognition model training method, malicious sample extraction method and device
CN111625649A (en) * 2020-05-28 2020-09-04 北京字节跳动网络技术有限公司 Text processing method and device, electronic equipment and medium
CN111708866A (en) * 2020-08-24 2020-09-25 北京世纪好未来教育科技有限公司 Session segmentation method and device, electronic equipment and storage medium
CN112417147A (en) * 2020-11-05 2021-02-26 腾讯科技(深圳)有限公司 Method and device for selecting training samples
CN112434501A (en) * 2020-10-23 2021-03-02 杭州未名信科科技有限公司 Work order intelligent generation method and device, electronic equipment and medium
CN112784031A (en) * 2021-01-29 2021-05-11 湖南大学 Method and system for classifying customer service conversation texts based on small sample learning
CN113011533A (en) * 2021-04-30 2021-06-22 平安科技(深圳)有限公司 Text classification method and device, computer equipment and storage medium
CN113128199A (en) * 2021-05-11 2021-07-16 济南大学 Word vector generation method based on pre-training language model and multiple word information embedding
CN113159921A (en) * 2021-04-23 2021-07-23 上海晓途网络科技有限公司 Overdue prediction method and device, electronic equipment and storage medium
CN116051848A (en) * 2023-02-10 2023-05-02 阿里巴巴(中国)有限公司 Image feature extraction method, network model, device and equipment

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104156438A (en) * 2014-08-12 2014-11-19 德州学院 Unlabeled sample selection method based on confidence coefficients and clustering
CN108108351A (en) * 2017-12-05 2018-06-01 华南理工大学 A kind of text sentiment classification method based on deep learning built-up pattern
CN108829818A (en) * 2018-06-12 2018-11-16 中国科学院计算技术研究所 A kind of file classification method
CA3062891A1 (en) * 2017-05-19 2018-11-22 Salesforce.Com, Inc. Natural language processing using context-specific word vectors
WO2019200806A1 (en) * 2018-04-20 2019-10-24 平安科技(深圳)有限公司 Device for generating text classification model, method, and computer readable storage medium
CN110502738A (en) * 2018-05-18 2019-11-26 阿里巴巴集团控股有限公司 Chinese name entity recognition method, device, equipment and inquiry system
CN110580288A (en) * 2019-08-23 2019-12-17 腾讯科技(深圳)有限公司 text classification method and device based on artificial intelligence

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104156438A (en) * 2014-08-12 2014-11-19 德州学院 Unlabeled sample selection method based on confidence coefficients and clustering
CA3062891A1 (en) * 2017-05-19 2018-11-22 Salesforce.Com, Inc. Natural language processing using context-specific word vectors
US20180349359A1 (en) * 2017-05-19 2018-12-06 salesforce.com,inc. Natural language processing using a neural network
CN108108351A (en) * 2017-12-05 2018-06-01 华南理工大学 A kind of text sentiment classification method based on deep learning built-up pattern
WO2019200806A1 (en) * 2018-04-20 2019-10-24 平安科技(深圳)有限公司 Device for generating text classification model, method, and computer readable storage medium
CN110502738A (en) * 2018-05-18 2019-11-26 阿里巴巴集团控股有限公司 Chinese name entity recognition method, device, equipment and inquiry system
CN108829818A (en) * 2018-06-12 2018-11-16 中国科学院计算技术研究所 A kind of file classification method
CN110580288A (en) * 2019-08-23 2019-12-17 腾讯科技(深圳)有限公司 text classification method and device based on artificial intelligence

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
赵亚欧;张家重;李贻斌;付宪瑞;生伟;: "融合基于语言模型的词嵌入和多尺度卷积神经网络的情感分析", no. 03, pages 651 - 657 *
陶志勇;李小兵;刘影;刘晓芳;: "基于双向长短时记忆网络的改进注意力短文本分类方法", no. 12, pages 21 - 29 *

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111343203B (en) * 2020-05-18 2020-08-28 国网电子商务有限公司 Sample recognition model training method, malicious sample extraction method and device
CN111343203A (en) * 2020-05-18 2020-06-26 国网电子商务有限公司 Sample recognition model training method, malicious sample extraction method and device
CN111625649A (en) * 2020-05-28 2020-09-04 北京字节跳动网络技术有限公司 Text processing method and device, electronic equipment and medium
CN111708866A (en) * 2020-08-24 2020-09-25 北京世纪好未来教育科技有限公司 Session segmentation method and device, electronic equipment and storage medium
CN111708866B (en) * 2020-08-24 2020-12-11 北京世纪好未来教育科技有限公司 Session segmentation method and device, electronic equipment and storage medium
CN112434501A (en) * 2020-10-23 2021-03-02 杭州未名信科科技有限公司 Work order intelligent generation method and device, electronic equipment and medium
CN112417147A (en) * 2020-11-05 2021-02-26 腾讯科技(深圳)有限公司 Method and device for selecting training samples
CN112784031A (en) * 2021-01-29 2021-05-11 湖南大学 Method and system for classifying customer service conversation texts based on small sample learning
CN112784031B (en) * 2021-01-29 2022-04-22 湖南大学 Method and system for classifying customer service conversation texts based on small sample learning
CN113159921A (en) * 2021-04-23 2021-07-23 上海晓途网络科技有限公司 Overdue prediction method and device, electronic equipment and storage medium
CN113011533A (en) * 2021-04-30 2021-06-22 平安科技(深圳)有限公司 Text classification method and device, computer equipment and storage medium
CN113011533B (en) * 2021-04-30 2023-10-24 平安科技(深圳)有限公司 Text classification method, apparatus, computer device and storage medium
CN113128199A (en) * 2021-05-11 2021-07-16 济南大学 Word vector generation method based on pre-training language model and multiple word information embedding
CN113128199B (en) * 2021-05-11 2022-06-21 济南大学 Word vector generation method based on pre-training language model and multiple word information embedding
CN116051848A (en) * 2023-02-10 2023-05-02 阿里巴巴(中国)有限公司 Image feature extraction method, network model, device and equipment
CN116051848B (en) * 2023-02-10 2024-01-09 阿里巴巴(中国)有限公司 Image feature extraction method, network model, device and equipment

Similar Documents

Publication Publication Date Title
CN111125317A (en) Model training, classification, system, device and medium for conversational text classification
CN107797985B (en) Method and device for establishing synonymous identification model and identifying synonymous text
CN109145294B (en) Text entity identification method and device, electronic equipment and storage medium
JP5901001B1 (en) Method and device for acoustic language model training
WO2021135469A1 (en) Machine learning-based information extraction method, apparatus, computer device, and medium
CN110276023B (en) POI transition event discovery method, device, computing equipment and medium
CN111062217B (en) Language information processing method and device, storage medium and electronic equipment
CN110633366B (en) Short text classification method, device and storage medium
CN111177186B (en) Single sentence intention recognition method, device and system based on question retrieval
CN106570180A (en) Artificial intelligence based voice searching method and device
CN113486178B (en) Text recognition model training method, text recognition method, device and medium
CN109271624B (en) Target word determination method, device and storage medium
CN113986864A (en) Log data processing method and device, electronic equipment and storage medium
CN117115581A (en) Intelligent misoperation early warning method and system based on multi-mode deep learning
CN111667817A (en) Voice recognition method, device, computer system and readable storage medium
CN113935314A (en) Abstract extraction method, device, terminal equipment and medium based on heteromorphic graph network
CN117807482B (en) Method, device, equipment and storage medium for classifying customs clearance notes
CN114220505A (en) Information extraction method of medical record data, terminal equipment and readable storage medium
CN112100360B (en) Dialogue response method, device and system based on vector retrieval
CN115858776A (en) Variant text classification recognition method, system, storage medium and electronic equipment
CN116361511A (en) Video retrieval method, device and equipment of composite semantics and storage medium
CN116166792A (en) Template-based Chinese privacy policy abstract generation method and device
CN115358817A (en) Intelligent product recommendation method, device, equipment and medium based on social data
CN111339760A (en) Method and device for training lexical analysis model, electronic equipment and storage medium
CN111506764B (en) Audio data screening method, computer device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination