CN115098680A - Data processing method, data processing apparatus, electronic device, medium, and program product - Google Patents

Data processing method, data processing apparatus, electronic device, medium, and program product Download PDF

Info

Publication number
CN115098680A
CN115098680A CN202210756062.XA CN202210756062A CN115098680A CN 115098680 A CN115098680 A CN 115098680A CN 202210756062 A CN202210756062 A CN 202210756062A CN 115098680 A CN115098680 A CN 115098680A
Authority
CN
China
Prior art keywords
text data
sample
target
classification model
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210756062.XA
Other languages
Chinese (zh)
Other versions
CN115098680B (en
Inventor
蔡浩锐
李健
蔡超维
李利强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202210756062.XA priority Critical patent/CN115098680B/en
Publication of CN115098680A publication Critical patent/CN115098680A/en
Application granted granted Critical
Publication of CN115098680B publication Critical patent/CN115098680B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the application discloses a data processing method, a data processing device, electronic equipment, a medium and a program product, which are applied to the technical field of computers. The method comprises the following steps: the method comprises the steps of obtaining sample text data, sequentially conducting sliding segmentation on N sub-texts based on a sliding window to obtain a plurality of candidate text data, conducting pre-training on a classification model based on the sample text data to obtain a pre-trained classification model, calling the pre-trained classification model to conduct classification prediction on the candidate text data, selecting target text data from the candidate text data according to the prediction accuracy of the pre-trained classification model aiming at the candidate text data, and conducting training on the pre-trained classification model based on the sample text data and the target text data to obtain a trained classification model. By adopting the method and the device, the accuracy of the trained classification model can be improved, and further more accurate classification prediction can be carried out on the text data through the trained classification model.

Description

Data processing method, data processing apparatus, electronic device, medium, and program product
Technical Field
The present application relates to the field of machine learning technologies, and in particular, to a data processing method, apparatus, electronic device, medium, and program product.
Background
With the continuous development of computer technology, Artificial Intelligence (AI) technology is becoming more mature, wherein the AI technology relates to machine learning related technology.
In the prior art, a model can be trained through a machine learning correlation technique, and the trained model can be applied to classification and prediction of text data of a specified service scene. For example, the session text data of the target object is obtained, and the risk of the session text data is classified by using the model, such as classification of whether the session text data of the target object contains malicious text. When a training set is actually constructed, the situation that the number of samples which can be collected in the current business scene is small may exist, and if a model is trained by using the small sample training set, overfitting of the model is easily caused, so that the accuracy of the trained model is low.
Disclosure of Invention
The embodiment of the application provides a data processing method, a data processing device, an electronic device, a medium and a program product, which can improve the accuracy of a trained classification model, and further can perform more accurate classification prediction on text data through the trained classification model.
In one aspect, an embodiment of the present application provides a data processing method, where the method includes:
acquiring sample text data; the number of the sample text data is less than the number of the sample indexes; the sample text data carries a classification label; the sample text data comprises N sub-texts, wherein N is a positive integer;
sequentially carrying out sliding segmentation on the N sub-texts based on the sliding window to obtain a plurality of candidate text data of the sample text data; any candidate text data comprises at least one arbitrary continuous sub text in the N sub texts;
pre-training the classification model based on the sample text data to obtain a pre-trained classification model;
calling a pre-trained classification model to perform classification prediction on the candidate text data, and selecting target text data from the candidate text data according to the prediction accuracy of the pre-trained classification model for the candidate text data;
training the pre-trained classification model based on the sample text data and the target text data to obtain a trained classification model; the trained classification model is used for performing classification prediction on the text data.
In one aspect, an embodiment of the present application provides a data processing apparatus, where the apparatus includes:
the acquisition module is used for acquiring sample text data; the number of the sample text data is less than the number of the sample indexes; the sample text data carries a classification label; the sample text data comprises N sub-texts, wherein N is a positive integer;
the processing module is used for sequentially carrying out sliding segmentation on the N sub-texts based on the sliding window to obtain a plurality of candidate text data of the sample text data; any candidate text data comprises at least one arbitrary continuous sub text in the N sub texts;
the processing module is also used for training the classification model based on the sample text data to obtain a pre-trained classification model;
the processing module is further used for calling the pre-trained classification model to perform classification prediction on the candidate text data, and selecting target text data from the candidate text data according to the prediction accuracy of the pre-trained classification model for the candidate text data;
the processing module is also used for training the pre-trained classification model based on the sample text data and the target text data to obtain a trained classification model; the trained classification model is used for performing classification prediction on the text data.
In one aspect, an embodiment of the present application provides an electronic device, which includes a processor and a memory, where the memory is used to store a computer program, and the computer program includes program instructions, and the processor is configured to call the program instructions to perform some or all of the steps in the above method.
In one aspect, the present application provides a computer-readable storage medium, which stores a computer program, where the computer program includes program instructions, and the program instructions, when executed by a processor, are used to perform some or all of the steps of the above method.
Accordingly, according to an aspect of the present application, there is provided a computer program product or computer program comprising program instructions stored in a computer readable storage medium. The processor of the computer device reads the program instructions from the computer-readable storage medium, and the processor executes the program instructions, so that the computer device executes the data processing method provided above.
In the embodiment of the application, sample text data can be obtained, and N sub-texts are sequentially subjected to sliding segmentation based on a sliding window to obtain a plurality of candidate text data; the candidate text data is obtained by sample expansion of sample text data, so that the number of samples can be increased, and the overfitting problem possibly generated by a small sample training set can be avoided; pre-training the classification model based on the sample text data to obtain a pre-trained classification model; calling a pre-trained classification model to perform classification prediction on the candidate text data, and selecting target text data from the candidate text data according to the prediction accuracy of the pre-trained classification model for the candidate text data; target text data with higher quality can be selected from the candidate text data based on the prediction accuracy; training the pre-trained classification model based on the sample text data and the target text data to obtain a trained classification model; the model training can be positively effected through the high-quality target text data and the sample text data, so that the prediction accuracy of the trained classification model can be improved, and further the text data can be more accurately classified and predicted through the trained classification model.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a schematic diagram of an application architecture according to an embodiment of the present application;
fig. 2 is a schematic flowchart of a data processing method according to an embodiment of the present application;
fig. 3 is a schematic flowchart of a data processing method according to an embodiment of the present application;
fig. 4a is a schematic view of a scene for acquiring candidate text data according to an embodiment of the present disclosure;
fig. 4b is a scene schematic diagram for acquiring candidate text data according to an embodiment of the present disclosure;
fig. 4c is a schematic view of a scene for acquiring candidate text data according to an embodiment of the present application;
FIG. 5 is a schematic flowchart of a method for training a classification model according to an embodiment of the present disclosure;
FIG. 6a is a schematic flow chart illustrating an application classification model according to an embodiment of the present application;
FIG. 6b is a schematic flow chart of a feature engineering process provided in an embodiment of the present application;
fig. 7 is a scene schematic diagram of risk early warning based on a classification model according to an embodiment of the present application;
fig. 8 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application;
fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
The technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention.
The data processing method provided by the embodiment of the application is implemented in electronic equipment, and the electronic equipment can be a server or a terminal. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as cloud service, a cloud database, cloud computing, a cloud function, cloud storage, Network service, cloud communication, middleware service, domain name service, security service, CDN (Content Delivery Network), big data, an artificial intelligence platform, and the like. The terminal may be a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch smart voice interaction device, a smart home appliance, a vehicle-mounted terminal, an aircraft, and the like, but is not limited thereto. The embodiment of the application can be applied to various scenes, including but not limited to cloud technology, artificial intelligence, intelligent traffic, driving assistance and the like.
Next, technical terms involved in the technical field to which the solution of the embodiment of the present application is possibly applied are described in association:
firstly, artificial intelligence:
the embodiment of the application relates to the technical field of Machine Learning (ML) in artificial intelligence, wherein the Machine Learning is a multi-field cross subject and relates to a plurality of subjects such as probability theory, statistics, approximation theory, convex analysis and algorithm complexity theory. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning. The classification model in the technical solution of the present application can be trained based on machine learning techniques.
In some embodiments, please refer to fig. 1, where fig. 1 is a schematic diagram of an application architecture provided in the embodiments of the present application, and the data processing method provided in the present application may be executed through the application architecture. As shown in fig. 1, an electronic device may be included in which a classification model to be trained is deployed; the electronic equipment can obtain a sample set, the sample set comprises sample text data, the sample text data comprises N sub-texts, a classification model is pre-trained by utilizing the sample text data, the sample text data is subjected to data segmentation to obtain a plurality of candidate text data, the sample set expansion is realized, the data segmentation can be realized based on a sliding window, the pre-trained classification model is called to perform classification prediction on the candidate text data, target text data is selected from the candidate text data according to the prediction result of the candidate text data, the prediction result represents the prediction accuracy of the model on the candidate text data, and the pre-trained classification model is trained based on the sample text data and the target text data to obtain a trained classification model.
It should be understood that fig. 1 merely illustrates a possible application architecture of the present application, and does not limit the specific architecture of the present application, that is, the present application may also provide other forms of application architectures.
Optionally, in some embodiments, the electronic device may execute the data processing method according to an actual service requirement to improve the prediction accuracy of the acquired model. The technical scheme can be applied to the classification scene of any text data. For example, the text data may be session text data of the target object, and the classification for the session text data may be a risk classification, such as a classification of whether the target object is exposed to malicious information (or may be called malicious text, etc.), where the classification result may indicate that the target object is exposed to malicious information or not exposed to malicious information, for example. It is to be understood that when the classification result indicates that malicious information has been exposed, it may indicate that there is a risk of conversation; when the classification result indicates no exposure to malicious information, it may indicate no session risk. The electronic equipment can acquire sample conversation text data carrying risk classification labels, sample expansion is carried out on the sample conversation text data according to the method provided by the technical scheme of the application, and a classification model capable of carrying out risk classification is obtained based on the sample conversation text data and the acquired candidate conversation text data through training together.
As another example, the text data may be social text data of the target conversation, and the classification for the social text data may be an emotion classification, such as a classification of an emotional tendency of the target object (or a classification of a so-called emotional text), where the classification result may be, for example, positive, negative, and the like. The electronic equipment can obtain sample social contact text data carrying emotion classification labels, sample expansion is carried out on the sample social contact text data according to the method provided by the technical scheme of the application, and a classification model capable of carrying emotion classification is obtained by training together based on the sample social contact text data and the obtained candidate social contact text data.
Optionally, data related to the present application, such as sample text data, candidate text data, and the like, may be stored in a database, or may be stored in a blockchain, for example, stored by a blockchain distributed system, which is not limited in the present application.
It should be noted that, in the specific implementation manner of the present application, related data such as user information, for example, user data (such as session data, social data, etc.) required to be obtained when a sample set is constructed or when a model is actually applied, when the above embodiment of the present application is applied to a specific product or technology, user permission or consent needs to be obtained, and collection, use and processing of related data need to comply with related laws and regulations and standards of related countries and regions.
It is to be understood that the foregoing scenarios are only examples, and do not constitute a limitation on application scenarios of the technical solutions provided in the embodiments of the present application, and the technical solutions of the present application may also be applied to other scenarios. For example, as can be known by those skilled in the art, with the evolution of system architecture and the emergence of new service scenarios, the technical solution provided in the embodiments of the present application is also applicable to similar technical problems.
The scheme provided by the embodiment of the application relates to the technologies such as machine learning of artificial intelligence and the like, and is specifically explained by the following embodiments:
based on the above description, the embodiments of the present application propose a data processing method, which may be executed by the above-mentioned electronic device. Referring to fig. 2, fig. 2 is a schematic flow chart of a data processing method according to an embodiment of the present disclosure. As shown in fig. 2, the flow of the data processing method according to the embodiment of the present application may include the following steps:
s201, sample text data is obtained.
In some embodiments, the number of sample text data is less than the number of sample metrics. The number of sample indices may be set manually, and the number of sample indices may be used to distinguish between a small sample training set and a non-small sample training set. And when the number of the samples in one sample set is less than the number of the sample indexes, the sample set is represented as a small sample training set. I.e. the acquired sample text data belongs to a small sample training set. Therefore, the classification model can be trained through the sample text data, model training under a small sample scene is achieved, and the training method based on the embodiment of the application can improve the model training effect and the prediction accuracy under the small sample scene.
In some embodiments, the sample text data may be text of any service type, for example, the text data may be text data transcribed based on a conversation record acquired when the outbound robot makes an intelligent outbound call with a sample object (e.g., a user), the text data may be text data in which a conversation on the outbound robot side and a conversation on the user side are respectively marked, or may also be social text data (e.g., text data combined based on comment information posted by the sample object in a social application), where a specific type of the sample text data is not limited. The sample text data may be a text composed of any language, such as a text composed of a chinese language, a text composed of an english language, or a text composed of a mixed language including a chinese language and an english language. The present application does not limit the form of the document. The sample text data may contain at least one sentence, N being a positive integer; the sentence is a basic unit of language operation, which is composed of words and phrases (phrases) and can express a complete meaning, and its end is usually marked with periods, question marks, ellipses, exclamation marks and other identifiers.
In some embodiments, the sample text data carries a classification label. The classification label can be set by related service personnel according to the sample text data and the actual application scene. For example, the sample text data is sample session text data, and the actual application scenario is risk prediction on the session text data, so the classification tag may indicate a specific risk classification, such as contact with malicious information and no contact with malicious information. Therefore, the trained classification model has different classification functions according to different classification labels.
In some embodiments, each sample text data may include one or more subfolders, and the subfolders of each sample text data are divided in the same manner, which is described herein by taking one sample text data as an example. Let the sample text data contain N sub-texts, where N is a positive integer. The electronic device may divide the sample text data according to a preset division rule to obtain N sub-texts, where the preset division rule may be that the sub-texts are divided according to separation characters included in the sample text data, and the separation characters may be set by a relevant service person according to an empirical value, for example, the separation characters may be identifiers such as commas, periods, exclamations. The sub-text divided at this time may be a complete sentence constituting the sample text data or a partial character constituting a sentence. The preset dividing rule may also be that the sample text data is divided according to a specified word length, for example, every 10 characters in the sample text data are divided into one sub-text, and then specified characters in the sample text data, such as specified identifiers, may be pre-filtered or not be filtered during the division. The preset division rule is not limited herein.
S202, sequentially carrying out sliding segmentation on the N sub texts based on the sliding window to obtain a plurality of candidate text data of the sample text data.
In some embodiments, the electronic device may perform data slicing on the sample text data to obtain a plurality of candidate text data of the sample text data. The data slicing may be implemented based on a sliding window. Such as sliding slicing of sub-texts contained in the sample text data based on a sliding window.
In some embodiments, the process and principle of obtaining the candidate text data of each sample text data are the same, and the determination process of one sample text data is described as an example. Let the one sample text data contain N sub-text data. Specifically, a sliding window is obtained, where the window size of the sliding window is the size of M sub-texts, and M is a positive integer; and sequentially carrying out sliding segmentation on the N sub texts based on the sliding window to obtain a plurality of candidate text data. When the last candidate text data is segmented, if the number of the remaining sub-texts is less than M, the remaining sub-texts can be directly used as the candidate text data; or setting default sub-texts to supplement the remaining sub-texts, so that the number of the supplemented remaining sub-texts is equal to M, and taking the supplemented remaining sub-texts as candidate text data.
The sliding window may be a window of a fixed model or a window of a variable model. That is, M corresponding to the sliding window may be fixed or variable each time the sliding window is slid. For example, the window size of the sliding window is 3 sub-texts each time the sliding window is slid. As another example, the window size of the sliding window is sequentially 1 sub-text size, 3 sub-text size, 5 sub-text size, and so on, for each sliding. The size definition rule of the sliding window is not limited, and may be specifically set by the relevant service personnel. Therefore, any of the candidate text data may include at least one of the N continuous sub-texts. It will be appreciated that the length of the sliding window may not be uniform at each division, specifically depending on the size of the M sub-texts. The size of the M sub-texts is determined based on the number of specifically contained words or characters.
Therefore, when the text data is divided by taking the sub-text as the granularity, the sub-text in a sliding window can obtain a candidate sample text data, and the candidate sample text data and the corresponding sample text data have different information content. Therefore, sample expansion can be realized through the candidate sample text data, so that the overfitting problem caused by training a model through small sample data is improved.
S203, pre-training the classification model based on the sample text data to obtain a pre-trained classification model.
In some embodiments, the electronic device pre-trains the classification model based on the sample text data may be training the classification model with sub-documents contained in the sample text data. Therefore, the classification model is pre-trained based on the sub-documents included in the sample text data, and the pre-trained classification model may be obtained by obtaining sample features corresponding to the sample text data based on the sub-documents included in the sample text data, calling the classification model to output a classification result for the sample text data based on the sample features, generating a prediction bias for the pre-trained classification model based on the classification result for the sample text data and the classification labels for the sample text data, and correcting model parameters of the classification model based on the prediction bias. The sample features may include one or more different types of features, such as features that may be included without being obtained by a classification model and features that may be included with a classification model. The obtaining mode of the sample characteristics may be set by the relevant service personnel according to an empirical value, and is not limited herein. The sample feature types used in the pre-training may be the same as the sample feature types used in training the pre-trained classification model. The specific manner of obtaining the sample characteristics can be referred to the following description of the embodiments.
S204, calling the pre-trained classification model to perform classification prediction on the candidate text data, and selecting target text data from the candidate text data according to the prediction accuracy of the pre-trained classification model for the candidate text data.
In some embodiments, the candidate text data may differ in sample quality for model training due to the wide variety of information content of the candidate text data. Therefore, the electronic equipment can call the pre-trained classification model to perform classification prediction on a plurality of candidate text data to obtain a prediction result for each candidate text data, and determine target text data based on prediction accuracy represented by the prediction result, wherein the target text data are samples with high sample quality, and the trained classification model has better training effect by training the model through the part of target text data, and can realize model training in a semi-supervised learning mode, so that the overfitting condition caused by carrying out supervised training on the model through small sample data can be improved.
In some embodiments, any of the plurality of candidate text data is represented as target candidate text data, the target candidate text data has a classification label, and the classification prediction result of the pre-trained classification model on the target candidate text data comprises a prediction category of the target candidate text data and a prediction probability for the prediction category. Therefore, according to the prediction accuracy of the pre-trained classification model for the candidate text data, specifically, the target text data is selected from the candidate text data, if the prediction category for the target candidate text data is the same as the category indicated by the classification label of the target candidate text data, and the prediction probability for the target candidate text data is greater than the probability threshold, it is determined that the pre-trained classification model has prediction accuracy for the target candidate text data, and it is determined that the target candidate text data is the target text data. The probability threshold may be set by the associated service personnel based on empirical values, such as 0.9.
The prediction accuracy can measure the prediction performance of the pre-trained classification model on the candidate text data. The prediction probability can be understood as a confidence level, and when the pre-trained classification model can predict the prediction category of the candidate text data with a higher prediction probability (which can be understood as being higher than a confidence level threshold) and the prediction category is correct, it indicates that the pre-trained classification model has a certain capability of predicting the correct result of the candidate text data, and indicates that the candidate text data is highly reliable for the classification model, so that the candidate text data can be selected to train the model, so that the model can learn the features in the candidate text data.
In some embodiments, the classification label of the target candidate text data may be the same classification label as the corresponding sample text data, or may be obtained by determining the classification label of the target candidate text data according to a classification label determination method of the sample text data. For example, the sample text data a is subjected to data segmentation to obtain candidate sample text data a.1 and candidate sample text data a.2, and the sample text data B is subjected to data segmentation to obtain candidate sample text data b.1 and candidate sample text data b.2, so that the classification labels of the candidate sample text data a.1 and the candidate sample text data a.2 are the same as the sample text data a, and the classification labels of the candidate sample text data b.1 and the candidate sample text data b.2 are the same as the sample text data B; or labeling the classification labels of the sample text data A and the sample text data B according to a specified mode, and labeling the classification labels of the candidate sample text data A.1, the candidate sample text data A.2, the candidate sample text data B.1 and the candidate sample text data B.2 according to the specified mode.
S205, training the pre-trained classification model based on the sample text data and the target text data to obtain a trained classification model.
The trained classification model is used for classifying and predicting the text data.
In some embodiments, the electronic device trains the pre-trained classification model based on the sample text data and the target text data, and the process of obtaining the trained classification model may be the same as the pre-training process. Specifically, any one of the sample text data and the target text data is expressed as target sample data, sample characteristics of the target sample data are acquired, a classification result for the target sample data is output by calling a pre-trained classification model based on the sample characteristics, a prediction deviation for the pre-trained classification model is generated based on the classification result for the target sample data and a classification label carried by the target sample data, and a model parameter of the pre-trained classification model is corrected based on the prediction deviation to obtain a trained classification model. Further, the electronic device may continue to input the other candidate data of the plurality of candidate text data except the target text data into the trained classification model according to the above process for classification prediction to obtain a classification prediction result of the other candidate data, and select new target text data from the other candidate data based on the classification prediction result, and add the new target text data into the sample set to continue training the trained classification model.
In addition, the electronic equipment can take the sample text data and the target text data as a new sample set, take data except the target text data in the candidate text data as a new candidate set, and continue iterative training on the trained classification model based on the new sample set and the new candidate set according to the process until the model performance is not obviously improved, namely the model converges, so that the finally trained classification model is obtained.
In the embodiment of the application, sample text data can be obtained, and the N sub-texts are sequentially subjected to sliding segmentation based on the sliding window to obtain a plurality of candidate text data of the sample text data; the candidate text data is obtained by sample expansion of sample text data, so that the number of samples can be increased, and the overfitting problem possibly generated by a small sample training set is avoided; pre-training the classification model based on the sample text data to obtain a pre-trained classification model; calling a pre-trained classification model to perform classification prediction on the candidate text data, and selecting target text data from the candidate text data according to the prediction accuracy of the pre-trained classification model for the candidate text data; target text data with higher quality can be selected from the candidate text data based on the prediction accuracy; training the pre-trained classification model based on the sample text data and the target text data to obtain a trained classification model; the model training can be positively effected through the high-quality target text data and sample text data, so that the prediction accuracy of the trained classification model can be improved, and the text data can be classified and predicted more accurately through the trained classification model.
Referring to fig. 3, fig. 3 is a schematic flowchart of a data processing method according to an embodiment of the present disclosure, where the method can be executed by the above-mentioned electronic device. As shown in fig. 3, the flow of the data processing method in the embodiment of the present application may include the following steps:
s301, sample text data is obtained.
In some embodiments, the sample text data carries a classification label. The sample text data may be any type of text and the classification label may be any type of label. For convenience of explanation, the classification of the risk of the session text data is described here as an example. When the sample text data is sample session text data, acquiring text data transcribed by session records between the outbound robot and different users (namely, a multi-turn dialog text which refers to a continuous progressive dialog process for reaching a characteristic target according to context); or the transcribed text data can be recorded in conversation between related personnel and different users; and the like, the composition of the session text data is not limited herein.
In some embodiments, since the session record contains the artificial language of the user (such as spoken language, word or phrase), or is easily influenced by the environment in the communication session, and thus the transcribed text data is easily subjected to transcription errors, dirty data (such as invalid characters) or the like, the sample text data may be pre-processed in advance, and then the process indicated in the embodiments of the present application may be performed on the pre-processed sample text data. The preprocessing may be to perform processing such as removing dirty data, removing stop words, and removing duplicate of continuous repeated text on the sample text data, and the specific type included in the preprocessing is not limited herein.
In some embodiments, each sample text data may contain one or more sub-texts. Let a sample text data include N sub-texts. Specifically, the separation characters in the sample text data may be detected, and the sample text data may be divided into N sub-texts based on the detected separation characters, or may be data obtained by dividing the sample text data based on a defined specified length (if the number of the divided sample text data is less than the specified length when the divided sample text data is divided into the last sub-text, default characters may be supplemented so that the size of the last sub-text satisfies the specified length).
S302, sequentially performing sliding segmentation on the N sub-texts based on the sliding window to obtain a plurality of candidate text data of the sample text data.
In some embodiments, the process and principle of the electronic device performing sliding segmentation on each sample text data based on the sliding window to obtain the corresponding candidate text data are the same. The process of determining candidate text data from one sample text data is described as an example. The process of the electronic device determining candidate text data of the sample text data may be a process of sliding-slicing N sub-text data contained in the sample text data based on a sliding window. The foregoing segmentation process may be designed in conjunction with the idea of "N-gram (N-gram, a language model)". The sliding window is divided by taking the subfiles as the granularity, the window size of the sliding window is the size of the M sub-texts, and when the sizes of different continuous M sub-texts are changed, the window size of the corresponding sliding window is changed.
The sliding window may be divided into a fixed window and a variable window, that is, M is a positive integer, and M may be a fixed value or a variable value. When the sliding window is a fixed window, the size c and the step length i of the window need to be defined: the size of the window determines the data size which can be covered by the current window, the step length determines the sliding distance, and both c and i indicate the number of the sub texts; that is, assuming that the current sample text data contains t sub-texts, the number k of candidate text data generated by windowing is:
Figure BDA0003721457040000121
wherein,
Figure BDA0003721457040000122
indicating the lower rounded symbol (provided that the following decimal is given by a decimal neglecting decimal, e.g.
Figure BDA0003721457040000123
The result of (3) was obtained.
For example, if the sample text data includes 16 sub-texts, the window size is 3, and the step size is 2, then the candidate text data generated by sliding the window is 7, and each candidate text data includes 3 complete sub-texts.
When the sliding window is a variable window, a step length i needs to be defined, and the window size can be changed according to a specified rule. For example, the window starting point is fixed, and the window ending point increases with the step size i from the initial size c0, where the number k of candidate text data generated by the variable window is:
Figure BDA0003721457040000124
wherein,
Figure BDA0003721457040000125
indicating the upper rounding symbol (adding 1 whenever an integer preceding a decimal follows, e.g.
Figure BDA0003721457040000126
The result of (2) is 4).
In some embodiments, the candidate text data obtained by segmenting the N sub-texts may also be obtained by performing sliding segmentation on the N sub-texts according to a sliding window to obtain R sub-text sets, and combining the R sub-text sets to serve as candidate text data corresponding to the sample text data. The data contained in one sub-text set is spliced into a whole to be used as one sub-text. The candidate text data at this time is different from the sample text data in data content.
For example, as shown in fig. 4a to 4c, fig. 4a to 4c are schematic views of a scene for acquiring candidate text data according to an embodiment of the present application; wherein, as shown in fig. 4 a: 1) acquiring one or more sample text data, dividing each sample text data based on a specified rule 1, and acquiring a sub-text contained in each sample text data (the number of the sub-texts in each sample text data may be different or the same); 2) taking a subfile contained in each sample text data as an original sample set, wherein the original sample set is used for training a model, and a part of text data can be obtained according to the same obtaining method to be taken as a test sample set; 3) defining relevant parameters of a sliding window; 4) traversing sub-texts contained in each sample text data based on the relevant parameters and the specified rule 2 and carrying out data segmentation to obtain a plurality of candidate text data; 4) taking a plurality of candidate text data as a candidate sample set, and obtaining a total sample set for training the classification model at this time according to the original sample set (which can also comprise a test sample set) and the candidate sample set;
based on the above (2): taking a sample text data (sample 1) as an example, as shown in (1) in fig. 4b, the designation rule 1 may be division based on separation characters, that is, may be division according to commas and periods in sample 1 to obtain 5 sub-texts; as shown in (2) in fig. 4b, the specification rule 1 may be to divide the text data for filtering the specified characters by a specified length, that is, to divide the sample 1 with the separated characters filtered by every 5 characters to obtain 6 sub-texts; at this time, when the sub-text 6 is divided, because the number of the remaining characters is insufficient, the characters can be supplemented so that the sub-text 6 contains 5 characters;
based on the above (4): as shown in (1) in fig. 4c, the specified rule 2 may be sliding segmentation according to a fixed window, and a sub-text in a fixed sliding window is used as a candidate text data, that is, 5 sub-texts of the sample 1 may be divided according to a window size of 3 and a step length of 1, so as to obtain 3 candidate text data; when the last candidate text data is divided, if the number of the remaining sub-texts is less than 3, the remaining sub-texts can be directly used as the candidate text data, or the remaining sub-texts are supplemented based on the default sub-texts, so that the number of the supplemented remaining sub-texts is equal to 3, and the supplemented remaining sub-texts are used as the candidate text data;
as shown in (2) in fig. 4c, the rule 2 may be a sliding segmentation according to the variable window, and the sub-texts in one variable window are used as a candidate text data, that is, 5 sub-texts of the sample 1 may be divided according to the step size of 1 and the initial size of the window end point of 3, so as to obtain 3 candidate text data;
as shown in (3) in fig. 4c, the rule 2 may be a sliding segmentation performed according to fixed windows, and the sub-texts in each fixed window are combined into candidate text data, that is, 5 sub-texts of the sample 1 may be divided according to the window size of 3 and the step length of 2 to obtain 2 sub-text sets, data in one sub-text set is spliced into one sub-text, and the 2 sub-text sets are combined to obtain candidate text data.
In some embodiments, after obtaining the plurality of candidate text data, it may be further determined whether duplicate data exists in the plurality of candidate text data, and a deduplication operation is performed on the plurality of candidate text data. In this case, the classification tags for determining the candidate text data may be labeled in a manner that the text data determines the classification tags.
S303, pre-training the classification model based on the sample text data to obtain the pre-trained classification model. The specific implementation of step S303 may refer to the related description of the above embodiments, and is not described herein again.
S304, calling the pre-trained classification model to perform classification prediction on the candidate text data, and selecting target text data from the candidate text data according to the prediction accuracy of the pre-trained classification model for the candidate text data.
In some embodiments, the electronic device may call a pre-trained classification model to perform classification prediction on multiple candidate text data to obtain a classification prediction result for each candidate text data, and determine the target text data according to prediction accuracy characterized by the classification prediction result.
In some embodiments, any one of the candidate text data is represented as a target candidate text data, the target candidate text data has a classification label, and the manner of determining the classification label of the target candidate text data can be found in the related description of the above embodiments. If the classification prediction result of the pre-trained classification model on the target candidate text data comprises the prediction category of the target candidate text data and the prediction probability for the prediction category, the target text data is selected from the candidate text data according to the prediction accuracy of the pre-trained classification model on the candidate text data, and if the prediction category is the same as the category indicated by the classification label and the prediction probability is greater than the probability threshold, the prediction accuracy is determined, and the target candidate text data is used as the target text data.
In some embodiments, if the classification prediction result of the pre-trained classification model on the target candidate text data includes a prediction category of the target candidate text data, then according to the prediction accuracy of the pre-trained classification model on the plurality of candidate text data, the target text data may be selected from the plurality of candidate text data, and if the prediction category for the target candidate text data is the same as the category indicated by the classification label of the target candidate text data, it is determined that the pre-trained classification model has prediction accuracy on the target candidate text data, and it is determined that the target candidate text data is the target text data.
S305, obtaining a first sample feature corresponding to the sample text data and a second sample feature corresponding to the target text data, calling a pre-trained classification model, and outputting a first classification result aiming at the sample text data and a second classification result aiming at the target text data based on the second sample feature.
In some embodiments, the first sample feature corresponding to the sample text data and the second sample feature corresponding to the target text data are processed in the same manner and principle, and here, any one of the sample text data and the target text data (denoted as target sample data) is taken as an example. The electronic equipment can perform feature engineering processing on target sample data to obtain corresponding sample features, and call a pre-trained classification model to predict the sample features so as to output a classification result aiming at the target sample data. The feature engineering process may be set by the relevant service personnel according to a specific application scenario, and the obtained sample features may include sample text statistical features for the target sample data, and/or may include sample text features of the target sample data generated by invoking a pre-trained classification model. Wherein, the sample text statistical characteristics may include at least one of the following: the method comprises the steps of aiming at the part-of-speech frequency distribution characteristics of characters in target sample data, the statistical characteristics of sub-texts in the target sample data, or the statistical characteristics of classified keywords in the target sample data. The feature dimensions of the various features in the different sample features are fixed, that is, the feature dimensions of the same feature in the different sample features are the same, and the feature dimensions of the different features may be the same or different.
In some embodiments, the electronic device invoking the pre-trained classification model to predict the sample features may be to sequentially splice a plurality of features included in the sample features to obtain spliced sample features, and invoke the pre-trained classification model to predict the spliced sample features.
In some embodiments, the obtaining of the part-of-speech frequency distribution characteristics for the characters in the target sample data may be performed by performing jieba (jieba) word segmentation analysis processing on the target sample data to obtain a part-of-speech of each segmented word in the target sample data, and generating the part-of-speech frequency distribution characteristics according to the part-of-speech of each segmented word. For example, the word segmentation set L aiming at the target sample data contains s words, and the word part frequency value f of the words with the part of speech type tc tc Comprises the following steps:
Figure BDA0003721457040000151
where t (x) is defined as a function of the part of speech of the input participle x.
Therefore, a plurality of part-of-speech frequency values can be calculated according to the formula, and the part-of-speech frequency values are spliced to obtain a part-of-speech frequency distribution characteristic; or mapping the part-of-speech frequency values into a specified range, and splicing the mapping values of each part-of-speech frequency value to obtain a part-of-speech frequency distribution characteristic. For example, if the verb frequency is 0.3, the noun frequency is 0.1, etc., the part-of-speech frequency distribution is characterized as [0.3, 0.1. ]; alternatively, if the part-of-speech frequency is mapped between 0 and 10, the mapping value of verb frequency is 3, the mapping value of noun frequency is 1, and the like, the part-of-speech frequency distribution is characterized as [3, 1. The composition of the part-of-speech frequency distribution features can be seen in table 1 below, for example:
Figure BDA0003721457040000161
TABLE 1
The parts of speech may be cut off according to a specific application scenario, for example, the parts of speech with a frequency value of 0 or less than a preset threshold may be cut off. In addition, when the session text data is obtained based on the session record between the outbound robot and the user, the session on the outbound robot side is usually a specified dialog template, and a large amount of information is included in the session of the user, so when the part-of-speech frequency distribution feature is calculated, the target sub-text belonging to the user side can be obtained from the N sub-texts, and the part-of-speech frequency distribution feature can be determined based on the target sub-text. Because the conversation text data transcribed by the conversation record may have the problems of context conflict (such as the situation that a user initially denies a certain behavior but subsequently acknowledges the certain behavior) and the transcription quality (such as a spoken text), and the like, the spoken text characteristics can be effectively characterized by adding the part-of-speech frequency distribution characteristics, the text problems caused by the context conflict and the speech transcription can be counteracted, the distortion of semantic information can be supplemented and the semantic loss can be eliminated from the perspective of emphasizing the spoken language of the user, and the model effect is greatly improved.
In some embodiments, obtaining statistical features for the sub-text in the target sample data specifically includes, but is not limited to, the following types: the number of turns of dialog with the user, the average length of the subfolders, the standard deviation of the length of the subfolders, etc. are determined based on the subfolders. It is to be understood that the statistical characteristics may be different depending on the type of sample text data.
Therefore, various types of statistical results can be calculated, and the statistical characteristics of the sub-text in the target sample data can be obtained by splicing the various types of statistical results. The composition of the statistical features can be seen, for example, in table 2 below:
Figure BDA0003721457040000162
TABLE 2
Since a large amount of information is included in the session of the user, when the statistical features for the sub-text in the target sample data are calculated, part of the statistical features may also be determined based on the target sub-text obtained from the user side. The specific rule can be set by the relevant service personnel according to the experience value.
In some embodiments, the obtaining of the statistical features for the classification keywords in the target sample data may specifically be obtaining specified classification keywords (for example, keywords related to risks, such as money transfers, risk application names, and the like), detecting whether the target sample data includes the classification keywords, generating one-hot codes (hot unique codes) for each classification keyword according to a detection result, and concatenating the one-hot codes for each classification keyword to obtain the statistical features of the classification keywords. The classification keywords can be set by related business personnel according to actual scenes. And one-hot code generation rules for the classification keywords may be set by the relevant business personnel. Exemplarily, if the target sample data contains a classification keyword, the one-hot is encoded as a first numerical value; if not, one-hot is encoded as a second value, etc. For example, if the keyword "transfer" is contained, the corresponding one-hot code is generated to be 1, and if the keyword "risk application name" is not contained, the corresponding one-hot code is generated to be 0, and the statistical characteristics are obtained by splicing [1,0 ]. The composition of the statistical features of the classification keyword can be seen in the following table 3, for example:
Figure BDA0003721457040000171
TABLE 3
Since a large amount of information is included in the conversation of the user, when the statistical characteristics of the classification keywords are calculated, the target sub-text belonging to the user side can be obtained, and whether the classification keywords are included in the target sub-text is determined. The specific rule can be set by the relevant service personnel according to the experience value.
In some embodiments, the obtaining of the sample text features may specifically be that a pre-trained classification model is invoked to generate a text vector of the target sample data, and the text vector is used as the sample text data. The text vector generation mode can be that a sentence vector of each sub-text is generated, and a text vector of the target sample data is determined according to the sentence vector of each sub-text; if the text vector can be an average vector of the sentence vectors of each subfile, the text vector is not limited herein; or generating a word vector of each participle in the sub-text, determining a sentence vector corresponding to the sub-text according to the word vector of each participle, and determining a text vector of the target sample data according to the sentence vector of each sub-text; if the average vector of the word vector of each participle is used as the sentence vector of the corresponding sub-text, the method is not limited herein; the manner of determining the text vector from the sentence vector of the sub-text may be as described above.
In some embodiments, when the text vector is generated by invoking a pre-trained classification model to generate a sentence vector for each sub-text, a feature generation layer may be built on the classification model, the feature generation layer includes a sentence vector generation network, and the sentence vector generation network may be built based on the idea of doc2vec (a paragraph vector conversion tool) of word2vec (a vector to vector); when the text vector is generated by calling a pre-trained classification model to generate a Word vector of each participle, a feature generation layer is built on the classification model, the feature generation layer comprises a Word vector generation network, and the Word vector generation network can be built on the basis of Word2vec of CBoW (Continuous Bag-of-Words), Word2vec of Skip-Gram (Word skipping mode), or GloVe (Global Vectors for Word Representation, a Word Representation tool based on Global Word frequency statistics).
In some embodiments, in order to enable the sample text features of the target sample data to better cover semantic information, the technical scheme of the application adopts 200-dimensional word2vec word vector features, and the specific process of the application may be to obtain a word segmentation set corresponding to the target sample data, train a 200-dimensional word2vec word vector in a classification model by using genism (a natural language processing library), and perform vector transformation on the word segmentation set based on the trained word2vec to obtain a word vector corresponding to the target sample data. The sample text feature may be determined in other ways, which are not limited herein.
Because the session at the outbound robot side in the sample text data is usually a fixed template, a default vector corresponding to the sub-text at the outbound robot side can be preset, when the classification model is called to generate the sample text features, vector conversion can be performed only on the target sub-text at the user side, and the corresponding sample text features are obtained based on the conversion vector for the target sub-text and the default vector.
In some embodiments, the classification model may be constructed based on any model structure and concept. The classification model may include a feature prediction layer, which may be understood as a classifier, for predicting the classification result generated. When the classification result is used for indicating two classes, the classifier is a binary classifier. The electronic device can select a more appropriate model idea according to specific sample characteristics to construct a characteristic prediction layer in the classification model. For example, if statistical features and part-of-speech frequency distribution features for sub-texts are used, traditional machine learning models such as logistic regression classifiers, random forest classifiers, or decision tree classifiers may be used; if the sample text feature is used, a deep learning model suitable for the text data feature, such as RNN (Recurrent Neural Network), LSTM (Long Short-Term Memory), GRU (Gated Recurrent Unit), or the like, may be used. According to the technical scheme, the bidirectional LSTM is selected to construct the model structure of the feature prediction layer in the classification model. The classification model may also be in other model structures, and is not limited herein.
S306, generating a prediction deviation aiming at the pre-trained classification model based on the first classification result and the second classification result, and correcting model parameters of the pre-trained classification model based on the prediction deviation to obtain the trained classification model.
In some embodiments, the electronic device may construct a loss function, generate a prediction bias for the pre-trained classification model based on the classification result for the target sample data and the classification label carried by the target sample data, modify a model parameter of the pre-trained classification model based on the prediction bias, and obtain the trained classification model through continuously iterative semi-supervised training. It will be appreciated that the pre-training process for the classification model is the same as the process described above for training the pre-trained classification model, and that the specific types of sample features used are the same.
For example, as shown in fig. 5, fig. 5 is a schematic flowchart of a process for training a classification model according to an embodiment of the present application; the method comprises the following steps that 1) an original sample set (namely one or more sample text data) is used for pre-training a classification model to obtain a pre-trained classification model; 2) calling a pre-trained classification model to perform classification prediction on a candidate sample set (namely a plurality of candidate text data) to obtain a classification prediction result (comprising a prediction category and a prediction probability) of each candidate text data; 3) according to the prediction accuracy represented by the classification prediction result of each candidate text data, dividing the candidate sample set into a high-confidence candidate sample set and a low-confidence candidate sample set, which may specifically be: taking candidate sample data with the prediction probability of the prediction category larger than the probability threshold value as a high-credibility candidate sample set, and taking candidate sample data with the prediction probability of the prediction category smaller than or equal to the probability threshold value as a low-credibility candidate sample set; 4) selecting target text data from the high-confidence candidate sample set, adding the target text data into the original sample set to obtain a new sample set, wherein the new sample set specifically includes: taking candidate text data with the same prediction category as the category indicated by the classification label as target text data; or randomly sampling the high-reliability candidate sample set, and taking candidate text data with the prediction type same as the type indicated by the classification label in the extracted data as target text data; taking the candidate text data except the target text data in the candidate sample set as a new candidate sample set; 5) and (2) continuing to execute the method from the step (1) based on the new sample set and the new candidate sample set, and when the performance of the classification model obtained after multiple rounds of training is not obviously improved, namely the multiple prediction results of the classification model for the test sample set are basically not different, obtaining the trained classification model and the expanded sample set (namely the original sample set used for training the model at this time). The new classification model may then be supervised model training through the extended sample set.
In some embodiments, taking the trained classification model as an example for classifying the risk of the text data (here, the session text data), the classification model may be applied to obtain the session text data of a target object (such as the target object), obtain the session text statistical features of the session text data, call the trained classification model to generate the session text features of the session text data, and call the trained classification model to output a risk classification result for the session text data based on the session text statistical features and the session text features, where the risk classification result may indicate that the session text data has a session risk or does not have a session risk. The session text data can be obtained by intelligently calling out the target object through the calling-out robot. And subsequently, if the risk classification result is used for indicating that the conversation text data has conversation risk, early warning is carried out on the target object, and if manual intervention is carried out.
In some embodiments, there may be a plurality of trained classification models, and the risk classification results output by the trained classification models collectively determine whether to perform early warning. The plurality of trained classification models can be obtained by training according to the process, or a trained classification model and a final extended sample set can be obtained by training according to the process, and the extended sample set is used for training the rest classification models.
For example, as shown in fig. 6a, fig. 6a is a schematic flowchart of an application classification model provided in an embodiment of the present application; wherein: 1) acquiring session text data, and processing the session text data to obtain N sub-session text data contained in the session text data; 2) feature engineering processing is performed on the N sub-session text data to obtain a plurality of trained classification models (for example: the classification model of "contact or not with malicious information", the classification model of "transfer or not", etc.), which may specifically be: acquiring session text statistical characteristics based on the N sub-session text data, respectively calling a plurality of trained classification models to generate session text characteristics aiming at each trained classification model, and taking the session text statistical characteristics and the session text characteristics aiming at each trained classification model as respective input session characteristics; 3) and calling a plurality of trained classification models to predict based on the respective input session features to obtain a plurality of risk classification results (for example: "contacted malicious information", "transferred", "not downloaded risk applications", etc.); 4) determining whether the conversation text data has conversation risks and whether to perform early warning (such as manual intervention) according to the risk classification results;
the above-mentioned flow diagram for performing feature engineering processing on the N sub-session text data may be as shown in fig. 6 b; wherein: 1) determining statistical characteristics of the sub-session text data in the session text data, such as the average length of the sub-session text data at the user side in the N sub-session text data; 2) determining part-of-speech frequency distribution characteristics for characters in the session text data; 3) aiming at the statistical characteristics of classified keywords in the session text data;
4) and calling the trained classification model to generate sample text features of the conversation text data.
For another example, as shown in fig. 7, fig. 7 is a scene schematic diagram of risk early warning based on a classification model according to an embodiment of the present application; taking risk classification of the session text data as an example:
acquiring early warning data which can comprise object information, risk sources and the like of a target object; the early warning data can be generated and sent by upstream business equipment, and can also be detected and generated by electronic equipment; for example, a security program is installed in a terminal of a target object, and the electronic device may be a background device of the security program and has a right to detect an abnormal operation behavior on the terminal, such as answering a malicious call; or a security plug-in is embedded in a target application (such as a browser) installed in a terminal of a target object, and the electronic device may be a background device of the security plug-in and has a right to detect abnormal operation behaviors on the target application, such as browsing a malicious website;
an intelligent voice module in the electronic device or the outbound device can initiate an intelligent outbound to a terminal of a target object based on the early warning data to obtain a session record, process the session record, for example, transcribe and display the session record, and splice the transcribed session content to obtain session text data; the template for carrying out intelligent outbound call on different target objects can be different and can be specifically set by related service personnel;
and performing risk prediction based on the session text data to obtain a processing result, such as: calling a plurality of trained classification models to perform classification prediction on the session text data to obtain a plurality of classification prediction results (such as transferred accounts and exposed malicious information … …) aiming at risk prediction, and determining whether risk early warning (such as manual intervention is needed) is needed or not based on the plurality of classification prediction results;
thus, the overall process flow for the trained classification model may include: the training process includes three modules: sample pretreatment, characteristic engineering treatment and iteration semi-supervised training; the application process comprises a module: and (5) text classification. See the following table 4 for details:
Figure BDA0003721457040000211
Figure BDA0003721457040000221
TABLE 4
In the embodiment of the application, sample text data can be obtained, and the N sub-texts are sequentially subjected to sliding segmentation based on the sliding window to obtain a plurality of candidate text data of the sample text data; the candidate text data is obtained by sample expansion of sample text data, so that the number of samples can be increased, and the overfitting problem possibly generated by a small sample training set can be avoided; pre-training the classification model based on the sample text data to obtain a pre-trained classification model; calling a pre-trained classification model to perform classification prediction on the candidate text data, and selecting target text data from the candidate text data according to the prediction accuracy of the pre-trained classification model for the candidate text data; target text data with higher quality can be selected from the candidate text data based on the prediction accuracy; acquiring a first sample characteristic corresponding to sample text data and a second sample characteristic corresponding to target text data, calling a pre-trained classification model to output a first classification result aiming at the sample text data based on the first sample characteristic and a second classification result aiming at the target text data based on the second sample characteristic, generating a prediction deviation aiming at the pre-trained classification model based on the first classification result and the second classification result, and correcting a model parameter of the pre-trained classification model based on the prediction deviation to obtain a trained classification model; the model training can be positively effected through the high-quality target text data and the sample text data, so that the prediction accuracy of the trained classification model can be improved, and further the text data can be more accurately classified and predicted through the trained classification model.
Please refer to fig. 8, fig. 8 is a schematic structural diagram of a data processing apparatus according to the present application. It should be noted that, the data processing apparatus shown in fig. 8 is used for executing the method of the embodiment shown in fig. 2 and fig. 3 of the present application, and for convenience of description, only the portion related to the embodiment of the present application is shown, and details of the specific technology are not disclosed, and reference is made to the embodiment shown in fig. 2 and fig. 3 of the present application. The data processing apparatus 800 may include: a processing module 801 and an acquisition module 802. Wherein:
an obtaining module 801, configured to obtain sample text data; the number of the sample text data is less than the number of the sample indexes; the sample text data carries a classification label; the sample text data comprises N sub-texts, wherein N is a positive integer;
the processing module 802 is configured to sequentially perform sliding segmentation on the N sub-texts based on the sliding window to obtain a plurality of candidate text data of the sample text data; any candidate text data comprises at least one arbitrary continuous sub text in the N sub texts;
the processing module 802 is further configured to train the classification model based on the sample text data to obtain a pre-trained classification model;
the processing module 802 is further configured to invoke the pre-trained classification model to perform classification prediction on the multiple candidate text data, and select target text data from the multiple candidate text data according to prediction accuracy of the pre-trained classification model for the multiple candidate text data;
the processing module 802 is further configured to train the pre-trained classification model based on the sample text data and the target text data to obtain a trained classification model; the trained classification model is used for performing classification prediction on the text data.
In some embodiments, the obtaining module 801 is further configured to:
detecting separator characters in the sample text data;
the sample text data is divided into N sub-texts based on the detected separator characters.
In some embodiments, any one of the plurality of candidate text data is represented as a target candidate text data, the target candidate text data has a classification label, and the classification prediction result of the pre-trained classification model on the target candidate text data comprises a prediction category of the target candidate text data and a prediction probability for the prediction category;
the processing module 802 is specifically configured to, when configured to select target text data from the multiple candidate text data according to prediction accuracy of the pre-trained classification model for the multiple candidate text data:
if the prediction category for the target candidate text data is the same as the category indicated by the classification label of the target candidate text data and the prediction probability for the target candidate text data is greater than the probability threshold, determining that the pre-trained classification model has prediction accuracy on the target candidate text data and determining that the target candidate text data is the target text data.
In some embodiments, any one of the plurality of candidate text data is represented as a target candidate text data, the target candidate text data has a classification label, and the classification prediction result of the pre-trained classification model on the target candidate text data comprises a prediction category of the target candidate text data;
the processing module 802 is specifically configured to, when configured to select target text data from the multiple candidate text data according to prediction accuracy of the pre-trained classification model for the multiple candidate text data:
and if the prediction type of the target candidate text data is the same as the type indicated by the classification label of the target candidate text data, determining that the pre-trained classification model has prediction accuracy on the target candidate text data, and determining that the target candidate text data is the target text data.
In some embodiments, either one of the sample text data and the target text data is represented as target sample data; the processing module 802 is specifically configured to, when being configured to train the pre-trained classification model based on the sample text data and the target text data to obtain a trained classification model:
acquiring sample text statistical characteristics aiming at target sample data;
calling a pre-trained classification model to generate sample text characteristics of target sample data;
calling a pre-trained classification model, and outputting a classification result aiming at target sample data based on the sample text statistical characteristics and the sample text characteristics;
generating a prediction deviation for a pre-trained classification model based on a classification result for the target sample data and a classification label carried by the target sample data;
and correcting the model parameters of the pre-trained classification model based on the prediction deviation to obtain the trained classification model.
In some embodiments, the sample text statistical features include at least one of: the method comprises the steps of aiming at the part-of-speech frequency distribution characteristics of characters in target sample data, the statistical characteristics of sub-texts in the target sample data, or the statistical characteristics of classified keywords in the target sample data.
In some embodiments, the trained classification model is used to classify the risk of the text data;
the processing module 802 is further configured to:
acquiring session text data of a target object, and acquiring session text statistical characteristics of the session text data;
calling the trained classification model to generate the session text characteristics of the session text data;
calling a trained classification model, and outputting a risk classification result aiming at the conversation text data based on the conversation text statistical characteristics and the conversation text characteristics;
the processing module 802 is further configured to:
and if the risk classification result is used for indicating that the conversation text data has the conversation risk, early warning is carried out on the target object.
In the embodiment of the application, an acquisition module acquires sample text data; the processing module sequentially performs sliding segmentation on the N sub-texts based on the sliding window to obtain a plurality of candidate text data of the sample text data; the processing module trains the classification model based on the sample text data to obtain a pre-trained classification model; the processing module calls a pre-trained classification model to perform classification prediction on the candidate text data, and selects target text data from the candidate text data according to the prediction accuracy of the pre-trained classification model for the candidate text data; the processing module trains the pre-trained classification model based on the sample text data and the target text data to obtain a trained classification model; the trained classification model is used for performing classification prediction on the text data. Through the device, the candidate text data is obtained by sample expansion of the sample text data, so that the number of samples can be increased, the overfitting problem possibly generated by a small sample training set is avoided, the target text data with higher quality can be selected from the candidate text data based on the prediction accuracy, the model training can be positively effected through the high-quality target text data and the sample text data, the prediction accuracy of the trained classification model can be improved, and the text data can be more accurately classified and predicted through the trained classification model.
Each functional module in the embodiments of the present application may be integrated into one module, or each module may exist alone physically, or two or more modules are integrated into one module. The integrated module may be implemented in a form of hardware, or may be implemented in a form of software functional module, which is not limited in this application.
Referring to fig. 9, fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure. As shown in fig. 9, the electronic device 900 includes: at least one processor 901, memory 902. Optionally, the electronic device may further include a network interface. Wherein data can be exchanged between the processor 901, the memory 902 and the network interface, the network interface is controlled by the processor 901 for transceiving messages, the memory 902 is used for storing a computer program, the computer program comprises program instructions, and the processor 901 is used for executing the program instructions stored in the memory 902. Wherein the processor 901 is configured to call the program instructions to execute the above method.
The memory 902 may include a volatile memory (volatile memory), such as a random-access memory (RAM); the memory 902 may also include a non-volatile memory (non-volatile memory), such as a flash memory (flash memory), a solid-state drive (SSD), etc.; the memory 902 may also comprise a combination of the above-described types of memory.
The processor 901 may be a Central Processing Unit (CPU). In one embodiment, processor 901 may also be a Graphics Processing Unit (GPU). The processor 901 may also be a combination of a CPU and a GPU.
In one possible embodiment, the memory 902 is used for storing program instructions, which the processor 901 can call to perform the following steps:
acquiring sample text data; the number of the sample text data is less than the number of the sample indexes; the sample text data carries a classification label; the sample text data comprises N sub-texts, wherein N is a positive integer;
sequentially carrying out sliding segmentation on the N sub-texts based on the sliding window to obtain a plurality of candidate text data of the sample text data; any candidate text data comprises at least one arbitrary continuous sub text in the N sub texts;
pre-training the classification model based on the sample text data to obtain a pre-trained classification model;
calling a pre-trained classification model to perform classification prediction on the candidate text data, and selecting target text data from the candidate text data according to the prediction accuracy of the pre-trained classification model for the candidate text data;
training the pre-trained classification model based on the sample text data and the target text data to obtain a trained classification model; the trained classification model is used for performing classification prediction on the text data.
In some embodiments, the processor 901 is further configured to:
detecting separator characters in the sample text data;
the sample text data is divided into N sub-texts based on the detected separator characters.
In some embodiments, any one of the plurality of candidate text data is represented as a target candidate text data, the target candidate text data has a classification label, and the classification prediction result of the pre-trained classification model on the target candidate text data comprises a prediction category of the target candidate text data and a prediction probability for the prediction category;
the processor 901 is specifically configured to, when configured to select target text data from the multiple candidate text data according to prediction accuracy of the pre-trained classification model for the multiple candidate text data,:
if the prediction category for the target candidate text data is the same as the category indicated by the classification label of the target candidate text data and the prediction probability for the target candidate text data is greater than the probability threshold, determining that the pre-trained classification model has prediction accuracy on the target candidate text data and determining that the target candidate text data is the target text data.
In some embodiments, any one of the plurality of candidate text data is represented as a target candidate text data, the target candidate text data has a classification label, and the classification prediction result of the pre-trained classification model on the target candidate text data comprises a prediction category of the target candidate text data;
the processor 901 is specifically configured to, when configured to select target text data from the multiple candidate text data according to prediction accuracy of the pre-trained classification model for the multiple candidate text data,:
and if the prediction type of the target candidate text data is the same as the type indicated by the classification label of the target candidate text data, determining that the pre-trained classification model has prediction accuracy on the target candidate text data, and determining that the target candidate text data is the target text data.
In some embodiments, either one of the sample text data and the target text data is represented as target sample data; the processor 901 is specifically configured to, when training the pre-trained classification model based on the sample text data and the target text data to obtain a trained classification model:
acquiring sample text statistical characteristics aiming at target sample data;
calling a pre-trained classification model to generate sample text characteristics of target sample data;
calling a pre-trained classification model, and outputting a classification result aiming at target sample data based on the sample text statistical characteristics and the sample text characteristics;
generating a prediction deviation for a pre-trained classification model based on a classification result for the target sample data and a classification label carried by the target sample data;
and correcting the model parameters of the pre-trained classification model based on the prediction deviation to obtain the trained classification model.
In some embodiments, the sample text statistical features include at least one of: the method comprises the steps of aiming at the part-of-speech frequency distribution characteristics of characters in target sample data, the statistical characteristics of sub-texts in the target sample data, or the statistical characteristics of classified keywords in the target sample data.
In some embodiments, the trained classification model is used to classify the risk of the text data;
the processor 901 is further configured to:
acquiring session text data of a target object, and acquiring session text statistical characteristics of the session text data;
calling the trained classification model to generate the session text characteristics of the session text data;
calling a trained classification model, and outputting a risk classification result aiming at the conversation text data based on the conversation text statistical characteristics and the conversation text characteristics;
the processor 901 is further configured to:
and if the risk classification result is used for indicating that the conversation text data has the conversation risk, early warning is carried out on the target object.
In specific implementation, the above-described devices, processors, memories, and the like may perform the implementation manners described in the above method embodiments, and may also perform the implementation manners described in the embodiments of the present application, which are not described herein again.
Also provided in embodiments of the present application is a computer (readable) storage medium storing a computer program, where the computer program includes program instructions, and the program instructions, when executed by a processor, cause the processor to perform some or all of the steps performed in the above-mentioned method embodiments. Alternatively, the computer storage media may be volatile or nonvolatile. The computer-readable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to the use of the blockchain node, and the like.
Embodiments of the present application also provide a computer program product, which includes computer instructions, and when executed by a processor, the computer instructions can implement some or all of the steps of the above method.
Reference herein to "a plurality" means two or more. "and/or" describes the association relationship of the associated object, indicating that there may be three relationships, for example, a and/or B, which may indicate: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above may be implemented by a computer program, which may be stored in a computer storage medium, where the computer storage medium may be a computer readable storage medium, and when executed, the computer program may include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.
While only some embodiments have been described in detail herein, it will be understood that all modifications and equivalents may be resorted to, falling within the scope of the invention.

Claims (11)

1. A method of data processing, the method comprising:
acquiring sample text data; the number of the sample text data is less than the number of the sample indexes; the sample text data carries a classification label; the sample text data comprises N sub-texts, wherein N is a positive integer;
sequentially performing sliding segmentation on the N sub texts based on a sliding window to obtain a plurality of candidate text data of the sample text data; any candidate text data comprises at least one arbitrary continuous sub text in the N sub texts;
pre-training a classification model based on the sample text data to obtain a pre-trained classification model;
calling the pre-trained classification model to perform classification prediction on the candidate text data, and selecting target text data from the candidate text data according to the prediction accuracy of the pre-trained classification model for the candidate text data;
training the pre-trained classification model based on the sample text data and the target text data to obtain a trained classification model; the trained classification model is used for performing classification prediction on the text data.
2. The method of claim 1, further comprising:
detecting a separator character in the sample text data;
dividing the sample text data into the N sub-texts based on the detected separator characters.
3. The method of claim 1, wherein any of the candidate text data is represented as a target candidate text data, the target candidate text data has a classification label, and the classification prediction result of the pre-trained classification model on the target candidate text data comprises a prediction category of the target candidate text data and a prediction probability for the prediction category;
the selecting target text data from the candidate text data according to the prediction accuracy of the pre-trained classification model for the candidate text data comprises:
if the prediction category of the target candidate text data is the same as the category indicated by the classification label of the target candidate text data and the prediction probability of the target candidate text data is greater than a probability threshold, determining that the pre-trained classification model has prediction accuracy on the target candidate text data, and determining that the target candidate text data is the target text data.
4. The method of claim 1, wherein any of the candidate text data is represented as a target candidate text data, the target candidate text data has a classification label, and the classification prediction result of the pre-trained classification model on the target candidate text data comprises a prediction category of the target candidate text data;
the selecting target text data from the candidate text data according to the prediction accuracy of the pre-trained classification model for the candidate text data comprises:
if the prediction type of the target candidate text data is the same as the type indicated by the classification label of the target candidate text data, determining that the pre-trained classification model has prediction accuracy on the target candidate text data, and determining that the target candidate text data is the target text data.
5. The method according to claim 1, wherein any one of the sample text data and the target text data is represented as target sample data; training the pre-trained classification model based on the sample text data and the target text data to obtain a trained classification model, comprising:
acquiring sample text statistical characteristics aiming at the target sample data;
calling the pre-trained classification model to generate sample text features of the target sample data;
calling the pre-trained classification model, and outputting a classification result aiming at the target sample data based on the sample text statistical characteristics and the sample text characteristics;
generating a prediction bias for the pre-trained classification model based on the classification result for the target sample data and a classification label carried by the target sample data;
and correcting the model parameters of the pre-trained classification model based on the prediction deviation to obtain the trained classification model.
6. The method of claim 5, wherein the sample text statistical features comprise at least one of: and aiming at the part-of-speech frequency distribution characteristics of characters in the target sample data, the statistical characteristics of the sub-text in the target sample data or the statistical characteristics of classified keywords in the target sample data.
7. The method of claim 1, wherein the trained classification model is used to classify risk of text data; the method further comprises the following steps:
acquiring session text data of a target object, and acquiring session text statistical characteristics of the session text data;
calling the trained classification model to generate the session text features of the session text data;
calling the trained classification model, and outputting a risk classification result aiming at the conversation text data based on the conversation text statistical characteristics and the conversation text characteristics;
the method further comprises the following steps:
and if the risk classification result is used for indicating that the conversation text data has conversation risk, early warning is carried out on the target object.
8. A data processing apparatus, characterized in that the apparatus comprises:
the acquisition module is used for acquiring sample text data; the number of the sample text data is less than the number of the sample indexes; the sample text data carries a classification label; the sample text data comprises N sub-texts, wherein N is a positive integer;
the processing module is used for sequentially carrying out sliding segmentation on the N sub texts based on a sliding window to obtain a plurality of candidate text data of the sample text data; any candidate text data comprises at least one arbitrary continuous sub text in the N sub texts;
the processing module is further used for training a classification model based on the sample text data to obtain a pre-trained classification model;
the processing module is further configured to invoke the pre-trained classification model to perform classification prediction on the candidate text data, and select target text data from the candidate text data according to prediction accuracy of the pre-trained classification model for the candidate text data;
the processing module is further configured to train the pre-trained classification model based on the sample text data and the target text data to obtain a trained classification model; the trained classification model is used for performing classification prediction on the text data.
9. An electronic device comprising a processor and a memory, wherein the memory is configured to store a computer program comprising program instructions, and wherein the processor is configured to invoke the program instructions to perform the method of any of claims 1-7.
10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program comprising program instructions that, when executed by a processor, cause the processor to carry out the method according to any one of claims 1-7.
11. A computer program product, characterized in that the computer program product comprises computer instructions which, when executed by a processor, implement the method according to any one of claims 1-7.
CN202210756062.XA 2022-06-29 2022-06-29 Data processing method, device, electronic equipment, medium and program product Active CN115098680B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210756062.XA CN115098680B (en) 2022-06-29 2022-06-29 Data processing method, device, electronic equipment, medium and program product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210756062.XA CN115098680B (en) 2022-06-29 2022-06-29 Data processing method, device, electronic equipment, medium and program product

Publications (2)

Publication Number Publication Date
CN115098680A true CN115098680A (en) 2022-09-23
CN115098680B CN115098680B (en) 2024-08-09

Family

ID=83294589

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210756062.XA Active CN115098680B (en) 2022-06-29 2022-06-29 Data processing method, device, electronic equipment, medium and program product

Country Status (1)

Country Link
CN (1) CN115098680B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109960800A (en) * 2019-03-13 2019-07-02 安徽省泰岳祥升软件有限公司 Weakly supervised text classification method and device based on active learning
CN111813932A (en) * 2020-06-17 2020-10-23 北京小米松果电子有限公司 Text data processing method, text data classification device and readable storage medium
CN112100378A (en) * 2020-09-15 2020-12-18 中国平安人寿保险股份有限公司 Text classification model training method and device, computer equipment and storage medium
CN113312899A (en) * 2021-06-18 2021-08-27 网易(杭州)网络有限公司 Text classification method and device and electronic equipment
CN113688239A (en) * 2021-08-20 2021-11-23 平安国际智慧城市科技股份有限公司 Text classification method and device under few samples, electronic equipment and storage medium
CN113918720A (en) * 2021-10-29 2022-01-11 平安普惠企业管理有限公司 Training method, device and equipment of text classification model and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109960800A (en) * 2019-03-13 2019-07-02 安徽省泰岳祥升软件有限公司 Weakly supervised text classification method and device based on active learning
CN111813932A (en) * 2020-06-17 2020-10-23 北京小米松果电子有限公司 Text data processing method, text data classification device and readable storage medium
CN112100378A (en) * 2020-09-15 2020-12-18 中国平安人寿保险股份有限公司 Text classification model training method and device, computer equipment and storage medium
CN113312899A (en) * 2021-06-18 2021-08-27 网易(杭州)网络有限公司 Text classification method and device and electronic equipment
CN113688239A (en) * 2021-08-20 2021-11-23 平安国际智慧城市科技股份有限公司 Text classification method and device under few samples, electronic equipment and storage medium
CN113918720A (en) * 2021-10-29 2022-01-11 平安普惠企业管理有限公司 Training method, device and equipment of text classification model and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Y. LI 等: "Few-shot Classification of Radar Equipment Fault Based on TF-IDF Feature Date Augmentation and BERT", 2021 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND INTELLIGENT SYSTEMS ENGINEERING (MLISE), 25 November 2021 (2021-11-25), pages 444 - 448 *
张军 等: "基于逐步优化分类模型的跨领域文本情感分类", 计算机科学, vol. 43, no. 07, 15 July 2016 (2016-07-15), pages 234 - 239 *

Also Published As

Publication number Publication date
CN115098680B (en) 2024-08-09

Similar Documents

Publication Publication Date Title
US11403680B2 (en) Method, apparatus for evaluating review, device and storage medium
CN110765244B (en) Method, device, computer equipment and storage medium for obtaining answering operation
CN108255805B (en) Public opinion analysis method and device, storage medium and electronic equipment
CN112528637B (en) Text processing model training method, device, computer equipment and storage medium
CN113094578B (en) Deep learning-based content recommendation method, device, equipment and storage medium
CN112270196A (en) Entity relationship identification method and device and electronic equipment
CN111125354A (en) Text classification method and device
CN112579733B (en) Rule matching method, rule matching device, storage medium and electronic equipment
CN113987147A (en) Sample processing method and device
CN112860896A (en) Corpus generalization method and man-machine conversation emotion analysis method for industrial field
CN112860871B (en) Natural language understanding model training method, natural language understanding method and device
CN112101042A (en) Text emotion recognition method and device, terminal device and storage medium
CN115640200A (en) Method and device for evaluating dialog system, electronic equipment and storage medium
CN110969005B (en) Method and device for determining similarity between entity corpora
CN113486174B (en) Model training, reading understanding method and device, electronic equipment and storage medium
CN115952854B (en) Training method of text desensitization model, text desensitization method and application
CN110377706B (en) Search sentence mining method and device based on deep learning
CN111401070B (en) Word meaning similarity determining method and device, electronic equipment and storage medium
CN115796141A (en) Text data enhancement method and device, electronic equipment and storage medium
CN115292495A (en) Emotion analysis method and device, electronic equipment and storage medium
CN113342932B (en) Target word vector determining method and device, storage medium and electronic device
CN115292492A (en) Method, device and equipment for training intention classification model and storage medium
CN115098680B (en) Data processing method, device, electronic equipment, medium and program product
CN113761874A (en) Event reality prediction method and device, electronic equipment and storage medium
CN113705194A (en) Extraction method and electronic equipment for short

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40074439

Country of ref document: HK

GR01 Patent grant
GR01 Patent grant