CN113988059A

CN113988059A - Session data type identification method, system, equipment and storage medium

Info

Publication number: CN113988059A
Application number: CN202110968186.XA
Authority: CN
Inventors: 吴明平
Original assignee: Beijing Minglue Zhaohui Technology Co Ltd
Current assignee: Beijing Minglue Zhaohui Technology Co Ltd
Priority date: 2021-08-23
Filing date: 2021-08-23
Publication date: 2022-01-28

Abstract

The invention discloses a method, a system, equipment and a storage medium for identifying session data types. The method comprises the following steps: real-time group chat data are stored through es, and group chat data labeling is carried out through a risk control module according to text types; identifying large-class data by adopting a multi-classification text classification model according to the marked group chat data; and identifying the small sample data by using a metric-based few-shot learning algorithm according to the marked group chat data. By combining the text classification algorithm with the few-shot idea in small sample learning, the method can reduce the labeled data amount, solve the problem of data imbalance by using a small amount of labeling cost, and reduce the misjudgment rate of data.

Description

Session data type identification method, system, equipment and storage medium

Technical Field

The invention relates to the technical field of computer data processing, in particular to a conversation data type identification method, a system, equipment and a storage medium based on a classification algorithm and small sample learning.

Background

More and more e-commerce selects private domain traffic, which is different from public domain traffic, and the private domain traffic is mainly reserved in the micro-signal and the community, so the health degree of the community is particularly important. The community operation is more and more a brand retained vermicelli, new product publicity and a common marketing means stimulating purchase, and the continuous survival of the community is required to be realized, so that the active value of the community is ensured, the health degree of the community is ensured, people who have some irrelevant advertisement information in the community and release information need to be processed immediately, and the community is prevented from becoming a wool water group and influencing the participation of real users.

At the initial stage of community creation, people are basically used for checking whether people who illegally issue advertisement information exist in group chat, but when a community develops to a certain scale, a large brand merchant can own hundreds or even thousands of fan groups, the cost of manpower monitoring is very high, and the situation that people cannot timely deal with the fan groups can occur.

At present, for the service of group chat risk control, the traditional method is to use some keywords to hit, or use NLP classification algorithm to identify advertisement data.

However, because the group chat information is more spoken, the chat content is random and diversified, the advertisement information source may be an external-link e-commerce or an individual text, even some Mars text appears; in this case, not only the labeling cost is high, but also the advertisement data in the group chat data is very small, and in the case of extreme imbalance of samples, the misjudgment rate cannot reach a lower level by relying on a simple classification algorithm and a data enhancement means.

Disclosure of Invention

The invention provides a conversation data type identification method, a system, equipment and a storage medium based on a classification algorithm and small sample learning, aiming at the technical problems of higher labeling cost and higher misjudgment rate in the existing data type identification.

In a first aspect, an embodiment of the present application provides a session data type identification method, including:

data labeling: real-time group chat data are stored through es, and group chat data labeling is carried out through a risk control module according to text types;

and (3) large category data identification: identifying large-class data by adopting a multi-classification text classification model according to the marked group chat data;

small sample data identification: and identifying the small sample data by using a metric-based few-shot learning algorithm according to the marked group chat data.

The session data type identification method, wherein the data labeling step further includes: and performing text type proportion analysis on the group chat data through the risk control module.

The above session data type identification method, wherein the large category data identification step includes:

model training: analyzing and training the text classification model according to the text type proportion, and calculating the confidence coefficient of a prediction category according to the text classification model;

and confidence coefficient calculation: setting a confidence threshold according to a training result, calculating a micro-average value of the text classification model under the condition that the maximum value of the confidence of the prediction category is larger than the confidence threshold, and selecting the group chat data with the confidence smaller than the confidence threshold as a Query sample according to the micro-average value.

The session data type identification method includes the following steps:

sampling sample embedding: randomly sampling the marked group chat data according to text types, and then embedding sampling samples based on the attention-LSTM, so that embedding of each Query sample is a function of support set embedding;

and (3) category characteristic induction step: traversing and extracting sample semantics of the support set by adopting a dynamic routing method, and inducing class characteristics;

and semantic relation measurement: and measuring the semantic relation between the Query sample and the class characteristics of the support set by adopting a Euclidean distance algorithm to complete small sample classification.

In a second aspect, an embodiment of the present application provides a session data type identification system, including:

a data labeling unit: real-time group chat data are stored through es, and group chat data labeling is carried out through a risk control module according to text types;

big category data identification unit: identifying large-class data by adopting a multi-classification text classification model according to the marked group chat data;

a small sample data identification unit: and identifying the small sample data by using a metric-based few-shot learning algorithm according to the marked group chat data.

The above-mentioned session data type identification system, wherein the data labeling unit further includes:

a type proportion analysis module: and performing text type proportion analysis on the group chat data through the risk control module.

The above-mentioned conversation data type identification system, wherein, the large category data identification unit includes:

a model training module: analyzing and training the text classification model according to the text type proportion, and calculating the confidence coefficient of a prediction category according to the text classification model;

a confidence calculation module: setting a confidence threshold according to a training result, calculating a micro-average value of the text classification model under the condition that the maximum value of the confidence of the prediction category is larger than the confidence threshold, and selecting the group chat data with the confidence smaller than the confidence threshold as a Query sample according to the micro-average value.

The above-mentioned conversation data type identification system, wherein the small sample data identification unit includes:

sampling sample embedding module: randomly sampling the marked group chat data according to text types, and then embedding sampling samples based on the attention-LSTM, so that embedding of each Query sample is a function of support set embedding;

a category characteristic induction module: traversing and extracting sample semantics of the support set by adopting a dynamic routing method, and inducing class characteristics;

the semantic relation measurement module: and measuring the semantic relation between the Query sample and the class characteristics of the support set by adopting a Euclidean distance algorithm to complete small sample classification.

In a third aspect, an embodiment of the present application provides an electronic device, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor, when executing the computer program, implements the session data type identification method according to the first aspect.

In a fourth aspect, the present application provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the session data type identification method according to the first aspect.

Compared with the prior art, the invention has the advantages and positive effects that:

the invention relates to a deep learning technology, which combines a text classification algorithm with an few-shot idea in small sample learning, adopts a multi-classification model to perform confidence calculation on data to be predicted whether to belong to a large category, simultaneously outputs a result with lower confidence, and performs secondary filtering by using a small sample model, so that the amount of labeled data can be reduced, the problem of data imbalance can be solved by using small labeling cost, and the misjudgment rate of the data can be reduced.

Drawings

FIG. 1 is a schematic diagram illustrating steps of a session data type identification method based on a classification algorithm and small sample learning according to the present invention;

FIG. 2 is a schematic flow chart based on step S2 in FIG. 1 according to the present invention;

FIG. 3 is a schematic flow chart based on step S3 in FIG. 1 according to the present invention;

FIG. 4 is a schematic flowchart of an embodiment of a session data type identification method based on a classification algorithm and small sample learning according to the present invention;

FIG. 5 is a block diagram of a conversational data type recognition system based on a classification algorithm and small sample learning according to the present invention;

fig. 6 is a block diagram of a computer device according to an embodiment of the present application.

Wherein the reference numerals are:

1. a data labeling unit; 11. a type proportion analysis module; 2 a big category data identification unit; 21. a model training module; 22. a confidence calculation module; 3. a small sample data identification unit; 31. a sampling sample embedding module; 32. a category characteristic induction module; 33. a semantic relationship measurement module; 81. a processor; 82. A memory; 83. a communication interface; 80. a bus.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be described and illustrated below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments provided in the present application without any inventive step are within the scope of protection of the present application.

It is obvious that the drawings in the following description are only examples or embodiments of the present application, and that it is also possible for a person skilled in the art to apply the present application to other similar contexts on the basis of these drawings without inventive effort. Moreover, it should be appreciated that in the development of any such actual implementation, as in any engineering or design project, numerous implementation-specific decisions must be made to achieve the developers' specific goals, such as compliance with system-related and business-related constraints, which may vary from one implementation to another.

Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the specification. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of ordinary skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments without conflict.

Unless defined otherwise, technical or scientific terms referred to herein shall have the ordinary meaning as understood by those of ordinary skill in the art to which this application belongs. Reference to "a," "an," "the," and similar words throughout this application are not to be construed as limiting in number, and may refer to the singular or the plural. The present application is directed to the use of the terms "including," "comprising," "having," and any variations thereof, which are intended to cover non-exclusive inclusions; for example, a process, method, system, article, or apparatus that comprises a list of steps or modules (elements) is not limited to the listed steps or elements, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus. Reference to "connected," "coupled," and the like in this application is not intended to be limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. The term "plurality" as referred to herein means two or more. "and/or" describes an association relationship of associated objects, meaning that three relationships may exist, for example, "A and/or B" may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. Reference herein to the terms "first," "second," "third," and the like, are merely to distinguish similar objects and do not denote a particular ordering for the objects.

The present invention is described in detail with reference to the embodiments shown in the drawings, but it should be understood that these embodiments are not intended to limit the present invention, and those skilled in the art should understand that functional, methodological, or structural equivalents or substitutions made by these embodiments are within the scope of the present invention.

Before describing in detail the various embodiments of the present invention, the core inventive concepts of the present invention are summarized and described in detail by the following several embodiments.

According to the method, confidence calculation is carried out on the group chat data by adopting a multi-classification text classification model, whether the group chat data belongs to a large class or not is judged, a result with low confidence is output, secondary filtering is carried out by utilizing a small sample model, and the recognition rate of the data type is improved.

The first embodiment is as follows:

fig. 1 is a schematic step diagram of a conversation data type identification method based on a classification algorithm and small sample learning according to the present invention. As shown in fig. 1, this embodiment discloses a specific implementation of a conversational data type identification method (hereinafter referred to as "method") based on a classification algorithm and small sample learning.

Specifically, the method disclosed in this embodiment mainly includes the following steps:

step S1: real-time group chat data are stored through es, and group chat data labeling is carried out through a risk control module according to text types;

wherein, step S1 further includes: and performing text type proportion analysis on the group chat data through the risk control module.

Step S2: identifying large-class data by adopting a multi-classification text classification model according to the marked group chat data;

as shown in fig. 2, step S2 specifically includes the following contents:

step S21: analyzing and training the text classification model according to the text type proportion, and calculating the confidence coefficient of a prediction category according to the text classification model;

specifically, the method mainly comprises the following steps of 1, performing word representation calculation on group chat data, 2, extracting text feature bidirectional representation on the group chat data by adopting bilstm, 3, splicing the representations in the step 1 and the step 2, training a weight matrix through a tanh function, calculating a target yi, 4, performing max-posing on a plurality of yi to obtain sentence representation y, and 5, outputting softmax as an activation function in a full connection layer.

Step S22: setting a confidence threshold according to a training result, calculating a micro-average value of the text classification model under the condition that the maximum value of the confidence of the prediction category is larger than the confidence threshold, and selecting the group chat data with the confidence smaller than the confidence threshold as a Query sample according to the micro-average value.

Step S3: and identifying the small sample data by using a metric-based few-shot learning algorithm according to the marked group chat data.

As shown in fig. 3, step S3 specifically includes the following contents:

step S31: randomly sampling the marked group chat data according to text types, and then embedding sampling samples based on the attention-LSTM, so that embedding of each Query sample is a function of support set embedding;

step S32: traversing and extracting sample semantics of the support set by adopting a dynamic routing method, and inducing class characteristics;

step S33: and measuring the semantic relation between the Query sample and the class characteristics of the support set by adopting a Euclidean distance algorithm to complete small sample classification.

The method mainly combines a text classification algorithm with an few-shot idea in small sample learning, so that the labeled data amount can be reduced, a new type can be identified according to a small amount of samples, few-shot decomposes a data set into different meta tasks in a training stage to learn the generalization capability of the model under the condition of type change, and classification can be completed without changing the existing model in the testing stage in the face of brand new types.

Please refer to fig. 4. Fig. 4 is a schematic flowchart of an embodiment of a session data type identification method based on a classification algorithm and small sample learning, which is provided by the present invention, and aims to realize real-time identification of advertisement data in group chat data and prompt an operator to perform processing, and an application flow of the method is specifically described as follows in conjunction with fig. 4:

the method comprises the following steps: real-time acquisition of group chat text

Firstly, the invention adopts es to store real-time group chat data, the data flow passes through a risk control module to judge the data of the advertisement type, and the input fields are user id (user _ id), chat information id (id), chat content (content) and time (create _ time). Here, let us assume that a group chat message of a certain fan group of customers in a certain time period is already owned, and the group chat message is roughly divided into 5 text types: product-related, operation activity-related, chatting data, advertising data, other data; labeling 1 ten thousand pieces of data according to a time sequence, and finding that the distribution ratio of the text type data is as follows: 25%, 20%, 50%, 1%, 4%.

Step two: text classification algorithm for calculating confidence of large class

According to the text information proportion analysis in the first step, a multi-classification text classification model is designed by the module, the multi-classification text classification model is mainly used for identifying large-class data and mainly comprises the following steps of 1, word representation calculation, 2, extraction of text feature bidirectional representation by bilstm, 3, splicing the representations in the steps 1 and 2, training a weight matrix by a tanh function, calculating a target yi, 4, performing max-posing on a plurality of yi to obtain a sentence representation y, 5, a full connection layer and outputting softmax as an activation function; 6. defining a threshold according to the training result, calculating the micro-average value of the whole model under the condition that the maximum value of the prediction category in the step 5 is greater than the threshold, and selecting the data smaller than the threshold as the sample with lower confidence when the value of the micro-average is greatly reduced.

Step three: small sample learning and identifying advertisement data

In the process of identifying small sample data by using a metric-based few-shot learning algorithm, in the few-shot training process, different meta-tasks are sampled in each training (epamode), so that in the overall view, the training comprises different category combinations, and the mechanism enables the common parts in different meta-tasks of the modeling society, such as how to extract important features and compare sample similarity. The model learned by the learning mechanism can be well classified when new unseen meta-task is faced.

The small sample learning mode adopted in the invention is to randomly sample the original 5 kinds of labeled data, then sample embedding is carried out on the sampled samples based on the attribute-LSTM, so that the embedding of each Query sample is a function of the support set embedding, a dynamic routing method is adopted here, category characteristics are induced in sample semantics of the support set by traversal extraction, finally, a Euclidean distance algorithm is adopted to measure the semantic relationship between the Query and the category characteristics, and then classification is completed.

Step four: providing a treatment plan according to risk level

In the second step, the confidence coefficient of whether a text belongs to a large class sample can be provided, a batch of data with lower confidence coefficient can be obtained according to the threshold calculation micro-average, and the small sample model in the third step can judge whether the data to be predicted belongs to a small class in the existing labeled data, and simultaneously can support operators to label a small sample of a new class and input the model to judge whether the new data belongs to the new class. Retraining of the main model is avoided.

The invention mainly adopts a multi-classification model to carry out confidence calculation on data to be predicted whether the data belong to a large class, outputs a result with lower confidence to the classification model, and carries out secondary filtering by using a small sample model, thereby solving the problem of data imbalance by using a small amount of labeling cost and ensuring high recognition rate of new types of data.

Example two:

in combination with the method for recognizing conversation data type based on classification algorithm and small sample learning disclosed in the first embodiment, this embodiment discloses a specific implementation example of a conversation data type recognition system (hereinafter referred to as "system") based on classification algorithm and small sample learning.

Referring to fig. 5, the system includes:

data labeling unit 1: real-time group chat data are stored through es, and group chat data labeling is carried out through a risk control module according to text types;

the large category data identification unit 2: identifying large-class data by adopting a multi-classification text classification model according to the marked group chat data;

the small sample data identification unit 3: and identifying the small sample data by using a metric-based few-shot learning algorithm according to the marked group chat data.

Specifically, the data labeling unit 1 further includes:

type proportion analysis module 11: and performing text type proportion analysis on the group chat data through the risk control module.

Specifically, the large category data identification unit 2 includes:

the model training module 21: analyzing and training the text classification model according to the text type proportion, and calculating the confidence coefficient of a prediction category according to the text classification model;

the confidence calculation module 22: setting a confidence threshold according to a training result, calculating a micro-average value of the text classification model under the condition that the maximum value of the confidence of the prediction category is larger than the confidence threshold, and selecting the group chat data with the confidence smaller than the confidence threshold as a Query sample according to the micro-average value.

Specifically, the small sample data identification unit 3 includes:

sample acquiring module 31: randomly sampling the marked group chat data according to text types, and then embedding sampling samples based on the attention-LSTM, so that embedding of each Query sample is a function of support set embedding;

category-feature summarization module 32: traversing and extracting sample semantics of the support set by adopting a dynamic routing method, and inducing class characteristics;

the semantic relationship metric module 33: and measuring the semantic relation between the Query sample and the class characteristics of the support set by adopting a Euclidean distance algorithm to complete small sample classification.

For the conversation data type identification system based on the classification algorithm and the small sample learning disclosed in this embodiment, and the technical solutions of the rest of the same parts in the conversation data type identification method based on the classification algorithm and the small sample learning disclosed in the first embodiment, please refer to the description of the first embodiment, and details are not repeated here.

Example three:

referring to fig. 6, the present embodiment discloses an embodiment of a computer device. The computer device may comprise a processor 81 and a memory 82 in which computer program instructions are stored.

Specifically, the processor 81 may include a Central Processing Unit (CPU), or A Specific Integrated Circuit (ASIC), or may be configured to implement one or more Integrated circuits of the embodiments of the present Application.

Memory 82 may include, among other things, mass storage for data or instructions. By way of example, and not limitation, memory 82 may include a Hard Disk Drive (Hard Disk Drive, abbreviated to HDD), a floppy Disk Drive, a Solid State Drive (SSD), flash memory, an optical Disk, a magneto-optical Disk, tape, or a Universal Serial Bus (USB) Drive or a combination of two or more of these. Memory 82 may include removable or non-removable (or fixed) media, where appropriate. The memory 82 may be internal or external to the data processing apparatus, where appropriate. In a particular embodiment, the memory 82 is a Non-Volatile (Non-Volatile) memory. In particular embodiments, Memory 82 includes Read-Only Memory (ROM) and Random Access Memory (RAM). The ROM may be mask-programmed ROM, Programmable ROM (PROM), Erasable PROM (EPROM), Electrically Erasable PROM (EEPROM), Electrically rewritable ROM (EAROM), or FLASH Memory (FLASH), or a combination of two or more of these, where appropriate. The RAM may be a Static Random-Access Memory (SRAM) or a Dynamic Random-Access Memory (DRAM), where the DRAM may be a Fast Page Mode Dynamic Random-Access Memory (FPMDRAM), an Extended data output Dynamic Random-Access Memory (EDODRAM), a Synchronous Dynamic Random-Access Memory (SDRAM), and the like.

The memory 82 may be used to store or cache various data files for processing and/or communication use, as well as possible computer program instructions executed by the processor 81.

The processor 81 implements any of the session data type identification methods in the above embodiments by reading and executing computer program instructions stored in the memory 82.

In some of these embodiments, the computer device may also include a communication interface 83 and a bus 80. As shown in fig. 6, the processor 81, the memory 82, and the communication interface 83 are connected via the bus 80 to complete communication therebetween.

The communication interface 83 is used for implementing communication between modules, devices, units and/or equipment in the embodiment of the present application. The communication port 83 may also be implemented with other components such as: the data communication is carried out among external equipment, image/data acquisition equipment, a database, external storage, an image/data processing workstation and the like.

Bus 80 includes hardware, software, or both to couple the components of the computer device to each other. Bus 80 includes, but is not limited to, at least one of the following: data Bus (Data Bus), Address Bus (Address Bus), Control Bus (Control Bus), Expansion Bus (Expansion Bus), and Local Bus (Local Bus). By way of example, and not limitation, Bus 80 may include an Accelerated Graphics Port (AGP) or other Graphics Bus, an Enhanced Industry Standard Architecture (EISA) Bus, a Front-Side Bus (FSB), a Hyper Transport (HT) Interconnect, an ISA (ISA) Bus, an InfiniBand (InfiniBand) Interconnect, a Low Pin Count (LPC) Bus, a memory Bus, a microchannel Architecture (MCA) Bus, a PCI (Peripheral Component Interconnect) Bus, a PCI-Express (PCI-X) Bus, a Serial Advanced Technology Attachment (SATA) Bus, a Video Electronics Bus (audio Electronics Association), abbreviated VLB) bus or other suitable bus or a combination of two or more of these. Bus 80 may include one or more buses, where appropriate. Although specific buses are described and shown in the embodiments of the application, any suitable buses or interconnects are contemplated by the application.

In addition, in combination with the session data type identification method in the foregoing embodiments, the embodiments of the present application may provide a computer-readable storage medium to implement. The computer readable storage medium having stored thereon computer program instructions; the computer program instructions, when executed by a processor, implement any of the above-described embodiments of the session data type identification method.

The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A session data type identification method, comprising:

2. The session data type identification method according to claim 1, wherein the data labeling step further comprises: and performing text type proportion analysis on the group chat data through the risk control module.

3. The session data type identification method according to claim 2, wherein the large category data identification step comprises:

4. The session data type identification method according to claim 3, wherein the small sample data identification step includes:

5. A session data type identification system, comprising:

6. The session data type identification system according to claim 5, wherein the data annotation unit further comprises:

7. The session data type identification system according to claim 6, wherein the large category data identification unit comprises:

8. The session data type identification system according to claim 7, wherein the small sample data identification unit includes:

9. An electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the session data type identification method according to any one of claims 1 to 4 when executing the computer program.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out a session data type identification method according to any one of claims 1 to 4.