CN113988059A - Session data type identification method, system, equipment and storage medium - Google Patents
Session data type identification method, system, equipment and storage medium Download PDFInfo
- Publication number
- CN113988059A CN113988059A CN202110968186.XA CN202110968186A CN113988059A CN 113988059 A CN113988059 A CN 113988059A CN 202110968186 A CN202110968186 A CN 202110968186A CN 113988059 A CN113988059 A CN 113988059A
- Authority
- CN
- China
- Prior art keywords
- data
- group chat
- text
- confidence
- sample
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/355—Class or cluster creation or modification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/166—Editing, e.g. inserting or deleting
- G06F40/169—Annotation, e.g. comment data or footnotes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Data Mining & Analysis (AREA)
- Biomedical Technology (AREA)
- Computing Systems (AREA)
- Molecular Biology (AREA)
- Evolutionary Computation (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Biophysics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a method, a system, equipment and a storage medium for identifying session data types. The method comprises the following steps: real-time group chat data are stored through es, and group chat data labeling is carried out through a risk control module according to text types; identifying large-class data by adopting a multi-classification text classification model according to the marked group chat data; and identifying the small sample data by using a metric-based few-shot learning algorithm according to the marked group chat data. By combining the text classification algorithm with the few-shot idea in small sample learning, the method can reduce the labeled data amount, solve the problem of data imbalance by using a small amount of labeling cost, and reduce the misjudgment rate of data.
Description
Technical Field
The invention relates to the technical field of computer data processing, in particular to a conversation data type identification method, a system, equipment and a storage medium based on a classification algorithm and small sample learning.
Background
More and more e-commerce selects private domain traffic, which is different from public domain traffic, and the private domain traffic is mainly reserved in the micro-signal and the community, so the health degree of the community is particularly important. The community operation is more and more a brand retained vermicelli, new product publicity and a common marketing means stimulating purchase, and the continuous survival of the community is required to be realized, so that the active value of the community is ensured, the health degree of the community is ensured, people who have some irrelevant advertisement information in the community and release information need to be processed immediately, and the community is prevented from becoming a wool water group and influencing the participation of real users.
At the initial stage of community creation, people are basically used for checking whether people who illegally issue advertisement information exist in group chat, but when a community develops to a certain scale, a large brand merchant can own hundreds or even thousands of fan groups, the cost of manpower monitoring is very high, and the situation that people cannot timely deal with the fan groups can occur.
At present, for the service of group chat risk control, the traditional method is to use some keywords to hit, or use NLP classification algorithm to identify advertisement data.
However, because the group chat information is more spoken, the chat content is random and diversified, the advertisement information source may be an external-link e-commerce or an individual text, even some Mars text appears; in this case, not only the labeling cost is high, but also the advertisement data in the group chat data is very small, and in the case of extreme imbalance of samples, the misjudgment rate cannot reach a lower level by relying on a simple classification algorithm and a data enhancement means.
Disclosure of Invention
The invention provides a conversation data type identification method, a system, equipment and a storage medium based on a classification algorithm and small sample learning, aiming at the technical problems of higher labeling cost and higher misjudgment rate in the existing data type identification.
In a first aspect, an embodiment of the present application provides a session data type identification method, including:
data labeling: real-time group chat data are stored through es, and group chat data labeling is carried out through a risk control module according to text types;
and (3) large category data identification: identifying large-class data by adopting a multi-classification text classification model according to the marked group chat data;
small sample data identification: and identifying the small sample data by using a metric-based few-shot learning algorithm according to the marked group chat data.
The session data type identification method, wherein the data labeling step further includes: and performing text type proportion analysis on the group chat data through the risk control module.
The above session data type identification method, wherein the large category data identification step includes:
model training: analyzing and training the text classification model according to the text type proportion, and calculating the confidence coefficient of a prediction category according to the text classification model;
and confidence coefficient calculation: setting a confidence threshold according to a training result, calculating a micro-average value of the text classification model under the condition that the maximum value of the confidence of the prediction category is larger than the confidence threshold, and selecting the group chat data with the confidence smaller than the confidence threshold as a Query sample according to the micro-average value.
The session data type identification method includes the following steps:
sampling sample embedding: randomly sampling the marked group chat data according to text types, and then embedding sampling samples based on the attention-LSTM, so that embedding of each Query sample is a function of support set embedding;
and (3) category characteristic induction step: traversing and extracting sample semantics of the support set by adopting a dynamic routing method, and inducing class characteristics;
and semantic relation measurement: and measuring the semantic relation between the Query sample and the class characteristics of the support set by adopting a Euclidean distance algorithm to complete small sample classification.
In a second aspect, an embodiment of the present application provides a session data type identification system, including:
a data labeling unit: real-time group chat data are stored through es, and group chat data labeling is carried out through a risk control module according to text types;
big category data identification unit: identifying large-class data by adopting a multi-classification text classification model according to the marked group chat data;
a small sample data identification unit: and identifying the small sample data by using a metric-based few-shot learning algorithm according to the marked group chat data.
The above-mentioned session data type identification system, wherein the data labeling unit further includes:
a type proportion analysis module: and performing text type proportion analysis on the group chat data through the risk control module.
The above-mentioned conversation data type identification system, wherein, the large category data identification unit includes:
a model training module: analyzing and training the text classification model according to the text type proportion, and calculating the confidence coefficient of a prediction category according to the text classification model;
a confidence calculation module: setting a confidence threshold according to a training result, calculating a micro-average value of the text classification model under the condition that the maximum value of the confidence of the prediction category is larger than the confidence threshold, and selecting the group chat data with the confidence smaller than the confidence threshold as a Query sample according to the micro-average value.
The above-mentioned conversation data type identification system, wherein the small sample data identification unit includes:
sampling sample embedding module: randomly sampling the marked group chat data according to text types, and then embedding sampling samples based on the attention-LSTM, so that embedding of each Query sample is a function of support set embedding;
a category characteristic induction module: traversing and extracting sample semantics of the support set by adopting a dynamic routing method, and inducing class characteristics;
the semantic relation measurement module: and measuring the semantic relation between the Query sample and the class characteristics of the support set by adopting a Euclidean distance algorithm to complete small sample classification.
In a third aspect, an embodiment of the present application provides an electronic device, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor, when executing the computer program, implements the session data type identification method according to the first aspect.
In a fourth aspect, the present application provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the session data type identification method according to the first aspect.
Compared with the prior art, the invention has the advantages and positive effects that:
the invention relates to a deep learning technology, which combines a text classification algorithm with an few-shot idea in small sample learning, adopts a multi-classification model to perform confidence calculation on data to be predicted whether to belong to a large category, simultaneously outputs a result with lower confidence, and performs secondary filtering by using a small sample model, so that the amount of labeled data can be reduced, the problem of data imbalance can be solved by using small labeling cost, and the misjudgment rate of the data can be reduced.
Drawings
FIG. 1 is a schematic diagram illustrating steps of a session data type identification method based on a classification algorithm and small sample learning according to the present invention;
FIG. 2 is a schematic flow chart based on step S2 in FIG. 1 according to the present invention;
FIG. 3 is a schematic flow chart based on step S3 in FIG. 1 according to the present invention;
FIG. 4 is a schematic flowchart of an embodiment of a session data type identification method based on a classification algorithm and small sample learning according to the present invention;
FIG. 5 is a block diagram of a conversational data type recognition system based on a classification algorithm and small sample learning according to the present invention;
fig. 6 is a block diagram of a computer device according to an embodiment of the present application.
Wherein the reference numerals are:
1. a data labeling unit; 11. a type proportion analysis module; 2 a big category data identification unit; 21. a model training module; 22. a confidence calculation module; 3. a small sample data identification unit; 31. a sampling sample embedding module; 32. a category characteristic induction module; 33. a semantic relationship measurement module; 81. a processor; 82. A memory; 83. a communication interface; 80. a bus.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be described and illustrated below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments provided in the present application without any inventive step are within the scope of protection of the present application.
It is obvious that the drawings in the following description are only examples or embodiments of the present application, and that it is also possible for a person skilled in the art to apply the present application to other similar contexts on the basis of these drawings without inventive effort. Moreover, it should be appreciated that in the development of any such actual implementation, as in any engineering or design project, numerous implementation-specific decisions must be made to achieve the developers' specific goals, such as compliance with system-related and business-related constraints, which may vary from one implementation to another.
Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the specification. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of ordinary skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments without conflict.
Unless defined otherwise, technical or scientific terms referred to herein shall have the ordinary meaning as understood by those of ordinary skill in the art to which this application belongs. Reference to "a," "an," "the," and similar words throughout this application are not to be construed as limiting in number, and may refer to the singular or the plural. The present application is directed to the use of the terms "including," "comprising," "having," and any variations thereof, which are intended to cover non-exclusive inclusions; for example, a process, method, system, article, or apparatus that comprises a list of steps or modules (elements) is not limited to the listed steps or elements, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus. Reference to "connected," "coupled," and the like in this application is not intended to be limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. The term "plurality" as referred to herein means two or more. "and/or" describes an association relationship of associated objects, meaning that three relationships may exist, for example, "A and/or B" may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. Reference herein to the terms "first," "second," "third," and the like, are merely to distinguish similar objects and do not denote a particular ordering for the objects.
The present invention is described in detail with reference to the embodiments shown in the drawings, but it should be understood that these embodiments are not intended to limit the present invention, and those skilled in the art should understand that functional, methodological, or structural equivalents or substitutions made by these embodiments are within the scope of the present invention.
Before describing in detail the various embodiments of the present invention, the core inventive concepts of the present invention are summarized and described in detail by the following several embodiments.
According to the method, confidence calculation is carried out on the group chat data by adopting a multi-classification text classification model, whether the group chat data belongs to a large class or not is judged, a result with low confidence is output, secondary filtering is carried out by utilizing a small sample model, and the recognition rate of the data type is improved.
The first embodiment is as follows:
fig. 1 is a schematic step diagram of a conversation data type identification method based on a classification algorithm and small sample learning according to the present invention. As shown in fig. 1, this embodiment discloses a specific implementation of a conversational data type identification method (hereinafter referred to as "method") based on a classification algorithm and small sample learning.
Specifically, the method disclosed in this embodiment mainly includes the following steps:
step S1: real-time group chat data are stored through es, and group chat data labeling is carried out through a risk control module according to text types;
wherein, step S1 further includes: and performing text type proportion analysis on the group chat data through the risk control module.
Step S2: identifying large-class data by adopting a multi-classification text classification model according to the marked group chat data;
as shown in fig. 2, step S2 specifically includes the following contents:
step S21: analyzing and training the text classification model according to the text type proportion, and calculating the confidence coefficient of a prediction category according to the text classification model;
specifically, the method mainly comprises the following steps of 1, performing word representation calculation on group chat data, 2, extracting text feature bidirectional representation on the group chat data by adopting bilstm, 3, splicing the representations in the step 1 and the step 2, training a weight matrix through a tanh function, calculating a target yi, 4, performing max-posing on a plurality of yi to obtain sentence representation y, and 5, outputting softmax as an activation function in a full connection layer.
Step S22: setting a confidence threshold according to a training result, calculating a micro-average value of the text classification model under the condition that the maximum value of the confidence of the prediction category is larger than the confidence threshold, and selecting the group chat data with the confidence smaller than the confidence threshold as a Query sample according to the micro-average value.
Step S3: and identifying the small sample data by using a metric-based few-shot learning algorithm according to the marked group chat data.
As shown in fig. 3, step S3 specifically includes the following contents:
step S31: randomly sampling the marked group chat data according to text types, and then embedding sampling samples based on the attention-LSTM, so that embedding of each Query sample is a function of support set embedding;
step S32: traversing and extracting sample semantics of the support set by adopting a dynamic routing method, and inducing class characteristics;
step S33: and measuring the semantic relation between the Query sample and the class characteristics of the support set by adopting a Euclidean distance algorithm to complete small sample classification.
The method mainly combines a text classification algorithm with an few-shot idea in small sample learning, so that the labeled data amount can be reduced, a new type can be identified according to a small amount of samples, few-shot decomposes a data set into different meta tasks in a training stage to learn the generalization capability of the model under the condition of type change, and classification can be completed without changing the existing model in the testing stage in the face of brand new types.
Please refer to fig. 4. Fig. 4 is a schematic flowchart of an embodiment of a session data type identification method based on a classification algorithm and small sample learning, which is provided by the present invention, and aims to realize real-time identification of advertisement data in group chat data and prompt an operator to perform processing, and an application flow of the method is specifically described as follows in conjunction with fig. 4:
the method comprises the following steps: real-time acquisition of group chat text
Firstly, the invention adopts es to store real-time group chat data, the data flow passes through a risk control module to judge the data of the advertisement type, and the input fields are user id (user _ id), chat information id (id), chat content (content) and time (create _ time). Here, let us assume that a group chat message of a certain fan group of customers in a certain time period is already owned, and the group chat message is roughly divided into 5 text types: product-related, operation activity-related, chatting data, advertising data, other data; labeling 1 ten thousand pieces of data according to a time sequence, and finding that the distribution ratio of the text type data is as follows: 25%, 20%, 50%, 1%, 4%.
Step two: text classification algorithm for calculating confidence of large class
According to the text information proportion analysis in the first step, a multi-classification text classification model is designed by the module, the multi-classification text classification model is mainly used for identifying large-class data and mainly comprises the following steps of 1, word representation calculation, 2, extraction of text feature bidirectional representation by bilstm, 3, splicing the representations in the steps 1 and 2, training a weight matrix by a tanh function, calculating a target yi, 4, performing max-posing on a plurality of yi to obtain a sentence representation y, 5, a full connection layer and outputting softmax as an activation function; 6. defining a threshold according to the training result, calculating the micro-average value of the whole model under the condition that the maximum value of the prediction category in the step 5 is greater than the threshold, and selecting the data smaller than the threshold as the sample with lower confidence when the value of the micro-average is greatly reduced.
Step three: small sample learning and identifying advertisement data
In the process of identifying small sample data by using a metric-based few-shot learning algorithm, in the few-shot training process, different meta-tasks are sampled in each training (epamode), so that in the overall view, the training comprises different category combinations, and the mechanism enables the common parts in different meta-tasks of the modeling society, such as how to extract important features and compare sample similarity. The model learned by the learning mechanism can be well classified when new unseen meta-task is faced.
The small sample learning mode adopted in the invention is to randomly sample the original 5 kinds of labeled data, then sample embedding is carried out on the sampled samples based on the attribute-LSTM, so that the embedding of each Query sample is a function of the support set embedding, a dynamic routing method is adopted here, category characteristics are induced in sample semantics of the support set by traversal extraction, finally, a Euclidean distance algorithm is adopted to measure the semantic relationship between the Query and the category characteristics, and then classification is completed.
Step four: providing a treatment plan according to risk level
In the second step, the confidence coefficient of whether a text belongs to a large class sample can be provided, a batch of data with lower confidence coefficient can be obtained according to the threshold calculation micro-average, and the small sample model in the third step can judge whether the data to be predicted belongs to a small class in the existing labeled data, and simultaneously can support operators to label a small sample of a new class and input the model to judge whether the new data belongs to the new class. Retraining of the main model is avoided.
The invention mainly adopts a multi-classification model to carry out confidence calculation on data to be predicted whether the data belong to a large class, outputs a result with lower confidence to the classification model, and carries out secondary filtering by using a small sample model, thereby solving the problem of data imbalance by using a small amount of labeling cost and ensuring high recognition rate of new types of data.
Example two:
in combination with the method for recognizing conversation data type based on classification algorithm and small sample learning disclosed in the first embodiment, this embodiment discloses a specific implementation example of a conversation data type recognition system (hereinafter referred to as "system") based on classification algorithm and small sample learning.
Referring to fig. 5, the system includes:
data labeling unit 1: real-time group chat data are stored through es, and group chat data labeling is carried out through a risk control module according to text types;
the large category data identification unit 2: identifying large-class data by adopting a multi-classification text classification model according to the marked group chat data;
the small sample data identification unit 3: and identifying the small sample data by using a metric-based few-shot learning algorithm according to the marked group chat data.
Specifically, the data labeling unit 1 further includes:
type proportion analysis module 11: and performing text type proportion analysis on the group chat data through the risk control module.
Specifically, the large category data identification unit 2 includes:
the model training module 21: analyzing and training the text classification model according to the text type proportion, and calculating the confidence coefficient of a prediction category according to the text classification model;
the confidence calculation module 22: setting a confidence threshold according to a training result, calculating a micro-average value of the text classification model under the condition that the maximum value of the confidence of the prediction category is larger than the confidence threshold, and selecting the group chat data with the confidence smaller than the confidence threshold as a Query sample according to the micro-average value.
Specifically, the small sample data identification unit 3 includes:
sample acquiring module 31: randomly sampling the marked group chat data according to text types, and then embedding sampling samples based on the attention-LSTM, so that embedding of each Query sample is a function of support set embedding;
category-feature summarization module 32: traversing and extracting sample semantics of the support set by adopting a dynamic routing method, and inducing class characteristics;
the semantic relationship metric module 33: and measuring the semantic relation between the Query sample and the class characteristics of the support set by adopting a Euclidean distance algorithm to complete small sample classification.
For the conversation data type identification system based on the classification algorithm and the small sample learning disclosed in this embodiment, and the technical solutions of the rest of the same parts in the conversation data type identification method based on the classification algorithm and the small sample learning disclosed in the first embodiment, please refer to the description of the first embodiment, and details are not repeated here.
Example three:
referring to fig. 6, the present embodiment discloses an embodiment of a computer device. The computer device may comprise a processor 81 and a memory 82 in which computer program instructions are stored.
Specifically, the processor 81 may include a Central Processing Unit (CPU), or A Specific Integrated Circuit (ASIC), or may be configured to implement one or more Integrated circuits of the embodiments of the present Application.
The memory 82 may be used to store or cache various data files for processing and/or communication use, as well as possible computer program instructions executed by the processor 81.
The processor 81 implements any of the session data type identification methods in the above embodiments by reading and executing computer program instructions stored in the memory 82.
In some of these embodiments, the computer device may also include a communication interface 83 and a bus 80. As shown in fig. 6, the processor 81, the memory 82, and the communication interface 83 are connected via the bus 80 to complete communication therebetween.
The communication interface 83 is used for implementing communication between modules, devices, units and/or equipment in the embodiment of the present application. The communication port 83 may also be implemented with other components such as: the data communication is carried out among external equipment, image/data acquisition equipment, a database, external storage, an image/data processing workstation and the like.
In addition, in combination with the session data type identification method in the foregoing embodiments, the embodiments of the present application may provide a computer-readable storage medium to implement. The computer readable storage medium having stored thereon computer program instructions; the computer program instructions, when executed by a processor, implement any of the above-described embodiments of the session data type identification method.
The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.
Claims (10)
1. A session data type identification method, comprising:
data labeling: real-time group chat data are stored through es, and group chat data labeling is carried out through a risk control module according to text types;
and (3) large category data identification: identifying large-class data by adopting a multi-classification text classification model according to the marked group chat data;
small sample data identification: and identifying the small sample data by using a metric-based few-shot learning algorithm according to the marked group chat data.
2. The session data type identification method according to claim 1, wherein the data labeling step further comprises: and performing text type proportion analysis on the group chat data through the risk control module.
3. The session data type identification method according to claim 2, wherein the large category data identification step comprises:
model training: analyzing and training the text classification model according to the text type proportion, and calculating the confidence coefficient of a prediction category according to the text classification model;
and confidence coefficient calculation: setting a confidence threshold according to a training result, calculating a micro-average value of the text classification model under the condition that the maximum value of the confidence of the prediction category is larger than the confidence threshold, and selecting the group chat data with the confidence smaller than the confidence threshold as a Query sample according to the micro-average value.
4. The session data type identification method according to claim 3, wherein the small sample data identification step includes:
sampling sample embedding: randomly sampling the marked group chat data according to text types, and then embedding sampling samples based on the attention-LSTM, so that embedding of each Query sample is a function of support set embedding;
and (3) category characteristic induction step: traversing and extracting sample semantics of the support set by adopting a dynamic routing method, and inducing class characteristics;
and semantic relation measurement: and measuring the semantic relation between the Query sample and the class characteristics of the support set by adopting a Euclidean distance algorithm to complete small sample classification.
5. A session data type identification system, comprising:
a data labeling unit: real-time group chat data are stored through es, and group chat data labeling is carried out through a risk control module according to text types;
big category data identification unit: identifying large-class data by adopting a multi-classification text classification model according to the marked group chat data;
a small sample data identification unit: and identifying the small sample data by using a metric-based few-shot learning algorithm according to the marked group chat data.
6. The session data type identification system according to claim 5, wherein the data annotation unit further comprises:
a type proportion analysis module: and performing text type proportion analysis on the group chat data through the risk control module.
7. The session data type identification system according to claim 6, wherein the large category data identification unit comprises:
a model training module: analyzing and training the text classification model according to the text type proportion, and calculating the confidence coefficient of a prediction category according to the text classification model;
a confidence calculation module: setting a confidence threshold according to a training result, calculating a micro-average value of the text classification model under the condition that the maximum value of the confidence of the prediction category is larger than the confidence threshold, and selecting the group chat data with the confidence smaller than the confidence threshold as a Query sample according to the micro-average value.
8. The session data type identification system according to claim 7, wherein the small sample data identification unit includes:
sampling sample embedding module: randomly sampling the marked group chat data according to text types, and then embedding sampling samples based on the attention-LSTM, so that embedding of each Query sample is a function of support set embedding;
a category characteristic induction module: traversing and extracting sample semantics of the support set by adopting a dynamic routing method, and inducing class characteristics;
the semantic relation measurement module: and measuring the semantic relation between the Query sample and the class characteristics of the support set by adopting a Euclidean distance algorithm to complete small sample classification.
9. An electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the session data type identification method according to any one of claims 1 to 4 when executing the computer program.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out a session data type identification method according to any one of claims 1 to 4.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110968186.XA CN113988059A (en) | 2021-08-23 | 2021-08-23 | Session data type identification method, system, equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110968186.XA CN113988059A (en) | 2021-08-23 | 2021-08-23 | Session data type identification method, system, equipment and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113988059A true CN113988059A (en) | 2022-01-28 |
Family
ID=79735168
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110968186.XA Pending CN113988059A (en) | 2021-08-23 | 2021-08-23 | Session data type identification method, system, equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113988059A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117807482A (en) * | 2024-02-29 | 2024-04-02 | 深圳市明心数智科技有限公司 | Method, device, equipment and storage medium for classifying customs clearance notes |
-
2021
- 2021-08-23 CN CN202110968186.XA patent/CN113988059A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117807482A (en) * | 2024-02-29 | 2024-04-02 | 深圳市明心数智科技有限公司 | Method, device, equipment and storage medium for classifying customs clearance notes |
CN117807482B (en) * | 2024-02-29 | 2024-05-14 | 深圳市明心数智科技有限公司 | Method, device, equipment and storage medium for classifying customs clearance notes |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2022141861A1 (en) | Emotion classification method and apparatus, electronic device, and storage medium | |
CN110019782B (en) | Method and device for outputting text categories | |
CN109872162B (en) | Wind control classification and identification method and system for processing user complaint information | |
CN112270196B (en) | Entity relationship identification method and device and electronic equipment | |
JP7153004B2 (en) | COMMUNITY Q&A DATA VERIFICATION METHOD, APPARATUS, COMPUTER DEVICE, AND STORAGE MEDIUM | |
CN106874253A (en) | Recognize the method and device of sensitive information | |
CN113051356B (en) | Open relation extraction method and device, electronic equipment and storage medium | |
US9672475B2 (en) | Automated opinion prediction based on indirect information | |
CN103336766A (en) | Short text garbage identification and modeling method and device | |
CN108550054B (en) | Content quality evaluation method, device, equipment and medium | |
WO2021174812A1 (en) | Data cleaning method and apparatus for profile, and medium and electronic device | |
CN111950279B (en) | Entity relationship processing method, device, equipment and computer readable storage medium | |
CN113254643B (en) | Text classification method and device, electronic equipment and text classification program | |
CN109582788A (en) | Comment spam training, recognition methods, device, equipment and readable storage medium storing program for executing | |
US20170011480A1 (en) | Data analysis system, data analysis method, and data analysis program | |
CN111553318A (en) | Sensitive information extraction method, referee document processing method and device and electronic equipment | |
US20200143159A1 (en) | Search device, search method, search program, and recording medium | |
CN112183102A (en) | Named entity identification method based on attention mechanism and graph attention network | |
CN112052424A (en) | Content auditing method and device | |
CN111695357A (en) | Text labeling method and related product | |
CN111475651A (en) | Text classification method, computing device and computer storage medium | |
CN108268602A (en) | Analyze method, apparatus, equipment and the computer storage media of text topic point | |
CN113392920B (en) | Method, apparatus, device, medium, and program product for generating cheating prediction model | |
CN113051911B (en) | Method, apparatus, device, medium and program product for extracting sensitive words | |
CN109753646B (en) | Article attribute identification method and electronic equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |