CN111737464A - Text classification method and device and electronic equipment - Google Patents

Text classification method and device and electronic equipment Download PDF

Info

Publication number
CN111737464A
CN111737464A CN202010540067.XA CN202010540067A CN111737464A CN 111737464 A CN111737464 A CN 111737464A CN 202010540067 A CN202010540067 A CN 202010540067A CN 111737464 A CN111737464 A CN 111737464A
Authority
CN
China
Prior art keywords
model
target text
sub
feature extraction
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010540067.XA
Other languages
Chinese (zh)
Inventor
上官亚力
梁兆豪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Netease Hangzhou Network Co Ltd
Original Assignee
Netease Hangzhou Network Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Netease Hangzhou Network Co Ltd filed Critical Netease Hangzhou Network Co Ltd
Priority to CN202010540067.XA priority Critical patent/CN111737464A/en
Publication of CN111737464A publication Critical patent/CN111737464A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a text classification method, a text classification device and electronic equipment, wherein the method comprises the steps of firstly converting a target text into a symbol string matched with the target text; inputting the symbol string into a classification model comprising a first submodel and a second submodel; carrying out feature extraction on the symbol string through a first submodel to obtain a plurality of groups of feature data of the symbol string; and classifying the multiple groups of characteristic data through the second submodel to obtain a classification result of the target text. According to the method, context semantic information of the target text can be fully learned through the first sub-model and the second sub-model in the classification model, and an accurate classification result can be obtained through the feature extraction and analysis of the target text through the two layers of network models, so that the accuracy of text classification is improved, and meanwhile, the method does not need to maintain a keyword form and reduces the labor cost.

Description

Text classification method and device and electronic equipment
Technical Field
The present invention relates to the field of data processing technologies, and in particular, to a text classification method and apparatus, and an electronic device.
Background
Game is a mass-based entertainment item in which the speech is usually related to political, advertising, abuse, and the like, and thus the game speech needs to be classified to monitor the game speech.
In the related technology, there are three general classification methods for game speech, the first is a keyword matching method, in which a speech to be processed is regularly matched with a pre-stored keyword form to obtain the category of the speech to be processed, but this method needs to maintain a large number of keyword forms, resulting in waste of human resources; the second method is a method of extracting word segmentation features based on tf-idf (term-inverse document frequency) and classifying the word segmentation features by a classifier, and although a large number of keyword forms are not required to be maintained, the semantic understanding of the context of the language to be processed is insufficient, which easily causes poor classification accuracy; the third is a classification method based on neural network (e.g. fasttext network, word2vec network, textcnn network), which can understand the context of the speech to be processed, but it is difficult to obtain the optimal solution, thereby affecting the accuracy of speech classification.
Disclosure of Invention
The invention aims to provide a text classification method, a text classification device and electronic equipment so as to improve the accuracy of speech classification.
In a first aspect, an embodiment of the present invention provides a text classification method, where the method includes: converting the target text into a symbol string matched with the target text; inputting the symbol string into a classification model which is trained in advance, wherein the classification model comprises a first sub-model and a second sub-model; carrying out feature extraction on the symbol string through a first submodel to obtain a plurality of groups of feature data of the symbol string; and classifying the multiple groups of characteristic data through the second submodel to obtain a classification result of the target text.
In an optional embodiment, the step of converting the target text into a symbol string matching the target text includes: extracting the participles in the target text; converting each participle in the target text into a corresponding symbol according to a preset participle and symbol comparison dictionary; and forming a character string matched with the target text by using the symbol corresponding to each word segmentation.
In an optional implementation manner, the step of extracting the word segmentation in the target text includes: deleting invalid characters in the target text; the invalid characters comprise spaces, expressions, URL addresses and system identifications; and extracting the participles from the target text after the invalid characters are deleted according to a preset rule.
In an alternative embodiment, the first sub-model comprises a plurality of feature extraction components connected in parallel; each feature extraction component is used for outputting a group of feature data of the symbol string; the step of classifying the plurality of groups of feature data through the second submodel to obtain the classification result of the target text comprises the following steps: receiving a plurality of groups of feature extraction data output by the plurality of feature extraction components through a second sub-model; and calculating the average characteristic value of the plurality of groups of characteristic data through the second submodel, inputting the average characteristic value into a preset classifier, and outputting the classification result of the target text.
In an alternative embodiment, the classification model is trained by: dividing a preset sample set to obtain a plurality of subsets; training an initial model of the first sub-model based on the plurality of sub-sets to obtain a trained first sub-model; inputting the samples in the plurality of subsets into the trained first submodel, and outputting sample characteristics corresponding to the samples in the plurality of subsets; and training an initial model of the second sub-model based on the sample characteristics to obtain the trained second sub-model.
In an alternative embodiment, the set of samples is determined by: setting a category label of a preset sample; calculating a feature value of a participle corresponding to each character in a preset sample; the characteristic values include: word frequency and inverse text frequency index; replacing characters with characteristic values lower than a preset threshold value in a preset sample by using characters corresponding to preset word segmentation to obtain an amplification sample, and setting a category label corresponding to the preset sample on the amplification sample; and determining the preset sample and the amplified sample with the set class label as a sample set.
In an optional embodiment, the initial model corresponding to the first sub-model includes a plurality of feature extraction components connected in parallel; the step of training the initial model of the first sub-model based on the plurality of subsets to obtain the final first sub-model includes: for each feature extraction component, performing the following operations: determining a test set of the current feature extraction component from the plurality of subsets; determining subsets of the plurality of subsets except the test set as a training set of the current feature extraction component; determining a target sample from a training set; inputting the target sample into the current feature extraction component to obtain an output result; calculating a loss value of a preset loss function based on the output result; and continuing to execute the step of determining the target sample from the training set until the loss value is converged, and obtaining the trained current feature extraction component.
In an optional embodiment, the step of determining the test set of the current feature extraction component from the plurality of subsets includes: and determining a test set corresponding to the current feature extraction component according to the test set corresponding to the feature extraction components except for the current feature extraction component in the plurality of feature extraction components and the plurality of subsets, so that each feature extraction component corresponds to a different test set.
In an alternative embodiment, the feature extraction component comprises a Bert model; the preset loss function includes a focalloss loss function.
In an optional embodiment, the trained first sub-model includes a plurality of trained feature extraction components; the step of inputting the samples in the plurality of subsets to the trained first sub-model and outputting the sample features corresponding to the samples in the plurality of subsets includes: for each trained feature extraction component, inputting samples in a test set corresponding to the current feature extraction component into the current feature extraction component to obtain sample features corresponding to the test set; wherein the test set comprises a subset of the plurality of subsets; the sum of the test sets corresponding to each trained feature extraction component is a plurality of subsets; and combining the sample characteristics corresponding to each trained characteristic extraction component to obtain the sample characteristics corresponding to the samples in the plurality of subsets.
In an alternative embodiment, the classification model described above is deployed in a first container via Kubernetes; the step of converting the target text into a symbol string matched with the target text is deployed in a second container through Kubernetes; the step of converting the target text into a symbol string matched with the target text includes: acquiring a target text through a second container so as to convert the target text into a symbol string matched with the target text; inputting the symbol string into a classification model which is trained in advance, wherein the step comprises the following steps: and calling the classification model in the first container through the second container, and inputting the character string into the classification model.
In an alternative embodiment, the first container includes a first sub-container and a second sub-container; the step of deployment of the classification model in a first container by Kubernetes, comprising: deploying a first sub-model in a first sub-container by adopting a tensoflow serving mode; and deploying the second sub-model in a second sub-container in a pickle mode.
In an optional embodiment, before the step of converting the target text into a symbol string matching the target text, the method further includes: acquiring a target text from a preset statement log of a kafka system; after the step of classifying the plurality of groups of feature data through the second submodel to obtain the classification result of the target text, the method further comprises the following steps: adding a category label to the target text based on the classification result; inputting the target text added with the category label into the language log to update the language log.
In a second aspect, an embodiment of the present invention provides a text classification apparatus, where the apparatus includes: the symbol conversion module is used for converting the target text into a symbol string matched with the target text; the symbol input module is used for inputting the symbol string into a classification model which is trained in advance, wherein the classification model comprises a first sub-model and a second sub-model; the characteristic extraction module is used for extracting the characteristics of the symbol string through the first sub-model to obtain a plurality of groups of characteristic data of the symbol string; and the classification module is used for classifying the multiple groups of characteristic data through the second submodel to obtain a classification result of the target text.
In a third aspect, an embodiment of the present invention provides an electronic device, which includes a processor and a memory, where the memory stores machine executable instructions capable of being executed by the processor, and the processor executes the machine executable instructions to implement the text classification method according to any one of the foregoing embodiments.
In a fourth aspect, embodiments of the present invention provide a computer-readable storage medium storing computer-executable instructions that, when invoked and executed by a processor, cause the processor to implement a text classification method as described in any one of the preceding embodiments.
The embodiment of the invention has the following beneficial effects:
the embodiment of the invention provides a text classification method, a text classification device and electronic equipment, wherein a target text is converted into a symbol string matched with the target text; inputting the symbol string into a classification model which is trained in advance and comprises a first sub-model and a second sub-model; further, carrying out feature extraction on the symbol string through a first sub-model to obtain a plurality of groups of feature data of the symbol string; and classifying the multiple groups of characteristic data through the second submodel to obtain a classification result of the target text. According to the method, context semantic information of the target text can be fully learned through the first sub-model and the second sub-model in the classification model, and an accurate classification result can be obtained through the feature extraction and analysis of the target text through the two layers of network models, so that the accuracy of text classification is improved, and meanwhile, the method does not need to maintain a keyword form and reduces the labor cost.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the invention as set forth above.
In order to make the aforementioned and other objects, features and advantages of the present invention comprehensible, preferred embodiments accompanied with figures are described in detail below.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
Fig. 1 is a flowchart of a text classification method according to an embodiment of the present invention;
FIG. 2 is a flowchart of another text classification method according to an embodiment of the present invention;
FIG. 3 is a flowchart of another text classification method according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of a training model using a Stacking strategy according to an embodiment of the present invention;
FIG. 5 is a flowchart of another text classification method according to an embodiment of the present invention;
FIG. 6 is a flowchart of another text classification method according to an embodiment of the present invention;
fig. 7 is a schematic structural diagram of a text classification apparatus according to an embodiment of the present invention;
fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.
Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In the related art, there are generally three classification methods for game speech:
the first is a keyword matching method, which requires the establishment of different categories of keyword libraries to obtain different categories of keyword lists. According to the method, the to-be-processed speech is submitted to a processing platform or system, the to-be-processed speech is further subjected to regular matching with the keyword form, the matched target category is adopted as the category of the to-be-processed speech, however, a large number of keyword forms need to be maintained in the method, not only is frequent maintenance required in the maintenance process, but also personnel are required to screen and count a large number of keywords or phrases, and waste of human resources is caused.
The second method is a method of extracting word segmentation features based on tf-idf and classifying the word segmentation features by a classifier, the tf-idf of the method can obtain the feature value score of a certain key word in the speech, but the context semantic understanding of the speech to be processed is not sufficient, and although the method is more intelligent than a keyword matching method, the accuracy of speech classification is not high.
The third is a classification method based on a neural network, and the neural network in the method adopts a fasttext network, a word2vec network or a textcnn network. The fasttext network can be participled in a n-gram mode, for example, a 2-gram mode is used for participling 'we have eaten' as: "we", "people eat", "have a meal"; and vectorizing words in a word vectoring mode, adding the embedding words and n-gram vectors to average, and outputting a classification label through a hidden layer.
The word2vec network generally selects 5 words or phrases from a section of speech or a phrase, shields the 3 rd word, predicts the 3 rd word through a preset network structure, learns the context relationship, can be converted into a classification problem, but only has a positive class, in order to increase the negative class, the word2vec characteristics of each word in the paragraph can be learned through a negative sampling method, and then the words or phrases are classified through a classifier. The Textcnn network can represent each word in the speech to be processed into a word vector with a fixed dimension, then the word vector is expanded into a two-dimensional vector in Euclidean space, further characteristics are obtained through convolution and pooling, and classification probability is obtained through softmax in an output layer.
According to the classification method based on the neural network, the fasttext network training speed is high, but as with the word2vec network, the semantic length understood by the method is limited by the size of the sliding window; although the textcnn network can increase the size of convolution, from the experimental result, the textcnn network has an upper limit value due to the function limitation of the model, and is not an optimal solution, thereby affecting the accuracy of the speech classification. In addition, when the neural network model is deployed, the deep learning model deployed by flash and connexion is used, the model and the preprocessing are coupled, and management is inconvenient.
Based on the above description, embodiments of the present invention provide a text classification method, an apparatus, and an electronic device, and the technology may be applied to various classification scenarios of text data and speech conversion data, especially a scenario of classification based on a word language of player interaction. In order to facilitate understanding of the embodiment of the present invention, a text classification method disclosed in the embodiment of the present invention is first described in detail, and as shown in fig. 1, the method includes the following steps:
step S102, converting the target text into a symbol string matched with the target text.
The target text can be data in any word form, and the target text can be a sentence, a paragraph, a chapter, or the like, or a combination of a plurality of sentences, or the like; the target text may include punctuation marks or other special characters. In a specific implementation, the target text may be a typed communication between the user and other users, for example, a communication between a player and a game; the text content can be the text content after voice conversion sent by the user; the content may be text edited or input by the user.
It is generally difficult for a machine to recognize data in text form, so for machine recognition, the target text may be converted into a symbol string matching the target text according to a preset rule, each symbol in the symbol string corresponds to a word segmentation (including chinese characters and punctuation marks, etc.) in the target text one-to-one, for example, "did a meal? "convert to" 152645488210120 ".
And step S104, inputting the symbol string into a classification model which is trained in advance, wherein the classification model comprises a first sub-model and a second sub-model.
And step S106, performing feature extraction on the symbol string through the first sub-model to obtain a plurality of groups of feature data of the symbol string.
And S108, classifying the multiple groups of characteristic data through the second sub-model, and determining the classification result of the target text.
The classification category of the text is preset by the user and can include advertisements, pornography, abuse, politics and other categories. And obtaining text samples of various categories according to a keyword matching and manual review mode, wherein each text sample carries a respective classification label, and the text sample can be a character string after symbol conversion. In specific implementation, the classification model can be obtained by training a large number of text samples, and comprises two layers of networks, wherein the first layer of network is a first sub-model, and the second layer of network is a second sub-model; the first sub-model may employ a neural network model or a deep learning model, and the second sub-model may employ a classifier, for example, a Support Vector Machine (SVM) classifier, a Logistic (LR) classifier, and the like.
In the training process of the classification model, the first sub-model is trained through the text sample to obtain the trained first sub-model, then the text sample is input into the trained first model, the characteristic data (also called prediction data) of the text sample is output, the characteristic data is input into the second sub-model, and the second sub-model is trained on the basis of the characteristic data to obtain the trained second sub-model.
When the target text is classified, firstly, the obtained target text is converted into a matched symbol string, then the symbol string is input into a trained first submodel, the symbol string is subjected to feature extraction through the trained first submodel to obtain multiple groups of feature data of the symbol string, then the multiple groups of feature data are input into a trained second submodel, and the classification result of the target text is output. In a specific implementation, the trained second sub-model may sum and average multiple groups of feature data, and set a category label for the target text based on a feature value obtained by the sum and averaging, that is, the trained second sub-model may output a classification result based on the feature value obtained by the sum and averaging, where the classification result includes probability values for multiple categories, and a preset category corresponding to the maximum probability value is used as a category of the target text, and set a category label corresponding to the category for the target text. For example, the plurality of categories include three categories, i.e., category 1, category 2, and category 3, and the probability values corresponding to the three categories are [0.333,0.111, and 0.556], and then category 3 corresponding to 0.556 is determined as the category of the target text.
The text classification method provided by the embodiment of the invention comprises the steps of firstly converting a target text into a symbol string matched with the target text; inputting the symbol string into a classification model which is trained in advance and comprises a first sub-model and a second sub-model; further, carrying out feature extraction on the symbol string through a first sub-model to obtain a plurality of groups of feature data of the symbol string; and classifying the multiple groups of characteristic data through the second submodel to obtain a classification result of the target text. According to the method, context semantic information of the target text can be fully learned through the first sub-model and the second sub-model in the classification model, and an accurate classification result can be obtained through the feature extraction and analysis of the target text through the two layers of network models, so that the accuracy of text classification is improved, and meanwhile, the method does not need to maintain a keyword form and reduces the labor cost.
The embodiment of the present invention further provides another text classification method, which is implemented on the basis of the above embodiment, and the method mainly describes a specific process of converting a target text into a symbol string matched with the target text (specifically, implemented by the following steps S202 to S206); as shown in fig. 2, the method comprises the following specific steps:
step S202, extracting the participles in the target text.
In the process of extracting the participles in the target text, the target text needs to be split into a single Chinese character or a single phrase so as to obtain a plurality of participles corresponding to the target text. Specifically, each Chinese character in the target text can be regarded as a word segment, a phrase can be regarded as an analysis, and a punctuation mark can be regarded as a word segment. In a specific implementation, the target text may be segmented into a plurality of segments by a jieba segmentation tool or the like, for example, "will have a meal? "split into" eat "," meal "," got "," did ", and"? ".
In a specific implementation, the step S202 can be implemented by the following steps 10-11:
step 10, deleting invalid characters in the target text; the invalid character includes a space, an emoticon, a URL (Uniform Resource Locator) address, and a system identifier.
Before the target text is subjected to symbol conversion, the target text needs to be preprocessed, that is, invalid characters in the target text are removed, so that the efficiency of symbol conversion is improved. The invalid character is generally a character that has no effect on the semantics of the target text, such as a space, an expression, a URL address, a system identifier, an invalid character string (e.g., a garbled code), a stop word, and the like; the system identifier may be an inherent logo of a certain brand, a speech of a system-suggestive speaker in a game speech, a speech of a non-player character (NPC) in a game, and the like, and the system identifier generally has different meanings for different application scenarios, and is not limited specifically herein.
Step 11, extracting participles from the target text after deleting the invalid characters according to a preset rule; namely, the target text with the invalid characters deleted is subjected to word segmentation operation through a jieba word segmentation tool or other word segmentation rules to obtain a plurality of word segments. The preset rule may be splitting according to a single Chinese character, splitting according to a phrase, or other word segmentation rules, and is not specifically limited herein.
Step S204, converting each participle in the target text into a corresponding symbol according to a preset participle and symbol comparison dictionary.
The word segmentation and symbol comparison dictionary is set by a user in advance, and the dictionary comprises a large number of words and symbols corresponding to the words; the form of the comparison between the participles and the symbols stored in the dictionary is as follows: "i- > 1", "he- > 2", etc., i.e. the symbol corresponding to the word "i" is "1", and the symbol corresponding to the word "he" is "2". In specific implementation, the word segmentation and symbol comparison dictionary contains a large number of symbols corresponding to the word segmentation, but the number of symbols cannot be all included, and a user can add new word segmentation to the dictionary according to requirements. In some embodiments, the dictionary may include 21128 participles, and when the number of participles exceeds the range, a uniform symbol may be set to encode the corresponding participles.
And step S206, forming a character string matched with the target text by the symbols corresponding to each participle.
Matching each participle in the target text with a participle in the dictionary, determining a symbol corresponding to the matched participle as a symbol corresponding to the participle of the target text, and integrating the symbols corresponding to each participle to obtain a character string matching with the target text, for example, "eaten? "is symbolized as" 152645488210120 ".
And step S208, inputting the symbol strings into a classification model trained in advance to obtain a classification result of the target text.
The text classification method comprises the steps of firstly extracting participles in a target text, converting each participle in the target text into a corresponding symbol according to a preset participle and symbol comparison dictionary, and forming a character string matched with the target text by the symbol corresponding to each participle; and then inputting the symbol string into a classification model trained in advance to obtain a classification result of the target text. The method can perform symbolization on the participles in the target text according to the preset participle and symbol comparison dictionary, so that the method is favorable for the recognition of the target text by the model, can improve the recognition efficiency and is favorable for the classification of the target text by the subsequent classification model.
The embodiment of the present invention further provides another text classification method, which is implemented on the basis of the above-mentioned embodiment, and the method mainly describes a specific process of training a classification model (specifically, implemented by the following steps S302 to S308), as shown in fig. 3, the method includes the following specific steps:
step S302, a preset sample set is divided to obtain a plurality of subsets.
In a specific implementation, the sample set may be divided into a plurality of subsets according to the average number of samples, that is, the number of samples included in each subset is the same, and the number of samples included in each category in each subset is also the same; the division may be performed according to other rules. Specifically, the sample set may be determined by the following steps 20 to 23:
and 20, setting a category label of the preset sample.
The preset sample is a character string corresponding to the language collected by the user in the text form. The category label of the preset sample is determined through a keyword matching or manual review mode, and the category label is set for the preset sample, for example, when the category label is set in a keyword matching mode, a keyword corresponding to a text of each category needs to be set in advance, then a participle corresponding to each character in the preset sample is matched with the keyword, and the category corresponding to the keyword which is successfully matched is determined as the category of the preset sample, for example, the keyword of the advertisement class is 'sale substitute', 'sale', and the like. The category label may be in the form of numbers, characters, letters, etc., e.g., setting the category label for the advertisement class as aa, the category label for the politics class as bb, etc.
Step 21, calculating a feature value of a word segmentation corresponding to each character in the preset sample; the characteristic values include: word frequency and inverse text frequency index.
In specific implementation, the tf-idf value can be used as a feature value of a participle corresponding to each symbol, and it can also be understood that the importance of each participle is measured by adopting a tf-idf method, and the calculation method is as follows:
firstly, calculating the word frequency of each participle:
Figure BDA0002537640270000121
wherein, tfi,jRepresenting the word frequency, n, of the ith participle under a preset sample ji,jRepresenting the number of times the ith participle appears in a preset sample j, nk,jThe number of times of occurrence of the kth participle in the preset sample j is represented, and the denominator in the above formula represents the sum of the number of times of occurrence of each participle in the preset sample j.
Then calculating the inverse text frequency index of each participle
Figure BDA0002537640270000131
Wherein idfiRepresenting the inverse text frequency index of the ith participle under a preset sample j, | D | representing the total number of samples identical to the class label of the preset sample j, | { j: t |, in the text tablei∈djDenotes the i-th participle t contained in the total number of samplesiNumber of samples dj(ii) a If the ith participle does not appear in the samples with the same class labels as the preset sample j, the denominator is zero, so that the denominator can be replaced by 1+ | { j: ti∈djJ | to facilitate the operation.
Aiming at the participle corresponding to each character in the preset sample, multiplying the word frequency of the current participle by the inverse text frequency index to obtain the tf-idf value of the current participle, namely the characteristic value of the current participle, wherein the characteristic value is usually larger, which indicates that the participle has high importance.
And step 22, replacing the characters with the characteristic values lower than the preset threshold value in the preset sample by the characters corresponding to the preset participles to obtain an amplification sample, and setting a category label corresponding to the preset sample on the amplification sample.
The preset participles are replacement words which are preset by the user under the condition that the semantics of the preset samples are not changed, for example, the preset participles can be replaced by ' removing ' for deleting ' and the like so as to enrich the samples. In specific implementation, the characters corresponding to the participles to be replaced need to be determined according to the size of the feature value of each participle corresponding to the preset sample, the characters with the feature values lower than the preset threshold value can be selected, and the characters with the lowest feature values can also be selected, so that the replaced characters can be ensured to be the characters with lower importance degree in the preset sample, the characters with the lower importance degree are replaced by the characters corresponding to the preset participles, an amplified sample is obtained, and the semantic of the amplified sample is the same as that of the preset sample, so that the category label corresponding to the preset sample is set on the amplified sample. For example, the preset sample is "123456", where 5 is a character with a lower importance level, 9 (corresponding to the preset participle) corresponds to a participle with a meaning similar to that of the participle corresponding to 5, and 5 can be replaced by 9, resulting in an expanded sample "123496".
In some embodiments, not only can the participle corresponding to a certain character in the preset sample be replaced, but also the word order of the preset sample can be adjusted under the condition of not changing the semantics, so that the amplified sample is obtained, and the sample set can be enriched.
And step 23, determining the preset sample and the amplification sample with the class label as a sample set.
In a specific implementation, the above steps 20 to 21 may be performed for each acquired preset sample, so that the preset sample and the amplified sample are jointly stored in the sample set to amplify the sample set, thereby facilitating training of the subsequent classification model.
Step S304, training an initial model of the first sub-model based on the plurality of subsets to obtain the trained first sub-model.
And sequentially inputting the samples in the plurality of subsets into an initial model of the first submodel to obtain an output result, and automatically adjusting model parameters of the initial model of the first submodel based on the output result until the initial model of the first submodel converges or reaches a preset training time to obtain the trained first submodel.
In a specific implementation, the initial model corresponding to the first submodel includes a plurality of feature extraction components connected in parallel, and the network structures of the plurality of feature extraction components included in the initial model corresponding to the first submodel may be the same or different, but the training mode of each feature extraction component is the same; therefore, in training the first submodel, the following steps 30-33 need to be performed for each feature extraction component to obtain a trained feature extraction component:
step 30, determining a test set of the current feature extraction component from the plurality of subsets; and determining the subsets except the test set in the plurality of subsets as the training set of the current feature extraction component.
Before training a current feature extraction component, determining a test set and a training set corresponding to the current feature extraction component; the test sets and the training sets corresponding to each feature extraction component in the plurality of feature extraction components corresponding to the initial model corresponding to the first sub-model are different, so that the network structure parameters obtained after each feature extraction component is trained are different, and classification can be performed more accurately.
In a specific implementation, the test set corresponding to the current feature extraction component may be determined according to the test set corresponding to the feature extraction component except for the current feature extraction component among the plurality of feature extraction components and the plurality of subsets, so that each feature extraction component corresponds to a different test set.
Step 31, determining a target sample from the training set.
The training set usually contains a large number of samples, from which one sample can be arbitrarily determined as a target sample.
And 32, inputting the target sample into the current feature extraction component to obtain an output result.
The current feature extraction component may perform feature extraction on the target sample to obtain feature data of the target sample, where the feature data is the output result.
Step 33, calculating a loss value of a preset loss function based on the output result; and continuing to execute the step of determining the target sample from the training set until the loss value is converged, and obtaining the trained current feature extraction component.
And substituting the output result into a preset loss function to obtain a loss value, wherein the larger the loss value is, the more undesirable the output result is, the network structure parameter of the current feature extraction component needs to be adjusted, continuously determining a new target sample from the training sample, continuously inputting the new target sample into the current feature extraction model after parameter adjustment to obtain a new output result, calculating the loss value of the preset loss function based on the new output result until the loss value is converged, and stopping adjusting the network structure parameter of the current feature extraction component to obtain the trained feature extraction component.
In specific implementation, after training of each feature extraction component is completed, a plurality of parallel trained feature extraction components can be obtained, and the plurality of parallel trained feature extraction components are used as a first trained submodel. In some embodiments, the feature extraction component may employ a Bert (Bidirectional Encoder) model, where the Bert model is implemented based on a transform, and a multi-layer Bidirectional transform Encoder is implemented, so that when the Bert model processes a participle, information of participles before and after the participle can be considered, and thus, context semantics can be obtained, so that the model can learn semantic information of a text more fully.
Because the number of samples corresponding to each type of sample in the sample set may be greatly different, the preset loss function may adopt a focalloss loss function, which is proposed mainly to solve the problem of serious imbalance of positive and negative sample ratios or imbalance of data sets in target detection, assuming that the number of positive samples is large and the number of negative samples is small, where the positive samples are represented by y ═ 1; negative samples are represented by y ═ 0, focalloss loss function LflThe expression of (c) can be expressed as:
Figure BDA0002537640270000161
wherein, α is a balance cross entropy, that is, a preset weight factor between [0, 1 ]; gamma represents a preset adjusting coefficient, alpha and gamma are both known quantities, and y' represents an output result output after the target sample is input into the current feature extraction component. In a specific implementation, the positive samples represent samples corresponding to classes with a larger number of samples in the training samples, and the negative samples represent samples corresponding to classes with a smaller number of samples in the training samples.
Step S306, inputting the samples in the plurality of subsets into the trained first sub-model, and outputting the sample features corresponding to the samples in the plurality of subsets.
And after the training of the first sub-model is finished, sequentially inputting the samples in the plurality of sub-sets into the trained first sub-model to obtain the sample characteristics corresponding to each sample, and using the sample characteristics as training data of the second sub-model.
During specific implementation, because the initial model corresponding to the first submodel comprises a plurality of feature extraction assemblies connected in parallel, the trained first submodel comprises a plurality of trained feature extraction assemblies; based on this, the above step S306 can be realized by the following steps 40 to 41:
step 40, inputting samples in a test set corresponding to the current feature extraction component into the current feature extraction component aiming at each trained feature extraction component to obtain sample features corresponding to the test set; wherein the test set comprises a subset of the plurality of subsets; and the sum of the test sets corresponding to each trained feature extraction component is a plurality of subsets.
Based on the above step 30, each trained feature extraction component has a training set and a test set corresponding to each trained feature extraction component, the training set is used for training the feature extraction components, and the test set is used for testing the trained feature extraction components and providing training data for the second sub-model. The test set corresponding to each trained feature extraction component is one of the plurality of subsets, the test sets corresponding to each trained feature extraction component are different, and the sum of all the test sets can be combined into the plurality of subsets, that is, the total number of the test sets is the same as the number of the subsets in the plurality of subsets, so that the second submodel is ensured to have sufficient training data.
And step 41, combining the sample features corresponding to each trained feature extraction component to obtain the sample features corresponding to the samples in the plurality of subsets.
And respectively inputting the samples in each test set into the corresponding trained feature extraction components to obtain the sample features corresponding to each test set, namely obtaining the sample features corresponding to each trained feature extraction component, and combining all the sample features to obtain the sample features corresponding to the samples in the plurality of subsets.
And step S308, training an initial model of the second sub-model based on the sample characteristics to obtain the trained second sub-model.
And taking the sample characteristics obtained by the first submodel as new characteristics corresponding to each sample in a plurality of subsets to obtain a training sample corresponding to the second submodel, and training an initial model of the second submodel through the training sample to obtain the trained second submodel.
Taking a 5-fold Stacking strategy training model as an example, the training of the first sub-model and the second sub-model is described in detail, as shown in fig. 4, it is assumed that the initial model corresponding to the first sub-model includes 5 parallel feature extraction components, the feature extraction components are all Bert models, which are respectively represented as Bert-1, Bert-2, Bert-3, Bert-4 and Bert-5, and the second sub-model is an SVM classifier; train in fig. 4 represents a training set, and Test represents a Test set.
Due to the adoption of the 5-fold Stacking strategy, a sample set needs to be averagely divided into 5 subsets (the number of samples corresponding to each category in each subset is the same), 5 parallel Bert models are equivalent to a first-layer network of Stacking, and an SVM classifier is equivalent to a second-layer network of Stacking. In the method, five initial models of same Bert models are adopted to perform 5-fold cross validation, namely, a four-fold subset is used as a training set, the other folds are used as test sets, which is equivalent to that 5 subsets corresponding to each Bert model in a first-layer network are the same, a first subset of the 5 subsets is used as a test set of Bert-1, a second subset is used as a test set of Bert-2, and the like until a fifth subset is used as a test set of Bert-5, and similarly, other distribution tests can be adopted as long as the test sets corresponding to each Bert model are different.
Each cross-validation comprises two processes: and training a Bert model based on the training set and predicting the test set based on the Bert model generated by training the training set. After the cross validation of each Bert model is completed, obtaining the trained Bert model and a predicted value (equivalent to the sample characteristic) of a sample in a test set corresponding to the Bert model; the training sets corresponding to each Bert model are different, so that the trained Bert models have different network structure parameters; then, splicing the predicted values of the samples in the test set corresponding to each trained Bert model to form a predicted value corresponding to the samples of the whole sample set, wherein the predicted value is a new feature of each sample in the original sample set, and combining the original sample and the new feature to obtain a new sample set; and taking the new sample set as a training set of the SVM classifier, and training the SVM classifier to obtain the trained SVM. In a specific implementation, the loss function in the training of the Bert model may be focalloss loss function.
Step S310, converting the obtained target text into a symbol string matched with the target text.
Step S312, inputting the symbol string into the trained first sub-model to obtain a plurality of groups of feature data of the symbol string.
In a specific implementation, the trained first sub-model comprises a plurality of parallel feature extraction components; each feature extraction component is used for outputting a group of feature data of the symbol string; therefore, the symbol strings corresponding to the target text are respectively input into a plurality of parallel feature extraction components, a group of feature data output by each feature extraction component can be obtained, and the feature data output by all the feature extraction components are combined to obtain a plurality of groups of feature data.
Step S314, inputting the multiple groups of feature data into the trained second sub-model, so that the trained second sub-model calculates average feature values of the multiple groups of feature data, and inputs the average feature values into a preset classifier, and outputs a classification result of the target text.
Specifically, the second submodel is used for calculating an average feature value of a plurality of groups of feature data; and inputting the average characteristic value into a preset classifier, and outputting a classification result of the target text. The preset classifier may be an SVM classifier, an LR classifier, or the like. Each group of feature data is a feature extracted from a character string of the target text (which may also be referred to as a classification prediction of the character string), and in order to avoid repetition, each group of feature data needs to be added and averaged to obtain an average feature value, and the average feature value is sent to a classifier for classification to obtain a classification result of the target text.
According to the text classification method, before a target text is classified, a sample set needs to be divided to obtain a plurality of sub-sets, then based on the plurality of sub-sets, an initial model of a first sub-model is trained to obtain a trained first sub-model, samples in the plurality of sub-sets are input into the trained first sub-model, sample characteristics corresponding to the samples in the plurality of sub-sets are output, and then based on the sample characteristics, an initial model of a second sub-model is trained to obtain a trained second sub-model; and finally, inputting the obtained symbol string of the target text into a trained first sub-model to obtain a plurality of groups of characteristic data of the symbol string, inputting the plurality of groups of characteristic data into a trained second sub-model to enable the trained second sub-model to calculate average characteristic values of the plurality of groups of characteristic data, inputting the average characteristic values into a preset classifier, and outputting a classification result of the target text. The first sub-model and the second sub-model obtained by training in the method can obtain accurate classification of the target text, the problem that the existing keyword matching classification method is poor in flexibility is solved, the classification accuracy and recall rate are improved, and meanwhile the text classification method in the embodiment of the invention has high generalization.
The embodiment of the invention also provides another text classification method, which is realized on the basis of the embodiment, wherein a classification model of the method is deployed in a first container through Kubernets; the step of converting the target text into a string of symbols matching the target text is deployed in a second container by Kubernetes. The Kubernets is an open source container arrangement engine and is used for automatic deployment, expansion and management of containerized application, strong fault finding and self-repairing capabilities, service rolling upgrading and online capacity expansion capabilities, an expandable automatic resource scheduling mechanism and multi-granularity resource quota management capabilities are built in the Kubernets, and the Kubernets are adopted for container deployment, so that the Kubernets can automatically stretch and expand capacity expansion and multi-container integration can be realized, for example, when more text data is processed in real time or more online users exist, the capacity expansion can be realized. The container may be a deployment in Kubernetes.
In order to achieve decoupling of a classification model and a text processing process, the classification model and the text processing process are deployed in different containers through Kubernetes, wherein the text processing process comprises a process of converting a target text into a symbol string matched with the target text and a subsequent process of processing a classification result of the target text; meanwhile, the classification model is deployed by using the independent container, and when the classification model is updated, only the container corresponding to the classification model needs to be updated again, so that the operation is convenient.
In particular implementations, the first container includes a first sub-container and a second sub-container; when the classification model is deployed in the first container through Kubernets, a tensoflow serving mode can be adopted to deploy the first sub-model in the first sub-container; and deploying the second sub-model in the second sub-container by adopting a pickle mode, and packaging the first sub-container and the second sub-container into a restful api (interface capable of being understood as a presentation layer state conversion style) interface by adopting a connexion technology, so that the character strings of the target text are input into the first sub-container through the interface.
The tensoflow serving is generally a service system, is suitable for deploying machine learning models, has the characteristics of flexibility, high performance and applicability to production environments, and also has the characteristics of supporting model version control and rollback, supporting concurrency, realizing high throughput, supporting distributed models and the like. The above-mentioned pickle is generally used to solve the serialization and deserialization of objects, the serialization refers to converting an object into a special binary string and writing it into a file in binary form, and the deserialization refers to reading the binary file and converting it back into the object itself.
In specific implementation, the first sub-model and the second sub-model are deployed in different containers, so that the models are more independent, and when the first sub-model comprises a plurality of parallel feature extraction components for predictive text classification, the model can be processed in a multi-thread mode, namely each thread corresponds to one feature extraction component, so that the processing time is shortened.
In some embodiments, other ways besides kubernets may also be adopted to perform container deployment on the classification model and the text processing process, so as to achieve the purpose of decoupling.
Based on the deployment of the container, the embodiment of the present invention mainly describes a specific process of converting a target text into a symbol string matched with the target text (specifically, implemented by the following step S504), and inputs the symbol string into a classification model trained in advance to obtain a classification result of the target text (specifically, implemented by the following step S506); as shown in fig. 5, the method includes the following specific steps:
step S502, obtaining a target text from a preset statement log of the kafka system.
The kafka system described above is generally a high throughput, distributed publish-subscribe messaging system that can handle all action flow data in a consumer-scale website, which is typically addressed by handling logs and log aggregations due to throughput requirements. All messages (equivalent to text) in the kafka system are managed in units of Topic logs, and kafka is responsible for managing partition data of a set of Topic logs of each Topic in a cluster. In specific implementation, the text corresponding to each user can be input into Topic of kafka, and then the target text is read from the speech log corresponding to the Topic.
Step S504, obtaining the target text through the second container so as to convert the target text into a symbol string matched with the target text.
Step S506, calling a classification model in the first container through the second container, and inputting the character string into the classification model, wherein the classification model comprises a first sub-model and a second sub-model.
The text processing process is stored in the second container, so that the symbol conversion of the obtained target text needs to be carried out through the second container to obtain a corresponding character string; and then the second container calls an exposed interface or port of the first container, inputs the character string corresponding to the target text into the first container, classifies the character string through a classification model, and outputs a classification result of the target text corresponding to the character string.
And step S508, performing feature extraction on the symbol string through the first sub-model to obtain a plurality of groups of feature data of the symbol string.
And step S510, classifying the multiple groups of characteristic data through a second sub-model, and determining the classification result of the target text.
And step S512, adding a category label to the target text based on the classification result.
Step S514, inputting the target text added with the category label into the language log to update the language log.
And adding a category label matched with the classification result to the target text according to the classification result, and sending the text added with the category label to the Topic of kafka, so that the original target text is replaced by the target text added with the category label in the language log corresponding to the Topic, thereby achieving the purpose of updating the language log.
In a specific implementation, if the target text is predicted to be related to pornography, abuse, politics, and the like, that is, the category label of the target text is a label corresponding to pornography, abuse, politics, and the like, a corresponding warning trigger may be performed on the user. For example, when the target text is a game statement, the forbidden duration processing can be performed according to certain characteristics of the game character.
In order to facilitate understanding of the embodiment of the present invention, as shown in fig. 6, the text classification process is described in detail by taking an example that the trained first sub-model includes a plurality of parallel feature extraction components, which are Bert-1, Bert-2, Bert-3, Bert-4, and Bert-5 respectively, and the classifier in the second sub-model is an SVM classifier, first, a target text is obtained from a language log (corresponding to Kafka-log in fig. 6) of the Kafka system, and then the target is symbolized to obtain character strings corresponding to the target text, the character strings are respectively input to Bert-1, Bert-2, Bert-3, Bert-4, and Bert-5 to obtain 5 sets of feature data, which are predictive-1, predictive-2, predictive-3, predictive-4, and predictive-5 sets of data in fig. 5, and the Average data and Average data are obtained to obtain Average feature values (corresponding to averages in fig. 6), then inputting the average characteristic value into an SVM classifier to obtain a classification result (equivalent to result in FIG. 6) of the target text, adding a class label to the target text according to the classification result, and inputting the target text added with the class label into Kafka-log to update the Kafka-log.
According to the text classification method, the real-time processing platform is constructed through Kubernets, and comprises the first container and the second container, so that a classification model and a text processing process can be decoupled, model updating is facilitated, and text processing is accelerated; meanwhile, the management of the real-time processing platform by the user is facilitated.
Corresponding to the above method embodiment, an embodiment of the present invention provides a text classification apparatus, as shown in fig. 7, where the apparatus includes:
and a symbol conversion module 70 for converting the target text into a symbol string matching the target text.
And the symbol input module 71 is configured to input the symbol string into a classification model which is trained in advance, where the classification model includes a first sub-model and a second sub-model.
And the feature extraction module 72 is configured to perform feature extraction on the symbol string through the first sub-model to obtain multiple sets of feature data of the symbol string.
And the classification module 73 is configured to classify the multiple sets of feature data through the second sub-model to obtain a classification result of the target text.
The text classification device firstly converts a target text into a symbol string matched with the target text; inputting the symbol string into a classification model which is trained in advance, wherein the classification model comprises a first sub-model and a second sub-model; carrying out feature extraction on the symbol string through a first submodel to obtain a plurality of groups of feature data of the symbol string; and then classifying the multiple groups of characteristic data through a second submodel to obtain a classification result of the target text. According to the method, context semantic information of the target text can be fully learned through the first sub-model and the second sub-model in the classification model, and an accurate classification result can be obtained through the feature extraction and analysis of the target text through the two layers of network models, so that the accuracy of text classification is improved, and meanwhile, the method does not need to maintain a keyword form and reduces the labor cost.
Further, the symbol conversion module 70 includes: the word segmentation extracting unit is used for extracting the word segmentation in the target text; the symbol combination unit is used for converting each participle in the target text into a corresponding symbol according to a preset participle and symbol comparison dictionary; and forming a character string matched with the target text by using the symbol corresponding to each word segmentation.
Specifically, the word segmentation extracting unit is configured to: deleting invalid characters in the target text; the invalid characters comprise spaces, expressions, URL addresses and system identifications; and extracting the participles from the target text after the invalid characters are deleted according to a preset rule.
Further, the first sub-model comprises a plurality of parallel feature extraction components; each feature extraction component is used for outputting a group of feature data of the symbol string; the classification module 73 is configured to: receiving a plurality of groups of feature extraction data output by the plurality of feature extraction components through a second sub-model; and calculating the average characteristic value of the plurality of groups of characteristic data through the second submodel, inputting the average characteristic value into a preset classifier, and outputting the classification result of the target text.
Further, the apparatus further comprises a model training module configured to: dividing a preset sample set to obtain a plurality of subsets; training an initial model of the first sub-model based on the plurality of sub-sets to obtain a trained first sub-model; inputting the samples in the plurality of subsets into the trained first submodel, and outputting sample characteristics corresponding to the samples in the plurality of subsets; and training an initial model of the second sub-model based on the sample characteristics to obtain the trained second sub-model.
Further, the apparatus further includes a sample set determining module configured to: setting a category label of a preset sample; calculating a feature value of a participle corresponding to each character in a preset sample; the characteristic values include: word frequency and inverse text frequency index; replacing characters with characteristic values lower than a preset threshold value in a preset sample by using characters corresponding to preset word segmentation to obtain an amplification sample, and setting a category label corresponding to the preset sample on the amplification sample; and determining the preset sample and the amplified sample with the set class label as a sample set.
Specifically, the initial model corresponding to the first submodel includes a plurality of parallel feature extraction components; the model training module is configured to: for each feature extraction component, performing the following operations: determining a test set of the current feature extraction component from the plurality of subsets; determining subsets of the plurality of subsets except the test set as a training set of the current feature extraction component; determining a target sample from a training set; inputting a target sample into the current feature extraction component to obtain an output result; calculating a loss value of a preset loss function based on the output result; and continuing to execute the step of determining the target sample from the training set until the loss value is converged, and obtaining the trained current feature extraction component.
The model training module is further configured to: and determining a test set corresponding to the current feature extraction component according to the test set corresponding to the feature extraction components except for the current feature extraction component in the plurality of feature extraction components and the plurality of subsets, so that each feature extraction component corresponds to a different test set.
Specifically, the feature extraction component includes a Bert model; the preset loss function includes a focalloss loss function.
In a specific implementation, the trained first sub-model comprises a plurality of trained feature extraction components; the model training module is further configured to: for each trained feature extraction component, inputting samples in a test set corresponding to the current feature extraction component into the current feature extraction component to obtain sample features corresponding to the test set; wherein the test set comprises a subset of the plurality of subsets; the sum of the test sets corresponding to each trained feature extraction component is a plurality of subsets; and combining the sample characteristics corresponding to each trained characteristic extraction component to obtain the sample characteristics corresponding to the samples in the plurality of subsets.
Specifically, the classification model is deployed in a first container through kubernets; the step of converting the target text into a symbol string matched with the target text is deployed in a second container through Kubernetes; the symbol conversion module 70 is configured to: acquiring a target text through a second container so as to convert the target text into a symbol string matched with the target text; the symbol input module 71 is configured to: and calling the classification model in the first container through the second container, and inputting the character string into the classification model.
Further, the first container includes a first sub-container and a second sub-container; the apparatus further comprises a model deployment module configured to: deploying a first sub-model in a first sub-container by adopting a tensoflow serving mode; and deploying the second sub-model in a second sub-container in a pickle mode.
Further, the apparatus further includes a text acquisition module, configured to: acquiring a target text from a preset statement log of a kafka system; the apparatus further comprises a log update module configured to: adding a category label to the target text based on the classification result; inputting the target text added with the category label into the language log to update the language log.
The text classification device provided by the embodiment of the invention has the same implementation principle and technical effect as the method embodiment, and for the sake of brief description, no part of the embodiment of the device is mentioned, and reference may be made to the corresponding content in the method embodiment.
An embodiment of the present invention further provides an electronic device, which is shown in fig. 8 and includes a processor 101 and a memory 100, where the memory 100 stores machine executable instructions that can be executed by the processor 101, and the processor 101 executes the machine executable instructions to implement the text classification method.
Further, the electronic device shown in fig. 8 further includes a bus 102 and a communication interface 103, and the processor 101, the communication interface 103, and the memory 100 are connected through the bus 102.
The memory 100 may include a high-speed Random Access Memory (RAM) and may further include a non-volatile memory (non-volatile memory), such as at least one disk memory. The communication connection between the network element of the system and at least one other network element is realized through at least one communication interface 103 (which may be wired or wireless), and the internet, a wide area network, a local network, a metropolitan area network, and the like can be used. The bus 102 may be an ISA bus, PCI bus, EISA bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one double-headed arrow is shown in FIG. 8, but that does not indicate only one bus or one type of bus.
The processor 101 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 101. The processor 101 may be a general-purpose processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the device can also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field-Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component. The various methods, steps and logic blocks disclosed in the embodiments of the present invention may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present invention may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 100, and the processor 101 reads the information in the memory 100, and completes the steps of the method of the foregoing embodiment in combination with the hardware thereof.
The embodiment of the present invention further provides a machine-readable storage medium, where the machine-readable storage medium stores machine-executable instructions, and when the machine-executable instructions are called and executed by a processor, the computer-executable instructions cause the processor to implement the text classification method, and specific implementation may refer to method embodiments, and is not described herein again.
The text classification method, the text classification device, and the computer program product of the electronic device provided in the embodiments of the present invention include a computer-readable storage medium storing a program code, where instructions included in the program code may be used to execute the method described in the foregoing method embodiments, and specific implementations may refer to the method embodiments and are not described herein again.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, an electronic device, or a network device) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present invention, which are used for illustrating the technical solutions of the present invention and not for limiting the same, and the protection scope of the present invention is not limited thereto, although the present invention is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present invention, and they should be construed as being included therein. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (16)

1. A method of text classification, the method comprising:
converting a target text into a symbol string matched with the target text;
inputting the symbol string into a classification model which is trained in advance, wherein the classification model comprises a first sub-model and a second sub-model;
extracting the characteristics of the symbol string through the first submodel to obtain a plurality of groups of characteristic data of the symbol string;
and classifying the multiple groups of characteristic data through the second submodel to obtain a classification result of the target text.
2. The method of claim 1, wherein the step of converting the target text into a string of symbols matching the target text comprises:
extracting the participles in the target text;
converting each participle in the target text into a corresponding symbol according to a preset participle and symbol comparison dictionary; and forming a character string matched with the target text by the symbol corresponding to each word segmentation.
3. The method of claim 2, wherein the step of extracting the segmentation in the target text comprises:
deleting invalid characters in the target text; the invalid characters comprise spaces, expressions, URL addresses and system identifications;
and extracting the participles from the target text after the invalid characters are deleted according to a preset rule.
4. The method of claim 1, wherein the first submodel comprises a plurality of parallel feature extraction components; each feature extraction component is used for outputting a set of feature data of the symbol string; the step of classifying the plurality of groups of feature data through the second submodel to obtain the classification result of the target text comprises:
receiving a plurality of groups of feature extraction data output by a plurality of feature extraction components through the second sub-model;
and calculating the average characteristic value of the plurality of groups of characteristic data through the second submodel, inputting the average characteristic value into a preset classifier, and outputting the classification result of the target text.
5. The method of claim 1, wherein the classification model is trained by:
dividing a preset sample set to obtain a plurality of subsets;
training an initial model of the first sub-model based on the plurality of subsets to obtain a trained first sub-model;
inputting the samples in the plurality of subsets into the trained first submodel, and outputting sample characteristics corresponding to the samples in the plurality of subsets;
and training the initial model of the second sub-model based on the sample characteristics to obtain the trained second sub-model.
6. The method of claim 5, wherein the sample set is determined by:
setting a category label of a preset sample;
calculating a feature value of a word segmentation corresponding to each character in the preset sample; the characteristic values include: word frequency and inverse text frequency index;
replacing the characters with the characteristic values lower than a preset threshold value in the preset sample by using characters corresponding to preset word segmentation to obtain an amplified sample, and setting a category label corresponding to the preset sample on the amplified sample;
and determining the preset sample and the amplification sample with the set class labels as the sample set.
7. The method of claim 5, wherein the initial model corresponding to the first sub-model comprises a plurality of parallel feature extraction components; the step of training the initial model of the first sub-model based on the plurality of subsets to obtain a final first sub-model comprises:
for each of the feature extraction components, performing the following operations:
determining a test set of current feature extraction components from the plurality of subsets; determining subsets of the plurality of subsets other than the test set as a training set for the current feature extraction component;
determining a target sample from the training set;
inputting the target sample into the current feature extraction component to obtain an output result;
calculating a loss value of a preset loss function based on the output result; and continuing to execute the step of determining the target sample from the training set until the loss value is converged, so as to obtain the trained current feature extraction component.
8. The method of claim 7, wherein the step of determining a test set of current feature extraction components from the plurality of subsets comprises:
and determining the test set corresponding to the current feature extraction component according to the test sets corresponding to the feature extraction components except for the current feature extraction component in the plurality of feature extraction components and the plurality of subsets, so that each feature extraction component corresponds to a different test set.
9. The method of claim 7, wherein the feature extraction component comprises a Bert model; the preset loss function comprises a focalloss loss function.
10. The method of claim 5, wherein the trained first sub-model comprises a plurality of trained feature extraction components;
the step of inputting the samples in the plurality of subsets into the trained first sub-model and outputting the sample features corresponding to the samples in the plurality of subsets comprises:
for each trained feature extraction component, inputting a sample in a test set corresponding to the current feature extraction component into the current feature extraction component to obtain a sample feature corresponding to the test set; wherein a subset of the plurality of subsets is included in the test set; the sum of the test sets corresponding to each trained feature extraction component is the plurality of subsets;
and combining the sample characteristics corresponding to each trained characteristic extraction component to obtain the sample characteristics corresponding to the samples in the plurality of subsets.
11. The method of claim 1, wherein the classification model is deployed in a first container by Kubernetes; the step of converting the target text into a symbol string matched with the target text is deployed in a second container through Kubernetes;
the step of converting the target text into a symbol string matching the target text comprises:
acquiring the target text through the second container so as to convert the target text into a symbol string matched with the target text;
the step of inputting the symbol string into a classification model trained in advance comprises:
and calling the classification model in the first container through the second container, and inputting the character string into the classification model.
12. The method of claim 11, wherein the first container comprises a first sub-container and a second sub-container; the step of the classification model being deployed in a first container by kubernets, comprising:
deploying the first sub-model in a first sub-container by adopting a tensoflow serving mode;
and deploying the second sub-model in a second sub-container in a pickle mode.
13. The method of claim 1, wherein prior to the step of converting the target text into a string of symbols matching the target text, the method further comprises:
acquiring the target text from a preset statement log of the kafka system;
after the step of classifying the plurality of groups of feature data through the second submodel to obtain the classification result of the target text, the method further includes:
adding a category label to the target text based on the classification result;
inputting the target text added with the category label into the language log so as to update the language log.
14. An apparatus for classifying text, the apparatus comprising:
the symbol conversion module is used for converting the target text into a symbol string matched with the target text;
the symbol input module is used for inputting the symbol string into a classification model which is trained in advance, wherein the classification model comprises a first sub-model and a second sub-model;
the characteristic extraction module is used for extracting the characteristics of the symbol string through the first sub-model to obtain a plurality of groups of characteristic data of the symbol string;
and the classification module is used for classifying the plurality of groups of characteristic data through the second sub-model to obtain a classification result of the target text.
15. An electronic device comprising a processor and a memory, the memory storing machine executable instructions executable by the processor, the processor executing the machine executable instructions to implement the text classification method of any one of claims 1 to 13.
16. A computer-readable storage medium having stored thereon computer-executable instructions that, when invoked and executed by a processor, cause the processor to implement the text classification method of any of claims 1 to 13.
CN202010540067.XA 2020-06-12 2020-06-12 Text classification method and device and electronic equipment Pending CN111737464A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010540067.XA CN111737464A (en) 2020-06-12 2020-06-12 Text classification method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010540067.XA CN111737464A (en) 2020-06-12 2020-06-12 Text classification method and device and electronic equipment

Publications (1)

Publication Number Publication Date
CN111737464A true CN111737464A (en) 2020-10-02

Family

ID=72649148

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010540067.XA Pending CN111737464A (en) 2020-06-12 2020-06-12 Text classification method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN111737464A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112836043A (en) * 2020-10-13 2021-05-25 讯飞智元信息科技有限公司 Long text clustering method and device based on pre-training language model
CN113554107A (en) * 2021-07-28 2021-10-26 工银科技有限公司 Corpus generating method, apparatus, device, storage medium and program product
CN113672732A (en) * 2021-08-19 2021-11-19 胜斗士(上海)科技技术发展有限公司 Method and device for classifying business data
CN116628168A (en) * 2023-06-12 2023-08-22 深圳市逗娱科技有限公司 User personality analysis processing method and system based on big data and cloud platform

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102081667A (en) * 2011-01-23 2011-06-01 浙江大学 Chinese text classification method based on Base64 coding
US20170147682A1 (en) * 2015-11-19 2017-05-25 King Abdulaziz City For Science And Technology Automated text-evaluation of user generated text
CN109508456A (en) * 2018-10-22 2019-03-22 网易(杭州)网络有限公司 A kind of text handling method and device
CN109993216A (en) * 2019-03-11 2019-07-09 深兰科技(上海)有限公司 A kind of file classification method and its equipment based on K arest neighbors KNN
WO2020073507A1 (en) * 2018-10-11 2020-04-16 平安科技(深圳)有限公司 Text classification method and terminal

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102081667A (en) * 2011-01-23 2011-06-01 浙江大学 Chinese text classification method based on Base64 coding
US20170147682A1 (en) * 2015-11-19 2017-05-25 King Abdulaziz City For Science And Technology Automated text-evaluation of user generated text
WO2020073507A1 (en) * 2018-10-11 2020-04-16 平安科技(深圳)有限公司 Text classification method and terminal
CN109508456A (en) * 2018-10-22 2019-03-22 网易(杭州)网络有限公司 A kind of text handling method and device
CN109993216A (en) * 2019-03-11 2019-07-09 深兰科技(上海)有限公司 A kind of file classification method and its equipment based on K arest neighbors KNN

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112836043A (en) * 2020-10-13 2021-05-25 讯飞智元信息科技有限公司 Long text clustering method and device based on pre-training language model
CN113554107A (en) * 2021-07-28 2021-10-26 工银科技有限公司 Corpus generating method, apparatus, device, storage medium and program product
CN113672732A (en) * 2021-08-19 2021-11-19 胜斗士(上海)科技技术发展有限公司 Method and device for classifying business data
CN113672732B (en) * 2021-08-19 2024-04-26 胜斗士(上海)科技技术发展有限公司 Method and device for classifying service data
CN116628168A (en) * 2023-06-12 2023-08-22 深圳市逗娱科技有限公司 User personality analysis processing method and system based on big data and cloud platform
CN116628168B (en) * 2023-06-12 2023-11-14 深圳市逗娱科技有限公司 User personality analysis processing method and system based on big data and cloud platform

Similar Documents

Publication Publication Date Title
CN109657054B (en) Abstract generation method, device, server and storage medium
CN109416705B (en) Utilizing information available in a corpus for data parsing and prediction
CN108628971B (en) Text classification method, text classifier and storage medium for unbalanced data set
CN106156204B (en) Text label extraction method and device
CN111737464A (en) Text classification method and device and electronic equipment
CN110597961B (en) Text category labeling method and device, electronic equipment and storage medium
CN109815336B (en) Text aggregation method and system
CN109271514B (en) Generation method, classification method, device and storage medium of short text classification model
CN112800170A (en) Question matching method and device and question reply method and device
CN110569354B (en) Barrage emotion analysis method and device
CN113505200B (en) Sentence-level Chinese event detection method combined with document key information
CN106506327B (en) Junk mail identification method and device
CN107688630B (en) Semantic-based weakly supervised microbo multi-emotion dictionary expansion method
CN112036168B (en) Event main body recognition model optimization method, device, equipment and readable storage medium
CN112052331A (en) Method and terminal for processing text information
CN111930792A (en) Data resource labeling method and device, storage medium and electronic equipment
CN111460149A (en) Text classification method, related equipment and readable storage medium
CN109446393B (en) Network community topic classification method and device
CN114416979A (en) Text query method, text query equipment and storage medium
CN107357895A (en) A kind of processing method of the text representation based on bag of words
CN113590764B (en) Training sample construction method and device, electronic equipment and storage medium
CN111581377B (en) Text classification method and device, storage medium and computer equipment
CN110263163B (en) Method and device for obtaining text abstract
CN111428487A (en) Model training method, lyric generation method, device, electronic equipment and medium
CN110717316A (en) Topic segmentation method and device for subtitle dialog flow

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination