CN111753082A - Text classification method and device based on comment data, equipment and medium - Google Patents

Text classification method and device based on comment data, equipment and medium Download PDF

Info

Publication number
CN111753082A
CN111753082A CN202010207346.4A CN202010207346A CN111753082A CN 111753082 A CN111753082 A CN 111753082A CN 202010207346 A CN202010207346 A CN 202010207346A CN 111753082 A CN111753082 A CN 111753082A
Authority
CN
China
Prior art keywords
data
classification
word
comment
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010207346.4A
Other languages
Chinese (zh)
Inventor
徐路
罗壮
何云龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jingdong Century Trading Co Ltd
Beijing Wodong Tianjun Information Technology Co Ltd
Original Assignee
Beijing Jingdong Century Trading Co Ltd
Beijing Wodong Tianjun Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jingdong Century Trading Co Ltd, Beijing Wodong Tianjun Information Technology Co Ltd filed Critical Beijing Jingdong Century Trading Co Ltd
Priority to CN202010207346.4A priority Critical patent/CN111753082A/en
Publication of CN111753082A publication Critical patent/CN111753082A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The disclosure relates to a text classification method and device based on comment data, equipment and a medium, relates to the technical field of natural language processing, and can be applied to scenes for classifying text data. The text classification method based on the comment data comprises the following steps: obtaining comment data, and performing text preprocessing on the comment data to generate word segmentation data to be processed; performing word vectorization processing on the word segmentation data to be processed to generate corresponding word vector representation data; inputting the word vector representation data into a target language representation model to generate corresponding sentence vector representation data; the sentence vector representation data is respectively input into a first classification model and a second classification model, whether the comment data belongs to the question text data or not is determined by the first classification model, and the question type classification corresponding to the comment data is determined by the second classification model. According to the method and the device, the comment texts with quality problems and the specific classification of the quality problems can be screened out through information mining of the comment data.

Description

Text classification method and device based on comment data, equipment and medium
Technical Field
The present disclosure relates to the field of natural language processing technologies, and in particular, to a text classification method based on comment data, a text classification device based on comment data, an electronic device, and a computer-readable storage medium.
Background
With the development of the internet, more and more people shop through the e-commerce website, and characteristics of various commodities can be obtained by analyzing and mining shopping evaluations of users. And then the commodity is improved by mining the characteristics of each commodity. For example, by analyzing user comments of fresh products, the quality problem proportion and the source of the quality problem of the fresh products can be mined, and by mining the source of the quality problem, a processing strategy for improving the commodities is formulated to reduce the proportion of the quality problems in the commodities.
Based on such background, how to classify quality problems and mine quality problems based on product review text type data becomes a hot point of research. The core of classifying the comment data is to screen out quality data, and when classifying the comment data, methods such as keyword matching and Word representation (Word Embedding) based text classification models are generally adopted.
However, the keyword matching has the disadvantages that the keyword matching is too simple, deeper semantic information cannot be mined, and the type of the text quality problem cannot be mined based on the keyword matching; in addition, the text classification model based on Word Embedding has the disadvantages that a large amount of labeled data is needed to train the classification model, and at the present stage, the labeling cost is too high, and the method is too expensive.
It is to be noted that the information disclosed in the above background section is only for enhancement of understanding of the background of the present disclosure, and thus may include information that does not constitute prior art known to those of ordinary skill in the art.
Disclosure of Invention
The invention aims to provide a text classification method based on comment data, a text classification device based on comment data, electronic equipment and a computer readable storage medium, and further solves the problems that the quality problem of comment texts cannot be mined by adopting a keyword matching method and the labeling cost is too high and the cost is too large by adopting a Word Embedding-based text classification model at least to a certain extent.
Additional features and advantages of the disclosure will be set forth in the detailed description which follows, or in part will be obvious from the description, or may be learned by practice of the invention.
According to a first aspect of the present disclosure, there is provided a text classification method based on comment data, including: obtaining comment data, and performing text preprocessing on the comment data to generate word segmentation data to be processed; performing word vectorization processing on the word segmentation data to be processed to generate corresponding word vector representation data; inputting the word vector representation data into a target language representation model to generate corresponding sentence vector representation data; inputting sentence vector representation data into a first classification model, and determining whether comment data belong to question text data or not by the first classification model; and inputting the sentence vector representation data into a second classification model, and determining the question type classification corresponding to the comment data by the second classification model.
Optionally, performing text preprocessing on the comment data to generate word segmentation data to be processed, including: performing regular matching processing on the comment data to generate regular text data; performing word segmentation on the regular text data to generate word segmentation data; and performing word correction processing on the word segmentation data to generate word segmentation data to be processed.
Optionally, performing word correction processing on the word segmentation data, including: acquiring a stop word reference table, and deleting stop words in the word segmentation data of the terms according to the stop word reference table; and determining the words to be modified in the word segmentation data of the words, and performing replacement processing on the words to be modified.
Optionally, performing word vectorization processing on the to-be-processed word segmentation data to generate corresponding word vector representation data, including: acquiring a word segmentation vocabulary, and constructing a vector space corresponding to the word segmentation vocabulary; and performing word vectorization processing on the word segmentation data to be processed according to the vector space to obtain word vector representation data of the comment data.
Optionally, before inputting the word vector representation data into the target language representation model, the method further includes: acquiring an initial language representation model and initial training data; acquiring mask identification and a preset proportion, and randomly selecting a target number of replacement training data from the initial training data according to the preset proportion; carrying out replacement processing on the replacement training data according to the mask identification to generate target training data; inputting target training data into the initial language representation model, and obtaining an output result of the initial language representation model; and adjusting the parameters of the initial language representation model according to the output result to obtain the target language representation model.
Optionally, inputting the sentence vector representation data into the first classification model, and determining whether the comment data belongs to the question text data by the first classification model, including: inputting sentence vector representation data to a first classification model; outputting, by the first classification model, a first result vector corresponding to the sentence vector representation data; and determining whether the comment data belongs to the question text data or not according to the first result vector.
Optionally, the sentence vector representation data is input to the second classification model, and the second classification model determines the question type classification corresponding to the comment data, including: inputting sentence vector representation data to a second classification model; outputting, by the second classification model, a second result vector corresponding to the sentence vector representation data; wherein the second result vector comprises a plurality of confidences; and obtaining confidence threshold values corresponding to the confidence degrees, and determining the problem type classification corresponding to the comment data according to the size relation between the confidence degrees and the confidence threshold values.
According to a second aspect of the present disclosure, there is provided a text classification apparatus based on comment data, including: the preprocessing module is used for acquiring comment data and performing text preprocessing on the comment data to generate word segmentation data to be processed; the word vector generation module is used for carrying out word vectorization on the word segmentation data to be processed so as to generate corresponding word vector representation data; a sentence vector generation module for inputting the word vector representation data to the target language representation model to generate corresponding sentence vector representation data; the first classification module is used for inputting sentence vector representation data into a first classification model, and determining whether the comment data belong to question text data or not by the first classification model; and the second classification module is used for inputting the sentence vector representation data into the second classification model, and determining the problem type classification corresponding to the comment data by the second classification model.
Optionally, the preprocessing module includes a preprocessing unit, configured to perform regular matching processing on the comment data to generate regular text data; performing word segmentation on the regular text data to generate word segmentation data; and performing word correction processing on the word segmentation data to generate word segmentation data to be processed.
Optionally, the preprocessing unit includes a word modification subunit, configured to obtain a stop word reference table, and delete a stop word in the word segmentation data according to the stop word reference table; and determining the words to be modified in the word segmentation data of the words, and performing replacement processing on the words to be modified.
Optionally, the word vector generation module includes a word vector generation unit, configured to obtain a word segmentation vocabulary, and construct a vector space corresponding to the word segmentation vocabulary; and performing word vectorization processing on the word segmentation data to be processed according to the vector space to obtain word vector representation data of the comment data.
Optionally, the text classification device based on the comment data further includes a model training module, configured to obtain an initial language representation model and initial training data; acquiring mask identification and a preset proportion, and randomly selecting a target number of replacement training data from the initial training data according to the preset proportion; carrying out replacement processing on the replacement training data according to the mask identification to generate target training data; inputting target training data into the initial language representation model, and obtaining an output result of the initial language representation model; and adjusting the parameters of the initial language representation model according to the output result to obtain the target language representation model.
Optionally, the first classification module includes a first classification unit, configured to input sentence vector representation data into the first classification model; outputting, by the first classification model, a first result vector corresponding to the sentence vector representation data; and determining whether the comment data belongs to the question text data or not according to the first result vector.
Optionally, the second classification module includes a second classification unit, configured to input sentence vector representation data into the second classification model; outputting, by the second classification model, a second result vector corresponding to the sentence vector representation data; wherein the second result vector comprises a plurality of confidences; and obtaining confidence threshold values corresponding to the confidence degrees, and determining the problem type classification corresponding to the comment data according to the size relation between the confidence degrees and the confidence threshold values.
According to a third aspect of the present disclosure, there is provided an electronic device comprising: a processor; and a memory having computer readable instructions stored thereon which, when executed by the processor, implement the method of text classification based on comment data according to any of the above.
According to a fourth aspect of the present disclosure, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method of text classification based on comment data according to any one of the above.
The technical scheme provided by the disclosure can comprise the following beneficial effects:
the comment data-based text classification method in the exemplary embodiment of the present disclosure obtains comment data, and performs text preprocessing on the comment data to generate word segmentation data to be processed; performing word vectorization processing on the word segmentation data to be processed to generate corresponding word vector representation data; inputting the word vector representation data into a target language representation model to generate corresponding sentence vector representation data; inputting sentence vector representation data into a first classification model, and determining whether comment data belong to question text data or not by the first classification model; and inputting the sentence vector representation data into a second classification model, and determining the question type classification corresponding to the comment data by the second classification model. On one hand, the efficiency of text classification processing can be improved by performing text preprocessing on the acquired comment data and performing text classification processing according to the comment data subjected to the text preprocessing. On the other hand, the comment data are classified in a mode of combining the target language representation model in the unsupervised learning mode and the classification model in the supervised learning mode, so that the accuracy of data classification can be effectively improved, and the labeling cost generated by text classification can be reduced. In another aspect, whether quality problems exist in the comment data and the target classification of the quality problems are respectively determined through the first classification model and the second classification model, and a targeted article improvement strategy can be formulated according to article feedback information obtained through text classification.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure. It is to be understood that the drawings in the following description are merely exemplary of the disclosure, and that other drawings may be derived from those drawings by one of ordinary skill in the art without the exercise of inventive faculty. In the drawings:
FIG. 1 schematically illustrates a flow diagram of a method of text classification based on comment data according to an exemplary embodiment of the present disclosure;
FIG. 2 schematically illustrates a detailed flow diagram of a review data-based semi-supervised text classification methodology, according to an exemplary embodiment of the present disclosure;
FIG. 3 schematically shows a flow diagram of text pre-processing for comment data according to an exemplary embodiment of the present disclosure;
FIG. 4 schematically illustrates a flow diagram for processing vector quantitative representation data using a target language representation model, according to an exemplary embodiment of the present disclosure;
FIG. 5 schematically illustrates a flow diagram for training a target language representation model according to an exemplary embodiment of the present disclosure;
FIG. 6 schematically illustrates a data flow diagram for classification of sentence vector representation data using a classification model according to an exemplary embodiment of the present disclosure;
FIG. 7 schematically illustrates an overall architecture diagram of a comment data based text classification method according to an exemplary embodiment of the present disclosure;
FIG. 8 schematically illustrates a block diagram of a review data-based text classification apparatus, according to an exemplary embodiment of the present disclosure;
FIG. 9 schematically illustrates a block diagram of an electronic device according to an exemplary embodiment of the present disclosure;
fig. 10 schematically illustrates a schematic diagram of a computer-readable storage medium according to an exemplary embodiment of the present disclosure.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The same reference numerals denote the same or similar parts in the drawings, and thus, a repetitive description thereof will be omitted.
Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the disclosure. One skilled in the relevant art will recognize, however, that the subject matter of the present disclosure can be practiced without one or more of the specific details, or with other methods, components, devices, steps, and so forth. In other instances, well-known structures, methods, devices, implementations, materials, or operations are not shown or described in detail to avoid obscuring aspects of the disclosure.
The block diagrams shown in the figures are functional entities only and do not necessarily correspond to physically separate entities. That is, these functional entities may be implemented in the form of software, or in one or more software-hardened modules, or in different networks and/or processor devices and/or microcontroller devices.
In the E-commerce platform, by acquiring shopping comments of users after shopping, the E-commerce website can mine the characteristics of each commodity, such as quality problem occupation ratio of a certain article and root cause of the quality problem. By digging the root of the quality problem, the method is beneficial to improving the processing strategy of the commodity and reducing the proportion of the quality problem in the commodity. The core of the comment data classification is to screen out texts with quality data, and two methods are mainly adopted in the prior art to classify comment text data. One is a keyword matching method, which mainly compares each text data with a keyword table by establishing a set of keyword tables with quality problems, and once the text contains words with quality problems, the text is considered to belong to the text with quality problems. However, the keyword matching method has the disadvantages that the method is too simple, deeper semantic information cannot be mined, and the type of the text quality problem cannot be mined based on keyword matching. The other method is to classify the text based on a Word Embedding text classification model, firstly establishing a continuous Word vector space, converting each Word of the comment text into a vector of the continuous Word vector space, then establishing a neural network model or a machine learning model to model the Word vector, and classifying the modeled Word vector based on the whole sentence. However, the text classification model based on Word Embedding has the disadvantage that a large amount of labeled data is needed to train the classification model, and at the present stage, the labeling cost is too high, and the method is too costly.
Based on this, in the present exemplary embodiment, firstly, a text classification method based on comment data is provided, the text classification method based on comment data of the present disclosure may be implemented by using a server, and the method described in the present disclosure may also be implemented by using a terminal device, where the terminal described in the present disclosure may include a mobile terminal such as a mobile phone, a tablet computer, a notebook computer, a palm computer, a Personal Digital Assistant (PDA), and a fixed terminal such as a desktop computer. Fig. 1 schematically illustrates a schematic diagram of a review data-based text classification method flow, according to some embodiments of the present disclosure. Referring to fig. 1, the text classification method based on comment data may include the steps of:
step S110, comment data are obtained, text preprocessing is conducted on the comment data, and word segmentation data to be processed are generated.
Step S120, performing word vectorization on the word segmentation data to be processed to generate corresponding word vector representation data.
Step S130, the word vector representation data is input to the target language representation model to generate corresponding sentence vector representation data.
Step S140, the sentence vector representation data is input to the first classification model, and it is determined by the first classification model whether the comment data belongs to the question text data.
And S150, inputting sentence vector representation data into a second classification model, and determining question type classification corresponding to the comment data by the second classification model.
According to the text classification method based on the comment data in the exemplary embodiment, on one hand, the efficiency of text classification processing can be improved by performing text preprocessing on the acquired comment data and performing text classification processing according to the comment data subjected to the text preprocessing. On the other hand, the comment data are classified in a mode of combining the target language representation model in the unsupervised learning mode and the classification model in the supervised learning mode, so that the accuracy of data classification can be effectively improved, and the labeling cost generated by text classification can be reduced. In another aspect, whether quality problems exist in the comment data and the target classification of the quality problems are respectively determined through the first classification model and the second classification model, and a targeted article improvement strategy can be formulated according to article feedback information obtained through text classification.
Next, a text classification method based on comment data in the present exemplary embodiment will be further explained.
In step S110, comment data is acquired, and text preprocessing is performed on the comment data to generate word segmentation data to be processed.
In some exemplary embodiments of the present disclosure, the comment data may be related comment information generated after a user purchases a certain item and evaluates the item performance, use experience, and the like of the item. For example, the comment data may be user comment data in an e-commerce platform; the review data may also be review data for the item that is obtained from other platforms. The present disclosure mainly performs mining analysis processing on related text information contained in comment data. The text preprocessing can be a preprocessing operation performed on the comment data, and the text preprocessing can be a process of filtering out special symbols or other types of information such as expression data in the comment data and performing word segmentation processing on the comment text. The to-be-processed participle data can be participle data corresponding to the comment data obtained after text preprocessing, such as text replacement, text deletion, participle segmentation and the like, is performed on the comment data.
Because of the large amount of user comment data in the e-commerce platform, and the user comment data are stored in databases in various places in a distributed manner. Before obtaining the comment data, a data acquisition script can be deployed in the e-commerce platform, and the comment data can be obtained through the data acquisition script. For example, the present disclosure may extract comment data from a Hadoop cluster through a data collection script. After the comment data is obtained, text preprocessing operation can be performed on the comment data, and non-standard data and other irrelevant data in the comment data are deleted, for example, data such as empty comment information in the comment data, emoticons and special symbols in the comment data are deleted. In addition, the comment data can be subjected to regular matching processing and word segmentation processing to generate word segmentation data to be processed corresponding to the comment data.
According to some exemplary embodiments of the present disclosure, a regular matching process is performed on the comment data to generate regular text data; performing word segmentation on the regular text data to generate word segmentation data; and performing word correction processing on the word segmentation data to generate word segmentation data to be processed. The regular matching processing can be deleting meaningless empty comment data in the comment data, or deleting special symbols such as line feed characters in the comment data, and reserving text data with semantic meanings in the comment data. The regular text data can be text data corresponding to the comment data obtained by performing regular matching processing on the comment data. The word segmentation processing may be a word segmentation processing operation performed on the regular text data, for example, a chinese word segmentation algorithm may be used to divide the regular text data of the comment data into a sequential set composed of words. The term segmentation data of the terms can be a sequential set formed by performing term segmentation processing on the regular text data and consisting of the terms. The word correction process may be a process of replacing an erroneous word in the comment text data and replacing the word with a similar word of some words.
Referring to fig. 2, fig. 2 schematically shows a detailed flowchart of a review data-based semi-supervised text classification method. In step S201, the original comment data is acquired from the database. Referring to fig. 3, fig. 3 schematically shows a flow chart of text preprocessing for comment data. In step S310, a regular matching process may be performed on the comment data, so as to remove empty comments, emoticons, special symbols, and the like in the comment data, and leave texts in the comment data, that is, to obtain regular text data. In step S202, the regular text data may be subjected to word segmentation processing. Specifically, in step S320, the obtained regular text data may be subjected to word segmentation, and since the chinese and english structures are different and there is no obvious segmentation mark, the present disclosure performs word segmentation on the regular text data by using a chinese word segmentation algorithm, and divides the comment text data into a sequential set composed of words. For example, the chinese word segmentation algorithm may include a dictionary-based word segmentation algorithm, a machine learning-based word segmentation algorithm, a neural network-based word segmentation algorithm, and the like. It should be noted that other methods that can perform word segmentation processing on the regular text data all belong to the protection scope of the present disclosure.
According to another exemplary embodiment of the present disclosure, a stop word reference table is obtained, and stop words in the term segmentation data are deleted according to the stop word reference table; and determining the words to be modified in the word segmentation data of the words, and performing replacement processing on the words to be modified. Stop words may be words that occur frequently in the review data but do not represent a feature, for example, stop words may include "what", "is", "ground", and the like. The stop word reference table may be a data table for storing common stop words, and in the present disclosure, the stop word reference table may be configured by using a preset program, and the stop word reference table may be added and deleted by using the preset program. The word to be modified can be a misspelled word in the comment text data, a word which can be replaced by a synonym, and the like.
Referring to fig. 2, in step S203, a process of removing stop words may be performed on the comment data subjected to the word segmentation process. Specifically, referring to fig. 3, in step S330, after performing word segmentation processing on the comment text data, a stop word reference table is obtained, and word correction processing may be performed on the obtained word segmentation data according to the stop word reference table, so as to generate word segmentation data to be processed. Since there may be synonyms in the words in the comment text data, and since misspelled words and synonyms may interfere with the classification result of the comment text data in chinese, in step S204, in order to achieve a better classification result when the comment text data is classified using the classification model, it is necessary to replace relevant synonyms and misspelled words in the comment text data.
In step S120, word vectorization processing is performed on the word segmentation data to be processed to generate corresponding word vector representation data.
In some exemplary embodiments of the present disclosure, the word vectorization processing may be a process of vectorizing the to-be-processed participle data, that is, a vectorization processing process of converting the comment data from text type information to numerical type information. The word vector representing data may be vector data corresponding to each word obtained by performing word vectorization on the word segmentation data to be processed. After the text preprocessing is performed on the comment data to obtain the preprocessed participle data, word vectorization processing can be performed on the participle data to be processed to obtain word vector representation data corresponding to the participle data to be processed.
According to some exemplary embodiments of the present disclosure, a word segmentation vocabulary is obtained, and a vector space corresponding to the word segmentation vocabulary is constructed; and performing word vectorization processing on the word segmentation data to be processed according to the vector space to obtain word vector representation data of the comment data. The participle vocabulary may be a vocabulary of all vocabulary components present in the review data used for model training. The vector space may be a continuous vector space corresponding to a vocabulary in the participle vocabulary.
The present disclosure can mainly build a vocabulary table for all vocabularies appearing in the training stage, and build a continuous vector space for all vocabularies, ensuring that each word can be converted into a vector in a continuous space. After the comment data are processed to obtain word segmentation data to be processed, the comment data are converted into a set of ordered words, and the word segmentation data to be processed can be converted into corresponding word vector representation data by using the constructed vector space.
In step S130, the word vector representation data is input to the target language representation model to generate corresponding sentence vector representation data.
In some exemplary embodiments of the present disclosure, the target language representation model may be a model that processes word vector representation data for extracting text type information in the word vector representation data, and for example, the target language representation model may include a Bidirectional Encoder tokens (BERT) model from a transformer. The sentence vector representation data can be used for extracting text features of the word vector representation data to obtain sentence vectorization representation, and the sentence vectorization representationThe data can be sentence vectorization representation corresponding to the comment data generated after the word vector representation data is processed by the target language representation model, the sentence vectorization representation data can contain semantic information of the comment text data, and T can be adopted1,T2,T3,…,TN
Referring to FIG. 4, FIG. 4 schematically illustrates a flow chart for processing vector representation data using a target language representation model. The Word vector representation data 410 may be a Chinese character vector after being subjected to Word Embedding processing, and the Word vector representation data 410 may adopt E1,E2,E3,…,EnAnd (4) showing. Representing word vectors to data E1,E2,E3,…,EnInputting into a BERT model as a target language representation model, and generating corresponding sentence vector representation data 430, namely T after being processed by a Transformer 420(Transformer) structure in the BERT model1,T2,T3,…,TN. Through the steps, each piece of comment data to be processed can be represented as a group of vectors, and text data vectorization processing of the comment data is achieved. And inputting the obtained sentence vector representation data into a classification model, and determining whether the comment data has quality problems and the specific classification of the quality problems.
According to some example embodiments of the present disclosure, an initial language representation model is obtained along with initial training data; acquiring mask identification and a preset proportion, and randomly selecting a target number of replacement training data from the initial training data according to the preset proportion; carrying out replacement processing on the replacement training data according to the mask identification to generate target training data; inputting target training data into the initial language representation model, and obtaining an output result of the initial language representation model; and adjusting the parameters of the initial language representation model according to the output result to obtain the target language representation model. The initial language representation model may be an initially built model for sentence-vectorized representation of the review data. The initial training data may be comment text data employed for training the initial language representation model. The mask identification may be an identification used to replace a particular word in the process of training the initial language representation model. The preset ratio may be a preset numerical ratio. The replacement training data may be training data randomly selected from the initial training data for replacement with the mask identification. The target training data may be training data generated by performing replacement processing on replacement training data in the initial training data by using mask identification. The output result may be result data output by the initial language representation model after the target training data is processed by the initial language representation model.
Referring to fig. 2, in step S205, before processing the word vector representation data with the target language representation model, the initial language representation model may be trained to obtain the target language representation model. Referring to FIG. 5, FIG. 5 schematically illustrates a flow chart for training a target language representation model. For example, the present disclosure will describe the training process of the initial language representation model by taking the training process of the BERT model as an example. Specifically, the training process is as follows: in step S510, a pre-constructed initial language representation model and initial training data are obtained, the Bert model may be trained by using original sentence information in the initial data, and in the initial training stage, the Bert model is trained by using an unsupervised random mask. In step S520, after the mask identifier and the preset ratio are determined, for example, the mask identifier may be a mask identifier, and the preset ratio may be 10%, 15%, 20%, and the like. In step S530, for each piece of initial training data, randomly selecting replacement training data with a preset ratio, and replacing part of the data in the replacement training data with a mask identifier to generate target training data. In step S540, the target training data is input to the initial language representation model, the structure of the sentence itself in the initial training data is learned by the initial language representation model, and the result is output by the initial language representation model. In step S550, parameters of the initial language training model are continuously adjusted and optimized according to a difference between the output result and the initial training data, so as to train the initial language representation model, and obtain the target language training model. For example, for each sentence (i.e., initial training data), 15% of the data in the input is randomly picked (i.e., the replacement training data), and replaced with a mask. Among the inputs, for these 15% masked values, 80% of the text is replaced with mask, 10% of the text is replaced with randomly chosen words in the word stock, and 10% of the data remains unchanged. By adopting the method, the structure of the sentence is learned through the Bert model so as to obtain the final target language representation model.
It is easily understood by those skilled in the art that the specific format of the mask mark and the specific value of the preset ratio can be set according to the model training requirement, and the disclosure is not limited in any way.
In step S140, sentence vector representation data is input to the first classification model, and it is determined by the first classification model whether or not comment data belongs to question text data.
In some exemplary embodiments of the present disclosure, the first classification model may be a classification model for determining whether the comment data has a quality problem. The question text data may be comment data indicating that there is a quality problem. The first classification model may be a Multilayer Perceptron (MLP) network, a decision tree (DecisionTree) model, or a Support Vector Machine (SVM) model. It will be readily appreciated by those skilled in the art that the first classification model may also be a classification model of other structures.
Referring to fig. 2, before classifying the comment data, training data for training a classification model may be labeled, a first classification model and a second classification model may be trained from the labeled training data in step S206, and a question classification model (i.e., a first classification model) and a quality question segmentation model (i.e., a second classification model) for classifying the comment data are generated in step S207.
After the sentence vector representation data of the comment data is obtained, the sentence vector representation data can be input into the first classification model, and the first classification model performs classification processing on the sentence vector representation data to determine whether the comment data contains text with quality problems, that is, whether the comment data belongs to problem text data. Specifically, the sentence vector representation data is classified by the first classification model, and the first classification model outputs two results, namely, no quality problem of the comment data and quality problem of the comment data.
According to some exemplary embodiments of the present disclosure, sentence vector representation data is input to a first classification model; outputting, by the first classification model, a first result vector corresponding to the sentence vector representation data; and determining whether the comment data belongs to the question text data or not according to the first result vector. The first result vector may be an output result corresponding to the sentence vector representation data output by the first classification model. The first result vector may include classification results for the model and a confidence for each classification; where the classification result of the model can be represented by prediction, prob _ list identifies the confidence of each classification. In the present disclosure, for the classification result, 1 may be used to indicate quality problem, and 0 indicates no quality problem; for the confidence of each classification, the 0 th dimension represents the confidence of no quality problem, the 1 st dimension represents the confidence of quality problem, the sum of the two is 1, and the classification result is one dimension with high confidence.
Referring to fig. 6, fig. 6 schematically illustrates a data flow diagram for classifying sentence-vector representative data using a classification model. After text preprocessing and vectorization processing are performed on the comment data, sentence vectorization representation data 610 can be obtained. When comment data A is "not of good quality, I have not come anymore! | A "then, the sentence vector representation data 611 of the piece of comment data a is input to the first classification model 621, and the first result vector 631 output by the first classification model 621 is" prediction: 1prob _ list: [ 0.124986220.87501377 ] ", a conclusion 641 whether there is a quality problem of the comment data a, which is the problem text data, can be obtained from the first result vector 631.
In step S150, the sentence vector representation data is input to the second classification model, and the question type classification corresponding to the comment data is determined by the second classification model.
In some example embodiments of the present disclosure, the second classification model may be used to determine a model of a quality issue specific classification to which the review data corresponds. The second classification model can be an MLP network, a decision tree model or an SVM model. It is easily understood by those skilled in the art that the first classification model and the second classification model may also be classification models of other structures, and the present disclosure does not make any special limitation on the specific model structures of the first classification model and the second classification model, and classification models of other structures are within the scope of protection of the present disclosure.
And inputting sentence vector representation data into a second classification model, wherein the second classification model can determine the specific classification of the quality problem corresponding to the comment data, and one comment data can correspond to one quality problem classification or a plurality of quality problem classifications.
According to some exemplary embodiments of the present disclosure, sentence vector representation data is input to the second classification model; outputting, by the second classification model, a second result vector corresponding to the sentence vector representation data; wherein the second result vector comprises a plurality of confidences; and obtaining confidence threshold values corresponding to the confidence degrees, and determining the problem type classification corresponding to the comment data according to the size relation between the confidence degrees and the confidence threshold values. The second result vector may be an output result of a quality problem specific classification corresponding to the sentence vector representation data output by the second classification model, and the second result vector may include a plurality of confidence levels therein. The second result vector can also adopt prediction to represent the classification result of the model, a certain dimension is 1 to represent that the comment data has quality problems under the category, and 0 represents that the comment data has no quality problems under the category; the data for each dimension of the prob _ list represents the confidence level that there is a quality problem under that category. Confidence may indicate the probability that the evaluation value is within a certain allowable error range from the overall parameter and that the review data is in the quality problem target classification. The confidence threshold may be a value preset for comparison with the magnitude of each confidence. The question type classification may be a specific classification of the quality question corresponding to the review data, for example, in a fresh product, the quality question classification may include text mismatch, freshness, damaged goods, weight loss, and so on.
Referring to FIG. 6, the process for the second classification model to determine the specific classification of the quality issue for the review data is as follows, if the review data B is "not fresh, I have come no longer! | A ", the sentence vector representation data 612 corresponding to the comment data B is input to the second classification model 622, the second result vector 632 output by the second classification model 622 may be" prediction: [ 0010000000000000000 ] prob _ list [8.29909288 e-042.09750287 e-049.96354342 e-013.79751466 e-043.23055923 e-046.41657214 e-042.17023538 e-049.52679693 e-062.69078882 e-051.26935687 e-044.70576160 e-055.83176443 e-052.85345799 e-041.75762907 e-042.74996244 e-041.55316502 e-051.27878729 e-059.50200683 e-061.78207074 e-06] ". After the second result vector output by the second classification model is obtained, whether the comment data has a quality problem in a certain dimension can be judged by adopting a threshold-based mode, namely, each confidence coefficient is compared with a confidence coefficient threshold, and if the confidence coefficient of the quality problem in the certain dimension is greater than the confidence coefficient threshold, the quality problem in the dimension is considered. As can be seen from the second result vector 632, the value of the third dimension is 1, and the confidence level in the third dimension is 9.96354342e-01, so that the quality problem classification 642 corresponding to the review data can be determined as "stale".
Further, in the determination of the algorithm performance of the first classification model and the second classification model, the quality problem classification algorithm may be tested based on four evaluation indexes, i.e., Accuracy (Accuracy), Precision (Precision), Recall (Recall), and F-number (F1-Score), where each evaluation index is defined as follows:
Figure BDA0002421592380000151
Figure BDA0002421592380000152
Figure BDA0002421592380000153
Figure BDA0002421592380000154
the definition of tp can be that the comment data has quality problems, and the algorithm classification result is that the comment data has quality problems; the definition of fp can be that the comment data has no quality problem, and the algorithm classification result is that the algorithm has quality problem; the definition of fn can be that the comment data has quality problems, and the algorithm classification result is no quality problem; the definition of tn can be that the comment data has no quality problem, and the algorithm classification result has no quality problem.
Based on the test indexes, the overall classification algorithm can be displayed. For example, in the present disclosure, for a certain comment data, the classification result of the comment data obtained by using the above classification algorithm may be: the accuracy rate is 95.8%, the accuracy rate is 96.4%, the recall rate is 95.5%, and the F value is 95.8%. Whether the comment data have quality problems and what kind of quality problems are judged by the two classification models respectively, the error of the whole model is the sum of the errors of the processing results of the two classification models, and the extraction of the whole information is favorably improved by combining the errors of the two classification models.
The first classification model and the second classification model may classify the sentence representation data simultaneously, that is, the sentence representation data is input to the first classification model and the second classification model at the same time, so that the first classification model and the second classification model respectively output corresponding classification results. The data is classified through the first classification model and the second classification model, the sentence expression data is classified, time consumption generated in the classification process can be reduced, and the classification efficiency is improved; in addition, the first classification model and the second classification model are adopted to classify whether quality problems exist or not and which quality problems belong to, so that a good classification effect can be obtained.
It should be noted that the terms "first", "second", etc. are used in this disclosure only for distinguishing different classification models and result vectors, and should not impose any limitation on this disclosure.
Referring to fig. 7, fig. 7 schematically shows an overall architecture diagram of a text classification method based on comment data. The basic corpus data module 710, the classification module 720 and the classification data visualization module 730 may be configured during a text classification process based on the comment data. Specifically, the basic corpus data module 710 may be configured to perform data cleaning on the obtained comment data, such as operations of regular matching, special symbol processing, and the like. The classification module 720 may be used to vectorize the review data and calculate a confidence of the presence or absence of the quality problem model from the first classification model and a calculation of the quality problem classification model from the second classification model. The classification data visualization module 730 can display the calculation result obtained by the classification module 720 and the test result of the classification algorithm.
In conclusion, comment data are obtained, and text preprocessing is performed on the comment data to generate word segmentation data to be processed; performing word vectorization processing on the word segmentation data to be processed to generate corresponding word vector representation data; inputting the word vector representation data into a target language representation model to generate corresponding sentence vector representation data; inputting sentence vector representation data into a first classification model, and determining whether comment data belong to question text data or not by the first classification model; and inputting the sentence vector representation data into a second classification model, and determining the question type classification corresponding to the comment data by the second classification model. On one hand, the efficiency of text classification processing can be improved by performing text preprocessing on the acquired comment data, deleting irregular data such as special symbols and emoticons in the comment data, performing word segmentation processing on the comment data, and performing text classification processing according to the comment data subjected to the text preprocessing. On the other hand, the comment data are classified in a semi-supervised classification mode combining a target language representation model in an unsupervised learning mode and a classification model in a supervised learning mode, so that the labeling cost in the text classification processing process can be effectively reduced, and the accuracy of data classification can be effectively improved. In another aspect, whether quality problems exist in the comment data and the target classification of the quality problems are determined through the first classification model and the second classification model respectively, and by combining classification errors of the first classification model and the second classification model, information extraction in the comment data can be improved, a targeted article improvement strategy is formulated according to extracted article feedback information, the quality problem proportion of the articles can be reduced, and user experience is improved.
It is noted that although the steps of the methods of the present invention are depicted in the drawings in a particular order, this does not require or imply that the steps must be performed in this particular order, or that all of the depicted steps must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions, etc.
Further, in the present exemplary embodiment, there is also provided a text classification apparatus based on comment data. Referring to fig. 8, the comment data-based text classification apparatus 800 may include: a preprocessing module 810, a word vector generation module 820, a sentence vector generation module 830, a first classification module 840, and a second classification module 850.
Specifically, the preprocessing module 810 may be configured to obtain comment data, and perform text preprocessing on the comment data to generate word segmentation data to be processed; the word vector generating module 820 may be configured to perform word vectorization on the to-be-processed word segmentation data to generate corresponding word vector representation data; the sentence vector generation module 830 may be configured to input the word vector representation data to the target language representation model to generate corresponding sentence vector representation data; the first classification module 840 may be configured to input sentence vector representation data to a first classification model, which determines whether the comment data belongs to question text data; and the second classification module 850 may be configured to input the sentence vector representation data to the second classification model, and determine, by the second classification model, a question type classification corresponding to the comment data.
The comment data-based text classification device 800 may perform text preprocessing on the comment data, generate to-be-processed participle data corresponding to the comment data, perform sentence vectorization processing on the to-be-processed participle data through the target language representation model to generate sentence vector representation data, input the generated sentence vector representation data to the first classification model and the second classification model, and determine whether the comment data has a quality problem and a target classification of the corresponding quality problem by the first classification model and the second classification model, respectively. By the text classification device based on the comment data, before classification processing is carried out on the comment data, text preprocessing operation is carried out, so that the processing efficiency of text classification can be improved; in addition, the vectorization data of the comment data is classified by adopting a target language representation model based on unsupervised learning and a classification model based on supervised learning, so that the labeling cost in the data classification process can be reduced, and the accuracy of the classification result can be improved.
In an exemplary embodiment of the present disclosure, based on the foregoing scheme, the preprocessing module includes a preprocessing unit, configured to perform a regular matching process on the comment data to generate regular text data; performing word segmentation on the regular text data to generate word segmentation data; and performing word correction processing on the word segmentation data to generate word segmentation data to be processed.
In an exemplary embodiment of the present disclosure, based on the foregoing solution, the preprocessing unit includes a word modification subunit, configured to obtain a stop word reference table, and delete a stop word in the term segmentation data according to the stop word reference table; and determining the words to be modified in the word segmentation data of the words, and performing replacement processing on the words to be modified.
In an exemplary embodiment of the present disclosure, based on the foregoing solution, the word vector generation module includes a word vector generation unit, configured to obtain a word segmentation vocabulary, and construct a vector space corresponding to the word segmentation vocabulary; and performing word vectorization processing on the word segmentation data to be processed according to the vector space to obtain word vector representation data of the comment data.
In an exemplary embodiment of the present disclosure, based on the foregoing solution, the text classification apparatus based on comment data further includes a model training module, configured to obtain an initial language representation model and initial training data; acquiring mask identification and a preset proportion, and randomly selecting a target number of replacement training data from the initial training data according to the preset proportion; carrying out replacement processing on the replacement training data according to the mask identification to generate target training data; inputting target training data into the initial language representation model, and obtaining an output result of the initial language representation model; and adjusting the parameters of the initial language representation model according to the output result to obtain the target language representation model.
In an exemplary embodiment of the present disclosure, based on the foregoing, the first classification module includes a first classification unit for inputting sentence vector representation data to the first classification model; outputting, by the first classification model, a first result vector corresponding to the sentence vector representation data; and determining whether the comment data belongs to the question text data or not according to the first result vector.
In an exemplary embodiment of the present disclosure, based on the foregoing, the second classification module includes a second classification unit for inputting sentence vector representation data to the second classification model; outputting, by the second classification model, a second result vector corresponding to the sentence vector representation data; wherein the second result vector comprises a plurality of confidences; and obtaining confidence threshold values corresponding to the confidence degrees, and determining the problem type classification corresponding to the comment data according to the size relation between the confidence degrees and the confidence threshold values.
The specific details of each virtual comment data-based text classification device module are already described in detail in the corresponding comment data-based text classification method, and therefore are not described herein again.
It should be noted that although several modules or units of the comment data based text classification apparatus are mentioned in the above detailed description, such division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.
In addition, in an exemplary embodiment of the present disclosure, an electronic device capable of implementing the above method is also provided.
As will be appreciated by one skilled in the art, aspects of the present invention may be embodied as a system, method or program product. Thus, various aspects of the invention may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.
An electronic device 900 according to such an embodiment of the invention is described below with reference to fig. 9. The electronic device 900 shown in fig. 9 is only an example and should not bring any limitations to the function and scope of use of the embodiments of the present invention.
As shown in fig. 9, the electronic device 900 is embodied in the form of a general purpose computing device. Components of electronic device 900 may include, but are not limited to: the at least one processing unit 910, the at least one storage unit 920, a bus 930 connecting different system components (including the storage unit 920 and the processing unit 910), and a display unit 940.
Wherein the storage unit stores program code that is executable by the processing unit 910 to cause the processing unit 910 to perform steps according to various exemplary embodiments of the present invention described in the above section "exemplary methods" of the present specification.
The storage unit 920 may include readable media in the form of volatile memory units, such as a random access memory unit (RAM)921 and/or a cache memory unit 922, and may further include a read only memory unit (ROM) 923.
Storage unit 920 may include a program/utility 924 having a set (at least one) of program modules 925, such program modules 925 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.
Bus 930 may represent one or more of several types of bus structures, including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.
The electronic device 900 may also communicate with one or more external devices 970 (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device 900, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device 900 to communicate with one or more other computing devices. Such communication may occur via input/output (I/O) interface 950. Also, the electronic device 900 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN) and/or a public network, such as the Internet) via the network adapter 960. As shown, the network adapter 960 communicates with the other modules of the electronic device 900 via the bus 930. It should be appreciated that although not shown, other hardware and/or software modules may be used in conjunction with the electronic device 900, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.
Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, a terminal device, or a network device, etc.) to execute the method according to the embodiments of the present disclosure.
In an exemplary embodiment of the present disclosure, there is also provided a computer-readable storage medium having stored thereon a program product capable of implementing the above-described method of the present specification. In some possible embodiments, aspects of the invention may also be implemented in the form of a program product comprising program code means for causing a terminal device to carry out the steps according to various exemplary embodiments of the invention described in the above-mentioned "exemplary methods" section of the present description, when said program product is run on the terminal device.
Referring to fig. 10, a program product 1000 for implementing the above method according to an embodiment of the present invention is described, which may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a terminal device, such as a personal computer. However, the program product of the present invention is not limited in this regard and, in the present document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
A computer readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).
Furthermore, the above-described figures are merely schematic illustrations of processes involved in methods according to exemplary embodiments of the invention, and are not intended to be limiting. It will be readily understood that the processes shown in the above figures are not intended to indicate or limit the chronological order of the processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, e.g., in multiple modules.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is to be limited only by the terms of the appended claims.

Claims (10)

1. A text classification method based on comment data is characterized by comprising the following steps:
obtaining comment data, and performing text preprocessing on the comment data to generate word segmentation data to be processed;
performing word vectorization processing on the word segmentation data to be processed to generate corresponding word vector representation data;
inputting the word vector representation data to a target language representation model to generate corresponding sentence vector representation data;
inputting the sentence vector representation data to a first classification model, determining by the first classification model whether the comment data belongs to question text data; and
and inputting the sentence vector representation data into a second classification model, and determining the question type classification corresponding to the comment data by the second classification model.
2. The method for text classification based on comment data according to claim 1, wherein the text preprocessing of the comment data to generate word segmentation data to be processed comprises:
performing regular matching processing on the comment data to generate regular text data;
performing word segmentation on the regular text data to generate word segmentation data;
and performing word correction processing on the word segmentation data to generate the word segmentation data to be processed.
3. The method for classifying texts based on comment data according to claim 2, wherein the performing word modification processing on the word segmentation data comprises:
acquiring a stop word reference table, and deleting stop words in the word segmentation data of the terms according to the stop word reference table; and
determining the words to be modified in the word segmentation data, and performing replacement processing on the words to be modified.
4. The method for classifying text based on comment data according to claim 1, wherein the performing word vectorization processing on the to-be-processed participle data to generate corresponding word vector representation data includes:
acquiring a word segmentation vocabulary, and constructing a vector space corresponding to the word segmentation vocabulary;
and performing word vectorization processing on the word segmentation data to be processed according to the vector space to obtain word vector representation data of the comment data.
5. The method of comment data based text classification of claim 1 wherein prior to the inputting of the word vector representation data into a target language representation model, the method further comprises:
acquiring an initial language representation model and initial training data;
acquiring mask identification and a preset proportion, and randomly selecting a target number of replacement training data from the initial training data according to the preset proportion;
carrying out replacement processing on the replacement training data according to the mask identification to generate target training data;
inputting the target training data into an initial language representation model, and acquiring an output result of the initial language representation model;
and adjusting the parameters of the initial language representation model according to the output result to obtain the target language representation model.
6. The method of claim 1, wherein the inputting the sentence vector representation data into a first classification model, determining whether the comment data belongs to question text data by the first classification model, comprises:
inputting the sentence vector representation data to the first classification model;
outputting, by the first classification model, a first result vector corresponding to the sentence vector representation data;
and determining whether the comment data belongs to question text data or not according to the first result vector.
7. The method of claim 1, wherein the inputting the sentence vector representation data into a second classification model, the determining the question type classification corresponding to the comment data by the second classification model, comprises:
inputting the sentence vector representation data to the second classification model;
outputting, by the second classification model, a second result vector corresponding to the sentence vector representation data; wherein the second result vector comprises a plurality of confidence levels;
and obtaining confidence threshold values corresponding to the confidence degrees, and determining the problem type classification corresponding to the comment data according to the size relationship between the confidence degrees and the confidence threshold values.
8. A text classification apparatus based on comment data, characterized by comprising:
the preprocessing module is used for acquiring comment data and performing text preprocessing on the comment data to generate word segmentation data to be processed;
the word vector generation module is used for carrying out word vectorization processing on the word segmentation data to be processed so as to generate corresponding word vector representation data;
a sentence vector generation module for inputting the word vector representation data to a target language representation model to generate corresponding sentence vector representation data;
a first classification module, configured to input the sentence vector representation data into a first classification model, and determine, by the first classification model, whether the comment data belongs to question text data; and
and the second classification module is used for inputting the sentence vector representation data into a second classification model, and determining the problem type classification corresponding to the comment data by the second classification model.
9. An electronic device, comprising:
a processor; and
a memory having stored thereon computer readable instructions which, when executed by the processor, implement the comment data based text classification method according to any one of claims 1 to 7.
10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out a method of text classification based on comment data according to any one of claims 1 to 7.
CN202010207346.4A 2020-03-23 2020-03-23 Text classification method and device based on comment data, equipment and medium Pending CN111753082A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010207346.4A CN111753082A (en) 2020-03-23 2020-03-23 Text classification method and device based on comment data, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010207346.4A CN111753082A (en) 2020-03-23 2020-03-23 Text classification method and device based on comment data, equipment and medium

Publications (1)

Publication Number Publication Date
CN111753082A true CN111753082A (en) 2020-10-09

Family

ID=72673038

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010207346.4A Pending CN111753082A (en) 2020-03-23 2020-03-23 Text classification method and device based on comment data, equipment and medium

Country Status (1)

Country Link
CN (1) CN111753082A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112735413A (en) * 2020-12-25 2021-04-30 浙江大华技术股份有限公司 Instruction analysis method based on camera device, electronic equipment and storage medium
CN112948583A (en) * 2021-02-26 2021-06-11 中国光大银行股份有限公司 Data classification method and device, storage medium and electronic device
CN113393276A (en) * 2021-06-25 2021-09-14 食亨(上海)科技服务有限公司 Comment data classification method and device and computer readable medium
CN113656607A (en) * 2021-08-19 2021-11-16 郑州轻工业大学 Text mining device and storage medium
CN113705213A (en) * 2021-03-01 2021-11-26 腾讯科技(深圳)有限公司 Wrongly written character recognition method, device, equipment and readable storage medium
CN114064895A (en) * 2021-11-16 2022-02-18 深圳视界信息技术有限公司 Method, device, equipment and medium for discovering new user suggestions in real time

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104573046A (en) * 2015-01-20 2015-04-29 成都品果科技有限公司 Comment analyzing method and system based on term vector
CN105912576A (en) * 2016-03-31 2016-08-31 北京外国语大学 Emotion classification method and emotion classification system
US10248648B1 (en) * 2016-07-11 2019-04-02 Microsoft Technology Licensing, Llc Determining whether a comment represented as natural language text is prescriptive
US20190287142A1 (en) * 2018-02-12 2019-09-19 Baidu Online Network Technology (Beijing) Co., Ltd. Method, apparatus for evaluating review, device and storage medium
CN110309308A (en) * 2019-06-27 2019-10-08 北京金山安全软件有限公司 Text information classification method and device and electronic equipment
CN110516055A (en) * 2019-08-16 2019-11-29 西北工业大学 A kind of cross-platform intelligent answer implementation method for teaching task of combination BERT
CN110717090A (en) * 2019-08-30 2020-01-21 昆山市量子昆慈量子科技有限责任公司 Network public praise evaluation method and system for scenic spots and electronic equipment

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104573046A (en) * 2015-01-20 2015-04-29 成都品果科技有限公司 Comment analyzing method and system based on term vector
CN105912576A (en) * 2016-03-31 2016-08-31 北京外国语大学 Emotion classification method and emotion classification system
US10248648B1 (en) * 2016-07-11 2019-04-02 Microsoft Technology Licensing, Llc Determining whether a comment represented as natural language text is prescriptive
US20190287142A1 (en) * 2018-02-12 2019-09-19 Baidu Online Network Technology (Beijing) Co., Ltd. Method, apparatus for evaluating review, device and storage medium
CN110309308A (en) * 2019-06-27 2019-10-08 北京金山安全软件有限公司 Text information classification method and device and electronic equipment
CN110516055A (en) * 2019-08-16 2019-11-29 西北工业大学 A kind of cross-platform intelligent answer implementation method for teaching task of combination BERT
CN110717090A (en) * 2019-08-30 2020-01-21 昆山市量子昆慈量子科技有限责任公司 Network public praise evaluation method and system for scenic spots and electronic equipment

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112735413A (en) * 2020-12-25 2021-04-30 浙江大华技术股份有限公司 Instruction analysis method based on camera device, electronic equipment and storage medium
CN112948583A (en) * 2021-02-26 2021-06-11 中国光大银行股份有限公司 Data classification method and device, storage medium and electronic device
CN113705213A (en) * 2021-03-01 2021-11-26 腾讯科技(深圳)有限公司 Wrongly written character recognition method, device, equipment and readable storage medium
CN113393276A (en) * 2021-06-25 2021-09-14 食亨(上海)科技服务有限公司 Comment data classification method and device and computer readable medium
CN113393276B (en) * 2021-06-25 2023-06-16 食亨(上海)科技服务有限公司 Comment data classification method, comment data classification device and computer-readable medium
CN113656607A (en) * 2021-08-19 2021-11-16 郑州轻工业大学 Text mining device and storage medium
CN114064895A (en) * 2021-11-16 2022-02-18 深圳视界信息技术有限公司 Method, device, equipment and medium for discovering new user suggestions in real time
CN114064895B (en) * 2021-11-16 2023-12-19 深圳数阔信息技术有限公司 Method, device, equipment and medium for discovering new suggestions of user in real time

Similar Documents

Publication Publication Date Title
CN108804512B (en) Text classification model generation device and method and computer readable storage medium
US11403680B2 (en) Method, apparatus for evaluating review, device and storage medium
CN110413780B (en) Text emotion analysis method and electronic equipment
CN111753082A (en) Text classification method and device based on comment data, equipment and medium
CN111291195B (en) Data processing method, device, terminal and readable storage medium
CN113011186B (en) Named entity recognition method, named entity recognition device, named entity recognition equipment and computer readable storage medium
CN111680159A (en) Data processing method and device and electronic equipment
CN110741376B (en) Automatic document analysis for different natural languages
CN110377744B (en) Public opinion classification method and device, storage medium and electronic equipment
CN113392209B (en) Text clustering method based on artificial intelligence, related equipment and storage medium
CN111414561B (en) Method and device for presenting information
CN113076739A (en) Method and system for realizing cross-domain Chinese text error correction
CN113961685A (en) Information extraction method and device
CN111339260A (en) BERT and QA thought-based fine-grained emotion analysis method
CN110705304B (en) Attribute word extraction method
CN112632226A (en) Semantic search method and device based on legal knowledge graph and electronic equipment
CN111858843A (en) Text classification method and device
CN112860896A (en) Corpus generalization method and man-machine conversation emotion analysis method for industrial field
CN112632227A (en) Resume matching method, resume matching device, electronic equipment, storage medium and program product
CN112667780A (en) Comment information generation method and device, electronic equipment and storage medium
CN115759119A (en) Financial text emotion analysis method, system, medium and equipment
CN111241273A (en) Text data classification method and device, electronic equipment and computer readable medium
CN115269768A (en) Element text processing method and device, electronic equipment and storage medium
CN113705207A (en) Grammar error recognition method and device
CN113761895A (en) Text abstract generation method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination